pdf-text-extractor
✓Verified·Scanned 2/18/2026
This skill extracts text from PDFs with optional OCR and batch processing. It includes instructions to run npm install pdfjs-dist and node index.js and references network resources such as https://registry.npmjs.org/pdfjs-dist/-/pdfjs-dist-3.11.174.tgz.
from clawhub.ai·v794b670·50.5 KB·0 installs
Scanned from 1.0.0 at 794b670 · Transparency log ↗
$ vett add clawhub.ai/michael-laffin/pdf-text-extractor
PDF-Text-Extractor
Extract text from PDFs with OCR support. Zero external dependencies (except PDF.js).
Quick Start
# Install
clawhub install pdf-text-extractor
# Extract text from PDF
cd ~/.openclaw/skills/pdf-text-extractor
node index.js extractText '{"pdfPath":"./document.pdf","options":{"outputFormat":"text"}}'
Usage Examples
Extract to Text
const result = await extractText({
pdfPath: './invoice.pdf',
options: { outputFormat: 'text' }
});
console.log(result.text);
Extract to JSON with Metadata
const result = await extractText({
pdfPath: './contract.pdf',
options: {
outputFormat: 'json',
includeMetadata: true
}
});
console.log(result.metadata);
console.log(`Words: ${result.wordCount}`);
Batch Process Multiple PDFs
const results = await extractBatch({
pdfFiles: [
'./doc1.pdf',
'./doc2.pdf',
'./doc3.pdf'
]
});
console.log(`Processed ${results.successCount}/${results.results.length} documents`);
Extract with OCR (Scanned Documents)
const result = await extractText({
pdfPath: './scanned-doc.pdf',
options: {
ocr: true,
language: 'eng',
ocrQuality: 'high'
}
});
console.log(result.text);
Count Words and Stats
const stats = await countWords({
text: result.text,
options: { countByPage: true }
});
console.log(`Total words: ${stats.wordCount}`);
console.log(`Pages: ${stats.pageCounts.length}`);
console.log(`Avg per page: ${stats.averageWordsPerPage}`);
Detect Language
const lang = await detectLanguage(text);
console.log(`Language: ${lang.languageName}`);
console.log(`Confidence: ${lang.confidence}%`);
Features
- Text Extraction: Extract text from PDFs without external tools
- OCR Support: Use Tesseract for scanned documents
- Batch Processing: Process multiple PDFs at once
- Multiple Output Formats: Text, JSON, Markdown, HTML
- Word Counting: Accurate word and character counting
- Language Detection: Simple heuristic for common languages
- Metadata Extraction: Title, author, creation date
- Page-by-Page: Extract text with page structure
- Zero Config Required: Works out of the box
Use Cases
Document Digitization
- Convert paper documents to digital text
- Process invoices and receipts
- Digitize contracts and agreements
- Archive physical documents
Content Analysis
- Extract text for analysis tools
- Prepare content for LLM processing
- Clean up scanned documents
- Parse PDF-based reports
Data Extraction
- Extract data from PDF reports
- Parse tables from PDFs
- Pull structured data
- Automate document workflows
Text Processing
- Prepare content for translation
- Clean up OCR output
- Extract specific sections
- Search within PDF content
Configuration
Edit config.json to customize:
{
"ocr": {
"enabled": true,
"defaultLanguage": "eng",
"quality": "medium"
},
"output": {
"defaultFormat": "text",
"preserveFormatting": true
},
"batch": {
"maxConcurrent": 3
}
}
Test
node test.js
Output Formats
Text
Plain text extraction with newlines between pages.
JSON
{
"text": "Document text here...",
"pages": 10,
"wordCount": 1500,
"charCount": 8500,
"language": "English",
"metadata": {
"title": "Document Title",
"author": "Author Name",
"creationDate": "2026-02-04"
}
}
Performance
Text-Based PDFs
- Speed: ~100ms for 10-page PDF
- Accuracy: 100% (exact text)
OCR Processing
- Speed: ~1-3s per page
- Accuracy: 85-95% (depends on scan quality)
Troubleshooting
PDF Not Parsing
- Check if file is a valid PDF
- Ensure not password-protected
- Verify PDF.js is installed
OCR Low Accuracy
- Ensure document language matches OCR language setting
- Try higher quality setting (slower but more accurate)
- Check scan quality (300 DPI+ recommended)
Slow Processing
- Reduce batch concurrency
- Lower OCR quality for speed
- Process files individually
Dependencies
npm install pdfjs-dist
License
MIT
Extract text from PDFs. Fast, accurate, ready to use. 🔮