pdf-text-extractor

✓Verified·Scanned 2/18/2026

This skill extracts text from PDFs with optional OCR and batch processing. It includes instructions to run npm install pdfjs-dist and node index.js and references network resources such as https://registry.npmjs.org/pdfjs-dist/-/pdfjs-dist-3.11.174.tgz.

from clawhub.ai·v794b670·50.5 KB·0 installs

Scanned from 1.0.0 at 794b670 · Transparency log ↗

$ vett add clawhub.ai/michael-laffin/pdf-text-extractor

PDF-Text-Extractor

Extract text from PDFs with OCR support. Zero external dependencies (except PDF.js).

Quick Start

# Install
clawhub install pdf-text-extractor

# Extract text from PDF
cd ~/.openclaw/skills/pdf-text-extractor
node index.js extractText '{"pdfPath":"./document.pdf","options":{"outputFormat":"text"}}'

Usage Examples

Extract to Text

const result = await extractText({
  pdfPath: './invoice.pdf',
  options: { outputFormat: 'text' }
});

console.log(result.text);

Extract to JSON with Metadata

const result = await extractText({
  pdfPath: './contract.pdf',
  options: {
    outputFormat: 'json',
    includeMetadata: true
  }
});

console.log(result.metadata);
console.log(`Words: ${result.wordCount}`);

Batch Process Multiple PDFs

const results = await extractBatch({
  pdfFiles: [
    './doc1.pdf',
    './doc2.pdf',
    './doc3.pdf'
  ]
});

console.log(`Processed ${results.successCount}/${results.results.length} documents`);

Extract with OCR (Scanned Documents)

const result = await extractText({
  pdfPath: './scanned-doc.pdf',
  options: {
    ocr: true,
    language: 'eng',
    ocrQuality: 'high'
  }
});

console.log(result.text);

Count Words and Stats

const stats = await countWords({
  text: result.text,
  options: { countByPage: true }
});

console.log(`Total words: ${stats.wordCount}`);
console.log(`Pages: ${stats.pageCounts.length}`);
console.log(`Avg per page: ${stats.averageWordsPerPage}`);

Detect Language

const lang = await detectLanguage(text);

console.log(`Language: ${lang.languageName}`);
console.log(`Confidence: ${lang.confidence}%`);

Features

Text Extraction: Extract text from PDFs without external tools
OCR Support: Use Tesseract for scanned documents
Batch Processing: Process multiple PDFs at once
Multiple Output Formats: Text, JSON, Markdown, HTML
Word Counting: Accurate word and character counting
Language Detection: Simple heuristic for common languages
Metadata Extraction: Title, author, creation date
Page-by-Page: Extract text with page structure
Zero Config Required: Works out of the box

Use Cases

Document Digitization

Convert paper documents to digital text
Process invoices and receipts
Digitize contracts and agreements
Archive physical documents

Content Analysis

Extract text for analysis tools
Prepare content for LLM processing
Clean up scanned documents
Parse PDF-based reports

Data Extraction

Extract data from PDF reports
Parse tables from PDFs
Pull structured data
Automate document workflows

Text Processing

Prepare content for translation
Clean up OCR output
Extract specific sections
Search within PDF content

Configuration

Edit config.json to customize:

{
  "ocr": {
    "enabled": true,
    "defaultLanguage": "eng",
    "quality": "medium"
  },
  "output": {
    "defaultFormat": "text",
    "preserveFormatting": true
  },
  "batch": {
    "maxConcurrent": 3
  }
}

Test

node test.js

Output Formats

Text

Plain text extraction with newlines between pages.

JSON

{
  "text": "Document text here...",
  "pages": 10,
  "wordCount": 1500,
  "charCount": 8500,
  "language": "English",
  "metadata": {
    "title": "Document Title",
    "author": "Author Name",
    "creationDate": "2026-02-04"
  }
}

Performance

Text-Based PDFs

Speed: ~100ms for 10-page PDF
Accuracy: 100% (exact text)

OCR Processing

Speed: ~1-3s per page
Accuracy: 85-95% (depends on scan quality)

Troubleshooting

PDF Not Parsing

Check if file is a valid PDF
Ensure not password-protected
Verify PDF.js is installed

OCR Low Accuracy

Ensure document language matches OCR language setting
Try higher quality setting (slower but more accurate)
Check scan quality (300 DPI+ recommended)

Slow Processing

Reduce batch concurrency
Lower OCR quality for speed
Process files individually

Dependencies

npm install pdfjs-dist

License

MIT

Extract text from PDFs. Fast, accurate, ready to use. 🔮