pymupdf-pdf
Fast local PDF parsing with PyMuPDF (fitz) for Markdown/JSON outputs and optional images/tables. Use when speed matters more than robustness, or as a fallback while heavier parsers are unavailable. Default to single-PDF parsing with per-document output folders.
PyMuPDF PDF Parser - Clawdbot Skill
A Clawdbot skill for fast, lightweight PDF parsing using PyMuPDF (fitz). Ideal for quick text extraction when speed matters.
Features
- Fast processing — Parses PDFs in ~1 second per page
- Lightweight — Single pip dependency, no heavy models
- Markdown output — Clean text extraction with page markers
- JSON output — Simple structured text per page
- Image extraction — Optional embedded image extraction
- NixOS compatible — Includes notes for libstdc++ issues
Installation
Prerequisites
- Python 3.8+
- PyMuPDF:
pip install pymupdf - Clawdbot installed
Install the skill
# Clone the repo
git clone https://github.com/kesslerio/PyMuPDF-PDF-Parser-Clawdbot-Skill.git
# Or copy the pymupdf-pdf/ folder to your Clawdbot skills directory
cp -r PyMuPDF-PDF-Parser-Clawdbot-Skill/pymupdf-pdf ~/.clawdbot/skills/
# Install dependency
pip install pymupdf
NixOS users
If you hit libstdc++ import errors:
export LD_LIBRARY_PATH=/nix/store/<your-gcc-lib-path>/lib
See pymupdf-pdf/references/pymupdf-notes.md for details.
Usage
Quick start
# Run from the skill directory
./scripts/pymupdf_parse.py /path/to/document.pdf
Options
./scripts/pymupdf_parse.py /path/to/document.pdf --format json
./scripts/pymupdf_parse.py /path/to/document.pdf --format both --images
./scripts/pymupdf_parse.py /path/to/document.pdf --outroot ./my-output
| Option | Default | Description |
|---|---|---|
--format | md | Output format: md, json, or both |
--outroot | ./pymupdf-output | Output root directory |
--images | off | Extract embedded images |
--tables | off | Extract line-based table approximation |
--lang | en | Language hint (stored in JSON metadata) |
Output
Creates a per-document folder under the output root:
./pymupdf-output/
└── document-name/
├── output.md # Markdown with page markers
├── output.json # Simple JSON (~1KB, text per page)
├── images/ # Extracted images (if --images)
└── tables.json # Line-based tables (if --tables)
Output quality
PyMuPDF produces fast, minimal output:
- Plain text extraction (no layout preservation)
- Simple JSON with text per page
- Optional image extraction
Best for: Quick text extraction, batch processing, or when speed matters.
Comparison with MinerU
| Aspect | PyMuPDF | MinerU |
|---|---|---|
| Speed | Fast (~1s/page) | Slower (~15-30s/page) |
| JSON output | Minimal (~1KB, text only) | Rich (~50KB+, layout data) |
| Image extraction | Optional | Automatic |
| Layout preservation | Basic | Excellent |
| Dependencies | Light (pip install) | Heavy (~20GB models) |
Use PyMuPDF when: Speed matters or for simple text extraction.
Use MinerU when: Quality and structure matter more than speed.
License
Apache 2.0
Contributing
Issues and PRs welcome. Please test changes with various PDF types before submitting.
Related
- MinerU PDF Parser Skill — Rich, layout-aware alternative
- PyMuPDF — The underlying PDF library
- Clawdbot — The AI agent framework