pymupdf-pdf

✓Verified·Scanned 2/18/2026

Fast local PDF parsing with PyMuPDF (fitz) for Markdown/JSON outputs and optional images/tables. Use when speed matters more than robustness, or as a fallback while heavier parsers are unavailable. Default to single-PDF parsing with per-document output folders.

from clawhub.ai·vfb42094·8.8 KB·0 installs

Scanned from 1.0.0 at fb42094 · Transparency log ↗

$ vett add clawhub.ai/kesslerio/pymupdf-pdf

PyMuPDF PDF Parser - Clawdbot Skill

A Clawdbot skill for fast, lightweight PDF parsing using PyMuPDF (fitz). Ideal for quick text extraction when speed matters.

Features

Fast processing — Parses PDFs in ~1 second per page
Lightweight — Single pip dependency, no heavy models
Markdown output — Clean text extraction with page markers
JSON output — Simple structured text per page
Image extraction — Optional embedded image extraction
NixOS compatible — Includes notes for libstdc++ issues

Installation

Prerequisites

Python 3.8+
PyMuPDF: pip install pymupdf
Clawdbot installed

Install the skill

# Clone the repo
git clone https://github.com/kesslerio/PyMuPDF-PDF-Parser-Clawdbot-Skill.git

# Or copy the pymupdf-pdf/ folder to your Clawdbot skills directory
cp -r PyMuPDF-PDF-Parser-Clawdbot-Skill/pymupdf-pdf ~/.clawdbot/skills/

# Install dependency
pip install pymupdf

NixOS users

If you hit libstdc++ import errors:

export LD_LIBRARY_PATH=/nix/store/<your-gcc-lib-path>/lib

See pymupdf-pdf/references/pymupdf-notes.md for details.

Usage

Quick start

# Run from the skill directory
./scripts/pymupdf_parse.py /path/to/document.pdf

Options

./scripts/pymupdf_parse.py /path/to/document.pdf --format json
./scripts/pymupdf_parse.py /path/to/document.pdf --format both --images
./scripts/pymupdf_parse.py /path/to/document.pdf --outroot ./my-output

Option	Default	Description
`--format`	`md`	Output format: `md`, `json`, or `both`
`--outroot`	`./pymupdf-output`	Output root directory
`--images`	off	Extract embedded images
`--tables`	off	Extract line-based table approximation
`--lang`	`en`	Language hint (stored in JSON metadata)

Output

Creates a per-document folder under the output root:

./pymupdf-output/
└── document-name/
    ├── output.md      # Markdown with page markers
    ├── output.json    # Simple JSON (~1KB, text per page)
    ├── images/        # Extracted images (if --images)
    └── tables.json    # Line-based tables (if --tables)

Output quality

PyMuPDF produces fast, minimal output:

Plain text extraction (no layout preservation)
Simple JSON with text per page
Optional image extraction

Best for: Quick text extraction, batch processing, or when speed matters.

Comparison with MinerU

Aspect	PyMuPDF	MinerU
Speed	Fast (~1s/page)	Slower (~15-30s/page)
JSON output	Minimal (~1KB, text only)	Rich (~50KB+, layout data)
Image extraction	Optional	Automatic
Layout preservation	Basic	Excellent
Dependencies	Light (pip install)	Heavy (~20GB models)

Use PyMuPDF when: Speed matters or for simple text extraction.
Use MinerU when: Quality and structure matter more than speed.

License

Apache 2.0

Contributing

Issues and PRs welcome. Please test changes with various PDF types before submitting.

MinerU PDF Parser Skill — Rich, layout-aware alternative
PyMuPDF — The underlying PDF library
Clawdbot — The AI agent framework