pymupdf-pdf

Verified·Scanned 2/18/2026

Fast local PDF parsing with PyMuPDF (fitz) for Markdown/JSON outputs and optional images/tables. Use when speed matters more than robustness, or as a fallback while heavier parsers are unavailable. Default to single-PDF parsing with per-document output folders.

from clawhub.ai·vfb42094·8.8 KB·0 installs
Scanned from 1.0.0 at fb42094 · Transparency log ↗
$ vett add clawhub.ai/kesslerio/pymupdf-pdf

PyMuPDF PDF Parser - Clawdbot Skill

A Clawdbot skill for fast, lightweight PDF parsing using PyMuPDF (fitz). Ideal for quick text extraction when speed matters.

Features

  • Fast processing — Parses PDFs in ~1 second per page
  • Lightweight — Single pip dependency, no heavy models
  • Markdown output — Clean text extraction with page markers
  • JSON output — Simple structured text per page
  • Image extraction — Optional embedded image extraction
  • NixOS compatible — Includes notes for libstdc++ issues

Installation

Prerequisites

  1. Python 3.8+
  2. PyMuPDF: pip install pymupdf
  3. Clawdbot installed

Install the skill

# Clone the repo
git clone https://github.com/kesslerio/PyMuPDF-PDF-Parser-Clawdbot-Skill.git

# Or copy the pymupdf-pdf/ folder to your Clawdbot skills directory
cp -r PyMuPDF-PDF-Parser-Clawdbot-Skill/pymupdf-pdf ~/.clawdbot/skills/

# Install dependency
pip install pymupdf

NixOS users

If you hit libstdc++ import errors:

export LD_LIBRARY_PATH=/nix/store/<your-gcc-lib-path>/lib

See pymupdf-pdf/references/pymupdf-notes.md for details.

Usage

Quick start

# Run from the skill directory
./scripts/pymupdf_parse.py /path/to/document.pdf

Options

./scripts/pymupdf_parse.py /path/to/document.pdf --format json
./scripts/pymupdf_parse.py /path/to/document.pdf --format both --images
./scripts/pymupdf_parse.py /path/to/document.pdf --outroot ./my-output
OptionDefaultDescription
--formatmdOutput format: md, json, or both
--outroot./pymupdf-outputOutput root directory
--imagesoffExtract embedded images
--tablesoffExtract line-based table approximation
--langenLanguage hint (stored in JSON metadata)

Output

Creates a per-document folder under the output root:

./pymupdf-output/
└── document-name/
    ├── output.md      # Markdown with page markers
    ├── output.json    # Simple JSON (~1KB, text per page)
    ├── images/        # Extracted images (if --images)
    └── tables.json    # Line-based tables (if --tables)

Output quality

PyMuPDF produces fast, minimal output:

  • Plain text extraction (no layout preservation)
  • Simple JSON with text per page
  • Optional image extraction

Best for: Quick text extraction, batch processing, or when speed matters.

Comparison with MinerU

AspectPyMuPDFMinerU
SpeedFast (~1s/page)Slower (~15-30s/page)
JSON outputMinimal (~1KB, text only)Rich (~50KB+, layout data)
Image extractionOptionalAutomatic
Layout preservationBasicExcellent
DependenciesLight (pip install)Heavy (~20GB models)

Use PyMuPDF when: Speed matters or for simple text extraction.
Use MinerU when: Quality and structure matter more than speed.

License

Apache 2.0

Contributing

Issues and PRs welcome. Please test changes with various PDF types before submitting.

Related