mineru-pdf

✓Verified·Scanned 2/18/2026

Parse PDFs locally (CPU) into Markdown/JSON using MinerU. Assumes MinerU creates per‑doc output folders; supports table/image extraction.

from clawhub.ai·v1bcf7e8·10.4 KB·0 installs

Scanned from 1.0.0 at 1bcf7e8 · Transparency log ↗

$ vett add clawhub.ai/kesslerio/mineru-pdf

MinerU PDF Parser - Clawdbot Skill

A Clawdbot skill for parsing PDFs locally using MinerU (CPU). Produces rich structured output including Markdown, JSON with layout data, and extracted images.

Features

Local CPU processing — No GPU required; runs entirely on your machine
Rich structured output — Markdown + detailed JSON with layout information
Image extraction — Automatically extracts embedded images
Table support — Optional table extraction (if supported by your MinerU version)
Configurable — Flexible env overrides for different MinerU wrappers

Installation

Prerequisites

MinerU CLI installed and accessible (see MinerU installation)
Clawdbot installed

Install the skill

# Clone the repo
git clone https://github.com/kesslerio/MinerU-PDF-Parser-Clawdbot-Skill.git

# Or copy the mineru-pdf/ folder to your Clawdbot skills directory
cp -r MinerU-PDF-Parser-Clawdbot-Skill/mineru-pdf ~/.clawdbot/skills/

Usage

Quick start

# Run from the skill directory
./scripts/mineru_parse.sh /path/to/document.pdf

Options

./scripts/mineru_parse.sh /path/to/document.pdf --format json
./scripts/mineru_parse.sh /path/to/document.pdf --tables --images
./scripts/mineru_parse.sh /path/to/document.pdf --outroot ./my-output

Option	Default	Description
`--format`	`both`	Output format: `md`, `json`, or `both`
`--outroot`	`./mineru-output`	Output root directory
`--tables`	off	Extract tables (if supported)
`--images`	off	Extract images (if supported)
`--threads`	`4`	Thread count (OMP_NUM_THREADS)
`--lang`	`en`	Language
`--backend`	`pipeline`	MinerU backend
`--method`	`auto`	Processing method
`--device`	`cpu`	Device (cpu/gpu)

Configuration

If your MinerU wrapper uses different flags, set env overrides. See mineru-pdf/references/mineru-cli.md for full documentation.

export MINERU_CMD=~/.local/bin/mineru
export MINERU_INPUT_FLAG=-p
export MINERU_OUTPUT_FLAG=-o

Output

MinerU creates a per-document subfolder under the output root:

./mineru-output/
└── document-name/
    └── auto/
        ├── document-name.md          # Markdown output
        ├── document-name_middle.json # Rich structured JSON (~50KB+)
        ├── document-name_layout.pdf  # Layout visualization
        └── images/                   # Extracted images

Output quality

MinerU produces rich structured output including:

Layout-aware text extraction
Detailed JSON with position/structure metadata
Extracted images and layout PDFs

Best for: Documents requiring accurate layout preservation, image extraction, or structured data output.

Comparison with PyMuPDF

Aspect	MinerU	PyMuPDF
Speed	Slower (~15-30s/page)	Fast (~1s/page)
JSON output	Rich (~50KB+, layout data)	Minimal (~1KB, text only)
Image extraction	Yes (automatic)	Yes (optional)
Layout preservation	Excellent	Basic
Dependencies	Heavy (~20GB models)	Light (pip install)

Use MinerU when: Quality and structure matter more than speed.
Use PyMuPDF when: Speed matters or for simple text extraction.

License

Apache 2.0

Contributing

Issues and PRs welcome. Please test with a variety of PDFs before submitting changes.

PyMuPDF PDF Parser Skill — Fast, lightweight alternative
MinerU — The underlying PDF parser
Clawdbot — The AI agent framework