mineru-pdf

Verified·Scanned 2/18/2026

Parse PDFs locally (CPU) into Markdown/JSON using MinerU. Assumes MinerU creates per‑doc output folders; supports table/image extraction.

from clawhub.ai·v1bcf7e8·10.4 KB·0 installs
Scanned from 1.0.0 at 1bcf7e8 · Transparency log ↗
$ vett add clawhub.ai/kesslerio/mineru-pdf

MinerU PDF Parser - Clawdbot Skill

A Clawdbot skill for parsing PDFs locally using MinerU (CPU). Produces rich structured output including Markdown, JSON with layout data, and extracted images.

Features

  • Local CPU processing — No GPU required; runs entirely on your machine
  • Rich structured output — Markdown + detailed JSON with layout information
  • Image extraction — Automatically extracts embedded images
  • Table support — Optional table extraction (if supported by your MinerU version)
  • Configurable — Flexible env overrides for different MinerU wrappers

Installation

Prerequisites

  1. MinerU CLI installed and accessible (see MinerU installation)
  2. Clawdbot installed

Install the skill

# Clone the repo
git clone https://github.com/kesslerio/MinerU-PDF-Parser-Clawdbot-Skill.git

# Or copy the mineru-pdf/ folder to your Clawdbot skills directory
cp -r MinerU-PDF-Parser-Clawdbot-Skill/mineru-pdf ~/.clawdbot/skills/

Usage

Quick start

# Run from the skill directory
./scripts/mineru_parse.sh /path/to/document.pdf

Options

./scripts/mineru_parse.sh /path/to/document.pdf --format json
./scripts/mineru_parse.sh /path/to/document.pdf --tables --images
./scripts/mineru_parse.sh /path/to/document.pdf --outroot ./my-output
OptionDefaultDescription
--formatbothOutput format: md, json, or both
--outroot./mineru-outputOutput root directory
--tablesoffExtract tables (if supported)
--imagesoffExtract images (if supported)
--threads4Thread count (OMP_NUM_THREADS)
--langenLanguage
--backendpipelineMinerU backend
--methodautoProcessing method
--devicecpuDevice (cpu/gpu)

Configuration

If your MinerU wrapper uses different flags, set env overrides. See mineru-pdf/references/mineru-cli.md for full documentation.

export MINERU_CMD=~/.local/bin/mineru
export MINERU_INPUT_FLAG=-p
export MINERU_OUTPUT_FLAG=-o

Output

MinerU creates a per-document subfolder under the output root:

./mineru-output/
└── document-name/
    └── auto/
        ├── document-name.md          # Markdown output
        ├── document-name_middle.json # Rich structured JSON (~50KB+)
        ├── document-name_layout.pdf  # Layout visualization
        └── images/                   # Extracted images

Output quality

MinerU produces rich structured output including:

  • Layout-aware text extraction
  • Detailed JSON with position/structure metadata
  • Extracted images and layout PDFs

Best for: Documents requiring accurate layout preservation, image extraction, or structured data output.

Comparison with PyMuPDF

AspectMinerUPyMuPDF
SpeedSlower (~15-30s/page)Fast (~1s/page)
JSON outputRich (~50KB+, layout data)Minimal (~1KB, text only)
Image extractionYes (automatic)Yes (optional)
Layout preservationExcellentBasic
DependenciesHeavy (~20GB models)Light (pip install)

Use MinerU when: Quality and structure matter more than speed.
Use PyMuPDF when: Speed matters or for simple text extraction.

License

Apache 2.0

Contributing

Issues and PRs welcome. Please test with a variety of PDFs before submitting changes.

Related