marker-pdf-ocr

⚠Review·Scanned 2/17/2026

This skill converts PDFs to Markdown using local marker-pdf with cloud fallback to Datalab.to. It requires MARKER_API_KEY, makes network calls to https://www.datalab.to/api/v1, and runs subprocesses such as python3 -m marker.scripts.convert_single and pip install marker-pdf.

from clawhub.ai·ve8283e1·176.8 KB·0 installs

Scanned from 1.0.0 at e8283e1 · Transparency log ↗

$ vett add clawhub.ai/charpup/marker-pdf-ocrReview findings below

Marker PDF OCR Skill

OpenClaw Skill for PDF to Markdown OCR conversion
Built with TDD+SDD Dual Pyramid Process

Overview

This skill provides robust PDF OCR capabilities with multiple deployment options optimized for different resource constraints. It converts PDF documents to Markdown (or JSON/HTML) with high accuracy, supporting various OCR engines and optional LLM enhancement.

Features

Multi-Modal Deployment: Cloud API, Local CPU, Docker, or Hybrid modes
Automatic Failover: Seamlessly switches between modes on errors
Resource Aware: Optimized for 8GB RAM environments
High Accuracy: Supports Surya, OCRMyPDF, and Tesseract OCR engines
Multiple Formats: Output to Markdown, JSON, HTML, or Chunks (for RAG)

Quick Start

from marker_pdf_ocr import MarkerOCRService

# Initialize service
service = MarkerOCRService()

# Convert PDF (auto-detects best mode)
result = service.convert_pdf(
    pdf_path="/path/to/document.pdf",
    output_format="markdown"
)

if result["success"]:
    print(result["content"])
else:
    print(f"Error: {result['error']}")

Deployment Modes

1. Cloud API Mode (Recommended for 8GB RAM)

Uses Datalab.to hosted API for processing.

Requirements: ~512MB RAM, API key
Pros: Fast, no GPU needed, managed scaling
Cons: Data leaves premises, API costs

export DATLAB_API_KEY="your_api_key"
export MARKER_DEPLOYMENT_MODE="cloud"

2. Local CPU Mode (Fallback)

Runs marker-pdf locally without GPU acceleration.

Requirements: 4GB+ RAM, 8GB+ swap recommended
Pros: No data leaves premises, one-time cost
Cons: Slower processing, no LLM enhancement

pip install marker-pdf
export MARKER_DEPLOYMENT_MODE="local_cpu"
export TORCH_DEVICE=cpu

3. Docker Mode (Not recommended for 8GB)

Uses containerized marker-api.

Requirements: 6GB+ RAM, Docker
Note: Container overhead makes this unsuitable for 8GB hosts

4. Auto Mode (Default)

Automatically selects best mode based on available resources and configuration.

export MARKER_DEPLOYMENT_MODE="auto"

Configuration

Environment Variables

Variable	Required	Default	Description
`DATLAB_API_KEY`	For cloud mode	-	Datalab.to API key
`MARKER_DEPLOYMENT_MODE`	No	`auto`	Deployment mode: auto, cloud, local_cpu
`MARKER_OCR_ENGINE`	No	`surya`	OCR engine: surya, ocrmypdf, tesseract
`MARKER_OUTPUT_DIR`	No	`/tmp/marker_output`	Output directory
`MARKER_FAILOVER_ENABLED`	No	`true`	Enable automatic failover
`MARKER_MAX_RETRIES`	No	`3`	Max retry attempts

Mode Selection Logic

if mode == "auto":
    if DATLAB_API_KEY set:
        return "cloud"          # Preferred: fast, low resource
    elif marker-pdf installed and RAM >= 4GB:
        return "local_cpu"      # Fallback: private, offline
    else:
        raise ConfigurationError("No viable mode available")

Resource Requirements

Mode	RAM	Storage	GPU	Notes
Cloud	512MB	1GB	No	Recommended for 8GB hosts
Local CPU	4GB+	5GB	No	Use swap for 8GB hosts
Docker	6GB+	10GB	No	Not recommended for 8GB
GPU	8GB+	10GB	4GB+	Not suitable for 8GB host

Architecture

┌─────────────────────────────────────────────────────────┐
│              OpenClaw Skill Interface                   │
└─────────────────────────────────────────────────────────┘
                          │
              ┌───────────┴───────────┐
              ▼                       ▼
      ┌──────────────┐        ┌──────────────┐
      │ Orchestrator │        │ Health Check │
      └──────┬───────┘        └──────────────┘
             │
    ┌────────┼────────┐
    ▼        ▼        ▼
┌───────┐ ┌───────┐ ┌───────┐
│ Cloud │ │ Local │ │Docker │
│  API  │ │  CPU  │ │ (opt) │
└───┬───┘ └───┬───┘ └───────┘
    │         │
    ▼         ▼
┌────────┐  ┌──────────┐
│Datalab │  │ marker-  │
│  .to   │  │  pdf     │
└────────┘  └──────────┘

Testing

Run All Tests

pip install -r tests/requirements-test.txt
pytest tests/ -v

Run by Category

# Unit tests only
pytest tests/unit/ -v

# Integration tests
pytest tests/integration/ -v

# Acceptance tests (requires API key)
pytest tests/acceptance/ -v --cloud

# Specific mode tests
pytest -m cloud -v
pytest -m local_cpu -v
pytest -m failover -v

Test Markers

@pytest.mark.unit - Unit tests
@pytest.mark.integration - Integration tests
@pytest.mark.acceptance - Acceptance tests
@pytest.mark.cloud - Cloud API tests
@pytest.mark.local_cpu - Local CPU tests
@pytest.mark.failover - Failover tests
@pytest.mark.performance - Performance tests

Development

TDD+SDD Process

This skill was developed using the TDD+SDD Dual Pyramid approach:

Specification First: SPEC.yaml defines all interfaces and contracts
Test Generation: Tests generated from SPEC.yaml scenarios
Implementation: Code written to pass tests
Validation: SPEC validated against implementation

Project Structure

marker-pdf-ocr/
├── SPEC.yaml              # SDD specification
├── findings.md            # Deployment research
├── task_plan.md           # Development plan
├── README.md              # This file
├── lib/                   # Implementation
│   ├── __init__.py
│   └── marker_pdf_ocr/    # Main package
├── tests/
│   ├── unit/              # Unit tests
│   ├── integration/       # Integration tests
│   ├── acceptance/        # E2E tests
│   └── conftest.py        # Test fixtures
└── requirements.txt       # Dependencies

Troubleshooting

"No viable deployment mode available"

Cause: Neither cloud API key nor local marker-pdf is available.

Solutions:

Set DATLAB_API_KEY for cloud mode
Install marker-pdf: pip install marker-pdf
Check available memory: free -h

"Out of memory" during local processing

Cause: Insufficient RAM for marker-pdf.

Solutions:

Use cloud mode instead
Add swap space: sudo fallocate -l 16G /swapfile
Reduce MARKER_MAX_WORKERS to 1

"API rate limit exceeded"

Cause: Too many requests to Datalab.to API.

Behavior: Skill automatically fails over to local CPU mode if available.

Solutions:

Wait for rate limit reset
Install marker-pdf for local fallback
Contact Datalab.to for higher limits

Performance Benchmarks

Mode	Single Page	10 Pages	50 Pages
Cloud	~3s	~15s	~60s
Local CPU	~5s	~60s	~5min

Benchmarks on 8GB RAM system with swap

License

This skill is part of the OpenClaw ecosystem. Refer to LICENSE file for details.

References

Marker Core - Core PDF conversion library
Marker API - FastAPI wrapper
Datalab.to - Hosted API service