web-to-markdown
Converts webpages to Markdown using a local web2md CLI and a Puppeteer-driven browser. It instructs running shell commands like command -v web2md and cd ~/workspace/softaworks/projects/web2md && npm install && npm run build && npm link, uses --user-data-dir ./tmp/web2md-profile, and accesses arbitrary http:///https:// URLs.
web-to-markdown
Convert web pages to clean Markdown using local browser automation with Puppeteer and Readability.
Purpose
This skill enables Claude Code to convert web pages (including JavaScript-rendered content) into clean, readable Markdown format by leveraging the web2md CLI tool. It's particularly useful for extracting article content, documentation, or any web content that needs to be processed, archived, or analyzed in Markdown format.
When to Use
Use this skill when you need to:
- Extract article content from news sites, blogs, or documentation
- Convert JavaScript-heavy pages that simple HTTP fetching can't handle
- Archive web content in a readable, portable format
- Process web content for analysis or documentation
- Handle pages with login walls or human verification (using interactive mode)
- Batch convert multiple URLs to Markdown files
Important: This skill must be explicitly invoked by the user with phrases like:
- "use the skill web-to-markdown ..."
- "use a skill web-to-markdown ..."
This is a hard requirement to prevent accidental usage when simpler tools might suffice.
How It Works
The skill uses the web2md CLI tool which:
- Launches a real browser (Chrome/Chromium/Brave/Edge) via Puppeteer
- Renders the page including all JavaScript and dynamic content
- Extracts main content using Mozilla's Readability library
- Converts to Markdown using Turndown with cleaned links
- Outputs to stdout or saves to file(s) with optional YAML frontmatter
This approach handles modern single-page applications and JavaScript-rendered content that simple HTTP fetchers cannot process.
Key Features
- JavaScript Support: Renders pages with Puppeteer, capturing dynamically loaded content
- Smart Content Extraction: Uses Readability to identify and extract main article content
- Flexible Output: Print to stdout, save to specific files, or auto-name files in a directory
- Interactive Mode: Handle login walls and human verification challenges
- Batch Processing: Convert multiple URLs in one operation
- Wait Strategies: Multiple options to ensure content is fully loaded before extraction
- Browser Profiles: Support for persistent user data directories (sessions, logins)
- Clean Markdown: Produces readable, well-formatted Markdown with optional metadata
Prerequisites
The web2md CLI tool must be installed. The skill will check for it and provide installation instructions if needed:
cd ~/workspace/softaworks/projects/web2md
npm install
npm run build
npm link
Usage Examples
Basic Conversion
Convert a single URL and save to a file:
use the skill web-to-markdown to convert https://example.com/article to article.md
This will run:
web2md 'https://example.com/article' --out ./article.md
Auto-Named Output
Convert a URL and let the tool name the file based on page title:
use the skill web-to-markdown to convert https://example.com/article and save to ./output/
This creates a directory and auto-names the file:
mkdir -p ./output
web2md 'https://example.com/article' --out ./output/
Print to Console
Convert and display the Markdown (useful for quick inspection):
use the skill web-to-markdown to convert https://example.com/article and print the result
This will run:
web2md 'https://example.com/article' --print
Interactive Mode (Login Walls)
Handle pages requiring login or human verification:
use the skill web-to-markdown to convert https://example.com/protected-article in interactive mode
This will run:
mkdir -p ./tmp/web2md-profile
web2md 'https://example.com/protected-article' --interactive --user-data-dir ./tmp/web2md-profile --out ./output/
The browser window will appear, allowing you to complete login or verification, then press Enter to continue.
Batch Conversion
Convert multiple URLs:
use the skill web-to-markdown to convert these URLs:
- https://example.com/article1
- https://example.com/article2
- https://example.com/article3
Save them to ./articles/
This will create the directory and run separate commands for each URL:
mkdir -p ./articles
web2md 'https://example.com/article1' --out ./articles/
web2md 'https://example.com/article2' --out ./articles/
web2md 'https://example.com/article3' --out ./articles/
Advanced Wait Strategies
For heavy JavaScript applications:
use the skill web-to-markdown to convert https://app.example.com/dashboard
Wait for the main selector to appear
This will run:
web2md 'https://app.example.com/dashboard' --wait-until domcontentloaded --wait-for 'main' --out ./dashboard.md
Advanced Options
The skill supports various options to handle tricky pages:
--chrome-path <path>: Specify Chrome/Chromium location if auto-detection fails--interactive: Show browser and pause for manual intervention--wait-until <event>: Wait forload,domcontentloaded,networkidle0, ornetworkidle2(default:networkidle2)--wait-for '<selector>': Wait for specific CSS selector to appear--wait-ms <milliseconds>: Additional wait time in milliseconds--headful: Show browser window (useful for debugging)--no-sandbox: Disable sandbox (sometimes required in containers/CI)--user-data-dir <dir>: Use persistent browser profile (for sessions/logins)
Output Format
The generated Markdown includes:
- YAML Frontmatter (optional): Title, author, publication date, URL, excerpt
- Main Content: Article text with preserved formatting
- Clean Links: Properly formatted Markdown links
- Images: Preserved with alt text and captions where available
Technical Details
- Browser Automation: Uses
puppeteer-corewith local Chrome/Chromium - Content Extraction: Mozilla Readability library
- Markdown Conversion: Turndown library
- Supported Browsers: Chrome, Chromium, Brave, Edge (Chromium-based)
Troubleshooting
Chrome not found:
- Install Chrome/Chromium or specify path with
--chrome-path
Page content incomplete:
- Try
--wait-until networkidle2(waits for network to settle) - Use
--wait-for '<selector>'to wait for specific elements - Add
--wait-ms 2000for additional delay
Login required:
- Use
--interactivemode to manually login - Use
--user-data-dirto persist session across runs
Verification/CAPTCHA:
- Use
--interactivemode to complete verification manually
Version
Current version: 0.1.0
Related Tools
For simpler use cases without JavaScript, consider using Claude Code's built-in WebFetch tool instead.