A tool for converting web documentation to clean markdown, perfect for embedding AI training data and Claude Code skills.
Since its launch two weeks ago, people have downloaded Docpull 842 times to streamline their AI training workflows, and it's fully open sourced.
Modern documentation sites are notoriously difficult to scrape. They're built with JavaScript frameworks, load content dynamically, and are structured in ways that break traditional crawlers.
Docpull was designed specifically to solve these problems.
Most scrapers fail on JavaScript-heavy sites. Docpull uses a three-layer discovery system:
Works seamlessly with Docusaurus, GitBook, ReadTheDocs, Nextra, and custom documentation sites.
Docpull streams results as they're found:
async for event in fetcher.fetch_docs():
if event.type == "PAGE_SAVED":
print(f"Got {event.path}") # Process immediately
Pipe this directly into a RAG ingestion pipeline. No need to wait for the full crawl to finish.
Docpull comes with built-in protections against common scraper vulnerabilities.
StreamingDeduplicator computes SHA-256 hashes on-the-fly:
--cache # Enable persistent cache
--cache-ttl 30 # 30-day expiry
--no-skip-unchanged # Override cache
CacheManager features:
Incremental updates on a 10k-page site take seconds, not hours.
docpull https://docs.example.com --profile rag
Pre-configured sets of settings optimized for different use cases. They save you the hassle of manually tuning concurrency, caching, deduplication, and depth for your scraping tasks.
Output isn't just HTML→Markdown. Every file includes Markdown + YAML frontmatter:
---
title: Getting Started
source: https://docs.example.com/getting-started
og_description: Learn how to...
json_ld:
"@type": HowTo
---
Pulls Open Graph, JSON-LD, and microdata via extruct, giving your RAG pipeline rich context for embeddings.
Built on aiohttp with:
Fetching 1,000 pages typically takes 2–5 minutes, even with rate limits in place.
Docpull makes scraping modern documentation simple, fast, and reliable - producing clean Markdown with metadata, ready for AI pipelines.
A tool for converting web documentation to clean markdown, perfect for embedding AI training data and Claude Code skills.
Since its launch two weeks ago, people have downloaded Docpull 842 times to streamline their AI training workflows, and it's fully open sourced.
Modern documentation sites are notoriously difficult to scrape. They're built with JavaScript frameworks, load content dynamically, and are structured in ways that break traditional crawlers.
Docpull was designed specifically to solve these problems.
Most scrapers fail on JavaScript-heavy sites. Docpull uses a three-layer discovery system:
Works seamlessly with Docusaurus, GitBook, ReadTheDocs, Nextra, and custom documentation sites.
Docpull streams results as they're found:
async for event in fetcher.fetch_docs():
if event.type == "PAGE_SAVED":
print(f"Got {event.path}") # Process immediately
Pipe this directly into a RAG ingestion pipeline. No need to wait for the full crawl to finish.
Docpull comes with built-in protections against common scraper vulnerabilities.
StreamingDeduplicator computes SHA-256 hashes on-the-fly:
--cache # Enable persistent cache
--cache-ttl 30 # 30-day expiry
--no-skip-unchanged # Override cache
CacheManager features:
Incremental updates on a 10k-page site take seconds, not hours.
docpull https://docs.example.com --profile rag
Pre-configured sets of settings optimized for different use cases. They save you the hassle of manually tuning concurrency, caching, deduplication, and depth for your scraping tasks.
Output isn't just HTML→Markdown. Every file includes Markdown + YAML frontmatter:
---
title: Getting Started
source: https://docs.example.com/getting-started
og_description: Learn how to...
json_ld:
"@type": HowTo
---
Pulls Open Graph, JSON-LD, and microdata via extruct, giving your RAG pipeline rich context for embeddings.
Built on aiohttp with:
Fetching 1,000 pages typically takes 2–5 minutes, even with rate limits in place.
Docpull makes scraping modern documentation simple, fast, and reliable - producing clean Markdown with metadata, ready for AI pipelines.