Docpull

tech

Lightweight Python scraper for agents. 5,000+ downloads on PyPI.

By Zachary Roth•1 min readNovember 13, 2025

Modern documentation sites break traditional scrapers. JavaScript frameworks, dynamic content loading, client-side routing—the patterns that make docs sites nice to use make them hard to crawl.

Most sites have 10-20% duplicate content across URLs. Incremental updates on large sites take hours. The tools that exist either can't handle modern web architecture or output messy HTML instead of clean markdown.

Docpull solves these problems. 5,000+ downloads on PyPI. Fully open source.

Key Features

Three-layer content discovery: Sitemap parsing (fastest), enhanced link extraction (data-* attributes, JSON-LD, Next.js prefetch hints), and full browser rendering via Playwright for SPAs. Works with Docusaurus, GitBook, ReadTheDocs, Nextra, and custom sites.

Streaming architecture: Process pages as they're found—pipe directly into RAG ingestion pipelines without waiting for full crawls to complete.

Smart deduplication: SHA-256 hashes computed on-the-fly detect duplicate content before writing to disk. O(1) lookups catch the 10-20% duplication typical on most sites.

Cache-forward design: Persistent caching with ETag/Last-Modified support makes incremental updates on 10k-page sites take seconds, not hours.

Rich metadata extraction: Outputs markdown with YAML frontmatter including Open Graph, JSON-LD, and microdata—giving RAG pipelines context for better embeddings.

Usage

pip install docpull

# Basic usage
docpull https://docs.example.com

# With RAG-optimized settings
docpull https://docs.example.com --profile rag

# Streaming into a pipeline
async for event in fetcher.fetch_docs():
    if event.type == "PAGE_SAVED":
        process(event.path)  # Handle immediately

Built-in profiles handle common configurations:

rag: Optimized for AI training pipelines
mirror: Full site archival
sample: Quick content sampling

Security

Built-in protections against SSRF, XXE, path traversal, and DoS. Mandatory robots.txt compliance. The goal is scraping that's both effective and responsible.

Docpull

tech

Lightweight Python scraper for agents. 5,000+ downloads on PyPI.

By Zachary Roth•1 min readNovember 13, 2025

python

documentation

ai-training

web-scraping

Modern documentation sites break traditional scrapers. JavaScript frameworks, dynamic content loading, client-side routing—the patterns that make docs sites nice to use make them hard to crawl.

Docpull solves these problems. 5,000+ downloads on PyPI. Fully open source.

Key Features

Streaming architecture: Process pages as they're found—pipe directly into RAG ingestion pipelines without waiting for full crawls to complete.

Smart deduplication: SHA-256 hashes computed on-the-fly detect duplicate content before writing to disk. O(1) lookups catch the 10-20% duplication typical on most sites.

Cache-forward design: Persistent caching with ETag/Last-Modified support makes incremental updates on 10k-page sites take seconds, not hours.

Rich metadata extraction: Outputs markdown with YAML frontmatter including Open Graph, JSON-LD, and microdata—giving RAG pipelines context for better embeddings.

Usage

pip install docpull

# Basic usage
docpull https://docs.example.com

# With RAG-optimized settings
docpull https://docs.example.com --profile rag

# Streaming into a pipeline
async for event in fetcher.fetch_docs():
    if event.type == "PAGE_SAVED":
        process(event.path)  # Handle immediately

Built-in profiles handle common configurations:

rag: Optimized for AI training pipelines
mirror: Full site archival
sample: Quick content sampling

Security

Built-in protections against SSRF, XXE, path traversal, and DoS. Mandatory robots.txt compliance. The goal is scraping that's both effective and responsible.

Key Features

Streaming architecture: Process pages as they're found—pipe directly into RAG ingestion pipelines without waiting for full crawls to complete.

Smart deduplication: SHA-256 hashes computed on-the-fly detect duplicate content before writing to disk. O(1) lookups catch the 10-20% duplication typical on most sites.

Cache-forward design: Persistent caching with ETag/Last-Modified support makes incremental updates on 10k-page sites take seconds, not hours.

Rich metadata extraction: Outputs markdown with YAML frontmatter including Open Graph, JSON-LD, and microdata—giving RAG pipelines context for better embeddings.

Usage

pip install docpull

# Basic usage
docpull https://docs.example.com

# With RAG-optimized settings
docpull https://docs.example.com --profile rag

# Streaming into a pipeline
async for event in fetcher.fetch_docs():
    if event.type == "PAGE_SAVED":
        process(event.path)  # Handle immediately

Built-in profiles handle common configurations:

rag: Optimized for AI training pipelines
mirror: Full site archival
sample: Quick content sampling

Security

Built-in protections against SSRF, XXE, path traversal, and DoS. Mandatory robots.txt compliance. The goal is scraping that's both effective and responsible.

Docpull

Key Features

Usage

Security

Links

Docpull

Key Features

Usage

Security

Links

Docpull

Key Features

Usage

Security

Links

Docpull

Key Features

Usage

Security

Links