How Docsieve Works
Docsieve transforms noisy, multi-page HTML documentation structures into high-density Markdown briefs, tracking version changes between runs automatically. Here is how our four-step pipeline operates.
Recurse & Validate
When you request a crawl, our pipeline resolves the target site recursively. Powered by an asynchronous crawler, Docsieve traverses directories up to your configured depth limits.
SSRF IP Validation
Security is built into our networking stack. We resolve and check every IP address before fetching content. All loopback IPs, private ranges (RFC 1918), and cloud metadata directories are blocked, ensuring our crawler cannot target internal servers.
Robots.txt & Stepper
The crawl pipeline validates robots.txt compliance. Directories matched by disallow rules are skipped, protecting sensitive routes.
Clean & Hash
Raw HTML is parsed to extract core semantic documentation content. Headers, footers, sidebars, cookie banners, and script tags are removed.
Page Content Hashing
Docsieve computes SHA-256 hashes of the cleaned Markdown content for every page. These hashes are stored and used as version baselines for subsequent crawl comparisons.
Multimedia Manifest
Images and code files referenced in the documents are compiled into a structural media manifest. Broken links are flagged automatically with link-rot indicators in the reading view.
Diff & Changelog
During recrawls, Docsieve compares the new crawl hashes with your previous versions, building a detailed change report.
Sparse Version Changelog
Instead of showing raw files, we build a sparse changelog documenting exactly which pages were **added**, **modified**, or **removed**. Clicking a changed page highlights the specific text differences.
Alert Subscriptions
Subscribe to spaces to receive notifications when changes are detected. If you do not acknowledge the changes in the active SSE dashboard within 120 seconds, Docsieve fires email fallbacks automatically.
AI Executive Brief
For larger crawls, digesting hundreds of pages is impractical. Our summarization pipeline uses Gemma models to generate a unified, high-density brief.
Synthesized Documentation Summaries
The summary extracts code patterns, architecture models, security boundaries, and API credentials scopes, giving team members a rapid onboarding baseline without reading hundreds of pages.
Interactive Reading Canvas
Read compiled briefs inside the web dashboard complete with hierarchical tables of contents, lazy-rendered Mermaid diagrams, and code snippets.
See it in action
Get started crawling docs or read our developer guide to set up the CLI and MCP tools.