Technical Walkthrough

How Docsieve Works

Docsieve transforms noisy, multi-page HTML documentation structures into high-density Markdown briefs, tracking version changes between runs automatically. Here is how our four-step pipeline operates.

Step 1

Recurse & Validate

When you request a crawl, our pipeline resolves the target site recursively. Powered by an asynchronous crawler, Docsieve traverses directories up to your configured depth limits.

SSRF IP Validation

Security is built into our networking stack. We resolve and check every IP address before fetching content. All loopback IPs, private ranges (RFC 1918), and cloud metadata directories are blocked, ensuring our crawler cannot target internal servers.

Robots.txt & Stepper

The crawl pipeline validates robots.txt compliance. Directories matched by disallow rules are skipped, protecting sensitive routes.

Step 2

Clean & Hash

Raw HTML is parsed to extract core semantic documentation content. Headers, footers, sidebars, cookie banners, and script tags are removed.

Page Content Hashing

Docsieve computes SHA-256 hashes of the cleaned Markdown content for every page. These hashes are stored and used as version baselines for subsequent crawl comparisons.

Multimedia Manifest

Images and code files referenced in the documents are compiled into a structural media manifest. Broken links are flagged automatically with link-rot indicators in the reading view.

Step 3

Diff & Changelog

During recrawls, Docsieve compares the new crawl hashes with your previous versions, building a detailed change report.

Sparse Version Changelog

Instead of showing raw files, we build a sparse changelog documenting exactly which pages were **added**, **modified**, or **removed**. Clicking a changed page highlights the specific text differences.

Alert Subscriptions

Subscribe to spaces to receive notifications when changes are detected. If you do not acknowledge the changes in the active SSE dashboard within 120 seconds, Docsieve fires email fallbacks automatically.

Step 4

AI Executive Brief

For larger crawls, digesting hundreds of pages is impractical. Our summarization pipeline uses Gemma models to generate a unified, high-density brief.

Synthesized Documentation Summaries

The summary extracts code patterns, architecture models, security boundaries, and API credentials scopes, giving team members a rapid onboarding baseline without reading hundreds of pages.

Interactive Reading Canvas

Read compiled briefs inside the web dashboard complete with hierarchical tables of contents, lazy-rendered Mermaid diagrams, and code snippets.

See it in action

Get started crawling docs or read our developer guide to set up the CLI and MCP tools.

Open Platform View Docs