Configuration | Docsieve

Docsieve provides a highly configurable parsing pipeline. You can fine-tune crawls using local config files or within the workspace settings.

Configuration File Structure

When running the CLI, you can define a .docsieverc.json file in the root of your project directory:

{
  "max_depth": 3,
  "blacklist_regex": [
    ".*\\/api\\/.*",
    ".*\\/v1\\/.*"
  ],
  "parser_options": {
    "strip_header_nav": true,
    "ignore_images": false
  },
  "concurrency_limit": 5
}

Parameters

max_depth: The recursion depth limit. A value of 1 crawls only the root URL, while 3 (default) recurses up to three levels deep.
blacklist_regex: List of regular expression patterns to skip. Useful for excluding API references, search endpoints, or login portals.
concurrency_limit: Maximum concurrent HTTP connections. Helps prevent triggering rate limits on target servers.

Security Controls

Docsieve is built with strict developer-first security controls:

SSRF Protection

During crawl validation, Docsieve checks every resolved IP address. It automatically blocks loopbacks, private subnets (e.g. RFC 1918 ranges), and cloud metadata endpoints (such as 169.254.169.254 on AWS or GCP). This prevents SSRF attacks targeting internal databases or APIs.

Robots.txt Compliance

The crawl pipeline performs an atomic access verification. The target site’s robots.txt is always checked, and directories matched by Disallow rules are skipped automatically.