Configuration
Configure recursive crawler depth limits, regex filters, and advanced parser settings.
Docsieve provides a highly configurable parsing pipeline. You can fine-tune crawls using local config files or within the workspace settings.
Configuration File Structure
When running the CLI, you can define a .docsieverc.json file in the root of your project directory:
{
"max_depth": 3,
"blacklist_regex": [
".*\\/api\\/.*",
".*\\/v1\\/.*"
],
"parser_options": {
"strip_header_nav": true,
"ignore_images": false
},
"concurrency_limit": 5
}
Parameters
max_depth: The recursion depth limit. A value of1crawls only the root URL, while3(default) recurses up to three levels deep.blacklist_regex: List of regular expression patterns to skip. Useful for excluding API references, search endpoints, or login portals.concurrency_limit: Maximum concurrent HTTP connections. Helps prevent triggering rate limits on target servers.
Security Controls
Docsieve is built with strict developer-first security controls:
SSRF Protection
During crawl validation, Docsieve checks every resolved IP address. It automatically blocks loopbacks, private subnets (e.g. RFC 1918 ranges), and cloud metadata endpoints (such as 169.254.169.254 on AWS or GCP). This prevents SSRF attacks targeting internal databases or APIs.
Robots.txt Compliance
The crawl pipeline performs an atomic access verification. The target site’s robots.txt is always checked, and directories matched by Disallow rules are skipped automatically.