Skip to content

Crawler Behavior

SiteOne Crawler implements a robust and efficient web crawling mechanism that systematically explores websites while respecting various limitations and configuration options. Understanding this behavior is crucial for effectively using the tool and customizing it to your specific needs.

The crawler follows a methodical approach:

  1. Initialization: Processes the starting URL and configuration options
  2. URL Queue Management: Maintains a queue of URLs to be crawled
  3. Content Fetching: Retrieves content from each URL in the queue
  4. Content Processing: Parses different content types (HTML, CSS, JS, etc.)
  5. URL Discovery: Extracts new URLs from processed content
  6. URL Filtering: Applies various rules to decide which URLs to follow
  7. Recursive Crawling: Repeats the process for each accepted URL

The crawler discovers URLs from various sources:

  • HTML Links: <a href> elements for navigation
  • HTML Resources: <img>, <script>, <link>, etc. for assets
  • CSS Resources: url() references to images, fonts, etc.
  • JavaScript Resources: URL strings in JS code
  • Redirects: HTTP redirects to new locations

URLs are processed to ensure they’re absolute and properly formatted before being added to the queue.

Not all discovered URLs are followed. The crawler applies several filters:

By default, the crawler only follows URLs within the initial domain. You can modify this behavior with these options:

The crawler intelligently handles different content types:

  • HTML Pages: Fully processed and crawled for links
  • Static Assets: Downloaded but not further crawled (CSS, JS, images, etc.)
  • Files: Downloaded based on configuration (PDFs, documents, etc.)

You can selectively disable certain content types with options like --disable-javascript, --disable-images, etc.

  • --max-depth: Controls how deep the crawler will follow links from the initial URL
  • --single-page: Limits crawling to just the starting URL and its assets

The crawler respects the robots.txt protocol by default:

  • Automatically fetches and parses robots.txt for each domain
  • Respects Disallow directives
  • Caches robots.txt content to minimize requests
  • Can be overridden with --ignore-robots-txt for private site analysis
  • Redirects (3xx) are followed automatically
  • Error pages (4xx, 5xx) are noted but don’t terminate the crawl
  • --max-non200-responses-per-basename: Prevents infinite crawling of error pages

The crawler includes specialized processors for modern frameworks:

  • Next.js: Handles Next.js-specific URL patterns and content
  • Svelte: Processes Svelte components and routing
  • Astro: Manages Astro’s hybrid static/dynamic content

For purely client-rendered (SPA) sites that need JavaScript executed to expose their links and content, enable the optional browser rendering mode (--browser), which renders each page in a real Chromium so the crawler sees the post-render DOM.

The crawler balances thoroughness with efficiency:

  • Concurrency: Uses multiple workers with non-blocking async I/O to parallelize requests
  • --workers: Controls the number of concurrent connections
  • --max-reqs-per-sec: Limits request rate to prevent overloading servers
  • Memory Management: The native Rust engine keeps crawl state in efficient in-memory data structures; for very large sites you can offload response bodies and headers to disk with --result-storage=file
  • --memory-limit: Sets maximum memory usage
  • --result-storage: Controls where content is stored (memory or disk)

Future enhancements could include:

  • Enhanced detection and handling of AJAX and SPA content beyond the current browser rendering mode
  • Improved handling of progressive web apps and service workers
  • Support for authentication workflows to crawl protected content
  • More intelligent content parsing for complex web applications