Skip to content

Crawler Behavior

SiteOne Crawler implements a robust and efficient web crawling mechanism that systematically explores websites while respecting various limitations and configuration options. Understanding this behavior is crucial for effectively using the tool and customizing it to your specific needs.

The crawler follows a methodical approach:

  1. Initialization: Processes the starting URL and configuration options
  2. URL Queue Management: Maintains a queue of URLs to be crawled
  3. Content Fetching: Retrieves content from each URL in the queue
  4. Content Processing: Parses different content types (HTML, CSS, JS, etc.)
  5. URL Discovery: Extracts new URLs from processed content
  6. URL Filtering: Applies various rules to decide which URLs to follow
  7. Recursive Crawling: Repeats the process for each accepted URL

The crawler discovers URLs from various sources:

  • HTML Links: <a href> elements for navigation
  • HTML Resources: <img>, <script>, <link>, etc. for assets
  • CSS Resources: url() references to images, fonts, etc.
  • JavaScript Resources: URL strings in JS code
  • Redirects: HTTP redirects to new locations

URLs are processed to ensure they’re absolute and properly formatted before being added to the queue.

Not all discovered URLs are followed. The crawler applies several filters:

By default, the crawler only follows URLs within the initial domain. You can modify this behavior with these options:

The crawler intelligently handles different content types:

  • HTML Pages: Fully processed and crawled for links
  • Static Assets: Downloaded but not further crawled (CSS, JS, images, etc.)
  • Files: Downloaded based on configuration (PDFs, documents, etc.)

You can selectively disable certain content types with options like --disable-javascript, --disable-images, etc.

  • --max-depth: Controls how deep the crawler will follow links from the initial URL
  • --single-page: Limits crawling to just the starting URL and its assets

The crawler respects the robots.txt protocol by default:

  • Automatically fetches and parses robots.txt for each domain
  • Respects Disallow directives
  • Caches robots.txt content to minimize requests
  • Can be overridden with --ignore-robots-txt for private site analysis
  • Redirects (3xx) are followed automatically
  • Error pages (4xx, 5xx) are noted but don’t terminate the crawl
  • --max-non200-responses-per-basename: Prevents infinite crawling of error pages

The crawler includes specialized processors for modern frameworks:

  • Next.js: Handles Next.js-specific URL patterns and content
  • Svelte: Processes Svelte components and routing
  • Astro: Manages Astro’s hybrid static/dynamic content

The crawler balances thoroughness with efficiency:

  • Concurrency: Uses multiple workers to parallelize requests
  • --workers: Controls the number of concurrent connections
  • --max-reqs-per-sec: Limits request rate to prevent overloading servers
  • Memory Management: Uses Swoole tables for efficient data structures
  • --memory-limit: Sets maximum memory usage
  • --result-storage: Controls where content is stored (memory or disk)

Future enhancements could include:

  • Advanced JavaScript rendering capabilities using headless browsers
  • Enhanced detection and handling of AJAX and SPA content
  • Improved handling of progressive web apps and service workers
  • Support for authentication workflows to crawl protected content
  • More intelligent content parsing for complex web applications