Crawler Behavior

SiteOne Crawler implements a robust and efficient web crawling mechanism that systematically explores websites while respecting various limitations and configuration options. Understanding this behavior is crucial for effectively using the tool and customizing it to your specific needs.

Basic Crawling Process

The crawler follows a methodical approach:

Initialization: Processes the starting URL and configuration options
URL Queue Management: Maintains a queue of URLs to be crawled
Content Fetching: Retrieves content from each URL in the queue
Content Processing: Parses different content types (HTML, CSS, JS, etc.)
URL Discovery: Extracts new URLs from processed content
URL Filtering: Applies various rules to decide which URLs to follow
Recursive Crawling: Repeats the process for each accepted URL

URL Handling and Discovery

The crawler discovers URLs from various sources:

HTML Links: <a href> elements for navigation
HTML Resources: <img>, <script>, <link>, etc. for assets
CSS Resources: url() references to images, fonts, etc.
JavaScript Resources: URL strings in JS code
Redirects: HTTP redirects to new locations

URLs are processed to ensure they’re absolute and properly formatted before being added to the queue.

URL Filtering

Not all discovered URLs are followed. The crawler applies several filters:

Domain Restrictions

By default, the crawler only follows URLs within the initial domain. You can modify this behavior with these options:

--allowed-domain-for-crawling: Permits crawling content from additional domains
--allowed-domain-for-external-files: Enables loading assets (but not crawling) from other domains
--single-foreign-page: When crawling other domains, only visit the directly linked page

Content Type Filtering

The crawler intelligently handles different content types:

HTML Pages: Fully processed and crawled for links
Static Assets: Downloaded but not further crawled (CSS, JS, images, etc.)
Files: Downloaded based on configuration (PDFs, documents, etc.)

You can selectively disable certain content types with options like --disable-javascript, --disable-images, etc.

Depth Limitation

--max-depth: Controls how deep the crawler will follow links from the initial URL
--single-page: Limits crawling to just the starting URL and its assets

Pattern Matching

--include-regex: Only URLs matching the pattern will be crawled
--ignore-regex: URLs matching the pattern will be skipped
--regex-filtering-only-for-pages: Apply regex filtering only to HTML pages, not assets

Robots.txt Compliance

The crawler respects the robots.txt protocol by default:

Automatically fetches and parses robots.txt for each domain
Respects Disallow directives
Caches robots.txt content to minimize requests
Can be overridden with --ignore-robots-txt for private site analysis

Handling Special Cases

Query Parameters

--remove-query-params: Strips query strings from URLs before processing
--add-random-query-params: Adds random parameters to prevent caching issues

Redirects and Non-200 Responses

Redirects (3xx) are followed automatically
Error pages (4xx, 5xx) are noted but don’t terminate the crawl
--max-non200-responses-per-basename: Prevents infinite crawling of error pages

JavaScript Frameworks

The crawler includes specialized processors for modern frameworks:

Next.js: Handles Next.js-specific URL patterns and content
Svelte: Processes Svelte components and routing
Astro: Manages Astro’s hybrid static/dynamic content

Performance Considerations

The crawler balances thoroughness with efficiency:

Concurrency: Uses multiple workers to parallelize requests
--workers: Controls the number of concurrent connections
--max-reqs-per-sec: Limits request rate to prevent overloading servers
Memory Management: Uses Swoole tables for efficient data structures
--memory-limit: Sets maximum memory usage
--result-storage: Controls where content is stored (memory or disk)

💡Further Development Ideas

Future enhancements could include:

Advanced JavaScript rendering capabilities using headless browsers
Enhanced detection and handling of AJAX and SPA content
Improved handling of progressive web apps and service workers
Support for authentication workflows to crawl protected content
More intelligent content parsing for complex web applications