Crawler Behavior
SiteOne Crawler implements a robust and efficient web crawling mechanism that systematically explores websites while respecting various limitations and configuration options. Understanding this behavior is crucial for effectively using the tool and customizing it to your specific needs.
Basic Crawling Process
Section titled “Basic Crawling Process”The crawler follows a methodical approach:
- Initialization: Processes the starting URL and configuration options
- URL Queue Management: Maintains a queue of URLs to be crawled
- Content Fetching: Retrieves content from each URL in the queue
- Content Processing: Parses different content types (HTML, CSS, JS, etc.)
- URL Discovery: Extracts new URLs from processed content
- URL Filtering: Applies various rules to decide which URLs to follow
- Recursive Crawling: Repeats the process for each accepted URL
URL Handling and Discovery
Section titled “URL Handling and Discovery”The crawler discovers URLs from various sources:
- HTML Links:
<a href>
elements for navigation - HTML Resources:
<img>
,<script>
,<link>
, etc. for assets - CSS Resources:
url()
references to images, fonts, etc. - JavaScript Resources: URL strings in JS code
- Redirects: HTTP redirects to new locations
URLs are processed to ensure they’re absolute and properly formatted before being added to the queue.
URL Filtering
Section titled “URL Filtering”Not all discovered URLs are followed. The crawler applies several filters:
Domain Restrictions
Section titled “Domain Restrictions”By default, the crawler only follows URLs within the initial domain. You can modify this behavior with these options:
--allowed-domain-for-crawling
: Permits crawling content from additional domains--allowed-domain-for-external-files
: Enables loading assets (but not crawling) from other domains--single-foreign-page
: When crawling other domains, only visit the directly linked page
Content Type Filtering
Section titled “Content Type Filtering”The crawler intelligently handles different content types:
- HTML Pages: Fully processed and crawled for links
- Static Assets: Downloaded but not further crawled (CSS, JS, images, etc.)
- Files: Downloaded based on configuration (PDFs, documents, etc.)
You can selectively disable certain content types with options like --disable-javascript
, --disable-images
, etc.
Depth Limitation
Section titled “Depth Limitation”--max-depth
: Controls how deep the crawler will follow links from the initial URL--single-page
: Limits crawling to just the starting URL and its assets
Pattern Matching
Section titled “Pattern Matching”--include-regex
: Only URLs matching the pattern will be crawled--ignore-regex
: URLs matching the pattern will be skipped--regex-filtering-only-for-pages
: Apply regex filtering only to HTML pages, not assets
Robots.txt Compliance
Section titled “Robots.txt Compliance”The crawler respects the robots.txt
protocol by default:
- Automatically fetches and parses
robots.txt
for each domain - Respects
Disallow
directives - Caches
robots.txt
content to minimize requests - Can be overridden with
--ignore-robots-txt
for private site analysis
Handling Special Cases
Section titled “Handling Special Cases”Query Parameters
Section titled “Query Parameters”--remove-query-params
: Strips query strings from URLs before processing--add-random-query-params
: Adds random parameters to prevent caching issues
Redirects and Non-200 Responses
Section titled “Redirects and Non-200 Responses”- Redirects (3xx) are followed automatically
- Error pages (4xx, 5xx) are noted but don’t terminate the crawl
--max-non200-responses-per-basename
: Prevents infinite crawling of error pages
JavaScript Frameworks
Section titled “JavaScript Frameworks”The crawler includes specialized processors for modern frameworks:
- Next.js: Handles Next.js-specific URL patterns and content
- Svelte: Processes Svelte components and routing
- Astro: Manages Astro’s hybrid static/dynamic content
Performance Considerations
Section titled “Performance Considerations”The crawler balances thoroughness with efficiency:
- Concurrency: Uses multiple workers to parallelize requests
--workers
: Controls the number of concurrent connections--max-reqs-per-sec
: Limits request rate to prevent overloading servers- Memory Management: Uses Swoole tables for efficient data structures
--memory-limit
: Sets maximum memory usage--result-storage
: Controls where content is stored (memory or disk)
💡Further Development Ideas
Section titled “💡Further Development Ideas”Future enhancements could include:
- Advanced JavaScript rendering capabilities using headless browsers
- Enhanced detection and handling of AJAX and SPA content
- Improved handling of progressive web apps and service workers
- Support for authentication workflows to crawl protected content
- More intelligent content parsing for complex web applications