Skip to content

Troubleshooting

Even though SiteOne Crawler is designed to work smoothly across various environments and websites, you may occasionally encounter issues that require troubleshooting. This guide provides solutions to common problems and guidance on debugging more complex issues.

Problem: Crawler crashes with “out of memory” errors when crawling large websites.

Solutions:

  1. Adjust memory limit: Use the --memory-limit option to increase available memory (e.g., --memory-limit=4096M for 4GB).
  2. Switch to file storage: Use --result-storage=file instead of the default memory storage for large websites.
  3. Enable compression: Add --result-storage-compression to reduce memory footprint (at the cost of some CPU usage).
  4. Limit crawl scope: Use --max-depth, --max-visited-urls, or regex filters to reduce the number of pages crawled.

Problem: Crawling is too slow or times out on large sites.

Solutions:

  1. Increase concurrency: Use --workers to increase the number of concurrent requests (default is 3, consider 5-10 for faster crawling).
  2. Adjust timeout: Use --timeout to set a higher timeout value for slow-responding servers.
  3. Disable unnecessary features: Use options like --disable-javascript, --disable-images, etc. if you don’t need to analyze those resources.
  4. Use the HTTP cache: Make sure you’re not disabling the cache with --http-cache-dir='off' during repeated runs.

Problem: Crawler can’t access certain pages or doesn’t find all links.

Solutions:

  1. Check robots.txt constraints: The crawler respects robots.txt by default. Use --ignore-robots-txt if needed for internal analysis.
  2. Examine JavaScript links: For SPAs or JavaScript-heavy sites, many links might be generated dynamically and not discovered. Consider adding key URLs manually.
  3. Domain restrictions: If your site spans multiple domains or subdomains, use --allowed-domain-for-crawling to include them.
  4. Authentication: For protected sites, use --http-auth for Basic authentication.

Problem: Reports or exports are incomplete or contain errors.

Solutions:

  1. Verify exported paths: Check that the output directories exist and are writable.
  2. Encoding issues: For websites with non-standard character encodings, consider using --replace-content to fix encoding problems.
  3. Large reports: For extensive sites, HTML reports might be too large for browsers. Consider using JSON output and filtering the data.

For detailed debugging information, use the --debug flag:

Terminal window
./crawler --url=https://example.com/ --debug

For targeted debugging of specific URLs, use the --debug-url-regex option:

Terminal window
./crawler --url=https://example.com/ --debug-url-regex='/about/'

To capture debug output to a file without displaying it in the console:

Terminal window
./crawler --url=https://example.com/ --debug-log-file=debug.log

When troubleshooting complex issues:

  1. Start small: Begin with a single page (--single-page) to verify basic functionality
  2. Add complexity gradually: Incrementally add more pages and features to identify where issues occur
  3. Isolate components: Use flags like --analyzer-filter-regex to focus on specific analyzers

If you’re crawling sites built with React, Vue, Angular, or other modern frameworks:

  1. Use --remove-all-anchor-listeners to bypass client-side routing that might prevent normal link discovery
  2. Consider manually discovering key URLs as SPA links might not be found through standard HTML parsing

For large e-commerce sites with thousands of product pages:

  1. Use --max-depth=2 to focus on category pages rather than all products
  2. Consider --result-storage=file and --result-storage-compression for memory efficiency
  3. Use regex filtering to focus on important sections: --include-regex='/product/|/category/'

For sites requiring authentication:

  1. Basic HTTP authentication: Use --http-auth=<username:password>
  2. For complex login forms, consider pre-authenticating and using cookies (this requires custom development)
Error MessageLikely CauseSolution
Maximum function nesting level...Deep recursion, typically with complex URL structuresIncrease PHP’s xdebug.max_nesting_level or disable xdebug
cURL error 28: Operation timed out...Server not responding within timeout periodIncrease timeout with --timeout
Allowed memory size exhausted...Not enough memory allocatedIncrease memory limit or use file storage
Maximum execution time...Script running longer than PHP allowsIncrease PHP’s max_execution_time or run in smaller batches

If you’re still experiencing issues after trying these solutions:

  1. Check the GitHub repository for recent issues or discussions
  2. Open a new issue with detailed information about your problem
  3. Include your command with all options, environment details, and relevant error messages
  • When memory issues persist, consider breaking a large site into smaller segments using --include-regex
  • For cross-origin resource problems, ensure all domains are properly added with --allowed-domain-for-external-files
  • Regularly update SiteOne Crawler to benefit from bug fixes and performance improvements
  • For custom installations, ensure all PHP dependencies and extensions are properly installed