Troubleshooting

Even though SiteOne Crawler is designed to work smoothly across various environments and websites, you may occasionally encounter issues that require troubleshooting. This guide provides solutions to common problems and guidance on debugging more complex issues.

Common Issues and Solutions

Memory Issues

Problem: Crawler crashes with “out of memory” errors when crawling large websites.

Solutions:

Adjust memory limit: Use the --memory-limit option to increase available memory (e.g., --memory-limit=4096M for 4GB).
Switch to file storage: Use --result-storage=file instead of the default memory storage for large websites.
Enable compression: Add --result-storage-compression to reduce memory footprint (at the cost of some CPU usage).
Limit crawl scope: Use --max-depth, --max-visited-urls, or regex filters to reduce the number of pages crawled.

Performance Issues

Problem: Crawling is too slow or times out on large sites.

Solutions:

Increase concurrency: Use --workers to increase the number of concurrent requests (default is 3, consider 5-10 for faster crawling).
Adjust timeout: Use --timeout to set a higher timeout value for slow-responding servers.
Disable unnecessary features: Use options like --disable-javascript, --disable-images, etc. if you don’t need to analyze those resources.
Use the HTTP cache: Make sure you’re not disabling the cache with --http-cache-dir='off' during repeated runs.

Crawling Problems

Problem: Crawler can’t access certain pages or doesn’t find all links.

Solutions:

Check robots.txt constraints: The crawler respects robots.txt by default. Use --ignore-robots-txt if needed for internal analysis.
Examine JavaScript links: For SPAs or JavaScript-heavy sites, many links might be generated dynamically and not discovered. Consider adding key URLs manually.
Domain restrictions: If your site spans multiple domains or subdomains, use --allowed-domain-for-crawling to include them.
Authentication: For protected sites, use --http-auth for Basic authentication.

Export and Report Issues

Problem: Reports or exports are incomplete or contain errors.

Solutions:

Verify exported paths: Check that the output directories exist and are writable.
Encoding issues: For websites with non-standard character encodings, consider using --replace-content to fix encoding problems.
Large reports: For extensive sites, HTML reports might be too large for browsers. Consider using JSON output and filtering the data.

Debugging Techniques

Enable Debug Mode

For detailed debugging information, use the --debug flag:

./crawler --url=https://example.com/ --debug

For targeted debugging of specific URLs, use the --debug-url-regex option:

./crawler --url=https://example.com/ --debug-url-regex='/about/'

Logging to File

To capture debug output to a file without displaying it in the console:

./crawler --url=https://example.com/ --debug-log-file=debug.log

Progressive Testing

When troubleshooting complex issues:

Start small: Begin with a single page (--single-page) to verify basic functionality
Add complexity gradually: Incrementally add more pages and features to identify where issues occur
Isolate components: Use flags like --analyzer-filter-regex to focus on specific analyzers

Specific Scenarios

Handling Modern JavaScript Frameworks

If you’re crawling sites built with React, Vue, Angular, or other modern frameworks:

Use --remove-all-anchor-listeners to bypass client-side routing that might prevent normal link discovery
Consider manually discovering key URLs as SPA links might not be found through standard HTML parsing

Working with Large E-commerce Sites

For large e-commerce sites with thousands of product pages:

Use --max-depth=2 to focus on category pages rather than all products
Consider --result-storage=file and --result-storage-compression for memory efficiency
Use regex filtering to focus on important sections: --include-regex='/product/|/category/'

For sites requiring authentication:

Basic HTTP authentication: Use --http-auth=<username:password>
For complex login forms, consider pre-authenticating and using cookies (this requires custom development)

Error Messages Explained

Error Message	Likely Cause	Solution
`Maximum function nesting level...`	Deep recursion, typically with complex URL structures	Increase PHP’s xdebug.max_nesting_level or disable xdebug
`cURL error 28: Operation timed out...`	Server not responding within timeout period	Increase timeout with `--timeout`
`Allowed memory size exhausted...`	Not enough memory allocated	Increase memory limit or use file storage
`Maximum execution time...`	Script running longer than PHP allows	Increase PHP’s max_execution_time or run in smaller batches

Getting More Help

If you’re still experiencing issues after trying these solutions:

Check the GitHub repository for recent issues or discussions
Open a new issue with detailed information about your problem
Include your command with all options, environment details, and relevant error messages

💡Further Troubleshooting Tips

When memory issues persist, consider breaking a large site into smaller segments using --include-regex
For cross-origin resource problems, ensure all domains are properly added with --allowed-domain-for-external-files
Regularly update SiteOne Crawler to benefit from bug fixes and performance improvements
For custom installations, ensure all PHP dependencies and extensions are properly installed

Troubleshooting

Common Issues and Solutions

Memory Issues

Performance Issues

Crawling Problems

Export and Report Issues

Debugging Techniques

Enable Debug Mode

Logging to File

Progressive Testing

Specific Scenarios

Handling Modern JavaScript Frameworks

Working with Large E-commerce Sites

Handling Sites with Login Requirements

Error Messages Explained

Getting More Help

💡Further Troubleshooting Tips