Troubleshooting
Even though SiteOne Crawler is designed to work smoothly across various environments and websites, you may occasionally encounter issues that require troubleshooting. This guide provides solutions to common problems and guidance on debugging more complex issues.
Common Issues and Solutions
Section titled “Common Issues and Solutions”Memory Issues
Section titled “Memory Issues”Problem: Crawler crashes with “out of memory” errors when crawling large websites.
Solutions:
- Adjust memory limit: Use the
--memory-limitoption to increase available memory (e.g.,--memory-limit=4096Mfor 4GB). - Switch to file storage: Use
--result-storage=fileinstead of the default memory storage for large websites. - Enable compression: Add
--result-storage-compressionto reduce memory footprint (at the cost of some CPU usage). - Limit crawl scope: Use
--max-depth,--max-visited-urls, or regex filters to reduce the number of pages crawled.
Performance Issues
Section titled “Performance Issues”Problem: Crawling is too slow or times out on large sites.
Solutions:
- Increase concurrency: Use
--workersto increase the number of concurrent requests (default is 3, consider 5-10 for faster crawling). - Adjust timeout: Use
--timeoutto set a higher timeout value for slow-responding servers. - Disable unnecessary features: Use options like
--disable-javascript,--disable-images, etc. if you don’t need to analyze those resources. - Use the HTTP cache: Make sure you’re not disabling the cache with
--http-cache-dir='off'during repeated runs.
Crawling Problems
Section titled “Crawling Problems”Problem: Crawler can’t access certain pages or doesn’t find all links.
Solutions:
- Check robots.txt constraints: The crawler respects robots.txt by default. Use
--ignore-robots-txtif needed for internal analysis. - Examine JavaScript links: For SPAs or JavaScript-heavy sites, many links might be generated dynamically and not discovered. Consider adding key URLs manually.
- Domain restrictions: If your site spans multiple domains or subdomains, use
--allowed-domain-for-crawlingto include them. - Authentication: For protected sites, use
--http-authfor Basic authentication.
Export and Report Issues
Section titled “Export and Report Issues”Problem: Reports or exports are incomplete or contain errors.
Solutions:
- Verify exported paths: Check that the output directories exist and are writable.
- Encoding issues: For websites with non-standard character encodings, consider using
--replace-contentto fix encoding problems. - Large reports: For extensive sites, HTML reports might be too large for browsers. Consider using JSON output and filtering the data.
Debugging Techniques
Section titled “Debugging Techniques”Enable Debug Mode
Section titled “Enable Debug Mode”For detailed debugging information, use the --debug flag:
./crawler --url=https://example.com/ --debugFor targeted debugging of specific URLs, use the --debug-url-regex option:
./crawler --url=https://example.com/ --debug-url-regex='/about/'Logging to File
Section titled “Logging to File”To capture debug output to a file without displaying it in the console:
./crawler --url=https://example.com/ --debug-log-file=debug.logProgressive Testing
Section titled “Progressive Testing”When troubleshooting complex issues:
- Start small: Begin with a single page (
--single-page) to verify basic functionality - Add complexity gradually: Incrementally add more pages and features to identify where issues occur
- Isolate components: Use flags like
--analyzer-filter-regexto focus on specific analyzers
Specific Scenarios
Section titled “Specific Scenarios”Handling Modern JavaScript Frameworks
Section titled “Handling Modern JavaScript Frameworks”If you’re crawling sites built with React, Vue, Angular, or other modern frameworks:
- Use
--remove-all-anchor-listenersto bypass client-side routing that might prevent normal link discovery - Consider manually discovering key URLs as SPA links might not be found through standard HTML parsing
Working with Large E-commerce Sites
Section titled “Working with Large E-commerce Sites”For large e-commerce sites with thousands of product pages:
- Use
--max-depth=2to focus on category pages rather than all products - Consider
--result-storage=fileand--result-storage-compressionfor memory efficiency - Use regex filtering to focus on important sections:
--include-regex='/product/|/category/'
Handling Sites with Login Requirements
Section titled “Handling Sites with Login Requirements”For sites requiring authentication:
- Basic HTTP authentication: Use
--http-auth=<username:password> - For complex login forms, consider pre-authenticating and using cookies (this requires custom development)
Error Messages Explained
Section titled “Error Messages Explained”| Error Message | Likely Cause | Solution |
|---|---|---|
Maximum function nesting level... | Deep recursion, typically with complex URL structures | Increase PHP’s xdebug.max_nesting_level or disable xdebug |
cURL error 28: Operation timed out... | Server not responding within timeout period | Increase timeout with --timeout |
Allowed memory size exhausted... | Not enough memory allocated | Increase memory limit or use file storage |
Maximum execution time... | Script running longer than PHP allows | Increase PHP’s max_execution_time or run in smaller batches |
Getting More Help
Section titled “Getting More Help”If you’re still experiencing issues after trying these solutions:
- Check the GitHub repository for recent issues or discussions
- Open a new issue with detailed information about your problem
- Include your command with all options, environment details, and relevant error messages
💡Further Troubleshooting Tips
Section titled “💡Further Troubleshooting Tips”- When memory issues persist, consider breaking a large site into smaller segments using
--include-regex - For cross-origin resource problems, ensure all domains are properly added with
--allowed-domain-for-external-files - Regularly update SiteOne Crawler to benefit from bug fixes and performance improvements
- For custom installations, ensure all PHP dependencies and extensions are properly installed