Troubleshooting
Even though SiteOne Crawler is designed to work smoothly across various environments and websites, you may occasionally encounter issues that require troubleshooting. This guide provides solutions to common problems and guidance on debugging more complex issues.
Common Issues and Solutions
Section titled “Common Issues and Solutions”Memory Issues
Section titled “Memory Issues”Problem: Crawler crashes with “out of memory” errors when crawling large websites.
Solutions:
- Adjust memory limit: Use the
--memory-limit
option to increase available memory (e.g.,--memory-limit=4096M
for 4GB). - Switch to file storage: Use
--result-storage=file
instead of the default memory storage for large websites. - Enable compression: Add
--result-storage-compression
to reduce memory footprint (at the cost of some CPU usage). - Limit crawl scope: Use
--max-depth
,--max-visited-urls
, or regex filters to reduce the number of pages crawled.
Performance Issues
Section titled “Performance Issues”Problem: Crawling is too slow or times out on large sites.
Solutions:
- Increase concurrency: Use
--workers
to increase the number of concurrent requests (default is 3, consider 5-10 for faster crawling). - Adjust timeout: Use
--timeout
to set a higher timeout value for slow-responding servers. - Disable unnecessary features: Use options like
--disable-javascript
,--disable-images
, etc. if you don’t need to analyze those resources. - Use the HTTP cache: Make sure you’re not disabling the cache with
--http-cache-dir='off'
during repeated runs.
Crawling Problems
Section titled “Crawling Problems”Problem: Crawler can’t access certain pages or doesn’t find all links.
Solutions:
- Check robots.txt constraints: The crawler respects robots.txt by default. Use
--ignore-robots-txt
if needed for internal analysis. - Examine JavaScript links: For SPAs or JavaScript-heavy sites, many links might be generated dynamically and not discovered. Consider adding key URLs manually.
- Domain restrictions: If your site spans multiple domains or subdomains, use
--allowed-domain-for-crawling
to include them. - Authentication: For protected sites, use
--http-auth
for Basic authentication.
Export and Report Issues
Section titled “Export and Report Issues”Problem: Reports or exports are incomplete or contain errors.
Solutions:
- Verify exported paths: Check that the output directories exist and are writable.
- Encoding issues: For websites with non-standard character encodings, consider using
--replace-content
to fix encoding problems. - Large reports: For extensive sites, HTML reports might be too large for browsers. Consider using JSON output and filtering the data.
Debugging Techniques
Section titled “Debugging Techniques”Enable Debug Mode
Section titled “Enable Debug Mode”For detailed debugging information, use the --debug
flag:
./crawler --url=https://example.com/ --debug
For targeted debugging of specific URLs, use the --debug-url-regex
option:
./crawler --url=https://example.com/ --debug-url-regex='/about/'
Logging to File
Section titled “Logging to File”To capture debug output to a file without displaying it in the console:
./crawler --url=https://example.com/ --debug-log-file=debug.log
Progressive Testing
Section titled “Progressive Testing”When troubleshooting complex issues:
- Start small: Begin with a single page (
--single-page
) to verify basic functionality - Add complexity gradually: Incrementally add more pages and features to identify where issues occur
- Isolate components: Use flags like
--analyzer-filter-regex
to focus on specific analyzers
Specific Scenarios
Section titled “Specific Scenarios”Handling Modern JavaScript Frameworks
Section titled “Handling Modern JavaScript Frameworks”If you’re crawling sites built with React, Vue, Angular, or other modern frameworks:
- Use
--remove-all-anchor-listeners
to bypass client-side routing that might prevent normal link discovery - Consider manually discovering key URLs as SPA links might not be found through standard HTML parsing
Working with Large E-commerce Sites
Section titled “Working with Large E-commerce Sites”For large e-commerce sites with thousands of product pages:
- Use
--max-depth=2
to focus on category pages rather than all products - Consider
--result-storage=file
and--result-storage-compression
for memory efficiency - Use regex filtering to focus on important sections:
--include-regex='/product/|/category/'
Handling Sites with Login Requirements
Section titled “Handling Sites with Login Requirements”For sites requiring authentication:
- Basic HTTP authentication: Use
--http-auth=<username:password>
- For complex login forms, consider pre-authenticating and using cookies (this requires custom development)
Error Messages Explained
Section titled “Error Messages Explained”Error Message | Likely Cause | Solution |
---|---|---|
Maximum function nesting level... | Deep recursion, typically with complex URL structures | Increase PHP’s xdebug.max_nesting_level or disable xdebug |
cURL error 28: Operation timed out... | Server not responding within timeout period | Increase timeout with --timeout |
Allowed memory size exhausted... | Not enough memory allocated | Increase memory limit or use file storage |
Maximum execution time... | Script running longer than PHP allows | Increase PHP’s max_execution_time or run in smaller batches |
Getting More Help
Section titled “Getting More Help”If you’re still experiencing issues after trying these solutions:
- Check the GitHub repository for recent issues or discussions
- Open a new issue with detailed information about your problem
- Include your command with all options, environment details, and relevant error messages
💡Further Troubleshooting Tips
Section titled “💡Further Troubleshooting Tips”- When memory issues persist, consider breaking a large site into smaller segments using
--include-regex
- For cross-origin resource problems, ensure all domains are properly added with
--allowed-domain-for-external-files
- Regularly update SiteOne Crawler to benefit from bug fixes and performance improvements
- For custom installations, ensure all PHP dependencies and extensions are properly installed