Deep Website Crawling
An important aspect of a crawler is the ability to find and then crawl everything that can be crawled and inspected in the code of a website.
- will crawl all files, styles, scripts, fonts, images, documents, etc. on your website
- it scans paths also in CSS files - typically
url()
for images, SVG icons or fonts - it scans all
srcsets
for images and therefore it goes through all the found paths to responsive images/formats as well (it can help prevent the visitor from waiting seconds for non-standard size images to be generated because they were the first) - in some cases, parses generic files from JavaScript (e.g. chunks of NextJS from the build manifest)
- crawler respects the
robots.txt
file and will not crawl the pages that are not allowed - has incredible C++ performance (thanks to Swooleโs coroutines)
- distributes the load as respectfully as possible to the hosting server(s) and with the least impact
- due to the very low CPU load and the
--workers
and--max-reqs-per-sec
setting options, it can execute and parse even hundreds or thousands of requests per second, so it can also be used as a stress-test tool or tester of protection against DoS attacks - captures CTRL+C (only on macOS and Linux) and ends with the statistics for at least the current processed URLs
๐กFurther development ideas
Crawler in the current release already handles task parallelization, crawling and analysis very well.
In the future, however, it is not excluded that we would proceed with improvements that would make even more use of multi-core processors.
It would be especially beneficial if we implemented a lot of other demanding analytics over time that could benefit from it.
If you have suggestions to improve crawling, donโt be afraid to send a feature request (to desktop application, or to command-line interface) with a suggestion for improvement. We are happy to consider and implement it if it will benefit more users.