Skip to content

Deep Website Crawling

An important aspect of a crawler is the ability to find and then crawl everything that can be crawled and inspected in the code of a website.

  • will crawl all files, styles, scripts, fonts, images, documents, etc. on your website
  • it scans paths also in CSS files - typically url() for images, SVG icons or fonts
  • it scans all srcsets for images and therefore it goes through all the found paths to responsive images/formats as well (it can help prevent the visitor from waiting seconds for non-standard size images to be generated because they were the first)
  • in some cases, parses generic files from JavaScript (e.g. chunks of NextJS from the build manifest)
  • the crawler respects the robots.txt file and will not crawl pages that are not allowed for User-agent: *. You can also specifically prevent it from crawling your website by adding User-agent: SiteOne-Crawler and Disallow: / to your robots.txt (see the FAQ)
  • has incredible ๐Ÿš€ native Rust performance with non-blocking async I/O and multi-threaded crawling โ€” fast, low-overhead, and with zero runtime dependencies
  • JavaScript-heavy / SPA sites can be crawled with their post-render DOM using the optional browser rendering mode (--browser)
  • fine-grained URL handling and filtering โ€” seed a bounded set with --url-list, keep only chosen query parameters with --keep-query-param, rewrite URLs on the fly with --transform-url, and skip links hidden in HTML comments with --ignore-html-comments
  • distributes the load as respectfully as possible to the hosting server(s) and with the least impact
  • due to the very low CPU load and the --workers and --max-reqs-per-sec setting options, it can execute and parse even hundreds or thousands of requests per second, so it can also be used as a stress-test tool or tester of protection against DoS attacks
  • captures CTRL+C (only on macOS and Linux) and ends with the statistics for at least the current processed URLs

The crawler in the current release already handles task parallelization, crawling, and analysis very well.

In the future, however, it is not excluded that we would proceed with improvements that would make even more use of multi-core processors.

It would be especially beneficial if we implemented a lot of other demanding analytics over time that could benefit from it.


If you have suggestions to improve crawling, donโ€™t be afraid to send a feature request (to desktop application, or to command-line interface) with a suggestion for improvement. We are happy to consider and implement it if it will benefit more users.