Overview

Purpose of the SiteOne Crawler

The main purpose of our multi-platform crawler is to help website owners, developers, QA engineers and DevOps find weaknesses and help improve the quality of their websites in various areas - SEO, security, performance, accessibility, socnets-sharing, best practices or content issues.

It has wide setting options and provides its output in the form of a clear structured graphic report, or in the form of text or JSON for the possibility of further integrations.

At the same time, the tool provides other very useful functions - it can convert the entire website into offline/archive browsable form, generate sitemaps, sends reports to e-mail, convert the entire website to markdown, and is ready for the effective development of other functionalities. From version 1.0.7 you can use the option of free uploading and sharing of HTML reports through the secure unique URL.

3 videos, more than a thousand words

Typical Use Cases

Website owner or consultant

can check the build quality of its website and consider the implementation of some recommendations;
can check the speed of building of individual pages on server-side, e.g. whether the parameters of the SLA contract are met;
can quickly reveal flaws in the definition of content for search engines (duplicate titles/descriptions, bad title structure, etc.);
for archival and other (e.g. legal) purposes, he can create a functional offline version of his website, which can be saved to USB stick and viewed from a local disk, completely without the need for the Internet.
can convert the website to markdown format for documentation or feeding content to AI models.

Developer

can check the quality of his work in many areas and find some imperfections that he has overlooked;
can effectively compare the effects of its optimizations before/after the release of a new version;
can very quickly locate the slowest pages that should be optimized, or pages that are not cached and should be;
can simply test how the website he develops works with fonts, images, styles or JavaScript turned off;
can check the still-in-progress developer version at http://localhost:PORT/ or non-public password-protected versions.

DevOps

can test the functionality of the entire website as part of the CI/CD pipeline;
can warm up the cache of the website after the release, so that the first visitors do not have to wait for the generation of individual pages;
can test the performance of the website under heavy load, because requests can be made in parallel, with any number of workers (but be careful not to cause a DoS attack);
can test the functionality of used protections against DoS attacks (typically rate-limiting and simultaneous connection-limiting).

QA engineer

crawler can be another useful tool for testing absolutely entire website and analyzing details that other tools don’t notice;
can keep historical reports and track how performance, security, accessibility, SEO and all other parameters have evolved over time;
can compare the results of the same website on different environments (e.g. production vs. staging) or different versions of the website (e.g. before/after the release of a new version);
can test the functionality of the website under heavy load, because requests can be made in parallel, with any number of workers and requests/sec (but be careful not to cause a DoS attack);
as a professional, he/she can think about improvements to this tool that would directly help him/her and send them in the form of a feature request - the goal is really to create a very useful and universal tool that will help improve the quality of websites around the world.

Core Principles

The principle of operation of this crawler is relatively simple - it reads the provided URL, parses the links to the other pages and all other content from them, goes through all the other pages in parallel, parses again and stops when it finds no other.

With each response to the URL, it performs a lot of analysis - some of them are performed immediately, some as a summary analysis at the end of the crawl. It is easy for programmers to add their own analyzer.

To avoid overloading the server, the crawler has this anti-overloading default setting: --workers=3 --max-reqs-per-sec=10 (short version -w=3 -rps=10). This means that the crawler will run 3 parallel workers and together they will make a maximum of 10 requests per second. This is a very safe setting that will not overload the server, but it will take a long time to crawl the entire website. If you want to speed up the crawling, you can increase the number of workers and the number of requests per second. But be careful not to cause a DoS attack.

By default, the crawler stores the content and headers of the visited URLs to local cache, so that if the crawling is interrupted, you can start again and the previously loaded content does not have to be downloaded again.

You can interrupt crawling at any time by pressing CTRL+C - all analyzes and reports will be generated as well, but only for visited URLs until the time of interruption. In case of interruption, however, the optionally configured e-mail with the report will not be sent.