Skip to content

Examples

Analysis of the entire website with default settings

Section titled “Analysis of the entire website with default settings”
Terminal window
./crawler --url=https://crawler.siteone.io/

By default, the crawler will save the HTML report in a path like ./tmp/report.crawler.siteone.io.20231214-152604.html.

Analysis and upload HTML report to the online service

Section titled “Analysis and upload HTML report to the online service”

Basic upload to our free service (30d retention, no password)

Section titled “Basic upload to our free service (30d retention, no password)”
./crawler \
--url=https://crawler.siteone.io/ \
--upload
./crawler \
--url=https://crawler.siteone.io/ \
--upload \
--upload-to=https://your.domain.tld/my-upload-service \
--upload-retention=365d \
--upload-password=secret123 \
--upload-timeout=7200

See Online HTML report (upload) or Upload options for more information.

Analysis and sending of e-mail with the HTML report

Section titled “Analysis and sending of e-mail with the HTML report”
./crawler \
--url=https://crawler.siteone.io/ \
--mail-smtp-host=my.smtp.com \
--mail-to=first@email.com,second@email.com

See Mailer options for all settings.

Simulate a tablet and crawl only the first 100 URLs

Section titled “Simulate a tablet and crawl only the first 100 URLs”
./crawler \
--url=https://crawler.siteone.io/ \
--device=tablet \
--max-visited-urls=100

Internal password-protected web behind the proxy

Section titled “Internal password-protected web behind the proxy”
./crawler \
--url=http://internal.web.dev/ \
--proxy=10.11.12.13:8080 \
--http-auth=user:secret123 \
--timeout=30

SEO oriented analysis and output (ignore assets)

Section titled “SEO oriented analysis and output (ignore assets)”
./crawler \
--url=https://crawler.siteone.io/ \
--extra-columns='Title(30),Description(40),Keywords(40)' \
--analyzer-filter-regex='/(seo|best)/i' \
--disable-javascript \
--disable-styles \
--disable-fonts \
--disable-images \
--disable-files \
--hide-progress-bar

Stress test with 10 workers and 100 reqs/sec

Section titled “Stress test with 10 workers and 100 reqs/sec”
./crawler \
--url=https://crawler.siteone.io/ \
--workers=10 \
--max-reqs-per-sec=100 \
--add-random-query-params \
--analyzer-filter-regex='/nothing/i' \
--disable-javascript \
--disable-styles \
--disable-fonts \
--disable-images \
--disable-files

Option --add-random-query-params is used to bypass the cache.

Option --analyzer-filter-regex='/nothing/i' will skip all analysis, save time, resources and output.

Analysis and export of a large website ~ 1 mio URLs

Section titled “Analysis and export of a large website ~ 1 mio URLs”
./crawler \
--url=https://www.very-large.website/ \
--workers=5 \
--max-reqs-per-sec=50 \
--max-visited-urls=1000000 \
--max-queue-length=900000 \
--memory-limit=4096M \
--offline-export-dir='tmp/www.very-large.website' \
--allowed-domain-for-external-files='*' \
--allowed-domain-for-crawling='*.very-large.website' \
--remove-query-params \
--result-storage='file' \
--result-storage-dir='tmp/result-storage' \
--result-storage-compression \
--http-cache-compression

Generate an offline version of the website

Section titled “Generate an offline version of the website”
./crawler \
--url=https://astro.build/ \
--offline-export-dir=tmp/astro.build \
--allowed-domain-for-external-files='*' \
--allowed-domain-for-crawling='*.astro.build' \

Option --offline-export-dir=tmp/astro.build will activate export mode and save the website to the ./tmp/astro.build directory.

Option --allowed-domain-for-external-files='*' will ensure that all external JavaScripts, styles, fonts, avatar images from GitHub, or any external files from other domains to which the HTML URL points are also downloaded for offline use.

Option --allowed-domain-for-crawling='*.astro.build' will ensure that only URLs from the initial domain astro.build and all its subdomains will be crawled.

./crawler \
--url=https://crawler.siteone.io/ \
--sitemap-xml-file=tmp/sitemap.xml \
--sitemap-txt-file=tmp/sitemap.txt \
--sitemap-base-priority=0.8 \
--sitemap-priority-increase=0.1

You can find the sitemap files in ./tmp/sitemap.xml and ./tmp/sitemap.txt.

Terminal window
./crawler --help