Examples
Analysis of the entire website with default settings
./crawler --url=https://crawler.siteone.io/
By default, the crawler will save the HTML report in a path like ./tmp/report.crawler.siteone.io.20231214-152604.html
.
Analysis and upload HTML report to the online service
Basic upload to our free service (30d retention, no password)
./crawler \ --url=https://crawler.siteone.io/ \ --upload
Advanced upload to your own service
./crawler \ --url=https://crawler.siteone.io/ \ --upload \ --upload-to=https://your.domain.tld/my-upload-service \ --upload-retention=365d \ --upload-password=secret123 \ --upload-timeout=7200
See Online HTML report (upload) or Upload options for more information.
Analysis and sending of e-mail with the HTML report
./crawler \ --url=https://crawler.siteone.io/ \ --mail-smtp-host=my.smtp.com \ --mail-to=first@email.com,second@email.com
See Mailer options for all settings.
Simulate a tablet and crawl only the first 100 URLs
./crawler \ --url=https://crawler.siteone.io/ \ --device=tablet \ --max-visited-urls=100
Internal password-protected web behind the proxy
./crawler \ --url=http://internal.web.dev/ \ --proxy=10.11.12.13:8080 \ --http-auth=user:secret123 \ --timeout=30
SEO oriented analysis and output (ignore assets)
./crawler \ --url=https://crawler.siteone.io/ \ --extra-columns='Title(30),Description(40),Keywords(40)' \ --analyzer-filter-regex='/(seo|best)/i' \ --disable-javascript \ --disable-styles \ --disable-fonts \ --disable-images \ --disable-files \ --hide-progress-bar
Stress test with 10 workers and 100 reqs/sec
./crawler \ --url=https://crawler.siteone.io/ \ --workers=10 \ --max-reqs-per-sec=100 \ --add-random-query-params \ --analyzer-filter-regex='/nothing/i' \ --disable-javascript \ --disable-styles \ --disable-fonts \ --disable-images \ --disable-files
Option --add-random-query-params
is used to bypass the cache.
Option --analyzer-filter-regex='/nothing/i'
will skip all analysis, save time, resources and output.
Analysis and export of a large website ~ 1 mio URLs
./crawler \ --url=https://www.very-large.website/ \ --workers=5 \ --max-reqs-per-sec=50 \ --max-visited-urls=1000000 \ --max-queue-length=900000 \ --memory-limit=4096M \ --offline-export-dir='tmp/www.very-large.website' \ --allowed-domain-for-external-files='*' \ --allowed-domain-for-crawling='*.very-large.website' \ --remove-query-params \ --result-storage='file' \ --result-storage-dir='tmp/result-storage' \ --result-storage-compression \ --http-cache-compression
Generate an offline version of the website
./crawler \ --url=https://astro.build/ \ --offline-export-dir=tmp/astro.build \ --allowed-domain-for-external-files='*' \ --allowed-domain-for-crawling='*.astro.build' \
Option --offline-export-dir=tmp/astro.build
will activate export mode and save the website to the ./tmp/astro.build
directory.
Option --allowed-domain-for-external-files='*'
will ensure that all external JavaScripts, styles, fonts, avatar images from GitHub, or any external files from other domains to which the HTML URL points are also downloaded for offline use.
Option --allowed-domain-for-crawling='*.astro.build'
will ensure that only URLs from the initial domain astro.build
and all its subdomains will be crawled.
Generate sitemaps for the website
./crawler \ --url=https://crawler.siteone.io/ \ --sitemap-xml-file=tmp/sitemap.xml \ --sitemap-txt-file=tmp/sitemap.txt \ --sitemap-base-priority=0.8 \ --sitemap-priority-increase=0.1
You can find the sitemap files in ./tmp/sitemap.xml
and ./tmp/sitemap.txt
.
Help with all available options
./crawler --help