Skip to content

Command-line options

Basic Settings

ParameterDescriptionDefault
--urlRequired. HTTP or HTTPS URL address of the website to be crawled. If you only provide a domain name without a scheme, https:// will be added automatically. Use quotation marks if the URL contains query parameters.
--deviceDevice type for choosing a predefined User-Agent. Ignored when โ€”user-agent is defined. Supported values: desktop, tablet, mobile. Ignored with --user-agent.desktop
--user-agentOverride User-Agent selected by --device. Custom User-Agent header. Use quotation marks.
--timeoutRequest timeout (in seconds).5
--proxyHTTP proxy to use in host:port format. Host can be hostname, IPv4 or IPv6..
--http-authBasic HTTP authentication in username:password format.
--helpShow help and exit.
--versionShow crawler version and exit.

Output Settings

ParameterDescriptionDefault
--outputConsole output type. Supported values: text or json. The JSON output displays only one-line progress on STDERR during crawling and at the end generates JSON on STDOUT with crawling details.text
--extra-columnsComma delimited list of extra columns added to output table. It is possible to specify HTTP header names (e.g. X-Cache) or predefined Title, Keywords, Description or DOM for the number of DOM elements found in the HTML. You can set the expected length of the column in parentheses and > for do-not-truncate - e.g. DOM(6),X-Cache(10),Title(40>),Description(50>).
--url-column-sizeBasic URL column width. By default, it is calculated from the size of your terminal window.
--show-inline-criticalsShow criticals from the analyzer directly in the URL table.
--show-inline-warningsShow warnings from the analyzer directly in the URL table.
--do-not-truncate-urlIn the text output, long URLs are truncated by default to --url-column-size so the table does not wrap due to long URLs. With this option, you can turn off the truncation.
--show-scheme-and-hostOn text output, show scheme and host also for origin domain URLs.
--hide-progress-barHide progress bar visible in text and JSON output for more compact view.
--no-colorDisable colored output.

Upload options

ParameterDescriptionDefault Value
--uploadEnable HTML report upload to --upload-to.
--upload-toURL of the endpoint where to send the HTML report.https://crawler.siteone.io/up
--upload-retentionHow long should the HTML report be kept in the online version? Values: 1h / 4h / 12h / 24h / 3d / 7d / 30d / 365d / forever.30d
--upload-passwordOptional password, which must be entered (the user will be crawler) to display the online HTML report.
--upload-timeoutUpload timeout in seconds.3600

See Online HTML report (upload) for more information.

Resource Filtering

For example, it is very useful to disable JavaScript on modern websites, e.g. on React with NextJS, which have SSR, so they work fine without JavaScript from the point of view of content browsing and navigation.

It is particularly useful to disable JavaScript in the case of exporting websites built e.g. on React to offline form (without HTTP server), where it is almost impossible to get the website to work from any location on the disk only through the file:// protocol.

ParameterDescriptionDefault
--disable-javascriptDisables JavaScript downloading and removes all JavaScript code from HTML, including onclick and other on* handlers.
--disable-stylesDisables CSS file downloading and at the same time removes all style definitions by <style> tag or inline by style attributes.
--disable-fontsDisables font downloading and also removes all font/font-face definitions from CSS.
--disable-imagesDisables downloading of all images and replaces found images in HTML with placeholder image only.
--disable-filesDisables downloading of any files (typically downloadable documents) to which various links point.

Advanced Crawler Settings

ParameterDescriptionDefault
--workers / -wMaximum number of concurrent workers (threads). Crawler will not make more simultaneous requests to the server than this number. Use carefully! A high number of workers can cause a DoS attack.3 (1 on Windows)
--max-reqs-per-sec / -rpsMax requests/s for the whole crawler and all workers. Use carefully! A high number of workers can cause a DoS attack.10
--memory-limitMemory limit in units M (Megabytes) or G (Gigabytes). If you crawl large website and you are encountering out of memory error, we recommend setting --result-storage=file.2048M
--allowed-domain-for-external-filesAllows you to enable the domain (domains via multiple definitions) from which the crawler can download static external files (JS, CSS, fonts, images, etc). You can enable all via *, or *.my-domain.tld. Otherwise, only the file from the same domain will be downloaded.
--allowed-domain-for-crawlingAllows crawling of content from other linked domains. Useful e.g. for traversing *.my-domain.tld subdomains or other TLD-driven language mutations *.my-domain.*. If you set *, you are crazy and can start crawling the entire internet (based on external links on your website).
--include-regexPCRE-compatible regular expression with for URLs that should be included. Argument can be specified multiple times. Example: --include-regex='/^\/public\//'
--ignore-regexPCRE-compatible regular expression for URLs that should be ignored. Argument can be specified multiple times. Example: --ignore-regex='/^.*\/downloads\/.*\.pdf$/i'
--regex-filtering-only-for-pagesSet if you want filtering by *-regex rules apply only to page URLs, but static assets (JS, CSS, images, fonts, documents) have to be loaded regardless of filtering. Useful where you want to filter only /sub-pages/ by --include-regex='/\/sub-pages\//', but assets have to be loaded from any URLs.
--analyzer-filter-regexPCRE-compatible regular expression applied to Analyzer class names for analyzers filtering. Example: /(content|accessibility)/i or /^(?:(?!best|access).)*$/i for all analyzers except BestPracticesAnalyzer and AccessibilityAnalyzer.
--accept-encodingCustom Accept-Encoding request header.gzip, deflate, br
--remove-query-paramsRemove query parameters from found URLs. Useful on websites where a lot of links are made to the same pages, only with different irrelevant query parameters.
--add-random-query-paramsAdds several random query parameters to each URL. With this, it is possible to bypass certain forms of server and CDN caches.
--max-queue-lengthThe maximum length of the waiting URL queue. Increase in case of large websites, but expect higher memory requirements.9000
--max-visited-urlsThe maximum number of the visited URLs. Increase in case of large websites, but expect higher memory requirements.10000
--max-url-lengthThe maximum supported URL length in chars. Increase in case of very long URLs, but expect higher memory requirements.2083

Expert Settings

ParameterDescriptionDefault
--result-storageResult storage type for content and headers cache. Values: memory or file. Use file for large websites and lower memory consumption. See Caching section.memory
--result-storage-dirDirectory for --result-storage=file.tmp/result-storage
--result-storage-compressionEnable compression for results storage. Saves disk space, but uses more CPU.
--http-cache-dirCache dir for HTTP responses. You can disable HTTP cache by --http-cache-dir=''. See Caching section.tmp/http-client-cache
--http-cache-compressionEnable compression for HTTP cache storage. Saves disk space, but uses more CPU.
--websocket-serverStart crawler with websocket server on given host:port. WebSocket is used, for example, by a desktop application, but you can use it for other purposes. Detailed documentation of sent messages is still missing.

File Export Settings

ParameterDescriptionDefault
--output-html-reportSave HTML report into that file. Set to empty โ€ to disable HTML report.tmp/report.%domain%.%datetime%.html
--output-json-fileSave report as JSON. Set to empty โ€ to disable JSON report.tmp/output.%domain%.%datetime%.json
--output-text-fileSave output as TXT. Set to empty โ€ to disable TXT report.tmp/output.%domain%.%datetime%.txt
--add-host-to-output-fileAppend initial URL host to filename except sitemaps.
--add-timestamp-to-output-fileAppend timestamp to filename except sitemaps.

Mailer Options

ParameterDescriptionDefault
--mail-toRecipients of HTML e-mail reports. Optional but required for mailer activation. You can specify multiple emails separated by comma.
--mail-fromE-mail sender address.siteone-crawler@your-hostname.com
--mail-from-nameE-mail sender name.SiteOne Crawler
--mail-subject-templateE-mail subject template. You can use dynamic variables %domain% and %datetime%.Crawler Report for %domain% (%date%)
--mail-smtp-hostSMTP host.localhost
--mail-smtp-portSMTP port. Unfortunately, only unencrypted SMTP port 25 is supported in the current version.25
--mail-smtp-userSMTP user, if your SMTP server requires authentication.
--mail-smtp-passSMTP password, if your SMTP server requires authentication.

Offline Exporter Options

ParameterDescriptionDefault
--offline-export-dirPath to directory for saving the offline version of the website.
--offline-export-store-only-url-regexStore only URLs which match one of these PCRE regexes. Activates debug mode.
--remove-all-anchor-listenersOn all links on the page remove any event listeners. Useful on some types of sites with modern JS frameworks that would like to compose content dynamically (React, Svelte, Vue, Angular, etc.).

Sitemap Options

ParameterDescriptionDefault
--sitemap-xml-fileFile path where generated XML Sitemap will be saved. Extension .xml is automatically added if not specified.
--sitemap-txt-fileFile path where generated TXT Sitemap will be saved. Extension .txt is automatically added if not specified.
--sitemap-base-priorityBase priority for XML sitemap.0.5
--sitemap-priority-increasePriority increase value based on slashes count in the URL.0.1

Fastest URL Analyzer

ParameterDescriptionDefault
--fastest-urls-top-limitNumber of URL addresses in TOP fastest URL addresses.20
--fastest-urls-max-timeThe maximum response time for an URL address to be evaluated as fast.1

SEO and OpenGraph Analyzer

ParameterDescriptionDefault
--max-heading-levelMaximal analyzed heading level from 1 to 6.3

Slowest URL Analyzer

ParameterDescriptionDefault
--slowest-urls-top-limitNumber of URL addresses in TOP slowest URL addresses.20
--slowest-urls-min-timeThe minimum response time for an URL address to be added to TOP slow selection.0.01
--slowest-urls-max-timeThe maximum response time for an URL address to be evaluated as very slow.3

Debug Settings

ParameterDescriptionDefault
--debugActivate debug mode.
--debug-log-fileLog file where to save debug messages. When --debug is not set and --debug-log-file is set, logging will be active without visible output.
--debug-url-regexRegex for URL(s) to debug. When crawled URL is matched, parsing, URL replacing, and other actions are printed to output. Can be specified multiple times.