Command-line options

Basic Settings

Parameter	Description	Default
`--url`	Required. HTTP or HTTPS URL address of the website or sitemap xml to be crawled. If you only provide a domain name without a scheme, `https://` will be added automatically. Use quotation marks if the URL contains query parameters.
`--single-page`	Load only one page to which the URL is given (and its assets), but do not follow other pages.
`--max-depth`	Maximum crawling depth (for pages, not assets). Default is `0` (no limit). `1` means `/about` or `/about/`, `2` means `/about/contacts` etc.	`0`
`--device`	Device type for choosing a predefined User-Agent. Ignored when —user-agent is defined. Supported values: `desktop`, `tablet`, `mobile`. Ignored with `--user-agent`.	`desktop`
`--user-agent`	Override User-Agent selected by `--device`. Custom User-Agent header. Use quotation marks. If you add `!` at the end, the siteone-crawler/version will not be added as a signature at the end of the final user-agent.
`--timeout`	Request timeout (in seconds).	`5`
`--proxy`	HTTP proxy to use in `host:port` format. Host can be hostname, IPv4 or IPv6..
`--http-auth`	Basic HTTP authentication in `username:password` format.
`--help`	Show help and exit.
`--version`	Show crawler version and exit.

Output Settings

Parameter	Description	Default
`--output`	Console output type. Supported values: `text` or `json`. The JSON output displays only one-line progress on STDERR during crawling and at the end generates JSON on STDOUT with crawling details.	`text`
`--extra-columns`	Comma delimited list of extra columns added to output table. It is possible to specify HTTP header names (e.g. `X-Cache`) or predefined `Title`, `Keywords`, `Description` or `DOM` for the number of DOM elements found in the HTML. You can set the expected length of the column in parentheses and `>` for do-not-truncate - e.g. `DOM(6),X-Cache(10),Title(40>),Description(50>)`. For custom extraction, use the format `Custom_column_name=method:pattern#group(length)`, where `method` is `xpath` or `regexp`, `pattern` is the extraction pattern, an optional `#group` specifies the capturing group (or node index for XPath) to return (defaulting to the entire match or first node), and an optional `(length)` sets the maximum output length (append `>` to disable truncation). For example, use `Heading1=xpath://h1/text()(20>)` to extract the text of the first H1 element from the HTML document, and `ProductPrice=regexp:/Price:\s*\$?(\d+(?:\.\d{2})?)/i#1(10)` to extract a numeric price (e.g., “29.99”) from a string like “Price: $29.99”.
`--url-column-size`	Basic URL column width. By default, it is calculated from the size of your terminal window.
`--rows-limit`	Max. number of rows to display in tables with analysis results (protection against very long and slow report). Default value is `200`.	`200`
`--timezone`	Timezone for datetimes in HTML reports and timestamps in output folders/files, e.g., `Europe/Prague`. Default is `UTC`. Available values can be found at Timezones Documentation.	`UTC`
`--show-inline-criticals`	Show criticals from the analyzer directly in the URL table.
`--show-inline-warnings`	Show warnings from the analyzer directly in the URL table.
`--do-not-truncate-url`	In the text output, long URLs are truncated by default to `--url-column-size` so the table does not wrap due to long URLs. With this option, you can turn off the truncation.
`--show-scheme-and-host`	On text output, show scheme and host also for origin domain URLs.
`--hide-progress-bar`	Hide progress bar visible in text and JSON output for more compact view.
`--no-color`	Disable colored output.
`--force-color`	Force colored output regardless of support detection.

Upload options

Parameter	Description	Default Value
`--upload`	Enable HTML report upload to `--upload-to`.
`--upload-to`	URL of the endpoint where to send the HTML report.	`https://crawler.siteone.io/up`
`--upload-retention`	How long should the HTML report be kept in the online version? Values: `1h` / `4h` / `12h` / `24h` / `3d` / `7d` / `30d` / `365d` / `forever`.	`30d`
`--upload-password`	Optional password, which must be entered (the user will be `crawler`) to display the online HTML report.
`--upload-timeout`	Upload timeout in seconds.	`3600`

See Online HTML report (upload) for more information.

Resource Filtering

For example, it is very useful to disable JavaScript on modern websites, e.g. on React with NextJS, which have SSR, so they work fine without JavaScript from the point of view of content browsing and navigation.

It is particularly useful to disable JavaScript in the case of exporting websites built e.g. on React to offline form (without HTTP server), where it is almost impossible to get the website to work from any location on the disk only through the file:// protocol.

Parameter	Description	Default
`--disable-all-assets`	Disables crawling of all assets and files and only crawls pages in href attributes. Shortcut for calling all other `--disable-*` flags.
`--disable-javascript`	Disables JavaScript downloading and removes all JavaScript code from HTML, including `onclick` and other `on*` handlers.
`--disable-styles`	Disables CSS file downloading and at the same time removes all style definitions by `<style>` tag or inline by style attributes.
`--disable-fonts`	Disables font downloading and also removes all font/font-face definitions from CSS.
`--disable-images`	Disables downloading of all images and replaces found images in HTML with placeholder image only.
`--disable-files`	Disables downloading of any files (typically downloadable documents) to which various links point.

Advanced Crawler Settings

Parameter	Description	Default
`--workers` / `-w`	Maximum number of concurrent workers (threads). Crawler will not make more simultaneous requests to the server than this number. Use carefully! A high number of workers can cause a DoS attack.	`3` (`1` on Windows)
`--max-reqs-per-sec` / `-rps`	Max requests/s for the whole crawler and all workers. Use carefully! A high number of workers can cause a DoS attack.	`10`
`--memory-limit`	Memory limit in units `M` (Megabytes) or `G` (Gigabytes). If you crawl large website and you are encountering out of memory error, we recommend setting `--result-storage=file`.	`2048M`
`--resolve`	The ability to force the domain+port to resolve to its own IP address, just like CURL —resolve does. Example: `--resolve='www.mydomain.tld:80:127.0.0.1'`. Can be specified multiple times.
`--allowed-domain-for-external-files`	Allows you to enable the domain (domains via multiple definitions) from which the crawler can download static external files (JS, CSS, fonts, images, etc). You can enable all via ``, or `.my-domain.tld`. Otherwise, only the file from the same domain will be downloaded.
`--allowed-domain-for-crawling`	Allows crawling of content from other linked domains. Useful e.g. for traversing `.my-domain.tld` subdomains or other TLD-driven language mutations `.my-domain.`. If you set ``, you are crazy and can start crawling the entire internet (based on external links on your website).
`--single-foreign-page`	If crawling of other domains is allowed (using `--allowed-domain-for-crawling`), it ensures that when another domain is not on same second-level domain, only that linked page and its assets are crawled from that foreign domain.
`--include-regex`	PCRE-compatible regular expression with for URLs that should be included. Argument can be specified multiple times. Example: `--include-regex='/^\/public\//'`
`--ignore-regex`	PCRE-compatible regular expression for URLs that should be ignored. Argument can be specified multiple times. Example: `--ignore-regex='/^.\/downloads\/.\.pdf$/i'`
`--regex-filtering-only-for-pages`	Set if you want filtering by `*-regex` rules apply only to page URLs, but static assets (JS, CSS, images, fonts, documents) have to be loaded regardless of filtering. Useful where you want to filter only /sub-pages/ by `--include-regex='/\/sub-pages\//'`, but assets have to be loaded from any URLs.
`--analyzer-filter-regex`	PCRE-compatible regular expression applied to Analyzer class names for analyzers filtering. Example: `/(content\|accessibility)/i` or `/^(?:(?!best\|access).)*$/i` for all analyzers except `BestPracticesAnalyzer` and `AccessibilityAnalyzer`.
`--accept-encoding`	Custom `Accept-Encoding` request header.	`gzip, deflate, br`
`--remove-query-params`	Remove query parameters from found URLs. Useful on websites where a lot of links are made to the same pages, only with different irrelevant query parameters.
`--add-random-query-params`	Adds several random query parameters to each URL. With this, it is possible to bypass certain forms of server and CDN caches.
`--ignore-robots-txt`	Should robots.txt content be ignored? Useful for crawling an otherwise internal/private/unindexed site.
`--max-queue-length`	The maximum length of the waiting URL queue. Increase in case of large websites, but expect higher memory requirements.	`9000`
`--max-visited-urls`	The maximum number of the visited URLs. Increase in case of large websites, but expect higher memory requirements.	`10000`
`--max-skipped-urls`	The maximum number of the skipped URLs. Increase in case of large websites, but expect higher memory requirements. Default is `10000`.	`10000`
`--max-url-length`	The maximum supported URL length in chars. Increase in case of very long URLs, but expect higher memory requirements.	`2083`
`--max-non200-responses-per-basename`	Protection against looping with dynamic non-200 URLs. If a basename (the last part of the URL after the last slash) has more non-200 responses than this limit, other URLs with same basename will be ignored/skipped. Default is `5`.	`5`

Expert Settings

Parameter	Description	Default
`--result-storage`	Result storage type for content and headers cache. Values: `memory` or `file`. Use `file` for large websites and lower memory consumption. See Caching section.	`memory`
`--result-storage-dir`	Directory for `--result-storage=file`.	`tmp/result-storage`
`--result-storage-compression`	Enable compression for results storage. Saves disk space, but uses more CPU.
`--http-cache-dir`	Cache dir for HTTP responses. You can disable HTTP cache by `--http-cache-dir=''`. See Caching section.	`tmp/http-client-cache`
`--http-cache-compression`	Enable compression for HTTP cache storage. Saves disk space, but uses more CPU.
`--websocket-server`	Start crawler with websocket server on given `host:port`. WebSocket is used, for example, by a desktop application, but you can use it for other purposes. Detailed documentation of sent messages is still missing.
`--console-width`	Enforce a fixed console width, disabling automatic detection.

File Export Settings

Parameter	Description	Default
`--output-html-report`	Save HTML report into that file. Set to empty ” to disable HTML report.	`tmp/report.%domain%.%datetime%.html`
`--output-json-file`	Save report as JSON. Set to empty ” to disable JSON report.	`tmp/output.%domain%.%datetime%.json`
`--output-text-file`	Save output as TXT. Set to empty ” to disable TXT report.	`tmp/output.%domain%.%datetime%.txt`
`--add-host-to-output-file`	Append initial URL host to filename except sitemaps.
`--add-timestamp-to-output-file`	Append timestamp to filename except sitemaps.

Mailer Options

Parameter	Description	Default
`--mail-to`	Recipients of HTML e-mail reports. Optional but required for mailer activation. You can specify multiple emails separated by comma.
`--mail-from`	E-mail sender address.	`siteone-crawler@your-hostname.com`
`--mail-from-name`	E-mail sender name.	`SiteOne Crawler`
`--mail-subject-template`	E-mail subject template. You can use dynamic variables %domain% and %datetime%.	`Crawler Report for %domain% (%date%)`
`--mail-smtp-host`	SMTP host.	`localhost`
`--mail-smtp-port`	SMTP port. Unfortunately, only unencrypted SMTP port 25 is supported in the current version.	`25`
`--mail-smtp-user`	SMTP user, if your SMTP server requires authentication.
`--mail-smtp-pass`	SMTP password, if your SMTP server requires authentication.

Offline Exporter Options

The feature of exporting a web page to offline form is activated by entering the --offline-export-dir parameter. All others are optional.

Parameter	Description	Default
`--offline-export-dir`	Path to directory for saving the offline version of the website.
`--offline-export-store-only-url-regex`	Store only URLs which match one of these PCRE regexes. Activates debug mode.
`--remove-all-anchor-listeners`	On all links on the page remove any event listeners. Useful on some types of sites with modern JS frameworks that would like to compose content dynamically (React, Svelte, Vue, Angular, etc.).
`--offline-export-remove-unwanted-code`	Remove unwanted code for offline mode? Typically JS of the analytics, social networks, cookie consent, cross origins, etc. Default value is `1`.	`1`
`--offline-export-no-auto-redirect-html`	Disable automatic creation of redirect HTML files for subfolders that contain an index.html file. This solves situations for URLs where sometimes the URL ends with a slash, sometimes it doesn’t.
`--replace-content`	Replace HTML/JS/CSS content with `foo -> bar` or regexp in PREG format: `/card[0-9]/i -> card`.
`--replace-query-string`	Instead of using a short hash instead of a query string in the filename, just replace some characters. You can use simple format ‘foo -> bar’ or regexp in PREG format, e.g. ’/([a-z]+)=([^&]*)(&	$)/i -> $1__$2’.
`--ignore-store-file-error`	Ignores any file storing errors. The export process will continue.

Markdown exporter options

The feature of exporting a web page to markdown form is activated by entering the --markdown-export-dir parameter. All others are optional.

Parameter	Description	Default
`--markdown-export-dir`	Path to directory where to save the markdown version of the website. Directory will be created if it doesn’t exist.
`--markdown-export-single-file`	Path to a file where to save the combined markdown files into one document. Requires `--markdown-export-dir` to be set. Ideal for AI tools that need to process the entire website content in one go.
`--markdown-move-content-before-h1-to-end`	Move all content before the main H1 heading (typically the header with the menu) to the end of the markdown.
`--markdown-disable-images`	Do not export and show images in markdown files. Images are enabled by default.
`--markdown-disable-files`	Do not export and link files other than HTML/CSS/JS/fonts/images - eg. PDF, ZIP, etc. These files are enabled by default.
`--markdown-remove-links-and-images-from-single-file`	Remove links and images from the combined single markdown file. Useful for AI tools that don’t need these elements. Requires `--markdown-export-single-file` to be set.
`--markdown-exclude-selector`	Exclude some page content (DOM elements) from markdown export defined by CSS selectors like ‘header’, ‘.header’, ‘#header’, etc. Can be specified multiple times.
`--markdown-replace-content`	Replace text content with `foo -> bar` or regexp in PREG format: `/card[0-9]/i -> card`.
`--markdown-replace-query-string`	Instead of using a short hash instead of a query string in the filename, just replace some characters. You can use simple format ‘foo -> bar’ or regexp in PREG format, e.g. `/([a-z]+)=([^&]*)(&\|$)/i -> $1__$2`.
`--markdown-export-store-only-url-regex`	For debug - when filled it will activate debug mode and store only URLs which match one of these PCRE regexes. Can be specified multiple times.
`--markdown-ignore-store-file-error`	Ignores any file storing errors. The export process will continue.

Sitemap Options

Parameter	Description	Default
`--sitemap-xml-file`	File path where generated XML Sitemap will be saved. Extension `.xml` is automatically added if not specified.
`--sitemap-txt-file`	File path where generated TXT Sitemap will be saved. Extension `.txt` is automatically added if not specified.
`--sitemap-base-priority`	Base priority for XML sitemap.	`0.5`
`--sitemap-priority-increase`	Priority increase value based on slashes count in the URL.	`0.1`

Fastest URL Analyzer

Parameter	Description	Default
`--fastest-urls-top-limit`	Number of URL addresses in TOP fastest URL addresses.	`20`
`--fastest-urls-max-time`	The maximum response time for an URL address to be evaluated as fast.	`1`

SEO and OpenGraph Analyzer

Parameter	Description	Default
`--max-heading-level`	Maximal analyzed heading level from 1 to 6.	`3`

Slowest URL Analyzer

Parameter	Description	Default
`--slowest-urls-top-limit`	Number of URL addresses in TOP slowest URL addresses.	`20`
`--slowest-urls-min-time`	The minimum response time for an URL address to be added to TOP slow selection.	`0.01`
`--slowest-urls-max-time`	The maximum response time for an URL address to be evaluated as very slow.	`3`