Skip to content

Command-line options

This page is the complete reference for every command-line option of siteone-crawler. It mirrors the output of siteone-crawler --help, using the same section groupings and order. For each option you will find its description and, where the crawler defines one, its default value.

ParameterDescription
--url=<url>Required URL. It can also be the URL to sitemap.xml. Enclose in quotes if URL contains query parameters.
--url-list=<val>Path to a plain-text file with one URL per line (blank lines and # comments ignored). When provided, --url is optional; the first URL in the file is used as the crawl base. All listed URLs are seeded into the crawl queue.
--single-pageLoad only one page to which the URL is given (and its assets), but do not follow other pages.
--max-depth=<int>Maximum crawling depth (for pages, not assets). Default is 0 (no limit). 1 means /about or /about/, 2 means /about/contacts etc.
--device=<val>Device type for User-Agent selection. Values desktop, tablet, mobile. Ignored with --user-agent. Default value is desktop.
--user-agent=<val>Override User-Agent selected by --device. If you add ! at the end, the SiteOne-Crawler/version will not be added as a signature at the end of the final user-agent.
--timeout=<int>Request timeout (in sec). Default value is 5.
--proxy=<host:port>HTTP proxy in host:port format.
--http-auth=<val>Basic HTTP authentication in username:password format.
--accept-invalid-certsAccept invalid or incomplete SSL/TLS certificates (e.g. expired, self-signed, or missing intermediate CA). Use with caution.
--config-file=<file>Load options from a configuration file (one option per line). Processed before all other arguments; CLI arguments override values from the file. See Configuration file for the format and auto-discovery details.
--helpShow help and exit.
--versionShow crawler version and exit.
ParameterDescription
--output=<val>Output type text or json. Default value is text.
--extra-columns=<val>Extra table headers for output table with option to set width and do-not-truncate (>), e.g. DOM,X-Cache(10),Title(40>).
--url-column-size=<int>URL column width. By default, it is calculated from the size of your terminal window.
--timezone=<val>Timezone for datetimes in HTML reports and timestamps in output folders/files, e.g. Europe/Prague. Default is UTC.
--rows-limit=<int>Max. number of rows to display in tables with analysis results (protection against very long and slow report). Default value is 200.
--show-inline-criticalsShow criticals from the analyzer directly in the URL table.
--show-inline-warningsShow warnings from the analyzer directly in the URL table.
--do-not-truncate-urlAvoid truncating URLs to --url-column-size.
--show-scheme-and-hostShow the schema://host also of the original domain URL as well. By default, only path+query is displayed for original domain.
--hide-progress-barSuppress progress bar in output.
--hide-columns=<val>Hide specified columns from the progress table. Comma-separated list: type, time, size, cache.
--no-colorDisable colored output.
--force-colorForce colored output regardless of support detection.
ParameterDescription
--disable-all-assetsDisables crawling of all assets and files and only crawls pages in href attributes. Shortcut for calling all other --disable-* flags.
--disable-javascriptDisables JavaScript downloading and removes all JavaScript code from HTML, including onclick and other on* handlers.
--disable-stylesDisables CSS file downloading and at the same time removes all style definitions by <style> tag or inline by style attributes.
--disable-fontsDisables font downloading and also removes all font/font-face definitions from CSS.
--disable-imagesDisables downloading of all images and replaces found images in HTML with placeholder image only.
--disable-filesDisables downloading of any files (typically downloadable documents) to which various links point.
--remove-all-anchor-listenersOn all links on the page remove any event listeners. Useful on some types of sites with modern JS frameworks.
ParameterDescription
--workers=<int>Max concurrent workers (threads). Crawler will not make more simultaneous requests to the server than this number. Default value is 3.
--max-reqs-per-sec=<val>Max requests/s for whole crawler. Be careful not to cause a DoS attack. Default value is 10.
--memory-limit=<size>Memory limit in units M (Megabytes) or G (Gigabytes). Default value is 2048M.
--resolve=<domain:port:ip>The ability to force the domain+port to resolve to its own IP address, just like CURL --resolve does. Example: --resolve='www.mydomain.tld:80:127.0.0.1'.
--allowed-domain-for-external-files=<val>Primarily, the crawler crawls only the URL within the domain for initial URL. This allows you to enable loading of file content from another domain as well (e.g. if you want to load assets from a CDN). Can be specified multiple times. You can use domains with wildcard *.
--allowed-domain-for-crawling=<val>This option will allow you to crawl all content from other listed domains — typically in the case of language mutations on other domains. Can be specified multiple times. You can use domains with wildcard *.
--single-foreign-pageIf crawling of other domains is allowed (using --allowed-domain-for-crawling), it ensures that when another domain is not on the same second-level domain, only that linked page and its assets are crawled from that foreign domain.
--include-regex=<regex>Include only URLs matching at least one PCRE regex. Can be specified multiple times.
--ignore-regex=<regex>Ignore URLs matching any PCRE regex. Can be specified multiple times.
--regex-filtering-only-for-pagesSet if you want filtering by *-regex rules to apply only to page URLs, but static assets are loaded regardless of filtering.
--analyzer-filter-regex=<regex>Use only analyzers that match the specified regexp.
--accept-encoding=<val>Set Accept-Encoding request header. Default value is gzip, deflate, br.
--remove-query-paramsRemove URL query parameters from crawled URLs.
--keep-query-param=<val>Keep only the specified query parameter(s) in discovered URLs. All other query parameters are removed. Can be specified multiple times. Ignored when --remove-query-params is active.
--add-random-query-paramsAdd random query parameters to each crawled URL.
--transform-url=<val>Transform URLs before crawling. Format: from -> to or /regex/ -> replacement. Example: live-site.com -> local-site.local or /live-site\.com\/wp/ -> local-site.local/. Can be specified multiple times.
--force-relative-urlsNormalize all discovered URLs matching the initial domain (incl. www variant and protocol differences) to relative paths. Prevents duplicate files in offline export when the site uses inconsistent URL formats.
--ignore-robots-txtShould robots.txt content be ignored? Useful for crawling an otherwise private/unindexed site.
--ignore-html-commentsIgnore URLs found inside HTML comments (<!-- ... -->), which search engines also ignore, so commented links are not crawled or reported as broken.
--max-queue-length=<int>Max URL queue length. It affects memory requirements. Default value is 9000.
--max-visited-urls=<int>Max visited URLs. It affects memory requirements. Default value is 10000.
--max-skipped-urls=<int>Max skipped URLs. It affects memory requirements. Default value is 10000.
--max-url-length=<int>Max URL length in chars. It affects memory requirements. Default value is 2083.
--max-non200-responses-per-basename=<int>Protection against looping with dynamic non-200 URLs. If a basename (the last part of the URL after the last slash) has more non-200 responses than this limit, other URLs with same basename will be ignored/skipped. Default value is 5.
ParameterDescription
--debugActivate debug mode.
--debug-log-file=<file>Log file where to save debug messages. When --debug is not set and --debug-log-file is set, logging will be active without visible output.
--debug-url-regex=<regex>Regex for URL(s) to debug. When crawled URL is matched, parsing, URL replacing and other actions are printed to output. Can be specified multiple times.
--result-storage=<val>Result storage type for content and headers. Values: memory or file. Use file for large websites. Default value is memory.
--result-storage-dir=<dir>Directory for --result-storage=file. Default value is tmp/result-storage.
--result-storage-compressionEnable compression for results storage. Saves disk space, but uses more CPU.
--http-cache-dir=<dir>Cache dir for HTTP responses. Disable with --http-cache-dir='off' or --no-cache. Default value is ~/.cache/siteone-crawler/http-cache.
--http-cache-compressionEnable compression for HTTP cache storage. Saves disk space, but uses more CPU.
--http-cache-ttl=<val>TTL for HTTP cache entries (e.g. 1h, 7d, 30m). Use 0 for infinite. Default: 24h.
--no-cacheDisable HTTP cache completely. Shortcut for --http-cache-dir='off'.
--websocket-server=<host:port>Start crawler with websocket server on given host:port, typically 0.0.0.0:8000.
--console-width=<int>Enforce the definition of the console width and disable automatic detection.
ParameterDescription
--output-html-report=<file>Save HTML report into that file. Set to empty '' to disable HTML report. Default value is tmp/%domain%.report.%datetime%.html.
--html-report-options=<val>Comma-separated list of sections to include in HTML report. Available sections: summary, seo-opengraph, image-gallery, video-gallery, visited-urls, dns-ssl, crawler-stats, crawler-info, headers, content-types, skipped-urls, caching, best-practices, accessibility, security, redirects, 404-pages, slowest-urls, fastest-urls, source-domains. Default: all sections.
--output-json-file=<file>Save report as JSON. Set to empty '' to disable JSON report. Default value is tmp/%domain%.output.%datetime%.json.
--output-text-file=<file>Save output as TXT. Set to empty '' to disable TXT report. Default value is tmp/%domain%.output.%datetime%.txt.
--add-host-to-output-fileAppend initial URL host to filename except sitemaps.
--add-timestamp-to-output-fileAppend timestamp to filename except sitemaps.
ParameterDescription
--mail-to=<email>E-mail report recipient address(es). Can be specified multiple times.
--mail-from=<email>E-mail sender address. Default value is siteone-crawler@your-hostname.com.
--mail-from-name=<val>E-mail sender name. Default value is SiteOne Crawler.
--mail-subject-template=<val>E-mail subject template. You can use dynamic variables %domain% and %datetime%. Default value is Crawler Report for %domain% (%date%).
--mail-smtp-host=<val>SMTP host. Default value is localhost.
--mail-smtp-port=<int>SMTP port. Default value is 25.
--mail-smtp-user=<val>SMTP user for authentication.
--mail-smtp-pass=<val>SMTP password for authentication.

The markdown export feature is activated by entering the --markdown-export-dir parameter. All others are optional.

ParameterDescription
--markdown-export-dir=<dir>Path to directory where to save the markdown version of the website.
--markdown-export-single-file=<file>Path to a file where to save the combined markdown files into one document. Requires --markdown-export-dir to be set.
--markdown-move-content-before-h1-to-endMove all content before the main H1 heading (typically the header with the menu) to the end of the markdown.
--markdown-disable-imagesDo not export and show images in markdown files. Images are enabled by default.
--markdown-disable-filesDo not export and link files other than HTML/CSS/JS/fonts/images — e.g. PDF, ZIP, etc. These files are enabled by default.
--markdown-remove-links-and-images-from-single-fileRemove links and images from the combined single markdown file. Useful for AI tools that don’t need these elements.
--markdown-exclude-selector=<val>Exclude some page content (DOM elements) from markdown export defined by CSS selectors like header, .header, #header, etc.
--markdown-replace-content=<val>Replace text content with foo -> bar or regexp in PREG format: /card[0-9]/i -> card.
--markdown-replace-query-string=<val>Instead of using a short hash instead of a query string in the filename, just replace some characters. You can use simple format foo -> bar or regexp in PREG format, e.g. /([a-z]+)=([^&]*)(&|$)/i -> $1__$2.
--markdown-export-store-only-url-regex=<regex>For debug — when filled it will activate debug mode and store only URLs which match one of these PCRE regexes. Can be specified multiple times.
--markdown-ignore-store-file-errorIgnores any file storing errors. The export process will continue.

The offline export feature is activated by entering the --offline-export-dir parameter. All others are optional.

ParameterDescription
--offline-export-dir=<dir>Path to directory where to save the offline version of the website.
--offline-export-store-only-url-regex=<regex>For debug — when filled it will activate debug mode and store only URLs which match one of these PCRE regexes. Can be specified multiple times.
--offline-export-remove-unwanted-codeRemove unwanted code for offline mode? Typically JS of the analytics, social networks, cookie consent, cross origins, etc. Default value is 1.
--offline-export-no-auto-redirect-htmlDisable automatic creation of redirect HTML files for subfolders that contain an index.html file. This solves situations for URLs where sometimes the URL ends with a slash, sometimes it doesn’t.
--offline-export-preserve-url-structurePreserve the original URL path structure. E.g. /about is stored as about/index.html instead of about.html. Useful for web server deployment.
--offline-export-preserve-urlsPreserve original URL format in exported HTML/CSS/JS. Same-domain links become root-relative (/path), cross-domain links stay absolute. Useful when exported HTML is processed by tools that need production URLs.
--offline-export-no-url-rewritingDisable all URL rewriting in exported HTML/CSS/JS. URLs remain exactly as in the original source. Useful for RAG indexing or other processing where original URLs must be preserved verbatim.
--replace-content=<val>Replace HTML/JS/CSS content with foo -> bar or regexp in PREG format: /card[0-9]/i -> card.
--replace-query-string=<val>Instead of using a short hash instead of a query string in the filename, just replace some characters. You can use simple format foo -> bar or regexp in PREG format, e.g. /([a-z]+)=([^&]*)(&|$)/i -> $1__$2.
--offline-export-lowercaseConvert all filenames to lowercase for offline export. Useful for case-insensitive filesystems.
--ignore-store-file-errorIgnores any file storing errors. The export process will continue.
--disable-astro-inline-modulesDisables inlining of Astro module scripts for offline export. Scripts will remain as external files with corrected relative paths.
ParameterDescription
--sitemap-xml-file=<file>Save sitemap to XML. .xml added if missing.
--sitemap-txt-file=<file>Save sitemap to TXT. .txt added if missing.
--sitemap-base-priority=<val>Base priority for XML sitemap. Default value is 0.5.
--sitemap-priority-increase=<val>Priority increase value based on slashes count in the URL. Default value is 0.1.
ParameterDescription
--uploadEnable HTML report upload to --upload-to.
--upload-to=<url>URL of the endpoint where to send the HTML report. Default value is https://crawler.siteone.io/up.
--upload-retention=<val>How long should the HTML report be kept in the online version? Values: 1h / 4h / 12h / 24h / 3d / 7d / 30d / 365d / forever. Default value is 30d.
--upload-password=<val>Optional password, which must be entered (the user will be crawler) to display the online HTML report.
--upload-timeout=<int>Upload timeout in seconds. Default value is 3600.

See Online HTML report (upload) for more information.

ParameterDescription
--fastest-urls-top-limit=<int>Number of URL addresses in TOP fastest URL addresses. Default value is 20.
--fastest-urls-max-time=<val>The maximum response time for an URL address to be evaluated as fast. Default value is 1.
ParameterDescription
--max-heading-level=<int>Maximal analyzer heading level from 1 to 6. Default value is 3.
ParameterDescription
--slowest-urls-top-limit=<int>Number of URL addresses in TOP slowest URL addresses. Default value is 20.
--slowest-urls-min-time=<val>The minimum response time for an URL address to be added to TOP slow selection. Default value is 0.01.
--slowest-urls-max-time=<val>The maximum response time for an URL address to be evaluated as very slow. Default value is 3.

See CI/CD integration for exit codes, JUnit XML, baseline regression and full pipeline examples.

ParameterDescription
--ciEnable CI/CD quality gate. Crawler exits with code 10 if thresholds are not met.
--ci-min-score=<val>Minimum overall quality score (0.0-10.0). Default value is 5.0.
--ci-min-performance=<val>Minimum Performance category score (0.0-10.0). Default value is 5.
--ci-min-seo=<val>Minimum SEO category score (0.0-10.0). Default value is 5.
--ci-min-security=<val>Minimum Security category score (0.0-10.0). Default value is 5.
--ci-min-accessibility=<val>Minimum Accessibility category score (0.0-10.0). Default value is 3.
--ci-min-best-practices=<val>Minimum Best Practices category score (0.0-10.0). Default value is 5.
--ci-max-404=<int>Maximum number of 404 responses allowed. Default value is 0.
--ci-max-5xx=<int>Maximum number of 5xx server error responses allowed. Default value is 0.
--ci-max-criticals=<int>Maximum number of critical analysis findings allowed. Default value is 0.
--ci-max-warnings=<int>Maximum number of warning analysis findings allowed.
--ci-max-avg-response=<val>Maximum average response time in seconds.
--ci-min-pages=<int>Minimum number of HTML pages that must be found. Default value is 10.
--ci-min-assets=<int>Minimum number of assets (JS, CSS, images, fonts) that must be found. Default value is 10.
--ci-min-documents=<int>Minimum number of documents (PDF, etc.) that must be found. Default value is 0.
--ci-baseline=<file>Path to a previous JSON output used as a baseline for regression checks.
--ci-max-score-drop=<val>Maximum allowed drop of the overall score vs the --ci-baseline run. Default 0 (any drop fails).
--ci-fail-on-code=<val>Fail the build if a finding code (aplCode, e.g. seo-noindex-sitewide) is present. Can be specified multiple times.
--ci-ignore-code=<val>Ignore a finding code (aplCode, e.g. pages-without-lang) when counting criticals/warnings (also suppresses --ci-fail-on-code). Can be specified multiple times.
--ci-junit-file=<file>Write the CI gate result as a JUnit XML report to this file.
--ci-github-annotationsPrint GitHub Actions error annotations for failed CI checks.
ParameterDescription
--serve-markdown=<dir>Start HTTP server to browse a markdown export directory. Renders .md files as styled HTML with table and accordion support. No crawling is performed.
--serve-offline=<dir>Start HTTP server to browse an offline HTML export directory. Serves files with Content-Security-Policy restricting to same origin. No crawling is performed.
--serve-port=<int>Port for the built-in HTTP server (used with --serve-markdown or --serve-offline). Default value is 8321.
--serve-bind-address=<val>Bind address for the built-in HTTP server. Default is 127.0.0.1 (localhost only). Use 0.0.0.0 to listen on all network interfaces.
--html-to-markdown=<val>Convert a local HTML file to Markdown and print to stdout. Uses the same pipeline as --markdown-export-dir. Respects --markdown-disable-images, --markdown-disable-files, --markdown-move-content-before-h1-to-end, and --markdown-exclude-selector. No crawling is performed.
--html-to-markdown-output=<val>Output file path for --html-to-markdown. If not set, markdown is printed to stdout.
ParameterDescription
--ai-provider=<val>AI provider: openai, anthropic, gemini, or openai-compatible (vLLM/LiteLLM/MiniMax/self-hosted). Enables the optional AI features. Default value is openai-compatible.
--ai-endpoint=<url>Base API endpoint URL. Required for openai-compatible; optional override for the other providers.
--ai-model=<val>Model name to call, e.g. MiniMax-M3, gpt-5-mini, claude-sonnet-4-6, gemini-2.5-pro.
--ai-api-key=<val>API key (DISCOURAGED — leaks into ps/shell history/logs). Supports env:VARNAME indirection. Prefer the default conventional env var (OPENAI_API_KEY/ANTHROPIC_API_KEY/GEMINI_API_KEY) or --ai-api-key-file.
--ai-api-key-env=<val>Name of the environment variable to read the API key from.
--ai-api-key-file=<file>Path to a file whose first line is the API key (safest for CI).
--ai-max-tokens=<int>Max output tokens per request. Auto-mapped to max_completion_tokens for OpenAI reasoning models. Raise it further if you enable thinking/reasoning. Default value is 32000.
--ai-use-max-completion-tokensForce max_completion_tokens instead of max_tokens (for endpoints/models that require it). Otherwise auto-detected.
--ai-temperature=<val>Sampling temperature (omitted automatically for OpenAI reasoning models). Default value is 0.0.
--ai-extra-body=<val>JSON object deep-merged into the request body, overriding native fields. Use for thinking/reasoning control and any provider-specific knobs, e.g. {"chat_template_kwargs":{"enable_thinking":false}}.
--ai-synthesis-extra-body=<val>Like --ai-extra-body but applied ONLY to the final report-summary synthesis call (the summary action). Use to enable thinking/max reasoning just for the synthesis, e.g. {} to drop a global enable_thinking:false, or {"reasoning_effort":"high"}.
--ai-actions=<val>Comma-separated AI analyses to run: seo, llms-txt, llms-full, typos, custom, summary. The default runs the full report set; custom (needs a prompt) and llms-txt/llms-full (extra files) are opt-in. Default value is seo,typos,summary.
--ai-prompt-file=<file>Path to a custom prompt file for the custom action. Supports placeholders like {{url}}, {{title}}, {{content_markdown}}.
--ai-prompt=<val>Inline custom prompt for the custom action (alternative to --ai-prompt-file).
--ai-language=<val>Force content language (BCP-47, e.g. cs, de) for the typos action. Auto-detected if unset.
--ai-include=<regex>Only run AI on URLs matching this regex (repeatable). Applied before ranking.
--ai-exclude=<regex>Skip AI on URLs matching this regex (repeatable, wins over --ai-include). E.g. exclude /press/.
--ai-max-pages=<int>Hard cap on the number of pages sent to the LLM (highest-ranked pages kept). Default value is 100.
--ai-max-concurrency=<int>Maximum concurrent AI requests. Default value is 4.
--ai-max-reqs-per-sec=<val>Maximum AI requests per second (rate limit for the LLM API).
--ai-timeout=<int>Per-request timeout for AI calls in seconds (raise it for slow reasoning models). Default value is 180.
--ai-cache-dir=<dir>Directory for caching AI responses. Empty value disables caching. Default value is tmp/ai-cache.
--ai-seo-affects-scoreLet the AI SEO assessment apply a small capped deduction to the SEO quality score (off = advisory only). Note: non-deterministic across runs.
--ai-dry-runShow which pages would be analyzed, the number of LLM calls, and an estimated input-token count, then exit without calling the API.
ParameterDescription
--browserRender each page in a real Chromium browser (CDP) instead of a direct HTTP request. Enables crawling JS-rendered/SPA sites. Included in all builds and the pre-built binaries; it just needs a Chromium-family browser (auto-detected, or --browser-path).
--browser-path=<val>Explicit path to a Chromium/Chrome/Edge/Brave executable. Skips auto-detection and download.
--browser-headfulShow a visible browser window (default is headless). Concurrency is kept low so the windows are watchable.
--browser-workers=<int>Max concurrently rendered pages (separate from --workers; browser pages are heavier). Default value is 3.
--browser-wait=<val>Page-ready wait strategy: load, domcontentloaded, or networkidle (near-idle network). Default value is networkidle.
--browser-wait-extra=<int>Extra settle delay in milliseconds after the wait condition is met. Default value is 0.
--browser-timeout=<int>Hard navigation+render timeout per page, in seconds. On timeout, whatever rendered is captured. Default value is 30.
--browser-render-allRender every URL in the browser. By default only HTML documents are rendered; assets (images/CSS/JS/fonts) are fetched via HTTP.
--browser-auto-downloadPre-consent to downloading chrome-headless-shell when no browser is found (for non-interactive/CI runs; interactive runs prompt instead).
--screenshotsCapture a screenshot of every rendered page (requires --browser).
--screenshots-dir=<val>Directory to save screenshots into. Defaults to tmp/screenshots/.
--screenshot-mode=<val>Screenshot mode: viewport (visible area at the set resolution) or full-page (entire scroll height). Default value is viewport.
--screenshot-viewport=<val>Viewport size WxH used for rendering and viewport screenshots. Default value is 1920x1080.
--screenshot-format=<val>Screenshot image format: png, jpg, or webp. Default value is png.
--screenshot-quality=<int>Image quality (1-100) for jpg/webp screenshots. Default value is 80.
--screenshots-animation=<val>Build an animation from the page screenshots (in crawl order): comma-separated gif,mp4. Requires --browser and --screenshots.
--screenshots-animation-frame-duration=<val>Seconds each page is shown in the animation (0.2-10). Default value is 2.
--screenshots-animation-width=<int>Animation width in px; height is derived from the --screenshot-viewport aspect ratio. Default value is 1024.
--ffmpeg-path=<file>Path to the ffmpeg binary (auto-detected from PATH if omitted). Required for MP4 output; GIF needs no ffmpeg.
--screenshot-hide-cookie-bannersBefore each screenshot, try to dismiss/hide cookie consent banners (best-effort). Requires --browser and --screenshots.
--screenshot-hide-selector=<val>Comma-separated CSS selectors to hide before each screenshot (e.g. a site-specific cookie banner).
--browser-no-sandboxLaunch Chromium with --no-sandbox. Often required in Docker/CI/WSL or when running as root, but it weakens the renderer’s security isolation against untrusted pages.
--console-max-messages=<int>Max console/diagnostic messages per page kept for the AI payload. Default value is 100.
--console-msg-max-chars=<int>Truncate each console/diagnostic message to this many characters in the AI payload. Default value is 200.
--console-total-max-kb=<int>Total size cap (in KB) of the per-page console diagnostics AI payload. Default value is 128.