Required. HTTP or HTTPS URL address of the website to be crawled. If you only provide a domain name without a scheme, https:// will be added automatically. Use quotation marks if the URL contains query parameters.
Device type for choosing a predefined User-Agent. Ignored when โuser-agent is defined. Supported values: desktop, tablet, mobile. Ignored with --user-agent.
Console output type. Supported values: text or json. The JSON output displays only one-line progress on STDERR during crawling and at the end generates JSON on STDOUT with crawling details.
Comma delimited list of extra columns added to output table. It is possible to specify HTTP header names (e.g. X-Cache) or predefined Title, Keywords, Description or DOM for the number of DOM elements found in the HTML. You can set the expected length of the column in parentheses and > for do-not-truncate - e.g. DOM(6),X-Cache(10),Title(40>),Description(50>).
In the text output, long URLs are truncated by default to --url-column-size so the table does not wrap due to long URLs. With this option, you can turn off the truncation.
For example, it is very useful to disable JavaScript on modern websites, e.g. on React with NextJS, which have SSR,
so they work fine without JavaScript from the point of view of content browsing and navigation.
It is particularly useful to disable JavaScript in the case of exporting websites built e.g. on React to offline form
(without HTTP server), where it is almost impossible to get the website to work from any location on the disk only
through the file:// protocol.
Maximum number of concurrent workers (threads). Crawler will not make more simultaneous requests to the server than this number. Use carefully! A high number of workers can cause a DoS attack.
Memory limit in units M (Megabytes) or G (Gigabytes). If you crawl large website and you are encountering out of memory error, we recommend setting --result-storage=file.
Allows you to enable the domain (domains via multiple definitions) from which the crawler can download static external files (JS, CSS, fonts, images, etc). You can enable all via *, or *.my-domain.tld. Otherwise, only the file from the same domain will be downloaded.
Allows crawling of content from other linked domains. Useful e.g. for traversing *.my-domain.tld subdomains or other TLD-driven language mutations *.my-domain.*. If you set *, you are crazy and can start crawling the entire internet (based on external links on your website).
PCRE-compatible regular expression with for URLs that should be included. Argument can be specified multiple times. Example: --include-regex='/^\/public\//'
PCRE-compatible regular expression for URLs that should be ignored. Argument can be specified multiple times. Example: --ignore-regex='/^.*\/downloads\/.*\.pdf$/i'
Set if you want filtering by *-regex rules apply only to page URLs, but static assets (JS, CSS, images, fonts, documents) have to be loaded regardless of filtering. Useful where you want to filter only /sub-pages/ by --include-regex='/\/sub-pages\//', but assets have to be loaded from any URLs.
PCRE-compatible regular expression applied to Analyzer class names for analyzers filtering. Example: /(content|accessibility)/i or /^(?:(?!best|access).)*$/i for all analyzers except BestPracticesAnalyzer and AccessibilityAnalyzer.
Remove query parameters from found URLs. Useful on websites where a lot of links are made to the same pages, only with different irrelevant query parameters.
Result storage type for content and headers cache. Values: memory or file. Use file for large websites and lower memory consumption. See Caching section.
Start crawler with websocket server on given host:port. WebSocket is used, for example, by a desktop application, but you can use it for other purposes. Detailed documentation of sent messages is still missing.
On all links on the page remove any event listeners. Useful on some types of sites with modern JS frameworks that would like to compose content dynamically (React, Svelte, Vue, Angular, etc.).
Regex for URL(s) to debug. When crawled URL is matched, parsing, URL replacing, and other actions are printed to output. Can be specified multiple times.