Skip to content

Basic HTML-only Crawl

HTML-only crawling is a lightweight approach that focuses on analyzing your website’s structure and content without downloading additional assets like images, stylesheets, or JavaScript files. This method is faster and consumes less bandwidth, making it ideal for:

  • Quick site structure analysis
  • Content auditing
  • Link validation
  • Initial SEO assessments

The simplest way to perform an HTML-only crawl is:

Terminal window
./siteone-crawler --url=https://example.com/ --disable-all-assets

The --disable-all-assets flag tells the crawler to skip downloading any non-HTML content, dramatically reducing the crawl time and resource usage.

When running an HTML-only crawl, you’ll receive output similar to this:

SiteOne Crawler v2.5.1 (2026-06-27) - https://crawler.siteone.io/
URL Status Time Mime type Title
------------------------------------------------------------------------------------------
/ 200 OK 0.18s text/html Example Domain
/about 200 OK 0.11s text/html About Us - Example
/contact 200 OK 0.09s text/html Contact - Example
/products 200 OK 0.12s text/html Products - Example
/products/item1 200 OK 0.10s text/html Item 1 - Products
/products/item2 200 OK 0.11s text/html Item 2 - Products
/blog 200 OK 0.14s text/html Blog - Example
/blog/post1 200 OK 0.09s text/html Post 1 - Blog
/policies/privacy 200 OK 0.10s text/html Privacy Policy
/policies/terms 200 OK 0.08s text/html Terms of Service
------------------------------------------------------------------------------------------
Analysis completed in 1.12s. Found 10 URLs (10 OK, 0 redirects, 0 errors).
Report saved to: tmp/example.com.report.20240115-120000.html

For more detailed analysis while still maintaining the lightweight approach, you can use additional options:

Terminal window
./siteone-crawler --url=https://example.com/ \
--disable-all-assets \
--max-depth=3 \
--workers=5 \
--show-inline-warnings \
--output-html-report=example-html-only.html

This command:

  • Crawls only HTML pages on example.com
  • Limits crawling to a maximum depth of 3 levels
  • Uses 5 concurrent workers for faster crawling
  • Shows warnings directly in the URL table
  • Saves an HTML report to the specified file

For a quick SEO structure check focusing on titles, descriptions, and headings:

Terminal window
./siteone-crawler --url=https://example.com/ \
--disable-all-assets \
--extra-columns="Title,Description"

To efficiently check for broken internal links:

Terminal window
./siteone-crawler --url=https://example.com/ \
--disable-all-assets \
--analyzer-filter-regex="Page404Analyzer"

To focus on page content and headings:

Terminal window
./siteone-crawler --url=https://example.com/ \
--disable-all-assets \
--extra-columns="Title(40),Heading1=xpath://h1/text()(40)"

The predefined extra columns are Title, Description, Keywords, and DOM. Other values (such as a page’s H1) are extracted with a custom XPath or regexp column using the Name=xpath:... / Name=regexp:... syntax — see --extra-columns.

  1. Speed: Significantly faster than full crawls (typically 3-10x faster)
  2. Efficiency: Lower bandwidth and memory usage
  3. Focus: Concentrates on page structure and content
  4. Reduced server load: Minimizes impact on the target server
  5. Simplicity: Provides clearer reports for content-focused analysis

While HTML-only crawling is efficient, be aware of these limitations:

  • No analysis of CSS, JavaScript, or image issues
  • Cannot check resource loading performance
  • Misses SEO factors related to non-HTML content
  • Cannot validate multimedia content

For more control over your HTML-only crawl, consider these additional flags:

  • --include-regex=<regex>: Only crawl URLs matching the regex pattern
  • --ignore-regex=<regex>: Skip URLs matching the regex pattern
  • --max-visited-urls=<int>: Limit the total number of URLs crawled
  • --timeout=<int>: Set request timeout in seconds (default is 5)
  • --rows-limit=<int>: Limit the number of rows displayed in the results table

After completing a basic HTML-only crawl, you might want to: