Basic HTML-only Crawl

HTML-only crawling is a lightweight approach that focuses on analyzing your website’s structure and content without downloading additional assets like images, stylesheets, or JavaScript files. This method is faster and consumes less bandwidth, making it ideal for:

Quick site structure analysis
Content auditing
Link validation
Initial SEO assessments

Basic HTML-only Crawl Command

The simplest way to perform an HTML-only crawl is:

./siteone-crawler --url=https://example.com/ --disable-all-assets

The --disable-all-assets flag tells the crawler to skip downloading any non-HTML content, dramatically reducing the crawl time and resource usage.

Sample Output

When running an HTML-only crawl, you’ll receive output similar to this:

SiteOne Crawler v2.5.1 (2026-06-27) - https://crawler.siteone.io/
         URL                           Status   Time   Mime type        Title
------------------------------------------------------------------------------------------
/                                      200 OK   0.18s  text/html        Example Domain
/about                                 200 OK   0.11s  text/html        About Us - Example
/contact                               200 OK   0.09s  text/html        Contact - Example
/products                              200 OK   0.12s  text/html        Products - Example
/products/item1                        200 OK   0.10s  text/html        Item 1 - Products
/products/item2                        200 OK   0.11s  text/html        Item 2 - Products
/blog                                  200 OK   0.14s  text/html        Blog - Example
/blog/post1                            200 OK   0.09s  text/html        Post 1 - Blog
/policies/privacy                      200 OK   0.10s  text/html        Privacy Policy
/policies/terms                        200 OK   0.08s  text/html        Terms of Service
------------------------------------------------------------------------------------------
Analysis completed in 1.12s. Found 10 URLs (10 OK, 0 redirects, 0 errors).
Report saved to: tmp/example.com.report.20240115-120000.html

Detailed HTML-only Crawl Example

For more detailed analysis while still maintaining the lightweight approach, you can use additional options:

./siteone-crawler --url=https://example.com/ \
  --disable-all-assets \
  --max-depth=3 \
  --workers=5 \
  --show-inline-warnings \
  --output-html-report=example-html-only.html

This command:

Crawls only HTML pages on example.com
Limits crawling to a maximum depth of 3 levels
Uses 5 concurrent workers for faster crawling
Shows warnings directly in the URL table
Saves an HTML report to the specified file

Use Cases

SEO Structure Analysis

For a quick SEO structure check focusing on titles, descriptions, and headings:

./siteone-crawler --url=https://example.com/ \
  --disable-all-assets \
  --extra-columns="Title,Description"

Broken Link Check

To efficiently check for broken internal links:

./siteone-crawler --url=https://example.com/ \
  --disable-all-assets \
  --analyzer-filter-regex="Page404Analyzer"

Content Audit

To focus on page content and headings:

./siteone-crawler --url=https://example.com/ \
  --disable-all-assets \
  --extra-columns="Title(40),Heading1=xpath://h1/text()(40)"

The predefined extra columns are Title, Description, Keywords, and DOM. Other values (such as a page’s H1) are extracted with a custom XPath or regexp column using the Name=xpath:... / Name=regexp:... syntax — see --extra-columns.

Benefits of HTML-only Crawling

Speed: Significantly faster than full crawls (typically 3-10x faster)
Efficiency: Lower bandwidth and memory usage
Focus: Concentrates on page structure and content
Reduced server load: Minimizes impact on the target server
Simplicity: Provides clearer reports for content-focused analysis

Limitations

While HTML-only crawling is efficient, be aware of these limitations:

No analysis of CSS, JavaScript, or image issues
Cannot check resource loading performance
Misses SEO factors related to non-HTML content
Cannot validate multimedia content

Advanced Options

For more control over your HTML-only crawl, consider these additional flags:

--include-regex=<regex>: Only crawl URLs matching the regex pattern
--ignore-regex=<regex>: Skip URLs matching the regex pattern
--max-visited-urls=<int>: Limit the total number of URLs crawled
--timeout=<int>: Set request timeout in seconds (default is 5)
--rows-limit=<int>: Limit the number of rows displayed in the results table

💡Next Steps

After completing a basic HTML-only crawl, you might want to:

Generate a sitemap based on the discovered pages
Perform a more detailed SEO analysis on specific sections
Export the site to markdown for content review
Set up a more comprehensive full website crawl for deeper analysis