Basic HTML-only Crawl
HTML-only crawling is a lightweight approach that focuses on analyzing your website’s structure and content without downloading additional assets like images, stylesheets, or JavaScript files. This method is faster and consumes less bandwidth, making it ideal for:
- Quick site structure analysis
- Content auditing
- Link validation
- Initial SEO assessments
Basic HTML-only Crawl Command
Section titled “Basic HTML-only Crawl Command”The simplest way to perform an HTML-only crawl is:
./siteone-crawler --url=https://example.com/ --disable-all-assetsThe --disable-all-assets flag tells the crawler to skip downloading any non-HTML content, dramatically reducing the crawl time and resource usage.
Sample Output
Section titled “Sample Output”When running an HTML-only crawl, you’ll receive output similar to this:
SiteOne Crawler v2.5.1 (2026-06-27) - https://crawler.siteone.io/ URL Status Time Mime type Title------------------------------------------------------------------------------------------/ 200 OK 0.18s text/html Example Domain/about 200 OK 0.11s text/html About Us - Example/contact 200 OK 0.09s text/html Contact - Example/products 200 OK 0.12s text/html Products - Example/products/item1 200 OK 0.10s text/html Item 1 - Products/products/item2 200 OK 0.11s text/html Item 2 - Products/blog 200 OK 0.14s text/html Blog - Example/blog/post1 200 OK 0.09s text/html Post 1 - Blog/policies/privacy 200 OK 0.10s text/html Privacy Policy/policies/terms 200 OK 0.08s text/html Terms of Service------------------------------------------------------------------------------------------Analysis completed in 1.12s. Found 10 URLs (10 OK, 0 redirects, 0 errors).Report saved to: tmp/example.com.report.20240115-120000.htmlDetailed HTML-only Crawl Example
Section titled “Detailed HTML-only Crawl Example”For more detailed analysis while still maintaining the lightweight approach, you can use additional options:
./siteone-crawler --url=https://example.com/ \ --disable-all-assets \ --max-depth=3 \ --workers=5 \ --show-inline-warnings \ --output-html-report=example-html-only.htmlThis command:
- Crawls only HTML pages on example.com
- Limits crawling to a maximum depth of 3 levels
- Uses 5 concurrent workers for faster crawling
- Shows warnings directly in the URL table
- Saves an HTML report to the specified file
Use Cases
Section titled “Use Cases”SEO Structure Analysis
Section titled “SEO Structure Analysis”For a quick SEO structure check focusing on titles, descriptions, and headings:
./siteone-crawler --url=https://example.com/ \ --disable-all-assets \ --extra-columns="Title,Description"Broken Link Check
Section titled “Broken Link Check”To efficiently check for broken internal links:
./siteone-crawler --url=https://example.com/ \ --disable-all-assets \ --analyzer-filter-regex="Page404Analyzer"Content Audit
Section titled “Content Audit”To focus on page content and headings:
./siteone-crawler --url=https://example.com/ \ --disable-all-assets \ --extra-columns="Title(40),Heading1=xpath://h1/text()(40)"The predefined extra columns are Title, Description, Keywords, and DOM. Other values (such as a page’s H1) are extracted with a custom XPath or regexp column using the Name=xpath:... / Name=regexp:... syntax — see --extra-columns.
Benefits of HTML-only Crawling
Section titled “Benefits of HTML-only Crawling”- Speed: Significantly faster than full crawls (typically 3-10x faster)
- Efficiency: Lower bandwidth and memory usage
- Focus: Concentrates on page structure and content
- Reduced server load: Minimizes impact on the target server
- Simplicity: Provides clearer reports for content-focused analysis
Limitations
Section titled “Limitations”While HTML-only crawling is efficient, be aware of these limitations:
- No analysis of CSS, JavaScript, or image issues
- Cannot check resource loading performance
- Misses SEO factors related to non-HTML content
- Cannot validate multimedia content
Advanced Options
Section titled “Advanced Options”For more control over your HTML-only crawl, consider these additional flags:
--include-regex=<regex>: Only crawl URLs matching the regex pattern--ignore-regex=<regex>: Skip URLs matching the regex pattern--max-visited-urls=<int>: Limit the total number of URLs crawled--timeout=<int>: Set request timeout in seconds (default is 5)--rows-limit=<int>: Limit the number of rows displayed in the results table
💡Next Steps
Section titled “💡Next Steps”After completing a basic HTML-only crawl, you might want to:
- Generate a sitemap based on the discovered pages
- Perform a more detailed SEO analysis on specific sections
- Export the site to markdown for content review
- Set up a more comprehensive full website crawl for deeper analysis