Skip to content

Full Website Crawl

A full website crawl with SiteOne Crawler provides a comprehensive analysis of your entire website, including all pages, assets, and resources. This approach gives you the most complete picture of your site’s health, performance, and optimization opportunities.

The simplest way to perform a full website crawl is:

Terminal window
./crawler --url=https://example.com/

Without any restrictive options, the crawler will:

  1. Follow all internal links on the website
  2. Download and analyze all assets (HTML, CSS, JS, images, etc.)
  3. Analyze various aspects of the site (SEO, performance, security, etc.)
  4. Generate a detailed HTML report

For a thorough analysis with all features enabled:

Terminal window
./crawler --url=https://example.com/ \
--workers=8 \
--output-html-report=full-report.html \
--output-json-file=full-report.json \
--sitemap-xml-file=sitemap.xml \
--sitemap-txt-file=sitemap.txt \
--show-inline-warnings \
--show-inline-criticals

This command:

  • Uses 8 concurrent workers for faster crawling
  • Saves a detailed HTML report to full-report.html
  • Exports all data to JSON for further processing
  • Generates XML and TXT sitemaps
  • Shows both warnings and critical issues directly in the URL table

A full crawl generates a comprehensive HTML report with multiple analysis tabs:

  1. Basic Stats: Overall crawl statistics and site health metrics
  2. URL Analysis: Complete list of all crawled URLs with status and metrics
  3. Content Type Analysis: Breakdown of content types, sizes, and load times
  4. SEO & OpenGraph: Analysis of SEO factors and social sharing metadata
  5. Heading Structure: Assessment of HTML heading hierarchy and structure
  6. Redirects & 404 Errors: Details on redirects and broken links
  7. Security Analysis: Security header checks and potential vulnerabilities
  8. Performance Analysis: Loading time analysis and bottlenecks
  9. Best Practices: Assessment of web development best practices
  10. Technical Analysis: Technical metadata and implementations

For a comprehensive SEO audit:

Terminal window
./crawler --url=https://example.com/ \
--extra-columns="Title,H1,Description,Keywords" \
--show-inline-warnings \
--output-html-report=seo-audit.html

For identifying performance bottlenecks:

Terminal window
./crawler --url=https://example.com/ \
--analyzer-filter-regex="SlowestAnalyzer|FastestAnalyzer|ContentTypeAnalyzer" \
--output-html-report=performance-report.html

For focusing on security aspects:

Terminal window
./crawler --url=https://example.com/ \
--analyzer-filter-regex="SecurityAnalyzer|SslTlsAnalyzer|HeadersAnalyzer" \
--output-html-report=security-report.html

For preparing a site migration or redesign:

Terminal window
./crawler --url=https://example.com/ \
--output-html-report=migration-baseline.html \
--sitemap-xml-file=sitemap.xml \
--markdown-export-dir=./content-backup \
--offline-export-dir=./site-backup

Control the crawler’s impact on server resources:

Terminal window
./crawler --url=https://example.com/ \
--workers=10 \
--max-reqs-per-sec=20 \
--timeout=10

For very large websites with thousands of pages:

Terminal window
./crawler --url=https://example.com/ \
--memory-limit=4096M \
--result-storage=file \
--result-storage-dir=./storage \
--result-storage-compression

Focus on specific sections of your website:

Terminal window
./crawler --url=https://example.com/ \
--include-regex="/blog/|/products/" \
--ignore-regex="/author/|/tag/" \
--max-depth=5

Analyze sites that span multiple domains:

Terminal window
./crawler --url=https://example.com/ \
--allowed-domain-for-crawling=blog.example.com,shop.example.com \
--allowed-domain-for-external-files=cdn.example.com,assets.example.com
  1. Start Small: Begin with small sections before crawling the entire site
  2. Increase Workers Gradually: Start with default settings, then increase workers if needed
  3. Use Result Storage Files: For sites with 1000+ pages, use file storage to manage memory
  4. Monitor Server Load: Watch your server’s performance during the crawl
  5. Schedule During Off-peak Hours: Run large crawls when site traffic is low
  6. Set Depth Limits: Use --max-depth to prevent infinite crawling of complex sites
  7. Use HTTP Cache: Enable caching for repeated crawls during analysis

Combine your full crawl with various export capabilities:

Generate and upload an HTML report for team sharing:

Terminal window
./crawler --url=https://example.com/ \
--output-html-report=report.html \
--upload \
--upload-retention=7d

Create a fully functional offline copy:

Terminal window
./crawler --url=https://example.com/ \
--offline-export-dir=./offline-site

Extract all content in markdown format for CMS migration:

Terminal window
./crawler --url=https://example.com/ \
--markdown-export-dir=./content

If you encounter issues during a full crawl:

  1. Memory Issues: Increase --memory-limit or switch to --result-storage=file
  2. Timeout Errors: Increase --timeout value for slow-responding servers
  3. Too Many Requests: Reduce --workers or --max-reqs-per-sec to avoid server overload
  4. Missing Pages: Check for JavaScript navigation that might require additional settings
  5. Slow Performance: Use --disable-images temporarily to speed up initial analysis

After your full website crawl:

  • Review the HTML report to identify key issues and opportunities
  • Export specific sections to markdown for content review
  • Set up regular scans to monitor site health
  • Use the JSON output for integration with your own analysis tools
  • Create a baseline report to measure improvements over time