Full Website Crawl

A full website crawl with SiteOne Crawler provides a comprehensive analysis of your entire website, including all pages, assets, and resources. This approach gives you the most complete picture of your site’s health, performance, and optimization opportunities.

Basic Full Website Crawl

The simplest way to perform a full website crawl is:

./crawler --url=https://example.com/

Without any restrictive options, the crawler will:

Follow all internal links on the website
Download and analyze all assets (HTML, CSS, JS, images, etc.)
Analyze various aspects of the site (SEO, performance, security, etc.)
Generate a detailed HTML report

Comprehensive Crawl with All Features

For a thorough analysis with all features enabled:

./crawler --url=https://example.com/ \
  --workers=8 \
  --output-html-report=full-report.html \
  --output-json-file=full-report.json \
  --sitemap-xml-file=sitemap.xml \
  --sitemap-txt-file=sitemap.txt \
  --show-inline-warnings \
  --show-inline-criticals

This command:

Uses 8 concurrent workers for faster crawling
Saves a detailed HTML report to full-report.html
Exports all data to JSON for further processing
Generates XML and TXT sitemaps
Shows both warnings and critical issues directly in the URL table

Sample Report Sections

A full crawl generates a comprehensive HTML report with multiple analysis tabs:

Basic Stats: Overall crawl statistics and site health metrics
URL Analysis: Complete list of all crawled URLs with status and metrics
Content Type Analysis: Breakdown of content types, sizes, and load times
SEO & OpenGraph: Analysis of SEO factors and social sharing metadata
Heading Structure: Assessment of HTML heading hierarchy and structure
Redirects & 404 Errors: Details on redirects and broken links
Security Analysis: Security header checks and potential vulnerabilities
Performance Analysis: Loading time analysis and bottlenecks
Best Practices: Assessment of web development best practices
Technical Analysis: Technical metadata and implementations

Use Cases

Website Audit for SEO

For a comprehensive SEO audit:

./crawler --url=https://example.com/ \
  --extra-columns="Title,H1,Description,Keywords" \
  --show-inline-warnings \
  --output-html-report=seo-audit.html

Performance Optimization

For identifying performance bottlenecks:

./crawler --url=https://example.com/ \
  --analyzer-filter-regex="SlowestAnalyzer|FastestAnalyzer|ContentTypeAnalyzer" \
  --output-html-report=performance-report.html

Security Assessment

For focusing on security aspects:

./crawler --url=https://example.com/ \
  --analyzer-filter-regex="SecurityAnalyzer|SslTlsAnalyzer|HeadersAnalyzer" \
  --output-html-report=security-report.html

Technical Migration Preparation

For preparing a site migration or redesign:

./crawler --url=https://example.com/ \
  --output-html-report=migration-baseline.html \
  --sitemap-xml-file=sitemap.xml \
  --markdown-export-dir=./content-backup \
  --offline-export-dir=./site-backup

Advanced Configuration

Concurrency and Rate Limiting

Control the crawler’s impact on server resources:

./crawler --url=https://example.com/ \
  --workers=10 \
  --max-reqs-per-sec=20 \
  --timeout=10

Memory Management for Large Sites

For very large websites with thousands of pages:

./crawler --url=https://example.com/ \
  --memory-limit=4096M \
  --result-storage=file \
  --result-storage-dir=./storage \
  --result-storage-compression

Filtering Content

Focus on specific sections of your website:

./crawler --url=https://example.com/ \
  --include-regex="/blog/|/products/" \
  --ignore-regex="/author/|/tag/" \
  --max-depth=5

Cross-domain Analysis

Analyze sites that span multiple domains:

./crawler --url=https://example.com/ \
  --allowed-domain-for-crawling=blog.example.com,shop.example.com \
  --allowed-domain-for-external-files=cdn.example.com,assets.example.com

Best Practices for Full Crawls

Start Small: Begin with small sections before crawling the entire site
Increase Workers Gradually: Start with default settings, then increase workers if needed
Use Result Storage Files: For sites with 1000+ pages, use file storage to manage memory
Monitor Server Load: Watch your server’s performance during the crawl
Schedule During Off-peak Hours: Run large crawls when site traffic is low
Set Depth Limits: Use --max-depth to prevent infinite crawling of complex sites
Use HTTP Cache: Enable caching for repeated crawls during analysis

Export Options

Combine your full crawl with various export capabilities:

HTML Report with Upload

Generate and upload an HTML report for team sharing:

./crawler --url=https://example.com/ \
  --output-html-report=report.html \
  --upload \
  --upload-retention=7d

Offline Website Version

Create a fully functional offline copy:

./crawler --url=https://example.com/ \
  --offline-export-dir=./offline-site

Markdown Content Export

Extract all content in markdown format for CMS migration:

./crawler --url=https://example.com/ \
  --markdown-export-dir=./content

Troubleshooting Full Crawls

If you encounter issues during a full crawl:

Memory Issues: Increase --memory-limit or switch to --result-storage=file
Timeout Errors: Increase --timeout value for slow-responding servers
Too Many Requests: Reduce --workers or --max-reqs-per-sec to avoid server overload
Missing Pages: Check for JavaScript navigation that might require additional settings
Slow Performance: Use --disable-images temporarily to speed up initial analysis

💡Next Steps

After your full website crawl:

Review the HTML report to identify key issues and opportunities
Export specific sections to markdown for content review
Set up regular scans to monitor site health
Use the JSON output for integration with your own analysis tools
Create a baseline report to measure improvements over time