Full Website Crawl
A full website crawl with SiteOne Crawler provides a comprehensive analysis of your entire website, including all pages, assets, and resources. This approach gives you the most complete picture of your site’s health, performance, and optimization opportunities.
Basic Full Website Crawl
Section titled “Basic Full Website Crawl”The simplest way to perform a full website crawl is:
./crawler --url=https://example.com/
Without any restrictive options, the crawler will:
- Follow all internal links on the website
- Download and analyze all assets (HTML, CSS, JS, images, etc.)
- Analyze various aspects of the site (SEO, performance, security, etc.)
- Generate a detailed HTML report
Comprehensive Crawl with All Features
Section titled “Comprehensive Crawl with All Features”For a thorough analysis with all features enabled:
./crawler --url=https://example.com/ \ --workers=8 \ --output-html-report=full-report.html \ --output-json-file=full-report.json \ --sitemap-xml-file=sitemap.xml \ --sitemap-txt-file=sitemap.txt \ --show-inline-warnings \ --show-inline-criticals
This command:
- Uses 8 concurrent workers for faster crawling
- Saves a detailed HTML report to full-report.html
- Exports all data to JSON for further processing
- Generates XML and TXT sitemaps
- Shows both warnings and critical issues directly in the URL table
Sample Report Sections
Section titled “Sample Report Sections”A full crawl generates a comprehensive HTML report with multiple analysis tabs:
- Basic Stats: Overall crawl statistics and site health metrics
- URL Analysis: Complete list of all crawled URLs with status and metrics
- Content Type Analysis: Breakdown of content types, sizes, and load times
- SEO & OpenGraph: Analysis of SEO factors and social sharing metadata
- Heading Structure: Assessment of HTML heading hierarchy and structure
- Redirects & 404 Errors: Details on redirects and broken links
- Security Analysis: Security header checks and potential vulnerabilities
- Performance Analysis: Loading time analysis and bottlenecks
- Best Practices: Assessment of web development best practices
- Technical Analysis: Technical metadata and implementations
Use Cases
Section titled “Use Cases”Website Audit for SEO
Section titled “Website Audit for SEO”For a comprehensive SEO audit:
./crawler --url=https://example.com/ \ --extra-columns="Title,H1,Description,Keywords" \ --show-inline-warnings \ --output-html-report=seo-audit.html
Performance Optimization
Section titled “Performance Optimization”For identifying performance bottlenecks:
./crawler --url=https://example.com/ \ --analyzer-filter-regex="SlowestAnalyzer|FastestAnalyzer|ContentTypeAnalyzer" \ --output-html-report=performance-report.html
Security Assessment
Section titled “Security Assessment”For focusing on security aspects:
./crawler --url=https://example.com/ \ --analyzer-filter-regex="SecurityAnalyzer|SslTlsAnalyzer|HeadersAnalyzer" \ --output-html-report=security-report.html
Technical Migration Preparation
Section titled “Technical Migration Preparation”For preparing a site migration or redesign:
./crawler --url=https://example.com/ \ --output-html-report=migration-baseline.html \ --sitemap-xml-file=sitemap.xml \ --markdown-export-dir=./content-backup \ --offline-export-dir=./site-backup
Advanced Configuration
Section titled “Advanced Configuration”Concurrency and Rate Limiting
Section titled “Concurrency and Rate Limiting”Control the crawler’s impact on server resources:
./crawler --url=https://example.com/ \ --workers=10 \ --max-reqs-per-sec=20 \ --timeout=10
Memory Management for Large Sites
Section titled “Memory Management for Large Sites”For very large websites with thousands of pages:
./crawler --url=https://example.com/ \ --memory-limit=4096M \ --result-storage=file \ --result-storage-dir=./storage \ --result-storage-compression
Filtering Content
Section titled “Filtering Content”Focus on specific sections of your website:
./crawler --url=https://example.com/ \ --include-regex="/blog/|/products/" \ --ignore-regex="/author/|/tag/" \ --max-depth=5
Cross-domain Analysis
Section titled “Cross-domain Analysis”Analyze sites that span multiple domains:
./crawler --url=https://example.com/ \ --allowed-domain-for-crawling=blog.example.com,shop.example.com \ --allowed-domain-for-external-files=cdn.example.com,assets.example.com
Best Practices for Full Crawls
Section titled “Best Practices for Full Crawls”- Start Small: Begin with small sections before crawling the entire site
- Increase Workers Gradually: Start with default settings, then increase workers if needed
- Use Result Storage Files: For sites with 1000+ pages, use file storage to manage memory
- Monitor Server Load: Watch your server’s performance during the crawl
- Schedule During Off-peak Hours: Run large crawls when site traffic is low
- Set Depth Limits: Use
--max-depth
to prevent infinite crawling of complex sites - Use HTTP Cache: Enable caching for repeated crawls during analysis
Export Options
Section titled “Export Options”Combine your full crawl with various export capabilities:
HTML Report with Upload
Section titled “HTML Report with Upload”Generate and upload an HTML report for team sharing:
./crawler --url=https://example.com/ \ --output-html-report=report.html \ --upload \ --upload-retention=7d
Offline Website Version
Section titled “Offline Website Version”Create a fully functional offline copy:
./crawler --url=https://example.com/ \ --offline-export-dir=./offline-site
Markdown Content Export
Section titled “Markdown Content Export”Extract all content in markdown format for CMS migration:
./crawler --url=https://example.com/ \ --markdown-export-dir=./content
Troubleshooting Full Crawls
Section titled “Troubleshooting Full Crawls”If you encounter issues during a full crawl:
- Memory Issues: Increase
--memory-limit
or switch to--result-storage=file
- Timeout Errors: Increase
--timeout
value for slow-responding servers - Too Many Requests: Reduce
--workers
or--max-reqs-per-sec
to avoid server overload - Missing Pages: Check for JavaScript navigation that might require additional settings
- Slow Performance: Use
--disable-images
temporarily to speed up initial analysis
💡Next Steps
Section titled “💡Next Steps”After your full website crawl:
- Review the HTML report to identify key issues and opportunities
- Export specific sections to markdown for content review
- Set up regular scans to monitor site health
- Use the JSON output for integration with your own analysis tools
- Create a baseline report to measure improvements over time