Caching

SiteOne Crawler includes a powerful HTTP response caching system that significantly improves performance during repeated crawls of the same website. This feature is especially valuable during development, testing, or when regularly monitoring websites for changes.

How Caching Works

When SiteOne Crawler fetches a URL, it can store the HTTP response (headers and body) in its cache system. On subsequent crawls, the tool checks if the URL is already in the cache, and if so, uses the cached version instead of making a new HTTP request.

This approach offers several advantages:

Reduced network traffic: Minimizes bandwidth usage for repeated crawls
Faster execution time: Cached responses are retrieved instantly without network latency
Reduced load on target servers: Prevents unnecessary requests to the crawled websites
Consistent testing environment: Cached responses ensure consistent results for development testing

Cache Configuration

You can control the caching behavior through command-line options:

Main Cache Settings

--http-cache-dir: Specifies the directory where HTTP responses are cached
- Default value: tmp/http-client-cache
- To disable caching completely, use --http-cache-dir='off'
--http-cache-compression: Enables compression for HTTP cache storage
- This option saves disk space but requires more CPU processing
- Recommended for large websites with many cached responses

Example Usage

Basic usage with default cache location:

./crawler --url=https://example.com/

Specifying a custom cache directory:

./crawler --url=https://example.com/ --http-cache-dir=./my-cache-folder

Disabling cache for a fresh crawl:

./crawler --url=https://example.com/ --http-cache-dir=off

Enabling compression for large website crawls:

./crawler --url=https://example.com/ --http-cache-compression

Cache Implementation Details

The cache implementation in SiteOne Crawler is handled by the HttpClient class:

Cache Key Generation: For each request, a unique cache key is generated based on:
- Host and port of the target server
- URL path and query parameters
- HTTP method and headers
Cache Storage: Responses are serialized and stored in files within the cache directory:
- Files are organized in subdirectories based on hash prefixes for efficient access
- When compression is enabled, cached data is compressed using gzip
Cache Retrieval: Before making an HTTP request, the crawler checks if a valid cache entry exists:
- If found and valid, the cached response is used
- Error responses (429, 500, 502, 503) are not cached to ensure valid retries
Cache Invalidation: The cache does not automatically expire entries - use --http-cache-dir='off' to perform a fresh crawl when needed

When to Use Caching

Caching is beneficial in several scenarios:

Development and Testing: During tool development or when testing new configurations
Regular Monitoring: When periodically checking websites for specific changes
Complex Analysis: When performing multiple analysis passes on the same site

When to Disable Caching

Consider disabling caching in these cases:

First-time crawls: When crawling a site for the first time
Timely information: When you need the most up-to-date content
Performance testing: When evaluating actual website response times
Dynamic content: When crawling sites with content that changes frequently

💡Further Development Ideas

Future enhancements to the caching system could include:

Time-based cache expiration settings
Selective caching based on content types
Intelligent cache invalidation for specific URLs
Cache sharing between multiple crawler instances