Caching
SiteOne Crawler includes a powerful HTTP response caching system that significantly improves performance during repeated crawls of the same website. This feature is especially valuable during development, testing, or when regularly monitoring websites for changes.
How Caching Works
Section titled “How Caching Works”When SiteOne Crawler fetches a URL, it can store the HTTP response (headers and body) in its cache system. On subsequent crawls, the tool checks if the URL is already in the cache, and if so, uses the cached version instead of making a new HTTP request.
This approach offers several advantages:
- Reduced network traffic: Minimizes bandwidth usage for repeated crawls
- Faster execution time: Cached responses are retrieved instantly without network latency
- Reduced load on target servers: Prevents unnecessary requests to the crawled websites
- Consistent testing environment: Cached responses ensure consistent results for development testing
Cache Configuration
Section titled “Cache Configuration”You can control the caching behavior through command-line options:
Main Cache Settings
Section titled “Main Cache Settings”-
--http-cache-dir
: Specifies the directory where HTTP responses are cached- Default value:
tmp/http-client-cache
- To disable caching completely, use
--http-cache-dir='off'
- Default value:
-
--http-cache-compression
: Enables compression for HTTP cache storage- This option saves disk space but requires more CPU processing
- Recommended for large websites with many cached responses
Example Usage
Section titled “Example Usage”Basic usage with default cache location:
./crawler --url=https://example.com/
Specifying a custom cache directory:
./crawler --url=https://example.com/ --http-cache-dir=./my-cache-folder
Disabling cache for a fresh crawl:
./crawler --url=https://example.com/ --http-cache-dir=off
Enabling compression for large website crawls:
./crawler --url=https://example.com/ --http-cache-compression
Cache Implementation Details
Section titled “Cache Implementation Details”The cache implementation in SiteOne Crawler is handled by the HttpClient
class:
-
Cache Key Generation: For each request, a unique cache key is generated based on:
- Host and port of the target server
- URL path and query parameters
- HTTP method and headers
-
Cache Storage: Responses are serialized and stored in files within the cache directory:
- Files are organized in subdirectories based on hash prefixes for efficient access
- When compression is enabled, cached data is compressed using gzip
-
Cache Retrieval: Before making an HTTP request, the crawler checks if a valid cache entry exists:
- If found and valid, the cached response is used
- Error responses (429, 500, 502, 503) are not cached to ensure valid retries
-
Cache Invalidation: The cache does not automatically expire entries - use
--http-cache-dir='off'
to perform a fresh crawl when needed
When to Use Caching
Section titled “When to Use Caching”Caching is beneficial in several scenarios:
- Development and Testing: During tool development or when testing new configurations
- Regular Monitoring: When periodically checking websites for specific changes
- Complex Analysis: When performing multiple analysis passes on the same site
When to Disable Caching
Section titled “When to Disable Caching”Consider disabling caching in these cases:
- First-time crawls: When crawling a site for the first time
- Timely information: When you need the most up-to-date content
- Performance testing: When evaluating actual website response times
- Dynamic content: When crawling sites with content that changes frequently
💡Further Development Ideas
Section titled “💡Further Development Ideas”Future enhancements to the caching system could include:
- Time-based cache expiration settings
- Selective caching based on content types
- Intelligent cache invalidation for specific URLs
- Cache sharing between multiple crawler instances