Extending
SiteOne Crawler is built with extensibility in mind, providing a flexible architecture that allows developers to enhance and customize its functionality. This page explains how to extend the crawler with your own components.
Architecture Overview
Section titled “Architecture Overview”SiteOne Crawler is composed of several extensible component types:
- Analyzers: Process crawled content to extract insights and generate reports
- Content Processors: Parse different content types and extract URLs
- Exporters: Generate various output formats from crawled data
Each component type follows a specific interface pattern, making it easy to create custom implementations.
Creating Custom Analyzers
Section titled “Creating Custom Analyzers”Analyzers examine crawled content and generate insights or reports. They can analyze HTML structure, performance metrics, SEO factors, or any other aspect of web content.
Analyzer Interface
Section titled “Analyzer Interface”To create a custom analyzer, implement the Analyzer
interface or extend the BaseAnalyzer
class:
namespace Crawler\Analysis;
class MyCustomAnalyzer extends BaseAnalyzer implements Analyzer{ // Required method implementation public function shouldBeActivated(): bool { return true; // or check configuration options }
// Main analysis method public function analyze(): void { // Process $this->status->getVisitedUrls() and generate insights // Add output using $this->output->addSuperTable() or other methods }
// Define analyzer order (lower numbers execute earlier) public function getOrder(): int { return 120; // Choose a number that suits your analyzer's dependencies }
// Define configuration options public static function getOptions(): Options { $options = new Options(); // Add your custom options here return $options; }}
Key Analyzer Methods
Section titled “Key Analyzer Methods”- shouldBeActivated(): Determines if the analyzer should run based on configuration
- analyze(): Main method that processes crawled data and generates results
- analyzeVisitedUrl(): For real-time analysis of each URL during crawling
- getOrder(): Defines execution order among multiple analyzers
- getOptions(): Specifies configuration options for the analyzer
Adding the Analyzer
Section titled “Adding the Analyzer”Custom analyzers are automatically detected and loaded when placed in the Crawler/Analysis/
directory.
Creating Custom Content Processors
Section titled “Creating Custom Content Processors”Content processors parse specific content types (HTML, CSS, JS, etc.) to extract URLs and modify content for exports.
Content Processor Interface
Section titled “Content Processor Interface”To create a custom content processor, implement the ContentProcessor
interface or extend the BaseProcessor
class:
namespace Crawler\ContentProcessor;
class MyFrameworkProcessor extends BaseProcessor implements ContentProcessor{ // Define which content types this processor handles public function __construct(Crawler $crawler) { parent::__construct($crawler); $this->relevantContentTypes = [ Crawler::CONTENT_TYPE_ID_HTML, // Add other relevant content types ]; }
// URL extraction method public function findUrls(string $content, int $contentType, ParsedUrl $url): ?FoundUrls { if (!in_array($contentType, $this->relevantContentTypes)) { return null; }
// Extract URLs from content based on your custom logic $foundUrls = new FoundUrls(); // Add URLs to $foundUrls return $foundUrls; }
// Optional: Modify content for offline export public function applyContentChangesForOfflineVersion(string &$content, int $contentType, ParsedUrl $url, bool $removeUnwantedCode): void { // Modify content for offline viewing }}
Adding the Content Processor
Section titled “Adding the Content Processor”Register your custom processor in the ContentProcessorManager::__construct()
method.
Creating Custom Exporters
Section titled “Creating Custom Exporters”Exporters transform crawled data into various output formats, such as HTML reports, offline websites, or specialized formats.
Exporter Interface
Section titled “Exporter Interface”To create a custom exporter, implement the Exporter
interface or extend the BaseExporter
class:
namespace Crawler\Export;
class MyCustomExporter extends BaseExporter implements Exporter{ // Should this exporter be activated? public function shouldBeActivated(): bool { return true; // or check configuration }
// Main export method public function export(): void { // Process $this->status data and generate output }
// Define configuration options public static function getOptions(): Options { $options = new Options(); // Add your custom options here return $options; }}
Adding the Exporter
Section titled “Adding the Exporter”Custom exporters should be placed in the Crawler/Export/
directory and registered in the ExporterManager
.
Best Practices for Extensions
Section titled “Best Practices for Extensions”- Follow Existing Patterns: Study existing components to understand design patterns
- Respect Memory Usage: Be mindful of memory usage with large websites
- Add Configuration Options: Make your extensions configurable
- Error Handling: Implement proper error handling to avoid crashing the crawler
- Documentation: Document your extensions with comments and examples
Example: Simple SEO Title Analyzer
Section titled “Example: Simple SEO Title Analyzer”Here’s a simplified example of a custom analyzer that checks if page titles follow best practices:
namespace Crawler\Analysis;
class TitleLengthAnalyzer extends BaseAnalyzer implements Analyzer{ protected int $minTitleLength = 30; protected int $maxTitleLength = 60;
public function shouldBeActivated(): bool { return true; }
public function analyze(): void { $results = [];
foreach ($this->status->getVisitedUrls() as $visitedUrl) { if ($visitedUrl->statusCode === 200 && $visitedUrl->contentType === Crawler::CONTENT_TYPE_ID_HTML) { $htmlBody = $this->status->getStorage()->load($visitedUrl->uqId); preg_match('/<title[^>]*>([^<]*)<\/title>/i', $htmlBody, $matches); $title = trim($matches[1] ?? '');
$length = mb_strlen($title); $status = ($length >= $this->minTitleLength && $length <= $this->maxTitleLength) ? 'optimal' : ($length < $this->minTitleLength ? 'too_short' : 'too_long');
$results[] = [ 'url' => $visitedUrl->url, 'title' => $title, 'length' => $length, 'status' => $status ]; } }
// Create output table $superTable = new SuperTable( 'title-length-analysis', 'Title Length Analysis', 'No HTML pages found.', [ new SuperTableColumn('url', 'URL', 50), new SuperTableColumn('title', 'Title', 40), new SuperTableColumn('length', 'Length', 8), new SuperTableColumn('status', 'Status', 10, function($value) { return $value === 'optimal' ? Utils::getColorText($value, 'green') : Utils::getColorText($value, 'red'); }) ], true, 'status', 'DESC' );
$superTable->setData($results); $this->status->addSuperTable($superTable); $this->output->addSuperTable($superTable); }
public function getOrder(): int { return 130; }
public static function getOptions(): Options { $options = new Options(); $options->addGroup(new Group( 'title-length-analyzer', 'Title Length Analyzer', [ new Option('--min-title-length', null, 'minTitleLength', Type::INT, false, 'Minimum title length in characters', 30), new Option('--max-title-length', null, 'maxTitleLength', Type::INT, false, 'Maximum title length in characters', 60), ])); return $options; }}
💡Further Development Ideas
Section titled “💡Further Development Ideas”The extensibility of SiteOne Crawler opens up many possibilities:
- Custom analyzers for industry-specific requirements
- Integration with third-party APIs for enhanced analysis
- Specialized exporters for different reporting formats
- Content processors for modern JavaScript frameworks
- Machine learning analyzers for content quality assessment
If you develop a useful extension, consider contributing it to the project to benefit the wider community.