Skip to content

Extending

SiteOne Crawler is built with extensibility in mind, providing a flexible architecture that allows developers to enhance and customize its functionality. This page explains how to extend the crawler with your own components.

SiteOne Crawler is composed of several extensible component types:

  1. Analyzers: Process crawled content to extract insights and generate reports
  2. Content Processors: Parse different content types and extract URLs
  3. Exporters: Generate various output formats from crawled data

Each component type follows a specific interface pattern, making it easy to create custom implementations.

Analyzers examine crawled content and generate insights or reports. They can analyze HTML structure, performance metrics, SEO factors, or any other aspect of web content.

To create a custom analyzer, implement the Analyzer interface or extend the BaseAnalyzer class:

namespace Crawler\Analysis;
class MyCustomAnalyzer extends BaseAnalyzer implements Analyzer
{
// Required method implementation
public function shouldBeActivated(): bool
{
return true; // or check configuration options
}
// Main analysis method
public function analyze(): void
{
// Process $this->status->getVisitedUrls() and generate insights
// Add output using $this->output->addSuperTable() or other methods
}
// Define analyzer order (lower numbers execute earlier)
public function getOrder(): int
{
return 120; // Choose a number that suits your analyzer's dependencies
}
// Define configuration options
public static function getOptions(): Options
{
$options = new Options();
// Add your custom options here
return $options;
}
}
  • shouldBeActivated(): Determines if the analyzer should run based on configuration
  • analyze(): Main method that processes crawled data and generates results
  • analyzeVisitedUrl(): For real-time analysis of each URL during crawling
  • getOrder(): Defines execution order among multiple analyzers
  • getOptions(): Specifies configuration options for the analyzer

Custom analyzers are automatically detected and loaded when placed in the Crawler/Analysis/ directory.

Content processors parse specific content types (HTML, CSS, JS, etc.) to extract URLs and modify content for exports.

To create a custom content processor, implement the ContentProcessor interface or extend the BaseProcessor class:

namespace Crawler\ContentProcessor;
class MyFrameworkProcessor extends BaseProcessor implements ContentProcessor
{
// Define which content types this processor handles
public function __construct(Crawler $crawler)
{
parent::__construct($crawler);
$this->relevantContentTypes = [
Crawler::CONTENT_TYPE_ID_HTML,
// Add other relevant content types
];
}
// URL extraction method
public function findUrls(string $content, int $contentType, ParsedUrl $url): ?FoundUrls
{
if (!in_array($contentType, $this->relevantContentTypes)) {
return null;
}
// Extract URLs from content based on your custom logic
$foundUrls = new FoundUrls();
// Add URLs to $foundUrls
return $foundUrls;
}
// Optional: Modify content for offline export
public function applyContentChangesForOfflineVersion(string &$content, int $contentType, ParsedUrl $url, bool $removeUnwantedCode): void
{
// Modify content for offline viewing
}
}

Register your custom processor in the ContentProcessorManager::__construct() method.

Exporters transform crawled data into various output formats, such as HTML reports, offline websites, or specialized formats.

To create a custom exporter, implement the Exporter interface or extend the BaseExporter class:

namespace Crawler\Export;
class MyCustomExporter extends BaseExporter implements Exporter
{
// Should this exporter be activated?
public function shouldBeActivated(): bool
{
return true; // or check configuration
}
// Main export method
public function export(): void
{
// Process $this->status data and generate output
}
// Define configuration options
public static function getOptions(): Options
{
$options = new Options();
// Add your custom options here
return $options;
}
}

Custom exporters should be placed in the Crawler/Export/ directory and registered in the ExporterManager.

  1. Follow Existing Patterns: Study existing components to understand design patterns
  2. Respect Memory Usage: Be mindful of memory usage with large websites
  3. Add Configuration Options: Make your extensions configurable
  4. Error Handling: Implement proper error handling to avoid crashing the crawler
  5. Documentation: Document your extensions with comments and examples

Here’s a simplified example of a custom analyzer that checks if page titles follow best practices:

namespace Crawler\Analysis;
class TitleLengthAnalyzer extends BaseAnalyzer implements Analyzer
{
protected int $minTitleLength = 30;
protected int $maxTitleLength = 60;
public function shouldBeActivated(): bool
{
return true;
}
public function analyze(): void
{
$results = [];
foreach ($this->status->getVisitedUrls() as $visitedUrl) {
if ($visitedUrl->statusCode === 200 && $visitedUrl->contentType === Crawler::CONTENT_TYPE_ID_HTML) {
$htmlBody = $this->status->getStorage()->load($visitedUrl->uqId);
preg_match('/<title[^>]*>([^<]*)<\/title>/i', $htmlBody, $matches);
$title = trim($matches[1] ?? '');
$length = mb_strlen($title);
$status = ($length >= $this->minTitleLength && $length <= $this->maxTitleLength)
? 'optimal'
: ($length < $this->minTitleLength ? 'too_short' : 'too_long');
$results[] = [
'url' => $visitedUrl->url,
'title' => $title,
'length' => $length,
'status' => $status
];
}
}
// Create output table
$superTable = new SuperTable(
'title-length-analysis',
'Title Length Analysis',
'No HTML pages found.',
[
new SuperTableColumn('url', 'URL', 50),
new SuperTableColumn('title', 'Title', 40),
new SuperTableColumn('length', 'Length', 8),
new SuperTableColumn('status', 'Status', 10, function($value) {
return $value === 'optimal'
? Utils::getColorText($value, 'green')
: Utils::getColorText($value, 'red');
})
],
true,
'status',
'DESC'
);
$superTable->setData($results);
$this->status->addSuperTable($superTable);
$this->output->addSuperTable($superTable);
}
public function getOrder(): int
{
return 130;
}
public static function getOptions(): Options
{
$options = new Options();
$options->addGroup(new Group(
'title-length-analyzer',
'Title Length Analyzer', [
new Option('--min-title-length', null, 'minTitleLength', Type::INT, false, 'Minimum title length in characters', 30),
new Option('--max-title-length', null, 'maxTitleLength', Type::INT, false, 'Maximum title length in characters', 60),
]));
return $options;
}
}

The extensibility of SiteOne Crawler opens up many possibilities:

  • Custom analyzers for industry-specific requirements
  • Integration with third-party APIs for enhanced analysis
  • Specialized exporters for different reporting formats
  • Content processors for modern JavaScript frameworks
  • Machine learning analyzers for content quality assessment

If you develop a useful extension, consider contributing it to the project to benefit the wider community.