Extending

SiteOne Crawler is built with extensibility in mind, providing a flexible architecture that allows developers to enhance and customize its functionality. This page explains how to extend the crawler with your own components.

Architecture Overview

SiteOne Crawler is composed of several extensible component types:

Analyzers: Process crawled content to extract insights and generate reports
Content Processors: Parse different content types and extract URLs
Exporters: Generate various output formats from crawled data

Each component type follows a specific interface pattern, making it easy to create custom implementations.

Creating Custom Analyzers

Analyzers examine crawled content and generate insights or reports. They can analyze HTML structure, performance metrics, SEO factors, or any other aspect of web content.

Analyzer Interface

To create a custom analyzer, implement the Analyzer interface or extend the BaseAnalyzer class:

namespace Crawler\Analysis;

class MyCustomAnalyzer extends BaseAnalyzer implements Analyzer
{
    // Required method implementation
    public function shouldBeActivated(): bool
    {
        return true; // or check configuration options
    }

    // Main analysis method
    public function analyze(): void
    {
        // Process $this->status->getVisitedUrls() and generate insights
        // Add output using $this->output->addSuperTable() or other methods
    }

    // Define analyzer order (lower numbers execute earlier)
    public function getOrder(): int
    {
        return 120; // Choose a number that suits your analyzer's dependencies
    }

    // Define configuration options
    public static function getOptions(): Options
    {
        $options = new Options();
        // Add your custom options here
        return $options;
    }
}

Key Analyzer Methods

shouldBeActivated(): Determines if the analyzer should run based on configuration
analyze(): Main method that processes crawled data and generates results
analyzeVisitedUrl(): For real-time analysis of each URL during crawling
getOrder(): Defines execution order among multiple analyzers
getOptions(): Specifies configuration options for the analyzer

Adding the Analyzer

Custom analyzers are automatically detected and loaded when placed in the Crawler/Analysis/ directory.

Creating Custom Content Processors

Content processors parse specific content types (HTML, CSS, JS, etc.) to extract URLs and modify content for exports.

Content Processor Interface

To create a custom content processor, implement the ContentProcessor interface or extend the BaseProcessor class:

namespace Crawler\ContentProcessor;

class MyFrameworkProcessor extends BaseProcessor implements ContentProcessor
{
    // Define which content types this processor handles
    public function __construct(Crawler $crawler)
    {
        parent::__construct($crawler);
        $this->relevantContentTypes = [
            Crawler::CONTENT_TYPE_ID_HTML,
            // Add other relevant content types
        ];
    }

    // URL extraction method
    public function findUrls(string $content, int $contentType, ParsedUrl $url): ?FoundUrls
    {
        if (!in_array($contentType, $this->relevantContentTypes)) {
            return null;
        }

        // Extract URLs from content based on your custom logic
        $foundUrls = new FoundUrls();
        // Add URLs to $foundUrls
        return $foundUrls;
    }

    // Optional: Modify content for offline export
    public function applyContentChangesForOfflineVersion(string &$content, int $contentType, ParsedUrl $url, bool $removeUnwantedCode): void
    {
        // Modify content for offline viewing
    }
}

Adding the Content Processor

Creating Custom Exporters

Exporters transform crawled data into various output formats, such as HTML reports, offline websites, or specialized formats.

Exporter Interface

To create a custom exporter, implement the Exporter interface or extend the BaseExporter class:

namespace Crawler\Export;

class MyCustomExporter extends BaseExporter implements Exporter
{
    // Should this exporter be activated?
    public function shouldBeActivated(): bool
    {
        return true; // or check configuration
    }

    // Main export method
    public function export(): void
    {
        // Process $this->status data and generate output
    }

    // Define configuration options
    public static function getOptions(): Options
    {
        $options = new Options();
        // Add your custom options here
        return $options;
    }
}

Adding the Exporter

Custom exporters should be placed in the Crawler/Export/ directory and registered in the ExporterManager.

Best Practices for Extensions

Follow Existing Patterns: Study existing components to understand design patterns
Respect Memory Usage: Be mindful of memory usage with large websites
Add Configuration Options: Make your extensions configurable
Error Handling: Implement proper error handling to avoid crashing the crawler
Documentation: Document your extensions with comments and examples

Example: Simple SEO Title Analyzer

Here’s a simplified example of a custom analyzer that checks if page titles follow best practices:

namespace Crawler\Analysis;

class TitleLengthAnalyzer extends BaseAnalyzer implements Analyzer
{
    protected int $minTitleLength = 30;
    protected int $maxTitleLength = 60;

    public function shouldBeActivated(): bool
    {
        return true;
    }

    public function analyze(): void
    {
        $results = [];

        foreach ($this->status->getVisitedUrls() as $visitedUrl) {
            if ($visitedUrl->statusCode === 200 && $visitedUrl->contentType === Crawler::CONTENT_TYPE_ID_HTML) {
                $htmlBody = $this->status->getStorage()->load($visitedUrl->uqId);
                preg_match('/<title[^>]*>([^<]*)<\/title>/i', $htmlBody, $matches);
                $title = trim($matches[1] ?? '');

                $length = mb_strlen($title);
                $status = ($length >= $this->minTitleLength && $length <= $this->maxTitleLength)
                    ? 'optimal'
                    : ($length < $this->minTitleLength ? 'too_short' : 'too_long');

                $results[] = [
                    'url' => $visitedUrl->url,
                    'title' => $title,
                    'length' => $length,
                    'status' => $status
                ];
            }
        }

        // Create output table
        $superTable = new SuperTable(
            'title-length-analysis',
            'Title Length Analysis',
            'No HTML pages found.',
            [
                new SuperTableColumn('url', 'URL', 50),
                new SuperTableColumn('title', 'Title', 40),
                new SuperTableColumn('length', 'Length', 8),
                new SuperTableColumn('status', 'Status', 10, function($value) {
                    return $value === 'optimal'
                        ? Utils::getColorText($value, 'green')
                        : Utils::getColorText($value, 'red');
                })
            ],
            true,
            'status',
            'DESC'
        );

        $superTable->setData($results);
        $this->status->addSuperTable($superTable);
        $this->output->addSuperTable($superTable);
    }

    public function getOrder(): int
    {
        return 130;
    }

    public static function getOptions(): Options
    {
        $options = new Options();
        $options->addGroup(new Group(
            'title-length-analyzer',
            'Title Length Analyzer', [
            new Option('--min-title-length', null, 'minTitleLength', Type::INT, false, 'Minimum title length in characters', 30),
            new Option('--max-title-length', null, 'maxTitleLength', Type::INT, false, 'Maximum title length in characters', 60),
        ]));
        return $options;
    }
}

💡Further Development Ideas

The extensibility of SiteOne Crawler opens up many possibilities:

Custom analyzers for industry-specific requirements
Integration with third-party APIs for enhanced analysis
Specialized exporters for different reporting formats
Content processors for modern JavaScript frameworks
Machine learning analyzers for content quality assessment

If you develop a useful extension, consider contributing it to the project to benefit the wider community.