Skip to content

Crawl with Sitemap Generation

Sitemaps are crucial for helping search engines discover and index your website effectively. SiteOne Crawler can generate both XML and TXT sitemaps while crawling your site, saving you time and ensuring the sitemaps are complete and accurate.

The simplest way to crawl a website and generate sitemaps is:

Terminal window
./crawler --url=https://example.com/ \
--sitemap-xml-file=sitemap.xml \
--sitemap-txt-file=sitemap.txt

This command will:

  1. Crawl the entire website at example.com
  2. Generate an XML sitemap (sitemap.xml)
  3. Generate a TXT sitemap (sitemap.txt)
  4. Apply default priority settings to the XML sitemap

The generated XML sitemap follows the standard sitemap protocol format recognized by all major search engines:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<priority>0.8</priority>
</url>
<url>
<loc>https://example.com/about</loc>
<priority>0.7</priority>
</url>
<url>
<loc>https://example.com/products</loc>
<priority>0.7</priority>
</url>
<url>
<loc>https://example.com/products/item1</loc>
<priority>0.6</priority>
</url>
<url>
<loc>https://example.com/blog</loc>
<priority>0.7</priority>
</url>
<url>
<loc>https://example.com/blog/post1</loc>
<priority>0.6</priority>
</url>
<url>
<loc>https://example.com/contact</loc>
<priority>0.7</priority>
</url>
</urlset>

The TXT sitemap contains simple, one-URL-per-line format:

https://example.com/
https://example.com/about
https://example.com/blog
https://example.com/blog/post1
https://example.com/contact
https://example.com/products
https://example.com/products/item1

For more control over your sitemaps, you can use additional options:

Terminal window
./crawler --url=https://example.com/ \
--sitemap-xml-file=sitemap.xml \
--sitemap-txt-file=sitemap.txt \
--sitemap-base-priority=0.6 \
--sitemap-priority-increase=0.05 \
--include-regex="/blog/|/products/" \
--max-depth=3

This command:

  • Sets a higher base priority (0.6) for all URLs
  • Uses a smaller priority increment (0.05) between URL levels
  • Only includes URLs containing “/blog/” or “/products/”
  • Limits crawling to a maximum depth of 3 levels

The XML sitemap priority values are calculated based on URL depth:

  • --sitemap-base-priority: Default value for URLs (default: 0.5)
  • --sitemap-priority-increase: Value added for each level closer to the root (default: 0.1)

For example, with default settings:

  • Homepage (”/”) gets base_priority + (depth × increase) = 0.5 + (0 × 0.1) = 0.5
  • First-level page (“/about”) gets 0.5 + (1 × 0.1) = 0.6
  • Second-level page (“/blog/post1”) gets 0.5 + (2 × 0.1) = 0.7

For an e-commerce site where product pages are most important:

Terminal window
./crawler --url=https://myshop.com/ \
--sitemap-xml-file=sitemap.xml \
--sitemap-base-priority=0.5 \
--sitemap-priority-increase=0.2 \
--include-regex="/products/"

For a blog where the homepage and articles should have high priority:

Terminal window
./crawler --url=https://myblog.com/ \
--sitemap-xml-file=blog-sitemap.xml \
--sitemap-base-priority=0.7 \
--sitemap-priority-increase=0.1 \
--include-regex="/$|/post/|/category/"

For very large sites, you might want to create multiple targeted sitemaps:

Terminal window
# Products sitemap
./crawler --url=https://example.com/products/ \
--sitemap-xml-file=products-sitemap.xml \
--max-depth=3
# Blog sitemap
./crawler --url=https://example.com/blog/ \
--sitemap-xml-file=blog-sitemap.xml \
--max-depth=3
  1. URL Selection: Only URLs with 200 status code and HTML content type are included
  2. URL Sorting: URLs are sorted by depth (number of slashes) and then alphabetically
  3. Filename Convention: If you don’t specify a filename extension, .xml or .txt will be added automatically
  4. Verification: Always validate your XML sitemap using Google Search Console or similar tools
  5. Regular Updates: Schedule regular crawls to keep your sitemaps current

After generating your sitemap:

  1. Upload both files to your website’s root directory
  2. Add a reference to your XML sitemap in robots.txt:
    Sitemap: https://example.com/sitemap.xml
  3. Submit your sitemap URL to search engines through their webmaster tools:

After generating your sitemaps, consider these follow-up actions: