Crawl with Sitemap Generation
Sitemaps are crucial for helping search engines discover and index your website effectively. SiteOne Crawler can generate both XML and TXT sitemaps while crawling your site, saving you time and ensuring the sitemaps are complete and accurate.
Basic Sitemap Generation
Section titled “Basic Sitemap Generation”The simplest way to crawl a website and generate sitemaps is:
./crawler --url=https://example.com/ \ --sitemap-xml-file=sitemap.xml \ --sitemap-txt-file=sitemap.txt
This command will:
- Crawl the entire website at example.com
- Generate an XML sitemap (sitemap.xml)
- Generate a TXT sitemap (sitemap.txt)
- Apply default priority settings to the XML sitemap
Output Examples
Section titled “Output Examples”XML Sitemap Output
Section titled “XML Sitemap Output”The generated XML sitemap follows the standard sitemap protocol format recognized by all major search engines:
<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://example.com/</loc> <priority>0.8</priority> </url> <url> <loc>https://example.com/about</loc> <priority>0.7</priority> </url> <url> <loc>https://example.com/products</loc> <priority>0.7</priority> </url> <url> <loc>https://example.com/products/item1</loc> <priority>0.6</priority> </url> <url> <loc>https://example.com/blog</loc> <priority>0.7</priority> </url> <url> <loc>https://example.com/blog/post1</loc> <priority>0.6</priority> </url> <url> <loc>https://example.com/contact</loc> <priority>0.7</priority> </url></urlset>
TXT Sitemap Output
Section titled “TXT Sitemap Output”The TXT sitemap contains simple, one-URL-per-line format:
https://example.com/https://example.com/abouthttps://example.com/bloghttps://example.com/blog/post1https://example.com/contacthttps://example.com/productshttps://example.com/products/item1
Advanced Sitemap Generation
Section titled “Advanced Sitemap Generation”For more control over your sitemaps, you can use additional options:
./crawler --url=https://example.com/ \ --sitemap-xml-file=sitemap.xml \ --sitemap-txt-file=sitemap.txt \ --sitemap-base-priority=0.6 \ --sitemap-priority-increase=0.05 \ --include-regex="/blog/|/products/" \ --max-depth=3
This command:
- Sets a higher base priority (0.6) for all URLs
- Uses a smaller priority increment (0.05) between URL levels
- Only includes URLs containing “/blog/” or “/products/”
- Limits crawling to a maximum depth of 3 levels
Customizing Priority Settings
Section titled “Customizing Priority Settings”The XML sitemap priority values are calculated based on URL depth:
--sitemap-base-priority
: Default value for URLs (default: 0.5)--sitemap-priority-increase
: Value added for each level closer to the root (default: 0.1)
For example, with default settings:
- Homepage (”/”) gets base_priority + (depth × increase) = 0.5 + (0 × 0.1) = 0.5
- First-level page (“/about”) gets 0.5 + (1 × 0.1) = 0.6
- Second-level page (“/blog/post1”) gets 0.5 + (2 × 0.1) = 0.7
Use Cases
Section titled “Use Cases”E-commerce Site Sitemap
Section titled “E-commerce Site Sitemap”For an e-commerce site where product pages are most important:
./crawler --url=https://myshop.com/ \ --sitemap-xml-file=sitemap.xml \ --sitemap-base-priority=0.5 \ --sitemap-priority-increase=0.2 \ --include-regex="/products/"
Blog Sitemap
Section titled “Blog Sitemap”For a blog where the homepage and articles should have high priority:
./crawler --url=https://myblog.com/ \ --sitemap-xml-file=blog-sitemap.xml \ --sitemap-base-priority=0.7 \ --sitemap-priority-increase=0.1 \ --include-regex="/$|/post/|/category/"
Multiple Sitemaps for Large Sites
Section titled “Multiple Sitemaps for Large Sites”For very large sites, you might want to create multiple targeted sitemaps:
# Products sitemap./crawler --url=https://example.com/products/ \ --sitemap-xml-file=products-sitemap.xml \ --max-depth=3
# Blog sitemap./crawler --url=https://example.com/blog/ \ --sitemap-xml-file=blog-sitemap.xml \ --max-depth=3
Best Practices
Section titled “Best Practices”- URL Selection: Only URLs with 200 status code and HTML content type are included
- URL Sorting: URLs are sorted by depth (number of slashes) and then alphabetically
- Filename Convention: If you don’t specify a filename extension,
.xml
or.txt
will be added automatically - Verification: Always validate your XML sitemap using Google Search Console or similar tools
- Regular Updates: Schedule regular crawls to keep your sitemaps current
Submitting Your Sitemap
Section titled “Submitting Your Sitemap”After generating your sitemap:
- Upload both files to your website’s root directory
- Add a reference to your XML sitemap in robots.txt:
Sitemap: https://example.com/sitemap.xml
- Submit your sitemap URL to search engines through their webmaster tools:
💡Next Steps
Section titled “💡Next Steps”After generating your sitemaps, consider these follow-up actions:
- Set up a regular automated crawl to keep sitemaps updated
- Perform a full website crawl to identify any SEO issues
- Analyze your site for redirects and 404 errors that might affect indexing
- Export an offline version of your site for archiving