1. Insight
Insight
The problem this article addresses and why it matters.
Sitemaps are the difference between "indexed" and "indexed quickly"
A sitemap is a structured list of URLs you want a search engine to crawl, along with metadata about each URL's priority, change frequency, and last modification. The format is XML, the spec is sitemaps.org, and every major search engine supports it. Submitting a sitemap to Google Search Console moves the indexing of new content from "eventual" to "within hours" for most sites.
The format is small. The implementation is often broken — and broken in ways that don't fail loudly. A sitemap with stale <lastmod> dates makes crawlers re-fetch unchanged pages. A sitemap with priority: 1.0 on every URL dilutes the signal (Google treats per-URL priority as "relative to other URLs in this sitemap", so making everything 1.0 is the same as making everything 0.5). A sitemap with 100,000 URLs exceeds the 50,000-URL-per-file limit and gets silently truncated by the crawler.
These problems show up six months after deploy as "why isn't our new content indexing?" support tickets. The fix is "validate the sitemap before deploying it."
Why a validator with crawl-budget analysis
The tool in this article validates against the sitemap XML schema (catching well-formedness issues) and runs a crawl-budget analysis: how many URLs, expected crawl frequency per URL given the <changefreq> distribution, the priority distribution (are most URLs at 1.0, suggesting priority dilution?), and the count against the 50,000-URL hard cap.
The output isn't just a pass/fail. It's a structured report a team can read and act on: "23 URLs have priority 1.0 — dilutes the signal", "the sitemap will exceed the 50,000-URL cap in approximately 90 days at current growth", "12 URLs have <lastmod> more than 18 months ago — consider removing or refreshing."
What this article delivers
End-to-end walks of a real sitemap validation, the crawl-budget analyser's output, and the recommendations it produces. We cover the priority-dilution finding, the lastmod-staleness finding, and the cases where the sitemap is technically valid but structurally suboptimal for Google's crawling behaviour.
2. Intent
Intent
What you will be able to do after reading.
By the end of this article you will be able to:
- Validate any XML sitemap against the sitemaps.org schema with per-line error reporting
- Run the crawl-budget analyser to surface priority dilution, lastmod staleness, and over-size warnings
- Read the recommendations output and apply the highest-impact fix first
- Recognise the 50,000-URL hard cap and the sitemap-index pattern for sites that exceed it
- Identify the cases where a sitemap is technically valid but structurally suboptimal for Google's crawling behaviour
The Examples section walks through validating a real sitemap and reading the crawl-budget report.
3. Examples
Examples
Annotated code and worked scenarios.
Before / after: validating a sitemap
A real sitemap from a midsize site (12,000 URLs):
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2026-05-15</lastmod>
<changefreq>daily</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://example.com/about</loc>
<lastmod>2024-03-12</lastmod>
<changefreq>monthly</changefreq>
<priority>1.0</priority>
</url>
<!-- ... 11,998 more URLs ... -->
</urlset>sitemapValidator({
xml,
crawlBudgetAnalysis: true,
});
// valid: true
// urlCount: 12000
// errors: []
// warnings: []
// crawlBudget: {
// estimatedCrawlsPerDay: 800,
// crawlFrequencyPerUrl: '~once every 15 days',
// urlsByPriority: {
// '1.0': 9847,
// '0.8': 1234,
// '0.6': 919,
// },
// urlsByChangefreq: {
// 'daily': 2435,
// 'weekly': 4892,
// 'monthly': 4673,
// },
// recommendations: [
// 'CRITICAL: 9847 URLs have priority 1.0 (82% of sitemap). Google treats per-URL priority as relative — making everything 1.0 is equivalent to making nothing prioritised. Vary priorities to signal which URLs matter most.',
// 'WARNING: 1247 URLs have <lastmod> more than 18 months old. Either remove these from the sitemap or update lastmod when the content was actually last meaningful.',
// 'INFO: Approaching the 50,000-URL cap at current growth rates (estimated 18 months). Plan a sitemap-index split before you hit it.',
// ],
// }The validator reports the sitemap is well-formed (zero errors). The crawl-budget analyser is where the real findings live: priority dilution is critical, lastmod staleness is a warning, the size-cap warning is informational because there's time to plan.
Before / after: well-formedness errors
A broken sitemap with subtle issues:
<urlset>
<url>
<loc>https://example.com/page-1</loc>
<lastmod>2026/05/15</lastmod>
<changefreq>often</changefreq>
<priority>2.0</priority>
</url>
<url>
<loc>not-a-url</loc>
</url>
</urlset>sitemapValidator({ xml });
// valid: false
// urlCount: 2
// errors: [
// { line: 1, message: 'Missing xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" on <urlset>' },
// { line: 4, message: 'lastmod value "2026/05/15" is not W3C Datetime format. Use 2026-05-15 (ISO 8601).' },
// { line: 5, message: 'changefreq value "often" is not valid. Must be one of always/hourly/daily/weekly/monthly/yearly/never.' },
// { line: 6, message: 'priority "2.0" out of range — must be 0.0 to 1.0 inclusive.' },
// { line: 9, message: 'loc value "not-a-url" is not a valid URL.' },
// ]Five errors. The validator's per-line output points at exactly what to fix. Most search engines would silently ignore individual broken <url> entries; the validator surfaces them so the team can fix the sitemap before deploy rather than wondering why some URLs aren't crawling.
Before / after: the 50,000-URL cap
<!-- sitemap with 52,341 URLs ... -->sitemapValidator({ xml, crawlBudgetAnalysis: true });
// valid: false
// urlCount: 52341
// errors: [
// { line: 0, message: 'Sitemap contains 52341 URLs, exceeding the 50000-URL per-sitemap limit. Use a sitemap index to split into multiple sitemap files.' },
// ]
// warnings: []The hard cap is 50,000 URLs per sitemap file AND 50MB uncompressed. Over either limit and Google truncates at the cap silently. The recommended fix is a sitemap index (sitemap_index.xml) pointing at multiple sub-sitemaps:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemaps/products.xml</loc>
<lastmod>2026-05-15</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemaps/articles.xml</loc>
<lastmod>2026-05-15</lastmod>
</sitemap>
</sitemapindex>The validator handles sitemap-index files separately — pass the index XML and the tool validates each sub-sitemap reference's syntax.
When humans use this
A developer launching a new site validates the generated sitemap before submitting to Google Search Console. A team auditing search performance runs the validator with crawl-budget analysis to find the priority-dilution and stale-lastmod issues that account for most "why isn't this indexing?" problems.
When agents use this
Two production patterns:
- Pre-deploy sitemap check. A CI agent validates every generated sitemap on every deploy. Critical errors fail the build; warnings open a comment-style finding on the PR.
- Continuous-SEO agent. A scheduled agent re-validates the live sitemap weekly, comparing the crawl-budget output against the prior week. Increases in priority-dilution or lastmod staleness drift signal that the sitemap generator is degrading; alerts fire when key thresholds cross.
Edge cases
Sitemap indexes
The tool handles both <urlset> (standard sitemap) and <sitemapindex> (an index pointing at multiple sitemaps). Pass either; the tool detects which it is. Index files are validated against a different schema with sub-sitemap location URLs rather than URL entries.
Image, video, and news sitemaps
The sitemaps.org spec has extensions for image sitemaps, video sitemaps, and news sitemaps. The tool's validator handles all three. Image sitemaps validate <image:image> namespaces, video sitemaps validate <video:duration>, news sitemaps validate <news:publication_date>.
Compressed sitemaps
.xml.gz compressed sitemaps are common for large sites. Decompress before passing to the validator — the tool operates on the raw XML string. The 50MB uncompressed cap applies to the decompressed form; gzip ratios vary but a 5MB compressed sitemap is typically well under the limit.
Localised URLs (hreflang)
Sitemaps can include hreflang annotations (<xhtml:link rel="alternate" hreflang="...">). The tool validates the syntax; mismatches between hreflang entries and the actual page's Content-Language header are a different audit (a content-vs-sitemap consistency check) outside the validator's scope.
4. Documentation
Documentation
Reference signatures, edge cases, and lookup tables.
Input parameters
Field | Type | Required | Default | Description |
|---|---|---|---|---|
|
| ✓ | — | Sitemap or sitemap-index XML content |
|
| ✗ |
| Run the crawl-budget analyser |
Output shape
{
valid: boolean;
urlCount: number;
errors: Array<{ line: number; message: string }>;
warnings: Array<{ line: number; message: string }>;
crawlBudget?: {
estimatedCrawlsPerDay: number;
crawlFrequencyPerUrl: string;
urlsByPriority: Record<string, number>;
urlsByChangefreq: Record<string, number>;
recommendations: string[];
};
}Validation checks
Check | Severity |
|---|---|
Missing or wrong | error |
URL count > 50,000 | error (truncation by crawlers) |
File size > 50MB uncompressed | error |
| error |
| error |
| error |
| error |
Missing | error |
| warning |
| warning |
Approaching 50,000-URL cap (>40,000) | info |
No | info |
Error codes
Code | When it fires | Recovery |
|---|---|---|
|
| Provide a non-empty sitemap |
| XML parse failed | Verify the XML is well-formed ( |
| XML exceeds 100MB | Split into a sitemap-index and validate each sub-sitemap separately |
When NOT to use this tool
For sitemap generation, use a dedicated sitemap generator (next-sitemap, Hugo's built-in, Jekyll's plugin, or your framework's equivalent). This tool validates after generation; it doesn't produce sitemaps from scratch.
For sitemap-index management at scale (sites with multi-million URLs), use a streaming sitemap generator that emits to disk in batches. Loading a multi-GB sitemap into the validator's memory isn't the right path.
For Google Search Console submission status (has Google indexed the URLs from your sitemap?), use the Search Console API. The validator audits the file; Search Console reports on the crawling and indexing.
Performance notes
Typical execution: under 50ms for sitemaps under 1MB. Crawl-budget analysis adds 5-20ms. The 50,000-URL hard cap is enforced — sitemaps exceeding it return errors without running the full analysis. The tool is deterministic — same input + same parameters always produce byte-identical output — so REST responses are Edge-Cache eligible.
The validator follows the sitemaps.org 0.9 schema and Google's published extensions. New extensions (image / video / news) are supported as of 2026; future spec additions require a tool release.