1. Insight
Insight
The problem this article addresses and why it matters.
The 12-line file that decides whether you're indexed
robots.txt is one of the smallest files on most websites and has more impact on traffic than any other piece of static content. A single Disallow: / accidentally shipped in production blocks every crawler from indexing the site. A Disallow: /api/* that's slightly too aggressive accidentally blocks Googlebot from crawling the very pages the site depends on for indexing. A missing Sitemap: directive forces every crawler to discover URLs the hard way.
The format itself is older than HTTP/1.1 — Martijn Koster's 1994 proposal standardised it before browsers had bookmarks. Every major search engine still respects it (modulo their own crawling-priority logic). The format is small and the syntax is simple, which is the problem: developers write it once, copy-paste between projects, and miss the production-specific gotchas every time.
Why a generator with simulation beats a snippet
The tool in this article does two related jobs. Generate mode produces a robots.txt from a structured rule set — easier to maintain than hand-edited text because the rules are typed and reviewable. Simulate mode is the killer feature: pass an array of URLs and the tool tells you which would be blocked, allowed, or ambiguously matched by the generated rules.
The simulation catches the failure mode every team encounters: shipping a Disallow: /admin/* that also blocks /administration/ (a URL the team had no idea existed because it was an old redirect), or shipping a Disallow: /api/v1/ that blocks /api/v1/health (the URL the team uses for uptime monitoring). Run simulation against your site's full URL list before deploying the new robots.txt.
What this article delivers
End-to-end walks of generating a robots.txt for a real site, simulating against representative URLs, and reading the warning output that flags common SEO mistakes. We cover the precedence rules between conflicting Allow / Disallow, the crawl-delay directive (and which crawlers respect it), and the cases where robots.txt is the wrong tool entirely (anything that must actually be confidential, not just non-indexed).
2. Intent
Intent
What you will be able to do after reading.
By the end of this article you will be able to:
- Generate a
robots.txtfrom a structured set of user-agent rules with optional sitemap and crawl-delay directives - Simulate the generated rules against an array of URLs to verify which paths would be blocked or allowed
- Read the warnings output that flags common mistakes (blocking CSS/JS files, overly broad disallow, missing sitemap)
- Recognise the cases where
robots.txtis the wrong primitive — security through robots.txt is not security - Understand the precedence rules when multiple Allow / Disallow patterns match the same URL
The Examples section walks through generating a real robots.txt, simulating against site URLs, and the warning that catches an accidental admin-page block.
3. Examples
Examples
Annotated code and worked scenarios.
Before / after: generating a real robots.txt
robotsTxtGenerator({
rules: [
{
agent: '*',
allow: ['/'],
disallow: ['/admin/*', '/api/*', '/_next/*'],
},
{
agent: 'GPTBot',
allow: [],
disallow: ['/'],
},
],
sitemap: 'https://obfus.link/sitemap.xml',
crawlDelay: 1,
});
// robotsTxt: `
// User-agent: *
// Allow: /
// Disallow: /admin/*
// Disallow: /api/*
// Disallow: /_next/*
// Crawl-delay: 1
//
// User-agent: GPTBot
// Disallow: /
//
// Sitemap: https://obfus.link/sitemap.xml
// `Two user-agent rule blocks. Default * allows the public site, blocks admin and API paths. GPTBot is fully disallowed — the site opts out of OpenAI training-data crawling. The crawl-delay is a hint (some crawlers respect it, some don't).
Before / after: simulation catches a false positive
The team thinks they're only blocking the admin panel. They run simulation against representative URLs:
robotsTxtGenerator({
rules: [
{ agent: '*', allow: ['/'], disallow: ['/admin*'] }, // intentional: admin pages
],
simulate: [
'https://obfus.link/', // expect allowed
'https://obfus.link/tool/json-to-zod', // expect allowed
'https://obfus.link/admin/users', // expect blocked
'https://obfus.link/administration', // expect ?
'https://obfus.link/tool/admin-helper', // expect ?
],
});
// simulation: [
// { url: '/', agent: '*', allowed: true, matchedRule: 'Allow: /' },
// { url: '/tool/json-to-zod', agent: '*', allowed: true, matchedRule: 'Allow: /' },
// { url: '/admin/users', agent: '*', allowed: false, matchedRule: 'Disallow: /admin*' },
// { url: '/administration', agent: '*', allowed: false, matchedRule: 'Disallow: /admin*' },
// { url: '/tool/admin-helper', agent: '*', allowed: true, matchedRule: 'Allow: /' },
// ]
// warnings: [
// {
// severity: 'warning',
// message: 'Pattern "Disallow: /admin*" also blocks /administration. Use "Disallow: /admin/" (trailing slash) to scope to the admin directory only.',
// line: 2,
// },
// ]The pattern /admin* matched /administration because * is glob-style — both /admin/users and /administration start with /admin. The fix is Disallow: /admin/ (with the trailing slash) which only matches paths under the admin directory.
The warning catches the false positive before deploy. The /tool/admin-helper URL is left allowed because the pattern starts at root.
Before / after: the CSS/JS warning
A common mistake — blocking the _next directory in a Next.js site:
robotsTxtGenerator({
rules: [
{ agent: '*', allow: ['/'], disallow: ['/_next/*'] },
],
});
// warnings: [
// {
// severity: 'critical',
// message: 'Disallow: /_next/* blocks CSS and JS Next.js needs to render the site. Googlebot can crawl the HTML but cannot evaluate the page; ranking drops for content that depends on JS-rendered text. Allow /_next/static/* explicitly.',
// line: 3,
// },
// ]The critical warning surfaces the production failure mode: Next.js apps need _next/static/* to render. Blocking it means Googlebot sees a broken page. The recommendation is to add Allow: /_next/static/* before the disallow.
Before / after: AI-crawler opt-out
OpenAI's GPTBot, Anthropic's anthropic-ai, Google-Extended (Bard's crawler), CCBot (Common Crawl), Bytespider — many sites want to opt out of these specifically. The tool generates the multi-agent rule block:
robotsTxtGenerator({
rules: [
{ agent: '*', allow: ['/'], disallow: ['/admin/*'] },
{ agent: 'GPTBot', allow: [], disallow: ['/'] },
{ agent: 'anthropic-ai', allow: [], disallow: ['/'] },
{ agent: 'Google-Extended', allow: [], disallow: ['/'] },
{ agent: 'CCBot', allow: [], disallow: ['/'] },
],
});
// robotsTxt includes five user-agent blocksNote this blocks training-data crawling, not search indexing. Googlebot (the search crawler) remains allowed; Google-Extended (the Bard / AI Overview training crawler) is blocked separately.
When humans use this
The first use is "I'm setting up a new site, generate me a sensible robots.txt." The high-leverage use is the pre-deploy simulation: before promoting a robots.txt change to production, run simulation against the site's URL list. The warning output catches the CSS/JS blocking case and the over-broad pattern case — both of which silently degrade SEO if shipped.
When agents use this
Two production patterns:
- Site-generator agent. An agent scaffolding a new Next.js / Astro / Hugo site generates the
robots.txtfrom the project's content structure. The structured input is easier for the LLM to reason about than free-form text. - SEO-audit agent. A scheduled agent runs simulation against the site's full sitemap weekly. Any URL that should be indexed but is blocked, or any URL that should be blocked but is reachable, opens an alert with the specific rule that caused the mismatch.
Edge cases
Allow vs Disallow precedence
When multiple patterns match the same URL, the more specific (longer) pattern wins. Allow: /admin/help beats Disallow: /admin/* for the URL /admin/help. This is the Google convention; other crawlers may differ (some use first-match). The tool warns when conflicting patterns exist and explains which would win under the Google model.
Crawl-delay support
Google doesn't respect Crawl-delay: (uses Search Console settings instead). Bing, Yandex, and Baidu do respect it. The tool emits the directive when set; it doesn't tell you which crawler is going to obey it. If crawl-rate is a real concern, configure it per-crawler at the dashboard layer where the crawler exposes settings.
Pattern-matching dialect
robots.txt patterns use * for "zero or more characters" and $ for end-of-string anchoring. They're not full regex — ?, +, parentheses, character classes are all literal characters. The tool surfaces a warning if it sees these in a pattern (which is almost always a copy-paste mistake from regex).
Pages that need confidentiality
robots.txt is a polite request, not an access control mechanism. Any URL listed in robots.txt is publicly readable (the file itself is public). Listing /admin/secrets/ in Disallow: advertises that the URL exists. For real confidentiality, use authentication; for "discourage indexing but the URL itself isn't a secret", use robots.txt plus <meta name="robots" content="noindex"> on the page itself.
4. Documentation
Documentation
Reference signatures, edge cases, and lookup tables.
Input parameters
Field | Type | Required | Default | Description |
|---|---|---|---|---|
|
| ✓ | — | Per-user-agent rule blocks |
|
| ✗ | — | Sitemap URL to include as |
|
| ✗ | — | Seconds — emitted as |
|
| ✗ | — | URLs to test against the generated rules |
Output shape
{
robotsTxt: string; // the generated robots.txt content
simulation?: Array<{
url: string;
agent: string; // user-agent that the rule applies to
allowed: boolean;
matchedRule: string; // 'Allow: /foo' or 'Disallow: /bar'
}>;
warnings: Array<{
severity: 'critical' | 'warning' | 'info';
message: string;
line: number; // line in the generated robots.txt
}>;
}Pattern semantics
Pattern | Matches | Doesn't match |
|---|---|---|
|
|
|
|
| (matches more than intended) |
|
|
|
|
|
|
| every URL (rooted) | (nothing — wildcard prefix) |
Common warnings
Severity | Trigger |
|---|---|
critical | Blocking |
critical |
|
warning | Pattern with no |
warning |
|
warning | Conflicting Allow + Disallow with same specificity — precedence depends on crawler |
info | No |
Error codes
Code | When it fires | Recovery |
|---|---|---|
|
| Provide at least one rule block |
| Rule pattern contains non- | Use only |
|
| Split into multiple calls |
When NOT to use this tool
Don't use robots.txt to hide pages you want to keep confidential. The file itself is publicly readable and listing a URL there advertises that URL exists. For real confidentiality use authentication. For "this URL should not appear in search but the URL itself isn't a secret" use <meta name="robots" content="noindex"> on the page.
Don't use Disallow: to handle login walls. A logged-out crawler hitting a logged-in URL receives a 401 — that's the right way to gate non-public content. robots.txt is for "this URL exists but please don't crawl it"; auth is for "this URL is for authorised users only."
Performance notes
Typical execution: under 3ms for generation. Simulation: O(rules × urls). 100 URLs × 5 rules runs in about 20ms. The tool is deterministic — same input always produces the same output — so REST responses are Edge-Cache eligible.
The pattern-matching follows Google's robots.txt parser semantics (RFC 9309, 2022). Other crawlers may differ slightly; the simulation results are accurate for Googlebot.