robots.txt with SEO impact simulation — obfus.link

1. Insight

Insight

The problem this article addresses and why it matters.

The 12-line file that decides whether you're indexed

robots.txt is one of the smallest files on most websites and has more impact on traffic than any other piece of static content. A single Disallow: / accidentally shipped in production blocks every crawler from indexing the site. A Disallow: /api/* that's slightly too aggressive accidentally blocks Googlebot from crawling the very pages the site depends on for indexing. A missing Sitemap: directive forces every crawler to discover URLs the hard way.

The format itself is older than HTTP/1.1 — Martijn Koster's 1994 proposal standardised it before browsers had bookmarks. Every major search engine still respects it (modulo their own crawling-priority logic). The format is small and the syntax is simple, which is the problem: developers write it once, copy-paste between projects, and miss the production-specific gotchas every time.

Why a generator with simulation beats a snippet

The tool in this article does two related jobs. Generate mode produces a robots.txt from a structured rule set — easier to maintain than hand-edited text because the rules are typed and reviewable. Simulate mode is the killer feature: pass an array of URLs and the tool tells you which would be blocked, allowed, or ambiguously matched by the generated rules.

The simulation catches the failure mode every team encounters: shipping a Disallow: /admin/* that also blocks /administration/ (a URL the team had no idea existed because it was an old redirect), or shipping a Disallow: /api/v1/ that blocks /api/v1/health (the URL the team uses for uptime monitoring). Run simulation against your site's full URL list before deploying the new robots.txt.

What this article delivers

End-to-end walks of generating a robots.txt for a real site, simulating against representative URLs, and reading the warning output that flags common SEO mistakes. We cover the precedence rules between conflicting Allow / Disallow, the crawl-delay directive (and which crawlers respect it), and the cases where robots.txt is the wrong tool entirely (anything that must actually be confidential, not just non-indexed).

2. Intent

Intent

What you will be able to do after reading.

By the end of this article you will be able to:

Generate a robots.txt from a structured set of user-agent rules with optional sitemap and crawl-delay directives
Simulate the generated rules against an array of URLs to verify which paths would be blocked or allowed
Read the warnings output that flags common mistakes (blocking CSS/JS files, overly broad disallow, missing sitemap)
Recognise the cases where robots.txt is the wrong primitive — security through robots.txt is not security
Understand the precedence rules when multiple Allow / Disallow patterns match the same URL

The Examples section walks through generating a real robots.txt, simulating against site URLs, and the warning that catches an accidental admin-page block.

3. Examples

Examples

Annotated code and worked scenarios.

Before / after: generating a real robots.txt

robotsTxtGenerator({
  rules: [
    {
      agent:    '*',
      allow:    ['/'],
      disallow: ['/admin/*', '/api/*', '/_next/*'],
    },
    {
      agent:    'GPTBot',
      allow:    [],
      disallow: ['/'],
    },
  ],
  sitemap:    'https://obfus.link/sitemap.xml',
  crawlDelay: 1,
});

// robotsTxt: `
//   User-agent: *
//   Allow: /
//   Disallow: /admin/*
//   Disallow: /api/*
//   Disallow: /_next/*
//   Crawl-delay: 1
//
//   User-agent: GPTBot
//   Disallow: /
//
//   Sitemap: https://obfus.link/sitemap.xml
// `

Two user-agent rule blocks. Default * allows the public site, blocks admin and API paths. GPTBot is fully disallowed — the site opts out of OpenAI training-data crawling. The crawl-delay is a hint (some crawlers respect it, some don't).

Before / after: simulation catches a false positive

The team thinks they're only blocking the admin panel. They run simulation against representative URLs:

robotsTxtGenerator({
  rules: [
    { agent: '*', allow: ['/'], disallow: ['/admin*'] },  // intentional: admin pages
  ],
  simulate: [
    'https://obfus.link/',                       // expect allowed
    'https://obfus.link/tool/json-to-zod',       // expect allowed
    'https://obfus.link/admin/users',            // expect blocked
    'https://obfus.link/administration',         // expect ?
    'https://obfus.link/tool/admin-helper',      // expect ?
  ],
});

// simulation: [
//   { url: '/',                     agent: '*', allowed: true,  matchedRule: 'Allow: /' },
//   { url: '/tool/json-to-zod',     agent: '*', allowed: true,  matchedRule: 'Allow: /' },
//   { url: '/admin/users',          agent: '*', allowed: false, matchedRule: 'Disallow: /admin*' },
//   { url: '/administration',       agent: '*', allowed: false, matchedRule: 'Disallow: /admin*' },
//   { url: '/tool/admin-helper',    agent: '*', allowed: true,  matchedRule: 'Allow: /' },
// ]
// warnings: [
//   {
//     severity: 'warning',
//     message:  'Pattern "Disallow: /admin*" also blocks /administration. Use "Disallow: /admin/" (trailing slash) to scope to the admin directory only.',
//     line:     2,
//   },
// ]

The pattern /admin* matched /administration because * is glob-style — both /admin/users and /administration start with /admin. The fix is Disallow: /admin/ (with the trailing slash) which only matches paths under the admin directory.

The warning catches the false positive before deploy. The /tool/admin-helper URL is left allowed because the pattern starts at root.

Before / after: the CSS/JS warning

A common mistake — blocking the _next directory in a Next.js site:

robotsTxtGenerator({
  rules: [
    { agent: '*', allow: ['/'], disallow: ['/_next/*'] },
  ],
});

// warnings: [
//   {
//     severity: 'critical',
//     message:  'Disallow: /_next/* blocks CSS and JS Next.js needs to render the site. Googlebot can crawl the HTML but cannot evaluate the page; ranking drops for content that depends on JS-rendered text. Allow /_next/static/* explicitly.',
//     line:     3,
//   },
// ]

The critical warning surfaces the production failure mode: Next.js apps need _next/static/* to render. Blocking it means Googlebot sees a broken page. The recommendation is to add Allow: /_next/static/* before the disallow.

Before / after: AI-crawler opt-out

OpenAI's GPTBot, Anthropic's anthropic-ai, Google-Extended (Bard's crawler), CCBot (Common Crawl), Bytespider — many sites want to opt out of these specifically. The tool generates the multi-agent rule block:

robotsTxtGenerator({
  rules: [
    { agent: '*',             allow: ['/'], disallow: ['/admin/*'] },
    { agent: 'GPTBot',        allow: [],    disallow: ['/'] },
    { agent: 'anthropic-ai',  allow: [],    disallow: ['/'] },
    { agent: 'Google-Extended', allow: [],  disallow: ['/'] },
    { agent: 'CCBot',         allow: [],    disallow: ['/'] },
  ],
});

// robotsTxt includes five user-agent blocks

Note this blocks training-data crawling, not search indexing. Googlebot (the search crawler) remains allowed; Google-Extended (the Bard / AI Overview training crawler) is blocked separately.

When humans use this

The first use is "I'm setting up a new site, generate me a sensible robots.txt." The high-leverage use is the pre-deploy simulation: before promoting a robots.txt change to production, run simulation against the site's URL list. The warning output catches the CSS/JS blocking case and the over-broad pattern case — both of which silently degrade SEO if shipped.

When agents use this

Two production patterns:

Site-generator agent. An agent scaffolding a new Next.js / Astro / Hugo site generates the robots.txt from the project's content structure. The structured input is easier for the LLM to reason about than free-form text.
SEO-audit agent. A scheduled agent runs simulation against the site's full sitemap weekly. Any URL that should be indexed but is blocked, or any URL that should be blocked but is reachable, opens an alert with the specific rule that caused the mismatch.

Edge cases

Allow vs Disallow precedence

When multiple patterns match the same URL, the more specific (longer) pattern wins. Allow: /admin/help beats Disallow: /admin/* for the URL /admin/help. This is the Google convention; other crawlers may differ (some use first-match). The tool warns when conflicting patterns exist and explains which would win under the Google model.

Crawl-delay support

Google doesn't respect Crawl-delay: (uses Search Console settings instead). Bing, Yandex, and Baidu do respect it. The tool emits the directive when set; it doesn't tell you which crawler is going to obey it. If crawl-rate is a real concern, configure it per-crawler at the dashboard layer where the crawler exposes settings.

Pattern-matching dialect

robots.txt patterns use * for "zero or more characters" and $ for end-of-string anchoring. They're not full regex — ?, +, parentheses, character classes are all literal characters. The tool surfaces a warning if it sees these in a pattern (which is almost always a copy-paste mistake from regex).

Pages that need confidentiality

robots.txt is a polite request, not an access control mechanism. Any URL listed in robots.txt is publicly readable (the file itself is public). Listing /admin/secrets/ in Disallow: advertises that the URL exists. For real confidentiality, use authentication; for "discourage indexing but the URL itself isn't a secret", use robots.txt plus <meta name="robots" content="noindex"> on the page itself.

4. Documentation

Documentation

Reference signatures, edge cases, and lookup tables.

Input parameters

Field	Type	Required	Default	Description
`rules`	`Array<{agent, allow, disallow}>`	✓	—	Per-user-agent rule blocks
`sitemap`	`string`	✗	—	Sitemap URL to include as `Sitemap:` directive
`crawlDelay`	`number`	✗	—	Seconds — emitted as `Crawl-delay:` (Google ignores)
`simulate`	`string[]`	✗	—	URLs to test against the generated rules

Output shape

{
  robotsTxt:   string;        // the generated robots.txt content
  simulation?: Array<{
    url:         string;
    agent:       string;       // user-agent that the rule applies to
    allowed:     boolean;
    matchedRule: string;       // 'Allow: /foo' or 'Disallow: /bar'
  }>;
  warnings: Array<{
    severity: 'critical' | 'warning' | 'info';
    message:  string;
    line:     number;           // line in the generated robots.txt
  }>;
}

Pattern semantics

Pattern	Matches	Doesn't match
`/admin/`	`/admin/`, `/admin/x`, `/admin/x/y`	`/admin` (no trailing slash), `/administration`
`/admin*`	`/admin`, `/admin/x`, `/administration`	(matches more than intended)
`/foo/*.html`	`/foo/a.html`, `/foo/bar/b.html`	`/foo/a.htm`
`/foo$`	`/foo` exactly	`/foo/`, `/foo/bar`
`/`	every URL (rooted)	(nothing — wildcard prefix)

Common warnings

Severity	Trigger
critical	Blocking `/_next/static/`, `/static/`, `/assets/`, `.css`, `*.js` — degrades JS-rendering quality for Googlebot
critical	`Disallow: /` for `User-agent: *` — blocks all indexing of the entire site
warning	Pattern with no `/` prefix — likely a typo, should be `/foo` not `foo`
warning	`Disallow:` with no `:` (typo)
warning	Conflicting Allow + Disallow with same specificity — precedence depends on crawler
info	No `Sitemap:` directive — crawlers fall back to link discovery, slower indexing of new content

Error codes

Code	When it fires	Recovery
`INPUT_EMPTY`	`rules` array empty	Provide at least one rule block
`INPUT_INVALID_TYPE`	Rule pattern contains non-`robots.txt` syntax (`?`, `+`, character classes)	Use only `*` and `$`
`INPUT_TOO_LARGE`	`simulate` array exceeds 1000 URLs	Split into multiple calls

When NOT to use this tool

Don't use robots.txt to hide pages you want to keep confidential. The file itself is publicly readable and listing a URL there advertises that URL exists. For real confidentiality use authentication. For "this URL should not appear in search but the URL itself isn't a secret" use <meta name="robots" content="noindex"> on the page.

Don't use Disallow: to handle login walls. A logged-out crawler hitting a logged-in URL receives a 401 — that's the right way to gate non-public content. robots.txt is for "this URL exists but please don't crawl it"; auth is for "this URL is for authorised users only."

Performance notes

Typical execution: under 3ms for generation. Simulation: O(rules × urls). 100 URLs × 5 rules runs in about 20ms. The tool is deterministic — same input always produces the same output — so REST responses are Edge-Cache eligible.

The pattern-matching follows Google's robots.txt parser semantics (RFC 9309, 2022). Other crawlers may differ slightly; the simulation results are accurate for Googlebot.