obfus.link
Generators

robots.txt with SEO impact simulation

Generate a robots.txt from structured rules and simulate the result against an array of URLs. Catches the false positives (overly broad disallow, accidentally blocked CSS/JS) before deploy.

The robots.txt Generator produces robots.txt from a structured ruleset and simulates the output against URLs to verify which paths would be blocked or allowed. Warnings flag common SEO mistakes — blocking CSS or JS files, overly broad disallow patterns, missing Sitemap directives, conflicting Allow / Disallow with ambiguous precedence.

1. Insight

Insight

The problem this article addresses and why it matters.

The 12-line file that decides whether you're indexed

robots.txt is one of the smallest files on most websites and has more impact on traffic than any other piece of static content. A single Disallow: / accidentally shipped in production blocks every crawler from indexing the site. A Disallow: /api/* that's slightly too aggressive accidentally blocks Googlebot from crawling the very pages the site depends on for indexing. A missing Sitemap: directive forces every crawler to discover URLs the hard way.

The format itself is older than HTTP/1.1 — Martijn Koster's 1994 proposal standardised it before browsers had bookmarks. Every major search engine still respects it (modulo their own crawling-priority logic). The format is small and the syntax is simple, which is the problem: developers write it once, copy-paste between projects, and miss the production-specific gotchas every time.

Why a generator with simulation beats a snippet

The tool in this article does two related jobs. Generate mode produces a robots.txt from a structured rule set — easier to maintain than hand-edited text because the rules are typed and reviewable. Simulate mode is the killer feature: pass an array of URLs and the tool tells you which would be blocked, allowed, or ambiguously matched by the generated rules.

The simulation catches the failure mode every team encounters: shipping a Disallow: /admin/* that also blocks /administration/ (a URL the team had no idea existed because it was an old redirect), or shipping a Disallow: /api/v1/ that blocks /api/v1/health (the URL the team uses for uptime monitoring). Run simulation against your site's full URL list before deploying the new robots.txt.

What this article delivers

End-to-end walks of generating a robots.txt for a real site, simulating against representative URLs, and reading the warning output that flags common SEO mistakes. We cover the precedence rules between conflicting Allow / Disallow, the crawl-delay directive (and which crawlers respect it), and the cases where robots.txt is the wrong tool entirely (anything that must actually be confidential, not just non-indexed).

2. Intent

Intent

What you will be able to do after reading.

By the end of this article you will be able to:

  • Generate a robots.txt from a structured set of user-agent rules with optional sitemap and crawl-delay directives
  • Simulate the generated rules against an array of URLs to verify which paths would be blocked or allowed
  • Read the warnings output that flags common mistakes (blocking CSS/JS files, overly broad disallow, missing sitemap)
  • Recognise the cases where robots.txt is the wrong primitive — security through robots.txt is not security
  • Understand the precedence rules when multiple Allow / Disallow patterns match the same URL

The Examples section walks through generating a real robots.txt, simulating against site URLs, and the warning that catches an accidental admin-page block.

3. Examples

Examples

Annotated code and worked scenarios.

Before / after: generating a real robots.txt

robotsTxtGenerator({
  rules: [
    {
      agent:    '*',
      allow:    ['/'],
      disallow: ['/admin/*', '/api/*', '/_next/*'],
    },
    {
      agent:    'GPTBot',
      allow:    [],
      disallow: ['/'],
    },
  ],
  sitemap:    'https://obfus.link/sitemap.xml',
  crawlDelay: 1,
});

// robotsTxt: `
//   User-agent: *
//   Allow: /
//   Disallow: /admin/*
//   Disallow: /api/*
//   Disallow: /_next/*
//   Crawl-delay: 1
//
//   User-agent: GPTBot
//   Disallow: /
//
//   Sitemap: https://obfus.link/sitemap.xml
// `

Two user-agent rule blocks. Default * allows the public site, blocks admin and API paths. GPTBot is fully disallowed — the site opts out of OpenAI training-data crawling. The crawl-delay is a hint (some crawlers respect it, some don't).

Before / after: simulation catches a false positive

The team thinks they're only blocking the admin panel. They run simulation against representative URLs:

robotsTxtGenerator({
  rules: [
    { agent: '*', allow: ['/'], disallow: ['/admin*'] },  // intentional: admin pages
  ],
  simulate: [
    'https://obfus.link/',                       // expect allowed
    'https://obfus.link/tool/json-to-zod',       // expect allowed
    'https://obfus.link/admin/users',            // expect blocked
    'https://obfus.link/administration',         // expect ?
    'https://obfus.link/tool/admin-helper',      // expect ?
  ],
});

// simulation: [
//   { url: '/',                     agent: '*', allowed: true,  matchedRule: 'Allow: /' },
//   { url: '/tool/json-to-zod',     agent: '*', allowed: true,  matchedRule: 'Allow: /' },
//   { url: '/admin/users',          agent: '*', allowed: false, matchedRule: 'Disallow: /admin*' },
//   { url: '/administration',       agent: '*', allowed: false, matchedRule: 'Disallow: /admin*' },
//   { url: '/tool/admin-helper',    agent: '*', allowed: true,  matchedRule: 'Allow: /' },
// ]
// warnings: [
//   {
//     severity: 'warning',
//     message:  'Pattern "Disallow: /admin*" also blocks /administration. Use "Disallow: /admin/" (trailing slash) to scope to the admin directory only.',
//     line:     2,
//   },
// ]

The pattern /admin* matched /administration because * is glob-style — both /admin/users and /administration start with /admin. The fix is Disallow: /admin/ (with the trailing slash) which only matches paths under the admin directory.

The warning catches the false positive before deploy. The /tool/admin-helper URL is left allowed because the pattern starts at root.

Before / after: the CSS/JS warning

A common mistake — blocking the _next directory in a Next.js site:

robotsTxtGenerator({
  rules: [
    { agent: '*', allow: ['/'], disallow: ['/_next/*'] },
  ],
});

// warnings: [
//   {
//     severity: 'critical',
//     message:  'Disallow: /_next/* blocks CSS and JS Next.js needs to render the site. Googlebot can crawl the HTML but cannot evaluate the page; ranking drops for content that depends on JS-rendered text. Allow /_next/static/* explicitly.',
//     line:     3,
//   },
// ]

The critical warning surfaces the production failure mode: Next.js apps need _next/static/* to render. Blocking it means Googlebot sees a broken page. The recommendation is to add Allow: /_next/static/* before the disallow.

Before / after: AI-crawler opt-out

OpenAI's GPTBot, Anthropic's anthropic-ai, Google-Extended (Bard's crawler), CCBot (Common Crawl), Bytespider — many sites want to opt out of these specifically. The tool generates the multi-agent rule block:

robotsTxtGenerator({
  rules: [
    { agent: '*',             allow: ['/'], disallow: ['/admin/*'] },
    { agent: 'GPTBot',        allow: [],    disallow: ['/'] },
    { agent: 'anthropic-ai',  allow: [],    disallow: ['/'] },
    { agent: 'Google-Extended', allow: [],  disallow: ['/'] },
    { agent: 'CCBot',         allow: [],    disallow: ['/'] },
  ],
});

// robotsTxt includes five user-agent blocks

Note this blocks training-data crawling, not search indexing. Googlebot (the search crawler) remains allowed; Google-Extended (the Bard / AI Overview training crawler) is blocked separately.

When humans use this

The first use is "I'm setting up a new site, generate me a sensible robots.txt." The high-leverage use is the pre-deploy simulation: before promoting a robots.txt change to production, run simulation against the site's URL list. The warning output catches the CSS/JS blocking case and the over-broad pattern case — both of which silently degrade SEO if shipped.

When agents use this

Two production patterns:

  • Site-generator agent. An agent scaffolding a new Next.js / Astro / Hugo site generates the robots.txt from the project's content structure. The structured input is easier for the LLM to reason about than free-form text.
  • SEO-audit agent. A scheduled agent runs simulation against the site's full sitemap weekly. Any URL that should be indexed but is blocked, or any URL that should be blocked but is reachable, opens an alert with the specific rule that caused the mismatch.

Edge cases

Allow vs Disallow precedence

When multiple patterns match the same URL, the more specific (longer) pattern wins. Allow: /admin/help beats Disallow: /admin/* for the URL /admin/help. This is the Google convention; other crawlers may differ (some use first-match). The tool warns when conflicting patterns exist and explains which would win under the Google model.

Crawl-delay support

Google doesn't respect Crawl-delay: (uses Search Console settings instead). Bing, Yandex, and Baidu do respect it. The tool emits the directive when set; it doesn't tell you which crawler is going to obey it. If crawl-rate is a real concern, configure it per-crawler at the dashboard layer where the crawler exposes settings.

Pattern-matching dialect

robots.txt patterns use * for "zero or more characters" and $ for end-of-string anchoring. They're not full regex — ?, +, parentheses, character classes are all literal characters. The tool surfaces a warning if it sees these in a pattern (which is almost always a copy-paste mistake from regex).

Pages that need confidentiality

robots.txt is a polite request, not an access control mechanism. Any URL listed in robots.txt is publicly readable (the file itself is public). Listing /admin/secrets/ in Disallow: advertises that the URL exists. For real confidentiality, use authentication; for "discourage indexing but the URL itself isn't a secret", use robots.txt plus <meta name="robots" content="noindex"> on the page itself.

4. Documentation

Documentation

Reference signatures, edge cases, and lookup tables.

Input parameters

Field

Type

Required

Default

Description

rules

Array<{agent, allow, disallow}>

Per-user-agent rule blocks

sitemap

string

Sitemap URL to include as Sitemap: directive

crawlDelay

number

Seconds — emitted as Crawl-delay: (Google ignores)

simulate

string[]

URLs to test against the generated rules

Output shape

{
  robotsTxt:   string;        // the generated robots.txt content
  simulation?: Array<{
    url:         string;
    agent:       string;       // user-agent that the rule applies to
    allowed:     boolean;
    matchedRule: string;       // 'Allow: /foo' or 'Disallow: /bar'
  }>;
  warnings: Array<{
    severity: 'critical' | 'warning' | 'info';
    message:  string;
    line:     number;           // line in the generated robots.txt
  }>;
}

Pattern semantics

Pattern

Matches

Doesn't match

/admin/

/admin/, /admin/x, /admin/x/y

/admin (no trailing slash), /administration

/admin*

/admin, /admin/x, /administration

(matches more than intended)

/foo/*.html

/foo/a.html, /foo/bar/b.html

/foo/a.htm

/foo$

/foo exactly

/foo/, /foo/bar

/

every URL (rooted)

(nothing — wildcard prefix)

Common warnings

Severity

Trigger

critical

Blocking /_next/static/*, /static/*, /assets/*, *.css, *.js — degrades JS-rendering quality for Googlebot

critical

Disallow: / for User-agent: * — blocks all indexing of the entire site

warning

Pattern with no / prefix — likely a typo, should be /foo not foo

warning

Disallow: with no : (typo)

warning

Conflicting Allow + Disallow with same specificity — precedence depends on crawler

info

No Sitemap: directive — crawlers fall back to link discovery, slower indexing of new content

Error codes

Code

When it fires

Recovery

INPUT_EMPTY

rules array empty

Provide at least one rule block

INPUT_INVALID_TYPE

Rule pattern contains non-robots.txt syntax (?, +, character classes)

Use only * and $

INPUT_TOO_LARGE

simulate array exceeds 1000 URLs

Split into multiple calls

When NOT to use this tool

Don't use robots.txt to hide pages you want to keep confidential. The file itself is publicly readable and listing a URL there advertises that URL exists. For real confidentiality use authentication. For "this URL should not appear in search but the URL itself isn't a secret" use <meta name="robots" content="noindex"> on the page.

Don't use Disallow: to handle login walls. A logged-out crawler hitting a logged-in URL receives a 401 — that's the right way to gate non-public content. robots.txt is for "this URL exists but please don't crawl it"; auth is for "this URL is for authorised users only."

Performance notes

Typical execution: under 3ms for generation. Simulation: O(rules × urls). 100 URLs × 5 rules runs in about 20ms. The tool is deterministic — same input always produces the same output — so REST responses are Edge-Cache eligible.

The pattern-matching follows Google's robots.txt parser semantics (RFC 9309, 2022). Other crawlers may differ slightly; the simulation results are accurate for Googlebot.

Try it now

Robots.txt Generator

Generate robots.txt with SEO Impact Simulator and pre-deploy URL testing

FAQ

Frequently asked questions

How do I block AI crawlers without affecting search?

Add separate User-agent rules for the AI-specific bots (GPTBot, anthropic-ai, Google-Extended, CCBot, Bytespider) with Disallow: /. The default User-agent: * block remains, allowing search crawlers. Note: Google-Extended is the AI / Bard training crawler — separate from Googlebot which handles search indexing.

Why does the simulator warn about blocking _next/*?

Modern frameworks (Next.js, Astro with islands, SvelteKit) need their static assets to render correctly for Googlebot. Blocking the JS / CSS directory means Googlebot can crawl the HTML but can't render the page — content that depends on JS rendering doesn't index. The fix is to allow /_next/static/* explicitly before any broader disallow.

How does precedence work with overlapping rules?

Per Google's implementation (RFC 9309), the more specific (longer) pattern wins. Allow: /admin/help beats Disallow: /admin/*. The tool surfaces ambiguous cases as warnings. Other crawlers may use first-match — verify against the specific crawler if you're relying on precedence rules.

Should I put credentials or secrets in Disallow?

No. robots.txt is publicly readable; listing a URL there advertises that the URL exists. For real confidentiality use authentication. For "discourage indexing but the URL itself isn't secret" use robots.txt + a noindex meta tag.