Regex from examples + cross-flavor translation across JS, Python, Go, Rust, PCRE — obfus.link

1. Insight

Insight

The problem this article addresses and why it matters.

Regex is the wrong syntax for the right idea

Almost every developer has the same regex relationship: they know what they want to match, they don't remember the syntax to write it, they Google a snippet, paste it in, and run their tests. Sometimes the tests pass. Sometimes they pass and the regex catastrophically backtracks on a real input three weeks later — a class of bug known as ReDoS, documented extensively by OWASP and behind several published CVEs (notably the 2019 Cloudflare outage where a single regex took down a third of the internet).

The other half of the problem is portability. A regex written for JavaScript may rely on lookbehind assertions that Python 3.6 doesn't support, named capture groups that Go's regexp package emits with different syntax, or Unicode property escapes (\p{L}) that aren't available in older PCRE versions. Cross-language teams maintain three copies of the same regex with subtle drift.

Why a verifier beats writing the regex first

The traditional workflow is regex-first: you write the pattern, then test it against examples. Inverted: write the examples first, ask a tool to produce the regex. The examples are the source of truth — you wanted "match e-mail addresses but not the test inputs that look like e-mail addresses and aren't" — and the regex is an implementation detail. The verifier supports both directions: validate a regex against test cases (the existing workflow, with anti-pattern detection bolted on), or generate the regex from examples (the inverted workflow).

The third mode handles the portability problem: paste a regex from one language and ask for the equivalent in another. The tool flags features that don't translate cleanly with per-feature warnings, so the cross-team divergence problem becomes a one-call fix.

What this article delivers

Three workflows walked end-to-end: verifying a pattern against test cases (with anti-pattern detection), generating a pattern from positive and negative examples, and translating a pattern across the five major regex dialects (JavaScript, Python, Go, Rust, PCRE). We cover the explain mode that decomposes any regex into plain English and the anti-pattern detector that flags ReDoS-prone constructs.

2. Intent

Intent

What you will be able to do after reading.

By the end of this article you will be able to:

Verify a regex against an array of test cases and read a per-case pass/fail report with extracted capture groups
Generate a regex from positive and negative examples using one of three strategies (precise, balanced, permissive)
Translate a regex between JavaScript, Python, Go, Rust, and PCRE dialects with per-feature warnings for incompatibilities
Read the Explain mode output that decomposes any regex into a plain-English description
Identify ReDoS-prone patterns (catastrophic backtracking, over-broad wildcards, missing anchors) with severity ratings and one-line fixes

The Examples section walks through each of the three modes against the same real-world problem — matching internal API endpoints across three different services.

3. Examples

Examples

Annotated code and worked scenarios.

Before / after: generate mode

You want a regex that matches your team's internal API endpoints. You have examples that should match and examples that shouldn't:

Before: open a regex tester, try ^/api/v\d+/.+, realise it matches /api/v1/, add a non-empty group, realise it matches the deprecated /api/v0/... paths you wanted excluded, end up with ^/api/v[12]/.+ and pray you remember to add v3 next quarter.

After:

regexVerifier({
  mode: 'generate',
  generatorStrategy: 'balanced',
  testCases: [
    { input: '/api/v1/users',            shouldMatch: true  },
    { input: '/api/v2/orgs/123',         shouldMatch: true  },
    { input: '/api/v2/orgs/123/members', shouldMatch: true  },
    { input: '/api/v0/legacy',           shouldMatch: false },
    { input: '/internal/health',         shouldMatch: false },
    { input: '/api/v1',                  shouldMatch: false },
  ],
});

// generatedPattern: '^/api/v[1-9]\\d*/.+'
// generatedFlags:   ''
// valid: true
// results: [
//   { input: '/api/v1/users',            matched: true,  expected: true,  passed: true },
//   { input: '/api/v2/orgs/123',         matched: true,  expected: true,  passed: true },
//   ...
// ]

The [1-9]\d* after v is the generator earning its keep — accepts v1 through v999 but rejects v0. With generatorStrategy: 'precise' you'd get ^/api/v[12]/.+ (matches only the exact examples). With 'permissive' you'd get ^/api/v\d+/.+ (matches v0 too — fails the explicit negative case). 'balanced' finds the middle.

Before / after: cross-flavor translation

Same regex in five languages:

Before:

// JavaScript
const r = /(?<=user_)\d+/;

# Python
import re
r = re.compile(r"(?<=user_)\d+")

// Go
r := regexp.MustCompile(`(?<=user_)\d+`)
// → panic: regexp: Compile(`(?<=user_)\d+`): error parsing regexp: invalid or unsupported Perl syntax

Go's regexp (RE2) doesn't support lookbehind. You either rewrite to use a non-lookbehind alternative (capture group + extract index 1) or ship a different regex per language.

After:

regexVerifier({
  mode: 'translate',
  pattern: '(?<=user_)\\d+',
  flags: '',
  sourceDialect: 'javascript',
  targetDialect: 'go',
  testCases: [],
});

// translation: {
//   sourceDialect: 'javascript',
//   targetDialect: 'go',
//   translatedPattern: 'user_(\\d+)',
//   translatedFlags: '',
//   warnings: [
//     {
//       feature: 'lookbehind',
//       message: 'Go RE2 does not support lookbehind. Rewrote as capture group; extract via match[1] instead of match[0].',
//     },
//   ],
// }

The translator picks the closest semantic equivalent and surfaces the API contract change (extract match[1] instead of match[0]) as a warning the developer reads before pasting the new regex.

Before / after: anti-pattern detection

regexVerifier({
  mode: 'verify',
  pattern: '^(a+)+$',
  flags: '',
  testCases: [{ input: 'aaaaaaaaaaaaaaaaaaaaab', shouldMatch: false }],
  detectAntiPattern: true,
});

// valid: true
// results: [{ input: 'aaaaaaaaaaaaaaaaaaaaab', matched: false, expected: false, passed: true }]
// antiPatterns: [
//   {
//     pattern: '(a+)+',
//     severity: 'critical',
//     description: 'Nested quantifier causes catastrophic backtracking on non-matching input',
//     fix: 'Rewrite without nested quantifier: ^a+$ matches the same set without backtracking',
//   },
// ]

The test passes (the regex correctly returns no match), but the input took milliseconds when it should take microseconds. The anti-pattern detector flags (a+)+ as a classic ReDoS construct — the kind of thing that brings down production when an attacker submits a 50-character string.

When humans use this

The most common workflow on the web UI is iterative: paste a regex, paste a few test inputs, click run, read the pass/fail and the anti-pattern warnings, refine. Generate mode is the second-most-common — particularly for developers who think in examples ("match these but not those") and don't want to translate their mental model into syntax. Explain mode powers the "what does this regex do?" question that comes up during code review when someone inherits a regex from a previous engineer.

When agents use this

Three patterns that dominate:

Code-generation agent producing validation logic. An agent asked to "validate that the input matches our internal email format" struggles when it has to write the regex itself. Generate mode inverts the problem: the agent describes the requirement as positive + negative examples, the tool produces the regex, the agent embeds it. Reliability goes up because the regex is verified against the examples before it ships.
Multi-language pipeline agent. An agent generating both backend (Go) and frontend (JavaScript) validation calls the verifier in translate mode to keep a single canonical regex in JavaScript, translates to Go for the API server, and surfaces any warnings as comments in the generated code.
Security-audit agent. An agent scanning a codebase for ReDoS-prone patterns calls verify mode with detectAntiPattern: true against every regex literal it finds. Critical findings open a security advisory PR; lower-severity findings open a tech-debt ticket.

Edge cases

Empty test-case arrays

Verify mode with testCases: [] returns valid: true (the regex compiles) but no pass/fail entries. Use this to syntax-check a regex without testing it. Generate mode requires at least one positive AND one negative example — empty arrays return INPUT_EMPTY.

Conflicting test cases

Verify mode tolerates contradictory expectations (one case says shouldMatch=true, another with the same input says false). The output reports both as failures. Generate mode rejects with INPUT_INVALID_TYPE because contradictory examples have no valid regex solution.

Translation incompatibilities the tool can't resolve

Some constructs have no equivalent in the target dialect — variable-length lookbehinds (JavaScript, .NET) translated to Python sub-3.7 or Go RE2, recursive subroutines ((?R) in PCRE), conditional patterns. These return translatedPattern: null with a warning explaining the structural reason and pointing at the closest non-regex alternative (PEG parser, string-manipulation API).

Regex flags across dialects

The flags parameter is passed verbatim in verify mode. In translate mode, flags are mapped to the target dialect's syntax — JavaScript gi becomes Python re.IGNORECASE, Go (?i). Flags that don't translate (JavaScript's s for dotall is implicit in Python 3.4+; Go has no g for global) are surfaced in warnings.

4. Documentation

Documentation

Reference signatures, edge cases, and lookup tables.

Input parameters

Field	Type	Required	Default	Description
`mode`	`'verify' \| 'generate' \| 'translate'`	✓	—	Workflow selector
`pattern`	`string`	conditional	—	Required for verify and translate modes
`flags`	`string`	✗	`''`	Regex flags (`g`, `i`, `m`, `s`, `u`, `y`)
`testCases`	`Array<{input, shouldMatch}>`	✓	—	At least one positive + one negative required for generate mode
`explainMode`	`boolean`	✗	`false`	Decompose the regex into plain English
`detectAntiPattern`	`boolean`	✗	`false`	Flag ReDoS-prone constructs with severity ratings
`generatorStrategy`	`'precise' \| 'balanced' \| 'permissive'`	for generate mode	`'balanced'`	Controls regex generality vs example fit
`sourceDialect`	`'javascript' \| 'python' \| 'go' \| 'rust' \| 'pcre'`	for translate mode	—	Input dialect
`targetDialect`	same enum	for translate mode	—	Output dialect

Output shape

{
  valid:           boolean;
  results:         Array<{
    input:          string;
    matched:        boolean;
    expected:       boolean;
    passed:         boolean;
    captureGroups?: string[];
  }>;
  explanation?:      string;    // when explainMode: true
  antiPatterns?:     Array<{ pattern, severity, description, fix }>;
  generatedPattern?: string;    // when mode: 'generate'
  generatedFlags?:   string;
  translation?: {                // when mode: 'translate'
    sourceDialect:     string;
    targetDialect:     string;
    translatedPattern: string;
    translatedFlags:   string;
    warnings: Array<{ feature: string; message: string }>;
  };
}

Generator strategies compared

Strategy	Output for `[v1, v2]` (positive) / `[v0]` (negative)	When to use
`precise`	`^/api/v[12]/.+`	Exact-match validation where future inputs must conform
`balanced`	`^/api/v[1-9]\d*/.+`	Default — accepts the obvious extension of the example set
`permissive`	`^/api/v\d+/.+`	Broadest match while respecting negative examples

Error codes

Code	When it fires	Recovery
`INPUT_EMPTY`	Generate mode with empty `testCases`, or verify mode with empty `pattern`	Provide the required input
`INPUT_MALFORMED`	`pattern` is not a valid regex in `sourceDialect`	Verify the source dialect matches the pattern syntax
`INPUT_INVALID_TYPE`	Generate mode contradictory examples; missing positive or negative cases	Provide both shouldMatch:true and shouldMatch:false cases
`UNSUPPORTED_FORMAT`	Translation between unsupported dialect pair (rare)	Translate via JavaScript as an intermediate dialect
`TIMEOUT`	Verify mode hit catastrophic backtracking on a test case (3s ceiling)	The regex is ReDoS-vulnerable — run with detectAntiPattern: true

When NOT to use this tool

Don't use the generator as a substitute for thinking about the problem space. The generator extrapolates from examples; if your examples don't cover the edge cases your production input set will include, the generated regex won't either. Always include adversarial negative examples (inputs that look like the positive cases but shouldn't match).

Don't use translate mode as a substitute for re-testing in the target language. The translation is semantic-equivalent for the documented features; subtle behaviour differences (Unicode handling, locale sensitivity, anchor semantics in multiline mode) require validation in the target runtime.

For complex parsing (nested structures, recursive grammars), use a PEG parser (peggy, tree-sitter, chevrotain). Regex is the wrong tool for any input where you'd want to describe the grammar with rules.

Performance notes

Verify mode execution is bounded by a 3-second hard timeout per test case to defend against ReDoS. Generate mode runs an iterative search bounded by 2 seconds; it returns the best result found within that budget. Translate mode is single-pass parse + emit, typically under 5ms. The tool is deterministic for verify and translate modes; generate mode is deterministic per (testCases, generatorStrategy) tuple. REST responses are Edge-Cache eligible for verify and translate modes.