Context-aware HTML encoding: avoiding double-encoding bugs in JSON, XML, and URL embeddings — obfus.link

1. Insight

Insight

The problem this article addresses and why it matters.

Encoding the right thing for the right context

Every developer has encountered the "double-encoded HTML" bug: &lt; rendered in a browser instead of <, because the same content was HTML-encoded twice as it moved through a pipeline. The root cause is almost always context confusion — a value that needed JSON escaping was HTML-encoded instead, or content that should have been URL-encoded got HTML entities applied. Each output context (HTML body, HTML attribute, JSON string, XML attribute, URL parameter) has different rules for which characters are unsafe and how to escape them.

OWASP's Cross-Site Scripting Prevention Cheat Sheet lists six distinct escaping contexts. Most teams pick one — HTML entity encoding via a library — and apply it everywhere. That breaks when the destination isn't actually HTML.

Why context-aware encoding

The tool in this article takes a contextAware: true flag and auto-detects the embedding context from the input shape, OR accepts a contextOverride parameter when the destination is known but doesn't match the input shape. Detected contexts: html-raw (standard HTML entities), html-in-json (JSON-string escaping), html-in-xml (XML attribute escaping), html-in-url (percent-encoding).

The output uses the right escape sequences for each context — a single call replaces the manual "which encoder do I need?" decision that produces double-encoding bugs at scale.

What this article delivers

End-to-end walkthroughs of encoding the same content for four different output contexts, the auto-detection behaviour against ambiguous inputs, and the cases where the tool can't decide on its own and the consumer needs contextOverride.

2. Intent

Intent

What you will be able to do after reading.

By the end of this article you will be able to:

Encode HTML content for safe embedding in HTML bodies, JSON strings, XML attributes, or URL parameters
Use auto-detection mode (contextAware: true) when the destination context is implied by the input shape
Override the detected context with contextOverride when the destination differs from the input
Reverse the direction (decode) any of the four context-specific encodings back to the original string
Recognise the cases that produce double-encoding bugs and structure the pipeline to prevent them

The Examples section walks through each context against the same input.

3. Examples

Examples

Annotated code and worked scenarios.

Before / after: encoding for an HTML body

Plain content with characters HTML parses as markup:

Before:

<script>alert("xss")</script>

After (html-raw):

htmlEncoder({
  html:         '<script>alert("xss")</script>',
  mode:         'encode',
  entities:     'named',
  contextAware: false,
  contextOverride: 'html-raw',
});

// encoded: '&lt;script&gt;alert(&quot;xss&quot;)&lt;/script&gt;'
// detectedContext: 'html-raw'
// entityCount: 5

The <, >, and " characters become HTML entities. The result is safe to insert as text content in an HTML document.

Before / after: encoding for embedding in JSON

The same input destined for a JSON string value:

htmlEncoder({
  html:            '<script>alert("xss")</script>',
  mode:            'encode',
  contextAware:    false,
  contextOverride: 'html-in-json',
});

// encoded: '\\u003cscript\\u003ealert(\\"xss\\")\\u003c/script\\u003e'

JSON requires \" escaping and < / > are commonly escaped to < / > to prevent the result from breaking out of a <script> tag if the JSON is rendered inline in HTML. The encoder picks the JSON-safe form rather than the HTML-entity form.

Before / after: encoding for a URL parameter

htmlEncoder({
  html:            'message=<b>hello</b>&user=admin',
  mode:            'encode',
  contextOverride: 'html-in-url',
});

// encoded: 'message%3D%3Cb%3Ehello%3C%2Fb%3E%26user%3Dadmin'

The =, <, >, &, and / characters become percent-encoded for safe inclusion in a URL query parameter. Note this is full percent-encoding — not the partial form that leaves = intact (which would break the parameter parser).

Before / after: auto-detection

When contextAware: true is set, the tool detects the context from input markers:

htmlEncoder({
  html:         '"users": [<b>Alice</b>]',  // looks like a JSON fragment
  mode:         'encode',
  contextAware: true,
});

// detectedContext: 'html-in-json'  (detected from the JSON-property-shape preamble)
// encoded: ...

Detection is best-effort — when the input is ambiguous (a fragment that could be HTML or JSON), the tool falls back to html-raw and surfaces a warning. For ambiguous cases, set contextOverride explicitly.

Before / after: decoding (reverse)

htmlEncoder({
  html: '&lt;script&gt;alert(&quot;xss&quot;)&lt;/script&gt;',
  mode: 'decode',
});

// decoded: '<script>alert("xss")</script>'

Decoding handles named entities (<, &, "), numeric entities (<), and hex entities (<) interchangeably. Useful for reading content stored in HTML-encoded form and processing it as plain text.

When humans use this

A developer integrating user-generated content into a templated email runs the content through html-in-json encoding before embedding in the JSON template payload. A team building an inline-JSON-in-HTML pattern (data attributes, <script type="application/json"> blocks) uses the JSON context to produce content safe in both embedding contexts simultaneously.

When agents use this

Two patterns:

Template-rendering agent. An agent generating templated content (email, HTML page, JSON config) calls the encoder with the right context for each template's embedding rule. Eliminates the "which escaper does this template need?" guess by the LLM.
XSS-defence pipeline. A pipeline ingesting user content into a multi-context destination runs the encoder per context: once for HTML body display, once for JSON API output, once for URL parameter inclusion. The same input becomes three different safe representations.

Edge cases

Already-encoded content

Passing already-encoded content (&lt;) through encode mode encodes it again (&amp;lt;). Either decode first, or use the tool's detectAlreadyEncoded: true parameter to skip re-encoding when the input is already in the chosen context's form.

Unicode and surrogates

The encoder handles the BMP correctly. Characters outside the BMP (emoji, some CJK extensions) become surrogate pairs in JavaScript strings; the tool encodes them as proper surrogate-pair representations in JSON and as full code-point references in HTML (😀).

Round-trip stability

Encode then decode produces the original string for all four contexts. Decode then encode is stable for HTML-raw; for the other contexts, the encoded output is canonical (the encoder picks one representation among the valid forms).

4. Documentation

Documentation

Reference signatures, edge cases, and lookup tables.

Input parameters

Field	Type	Required	Default	Description
`html`	`string`	✓	—	The string to encode or decode
`mode`	`'encode' \| 'decode'`	✓	—	Direction
`entities`	`'named' \| 'numeric' \| 'hex'`	✗	`'named'`	Entity style for HTML output
`contextAware`	`boolean`	✗	`false`	Auto-detect the embedding context
`contextOverride`	`'html-raw' \| 'html-in-json' \| 'html-in-xml' \| 'html-in-url'`	✗	—	Force a specific context
`detectAlreadyEncoded`	`boolean`	✗	`false`	Skip re-encoding when input is already in the target context

Output shape

{
  encoded:         string;   // when mode: 'encode'
  decoded?:        string;   // when mode: 'decode'
  detectedContext: 'html-raw' | 'html-in-json' | 'html-in-xml' | 'html-in-url';
  entityCount:     number;   // count of replacements made
  warnings:        string[];
}

Context-specific escape tables

Char	html-raw	html-in-json	html-in-xml	html-in-url
`<`	`<`	`<`	`<`	`%3C`
`>`	`>`	`>`	`>`	`%3E`
`"`	`"`	`\"`	`"`	`%22`
`'`	`'`	`'`	`'`	`%27`
`&`	`&`	`&`	`&`	`%26`

Error codes

Code	When it fires	Recovery
`INPUT_EMPTY`	`html` empty	Provide a non-empty input
`INPUT_INVALID_TYPE`	`contextOverride` value outside the supported set	Use one of the four documented contexts

When NOT to use this tool

For HTML sanitisation (removing potentially-dangerous tags while preserving safe markup), use a dedicated sanitiser (DOMPurify, sanitize-html). The encoder escapes everything to text; the sanitiser preserves whitelisted markup while removing dangerous constructs.

For binary content embedded in text contexts, use base64 (base64_codec tool) rather than HTML encoding. HTML entities are inefficient for binary; base64 is the right primitive.

Performance notes

Typical execution: under 2ms for inputs under 50KB. The encoder is single-pass; performance scales linearly with input size. Deterministic — same input + same context produce byte-identical output, so REST responses are Edge-Cache eligible.