obfus.link
Encoders

Context-aware HTML encoding: avoiding double-encoding bugs in JSON, XML, and URL embeddings

Encode HTML content for safe embedding in HTML bodies, JSON strings, XML attributes, or URL parameters. Auto-detects the destination context, eliminating the double-encoding bugs that come from applying the wrong escape rule.

The HTML Encoder escapes content for one of four embedding contexts — html-raw (standard HTML entities), html-in-json (JSON-string escaping), html-in-xml (XML attribute escaping), or html-in-url (percent-encoding). Auto-detection picks the context from the input shape; contextOverride forces a specific one. Reverse mode decodes any of the four contexts back to plain text.

1. Insight

Insight

The problem this article addresses and why it matters.

Encoding the right thing for the right context

Every developer has encountered the "double-encoded HTML" bug: < rendered in a browser instead of <, because the same content was HTML-encoded twice as it moved through a pipeline. The root cause is almost always context confusion — a value that needed JSON escaping was HTML-encoded instead, or content that should have been URL-encoded got HTML entities applied. Each output context (HTML body, HTML attribute, JSON string, XML attribute, URL parameter) has different rules for which characters are unsafe and how to escape them.

OWASP's Cross-Site Scripting Prevention Cheat Sheet lists six distinct escaping contexts. Most teams pick one — HTML entity encoding via a library — and apply it everywhere. That breaks when the destination isn't actually HTML.

Why context-aware encoding

The tool in this article takes a contextAware: true flag and auto-detects the embedding context from the input shape, OR accepts a contextOverride parameter when the destination is known but doesn't match the input shape. Detected contexts: html-raw (standard HTML entities), html-in-json (JSON-string escaping), html-in-xml (XML attribute escaping), html-in-url (percent-encoding).

The output uses the right escape sequences for each context — a single call replaces the manual "which encoder do I need?" decision that produces double-encoding bugs at scale.

What this article delivers

End-to-end walkthroughs of encoding the same content for four different output contexts, the auto-detection behaviour against ambiguous inputs, and the cases where the tool can't decide on its own and the consumer needs contextOverride.

2. Intent

Intent

What you will be able to do after reading.

By the end of this article you will be able to:

  • Encode HTML content for safe embedding in HTML bodies, JSON strings, XML attributes, or URL parameters
  • Use auto-detection mode (contextAware: true) when the destination context is implied by the input shape
  • Override the detected context with contextOverride when the destination differs from the input
  • Reverse the direction (decode) any of the four context-specific encodings back to the original string
  • Recognise the cases that produce double-encoding bugs and structure the pipeline to prevent them

The Examples section walks through each context against the same input.

3. Examples

Examples

Annotated code and worked scenarios.

Before / after: encoding for an HTML body

Plain content with characters HTML parses as markup:

Before:

<script>alert("xss")</script>

After (html-raw):

htmlEncoder({
  html:         '<script>alert("xss")</script>',
  mode:         'encode',
  entities:     'named',
  contextAware: false,
  contextOverride: 'html-raw',
});

// encoded: '&lt;script&gt;alert(&quot;xss&quot;)&lt;/script&gt;'
// detectedContext: 'html-raw'
// entityCount: 5

The <, >, and " characters become HTML entities. The result is safe to insert as text content in an HTML document.

Before / after: encoding for embedding in JSON

The same input destined for a JSON string value:

htmlEncoder({
  html:            '<script>alert("xss")</script>',
  mode:            'encode',
  contextAware:    false,
  contextOverride: 'html-in-json',
});

// encoded: '\\u003cscript\\u003ealert(\\"xss\\")\\u003c/script\\u003e'

JSON requires \" escaping and < / > are commonly escaped to < / > to prevent the result from breaking out of a <script> tag if the JSON is rendered inline in HTML. The encoder picks the JSON-safe form rather than the HTML-entity form.

Before / after: encoding for a URL parameter

htmlEncoder({
  html:            'message=<b>hello</b>&user=admin',
  mode:            'encode',
  contextOverride: 'html-in-url',
});

// encoded: 'message%3D%3Cb%3Ehello%3C%2Fb%3E%26user%3Dadmin'

The =, <, >, &, and / characters become percent-encoded for safe inclusion in a URL query parameter. Note this is full percent-encoding — not the partial form that leaves = intact (which would break the parameter parser).

Before / after: auto-detection

When contextAware: true is set, the tool detects the context from input markers:

htmlEncoder({
  html:         '"users": [<b>Alice</b>]',  // looks like a JSON fragment
  mode:         'encode',
  contextAware: true,
});

// detectedContext: 'html-in-json'  (detected from the JSON-property-shape preamble)
// encoded: ...

Detection is best-effort — when the input is ambiguous (a fragment that could be HTML or JSON), the tool falls back to html-raw and surfaces a warning. For ambiguous cases, set contextOverride explicitly.

Before / after: decoding (reverse)

htmlEncoder({
  html: '&lt;script&gt;alert(&quot;xss&quot;)&lt;/script&gt;',
  mode: 'decode',
});

// decoded: '<script>alert("xss")</script>'

Decoding handles named entities (&lt;, &amp;, &quot;), numeric entities (&#60;), and hex entities (&#x3c;) interchangeably. Useful for reading content stored in HTML-encoded form and processing it as plain text.

When humans use this

A developer integrating user-generated content into a templated email runs the content through html-in-json encoding before embedding in the JSON template payload. A team building an inline-JSON-in-HTML pattern (data attributes, <script type="application/json"> blocks) uses the JSON context to produce content safe in both embedding contexts simultaneously.

When agents use this

Two patterns:

  • Template-rendering agent. An agent generating templated content (email, HTML page, JSON config) calls the encoder with the right context for each template's embedding rule. Eliminates the "which escaper does this template need?" guess by the LLM.
  • XSS-defence pipeline. A pipeline ingesting user content into a multi-context destination runs the encoder per context: once for HTML body display, once for JSON API output, once for URL parameter inclusion. The same input becomes three different safe representations.

Edge cases

Already-encoded content

Passing already-encoded content (&amp;lt;) through encode mode encodes it again (&amp;amp;lt;). Either decode first, or use the tool's detectAlreadyEncoded: true parameter to skip re-encoding when the input is already in the chosen context's form.

Unicode and surrogates

The encoder handles the BMP correctly. Characters outside the BMP (emoji, some CJK extensions) become surrogate pairs in JavaScript strings; the tool encodes them as proper surrogate-pair representations in JSON and as full code-point references in HTML (&#x1F600;).

Round-trip stability

Encode then decode produces the original string for all four contexts. Decode then encode is stable for HTML-raw; for the other contexts, the encoded output is canonical (the encoder picks one representation among the valid forms).

4. Documentation

Documentation

Reference signatures, edge cases, and lookup tables.

Input parameters

Field

Type

Required

Default

Description

html

string

The string to encode or decode

mode

'encode' | 'decode'

Direction

entities

'named' | 'numeric' | 'hex'

'named'

Entity style for HTML output

contextAware

boolean

false

Auto-detect the embedding context

contextOverride

'html-raw' | 'html-in-json' | 'html-in-xml' | 'html-in-url'

Force a specific context

detectAlreadyEncoded

boolean

false

Skip re-encoding when input is already in the target context

Output shape

{
  encoded:         string;   // when mode: 'encode'
  decoded?:        string;   // when mode: 'decode'
  detectedContext: 'html-raw' | 'html-in-json' | 'html-in-xml' | 'html-in-url';
  entityCount:     number;   // count of replacements made
  warnings:        string[];
}

Context-specific escape tables

Char

html-raw

html-in-json

html-in-xml

html-in-url

<

&lt;

<

&lt;

%3C

>

&gt;

>

&gt;

%3E

"

&quot;

\"

&quot;

%22

'

&#39;

'

&apos;

%27

&

&amp;

&

&amp;

%26

Error codes

Code

When it fires

Recovery

INPUT_EMPTY

html empty

Provide a non-empty input

INPUT_INVALID_TYPE

contextOverride value outside the supported set

Use one of the four documented contexts

When NOT to use this tool

For HTML sanitisation (removing potentially-dangerous tags while preserving safe markup), use a dedicated sanitiser (DOMPurify, sanitize-html). The encoder escapes everything to text; the sanitiser preserves whitelisted markup while removing dangerous constructs.

For binary content embedded in text contexts, use base64 (base64_codec tool) rather than HTML encoding. HTML entities are inefficient for binary; base64 is the right primitive.

Performance notes

Typical execution: under 2ms for inputs under 50KB. The encoder is single-pass; performance scales linearly with input size. Deterministic — same input + same context produce byte-identical output, so REST responses are Edge-Cache eligible.

Try it now

HTML Encoder

Encode and decode HTML entities with context-aware mode

FAQ

Frequently asked questions

Why not just use one encoder everywhere?

Each context has different rules. < becomes &lt; in HTML but \u003c in JSON-inside-HTML and %3C in a URL. Apply the wrong one and you get either double-encoding (&amp;lt;) or under-encoding (the < reaches the renderer un-escaped). The four-context surface fixes this once.

When does auto-detect work?

When the input has shape markers — JSON-property prefix ('key':), XML attribute syntax (='value'), or URL parameter context (?key=). For ambiguous inputs, contextOverride is the explicit choice; the tool surfaces detection ambiguity as a warning.

Does this prevent XSS?

Context-aware encoding is one of the defences. Combined with strict Content-Security-Policy and avoiding innerHTML for user content, it eliminates the most common XSS vectors. For full XSS defence, layer sanitisation (DOMPurify) for content that needs to retain markup.

How do I handle content that's already encoded?

Pass detectAlreadyEncoded: true. The encoder checks if the input is already in the target context's form and returns it unchanged rather than encoding again. Otherwise decode first, then encode for the new context.