XML to JSON with attribute, namespace, and CDATA preservation — obfus.link

1. Insight

Insight

The problem this article addresses and why it matters.

XML is still everywhere it always was

The peak of "XML for everything" was 2008. The web moved on to JSON in 2010. Most JavaScript developers entering the industry now haven't written XML deliberately — but they've definitely consumed it, because XML never left the systems it landed in. SOAP APIs at financial institutions, RSS feeds from media systems, SAML assertions from enterprise SSO, configuration files in JVM and .NET ecosystems, EDI / HL7 / FpML payloads from any healthcare or finance integration — all XML, all still in production.

The translation problem is the asymmetry. JSON is mostly a subset of what XML can express. XML has attributes, namespaces, mixed content (text and child elements interleaved), comments, processing instructions, CDATA sections. JSON has none of those. Most xml-to-json converters discard the asymmetry: they read the elements, drop the attributes and namespaces, and emit a JSON-flavoured copy that can't round-trip back to valid XML.

Why a preservation-first converter

The tool in this article preserves everything by default. Attributes become metadata fields (@attributes). Namespaces are kept as prefix declarations the output JSON can carry forward. CDATA sections are marked explicitly so a downstream consumer knows the content was wrapped. The output is faithful enough to round-trip back to valid XML via the json-to-xml reverse mode.

For consumers that want the lossy simple conversion (XML → JSON with attributes dropped because the consumer doesn't care), pass preserveAttributes: false. The tool's default is the conservative choice; the opt-in lossy mode is for cases where the consumer has positive reason to ignore the metadata.

What this article delivers

End-to-end walks of converting a SOAP envelope, a SAML assertion, and a CDATA-heavy XML feed. We cover the reverse direction (JSON to XML with a configurable root element), the namespace-preservation behaviour, and the cases where neither direction is right because the XML uses features without a clean JSON equivalent (recursive schemas, document-type declarations, external entities).

2. Intent

Intent

What you will be able to do after reading.

By the end of this article you will be able to:

Convert XML to JSON with attribute, namespace, and CDATA preservation by default
Convert JSON back to XML in reverse mode with a configurable root element
Choose between compact and verbose JSON representations of attribute-heavy XML
Recognise the XML features (DTDs, external entities, processing instructions) that the converter handles vs reports as warnings
Choose preserveAttributes: false when the consumer doesn't care about XML metadata and a simpler JSON output is preferred

The Examples section walks through SOAP, SAML, and CDATA-bearing XML in both directions.

3. Examples

Examples

Annotated code and worked scenarios.

Before / after: a SOAP envelope

<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
  <soap:Header>
    <auth:Credentials xmlns:auth="http://example.com/auth">
      <auth:Token>abc123</auth:Token>
    </auth:Credentials>
  </soap:Header>
  <soap:Body>
    <ns:GetOrder xmlns:ns="http://example.com/orders">
      <ns:OrderId>4521</ns:OrderId>
    </ns:GetOrder>
  </soap:Body>
</soap:Envelope>

xmlToJson({
  input:               soapXml,
  mode:                'xml-to-json',
  preserveAttributes:  true,
  preserveNamespaces:  true,
  preserveCDATA:       false,
});

// output: {
//   "soap:Envelope": {
//     "@xmlns:soap":   "http://schemas.xmlsoap.org/soap/envelope/",
//     "soap:Header": {
//       "auth:Credentials": {
//         "@xmlns:auth":  "http://example.com/auth",
//         "auth:Token":   "abc123"
//       }
//     },
//     "soap:Body": {
//       "ns:GetOrder": {
//         "@xmlns:ns":   "http://example.com/orders",
//         "ns:OrderId":  "4521"
//       }
//     }
//   }
// }
// stats: { elements: 6, attributes: 3, namespaces: ['soap', 'auth', 'ns'], cdataSections: 0 }

The @ prefix on attribute keys and the namespace preservation make the JSON faithful to the original. A downstream consumer that needs to know which namespace Token belonged to can read auth:Token; a consumer that doesn't care can ignore the prefix.

Before / after: simple mode (lossy)

Same input, simpler JSON when you don't need the metadata:

xmlToJson({
  input: soapXml,
  preserveAttributes: false,
  preserveNamespaces: false,
});

// output: {
//   Envelope: {
//     Header: { Credentials: { Token: 'abc123' } },
//     Body:   { GetOrder:    { OrderId: '4521' } }
//   }
// }

Cleaner. Round-tripping back to valid SOAP would lose the namespace prefixes; for one-way consumption (e.g. extracting OrderId for downstream processing), this is the simpler shape.

Before / after: reverse direction (JSON to XML)

xmlToJson({
  input:        JSON.stringify({ user: { id: 42, name: 'Alice', email: 'alice@example.com' } }),
  mode:         'json-to-xml',
  rootElement:  'response',
});

// output: '<?xml version="1.0" encoding="UTF-8"?>\n<response><user><id>42</id><name>Alice</name><email>alice@example.com</email></user></response>'

The rootElement parameter is the top-level wrapper for the emitted XML. Useful for systems that expect a specific document root.

Before / after: CDATA preservation

<article>
  <title>The HTML5 spec</title>
  <body><![CDATA[<p>Some <strong>HTML</strong> content with <code>&lt;markup&gt;</code></p>]]></body>
</article>

xmlToJson({
  input:         xml,
  preserveCDATA: true,
});

// output: {
//   article: {
//     title: 'The HTML5 spec',
//     body:  { '#cdata': '<p>Some <strong>HTML</strong> content with <code>&lt;markup&gt;</code></p>' }
//   }
// }
// stats: { ..., cdataSections: 1 }

The #cdata key tells the consumer the value was wrapped in CDATA — meaningful because CDATA content is not entity-decoded by the parser. Without preservation, the value would just be a string and the consumer wouldn't know to skip entity decoding on round-trip.

When humans use this

A developer integrating with a SOAP API runs sample requests through the converter to get a JSON-shaped view they can reason about. A team migrating from XML configuration to JSON configuration runs the existing XML files through the converter to bootstrap the new JSON equivalents (then iterates). The reverse direction (JSON to XML) is less common but shows up when integrating with a system that only accepts XML.

When agents use this

Two patterns:

Legacy API ingestion. An agent integrating with a SOAP or RSS feed converts the response to JSON via the tool, then operates on the JSON downstream. The agent doesn't have to understand XML semantics; the converter handles the impedance mismatch.
Document-format normalisation. A pipeline that ingests heterogeneous documents (some XML, some JSON, some YAML) routes XML through this tool, YAML through yaml_to_env or a similar converter, and ends up with a single JSON representation downstream consumers can process uniformly.

Edge cases

DTDs and external entities

Document Type Declarations and external entity references (<!ENTITY xxx SYSTEM "...">) are a security concern (XML External Entity attacks). The tool rejects DTDs and external entities by default with SECURITY_VIOLATION. Pass allowDtd: true to opt in for trusted inputs — useful when the XML source is a known-safe internal system.

Mixed content

XML allows mixed content: Hello world! has both text ("Hello ", "!") and child elements (world). JSON has no idiomatic representation. The converter emits {"#text": "Hello ", "b": "world", "#text-after": "!"} to preserve order; this is the only translation that doesn't lose information but it's ugly. For text-dominant XML (DocBook, DITA), this is the failure mode of "structured" converters.

Numeric coercion

<count>42</count> becomes count: "42" (string) by default. Pass coerceNumbers: true to emit count: 42 (number). XML has no type information; numeric coercion is a heuristic that gets it right for unambiguous cases (42, -3.14) and wrong for ambiguous cases (007 is sometimes a number, sometimes a string like an employee ID). The default is "no coercion" because the false-positive rate of coercion at scale is non-trivial.

Comments and processing instructions

XML comments () and processing instructions (<?xml-stylesheet ... ?>) are dropped by default. Pass preserveComments: true to keep them as #comment keys. Most consumers don't care, so the default is to drop.

4. Documentation

Documentation

Reference signatures, edge cases, and lookup tables.

Input parameters

Field	Type	Required	Default	Description
`input`	`string`	✓	—	XML or JSON to convert
`mode`	`'xml-to-json' \| 'json-to-xml'`	✓	—	Direction
`compact`	`boolean`	✗	`false`	Compact JSON representation for attribute-heavy XML
`preserveAttributes`	`boolean`	✗	`true`	Keep XML attributes as `@attribute` keys
`preserveNamespaces`	`boolean`	✗	`true`	Keep `xmlns:*` prefixes
`preserveCDATA`	`boolean`	✗	`true`	Keep CDATA sections marked as `#cdata`
`preserveComments`	`boolean`	✗	`false`	Keep XML comments as `#comment` keys
`coerceNumbers`	`boolean`	✗	`false`	Coerce unambiguous numeric strings to numbers
`allowDtd`	`boolean`	✗	`false`	Accept Document Type Declarations and external entities
`rootElement`	`string`	for json-to-xml	`'root'`	XML root element name

Output shape

{
  output: string;            // converted JSON or XML
  stats: {
    elements:      number;
    attributes:    number;
    namespaces:    string[]; // list of namespace prefixes found
    cdataSections: number;
  };
  warnings: string[];        // e.g. 'Namespace soap mapped to default'
}

Attribute encoding conventions

XML feature	JSON representation
Element with text content only	`'key': 'value'`
Element with attributes + text	`'key': { '@attr': 'value', '#text': 'content' }`
Element with attributes + children	`'key': { '@attr': 'value', 'child': {...} }`
CDATA section	`'key': { '#cdata': 'content' }`
Comment	`'#comment': 'text'` (when preserveComments)
Repeated element	`'key': [...]` (array)

Error codes

Code	When it fires	Recovery
`INPUT_EMPTY`	`input` empty	Provide a non-empty input
`INPUT_MALFORMED`	XML or JSON parse failed	Verify the input is well-formed
`SECURITY_VIOLATION`	DTD or external entity reference detected and `allowDtd: false`	Pass `allowDtd: true` for trusted inputs; refuse the input for untrusted sources (XXE attack risk)
`INPUT_TOO_LARGE`	Input exceeds 5MB	Streaming-XML parsers are the right tool for large inputs
`UNSUPPORTED_FORMAT`	XML feature without JSON equivalent (e.g. specific schema constructs)	Use a dedicated XML library for the feature in question

When NOT to use this tool

For XML schema validation, use a dedicated validator (xmllint, libxml2's validation mode). The converter handles well-formed XML; it doesn't validate against XSD schemas.

For very large XML feeds (multi-MB, multi-GB), use a streaming parser (sax-js, Python's lxml.iterparse). This tool loads the full document into memory.

For XML-to-XML transformations (XSLT use cases), use an XSLT processor. JSON is the wrong intermediate format for that workflow.

Performance notes

Typical execution: under 10ms for inputs under 50KB. Attribute preservation adds 5-15% overhead vs simple mode. The tool is deterministic — same input + same parameters always produce byte-identical output — so REST responses are Edge-Cache eligible.

The XML parser is XXE-safe by default (allowDtd: false). The opt-in path expects the caller to have verified the source. For high-throughput conversion (thousands of small documents per second), the per-call overhead of the tool's HTTP layer is meaningful; convert in-process with a library like fast-xml-parser instead.