Chapter 10The XMLn’t parser

XML Calabash includes support for a non-conformant XML parser: an “XMLn’t” parser. It’s built from the grammar productions in the fifth edition of the XML Recommendation, but it is not an XML parser. It doesn’t check the well-formedness constraints of XML and it does not follow the rules of the XML specification.

  1. It doesn’t normalize whitespace in attribute values.

  2. Optionally, it encodes entity references such that they can be restored during serialization.

The XMLn’t parser is enabled with the cx:xmlnt property in the parameters on p:document or p:load. The cx:xmlnt property can have one of three values: “attributes” or “true()” to enable it; or “entities” to enable the parser and also preservation of entity references.

If you parse this document with the XMLn’t parser:

1 |<doc>
  |   <if test="significant
  |             line
  |             breaks">
5 |      
  |   </if>
  |</doc>

You get a XPath data model that has an attribute with newlines in it. That is not what an XML parser would do, an XML parser would normalize those newlines to spaces.

If you also enable entity preservation, it replaces every entity reference with a single Unicode character. By default, these are taken from the Unicode private use area (starting at U+E000), but you can choose any starting point you like by setting the cx:xmlnt-startchar property in the parameters to the initial character.

Parsing this document:

1 |<?xml version="1.0" encoding="iso-8859-1"?>
  |<!DOCTYPE doc [
  |<!ENTITY motto "Spoon!">
  |]>
5 |<doc>
  |   <if test="significant
  |             line
  |             breaks">&motto;</if>
  |</doc>

Results in an XPath data model that has an if element that contains a single character, &#xE000;. XML Calabash provides a use-character-map in the serialization properties that will allow the serializer to reproduce the original. If you load this document with the XMLn’t parser, what you get is an XProc document that has the data model that you would get from parsing this input with a proper XML parser:

1 |<?xml version="1.0" encoding="iso-8859-1"?>
  |<doc>
  |   <if test="significant&#xA;
  |             line&#xA;
5 |             breaks">&#xE000;</if>
  |</doc>

The document will have two additional properties, a cx:xmlnt property that contains the text of the prolog (<!DOCTYPE … ]]>) and a serialization property that includes a character map:

  |map{Q{}use-character-maps:map{"&#xE000;":"&motto;"}}

Both of these properties must remain on the document in order to get the correct serialization. If you pass this document through a step, such as p:xslt, that produces a new document with different properties, you must arrange to put these properties back or the result of serialization will not be correct.

The cx:merge-properties step exists to simpify putting the XMLn’t properties back. For example:

 1 |<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
   |                xmlns:cx="http://xmlcalabash.com/ns/extensions"
   |                xmlns:xs="http://www.w3.org/2001/XMLSchema"
   |                version="3.0">
 5 |<p:import href="https://xmlcalabash.com/ext/library/merge-properties.xpl"/>
   |<p:output port="result" sequence="true"/>
   | 
   |<p:load name="original"
   |        href="some-document.xml"
10 |        parameters="map{'cx:xmlnt':true()}"/>
   | 
   |<p:xslt>
   |  <p:with-input port="stylesheet"
   |                href="some-transform.xsl"/>
15 |</p:xslt>
   | 
   |<cx:merge-properties>
   |  <p:with-input port="alternate"
   |                pipe="@original"/>
20 |</cx:merge-properties>
   | 
   |</p:declare-step>

In principle, you can change the character map or the declaration by adjusting the document properties, but an error will likely result in an unparsable result.