Chapter 10. The XMLn’t parser
XML Calabash includes support for a non-conformant XML parser: an “XMLn’t” parser. It’s built from the grammar productions in the fifth edition of the XML Recommendation, but it is not an XML parser. It doesn’t check the well-formedness constraints of XML and it does not follow the rules of the XML specification.
It doesn’t normalize whitespace in attribute values.
Optionally, it encodes entity references such that they can be restored during serialization.
The XMLn’t parser is enabled with the cx:xmlnt
property in
the parameters
on p:document
or p:load
.
The cx:xmlnt
property can have one of three values: “attributes
”
or “true()
” to enable it; or “entities
” to
enable the parser and also preservation of entity references.
If you parse this document with the XMLn’t parser:
1 |<doc>
| <if test="significant
| line
| breaks">
5 | …
| </if>
|</doc>
You get a XPath data model that has an attribute with newlines in it. That is not what an XML parser would do, an XML parser would normalize those newlines to spaces.
If you also enable entity preservation,
it replaces every entity reference with a single Unicode character.
By default, these are taken from the Unicode private use area (starting at U+E000), but
you can choose any starting point you like by setting the
cx:xmlnt-startchar
property in the parameters
to the
initial character.
Parsing this document:
1 |<?xml version="1.0" encoding="iso-8859-1"?>
|<!DOCTYPE doc [
|<!ENTITY motto "Spoon!">
|]>
5 |<doc>
| <if test="significant
| line
| breaks">&motto;</if>
|</doc>
Results in an XPath data model that has an if
element that contains
a single character, 
. XML Calabash provides a use-character-map
in the serialization properties
that will allow the serializer to reproduce the original. If you load this document with the XMLn’t
parser,
what you get is an XProc document that has the data model that you would get from
parsing this input with a proper XML parser:
1 |<?xml version="1.0" encoding="iso-8859-1"?>
|<doc>
| <if test="significant

| line

5 | breaks"></if>
|</doc>
The document will have two additional properties, a cx:xmlnt
property that contains the text of the prolog (<!DOCTYPE … ]]>
) and
a serialization property that includes a character map:
|map{Q{}use-character-maps:map{"":"&motto;"}}
Both of these properties must remain on the document in order to get the
correct serialization. If you pass this document through a step, such as p:xslt
, that
produces a new document with different properties, you must arrange to put these properties
back or the result of serialization will not be correct.
The cx:merge-properties
step exists to simpify putting the
XMLn’t properties back. For example:
1 |<p:declare-step xmlns:p="http://www.w3.org/ns/xproc"
| xmlns:cx="http://xmlcalabash.com/ns/extensions"
| xmlns:xs="http://www.w3.org/2001/XMLSchema"
| version="3.0">
5 |<p:import href="https://xmlcalabash.com/ext/library/merge-properties.xpl"/>
|<p:output port="result" sequence="true"/>
|
|<p:load name="original"
| href="some-document.xml"
10 | parameters="map{'cx:xmlnt':true()}"/>
|
|<p:xslt>
| <p:with-input port="stylesheet"
| href="some-transform.xsl"/>
15 |</p:xslt>
|
|<cx:merge-properties>
| <p:with-input port="alternate"
| pipe="@original"/>
20 |</cx:merge-properties>
|
|</p:declare-step>
In principle, you can change the character map or the declaration by adjusting the document properties, but an error will likely result in an unparsable result.