HOWTO Avoid Being Called a Bozo When Producing XML

“There’s just no nice way to say this: Anyone who can’t make a syndication feed that’s well-formed XML is an incompetent fool.——Maybe this is unkind and elitist of me, but I think that anyone who either can’t or won’t implement these measures is, as noted above, a bozo.” – Tim Bray, co-editor of the XML 1.0 specification

There seem to be developers who think that well-formedness is awfully hard—if not impossible—to get right when producing XML programmatically and developers who can get it right and wonder why the others are so incompetent. I assume no one wants to appear incompetent or to be called names. Therefore, I hope the following list of dos and don’ts helps developers to move from the first group to the latter.

Note about the scope of this document: This document focuses on the Unicode layer, the XML 1.0 layer and the Namespaces in XML layer. Getting higher layers like XHTML and Atom right are outside the scope of this document. Also, anything served as text/html is outside the scope of this document, alhough the methods described here can be applied to producing HTML. In fact, doing so is even a good idea.

Contents

  1. Don’t think of XML as a text format
  2. Don’t use text-based templates
  3. Don’t print
  4. Use an isolated serializer
  5. Use a tree or a stack (or an XML parser)
  6. Don’t try to manage namespace declarations manually
  7. Use unescaped Unicode strings in memory
  8. Use UTF-8 (or UTF-16) for output
  9. Use NFC
  10. Don’t expect software to look inside comments
  11. Don’t rely on external entities on the Web
  12. Don’t bother with CDATA sections
  13. Don’t bother with escaping non-ASCII
  14. Avoid adding pretty-printing white space in character data
  15. Don’t use text/xml
  16. Use XML 1.0
  17. Test with astral characters
  18. Test with forbidden control characters
  19. Test with broken UTF-*

Don’t think of XML as a text format

Even people who have used compilers and seen the error and warning messages seem to think that text formats can be written casually and the piece of software in the other end will be able to fix small errors like a human reader. This is not the case with XML. If the document is not well-formed, it is not XML and an XML processor has to cease normal processing upon finding a fatal error.

It helps if you think of XML as a binary format like PNG—only with the added bonus that you can use text tools to see what is in the file for debugging.

Don’t use text-based templates

Text-based Web templating systems (MovableType, WordPress, etc.) and active page technologies that seem to allow you to embed program code in document skeleton (ASP, PHP, JSP, Lasso, Net.Data, etc.) are designed for tag soup. They don’t guarantee well-formed XML output. They don’t guarantee correct HTML output, either. They seem to work with HTML, because text/html user agents are lenient and try to cope with broken HTML. The most common mistakes involve not escaping markup-significant characters or escaping them twice.

Don’t use these systems for producing XML. Making mistakes with them is extremely easy and taking all cases into account is hard. These systems have failed smart people who have actively tried to get things right.

Don’t print

Using print (or echo) calls sprinkled all over your code to emit pieces of markup and literal text is error-prone as well. Is the string you are printing markup or text that needs to be escaped? Have you printed multiple start tags at a time? Can you get the end tags right?

When your program grows and is modified, these things become increasingly difficult to keep track of. It is very easy to overlook something. Indeed, it is very likely that something goes wrong.

Use an isolated serializer

Still, producing the markup characters and writing them as bytes into an output stream has to happen somewhere. Putting all the code the writes to the output stream in a single class or compilation unit makes it possible to debug the escaping-sensitive code in one place. The serializer should have SAX-like methods such as startElement(nsUri, localname, attributes), endElement(nsUri, localname), characters(text), processingInstruction(target, data), etc. The methods always take unescaped strings and escape attribute values and character data. With this approach, the notorious escaping problem just vanishes!

For Java, there is gnu.xml.util.XMLWriter and its subclass gnu.xml.pipeline.TextConsumer that plugs into the GNU JAXP SAX pipeline framework. (A word of warning: The GNU JAXP XMLWriter does not work properly for all characters unless used with the UTF-8 output encoding and with the XHTML mode turned off. If you believe you need the XHTML mode—that is, the Appendix C mode—you may want to check out fi.karppinen.gnu.xml.util.XMLWriter and fi.karppinen.gnu.xml.pipeline.TextConsumer instead. However, if you need the Appendix C mode, you are probably trying to serve XHTML as text/html. Doing so is considered harmful, so what you really need is a serializer that produces HTML 4.01 from XHTML 1.0 SAX events.)

For Java, there is nu.validator.htmlparser.sax.XmlSerializer. It does not support XHTML 1.0 Appendix C. If you want Appendix C support, you should probably send HTML5 as text/html instead, since serving XHTML as text/html is considered harmful. For that, there is nu.validator.htmlparser.sax.HtmlSerializer.

For C, there is eg. GenX. C programmers may also find the tools in libxml2 useful.

Use a tree or a stack (or an XML parser)

Although the serializer API sketched above makes the escaping problem disappear, the application could still call startElement() and endElement() in a bad sequence and break well-formed nesting.

Since an XML document parses into a tree, traversing an analogous programmatically produced tree (eg. DOM or XOM) induces the right sequence of startElement() and endElement()calls. It is worth noting that even though recursive tree traversal usually gets all the attention in algorithm and data structure text books, a tree with parent references can be traversed iteratively.

If you are serializing a tree data structure into an XML format that closely mirrors the in-memory structure, you can use the treeness of the data structure for ensuring well-formed nesting instead of first building a DOM or XOM (or similar) tree.

A tree may be an overkill, however. To ensure proper nesting, a stack is sufficient. A stack can keep track of the open elements without wasting space on parts of the document that have already been handled or have not been handled yet. More importantly, the stack does not need to be explicit: the runtime stack can be used. If startElement is always called at the beginning of a method and endElement is always called in the end, the runtime stack guarantees the nesting.

Code using the runtime stack for ensuring nesting would look like this:
void emitFoo() {
    startElement(NS_URI, "foo");
    emitBar();
    if (shouldEmitBaz) {
        emitBaz();
    }
    endElement(NS_URI, "foo");
}

Finally, one way of producing SAX events in a proper sequence may be obvious: a SAX parser emits SAX parse events in a proper sequence. It may also be so obvious that it is easy to overlook.

The original way to get some SAX events is parsing an XML document at runtime. But if you are producing XML dynamically, what good does it do to parse a static document? Well, boilerplate markup can be put in a static XML file and the interesting parts can be produced programmatically. A SAX filter can look for interesting points in the XML document (eg. a particular processing instruction or element) and inject additional SAX events to the pipeline before returning to control to the parser. The injection may involve parsing another document and injecting events from it into the same pipeline. If the static XML data is trusted, it is possible to even name methods in processing instructions and use reflections to call back into the application based on the XML data.

Another approach to boilerplate markup is code generation in such a way that the parse events from an XML parser are recorded as generated program code that can play back the events efficiently without actually reading input at runtime. My SaxCompiler takes this approach. Since the events are recorded from an XML parser, they occur in a permissible sequence.

Don’t try to manage namespace declarations manually

Namespaces in XML makes it possible for XML element and attribute names to be in a namespace. Being in a namespace means being associated with an additional string symbol, which is required to be an URI alhough it is compared code point for code point. The name of the XHTML element for paragraps is not just p. It is the pair consisting of the XHTML namespace URI and p—that is (http://www.w3.org/1999/xhtml, p) or in James Clark’s notation {http://www.w3.org/1999/xhtml}p.

The URI is bound to the local name by using an intermediate syntactic abstraction. The namespace can be declared as a default that affects unprefixed element name (but not attribute names) or it can be bound to a prefix. The crucial point is that the prefix string itself can be chosen arbitrarily and carries no meaning. Also, the declarations can appear earlier in the document tree and are scoped.

My aim in the above paragraps is to convey that the namespace mechanism is sufficiently complex to be dangerous to be left up to the casual programmer and application code. Instead, the application programmer should use the URI–local name pair and leave the management of the namespace declarations and prefixes to a dedicated piece of code that someone has already debugged. (Of course, it is OK for the programmer to suggest prefixes to make the output more readable.)

For the GNU JAXP framework, gnu.xml.pipeline.NSFilter is such a piece of code. GenX, on the other hand, does this within the serializer component itself.

Use unescaped Unicode strings in memory

To keep the abstractions clear, the content strings in memory should be in the unescaped parsed form. For example, if you have content that says two is greater than one the string in the memory should be “2 > 1”. In particular, it should not be “2 > 1”. “2 > 1” is what you mean. Only when the string reaches the serializer, it is the responsibility on the serializer to write “2 > 1” in the output.

Passing along a chunk of markup is done either by passing a tree data structure (eg. DOM fragment) or by emitting multiple SAX events in sequence.

Moreover, the chances for mistakes are minimized when in-memory strings use the encoding of the built-in Unicode string type of the programming language if your language (or framework) has one. For example, in Java you’d use java.lang.String and char[] and, therefore, UTF-16. Python has the complication that the Unicode string type can be either UTF-16 (OS X, Jython) or UTF-32 (Debian) depending on how the interpreter was compiled. With C it makes sense to choose one UTF and stick to it.

Use UTF-8 (or UTF-16) for output

The XML 1.0 specification requires all XML processors to support the UTF-8 and UTF-16 encodings. XML processors may support other encodings, but they are not required to. It follows that using any encoding other than UTF-8 or UTF-16 is unsafe, because the XML processor used by the recipient might not support the encoding. If you use an encoding other than UTF-8 or UTF-16 and communication fails, it is your fault. Arguments about particular legacy encodings being common in a particular locale (eg. Shift_JIS in Japan or ISO-8859-1 in Western Europe) are totally irrelevant here. (The xml:lang attribute can be used for CJK disambiguation. There is no need to use parochial encodings for that.)

From the XML point of view both UTF-8 and UTF-16 are equally right. If your serializer only supports either one, just go with the one the serializer already supports.

UTF-8 is more compact than UTF-16 (in terms of bytes) for characters in the ASCII range. Even if your content does not contain characters from the ASCII range frequently, the element and attribute names in well-known vocabularies as well the XML syntax itself consist of characters from the ASCII range. UTF-8 data is also easier to examine for debugging with byte/ASCII-oriented network sniffing and file examination tools. UTF-16 is more compact than UTF-8 only when the number of characters from the U+0800–U+FFFF range exceeds the number of characters from the ASCII range—and the latter includes markup whenever well-known XML vocabularies are used.

It might be tempting to try to optimize the size of the document by choosing the encoding depending on the content or the expected content. However, doing so opens up more possibilities for bugs. Even when the serializer offers a choice, it is safer to pick either UTF-8 or UTF-16 and stick to the choice regardless of content or deployment locale. I am biased in favor of UTF-8.

Use NFC

In Unicode, common accented letters can be expressed in two different ways: as a single character or as a base character followed by combining character. For example ‘ä’ can be represented as one character (LATIN SMALL LETTER A WITH DIAERESIS) or as two characters (LATIN SMALL LETTER A followed by COMBINING DIAERESIS). The former is known as the precomposed form and the latter as the decomposed form. There are also presentation forms that are considered compatibility equivalents of other characters. For example, LATIN SMALL LIGATURE FI is a presentation form of LATIN SMALL LETTER F and LATIN SMALL LETTER I.

Unicode Normalization Forms defines four normalization forms of Unicode that differ in their representation of characters that can be decomposed or that have compatibility equivalents. Character Model for the World Wide Web 1.0: Normalization (which is still a working draft) specifies that the Normalization Form C (NFC for short) ought to be used on the Web.

There are a lot of transitional applications that treat Unicode as wide ISO-8859-1—like ISO-8859-1 is wide ASCII. These applications are able to deal with precomposed accented characters but not with the canonically equivalent NFD representations. Thus, NFC is the safer choice if you want to maximize the probability that your text renders nicely. Using NFC is not a well-formedness requirement—just a robustness bonus.

Don’t expect software to look inside comments

According to the XML spec, “an XML processor MAY, but need not, make it possible for an application to retrieve the text of comments”. Since the receiving application is not guaranteed to see the comments, comments are not an appropriate place for data that you want to the recipient to process. That a particular DTD does not allow embedded RDF metadata does not make comments an appropriate place for metadata.

Don’t rely on external entities on the Web

It follows from the XML spec that external entities are inherently unsafe for Web documents, because non-validating XML processors are allowed not to process them and someone may be using a non-validating XML processor to parse the content you serve on the Web. Therefore, it makes sense not to rely on external entities. When you are not relying on them, why have them around at all? Anyone processing them would just waste time. The straight-forward way is to produce doctypeless XML.

But what about validation? It turns out there is a better validation formalism than DTDs. It is more interesting to know the answer to the question “Does this document conform to these rules?” than to the question “Does this document conform to the rules it declares itself?” RELAX NG validation answers the first question. DTD validation of answers the second. RELAX NG allows you to validate a document against a schema that is more expressive than a DTD without polluting the document with schema-specific syntax.

Don’t bother with CDATA sections

XML provides two ways of escaping markup-significant characters: predefined entities and CDATA sections. CDATA sections are only syntactic sugar. The two alternative syntactic constructs have no semantic difference.

CDATA sections are convenient when you are editing XML manually and need to paste a large chunk of text that includes markup-significant characters (eg. code samples). However, when producing XML using a serializer, the serializer takes care of escaping automatically and trying to micromanage the choice of escaping method only opens up possibilities for bugs.

Don’t bother with escaping non-ASCII

Since you are using UTF-8 (or UTF-16), the output encoding can represent the whole of Unicode directly. There is no need to escape non-ASCII characters in any way. Only <, >, & and (in attribute values) " need escaping. That’s it. No entities needed. No numeric character references needed.

If you insist on escaping non-ASCII, please make sure you handle astral characters correctly.

Avoid adding pretty-printing white space in character data

XML has a design problem that makes source formatting leak into parsed content. Instead of reserving eg. literal tabs and line feeds exclusively for source formatting so that the parser could always discard them, XML allows white space to be both significant content and meaningless pretty-printing. The mess is left for higher layers to sort out.

To avoid problems, it is prudent never to introduce pretty-printing white space in character data. Personally, I don’t pretty-print at all when I produce XML programmatically. The safe way to pretty-print is to put the white space inside the tags themselves instead of putting it between them.

That is, if you have
<foo>bar</foo>
instead of doing this
<foo>
    bar
</foo>

do this
<foo
    >bar</foo
>

Don’t use text/xml

The XML specification provides a means for XML documents to declare their own character encoding. This way, the encoding information travels with the document even in environments that can’t store or communicate the encoding information externally.

Unfortunately, the XML specification allows external encoding information to override the internal encoding information. Considering Ruby’s Postulate, it would probably be a better idea to count on the internal information just like you trust a ZIP file itself when it comes to figuring out which compression method has been used instead of letting an external HTTP header say which decompression method you should apply. According to RFC 3023, the text/xml content type never allows you to use the internal information. Even in the absence of an explicit charset parameter, the default is US-ASCII trumping the XML spec. (Of course, there’s a lot of software that ignores the RFC, but that’s not a good basis to build on.)

When the type application/xml is used without the charset parameter, the XML spec governs on the matter of character encoding. For some vocabularies, there are types of the form application/*+xml, which also don’t suffer from the counter-intuitive encoding default of text/xml.

Use XML 1.0

XML 1.0 is well supported. XML 1.1 is not interoperable with XML 1.0 software. XML 1.0 processors are required to reject XML 1.1 documents.

XML 1.1 adds the ability to use some previously forbidden control characters like the form feed while still forbidding U+0000, so you still cannot zero-extend random binary data and smuggle it over XML as text. XML 1.1 also allows you to use Khmer, Amharic, Ge’ez, Thaana, Cherokee, and Burmese characters in element and attribute names. Contrary to what XML 1.1 propaganda may lead people to believe, XML 1.0 already allows content in those languages. Additionally, XML 1.1 changes the definition of white space to accommodate IBM mainframe text conventions.

Test with astral characters

Unicode was originally supposed to be 16 bits wide. However, the original 16 bits running up to U+FFFF turned out to be insufficient. Thus, Unicode was extended to extend up to U+10FFFF. The range of scalar values is considered to be partitioned into 17 planes with 16 bits worth of code points on each plane. The characters in the range of the original 16 bits constitute the Basic Multilingual Plane (or BMP or Plane 0). The range above U+FFFF consists of astral planes and the characters above U+FFFF are called astral characters.

The original way of simply storing a character as an unsigned 16-bit integer was extended to cover the astral planes using surrogate pairs yielding the UTF-16 encoding. A range of values that fall in the BMP are set aside to be used as surrogates. An astral character is represented as a surrogate pair: a high surrogate (a 16-bit code unit) followed by a low surrogate (another 16-bit code unit).

Some programs operating on 16-bit units may not pass surrogate pairs through intact even though one might think the surrogate pairs could be smuggled through legacy software as two adjacent “characters”. Moreover, when UTF-16 data is converted into UTF-8, the surrogate pair needs to be converted into the scalar value of the code point which is then converted into a 4-byte UTF-8 byte sequence. Some broken converters may produce a 3-byte sequence for each surrogate instead. (This kind of broken UTF-8 has been formalized as CESU-8.)

Because of these issues, it is a good idea to test that astral characters can travel through your system intact and that the output produced is proper UTF-8 and not CESU-8.

Test with forbidden control characters

XML semi-arbitrarily forbids some ASCII control characters and Unicode values that are reserved to be used as sentinels (eg. U+0000 and U+FFFF). These characters render the document ill-formed. Therefore, it is important to make sure they cannot occur in the output of your system.

It is a good idea to try to introduce these characters into the system and make sure that they are either caught right upon input or at least filtered out in the XML serializer.

Test with broken UTF-*

Whichever UTF you use in memory or for input, it is possible to construct illegal code unit sequences. With UTF-32 the scalar value may be outside the Unicode range. With UTF-16 there may be unpaired surrogates. With UTF-8 there may be overlong byte sequences, sequences that are not the shortest form for a given character or sequences whose scalar value fall in the surrogate range.

You should try throwing broken code unit sequences at your system and make sure that broken input can never silently translate into broken output. Most importantly, if your input or memory UTF is the same as the output UTF, you should not merely copy code units into the output without checking them.

Usually checking is achieved as a side effect by using UTF-8 for input and output and UTF-16 in memory, so broken data is caught in the conversion.


Stuff to read elsewhere

Uche Ogbuji comments on this article on IBM developerWorks.