.@ Tony Finch – blog


There is a particularly irritating requirement in the Unicode standard that a UTF-8 parser must either abort or corrupt its input if it encounters ill-formed UTF-8. By "corrupt" I mean that ill-formed subsequences all get converted to the U+FFFD replacement character, information about the values of the bytes in the ill-formed sequence is lost.

This is a problem for programs that want to treat data as UTF-8 but which cannot be sure that the data is actually conformant. For example, how do you delete a file or an entry in a database if its name is ill-formed and the API rejects ill-formed names?

UTF-16 has several advantages over UTF-8 in this area. The only syntax error in UTF-16 is the appearance of a surrogate that isn't part of a pair. If a parser ignores this error it is natural to convert the surrogate into the corresponding invalid UCS-4 character value between 0xD800 and 0xDFFF. The resulting (invalid) string can be written back out without loss of information as UTF-16. It can also be written out as UTF-8 and read back in (if you have a relaxed UTF-8 parser) without loss of information.

UTF-8 has more error conditions: as well as surrogates, there are over-long sequences, out-of-place continuation bytes, and some byte values that may not appear at all. In many cases there is no natural way to convert an ill-formed sequence into UCS-4. So if you want your code to be more relaxed than the standard there's no obvious way to do it, and you are unlikely to interoperate with other implementations.

Markus Kuhn proposed a way to deal with this problem which has been dubbed UTF-8b. He suggests using part of the surrogate range to represent each byte in an ill-formed sequence. However this suggestion conflicts with the use of surrogates to represent ill-formed UTF-16 - you can't tell whether to write out (for example) 0xDCBA as the single byte 0xBA (preserving ill-formed UTF-8) or as the sequence 0xED 0xB2 0xBA (preserving ill-formed UTF-16).

In a pure UTF-8 world it might make sense to represent ill-formed sequences using character values outside the Unicode range. For instance, all bytes in invalid sequences have their top bits set, so you could just sign-extend them to produce a negative invalid UCS-4 character value. But there are situations where you need to round-trip via UTF-16 and UTF-16 cannot represent negative character values.

So what we really need is a set of 128 code points allocated to represent raw byte values from ill-formed UTF-8 sequences, along the lines of UTF-16 surrogates, and to be used instead of the U+FFFD replacement character. This would allow a parser to read in UTF-8 without losing data, and to leave error handling to higher layers that can make more informed decisions. One odd wrinkle is that a UTF-8 parser that allows graceful round-trip handling of binary data should not treat surrogates as parse errors, so that it can preserve ill-formed UTF-16 data, but it must treat encoded raw byte values as parse errors to avoid ambiguity. A sensible place to allocate the raw byte value code points is in plane 14 which is reserved for special use, above the area reserved for ignorable characters, i.e. U+E1000 .. U+E10FF.

I propose the name UTF-8-relaxed or UTF-8r for this scheme.