.@ Tony Finch – blog


I've come to the conclusion that most programming languages' idea of a string - an array or list of characters - is wrong, or at least too simplistic.

Most languages designed before Unicode have the idea that characters are bytes, which is hopelessly inadequate. Languages designed before Unicode was mature (e.g. Java) have the idea that characters are 16 bits which is still inadequate. The next reasonable step is to use a 32 bit type for characters, but that wastes at least a third of the memory you use for strings since Unicode needs at most 21 bits.

If your language's character type is too narrow then you have to use an encoding scheme (such as a Unicode Transformation Format or ISO-2022) to fit a larger repertoire into the available space. However, when you have done that your char is no longer a character. This causes a number of problems:

In fact, even in the absence of encodings you have similar problems, e.g.

Even ASCII is not simple enough to be immune from problems:

On the other hand, if your language's string type is not based on bytes then you have the problem that strings need to be transcoded from their external form into something that the libraries prefer, since UTF-8 is winning the encoding wars.

I think that the solution to this problem is to embrace it, instead of trying to hide behind an inadequate abstraction. The essence of the bug is that, while it is meaningful to talk about a unicode codepoint in isolation, a codepoint is not a character and a string is not an array of codepoints: there is a layer of encoding between the conceptual sequence of codepoints and the representation of strings. (This implies that, although D is refreshingly agnostic about the size of its characters it still falls into the array trap and is chauvinistic towards a subset of Unicode transformation formats.)

What this means is that the language should not have a character or string type, but instead a type for binary blobs like Erlang's. The programmer manipulates binary values using pattern matching (it should probably support Erlang-style matching and some flavour of regular expressions) and some reasonably efficient way of construction by concatenation (as well as its syntax for binary comprehensions, Erlang has the concept of IO lists which support scatter/gather IO). The idea is to shift from thinking of sequences of discrete characters to thinking about strings as structured mass data.

Note that (so far) this design has no built-in preference for any particular string encoding, which should help to make it flexible enough to live gracefully in environments that favour all sorts of textual data, including ones not designed yet. However you need to support string constants reasonably gracefully, which either means that your source encoding is the same as the encoding you deal with at run time (so that the blobs in the source become identical blobs in the object) or you need a good way to transcode them. Ideally transcoding should only happen once, preferably at compile time, but it's probably OK to do it during static initialization. (Note that similar considerations apply to compiling regular expressions.)

To a large extent, what I'm doing is lifting the string representation chauvinism problem from the language level to the library level, and libraries are just as good as languages at growing ugly calluses around poorly-designed features - though you usually have to do less reimplementation when switching libraries. I'm also aiming to encourage a style of string manipulation more like perl's than C's. To succeed properly, though, it'll have to go further with encouraging good use of library support for international text - but at that point I'm stepping on to thin ice.