Comprehending endianness

I've recently been reading about Erlang's bit syntax, which is a pretty nice way of serializing and deserializing binary data. The Erlang documentation includes some programming examples as well as the reference description. Although it is quite simple, it gets a lot of bang per buck by using essentially the same syntax to parse binaries using pattern matching as it does to construct them. It can also be implemented quite efficiently.

More recently the Erlang implementers have been working on a syntax for binary comprehensions which makes it even easier to implement really fiddly binary data manipulation code incredibly concisely and efficiently. It's one of the greatest inventions in programming language syntax in the last ten years.

However all is not sweetness and light, and the problem is that Erlang has incomplete (or broken) support for differing endianness.

Endianness affects the behaviour of two basic operations: splitting a large number into a sequence of smaller numbers; and joining a sequence of small numbers to form a larger number. (I'm being deliberately vague talking about "numbers" because they can be any size, smaller than bytes as well as larger.) These split and join operations are the fundamental building blocks of higher-level serialization and deserializaton operations.

The simplest view of endianness is that it is a choice of byte order, where the larger numbers are words comprising a whole number of bytes, which are the smaller numbers. Serialization is splitting and deserialization is joining, because the programmer wants to deal with words but they must be represented as bytes. Erlang allows you to control the byte order of numbers when they are serialized into or deserialized from a binary.

C's bitfield support illustrates how endianness is more complicated than just byte order. C allows you to specify the width in bits of integer fields in a structure. The compiler packs sequences of bitfields into (typically) word-sized storage units (for efficient manipulation in the processor's registers). In this situation, serialization is joining and deserialization is splitting - the opposite of the previous paragraph! - because the programmer wants to deal with a sequence of small numbers but these must be represented as words. However, it gets more complicated because byte order affects how words are represented in memory, so there are two layers of endianness in effect: how the compiler joins bitfields into words and how the processor splits words into bytes. The compiler agrees with the processor on endianness, so that a sequence of 8-bit-wide bitfields has the same representation as a sequence of char fields.

Erlang's bit syntax deals directly with byte strings, without restricting the alignment of bitfield boundaries to allow simple word-by-word manipulation as C does. However you still get both splitting and joining when serializing (and when deserializing), since large numbers must be split into bytes and small numbers must be joined to form bytes. Small numbers that overlap byte boundaries must be both split and joined!

The bug is that Erlang allows you (when serializing) to control the byte order of splitting, but not the endianness it uses to join small bit fields to form bytes - the latter is always big endian. This means that Erlang bit syntax cannot elegantly handle bitfields in a C struct on a little-endian machine. For example, this is the structure of the program counter on a 26-bit little-endian ARM processor:

	struct {
		unsigned mode : 2;
		unsigned pc : 24;
		unsigned imask : 2;
		unsigned flags : 4;
	};

The layout in memory is as follows. I have numbered the bytes on the left-hand side and the bits along the top, according to their power-of-two values. I have also indicated how the program counter bits end up split between the four bytes.

	    7   6   5   4   3   2   1   0
	  ------------------------+-------+
	0   5         pc        0 |  mode |
	  ------------------------+-------+
	1  13            pc             6
	  ---------------------------------
	2  21               pc          14
	  +---------------+-------+--------
	3 | N   Z   C   V | I   F | 23  22
	  +---------------+-------+--------

It would be nice if you could write in Erlang:

	<< Mode:2/little, PC:24/little, IMask:2/little, Flags:4/little >>

However, what you get is completely crazy:

	    7   6   5   4   3   2   1   0
	  +-------+------------------------
	0 | mode  | 7         pc        2
	  +-------+------------------------
	1   1   0   15        pc       10
	  ---------------------------------
	2   9   8   23        pc       18
	  --------+-------+---------------+
	3  17  16 | I   F | N   Z   C   V |
	  --------+-------+---------------+

That is, the endianness flags have had no effect on the small fields: they have been joined to form bytes in big endian order. The program counter has been split into bytes in little-endian order, and then this three-byte string has been shifted in a big-endian fashion to fit into its unaligned space. As a result its bits are shuffled into an utterly baffling order.

To represent little-endian structures like this one in Erlang, you must deal with layout details manually, defeating the expressiveness of Erlang's bit syntax. In the example, the logically-contiguous PC field gets split up, and its value must be recovered with an auxiliary expression.

	<< PCa:6, Mode:2, PCb:8, PCc:8, Flags:4, IMask:2, PCd:2 >>

So, how to fix this bug?

One possibility that I considered was to have a little-endian version of the binary append operation, which causes small bitfields to be joined into bytes in little-endian order, and larger fields to be shifted and split in little-endian fashion when they are appended on non-byte boundaries. This is nice and straight-forward, but it can still lead to craziness if you mix big- and little-endian operations. Appending binaries is expressed in Erlang by the comma inside << >> brackets; if, instead of putting the endianness flag on each element, you put it on the binary as a whole, e.g. << mode:2, pc:24, imask:2, flags:4 >>/little, then it can simultaneously affect splitting and joining which helps to ensure the results are consistent not crazy, and it also reduces verbosity.

However I'm not sure this is entirely sufficient. Erlang used to require that binaries were a whole number of bytes, so sub-byte fragments could only appear as part of a larger << >> expression. However the new generalized bit syntax allows binaries to end on arbitrary bit boundaries, which opens the question of whether the sub-byte tail of a binary is represented in big- or little-endian fashion. The same problem can occur at the start of a binary if it has been created by splitting a larger binary on a non-byte boundary, and the new binary is represented by reference to the original instead of as a new copy.

Therefore I think binaries should know their endianness, which would determine: the layout of non-byte-aligned boundaries at the start and/or end of the binary; how the binary is re-aligned (shifted) so that it can be appended to a binary with different alignment; and the byte order of large fields. (Shifting left moves big-endians towards the start and little-endians towards the end.) The append operation can also check that binaries have compatible endianness so that you get a sensible error instead of crazy bit shuffling.

Binaries actually need three kinds of endianness: big, little, and indeterminate. The latter can only be used for whole-byte binaries and would (for example) apply to binaries freshly read from the outside world. It seems reasonable to allow whole-byte binaries to have their endianness coerced. Appending binaries would cause an error if they are not a whole number of bytes and they are not the same endianness; if they are a whole number of bytes and of differing endianness then the result has indeterminate endianness; otherwise the result has the same endianness as the arguments.

So, to conclude: Erlang's bit syntax is fantastic but not quite finished. The run-time system needs to annotate binary objects with their endianness, and endianness flags should be attached to whole bit syntax expressions instead of each element.