.@ Tony Finch – blog


What is the right way to understand endianness?

In my view endianness makes much more sense when viewed as a matter of bit order: these notes explore how byte order and byte swapping are a consequence of bitwise endianness combined with bytewise addressing and data transmission. But byte swapping isn’t the only flavour of endianness: bitwise endianness is equally comfortable with oddities like PDP11 word swapping.

I also explain why big-endian is just as mathematically elegant as little-endian: they are perfectly dual to each other, so one can’t be more elegant than the other.

There’s a sketch of an abstract API for endian-aware bit sequences, designed with algebra in mind (associativity, anamorphisms and catamorphisms).

edited to add: this article has a sequel with improved symmetry

Endianness is fundamentally bitwise

Endianness is usually understood and described as a matter of byte order, though Danny Cohen’s essay On Holy Wars and a Plea For Peace (which introduced the terms “big-endian” and “little-endian”) described it in terms of transmission order, from bit-streams upwards. But, as Cohen observed, computers and network gear now deal in bytes, so the transmission order of bits is invisible to programmers, and it makes sense to think of endianness as byte order.

Cohen also judges implementation in terms of how consistently they follow a particular endianness, which makes it sound like a matter of aesthetic preference.

I think there’s a little more to it than that.

to and from bits

The fundamental operations are to serialize an unsigned integer value into an ordered sequence of bits, and the inverse, to deserialize a sequence of bits into a value.

Bit sequences can be concatenated to make longer sequences, or split into multiple shorter sequences.

big endian

For example, the value

    0x1234 == 4660

serializes as (reading left-to-right)

    0001 0010 0011 0100

If we break it in half and deserialize it we get (reading line by line downwards)

    0001 0010 == 0x12 == 18
    0011 0100 == 0x34 == 52

So big-endian byte order comes from serializing to bits (big end first) then deserializing to bytes (big end first). In practice we always implement the composition of these functions as a single fused operation, such as htonl() which converts a 32 bit value of unspecified endianness to a sequence of big-endian bytes.

little endian

It’s harder to represent little-endian in text. Either you have to read right-to-left, or I have to ask you to do mental gymnastics to reverse numbers in your head. Like Cohen, I’ll write numbers with the little end on the right as usual, and justified to the right as a reminder of which end goes first.

Our example becomes

                                          4660 == 0x1234

and serializes as (reading right-to-left)

                                     0001 0010 0011 0100

This time when we break it in half and deserialize, we get (reading line by line downwards)

                                 52 == 0x34 == 0011 0100
                                 18 == 0x12 == 0001 0010

As before we are composing serialize (from 16 bits) and deserialize (to bytes) but this time with little-endian bit order, and the result is little endian byte order.

byte order, word order

So, when a system has a consistent endianness, we can understand this as the result of choosing between either the big-endian conversions to and from bits, or the little-endian conversion to and from bits. A consistent bit, byte, and word ordering comes from composing the two inverse conversion functions at various widths.

Bit numbering

It is usually said (in Cohen’s essay and elsewhere) that the little-endian order is more mathematically elegant, because when bits are numbered from zero, the deserialized value of bit i is 2**i (where ** is exponentiation).

On the other hand, the big-endian order corresponds better to the way we store strings, and to lexicographic sorting.

I learned binary, bits, bytes, and words, on little-endian computers, and I was convinced (as only teenagers can be) that little-endian was the Right Thing and that big-endian ordering didn’t make sense. But there is a mathematically elegant way to view the big-endian order too.

fractions and strings

Let’s imagine we have a set of values, which we are making progressively longer.

big-endian elegance

This analogy suggests that we should treat big-endian numbers like binary fractions, so the value of bit i is 2**-i.

We usually want to deserialize to an integer, not a fraction, so if we have N bits we will need to multiply by 2**N. (This seems like an extra thing that isn’t necessary for little-endian numbers - or is it…?)

But wait, there’s an off-by-one error! Either we need to count bits starting with i = 1 or we need to multiply by 2**(N-1).

There is a good reason, in this situation, to count from 1, because that gives us a binary fraction between 0 and 1. The effect is a virtual binary point just before the left end of the number, at the start of the bit string.

example

If we have a bit string

    0001 0010 0011 0100

We treat it as a binary fraction

    0b0.0001_0010_0011_0100

We multiply by 2**16 to get

    0b0001_0010_0011_0100.0 == 0x1234 == 4660

Duality

With the right framing, we can see that the big-endian order and the little-endian order are perfectly dual to each other.

                        big-endian          little-endian

first bit               most significant    least significant

transmission order      left-to-right       right-to-left

count bits from         1                   0

bit value               2 ** -i             2 ** +i

binary point            to the left         to the right

convert to integer      mul 2**N            mod 2**N

In the expressions, i is a bit number, and N is the word size.

Note that in both cases the binary point is just before the first bit in transmission order.

When converting to an integer, imagine we are deserializing an N bit word from the start of a longer bit sequence.

In the big-endian case, we multiply by 2**N and take the integer part as the value of our word; the fraction part is the rest of the bit sequence.

In the little-endian case we divide by 2**N and take the remainder as the value of our word; the quotient is the rest of the bit sequence.

In both big-endian and little-endian cases, after multiplying or dividing, the rest of the bit sequence retains the same interpretation of bit numbers, bit values, and position of the binary point.

Mixed endian nonsense

What happens if we mix bit sequences of differing endianness? Nothing good! Here are some examples of the nonsense results and complete brokenness we get if we are careless.

bitwise

Let’s take two 12 bit numbers and serialize one big-endian and one little-endian.

    0001 0010 0011 == 0x123 == 291

                              291 == 0x123 == 0001 0010 0011

When we concatenate them, keeping the bit sequences in their specified orders, we read the big-endian one left-to-right and the little-endian one right-to-left. (This is just an artefact of the way I am writing the numbers; if the sequences were represented as lists of bits, we would be traversing them both the same way.) The result is:

    0001 0010 0011 1100 0100 1000

We can deserialize it big-endianwise into a 24-bit number and we get:

    0x123C48

Half the number has had its bits reversed, which is almost never going to be what anyone wanted.

bytewise

In practice, software isn’t going to represent a bit sequence using a list; we’ll use something more compact and efficient like a byte vector. We’ll have to pay careful attention when its length is not a multiple of 8, though!

In our example, the representation of our big-endian sequence will look like:

    0x12
    0x3?

The second byte has been filled from the left, big-endian style. The ? marks the 4 spare bits.

Before we concatenate a second 12 bit value, it needs to be shifted right by the length of the first sequence (mod 8):

    0x?1
    0x23

Now the used and unused bits mesh neatly so they can be combined into the expected sequence, 0x123123:

    0x12
    0x3?
     ++
    0x?1
    0x23
     ==
    0x12
    0x31
    0x23

The representation of a little-endian sequence fills bits from the right, and concatenation requires a left shift:

    0x23
    0x?1
     ++
    0x3?
    0x12
     ==
    0x23
    0x31
    0x12

If we try to concatenate sequences of different endianness, then the used and spare bits clash instead of meshing. It doesn’t work at all:

    0x12        0x23
    0x3?        0x?1
     ++          ++
    0x3?        0x?1
    0x12        0x23
     ==          ==
    ????        ????

wordwise

When we use a word to represent short-enough bit sequences, it’s quite easy to concatenate them. For a big-endian sequence:

    typedef struct bits { unsigned len, val; } bits;

    bits bebits_cat(bits a, bits b) {
            return (bits){ .len = a.len + b.len,
                           .val = a.val | b.val >> a.len };
    }

And for a little-endian sequence it is:

    bits lebits_cat(bits a, bits b) {
            return (bits){ .len = a.len + b.len,
                           .val = a.val | b.val << a.len };
    }

If we use these concatenate functions with arguments of mismatched endianness, once again we get complete nonsense.

For instance, when we call bebits_cat() with a little-endian second argument, the right shift will throw away part of the sequence because it is held at the least significant end of b.val.

muchfool

This is a long-winded way of saying that little-endian and big-endian bit sequences need to be treated as distinct types that cannot be mixed - at least, not in a naive way.

Unfortunately, in the real world there are systems with inconsistent endianness. How can we support them if we can’t combine little-endian and big-endian bit sequences? It’s tempting to make the concatenation functions more complicated so that they work a bit better (see my comments about Erlang below for an example) but we can do better.

Practical mixed endianness

In the previous section we tried to combine big-endian and little-endian values without any explicit conversion, and it was a disaster. So what is a good way to convert between big-endian and little-endian?

Usually we talk about byte-swapping, but my contention is that byte-wise endianness isn’t fundamental; it arises from an underlying bit-wise endianness. And there are real-world cases that are not handled by byte-swapping: for example, Cohen talks about floating point numbers on the DEC PDP11 and VAX, which are stored as a sequence of little-endian 16 bit words in big-endian order.

This gives us a clue that endianness converstion needs to depend on a bit length somehow: 8-bit bytes for the usual byte-swapping, or 16-bit words for DEC floating point.

little to big

For example, let’s use the 32 bit number

    0x12345678

We’ll serialize it to a little-endian bit sequence (which I have written with the first bit on the right, as before)

    0001 0010 0011 0100 0101 0110 0111 1000

To convert this little-endian bit sequence to big-endian, we deserialize it (reading right-to-left) to make a sequence of byte values:

    0111 1000 == 0x78
    0101 0110 == 0x56
    0011 0100 == 0x34
    0001 0010 == 0x12

And reserialize it to a big-endian bit sequence, reading the bytes top-to-bottom, left-to-right:

    0111 1000 0101 0110 0011 0100 0001 0010

If we then deserialize it back to a 32 bit number (reading big-endian bits left-to-right), we get:

    0x78563412

We have byte-swapped it, without using any byte-specific operations. If we had changed the intermediate word size parameter from 8 to 16, we would have converted via a sequence of 16 bit words, resulting in a PDP11 word-swapped number instead of byte-swapped.

efficiency

It seems like a lot of faff to do byte swapping by composing four operations. But look a bit closer.

Say we are on a little-endian computer; the first operation (serializing a 32 bit number) is a no-op; the second operation (deserializing to bytes) is also a no-op.

If our internal representation of bit sequences is byte-wise, then the third operation (serializing a byte sequence) is also a no-op. The final operation, deserializing a big-endian bit sequence to a 32-bit value, is where the byte swapping happens.

Alternatively, if our internal representation is word-wise, the third operation (serializing a byte sequence) requires byte swapping, and the final operation (desrializing to a 32-bit value) is a no-op or maybe just a shift.

I have not tried implementing byte swapping by composing four functions like this, so I don’t know how feasible it is for a compiler to actually eliminate the no-ops…

Summary

Here’s a sketch of what this might look like as an API.

We need two types, bebits for big-endian bit sequences, and lebits for little-endian bit sequences. They both implement the same bits abstract data type, and behave the same way, except that the differing endianness becomes visible when you serialize and deserialize values of differing widths.

Each type has an algebraic identity value: there are constants for an empty bebits and an empty lebits.

serialize

An unsigned number can be appended to a bits type. (To convert a number by itself, append it to the relevant empty bits value.) Most languages don’t have arbitrary-width numbers (uint19_t or whatever) so there probably needs to be a separate parameter saying how many bits to extract from the number.

Signed numbers need to be converted to unsigned before being serialized, using two’s complement or whatever other format is required.

deserialize

An unsigned number of some size (specified with a parameter or implied by the return type) can be extracted from a bits type, yielding the number represented by the prefix of the bit sequence, and the rest of the bit sequence. The bit sequence must be at least as long as the size of the resulting number.

concatenate and split

Two bits values of the same type can be concatenated. The left operand is always first and the right operand second (there is no endian order dependence), so you get the same results if you concatenate bits then deserialize to bytes, or if you deserialize to bytes then concatenate byte strings.

You can also split a bit sequence into two subsequences, so you can deserialize the parts separately.

length

So you can find out the number of bits in a bit sequence - how much there is left to deserialize, etc.

duality

I have described serialize as appending to an existing bits value so that it is dual to deserialize.

Haskell fans will appreciate that serializing a list of numbers to bits is just folding serialize over the list; similarly, deserializing bits to a list of numbers is unfolding deserialize.

If your programming language has the right kind of polymorphism then you can use the same serialize, deserialize, concatenate, and length interfaces for both bebits and lebits, with the endianness determined by the type of the bits argument.

conversion

To convert between bebits and lebits, deserialize to a list of numbers and reserialize them again; 8 bit numbers gives you byte swapping, 16 bit numbers gives you PDP11-style word swapping, etc.

This definition of conversion is an example of a hylomorphism. In principle this pattern should make it amenable to optimizations such as deforestation.

Elsewhere

In September 2016 I started writing these notes but I didn’t like how it was turning out, so I stuck it on the back burner. Four years later I’ve rewritten the notes so that I can clear the old draft out of my inbox. This version is better at treating bytes as nothing special, and I now have a better description of the duality between big-endian and little-endian.

Erlang

In yet another September even more years ago, I wrote about endianness in Erlang bit strings, and complained about some crazy behaviour.

Erlang bit syntax is a good example of a wrong way to resolve mixed endian nonsense that I hinted at earlier. Erlang bit strings are big-endian by default, and concatenation always follows big-endian behaviour.

If you append a little-endian value, it is byte-swapped, then treated as a big-endian value.

Erlang’s bit syntax is utterly brilliant; it’s a shame that it’s little-endian support is messed up.

Haskell

My abortive 2016 article was inspired by a nice paper about bitdata: High-level Views on Low-level Representations. It describes a type system (with type inference) for bitdata, and a neat syntax for parsing and serialization based on pattern matching algebraic data types. There are two main limitations with their work:

Their follow-up paper, Strongly Typed Memory Areas: Programming Systems-Level Data Structures in a Functional Language extends their ideas to cover what they didn’t (in the end) call “bytedata”. They build up larger structures from byte-aligned bitdata; members of a structure can be marked as LE or BE and they are byte-swapped as required.

What else?

I’m curious if there are other descriptions of endianness that tackle it in a similar way, starting with bit order and deriving byte order as a consequence.

Also I’m not enough of a mathematician to formalize this in a way that properly elucidates the duality between big-endian and little-endian. Endianness is such a common-place concept (and usually so ugly that it’s desperately in need of cleaning up) that surely someone has done so already…

edited to add: this article has a sequel with improved symmetry