What is the right way to understand endianness?
In my view endianness makes much more sense when viewed as a matter of bit order: these notes explore how byte order and byte swapping are a consequence of bitwise endianness combined with bytewise addressing and data transmission. But byte swapping isn’t the only flavour of endianness: bitwise endianness is equally comfortable with oddities like PDP11 word swapping.
I also explain why big-endian is just as mathematically elegant as little-endian: they are perfectly dual to each other, so one can’t be more elegant than the other.
There’s a sketch of an abstract API for endian-aware bit sequences, designed with algebra in mind (associativity, anamorphisms and catamorphisms).
edited to add: this article has a sequel with improved symmetry
Endianness is usually understood and described as a matter of byte order, though Danny Cohen’s essay On Holy Wars and a Plea For Peace (which introduced the terms “big-endian” and “little-endian”) described it in terms of transmission order, from bit-streams upwards. But, as Cohen observed, computers and network gear now deal in bytes, so the transmission order of bits is invisible to programmers, and it makes sense to think of endianness as byte order.
Cohen also judges implementation in terms of how consistently they follow a particular endianness, which makes it sound like a matter of aesthetic preference.
I think there’s a little more to it than that.
The fundamental operations are to serialize an unsigned integer value into an ordered sequence of bits, and the inverse, to deserialize a sequence of bits into a value.
Bit sequences can be concatenated to make longer sequences, or split into multiple shorter sequences.
For example, the value
0x1234 == 4660
serializes as (reading left-to-right)
0001 0010 0011 0100
If we break it in half and deserialize it we get (reading line by line downwards)
0001 0010 == 0x12 == 18
0011 0100 == 0x34 == 52
So big-endian byte order comes from serializing to bits (big end
first) then deserializing to bytes (big end first). In practice we
always implement the composition of these functions as a single fused
operation, such as htonl()
which converts a 32 bit value of
unspecified endianness to a sequence of big-endian bytes.
It’s harder to represent little-endian in text. Either you have to read right-to-left, or I have to ask you to do mental gymnastics to reverse numbers in your head. Like Cohen, I’ll write numbers with the little end on the right as usual, and justified to the right as a reminder of which end goes first.
Our example becomes
4660 == 0x1234
and serializes as (reading right-to-left)
0001 0010 0011 0100
This time when we break it in half and deserialize, we get (reading line by line downwards)
52 == 0x34 == 0011 0100
18 == 0x12 == 0001 0010
As before we are composing serialize (from 16 bits) and deserialize (to bytes) but this time with little-endian bit order, and the result is little endian byte order.
So, when a system has a consistent endianness, we can understand this as the result of choosing between either the big-endian conversions to and from bits, or the little-endian conversion to and from bits. A consistent bit, byte, and word ordering comes from composing the two inverse conversion functions at various widths.
It is usually said (in Cohen’s essay and elsewhere) that the
little-endian order is more mathematically elegant, because when bits
are numbered from zero, the deserialized value of bit i
is 2**i
(where **
is exponentiation).
On the other hand, the big-endian order corresponds better to the way we store strings, and to lexicographic sorting.
I learned binary, bits, bytes, and words, on little-endian computers, and I was convinced (as only teenagers can be) that little-endian was the Right Thing and that big-endian ordering didn’t make sense. But there is a mathematically elegant way to view the big-endian order too.
Let’s imagine we have a set of values, which we are making progressively longer.
If our values are little-endian numbers, when we add bits they get appended on the big end, on the left. This can drastically alter the values of our numbers, and how they sort relative to each other.
If our values are decimal fractions, when we add digits it only makes a small difference to their values, and a small difference to how they sort.
If our values are strings, adding characters can only make small refinements to their lexigographic order.
If our values are big-endian numbers, the set behaves like strings or fractions: making the values longer only makes small changes to the way they sort.
This analogy suggests that we should treat big-endian numbers like
binary fractions, so the value of bit i
is 2**-i
.
We usually want to deserialize to an integer, not a fraction, so if we
have N
bits we will need to multiply by 2**N
. (This seems like an
extra thing that isn’t necessary for little-endian numbers - or is
it…?)
But wait, there’s an off-by-one error! Either we need to count bits
starting with i = 1
or we need to multiply by 2**(N-1)
.
There is a good reason, in this situation, to count from 1, because that gives us a binary fraction between 0 and 1. The effect is a virtual binary point just before the left end of the number, at the start of the bit string.
If we have a bit string
0001 0010 0011 0100
We treat it as a binary fraction
0b0.0001_0010_0011_0100
We multiply by 2**16
to get
0b0001_0010_0011_0100.0 == 0x1234 == 4660
With the right framing, we can see that the big-endian order and the little-endian order are perfectly dual to each other.
big-endian little-endian
first bit most significant least significant
transmission order left-to-right right-to-left
count bits from 1 0
bit value 2 ** -i 2 ** +i
binary point to the left to the right
convert to integer mul 2**N mod 2**N
In the expressions, i
is a bit number, and N
is the word size.
Note that in both cases the binary point is just before the first bit in transmission order.
When converting to an integer, imagine we are deserializing an N
bit
word from the start of a longer bit sequence.
In the big-endian case, we multiply by 2**N
and take the integer
part as the value of our word; the fraction part is the rest of the
bit sequence.
In the little-endian case we divide by 2**N
and take the remainder
as the value of our word; the quotient is the rest of the bit
sequence.
In both big-endian and little-endian cases, after multiplying or dividing, the rest of the bit sequence retains the same interpretation of bit numbers, bit values, and position of the binary point.
What happens if we mix bit sequences of differing endianness? Nothing good! Here are some examples of the nonsense results and complete brokenness we get if we are careless.
Let’s take two 12 bit numbers and serialize one big-endian and one little-endian.
0001 0010 0011 == 0x123 == 291
291 == 0x123 == 0001 0010 0011
When we concatenate them, keeping the bit sequences in their specified orders, we read the big-endian one left-to-right and the little-endian one right-to-left. (This is just an artefact of the way I am writing the numbers; if the sequences were represented as lists of bits, we would be traversing them both the same way.) The result is:
0001 0010 0011 1100 0100 1000
We can deserialize it big-endianwise into a 24-bit number and we get:
0x123C48
Half the number has had its bits reversed, which is almost never going to be what anyone wanted.
In practice, software isn’t going to represent a bit sequence using a list; we’ll use something more compact and efficient like a byte vector. We’ll have to pay careful attention when its length is not a multiple of 8, though!
In our example, the representation of our big-endian sequence will look like:
0x12
0x3?
The second byte has been filled from the left, big-endian style. The ? marks the 4 spare bits.
Before we concatenate a second 12 bit value, it needs to be shifted right by the length of the first sequence (mod 8):
0x?1
0x23
Now the used and unused bits mesh neatly so they can be combined into
the expected sequence, 0x123123
:
0x12
0x3?
++
0x?1
0x23
==
0x12
0x31
0x23
The representation of a little-endian sequence fills bits from the right, and concatenation requires a left shift:
0x23
0x?1
++
0x3?
0x12
==
0x23
0x31
0x12
If we try to concatenate sequences of different endianness, then the used and spare bits clash instead of meshing. It doesn’t work at all:
0x12 0x23
0x3? 0x?1
++ ++
0x3? 0x?1
0x12 0x23
== ==
???? ????
When we use a word to represent short-enough bit sequences, it’s quite easy to concatenate them. For a big-endian sequence:
typedef struct bits { unsigned len, val; } bits;
bits bebits_cat(bits a, bits b) {
return (bits){ .len = a.len + b.len,
.val = a.val | b.val >> a.len };
}
And for a little-endian sequence it is:
bits lebits_cat(bits a, bits b) {
return (bits){ .len = a.len + b.len,
.val = a.val | b.val << a.len };
}
If we use these concatenate functions with arguments of mismatched endianness, once again we get complete nonsense.
For instance, when we call bebits_cat()
with a little-endian second
argument, the right shift will throw away part of the sequence because
it is held at the least significant end of b.val
.
This is a long-winded way of saying that little-endian and big-endian bit sequences need to be treated as distinct types that cannot be mixed - at least, not in a naive way.
Unfortunately, in the real world there are systems with inconsistent endianness. How can we support them if we can’t combine little-endian and big-endian bit sequences? It’s tempting to make the concatenation functions more complicated so that they work a bit better (see my comments about Erlang below for an example) but we can do better.
In the previous section we tried to combine big-endian and little-endian values without any explicit conversion, and it was a disaster. So what is a good way to convert between big-endian and little-endian?
Usually we talk about byte-swapping, but my contention is that byte-wise endianness isn’t fundamental; it arises from an underlying bit-wise endianness. And there are real-world cases that are not handled by byte-swapping: for example, Cohen talks about floating point numbers on the DEC PDP11 and VAX, which are stored as a sequence of little-endian 16 bit words in big-endian order.
This gives us a clue that endianness converstion needs to depend on a bit length somehow: 8-bit bytes for the usual byte-swapping, or 16-bit words for DEC floating point.
For example, let’s use the 32 bit number
0x12345678
We’ll serialize it to a little-endian bit sequence (which I have written with the first bit on the right, as before)
0001 0010 0011 0100 0101 0110 0111 1000
To convert this little-endian bit sequence to big-endian, we deserialize it (reading right-to-left) to make a sequence of byte values:
0111 1000 == 0x78
0101 0110 == 0x56
0011 0100 == 0x34
0001 0010 == 0x12
And reserialize it to a big-endian bit sequence, reading the bytes top-to-bottom, left-to-right:
0111 1000 0101 0110 0011 0100 0001 0010
If we then deserialize it back to a 32 bit number (reading big-endian bits left-to-right), we get:
0x78563412
We have byte-swapped it, without using any byte-specific operations. If we had changed the intermediate word size parameter from 8 to 16, we would have converted via a sequence of 16 bit words, resulting in a PDP11 word-swapped number instead of byte-swapped.
It seems like a lot of faff to do byte swapping by composing four operations. But look a bit closer.
Say we are on a little-endian computer; the first operation (serializing a 32 bit number) is a no-op; the second operation (deserializing to bytes) is also a no-op.
If our internal representation of bit sequences is byte-wise, then the third operation (serializing a byte sequence) is also a no-op. The final operation, deserializing a big-endian bit sequence to a 32-bit value, is where the byte swapping happens.
Alternatively, if our internal representation is word-wise, the third operation (serializing a byte sequence) requires byte swapping, and the final operation (desrializing to a 32-bit value) is a no-op or maybe just a shift.
I have not tried implementing byte swapping by composing four functions like this, so I don’t know how feasible it is for a compiler to actually eliminate the no-ops…
Here’s a sketch of what this might look like as an API.
We need two types, bebits
for big-endian bit sequences, and lebits
for little-endian bit sequences. They both implement the same bits
abstract data type, and behave the same way, except that the differing
endianness becomes visible when you serialize and deserialize values
of differing widths.
Each type has an algebraic identity value: there are constants for an
empty bebits
and an empty lebits
.
An unsigned number can be appended to a bits
type. (To convert a
number by itself, append it to the relevant empty bits value.) Most
languages don’t have arbitrary-width numbers (uint19_t
or whatever)
so there probably needs to be a separate parameter saying how many
bits to extract from the number.
Signed numbers need to be converted to unsigned before being serialized, using two’s complement or whatever other format is required.
An unsigned number of some size (specified with a parameter or implied
by the return type) can be extracted from a bits
type, yielding the
number represented by the prefix of the bit sequence, and the rest of
the bit sequence. The bit sequence must be at least as long as the
size of the resulting number.
Two bits
values of the same type can be concatenated. The left
operand is always first and the right operand second (there is no
endian order dependence), so you get the same results if you
concatenate bits
then deserialize to bytes, or if you deserialize to
bytes then concatenate byte strings.
You can also split a bit sequence into two subsequences, so you can deserialize the parts separately.
So you can find out the number of bits in a bit sequence - how much there is left to deserialize, etc.
I have described serialize as appending to an existing bits
value so
that it is dual to deserialize.
Haskell fans will appreciate that serializing a list of numbers to
bits
is just folding serialize over the list; similarly,
deserializing bits
to a list of numbers is unfolding deserialize.
If your programming language has the right kind of polymorphism then
you can use the same serialize, deserialize, concatenate, and length
interfaces for both bebits
and lebits
, with the endianness
determined by the type of the bits
argument.
To convert between bebits
and lebits
, deserialize to a list of
numbers and reserialize them again; 8 bit numbers gives you byte
swapping, 16 bit numbers gives you PDP11-style word swapping, etc.
This definition of conversion is an example of a hylomorphism. In principle this pattern should make it amenable to optimizations such as deforestation.
In September 2016 I started writing these notes but I didn’t like how it was turning out, so I stuck it on the back burner. Four years later I’ve rewritten the notes so that I can clear the old draft out of my inbox. This version is better at treating bytes as nothing special, and I now have a better description of the duality between big-endian and little-endian.
In yet another September even more years ago, I wrote about endianness in Erlang bit strings, and complained about some crazy behaviour.
Erlang bit syntax is a good example of a wrong way to resolve mixed endian nonsense that I hinted at earlier. Erlang bit strings are big-endian by default, and concatenation always follows big-endian behaviour.
If you append a little-endian value, it is byte-swapped, then treated as a big-endian value.
Erlang’s bit syntax is utterly brilliant; it’s a shame that it’s little-endian support is messed up.
My abortive 2016 article was inspired by a nice paper about bitdata: High-level Views on Low-level Representations. It describes a type system (with type inference) for bitdata, and a neat syntax for parsing and serialization based on pattern matching algebraic data types. There are two main limitations with their work:
It only deals with word-sized data; they aren’t trying to tackle the representation of larger structures or network packets.
Because of that they can ignore questions of endianness.
Their follow-up paper, Strongly Typed Memory Areas: Programming Systems-Level Data Structures in a Functional Language extends their ideas to cover what they didn’t (in the end) call “bytedata”. They build up larger structures from byte-aligned bitdata; members of a structure can be marked as LE or BE and they are byte-swapped as required.
I’m curious if there are other descriptions of endianness that tackle it in a similar way, starting with bit order and deriving byte order as a consequence.
Also I’m not enough of a mathematician to formalize this in a way that properly elucidates the duality between big-endian and little-endian. Endianness is such a common-place concept (and usually so ugly that it’s desperately in need of cleaning up) that surely someone has done so already…
edited to add: this article has a sequel with improved symmetry