The genesis of my DNS-trie
I'm sitting in my room surrounded by un-assembled bits of IKEA furniture and un-opened boxes containing various goodies to help organize my clutter. I have been distracted by an idea that would not let go. It's not often I get literally compelled by creativity!
I started my radix tree experiments at the start of 2013, but at that time my ideas were not novel or elegant. It was not until September 2015 that my qp-trie design was solid enough to be worth the effort to turn into code.
Unfortunately I can't remember exactly where the ideas came from - I can remember thinking about it on-and-off while walking to work or in the shower. I was probably getting ideas from what I was reading, but there was a lot of subconscious slow stewing before the thoughts resurfaced in a useful form, not necessarily with citations attached. There are a few obvious sources but I fear I may have forgotten some interesting tangential ones.
Weirdly, my DNS-trie idea took about the same amount of time to stew: I first wrote down some thoughts nearly three years ago. To be honest, the idea didn't really improve for most of that time, and it wasn't until spring this year that it started to kick off. But again, a long slow process of returning to the idea every so often while walking or washing, trying to think of improvements.
In a trie, a radix tree, the key is treated as a string of digits: binary digits for a crit-bit tree or patricia trie; hexadecimal or base-32 for a qp-trie. Each node in the tree corresponds to a digit; the bigger the digits, the shorter the key, the shallower the tree, and the fewer memory references it takes to traverse.
The downside is that nodes get exponentially more unwieldy as digits get bigger. At least, that's true if digits/nodes are strictly uniform, as they are in a qp-trie.
Oh, but DNS hostnames have quite a small standard alphabet - letters, digits, hyphen, and a few oddities - so if we prioritize them we can maybe get an effectively base-256 trie for the common case, without using any more space than a base-32 qp-trie.
The foundation of a qp-trie is the index word. It contains a bitmap indicating which child nodes are present (16 bits for a hexadecimal trie, for example), and the offset inside the key of the digit we use to look up the right child node. The space/size sweet spot is a 64 bit index word.
For a DNS-trie, the bitmap needs 39 bits for the common characters: case-insensitive letters, digits, hyphen, underscore, and the domain name label separator that is usually written as a dot. In the common case there's (at most) one tree node per character in the key. For rare characters, the DNS-trie falls back to (at most) two nodes per character, like the original qp-trie.
The original hexadecimal qp-trie split bytes into 4+4 bits, so its bitmaps used 2^4 = 16 bits. But 39 bits (for the common characters) plus 16 bits (for the top half of rare characters) is 55 bits, which doesn't leave enough space in a 64 bit index word for the key offset and node type flags. So the DNS-trie uses a 3+5 bit split: upper nodes have a bitmap with 39 + 2^3 = 47 bits for the common characters and the upper part of rare characters; and lower nodes have a bitmap of 2^5 = 32 bits for the lower part of rare characters.
After using 47 bits for the bitmap, our 64 bit index word has space for the node type flags (2 bits, to encode upper / lower / leaf) and a copy-on-write flag, leaving 14 bits for the offset, which is more than enough for a domain name.
One of the crucial reasons that a qp-trie is interesting for the DNS is that it stores keys in order (unlike a hash table). But the ordering of keys in a qp-trie is strictly lexicographical on the byte values of the keys from left to right, which is not the same as the canonical DNS name order.
A domain name is a list of labels (the dot-separated parts) in little-endian order, which is the opposite of what is usual for strings. Each label can have up to 63 characters, big-endian as usual for strings. The maximum total length is 255 bytes, which means there can be up to 127 labels.
[ Aside: Paul Mockapetris has explained that the reason domain names
are little-endian like
dns.cam.ac.uk, rather than big-endian like
uk.ac.cam.dns, is to make it easier to support autocompletion of
unqualified local names. For example, I can
ssh to bare
which is automatically expanded to
auth0.dns.cam.ac.uk by my DNS
A qp-trie needs random access into a key, using the offset inside the index word of each node. To support random access while also respecting the canonical DNS name order, it can be helpful to treat a domain name as an array of strings. This turns the name into a two dimensional structure instead of a flat sequence of labels.
A random-access offset into this 2D structure needs 13 bits: up to 7 bits for the label number, and up to 6 bits for the character within a label. This onlyjust_ fits into the 14 bits we have available in the index word!
So, in this setup of wide bitmaps and 2D offsets, the process for looking up a domain name in a DNS-trie is basically:
Parse the domain name into an array of labels.
This is very cheap when the domain name is in wire format. The array of labels can be just 128 bytes, indicating the position of each label in the name.
Walk down the tree; at each node check its bitmap with code like:
node->bmp & 1 << bmpbit[labels[node->label][node->chr]]
In this pseudocode,
node->chr are the 7-bit and
6-bit parts of the offset in the index word,
labels is the 2D parsed
form of the domain name, and
bmpbit translates a byte in the name to
a bit in the bitmap.
Now this involves three array lookups, in the labels array, in the bmpbit array, and in the name itself. These are all small arrays, 256 bytes or less, and they should all be in fast cache, but it's still quite a lot of work. In a qp-trie, the corresponding code just pulls a byte or two from the key and does some simple bit manipulation.
One of the big advantages of a qp-trie compared to other tree structures is that at the same time this part of the code is working to pull a radix 16 or 32 digit out of the key, the CPU is concurrently fetching the child node from memory. (Most other trees are not able to overlap their work like this, so they alternate between waiting for the CPU and waiting for memory.) But there's only so much work the CPU can do before it takes longer than the child node prefetch.
I wasn't confident that my DNS-trie idea would be fast enough or cute enough to be worth the effort to implement, so I wrote down some notes, and left it to stew in the back of my mind.
In Knot DNS the qp-trie code is used for a number of non-DNS purposes. I wondered if it might make sense to use the DNS-trie layout for keys other than domain names.
There are two differences between domain names and other string keys:
domain names are case-insensitive, and they have a short maximum
length. In a DNS-trie, case sensitivity or insensitivity is handled
bmpbit lookup table that translates bytes in the key to bits
in the index word bitmap.
Key length is more tricky: 14 bits is kind of a short limit for a general-purpose string lookup table. (Especially since a qp-trie should be pretty efficient at looking up long keys.)
My first new idea was to observe that my lower nodes (with the 32-bit bitmap) were nearly the same shape as nodes in a 5-bit qp-trie. What if I use a DNS-trie layout for the first 16 KiB of a key, and use a 5-bit qp-trie layout for the rest of the key up to its maximum 32 MiB key size? At the cost of even more complexity in the inner tree walk loop...
But if the DNS-trie and non-DNS-trie are that similar, it would also be nice if we could actually use the same code for both. So the second idea (not a new idea, but an old idea in a new context) was to convert domain names into a lexically-sortable strings, that could be used more directly as trie lookup keys.
With this setup, the process for looking up a domain name would be basically:
Stringify the domain name by reversing the order of the labels. We also need to squash upper-case letters to lower case, and escape a few characters to deal with the weird DNS awkwardness that null bytes are allowed inside domain names and sort after short labels.
Walk down the tree; at each node check its bitmap with code like:
node->bmp & 1 << bmpbit[key[node->offset]]
This stringified tree walk is significantly less complicated than the 2D tree walk. Instead we do a little more work when preparing the domain name.
However, there is a gotcha. After walking down the tree, we have to verify that our lookup key matches the key that we found in the leaf node. It is quite natural for a string-indexed tree to have its own copies of its keys, but if we are stringifying domain names we are likely to end up with two copies of each name in memory, in wire format and in stringified format. This is a terrible waste! A DNS server contains a lot of names! We might be able to save this memory overhead by tweaking the trie code to compare wire-format domain names instead of strings, but then we are heading back towards two trie implementations.
I wrote down these new ideas and, again, left them to stew in the back of my mind.
I was revisiting these ideas in the shower last week when I had a brainwave.
I've already observed that by stringifying the domain name we have moved work out of the tree traversal loop and into the name preparation loop. The third new idea was, what if we move more work in the same direction?
Prepare the domain name by reversing the order of the labels, and convert each character into one or two bit numbers, depending on whether it is a common or rare character.
Walk down the tree; at each node check its bitmap something like:
node->bmp & 1 << key[node->offset]
But wait, something magical has happened! The name preparation code has not become any more complicated! In the stringifying version, there were escaping and table lookups (for case conversion), and in the bit-number version there is basically the same amount of escaping and table lookups! Fantastic!
There is a downside, though, which is that we are heading back to the 2017 setup with separate trie implementations for domain names and for general strings. But that's OK since the domain name option is looking a lot sweeter!
I wrote down this new brainwave, but it refused to be put on the back burner to stew. I could not stop thinking about it! I had to code it up to see how well it works!
In the process I discovered some more new ideas for improvement:
New idea 4: During name preparation, assign bit numbers so that they can be tested directly against the index word, so that there is no need to extract the bitmap first.
New idea 5: In previous qp-trie implementations I put the bitmap at the top of the index word and the offset in the middle, between the flag bits and the bitmap. But if I swap the positions of the bitmap and offset, so the offset is at the top of the word, it can be extracted with just a shift, with no masking needed.
Complications were evaporating all over the place. It was so nice!
There was another lingering pain point that I had not previously solved. The DNS-trie setup for rare characters divides byte values into 8 blocks of 32 characters. The block of characters between 32 (space) and 63 (query) includes hyphen (45) and the digits (48-57). The lexicographic ordering was not directly represented in the bitmap: if we wanted to scan the DNS-trie in order, we needed to look at the grandchildren of block 32-63 for characters 32-44, then look at the direct child for 45 (hyphen) then back to the grandchildren for a couple of characters, then direct children for the digits, then grandchildren again. Very awkward.
What made this possible is that an escaped domain name needs less than 512 bytes, so we only need 9 bits for the offset, compared to a 13 bit offset for the 2017 DNS-trie concept. The new DNS-trie uses 50 bits for its bitmap, compared to 47 before.
Other simplifications appeared as free gifts:
Before, we needed three kinds of node: upper branches, lower branches (for rare characters) and leaves. But now the distinction between upper and lower branches is taken care of by name preparation, so now we only need one kind of branch.
One of the most tricky parts of the qp-trie code happens when inserting a new key: it needs to work out the bitwise alignment of the point where an existing key and the new key differ. But in a DNS-trie, name preparation creates keys where all the radix 50-ish digits are byte aligned.
Sadly, this cornucopia of elegance doesn't easily transfer its gifts to arbitrary string keys. The DNS-trie sweetness is unlocked by the knowledge that we can cheaply make a more spaced-out copy of the key. This is true for domain names, which are than 256 bytes long, and expand to keys less than 512 bytes long. So the 5-bit qp-trie is still the winner for general-purpose keys.
The proof of the code is in the running. How would a DNS-trie compare to my other qp-trie variants?
My qp-trie test and benchmarking harnesses are not set up to work with wire format domain names, which the DNS-trie is designed for. So I slightly bodged the DNS-trie implementation to work with string keys, which allows more direct comparisons. Fortunately my test data sets do not use long keys. The bodge involves relatively small adjustments to name preparation and name comparison.
In all these benchmarks, smaller numbers are better.
I have five test data sets:
b9: a list of words extracted from the BIND 9 source code (51k entries);
cam: a list of domain names under cam.ac.uk (400k entries);
rcam: same as dns, but each line is backwards;
usdw: English words from /usr/share/dict/words (235k entries);
top1m: domain names from an Alexa Top 1 Million list (1 million entries).
I'm comparing three trie implementations:
qp: 4-bit qp-trie
fn: 5-bit qp-trie, newer more portable style
dns: my new DNS-trie
These measurements were made on my 9-year-old 2.5GHz Mac Mini.
In a qp-trie, each entry has 2 words for the key and value pointers, plus some overhead for internal tree nodes, which I count as the average number of internal words per entry.
b9 cam rcam usdw top1m ---------------------------------------- qp 1.22 0.57 1.41 1.44 1.23 fn 1.09 0.74 1.31 1.31 1.12 dns 0.88 0.36 1.16 1.05 0.83
My benchmark code can also count the average depth of a tree, which provides a hint about the cost of a lookup.
b9 cam rcam usdw top1m ---------------------------------------- qp 13.09 19.12 22.04 12.46 12.40 fn 11.21 18.48 19.74 10.34 10.35 dns 8.28 12.28 14.58 7.14 6.75
Times for 1 million key lookups in milliseconds on a single thread, including benchmark harness overhead (mainly the random number generator). We're always searching for keys that are known to be present. (Simple lookups for a missing keys in a qp-trie are usually faster because they can bail out early, whereas in a hash table it usually takes longer to work out a key is missing.)
b9 cam rcam usdw top1m ----------------------------------- qp 272 611 707 509 770 fn 271 598 642 463 708 dns 249 549 619 419 582
Times for 1 million sets and deletes, in milliseconds on a single thread. As well as the benchmark harness overhead, this test gets realloc() working when nodes are resized.
b9 cam rcam usdw top1m ----------------------------------- qp 311 683 784 584 880 fn 299 701 725 534 815 dns 293 662 726 494 741