.@ Tony Finch – blog


I’m sitting in my room surrounded by un-assembled bits of IKEA furniture and un-opened boxes containing various goodies to help organize my clutter. I have been distracted by an idea that would not let go. It’s not often I get literally compelled by creativity!

This is the story of my splendid new DNS-trie code, with benchmarks at the end.

long gestation

I started my radix tree experiments at the start of 2013, but at that time my ideas were not novel or elegant. It was not until September 2015 that my qp-trie design was solid enough to be worth the effort to turn into code.

Unfortunately I can’t remember exactly where the ideas came from - I can remember thinking about it on-and-off while walking to work or in the shower. I was probably getting ideas from what I was reading, but there was a lot of subconscious slow stewing before the thoughts resurfaced in a useful form, not necessarily with citations attached. There are a few obvious sources but I fear I may have forgotten some interesting tangential ones.

Weirdly, my DNS-trie idea took about the same amount of time to stew: I first wrote down some thoughts nearly three years ago. To be honest, the idea didn’t really improve for most of that time, and it wasn’t until spring this year that it started to kick off. But again, a long slow process of returning to the idea every so often while walking or washing, trying to think of improvements.

wider nodes for common characters

In a trie, a radix tree, the key is treated as a string of digits: binary digits for a crit-bit tree or patricia trie; hexadecimal or base-32 for a qp-trie. Each node in the tree corresponds to a digit; the bigger the digits, the shorter the key, the shallower the tree, and the fewer memory references it takes to traverse.

The downside is that nodes get exponentially more unwieldy as digits get bigger. At least, that’s true if digits/nodes are strictly uniform, as they are in a qp-trie.

Oh, but DNS hostnames have quite a small standard alphabet - letters, digits, hyphen, and a few oddities - so if we prioritize them we can maybe get an effectively base-256 trie for the common case, without using any more space than a base-32 qp-trie.

The foundation of a qp-trie is the index word. It contains a bitmap indicating which child nodes are present (16 bits for a hexadecimal trie, for example), and the offset inside the key of the digit we use to look up the right child node. The space/size sweet spot is a 64 bit index word.

bit budget

For a DNS-trie, the bitmap needs 39 bits for the common characters: case-insensitive letters, digits, hyphen, underscore, and the domain name label separator that is usually written as a dot. In the common case there’s (at most) one tree node per character in the key. For rare characters, the DNS-trie falls back to (at most) two nodes per character, like the original qp-trie.

The original hexadecimal qp-trie split bytes into 4+4 bits, so its bitmaps used 2^4 = 16 bits. But 39 bits (for the common characters) plus 16 bits (for the top half of rare characters) is 55 bits, which doesn’t leave enough space in a 64 bit index word for the key offset and node type flags. So the DNS-trie uses a 3+5 bit split: upper nodes have a bitmap with 39 + 2^3 = 47 bits for the common characters and the upper part of rare characters; and lower nodes have a bitmap of 2^5 = 32 bits for the lower part of rare characters.

After using 47 bits for the bitmap, our 64 bit index word has space for the node type flags (2 bits, to encode upper / lower / leaf) and a copy-on-write flag, leaving 14 bits for the offset, which is more than enough for a domain name.

domain names

One of the crucial reasons that a qp-trie is interesting for the DNS is that it stores keys in order (unlike a hash table). But the ordering of keys in a qp-trie is strictly lexicographical on the byte values of the keys from left to right, which is not the same as the canonical DNS name order.

A domain name is a list of labels (the dot-separated parts) in little-endian order, which is the opposite of what is usual for strings. Each label can have up to 63 characters, big-endian as usual for strings. The maximum total length is 255 bytes, which means there can be up to 127 labels.

[ Aside: Paul Mockapetris has explained that the reason domain names are little-endian like dns.cam.ac.uk, rather than big-endian like uk.ac.cam.dns, is to make it easier to support autocompletion of unqualified local names. For example, I can ssh to bare auth0 which is automatically expanded to auth0.dns.cam.ac.uk by my DNS resolver. ]

A qp-trie needs random access into a key, using the offset inside the index word of each node. To support random access while also respecting the canonical DNS name order, it can be helpful to treat a domain name as an array of strings. This turns the name into a two dimensional structure instead of a flat sequence of labels.

A random-access offset into this 2D structure needs 13 bits: up to 7 bits for the label number, and up to 6 bits for the character within a label. This only_just fits into the 14 bits we have available in the index word!

DNS-trie lookups (2017)

So, in this setup of wide bitmaps and 2D offsets, the process for looking up a domain name in a DNS-trie is basically:

In this pseudocode, node->label and node->chr are the 7-bit and 6-bit parts of the offset in the index word, labels is the 2D parsed form of the domain name, and bmpbit translates a byte in the name to a bit in the bitmap.

Now this involves three array lookups, in the labels array, in the bmpbit array, and in the name itself. These are all small arrays, 256 bytes or less, and they should all be in fast cache, but it’s still quite a lot of work. In a qp-trie, the corresponding code just pulls a byte or two from the key and does some simple bit manipulation.

One of the big advantages of a qp-trie compared to other tree structures is that at the same time this part of the code is working to pull a radix 16 or 32 digit out of the key, the CPU is concurrently fetching the child node from memory. (Most other trees are not able to overlap their work like this, so they alternate between waiting for the CPU and waiting for memory.) But there’s only so much work the CPU can do before it takes longer than the child node prefetch.

I wasn’t confident that my DNS-trie idea would be fast enough or cute enough to be worth the effort to implement, so I wrote down some notes, and left it to stew in the back of my mind.

a non-DNS DNS-trie (May 2020)

In Knot DNS the qp-trie code is used for a number of non-DNS purposes. I wondered if it might make sense to use the DNS-trie layout for keys other than domain names.

There are two differences between domain names and other string keys: domain names are case-insensitive, and they have a short maximum length. In a DNS-trie, case sensitivity or insensitivity is handled by the bmpbit lookup table that translates bytes in the key to bits in the index word bitmap.

Key length is more tricky: 14 bits is kind of a short limit for a general-purpose string lookup table. (Especially since a qp-trie should be pretty efficient at looking up long keys.)

My first new idea was to observe that my lower nodes (with the 32-bit bitmap) were nearly the same shape as nodes in a 5-bit qp-trie. What if I use a DNS-trie layout for the first 16 KiB of a key, and use a 5-bit qp-trie layout for the rest of the key up to its maximum 32 MiB key size? At the cost of even more complexity in the inner tree walk loop…

But if the DNS-trie and non-DNS-trie are that similar, it would also be nice if we could actually use the same code for both. So the second idea (not a new idea, but an old idea in a new context) was to convert domain names into a lexically-sortable strings, that could be used more directly as trie lookup keys.

With this setup, the process for looking up a domain name would be basically:

This stringified tree walk is significantly less complicated than the 2D tree walk. Instead we do a little more work when preparing the domain name.

However, there is a gotcha. After walking down the tree, we have to verify that our lookup key matches the key that we found in the leaf node. It is quite natural for a string-indexed tree to have its own copies of its keys, but if we are stringifying domain names we are likely to end up with two copies of each name in memory, in wire format and in stringified format. This is a terrible waste! A DNS server contains a lot of names! We might be able to save this memory overhead by tweaking the trie code to compare wire-format domain names instead of strings, but then we are heading back towards two trie implementations.

I wrote down these new ideas and, again, left them to stew in the back of my mind.

DNS-trie lookups (2020)

I was revisiting these ideas in the shower last week when I had a brainwave.

I’ve already observed that by stringifying the domain name we have moved work out of the tree traversal loop and into the name preparation loop. The third new idea was, what if we move more work in the same direction?

Basically:

But wait, something magical has happened! The name preparation code has not become any more complicated! In the stringifying version, there were escaping and table lookups (for case conversion), and in the bit-number version there is basically the same amount of escaping and table lookups! Fantastic!

There is a downside, though, which is that we are heading back to the 2017 setup with separate trie implementations for domain names and for general strings. But that’s OK since the domain name option is looking a lot sweeter!

DNS-trie in code

I wrote down this new brainwave, but it refused to be put on the back burner to stew. I could not stop thinking about it! I had to code it up to see how well it works!

In the process I discovered some more new ideas for improvement:

Complications were evaporating all over the place. It was so nice!

There was another lingering pain point that I had not previously solved. The DNS-trie setup for rare characters divides byte values into 8 blocks of 32 characters. The block of characters between 32 (space) and 63 (query) includes hyphen (45) and the digits (48-57). The lexicographic ordering was not directly represented in the bitmap: if we wanted to scan the DNS-trie in order, we needed to look at the grandchildren of block 32-63 for characters 32-44, then look at the direct child for 45 (hyphen) then back to the grandchildren for a couple of characters, then direct children for the digits, then grandchildren again. Very awkward.

What made this possible is that an escaped domain name needs less than 512 bytes, so we only need 9 bits for the offset, compared to a 13 bit offset for the 2017 DNS-trie concept. The new DNS-trie uses 50 bits for its bitmap, compared to 47 before.

Other simplifications appeared as free gifts:

Beautiful!

Sadly, this cornucopia of elegance doesn’t easily transfer its gifts to arbitrary string keys. The DNS-trie sweetness is unlocked by the knowledge that we can cheaply make a more spaced-out copy of the key. This is true for domain names, which are than 256 bytes long, and expand to keys less than 512 bytes long. So the 5-bit qp-trie is still the winner for general-purpose keys.

benchmarks

The proof of the code is in the running. How would a DNS-trie compare to my other qp-trie variants?

My qp-trie test and benchmarking harnesses are not set up to work with wire format domain names, which the DNS-trie is designed for. So I slightly bodged the DNS-trie implementation to work with string keys, which allows more direct comparisons. Fortunately my test data sets do not use long keys. The bodge involves relatively small adjustments to name preparation and name comparison.

In all these benchmarks, smaller numbers are better.

I have five test data sets:

I’m comparing three trie implementations:

memory overhead

In a qp-trie, each entry has 2 words for the key and value pointers, plus some overhead for internal tree nodes, which I count as the average number of internal words per entry.

           b9     cam    rcam   usdw   top1m
    ----------------------------------------
    qp     1.22   0.57   1.41   1.44   1.23
    fn     1.09   0.74   1.31   1.31   1.12
    dns    0.88   0.36   1.16   1.05   0.83

tree depth

My benchmark code can also count the average depth of a tree, which provides a hint about the cost of a lookup.

           b9     cam    rcam   usdw   top1m
    ----------------------------------------
    qp     13.09  19.12  22.04  12.46  12.40
    fn     11.21  18.48  19.74  10.34  10.35
    dns     8.28  12.28  14.58   7.14   6.75

lookup time

Times for 1 million key lookups in milliseconds, including benchmark harness overhead (mainly the random number generator). We’re always searching for keys that are known to be present. (Simple lookups for a missing keys in a qp-trie are usually faster because they can bail out early, whereas in a hash table it usually takes longer to work out a key is missing.)

          b9    cam   rcam  usdw  top1m
    -----------------------------------
    qp    272   611   707   509   770
    fn    271   598   642   463   708
    dns   249   549   619   419   582

mutation time

Times for 1 million sets and deletes, in milliseconds. As well as the benchmark harness overhead, this test gets realloc() working when nodes are resized.

          b9    cam   rcam  usdw  top1m
    -----------------------------------
    qp    311   683   784   584   880
    fn    299   701   725   534   815
    dns   293   662   726   494   741