DNS-trie: a compelling idea

<h1>DNS-trie: a compelling idea</h1>

<p><em>The genesis of my <a href="https://dotat.at/prog/qp/">DNS-trie</a></em></p>

<p>I'm sitting in my room surrounded by un-assembled bits of IKEA
furniture and un-opened boxes containing various goodies to help
organize my clutter. I have been distracted by an idea that would not
let go. It's not often I get literally compelled by creativity!</p>

<h2>long gestation</h2>

<p>I started my radix tree experiments at the start of 2013, but at that
time my ideas were not novel or elegant. It was not until September
2015 that my qp-trie design was solid enough to be worth the effort
to turn into code.</p>

<p>Unfortunately I can't remember exactly where the ideas came from - I
can remember thinking about it on-and-off while walking to work or in
the shower. I was probably getting ideas from what I was reading, but
there was a lot of subconscious slow stewing before the thoughts
resurfaced in a useful form, not necessarily with citations attached.
There are a few obvious sources but I fear I may have forgotten some
interesting tangential ones.</p>

<p>Weirdly, my DNS-trie idea took about the same amount of time to stew:
I first wrote down some thoughts nearly three years ago. To be
honest, the idea didn't really improve for most of that time, and it
wasn't until spring this year that it started to kick off. But again,
a long slow process of returning to the idea every so often while
walking or washing, trying to think of improvements.</p>

<h2>wider nodes for common characters</h2>

<p>In a trie, a radix tree, the key is treated as a string of digits:
binary digits for a crit-bit tree or patricia trie; hexadecimal or
base-32 for a qp-trie. Each node in the tree corresponds to a digit;
the bigger the digits, the shorter the key, the shallower the tree,
and the fewer memory references it takes to traverse.</p>

<p>The downside is that nodes get exponentially more unwieldy as digits
get bigger. At least, that's true if digits/nodes are strictly
uniform, as they are in a qp-trie.</p>

<p>Oh, but DNS hostnames have quite a small standard alphabet - letters,
digits, hyphen, and a few oddities - so if we prioritize them we can
maybe get an effectively base-256 trie for the common case, without
using any more space than a base-32 qp-trie.</p>

<p>The foundation of a qp-trie is the index word. It contains a bitmap
indicating which child nodes are present (16 bits for a hexadecimal
trie, for example), and the offset inside the key of the digit we use
to look up the right child node. The space/size sweet spot is a 64 bit
index word.</p>

<h2>bit budget</h2>

<p>For a DNS-trie, the bitmap needs 39 bits for the common characters:
case-insensitive letters, digits, hyphen, underscore, and the domain
name label separator that is usually written as a dot. In the common
case there's (at most) one tree node per character in the key. For
rare characters, the DNS-trie falls back to (at most) two nodes per
character, like the original qp-trie.</p>

<p>The original hexadecimal qp-trie split bytes into 4+4 bits, so its
bitmaps used 2^4 = 16 bits. But 39 bits (for the common characters)
plus 16 bits (for the top half of rare characters) is 55 bits, which
doesn't leave enough space in a 64 bit index word for the key offset
and node type flags. So the DNS-trie uses a 3+5 bit split: upper
nodes have a bitmap with 39 + 2^3 = 47 bits for the common characters
and the upper part of rare characters; and lower nodes have a bitmap
of 2^5 = 32 bits for the lower part of rare characters.</p>

<p>After using 47 bits for the bitmap, our 64 bit index word has space
for the node type flags (2 bits, to encode upper / lower / leaf) and
a copy-on-write flag, leaving 14 bits for the offset, which is more
than enough for a domain name.</p>

<h2>domain names</h2>

<p>One of the crucial reasons that a qp-trie is interesting for the DNS
is that it stores keys in order (unlike a hash table). But the
ordering of keys in a qp-trie is strictly lexicographical on the byte
values of the keys from left to right, which is <em>not</em> the same as
<a href="https://tools.ietf.org/html/rfc4034#section-6.1">the canonical DNS name order</a>.</p>

<p>A domain name is a list of labels (the dot-separated parts) in
little-endian order, which is the opposite of what is usual for
strings. Each label can have up to 63 characters, big-endian as usual
for strings. The maximum total length is 255 bytes, which means there
can be up to 127 labels.</p>

<p>[ Aside: Paul Mockapetris has explained that the reason domain names
are little-endian like <code>dns.cam.ac.uk</code>, rather than big-endian like
<code>uk.ac.cam.dns</code>, is to make it easier to support autocompletion of
unqualified local names. For example, I can <code>ssh</code> to bare <code>auth0</code>
which is automatically expanded to <code>auth0.dns.cam.ac.uk</code> by my DNS
resolver. ]</p>

<p>A qp-trie needs random access into a key, using the offset inside the
index word of each node. To support random access while also
respecting the canonical DNS name order, it can be helpful to treat a
domain name as an array of strings. This turns the name into a two
dimensional structure instead of a flat sequence of labels.</p>

<p>A random-access offset into this 2D structure needs 13 bits: up to 7
bits for the label number, and up to 6 bits for the character within
a label. This <em>only</em>just_ fits into the 14 bits we have available in
the index word!</p>

<h2>DNS-trie lookups (2017)</h2>

<p>So, in this setup of wide bitmaps and 2D offsets, the process for
looking up a domain name in a DNS-trie is basically:</p>

<ul>
<li><p>Parse the domain name into an array of labels.</p>

<p>This is very cheap when the domain name is in wire format. The
array of labels can be just 128 bytes, indicating the position of
each label in the name.</p></li>
<li><p>Walk down the tree; at each node check its bitmap with code like:</p>

<pre><code>node-&gt;bmp &amp; 1 &lt;&lt; bmpbit[labels[node-&gt;label][node-&gt;chr]]
</code></pre></li>
</ul>

<p>In this pseudocode, <code>node-&gt;label</code> and <code>node-&gt;chr</code> are the 7-bit and
6-bit parts of the offset in the index word, <code>labels</code> is the 2D parsed
form of the domain name, and <code>bmpbit</code> translates a byte in the name to
a bit in the bitmap.</p>

<p>Now this involves three array lookups, in the labels array, in the
bmpbit array, and in the name itself. These are all small arrays, 256
bytes or less, and they should all be in fast cache, but it's still
quite a lot of work. In a qp-trie, the corresponding code just pulls a
byte or two from the key and does some simple bit manipulation.</p>

<p>One of the big advantages of a qp-trie compared to other tree
structures is that at the same time this part of the code is working
to pull a radix 16 or 32 digit out of the key, the CPU is
concurrently fetching the child node from memory. (Most other trees
are not able to overlap their work like this, so they alternate
between waiting for the CPU and waiting for memory.) But there's only
so much work the CPU can do before it takes longer than the child
node prefetch.</p>

<p>I wasn't confident that my DNS-trie idea would be fast enough or cute
enough to be worth the effort to implement, so I wrote down some
notes, and left it to stew in the back of my mind.</p>

<h2>a non-DNS DNS-trie (May 2020)</h2>

<p>In <a href="https://www.knot-dns.cz/">Knot DNS</a> the qp-trie code is used for a
number of non-DNS purposes. I wondered if it might make sense to use
the DNS-trie layout for keys other than domain names.</p>

<p>There are two differences between domain names and other string keys:
domain names are case-insensitive, and they have a short maximum
length. In a DNS-trie, case sensitivity or insensitivity is handled
by the <code>bmpbit</code> lookup table that translates bytes in the key to bits
in the index word bitmap.</p>

<p>Key length is more tricky: 14 bits is kind of a short limit for a
general-purpose string lookup table. (Especially since a qp-trie
should be pretty efficient at looking up long keys.)</p>

<p>My first new idea was to observe that my lower nodes (with the 32-bit
bitmap) were nearly the same shape as nodes in a 5-bit qp-trie. What
if I use a DNS-trie layout for the first 16 KiB of a key, and use a
5-bit qp-trie layout for the rest of the key up to its maximum 32 MiB
key size? At the cost of even more complexity in the inner tree walk
loop...</p>

<p>But if the DNS-trie and non-DNS-trie are that similar, it would also
be nice if we could actually use the same code for both. So the
second idea (not a new idea, but an old idea in a new context) was to
convert domain names into a lexically-sortable strings, that could be
used more directly as trie lookup keys.</p>

<p>With this setup, the process for looking up a domain name would be
basically:</p>

<ul>
<li><p>Stringify the domain name by reversing the order of the labels.
We also need to squash upper-case letters to lower case, and
escape a few characters to deal with the weird DNS awkwardness
that null bytes are allowed inside domain names and sort after
short labels.</p></li>
<li><p>Walk down the tree; at each node check its bitmap with code like:</p>

<pre><code>node-&gt;bmp &amp; 1 &lt;&lt; bmpbit[key[node-&gt;offset]]
</code></pre></li>
</ul>

<p>This stringified tree walk is significantly less complicated than the
2D tree walk. Instead we do a little more work when preparing the
domain name.</p>

<p>However, there is a gotcha. After walking down the tree, we have to
verify that our lookup key matches the key that we found in the leaf
node. It is quite natural for a string-indexed tree to have its own
copies of its keys, but if we are stringifying domain names we are
likely to end up with two copies of each name in memory, in wire
format and in stringified format. This is a terrible waste! A DNS
server contains a lot of names! We might be able to save this memory
overhead by tweaking the trie code to compare wire-format domain
names instead of strings, but then we are heading back towards two
trie implementations.</p>

<p>I wrote down these new ideas and, again, left them to stew in the
back of my mind.</p>

<h2>DNS-trie lookups (2020)</h2>

<p>I was revisiting these ideas in the shower last week when I had a
brainwave.</p>

<p>I've already observed that by stringifying the domain name we have
moved work out of the tree traversal loop and into the name
preparation loop. The third new idea was, what if we move <em>more</em> work
in the same direction?</p>

<p>Basically:</p>

<ul>
<li><p>Prepare the domain name by reversing the order of the labels, and
convert each character into one or two bit numbers, depending on
whether it is a common or rare character.</p></li>
<li><p>Walk down the tree; at each node check its bitmap something like:</p>

<pre><code>node-&gt;bmp &amp; 1 &lt;&lt; key[node-&gt;offset]
</code></pre></li>
</ul>

<p>But wait, something magical has happened! The name preparation code
has not become any more complicated! In the stringifying version,
there were escaping and table lookups (for case conversion), and in
the bit-number version there is basically the same amount of escaping
and table lookups! Fantastic!</p>

<p>There is a downside, though, which is that we are heading back to the
2017 setup with separate trie implementations for domain names and
for general strings. But that's OK since the domain name option is
looking a lot sweeter!</p>

<h2>DNS-trie in code</h2>

<p>I wrote down this new brainwave, but it refused to be put on the back
burner to stew. I could not stop thinking about it! I had to code it
up to see how well it works!</p>

<p>In the process I discovered some more new ideas for improvement:</p>

<ul>
<li><p>New idea 4: During name preparation, assign bit numbers so that
they can be tested directly against the index word, so that there
is no need to extract the bitmap first.</p></li>
<li><p>New idea 5: In previous qp-trie implementations I put the bitmap
at the top of the index word and the offset in the middle,
between the flag bits and the bitmap. But if I swap the positions
of the bitmap and offset, so the offset is at the top of the
word, it can be extracted with just a shift, with no masking
needed.</p></li>
</ul>

<p>Complications were evaporating all over the place. It was so nice!</p>

<p>There was another lingering pain point that I had not previously
solved. The DNS-trie setup for rare characters divides byte values
into 8 blocks of 32 characters. The block of characters between 32
(space) and 63 (query) includes hyphen (45) and the digits (48-57).
The lexicographic ordering was not directly represented in the
bitmap: if we wanted to scan the DNS-trie in order, we needed to look
at the grandchildren of block 32-63 for characters 32-44, then look
at the direct child for 45 (hyphen) then back to the grandchildren
for a couple of characters, then direct children for the digits, then
grandchildren again. Very awkward.</p>

<ul>
<li>New idea 6: Expand the bitmap a little more, so that the order of
bits in the bitmap and the order of children in the tree and the
lexicographic order of keys all correspond to each other
directly.</li>
</ul>

<p>What made this possible is that an escaped domain name needs less
than 512 bytes, so we only need 9 bits for the offset, compared to a
13 bit offset for the 2017 DNS-trie concept. The new DNS-trie uses 50
bits for its bitmap, compared to 47 before.</p>

<p>Other simplifications appeared as free gifts:</p>

<ul>
<li><p>Before, we needed three kinds of node: upper branches, lower
branches (for rare characters) and leaves. But now the
distinction between upper and lower branches is taken care of by
name preparation, so now we only need one kind of branch.</p></li>
<li><p>One of the most tricky parts of the qp-trie code happens when
inserting a new key: it needs to work out the bitwise alignment
of the point where an existing key and the new key differ. But in
a DNS-trie, name preparation creates keys where all the radix
50-ish digits are byte aligned.</p></li>
</ul>

<p>Beautiful!</p>

<p>Sadly, this cornucopia of elegance doesn't easily transfer its gifts
to arbitrary string keys. The DNS-trie sweetness is unlocked by the
knowledge that we can cheaply make a more spaced-out copy of the key.
This is true for domain names, which are than 256 bytes long, and
expand to keys less than 512 bytes long. So the 5-bit qp-trie is
still the winner for general-purpose keys.</p>

<h2>benchmarks</h2>

<p>The proof of the code is in the running. How would a DNS-trie compare
to my other qp-trie variants?</p>

<p>My qp-trie test and benchmarking harnesses are not set up to work
with wire format domain names, which the DNS-trie is designed for. So
I slightly bodged the DNS-trie implementation to work with string
keys, which allows more direct comparisons. Fortunately my test data
sets do not use long keys. The bodge involves relatively small
adjustments to name preparation and name comparison.</p>

<p>In all these benchmarks, smaller numbers are better.</p>

<p>I have five test data sets:</p>

<ul>
<li><p>b9: a list of words extracted from the BIND 9 source code (51k entries);</p></li>
<li><p>cam: a list of domain names under cam.ac.uk (400k entries);</p></li>
<li><p>rcam: same as dns, but each line is backwards;</p></li>
<li><p>usdw: English words from /usr/share/dict/words (235k entries);</p></li>
<li><p>top1m: domain names from an Alexa Top 1 Million list (1 million entries).</p></li>
</ul>

<p>I'm comparing three trie implementations:</p>

<ul>
<li><p>qp: 4-bit qp-trie</p></li>
<li><p>fn: 5-bit qp-trie, newer more portable style</p></li>
<li><p>dns: my new DNS-trie</p></li>
</ul>

<p>These measurements were made on my 9-year-old 2.5GHz Mac Mini.</p>

<h3>memory overhead</h3>

<p>In a qp-trie, each entry has 2 words for the key and value pointers,
plus some overhead for internal tree nodes, which I count as the
average number of internal words per entry.</p>

<pre><code>           b9     cam    rcam   usdw   top1m
    ----------------------------------------
    qp     1.22   0.57   1.41   1.44   1.23
    fn     1.09   0.74   1.31   1.31   1.12
    dns    0.88   0.36   1.16   1.05   0.83
</code></pre>

<h3>tree depth</h3>

<p>My benchmark code can also count the average depth of a tree, which
provides a hint about the cost of a lookup.</p>

<pre><code>           b9     cam    rcam   usdw   top1m
    ----------------------------------------
    qp     13.09  19.12  22.04  12.46  12.40
    fn     11.21  18.48  19.74  10.34  10.35
    dns     8.28  12.28  14.58   7.14   6.75
</code></pre>

<h3>lookup time</h3>

<p>Times for 1 million key lookups in milliseconds on a single thread,
including benchmark harness overhead (mainly the random number
generator). We're always searching for keys that are known to be
present. (Simple lookups for a missing keys in a qp-trie are usually
faster because they can bail out early, whereas in a hash table it
usually takes longer to work out a key is missing.)</p>

<pre><code>          b9    cam   rcam  usdw  top1m
    -----------------------------------
    qp    272   611   707   509   770
    fn    271   598   642   463   708
    dns   249   549   619   419   582
</code></pre>

<h3>mutation time</h3>

<p>Times for 1 million sets and deletes, in milliseconds on a single
thread. As well as the benchmark harness overhead, this test gets
realloc() working when nodes are resized.</p>

<pre><code>          b9    cam   rcam  usdw  top1m
    -----------------------------------
    qp    311   683   784   584   880
    fn    299   701   725   534   815
    dns   293   662   726   494   741
</code></pre>