In our previous episode, I optimized tolower()
using SWAR tricks.
That was in fact a side-quest to a side-quest to a side-quest. The first
side-quest was trying out a new DNS name compression algorithm; in BIND,
compression is performance-critical and a heavy user of tolower(), which is what
set me off on the second and third side-quests.
Now let’s pop the stack, and talk about DNS names.
This article is part of a series on optimizing DNS names:
- tolower() in bulk at speed
- a DNS name compression algorithm
- faster DNS name decompression
- BIND zone transfer performance
- slower DNS name decompression
- tolower() with AVX-512
- tolower() small string performance
names on the wire
I expect many of my readers know how DNS names are encoded, but I will explain for completeness.
In its text format, a domain name is a sequence of dot-separated “labels”. The dot separators are not used in binary DNS messages: instead, each label is preceded by a byte containing its length, and the name is terminated by a zero-length label representing the root zone.
(Vexingly, a DNS name can’t be treated as a C-style zero-terminated string, because there can be zero bytes inside labels. Sigh.)
A label can be up to 63 bytes long; if the length byte is 64 (0x40) or greater, it has a special meaning. Values between 0x40 and 0xBF have no purpose except to cause painful memories for those involved in DNS extensions in the late 1990s. If the length byte is 0xC0 or greater, the length byte and the next byte form a “compression pointer”.
A DNS name compression pointer allows DNS messages to re-use parent domains. The lower 14 bits are an offset into the DNS message where the remaining suffix of the name previously occurred.
One of the few comments in djbdns is, “DNS should have used LZ77 instead of its own sophomoric compression algorithm.”
decompressing
My favourite slapstick bug in DNS software is due to DNS name compression. If the programmer is not careful enough, their DNS message parsing code can get into an infinite loop when trying to decode a compressed name.
Many, many DNS experts have made this mistake.
The neatest way to avoid the bug is to keep an upper limit for compression offsets, which is initialized with the offset of the start of the name. When the decoder encounters a compression pointer, it must be less than the limit or the message is malformed. If it is OK, then both the decoder’s position and the compression upper limit are set to the pointer’s target.
compressing, in abstract
The difficulty with DNS name compression is not so much to do with prat-fall bugs - though it is fairly funny when a DNS server replaces a one-byte root label with a two-byte compression pointer (which is, in fact, the bug that set me off on these side-quests).
No, the problem with compressing DNS names is that it is difficult to do efficiently.
In outline, the most naïve algorithm is:
for each name to be added to the message,
for each suffix of the name,
for each suffix already in the message,
if the suffixes match,
we have found a compression offset
This is O(M*N*M*N)
where M is the number of names in the message and
N is the size of each name. Our aim is to get this down to something
much less quadratic, near O(M*N)
, that can be implemented efficiently.
data structures
Authoritative-only DNS servers have an advantage that they can precalculate most of the information they need to do a good job of compression. (Knot DNS and NSD have strategies along these lines.) But recursive DNS servers have to work it out on the fly.
One possibility is to use some kind of tree data structure that
supports longest-match lookups. That will take care of the three inner
loops (for each suffix of the name
, for each suffix already in the message
, if the suffixes match
) with good asymptotic performance.
But trees need quite a lot of machinery, when all we need is something
that exists very briefly to store a few names before it is thrown
away.
Avoiding fancy data structures, the minimum we need is a map from DNS names to compression offsets. But a compression offset refers to a point in the message where the name has been written. So in fact all we need is a hash set of compression offsets. This kind of hash set can be incredibly small, just 4 bytes per slot: 2 bytes of compression offset and 2 bytes of hash.
By default, my implementation needs less than 300 bytes, it is fixed size, and it lives on the stack so it requires no heap allocation.
matching suffixes
A tiny hash set is all very well, but it only reduces for each suffix already in the message
to O(1), and we still have quadratic nested
loops (for each suffix of the name
, if the suffixes match
).
In terms of the implementation, there are two parts to matching a suffix of the new name against an entry in the hash set:
- calculate a hash code for the suffix;
- compare the suffix with a name in the message.
We walk the name from right to left, with successively longer suffixes. So it’s straightforward to make the suffix hash calculations linear instead of quadratic: when we add a label to the suffix, hash the new label and combine its hash with the previous hash, instead of hashing everything again.
Speeding up name comparisons requires a deeper insight into how the DNS message is constructed.
comparing names
When we have just added a label to our suffix, and found a possible match for it in the hash set, we know a few things:
-
our previous (shorter) suffix occurred in the message (otherwise we would already have stopped)
-
we know where we found the previous suffix in the message
-
we ensure that every entry in the hash set is the first occurrence of a particular suffix
-
we ensure that names in the message are compressed as much as possible
So the first thing we need to do is verify that the label we just added to our suffix matches the label in the message at the offset we found in the hash set.
We also need to match the rest of the suffix, so that hash collisions don’t muddle up labels from different names. And we need to do it in O(1) so that our overall performance is linear not quadratic.
There are three possibilities:
-
When the first occurrence of the new longer suffix is also the first occurrence previous shorter suffix, the end of the label that we just matched is also the start of the shorter suffix that we remembered. So we can just compare two offsets and we’re done.
-
Otherwise, the longer suffix might occur in a compressed name. If so, it will be followed by a compression pointer that refers to the previous shorter suffix. We remembered the offset of the previous suffix which tells us what this compression pointer looks like, so we can just check two bytes and we’re done.
-
The final possibility is that the new suffix occurs in an uncompressed name. (Not all names in a DNS message can be compressed.) Sadly in this case we have to compare the rest of the suffix in full.
summary
This DNS name compression algorithm is built out of compression offsets; it doesn’t just produce them as a result. We get some crucial advantages by using offsets this way:
-
They are only 16 bits, so the data structure can be really small;
-
We don’t need redundant copies of names, because we find the ones we care about in the message using our compression offsets;
-
In many cases we can shortcut name comparisons because we know the offset in the message of the rest of the name.
code size
The commit that implements this DNS name compression algorithm in BIND deletes 623 lines of code and adds 375.
That does not include my tolower()
work, which deletes 496 lines and
adds 418.
(Not counting test and benchmark code, of which there is more!)
microbenchmarks
I have a small benchmark that compresses a list of DNS names into a series of 4KiB messages. I have been running it with a couple of data sets:
-
a list of the top 50,000 domains, which are all different
-
all the domain names in the root zone, which has a lot of repetition
Here are some times to run the benchmarks:
top-50k root
old code 102ms 28ms
new code 36ms 16ms
system performance
I added benchmark jobs for my compression branch to ISC’s BIND performance lab. The numbers are rather noisy, but roughly:
auth rec root
old code 965 kqps 986 kqps 459 kqps
new code 992 kqps 991 kqps 449 kqps
conclusion
In some tests this targeted optimization provided multiple percentage point performance improvement in whole system performance. I’m mildly disappointed not to be fastest across the board, but the old compression code was specifically tuned for root servers.