.@ Tony Finch – blog


In our previous episode, I optimized tolower() using SWAR tricks. That was in fact a side-quest to a side-quest to a side-quest. The first side-quest was trying out a new DNS name compression algorithm; in BIND, compression is performance-critical and a heavy user of tolower(), which is what set me off on the second and third side-quests.

Now let’s pop the stack, and talk about DNS names.

names on the wire

I expect many of my readers know how DNS names are encoded, but I will explain for completeness.

In its text format, a domain name is a sequence of dot-separated “labels”. The dot separators are not used in binary DNS messages: instead, each label is preceded by a byte containing its length, and the name is terminated by a zero-length label representing the root zone.

(Vexingly, a DNS name can’t be treated as a C-style zero-terminated string, because there can be zero bytes inside labels. Sigh.)

A label can be up to 63 bytes long; if the length byte is 64 (0x40) or greater, it has a special meaning. Values between 0x40 and 0xBF have no purpose except to cause painful memories for those involved in DNS extensions in the late 1990s. If the length byte is 0xC0 or greater, the length byte and the next byte form a “compression pointer”.

A DNS name compression pointer allows DNS messages to re-use parent domains. The lower 14 bits are an offset into the DNS message where the remaining suffix of the name previously occurred.

One of the few comments in djbdns is, “DNS should have used LZ77 instead of its own sophomoric compression algorithm.”

decompressing

My favourite slapstick bug in DNS software is due to DNS name compression. If the programmer is not careful enough, their DNS message parsing code can get into an infinite loop when trying to decode a compressed name.

Many, many DNS experts have made this mistake.

The neatest way to avoid the bug is to keep an upper limit for compression offsets, which is initialized with the offset of the start of the name. When the decoder encounters a compression pointer, it must be less than the limit or the message is malformed. If it is OK, then both the decoder’s position and the compression upper limit are set to the pointer’s target.

compressing, in abstract

The difficulty with DNS name compression is not so much to do with prat-fall bugs - though it is fairly funny when a DNS server replaces a one-byte root label with a two-byte compression pointer (which is, in fact, the bug that set me off on these side-quests).

No, the problem with compressing DNS names is that it is difficult to do efficiently.

In outline, the most naïve algorithm is:

    for each name to be added to the message,
        for each suffix of the name,
            for each suffix already in the message,
                if the suffixes match,
                    we have found a compression offset

This is O(M*N*M*N) where M is the number of names in the message and N is the size of each name. Our aim is to get this down to something much less quadratic, near O(M*N), that can be implemented efficiently.

data structures

Authoritative-only DNS servers have an advantage that they can precalculate most of the information they need to do a good job of compression. (Knot DNS and NSD have strategies along these lines.) But recursive DNS servers have to work it out on the fly.

One possibility is to use some kind of tree data structure that supports longest-match lookups. That will take care of the three inner loops (for each suffix of the name, for each suffix already in the message, if the suffixes match) with good asymptotic performance. But trees need quite a lot of machinery, when all we need is something that exists very briefly to store a few names before it is thrown away.

Avoiding fancy data structures, the minimum we need is a map from DNS names to compression offsets. But a compression offset refers to a point in the message where the name has been written. So in fact all we need is a hash set of compression offsets. This kind of hash set can be incredibly small, just 4 bytes per slot: 2 bytes of compression offset and 2 bytes of hash.

By default, my implementation needs less than 300 bytes, it is fixed size, and it lives on the stack so it requires no heap allocation.

matching suffixes

A tiny hash set is all very well, but it only reduces for each suffix already in the message to O(1), and we still have quadratic nested loops (for each suffix of the name, if the suffixes match).

In terms of the implementation, there are two parts to matching a suffix of the new name against an entry in the hash set:

We walk the name from right to left, with successively longer suffixes. So it’s straightforward to make the suffix hash calculations linear instead of quadratic: when we add a label to the suffix, hash the new label and combine its hash with the previous hash, instead of hashing everything again.

Speeding up name comparisons requires a deeper insight into how the DNS message is constructed.

comparing names

When we have just added a label to our suffix, and found a possible match for it in the hash set, we know a few things:

So the first thing we need to do is verify that the label we just added to our suffix matches the label in the message at the offset we found in the hash set.

We also need to match the rest of the suffix, so that hash collisions don’t muddle up labels from different names. And we need to do it in O(1) so that our overall performance is linear not quadratic.

There are three possibilities:

summary

This DNS name compression algorithm is built out of compression offsets; it doesn’t just produce them as a result. We get some crucial advantages by using offsets this way:

code size

The commit that implements this DNS name compression algorithm in BIND deletes 623 lines of code and adds 375.

That does not include my tolower() work, which deletes 496 lines and adds 418.

(Not counting test and benchmark code, of which there is more!)

microbenchmarks

I have a small benchmark that compresses a list of DNS names into a series of 4KiB messages. I have been running it with a couple of data sets:

Here are some times to run the benchmarks:

            top-50k   root

old code      102ms   28ms

new code       36ms   16ms

system performance

I added benchmark jobs for my compression branch to ISC’s BIND performance lab. The numbers are rather noisy, but roughly:

                 auth       rec        root

    old code   965 kqps   986 kqps   459 kqps

    new code   992 kqps   991 kqps   449 kqps

conclusion

In some tests this targeted optimization provided multiple percentage point performance improvement in whole system performance. I’m mildly disappointed not to be fastest across the board, but the old compression code was specifically tuned for root servers.