In 2021, I came up with a design for a new memory layout for a qp-trie, and I implemented a prototype of the design in NLnet Labs NSD (see my git repo or github).
Since I started work at ISC my main project has been to adapt the NSD prototype into a qp-trie for use in BIND. The ultimate aim is to replace BIND’s red-black tree database, its in-memory store of DNS records.
Yesterday I merged the core qp-trie implementation into BIND so it’s a good time to write some blog notes about it.
The core of the design is still close to what I sketched in 2021 and implemented in NSD, so these notes are mostly about what’s different, and the mistakes I made along the way…
A qp-trie is a kind of radix tree, derived from Dan Bernstein’s crit-bit tree (one two) and Phil Bagwell’s HAMT (one two).
A qp-trie supports lexically ordered lookups on string keys, such as longest match (e.g., to find which zone can answer a DNS query), predecessor (e.g., to find an NSEC record for a DNSSEC proof of nonexistence), etc.
The version in BIND supports lock-free reads and strictly serialized transactional writes. It has multi-version concurrency with snapshots and rollbacks.
It typically uses about 20 bytes of memory (often less) per object stored in the trie. When keys are normal DNS names, an interior branch in the trie corresponds to one character in the name. A trie containing a million items can sustain millions of lookups per second per core.
NSD was a good place to implement the prototype, because it avoids a lot of the complexity of BIND.
NSD doesn’t use its tree data structures for as many different things as BIND does, so it didn’t need much support for polymorphism;
NSD is authoritative-only, so it doesn’t have to support the kind of high-frequency fine-grain updates that a DNS cache is subjected to;
NSD has a multi-process architecture, so I could handwave the details of multithreading and pretend to myself that it was enough to get the copy-on-write logistics working.
So the goals of the project were to fill in these gaps. Simples!
In BIND, when you create a qp-trie, you give it a few methods which
tell the trie how to work with the leaf objects you are going to store
in it. (See
detach() for reference counting:
When an interior leaf node is duplicated for copy-on-write, it can be just a sibling node to where the modify action is happening. The leaf objects hanging off the trie are much bigger than interior nodes, so we don’t want to duplicate them needlessly. Hence, we keep track of them using reference counting, like memory management in the rest of BIND.
makekey() to get the
qpkey corresponding to the leaf object:
This turned out to be unexpectedly nice. In almost all cases, a
qp-trie key will be derived from a DNS name stored somewhere in
the leaf object. But instead of just getting the name,
is responsible for turning the name into a key,
usually by calling
dns_qpkey_fromname(). So there’s no
memory management difficulty if the method needs to temporarily
rehydrate the name, and the key could even be derived from some
For cases where multithreading isn’t needed, BIND has a bare
dns_qp_t type which saves a bit of space. I’m not sure it’s going to
get much use, but I have used it in tests to deduplicate randomly
All multithreaded access is done in the context of a transaction. When
you start a modify transaction, you get a pointer to the internal
dns_qp_t which you can use just like the single threaded version.
For a read-only transaction, you get a pointer to a specific type
depending on the kind of transaction.
All the read-only operations on a qp-trie can take any of these types,
because the types share a common prefix. There’s a neat GNU C extension to
make this kind of polymorphism safer: you can declare a transparent
union of the compatible pointer types, and any type in
the union can be passed to the functions without casting. Unlike a
void * argument, passing other types makes the compiler report an
There are 4 kinds of transactions:
query transactions are lightweight and read-only
snapshots are heavyweight and read-only
update transactions are for heavyweight modifications
write transactions are for lightweight modifications
This is the workhorse transaction. It is lightweight because it is completely lock-free, relying on QSBR.
For years I was under the misapprehension that two versions of each trie would be enough: one stable shared read-only one, and an exclusive mutable one. However, two is not enough for lock-free read-only access and decent write performance, because there can be many modify transactions during a QSBR “grace period”, and each one creates a new version of the trie.
I wanted to avoid having to keep track of a fresh memory allocation for each version. The solution I came up with is to allocate read-only versions as a special type of node, so they can be cleaned up by the existing machinery for interior nodes.
A read-only version consists of a reference to the root of the trie, a pointer to a table of memory chunks, and a pointer to the trie object it belongs to. The root reference is unique per version; it identifies which cell within which chunk contains the root node. The chunk table is typically shared by many versions of the trie, so it has a reference count so we know when it is no longer needed. All we need the trie object for is the method table and user context pointer; it’s also used for some safety checks.
The way I integrated QSBR into BIND means that qp-trie queries can
only live as long as one
uv_loop callback. This is fine for
most DNS requests: the main exception is zone transfers, which might
need to keep a stable view of one version of a zone for several
seconds. For that they need a snapshot.
Snapshots are heavyweight, because to create one we need to briefly get the qp-trie’s mutex in order to copy some of its metadata into a new allocation.
A qp-trie snapshot has its own memory allocation, so it can live as many loops as necessary for a zone transfer to complete. Originally, I simply suppressed memory reclamation while a trie had any snapshots, until Paul Khuong pointed out that this would lead to trouble if (in DNS terms) a zone was subject to a heavy load of outgoing zone transfers (i.e. snapshots exist all the time) and concurrent UPDATEs or incoming zone transfers. To cope with that, snapshots are now aware of exactly which chunks they use, so the qp-trie knows which chunks become free when a snapshot is destroyed.
Update transactions are named after the DNS UPDATE opcode for making dynamic updates to zones. There are a couple of ways they are heavyweight.
It is fairly common for authoritative DNS servers to have huge numbers
of zones, most of which have only a couple of names (e.g.
www.example.com). So it would be enormously
wasteful to allocate for every zone a whole chunk of 1024 nodes and
use only 0.5% of them.
So, when an update transaction commits, it compacts the trie
to use as little space as possible, and uses
realloc() to shrink the
last chunk so it is only just as big as necessary.
The other thing is that update transactions can be rolled back. More than once I made the mistake of underestimating how much of the qp-trie allocator metadata needs to be restored in order to roll back correctly, which made me quite irritated with myself. In the qp-trie code that was merged this week, rollback support is both more correct and more straightforward than my previous attempts: when an update transaction is opened, it allocates a fresh copy of the trie’s metadata; when the transaction commits, the rollback copy is discarded; when the transaction is rolled back, the trie metadata is overwritten with its rollback copy.
Write transactions are lightweight because they avoid allocation (so no rollback support) and they do not eagerly compact the trie. They are intended for the DNS cache, which (compared to authoritative zones) has a high rate of small changes.
Allocating interior nodes for a qp-trie is very cheap with a chunk-based memory layout: just increment an index into the latest chunk and check that it has not run out of space. An update transaction has to allocate a fresh allocation chunk each time, because the previous one got shrunk; instead, write transactions re-use the same chunk until it fills up.
Another nice aspect of the chunk-based memory layout is that the metadata is per-chunk, not per node, saving a lot of overhead. One of the bits of metadata is a flag indicating that a chunk contains nodes from previous transactions and is therefore immutable. (Modifications need to use copy-on-write, not overwrite-in-place.) But one bit is not enough when write transactions are re-using chunks!
One of my more hilariously frustrating bugs was to do with an extra
strict debug-mode qp-trie. Instead of immutability being just a
flag, it can be enforced in hardware using
mprotect(). This is a
very effective way to make my computer point out to me that I have got
some of the fine details of write transactions wrong.
A big disadvantage of cheap sequential memory allocation is that it usually makes it more expensive to reclaim memory. This happens in a couple of ways in BIND’s qp-trie:
compaction, which is necessary to evacuate chunks that are nearly empty, so the trie isn’t holding on to memory that is mostly unused.
scanning chunks before they are
free()ed, to decrement refcounts.
Both of these can take ages, which I am not very happy about, but both of them are designed not to happen in performance-sensitive situations. I guess we’ll see if they turn out to be slow enough to be painful.
The core qp-trie and qsbr code (in
libisc) consist of:
There is some design and overview documentation for the qp-trie and QSBR:
And in the tests,
I’m reasonably happy with the way the code has turned out. It has taken about twice as long as I hoped to get to this point, for two main reasons:
liburcu, so I designed and wrote a QSBR implementation and redesigned the qp-trie for good performance with safe memory reclamation.
The last few weeks I have started to put this new qp-trie to work as a replacement for BIND’s zone table. Every query has to go through the zone table, so it is crucial for good performance, but it is also relatively simple. The main thing I hope to learn from this exercise is how to get all this new machinery to shut itself down gracefully.
I am looking forward to seeing some practical results!