Page-based GC for qp-trie RCU

Here are some ideas for how a special-purpose allocator might improve my qp-trie implementation:

lower memory usage
faster allocation
neater RCU support
possibly less load on the TLB
tunable fragmentation overhead

The downsides are:

complexity - it's a custom allocator and garbage collector!
it would only support transactional updates

Let's dig in...

COW and RCU

A few years ago I added support for transactional updates to the qp-trie used by Knot DNS. Knot handled DNS updates by making a complete copy of the zone, so that the old copy could continue to serve queries while the new copy was being modified. The new code made it possible to copy only the parts of the zone that were affected by the update, and reduce the overhead of handling small updates.

My COW (copy-on-write) code was designed to work with the RCU (read-copy-update) concurrency framework. RCU was developed for concurrent data structures in the Linux kernel; there is also a userland RCU library. RCU is a combination of;

COW data structures, so updates don't interfere with readers
lightweight concurrency barriers, so readers do not need to take a lock
deferred cleanup, so writers know when all readers are using the new copy of a data structure, when the old copy can be cleaned up

I used one-bit reference counts to mark the boundary between the parts of the tree (mostly near the root) that had been copied, and the shared parts (towards the leaves). So it wasn't a pure COW, because the refcount manipulation required writes to the shared parts of the tree.

memory layout

A common design for malloc() implementations (for example phkmalloc and jemalloc) is to keep allocations of different sizes separate. Each size class has its own free list, and each page can only satisfy allocations from a single size class. This can reduce the amount of searching around for free space inside malloc() and reduce the amount of fragmentation.

But in a qp-trie, nodes are often different sizes, so each step when traversing the tree will usually require a leap to a different page, which can increase pressure on the CPU's address translation lookaside buffer.

Could we, perhaps, make a qp-trie more friendly to the TLB, and maybe also the prefetcher, by being more clever about how it allocates nodes, and how they are arranged next to each other in memory?

A custom allocator seems like a lot of work for a (probably) small performance improvement, so I have not (until recently) pursued the idea.

refcounts vs tracing

Reference counts are often regarded as a poor substitute for "proper" tracing garbage collection. A tracing copying collector can give you:

cheaper allocations: just bump a pointer
amortized free: release whole pages rather than individual nodes
better locality and less fragmentation
no extra write traffic to update reference counts

To get most of these advantages, the garbage collector must be able to move objects around. What you gain in more efficient alloc and free, you pay for by copying.

However, if all updates to our data structure are RCU transactions that necessarily involve making copies, then tracing garbage collection seems like less of a stretch.

rough design

Our qp-trie allocator has a bag of pages, whose size does not need to match the hardware page size, but that kind of size should be about right.

For each page, we keep track of how much free space it has (so that we can decide when it is worth evacuating and freeing it), and a note of the RCU epoch after which it can be freed.

There's a global array of pages, containing the the address of each page. Actually, when the page table is resized, we will need to do an RCU delayed cleanup, so there can also be a secondary array which is waiting to be freed.

starting a transaction

When an update transaction is started, we obtain a fresh page where we will put new nodes and modified copies. We use a cheap bump allocator that just obtains another page when it runs out of space. Unlike many GC languages, we still manually free() nodes, to keep count of the free space in each page.

There can only be one write transaction at a time, so the writer can update the page metadata without interlocks.

finishing a transaction

After the updates have been applied to make a new version of the tree, we can do a bit of extra maintenance work before switching our readers over to the new tree. I'll discuss these in more detail below:

layout optimization: it might be worth doing some extra copying to make the tree nicer for the prefetcher;
garbage collection: identify which pages have too much free space, and evacuate and compact their contents so they can be freed.
cache eviction: if our tree is used for a cache rather than for authoritative data, the GC phase can also discard entries that are past their TTL.

Finally, the switch-over process:

swap the tree's root pointers atomically
wait for an RCU epoch so all readers are using the new tree
free everything on the delayed cleanup list

layout optimization

This is entirely optional: I don't know if it will have any useful effect. The idea is to copy tree nodes into a layout that's friendly to the CPU's prefetcher and maybe also its TLB. My best guess for how to achieve this is, starting from the root of the tree, to copy nodes in breadth-first order, until some heuristic limits are reached.

One of the tradeoffs is between better layout and extra memory usage (for more copies). A minimal option might be to only copy the few uppermost levels of the tree until they fill one page. Layout optimization across multiple pages is more complicated.

garbage collection

Here is a sketch of an algorithm for a full collection; I have not worked out how to do a useful collection that touches less data.

We recursively traverse the whole tree. The argument for the recursive function is a branch twig, i.e. a pointer to an interior node with its metadata (bitmap etc.), and the return value is either the same as the argument, or an altered version pointing to the node's new location.

The function makes a temporary copy of its node on the stack, then iterates over the twigs contained in the node. Leaf twigs are copied as is; it calls itself recursively for each branch twig.

If any of the branch twigs were changed by the recursive calls, or if the old copy of this node was in a sufficiently-empty page, the old copy is freed (which only alters its page's free space counter), the new version of the node is copied to the allocation pointer, and this recursive invocation returns the node's new location. Otherwise it returns a pointer to the old location (and the copy on the stack is discareded).

We can tune our fragmentation overhead by adjusting the threshold for sufficiently-empty pages. Note that garbage collection must also include recent allocations during the update transaction: a transaction containing multiple updates is likely to generate garbage because many qp-trie updates change the size of a node, even if we update in place when we can. So the pages used for new allocations should be treated as sufficiently-empty so that their contents are compacted before they enter heavy read-only use.

cache eviction

So far my qp-trie code has worked well for authoritative data, but I have not tried to make it work for a DNS cache. A cache needs to do a couple of extra things:

evict entries that have passed their time-to-live;
evict older entries to keep within a size limit.

Both of these can be done as part of the garbage collection tree walk.

In BIND, activities like this are performed incrementally by co-operatively scheduled tasks, rather than dedicated threads, which makes them a bit more intricate to code.

small pointers

The page table allows us to use much smaller node pointers.

Instead of using a native 64-bit pointer, we can refer to a node by the index of its page in the page table and the position of the node in its page, which together can easily fit in 32 bits. This requires a double indirection to step from one node to the next, but the page table should be in cache, and qp-trie traversal is friendly to prefetching, so we can provide hints if the processor can't prefetch automatically.

There are a couple of ways to make use of this saving.

We can reduce the size of each twig from 16 bytes to 12 bytes, making the whole tree 25% smaller. This adds some constraints on leaves: either the key and value pointers must fit in 48 bits each (which requires unwarranted chumminess with the implementation); or we can get hold of the key via the value (and waste 32 bits in the leaf).

Or if this is a one-pass DNS-trie we can use the extra space for path compression, and avoid making assumptions about pointers.

metadata placement

For each page we need to keep track of how much free space it contains, so that we know when it should be evacuated; and something to tell us if the page should be freed after the next RCU epoch.

It's fairly straightforward to put this metadata at the start of each page. At the cost of a little wasted space we can make sure this writable data doesn't share a cache line with read-only nodes.

If we are using small pointers, another option is to put per-page metadata in the page table, or perhaps in another array parallel to the page table to keep read-only and writable data separate.

transactions and caches

I normally think of a cache as having a lot of small point updates, which is unlikely to be efficient with this transaction-oriented design. But perhaps it makes sense if we split the cache into two parts.

The main cache is read-only; we use transactional updates for eviction based on TTL and cache size, and to bring in new records from the working cache. It uses ordered lookups to support RFC 8198 NXDOMAIN synthesis.

The working cache is used by the resolver to keep track of queries in progress. It can be based on fine-grained updates and locking, rather than being designed for a read-mostly workload. It might not need ordered lookups at all.

Queries that miss the main cache get handed over to the resolver, which might be able to answer them straight from the working cache, or it might add the query to a list of queries waiting for the same answer, or when there are no matches it creates a new entry in the working cache that belongs to a new resolver task.

application data

Most of what I have written above is about working with the interior branch nodes of the tree. What about the application data hanging off the leaf nodes?

During a transaction, any elements that we want to delete need to be added to a free list, so that they can be cleaned up after the next RCU epoch. When we need to modify an element, we must do so COW-style.

It's reasonable for the tree implementation to keep a free list of application elements, so any delete or set operations will automatically add the old element pointer to the list for later cleanup. On the other hand, it's probably easier for the applicaton to COW its own data.

The only callback we will need is to free application elements during the delayed cleanup after the RCU epoch has passed. (This is simpler than Knot DNS, which also has callbacks for refcount manipulation.)

conclusion

For a long time I was doubtful that a custom allocator for a qp-trie would be worth the effort. But now I think it is likely to be worth it:

The refcounting in Knot is confusing; GC seems to be a nicer way to support RCU.
Small pointers can save a significant amount of space, and are more accommodating for a one-pass radix tree version.
It will become feasible to see if layout optimization can make queries faster.

It remains to be seen if I can find the time to turn these ideas into code!