Compacting a qp-trie

My new job is working on BIND for ISC, and my main project is to replace BIND’s core red-black tree data structure with my qp-trie.

previously

In the summer of 2021 I wrote some notes on page-based GC for qp-trie RCU which I then went on to implement in my fork of NSD.

Since the start of May 2022 I have ported the NSD version of my qp-trie to BIND, with several improvements:

multi-version concurrency, instead of just two versions, one for readers and one for the writer;
the rather sketchy locking has been completed;
two flavours of write transaction: minimum space for authoritative DNS; and minimum time for recursive caches;
rollback for failed transactions.

The notes I wrote last summer turned into code very nicely: NSD proved to be a good place to try out the ideas. And more recently, I am pleased with how the code adapted to the more complicated demands of BIND.

But there’s one area that has been problematic: compaction.

previously
memory manager
version one
version two
version three
version four
version five
version N+1
postscript

memory manager

My qp-trie organizes its memory into a collection of “chunks”, each of which is something like 12 KiB or 48 KiB. (I previously called them “pages” but they aren’t the same size as hardware pages, and the authors of the Garbage Collection Handbook says “chunk” is their preferred term.)

There is very little metadata: each chunk keeps a count of how much it has allocated, and another count of how much it has freed. Unlike most garbage collectors, the qp-trie code frees memory explicitly. This helps because the GC does not have to scan a chunk to find out how fragmented it is, and we can know in advance whether it is worth the effort to compact the trie.

However, we don’t know which nodes in a chunk are in use or not, without either scanning the chunk or traversing the trie. So this is the most expensive part of the memory manager.

I have tried several compaction algorithms so far, and I am not sure I have found a good one yet…

version one

code

The first compaction algorithm I implemented was basically what I described in the notes I wrote before writing the code. Looking back at the code now, I can’t see how it would have worked, which probably explains why I tried other algorithms.

I remember one failure mode where the compactor often left behind a lot of new fragmentation (oops), so the GC would fire again soon after a compaction, and the double compaction would effectively copy the whole trie.

version two

code

Next I tried a simple semi-space garbage collector, using Cheney’s exceptionally beautiful stackless copying algorithm. This simplifed many things, because there was no longer any need for a chunk table, instead just a single allocation for the whole trie.

But it seemed a bit heavy-handed to me to copy the whole thing when fragmentation is likely to affect only part of the trie. And semi-space collectors need a lot of unused space to work efficiently.

version three

code

So I went back to the chunk table, and tried to apply the generational hypothesis. This is a rule of thumb that says most allocations are short-lived. Many garbage collectors split their memory into “generations”; the youngest “nursery” generation is optimized for fast allocation, and the expectation is that when it fills up, most of the contents will already be garbage, so it will be cheap to evacuate the live data from the nursery to the next generation.

I applied this idea to the qp-trie by guessing that most fragmentation would occur near the root of the trie. So, walk the trie recursively from the root, copying nodes compactly into a fresh chunk, and stop recursing whenever you reach a node that is in a full chunk.

This works OK for copy-on-write transactions, which must copy the path from the root to any modified leaves, but it is terrible for a single-threaded trie. All mutations only affect one node near the leaf that is being added or deleted, i.e. nodes near the root are mostly left alone.

version four

code

I thought until last week that version three was OK: I was dismayed to learn that this part of my code from last summer was weaker than I remembered. But, I had a rethink, and worked out that the generational hypothesis is false in my situation, and thought up a new algorithm.

This one takes advantage of the work I did to support transaction rollback, and more well-defined lifecycle management for the value objects hanging off the trie. As a result, BIND’s qp-trie code can scan a chunk and know which parts of it are free or in use, without walking the trie.

But there’s a big proviso: it can only compact mutable chunks. This algorithm does not work for copy-on-write transactional modifications. It was OK as a stop-gap, but I knew it needed more work.

version five

code

Version three taught me that I need a way to find and compact fragmentation near the leaves of the trie.

Version four suggested that an algorithm that aims not to increase fragmentation can work quite well in practice even if there are situations that can make it less effective.

So I had another think and worked out a “bottom-up” algorithm: walk the trie, and copy/compactify any nodes we find in fragmented chunks. We might also need gratuitous copies of nodes on the paths from the root to the nodes that have been copied, if those nodes are in shared chunks. That is, we need to obey the copy-on-write rule.

Today’s algorithm turned out to be more effective at compaction than version four, and faster. It’s also quite similar to version one, except less broken. (And I had fun writing an informal inductive proof to convince myself how it works!)

version N+1

This compaction algorithm rework comes in the middle of my efforts to write some basic testing and validation code for BIND’s qp-trie, since the testing I have done so far revealed compaction to be a problem. The next things to test are transactional writes to the trie, exercising the copy-on-write code, and then see if I can break it in multithreaded mode.

Even if version five survives this testing, I am sure I will need to do more experiments and try out other algorithms, because this aspect of the qp-trie code still uses more CPU than I am happy with. But it might be possible to shuffle the work into a corner where it doesn’t cause trouble?

postscript

I have enjoyed reading Andy Wingo’s recent blog posts about garbage collection which inspired me to write about it too.