At the end of October, I finally got my multithreaded qp-trie working! It could be built with two different concurrency control mechanisms:
-
A reader/writer lock
This has poor read-side scalability, because every thread is hammering on the same shared location. But its write performance is reasonably good: concurrent readers don’t slow it down too much.
-
liburcu
, userland read-copy-updateRCU has a fast and scalable read side, nice! But on the write side I used
synchronize_rcu()
, which is blocking and rather slow, so my write performance was terrible.
OK, but I want the best of both worlds! To fix it, I needed to change
the qp-trie code to use safe memory reclamation more effectively:
instead of blocking inside synchronize_rcu()
before cleaning up, use
call_rcu()
to clean up asynchronously. I expect I’ll write about the
qp-trie changes another time.
Another issue is that I want the best of both worlds by default,
but liburcu
is LGPL and we don’t want BIND to depend on
code whose licence demands more from our users than the MPL.
So I set out to write my own safe memory reclamation support code.
lock freedom
In a multithreaded qp-trie, there can be many concurrent readers, but there can be only one writer at a time and modifications are strictly serialized. When I have got it working properly, readers are completely wait-free, unaffected by other readers, and almost unaffected by writers. Writers need to get a mutex to ensure there is only one at a time, but once the mutex is acquired, a writer is not obstructed by readers.
The way this works is that readers use an atomic load to get a pointer to the root of the current version of the trie. Readers can make multiple queries using this root pointer and the results will be consistent wrt that particular version, regardless of what changes writers might be making concurrently. Writers do not affect readers because all changes are made by copy-on-write. When a writer is ready to commit a new version of the trie, it uses an atomic store to flip the root pointer.
safe memory reclamation
We can’t copy-on-write indefinitely: we need to reclaim the memory
used by old versions of the trie. And we must do so “safely”, i.e.
without free()
ing memory that readers are still using.
So, before free()
ing memory, a writer must wait for a “grace
period”, which is a jargon term meaning “until readers are not using
the old version”. There are a bunch of algorithms for determining when
a grace period is over, with varying amounts of over-approximation,
CPU overhead, and memory backlog.
The RCU function synchronize_rcu()
is slow because it blocks
waiting for a grace period; the call_rcu()
function runs a callback
asynchronously after a grace period has passed. I wanted to avoid
blocking my writers, so I needed to implement something like
call_rcu()
.
aversions
When I started trying to work out how to do safe memory reclamation, it all seemed quite intimidating. But as I learned more, I found that my circumstances make it easier than it appeared at first.
The liburcu
homepage has a long list of supported CPU
architectures and operating systems. Do I have to care about those
details too? No! The RCU code dates back to before the age of
standardized concurrent memory models, so the RCU developers had to
invent their own atomic primitives and correctness rules. Twenty-ish
years later the state of the art has advanced, so I can use
<stdatomic.h>
without having to re-do it like liburcu
.
You can also choose between several algorithms implemented by
liburcu
, involving questions about kernel support, specially
reserved signals, and intrusiveness in application code. But while I
was working out how to schedule asynchronous memory reclamation work,
I realised that BIND is already well-suited to the fastest flavour of
RCU, called “QSBR”.
QSBR
QSBR stands for “quiescent state based reclamation”. A “quiescent state” is a fancy name for a point when a thread is not accessing a lock-free data structure, and does not retain any root pointers or interior pointers.
When a thread has passed through a quiescent state, it no longer has access to older versions of the data structures. When all threads have passed through quiescent states, then nothing in the program has access to old versions. This is how QSBR detects grace periods: after a writer commits a new version, it waits for all threads to pass through quiescent states, and therefore a grace period has definitely elapsed, and so it is then safe to reclaim the old version’s memory.
QSBR is fast because readers do not need to explicitly mark the critical section surrounding the atomic load that I mentioned earlier. Threads just need to pass through a quiescent state frequently enough that there isn’t a huge build-up of unreclaimed memory.
Inside an operating system kernel (RCU’s native environment), a
context switch provides a natural quiescent state. In a userland
application, you need to find a good place to call
rcu_quiescent_state()
. You could call it every time you have
finished using a root pointer, but marking a quiescent state is not
completely free, so there are probably more efficient ways.
`libuv`
BIND is multithreaded, and (basically) each thread runs an event loop.
Recent versions of BIND use libuv
for the event loops.
A lot of things started falling into place when I realised that the
libuv
event loop gives BIND a natural quiescent state:
when the event callbacks have finished running, and libuv
is about
to call select()
or poll()
or whatever, we can mark a quiescent
state. We can require that event-handling functions do not stash root
pointers in the heap, but only use them via local variables, so we
know that old versions are inaccessible after the callback returns.
My design marks a quiescent state once per loop, so on a busy server where each loop has lots to do, the cost of marking a quiescent state is amortized across several I/O events.
fuzzy barrier
So, how do we mark a quiescent state? Using a “fuzzy barrier”.
When a thread reaches a normal barrier, it blocks until all the other threads have reached the barrier, after which exactly one of the threads can enter a protected section of code, and the others are unblocked and can proceed as normal.
When a thread encounters a fuzzy barrier, it never blocks. It either proceeds immediately as normal, or if it is the last thread to reach the barrier, it enters the protected code.
RCU does not actually use a fuzzy barrier as I have described it. Like
a fuzzy barrier, each thread keeps track of whether it has passed
through a quiescent state in the current grace period, without
blocking; but unlike a fuzzy barrier, no thread is diverted to the
protected code. Instead, code that wants to enter a protected section
uses the blocking synchronize_rcu()
function.
EBR-ish
As in the paper “performance of memory reclamation for lockless synchronization”, my implementation of QSBR uses a fuzzy barrier designed for another safe memory reclamation algorithm, EBR, epoch based reclamation. (EBR was invented here in Cambridge by Keir Fraser.)
Actually, my fuzzy barrier is slightly different to EBR’s. In EBR, the fuzzy barrier is used every time the program enters a critical section. (In qp-trie terms, that would be every time a reader fetches a root pointer.) So it is vital that EBR’s barrier avoids mutating shared state, because that would wreck multithreaded performance.
Because BIND will only pass through the fuzzy barrier when it is about to use a blocking system call, my version mutates shared state more frequently (typically, once per CPU per grace period, instead of once per grace period). If this turns out to be a problem, it won’t be too hard to make it work more like EBR.
More trivially, I’m using the term “phase” instead of “epoch”, because it’s nothing to do with the unix epoch, because there are three phases, and because I can talk about phase transitions and threads being out of phase with each other.
code
At the moment, my QSBR implementation is chilling out in the branch
where I am working on my qp-trie multithreaded performance
improvements. QSBR has now been merged into BIND’s main branch.
Once I understood how it should work, the implementation was about 200 lines of code, plus 200 lines of comments.
coda
While reading various RCU-related papers, I was amused by “user-level implementations of read-copy update”, which says:
BIND, a major domain-name server used for Internet domain-name resolution, is facing scalability issues. Since domain names are read often but rarely updated, using user-level RCU might be beneficial.
Yes, I think it might :-)