Previously, I wrote about implementing safe memory reclamation for my qp-trie code in BIND. I have now got it working with a refactored qp-trie that has been changed to support asynchronous memory reclamation - working to the point where I can run some benchmarks to compare the performance of the old and new versions.
the contenders
My qp-trie code supports concurrent readers and at most one writer at a time. Writes are strictly serialized.
The old code reclaimed memory synchronously, immediately after commiting a modify transaction. It could be built to use one of two different concurrency control mechanisms:
-
A reader/writer lock
This has poor read-side scalability, because every thread is hammering on the same shared location. Writers perform reasonably well (BIND’s rwlock prefers writers) but a writer blocks all readers when committing, just long enough to swap the trie’s root pointer.
-
liburcu
, userland read-copy-updateRCU has a fast and scalable read side, nice! But on the write side I used
synchronize_rcu()
, which is rather slow, so write performance is terrible. In effect, readers block the writer for a limited but relatively long period of time.
The new code reclaims memory asynchronously, using a callback
instead of blocking like synchronize_rcu()
, so it should be able to
get much better performance with liburcu
. However, it can only be
built one way:
-
QSBR has a fast scalable read side. Reads and writes do not interfere with each other at all, except for contention on the trie’s root pointer, which only occurs on commit.
The aim of the refactored qp-trie is to have one good concurrency mode instead of two bad ones.
the benchmark
I needed to re-do my multithreaded qp-trie benchmark so that it uses
isc_qsbr
in a reasonably realistic manner. This means that each CPU
is running a libuv
event loop; libuv
callbacks drive
both the QSBR phase transitions and the benchmark runs.
The old benchmark used unvarnished pthreads
, so the raw numbers from
the old and new benchmarks are not directly comparable. I have cooked
most of the numbers from the new benchmark to discount time spent in
libuv
. (The benchmark does print the time taken including libuv
,
and some of the benchmark series explore how much overhead it adds,
but I’m not concerned with that aspect today.)
The old and new benchmarks have the same overall plan:
-
Each benchmark run exercises a qp-trie for 250 ms
-
Each run is part of a series where most of the parameters are fixed (e.g. number of threads, size of transactions), and one parameter is varied
-
There is a collection of series each of which examines some aspect of performance that I am concerned about.
My main aim is to confirm that the code behaves as expected, not so much to get the most flattering absolute numbers.
I ran the benchmarks on my MacBook Pro M1 Pro, without making any effort to reduce background activity, so the numbers are rather noisy.
the results
How does read performance vary with the number of concurrent threads? The number of ops per microsecond per core should be flat.
rwlock urcu qsbr
read Kops ops/us Kops ops/us Kops ops/us
1 1530 6.00 1348 5.31 1251 5.32
2 3003 5.89 3030 6.05 2746 5.87
3 4626 6.05 4521 5.91 3964 5.66
4 5568 5.46 5721 5.61 5600 6.03
5 7045 5.53 7102 5.57 7047 6.10
6 8582 5.61 7988 5.32 8387 6.09
7 9628 5.39 9653 5.41 9448 5.89
8 11190 5.48 10997 5.39 10998 5.97
9 11364 4.95 11578 5.13 11352 5.68
10 12840 4.80 12511 4.91 11716 5.31
OK, the urcu
and qsbr
numbers are very noisy, but generally flat;
there’s a perceptible downward trend in the rwlock
performance. (The
lower absolute count of ops for qsbr
is due to the libuv
loop
overhead, which is discounted from the ops/us/core numbers.)
What if we use one of the cores to add some write activity?
rwlock urcu qsbr
read Kops ops/us Kops ops/us Kops ops/us
1 627 2.46 1406 5.52 1134 4.83
2 1253 2.46 2832 5.65 2447 5.22
3 1666 2.17 4256 5.56 3717 5.32
4 2002 1.96 5831 5.80 4749 5.36
5 2313 1.81 6933 5.54 6147 5.43
6 2688 1.75 8389 5.48 7512 5.50
7 2703 1.52 9535 5.44 8928 5.59
8 2643 1.29 11769 5.77 9284 5.26
9 3153 1.37 11686 5.09 10344 5.28
The rwlock
read performance takes a dive, and scales even worse
with the number of threads. The urcu
and qsbr
performance remains
flat and is not much slower. (I think it could be better if I get rid
of some false sharing?)
But while that reading is going on, what is the write thread doing?
rwlock urcu qsbr
read ops ops/us ops ops/us ops ops/us
1 65440 0.03 320 0.00 72208 0.03
2 54704 0.03 304 0.00 75200 0.04
3 50384 0.03 304 0.00 77184 0.05
4 44672 0.03 240 0.00 70400 0.06
5 45504 0.04 240 0.00 67024 0.06
6 40944 0.04 192 0.00 65632 0.07
7 42256 0.05 208 0.00 61056 0.09
8 40320 0.08 176 0.00 60800 0.16
9 41424 0.16 144 0.00 42400 0.25
As I said, the urcu
write performance is terrible (note the ops
columns are not divided by 1000 in this table!). Pleasingly, the
qsbr
code gets better write throughput than the wrlock
code as
well as decent read throughput.
That’s possibly even better than what I wanted, yay!
more numbers
In the benchmark runs above, I did 32 read operations per transaction
(and in the qsbr
case, 32 transactions before returning to the
libuv
event loop). TBH 32 ops per read transaction is
unrealistically large. To increase write contention I did a more
realistic 4 ops per write transaction (and 4 tx per loop with qsbr
).
To get some idea of the read-side overhead, I did a series of runs that keep the constant 32*32 == 1024 ops per loop, but vary the number of ops per transaction.
A lightweight read side should have performance that does not degrade so much towards the top of the table, where there are only a few ops per transaction.
rwlock urcu qsbr
tx/loop ops/tx Kops ops/us Kops ops/us Kops ops/us
1024 1 822 0.36 9828 4.28 8962 4.54
512 2 1230 0.54 10566 4.60 9645 4.89
256 4 1938 0.84 10752 4.68 9864 5.03
128 8 2525 1.10 11595 5.04 10148 5.23
64 16 2658 1.16 11756 5.11 10198 5.24
32 32 3176 1.36 12118 5.28 10432 5.34
16 64 3206 1.40 12170 5.30 10256 5.27
8 128 3886 1.70 12141 5.29 10345 5.30
4 256 5065 2.21 12008 5.23 10267 5.29
2 512 5899 2.57 11555 5.03 10384 5.30
1 1024 6406 2.84 11885 5.17 10246 5.25
All of these benchmark runs are working on a qp-trie containing a million random string keys. An operation chooses a key uniformly at random. This thrashes the CPU’s caches fairly hard: there’s almost one GByte of leaf objects (mostly consisting of oversized buffers containing the keys) plus 15 MBytes for the trie interior nodes.
It goes quite a lot faster with a smaller trie, and the space used by the trie structure remains reasonably modest. (I think more realistic / less random keys would need more like 20 bytes of interior nodes per item, because they would have a deeper narrower trie, needing more interior branch nodes. Each item in the trie needs at least 12 bytes for its interior leaf node.)
rwlock urcu qsbr
B/item items Kops ops/us Kops ops/us Kops ops/us
64.00 10 37474 14.70 53225 20.90 46567 27.73
17.08 100 35880 13.88 45519 17.86 39366 22.31
14.49 1000 31671 12.43 37312 14.64 37173 20.46
16.01 10000 27270 10.70 29958 11.75 29821 15.80
14.85 100000 19915 7.81 19228 7.61 18874 9.31
15.14 1000000 12125 4.79 12686 4.97 11578 5.30
what's missing
There isn’t much to put these numbers into context.
Previously, when I prototyped an earlier version of my qp-trie in NLnetLabs NSD, I was able to compare it fairly directly with NSD’s existing red-black tree and radix tree: the interface between the data structures and the rest of NSD is straightforward.
In BIND, things are more complicated, because BIND is multi-threaded (NSD uses multiple processes) and makes changes to its working data structures all the time (because it is a resolver, and because its authoritative zones support dynamic updates and automatic DNSSEC signing).
BIND’s existing red-black tree is single-threaded. On top of the rbt
is the rbtdb
, which is an in-memory database of DNS records that
uses the rbt
as an index. The rbtdb
is deeply intertwingled with
the rbt
in order to add multithreading support, and with the
rdataslab
storage for DNS records. So it is difficult (and I have
not tried) to set up a simple benchmark that can compare a qp-trie
with BIND’s rbtdb
.
what's next
I hope that the qp-trie will give BIND some cleaner layering. The data
structure is general-purpose, and has built-in support for concurrency
and versioning. (Versioning allows zone transfers and updates to
happen concurrently without interference, and it supports
ixfr-from-differences
.) Its API enforces a clean
separation between its interior nodes and the leaf objects stored in
the trie, which doesn’t exist in BIND’s rbt
.
As the qp-trie gets used for real work inside BIND, I expect there will be some more meaningful benchmarks, so we can show there are performance improvements, or at least no regressions.