Concurrent qp-trie performance numbers

Previously, I wrote about implementing safe memory reclamation for my qp-trie code in BIND. I have now got it working with a refactored qp-trie that has been changed to support asynchronous memory reclamation - working to the point where I can run some benchmarks to compare the performance of the old and new versions.

the contenders

My qp-trie code supports concurrent readers and at most one writer at a time. Writes are strictly serialized.

The old code reclaimed memory synchronously, immediately after commiting a modify transaction. It could be built to use one of two different concurrency control mechanisms:

A reader/writer lock

This has poor read-side scalability, because every thread is hammering on the same shared location. Writers perform reasonably well (BIND’s rwlock prefers writers) but a writer blocks all readers when committing, just long enough to swap the trie’s root pointer.
liburcu, userland read-copy-update

RCU has a fast and scalable read side, nice! But on the write side I used synchronize_rcu(), which is rather slow, so write performance is terrible. In effect, readers block the writer for a limited but relatively long period of time.

The new code reclaims memory asynchronously, using a callback instead of blocking like synchronize_rcu(), so it should be able to get much better performance with liburcu. However, it can only be built one way:

isc_qsbr

QSBR has a fast scalable read side. Reads and writes do not interfere with each other at all, except for contention on the trie’s root pointer, which only occurs on commit.

The aim of the refactored qp-trie is to have one good concurrency mode instead of two bad ones.

the benchmark

I needed to re-do my multithreaded qp-trie benchmark so that it uses isc_qsbr in a reasonably realistic manner. This means that each CPU is running a libuv event loop; libuv callbacks drive both the QSBR phase transitions and the benchmark runs.

The old benchmark used unvarnished pthreads, so the raw numbers from the old and new benchmarks are not directly comparable. I have cooked most of the numbers from the new benchmark to discount time spent in libuv. (The benchmark does print the time taken including libuv, and some of the benchmark series explore how much overhead it adds, but I’m not concerned with that aspect today.)

The old and new benchmarks have the same overall plan:

Each benchmark run exercises a qp-trie for 250 ms
Each run is part of a series where most of the parameters are fixed (e.g. number of threads, size of transactions), and one parameter is varied
There is a collection of series each of which examines some aspect of performance that I am concerned about.

My main aim is to confirm that the code behaves as expected, not so much to get the most flattering absolute numbers.

I ran the benchmarks on my MacBook Pro M1 Pro, without making any effort to reduce background activity, so the numbers are rather noisy.

the results

How does read performance vary with the number of concurrent threads? The number of ops per microsecond per core should be flat.

              rwlock              urcu               qsbr
read        Kops  ops/us       Kops  ops/us       Kops  ops/us
   1        1530    6.00       1348    5.31       1251    5.32
   2        3003    5.89       3030    6.05       2746    5.87
   3        4626    6.05       4521    5.91       3964    5.66
   4        5568    5.46       5721    5.61       5600    6.03
   5        7045    5.53       7102    5.57       7047    6.10
   6        8582    5.61       7988    5.32       8387    6.09
   7        9628    5.39       9653    5.41       9448    5.89
   8       11190    5.48      10997    5.39      10998    5.97
   9       11364    4.95      11578    5.13      11352    5.68
  10       12840    4.80      12511    4.91      11716    5.31

OK, the urcu and qsbr numbers are very noisy, but generally flat; there’s a perceptible downward trend in the rwlock performance. (The lower absolute count of ops for qsbr is due to the libuv loop overhead, which is discounted from the ops/us/core numbers.)

What if we use one of the cores to add some write activity?

              rwlock             urcu               qsbr
read       Kops  ops/us       Kops  ops/us       Kops  ops/us
   1        627    2.46       1406    5.52       1134    4.83
   2       1253    2.46       2832    5.65       2447    5.22
   3       1666    2.17       4256    5.56       3717    5.32
   4       2002    1.96       5831    5.80       4749    5.36
   5       2313    1.81       6933    5.54       6147    5.43
   6       2688    1.75       8389    5.48       7512    5.50
   7       2703    1.52       9535    5.44       8928    5.59
   8       2643    1.29      11769    5.77       9284    5.26
   9       3153    1.37      11686    5.09      10344    5.28

The rwlock read performance takes a dive, and scales even worse with the number of threads. The urcu and qsbr performance remains flat and is not much slower. (I think it could be better if I get rid of some false sharing?)

But while that reading is going on, what is the write thread doing?

              rwlock             urcu               qsbr
read        ops  ops/us        ops  ops/us        ops  ops/us
   1      65440    0.03        320    0.00      72208    0.03
   2      54704    0.03        304    0.00      75200    0.04
   3      50384    0.03        304    0.00      77184    0.05
   4      44672    0.03        240    0.00      70400    0.06
   5      45504    0.04        240    0.00      67024    0.06
   6      40944    0.04        192    0.00      65632    0.07
   7      42256    0.05        208    0.00      61056    0.09
   8      40320    0.08        176    0.00      60800    0.16
   9      41424    0.16        144    0.00      42400    0.25

As I said, the urcu write performance is terrible (note the ops columns are not divided by 1000 in this table!). Pleasingly, the qsbr code gets better write throughput than the wrlock code as well as decent read throughput.

That’s possibly even better than what I wanted, yay!

more numbers

In the benchmark runs above, I did 32 read operations per transaction (and in the qsbr case, 32 transactions before returning to the libuv event loop). TBH 32 ops per read transaction is unrealistically large. To increase write contention I did a more realistic 4 ops per write transaction (and 4 tx per loop with qsbr).

To get some idea of the read-side overhead, I did a series of runs that keep the constant 32*32 == 1024 ops per loop, but vary the number of ops per transaction.

A lightweight read side should have performance that does not degrade so much towards the top of the table, where there are only a few ops per transaction.

                      rwlock            urcu             qsbr
tx/loop  ops/tx     Kops  ops/us     Kops  ops/us     Kops  ops/us
   1024       1      822    0.36     9828    4.28     8962    4.54
    512       2     1230    0.54    10566    4.60     9645    4.89
    256       4     1938    0.84    10752    4.68     9864    5.03
    128       8     2525    1.10    11595    5.04    10148    5.23
     64      16     2658    1.16    11756    5.11    10198    5.24
     32      32     3176    1.36    12118    5.28    10432    5.34
     16      64     3206    1.40    12170    5.30    10256    5.27
      8     128     3886    1.70    12141    5.29    10345    5.30
      4     256     5065    2.21    12008    5.23    10267    5.29
      2     512     5899    2.57    11555    5.03    10384    5.30
      1    1024     6406    2.84    11885    5.17    10246    5.25

All of these benchmark runs are working on a qp-trie containing a million random string keys. An operation chooses a key uniformly at random. This thrashes the CPU’s caches fairly hard: there’s almost one GByte of leaf objects (mostly consisting of oversized buffers containing the keys) plus 15 MBytes for the trie interior nodes.

It goes quite a lot faster with a smaller trie, and the space used by the trie structure remains reasonably modest. (I think more realistic / less random keys would need more like 20 bytes of interior nodes per item, because they would have a deeper narrower trie, needing more interior branch nodes. Each item in the trie needs at least 12 bytes for its interior leaf node.)

                       rwlock            urcu             qsbr
 B/item    items     Kops  ops/us     Kops  ops/us     Kops  ops/us
  64.00       10    37474   14.70    53225   20.90    46567   27.73
  17.08      100    35880   13.88    45519   17.86    39366   22.31
  14.49     1000    31671   12.43    37312   14.64    37173   20.46
  16.01    10000    27270   10.70    29958   11.75    29821   15.80
  14.85   100000    19915    7.81    19228    7.61    18874    9.31
  15.14  1000000    12125    4.79    12686    4.97    11578    5.30

what’s missing

There isn’t much to put these numbers into context.

Previously, when I prototyped an earlier version of my qp-trie in NLnetLabs NSD, I was able to compare it fairly directly with NSD’s existing red-black tree and radix tree: the interface between the data structures and the rest of NSD is straightforward.

In BIND, things are more complicated, because BIND is multi-threaded (NSD uses multiple processes) and makes changes to its working data structures all the time (because it is a resolver, and because its authoritative zones support dynamic updates and automatic DNSSEC signing).

BIND’s existing red-black tree is single-threaded. On top of the rbt is the rbtdb, which is an in-memory database of DNS records that uses the rbt as an index. The rbtdb is deeply intertwingled with the rbt in order to add multithreading support, and with the rdataslab storage for DNS records. So it is difficult (and I have not tried) to set up a simple benchmark that can compare a qp-trie with BIND’s rbtdb.

what’s next

I hope that the qp-trie will give BIND some cleaner layering. The data structure is general-purpose, and has built-in support for concurrency and versioning. (Versioning allows zone transfers and updates to happen concurrently without interference, and it supports ixfr-from-differences.) Its API enforces a clean separation between its interior nodes and the leaf objects stored in the trie, which doesn’t exist in BIND’s rbt.

As the qp-trie gets used for real work inside BIND, I expect there will be some more meaningful benchmarks, so we can show there are performance improvements, or at least no regressions.