never mind the quadbits, feel the width!

benchmarking wider-fanout versions of qp tries

QP tries are faster than crit-bit tries because they extract up to four bits of information out of the key per branch, whereas crit-bit tries only get one bit per branch. If branch nodes have more children, then the trie should be shallower on average, which should mean lookups are faster.

So if we extract even more bits of information from the key per branch, it should be even better, right? But, there are downsides to wider fanout.

I originally chose to index by nibbles for two reasons: they are easy to pull out of the key, and the bitmap easily fits in a word with room for a key index.

If we index keys by 5-bit chunks we pay a penalty for more complicated indexing.

If we index keys by 6-bit chunks, trie nodes have to be bigger to fit a bigger bitmap, so we pay a memory overhead penalty.

How do these costs compare to the benefits of wider branches?

(The next two sections expand on the downsides; skip to "results" for the tl;dr.)

five-bit fields are fiddly

If we use the key 5 bits at a time, the bitmap is 2^5 == 32 bits. This leaves room in a 64 bit word for a 28 bit index (limiting keys to 256 Mbytes), three bits to store the shift/alignment of the 5 bit field relative to 8 bit bytes, and a 1 bit branch/leaf flag.

When indexing by 4 bit chunks, we only need to look at one byte of key at a time; the chunk can be in the top half or the bottom half. For larger chunks we often need to extract a chunk that overlaps two bytes. To identify a 5-bit chunk we store the byte index of the first byte it overlaps, and how much to shift the pair of bytes to extract the chunk we want. (The aim is to avoid complicated arithmetic in the inner loops.) This gives us a lot more freedom to identify chunks than we actually use, because although there are 8 possible shifts at any particular byte index, we only use one or two of them. The shifts have to be consistent with representing the key as a string of successive 5-bit chunks.

The diagram below shows the correspondence between byte indexes (i) and valid chunk alignment shifts (s). The shift value is how much to shift the chunk left to move it to the top of its byte, so shift values and indexes increase from left to right. This fact is used when working out how deep a new branch node is; for instance, when i%5==1 the possible shifts are 2 and 7, so the branch for the chunk with shift 2 has to be a parent of 7. Although shift 5 overlaps the same byte, it can only occur when i%5=0.

 i%5==0  i%5==1  i%5==2  i%5==3  i%5==4
|       |       |       |       |       | bytes
7654321076543210765432107654321076543210
|    |    |    |    |    |    |    |    | chunks
 s=0  s=5  s=2  s=7  s=4  s=1  s=6  s=3

When we are working out which is the first chunk where two keys differ (so we can work out where to insert a new branch) we start off by scanning to find the first byte that differs. But, if the first differing bits are in the top of the byte, they can be in a chunk which overlaps the preceding byte. In those cases we have to subtract one from the index byte.

six-bit chunks blow the size budget

If we use the key in 6-bit chunks, we get most of the lookup complexity of 5-bit chunks, plus added size problems.

The bitmap now is 2^6 == 64 bits, a whole word. So we have to expand trie nodes to three words: bitmap, index, and twigs pointer. This means there's now a word of wasted space in the leaf nodes, because they have to be the same size as branch nodes, but they only need two words for key and value pointers.

My code borrows the caller's existing pointer to the key, rather than taking a copy of the key for its own use. If I change it to take a copy, then I could save the wasted space in big leaf nodes by using it to store short keys inline. But for now I've just done enough of a proof of concept to see what is the effect of wider branches, without re-doing the memory ownership model.

The other potential loss for 6-bit chunks is the potential size of the twig array, up to 3*8*64 = 1536 bytes. This is 24 cache lines, so prefetching is unlikely to be much help if the twig index is large, because the prefetch guess is way off.

benchmark process

The data sets I use are:

A benchmark run takes four time measurements (lower is better): read the data into memory (without timing), "load" it into the trie, "search" for 1 000 000 pseudorandomly selected keys, "mutate" (i.e. add or delete) 1 000 000 pseudorandomly selected keys, reload any missing items (without timing), then "free" the entire trie one key at a time. A benchmark run has an explicit PRNG seed.

Each benchmark iteration has a fixed PRNG seed for all runs. It performs benchmark runs with every data set and every version of the data structure under test (5 x 4 runs). The versions of the data structure are:

Benchmarks are iterated and we take the mean.

The following table gives some information about the sizes of the data structures: number of leaf nodes, number of branch nodes, "overhead" i.e. number of words of storage per leaf (not counting key and value pointers), and the average depth of the trie. The in-dns file seems to be particularly weird.

         leaves branches  OH   depth

    cb    63603   63602  2.00  29.93  in-b9
    qp    63603   35414  1.11  12.45
    fp    63603   32490  1.02  10.56
    wp    63603   31311  2.48   9.27

    cb   314309  314308  2.00  42.19  in-dns
    qp   314309   95094  0.61  17.53
    fp   314309  119916  0.76  16.87
    wp   314309   95866  1.92  14.20

    cb   314309  314308  2.00  38.53  in-rdns
    qp   314309  188252  1.20  20.67
    fp   314309  186714  1.19  18.78
    wp   314309  173500  2.66  17.93

    cb   235882  235881  2.00  26.50  in-usdw
    qp   235882  169434  1.44  12.46
    fp   235882  154272  1.31  10.34
    wp   235882  144357  2.84   9.02

    cb  1000000  999999  2.00  30.76  top-1m
    qp  1000000  617027  1.23  12.52
    fp  1000000  560985  1.12  10.45
    wp  1000000  514113  2.54   8.95

results

Overall, 5-bit chunks win over 4-bit chunks. 6-bit chunks are not a win on my slow computers, but are a win on a bigger Xeon. (All these machines are pretty old.)

in-b9in-dnsin-rdnsin-usdwtop-1m
cb0.0300.1730.1510.0951.438free
qp0.0280.1530.1550.0991.017
fp0.0260.1650.1440.0960.924
wp0.0290.1640.1550.1000.945
cb0.0520.1810.1990.0811.425load
qp0.0530.2500.2760.1311.170
fp0.0510.2330.2650.1291.102
wp0.0530.2310.2790.1301.127
cb0.4891.3141.2211.0641.855mutate
qp0.4610.9971.0460.8331.292
fp0.4250.9881.0050.7731.163
wp0.4610.9951.0510.8221.131
cb0.4651.0901.0790.8611.756search
qp0.3870.8830.9210.7081.139
fp0.3580.8360.8880.6571.002
wp0.4260.8280.9320.7100.973
in-b9in-dnsin-rdnsin-usdwtop-1m
cb0.0170.0850.0710.0410.809free
qp0.0200.1020.0870.0560.643
fp0.0200.1020.0850.0550.603
wp0.0200.1030.0910.0570.669
cb0.0270.1250.1420.0510.940load
qp0.0290.1640.1880.0790.734
fp0.0280.1590.1860.0800.693
wp0.0280.1510.1940.0800.733
cb0.3800.8890.8290.6421.307mutate
qp0.3560.7110.7640.5180.940
fp0.3420.7170.7490.4810.879
wp0.3450.7410.8150.5610.907
cb0.3010.7400.7300.5211.202search
qp0.2620.5920.6610.4430.798
fp0.2510.5960.6440.4120.730
wp0.2550.6100.7040.4880.737
in-b9in-dnsin-rdnsin-usdwtop-1m
cb0.0200.1060.0950.0410.789free
qp0.0140.0870.0730.0420.457
fp0.0130.0890.0740.0420.414
wp0.0130.0810.0710.0390.415
cb0.0380.2000.2250.0731.172load
qp0.0260.1600.1910.0770.667
fp0.0270.1600.1930.0800.612
wp0.0240.1390.1830.0710.615
cb0.4810.8650.7940.6411.261mutate
qp0.2850.4950.5500.3850.721
fp0.2700.5200.5420.3550.662
wp0.2560.4940.5340.3500.634
cb0.4730.8800.8440.6111.292search
qp0.2970.5220.6070.3910.737
fp0.2750.5260.5840.3600.668
wp0.2640.5060.5820.3610.644

Written by Tony Finch dot@dotat.at https://dotat.at/; You may do anything with this. It has no warranty. https://creativecommons.org/publicdomain/zero/1.0/