never mind the quadbits, feel the width!

benchmarking wider-fanout versions of qp tries

QP tries are faster than crit-bit tries because they extract up to four bits of information out of the key per branch, whereas crit-bit tries only get one bit per branch. If branch nodes have more children, then the trie should be shallower on average, which should mean lookups are faster.

So if we extract even more bits of information from the key per branch, it should be even better, right? But, there are downsides to wider fanout.

I originally chose to index by nibbles for two reasons: they are easy to pull out of the key, and the bitmap easily fits in a word with room for a key index.

If we index keys by 5-bit chunks we pay a penalty for more complicated indexing.

If we index keys by 6-bit chunks, trie nodes have to be bigger to fit a bigger bitmap, so we pay a memory overhead penalty.

How do these costs compare to the benefits of wider branches?

(The next two sections expand on the downsides; skip to "results" for the tl;dr.)

five-bit fields are fiddly

If we use the key 5 bits at a time, the bitmap is 2^5 == 32 bits. This leaves room in a 64 bit word for a 28 bit index (limiting keys to 256 Mbytes), three bits to store the shift/alignment of the 5 bit field relative to 8 bit bytes, and a 1 bit branch/leaf flag.

When indexing by 4 bit chunks, we only need to look at one byte of key at a time; the chunk can be in the top half or the bottom half. For larger chunks we often need to extract a chunk that overlaps two bytes. To identify a 5-bit chunk we store the byte index of the first byte it overlaps, and how much to shift the pair of bytes to extract the chunk we want. (The aim is to avoid complicated arithmetic in the inner loops.) This gives us a lot more freedom to identify chunks than we actually use, because although there are 8 possible shifts at any particular byte index, we only use one or two of them. The shifts have to be consistent with representing the key as a string of successive 5-bit chunks.

The diagram below shows the correspondence between byte indexes (i) and valid chunk alignment shifts (s). The shift value is how much to shift the chunk left to move it to the top of its byte, so shift values and indexes increase from left to right. This fact is used when working out how deep a new branch node is; for instance, when i%5==1 the possible shifts are 2 and 7, so the branch for the chunk with shift 2 has to be a parent of 7. Although shift 5 overlaps the same byte, it can only occur when i%5=0.

 i%5==0  i%5==1  i%5==2  i%5==3  i%5==4
|       |       |       |       |       | bytes
7654321076543210765432107654321076543210
|    |    |    |    |    |    |    |    | chunks
 s=0  s=5  s=2  s=7  s=4  s=1  s=6  s=3

When we are working out which is the first chunk where two keys differ (so we can work out where to insert a new branch) we start off by scanning to find the first byte that differs. But, if the first differing bits are in the top of the byte, they can be in a chunk which overlaps the preceding byte. In those cases we have to subtract one from the index byte.

six-bit chunks blow the size budget

If we use the key in 6-bit chunks, we get most of the lookup complexity of 5-bit chunks, plus added size problems.

The bitmap now is 2^6 == 64 bits, a whole word. So we have to expand trie nodes to three words: bitmap, index, and twigs pointer. This means there's now a word of wasted space in the leaf nodes, because they have to be the same size as branch nodes, but they only need two words for key and value pointers.

My code borrows the caller's existing pointer to the key, rather than taking a copy of the key for its own use. If I change it to take a copy, then I could save the wasted space in big leaf nodes by using it to store short keys inline. But for now I've just done enough of a proof of concept to see what is the effect of wider branches, without re-doing the memory ownership model.

The other potential loss for 6-bit chunks is the potential size of the twig array, up to 3*8*64 = 1536 bytes. This is 24 cache lines, so prefetching is unlikely to be much help if the twig index is large, because the prefetch guess is way off.

benchmark process

The data sets I use are:

in-b9: words extracted from the BIND-9 source code
- 63k lines, 784k bytes, av. len. 12.3
- lots of common prefixes
in-dns: list of domain names under cam.ac.uk.
- 314k lines, 14 049k bytes, av. len. 44.7
in-rdns: same as above, but all keys reversed
- huge common prefixes
in-usdw: /usr/share/dict/words
- 236k lines, 2 493k bytes, av. len. 10.7
- English dictionary
top-1m: the Alexa top million web site list
- 1000k lines, 15 559k bytes, av. len. 15.6
- numeric rankings removed

A benchmark run takes four time measurements (lower is better): read the data into memory (without timing), "load" it into the trie, "search" for 1 000 000 pseudorandomly selected keys, "mutate" (i.e. add or delete) 1 000 000 pseudorandomly selected keys, reload any missing items (without timing), then "free" the entire trie one key at a time. A benchmark run has an explicit PRNG seed.

Each benchmark iteration has a fixed PRNG seed for all runs. It performs benchmark runs with every data set and every version of the data structure under test (5 x 4 runs). The versions of the data structure are:

cb: crit-bit tries
qp: quadbit popcount patricia tries
fp: 5-bit version of qp tries
wp: 6-bit version of qp tries

Benchmarks are iterated and we take the mean.

The following table gives some information about the sizes of the data structures: number of leaf nodes, number of branch nodes, "overhead" i.e. number of words of storage per leaf (not counting key and value pointers), and the average depth of the trie. The in-dns file seems to be particularly weird.

         leaves branches  OH   depth

    cb    63603   63602  2.00  29.93  in-b9
    qp    63603   35414  1.11  12.45
    fp    63603   32490  1.02  10.56
    wp    63603   31311  2.48   9.27

    cb   314309  314308  2.00  42.19  in-dns
    qp   314309   95094  0.61  17.53
    fp   314309  119916  0.76  16.87
    wp   314309   95866  1.92  14.20

    cb   314309  314308  2.00  38.53  in-rdns
    qp   314309  188252  1.20  20.67
    fp   314309  186714  1.19  18.78
    wp   314309  173500  2.66  17.93

    cb   235882  235881  2.00  26.50  in-usdw
    qp   235882  169434  1.44  12.46
    fp   235882  154272  1.31  10.34
    wp   235882  144357  2.84   9.02

    cb  1000000  999999  2.00  30.76  top-1m
    qp  1000000  617027  1.23  12.52
    fp  1000000  560985  1.12  10.45
    wp  1000000  514113  2.54   8.95

results

Overall, 5-bit chunks win over 4-bit chunks. 6-bit chunks are not a win on my slow computers, but are a win on a bigger Xeon. (All these machines are pretty old.)

laptop
- MacBook Core 2 Duo P8600 2.4GHz, Mac OS 10.11.1
- Apple LLVM version 7.0.0 (clang-700.0.72)
- -O3 -march=native -mtune=native
- __builtin_popcount

	in-b9	in-dns	in-rdns	in-usdw	top-1m
cb	0.030	0.173	0.151	0.095	1.438	free
qp	0.028	0.153	0.155	0.099	1.017
fp	0.026	0.165	0.144	0.096	0.924
wp	0.029	0.164	0.155	0.100	0.945
cb	0.052	0.181	0.199	0.081	1.425	load
qp	0.053	0.250	0.276	0.131	1.170
fp	0.051	0.233	0.265	0.129	1.102
wp	0.053	0.231	0.279	0.130	1.127
cb	0.489	1.314	1.221	1.064	1.855	mutate
qp	0.461	0.997	1.046	0.833	1.292
fp	0.425	0.988	1.005	0.773	1.163
wp	0.461	0.995	1.051	0.822	1.131
cb	0.465	1.090	1.079	0.861	1.756	search
qp	0.387	0.883	0.921	0.708	1.139
fp	0.358	0.836	0.888	0.657	1.002
wp	0.426	0.828	0.932	0.710	0.973

desktop
- Dell Core 2 Duo E8500 3.16GHz, FreeBSD 8
- gcc 4.2.1
- -O3 -march=native -mtune=native
- HAVE_SLOW_POPCOUNT

	in-b9	in-dns	in-rdns	in-usdw	top-1m
cb	0.017	0.085	0.071	0.041	0.809	free
qp	0.020	0.102	0.087	0.056	0.643
fp	0.020	0.102	0.085	0.055	0.603
wp	0.020	0.103	0.091	0.057	0.669
cb	0.027	0.125	0.142	0.051	0.940	load
qp	0.029	0.164	0.188	0.079	0.734
fp	0.028	0.159	0.186	0.080	0.693
wp	0.028	0.151	0.194	0.080	0.733
cb	0.380	0.889	0.829	0.642	1.307	mutate
qp	0.356	0.711	0.764	0.518	0.940
fp	0.342	0.717	0.749	0.481	0.879
wp	0.345	0.741	0.815	0.561	0.907
cb	0.301	0.740	0.730	0.521	1.202	search
qp	0.262	0.592	0.661	0.443	0.798
fp	0.251	0.596	0.644	0.412	0.730
wp	0.255	0.610	0.704	0.488	0.737

server
- Generic Xeon E5620 2.40GHz, SLES 11 SP3
- gcc 4.3.4
- -O3 -march=native -mtune=native -mpopcnt
- __builtin_popcnt

	in-b9	in-dns	in-rdns	in-usdw	top-1m
cb	0.020	0.106	0.095	0.041	0.789	free
qp	0.014	0.087	0.073	0.042	0.457
fp	0.013	0.089	0.074	0.042	0.414
wp	0.013	0.081	0.071	0.039	0.415
cb	0.038	0.200	0.225	0.073	1.172	load
qp	0.026	0.160	0.191	0.077	0.667
fp	0.027	0.160	0.193	0.080	0.612
wp	0.024	0.139	0.183	0.071	0.615
cb	0.481	0.865	0.794	0.641	1.261	mutate
qp	0.285	0.495	0.550	0.385	0.721
fp	0.270	0.520	0.542	0.355	0.662
wp	0.256	0.494	0.534	0.350	0.634
cb	0.473	0.880	0.844	0.611	1.292	search
qp	0.297	0.522	0.607	0.391	0.737
fp	0.275	0.526	0.584	0.360	0.668
wp	0.264	0.506	0.582	0.361	0.644

Written by Tony Finch dot@dotat.at https://dotat.at/; You may do anything with this. It has no warranty. https://creativecommons.org/publicdomain/zero/1.0/