never mind the quadbits, feel the width!

benchmarking wider-fanout versions of qp tries

QP tries are faster than crit-bit tries because they extract up to four bits of information out of the key per branch, whereas crit-bit tries only get one bit per branch. If branch nodes have more children, then the trie should be shallower on average, which should mean lookups are faster.

So if we extract even more bits of information from the key per branch, it should be even better, right? But, there are downsides to wider fanout.

I originally chose to index by nibbles for two reasons: they are easy to pull out of the key, and the bitmap easily fits in a word with room for a key index.

If we index keys by 5-bit chunks we pay a penalty for more complicated indexing.

If we index keys by 6-bit chunks, trie nodes have to be bigger to fit a bigger bitmap, so we pay a memory overhead penalty.

How do these costs compare to the benefits of wider branches?

(The next two sections expand on the downsides; skip to "results" for the tl;dr.)

five-bit fields are fiddly

If we use the key 5 bits at a time, the bitmap is 2^5 == 32 bits. This leaves room in a 64 bit word for a 28 bit index (limiting keys to 256 Mbytes), three bits to store the shift/alignment of the 5 bit field relative to 8 bit bytes, and a 1 bit branch/leaf flag.

When indexing by 4 bit chunks, we only need to look at one byte of key at a time; the chunk can be in the top half or the bottom half. For larger chunks we often need to extract a chunk that overlaps two bytes. To identify a 5-bit chunk we store the byte index of the first byte it overlaps, and how much to shift the pair of bytes to extract the chunk we want. (The aim is to avoid complicated arithmetic in the inner loops.) This gives us a lot more freedom to identify chunks than we actually use, because although there are 8 possible shifts at any particular byte index, we only use one or two of them. The shifts have to be consistent with representing the key as a string of successive 5-bit chunks.

The diagram below shows the correspondence between byte indexes (i) and valid chunk alignment shifts (s). The shift value is how much to shift the chunk left to move it to the top of its byte, so shift values and indexes increase from left to right. This fact is used when working out how deep a new branch node is; for instance, when i%5==1 the possible shifts are 2 and 7, so the branch for the chunk with shift 2 has to be a parent of 7. Although shift 5 overlaps the same byte, it can only occur when i%5=0.

 i%5==0  i%5==1  i%5==2  i%5==3  i%5==4
|       |       |       |       |       | bytes
|    |    |    |    |    |    |    |    | chunks
 s=0  s=5  s=2  s=7  s=4  s=1  s=6  s=3

When we are working out which is the first chunk where two keys differ (so we can work out where to insert a new branch) we start off by scanning to find the first byte that differs. But, if the first differing bits are in the top of the byte, they can be in a chunk which overlaps the preceding byte. In those cases we have to subtract one from the index byte.

six-bit chunks blow the size budget

If we use the key in 6-bit chunks, we get most of the lookup complexity of 5-bit chunks, plus added size problems.

The bitmap now is 2^6 == 64 bits, a whole word. So we have to expand trie nodes to three words: bitmap, index, and twigs pointer. This means there's now a word of wasted space in the leaf nodes, because they have to be the same size as branch nodes, but they only need two words for key and value pointers.

My code borrows the caller's existing pointer to the key, rather than taking a copy of the key for its own use. If I change it to take a copy, then I could save the wasted space in big leaf nodes by using it to store short keys inline. But for now I've just done enough of a proof of concept to see what is the effect of wider branches, without re-doing the memory ownership model.

The other potential loss for 6-bit chunks is the potential size of the twig array, up to 3*8*64 = 1536 bytes. This is 24 cache lines, so prefetching is unlikely to be much help if the twig index is large, because the prefetch guess is way off.

benchmark process

The data sets I use are:

A benchmark run takes four time measurements (lower is better): read the data into memory (without timing), "load" it into the trie, "search" for 1 000 000 pseudorandomly selected keys, "mutate" (i.e. add or delete) 1 000 000 pseudorandomly selected keys, reload any missing items (without timing), then "free" the entire trie one key at a time. A benchmark run has an explicit PRNG seed.

Each benchmark iteration has a fixed PRNG seed for all runs. It performs benchmark runs with every data set and every version of the data structure under test (5 x 4 runs). The versions of the data structure are:

Benchmarks are iterated and we take the mean.

The following table gives some information about the sizes of the data structures: number of leaf nodes, number of branch nodes, "overhead" i.e. number of words of storage per leaf (not counting key and value pointers), and the average depth of the trie. The in-dns file seems to be particularly weird.

         leaves branches  OH   depth

    cb    63603   63602  2.00  29.93  in-b9
    qp    63603   35414  1.11  12.45
    fp    63603   32490  1.02  10.56
    wp    63603   31311  2.48   9.27

    cb   314309  314308  2.00  42.19  in-dns
    qp   314309   95094  0.61  17.53
    fp   314309  119916  0.76  16.87
    wp   314309   95866  1.92  14.20

    cb   314309  314308  2.00  38.53  in-rdns
    qp   314309  188252  1.20  20.67
    fp   314309  186714  1.19  18.78
    wp   314309  173500  2.66  17.93

    cb   235882  235881  2.00  26.50  in-usdw
    qp   235882  169434  1.44  12.46
    fp   235882  154272  1.31  10.34
    wp   235882  144357  2.84   9.02

    cb  1000000  999999  2.00  30.76  top-1m
    qp  1000000  617027  1.23  12.52
    fp  1000000  560985  1.12  10.45
    wp  1000000  514113  2.54   8.95


Overall, 5-bit chunks win over 4-bit chunks. 6-bit chunks are not a win on my slow computers, but are a win on a bigger Xeon. (All these machines are pretty old.)


Written by Tony Finch; You may do anything with this. It has no warranty.