Prefetching tries

The inner loop in qp trie lookups is roughly

    while(t->isbranch) {
        b = 1 << key[t->index]; // simplified
        if((t->bitmap & b) == 0) return(NULL);
        t = t->twigs + popcount(t->bitmap & b-1);

The efficiency of this loop depends on how quickly we can get from one indirection down the trie to the next. There is quite a lot of work in the loop, enough to slow it down significantly compared to the crit-bit search loop. Although qp tries are half the depth of crit-bit tries on average, they don't run twice as fast. The prefetch compensates in a big way: without it, qp tries are about 10% faster; with it they are about 30% faster.

I adjusted the code above to emphasize that in one iteration of the loop it accesses two locations: the key, which it is traversing linearly with small skips, so access is fast; and the tree node t, whose location jumps around all over the place, so access is slow. The body of the loop calculates the next location of t, but we know at the start that it is going to be some smallish offset from t->twigs, so the prefetch is very effective at overlapping calculation and memory latency.

It was entirely accidental that prefetching works well for qp tries. I was trying not to waste space, so the thought process was roughly, a leaf has to be two words:

    struct Tleaf { const char *key; void *value; };

Leaves should be embedded in the twig array, to avoid a wasteful indirection, so branches have to be the same size as leaves.

    union Tnode { struct Tleaf leaf; struct Tbranch branch; };

A branch has to have a pointer to its twigs, so there is space in the other word for the metadata: bitmap, index, flags. (The limited space in one word is partly why qp tries test a nibble at a time.) Putting the metadata about the twigs next to the pointer to the twigs is the key thing that makes prefetching work.

One of the inspirations of qp tries was Phil Bagwell's hash array mapped tries. HAMTs use the same popcount trick, but instead of using the PATRICIA method of skipping redundant branch nodes, they hash the key and use the hash as the trie index. The hashes should very rarely collide, so redundant branches should also be rare. Like qp tries, HAMTs put the twig metadata (just a bitmap in their case) next to the twig pointer, so they are friendly to prefetching.

So, if you are designing a tree structure, put the metdadata for choosing which child is next adjacent to the node's pointer in its parent, not inside the node itself. That allows you to overlap the computation of choosing which child is next with the memory latency for fetching the child pointers.

Written by Tony Finch; You may do anything with this. It has no warranty.