crit-bit tries without allocation

Crit-bit tries have fixed-size branch nodes and a constant overhead per leaf, which means they can be used as an embedded lookup structure. Embedded lookup structures do not need any extra memory allocation; it is enough to allocate the objects that are to be indexed by the lookup structure.

An embedded lookup structure is a data structure in which the internal pointers used to search for an object (such as branch nodes) are embedded within the objects you are searching through. Each object can be a member of at most one of any particular kind of lookup structure, though an object can simultaneously be a member of several different kinds of lookup structure.

The BSD <sys/queue.h> macros are embedded linked lists. They are used frequently in the kernel, for instance in the network stack to chain mbuf packet buffers together. Each mbuf can be a member of a list and a tailq. There is also a <sys/tree.h> which is used by OpenSSH's privilege separation memory manager. Embedded red-black trees also appear in jemalloc.

embedded crit-bit branch node structure

DJB's crit-bit branch nodes require three words: bit index, left child, and right child; embedded crit-bit branches are the same with an additional parent pointer. (However, see the addendum below in which Simon Tatham shows the parent pointer is not necessary.)

    struct branch {
        uint index;
        void *twig[2];
        void **parent;
    };

The "twig" child pointers are tagged to indicate whether they point to a branch node or a leaf. The parent pointer normally points to the relevant child pointer inside the parent node; it can also point at the trie's root pointer, which means there has to be exactly one root pointer in a fixed place.

(An aside about how I have been counting overhead: DJB does not include the leaf string pointer as part of the overhead of his crit-bit tries, and I have followed his lead by not counting the leaf key and value pointers in my crit-bit and qp tries. So by this logic, although an embedded branch adds four words to an object, it only counts as three words of overhead. Perhaps it would be more honest to count the total size of the data structure.)

using embedded crit-bit tries

For most purposes, embedded crit-bit tries work the same as external crit-bit tries.

When searching for an object, there is a final check that the search key matches the leaf. This check needs to know where to find the search key inside the leaf object - it should not assume the key is at the start.

When inserting a new object, you need to add a branch node to the trie. For external crit-bit tries this new branch is allocated; for embedded crit-bit tries you use the branch embedded in the new leaf object.

deleting objects from embedded crit-bit tries

This is where the fun happens. There are four objects of interest:

The doomed leaf object to be deleted;
The victim branch object which needs to remain in the trie, although it is embedded in the doomed leaf object;
The parent branch object pointing at the leaf, which will be unlinked from the trie;
The bystander leaf object in which the parent branch is embedded, which remains in the trie.

The plan is that after unlinking the parent branch from the trie, you rescue the victim branch from the doomed leaf object by moving it into the place vacated by the parent branch. You use the parent pointer in the victim branch to update the twig (or root) pointer to follow the move.

Note that you need to beware of the case where the parent branch happens to be embedded in the doomed leaf object.

exercise for the reader

Are the parent pointers necessary?

Is the movement of branches constrained enough that we will always encounter a leaf's embedded branch in the course of searching for that leaf? If so, we can eliminate the parent pointers and save a word of overhead.

conclusion

I have not implemented this idea, but following Simon Tatham's encouragement I have written this description in the hope that it inspires someone else.

addendum

In response to the "exercise for the reader", Simon Tatham says:

I think so, yes, and here is (hopefully) a proof.

We aim to show that the trie always satisfies the property that every leaf's embedded branch object (if it exists at all – since we need one fewer branch than we have leaves, there's always one leaf whose embedded branch object is totally unused) is an ancestor of that leaf.

Proof is by induction, of course. The base case is that a trie with zero or one leaf obviously obeys the invariant. Now we have to show that insertion and deletion each preserve it.

Insertion. We need one new branch object which will point at the new leaf. Obviously we'll make this the branch object embedded in the new leaf itself. (In principle we could instead find and use the free one somewhere else in the tree, as mentioned above. But that would be deliberately perverse, so let's not.)

So the new leaf's embedded branch object is indeed an ancestor of the new leaf – specifically, the very closest ancestor.

And the new branch object gets inserted in the middle of some existing link of the trie, from one branch object (or the root) to another branch object (or a leaf). So the paths from the root to all pre-existing leaves are either unchanged, or they get this new branch object added in the middle of them. But insertion never removes a branch object from any leaf's ancestry, so it cannot break the invariant.

Deletion. In your terminology: the parent branch object is the closest ancestor of the doomed leaf object. The victim branch object is currently embedded in the doomed leaf object, and therefore, by the inductive hypothesis, it's currently an ancestor of doomed leaf object. Hence, the victim branch object is also ancestor of the parent branch object. (Unless they're the same object, as you point out, but in that case we have nothing to prove anyway.)

Also by the inductive hypothesis, the parent branch object is an ancestor of the bystander leaf object. (I.e. the bystander leaf is some leaf that you can reach by following whichever of the parent branch object's child pointers doesn't lead to the doomed leaf).

But if the victim branch object is an ancestor of the parent branch object, and the parent branch object in turn is an ancestor of the bystander leaf object, then it follows that the victim branch object must be an ancestor of the bystander leaf object. So embedding the victim branch object in the bystander leaf object surely cannot break the invariant!

Caveat. “I have only proved it correct, not tried it.” I won't be completely confident of this argument until I've seen it go through a long random test run :-)

And:

Actually there's a second easy special case, which you didn't warn implementors to watch out for :-) Another possibility is that the doomed leaf object might be the one whose embedded branch object isn't used at all, in which case there's no victim branch object in the first place and it would be a mistake to follow any pointers in an effort to salvage it!

In my original construction you would have to mark the unused branch in some way (e.g. NULL twigs). You always have to copy the victim branch on top of the removed parent branch, whether the victim is used or not; if the victim was unused the parent has to be marked as unused.

But given your proof we will discover during the traversal that the victim branch is unused - we will not discover the victim's parent twig so we will not have a pointer to update!

(Perhaps I need better terminology since I have both the parent twig of the victim branch, and the parent branch of the deleted leaf.)

Written by Tony Finch dot@dotat.at https://dotat.at/; You may do anything with this. It has no warranty. https://creativecommons.org/publicdomain/zero/1.0/