This is a follow-up to my unfinished series of posts last month.
On the Friday of the beer festival I found myself rather worn out. I managed to write a missing function (delete an element copy-on-write style) but that was about it.
When I got back to work after the bank holiday there was a bunch of stuff demanding more urgent attention so I wasn’t able to find time to finish the qp trie hacking until this week.
The nice thing about testing data structures is you can get a very long way with randomized testing and a good set of invariant checks.
When there’s a bug I tend to rely on voluminous dumps of how the trie structure changes as it mutates, with pointer values so I can track allocation and reuse. I stare at them wondering how the heck that pointer got where it shouldn’t be, until enlightenment dawns.
There were a number of notable bugs:
Another variable rename error from the big refactor. I think that was the last refactoring bug. I got away with that pretty well, I think :-)
Memory leaks in the commit/rollback functions. Remember to free the top-level structure, dumbass!
COW pushdown works on a “node stack” structure, which is a list of pointers to the trie nodes on the spine from the root to the leaf of interest. Pushdown involves making a copy of each branch node, so that the copies can be exclusively owned by the new trie where they are safe to mutate. The bug was that the pushdown function didn’t update the child pointer in the node stack to point to the new copy instead of the old one. A relatively small oversight which caused interesting corruption and much staring at runic dump output.
During my Beer Festival week thinking, I completely forgot to consider the string keys. The Knot qp trie code makes copies of the keys for its own use, so it needs to keep track of their sharing state during a COW transaction so that they can be freed at the right time. This was quite a big thing to forget! Fortunately, since the keys are owned by the qp trie code, I could change their shape to add a COW flag and fix the use-after-free and memory leak bugs…
Having dealt with those, I have at last submitted my patches to CZ.NIC! There is still a lot of work to do, changing the DNS-specific parts of Knot so that UPDATE/IXFR transactions use the COW API instead of a copy of the entire zone.
One thing I’m not entirely sure about is whether I have been working with a valid memory model; in particular I have assumed that it’s OK to flip COW flags in shared words without any interlocks. The COW flags are only written in a single threaded way by the mutation thread; the concurrent read-only threads pay no attention to the COW flags, though they read the words that hold the flags.
If this is wrong, I might need to sprinkle some RCU calls through the qp trie implementation to ensure it works correctly…