.@ Tony Finch – blog


Recently, Alex Kladov wrote on the TigerBeetle blog about swarm testing data structures. It’s a neat post about randomized testing with Zig. I wrote a comment with an idea that was new to Alex @matklad, so I’m reposing a longer version here.

differential testing

A common approach to testing data structures is to write a second reference implementation that has the same API but simpler and/or more obviously correct, though it uses more memory or is slower or less concurrent or otherwise not up to production quality.

Then, run the production implementation and the reference implementation on the same sequence of operations, and verify that they produce the same results.

Any difference is either a bug in the production implementation (probably) or a bug in the reference implementation (unlucky) or a bug in the tests (unfortunate).

This is a straightforward differential testing pattern.

problems

There are a couple of difficulties with this kind of basic differential testing.

grow / shrink

The TigerBeetle article talks about adjusting the probabilities of different operations on the data structure to try to explore more edge cases. To motivate the idea, the article talks about adjusting the probabilities of adding or deleting items: If adding and deleting have equal probability, then the test finds it hard to grow the data structure to interesting sizes that might expose bugs.

Unfortunately, if the probability of add is greater than del, then the data structure tends to grow without bound. If the probability of del is greater than add, then it tries to shrink from nothing: worse than equal probabilities! They could preload the data structure to test how it behaves when it shrinks, but a fixed set of probabilities per run is not good at testing both growth and shrinkage on the same test run on the same data structure.

One way to improve this kind of test is to adjust the probability of add and del dynamically: make add more likely when the data structure is small, and del more likely when it is big. And maybe make add more likely in the first half of a test run and del more likely in the second half.

random elements

The TigerBeetle article glosses over the question of where the tests get fresh elements to add to the data structure. And its example is chosen so it doesn’t have to think about which elements get deleted.

In my experience writing data structures for non-garbage-collected languages, I had to be more deliberate about how to create and destroy elements. That led to a style of test that’s more element-centric, as Alex described it.

element-wise testing

Change the emphasis so that instead of testing that two implementations match, test that one implementation obeys the expected behaviour. No need to make a drop-in replacement reference implementation!

What I typically do is pre-allocate an array of elements, with slots that I can set to keep track of how each element relates to the data structure under test. The most important property is whether the element has been added or deleted, but there might be others related to ordering of elements, or values associated with keys, and so on.

test loop

Each time round the loop, choose at random an element from the array, and an action such as add / del / get / …

Then, if it makes sense, perform the operation on the data structure with the element.

For example, you might skip an add action if the element is already in the data structure, unless you can try to add it and expect an error.

data structure size

This strategy tends to grow the data structure until about 50% of the pre-allocated elements are inserted, then it makes a random walk around this 50% point. Random walks can diverge widely from their central point both in theory and in practice, so this kind of testing is reasonably effective at both growing and (to a lesser extent) shrinking the data structure.

invariants

I usually check some preconditions before an action, to verify that the data structure matches the expected properties of the chosen element. This can help to detect earlier that an action on one element has corrupted another element.

After performing the action and updating the element’s properties, I check the updated properties as a postcondition, to make sure the action had the expected effects.

performance

John Regehr’s great tutorial, how to fuzz an ADT implementation, recommends writing a checkRep() function that thoroughly verifies a data structure’s internal consistency. A checkRep() function is a solid gold testing tool, but it is O(n) at least and typically very slow.

If you call checkRep() frequently during testing, your tests slow down dramatically as your data structure gets larger.

I like my per-element invariants to be local and ideally O(1) or O(log n) at worst, so they don’t slow down the tests too much.

conclusion

Recently I’ve used this pattern to exhibit concurrency bugs in an API that’s hard to make thread-safe. Writing the tests has required some cunning to work out what invariants I can usefully maintain and test; what variety of actions I can use to stress those invariants; and what mix of elements + actions I need so that my tests know which properties of each element should be upheld and which can change.

I’m testing multiple implementations of the same API, trying to demonstrate which is safest. Differential testing can tell me that implementations diverge, but not which is correct, whereas testing properties and invariants more directly tells me whether an implementation does what I expect. (Or gives me a useless answer when my tests are weak.)

Which is to say that this kind of testing is a fun creative challenge. I find it a lot more rewarding than example-based testing.