See 2022-10-12 for an update
I have reached the point in my main project (qp-trie for BIND) where I want to collect performance statistics. Things like:
In a wider context, those using BIND for large authoritative server systems would like better statistics on things like:
For my own purposes, I’m no longer satisfied with summarizing performance with just the mean and standard deviation; and in an operational context, it’s useful to know about the existence of outliers (rare outliers can be hidden by simple summary statistics) and the size of an outlier can be a useful clue.
So, we need histograms!
I have written a proof-of-concept histogram data structure called
hg64
, which you can get from:
It can load 1 million data items in about 5ms (5ns per item), and uses a few KiB of memory.
Here I will write about how hg64
came to be.
The DataDog blog has a very nice introduction to histograms and quantile sketches in general, and DDSketch in particular. I particularly liked it because the design of DDSketch is simple and it demystified things for me.
Briefly, the idea is that by choosing a relative error (e.g. 1%, or two significant digits), you implicitly choose how to map values to buckets: anything within 1% of the bucket’s nominal value is in the same bucket. And buckets are spaced logarithmically, so a large range can be covered by a relatively small number of buckets.
I think it’s worth distinguishing between quantile sketches that have a target rank error and those that have a target value error. As well as the comparison in the DataDog blog article, there’s a lot of informative discussion about quantile sketches in the Apache DataSketches documentation. DDSketch is just a histogram, like the ones we learned about in school, with a logarithmic horizontal axis.
Well, not “just” a histogram, since DDSketch has a “collapsing” mechanism that sacrifices accuracy for small measurements when the data structure gets too big. (Generally we are more interested in the large outliers.) There’s a variant called UDDSketch that collapses all buckets uniformly, which I liked because it is effectively a self-tuning histogram.
So I thought I would try implementing my own UDDSketch.
A histogram data structure is based on two questions:
What is the underlying data structure? As well as directly finding buckets by key, we would like to be able to search by rank so that we can calculate quantiles efficiently.
How do we map a value to the key that identifies its bucket? This affects the choice of data structure, because it’s easier if the keys have a smaller range, or are more densely numbered.
Existing DDSketch implementations have an abstraction layer between the histogram logic and an underlying storage structure. I wanted to simplify my code by making a firm choice, so I tried out a few alternatives:
A red-black tree. Each node has left and right pointers, the bucket key, bucket count, and (for searching by rank) a sum of the counts of every bucket in the subtree.
This was fiddly to implement, used a lot of memory, and was not particularly fast.
A sorted array. Each element has a bucket key and a count. Use binary search to find a bucket by value, and linear search to find a bucket by rank.
About twice as fast as the tree, and 0.2x memory usage.
I thought to myself, I know a trick to find things faster than a binary search…
A Bagwell array-mapped trie with three levels, enough for 18 bit keys.
I did not finish implementing this, because at the same time it was becoming more clear to me what I needed from my bucket keys, and that 18 bits was excessive.
An short array of popcount
packed vectors. Each vector also
keeps a sum total of the counts in its buckets for faster rank
searches. Keys are 8..12 bits.
About 10x faster than the sorted array, down to around 5ns to add an item. (The speed increase corresponds fairly accurately to the smaller number of memory accesses.) Tolerable increase in memory usage.
It’s definitely inspired by a Bagwell trie, but flattened too much to be tree-like.
DDSketch counts floating point values. It maps a value to an integer bucket number with a neat little formula:
ceil(log(value) / log(gamma))
Rounding with ceil()
is where the precision is discarded so that a
range of values land in the same bucket. The base of the logarithm
gamma
is derived from the chosen precision.
I adjusted this formula in a couple of ways: I precalculated 1 / log(gamma)
, because a multiplication is faster than a log and a div.
And I added a bias so that I did not have to worry about negative
numbers (such as weirdness from undefined behaviour when shifting
signed integers in C).
At this point I was thinking, my bucket bias is like the exponent bias in IEEE754; and I was thinking, how can I get rid of this other logarithm?
Another histogram data structure, circllhist
(Circonus log-linear
histogram), buckets values by converting them to the form
X.Y * 10^Z
, i.e. exactly two decimal significant digits. In effect,
circllhist
uses a custom low-precision decimal floating point
format.
So, if my values are floating point, they are already log-linear, and
I can adjust my precision (my bucket size) by adjusting how many bits
of mantissa I use. Maybe I could just use bfloat16
as my
key? Basically just the top 16 bits of a single-precision IEEE754
number.
I posted a few tweets about my histogram experiments and got a number of helpful replies.
That discussion reminded me that the values I want to handle - time
measurements, memory usage - start off as integers, in particular
64-bit integers. (32 bits doesn’t have quite enough range for e.g.
measuring hours to microsecond precision, and it’s smaller than
size_t
on most modern platforms.)
So, our floating-point (or log-linear) keys need 6 bits of exponent to cover the range of a 64-bit unsigned integer, and (roughly) 6 bits of mantissa to get two decimal digits of precision.
To calculate the exponent, we can use CLZ
(count leading zeroes),
then assemble the key with a small amount of bit shuffling:
clz = __builtin_clzll(value);
exponent = 64 - 6 - clz;
mantissa = value >> (exponent - 1);
key = exponent * 64 + mantissa % 64;
I compared this with converting value
to float
and pulling out the
bits I wanted, and using float
was neither simpler nor faster. (And
float
introduced tricky rounding issues.)
This 6+6 (or 64+64) format meshes beautifully with 64-bit popcount
packed arrays, so I called the result hg64
.
But 6 bits of mantissa is more than necessary for 2 s.f. of precision: 5 bits is about right. And I expect that in many cases 2 bits of mantissa or 1 s.f. will be plenty. (My colleagues have suggested simple order-of-magnitude bucketing for time measurements.)
So I added a coarse precision adjustment, but apart from that, hg64
does not have any configuration. For instance, unlike DDSketch, size
limits are implicit in the design: for 2 s.f. precision hg64
can’t
grow larger than 17 KiB, and for 1 s.f. the max is 2 KiB. And hg64
remains consistently fast regardless of how much data it is fed.
It was fun making hg64
and I’m pleased with the result.
See 2022-10-12 for an update