hg64: a 64-bit histogram data structure

See 2022-10-12 for an update

I have reached the point in my main project (qp-trie for BIND) where I want to collect performance statistics. Things like:

how long it takes to compact a trie;
how much memory was recovered by compaction;
size of trie before and after compaction.

In a wider context, those using BIND for large authoritative server systems would like better statistics on things like:

incoming and outgoing zone transfer times;
zone refresh query latency.

For my own purposes, I’m no longer satisfied with summarizing performance with just the mean and standard deviation; and in an operational context, it’s useful to know about the existence of outliers (rare outliers can be hidden by simple summary statistics) and the size of an outlier can be a useful clue.

So, we need histograms!

hg64

I have written a proof-of-concept histogram data structure called hg64, which you can get from:

It can load 1 million data items in about 5ms (5ns per item), and uses a few KiB of memory.

Here I will write about how hg64 came to be.

starting points

The DataDog blog has a very nice introduction to histograms and quantile sketches in general, and DDSketch in particular. I particularly liked it because the design of DDSketch is simple and it demystified things for me.

Briefly, the idea is that by choosing a relative error (e.g. 1%, or two significant digits), you implicitly choose how to map values to buckets: anything within 1% of the bucket’s nominal value is in the same bucket. And buckets are spaced logarithmically, so a large range can be covered by a relatively small number of buckets.

I think it’s worth distinguishing between quantile sketches that have a target rank error and those that have a target value error. As well as the comparison in the DataDog blog article, there’s a lot of informative discussion about quantile sketches in the Apache DataSketches documentation. DDSketch is just a histogram, like the ones we learned about in school, with a logarithmic horizontal axis.

Well, not “just” a histogram, since DDSketch has a “collapsing” mechanism that sacrifices accuracy for small measurements when the data structure gets too big. (Generally we are more interested in the large outliers.) There’s a variant called UDDSketch that collapses all buckets uniformly, which I liked because it is effectively a self-tuning histogram.

So I thought I would try implementing my own UDDSketch.

two questions

A histogram data structure is based on two questions:

What is the underlying data structure? As well as directly finding buckets by key, we would like to be able to search by rank so that we can calculate quantiles efficiently.
How do we map a value to the key that identifies its bucket? This affects the choice of data structure, because it’s easier if the keys have a smaller range, or are more densely numbered.

data structures

Existing DDSketch implementations have an abstraction layer between the histogram logic and an underlying storage structure. I wanted to simplify my code by making a firm choice, so I tried out a few alternatives:

A red-black tree. Each node has left and right pointers, the bucket key, bucket count, and (for searching by rank) a sum of the counts of every bucket in the subtree.

This was fiddly to implement, used a lot of memory, and was not particularly fast.
A sorted array. Each element has a bucket key and a count. Use binary search to find a bucket by value, and linear search to find a bucket by rank.

About twice as fast as the tree, and 0.2x memory usage.

I thought to myself, I know a trick to find things faster than a binary search…

A Bagwell array-mapped trie with three levels, enough for 18 bit keys.

I did not finish implementing this, because at the same time it was becoming more clear to me what I needed from my bucket keys, and that 18 bits was excessive.
An short array of popcount packed vectors. Each vector also keeps a sum total of the counts in its buckets for faster rank searches. Keys are 8..12 bits.

About 10x faster than the sorted array, down to around 5ns to add an item. (The speed increase corresponds fairly accurately to the smaller number of memory accesses.) Tolerable increase in memory usage.

It’s definitely inspired by a Bagwell trie, but flattened too much to be tree-like.

ddsketch keys

DDSketch counts floating point values. It maps a value to an integer bucket number with a neat little formula:

    ceil(log(value) / log(gamma))

Rounding with ceil() is where the precision is discarded so that a range of values land in the same bucket. The base of the logarithm gamma is derived from the chosen precision.

I adjusted this formula in a couple of ways: I precalculated 1 / log(gamma), because a multiplication is faster than a log and a div. And I added a bias so that I did not have to worry about negative numbers (such as weirdness from undefined behaviour when shifting signed integers in C).

At this point I was thinking, my bucket bias is like the exponent bias in IEEE754; and I was thinking, how can I get rid of this other logarithm?

floating point keys

Another histogram data structure, circllhist (Circonus log-linear histogram), buckets values by converting them to the form X.Y * 10^Z, i.e. exactly two decimal significant digits. In effect, circllhist uses a custom low-precision decimal floating point format.

So, if my values are floating point, they are already log-linear, and I can adjust my precision (my bucket size) by adjusting how many bits of mantissa I use. Maybe I could just use bfloat16 as my key? Basically just the top 16 bits of a single-precision IEEE754 number.

integer values

I posted a few tweets about my histogram experiments and got a number of helpful replies.

That discussion reminded me that the values I want to handle - time measurements, memory usage - start off as integers, in particular 64-bit integers. (32 bits doesn’t have quite enough range for e.g. measuring hours to microsecond precision, and it’s smaller than size_t on most modern platforms.)

So, our floating-point (or log-linear) keys need 6 bits of exponent to cover the range of a 64-bit unsigned integer, and (roughly) 6 bits of mantissa to get two decimal digits of precision.

To calculate the exponent, we can use CLZ (count leading zeroes), then assemble the key with a small amount of bit shuffling:

    clz = __builtin_clzll(value);
    exponent = 64 - 6 - clz;
    mantissa = value >> (exponent - 1);
    key = exponent * 64 + mantissa % 64;

I compared this with converting value to float and pulling out the bits I wanted, and using float was neither simpler nor faster. (And float introduced tricky rounding issues.)

putting it together

This 6+6 (or 64+64) format meshes beautifully with 64-bit popcount packed arrays, so I called the result hg64.

But 6 bits of mantissa is more than necessary for 2 s.f. of precision: 5 bits is about right. And I expect that in many cases 2 bits of mantissa or 1 s.f. will be plenty. (My colleagues have suggested simple order-of-magnitude bucketing for time measurements.)

So I added a coarse precision adjustment, but apart from that, hg64 does not have any configuration. For instance, unlike DDSketch, size limits are implicit in the design: for 2 s.f. precision hg64 can’t grow larger than 17 KiB, and for 1 s.f. the max is 2 KiB. And hg64 remains consistently fast regardless of how much data it is fed.

It was fun making hg64 and I’m pleased with the result.

See 2022-10-12 for an update