Since my blog post about
hg64 histograms in July I have
continued occasional hacking on the code, so it’s time to post an
As before, my proof-of-concept histogram data structure
My original version used
popcount packed arrays to avoid storing
empty counters. It’s difficult to make this trick efficiently
thread-safe, so I switched it out for a simpler but slightly less
Now there is a top-level array of pointers, one for each possible logarithm of a value, 0 .. 63. Each pointer is assigned at most once (with an atomic compare-and-swap) to an array of counters. The sub-array size depends on the precision (number of significant digits) of the histogram.
separate summary totals
When calculating ranks and percentiles,
hg64 uses summary totals so
that it can skip over most of the counter arrays. These totals can’t
be updated in multithreaded code in a way that is both fast and
Instead, you need to take an immutable snapshot of the histogram. This separate copy is immune from multithreading issues so its summary totals are always consistent with its counter arrays.
And we save work in the
hg64_inc() hot path by removing summary
totals from mutable histograms.
branch-free key calculation
Paul Khuong encouraged me to understand his code for log-linear
bucketing which does the same job as my function to map a 64-bit
value to a low-precision floating-point key. Paul’s
function uses a clever trick which allows me to remove a branch from
the fast path.
Together, these improvements (unpacked arrays, no summary totals, branch-free bucketing) mean that in the best case (single threaded, histogram is warmed up and in cache) it takes about 2.3 nanoseconds to atomically increment a histogram counter on my M1 Mac. That’s about half the time of my original code back in July.
space / time tradeoffs
hg64 test code tries hammering a histogram concurrently from
multiple threads. The time measurements mostly show the cost of
high-contention atomic operations - it can be 50x slower than
single-threaded increments on my M1 Mac.
This slowdown isn’t a problem for me, because I aim to use
situations where updates are relatively rare; they need to be
thread-safe but cheap.
hg64 test code illustrates an alternative setup where each
thread has its own histogram, which can be updated fast with zero
contention. The per-thread histograms can be merged to get a
whole-system histogram (even while the update threads are active,
though the test code does not do that).
The real test is that we get the same results from the single concurrent histogram, and the merged per-thread histograms.
I still don’t have any useful way to export or import histogram data. BIND’s built-in statistics web server exports in JSON and XML; I think it would also be useful to support a compressed binary format, and CSV.
And I would like to be able to visualize the data without having a whole prometheus + grafana setup!