Since my blog post about hg64
histograms in July I have
continued occasional hacking on the code, so it’s time to post an
update.
As before, my proof-of-concept histogram data structure hg64
is
available from:
simpler allocation
My original version used popcount
packed arrays to avoid storing
empty counters. It’s difficult to make this trick efficiently
thread-safe, so I switched it out for a simpler but slightly less
memory-efficient design.
Now there is a top-level array of pointers, one for each possible logarithm of a value, 0 .. 63. Each pointer is assigned at most once (with an atomic compare-and-swap) to an array of counters. The sub-array size depends on the precision (number of significant digits) of the histogram.
separate summary totals
When calculating ranks and percentiles, hg64
uses summary totals so
that it can skip over most of the counter arrays. These totals can’t
be updated in multithreaded code in a way that is both fast and
consistent.
Instead, you need to take an immutable snapshot of the histogram. This separate copy is immune from multithreading issues so its summary totals are always consistent with its counter arrays.
And we save work in the hg64_inc()
hot path by removing summary
totals from mutable histograms.
branch-free key calculation
Paul Khuong encouraged me to understand his code for log-linear
bucketing which does the same job as my function to map a 64-bit
value to a low-precision floating-point key. Paul’s bin_down_of()
function uses a clever trick which allows me to remove a branch from
the fast path.
Together, these improvements (unpacked arrays, no summary totals, branch-free bucketing) mean that in the best case (single threaded, histogram is warmed up and in cache) it takes about 2.3 nanoseconds to atomically increment a histogram counter on my M1 Mac. That’s about half the time of my original code back in July.
space / time tradeoffs
The hg64
test code tries hammering a histogram concurrently from
multiple threads. The time measurements mostly show the cost of
high-contention atomic operations - it can be 50x slower than
single-threaded increments on my M1 Mac.
This slowdown isn’t a problem for me, because I aim to use hg64
in
situations where updates are relatively rare; they need to be
thread-safe but cheap.
The hg64
test code illustrates an alternative setup where each
thread has its own histogram, which can be updated fast with zero
contention. The per-thread histograms can be merged to get a
whole-system histogram (even while the update threads are active,
though the test code does not do that).
The real test is that we get the same results from the single concurrent histogram, and the merged per-thread histograms.
todo
I still don’t have any useful way to export or import histogram data. BIND’s built-in statistics web server exports in JSON and XML; I think it would also be useful to support a compressed binary format, and CSV.
And I would like to be able to visualize the data without having a whole prometheus + grafana setup!