2022-10-12 – hg64 multithreaded histograms

Since my blog post about hg64 histograms in July I have continued occasional hacking on the code, so it’s time to post an update.

As before, my proof-of-concept histogram data structure hg64 is available from:

simpler allocation

My original version used popcount packed arrays to avoid storing empty counters. It’s difficult to make this trick efficiently thread-safe, so I switched it out for a simpler but slightly less memory-efficient design.

Now there is a top-level array of pointers, one for each possible logarithm of a value, 0 .. 63. Each pointer is assigned at most once (with an atomic compare-and-swap) to an array of counters. The sub-array size depends on the precision (number of significant digits) of the histogram.

separate summary totals

When calculating ranks and percentiles, hg64 uses summary totals so that it can skip over most of the counter arrays. These totals can’t be updated in multithreaded code in a way that is both fast and consistent.

Instead, you need to take an immutable snapshot of the histogram. This separate copy is immune from multithreading issues so its summary totals are always consistent with its counter arrays.

And we save work in the hg64_inc() hot path by removing summary totals from mutable histograms.

branch-free key calculation

Paul Khuong encouraged me to understand his code for log-linear bucketing which does the same job as my function to map a 64-bit value to a low-precision floating-point key. Paul’s bin_down_of() function uses a clever trick which allows me to remove a branch from the fast path.

Together, these improvements (unpacked arrays, no summary totals, branch-free bucketing) mean that in the best case (single threaded, histogram is warmed up and in cache) it takes about 2.3 nanoseconds to atomically increment a histogram counter on my M1 Mac. That’s about half the time of my original code back in July.

space / time tradeoffs

The hg64 test code tries hammering a histogram concurrently from multiple threads. The time measurements mostly show the cost of high-contention atomic operations - it can be 50x slower than single-threaded increments on my M1 Mac.

This slowdown isn’t a problem for me, because I aim to use hg64 in situations where updates are relatively rare; they need to be thread-safe but cheap.

The hg64 test code illustrates an alternative setup where each thread has its own histogram, which can be updated fast with zero contention. The per-thread histograms can be merged to get a whole-system histogram (even while the update threads are active, though the test code does not do that).

The real test is that we get the same results from the single concurrent histogram, and the merged per-thread histograms.

todo

I still don’t have any useful way to export or import histogram data. BIND’s built-in statistics web server exports in JSON and XML; I think it would also be useful to support a compressed binary format, and CSV.

And I would like to be able to visualize the data without having a whole prometheus + grafana setup!


Comments welcome on Dreamwidth


⇐ 2022-09-23 ⇐ How big is this integer type? ⇐ ⇒ Faster DNS name decompression ⇒ 2022-11-17 ⇒