I am pleased that so many people enjoyed my talk about time at RIPE86. I thought I would write a few notes on some of the things I left out.
This week I was in Rotterdam for a RIPE meeting. On Friday morning I gave a lightning talk called where does my computer get the time from? The RIPE meeting website has a copy of my slides and a video of the talk; this is a blogified version, not an exact transcript.
This weekend I was in Rotterdam for the RIPE DNS Hackathon.
About 50 people gathered with several ideas for potential projects: things like easier DNSSEC provisioning, monitoring DNS activity in the network, what is the environmental cost of the DNS, …
At the start of the weekend we were asked to introduce ourselves and say what our goals were. My goal was to do something different from my day job working on BIND. I was successful, tho I did help some others out with advice on a few of BIND’s obscurities.
The team I joined was very successful at producing a working prototype and a cool demo.
In 2021, I came up with a design for a new memory layout for a qp-trie, and I implemented a prototype of the design in NLnet Labs NSD (see my git repo or github).
Since I started work at ISC my main project has been to adapt the NSD prototype into a qp-trie for use in BIND. The ultimate aim is to replace BIND’s red-black tree database, its in-memory store of DNS records.
Yesterday I merged the core qp-trie implementation into BIND so it’s a good time to write some blog notes about it.
The core of the design is still close to what I sketched in 2021 and implemented in NSD, so these notes are mostly about what’s different, and the mistakes I made along the way…
Chris Wellons posted a good review of why large chunks of the C
library are terrible,
especially if you are coding on Windows - good fun if you like staring
into the abyss. He followed up with let’s write a
setjmp which is fun in a
more positive way. I was also pleased to learn about
__builtin_longjmp! There’s a small aside in this article about the
signal mask, which skates past another horrible abyss - which might
even make it sensible to DIY
Some of the nastiness can be seen in the POSIX rationale for
sigsetjmp which says that on BSD-like systems,
_setjmp correspond to
setjmp on System V Unixes.
The effect is that
setjmp might or might not involve a system call
to adjust the signal mask. The syscall overhead might be OK for
exceptional error recovery, such as Chris’s arena out of memory
example, but it’s likely to be more troublesome if you are
But why would they need to mess with the signal mask? Well, if you are
using BSD-style signals or you are using
sigaction correctly, a
signal handler will run with its signal masked. If you decide to
longjmp out of the handler, you also need to take care to unmask the
signal. On BSD-like systems,
longjmp does that for you.
The problem is that
longjmp out of a signal handler is basically
impossible to do correctly. (There’s a whole flamewar in the wg14
committee documents on this subject.) So this is another example of
libc being optimized for the unusual, broken case at the cost of the
The other day, Paul McKenney posted an article on LiveJournal about different flavours of RCU, prompted by a question about couple of Rust RCU crates. (There are a few comments about it on LWN.)
McKenney goes on to propose an RCU classification system based on the API an implementation provides to its users. (I am curious that the criteria do not involve how RCU works.)
Here’s how I would answer the questions for QSBR in BIND:
Are there explicit RCU read-side markers?
No, it relies on
libuv callbacks to bound the lifetime of a
read-side critical section.
Are grace periods computed automatically?
Yes. There is an internal
but that mainly exists to separate the QSBR code from the event
loop manager, and for testing purposes, not for use by
Are there synchronous grace-period-wait APIs?
No. (Because they led me astray when designing a data structure to use RCU.)
Are there asynchronous grace-period-wait APIs?
Yes, but instead of one-shot
call_rcu(), a subsystem (such as
the qp-trie code) registers a permanent callback
isc_qsbr_register()), and notifies the QSBR when there is work
for the callback to do (
isc_qsbr_activate()). This avoids having
to allocate a thunk on every modification, and it automatically
coalesces reclamation work.
If so, are there callback-wait APIs?
No. At the moment, final cleanup work is tied to event loop teardown.
Are there polled grace-period-wait APIs?
Are there multiple grace-period domains?
One per event loop manager, and there’s only one
Previously, I wrote about implementing safe memory reclamation for my qp-trie code in BIND. I have now got it working with a refactored qp-trie that has been changed to support asynchronous memory reclamation - working to the point where I can run some benchmarks to compare the performance of the old and new versions.
Previously, I wrote about my cataract and its assessment at Addenbrooke’s cataract clinic.
I had my cataract removed a couple of weeks ago, and so far things are going well, though there is still some follow-up work needed.
At the end of October, I finally got my multithreaded qp-trie working! It could be built with two different concurrency control mechanisms:
A reader/writer lock
This has poor read-side scalability, because every thread is hammering on the same shared location. But its write performance is reasonably good: concurrent readers don’t slow it down too much.
liburcu, userland read-copy-update
RCU has a fast and scalable read side, nice! But on the write side
synchronize_rcu(), which is blocking and rather slow, so
my write performance was terrible.
OK, but I want the best of both worlds! To fix it, I needed to change
the qp-trie code to use safe memory reclamation more effectively:
instead of blocking inside
synchronize_rcu() before cleaning up, use
call_rcu() to clean up asynchronously. I expect I’ll write about the
qp-trie changes another time.
Another issue is that I want the best of both worlds by default,
liburcu is LGPL and we don’t want BIND to depend on
code whose licence demands more from our users than the MPL.
So I set out to write my own safe memory reclamation support code.
In a previous entry, I wrote about making DNS name decompression faster by moving work left on this diagram so that we do less of it:
names < pointers < labels < bytes
Last week I had a bright idea about that leftmost step, moving per-pointer work to per-name, using some dirty tricks. Sadly the experiment was not successful, because it also increased the per-label work. Nevertheless I think it’s interesting enough to be worth writing about.
This year I have rewritten BIND’s DNS name compression and decompression code. I didn’t plan to, it just sort of happened! Anyway, last week my colleague Petr was doing some benchmarking, and he produced some numbers that seemed too good to be true, so I have re-done the measurement myself, and wow.
It has been a couple of years since my previous blog post about leap seconds, though I have been tweeting on the topic fairly frequently: see my page on date, time, and leap seconds for an index of threads. But Twitter now seems a lot less likely to stick around, so I’ll aim to collect more of my thinking-out-loud here on my blog.
Earlier this year, I rewrote BIND’s DNS name compression algorithm. That work is in the November release, BIND 9.19.7.
Last week I rewrote the reverse algorithm, DNS name decompression, and made it about 2x faster. I enjoyed deleting this comment that dates back to January 1999:
/* * Note: The following code is not optimized for speed, but * rather for correctness. Speed will be addressed in the future. */
So here is a collection of folklore and tricks related to DNS name decompression.
Since my blog post about
hg64 histograms in July I have
continued occasional hacking on the code, so it’s time to post an
As before, my proof-of-concept histogram data structure
This week I am on “escalation engineer” duty, which means triaging new
bug reports. One of them was found by ubsan (
reporting an overflow in a signed integer shift. This was fun to fix
because it needed a bit of lateral thinking.
See 2022-10-12 for an update
I have reached the point in my main project (qp-trie for BIND) where I want to collect performance statistics. Things like:
In a wider context, those using BIND for large authoritative server systems would like better statistics on things like:
For my own purposes, I’m no longer satisfied with summarizing performance with just the mean and standard deviation; and in an operational context, it’s useful to know about the existence of outliers (rare outliers can be hidden by simple summary statistics) and the size of an outlier can be a useful clue.
So, we need histograms!
I have written a proof-of-concept histogram data structure called
hg64, which you can get from:
It can load 1 million data items in about 5ms (5ns per item), and uses a few KiB of memory.
Here I will write about how
hg64 came to be.
In our previous episode, I optimized
tolower() using SWAR tricks.
That was in fact a side-quest to a side-quest to a side-quest. The first
side-quest was trying out a new DNS name compression algorithm; in BIND,
compression is performance-critical and a heavy user of tolower(), which is what
set me off on the second and third side-quests.
Now let’s pop the stack, and talk about DNS names.
Here’s a fun bit of Monday optimization.
DNS servers often have to convert DNS names to their canonical lower case form. BIND has to do so a bit more than most because it tries to preserve the case of DNS names rather than canonicalizing them.
I was recently distracted by a cleanup oportunity: there were multiple
maptolower lookup tables in BIND, so I decided to make a home for
them so that we would only need one copy. Then I thought, surely there
are faster ways to implement
tolower() than a lookup table.
My new job is working on BIND for ISC, and my main project is to replace BIND’s core red-black tree data structure with my qp-trie.
In the summer of 2021 I wrote some notes on page-based GC for qp-trie RCU which I then went on to implement in my fork of NSD.
Since the start of May 2022 I have ported the NSD version of my qp-trie to BIND, with several improvements:
multi-version concurrency, instead of just two versions, one for readers and one for the writer;
the rather sketchy locking has been completed;
two flavours of write transaction: minimum space for authoritative DNS; and minimum time for recursive caches;
rollback for failed transactions.
The notes I wrote last summer turned into code very nicely: NSD proved to be a good place to try out the ideas. And more recently, I am pleased with how the code adapted to the more complicated demands of BIND.
But there’s one area that has been problematic: compaction.
My response to the UK government consultation on using imperial units of measure for retail.
Regarding the foreword to the consultation description, my children were weighed in kilos at birth, so whoever wrote the foreword must be wery old, and talking about babies who are now grown adults.
And for metrological purposes, British weights and measures have been derived from the metric system since the international convention on the metre in 1875 and the production of the new metric prototypes in London in 1898, because the metric standards were made to a higher quality with greater reliability than the old imperial standards. So it’s incorrect to say that we started adopting the metric system in 1995; that was actually when we started abolishing imperial (and that was at the initiative of the UK government, not the EU).
oops, I should have mentioned that the tories were in power in 1995
Q. 1 for all,
1 a) Are there any specific areas of consumer transactions that should be a priority for allowing a choice in units of measurement, and why?
None, because there is no need to make it more confusing to compare sizes and prices.
1 b) Are there any specific areas that you think should be excluded from a choice in units of measurement, and why?
There should be a single set of standard units of measure so that it is as easy as possible to compare the sizes and prices of goods, especially (e.g.) price per kilo or price per litre.
1 c) If an item is sold in imperial measures, should there be a requirement for a metric equivalent alongside it?
Metric measures should always be required, and there should always be clear labelling showing the price per some power of 10 (e.g. 100g or 1kg)
Q. 2 for businesses,
n/a (I am replying as an individual)
Q. 3 for consumers,
3 a) If you had a choice, would you want to purchase items:
- (i) in imperial units?
- (ii) in imperial units alongside a metric equivalent?
I always want to purchase items in metric (preferably a round number when measured in metric units), and I want it to be easy to find the quantity labelled in metric.
3 b) Are you more likely to shop from businesses that sell in imperial units?
I am likely to avoid shops that use imperial units, because the measurements will be confusing and unfamiliar.
3 c) Do you foresee any costs or benefits to you from businesses being permitted to sell:
- (i) solely in imperial units?
- (ii) in imperial units alongside a less prominent metric equivalent?
I expect it will be detrimental to allow businesses to sell in imperial units only, because it will make it easier for them to confuse customers, misprice goods, and hide their malpractice.
3 d) Do you have experience of buying solely in imperial units?
I am 47 years old and I cannot remember when imperial measures were in wide use for anything except milk and beer.
Milk, for example, is often confusing when 2 litre bottles are sold alongside 4 pint bottles, and I have to carefully examine the labels to find out which is the larger volume and what is the price per litre of the 4 pint bottle.
For beer, how can I compare a pint bottle and a 33cl bottle? It’s relatively easy for 50cl vs 33cl because the ratio is 3/2. In practice I think of a pint as about half a litre, so I can compare in metric. And, when I am served beer in a pub in a pint-to-brim glass, what I usually get is closer to half a litre than a pint.
Q. 4 for trading standards,
n/a (I am replying as an individual)