I have added syntax highlighting to my blog using tree-sitter. Here are some notes about what I learned, with some complaining.
Last year I wrote about inlining just the fast path of Lemire’s
algorithm for nearly-divisionless unbiased bounded random
numbers. The idea was to reduce code bloat by eliminating lots
of copies of the random number generator in the rarely-executed slow
paths. However a simple split prevented the compiler from being able
to optimize cases like pcg32_rand(1 << n)
, so a lot of the blog post
was toying around with ways to mitigate this problem.
On Monday while procrastinating a different blog post, I realised that
it’s possible to do better: there’s a more general optimization which
gives us the 1 << n
special case for free.
nearly divisionless
Lemire’s algorithm has about 4 neat tricks:
-
use multiplication instead of division to reduce the output of a random number generator modulo some limit
-
eliminate the bias in (1) by (counterintuitively) looking at the lower digits
-
fun modular arithmetic to calculate the reject threshold for (2)
-
arrange the reject tests to avoid the slow division in (3) in most cases
The nearly-divisionless logic in (4) leads to two copies of the random number generator, in the fast path and the slow path.
Generally speaking, compilers don’t try do deduplicate code that was written by the programmer, so they can’t simplify the nearly-divisionless algorithm very much when the limit is constant.
constantly divisionless
Two points occurred to me:
-
when the limit is constant, the reject threshold (3) can be calculated at compile time
-
when the division is free, there’s no need to avoid it using (4)
These observations suggested that when the limit is constant, the function for random numbers less than a limit should be written:
static inline uint32_t
pcg32_rand_const(pcg32_t *rng, uint32_t limit) {
uint32_t reject = -limit % limit;
uint64_t sample;
do sample = (uint64_t)pcg32_random(rng) * (uint64_t)limit;
while ((uint32_t)(sample) < reject);
return ((uint32_t)(sample >> 32));
}
This has only one call to pcg32_random()
, saving space as I wanted,
and the compiler is able to eliminate the loop automatically when the
limit is a power of two. The loop is smaller than a call to an
out-of-line slow path function, so it’s better all round than the code
I wrote last year.
algorithm selection
As before it’s possible to automatically choose the
constantly-divisionless or nearly-divisionless algorithms depending on
whether the limit is a compile-time constant or run-time variable,
using arcane C tricks or GNU C
__builtin_constant_p()
.
I have been idly wondering how to do something similar in other languages.
Rust isn’t very keen on automatic specialization, but it has a reasonable alternative. The thing to avoid is passing a runtime variable to the constantly-divisionless algorithm, because then it becomes never-divisionless. Rust has a much richer notion of compile-time constants than C, so it’s possible to write a method like the follwing, which can’t be misused:
pub fn upto<const LIMIT: u32>(&mut self) -> u32 {
let reject = LIMIT.wrapping_neg().wrapping_rem(LIMIT);
loop {
let (lo, hi) = self.get_u32().embiggening_mul(LIMIT);
if lo < reject {
continue;
} else {
return hi;
}
}
}
assert!(rng.upto::<42>() < 42);
(embiggening_mul
is my stable replacement for the unstable widening_mul
API.)
This is a nugatory optimization, but there are more interesting cases
where it makes sense to choose a different implementation for constant
or variable arguments – that it, the constant case isn’t simply a
constant-folded or partially-evaluated version of the variable case.
Regular expressions might be lex
-style or pcre
-style, for example.
It’s a curious question of language design whether it should be
possible to write a library that provides a uniform API that
automatically chooses constant or variable implementations, or whether
the user of the library must make the choice explicit.
Maybe I should learn some Zig to see how its comptime works.
One of the neat things about the PCG random number generator by Melissa O’Neill is its use of instruction-level parallelism: the PCG state update can run in parallel with its output permutation.
However, PCG only has a limited amount of ILP, about 3 instructions. Its overall speed is limited by the rate at which a CPU can run a sequence where the output of one multiply-add feeds into the next multiply-add.
… Or is it?
With some linear algebra and some AVX512, I can generate random numbers from a single instance of pcg32 at 200 Gbit/s on a single core. This is the same sequence of random numbers generated in the same order as normal pcg32, but more than 4x faster.
The International Obfuscated C Code Contest has a newly revamped web site, and the Judges have announced the 28th contest, to coincide with its 40th anniversary. (Or 41st?)
The Judges have also updated the archive of past winners so that as many of them as possible work on modern systems. Accordingly, I took a look at my 1998 winner to see how much damage time hath wrought.
D’oh, I lost track of a bug report that should have been fixed in
nsnotifyd-2.2
. Thus, hot on the heels of the previous release,
here’s nsnotifyd-2.3
. Sorry for causing extra work to
my uncountably many users!
The nsnotifyd
daemon monitors a set of DNS zones and runs a command
when any of them change. It listens for DNS NOTIFY messages so it can
respond to changes promptly. It also uses each zone’s SOA refresh and
retry parameters to poll for updates if nsnotifyd
does not receive
NOTIFY messages more frequently. It comes with a client program
nsnotify
for sending notify messages.
This nsnotifyd-2.3
release includes some bug fixes:
-
When
nsnotifyd
receives a SIGINT or SIGTERM while running the command, it failed to handle it correctly. Now it exits promptly.Many thanks to Athanasius for reporting the bug!
-
Miscellaneous minor code cleanup and compiler warning suppression.
Thanks also to Dan Langille who sent me a lovely appreciation:
Now that I think of it, nsnotifyd is in my favorite group of software. That group is software I forget I’m running, because they just run and do the work. For years. I haven’t touched, modified, or configured nsnotifyd and it just keeps doing the job.
I have made
a new release of nsnotifyd
,
a tiny DNS server that just listens for NOTIFY messages
and runs a script when one of your zones changes.
This nsnotifyd-2.2
release includes a new feature:
nsnotify
can now send NOTIFY messages from a specific source address
Thanks to Adam Augustine for the suggestion. I like receiving messages that say things like,
Thanks for making this useful tool available for free.
Recently the Spritely Institute published an introduction to Petnames, A humane approach to secure, decentralized naming.
I have long been a fan of petnames, and graph naming systems in general. I first learned about them in the context of Mark Miller’s E programming language which was a distributed object capability system on the JVM. I gather Spritely are working on something similar, but with a more webby lisp/scheme flavour – and in fact Mark Miller is the second author on this article.
An interesting thing that the article sort of hints at is that these kinds of systems have a fairly smooth range of operating points from decentralized petnames all the way to centralized globally unique names. Zooko’s triangle is actually more like a fan when the “human friendly” point is fixed: there’s an arc describing the tradeoffs, from personal through local to global.
As a first step away from petnames, I like to use the term “nickname” for nearby instances of what they call “edge names” in the article: a nickname is not as personal as a petname, it’s a name you share with others.
A bit further along the arc is the article’s example of the “bizdir” local business directory. It’s a trusted (and hopefully trustworthy) naming authority, but in a more local / federated fashion rather than centralized.
How can a petname system function at the global+centralized point, so it could replace the DNS? It needs to pass the “billboard test”: I type in a name I see on a billboard and I get to the right place. (It might be a multipart name like a postal address or DNS name, with an extra “edge name” or two to provide enough disambiguating context.)
I imagine that an operating system supplier might provide a few preconfigured petnames (it probably includes its own petname so the software can update itself securely), a lot like its preconfigured PKIX CA certificates. These petnames would refer to orgs like the “bizdir”, or Verisign, or Nominet, that act as nickname registries. Your collection of petnames is in effect your personal root zone, and the preconfigured petnames are in effect the default TLDs.
When I was thinking about how a decentralized graph naming system could be made user-friendly and able to pass the billboard test I realised that there are organizational structures that are hard to avoid. I expect there would inevitably be something like the CA/Browser forum to mediate between OS suppliers and nickname registries: a petname ICANN.
I wrote an older iteration of these ideas over a decade ago. Those notes suffer from too many DNS brainworms, but looking back, there’s some curious anti-DNS discussion. How might it be useful to be able to reach a name by multiple paths? Could you use that for attestation? What might it look like to have a shared context for names that is not global but is national or regional or local?
The other day I learned about the Rust crate lexopt which describes itself as,
A pathologically simple command line argument parser.
Most argument parsers are declarative: you tell them what to parse, and they do it. This one provides you with a stream of options and values and lets you figure out the rest.
For “pathologically simple” I still rather like getopt(3) despite its lack of support for long options. Aaron S Cohen wrote getopt in around 1979, and it was released into the public domain by AT&T in 1985. A very useful 50-ish lines of code! It still has almost everything required by POSIX nearly four decades later.
But the description of lexopt
made me think getopt()
could be
simpler. The insight is that the string of options that you have to
pass to getopt()
is redundant with respect to the code that deals
with the return values from getopt()
. What if you just get rid of
the options string?
I thought I would try it. Turns out, not much is lost in getting rid of the options string, and a few things are gained.
My new code is half the size of getopt()
, and has more
functionality. I’m going to show how how this was done, because it’s
short (ish), not because it is interesting. Then I’ll try to tease out
a lesson or two.
I commented on Lobsters that /tmp
is usually a bad idea,
which caused some surprise. I suppose /tmp
security bugs were common
in the 1990s when I was learning Unix, but they are pretty rare now so
I can see why less grizzled hackers might not be familiar with the
problems.
I guess that’s some kind of success, but sadly the fixes have left
behind a lot of scar tissue because they didn’t address the underlying
problem: /tmp
should not exist.
It’s a bad idea because it’s shared global mutable state that crosses security boundaries. There’s a ton of complexity at all levels of unix (filesystems, kernel APIs, libc, shell, admin scripts) that only exists as a workaround for the dangers caused by making
/tmp
shared.
A couple of notable things have happened in recent months:
-
There is a new edition of POSIX for 2024. There’s lots of good stuff in it, but today I am writing about getentropy() which is the first officially standardized POSIX API for getting cryptographically secure random numbers.
-
On Linux the getentropy(3) function is based on the getrandom(2) system call. In Linux 6.11 there is a new vDSO call, vgetrandom(), that makes it possible to implement
getentropy()
entirely in userland, which should make it significantly faster.
UUID v4 and v7 are great examples of the need for high performance secure random numbers: you don’t want the performance of your database inserts to be limited by your random number generator! Another example is DNS source port and query ID randomization which help protect DNS resolvers against forged answers.
I was inspired to play with getentropy()
by a blog post about getting
a few secure random bytes in PostgreSQL without pgcrypto
:
it struck me that PostgreSQL doesn’t use getentropy()
, and I thought
it might be fun (and possibly even useful!) to add support for it.
I learned a few things along the way!
Following my previous post on rate limiting with GCRA, leaky buckets without the buckets, I reviewed my old notes on rate limiting for Exim. I thought I should do a new write-up of the ideas that I hope will be more broadly interesting.
Exponential rate limiting uses an exponentially-weighted moving average to measure the client’s rate. It is motivated by a shift of perspective:
- first measure the client’s rate,
- then compare it to the limit.
Algorithms like GCRA and leaky bucket don’t allow you to separate these two points because they don’t measure the client’s rate as a concrete number.
A moving average allows more flexible policy enforcement because the rate measurement is meaningful even when you don’t apply back-pressure. For example, it’s useful in a dry run mode, or when diverting messages to a quarantine.
An exponential rate limiter stores, for each client:
- last update time
- average rate
This is a similar amount of space as leaky bucket. GCRA uses less space because it only needs to store a time.
The main disadvantage is that an exponential rate limiter needs fairly complicated floating point arithmetic.
Yesterday I read an article describing the GCRA rate limiting algorithm. I thought it was really interesting, but I wasn’t entirely satisfied with Brandur’s explanation, and the Wikipedia articles on leaky buckets and GCRA are terrible, so here’s my version.
what is GCRA?
GCRA is the “generic cell rate algorithm”, a rate-limiting algorithm that came from ATM. GCRA does the same job as the better-known leaky bucket algorithm, but using half the storage and with much less code.
GCRA only stores one time stamp per sender, whereas leaky buckets need to store a time stamp and a quota.
GCRA needs less arithmetic, and makes it trivial to tell a client how long it should wait before the next request is permitted.
Yesterday there was some discussion on the Orange Site about whether or not C is Turing complete. The consensus in the StackOverflow question is,
-
no, because the C abstract machine is a (large) finite state machine,
-
or maybe yes, if you believe that unaddressable local variables can exist outside the finite address space, and you can have an unbounded number of them, i.e. no.
My answer is definitely yes, if you include the standard IO library.
And using IO is much closer to Turing’s original model of a finite state machine working on unbounded peripheral storage.
I’m pleased that so many people enjoyed my previous blog post on tolower() with AVX-512. Thanks for all the great comments and discussion!
One aspect that needed more work was examining the performance for small strings. The previous blog post had a graph for strings up to about 1000 bytes long, mainly because it illustrated some curious behaviour – but that distracted me from what I really care about, which is strings up to about 100 bytes long, 10x smaller.
Eventually I managed to get some numbers that I think are plausible.
A couple of years ago I wrote about tolower() in bulk at speed using SWAR tricks. A couple of days ago I was interested by Olivier Giniaux’s article about unsafe read beyond of death, an optimization for handling small strings with SIMD instructions, for a fast hash function written in Rust.
I’ve long been annoyed that SIMD instructions can easily eat short strings whole, but it’s irritatingly difficult to transfer short strings between memory and vector registers. Oliver’s post caught my eye because it seemed like a fun way to avoid the problem, at least for loads. (Stores remain awkward!)
Actually, to be frank, Olivier nerdsniped me.
Semaphores are one of the oldest concurrency primitives in computing, invented over 60 years ago. They are weird: usually the only numbers of concurrent processes we care about are zero, one, or many – but semaphores deal with those fussy finite numbers in between.
Yesterday I was writing code that needed to control the number of concurrent operating system processes that it spawned so that it didn’t overwhelm the computer. One of those rare situations when a semaphore is just the thing!
a Golang channel is a semaphore
A Golang channel has a buffer size – a number of free slots – which corresponds to the initial value of the semaphore. We don’t care about the values carried by the channel: any type will do.
var semaphore := make(chan any, MAXPROCS)
The acquire
operation uses up a slot in the channel. It is
traditionally called P()
, and described as decrementing the value of
the semaphore, i.e. decrementing the number of free slots in the
channel. When the channel is full this will block, waiting for another
goroutine to release the semaphore.
func acquire() {
semaphore <- nil
}
The release
operation, traditionally called V()
, frees a slot in
the channel, incrementing the value of the semaphore.
func release() {
<-semaphore
}
That’s it!
the GNU make jobserver protocol is a semaphore
The GNU make -j
parallel builds feature uses a semaphore in the
form of its jobserver protocol. Occasionally, other programs
support the jobserver protocol too, such as Cargo. BSD make -j
uses basically the same semaphore implementation, but is not
compatible with the GNU make jobserver protocol.
The make
jobserver semaphore works in a similar manner to a Golang
channel semaphore, but:
-
instead of a channel,
make
uses a unix pipe; -
because
make
can’t control the amount of free space in a pipe’s buffer, the value of the semaphore is represented as the amount of used space in the pipe’s buffer; -
the value of the semaphore can’t be more than PIPE_BUF, to ensure that
release()
will never block.
Here’s a C-flavoured sketch of how it works. To create a semaphore and initialize its value, create a pipe and write that many characters to it, which are buffered in the kernel:
int fd[2];
pipe(fd);
char slots[MAXPROCS] = {0};
write(fd[1], slots, sizeof(slots));
To acquire a slot, read a character from the pipe. When the pipe is empty this will block, waiting for another process to release the semaphore.
char slot;
read(fd[0], &slot, 1);
To release a slot, the worker must write the same character back to the pipe:
write(fd[1], &slot, 1);
Error handling is left as an exercise for the reader.
bonus: waiting for concurrent tasks to complete
If we need to wait for everything to finish, we don’t need any extra machinery. We don’t even need to know how many tasks are still running! It’s enough to acquire all possible slots, which will block until the tasks have finished, then release all the slots again.
func wait() {
for range MAXPROCS {
acquire()
}
for range MAXPROCS {
release()
}
}
That’s all for today! Happy hacking :-)
a blog post for international RNG day
Lemire’s nearly-divisionless algorithm unbiased bounded random numbers has a fast path and a slow path. In the fast path it gets a random number, does a multiplication, and a comparison. In the rarely-taken slow path, it calculates a remainder (the division) and enters a rejection sampling loop.
When Lemire’s algorithm is coupled to a small random number generator such as PCG, the fast path is just a handful of instructions. When performance matters, it makes sense to inline it. It makes less sense to inline the slow path, because that just makes it harder for the compiler to work on the hot code.
Lemire’s algorithm is great when the limit is not a constant (such as during a Fisher-Yates shuffle) or when the limit is not a power of two. But when the limit is a constant power of two, it ought to be possible to eliminate the slow path entirely.
What are the options?
I have made
a new release of nsnotifyd
,
a tiny DNS server that just listens for NOTIFY messages
and runs a script when one of your zones changes.
This nsnotifyd-2.1
release includes a few bugfixes:
- more lenient handling of trailing
.
in domain names on the command line - avoid dividing by zero when the refresh interval is less than 10 seconds
- do not error out when in TCP mode and debug mode and the refresh timer expires
- explain how to fix incomplete github forks
Many thanks to Lars-Johann Liman, JP Mens, and Jonathan Hewlett for the bug reports. I like receiving messages that say things like,
thanks for nsnotifyd, is a great little program, and a good example of a linux program, does one thing well.
(There’s more like that in the nsnotifyd-2.0 release annoucement.)
I have also included a little dumpaxfr
program, which I wrote when
fiddling around with binary wire format DNS zone transfers. I used the
nsnotifyd
infrastructure as a short cut, though dumpaxfr
doesn’t
logically belong here. But it’s part of the family, so I wrote a
dumpaxfr(1) man page and included it in this release.
I will be surprised if anyone else finds dumpaxfr
useful!
Yesterday I received a bug report for regpg, my program that safely stores server secrets encrypted with gpg so they can be commited to a git repository.
The bug was that I used the classic shell pipeline find | xargs grep
with the classic Unix “who would want spaces in filenames?!” flaw.
I have pushed a new release, regpg-1.12, containing the bug fix.
There’s also a gendnskey
subcommand which I used when doing my
algorithm rollovers a few years ago. (It’s been a long time since the
last regpg release!) It’s somewhat obsolete, now I know how to use
dnssec-policy
.
A bunch of minor compatibility issues have crept in, which mostly required fixing the tests to deal with changes in Ansible, OpenSSL, and GnuPG.
My most distressing discovery was that Mac OS crypt(3)
still supports only DES. Good grief.
There are a couple of version control commands that deserve wider
appreciation: SCCS what
and RCS ident
. They allow
you to find out what source a binary was built from, without having to
run it – handy if it is a library! They basically scan a file looking
for magic strings that contain version control metadata and print out
what they discover.