Last year I wrote about inlining just the fast path of Lemire’s
algorithm for nearly-divisionless unbiased bounded random
numbers. The idea was to reduce code bloat by eliminating lots
of copies of the random number generator in the rarely-executed slow
paths. However a simple split prevented the compiler from being able
to optimize cases like pcg32_rand(1 << n)
, so a lot of the blog post
was toying around with ways to mitigate this problem.
On Monday while procrastinating a different blog post, I realised that
it’s possible to do better: there’s a more general optimization which
gives us the 1 << n
special case for free.
nearly divisionless
Lemire’s algorithm has about 4 neat tricks:
-
use multiplication instead of division to reduce the output of a random number generator modulo some limit
-
eliminate the bias in (1) by (counterintuitively) looking at the lower digits
-
fun modular arithmetic to calculate the reject threshold for (2)
-
arrange the reject tests to avoid the slow division in (3) in most cases
The nearly-divisionless logic in (4) leads to two copies of the random number generator, in the fast path and the slow path.
Generally speaking, compilers don’t try do deduplicate code that was written by the programmer, so they can’t simplify the nearly-divisionless algorithm very much when the limit is constant.
constantly divisionless
Two points occurred to me:
-
when the limit is constant, the reject threshold (3) can be calculated at compile time
-
when the division is free, there’s no need to avoid it using (4)
These observations suggested that when the limit is constant, the function for random numbers less than a limit should be written:
static inline uint32_t
pcg32_rand_const(pcg32_t *rng, uint32_t limit) {
uint32_t reject = -limit % limit;
uint64_t sample;
do sample = (uint64_t)pcg32_random(rng) * (uint64_t)limit;
while ((uint32_t)(sample) < reject);
return ((uint32_t)(sample >> 32));
}
This has only one call to pcg32_random()
, saving space as I wanted,
and the compiler is able to eliminate the loop automatically when the
limit is a power of two. The loop is smaller than a call to an
out-of-line slow path function, so it’s better all round than the code
I wrote last year.
algorithm selection
As before it’s possible to automatically choose the
constantly-divisionless or nearly-divisionless algorithms depending on
whether the limit is a compile-time constant or run-time variable,
using arcane C tricks or GNU C
__builtin_constant_p()
.
I have been idly wondering how to do something similar in other languages.
Rust isn’t very keen on automatic specialization, but it has a reasonable alternative. The thing to avoid is passing a runtime variable to the constantly-divisionless algorithm, because then it becomes never-divisionless. Rust has a much richer notion of compile-time constants than C, so it’s possible to write a method like the follwing, which can’t be misused:
pub fn upto<const LIMIT: u32>(&mut self) -> u32 {
let reject = LIMIT.wrapping_neg().wrapping_rem(LIMIT);
loop {
let (lo, hi) = self.get_u32().embiggening_mul(LIMIT);
if lo < reject {
continue;
} else {
return hi;
}
}
}
assert!(rng.upto::<42>() < 42);
(embiggening_mul
is my stable replacement for the unstable widening_mul
API.)
This is a nugatory optimization, but there are more interesting cases
where it makes sense to choose a different implementation for constant
or variable arguments – that it, the constant case isn’t simply a
constant-folded or partially-evaluated version of the variable case.
Regular expressions might be lex
-style or pcre
-style, for example.
It’s a curious question of language design whether it should be
possible to write a library that provides a uniform API that
automatically chooses constant or variable implementations, or whether
the user of the library must make the choice explicit.
Maybe I should learn some Zig to see how its comptime works.