.@ Tony Finch – blog


Last year I wrote about inlining just the fast path of Lemire’s algorithm for nearly-divisionless unbiased bounded random numbers. The idea was to reduce code bloat by eliminating lots of copies of the random number generator in the rarely-executed slow paths. However a simple split prevented the compiler from being able to optimize cases like pcg32_rand(1 << n), so a lot of the blog post was toying around with ways to mitigate this problem.

On Monday while procrastinating a different blog post, I realised that it’s possible to do better: there’s a more general optimization which gives us the 1 << n special case for free.

nearly divisionless

Lemire’s algorithm has about 4 neat tricks:

  1. use multiplication instead of division to reduce the output of a random number generator modulo some limit

  2. eliminate the bias in (1) by (counterintuitively) looking at the lower digits

  3. fun modular arithmetic to calculate the reject threshold for (2)

  4. arrange the reject tests to avoid the slow division in (3) in most cases

The nearly-divisionless logic in (4) leads to two copies of the random number generator, in the fast path and the slow path.

Generally speaking, compilers don’t try do deduplicate code that was written by the programmer, so they can’t simplify the nearly-divisionless algorithm very much when the limit is constant.

constantly divisionless

Two points occurred to me:

These observations suggested that when the limit is constant, the function for random numbers less than a limit should be written:

    static inline uint32_t
    pcg32_rand_const(pcg32_t *rng, uint32_t limit) {
        uint32_t reject = -limit % limit;
        uint64_t sample;
        do sample = (uint64_t)pcg32_random(rng) * (uint64_t)limit;
        while ((uint32_t)(sample) < reject);
        return ((uint32_t)(sample >> 32));
    }

This has only one call to pcg32_random(), saving space as I wanted, and the compiler is able to eliminate the loop automatically when the limit is a power of two. The loop is smaller than a call to an out-of-line slow path function, so it’s better all round than the code I wrote last year.

algorithm selection

As before it’s possible to automatically choose the constantly-divisionless or nearly-divisionless algorithms depending on whether the limit is a compile-time constant or run-time variable, using arcane C tricks or GNU C __builtin_constant_p().

I have been idly wondering how to do something similar in other languages.

Rust isn’t very keen on automatic specialization, but it has a reasonable alternative. The thing to avoid is passing a runtime variable to the constantly-divisionless algorithm, because then it becomes never-divisionless. Rust has a much richer notion of compile-time constants than C, so it’s possible to write a method like the follwing, which can’t be misused:

    pub fn upto<const LIMIT: u32>(&mut self) -> u32 {
        let reject = LIMIT.wrapping_neg().wrapping_rem(LIMIT);
        loop {
            let (lo, hi) = self.get_u32().embiggening_mul(LIMIT);
            if lo < reject {
                continue;
            } else {
                return hi;
            }
        }
    }

    assert!(rng.upto::<42>() < 42);

(embiggening_mul is my stable replacement for the unstable widening_mul API.)

This is a nugatory optimization, but there are more interesting cases where it makes sense to choose a different implementation for constant or variable arguments – that it, the constant case isn’t simply a constant-folded or partially-evaluated version of the variable case. Regular expressions might be lex-style or pcre-style, for example. It’s a curious question of language design whether it should be possible to write a library that provides a uniform API that automatically chooses constant or variable implementations, or whether the user of the library must make the choice explicit.

Maybe I should learn some Zig to see how its comptime works.