inlined nearly divisionless random numbers

a blog post for international RNG day

Lemire’s nearly-divisionless algorithm unbiased bounded random numbers has a fast path and a slow path. In the fast path it gets a random number, does a multiplication, and a comparison. In the rarely-taken slow path, it calculates a remainder (the division) and enters a rejection sampling loop.

When Lemire’s algorithm is coupled to a small random number generator such as PCG, the fast path is just a handful of instructions. When performance matters, it makes sense to inline it. It makes less sense to inline the slow path, because that just makes it harder for the compiler to work on the hot code.

Lemire’s algorithm is great when the limit is not a constant (such as during a Fisher-Yates shuffle) or when the limit is not a power of two. But when the limit is a constant power of two, it ought to be possible to eliminate the slow path entirely.

What are the options?

basic fast / slow split
problem
rely on the programmer?
check for a power of two?
move the split?
compile time hacks?

basic fast / slow split

The following function is a typical implementation of Lemire’s algorithm. (See my past blog post for an explanation.)

    uint32_t
    pcg32_rand(pcg32_t *rng, uint32_t limit) {
        uint64_t sample = (uint64_t)pcg32_random(rng) * (uint64_t)limit;
        if ((uint32_t)(sample) < limit) {
            uint32_t reject = -limit % limit;
            while ((uint32_t)(sample) < reject)
                sample = (uint64_t)pcg32_random(rng) * (uint64_t)limit;
        }
        return ((uint32_t)(sample >> 32));
    }

To separate the fast and slow paths, we can put everything up to the slow path test into an inline function in a header:

    static inline uint32_t
    pcg32_rand_fast(pcg32_t *rng, uint32_t limit) {
        uint64_t sample = (uint64_t)pcg32_random(rng) * (uint64_t)limit;
        if ((uint32_t)(sample) < limit) {
            return (pcg32_rand_slow(rng, limit, sample));
        return ((uint32_t)(sample >> 32));
    }

And the division (or rather, the remainder) and the rejection sampling loop are shoved off into a separate compilation unit:

    uint32_t
    pcg32_rand_slow(pcg32_t *rng, uint32_t limit, uint64_t sample) {
        uint32_t reject = -limit % limit;
        while ((uint32_t)(sample) < reject)
            sample = (uint64_t)pcg32_random(rng) * (uint64_t)limit;
        return ((uint32_t)(sample >> 32));
    }

problem

The basic split minimizes the amount of duplicated slow-path code. The problem is that the compiler cannot eliminate the slow path when pcg32_rand_fast() is called with a constant power of two.

The slow path guard (uint32_t)(sample) < limit can true when the sample is zero or a large enough power of two. Inside the slow path, when the limit is a power of two, the code calculates that the reject threshold is zero so it will never enter the re-sample loop – but the compiler can’t see into the slow path when it is compiling the fast path, so it can’t work out that it’s OK to completely drop the slow path guard and call.

rely on the programmer?

Maybe we can just add a note to the documentation saying, don’t call pcg32_rand_fast(rng, N) when N is a power of two, use pcg32_random(rng) & (N - 1) instead. But what are computers for if not taking care of tedious details for us?

check for a power of two?

The expression N & (N - 1) is zero if N is a power of two (or zero) and nonzero otherwise.

Maybe we can use this bitwise hack as a special case test in the fast path, like:

        if ((limit & (limit - 1)) != 0 && (uint32_t)(sample) < limit)
            return (pcg32_rand_slow(rng, limit, sample));

This is fine for constant limits, when the extra test can be optimized away. However, when the limit is not constant it increases the code size slightly, and adds a difficult-to-predict branch to the hot path. Not so great.

move the split?

Another option is to change where the split occurs, moving the division into the inlined code, like this:

        if ((uint32_t)(sample) < limit) {
            uint32_t reject = -limit % limit;
            if ((uint32_t)(sample) < reject)
                return (pcg32_rand_slow(rng, limit, reject));
        }

And change pcg32_rand_slow() so it gets the reject threshold from an argument instead of recalculating it.

This gives the compiler enough information to eliminate the slow path completely when the limit is a constant power of two.

In other cases it increases code size roughly the same amount as the explicit power-of-two check, but the extra code is hidden behind an easy-to-predict branch, so it’s cheaper overall.

compile time hacks?

The final option is to directly check if the limit is a compile-time constant power of two.

There is a gcc/clang extension __builtin_constant_p() that checks for a compile-time constant. More fun and more standard is the C11 version of Martin Uecker’s clever hack that relies on arcane details of how pointer types work with the ?: operator:

    #define is_constexpr(x) \
            _Generic(0 ? (long *)(1) : (void *)(0 * (long)(x)), \
                     long *: 1, void *: 0)
    // ...
        if (!(is_constexpr(limit) && (limit & (limit - 1)) == 0)
            && (uint32_t)(sample) < limit)
                return (pcg32_rand_slow(rng, limit, sample));

This is fast for constant powers of two, and minimizes the amount of duplicated slow path code in other cases. Best of both worlds!

But be warned: if you stare long into the void, the void stares also into you.

My implementations of pcg32 and pcg64-dxsm include this split version of Lemire’s algorithm, plus random floating point numbers