a blog post for international RNG day

Lemire’s nearly-divisionless algorithm unbiased bounded random numbers has a fast path and a slow path. In the fast path it gets a random number, does a multiplication, and a comparison. In the rarely-taken slow path, it calculates a remainder (the division) and enters a rejection sampling loop.

When Lemire’s algorithm is coupled to a small random number generator such as PCG, the fast path is just a handful of instructions. When performance matters, it makes sense to inline it. It makes less sense to inline the slow path, because that just makes it harder for the compiler to work on the hot code.

Lemire’s algorithm is great when the limit is not a constant (such as during a Fisher-Yates shuffle) or when the limit is not a power of two. But when the limit is a constant power of two, it ought to be possible to eliminate the slow path entirely.

What are the options?

- basic fast / slow split
- problem
- rely on the programmer?
- check for a power of two?
- move the split?
- compile time hacks?

## basic fast / slow split

The following function is a typical implementation of Lemire’s algorithm. (See my past blog post for an explanation.)

```
uint32_t
pcg32_rand(pcg32_t *rng, uint32_t limit) {
uint64_t hi_lo = (uint64_t)pcg32_random(rng) * (uint64_t)limit;
if ((uint32_t)(hi_lo) < limit) {
pcg_uint_t residue = -limit % limit;
while ((pcg_uint_t)(hi_lo) < residue)
hi_lo = (pcg_ulong_t)pcg_random(rng) * (pcg_ulong_t)limit;
}
return ((uint32_t)(hi_lo >> PCG_UINT_BITS));
}
```

To separate the fast and slow paths, we can put everything up to the slow path test into an inline function in a header:

```
static inline uint32_t
pcg32_rand_fast(pcg32_t *rng, uint32_t limit) {
uint64_t hi_lo = (uint64_t)pcg32_random(rng) * (uint64_t)limit;
if ((uint32_t)(hi_lo) < limit)
return (pcg32_rand_slow(rng, limit, hi_lo));
return ((uint32_t)(hi_lo >> PCG_UINT_BITS));
}
```

And the division (or rather, the remainder) and the rejection sampling loop are shoved off into a separate compilation unit:

```
uint32_t
pcg32_rand_slow(pcg32_t *rng, uint32_t limit, uint64_t hi_lo) {
uint32_t residue = -limit % limit;
while ((uint32_t)(hi_lo) < residue)
hi_lo = (uint64_t)pcg32_random(rng) * (uint64_t)limit;
return ((uint32_t)(hi_lo >> 32));
}
```

## problem

The basic split minimizes the amount of duplicated slow-path code. The
problem is that the compiler cannot eliminate the slow path when
`pcg32_rand_fast()`

is called with a constant power of two.

## rely on the programmer?

Maybe we can just add a note to the documentation saying, don’t call
`pcg32_rand_fast(rng, N)`

when N is a power of two, use
`pcg32_random(rng) & (N - 1)`

instead. But what are computers for if
not taking care of tedious details for us?

## check for a power of two?

The expression `N & (N - 1)`

is zero if N is a power of two (or zero)
and nonzero otherwise.

Maybe we can use this bitwise hack as a special case test in the fast path, like:

```
if ((limit & (limit - 1)) != 0 && (uint32_t)(hi_lo) < limit)
return (pcg32_rand_slow(rng, limit, hi_lo));
```

This is fine for constant limits, when the extra test can be optimized away. However, when the limit is not constant it increases the code size slightly, and adds a difficult-to-predict branch to the hot path. Not so great.

## move the split?

Another option is to change where the split occurs, moving the division into the inlined code, like this:

```
if ((uint32_t)(hi_lo) < limit) {
uint32_t residue = -limit % limit;
if ((uint32_t)(hi_lo) < residue)
return (pcg32_rand_slow(rng, limit, residue));
}
```

And change `pcg32_rand_slow()`

so it gets the residue from an argument
instead of recalculating it.

This gives the compiler enough information to eliminate the slow path completely when the limit is a constant power of two.

In other cases it increases code size roughly the same amount as the explicit power-of-two check, but the extra code is hidden behind an easy-to-predict branch, so it’s cheaper overall.

## compile time hacks?

The final option is to directly check if the limit is a compile-time constant power of two.

There is a gcc/clang extension `__builtin_constant_p()`

that checks
for a compile-time constant. More fun and more standard is the C11
version of Martin Uecker’s clever hack that relies on
arcane details of how pointer types work with the `?:`

operator:

```
#define is_constexpr(x) \
_Generic(0 ? (long *)(1) : (void *)(0 * (long)(x)), \
long *: 1, void *: 0)
// ...
if (!(is_constexpr(limit) && (limit & (limit - 1)) == 0)
&& (uint32_t)(hi_lo) < limit)
return (pcg32_rand_slow(rng, limit, hi_lo));
```

This is fast for constant powers of two, and minimizes the amount of duplicated slow path code in other cases. Best of both worlds!

But be warned: if you stare long into the void, the void stares also into you.

My implementations of pcg32 and pcg64-dxsm include this split version of Lemire’s algorithm, plus random floating point numbers