.@ Tony Finch – blog


I recently saw Colm MacCárthaigh’s exegesis of Daniel Lemire’s paper about nearly divisionless fast random integer generation. I found myself writing down my own notes on how it works, that take a slightly different angle.

uint64_t random_lemire(uint64_t limit) {
    uint64_t raw = random_u64();
    uint128_t sample = (uint128_t)raw * (uint128_t)limit;
    uint64_t frac = (uint64_t)sample;
    if(frac < limit) {
        const uint64_t residue = -limit % limit;
        while(frac < residue) {
            raw = random_u64();
            sample = (uint128_t)raw * (uint128_t)limit;
            frac = (uint64_t)sample;
        }
    }
    return(sample >> 64);
}

(In this explanation, W == 1<<64 i.e. two to the power of the word size.)

We first get a raw random value, which we can think of as a fraction 0 <= raw < 1, multiplied by W so it is represented as a fixed-point number. We calculate a sample as raw * limit, which is a random fixed-point fractional number 0 <= sample < limit. Our return value r will be the integer part of fixed-point sample, which in machine arithmetic is r == sample / W == sample >> 64. We will also work with frac, the fractional part of sample, which in machine arithmetic we get by truncating sample to its least significant 64 bits, so that sample == W*r + frac.

There is a little bias in this basic fixed-point calculation, so we are going to do some rejection sampling.

We will reject and retry when we get one of residue = W % limit special raw values, so that our return values are sampled from W - residue == n * limit random values. This is the largest possible integer n where n * limit < W, to minimize the probability of a retry.

To avoid overflow when calculating residue (because W does not fit in a word) we use some modular arithmetic tricks: W % limit == (W - limit) % limit, and W - limit is the same as -limit in unsigned two’s complement arithmetic, so residue == W % limit == -limit % limit. We know this division is safe because we know the limit is strictly greater than zero after the if() test.

We don’t compare residue with raw and reject a contiguous range of possible raw random values, because sample uses the entire range of raw and there isn’t a straightforward overflow interval that we can reject. Instead we compare residue with frac, the fractional part of sample = raw * limit. The rough idea is to reject residue/limit worth of bias for each of the limit possible return values in the integer part of sample.

More precisely (working in machine arithmetic rather than fixed-point), the range 0 <= sample < W * limit consists of several intervals, one for each return value 0 <= r < limit. Interval r covers the sub-range

    W*r <= sample < W*(r+1)

In each interval we reject sample if it is in the range

    W*r <= sample < W*r + residue

and keep sample if it is in the range

    W*r + residue <= sample < W*(r+1)

The code uses truncated frac rather than sample == W*r + frac, which in effect subtracts out the W*r part of these ranges, so the reject test becomes frac < residue.

Each keep interval spans

    W*(r+1) - (W*r + residue) == W - residue == n * limit

When we map from raw to sample with sample = raw * limit, the same number of raw values fall into each keep interval, because each keep interval is the same size and is a multiple of limit. As a result, the sampling will be uniform.

Not all reject intervals contain a sample value because residue < limit, or (in terms of our earlier intuition) because residue/limit < 1. (Sort-of relatedly, the phase alignment of samples in each keep interval is typically different.) Because of this per-return-value lumpiness, and unlike simpler versions of rejection sampling, this algorithm has a timing side-channel that can reveal a small amount of information about its result, so it should not be used for secrets.

The final tweak makes the algoritm nearly divisionless by avoiding the calculation of residue. We observe that residue == W % limit < limit, so we can do an initial rough comparison frac < limit to see if it’s worth preparing for the rejection sampling loop. This saves division work proportional to how much limit is smaller than W. It also saves us from having to explicitly check if the limit is zero.

Our return values use the most significant bits of the underlying random number generator, and rejection sampling depends on the next most significant bits (because of the way carries in raw * limit cascade over multiples of W), so we still work OK with RNGs that have quality problems in their least significant bits.