I recently saw Colm MacCárthaigh’s exegesis of Daniel Lemire’s paper about nearly divisionless fast random integer generation. I found myself writing down my own notes on how it works, that take a slightly different angle.
I have written a follow-up blog post about splitting Lemire’s algorithm between an inline fast path and a separate slow path.
uint64_t random_lemire(uint64_t limit) {
uint64_t raw = random_u64();
uint128_t sample = (uint128_t)raw * (uint128_t)limit;
uint64_t frac = (uint64_t)sample;
if(frac < limit) {
const uint64_t residue = -limit % limit;
while(frac < residue) {
raw = random_u64();
sample = (uint128_t)raw * (uint128_t)limit;
frac = (uint64_t)sample;
}
}
return(sample >> 64);
}
(In this explanation, W == 1<<64 i.e. two to the power of the
word size.)
We first get a raw random value, which we can think of as a
fraction 0 <= raw < 1, multiplied by W so it is represented
as a fixed-point number. We calculate a sample as raw * limit, which is a random fixed-point fractional number 0 <= sample < limit. Our return value r will be the integer part
of fixed-point sample, which in machine arithmetic is r == sample / W == sample >> 64. We will also work with frac, the
fractional part of sample, which in machine arithmetic we get
by truncating sample to its least significant 64 bits, so that
sample == W*r + frac.
There is a little bias in this basic fixed-point calculation, so we are going to do some rejection sampling.
We will reject and retry when we get one of residue = W % limit special raw values, so that our return values are
sampled from W - residue == n * limit random values. This is
the largest possible integer n where n * limit < W, to
minimize the probability of a retry.
To avoid overflow when calculating residue (because W does
not fit in a word) we use some modular arithmetic tricks: W % limit == (W - limit) % limit, and W - limit is the same as
-limit in unsigned two’s complement arithmetic, so residue == W % limit == -limit % limit. We know this division is safe
because we know the limit is strictly greater than zero after
the if() test.
We don’t compare residue with raw and reject a contiguous
range of possible raw random values, because sample uses the
entire range of raw and there isn’t a straightforward overflow
interval that we can reject. Instead we compare residue with
frac, the fractional part of sample = raw * limit. The rough
idea is to reject residue/limit worth of bias for each of the
limit possible return values in the integer part of sample.
More precisely (working in machine arithmetic rather than
fixed-point), the range 0 <= sample < W * limit consists of
several intervals, one for each return value 0 <= r < limit.
Interval r covers the sub-range
W*r <= sample < W*(r+1)
In each interval we reject sample if it is in the range
W*r <= sample < W*r + residue
and keep sample if it is in the range
W*r + residue <= sample < W*(r+1)
The code uses truncated frac rather than sample == W*r + frac, which in effect subtracts out the W*r part of these
ranges, so the reject test becomes frac < residue.
Each keep interval spans
W*(r+1) - (W*r + residue) == W - residue == n * limit
When we map from raw to sample with sample = raw * limit,
the same number of raw values fall into each keep interval,
because each keep interval is the same size and is a multiple of
limit. As a result, the sampling will be uniform.
Not all reject intervals contain a sample value because residue < limit, or (in terms of our earlier intuition) because
residue/limit < 1. (Sort-of relatedly, the phase alignment of
samples in each keep interval is typically different.) Because
of this per-return-value lumpiness, and unlike simpler versions
of rejection sampling, this algorithm has a timing side-channel
that can reveal a small amount of information about its result,
so it should not be used for secrets.
The final tweak makes the algoritm nearly divisionless by
avoiding the calculation of residue. We observe that residue == W % limit < limit, so we can do an initial rough comparison
frac < limit to see if it’s worth preparing for the rejection
sampling loop. This saves division work proportional to how much
limit is smaller than W. It also saves us from having to
explicitly check if the limit is zero.
Our return values use the most significant bits of the
underlying random number generator, and rejection sampling
depends on the next most significant bits (because of the way
carries in raw * limit cascade over multiples of W), so we
still work OK with RNGs that have quality problems in their
least significant bits.
Tony Finch – blog