I recently saw Colm MacCárthaigh’s exegesis of Daniel Lemire’s paper about nearly divisionless fast random integer generation. I found myself writing down my own notes on how it works, that take a slightly different angle.
I have written a follow-up blog post about splitting Lemire’s algorithm between an inline fast path and a separate slow path.
uint64_t random_lemire(uint64_t limit) {
uint64_t raw = random_u64();
uint128_t sample = (uint128_t)raw * (uint128_t)limit;
uint64_t frac = (uint64_t)sample;
if(frac < limit) {
const uint64_t residue = -limit % limit;
while(frac < residue) {
raw = random_u64();
sample = (uint128_t)raw * (uint128_t)limit;
frac = (uint64_t)sample;
}
}
return(sample >> 64);
}
(In this explanation, W == 1<<64
i.e. two to the power of the
word size.)
We first get a raw
random value, which we can think of as a
fraction 0 <= raw < 1
, multiplied by W
so it is represented
as a fixed-point number. We calculate a sample
as raw * limit
, which is a random fixed-point fractional number 0 <= sample < limit
. Our return value r
will be the integer part
of fixed-point sample
, which in machine arithmetic is r == sample / W == sample >> 64
. We will also work with frac
, the
fractional part of sample
, which in machine arithmetic we get
by truncating sample
to its least significant 64 bits, so that
sample == W*r + frac
.
There is a little bias in this basic fixed-point calculation, so we are going to do some rejection sampling.
We will reject and retry when we get one of residue = W % limit
special raw
values, so that our return values are
sampled from W - residue == n * limit
random values. This is
the largest possible integer n
where n * limit < W
, to
minimize the probability of a retry.
To avoid overflow when calculating residue
(because W
does
not fit in a word) we use some modular arithmetic tricks: W % limit == (W - limit) % limit
, and W - limit
is the same as
-limit
in unsigned two’s complement arithmetic, so residue == W % limit == -limit % limit
. We know this division is safe
because we know the limit
is strictly greater than zero after
the if()
test.
We don’t compare residue
with raw
and reject a contiguous
range of possible raw
random values, because sample
uses the
entire range of raw
and there isn’t a straightforward overflow
interval that we can reject. Instead we compare residue
with
frac
, the fractional part of sample = raw * limit
. The rough
idea is to reject residue/limit
worth of bias for each of the
limit
possible return values in the integer part of sample
.
More precisely (working in machine arithmetic rather than
fixed-point), the range 0 <= sample < W * limit
consists of
several intervals, one for each return value 0 <= r < limit
.
Interval r
covers the sub-range
W*r <= sample < W*(r+1)
In each interval we reject sample
if it is in the range
W*r <= sample < W*r + residue
and keep sample
if it is in the range
W*r + residue <= sample < W*(r+1)
The code uses truncated frac
rather than sample == W*r + frac
, which in effect subtracts out the W*r
part of these
ranges, so the reject test becomes frac < residue
.
Each keep interval spans
W*(r+1) - (W*r + residue) == W - residue == n * limit
When we map from raw
to sample
with sample = raw * limit
,
the same number of raw
values fall into each keep interval,
because each keep interval is the same size and is a multiple of
limit
. As a result, the sampling will be uniform.
Not all reject intervals contain a sample value because residue < limit
, or (in terms of our earlier intuition) because
residue/limit < 1
. (Sort-of relatedly, the phase alignment of
samples in each keep interval is typically different.) Because
of this per-return-value lumpiness, and unlike simpler versions
of rejection sampling, this algorithm has a timing side-channel
that can reveal a small amount of information about its result,
so it should not be used for secrets.
The final tweak makes the algoritm nearly divisionless by
avoiding the calculation of residue
. We observe that residue == W % limit < limit
, so we can do an initial rough comparison
frac < limit
to see if it’s worth preparing for the rejection
sampling loop. This saves division work proportional to how much
limit
is smaller than W
. It also saves us from having to
explicitly check if the limit
is zero.
Our return values use the most significant bits of the
underlying random number generator, and rejection sampling
depends on the next most significant bits (because of the way
carries in raw * limit
cascade over multiples of W
), so we
still work OK with RNGs that have quality problems in their
least significant bits.