I recently saw Colm MacCárthaigh’s exegesis of Daniel Lemire’s paper about nearly divisionless fast random integer generation. I found myself writing down my own notes on how it works, that take a slightly different angle.

I have written a follow-up blog post about splitting Lemire’s algorithm between an inline fast path and a separate slow path.

```
uint64_t random_lemire(uint64_t limit) {
uint64_t raw = random_u64();
uint128_t sample = (uint128_t)raw * (uint128_t)limit;
uint64_t frac = (uint64_t)sample;
if(frac < limit) {
const uint64_t residue = -limit % limit;
while(frac < residue) {
raw = random_u64();
sample = (uint128_t)raw * (uint128_t)limit;
frac = (uint64_t)sample;
}
}
return(sample >> 64);
}
```

(In this explanation, `W == 1<<64`

i.e. two to the power of the
word size.)

We first get a `raw`

random value, which we can think of as a
fraction `0 <= raw < 1`

, multiplied by `W`

so it is represented
as a fixed-point number. We calculate a `sample`

as `raw * limit`

, which is a random fixed-point fractional number `0 <= sample < limit`

. Our return value `r`

will be the integer part
of fixed-point `sample`

, which in machine arithmetic is `r == sample / W == sample >> 64`

. We will also work with `frac`

, the
fractional part of `sample`

, which in machine arithmetic we get
by truncating `sample`

to its least significant 64 bits, so that
`sample == W*r + frac`

.

There is a little bias in this basic fixed-point calculation, so we are going to do some rejection sampling.

We will reject and retry when we get one of `residue = W % limit`

special `raw`

values, so that our return values are
sampled from `W - residue == n * limit`

random values. This is
the largest possible integer `n`

where `n * limit < W`

, to
minimize the probability of a retry.

To avoid overflow when calculating `residue`

(because `W`

does
not fit in a word) we use some modular arithmetic tricks: `W % limit == (W - limit) % limit`

, and `W - limit`

is the same as
`-limit`

in unsigned two’s complement arithmetic, so `residue == W % limit == -limit % limit`

. We know this division is safe
because we know the `limit`

is strictly greater than zero after
the `if()`

test.

We don’t compare `residue`

with `raw`

and reject a contiguous
range of possible `raw`

random values, because `sample`

uses the
entire range of `raw`

and there isn’t a straightforward overflow
interval that we can reject. Instead we compare `residue`

with
`frac`

, the fractional part of `sample = raw * limit`

. The rough
idea is to reject `residue/limit`

worth of bias for each of the
`limit`

possible return values in the integer part of `sample`

.

More precisely (working in machine arithmetic rather than
fixed-point), the range `0 <= sample < W * limit`

consists of
several intervals, one for each return value `0 <= r < limit`

.
Interval `r`

covers the sub-range

```
W*r <= sample < W*(r+1)
```

In each interval we reject `sample`

if it is in the range

```
W*r <= sample < W*r + residue
```

and keep `sample`

if it is in the range

```
W*r + residue <= sample < W*(r+1)
```

The code uses truncated `frac`

rather than `sample == W*r + frac`

, which in effect subtracts out the `W*r`

part of these
ranges, so the reject test becomes `frac < residue`

.

Each keep interval spans

```
W*(r+1) - (W*r + residue) == W - residue == n * limit
```

When we map from `raw`

to `sample`

with `sample = raw * limit`

,
the same number of `raw`

values fall into each keep interval,
because each keep interval is the same size and is a multiple of
`limit`

. As a result, the sampling will be uniform.

Not all reject intervals contain a sample value because `residue < limit`

, or (in terms of our earlier intuition) because
`residue/limit < 1`

. (Sort-of relatedly, the phase alignment of
samples in each keep interval is typically different.) Because
of this per-return-value lumpiness, and unlike simpler versions
of rejection sampling, this algorithm has a timing side-channel
that can reveal a small amount of information about its result,
so it should not be used for secrets.

The final tweak makes the algoritm nearly divisionless by
avoiding the calculation of `residue`

. We observe that `residue == W % limit < limit`

, so we can do an initial rough comparison
`frac < limit`

to see if it’s worth preparing for the rejection
sampling loop. This saves division work proportional to how much
`limit`

is smaller than `W`

. It also saves us from having to
explicitly check if the `limit`

is zero.

Our return values use the most significant bits of the
underlying random number generator, and rejection sampling
depends on the next most significant bits (because of the way
carries in `raw * limit`

cascade over multiples of `W`

), so we
still work OK with RNGs that have quality problems in their
least significant bits.