I previously wrote about Daniel Lemire’s algorithm for nearly divisionless unbiased bounded random numbers. Recently I found out there is a way to get really divisionless random numbers, as explained by Steve Canon and demonstrated by Kendall Willets.
I have written my own version of really divisionless random numbers so I could compare it with Lemire’s algorithm. Here’s how it works.
biased numbers
As before, I’m going to use W == 1<<32
as an abbreviation for two to the
power of the word size. (Except I am using 32 bit words instead of 64 bits
to make the worst cases more visible in the benchmark later on.)
We have a raw random number generator rng()
that produces numbers in the
range [0,W), but we are going to treat it as if it produces fixedpoint
fractions in the range [0,1), as if it were rng() / W
. Then we can get a
value in our desired range [0,limit) by multiplying rng() * limit
. This
gives us a 32.32 fixedpoint result, represented as a 64 bit number.
uint64_t num = (uint64_t)rng() * (uint64_t)limit;
uint32_t value = (uint32_t)(num >> 32);
This random value is biased. The bias is due to the limited precision of the fraction we started with. Lemire’s nearlydivisionless algorithm uses rejection sampling to debias its result. But what if we could calculate it with unlimited precision instead?
multiprecision fractions
Let’s make a multiprecision fraction in [0,1) from a series of raw random
numbers. We started with rng() / W
, which we can extend W bits at a time:
rng()/W + rng()/(W*W) + rng()/(W*W*W) + ...
When we multiply by the limit we can distribute the multiplication through the sum:
limit*rng()/W + limit*rng()/(W*W) + ...
Each term of the sum is represented as a 32.32 fixedpoint number, same as
before. To add them up, the upper half of the term on the right of a +
is
added to the lower half of the term on the left. But this can overflow, in
which case we need to propagate a carry.
It’s possible for a carry to propagate all the way to the left and affect the integer part of our result. Ignoring these carries is where the bias comes from in the limitedprecision multiply. Instead, we are going to handle carries properly.
unbounded but not infinite
We will add up our multiprecision sum from left to right, at each step adding the high half of the next term to the low half of the current term. There are three interesting cases:

hi + lo < (W1)
The addition does not cause a carry, and this term will absorb carries from later terms, so whatever integer value we started with is the correct result, and we can stop.

hi + lo == (W1)
The addition does not cause a carry, but this term will propagate carries from later terms. We need to examine more terms.

hi + lo > (W1)
The addition causes a carry. This term will absorb carries from later terms, so we can stop here. We also know (by induction) that all the terms to our left were
== (W1)
and will propagate our carry up to our preliminary integer value, so our result isvalue + 1
.
In the middle case where we need to keep going, we don’t actually have to
keep the result of the sum around (because we know it is W1
) so we can
just discard it, forget about dividing by W
, and redo the loop exactly
as if we were adding the second term again.
The middle case is also very unlikely, probability 1 / W
, so the loop
will almost never have more than one iteration.
early escape
There is a fourth case that allows us to skip getting the next term’s
random number, because we know hi < limit
.
limit + lo <= (W1)
bitwise hacks
OK, we are nearly there! But these additions can overflow our word size, and we would prefer to be able to test if there is a carry within our word size and without overflow. The following tests are all equivalent:
hi + lo > (1 << 32)  1
hi + lo > (uint32_t)1
hi > lo  1
hi > ~lo
All of our carry propagation checks test if something added to our
fractional part (what I called lo
above) would overflow, so we can keep
the fraction part inverted all the time.
code
So what I ended up with is:
uint32_t really_divisionless(uint32_t limit) {
uint64_t num = (uint64_t)rng() * (uint64_t)limit;
uint32_t value = (uint32_t)(num >> 32);
uint32_t frac = ~(uint32_t)(num);
while(limit > frac) {
num = (uint64_t)rng() * (uint64_t)limit;
uint32_t more = (uint32_t)(num >> 32);
if(more < frac) return(value + 0);
if(more > frac) return(value + 1);
frac = ~(uint32_t)(num);
}
return(value);
}
It is not quite as brief as Lemire’s algorithm:
uint32_t nearly_divisionless(uint32_t limit) {
uint64_t num = (uint64_t)rng() * (uint64_t)limit;
if((uint32_t)(num) < limit) {
uint32_t residue = limit % limit;
while((uint32_t)(num) < residue)
num = (uint64_t)rng() * (uint64_t)limit;
}
return((uint32_t)(num >> 32));
}
benchmark
I’ve written a simple benchmark harness to compare nearlydivisionless and reallydivisionless random numbers.
On my Apple M1 Pro, I get:
nearly really
limit rng() time rng() time
10 100000000 0.288708 100000000 0.272645 (+0.199934)
100 100000000 0.270472 100000002 0.271755 (+0.019925)
1000 100000004 0.270880 100000018 0.271528 (+0.001843)
10000 100000195 0.271386 100000226 0.272936 (+0.000239)
100000 100001588 0.273159 100002420 0.271645 (+0.000011)
1000000 100022422 0.270636 100023153 0.273379 (+0.000001)
10000000 100115390 0.274848 100232254 0.275551 (0.000170)
100000000 102259813 0.295489 102329850 0.308394 (+0.000040)
1000000000 107371660 0.506537 123283286 0.554149 (+0.000011)
On an old Intel i52520M (2.50GHz) I get:
nearly really
limit rng() time rng() time
10 100000000 0.527669 100000000 0.539901 (+0.199986)
100 100000002 0.517446 100000004 0.539832 (+0.019943)
1000 100000003 0.517961 100000029 0.540178 (+0.002025)
10000 100000164 0.517207 100000203 0.540129 (+0.000231)
100000 100001551 0.517465 100002365 0.541679 (+0.000082)
1000000 100022311 0.518833 100023428 0.542076 (0.000006)
10000000 100115835 0.521978 100232914 0.546125 (0.000093)
100000000 102260934 0.560111 102328804 0.580683 (0.000099)
1000000000 107370473 0.925359 123280839 0.922405 (0.000007)
The most notable thing is that the nearlydivisionless algorithm makes
fewer calls to the random number generator. On further investigation (not
shown in the results above) I found that both algorithms take the
earlyexit with about the same probability, but nearlydivisionless calls
rng()
much less often in the slow path.
The old Intel box is less able to handle the greater complexity of the reallydivisionless code, so it performs significantly slower even when mostly taking the earlyout.
The differences build up when the limit gets near W
, when the algorithms
have to hit their slow path more frequently. On the old Intel box, with
relatively slow division, nearlydivisionless slows down more than
reallydivisionless, but the M1 Pro clearly has very fast divisions so
nearlydivisionless stays ahead.
conclusion
It was fun to investigate this reallydivisionless algorithm. It’s easier to understand than Lemire’s nearlydivisionless algorithm, but the reallydivisionless code ends up being longer and more branchy, so it has a hard time keeping up.
If you need unbiased numbers with a large limit, you are better off using
an RNG that produces 64 bits at a time, so that these algorithms almost
never need to hit their slow path. Then the relative performance of your
rng()
and your hardware’s division matters much less, and Lemire’s algorithm
wins on brevity.