I previously wrote about Daniel Lemire’s algorithm for nearly divisionless unbiased bounded random numbers. Recently I found out there is a way to get really divisionless random numbers, as explained by Steve Canon and demonstrated by Kendall Willets.
I have written my own version of really divisionless random numbers so I could compare it with Lemire’s algorithm. Here’s how it works.
As before, I’m going to use W == 1<<32
as an abbreviation for two to the
power of the word size. (Except I am using 32 bit words instead of 64 bits
to make the worst cases more visible in the benchmark later on.)
We have a raw random number generator rng()
that produces numbers in the
range [0,W), but we are going to treat it as if it produces fixed-point
fractions in the range [0,1), as if it were rng() / W
. Then we can get a
value in our desired range [0,limit) by multiplying rng() * limit
. This
gives us a 32.32 fixed-point result, represented as a 64 bit number.
uint64_t num = (uint64_t)rng() * (uint64_t)limit;
uint32_t value = (uint32_t)(num >> 32);
This random value is biased. The bias is due to the limited precision of the fraction we started with. Lemire’s nearly-divisionless algorithm uses rejection sampling to debias its result. But what if we could calculate it with unlimited precision instead?
Let’s make a multiprecision fraction in [0,1) from a series of raw random
numbers. We started with rng() / W
, which we can extend W bits at a time:
rng()/W + rng()/(W*W) + rng()/(W*W*W) + ...
When we multiply by the limit we can distribute the multiplication through the sum:
limit*rng()/W + limit*rng()/(W*W) + ...
Each term of the sum is represented as a 32.32 fixed-point number, same as
before. To add them up, the upper half of the term on the right of a +
is
added to the lower half of the term on the left. But this can overflow, in
which case we need to propagate a carry.
It’s possible for a carry to propagate all the way to the left and affect the integer part of our result. Ignoring these carries is where the bias comes from in the limited-precision multiply. Instead, we are going to handle carries properly.
We will add up our multiprecision sum from left to right, at each step adding the high half of the next term to the low half of the current term. There are three interesting cases:
hi + lo < (W-1)
The addition does not cause a carry, and this term will absorb carries from later terms, so whatever integer value we started with is the correct result, and we can stop.
hi + lo == (W-1)
The addition does not cause a carry, but this term will propagate carries from later terms. We need to examine more terms.
hi + lo > (W-1)
The addition causes a carry. This term will absorb carries from later
terms, so we can stop here. We also know (by induction) that all the
terms to our left were == (W-1)
and will propagate our carry up to
our preliminary integer value, so our result is value + 1
.
In the middle case where we need to keep going, we don’t actually have to
keep the result of the sum around (because we know it is W-1
) so we can
just discard it, forget about dividing by W
, and re-do the loop exactly
as if we were adding the second term again.
The middle case is also very unlikely, probability 1 / W
, so the loop
will almost never have more than one iteration.
There is a fourth case that allows us to skip getting the next term’s
random number, because we know hi < limit
.
limit + lo <= (W-1)
OK, we are nearly there! But these additions can overflow our word size, and we would prefer to be able to test if there is a carry within our word size and without overflow. The following tests are all equivalent:
hi + lo > (1 << 32) - 1
hi + lo > (uint32_t)-1
hi > -lo - 1
hi > ~lo
All of our carry propagation checks test if something added to our
fractional part (what I called lo
above) would overflow, so we can keep
the fraction part inverted all the time.
So what I ended up with is:
uint32_t really_divisionless(uint32_t limit) {
uint64_t num = (uint64_t)rng() * (uint64_t)limit;
uint32_t value = (uint32_t)(num >> 32);
uint32_t frac = ~(uint32_t)(num);
while(limit > frac) {
num = (uint64_t)rng() * (uint64_t)limit;
uint32_t more = (uint32_t)(num >> 32);
if(more < frac) return(value + 0);
if(more > frac) return(value + 1);
frac = ~(uint32_t)(num);
}
return(value);
}
It is not quite as brief as Lemire’s algorithm:
uint32_t nearly_divisionless(uint32_t limit) {
uint64_t num = (uint64_t)rng() * (uint64_t)limit;
if((uint32_t)(num) < limit) {
uint32_t residue = -limit % limit;
while((uint32_t)(num) < residue)
num = (uint64_t)rng() * (uint64_t)limit;
}
return((uint32_t)(num >> 32));
}
I’ve written a simple benchmark harness to compare nearly-divisionless and really-divisionless random numbers.
On my Apple M1 Pro, I get:
nearly really
limit rng() time rng() time
10 100000000 0.288708 100000000 0.272645 (+0.199934)
100 100000000 0.270472 100000002 0.271755 (+0.019925)
1000 100000004 0.270880 100000018 0.271528 (+0.001843)
10000 100000195 0.271386 100000226 0.272936 (+0.000239)
100000 100001588 0.273159 100002420 0.271645 (+0.000011)
1000000 100022422 0.270636 100023153 0.273379 (+0.000001)
10000000 100115390 0.274848 100232254 0.275551 (-0.000170)
100000000 102259813 0.295489 102329850 0.308394 (+0.000040)
1000000000 107371660 0.506537 123283286 0.554149 (+0.000011)
On an old Intel i5-2520M (2.50GHz) I get:
nearly really
limit rng() time rng() time
10 100000000 0.527669 100000000 0.539901 (+0.199986)
100 100000002 0.517446 100000004 0.539832 (+0.019943)
1000 100000003 0.517961 100000029 0.540178 (+0.002025)
10000 100000164 0.517207 100000203 0.540129 (+0.000231)
100000 100001551 0.517465 100002365 0.541679 (+0.000082)
1000000 100022311 0.518833 100023428 0.542076 (-0.000006)
10000000 100115835 0.521978 100232914 0.546125 (-0.000093)
100000000 102260934 0.560111 102328804 0.580683 (-0.000099)
1000000000 107370473 0.925359 123280839 0.922405 (-0.000007)
The most notable thing is that the nearly-divisionless algorithm makes
fewer calls to the random number generator. On further investigation (not
shown in the results above) I found that both algorithms take the
early-exit with about the same probability, but nearly-divisionless calls
rng()
much less often in the slow path.
The old Intel box is less able to handle the greater complexity of the really-divisionless code, so it performs significantly slower even when mostly taking the early-out.
The differences build up when the limit gets near W
, when the algorithms
have to hit their slow path more frequently. On the old Intel box, with
relatively slow division, nearly-divisionless slows down more than
really-divisionless, but the M1 Pro clearly has very fast divisions so
nearly-divisionless stays ahead.
It was fun to investigate this really-divisionless algorithm. It’s easier to understand than Lemire’s nearly-divisionless algorithm, but the really-divisionless code ends up being longer and more branchy, so it has a hard time keeping up.
If you need unbiased numbers with a large limit, you are better off using
an RNG that produces 64 bits at a time, so that these algorithms almost
never need to hit their slow path. Then the relative performance of your
rng()
and your hardware’s division matters much less, and Lemire’s algorithm
wins on brevity.