really divisionless random numbers

I previously wrote about Daniel Lemire’s algorithm for nearly divisionless unbiased bounded random numbers. Recently I found out there is a way to get really divisionless random numbers, as explained by Steve Canon and demonstrated by Kendall Willets.

I have written my own version of really divisionless random numbers so I could compare it with Lemire’s algorithm. Here’s how it works.

biased numbers

As before, I’m going to use W == 1<<32 as an abbreviation for two to the power of the word size. (Except I am using 32 bit words instead of 64 bits to make the worst cases more visible in the benchmark later on.)

We have a raw random number generator rng() that produces numbers in the range [0,W), but we are going to treat it as if it produces fixed-point fractions in the range [0,1), as if it were rng() / W. Then we can get a value in our desired range [0,limit) by multiplying rng() * limit. This gives us a 32.32 fixed-point result, represented as a 64 bit number.

    uint64_t num = (uint64_t)rng() * (uint64_t)limit;
    uint32_t value = (uint32_t)(num >> 32);

This random value is biased. The bias is due to the limited precision of the fraction we started with. Lemire’s nearly-divisionless algorithm uses rejection sampling to debias its result. But what if we could calculate it with unlimited precision instead?

multiprecision fractions

Let’s make a multiprecision fraction in [0,1) from a series of raw random numbers. We started with rng() / W, which we can extend W bits at a time:

    rng()/W + rng()/(W*W) + rng()/(W*W*W) + ...

When we multiply by the limit we can distribute the multiplication through the sum:

    limit*rng()/W + limit*rng()/(W*W) + ...

Each term of the sum is represented as a 32.32 fixed-point number, same as before. To add them up, the upper half of the term on the right of a + is added to the lower half of the term on the left. But this can overflow, in which case we need to propagate a carry.

It’s possible for a carry to propagate all the way to the left and affect the integer part of our result. Ignoring these carries is where the bias comes from in the limited-precision multiply. Instead, we are going to handle carries properly.

unbounded but not infinite

We will add up our multiprecision sum from left to right, at each step adding the high half of the next term to the low half of the current term. There are three interesting cases:

hi + lo < (W-1)

The addition does not cause a carry, and this term will absorb carries from later terms, so whatever integer value we started with is the correct result, and we can stop.
hi + lo == (W-1)

The addition does not cause a carry, but this term will propagate carries from later terms. We need to examine more terms.
hi + lo > (W-1)

The addition causes a carry. This term will absorb carries from later terms, so we can stop here. We also know (by induction) that all the terms to our left were == (W-1) and will propagate our carry up to our preliminary integer value, so our result is value + 1.

In the middle case where we need to keep going, we don’t actually have to keep the result of the sum around (because we know it is W-1) so we can just discard it, forget about dividing by W, and re-do the loop exactly as if we were adding the second term again.

The middle case is also very unlikely, probability 1 / W, so the loop will almost never have more than one iteration.

early escape

There is a fourth case that allows us to skip getting the next term’s random number, because we know hi < limit.

limit + lo <= (W-1)

bitwise hacks

OK, we are nearly there! But these additions can overflow our word size, and we would prefer to be able to test if there is a carry within our word size and without overflow. The following tests are all equivalent:

hi + lo > (1 << 32) - 1
hi + lo > (uint32_t)-1
hi > -lo - 1
hi > ~lo

All of our carry propagation checks test if something added to our fractional part (what I called lo above) would overflow, so we can keep the fraction part inverted all the time.

code

So what I ended up with is:

    uint32_t really_divisionless(uint32_t limit) {
        uint64_t num = (uint64_t)rng() * (uint64_t)limit;
        uint32_t value = (uint32_t)(num >> 32);
        uint32_t frac = ~(uint32_t)(num);
        while(limit > frac) {
            num = (uint64_t)rng() * (uint64_t)limit;
            uint32_t more = (uint32_t)(num >> 32);
            if(more < frac) return(value + 0);
            if(more > frac) return(value + 1);
            frac = ~(uint32_t)(num);
        }
        return(value);
    }

It is not quite as brief as Lemire’s algorithm:

    uint32_t nearly_divisionless(uint32_t limit) {
        uint64_t num = (uint64_t)rng() * (uint64_t)limit;
        if((uint32_t)(num) < limit) {
            uint32_t residue = -limit % limit;
            while((uint32_t)(num) < residue)
                num = (uint64_t)rng() * (uint64_t)limit;
        }
        return((uint32_t)(num >> 32));
    }

benchmark

I’ve written a simple benchmark harness to compare nearly-divisionless and really-divisionless random numbers.

On my Apple M1 Pro, I get:

                           nearly             really
         limit     rng()     time     rng()     time
            10 100000000 0.288708 100000000 0.272645 (+0.199934)
           100 100000000 0.270472 100000002 0.271755 (+0.019925)
          1000 100000004 0.270880 100000018 0.271528 (+0.001843)
         10000 100000195 0.271386 100000226 0.272936 (+0.000239)
        100000 100001588 0.273159 100002420 0.271645 (+0.000011)
       1000000 100022422 0.270636 100023153 0.273379 (+0.000001)
      10000000 100115390 0.274848 100232254 0.275551 (-0.000170)
     100000000 102259813 0.295489 102329850 0.308394 (+0.000040)
    1000000000 107371660 0.506537 123283286 0.554149 (+0.000011)

On an old Intel i5-2520M (2.50GHz) I get:

                           nearly             really
         limit     rng()     time     rng()     time
            10 100000000 0.527669 100000000 0.539901 (+0.199986)
           100 100000002 0.517446 100000004 0.539832 (+0.019943)
          1000 100000003 0.517961 100000029 0.540178 (+0.002025)
         10000 100000164 0.517207 100000203 0.540129 (+0.000231)
        100000 100001551 0.517465 100002365 0.541679 (+0.000082)
       1000000 100022311 0.518833 100023428 0.542076 (-0.000006)
      10000000 100115835 0.521978 100232914 0.546125 (-0.000093)
     100000000 102260934 0.560111 102328804 0.580683 (-0.000099)
    1000000000 107370473 0.925359 123280839 0.922405 (-0.000007)

The most notable thing is that the nearly-divisionless algorithm makes fewer calls to the random number generator. On further investigation (not shown in the results above) I found that both algorithms take the early-exit with about the same probability, but nearly-divisionless calls rng() much less often in the slow path.

The old Intel box is less able to handle the greater complexity of the really-divisionless code, so it performs significantly slower even when mostly taking the early-out.

The differences build up when the limit gets near W, when the algorithms have to hit their slow path more frequently. On the old Intel box, with relatively slow division, nearly-divisionless slows down more than really-divisionless, but the M1 Pro clearly has very fast divisions so nearly-divisionless stays ahead.

conclusion

It was fun to investigate this really-divisionless algorithm. It’s easier to understand than Lemire’s nearly-divisionless algorithm, but the really-divisionless code ends up being longer and more branchy, so it has a hard time keeping up.

If you need unbiased numbers with a large limit, you are better off using an RNG that produces 64 bits at a time, so that these algorithms almost never need to hit their slow path. Then the relative performance of your rng() and your hardware’s division matters much less, and Lemire’s algorithm wins on brevity.