.@ Tony Finch – blog


One of the neat things about the PCG random number generator by Melissa O’Neill is its use of instruction-level parallelism: the PCG state update can run in parallel with its output permutation.

However, PCG only has a limited amount of ILP, about 3 instructions. Its overall speed is limited by the rate at which a CPU can run a sequence where the output of one multiply-add feeds into the next multiply-add.

… Or is it?

With some linear algebra and some AVX512, I can generate random numbers from a single instance of pcg32 at 200 Gbit/s on a single core. This is the same sequence of random numbers generated in the same order as normal pcg32, but more than 4x faster.

You can look at the benchmark in my pcg-dxsm repository.

skip ahead

One of the slightly weird features that PCG gets from its underlying linear congruential generator is “seekability”: you can skip ahead k steps in the stream of random numbers in log(k) time.

The PCG paper (in section 4.3.1) cites Forrest Brown’s paper, random numbers with arbitrary strides, which explains that the skip-ahead feature is useful for reproducibility of monte carlo simulations.

But what caught my eye is the skip-ahead formula. Rephrased in programmer style,

    state[n+k] = state[n] * pow(MUL, k) +
                 inc * (pow(MUL, k) - 1) / (MUL - 1)

the insight

The skip-ahead formula says that we can calculate a future state using a couple of multiplications. The skip-ahead multipliers depend only on the LCG multiplier, not on the variable state, nor on the configurable increment. That means that for a fixed skip ahead, we can precalculate the multipliers before compile time.

The skip-ahead formula allows us to unroll the PCG data dependency chain. Normally, four iterations of the PCG state update look like,

    state0 = rng->state
    state1 = state0 * MUL + rng->inc
    state2 = state1 * MUL + rng->inc
    state3 = state2 * MUL + rng->inc
    state4 = state3 * MUL + rng->inc
    rng->state = state4

With the skip-ahead multipliers it looks like,

    state0 = rng->state
    state1 = state0 * MULs1 + rng->inc * MULi1
    state2 = state0 * MULs2 + rng->inc * MULi2
    state3 = state0 * MULs3 + rng->inc * MULi3
    state4 = state0 * MULs4 + rng->inc * MULi4
    rng->state = state4

These state calculations can be done in parallel using NEON or AVX vector instructions.

The disadvantage is that calculating future states in parallel requires more multiplications than doing so in series, but inc * MULiN can be lifted out of an inner loop so the bulk cost is the same.

multipliers

The skip-ahead formula is useful for jumping ahead long distances, because (as Forrest Brown explained) you can do the exponentiation in log(k) time using repeated squaring. (The same technique is used for modexp in RSA.) But I’m only interested in the first few skip-ahead multipliers.

I’ll define the linear congruential generator as:

    lcg(s, inc) = s * MUL + inc

Which is used in PCG’s normal state update like:

    rng->state = lcg(rng->state, rng->inc)

To precalculate the first few skip-ahead multipliers, we iterate the LCG starting from zero and one, like this:

    MULs0 = 1
    MULs1 = lcg(MULs0, 0)
    MULs2 = lcg(MULs1, 0)

    MULi0 = 0
    MULi1 = lcg(MULi0, 1)
    MULi2 = lcg(MULi1, 1)

My benchmark code’s commentary includes a proof by induction, which I wrote to convince myself that these multipliers are correct.

trying it out

To explore how well this skip-ahead idea works, I have written a couple of variants of my pcg32_bytes() function, which simply iterates pcg32 and writes the results to a byte array. The variants have an adjustable amount of parallelism.

One variant is written as scalar code in a loop that has been unrolled by hand a few times. I wanted to see if standard C gets a decent speedup, perhaps from autovectorization.

The other variant uses the GNU C portable vector extensions to calculate pcg32 in an explicitly parallel manner.

The benchmark also ensures the output from every variant matches the baseline pcg32_bytes().

results

The output from the benchmark harness lists:

There are small differences in style between the baseline and u1 functions, but their performance ought to be basically the same.

Apple clang 16, Macbook Pro M1 Pro.

This compiler is eager and fairly effective at autovectorizing. ARM NEON isn’t big enough to get a speedup from 8 lanes of parallelism.

    __  3.66 bytes/ns x 1.00
    u1  3.90 bytes/ns x 1.07
    u2  6.40 bytes/ns x 1.75
    u3  7.66 bytes/ns x 2.09
    u4  8.52 bytes/ns x 2.33
    x2  7.59 bytes/ns x 2.08
    x4 10.49 bytes/ns x 2.87
    x8 10.40 bytes/ns x 2.84

The following results were from my AMD Ryzen 9 7950X running Debian 12 “bookworm”, comparing gcc vs clang, and AVX2 vs AVX512. gcc is less keen to autovectorize so it doesn’t do very well with the unrolled loops. (Dunno why u1 is so much slower than the baseline.)

That last result is pcg32 generating random numbers at 200 Gbit/s.