One of the neat things about the PCG random number generator by Melissa O’Neill is its use of instruction-level parallelism: the PCG state update can run in parallel with its output permutation.
However, PCG only has a limited amount of ILP, about 3 instructions. Its overall speed is limited by the rate at which a CPU can run a sequence where the output of one multiply-add feeds into the next multiply-add.
… Or is it?
With some linear algebra and some AVX512, I can generate random numbers from a single instance of pcg32 at 200 Gbit/s on a single core. This is the same sequence of random numbers generated in the same order as normal pcg32, but more than 4x faster.
You can look at the benchmark in my pcg-dxsm repository.
skip ahead
One of the slightly weird features that PCG gets from its underlying
linear congruential generator is “seekability”: you can skip ahead k
steps in the stream of random numbers in log(k)
time.
The PCG paper (in section 4.3.1) cites Forrest Brown’s paper, random numbers with arbitrary strides, which explains that the skip-ahead feature is useful for reproducibility of monte carlo simulations.
But what caught my eye is the skip-ahead formula. Rephrased in programmer style,
state[n+k] = state[n] * pow(MUL, k) +
inc * (pow(MUL, k) - 1) / (MUL - 1)
the insight
The skip-ahead formula says that we can calculate a future state using a couple of multiplications. The skip-ahead multipliers depend only on the LCG multiplier, not on the variable state, nor on the configurable increment. That means that for a fixed skip ahead, we can precalculate the multipliers before compile time.
The skip-ahead formula allows us to unroll the PCG data dependency chain. Normally, four iterations of the PCG state update look like,
state0 = rng->state
state1 = state0 * MUL + rng->inc
state2 = state1 * MUL + rng->inc
state3 = state2 * MUL + rng->inc
state4 = state3 * MUL + rng->inc
rng->state = state4
With the skip-ahead multipliers it looks like,
state0 = rng->state
state1 = state0 * MULs1 + rng->inc * MULi1
state2 = state0 * MULs2 + rng->inc * MULi2
state3 = state0 * MULs3 + rng->inc * MULi3
state4 = state0 * MULs4 + rng->inc * MULi4
rng->state = state4
These state calculations can be done in parallel using NEON or AVX vector instructions.
The disadvantage is that calculating future states in parallel
requires more multiplications than doing so in series, but inc * MULiN
can be lifted out of an inner loop so the bulk cost is the
same.
multipliers
The skip-ahead formula is useful for jumping ahead long distances,
because (as Forrest Brown explained) you can do the exponentiation in
log(k)
time using repeated squaring. (The same technique is used
for modexp in RSA.) But I’m only interested in the first few
skip-ahead multipliers.
I’ll define the linear congruential generator as:
lcg(s, inc) = s * MUL + inc
Which is used in PCG’s normal state update like:
rng->state = lcg(rng->state, rng->inc)
To precalculate the first few skip-ahead multipliers, we iterate the LCG starting from zero and one, like this:
MULs0 = 1
MULs1 = lcg(MULs0, 0)
MULs2 = lcg(MULs1, 0)
MULi0 = 0
MULi1 = lcg(MULi0, 1)
MULi2 = lcg(MULi1, 1)
My benchmark code’s commentary includes a proof by induction, which I wrote to convince myself that these multipliers are correct.
trying it out
To explore how well this skip-ahead idea works, I have written a
couple of variants of my pcg32_bytes()
function, which simply
iterates pcg32 and writes the results to a byte array. The variants
have an adjustable amount of parallelism.
One variant is written as scalar code in a loop that has been unrolled by hand a few times. I wanted to see if standard C gets a decent speedup, perhaps from autovectorization.
The other variant uses the GNU C portable vector extensions to calculate pcg32 in an explicitly parallel manner.
The benchmark also ensures the output from every variant matches the
baseline pcg32_bytes()
.
results
The output from the benchmark harness lists:
-
the function variant
- either the baseline version
- or
uN
for a scalar loop unrolled N times - or
xN
for vector code with N lanes
-
its speed in bytes per nanosecond (aka gigabytes per second)
-
its performance relative to the baseline
There are small differences in style between the baseline and u1
functions, but their performance ought to be basically the same.
Apple clang
16, Macbook Pro M1 Pro.
This compiler is eager and fairly effective at autovectorizing. ARM NEON isn’t big enough to get a speedup from 8 lanes of parallelism.
__ 3.66 bytes/ns x 1.00
u1 3.90 bytes/ns x 1.07
u2 6.40 bytes/ns x 1.75
u3 7.66 bytes/ns x 2.09
u4 8.52 bytes/ns x 2.33
x2 7.59 bytes/ns x 2.08
x4 10.49 bytes/ns x 2.87
x8 10.40 bytes/ns x 2.84
The following results were from my AMD Ryzen 9 7950X running Debian 12
“bookworm”, comparing gcc
vs clang
, and AVX2 vs AVX512. gcc
is
less keen to autovectorize so it doesn’t do very well with the
unrolled loops. (Dunno why u1
is so much slower than the baseline.)
-
gcc 12.2 -march=x86-64-v3
__ 5.57 bytes/ns x 1.00 u1 5.13 bytes/ns x 0.92 u2 5.03 bytes/ns x 0.90 u3 7.01 bytes/ns x 1.26 u4 6.83 bytes/ns x 1.23 x2 3.96 bytes/ns x 0.71 x4 8.00 bytes/ns x 1.44 x8 12.35 bytes/ns x 2.22
-
clang 16.0 -march=x86-64-v3
__ 4.89 bytes/ns x 1.00 u1 4.08 bytes/ns x 0.83 u2 8.76 bytes/ns x 1.79 u3 10.43 bytes/ns x 2.13 u4 10.81 bytes/ns x 2.21 x2 6.67 bytes/ns x 1.36 x4 12.67 bytes/ns x 2.59 x8 15.27 bytes/ns x 3.12
-
gcc 12.2 -march=x86-64-v4
__ 5.53 bytes/ns x 1.00 u1 5.53 bytes/ns x 1.00 u2 5.55 bytes/ns x 1.00 u3 6.99 bytes/ns x 1.26 u4 6.79 bytes/ns x 1.23 x2 4.75 bytes/ns x 0.86 x4 17.14 bytes/ns x 3.10 x8 20.90 bytes/ns x 3.78
-
clang 16.0 -march=x86-64-v4
__ 5.53 bytes/ns x 1.00 u1 4.25 bytes/ns x 0.77 u2 7.94 bytes/ns x 1.44 u3 9.31 bytes/ns x 1.68 u4 15.33 bytes/ns x 2.77 x2 9.07 bytes/ns x 1.64 x4 21.74 bytes/ns x 3.93 x8 26.34 bytes/ns x 4.76
That last result is pcg32 generating random numbers at 200 Gbit/s.