[noise] [ANNOUNCE] WireGuard Launched!

Sat Jul 9 16:37:04 PDT 2016

Brian Smith writes:
> wouldn't the performance be much better on Skylake+ and ARMv8+?

No. Concretely, several generations of Intel chips have run 12-round
ChaCha12-256 at practically the same speed as 12-round AES-192 (with a
similar security margin), even though AES-192 has "hardware support", a
smaller key, a smaller block size, and smaller data limits. For example:

   * Both ciphers are ~1.7 cycles/byte on Westmere (introduced 2010).
   * Both ciphers are ~1.5 cycles/byte on Ivy Bridge (introduced 2012).
   * Both ciphers are ~0.8 cycles/byte on Skylake (introduced 2015).

ChaCha20-256 is slower than ChaCha12-256 but this is entirely because it
has a much larger security margin. For reasons explained below, I
wouldn't be surprised to see ChaCha20-256 running _faster_ than AES-256
on future Intel chips.

Anyway, it's hard to find any applications that will notice any of these
gaps. We're talking about differences under 1 cycle/byte, on CPUs where
cycles are very fast in the first place, vs. multiple cycles/byte for
typical data processing.

Meanwhile the ecosystem includes many smaller CPUs where AES is
painfully slow---e.g., ~20 cycles/byte for AES-128 on Cortex-A8. On a
small CPU core, spending precious area for an AES instruction isn't an
easy decision for a CPU designer to make! The ecosystem also includes,
more importantly, many AES implementations vulnerable to timing attacks.

> SHA-256 + AES-256-GCM, for which newer CPUs are being optimized

No.

CPUs are optimized for video games. Video games are a huge market---a
market where people pay close attention to performance and adjust their
purchasing decisions accordingly.

Yes, yes, some buyers pay attention to CPU performance for weather
prediction, or movie rendering, or various other important applications.
But most of these applications have the same basic bottlenecks as video
games: most importantly, low-precision multiplications and additions. 

Do CPU designers spend area on niche operations such as _binary-field_
multiplication? Sometimes, yes, but not much area. Given how CPUs are
actually used, CPU designers see vastly more benefit to spending area
on, e.g., vectorized floating-point multipliers.

Intel is doubling the vector size in its newest CPUs---again!---this
time from 256 bits to 512 bits. The design principles explained in

   https://cr.yp.to/snuffle/design.pdf (especially parallelization)
   https://cr.yp.to/snuffle/speed.pdf (see in particular Section 12)

allow Salsa20 and ChaCha20 to take direct advantage of this increase in
vector size---and also of other improvements that Intel has been adding,
such as extra vector units and three-operand instructions. Intel is also
adding vectorized rotations (which take very little extra hardware area
on top of shifts), so add-shift-shift-merge-xor becomes add-rotate-xor.

Poly1305 is similarly parallelizable, with slightly more effort. It also
benefits from Intel's continued attention to the speed of multipliers.
Intel's newest optimization here is the introduction of very efficient
52-bit multipliers that expose the internals of the floating-point
multipliers. This is essentially what was requested in, e.g.,

   * https://cr.yp.to/talks/2002.06.15/slides.ps (last slide, reusing
     the 64-bit extended-precision floating-point multiplier) and

   * https://blog.cr.yp.to/20140517-insns.html (reusing the
     53-bit double-precision vector floating-point multiplier, very
     close to what Intel subsequently announced).

This will---with very little hardware cost---also speed up conservative
prime-field ECC and many further applications.

The reasons that CPUs can make simple vector arithmetic run really fast
are related, at a fundamental hardware-design level, to the reasons that
the same instructions take constant time on most CPUs. The story is
quite different for variable data indices: these naturally create
variations in timing and are hard to efficiently vectorize. For the same
reasons, variable instruction indices (i.e., branches) naturally create
variations in timing and are hard to efficiently vectorize.

AES was never (and GCM was never) designed to run quickly in software
using common arithmetic instructions: it relied critically on variable
data indices for performance. This approach turned out to be a security
nightmare, producing many successful timing attacks. Even in contexts
where timing attacks weren't a concern, this approach vectorized poorly,
with performance falling farther and farther behind pure arithmetic
designs as CPUs evolved in predictable ways.

Adding hardware for AES arithmetic, particularly inversion in GF(256),
deals with the timing attacks but costs quite a bit of hardware area for
yet another niche operation. Intel decided that on its large cores it
could justify the area for 8 inverters, and then for 16 inverters, and
probably 32 on future cores---but this is only a small percentage of the
overall area of Intel's vector units, and the percentage doesn't seem to
be increasing as the years go by.

This is why 12-round AES-192 isn't beating 12-round ChaCha12-256. Yes,
there's some special-purpose hardware for AES, but there's much more
general-purpose hardware for vector computations, and ChaCha takes
advantage of this general-purpose hardware.

---Dan