ImperialViolet

Post-quantum confidentiality for TLS (11 Apr 2018)

In 2016, my colleague, Matt Braithwaite, ran an experiment in Google Chrome which integrated a post-quantum key-agreement primitive (NewHope) with a standard, elliptic-curve one (X25519). Since that time, the submissions for the 1st round of NIST’s post-quantum process have arrived. We thus wanted to consider which of the submissions, representing the new state of the art, would be most suitable for future work on post-quantum confidentiality in TLS.

A major change since the 2016 experiment is the transition from TLS 1.2 to TLS 1.3 (a nearly-final version of which is now enabled by default in Chrome). This impacts where in the TLS handshake the larger post-quantum messages appear:

In TLS 1.2, the client offers possible cipher suites in its initial flow, the server selects one and sends a public-key in its reply, then the client completes the key-agreement in its second flow. With TLS 1.3, the client offers several possible public-keys in its initial flow and the server completes one of them in its reply. Thus, in a TLS 1.3 world, the larger post-quantum keys will be sent to every TLS server, whether or not they’ll use it.

(There is a mechanism in TLS 1.3 for a client to advertise support for a key-agreement but not provide a public-key for it. In this case, if the server wishes to use that key-agreement, it replies with a special message to indicate that the client should try again and include a public-key because it’ll be accepted. However, this obviously adds an extra round-trip and we don’t want to penalise more secure options—at least not if we want them adopted. Therefore I’m assuming here that any post-quantum key agreement will be optimistically included in the initial message. For a diagram of how this might work with a post-quantum KEM, see figure one of the Kyber paper. Also I'm using the term “public-key” here in keeping with the KEM constructions of the post-quantum candidates. It's not quite the same as a Diffie-Hellman value, but it's close enough in this context.)

In order to evaluate the likely latency impact of a post-quantum key-exchange in TLS, Chrome was augmented with the ability to include a dummy, arbitrarily-sized extension in the TLS ClientHello. To be clear: this was not an implementation of any post-quantum primitive, it was just a fixed number of bytes of random noise that could be sized to simulate the bandwidth impact of different options. It was included for all versions of TLS because this experiment straddled the enabling of TLS 1.3.

Post quantum families

Based on submissions to NIST’s process, we grouped likely candidates into three “families” of algorithms, based on size: supersingular isogenies (SI), structured lattices (SL), and unstructured lattices (UL). Not all submissions fit into these families, of course: in some cases the public-key + ciphertext size was too large to be viable in the context of TLS, and others were based on problems that we were too unfamiliar with to be able to evaluate.

As with our 2016 experiment, we expect any post-quantum algorithm to be coupled with a traditional, elliptic-curve primitive so that, if the post-quantum component is later found to be flawed, confidentiality is still protected as well as it would otherwise have been.

This led to the following, rough sizes for evaluation:

  1. Supersingular isogenies (SI): 400 bytes
  2. Structured lattices (SL): 1 100 bytes
  3. Unstructured lattices (UL): 10 000 bytes

Incompatibilities with unstructured lattices

Before the experiment began, in order to establish some confidence that TLS connections wouldn’t break, the top 2 500 sites from the Alexa list were probed to see whether large handshakes caused problems. Unfortunately, the 10 000-byte extension caused 21 of these sites to fail, including common sites such as godaddy.com, linkedin.com, and python.org.

Oddly enough, reducing the size of the extension to 9 999 bytes reduced that number to eight, but linkedin.com was still included. These remaining eight sites seem to reject ClientHello messages larger than 3 970 bytes.

This indicated that widespread testing of a 10 000-byte extension was not viable. Thus the unstructured lattice configuration was replaced with a 3 300-byte “unstructured lattice stand-in” (ULS). This size was chosen to be about as large as we could reasonably test (although four sites still broke.

Phase One

In February 2018, a fraction of Chrome installs (Canary and Dev channels only) were configured to enable the dummy extension in one of four sizes, with the configuration randomly selected, per install, from:

  1. Control group: no extension sent
  2. Supersingular isogenies (SI): 400 bytes
  3. Structured lattices (SL): 1 100 bytes
  4. Unstructured lattice standing (ULS): 3 300 bytes

We measured TLS handshake latency at the 50th and 95th percentile, split out by mobile / non-mobile and by whether the handshake was a full or resumption handshake.

Given that we measured a stand-in for the UL case, we needed to extrapolate from that to an estimate for the UL latency. We modeled a linear cost of each additional byte in the handshake, 66 bytes of overhead per packet, and an MTU around 1 500 bytes. The number of packets for the UL case is the same for MTU values around 1 500, but the number of packets for the SL case is less certain: a session ticket could push the ClientHello message into two packets there. We chose to model the SL case as a single packet for the purposes of estimating the UL cost.

We also ignored the possibility that the client’s initial congestion window could impact the UL case. If that were to happen, a UL client would have to wait an additional round-trip, thus our UL estimates might be optimistic.

Despite the simplicity of the model, the control, SI, and SL points for the various configurations are reasonably co-linear under it and we draw the least-squares line out to 10 000 bytes to estimate the UL latency.

Configuration Additional latency over control group
SI SL UL (estimated)
Desktop, Full, Median 4.0% 6.4% 71.2%
Desktop, Full, 95% 4.7% 9.6% 117.0%
Desktop, Resume, Median 4.3% 12.5% 118.6%
Desktop, Resume, 95% 5.2% 17.7% 205.1%
Mobile, Full, Median -0.2% 3.4% 34.3%
Mobile, Full, 95% 0.5% 7.2% 110.7%
Mobile, Resume, Median 0.6% 7.2% 66.7%
Mobile, Resume, 95% 4.2% 12.5% 149.5%

(The fact that one of the SI values came out implausibly negative should give some feeling for the accuracy of the results—i.e. the first decimal place is probably just noise, even at the median.)

As can be seen, the estimated latency overhead of unstructured lattices is notably large in TLS, even without including the effects of the initial congestion window. For this reason, we feel that UL primitives are probably not preferable for real-time TLS connections.

It is also important to note that, in a real deployment, the server would also have to return a message to the client. Therefore the transmission overhead, when connecting to a compatible server, would be doubled. This phase of the experiment experiment did not take that into account at all. Which leads us to…

Phase Two

In the second phase of the experiment we wished to measure the effect of actually negotiating a post-quantum primitive, i.e. when both client and server send post-quantum messages. We again added random noise of various sizes to simulate the bandwidth costs of this. Cloudflare and Google servers were augmented to echo an equal amount of random noise back to clients that included it.

In this phase, latencies were only measured when the server echoed the random noise back so that the set of measured servers could be equal between control and experiment groups. This required that the control group send one byte of noise (rather than nothing).

Since unstructured lattices were not our focus after the phase one results, there were only three groups in phase two: one byte (control), 400 bytes (SI), and 1100 bytes (SL).

Unlike phase one, we did not break the measurements out into resumption / full handshakes this time. We did that in phase one simply because that’s what Chrome measures by default. However, it begs the question of what fraction of handshakes resume, since that value is needed to correctly weigh the two numbers. Thus, in phase two, we simply measured the overall latency, with resumption (or not) implicitly included.

Here are the latency results, this time in milliseconds in order to reason about CPU costs below:

Configuration Additional latency over control group (ms)
SI SL
Mobile, Median 3.5 9.6
Mobile, 95% 18.4 159.0
Desktop, Median 2.6 5.5
Desktop, 95% 19.2 136.9

So far we’ve only talked about families of algorithms but, to analyse these results, we have to be more specific. That’s easy in the case supersingular isogenies (SI) because there is only one SI-based submission to NIST: SIKE.

For SIKEp503, the paper claims an encapsulation speed on a 3.4GHz Skylake of 16 619 kilocycles, or 4.9ms of work on a server. They also report on an assembly implementation for Aarch64, which makes a good example client. There the key generation + decapsulation time is 86 078 kilocycles, or 43ms on their 2GHz Cortex-A79.

On the other hand, structured lattices (SL) are much faster. There are several candidates here but most have comparable speed, and it’s under a 10th of a millisecond on modern Intel chips.

Therefore, if computation were included, the SI numbers above would have 48ms added to them. That happens to makes SL clearly faster at the median, although it does not close the gap at the 95% level. (The SI key-generation could be amortised across connections, at the cost of code complexity and maybe some forward security. Amortising key-generation across 16 connections results in 28ms of computation per connection, which doesn’t affect the argument.)

We had previously deployed CECPQ1 (X25519 + NewHope) in TLS, which should be comparable to SL measurements, and gotten quite different results: we found a 1ms increase at the median and only 20ms at 95%. (And CECPQ1 took roughly 1 900 bytes.)

On a purely technical level, CECPQ1 was based on TLS 1.2 and therefore increased the sizes of the server’s first flow and the client’s second flow. It’s possible that bytes in the client’s second flow are less expensive, but much more significantly, TLS 1.2 does not do a fresh key-agreement for a resumption handshake. Therefore all the successful resumptions had no overhead with CECPQ1, but with TLS 1.3, they would. Going back to historical Chrome statistics from the time of the CECPQ1 experiment, we estimate that about 50% of connections would have been resumptions, which closes the gap.

Additionally, there may be a significant difference in Chrome user populations between the experiments. CECPQ1 went to Chrome Stable (non-mobile only) while the post-quantum measurements here have been done on Dev and Canary channels. It is the case that, for all non-mobile TLS connections (completely unrelated to any post-quantum experiments), the population of Dev and Canary channel users see higher TLS latencies and the effect is larger than the SI or SL effect measured above. Thus the user population for these experiments might have had poorer internet connections than the population for CECPQ1, exaggerating the costs of extra handshake bytes.

Apart from CPU time, a second major consideration is that we are not comparing like with like. Our 400-byte figure for SI maps to a NIST “category one” strength, while 1 100 bytes for SL is roughly in line with “category three”. A comparable, “category one” SL primitive comes in at more like 800 bytes, while a “category three” SI is much slower.

(Although there is a recent paper suggesting that SIKEp503 might be more like category three after all. So maybe this point is moot, but such gyrations highlight that SI is a relatively new field of study.)

Lastly, and much more qualitatively, we have had many years of dealing with subtle bugs in elliptic-curve implementations. The desire for performance pushes for carefully-optimised implementations, but missed carries, or other range violations, are typically impossible for random tests to find. After many years, we are at the point where we are starting to use formally-synthesised code for elliptic-curve field operations but SI would take us deeper into the same territory, likely ensuring that the pattern of ECC bugs continues into the future. (As Hamburg says in his NIST submission, “some cryptographers were probably hoping never to see a carry chain again”.)

On the other hand, SL designs are much more like symmetric primitives, using small fields and bitwise operations. While it’s not impossible to botch, say, AES, my experience is that the defect rates in implementations of things like AES and SHA-256 is dramatically lower, and this is strong argument for SL over SI. Combining post-quantum primitives with traditional ECC is a good defense against bugs, but we should still be reticent to adopt potentially fragile constructs.

(Not all SL designs have this benefit, mind you. Hamburg, quoted above, uses ℤ/(23120 - 21560 - 1) in ThreeBears and lattice schemes for other primitives may need operations like discrete Gaussian sampling. But it holds for a number of SL-based KEMs.)

Thus the overall conclusion of these experiments is that post-quantum confidentiality in TLS should probably be based on structured lattices although there is still an open question around QUIC, for which exceeding a single packet is problematic and thus the larger size of SL is more significant.