The curious case of slow downloads

Some time ago we discovered that certain very slow downloads were getting abruptly terminated and began investigating whether that was a client (i.e. web browser) or server (i.e. us) problem.

Some users were unable to download a binary file a few megabytes in length. The story was simple—the download connection was abruptly terminated even though the file was in the process of being downloaded. After a brief investigation we confirmed the problem: somewhere in our stack there was a bug.

Describing the problem was simple, reproducing the problem was easy with a single curl command, but fixing it took surprising amount of effort.

CC BY 2.0 image by jojo nicdao

In this article I'll describe the symptoms we saw, how we reproduced it and how we fixed it. Hopefully, by sharing our experiences we will save others from the tedious debugging we went through.

Failing downloads

Two things caught our attention in the bug report. First, only users on mobile phones were experiencing the problem. Second, the asset causing issues—a binary file—was pretty large, at around 30MB.

After a fruitful session with tcpdump one of our engineers was able to prepare a test case that reproduced the problem. As so often happens, once you know what you are looking for reproducing a problem is easy. In this case setting up a large file on a test domain and using the --limit-rate option to curl was enough:

$ curl -v http://example.com/large.bin --limit-rate 10k > /dev/null
* Closing connection #0
curl: (56) Recv failure: Connection reset by peer

Poking with tcpdump showed there was RST packet coming from our server exactly 60 seconds after the connection was established:

$ tcpdump -tttttni eth0 port 80
00:00:00 IP 192.168.1.10.50112 > 1.2.3.4.80: Flags [S], seq 3193165162, win 43690, options [mss 65495,sackOK,TS val 143660119 ecr 0,nop,wscale 7], length 0
...
00:01:00 IP 1.2.3.4.80 > 192.168.1.10.50112: Flags [R.], seq 1579198, ack 88, win 342, options [nop,nop,TS val 143675137 ecr 143675135], length 0

Clearly our server was doing something wrong. The RST packet coming from CloudFlare server is just bad. The client behaves, sends ACK packets politely, consumes the data at its own pace, and then we just abruptly cut the conversation.

Not our problem

We are a heavy users of NGINX. In order to isolate the problem we set up a basic off-the-shelf NGINX server. The issue was easily reproducible locally:

$ curl --limit-rate 10k  localhost:8080/large.bin > /dev/null
* Closing connection #0
curl: (56) Recv failure: Connection reset by peer

This proved the problem was not specific to our setup—it was a broader NGINX issue!

After some further poking we found two culprits. First, we were using the reset_timedout_connection setting. This causes NGINX to close connections abruptly. When NGINX wants to time out a connection it sets SO_LINGER without a timeout on a socket, followed by a close(). This triggers the RST packet, instead of a usual graceful TCP finalization. Here's an strace log from NGINX:

04:20:22 setsockopt(5, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
04:20:22 close(5) = 0

We could just have disabled the reset_timedout_connection setting, but that wouldn't have solved the underlying problem. Why was NGINX closing the connection in the first place?

After further investigation we looked at the send_timeout configuration option. The default value is 60 seconds, exactly the timeout we were seeing.

http {
     send_timeout 60s;
     ...

The send_timeout option is used by NGINX to ensure that all connections will eventually drain. It controls the time allowed between successive send/sendfile calls on each connection. Generally speaking it's not fine for a single connection to use precious server resources for too long. If the download is going on too long or is plain stuck, it's okay for the HTTP server to be upset.

But there was more to it than that.

Not an NGINX problem either

Armed with strace we investigated what NGINX actually did:

04:54:05 accept4(4, ...) = 5
04:54:05 sendfile(5, 9, [0], 51773484) = 5325752
04:55:05 close(5) = 0

In the config we ordered NGINX to use sendfile to transmit the data. The call to sendfile succeeds and pushes 5MB of data to the send buffer. This value is interesting—it's about the amount of space we have in our default write buffer setting:

$ sysctl net.ipv4.tcp_wmem
net.ipv4.tcp_wmem = 4096 5242880 33554432

A minute after the first long sendfile the socket is closed. Let's see what happens when we increase the send_timeout value to some big value (like 600 seconds):

08:21:37 accept4(4, ...) = 5
08:21:37 sendfile(5, 9, [0], 51773484) = 6024754
08:24:21 sendfile(5, 9, [6024754], 45748730) = 1768041
08:27:09 sendfile(5, 9, [7792795], 43980689) = 1768041
08:30:07 sendfile(5, 9, [9560836], 42212648) = 1768041
...

After the first large push of data, sendfile is called more times. Between each successive call it transfers about 1.7 MB. Between these syscalls, about every 180 seconds, the socket was constantly being drained by the slow curl, so why didn't NGINX refill it constantly?

The asymmetry

A motto of Unix design is "everything is a file". I prefer to think about this as: "in Unix everything can be readable and writeable when given to poll". But what exactly does "being readable" mean? Let's discuss the behavior of network sockets on Linux.

The semantics of reading from a socket are simple:

Calling read() will return the data available on the socket, until it's empty.
poll reports the socket as readable when any data is available on it.

One might think this is symmetrical and similar conditions hold for writing to a socket, like this:

Calling write() will copy data to the write buffer, up until "send buffer" memory is exhausted.
poll reports the socket is writeable if there is any space available in the send buffer.

Surprisingly, the last point is not true.

Different code paths

It's very important to realize that in the Linux Kernel, there are two separate code paths: one for sending data and another one for checking if a socket is writeable.

In order for send() to succeed two conditions must be met:

There must be some space available in the send buffer.
The amount of enqueued, not sent, data must be lower than the LOWAT setting. We will ignore the LOWAT setting in this blog post.

On the other hand, the conditions for a socket to be reported as "writeable" by poll are slightly narrower:

There must be some space available in the send buffer.
The amount of enqueued, not sent, data must be lower than the LOWAT setting.
The amount of free buffer space in the send buffer must be greater than half of the used send buffer space.

The last condition is critical. This means that after you fill the socket send buffer to 100%, the socket will become writeable again only when it's drained below 66% of send buffer size.

Going back to our NGINX trace, the second sendfile we saw:

08:24:21 sendfile(5, 9, [6024754], 45748730) = 1768041

The call succeeded in sending 1.7 MiB of data. This is close to 33% of 5 MiB, our default wmem send buffer size.

I presume this threshold was implemented in Linux in order to avoid refilling the buffers too often. It is undesirable to wake up the sending program after each byte of the data was acknowledged by the client.

The solution

With full understanding of the problem we can decisively say when it happens:

The socket send buffer is filled to at least 66%.
The customer download speed is poor and it fails to drain the buffer to below 66% in 60 seconds.
When that happens, the send buffer is not refilled in time, it's not reported as writeable, and the connection gets reset with a timeout.

There are a couple of ways to fix the problem.

One option is to increase the send_timeout to, say, 280 seconds. This would ensure that given the default send buffer size, consumers faster than 50Kbps will never time out.

Another choice is to reduce the tcp_wmem send buffers sizes.

The final option is to patch NGINX to react differently on timeout. Instead of closing the connection, we could inspect the amount of data remaining in the send buffer. We can do that with ioctl(TIOCOUTQ). With this information we know exactly how quickly the connection is being drained. If it's above some configurable threshold, we could decide to grant the connection some more time.

My colleague Chris Branch prepared a Linux specific patch to NGINX. It implements a send_minimum_rate option, which is used to specify the minimum permitted client throughput.

Summary

The Linux networking stack is very complex. While it usually works really well, sometimes it gives us a surprise. Even very experienced programmers don't fully understand all the corner cases. During debugging we learned that setting timeouts in the "write" path of the code requires special attention. You can't just treat the "write" timeouts in the same way as "read" timeouts.

It was a surprise to me that the semantics of a socket being "writeable" are not symmetrical to the "readable" state.

In past we found that raising receive buffers can have unexpected consequences. Now we know tweaking wmem values can affect something totally different—NGINX send timeouts.

Tuning a CDN to work well for all the users takes a lot of work. This write up is a result of hard work done by four engineers (special thanks to Chris Branch!). If this sounds interesting, consider applying!

The Cloudflare Blog

The curious case of slow downloads

Failing downloads

Not our problem

Not an NGINX problem either

The asymmetry

Different code paths

The solution

Summary

How we found a bug in Go's arm64 compiler

Cloudflare just got faster and more secure, powered by Rust

How Cloudflare uses the world’s greatest collection of performance data to make the world’s fastest global network even faster

QUIC restarts, slow problems: udpgrm to the rescue