Server Outages and Increased API Errors

Incident Report for Discord

Postmortem

All times are PDT.

Summary

Discord was unavailable for most users for a period of an hour. The root cause is well understood and fixed. The bug was in our service discovery system, which is used by services within our infrastructure to discover one another. In this instance, service discovery is used by our real time chat services services in order to discover the RPC endpoint that they use to load data from our databases when you connect to Discord, or when a Discord server (or "guild") is created for the first time, or needs to be re-loaded from the database.

Timeline

14:18 - A set of nodes that serve our API traffic scales in to deal with growing load on the API cluster. This is an event that happens throughout the day. A single node, called api-prd-main-m1ds scaled in, and had an unexpected error when announcing to Service Discovery.
14:19 - Most of our Elixir services, which are used to handle our real time connections and chat message processing started to crash, resulting in an instantaneous loss of capacity, and causing a cascading failure into other dependent systems.
14:21 - Multiple alarms internally go off that signal a drop in key metrics we look at, as well as anomaly alerts for cluster utilization. A SEV1 incident is declared. A phone bridge is set up, and all available engineers hop on to start investigating and establish internal and external communications.
14:24 - A status page incident is opened to let our users know that we're investigating: https://status.discordapp.com/incidents/62gt9cgjwdgf
14:31 - A tweet is posted, letting users know that we're looking into the issue, and to check the status page for more updates: https://twitter.com/discordapp/status/1239665509596983296
14:23 to 14:43 - A few engineers investigate why exactly we lost so much capacity on our real time systems, while other engineers focus on recovering service, restarting the lost capacity, and begin to throttle connections to Discord in order to help with service recovery. Additionally, we begin to stop database maintenance operations ("anti-entropy repairs") on two of our ScyllaDB clusters that would lead to resource starvation while everyone is attempting to re-connect.
14:55 - Engineers pinpoint the issue to be strongly correlated to a spike in errors in originating from our service discovery modules. It is determined that the service discovery processes of our API service had gotten into a crash loop due to an unexpected deserialization error. This triggered an event called "max restart intensity" where, the process's supervisor determined it was crashing too frequently, and decided to trigger a full restart of the node. This event occurred instantaneously across approximately 50% of the nodes that were watching for API nodes, across multiple clusters. We believed it to be related to us hitting a cap in the number of watchers in etcd (the key-value store we use for service discovery.) We attempt to increase this using runtime configuration. Engineers continue to remediate any failed nodes, and restore service to our users.
15:07 to 15:26 - The connection throttle is continually increased, allowing more users to reconnect as services recover.
15:32 - The status page incident is resolved, and service is deemed to be fully operational again.
23:00 - A mitigation for this issue is deployed to production in order to prevent this issue from recurring - once the root cause was fully understood.

Investigation and Analysis

The root cause of this outage was determined to be an invalid service entry being inserted into service discovery, causing a parse error while trying to deserialize that entry when loading it from etcd. Engineers worked to re-create this issue in a test environment, and were able to reproduce the same issue that was observed in production in our development environment.

Discord uses an open source distributed, reliable, key-value store called etcd (https://github.com/etcd-io/etcd) in order to store service discovery information. Services that are discoverable announce themselves by setting a key in a specific directory in etcd that pertains to the cluster they are a part of. That key has a 60 second TTL, and the service is responsible for heart-beating to etcd, to "re-announce" the key. Discord is using the etcd v2 API.
At 14:18, a node joined our API cluster, after being introduced by the google cloud managed instance group autoscaler. This is a normal event that happens hundreds of times a day, as utilization of the platform increases as we approach peak hours. This node logged an error while attempting to initially announce itself to etcd: "http.client.RemoteDisconnected: Remote end closed connection without response". Nearly immediately, almost all of our Elixir nodes logged that the service watcher for the "discord_api" service had crashed while attempting to parse the JSON metadata that should be stored in the key's value on etcd. These processes crash-looped briefly due to invalid JSON data in the etcd cluster, which lasted until that API node retried announcing itself to service discovery, fixing the "corrupt" key that had been written to etcd.
Nodes announce themselves to etcd by issuing an HTTP PUT request that contains a urlencoded form body that contains the "value" of the key. In our case, this value is JSON encoded metadata that has information relevant to the discovery of the service. Our etcd client uses Python's builtin HTTP client, and sends a PUT request (along with the Content-Length header) in one packet, and the request body in another packet. We determined that the connection was reset after sending the first packet, but before the second packet could be sent.
A well-behaved HTTP server would see that it received a request specifying a Content-Length, and an incomplete or non-existent body, and reject this request. etcd is written in the Go programming language, and uses the Go standard library net/http HTTP request handlers for their v2 keys API. In order to parse the form body sent from clients, it uses the net.http/Request.ParseForm()method. This method does not check to see if the request body's length matches the length that was specified in the Content-Length header.
This caused the key to be written with an empty string as the value, as the announce request was able to successfully send the headers, but did not send the body. This in turn caused an invalid key to be written to service discovery, which caused the downstream services to crash until the key was re-written when the announcing node retried.

Action Items / Response

Code within our service discovery system was not resilient to this type of failure - as it was not within our expectations that a key could be announced without a value due to a transient network error. Our service discovery system is resilient to various failure modes within etcd, however, we did not anticipate a client being able to write a corrupt record due to improper HTTP handling. golang's net/http module — which is in violation of the HTTP 1.1 specification in this specific case — introduced an additional failure case which we did not anticipate. It is expected that a Content-Length header and an incomplete body should be rejected by a server as being incomplete. Unfortunately net/http does not check that the bytes read from the body portion of the request match the Content-Length header. We've since hardened our system to reduce the likelihood of this occurring and also handle invalid services being announced without crash looping.

In order to reduce the likelihood of invalid keys being written to service discovery, we've modified our etcd clients to send their announce requests in a single TCP packet, instead of two. This means that the headers and body should either be received completely, or not at all.
We've added additional error handling to ignore services that have a "corrupt" key value, just in case this issue does recur. The worst that will happen now is that the service will not be discovered - and we'll be able to investigate.

Additionally, we will be filling an upstream bug report to the Go project so they're aware of this issue and hopefully nobody else will have to learn about it the hard way.

Sorry for any inconvenience this caused! We're working around the clock to make sure that Discord is reliable and available for everyone, especially as utilization of the platform is at an all time high. Thank you for choosing Discord as your place to hang out and talk to your friends!

Posted Mar 20, 2020 - 16:59 PDT

Resolved

We have resolved an issue with our service discovery suffering under a high amount of load and triggering cascading failures. All users are now able to reconnect to all servers.

Posted Mar 16, 2020 - 15:32 PDT

Monitoring

Error rate and latency is back to a stable level. We are starting to let users back into Discord.

Posted Mar 16, 2020 - 15:05 PDT

Update

A series of fatal errors caused the majority of servers to become unavailable. We are working to revive all of these resources. Most users will be unable to connect while this work is ongoing.

Posted Mar 16, 2020 - 14:51 PDT

Investigating

We are currently investigating an issue where a number of Discord servers are completely unavailable.

Posted Mar 16, 2020 - 14:24 PDT

This incident affected: API.