Analyzing the GitHub outage

time to read 8 min | 1516 words

imageA couple of weeks ago, GitHUb had a major outage, lasting over 24 hours and resulted in wide spread disruption of many operations for customers. A few days after everything was fixed, they posted their analysis on what happened, which makes for a really good read.

The pebble that started all of this was a connection disruption that lasted 43 seconds(!). A couple of months ago I talked about people who say that you can assume that distributed failures are no longer meaningful. The real world will keep serving up examples of weird / strange / nasty stuff to your productions systems, and you need to handle that. Quoting from the original post:

Therefore: the question becomes: how much availability is lost when we guarantee consistency? In practice, the answer is very little. Systems that guarantee consistency only experience a necessary reduction in availability in the event of a network partition. As networks become more redundant, partitions become an increasingly rare event. And even if there is a partition, it is still possible for the majority partition to be available. Only the minority partition must become unavailable. Therefore, for the reduction in availability to be perceived, there must be both a network partition, and also clients that are able to communicate with the nodes in the minority partition (and not the majority partition). This combination of events is typically rarer than other causes of system unavailability.

So no, not really. There is a good point here on the fact only the minority portion of the system must become unavailable, but given typical production deployment, any disconnect between data centers will cause a minority portion to be visible to clients and become unavailable.

The actual GitHub issues that are discussed in the post are a lot more interesting. First, we have the obvious problem that most applications assume that their database access is fast and they make multiple such calls during the processing of a single request (sometimes, many calls). This is just another example of the Fallacies of Distributed Computing in action. RavenDB has a builtin detection for that and a host of features that allow you to go to the database server once, instead of multiple times. In such a case, even if you need to failover to a remote server, you won’t pay the roundtrip costs multiple times.

However, this is such a common problem that I don’t think that it deserve much attention. There isn’t much that you can do about it without careful consideration and support from the whole stack. Usually, this happens on projects when you have a strong leader that institute a performance budget and enforce that. This has costs of its own and usually it is cheaper to just not failover across data center boundaries.

The next part that I find really interesting is that the system that GitHub uses for managing topologies is not consistent but is required to be. The problem is that there is an inherent delay between their orchestrator re-organizing the cluster after a failure and when the failure actually occurs. That would have been fine, if they had a way to successfully merge histories, but that is not the case. In fact, looking at just the information that they have published (and ignoring that I have the benefit of hindsight) the issue is glaringly obvious.

A deep dive (and a fascinating read) into how GitHub handles high availability talks about the underlying details and expose the root cause. You cannot layer distinct distributed architectures on top of one another and expect to come up with a good result. Here is what happens in a master crash scenario:

In a master crash scenario:

  • The orchestrator nodes detect failures.
  • The orchestrator/raft leader kicks off a recovery. A new master gets promoted.
  • orchestrator/raft advertises the master change to all raft cluster nodes.
  • Each orchestrator/raft member receives a leader change notification. They each update the local Consul’s KV store with the identity of the new master.
  • Each GLB/HAProxy has consul-template running, which observes the change in Consul’s KV store, and reconfigures and reloads HAProxy.
  • Client traffic gets redirected to the new master.

I read this and feel a bit queasy, because the master crash scenario is not the interesting bit. That is the easy part. The really hard part is how you manage things when you have a network disruption, with both sides still up and functioning. In fact, that is exactly what happened to GitHub. In this case, on the minority side, their orchestrator cannot get a majority (so cannot make any forward process). However, the rest of the system cannot proceed, the whole thing stops at either the first or second stage.

That means that the rest of the system will continue to write to the old master, resulting in a conflict. And this is where things gets complicated. The issue here is that with MySQL (and most other systems that relies on log replication) you must have a single master at any given time. That is an absolute requirement. If you got to the point where you had two writes with divergent histories, you are in for selecting which one you’ll accept (and what data you’ll discard) and trying to manually fix things after the fact.

The proper way to handle something like this would have been to use Raft to actually send the commands themselves to the server. This ensures a consistent set of statements that run in the same order for all servers. Rqlite is a great example of this, where you can get consistent and distributed system on top of individual components. That would be the proper way to do it, mind, not the way anyone would do it.

You wouldn’t be able to get any reasonable performance from the system using this kind of approach. Rqlite, for example, talks about being able to get 10 – 200 operations per second. I’m going to assume that GitHub has a need for something better than that. So the underlying distributed architecture looks like this:

  • MySQL write master with multiple read-only secondaries using binlog.
  • Orchestrator to provide health monitoring and selection of new write primary (consistent using Raft)
  • Underlying infrastructure that uses (different) Raft to store routing configuration.

If you break Orchestrator’s ability to make decisions (easy, just create a partition), you take away the ability to change the write master, and if the failure mode you are dealing with is not a failed master (for example, you have partition) you are going to accept new writes to the old master.  That breaks completely the whole idea of binlog replication, of course, so you are sort of stuck at that point. In short, I think that Orchestrator is something that was meant to solve an entirely different problem, it was meant to deal with the failure of a single node, not to handle a full data center partition.

When looking at such incidents, I always compare to what would have happened if RavenDB was used instead. This is not really fair in this case because RavenDB was designed upfront to be a distributed database. RavenDB doesn’t really have the concept a a write master. For simplicity’s sake, we usually try to direct all writes to a single node for each database, because this simplify how you usually work. However, but any node can accept writes and will distribute it to the rest of the nodes in the cluster. In a situation like the one GitHub faced, both sides of the partition would keep accepting writes (just like happened in GitHub’s case with MySQL).

The difference is what will happen when the partition is healed. Both sides of the partition will update the other with the data that is missing on the other side. Any conflicting writes (by which I mean writes on both sides of the partition to the same document or documents) will be detected and resolved automatically. Automatic resolution is very important to keeping everything up and running. This can be a custom resolution policy defined by the user or arbitrary by RavenDB. Regardless of the conflict resolution policy, the administrator will be notified about the conflicts and can review the actions taken by RavenDB and decide what to do about that.

In GitHub’s case, their busiest cluster had less than a thousand writes in the time period in question. Most of which aren’t going to conflict. I would expect the timeline with RavenDB to be:

  • 2018 October 21 22:52 UTC – initial network partition, lasting 43 seconds
  • 2018 October 21 22:54 UTC – internal monitoring alert about any conflicts (but I consider this unlikely)
  • 2018 October 21 23:00 UTC – issue resolved, go home

The difference is mostly because RavenDB was designed to live in these kind of environment, deployed in multiple data centers and actually handling, in the real world and with very little assistance, the task of keeping applications up and running without blowing things up. It is quite literally one of the basic building blocks we have, so it shouldn’t be surprising that we are pretty good at it.