- On Monday, they attempted a schema migration which lead to a load spike.
- The high load triggered an automated failover to one of their MySQL slaves.
- Once failed too, the new master also experienced high load, and so the automated failover attempted to revert back
- At this point, the ops team put the automated failure system into “maintenance mode”, to prevent further failover
Shoot the Automated Failure in the Head
This past week Github experienced their most significant service disruption of the year, and much of it came at the hands of an automated failover system they had designed to try to avoid disruptions. There are a number of different factors that made the situation as bad as it was, but the basic summary of what lead to the problem looks like this: