The Story
At 22:52 UTC on October 21, 2018, a GitHub data center technician began routine maintenance to replace failing 100G optical equipment in the US East Coast network hub. The technician disconnected the hardware. For 43 seconds, connectivity between the East Coast hub and GitHub's primary US East Coast data center was severed. The maintenance crew restored the connection almost immediately — 43 seconds is barely enough time to notice a problem. But in those 43 seconds, two things happened in rapid succession that would take the next 24 hours to undo: the East Coast primary MySQL database accepted writes that had not yet been replicated to the West Coast, and GitHub's automated failover system — — detected the partition and promoted a West Coast replica to primary status.
GitHub stores all its non-Git metadata in MySQL. The data in these clusters covers issues, pull requests, comments, notifications, user authentication, background job queues, and most other features that make GitHub useful beyond raw Git object storage. The Git repositories themselves — the actual code — were unaffected throughout the incident. But every issue, every pull request, every comment, every notification was at risk of showing stale or inconsistent data while the two clusters remained diverged. GitHub made the decision to put the site into a degraded read-only state — disabling writes and disabling features that depended on write consistency — while engineers worked to reconcile the clusters.
Problem
43-Second Network Partition Triggers Automatic Failover
At 22:52 UTC, routine maintenance severs the East Coast network hub from the primary data center for 43 seconds. In this window, the East Coast MySQL primary accepts writes that are not yet replicated to West Coast replicas. Orchestrator — running across three sites (East Coast DC, West Coast DC, and East Coast public cloud) — detects that West Coast replicas cannot reach the East Coast primary. The West Coast and East Coast cloud Orchestrator nodes form a quorum and promote a West Coast replica to primary. The network partition resolves at 22:53 UTC, but a new primary has already been elected while the old primary was still running and accepting writes.
Cause
Replication Lag Meant the New Primary Had an Incomplete History
MySQL replication is asynchronous by default — replicas apply changes written to the primary, but not necessarily in real time. During the 43-second partition, some writes committed to the East Coast primary had not yet been shipped to or applied by the West Coast replicas. When the West Coast replica was promoted to primary, it had a replication lag — it was missing some committed writes. When the East Coast primary came back online and tried to connect as a replica to the new West Coast primary, the topologies were irreconcilably divergent: the East Coast primary had committed writes that the new West Coast primary had never seen, and the West Coast primary had already accepted new writes on top of its incomplete history.
Solution
Site Read-Only; Manual Data Reconciliation Across All Clusters
Engineers stopped the bleeding by putting GitHub into read-only mode and halting all write traffic. This prevented further divergence but meant users could not create or update issues, pull requests, or comments. Engineers then began the manual process of identifying all rows that had been written to the East Coast primary but were absent from the West Coast primary — a comparison of transaction logs and data snapshots. Each diverged cluster had to be reconciled independently. Webhooks and background jobs were paused and queued. The process of re-syncing backups, promoting clean replicas, and verifying consistency took until 16:00 UTC on October 22 — more than 24 hours after the incident began.
Result
24h11m of Degraded Service; GitHub Rethinks Failover Philosophy
GitHub experienced 24 hours and 11 minutes of service degradation. Git repositories were unaffected. Issues, pull requests, notifications, and user-generated content were in read-only mode for the majority of the outage. Over 5 million webhook events and 80,000 GitHub Pages builds were queued and required reprocessing. GitHub's post-incident analysis led to fundamental changes in Orchestrator configuration and deployment philosophy: automated failover would be limited to intra-region scenarios only, and cross-region failover would require human judgment. Engineers also adopted GitHub's own internal tooling to avoid creating unsupported MySQL topologies during failover.
Orchestrator Did What It Was Configured to Do — That Was the Problem
The most technically subtle aspect of the incident is that Orchestrator made the correct decision given the information it had. From the West Coast's perspective: replicas could no longer reach the East Coast primary. Orchestrator is designed to promote a new primary when this happens. It did so, correctly, within its programmed parameters. The problem was not that Orchestrator malfunctioned — it was that the configuration allowed Orchestrator to elect a new primary across a wide area network partition, a scenario where the original primary might still be healthy and accepting writes.
GitHub's MySQL topology at the time had an intermediate layer complexity that amplified the incident. Some West Coast replicas acted as 'intermediate primaries' — they had their own replicas downstream to handle read distribution. When the West Coast intermediate primary was promoted, its downstream replicas were reconfigured to replicate from it — but the West Coast intermediate primary was itself behind on replication from the East Coast primary. The replication lag cascaded through the topology: the promoted primary was missing data, and all its downstream replicas were also missing that data. When engineers attempted to sync from backups, the backups were stored in cloud storage and took hours to restore over available bandwidth.
The Fix
Why GitHub Changed From 'Automate Everything' to 'Never Auto-Failover Across Regions'
The post-incident analysis led GitHub to a conclusion that runs counter to the common instinct in infrastructure engineering: more automation is not always safer. For GitHub's cross-region MySQL topology, automated failover created a failure mode — split-brain with no safe merge path — that was strictly worse than the degraded availability of a temporarily unreachable primary. The right tradeoff for this specific topology was to sacrifice automatic cross-region failover in exchange for data consistency guarantees.
| Dimension | Before Incident (Async Replication + Auto Failover) | After Incident (Semi-Sync + Human Cross-Region Failover) |
|---|---|---|
| Replication mode | Asynchronous — primary commits write without waiting for replica acknowledgment | Semi-synchronous — primary waits for at least one replica to acknowledge before committing |
| Replication lag during failure | Unbounded — lag accumulates during partition, creating irreconcilable divergence | Bounded — semi-sync prevents commit unless replica is sufficiently current |
| Cross-region failover | Orchestrator auto-promotes across regions on partition detection | Cross-region failover requires human decision and explicit operator action |
| Intra-region failover | Automated via Orchestrator | Automated via Orchestrator (unchanged — intra-region partitions have lower split-brain risk) |
| Split-brain recovery | Manual: compare transaction logs, restore from backup, replay changes — 24+ hours | Semi-sync prevents the write divergence that creates split-brain in the first place |
| Write availability during partition | High (writes accepted by both partitions, creating divergence) | Lower (writes blocked until replica acknowledges) — but consistency is preserved |
-- BEFORE: Asynchronous replication (GitHub's setup during the incident)
-- Primary commits immediately, replica applies later
-- If primary goes down after commit but before replica applies:
-- → committed data exists only on primary
-- → if primary is replaced, that data is lost from the new topology
-- MySQL async primary commit (default):
-- Client: INSERT INTO pull_requests (id, title) VALUES (42, 'Fix bug');
-- Primary: writes to binary log, marks transaction COMMITTED
-- Primary: returns success to client
-- Replica: later reads binlog and applies → INSERT executed
-- If network severs between commit and replica apply:
-- → commit is on primary, NOT on replica
-- → if replica is promoted, PR #42 is gone from the new primary
-- AFTER: Semi-synchronous replication
-- Primary waits for ACK from at least one replica before committing
-- Enable semi-sync on primary:
INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
SET GLOBAL rpl_semi_sync_master_enabled = 1;
-- Require at least 1 replica to acknowledge before commit:
SET GLOBAL rpl_semi_sync_master_wait_for_slave_count = 1;
-- Timeout before falling back to async (network blip tolerance):
SET GLOBAL rpl_semi_sync_master_timeout = 10000; -- 10 seconds
-- Enable on replica:
INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';
SET GLOBAL rpl_semi_sync_slave_enabled = 1;
-- Now: Client INSERT does not return success until replica has the write.
-- In the 43-second partition scenario:
-- → semi-sync timeout fires after 10 seconds
-- → primary falls back to async temporarily
-- → but the 43-second partition is NOT long enough to accumulate
-- significant divergence before replication resumes
-- The net effect: no split-brain. Writes were either replicated
-- before commit or the write was not committed at all.LESS AUTOMATION WAS THE RIGHT CHOICE FOR THIS TOPOLOGY
The most counterintuitive outcome of the incident analysis was that GitHub chose to reduce automated failover scope rather than improve it. The conventional wisdom in high-availability database design is that you should minimize human involvement in failover — humans are slow and error-prone. GitHub's conclusion was the opposite for cross-region scenarios: the risk of an automated system electing a new primary across a WAN partition — and creating irreconcilable split-brain — was higher than the risk of 30–60 minutes of extra downtime while a human makes the failover decision. The key insight is that automated failover protects against downtime but creates risk of data loss. For GitHub's metadata, data consistency was worth more than automatic recovery speed.Architecture
GitHub's MySQL topology at the time was a multi-site primary-replica arrangement with intermediate primaries for read scaling. Understanding the topology explains why the 43-second partition had such outsized consequences: the failover happened at the wrong layer of a deep replication chain, and the data that was missing had to be manually recovered from every cluster that inherited the gap.
THE TOPOLOGY THAT CREATED THE PROBLEM
The critical observation in this diagram: when the WAN link severs during the 43-second partition, writes are accumulating on the East Coast primary (red box) that have not reached the West Intermediate Primary (yellow box). Orchestrator's quorum elects the West Intermediate Primary as the new primary. But West Intermediate Primary is already behind on East's recent commits. Everything downstream of it — West Replica A and B — is also behind. When the East primary reconnects and tries to replicate from the West's newly-elected primary, the histories are irreconcilably diverged.What Changed in the Post-Incident Architecture
In the after diagram, the West Coast is no longer a cross-region failover candidate. Semi-synchronous replication on the East Coast ensures that no commit is acknowledged to the application until at least one East Coast replica has confirmed receipt — bounding the data that can be lost in a partition to at most the semi-sync timeout window (typically 10 seconds). Cross-region failover, if ever needed, now requires a human decision, giving operators the context to choose whether to fail over or wait for recovery.
Lessons
The GitHub October 2018 incident is a case study in how the same tool — automated database failover — can be simultaneously the right engineering choice (for intra-region failures) and the wrong engineering choice (for cross-region WAN partitions). The generalization is not 'don't automate failover'; it is 'understand the failure modes your automation creates before the failure happens'.
What to remember
- Automated failover protects against downtime but creates risk of data divergence — test which risk you can tolerate before the failure, not during it. GitHub's Orchestrator correctly detected the East Coast primary as unreachable and promoted a replacement. The cost of that correct decision was a 24-hour data reconciliation effort. For GitHub's metadata, the downtime cost of waiting for human judgment would have been far less than the recovery cost of split-brain. Know the tradeoff for your specific topology and data consistency requirements.
- Replication lag is invisible until it causes data loss. In steady state, GitHub's West Coast replicas appeared healthy — they were applying writes, they were reachable, they were being monitored. But async replication always has lag. That lag only matters during a promotion event, and the amount of lag at the moment of promotion determines the severity of the data consistency problem. Semi-synchronous replication is the mechanism that makes lag bounded at commit time.
- Verify that your failover tool's topology elections match your application's assumptions about primary uniqueness. Orchestrator can create topologies that work correctly at the database layer but break application assumptions — for example, creating a topology with two writable primaries briefly visible from different application nodes. GitHub's post-incident work included aligning Orchestrator configuration with the topologies the application could safely support.
- Recovery bandwidth is an often-ignored variable in disaster recovery planning. GitHub's backup restoration took hours in part because large database backups stored in cloud object storage required significant time to transfer over available bandwidth. Recovery time objectives must account for transfer time, not just restore time. Testing backup restoration under realistic bandwidth constraints reveals this gap before an incident does.
- Disable write-dependent features rather than serving stale data silently. During the 24-hour recovery, GitHub showed users clearly that the site was in a degraded read-only state, rather than silently serving stale issue and PR data. Users who saw a clear 'read-only' indication could adjust their workflow; users who were served stale data silently could make decisions based on incorrect information. Explicit degradation is better than invisible inconsistency.
On the Postmortem's Honesty
GitHub's public postmortem, published October 30, 2018, is one of the most technically detailed and honest incident analyses ever released by a software company. Author Jason Warner described not just what happened but the sequence of decisions engineers made under time pressure and why each seemed reasonable at the time. The most striking quote from the postmortem: the West Coast database cluster had already been ingesting writes for nearly 40 minutes when engineers understood the full scope of the split-brain. By the time you fully understand a split-brain scenario, the database has already been doing its job — just in two different directions.