The Story
GitHub is the infrastructure underpinning virtually all software development on Earth. Over 100 million developers use it to store code, run CI/CD pipelines via , and collaborate on projects ranging from weekend hobby projects to the world's most critical open-source software. When GitHub goes down, the entire software industry's ability to ship code grinds to a halt. This is not a theoretical concern — it is the reality GitHub's infrastructure team lives with every day. For years, the team had known about a in their network architecture at the Internet edge. The fix was a second Internet edge facility, completed in January 2023.
The second edge facility had been routing live production traffic since January, operating alongside the primary in a architecture. Six months of real traffic without incident. The team's next step was logical and responsible: perform a live failover test — deliberately route all traffic to the secondary, as if the primary had failed — to verify the redundancy actually worked. June 29, 2023. The test began. The secondary facility could not function as a primary. GitHub went down.
Unfortunately, during this failover test we inadvertently caused a production outage. The test exposed a network pathing configuration issue in the secondary side that prevented it from properly functioning as the primary facility.
The Hidden Configuration Flaw
The root cause was a network path configuration issue in the secondary Internet edge facility. The secondary had been designed to route traffic alongside the primary in a shared , but its specific network routing configuration was never validated for the scenario where it had to handle all traffic alone. This is the subtle trap of active-active HA: a facility can route 50% of traffic flawlessly for six months and still fail when asked to route 100%, because some of its internal network paths — in particular — were only configured to work in the context of the primary being present. The facility was a co-pilot that had never practiced landing the plane alone.
Problem
Failover Test Initiated
At 17:39 UTC on June 29, 2023, GitHub engineers begin a planned live validation of the secondary Internet edge facility by shifting all traffic away from the primary. Within seconds, parts of North America (especially the US East Coast) and South America begin experiencing connectivity failures to GitHub.
Cause
Secondary Cannot Function as Primary
The secondary facility has a network path configuration issue that was invisible while it shared load with the primary but becomes critical when it must handle all traffic alone. cannot happen correctly because the secondary's own configuration is broken.
Solution
Revert in 2 Minutes
GitHub's monitoring fires immediately. Within two minutes of being alerted, engineers revert the failover change and bring the primary facility back online. The revert itself is fast — but once online, border routers across the internet need time to reconverge, meaning GitHub service is not instantly restored even after the primary is running.
Result
Fixed, Then Tested Better
The network path configuration issue in the secondary is corrected. GitHub commits to improved failover testing procedures that minimize customer impact — specifically, scheduling future tests in a way that reduces blast radius. The test that caused the outage was ironically the most valuable test the team ever ran: it found the flaw that would have caused a much longer, unplanned outage during a real emergency.
The Reconvergence Penalty
Even after GitHub reverted the failover and the primary came back online, users could not immediately reach GitHub. The internet's needed time to reconverge — to undo the routing changes that the failover had caused. This is the hidden cost of network-level failures: the fix is fast, the recovery is slow.
The June 7 Incident: A Different Kind of Queue Starvation
Three weeks before the failover outage, GitHub experienced a completely separate but equally instructive incident. On June 7 at 16:11 UTC, GitHub's internal job queue for processing Git pushes began experiencing increasing delays. The monitoring system alerted engineers after 19 minutes. Customers experienced GitHub Actions workflow delays of up to 55 minutes and pull requests that failed to reflect new commits. The root cause was a single customer pushing to a repository with a specific, unusual data shape — a shape that caused the Git backend to throttle the processing jobs, making them slow. These slow jobs that served all other users. One customer's pathological repository data silently starved the global Git push queue for nearly two and a half hours.
Tenant Isolation in Shared Queues
The June 7 incident is a textbook case of in a shared job queue. GitHub's fix — making the Git backend throttle behavior fail faster and reducing the Git client timeout — prevents any single customer's workload from holding a worker slot indefinitely. The principle applies anywhere a shared queue serves diverse workloads.
The Fix
Fixing the Failover Test Outage
The immediate fix for the June 29 outage was surgical and fast: engineers identified the network path configuration issue exposed by the failover test and corrected it in the secondary edge facility. But the more important fix was procedural — changing how future failover tests are designed and scheduled. A live failover test that takes GitHub fully offline for users in two continents is not a sustainable validation strategy. GitHub committed to scheduling tests in ways that minimize customer impact, likely through phased traffic migration (moving a small percentage of traffic first), to identify configuration gaps before they cause outages, and off-peak timing to reduce the blast radius if something goes wrong. The secondary facility was fixed and is now genuinely capable of functioning as a primary.
# Simplified model of a safer failover test strategy
# Instead of "flip all traffic to secondary", use staged validation
def run_failover_validation(primary: EdgeFacility, secondary: EdgeFacility):
"""
Safe failover validation: verify the secondary can function as primary
without causing a production outage.
"""
# Step 1: Shadow test — route 0% of real traffic, compare responses
# Checks routing and config WITHOUT touching user requests
shadow_result = shadow_test(secondary, sample_requests=SYNTHETIC_TRAFFIC)
if not shadow_result.routes_correctly:
# ✅ Caught here — no user impact
alert_team("Secondary cannot route independently — config issue found")
return FailoverResult.ABORTED
# Step 2: Canary — shift 1% of traffic to secondary, monitor error rates
with traffic_shift(secondary, percentage=1):
if error_rate() > ACCEPTABLE_THRESHOLD:
rollback() # Instant revert, only 1% of users briefly affected
return FailoverResult.ABORTED
# Step 3: Gradual ramp — 10% → 25% → 50% → 100%
# At each stage, verify secondary handles the load correctly
for percentage in [10, 25, 50, 100]:
with traffic_shift(secondary, percentage=percentage):
# Monitor BGP convergence, latency, error rates
health = monitor(duration_seconds=300)
if not health.acceptable:
rollback() # Revert to last good state
return FailoverResult.ABORTED
# Step 4: Full failover validated — secondary proved capable
return FailoverResult.SUCCESS
# The June 29 incident used the equivalent of jumping straight to step 4.
# A broken secondary had no chance to be caught before users felt it.The BGP Reconvergence Reality
When GitHub's primary facility came back online after the revert, engineers could not simply flip a switch and restore service. Border routers across the internet needed time to reconverge — each network that had learned the (broken) route to GitHub's secondary needed to update its routing tables back to the primary. This is unavoidable, which is why the outage lasted 32 minutes even though the fix itself took under 2 minutes.
The June 7 Git push queue fix was more technically nuanced. The Git backend's throttling behavior was changed to fail faster — instead of a throttled job slowly consuming a worker slot while retrying indefinitely, it now returns a failure quickly so the slot is released for another repository's work. The Git client timeout within the job was also reduced, preventing a hung upstream connection from holding a worker open. These two changes together mean a pathological repository data shape can no longer starve the shared worker pool. Additional improvements were added to reduce detection and diagnosis time for future incidents of this type.
The Outage That Validated the Investment
GitHub's engineering team noted a pointed irony: the test that caused the outage was exactly the right test to run. Without it, the hidden configuration flaw would have remained undetected until a real infrastructure failure — at which point the outage would have been unplanned, potentially longer, and without the fast human revert that limited the June 29 impact to 32 minutes. A self-inflicted outage you can control is always better than a real one you cannot.
TWO INCIDENTS, ONE JUNE
June 2023 gave GitHub two distinct outage patterns in a single month. The June 7 incident (2h28m) was caused by a shared resource exhaustion — one customer's data starving a global queue. The June 29 incident (32 min) was caused by untested redundancy — infrastructure built for resilience that had never been validated as a solo primary. Both share a root: assumptions that were never tested in production conditions.The Hidden Cost of Active-Active HA
The secondary facility had routed live production traffic for six months without incident — because it was always operating alongside the primary, not instead of it. Active-active HA gives you a false signal of readiness. A facility that handles 50% of traffic when the primary is healthy has never been proven to handle 100% of traffic when the primary is gone. Failover capability must be explicitly validated at full load, not inferred from shared-load health.
The most important long-term fix was cultural: GitHub's team committed to making failover testing a regular practice, not a one-time event. Regular failover tests — scheduled with appropriate notice, designed to minimize blast radius, and run at off-peak times — are the only way to keep redundancy validated over time. Infrastructure drifts: routers get reconfigured, network policies change, and a facility that was a fully functional backup six months ago may not be today. Untested redundancy is not redundancy. It is the comforting fiction that your system is more resilient than it actually is.
Architecture
GitHub's Internet edge architecture is the layer that connects the global internet to GitHub's internal infrastructure. Every request from every developer in the world — whether pushing code, pulling a repository, or triggering a GitHub Action — flows through an Internet edge facility. For years, this was a single point of failure: one facility, one set of , and one path in from the internet. The second facility, completed in January 2023, was designed to eliminate this vulnerability. What the architecture diagrams did not capture was the specific network path configuration that would only become a problem when the secondary had to stand alone.
Before: Single Point of Failure at the Internet Edge
After: HA Architecture with Secondary Edge — and the Hidden Flaw
The architecture diagram shows the deceptive appearance of redundancy. Two edge facilities, both actively routing traffic, both connected to the same internal load balancer — it looks bulletproof. But the diagram does not show the network path configuration inside the secondary: the specific that tell the global internet how to reach GitHub, and the internal routing rules that control traffic flow within the facility. When the secondary was asked to function as the primary during the failover test, those configurations were incorrect for the solo-primary role. The redundancy was a drawing on paper, not a tested fact.
Border Router Reconvergence: The Delay Nobody Talks About
When GitHub's primary facility came back online, the recovery was not instant. Every network on the internet that had updated its to route via the broken secondary had to learn the new path to the primary. This propagation delay is inherent to how the internet works and is unavoidable once a failover has been initiated. It is one more reason to avoid unnecessary failover events — even a 2-minute fix can result in 30 minutes of degraded service.
June 2023 GitHub incidents — two outages, two root causes, one shared theme
| Incident | Date | Duration | Root Cause | Fix |
|---|---|---|---|---|
| Git Push Queue Starvation | June 7 | 2h 28m | Single customer's pathological data shape throttled jobs, exhausting the shared worker pool | Fail-faster throttling, reduced Git client timeout |
| Failover Test Outage | June 29 | 32 min | Secondary edge facility had hidden network path config flaw that only manifested when operating solo | Fixed secondary config; improved failover test procedures |
| Common thread | Both | — | Assumptions about system behavior that were never validated under the actual failure conditions | Testing at the real failure boundary, not the assumed one |
Lessons
June 2023 gave GitHub — and the industry — two clean case studies in the same month. Neither outage was caused by a novel bug or an obscure race condition. Both were caused by things that look like good engineering on paper but hadn't been tested at the right failure boundary. These lessons apply to any team operating infrastructure with redundancy assumptions they have never validated.
What to remember
- Untested redundancy is not redundancy — it is a liability. GitHub's secondary edge facility routed 50% of live production traffic for six months without revealing the flaw that prevented it from functioning as a primary. does not validate failover capability; it validates shared-load operation. Test your redundancy by actually removing the primary, not by observing the secondary under normal conditions.
- Failover tests should be staged, not binary. Shifting 100% of traffic to an untested secondary in a single step is a high-stakes gamble with no abort option. Canary failovers — shifting 1%, then 10%, then 25%, validating at each stage before proceeding — expose configuration issues before they cause full outages. The extra complexity of staged testing is trivially small compared to the cost of a production outage discovered mid-test.
- Reverting fast does not mean recovering fast. GitHub reverted the failover change in under 2 minutes, but the outage lasted 32 minutes because takes time that no amount of engineering can compress. Build your incident response timelines around recovery time, not just fix time.
- Shared queues need tenant isolation to prevent noisy neighbor failures. The June 7 incident is a canonical example of one tenant's unusual workload consuming all of a shared resource. Design queue systems with per-tenant rate limits and fast-fail timeouts so that a single job never holds a worker slot long enough to starve the entire pool. The fix — making the Git backend throttle faster — is a one-line change that protects millions of users from one user's edge case.
- The test that breaks production is the most valuable test you ever run. GitHub's team made a pointed admission: without the June 29 failover test, the network path flaw would have remained hidden until a real infrastructure emergency forced a failover under far worse conditions — with no time to prepare, no clean revert path, and no certainty about what was broken. Deliberately probing your redundancy in a controlled environment, even at the cost of a brief outage, is the engineering equivalent of a fire drill: painful in the moment, essential in the long run.
THE IRONY THEOREM
The infrastructure designed to prevent an outage caused the outage. The test designed to validate resilience proved the resilience didn't exist. And the 32-minute disruption designed to be the worst case turned out to be far better than the real-emergency case it prevented. Sometimes the most constructive thing you can do for your reliability is schedule an outage before the universe schedules one for you.Document Your Assumptions Before You Test Them
A pre-test checklist should include: what does the secondary need to do independently? Not just what load it handles, but what configuration it needs — BGP route advertisements, internal routing policies, health check endpoints, TLS certificates. Every assumption about how the secondary behaves when the primary is absent should be written down and verified before the test runs, not discovered by watching production users experience an outage.