The Story
Between May 2025 and April 2026, GitHub logged 257 incidents — 48 of which were classified as major outages. That is roughly one significant disruption per week. February 2026 was the worst month in the platform's history: 37 incidents. GitHub's own CTO publicly acknowledged that the platform had failed its three-nines reliability commitment to enterprise customers. Developer Mitchell Hashimoto announced he was moving the 52,000-star Ghostty project off GitHub because it was blocking his work for hours daily. GitHub's COO responded to him directly and personally on X.
The actual cause
GitHub's infrastructure was built in 2008 on Ruby on Rails and still runs a near-two-million-line monolith at its core. AI agentic workflows — automated agents that commit, review, and merge code around the clock — generated a type and volume of events the 2008 architecture was never designed to handle.In October 2025, GitHub announced a plan to grow platform capacity by 10x. By February 2026, CTO Vlad Fedorov revised that target to 30x — because agentic development workflows had accelerated beyond any model the infrastructure team had used for forecasting. The projected volume of AI-generated GitHub events for full-year 2026 was 14 billion — a 14x increase year-over-year. Three structural failures were present across every major incident: tight coupling between services allowed localized failures to cascade across the platform, the system could not shed load from misbehaving clients, and there were insufficient manual control mechanisms for operators during active incidents.
The Fix
GitHub's response is not an expansion — it is a ground-up redesign. Fedorov described the distinction clearly: scaling up means adding machines and memory to the existing architecture. Redesigning means the current architectural assumptions fail at 30x, requiring a fundamental rethink of service decomposition, data flow, and fault isolation. GitHub is decoupling critical services, adding backpressure mechanisms so the system can protect itself from misbehaving clients, and building explicit load-shedding controls for operators. The migration of a two-million-line monolith while keeping the platform live is one of the most complex engineering projects underway in production software today.
Solution
Ground-up architecture redesign underway: service decoupling, backpressure mechanisms, operator load-shedding controls, shard-based isolation. Target: 30x current scale.
Three structural causes Fedorov identified
| Problem | Symptom | Redesign approach |
|---|---|---|
| Tight service coupling | Local failures cascade platform-wide | Service decomposition with isolation boundaries |
| No load shedding | Misbehaving clients degrade platform for all users | Backpressure + client throttling |
| Insufficient operator controls | Binary choice: accept all load or collapse | Granular manual traffic controls per service |
Architecture
The coupling problem: one failure propagates to everything
Lessons
What to remember
- AI agent traffic is not human traffic. It has no human-shaped timing distribution, no stop times, and no patience for degraded service. Design infrastructure for it specifically — not as a variant of human traffic.
- Tight service coupling makes individual features faster to ship and makes platform-level failures harder to contain. At 30x scale, isolation matters more than development velocity.
- If your platform cannot shed load from a single misbehaving client, one poorly written agentic CI loop can degrade service for millions of other users simultaneously.
- Reforecast your load model when the workload type changes, not when the capacity runs out. The shift from human commits to agent commits happened in months, not years.