The Story

Between May 2025 and April 2026, GitHub logged 257 incidents — 48 of which were classified as major outages. That is roughly one significant disruption per week. February 2026 was the worst month in the platform's history: 37 incidents. GitHub's own CTO publicly acknowledged that the platform had failed its three-nines reliability commitment to enterprise customers. Developer Mitchell Hashimoto announced he was moving the 52,000-star Ghostty project off GitHub because it was blocking his work for hours daily. GitHub's COO responded to him directly and personally on X.

The actual cause

GitHub's infrastructure was built in 2008 on Ruby on Rails and still runs a near-two-million-line monolith at its core. AI agentic workflows — automated agents that commit, review, and merge code around the clock — generated a type and volume of events the 2008 architecture was never designed to handle.

In October 2025, GitHub announced a plan to grow platform capacity by 10x. By February 2026, CTO Vlad Fedorov revised that target to 30x — because agentic development workflows had accelerated beyond any model the infrastructure team had used for forecasting. The projected volume of AI-generated GitHub events for full-year 2026 was 14 billion — a 14x increase year-over-year. Three structural failures were present across every major incident: tight coupling between services allowed localized failures to cascade across the platform, the system could not shed load from misbehaving clients, and there were insufficient manual control mechanisms for operators during active incidents.

257
Incidents tracked May 2025–April 2026
48
Major outages in 12 months
14B
AI events projected for full-year 2026 (14x growth)
30x
Scale redesign target set February 2026

The Fix

GitHub's response is not an expansion — it is a ground-up redesign. Fedorov described the distinction clearly: scaling up means adding machines and memory to the existing architecture. Redesigning means the current architectural assumptions fail at 30x, requiring a fundamental rethink of service decomposition, data flow, and fault isolation. GitHub is decoupling critical services, adding backpressure mechanisms so the system can protect itself from misbehaving clients, and building explicit load-shedding controls for operators. The migration of a two-million-line monolith while keeping the platform live is one of the most complex engineering projects underway in production software today.

Solution

Ground-up architecture redesign underway: service decoupling, backpressure mechanisms, operator load-shedding controls, shard-based isolation. Target: 30x current scale.

Three structural causes Fedorov identified

Three structural causes Fedorov identified
ProblemSymptomRedesign approach
Tight service couplingLocal failures cascade platform-wideService decomposition with isolation boundaries
No load sheddingMisbehaving clients degrade platform for all usersBackpressure + client throttling
Insufficient operator controlsBinary choice: accept all load or collapseGranular manual traffic controls per service

Architecture

The coupling problem: one failure propagates to everything

In a tightly coupled monolith, a failure in any component reaches every other component that shares the same data store or service boundary.

Lessons

What to remember

  1. AI agent traffic is not human traffic. It has no human-shaped timing distribution, no stop times, and no patience for degraded service. Design infrastructure for it specifically — not as a variant of human traffic.
  2. Tight service coupling makes individual features faster to ship and makes platform-level failures harder to contain. At 30x scale, isolation matters more than development velocity.
  3. If your platform cannot shed load from a single misbehaving client, one poorly written agentic CI loop can degrade service for millions of other users simultaneously.
  4. Reforecast your load model when the workload type changes, not when the capacity runs out. The shift from human commits to agent commits happened in months, not years.
We started this year planning for 10x. By February, we knew that wasn't the world we were living in.GitHub CTO Vlad Fedorov, March 2026