GitHub 2026: AI Agents Force a 30x Architecture Redesign

The Story

Between May 2025 and April 2026, GitHub logged 257 incidents — 48 of which were classified as major outages. That is roughly one significant disruption per week. February 2026 was the worst month in the platform's history: 37 incidents. GitHub's own CTO publicly acknowledged that the platform had failed its three-nines reliability commitment to enterprise customers. Developer Mitchell Hashimoto announced he was moving the 52,000-star Ghostty project off GitHub because it was blocking his work for hours daily. GitHub's COO responded to him directly and personally on X.

The actual cause

GitHub's infrastructure was built in 2008 on Ruby on Rails and still runs a near-two-million-line monolith at its core. AI agentic workflows — automated agents that commit, review, and merge code around the clock — generated a type and volume of events the 2008 architecture was never designed to handle.

In October 2025, GitHub announced a plan to grow platform capacity by 10x. By February 2026, CTO Vlad Fedorov revised that target to 30x — because agentic development workflows had accelerated beyond any model the infrastructure team had used for forecasting. The projected volume of AI-generated GitHub events for full-year 2026 was 14 billion — a 14x increase year-over-year. Three structural failures were present across every major incident: tight coupling between services allowed localized failures to cascade across the platform, the system could not shed load from misbehaving clients, and there were insufficient manual control mechanisms for operators during active incidents.

257

Incidents tracked May 2025–April 2026

Major outages in 12 months

14B

AI events projected for full-year 2026 (14x growth)

30x

Scale redesign target set February 2026

The Fix

GitHub's response is not an expansion — it is a ground-up redesign. Fedorov described the distinction clearly: scaling up means adding machines and memory to the existing architecture. Redesigning means the current architectural assumptions fail at 30x, requiring a fundamental rethink of service decomposition, data flow, and fault isolation. GitHub is decoupling critical services, adding backpressure mechanisms so the system can protect itself from misbehaving clients, and building explicit load-shedding controls for operators. The migration of a two-million-line monolith while keeping the platform live is one of the most complex engineering projects underway in production software today.

Solution

Ground-up architecture redesign underway: service decoupling, backpressure mechanisms, operator load-shedding controls, shard-based isolation. Target: 30x current scale.

Three structural causes Fedorov identified

Three structural causes Fedorov identified
Problem	Symptom	Redesign approach
Tight service coupling	Local failures cascade platform-wide	Service decomposition with isolation boundaries
No load shedding	Misbehaving clients degrade platform for all users	Backpressure + client throttling
Insufficient operator controls	Binary choice: accept all load or collapse	Granular manual traffic controls per service

Architecture

The coupling problem: one failure propagates to everything

Diagram preview unavailable.

In a tightly coupled monolith, a failure in any component reaches every other component that shares the same data store or service boundary.

Lessons

What to remember

AI agent traffic is not human traffic. It has no human-shaped timing distribution, no stop times, and no patience for degraded service. Design infrastructure for it specifically — not as a variant of human traffic.
Tight service coupling makes individual features faster to ship and makes platform-level failures harder to contain. At 30x scale, isolation matters more than development velocity.
If your platform cannot shed load from a single misbehaving client, one poorly written agentic CI loop can degrade service for millions of other users simultaneously.
Reforecast your load model when the workload type changes, not when the capacity runs out. The shift from human commits to agent commits happened in months, not years.

We started this year planning for 10x. By February, we knew that wasn't the world we were living in.GitHub CTO Vlad Fedorov, March 2026

The Story

The Fix

Architecture

Lessons

Related Stories

How GitHub Upgraded 1200 MySQL Hosts Without Dropping a Single Query

GitHub Built the Internet's Code Platform — Then AI Agents Broke It

GitHub's Settings Cache Went Stale and Took Authentication, Actions, and Copilot Down With It