GitHub AI Agents Outage 2025–2026: How Agentic Workflows Bro

The Story

We started executing our plan to increase GitHub's capacity by 10X in October 2025, with a goal of substantially improving reliability and failover. By February 2026, it was clear that we needed to design for a future that requires 30X today's scale. The main driver is a rapid change in how software is being built. Since the second half of December 2025, agentic development workflows have accelerated sharply.

— Vlad Fedorov, CTO of GitHub — GitHub Engineering Blog, April 28, 2026

For most of its existence, GitHub has been one of the most reliable platforms on the internet. Developers took it for granted the way they take electricity for granted — always on, always there, a utility so dependable it disappeared into the background. That changed in 2025. Not because GitHub's engineers got worse. Not because the codebase got sloppier. But because something fundamental changed about who — or more precisely, what — was using GitHub. AI coding agents arrived at scale, and they didn't behave anything like the human developers the platform was built for.

The numbers tell the story with uncomfortable clarity. In 2024, GitHub logged 119 service incidents, including 26 major ones — frustrating, but manageable, with an average resolution time of roughly 106 minutes. Then, between May 2025 and April 2026, incident monitoring service IncidentHub tracked 257 separate incidents, of which 48 were classified as major outages. February 2026 alone produced 37 incidents — the worst month on record. GitHub Actions, the platform's CI/CD automation backbone, suffered 57 outages in the same 12-month stretch. On May 15, 2026, a single Actions degradation caused 42% of all Actions runs to fail at peak impact. Developers worldwide woke up to red CI pipelines and spent hours debugging their own code — before eventually realizing it wasn't their code at all.

THE CORE PROBLEM: AGENTS DON'T BEHAVE LIKE HUMANS

A human developer on a free GitHub account might generate a few commits and a handful of CI runs in a working day. An AI agent on the same account can generate hundreds of commits, dozens of PRs, and thousands of Actions minutes in a single afternoon. GitHub's 2025 Octoverse report celebrated nearly 1 billion commits. By early 2026, GitHub COO Kyle Daigle shared a more alarming figure: the platform was handling 275 million commits every single week — on pace for 14 billion in 2026 if growth held linear. That's a 14x annual increase. It wasn't 14x more developers. It was agents treating GitHub's API like a utility and consuming at machine speed.

Problem

GitHub Was Built for Human-Paced Development

GitHub's architecture was designed for a world where developers work at human speed: open a PR, push commits over hours or days, wait for CI to run, merge when green. The platform's capacity planning, its database schemas, its job queues, its rate limits — all calibrated for a workflow where one human generates a bounded amount of activity per session. That assumption held for 17 years.

Cause

AI Agents Changed the Economics of Every GitHub Operation

GitHub CTO Vlad Fedorov identified the mechanism: a single pull request can simultaneously touch Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. A human merging one PR triggers this chain once. An AI agent framework running hundreds of concurrent sessions triggers it thousands of times simultaneously. AI-agent PRs alone jumped from 4 million in September 2025 to 17 million in March 2026 — a 325% increase in six months. Actions usage went from 500 million minutes per week in 2023 to 2.1 billion minutes in a single week in early 2026.

Solution

10x Plan Became 30x Plan — And They Were Still Behind

GitHub began a 10x capacity scaling initiative in October 2025. By February 2026, that plan was already obsolete — the real demand required 30x. Simultaneously, GitHub was running a migration to Azure, with 12.5% of all traffic on Azure Central US and a target of 50% by July 2026. Running a platform migration alongside an AI-driven traffic explosion is the engineering equivalent of rebuilding an airplane's engines at 35,000 feet.

Result

Cascading Failures and High-Profile Departures

The pressure produced not just performance degradation but engineering failures. On April 23, 2026, an incomplete feature flag silently reverted commits across 658 repositories and 2,092 pull requests — the UI showed green checkmarks while code was being rewritten underneath. On April 27, a botnet attack overwhelmed the Elasticsearch backend and took GitHub Search offline for hours. On April 28, Mitchell Hashimoto — GitHub user #1299, co-founder of HashiCorp, joined February 2008 — announced that Ghostty was leaving GitHub. The Zig programming language project also migrated away.

Peak monthly metrics by early 2026: 90 million merged PRs, 1.4 billion commits, 20 million new repositories. These are not the metrics of a platform being used by developers. These are the metrics of a platform being consumed by machines.

The Two April Incidents That Broke Developer Trust

Two incidents in late April 2026 crystallized the reliability crisis into something personal for every developer who experienced them. The first, on April 23, was a merge queue regression — a bug caused by an incomplete feature flag deployment — that silently reverted commits across 658 repositories and 2,092 pull requests. The terrifying part was not the scope. It was the silence. The UI continued to show green checkmarks and merge confirmations while the system was actively undoing work underneath. Developers had no idea their code had been reverted until they dug into the diff. A platform's most sacred contract with its users is that when it shows a green checkmark, the operation succeeded. GitHub broke that contract.

The second incident, on April 27, was a Search outage triggered by what GitHub described as a likely botnet overwhelming the Elasticsearch backend. Search went down for hours across the platform. Taken individually, either incident could be explained away. Together, in the same week, they were the signal that the accumulated reliability debt had become impossible to ignore.

The Silent Revert: Why the Merge Bug Was So Damaging

The April 23 merge queue bug was technically a data integrity issue — code that had been merged was reverted without notification. But its deeper damage was psychological. Developers depend on version control's fundamental promise: what you commit is preserved, what you merge is recorded. A bug that silently breaks this promise doesn't just cause data loss. It causes a loss of confidence in every operation the platform reports as successful. How many other green checkmarks weren't telling the whole story? This is the question that made Mitchell Hashimoto's departure feel less like frustration and more like a considered assessment of risk.

The Mitchell Hashimoto Moment

On April 28, 2026, Mitchell Hashimoto — GitHub user number 1299, co-founder of HashiCorp, creator of Vagrant, Packer, Consul, Terraform, and Vault, one of the most prolific infrastructure engineers in the industry — posted that Ghostty was leaving GitHub. He had visited GitHub almost every day for over 18 years. His post described the decision as 'irrationally sad' but said the platform was no longer a place where he could 'get work done' and 'ship software.' He made a point that resonated across the developer community: the problem was not Git itself — the distributed version control system remained excellent. The problem was the surrounding infrastructure: issues, pull requests, GitHub Actions. The platform built around Git was failing.

GITHUB USER #1299: WHY THE NUMBER MATTERS

Mitchell Hashimoto is GitHub user #1299 — one of the earliest accounts on the platform, joined February 2008, three years after GitHub's 2005 founding. His departure is not symbolically significant because he is famous. It is symbolically significant because he is the kind of developer GitHub was built for: a serious, high-output infrastructure engineer who had chosen GitHub without question for 18 years. When the person who never had reason to question the platform starts questioning it, something has fundamentally changed.

257

Total incidents tracked between May 2025 and April 2026 by IncidentHub — roughly five per week, every week, for twelve months straight

Major outages in the same period, producing over 112 hours of total significant downtime across the platform

30x

The scale GitHub needed to design for by February 2026 — triple the 10x plan they had launched just four months earlier in October 2025

2,092

Pull requests silently reverted by the April 23 merge queue bug across 658 repositories — with no notification to affected developers

The Fix

The Engineering Response: Ruby to Go, Monolith to Services, Single Cloud to Multi-Cloud

GitHub's CTO Vlad Fedorov publicly outlined the engineering response in his April 28 blog post. The plan has three interlocking components, each targeting a different layer of the scaling problem. Together they represent one of the most significant architectural transformations GitHub has undertaken since its founding.

GitHub's five-layer engineering response to the agentic AI scaling crisis, as outlined by CTO Vlad Fedorov in April 2026

GitHub's five-layer engineering response to the agentic AI scaling crisis, as outlined by CTO Vlad Fedorov in April 2026
Problem Layer	Root Cause	GitHub's Fix
Language / Runtime	Ruby monolith has GIL (Global Interpreter Lock), limiting CPU parallelism — cannot efficiently use all available cores under high concurrency	Rewriting performance-critical services from Ruby to Go — Go's goroutine model handles massive concurrency without the GIL bottleneck
Infrastructure	Single-cloud (Microsoft Azure-only migration) creates concentrated failure risk and limits horizontal scaling options	Multi-cloud deployment strategy — 12.5% on Azure Central US in early 2026, targeting 50% by July 2026 with additional cloud providers
Service Isolation	A single PR cascades through 10+ interconnected subsystems — Git storage, Actions, search, notifications, permissions, webhooks — so any bottleneck in one propagates to all	Isolating critical services (Git and Actions especially) into independent failure domains so a degradation in one subsystem cannot cascade to others
Capacity Planning	10x scaling plan (October 2025) became obsolete by February 2026 as AI agent traffic doubled the required headroom	30x capacity design — automated scaling for agent-driven burst load patterns rather than human-paced steady-state assumptions
Feature Safety	April 23 merge queue regression was caused by an incomplete feature flag that allowed a new code path to activate without full safeguards	Strengthened feature flag discipline — no code path that affects data integrity ships without complete flag protection and staged rollout verification

Why the Ruby-to-Go Rewrite Is the Right Call

Ruby served GitHub extraordinarily well for 18 years. It enabled rapid product development, contributed to GitHub's culture, and its Rails framework made the web application layer elegant and maintainable. But Ruby's Global Interpreter Lock (GIL) is a fundamental constraint: even on a 64-core server, a Ruby process can only execute one thread of Ruby code at a time. For human-paced web traffic, this limitation is manageable. For AI agent workflows that generate thousands of concurrent operations, the GIL is a hard ceiling. Go's goroutine model — lightweight threads managed by the Go runtime that can run across all available CPU cores without a GIL — is architecturally suited for exactly the concurrency profile that AI agents create. The rewrite is not about language preference. It is about physics.

The Structural Diagnosis: Agents Are Not Billed Like Agents

A deeper structural problem underlies the engineering crisis: GitHub's business model was designed for humans, and its pricing reflects human-scale consumption. A developer on a free GitHub account generates some commits, a few CI runs, and a handful of API calls per day. An AI agent on the same account can generate hundreds of commits, dozens of PRs, thousands of Actions minutes, and tens of thousands of API calls in a single afternoon. The infrastructure cost per 'user' has fundamentally changed, but the pricing model has not yet caught up. As one engineer put it plainly: GitHub's Octoverse 2025 report celebrated nearly 1 billion commits and 36 million new developers. But the 2026 numbers aren't being driven by 36 million new developers showing up. They're being driven by agents that treat GitHub's API like a utility — which it basically is, except utilities charge for consumption.

83 INCIDENTS FROM CAPACITY FAILURES ALONE

83 of GitHub's 257 incidents between May 2025 and April 2026 were caused by load and capacity problems — with indications that many services did not have automatic scaling configured, requiring manual intervention to add capacity during surges. This means that dozens of times, engineers had to notice the problem, escalate it, and manually provision resources before the platform could recover. Automated capacity scaling for burst load is not optional infrastructure. For a platform being consumed by AI agents, it is the minimum viable reliability architecture.

The CVE-2026-3854 Problem: Reliability and Security Compounding

The April week that prompted Hashimoto's departure also included a critical security disclosure: CVE-2026-3854, a CVSS 8.7 remote code execution vulnerability in GitHub's internal Git layer. The flaw allowed an attacker to inject extra header fields via a malformed git push and execute code as the Git service user. GitHub.com was patched within six hours of disclosure. GitHub Enterprise Server patches were released. But Wiz reported that 88% of self-hosted GitHub Enterprise Server instances remained unpatched at time of publication. A platform under reliability stress is also a platform whose administrators are too busy managing incidents to maintain their security posture — the two crises compound each other.

Architecture

GitHub's architecture evolved over 18 years around a core assumption: the unit of load is a human developer. A human opens a PR, waits for review, pushes a few commits, and merges. The platform's service graph — its Git storage layer, its mergeability computation engine, its branch protection evaluation system, its Actions job dispatch queue, its search indexer, its notification fan-out, its webhook delivery pipeline, its permission evaluation layer, its API gateway — was sized and coupled around this human-paced access pattern. Every service in the chain was both a dependency and a dependency of every other service. This architecture was efficient and made GitHub easy to reason about for years.

AI agents broke the architecture's fundamental assumption. An agent doesn't open a PR and wait. An agent opens 50 PRs in parallel, each triggering the full service chain simultaneously. At scale, this creates a concurrency storm that amplifies through every layer of the graph. GitHub CTO Vlad Fedorov described it precisely: a single PR touches Git storage, mergeability checks, branch protection, Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. When the number of concurrent PRs scales 4x in six months, the pressure on every one of those systems scales accordingly — and the interconnected failures begin.

A Single GitHub PR: The 10+ Subsystems It Touches

Flowchart showing how one AI-agent pull request fans out across Git storage, mergeability, branch protection, Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases.

GitHub Actions: Weekly Compute Minutes — The AI Agent Surge

Bar chart of GitHub Actions weekly compute minutes from 2023 through early 2026, showing growth from 500 million to 2.1 billion minutes per week.

THE RUBY GIL: WHY THE MONOLITH COULDN'T SCALE

Ruby's Global Interpreter Lock (GIL) is a mutex that prevents multiple threads from executing Ruby code simultaneously in the same process. For human-paced web traffic — where a request comes in, does some database work, and returns a response — the GIL is rarely the bottleneck. For AI agent traffic — where thousands of operations arrive concurrently and each one fans out across dozens of internal services — the GIL becomes a hard ceiling. Even on a 64-core server, a Ruby process can use exactly one core at a time for Ruby execution. The fix isn't optimization. It's a different runtime. Go's goroutine scheduler runs across all available CPU cores without a GIL, making it architecturally suited for the concurrency profile that AI agent workflows generate. GitHub's Ruby-to-Go migration for performance-critical services is the right move — not as a language preference, but as a physics constraint.

The Azure Migration Timing Problem

GitHub began migrating traffic to Azure as part of its Microsoft integration — 12.5% of all traffic on Azure Central US in early 2026, with a target of 50% by July 2026. Running this migration simultaneously with a 30x capacity redesign and a Ruby-to-Go rewrite is an extraordinary amount of concurrent infrastructure transformation. Each of these projects is a multi-year undertaking at GitHub's scale. Running all three in parallel reduces the blast radius of each individual change — but increases the cognitive load and coordination complexity for the engineering teams managing them. The timing was not chosen; it was forced by the speed of the AI agent traffic explosion.

Lessons

GitHub's reliability crisis is not a story about a company making engineering mistakes. It's a story about a platform being asked to do something it was never designed for — at a speed that outpaced any reasonable capacity planning horizon. The lessons are as much about the nature of AI agent infrastructure demands as they are about reliability engineering practice.

What to remember

Your platform's capacity model must be built around its actual consumers — not its original consumers. GitHub was built for human developers. AI agents consume infrastructure at orders of magnitude greater intensity. Any platform that introduces AI-native workflows must remodel its capacity assumptions from scratch, not incrementally adjust from the human baseline.
are not optional for infrastructure that handles data integrity. The April 23 merge queue bug — which silently reverted 2,092 pull requests — was caused by an incomplete feature flag. A complete feature flag would have allowed engineers to disable the affected code path instantly without a full redeployment. For any code path that touches data that developers trust as immutable, flag protection is not a best practice. It is the minimum viable safety mechanism.
A monolith that can't be incrementally scaled will become a single point of failure at sufficient scale. GitHub's Ruby monolith served the platform for 18 years because human-paced traffic was bounded enough that the GIL's concurrency limit never became the primary bottleneck. AI agents removed that bound. The architectural lesson is not that monoliths are bad — it's that every architectural decision encodes assumptions about scale, and those assumptions must be revisited when the scale changes fundamentally.
When critical services are deeply coupled — when a PR touches Git storage, Actions, search, notifications, permissions, and webhooks in a single chain — a failure in any one component becomes a failure across all components. Service isolation is not premature optimization. It is the prerequisite for containing blast radius at scale. GitHub's commitment to isolating Git and Actions into independent failure domains is the architectural move that will have the most long-term impact on reliability.
Trust is the asset that reliability engineering protects. Mitchell Hashimoto didn't leave GitHub because of the April 27 Search outage alone. He left because 257 incidents over 12 months had eroded confidence in the platform as a reliable foundation for serious work. Reliability is not measured in individual incident severities — it is measured in the cumulative effect of failures on whether people trust the platform to do what it says it did. The merge bug's silent revert made this unmistakably concrete.

The Availability-First Mandate

GitHub's leadership response to the crisis was to shift from a growth-at-all-costs philosophy to an availability-first mandate. This means engineering prioritization changes: new feature work is deprioritized relative to stability improvements, scaling infrastructure, and incident remediation. The availability-first mandate is the organizational signal that GitHub recognizes the seriousness of the reliability debt it has accumulated. Whether the engineering plans — Ruby-to-Go, service isolation, 30x capacity, multi-cloud — can be executed faster than the AI agent traffic continues to grow is the open question that will define GitHub's next two years.

WHAT EVERY DEVELOPER SHOULD DO RIGHT NOW

GitHub's instability has a practical implication for every engineering team: treat GitHub as important infrastructure, not invisible infrastructure. Map your team's GitHub dependency surface — CI/CD pipelines, registry mirrors, source-of-truth, identity flows, Actions runners. Know which of your deployments would be blocked if GitHub Actions was degraded for four hours. Have a runbook for 'GitHub is down' that doesn't end with 'wait for GitHub to come back.' Independent CI mirroring, artifact registries with fallback paths, and local Git mirrors of critical dependencies are not paranoia — they are the appropriate response to a platform that has demonstrated it will have five major incidents per week for a year.

GitHub spent 18 years building the platform where the world's code lives, survived Microsoft's acquisition, launched Copilot, and made developers 10x more productive — and then the thing that broke it was all those developers becoming 100x more productive using Copilot. The platform that accelerated AI-assisted development got outrun by AI-assisted development. There is probably a lesson in there somewhere about building infrastructure for the world you are creating, not the world you came from.TechLogStack — built at scale, broken in public, rebuilt by engineers