{"version": "https://jsonfeed.org/version/1.1", "title": "TechLogStack — Engineering Case Studies", "home_page_url": "https://techlogstack.com/", "feed_url": "https://techlogstack.com/feed.json", "description": "Real engineering case studies and postmortems retold in plain English — Netflix, Stripe, Google, and more.", "language": "en", "items": [{"id": "https://techlogstack.com/explore/github-ai-agents-outage-2026/", "url": "https://techlogstack.com/explore/github-ai-agents-outage-2026/", "title": "GitHub Built the Internet's Code Platform — Then AI Agents Broke It", "summary": "GitHub logged 257 incidents in 12 months as AI agent workflows demanded 30x the platform's designed capacity. The engineering story behind the outages, the Ghostty e", "content_html": "<p><strong>GitHub</strong> · Distributed Systems · 31 May 2026</p>\n<p>Between May 2025 and April 2026, GitHub experienced 257 incidents — 48 of them major outages. That's roughly one significant disruption every single week. The culprit wasn't a security breach, a botched deployment, or a rogue engineer. It was the thing GitHub had spent years celebrating: AI. Specifically, agentic AI workflows that turned one human developer's footprint into hundreds of commits, thousands of CI minutes, and a dozen simultaneous PR operations — all at once, across millions of accounts. GitHub had been built for humans. Agents are not human.</p>\n<ul>\n<li>257 incidents — May 2025 to April 2026</li><li>48 major outages, 112+ hours total downtime</li><li>57 GitHub Actions outages in 12 months</li><li>10x scaling plan revised to 30x by February 2026</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<blockquote><p>We started executing our plan to increase GitHub's capacity by 10X in October 2025, with a goal of substantially improving reliability and failover. By February 2026, it was clear that we needed to design for a future that requires 30X today's scale. The main driver is a rapid change in how software is being built. Since the second half of December 2025, agentic development workflows have accelerated sharply.</p><p><em>— — Vlad Fedorov, CTO of GitHub — GitHub Engineering Blog, April 28, 2026</em></p></blockquote>\n<p>For most of its existence, GitHub has been one of the most reliable platforms on the internet. Developers took it for granted the way they take electricity for granted — always on, always there, a utility so dependable it disappeared into the background. That changed in 2025. Not because GitHub's engineers got worse. Not because the codebase got sloppier. But because something fundamental changed about who — or more precisely, <em>what</em> — was using GitHub. <strong>AI coding agents arrived at scale</strong>, and they didn't behave anything like the human developers the platform was built for.</p>\n<p>The numbers tell the story with uncomfortable clarity. In 2024, GitHub logged 119 service incidents, including 26 major ones — frustrating, but manageable, with an average resolution time of roughly 106 minutes. Then, between May 2025 and April 2026, incident monitoring service IncidentHub tracked <strong>257 separate incidents</strong>, of which 48 were classified as major outages. February 2026 alone produced 37 incidents — the worst month on record. GitHub Actions, the platform's CI/CD automation backbone, suffered 57 outages in the same 12-month stretch. On May 15, 2026, a single Actions degradation caused <strong>42% of all Actions runs to fail at peak impact</strong>. Developers worldwide woke up to red CI pipelines and spent hours debugging their own code — before eventually realizing it wasn't their code at all.</p>\n<blockquote>\n<p><strong>THE CORE PROBLEM: AGENTS DON'T BEHAVE LIKE HUMANS</strong></p>\n<p>A human developer on a free GitHub account might generate a few commits and a handful of CI runs in a working day. <strong>An AI agent on the same account can generate hundreds of commits, dozens of PRs, and thousands of Actions minutes in a single afternoon.</strong> GitHub's 2025 Octoverse report celebrated nearly 1 billion commits. By early 2026, GitHub COO Kyle Daigle shared a more alarming figure: the platform was handling <strong>275 million commits every single week</strong> — on pace for 14 billion in 2026 if growth held linear. That's a 14x annual increase. It wasn't 14x more developers. It was agents treating GitHub's API like a utility and consuming at machine speed.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>GitHub Was Built for Human-Paced Development</h4>\n<p>GitHub's architecture was designed for a world where developers work at human speed: open a PR, push commits over hours or days, wait for CI to run, merge when green. The platform's capacity planning, its database schemas, its job queues, its rate limits — all calibrated for a workflow where one human generates a bounded amount of activity per session. That assumption held for 17 years.</p>\n<hr />\n<h3>Cause</h3>\n<h4>AI Agents Changed the Economics of Every GitHub Operation</h4>\n<p>GitHub CTO Vlad Fedorov identified the mechanism: a single pull request can simultaneously touch Git storage, mergeability checks, branch protection, GitHub Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. A human merging one PR triggers this chain once. An AI agent framework running hundreds of concurrent sessions triggers it thousands of times simultaneously. AI-agent PRs alone jumped from 4 million in September 2025 to 17 million in March 2026 — a 325% increase in six months. Actions usage went from 500 million minutes per week in 2023 to 2.1 billion minutes in a single week in early 2026.</p>\n<hr />\n<h3>Solution</h3>\n<h4>10x Plan Became 30x Plan — And They Were Still Behind</h4>\n<p>GitHub began a 10x capacity scaling initiative in October 2025. By February 2026, that plan was already obsolete — the real demand required 30x. Simultaneously, GitHub was running a migration to Azure, with 12.5% of all traffic on Azure Central US and a target of 50% by July 2026. Running a platform migration alongside an AI-driven traffic explosion is the engineering equivalent of rebuilding an airplane's engines at 35,000 feet.</p>\n<hr />\n<h3>Result</h3>\n<h4>Cascading Failures and High-Profile Departures</h4>\n<p>The pressure produced not just performance degradation but engineering failures. On April 23, 2026, an incomplete feature flag silently reverted commits across 658 repositories and 2,092 pull requests — the UI showed green checkmarks while code was being rewritten underneath. On April 27, a botnet attack overwhelmed the Elasticsearch backend and took GitHub Search offline for hours. On April 28, Mitchell Hashimoto — GitHub user #1299, co-founder of HashiCorp, joined February 2008 — announced that Ghostty was leaving GitHub. The Zig programming language project also migrated away.</p>\n<hr />\n\n<blockquote>\n<p><strong>🤖</strong></p>\n<p>Peak monthly metrics by early 2026: <strong>90 million merged PRs</strong>, <strong>1.4 billion commits</strong>, <strong>20 million new repositories</strong>. These are not the metrics of a platform being used by developers. These are the metrics of a platform being consumed by machines.</p>\n</blockquote>\n\n<h3>The Two April Incidents That Broke Developer Trust</h3>\n<p>Two incidents in late April 2026 crystallized the reliability crisis into something personal for every developer who experienced them. The first, on April 23, was a <strong>merge queue regression</strong> — a bug caused by an incomplete feature flag deployment — that silently reverted commits across 658 repositories and 2,092 pull requests. The terrifying part was not the scope. It was the silence. The UI continued to show green checkmarks and merge confirmations while the system was actively undoing work underneath. Developers had no idea their code had been reverted until they dug into the diff. A platform's most sacred contract with its users is that when it shows a green checkmark, the operation succeeded. GitHub broke that contract.</p>\n<p>The second incident, on April 27, was a Search outage triggered by what GitHub described as a likely botnet overwhelming the Elasticsearch backend. Search went down for hours across the platform. Taken individually, either incident could be explained away. Together, in the same week, they were the signal that the accumulated reliability debt had become impossible to ignore.</p>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Silent Revert: Why the Merge Bug Was So Damaging</p><p>The April 23 merge queue bug was technically a data integrity issue — code that had been merged was reverted without notification. But its deeper damage was psychological. Developers depend on version control's fundamental promise: what you commit is preserved, what you merge is recorded. A bug that silently breaks this promise doesn't just cause data loss. It causes a loss of confidence in every operation the platform reports as successful. How many other green checkmarks weren't telling the whole story? This is the question that made Mitchell Hashimoto's departure feel less like frustration and more like a considered assessment of risk.</p>\n</blockquote>\n\n<h3>The Mitchell Hashimoto Moment</h3>\n<p>On April 28, 2026, Mitchell Hashimoto — GitHub user number 1299, co-founder of HashiCorp, creator of Vagrant, Packer, Consul, Terraform, and Vault, one of the most prolific infrastructure engineers in the industry — posted that Ghostty was leaving GitHub. He had visited GitHub almost every day for over 18 years. His post described the decision as 'irrationally sad' but said the platform was no longer a place where he could 'get work done' and 'ship software.' He made a point that resonated across the developer community: the problem was not Git itself — the distributed version control system remained excellent. The problem was the <strong>surrounding infrastructure</strong>: issues, pull requests, GitHub Actions. The platform built around Git was failing.</p>\n<blockquote>\n<p><strong>GITHUB USER #1299: WHY THE NUMBER MATTERS</strong></p>\n<p>Mitchell Hashimoto is GitHub user <strong>#1299</strong> — one of the earliest accounts on the platform, joined February 2008, three years after GitHub's 2005 founding. His departure is not symbolically significant because he is famous. It is symbolically significant because he is the kind of developer GitHub was built for: a serious, high-output infrastructure engineer who had chosen GitHub without question for 18 years. When the person who never had reason to question the platform starts questioning it, something has fundamentally changed.</p>\n</blockquote>\n\n<ul>\n<li><strong>257</strong> — Total incidents tracked between May 2025 and April 2026 by IncidentHub — roughly five per week, every week, for twelve months straight</li>\n<li><strong>48</strong> — Major outages in the same period, producing over 112 hours of total significant downtime across the platform</li>\n<li><strong>30x</strong> — The scale GitHub needed to design for by February 2026 — triple the 10x plan they had launched just four months earlier in October 2025</li>\n<li><strong>2,092</strong> — Pull requests silently reverted by the April 23 merge queue bug across 658 repositories — with no notification to affected developers</li>\n</ul>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>The Engineering Response: Ruby to Go, Monolith to Services, Single Cloud to Multi-Cloud</h3>\n<p>GitHub's CTO Vlad Fedorov publicly outlined the engineering response in his April 28 blog post. The plan has three interlocking components, each targeting a different layer of the scaling problem. Together they represent one of the most significant architectural transformations GitHub has undertaken since its founding.</p>\n<p>GitHub's five-layer engineering response to the agentic AI scaling crisis, as outlined by CTO Vlad Fedorov in April 2026</p><div><table><caption>GitHub's five-layer engineering response to the agentic AI scaling crisis, as outlined by CTO Vlad Fedorov in April 2026</caption><thead><tr><th>Problem Layer</th><th>Root Cause</th><th>GitHub's Fix</th></tr></thead><tbody><tr><td>Language / Runtime</td><td>Ruby monolith has GIL (Global Interpreter Lock), limiting CPU parallelism — cannot efficiently use all available cores under high concurrency</td><td>Rewriting performance-critical services from Ruby to Go — Go's goroutine model handles massive concurrency without the GIL bottleneck</td></tr><tr><td>Infrastructure</td><td>Single-cloud (Microsoft Azure-only migration) creates concentrated failure risk and limits horizontal scaling options</td><td>Multi-cloud deployment strategy — 12.5% on Azure Central US in early 2026, targeting 50% by July 2026 with additional cloud providers</td></tr><tr><td>Service Isolation</td><td>A single PR cascades through 10+ interconnected subsystems — Git storage, Actions, search, notifications, permissions, webhooks — so any bottleneck in one propagates to all</td><td>Isolating critical services (Git and Actions especially) into independent failure domains so a degradation in one subsystem cannot cascade to others</td></tr><tr><td>Capacity Planning</td><td>10x scaling plan (October 2025) became obsolete by February 2026 as AI agent traffic doubled the required headroom</td><td>30x capacity design — automated scaling for agent-driven burst load patterns rather than human-paced steady-state assumptions</td></tr><tr><td>Feature Safety</td><td>April 23 merge queue regression was caused by an incomplete feature flag that allowed a new code path to activate without full safeguards</td><td>Strengthened feature flag discipline — no code path that affects data integrity ships without complete flag protection and staged rollout verification</td></tr></tbody></table>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Why the Ruby-to-Go Rewrite Is the Right Call</p><p>Ruby served GitHub extraordinarily well for 18 years. It enabled rapid product development, contributed to GitHub's culture, and its Rails framework made the web application layer elegant and maintainable. But Ruby's Global Interpreter Lock (GIL) is a fundamental constraint: even on a 64-core server, a Ruby process can only execute one thread of Ruby code at a time. For human-paced web traffic, this limitation is manageable. <strong>For AI agent workflows that generate thousands of concurrent operations, the GIL is a hard ceiling.</strong> Go's goroutine model — lightweight threads managed by the Go runtime that can run across all available CPU cores without a GIL — is architecturally suited for exactly the concurrency profile that AI agents create. The rewrite is not about language preference. It is about physics.</p>\n</blockquote>\n\n<h3>The Structural Diagnosis: Agents Are Not Billed Like Agents</h3>\n<p>A deeper structural problem underlies the engineering crisis: GitHub's business model was designed for humans, and its pricing reflects human-scale consumption. A developer on a free GitHub account generates some commits, a few CI runs, and a handful of API calls per day. An AI agent on the same account can generate hundreds of commits, dozens of PRs, thousands of Actions minutes, and tens of thousands of API calls in a single afternoon. The infrastructure cost per 'user' has fundamentally changed, but the pricing model has not yet caught up. As one engineer put it plainly: <strong>GitHub's Octoverse 2025 report celebrated nearly 1 billion commits and 36 million new developers. But the 2026 numbers aren't being driven by 36 million new developers showing up. They're being driven by agents that treat GitHub's API like a utility — which it basically is, except utilities charge for consumption.</strong></p>\n<blockquote>\n<p><strong>83 INCIDENTS FROM CAPACITY FAILURES ALONE</strong></p>\n<p>83 of GitHub's 257 incidents between May 2025 and April 2026 were caused by <strong>load and capacity problems</strong> — with indications that many services did not have automatic scaling configured, requiring manual intervention to add capacity during surges. This means that dozens of times, engineers had to notice the problem, escalate it, and manually provision resources before the platform could recover. Automated capacity scaling for burst load is not optional infrastructure. For a platform being consumed by AI agents, it is the minimum viable reliability architecture.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The CVE-2026-3854 Problem: Reliability and Security Compounding</p><p>The April week that prompted Hashimoto's departure also included a critical security disclosure: CVE-2026-3854, a CVSS 8.7 remote code execution vulnerability in GitHub's internal Git layer. The flaw allowed an attacker to inject extra header fields via a malformed git push and execute code as the Git service user. GitHub.com was patched within six hours of disclosure. GitHub Enterprise Server patches were released. But Wiz reported that 88% of self-hosted GitHub Enterprise Server instances remained unpatched at time of publication. A platform under reliability stress is also a platform whose administrators are too busy managing incidents to maintain their security posture — the two crises compound each other.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>GitHub's architecture evolved over 18 years around a core assumption: the unit of load is a human developer. A human opens a PR, waits for review, pushes a few commits, and merges. The platform's service graph — its Git storage layer, its mergeability computation engine, its branch protection evaluation system, its Actions job dispatch queue, its search indexer, its notification fan-out, its webhook delivery pipeline, its permission evaluation layer, its API gateway — was sized and coupled around this human-paced access pattern. Every service in the chain was both a dependency and a dependency of every other service. This architecture was efficient and made GitHub easy to reason about for years.</p><p>AI agents broke the architecture's fundamental assumption. An agent doesn't open a PR and wait. An agent opens 50 PRs in parallel, each triggering the full service chain simultaneously. At scale, this creates a concurrency storm that amplifies through every layer of the graph. GitHub CTO Vlad Fedorov described it precisely: a single PR touches Git storage, mergeability checks, branch protection, Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. When the number of concurrent PRs scales 4x in six months, the pressure on every one of those systems scales accordingly — and the interconnected failures begin.</p>\r\n<h3>A Single GitHub PR: The 10+ Subsystems It Touches</h3>\n<p><a href=\"https://techlogstack.com/explore/github-ai-agents-outage-2026/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\r\n<h3>GitHub Actions: Weekly Compute Minutes — The AI Agent Surge</h3>\n<p><a href=\"https://techlogstack.com/explore/github-ai-agents-outage-2026/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\r\n<blockquote>\n<p><strong>THE RUBY GIL: WHY THE MONOLITH COULDN'T SCALE</strong></p>\n<p>Ruby's Global Interpreter Lock (GIL) is a mutex that prevents multiple threads from executing Ruby code simultaneously in the same process. For human-paced web traffic — where a request comes in, does some database work, and returns a response — the GIL is rarely the bottleneck. For AI agent traffic — where thousands of operations arrive concurrently and each one fans out across dozens of internal services — the GIL becomes a hard ceiling. <strong>Even on a 64-core server, a Ruby process can use exactly one core at a time for Ruby execution.</strong> The fix isn't optimization. It's a different runtime. Go's goroutine scheduler runs across all available CPU cores without a GIL, making it architecturally suited for the concurrency profile that AI agent workflows generate. GitHub's Ruby-to-Go migration for performance-critical services is the right move — not as a language preference, but as a physics constraint.</p>\n</blockquote>\n\r\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Azure Migration Timing Problem</p><p>GitHub began migrating traffic to Azure as part of its Microsoft integration — 12.5% of all traffic on Azure Central US in early 2026, with a target of 50% by July 2026. Running this migration <strong>simultaneously</strong> with a 30x capacity redesign and a Ruby-to-Go rewrite is an extraordinary amount of concurrent infrastructure transformation. Each of these projects is a multi-year undertaking at GitHub's scale. Running all three in parallel reduces the blast radius of each individual change — but increases the cognitive load and coordination complexity for the engineering teams managing them. The timing was not chosen; it was forced by the speed of the AI agent traffic explosion.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>GitHub's reliability crisis is not a story about a company making engineering mistakes. It's a story about a platform being asked to do something it was never designed for — at a speed that outpaced any reasonable capacity planning horizon. The lessons are as much about the nature of AI agent infrastructure demands as they are about reliability engineering practice.</p>\r\n<ol>\n<li><strong>01.</strong> <strong>Your platform's capacity model must be built around its actual consumers — not its original consumers.</strong> GitHub was built for human developers. AI agents consume infrastructure at orders of magnitude greater intensity. Any platform that introduces AI-native workflows must remodel its capacity assumptions from scratch, not incrementally adjust from the human baseline.</li>\n<li><strong>02.</strong> <em>Feature flag</em> (a software engineering practice where new code is deployed but kept inactive until explicitly enabled, allowing teams to test in production, roll out gradually, and instantly disable a feature if it causes problems — without redeploying) are not optional for infrastructure that handles data integrity. The April 23 merge queue bug — which silently reverted 2,092 pull requests — was caused by an incomplete feature flag. A complete feature flag would have allowed engineers to disable the affected code path instantly without a full redeployment. For any code path that touches data that developers trust as immutable, flag protection is not a best practice. It is the minimum viable safety mechanism.</li>\n<li><strong>03.</strong> <strong>A monolith that can't be incrementally scaled will become a single point of failure at sufficient scale.</strong> GitHub's Ruby monolith served the platform for 18 years because human-paced traffic was bounded enough that the GIL's concurrency limit never became the primary bottleneck. AI agents removed that bound. The architectural lesson is not that monoliths are bad — it's that every architectural decision encodes assumptions about scale, and those assumptions must be revisited when the scale changes fundamentally.</li>\n<li><strong>04.</strong> When critical services are deeply coupled — when a PR touches Git storage, Actions, search, notifications, permissions, and webhooks in a single chain — a failure in any one component becomes a failure across all components. <strong>Service isolation is not premature optimization. It is the prerequisite for containing blast radius at scale.</strong> GitHub's commitment to isolating Git and Actions into independent failure domains is the architectural move that will have the most long-term impact on reliability.</li>\n<li><strong>05.</strong> Trust is the asset that reliability engineering protects. Mitchell Hashimoto didn't leave GitHub because of the April 27 Search outage alone. He left because 257 incidents over 12 months had eroded confidence in the platform as a reliable foundation for serious work. <strong>Reliability is not measured in individual incident severities — it is measured in the cumulative effect of failures on whether people trust the platform to do what it says it did.</strong> The merge bug's silent revert made this unmistakably concrete.</li>\n</ol>\n\r\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Availability-First Mandate</p><p>GitHub's leadership response to the crisis was to shift from a growth-at-all-costs philosophy to an <strong>availability-first mandate</strong>. This means engineering prioritization changes: new feature work is deprioritized relative to stability improvements, scaling infrastructure, and incident remediation. The availability-first mandate is the organizational signal that GitHub recognizes the seriousness of the reliability debt it has accumulated. Whether the engineering plans — Ruby-to-Go, service isolation, 30x capacity, multi-cloud — can be executed faster than the AI agent traffic continues to grow is the open question that will define GitHub's next two years.</p>\n</blockquote>\n\r\n<blockquote>\n<p><strong>WHAT EVERY DEVELOPER SHOULD DO RIGHT NOW</strong></p>\n<p>GitHub's instability has a practical implication for every engineering team: <strong>treat GitHub as important infrastructure, not invisible infrastructure.</strong> Map your team's GitHub dependency surface — CI/CD pipelines, registry mirrors, source-of-truth, identity flows, Actions runners. Know which of your deployments would be blocked if GitHub Actions was degraded for four hours. Have a runbook for 'GitHub is down' that doesn't end with 'wait for GitHub to come back.' Independent CI mirroring, artifact registries with fallback paths, and local Git mirrors of critical dependencies are not paranoia — they are the appropriate response to a platform that has demonstrated it will have five major incidents per week for a year.</p>\n</blockquote>\n\r\n<blockquote><p>GitHub spent 18 years building the platform where the world's code lives, survived Microsoft's acquisition, launched Copilot, and made developers 10x more productive — and then the thing that broke it was all those developers becoming 100x more productive using Copilot. The platform that accelerated AI-assisted development got outrun by AI-assisted development. There is probably a lesson in there somewhere about building infrastructure for the world you are creating, not the world you came from.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/github-ai-agents-outage-2026/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-31T00:00:00+00:00", "date_modified": "2026-06-13T19:03:06.257453+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Distributed Systems", "GitHub"]}, {"id": "https://techlogstack.com/explore/aws-dynamodb-dns-outage-2025/", "url": "https://techlogstack.com/explore/aws-dynamodb-dns-outage-2025/", "title": "A Race Condition in DynamoDB's DNS Took Down Snapchat, Fortnite, Ring, and Half the Internet for 15 Hours", "summary": "On October 19–20, 2025, a race condition between two DNS Enactor processes wiped all DynamoDB DNS records in US-EAST-1, cascading into a 15-hour outage that took dow", "content_html": "<p><strong>AWS</strong> · Distributed Systems · 31 May 2026</p>\n<p>It was 11:48 PM PDT on October 19, 2025. Two automation processes inside AWS's DynamoDB DNS management system were doing the same job simultaneously — one fast, one painfully slow. The slow one was just finishing up when the fast one, having already completed, triggered a cleanup job that deleted the slow one's work. In that moment, every DNS record for DynamoDB in the world's busiest cloud region vanished. Snapchat went dark for 375 million daily users. Fortnite lobbies dissolved mid-match. Ring cameras stopped recording. The UK's HMRC tax authority went offline. For 15 hours, the internet's largest database service had no address.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;AWS services eventually affected&#x27;, &#x27;value&#x27;: &#x27;140+&#x27;}</li><li>{&#x27;label&#x27;: &#x27;outage reports across 3,000+ organizations (Ookla data)&#x27;, &#x27;value&#x27;: &#x27;17M+&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<blockquote><p>When this issue occurred at 11:48 PM PDT, all systems needing to connect to the DynamoDB service in the N. Virginia (us-east-1) Region via the public endpoint immediately began experiencing DNS failures and failed to connect to DynamoDB. This included customer traffic as well as traffic from internal AWS services that rely on DynamoDB.</p><p><em>— — Amazon Web Services — Official Post-Incident Summary, October 2025</em></p></blockquote>\n<p>DynamoDB is not just a database. Inside AWS's infrastructure, it is the connective tissue — the system that EC2, IAM, Lambda, STS, Redshift, and dozens of other control-plane services rely on to store metadata, track state, and coordinate operations. When DynamoDB becomes unreachable, it doesn't just take databases offline. It takes down the systems that <em>manage</em> everything else. This is why a DNS failure that lasted roughly three hours for DynamoDB itself cascaded into a 15-hour platform-wide crisis. The control plane broke. And when the control plane breaks, recovery is not a matter of fixing the root cause — it is a matter of stabilizing everything that lost its footing when the ground disappeared.</p>\n<p>To understand what happened, you need to understand how DynamoDB manages its DNS. At AWS's scale, DynamoDB maintains hundreds of thousands of DNS records to operate the massive fleet of load balancers that route traffic across each region. These records are updated constantly — as capacity is added, as hardware fails, as traffic is redistributed. AWS built a two-component system to manage this at scale: a <strong>DNS Planner</strong> and multiple <strong>DNS Enactors</strong>.</p>\n<blockquote>\n<p><strong>THE TWO-COMPONENT DNS ARCHITECTURE: PLANNER AND ENACTOR</strong></p>\n<p>The DNS management system had two independent components: <strong>The DNS Planner</strong> monitors load balancer health and capacity and periodically creates new DNS plans — essentially a specification of which load balancers should receive traffic and with what weight distribution. <strong>The DNS Enactors</strong> are the workers — multiple independent processes running across three different Availability Zones — that pick up the plans and apply them to Route53 (AWS's DNS service). Multiple Enactors running in parallel provide redundancy: if one Enactor fails, others continue. In theory.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>Enactor A Slows Down — And Its Stale Check Becomes a Time Bomb</h4>\n<p>DNS Enactor A began applying an older DNS plan but encountered unusual delays — it kept getting blocked trying to update records and moved painfully slowly through the list of endpoints. Crucially, Enactor A performed a staleness check early in its process: 'Is my plan newer than what's currently active?' At the time of that check, it was. But by the time Enactor A actually finished applying the plan, the world had moved on — newer plans had been created and applied. The staleness check was now stale itself.</p>\n<hr />\n<h3>Cause</h3>\n<h4>The Race Condition Fires — Enactor B Wins, Then Cleans Up</h4>\n<p>While Enactor A was slowly working through its updates, Enactor B picked up one of the newer plans and rapidly applied it across all endpoints. When Enactor B completed, it triggered the cleanup process: identify plans that are significantly older than the one just applied, and delete them. At that exact moment — T+45 seconds after the race began — Enactor A finally finished applying its old plan, overwriting Enactor B's newer records. The cleanup job identified Enactor A's newly-applied old plan as many generations old, and deleted it. All DynamoDB DNS records for the US-EAST-1 regional endpoint were gone.</p>\n<hr />\n<h3>Solution</h3>\n<h4>11:48 PM PDT: Total DNS Blackout</h4>\n<p>At 11:48 PM PDT on October 19, every system trying to connect to DynamoDB in US-EAST-1 via the public endpoint received DNS failures. No IP address. No connection. The failure was immediate and total — not a degradation, not elevated error rates, but a complete inability to resolve the DynamoDB endpoint. Internal AWS services relying on DynamoDB for control-plane operations went down alongside customer traffic. EC2's Droplet Workflow Manager lost its ability to track instance lease state. IAM couldn't validate credentials. Lambda couldn't execute. The cascade was underway.</p>\n<hr />\n<h3>Result</h3>\n<h4>15 Hours of Cascading Failure — and Manual Recovery</h4>\n<p>Engineers identified the DNS issue by 12:38 AM UTC and began temporary mitigations by 1:15 AM UTC. DynamoDB itself recovered by approximately 2:25 AM UTC — roughly three hours after the incident began. But the cascade had already overwhelmed EC2's Droplet Workflow Manager with a backlog of expired instance leases it couldn't process. The DWFM entered congestive collapse, requiring 12+ more hours for network state to fully stabilize. The fix for the automation itself was brutal in its simplicity: engineers had to manually disable the automatic failover system entirely to stop it from flip-flopping between states and allow the platform to stabilize. Full recovery across all services wasn't complete until late afternoon on October 20 — roughly 15 hours after the cascade began.</p>\n<hr />\n\n<blockquote>\n<p><strong>🌐</strong></p>\n<p>Ookla, the network intelligence company behind Speedtest.net, recorded over <strong>17 million outage reports</strong> across more than <strong>3,000 organizations</strong> during the incident. Independent measurements showed 20 to 30 percent of all internet-facing services experienced disruptions at peak impact. The US, UK, and Germany were hit hardest.</p>\n</blockquote>\n\n<h3>What Actually Went Dark</h3>\n<p>The list of affected services illustrates something important about how the modern internet is structured. US-EAST-1 is AWS's default region — the one developers reach for first, the one that has the most mature service availability, the one that decades of 'just deploy to us-east-1' decisions have concentrated critical infrastructure in. Even services claiming multi-region redundancy often still rely on US-EAST-1 for authentication, metadata stores, or database calls — dependencies that only become visible when US-EAST-1 goes dark.</p>\n<p>Major services and platforms affected by the October 19–20, 2025 AWS US-EAST-1 outage</p><div><table><caption>Major services and platforms affected by the October 19–20, 2025 AWS US-EAST-1 outage</caption><thead><tr><th>Category</th><th>Affected Services</th></tr></thead><tbody><tr><td>Social & Entertainment</td><td>Snapchat (375M daily users), Discord, Reddit, Roblox, Fortnite, Disney+, Hulu, Twitch</td></tr><tr><td>Finance & Payments</td><td>Coinbase, Venmo, several UK banks (Lloyds, Halifax)</td></tr><tr><td>Smart Home & IoT</td><td>Amazon Ring cameras, Amazon Alexa, Eight Sleep</td></tr><tr><td>Communications</td><td>Signal, several enterprise communication platforms</td></tr><tr><td>Government</td><td>UK HMRC tax authority</td></tr><tr><td>Travel</td><td>United Airlines app, Delta app</td></tr><tr><td>AI & Developer Tools</td><td>Perplexity AI, Pokémon GO</td></tr><tr><td>AWS Services (internal)</td><td>EC2, IAM, STS, Lambda, S3, SQS, Amazon Connect, Redshift (140+ services total)</td></tr></tbody></table>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Control Plane Problem: Why DynamoDB's Failure Was Uniquely Catastrophic</p><p>A typical service outage takes down the service that failed. The October 2025 DynamoDB outage was different because DynamoDB is infrastructure for infrastructure. <strong>EC2 uses DynamoDB to track compute instance state. IAM uses DynamoDB to store and retrieve access policies. Lambda uses DynamoDB for execution state. STS uses DynamoDB to validate tokens.</strong> When DynamoDB became unreachable, these services couldn't perform their core functions — not because they had their own bugs, but because the foundation they relied on had disappeared. This is a control-plane failure, and control-plane failures cascade differently from data-plane failures: they don't just take down what failed, they take down the ability to manage everything else.</p>\n</blockquote>\n\n<h3>The EC2 Congestive Collapse: Why Recovery Took 12 Extra Hours</h3>\n<p>DynamoDB's DNS was restored in approximately three hours. But the outage lasted 15. The reason was EC2's <strong>Droplet Workflow Manager (DWFM)</strong> — the system responsible for managing EC2 instance lifecycle events, including lease renewals. When DynamoDB became unavailable, DWFM couldn't process instance state updates and began accumulating a backlog of expired leases. By the time DynamoDB recovered, DWFM was facing an enormous queue of backlogged lease management tasks — all trying to execute simultaneously. The system entered congestive collapse: the more it tried to process, the more it overwhelmed the now-recovered DynamoDB, which slowed processing, which lengthened the queue, which increased the pressure. Network state recovery from this congestive collapse took more than five additional hours after DynamoDB was fixed.</p>\n<blockquote>\n<p><strong>THE ANTI-PATTERN: WHEN AUTOMATION PREVENTS RECOVERY</strong></p>\n<p>The most counterintuitive part of the recovery was that engineers had to <strong>disable automatic failover</strong> to stabilize the system. The automatic failover mechanisms — designed to move traffic to healthy systems when failures are detected — were themselves contributing to the instability. With DNS records in an inconsistent state, the failover systems were flip-flopping: detecting failures, triggering failovers, detecting those failovers as failures, triggering more failovers. The automation designed to speed recovery was making recovery impossible. Engineers had to manually turn it off, let the system reach a stable state, and then re-enable it with the correct DNS records in place. This is one of the most instructive moments in the incident: sometimes, the recovery automation has to be stopped before recovery can begin.</p>\n</blockquote>\n\n<ul>\n<li><strong>~3 hrs</strong> — Time from incident start to DynamoDB DNS restoration — engineers had to manually diagnose, understand the inconsistent DNS state, and correct it since automated systems couldn't self-recover</li>\n<li><strong>12+ hrs</strong> — Additional hours EC2's Droplet Workflow Manager required to clear its congestive collapse from accumulated expired lease backlog after DynamoDB recovered</li>\n<li><strong>140+</strong> — AWS services eventually affected by the cascade — because DynamoDB powers the control planes of EC2, IAM, Lambda, STS, and dozens of other foundational services</li>\n<li><strong>$581M</strong> — Estimated insurance losses from the 15-hour outage, per CyberCube cyber risk analytics — representing disruption to thousands of businesses globally dependent on US-EAST-1</li>\n</ul>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>AWS's Post-Incident Fixes: Preventing the Race, Containing the Cascade</h3>\n<p>AWS published its official post-incident report three days after the October 20 event. The fixes address four distinct failure layers: the race condition itself, the cleanup automation that deleted active records, the EC2 cascade, and the inadequate test coverage for the recovery workflow. Each fix targets a specific mechanism that allowed the failure to happen or to propagate.</p>\n<p>AWS's five-layer post-incident fix plan, derived from the official post-incident summary published October 23, 2025</p><div><table><caption>AWS's five-layer post-incident fix plan, derived from the official post-incident summary published October 23, 2025</caption><thead><tr><th>Failure Layer</th><th>What Went Wrong</th><th>AWS's Fix</th></tr></thead><tbody><tr><td>DNS Enactor race condition</td><td>Two Enactors ran concurrently; Enactor A's stale staleness check allowed it to overwrite Enactor B's newer plan</td><td>Stronger staleness validation in the Enactor before applying plans — the check must reflect the current state of the world at time of application, not at time of plan pickup</td></tr><tr><td>Cleanup automation</td><td>The cleanup job deleted Enactor A's just-applied old plan because it appeared many generations old — wiping all DNS records in the process</td><td>Safeguards to ensure no automated process can delete or remove an active DNS plan — any plan being actively used as the live record is protected from cleanup regardless of its generation number</td></tr><tr><td>NLB failover velocity</td><td>Network Load Balancers moved large amounts of capacity during AZ failover triggered by the DNS failure, amplifying the cascade</td><td>Velocity control mechanism for NLBs to limit how much capacity a single NLB can remove when health check failures cause AZ failover — preventing AZ-level failures from creating region-level capacity evaporation</td></tr><tr><td>EC2 recovery workflow</td><td>EC2's DWFM entered congestive collapse when DynamoDB recovered and the backlogged lease queue overwhelmed the system — a failure mode that had not been tested</td><td>Additional test suite to exercise the DWFM recovery workflow at scale — catching congestive collapse scenarios before they happen in production rather than discovering them during outage recovery</td></tr><tr><td>Automatic failover during recovery</td><td>Failover automation flip-flopped during recovery, requiring manual disabling before stabilization could occur</td><td>Review of failover automation behavior during degraded DNS states — automation must detect the difference between 'service is down' and 'DNS is inconsistent during recovery' and respond differently to each</td></tr></tbody></table>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Unstated Root Cause: The Architecture of Trust in US-EAST-1</p><p>AWS's post-mortem addressed the technical race condition correctly. But the incident exposed a deeper architectural problem that no single fix resolves: <strong>the internet's implicit trust in US-EAST-1.</strong> AWS designed US-EAST-1 as a region — a geographic cluster of data centers meant to be one of many independently redundant deployment targets. Over 20 years, it became something else: the default region for millions of applications, the region where foundational services were first available, and the region that even 'multi-region' architectures often quietly depend on for authentication, metadata, or coordination. Ring cameras depend on it for authentication. Venmo depends on it for payment processing. UK government services depend on it for API calls. None of these dependencies were meant to create a single point of failure. But that's what they became.</p>\n</blockquote>\n\n<h3>The Test Coverage Gap: You Can't Fully Test Massive Scale Without Massive Scale</h3>\n<p>One of the most honest admissions in AWS's post-incident report is about test coverage. The DWFM recovery workflow — the path EC2's Droplet Workflow Manager follows when it needs to process a massive backlog of expired leases after a DynamoDB outage — had not been adequately tested at the scale required to discover the congestive collapse failure mode. AWS's response is to build additional test suites specifically for this recovery workflow. But the admission surfaces a fundamental challenge of hyperscale infrastructure: <strong>the failure conditions that matter most are the ones that only occur at scale, and at scale, test environments are approximations of production, not replicas of it.</strong> The only complete test of how AWS's systems behave during a DynamoDB outage recovery is an actual DynamoDB outage. This is the same insight that drove Netflix to build Chaos Monkey — except that for a cloud provider, you cannot deliberately cause a DynamoDB outage to test the recovery path.</p>\n<blockquote>\n<p><strong>THE HIDDEN CROSS-REGION DEPENDENCY PROBLEM</strong></p>\n<p>The October 2025 outage adds to a body of evidence about a specific architectural anti-pattern: <strong>regions that are called independent but aren't.</strong> AWS regions were designed with the premise that a failure in US-EAST-1 should not affect services running in EU-WEST-1 or AP-SOUTHEAST-1. But control-plane dependencies — authentication services, metadata stores, quota management systems — create invisible cross-region ties. When the control plane fails in one region, services in other regions that depend on that control plane for any operation fail with it. True regional independence requires not just deploying application code in multiple regions, but ensuring that every control-plane dependency is also independently redundant per region. For most organizations building on cloud infrastructure, this is not the architecture they have — it is the architecture they think they have.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>The October 2025 DynamoDB outage is a case study in what distributed systems engineers call a <em>control-plane failure</em> — a class of failure that is categorically more damaging than a data-plane failure because it removes the ability to manage and coordinate infrastructure rather than just disrupting one service. To understand why the failure cascaded so far and recovered so slowly, you need to understand the three layers of the failure: the DNS automation race condition, the DynamoDB control-plane dependency web, and EC2's Droplet Workflow Manager congestive collapse.</p>\n<h3>The DNS Race Condition: Step-by-Step</h3>\n<p><a href=\"https://techlogstack.com/explore/aws-dynamodb-dns-outage-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>The Cascade: How DynamoDB's DNS Failure Propagated</h3>\n<p><a href=\"https://techlogstack.com/explore/aws-dynamodb-dns-outage-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>WHY US-EAST-1 BECAME A SINGLE POINT OF FAILURE FOR THE INTERNET</strong></p>\n<p>AWS designed its regions to be independently operable — a failure in US-EAST-1 should not affect EU-WEST-1. This design intention is correct, but the reality that emerged over 20 years is different. US-EAST-1 is where AWS first launched most services, so it accumulated the most mature feature sets. It became the default — the region developers reach for first, the region enterprises trust most deeply. Over time, even architectures claiming multi-region resilience often retain quiet dependencies on US-EAST-1 for authentication flows, control-plane coordination, or foundational database calls. <strong>The technical independence of regions is real. The operational independence, as experienced during the October 2025 outage, is not.</strong></p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Automatic Failover Anti-Pattern</p><p>One of the most practically instructive moments of the October 2025 outage was the decision to <strong>manually disable automatic failover</strong> to allow recovery to proceed. The automatic failover systems — designed to improve availability — were detecting the DNS inconsistency as failures and triggering failovers, which created new inconsistencies, which triggered more failovers. The automation was creating a feedback loop that prevented stabilization. Engineers had to turn it off to let the system reach a stable state. The lesson: automatic recovery systems need to distinguish between 'the service is down' (trigger failover) and 'the DNS state is inconsistent during manual recovery' (pause failover until DNS is stable). Automation that cannot make this distinction can prevent recovery faster than it enables it.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>The October 2025 DynamoDB outage is one of the most technically instructive incidents in cloud computing history — not because the root cause was complex, but because it was so simple, and yet it cascaded so far. A race condition in a cleanup job. The most consequential bug is often the one you're sure you've already solved.</p>\n<ol>\n<li><strong>01.</strong> <strong>Staleness checks must be evaluated at time of use, not time of pickup.</strong> Enactor A's staleness check was valid when it ran. By the time Enactor A acted on the result, the check was stale. In any concurrent system where state changes between the check and the action, the check must be re-evaluated immediately before the action — not cached from a prior point in the workflow. This is <em>TOCTOU</em> (Time-of-Check to Time-of-Use — a class of race condition where the condition being checked changes between when it is checked and when it is acted upon, causing the action to operate on incorrect assumptions) — one of the oldest race condition patterns in computer science — appearing in production at AWS scale.</li>\n<li><strong>02.</strong> <strong>No automated process should be able to delete an active record.</strong> The cleanup job's design — delete plans that are significantly older than the most recently applied plan — had no protection for the case where an older plan was actively in use as the live DNS record. The invariant that must be protected: <em>the record currently resolving live traffic cannot be deleted by any automated process, regardless of its generation number.</em> This invariant is simpler than the cleanup logic that violated it.</li>\n<li><strong>03.</strong> Congestive collapse is a failure mode that only appears at scale — and the recovery path for it must be tested before it's needed. <strong>EC2's DWFM had never been tested through the scenario of processing a massive backlog of expired leases simultaneously after a DynamoDB recovery.</strong> The scenario seemed unlikely enough to skip in testing, and specific enough that staging environments couldn't reproduce it. Building the test suite that exercises recovery workflows at production scale is the investment that pays off only in disasters — but those are exactly the moments when it matters most.</li>\n<li><strong>04.</strong> Multi-region architecture must be evaluated not just for application code but for <em>control-plane dependencies</em> (the hidden dependencies that applications have on cloud provider management systems — authentication services, metadata stores, quota management — which can create cross-region failure modes even when application code is deployed in multiple regions). Ring cameras deployed globally still authenticated against US-EAST-1 IAM. UK government services deployed in EU regions still made US-EAST-1 API calls. <strong>True regional independence requires independently redundant control planes, not just independently deployed application code.</strong></li>\n<li><strong>05.</strong> Sometimes, the recovery automation has to stop before recovery can start. The engineers who manually disabled automatic failover to stabilize the system were making the right call — but it required human judgment to recognize that the automation was making things worse rather than better. <strong>Build your recovery playbooks to include the question: 'Is any automated system currently making this worse?'</strong> The answer is occasionally yes, and having a clear path to pause automation is as important as having automation in the first place.</li>\n</ol>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>What Good Regional Independence Actually Looks Like</p><p>The October 2025 outage drew a clear line between companies that had genuine multi-region independence and those that believed they did. Genuine independence requires: application code deployed in at least two regions; <strong>authentication, authorization, and metadata services independently operational per region</strong>; no synchronous cross-region API calls in the critical path; tested failover that has been exercised under real load; and runbooks that don't assume a specific region is available. The companies whose services stayed up during the October 2025 outage weren't lucky. They had made specific architectural decisions years earlier — decisions that cost money and engineering time — that happened to be exactly the right decisions.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE PRACTICAL RESPONSE FOR EVERY ENGINEERING TEAM</strong></p>\n<p>The October 2025 AWS outage has a direct implication for every team running production workloads on cloud infrastructure. <strong>Map your US-EAST-1 dependencies before the next outage, not during it.</strong> Specifically: identify every service your application calls that is hosted in US-EAST-1, even if your application code is deployed elsewhere. This includes authentication providers, CDN origins, third-party APIs, and internal microservices. For each dependency, ask: 'If this endpoint returned no DNS records for three hours, what would our users experience?' The answer to that question is your actual blast radius for a US-EAST-1 control-plane failure — and it is almost certainly larger than your architecture diagram suggests.</p>\n</blockquote>\n\n<blockquote><p>Amazon Web Services runs infrastructure at a scale where the cost of a single race condition is measured in hundreds of millions of dollars and 375 million users unable to send a Snapchat. The race condition itself — two processes trying to update the same state concurrently, with a stale check allowing a stale write — is the kind of bug that appears in computer science textbooks under 'concurrent programming gotchas.' The lesson isn't that AWS made an obvious mistake. The lesson is that obvious mistakes at sufficient scale have non-obvious consequences, and the gap between 'finding the bug' and 'recovering from the bug' was twelve hours wide.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/aws-dynamodb-dns-outage-2025/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-31T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.292804+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Distributed Systems", "AWS"]}, {"id": "https://techlogstack.com/explore/google-cloud-service-control-outage-2025/", "url": "https://techlogstack.com/explore/google-cloud-service-control-outage-2025/", "title": "Google's Own Cleanup Job Crashed Cloud Services Across 4 Continents — and Then Made Recovery Worse", "summary": "On June 12, 2025, a null pointer exception in Google Cloud's Service Control binary — deployed May 29 without error handling or feature flag protection — crashed API", "content_html": "<p><strong>Google</strong> · Distributed Systems · 31 May 2026</p>\n<p>On May 29, 2025, a Google engineer deployed new quota-checking code to Service Control — the system that authorizes every single API request across Google Cloud. The code had a bug: it couldn't handle a null value. But the bug was invisible during deployment because it could only be triggered by a specific type of policy data that hadn't appeared yet. Two weeks later, on June 12, an automated system pushed a routine policy update containing blank fields. The policy data replicated globally within seconds. Every Service Control binary in every region hit the null pointer, crashed, and refused to restart without eating itself. Spotify went down. Discord went down. Snapchat went down. Google's own status page went down. And when engineers deployed the fix, the restart surge overwhelmed the infrastructure — making the recovery worse than the crash.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;Google Cloud services affected including IAM, Compute Engine, Cloud Storage, BigQuery&#x27;, &#x27;value&#x27;: &#x27;50+&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<blockquote><p>On May 29, 2025, a new feature was added to Service Control for additional quota policy checks. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash.</p><p><em>— — Google Cloud — Official Incident Report, June 14, 2025</em></p></blockquote>\n<p>Service Control is not a product you've heard of. It doesn't have a marketing page or a conference talk. It exists in the infrastructure layer beneath everything else — the system that authorizes every API request across Google Cloud and Google Workspace before that request is allowed to proceed. If you call the Cloud Storage API, Service Control checks your quota. If you authenticate with Google IAM, Service Control validates your policy. If your app on Google Cloud makes any call to any Google service, Service Control is in the critical path. It is, in the most literal sense, the gatekeeper of the entire platform.</p>\n<p>When Service Control crashed on June 12, 2025, it didn't just take down one service. It took down the authorization layer for every service. API calls returned 503 errors not because the underlying services had failed, but because the gatekeeper wasn't there to let them through. Compute Engine instances were running. Cloud Storage buckets were intact. BigQuery jobs were ready to execute. None of it mattered — because without Service Control, nothing could be authorized, and nothing unauthorized can proceed in a correctly secured cloud platform.</p>\n<blockquote>\n<p><strong>WHAT SERVICE CONTROL ACTUALLY DOES</strong></p>\n<p>Google Cloud's Service Control system performs three functions on every API request: <strong>authentication</strong> (is this requester who they claim to be?), <strong>authorization</strong> (are they allowed to perform this operation?), and <strong>quota enforcement</strong> (have they exceeded their usage limits?). It processes these checks at massive scale across every region — billions of API calls per day — using policy metadata stored in and synchronized across Spanner, Google's globally distributed database. The May 29 code change was adding more sophisticated quota checking logic to this pipeline. The change worked correctly in every scenario that was tested. The scenario that wasn't tested was the one that appeared on June 12.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>May 29: Code Deployed — Bug Present, But Invisible</h4>\n<p>Google engineers deployed new quota policy checking code to Service Control. The deployment went through the standard region-by-region rollout and passed all checks. But the new code path had two critical gaps: no error handling for null values, and no feature flag to disable it if something went wrong. The bug was invisible during rollout because the problematic code path could only be triggered by a specific type of policy input — blank fields in the policy metadata. That input hadn't appeared during rollout. The binary was now running in every region with a loaded trap, waiting for the right trigger.</p>\n<hr />\n<h3>Cause</h3>\n<h4>June 12, 10:45 AM PDT: The Policy Update That Pulled the Trigger</h4>\n<p>An automated system inserted a routine policy change into the regional Spanner tables that Service Control uses for policy metadata. The policy update contained unintended blank fields — values that should have been populated but weren't. Because quota management is global, the Spanner replication engine distributed this metadata worldwide within seconds. Every Service Control binary in every region hit the new code path, encountered the null values, and threw a null pointer exception. Without error handling, the exception crashed the binary. Service Control was dead globally.</p>\n<hr />\n<h3>Solution</h3>\n<h4>The SRE Response: Diagnosis in 10 Minutes, Red Button in 40</h4>\n<p>Google's Site Reliability Engineering team began triaging within two minutes of the first alert. They identified the root cause — the null pointer exception in the new quota checking code path — within 10 minutes. Engineers deployed a 'red button' kill switch within 40 minutes to disable the problematic serving path. Most regions began recovering within two hours. But us-central1, Google's largest region, hit a second problem: the recovery itself.</p>\n<hr />\n<h3>Result</h3>\n<h4>The Herd Effect: When Recovery Made Things Worse</h4>\n<p>As Service Control instances restarted in us-central1 after the red button was deployed, they all simultaneously reached for the regional Spanner database to load their policy metadata. Hundreds of instances, all restarting at the same moment, all hitting Spanner at the same time, with no randomization in their startup sequence. The Spanner database — which had been handling steady-state read traffic fine — was overwhelmed by the simultaneous burst. Service Control couldn't load its policies, which meant it couldn't restart properly, which meant it kept trying, which kept hitting Spanner. The recovery created a herd effect that prolonged the outage in us-central1 by more than two hours beyond when other regions had stabilized. Full resolution across all services wasn't complete until 18:18 PDT — more than seven hours after the incident began.</p>\n<hr />\n\n<blockquote>\n<p><strong>🔴</strong></p>\n<p>Google's own <strong>Cloud Service Health dashboard went down</strong> during the incident — the monitoring system that customers rely on to track outage status was itself affected by the outage it was supposed to report. Engineers and customers trying to understand what was happening couldn't access the standard communication channel. A status page that goes down during the incident it's supposed to report is a monitoring anti-pattern at its most consequential.</p>\n</blockquote>\n\n<h3>What Went Dark: The Third-Order Cascade</h3>\n<p>The blast radius of the June 12 outage had three concentric rings. The innermost ring was Google's own services: Google Cloud Platform APIs, Google Workspace (Gmail, Calendar, Drive, Docs, Meet), IAM, Cloud Storage, Compute Engine, BigQuery, Cloud SQL, Cloud Spanner, Vertex AI, Cloud Monitoring. The second ring was companies building directly on GCP — Spotify, Snapchat, Fitbit, Replit, GitLab, Shopify, Character.AI, Cursor — whose applications were unable to authorize any backend calls. The third ring was the one that made this incident uniquely instructive: <strong>companies that depend on Cloudflare, which depends on Google Cloud</strong>. Cloudflare — itself one of the internet's core infrastructure providers — uses Google Cloud for some of its backend operations. When Google Cloud's Service Control failed, Cloudflare experienced partial degradation, which in turn degraded Discord, Twitch, and other services that had built on top of Cloudflare. This is third-order cascading failure: Google fails → Cloudflare degrades → Discord goes down. Discord's users had no idea their outage had anything to do with a null pointer exception in a Google quota management system.</p>\n<p>The three-ring cascade of the June 12, 2025 Google Cloud outage — and the dependency chain that connected them</p><div><table><caption>The three-ring cascade of the June 12, 2025 Google Cloud outage — and the dependency chain that connected them</caption><thead><tr><th>Failure Ring</th><th>What Failed</th><th>Why</th></tr></thead><tbody><tr><td>First: Google's own infrastructure</td><td>Cloud IAM, Compute Engine, Cloud Storage, BigQuery, Cloud SQL, Cloud Spanner, Vertex AI, Cloud Monitoring, Google Workspace (Gmail, Calendar, Drive, Docs, Meet)</td><td>Service Control — the authorization gateway — crashed globally, blocking all API requests across GCP and Workspace</td></tr><tr><td>Second: Direct GCP customers</td><td>Spotify (~46K outage reports), Snapchat, Fitbit, Replit, GitLab, Shopify, Character.AI, Cursor, Perplexity AI</td><td>Applications built on GCP couldn't authorize any backend calls — services appeared down to users even though underlying compute was running</td></tr><tr><td>Third: Cloudflare and its customers</td><td>Cloudflare (partial degradation), Discord, Twitch</td><td>Cloudflare uses Google Cloud for certain backend operations; when those degraded, Cloudflare's services partially degraded, cascading to Cloudflare's own customers</td></tr></tbody></table>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Dormant Trap: Why Staged Rollouts Didn't Catch This</p><p>Google's staged, region-by-region rollout is exactly the right practice for catching bugs introduced by new deployments. It worked correctly for 14 days — no failures appeared during the May 29 rollout because the failure condition required specific policy data (blank fields) that hadn't yet been inserted. The bug was a <strong>dormant trap</strong>: present in production, but invisible until the exact trigger arrived. This is a class of bug that staged rollouts are structurally unable to catch — because the rollout environment and the trigger environment are separated by two weeks and an automated policy change that nobody controlled. <strong>The only defenses against dormant traps are error handling (so the crash doesn't happen when the trigger arrives) and feature flags (so the code path can be disabled immediately when the trigger produces unexpected behavior).</strong> The May 29 change had neither.</p>\n</blockquote>\n\n<ul>\n<li><strong>10 min</strong> — Time for Google's SRE team to identify the root cause — null pointer exception in the new quota checking code path — from the first alert at 10:49 AM PDT on June 12</li>\n<li><strong>40 min</strong> — Time to deploy the 'red button' kill switch that disabled the problematic Service Control serving path and allowed most regions to begin recovery</li>\n<li><strong>7+ hrs</strong> — Total outage duration — most regions recovered within 2 hours, but the herd effect in us-central1 extended full resolution to 18:18 PDT</li>\n<li><strong>50+</strong> — Google Cloud services affected, including all core infrastructure APIs, all Google Workspace products, and all AI/ML services including Vertex AI</li>\n</ul>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>Google's Response: Five Commitments After the Outage</h3>\n<p>Google's incident report, published June 14, 2025, outlined specific remediation steps across five categories. Each addresses a distinct failure mode that either caused the outage or made it worse than it needed to be.</p>\n<p>Google's five-category post-incident remediation plan, derived from the official June 14, 2025 incident report</p><div><table><caption>Google's five-category post-incident remediation plan, derived from the official June 14, 2025 incident report</caption><thead><tr><th>Failure Mode</th><th>What Happened</th><th>Google's Fix</th></tr></thead><tbody><tr><td>Missing error handling</td><td>The new quota checking code had no null-safety — when blank fields appeared, a null pointer exception crashed the binary</td><td>Mandatory null-safe code patterns for all Service Control code paths, with additional static analysis to catch null pointer vulnerabilities before deployment</td></tr><tr><td>No feature flag</td><td>Without a feature flag, the new code path could not be disabled without a full binary redeployment — adding 30+ minutes to initial response time</td><td>Feature flag protection required for all new code paths in Service Control — a flag would have allowed the problematic path to be disabled within seconds, before most regions crashed</td></tr><tr><td>Herd effect during recovery</td><td>Hundreds of Service Control instances restarting simultaneously all hit Spanner at the same time, overwhelming it and prolonging the us-central1 outage by 2+ hours</td><td>Randomized exponential backoff on Service Control startup — instances restart with jittered delays so Spanner load is distributed over time rather than concentrated in a burst</td></tr><tr><td>Status page availability</td><td>Google's Cloud Service Health dashboard went down during the outage, removing the primary customer communication channel</td><td>Decoupled status infrastructure — the status page must be architecturally independent of the services it monitors, with its own Service Control dependency removed</td></tr><tr><td>Service Control architecture</td><td>Service Control is a monolithic regional binary — a crash in Service Control takes down all API authorization for the entire region simultaneously</td><td>Modularize Service Control's architecture — isolate the quota checking component so a crash in quota logic does not crash the authentication and authorization components</td></tr></tbody></table>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Feature Flag That Would Have Saved Seven Hours</p><p>The most consequential missing safeguard in the June 12 outage was the absence of a feature flag on the new quota checking code path. A feature flag — a configuration switch that enables or disables a code path without a redeployment — would have changed the incident timeline dramatically. When the null pointer exceptions began firing at 10:49 AM PDT, engineers with a feature flag could have disabled the new code path across all regions within seconds or minutes, before the crash had spread globally. Without a feature flag, the only option was a red-button kill switch that required a new binary deployment — a process that took 40 minutes and still left the herd effect problem during restart. <strong>40 minutes of global Service Control outage versus seconds of a feature flag toggle.</strong> Google's own incident report acknowledges this directly: 'If this had been flag protected, the issue would have been caught in staging.' The cost of not having a feature flag was measured in hundreds of millions of users unable to access their services for seven hours.</p>\n</blockquote>\n\n<h3>The Herd Effect: A Recovery Anti-Pattern With a Known Fix</h3>\n<p>The herd effect that prolonged the us-central1 outage is not a new problem. It has been documented since the earliest days of distributed systems: when many clients restart simultaneously after a shared dependency recovers, they all connect simultaneously and overwhelm the dependency, preventing it from returning to steady state. The canonical solution — randomized exponential backoff — is equally well-documented and simple: when restarting, add a random delay so clients stagger their reconnection attempts over a time window rather than clustering them at a single instant. Every Service Control instance waiting exactly zero milliseconds before hitting Spanner on restart is the problem. Service Control instances waiting a random delay between 0 and 30 seconds before hitting Spanner on restart is the solution. Google committed to implementing this. The fact that it required an outage to prompt the implementation is a reminder that known fixes for known problems often go unimplemented until the cost of not implementing them is paid in production.</p>\n<blockquote>\n<p><strong>MTTD VS DURATION: WHAT THE NUMBERS ACTUALLY TELL YOU</strong></p>\n<p>Google's SRE team began triaging within two minutes of the first alert. This is elite incident response. The MTTD (Mean Time to Detect) was near-instantaneous, and the root cause diagnosis took 10 minutes. These are remarkable numbers for a global infrastructure failure of this complexity. The lesson is not that Google's response was slow — it was fast. The lesson is that even elite incident response cannot compensate for missing preventative safeguards. <strong>Feature flags, error handling, and randomized backoff would have prevented or dramatically shortened the outage before any human had time to respond.</strong> The SRE team's quality is demonstrated by the 10-minute diagnosis. The systems quality gap is demonstrated by the 7-hour duration.</p>\n</blockquote>\n\n<h3>The Broader Lesson: Cleanup Operations Are the Hardest Code to Get Right</h3>\n<p>The June 12 outage shares an important structural pattern with the October 2025 AWS DynamoDB incident: the trigger was not a complex new feature or an ambitious architectural change. It was a routine operation — in AWS's case, a cleanup job that deleted stale DNS plans; in Google's case, an automated policy update that inserted routine quota metadata. <strong>Routine operations are the hardest to protect against because they're not treated with the same scrutiny as new features.</strong> A new feature gets code review, testing, staged rollout, and monitoring. A routine automated policy update is assumed to be safe because it's been running correctly for months. But when the underlying system has changed — when new code is now in the critical path that couldn't handle what the routine operation produces — the routine operation becomes the trigger for a catastrophic failure. The implication is uncomfortable: every automated operation that modifies system-critical metadata must be treated as a potential trigger for any latent bugs in the code that consumes that metadata.</p>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>Service Control sits at the intersection of every API request Google Cloud processes. Understanding how it failed — and why the failure spread so quickly and recovered so slowly — requires understanding three things: the role of Spanner as the global policy data store, the absence of safe failure handling in the new code path, and the herd effect as a predictable consequence of synchronized restart under load.</p>\n<h3>Normal Flow vs. June 12 Failure: What Service Control Does on Every Request</h3>\n<p><a href=\"https://techlogstack.com/explore/google-cloud-service-control-outage-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>The Herd Effect: Why Us-Central1 Recovery Took 2+ Extra Hours</h3>\n<p><a href=\"https://techlogstack.com/explore/google-cloud-service-control-outage-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>THE GLOBAL SPANNER REPLICATION TRAP</strong></p>\n<p>The reason the June 12 failure was global rather than regional was Spanner's design strength working against Google in this case. Spanner is Google's globally distributed database, engineered to replicate data to all regions in real time — typically within seconds. This replication is what makes Spanner so powerful for global consistency. On June 12, it was what made the failure instantaneous and universal. When the automated system inserted the policy update with blank fields into the regional Spanner tables, <strong>Spanner replicated that policy data to every region within seconds.</strong> Every Service Control instance in every region hit the null pointer at essentially the same moment. There was no regional staging, no propagation delay, no opportunity for an alert to fire in one region before the failure had spread to all others. The same architecture that gives Spanner its global consistency guarantee gave this bug its global blast radius.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Status Page That Went Dark</p><p>Google's Cloud Service Health dashboard — the system that customers rely on to understand what Google services are experiencing — went offline during the June 12 outage. This happened because the status infrastructure shared a dependency on the same Google Cloud services that were failing. <strong>A status page that fails during a widespread outage is not just unhelpful — it is actively harmful.</strong> Customers experiencing failures couldn't access the standard channel to confirm they weren't the source of the problem, couldn't track recovery progress, and couldn't communicate accurate information to their own stakeholders. The status page being down created a second outage: an outage of information. Google's commitment to decoupling status infrastructure from the services it monitors is not a nice-to-have. It is the baseline requirement for maintaining communication with customers during incidents.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>The June 12, 2025 Google Cloud outage carries lessons that apply to every engineering team — from startups deploying to a single cloud region to hyperscalers managing global infrastructure. The failure modes are not exotic. They are the canonical patterns of distributed systems engineering: missing error handling, absent feature flags, synchronized recovery storms. The scale is Google's. The lessons belong to everyone.</p>\n<ol>\n<li><strong>01.</strong> <strong>Error handling is not optional for code that runs in the critical path of a globally distributed system.</strong> The null pointer exception that crashed Service Control was caused by a missing two-line null check. Any code path that processes external data — data that arrives from an automated system and could contain unexpected values — must explicitly handle the unexpected cases. The failure condition was not unpredictable. Blank fields in policy metadata is a predictable input variation. The code should have anticipated it.</li>\n<li><strong>02.</strong> <em>Feature flag</em> (a software engineering practice where new code is deployed but kept inactive until explicitly enabled via configuration, allowing teams to disable problematic features instantly without redeployment — sometimes called a kill switch or dark launch) on infrastructure code are not optional — they are the minimum viable safety mechanism for any code that processes global-scale policy data. The difference between 'feature flag enabled, issue caught in staging' and 'no feature flag, 7-hour global outage' is one line of configuration. Every new code path in a globally deployed binary should require a feature flag as a deployment prerequisite, not a nice-to-have.</li>\n<li><strong>03.</strong> The herd effect — the <em>thundering herd</em> (a distributed systems failure mode where many clients simultaneously attempt to reconnect to a shared resource after it recovers, overwhelming the resource and preventing it from returning to stable operation) in its classic form — is a known failure mode with a known fix. <strong>Randomized exponential backoff on service restart is the standard solution, and it has been documented for decades.</strong> The fact that Service Control lacked it is a reminder that well-known fixes go unimplemented until the cost of not implementing them becomes acute. Build randomized backoff into any service that has a shared dependency it needs to reconnect to after a failure.</li>\n<li><strong>04.</strong> <strong>Your monitoring infrastructure must be architecturally independent of the services it monitors.</strong> A status page that goes down during an outage is a second outage layered on top of the first. This means no shared dependencies between the monitoring stack and the application stack, separate cloud regions or providers for status infrastructure, and tested independence — verifying that a full outage of the primary platform does not affect the observability layer. This is not easy to build, but it is essential. The moment customers need status information most is exactly the moment a shared-dependency status page is most likely to be unavailable.</li>\n<li><strong>05.</strong> Third-order cascade failures are invisible until they happen. Discord's users had no idea their outage originated in a null pointer exception in Google's quota management code. The dependency chain was opaque: Discord → Cloudflare → Google Cloud → Service Control → policy metadata blank fields. <strong>Every engineering team should map their dependency chain at least two levels deep</strong> — not just 'we use Cloudflare' but 'Cloudflare uses Google Cloud, and a Google Cloud outage of sufficient scope will reach us through Cloudflare.' This mapping informs both architectural decisions and incident response communication.</li>\n</ol>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Modularization Commitment</p><p>Google's most architecturally significant post-incident commitment was to modularize Service Control — isolating the quota checking component so that a failure in quota logic cannot crash the authentication and authorization components. Currently, Service Control is a single binary: a null pointer in any component crashes everything. In a modularized architecture, the quota checking subsystem can crash or be restarted without affecting authentication. This is the same architectural move GitHub is making with service isolation — and it is the right call for the same reason: blast radius containment. The cost of modularizing a critical monolithic binary is high. The cost of not doing it is measured in seven-hour global outages.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE IRONY OF THE RECOVERY MAKING THINGS WORSE</strong></p>\n<p>The most memorable aspect of the June 12 outage is the herd effect: <strong>the act of fixing the crash created a new problem that extended the outage by hours.</strong> This pattern — where the recovery mechanism amplifies the damage — appears across some of the most instructive engineering post-mortems in the industry. Netflix built Chaos Monkey partly because they discovered that their graceful degradation paths, when triggered simultaneously by a real failure, could overload the systems they were supposed to protect. AWS's October 2025 DynamoDB outage extended 12 extra hours because the automatic failover mechanisms were making the DNS inconsistency worse. And Google's June 2025 Service Control outage extended beyond the crash itself because hundreds of instances restarting simultaneously overwhelmed the Spanner database they all depended on. The lesson that runs through all three: <strong>design recovery paths with the same care you design primary paths — because recovery paths are the code that runs when everything is already wrong.</strong></p>\n</blockquote>\n\n<blockquote><p>Google deployed a null pointer exception on May 29, and it sat patiently in production for two weeks waiting for exactly the wrong policy update to arrive — like a trapdoor that looks like a floor until someone with the right keycard walks over it. Then it took down Spotify, Discord, Snapchat, Cloudflare, and Google's own status page simultaneously, and when engineers hit the kill switch to fix it, the restart surge broke the database they were restarting into. There is a version of this story where the lesson is 'write a null check.' There is a more useful version where the lesson is: the most dangerous code in your system is the code that runs perfectly for two weeks before it fails catastrophically, because you will have completely forgotten it's there by the time it does.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/google-cloud-service-control-outage-2025/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-31T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.378232+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Distributed Systems", "Google"]}, {"id": "https://techlogstack.com/explore/spotify-envoy-proxy-outage-2025/", "url": "https://techlogstack.com/explore/spotify-envoy-proxy-outage-2025/", "title": "Spotify Changed a Filter Order in Their Proxy — Then Every Server in the World Crashed at Once", "summary": "How a low-risk Envoy filter reorder triggered a latent bug, crashed every proxy instance simultaneously, and started a Kubernetes memory-limit death loop that kept S", "content_html": "<p><strong>Spotify</strong> · Reliability · 24 May 2026</p>\n<p>On April 16, 2025, Spotify's engineering team made a change they deemed low risk: reordering the custom filters inside their Envoy Proxy perimeter. They applied it to all regions simultaneously. Within two minutes, every Envoy instance worldwide had crashed. And then the restart loop began — a loop Kubernetes itself was powering, killing each new server as fast as it came back up. 675 million users couldn't load the app. Asia Pacific stayed up, and the reason why told the engineers exactly what was broken.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;MAU affected&#x27;, &#x27;value&#x27;: &#x27;675M&#x27;}</li><li>{&#x27;label&#x27;: &#x27;Downdetector peak reports&#x27;, &#x27;value&#x27;: &#x27;48,000+&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<blockquote><p>This crash happened simultaneously on all Envoy instances.</p><p><em>— — Spotify Engineering — Incident Report: Spotify Outage on April 16, 2025, engineering.atspotify.com</em></p></blockquote>\n<p>There is a specific kind of engineering failure that hurts more than the others: the change that was reviewed, discussed, and approved — the change the team looked at together and agreed was fine. On April 16, 2025, Spotify's team reordered the custom filters within their <em>Envoy Proxy</em> (an open-source, high-performance edge proxy originally built at Lyft and now widely used as the networking perimeter layer in distributed systems — it receives all incoming user traffic before distributing it to backend services) perimeter. This was not a new feature, not a database migration, not a major infrastructure overhaul. It was a filter reorder. The team assessed it as <strong>low risk</strong>. They applied it to all cloud regions simultaneously. Two minutes later, <strong>every single Envoy instance running Spotify's networking perimeter had crashed</strong>.</p>\n<p>Spotify's perimeter is the first layer of software that receives traffic from every user worldwide — every stream request, every search, every login. It sits in front of all backend services and distributes traffic across cloud regions. To extend Envoy's capabilities, Spotify develops and maintains its own <strong>custom filters</strong> — plugins that run within Envoy to handle rate limiting, authentication, and other cross-cutting concerns. These filters execute in a defined order. The April 16 change altered that order. The new sequence triggered a <strong>latent bug in one of the custom filters</strong>: a code path that had existed undetected, harmless as long as the filter never received control at that position, suddenly activated. Envoy crashed. Not one instance, not one region. All of them.</p>\n<blockquote>\n<p><strong>THE DEATH LOOP: WHY THE RESTART MADE THINGS WORSE</strong></p>\n<p>An Envoy crash is normally survivable — Kubernetes detects the failed pod and starts a replacement. But what happened next on April 16 was not normal. The immediate restart of all Envoy instances, combined with <strong>client-side retry logic</strong> (every user's app and browser retrying its failed request), created an unprecedented traffic spike onto the new instances. Each new Envoy started, received the full flood of retry traffic, consumed more memory than the <em>Kubernetes memory limit</em> (the maximum memory a Kubernetes pod is allowed to use, defined in its resource spec — when a pod exceeds this limit, Kubernetes automatically terminates it regardless of what the pod is doing), and was <strong>automatically killed by Kubernetes</strong>. A new instance started. The same thing happened. The loop repeated — powered by Kubernetes itself — for hours.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>12:18 UTC — Filter Reorder Applied Globally, All Envoy Instances Crash</h4>\n<p>The change to Envoy filter execution order was applied simultaneously to all cloud regions worldwide. The new order activated a latent bug in a custom Spotify filter. Every Envoy instance on Spotify's networking perimeter crashed at the same moment. Alarms fired two minutes later as the traffic drop became measurable.</p>\n<hr />\n<h3>Cause</h3>\n<h4>The Hidden Misconfiguration: Heap Larger Than the K8s Memory Limit</h4>\n<p>The traffic flood from client retries exposed a misconfiguration that had existed undetected: Envoy's max heap size was configured higher than the Kubernetes memory limit for the pod. Under normal traffic, Envoy never approached its heap limit and the misconfiguration was invisible. Under the retry flood, each new instance immediately exceeded the K8s limit and was killed. This turned a crash into an infinite restart loop.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Asia Pacific Stayed Up — and Explained Everything</h4>\n<p>Asia Pacific was the only region unaffected. Engineers noticed and investigated why. The answer: lower traffic volume at that time of day (timezone difference) meant APAC Envoy instances never received enough retry traffic to exceed the K8s memory limit. The asymmetry proved the hypothesis: the death loop was memory-limit driven, not bug-driven. Fix the memory headroom, break the loop.</p>\n<hr />\n<h3>Result</h3>\n<h4>15:45 UTC — Death Loop Broken, Full Recovery</h4>\n<p>Increasing total perimeter server capacity gave each new Envoy instance enough headroom to stay under the K8s memory limit even while absorbing the retry traffic flood. The death loop broke. Instances stabilized. EU recovered at 14:20 UTC, US at 15:10 UTC, full normalization at 15:40 UTC. Total duration: 3 hours 27 minutes.</p>\n<hr />\n\n<blockquote>\n<p><strong>🌏</strong></p>\n<p>Asia Pacific's survival was not a result of better engineering in that region. It was a result of <strong>time zones</strong>. The outage struck at 12:18 UTC — early morning in Europe and the US, but late evening in Asia Pacific where traffic was naturally lower. Fewer users → fewer retries → less memory pressure on new Envoy instances → stayed under the K8s limit. The region that wasn't affected was the one that happened to have the least traffic at the exact wrong moment.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Why 'Low Risk' Was Wrong</p><p>The filter reorder was assessed as low risk for a defensible reason: reordering filters does not add new code or change individual filter logic. It changes the sequence in which existing, tested filters run. The team's mental model was correct for most cases — but it was missing one scenario: a latent bug in a filter that only activates when that filter receives control at a specific position in the execution chain. <span><strong>Latent bugs that depend on execution context are invisible to tests that don't vary that context.</strong></span> A filter integration test suite that exercises filters in isolation or in their original order will never catch a bug that only manifests in a new order.</p>\n</blockquote>\n\n<p>The mechanism of the death loop is worth understanding in precise detail because it recurs across infrastructure outages in different forms. The pattern is: a failure event triggers a restart, the restart environment differs from steady state (here: retry flood instead of normal traffic), the restarted instance fails faster than under steady state, the restart mechanism itself (Kubernetes) becomes the engine of the failure. In Kubernetes deployments, the most common version of this pattern involves the <em>OOMKill</em> (Out-Of-Memory Kill — the Kubernetes mechanism that terminates a pod when it exceeds its configured memory limit, to protect other workloads on the node from memory starvation) cycle: a pod exceeds its memory limit under unexpected load, K8s kills it, the replacement starts with no warm state and faces the same load, K8s kills it again. The Spotify outage was this pattern at global perimeter scale.</p>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>EnvoyCon 2025: The Custom Filter Spotify Presented</p><p>Spotify's engineering team had discussed their custom Envoy filter work publicly — including their rate limiting filter — at <strong>EnvoyCon 2025</strong>, just before the April 16 incident. The presentation described the same filter system that the April 16 change modified. The public talk was about the filter's capabilities; the April 16 postmortem was about what happened when its execution order changed. The two documents together give a rare complete picture: how the system was designed to work, and exactly how it broke.</p>\n</blockquote>\n\n<p>The 263 million Spotify Premium subscribers who pay for the service experienced the same outage as the 412 million free-tier users. Spotify's architecture does not provide differentiated reliability between paid and unpaid users at the perimeter layer — the same Envoy proxy handles all traffic. This is consistent with how most streaming platforms operate: the perimeter is a shared resource, and a perimeter failure is total. The 48,000+ Downdetector reports at peak represented the fraction of users actively reporting issues; the actual count of affected users was in the hundreds of millions.</p>\n<blockquote>\n<p><strong>🔄</strong></p>\n<p>The Retry Amplification Problem</p><p>Client-side retry logic is a standard reliability feature: when a request fails, the app retries it, giving transient failures a chance to self-heal. During normal partial failures, retry logic helps. During a simultaneous total failure, retry logic becomes a <strong>load amplifier</strong>. Every user whose request failed immediately retried — some apps retry multiple times with exponential backoff. The simultaneous crash of all Envoy instances converted the normal traffic level into a retry-amplified spike: each failed request generated one or more retry requests, all arriving at the same moment the replacement instances were starting. The retry logic designed to improve reliability became a key component of the death loop.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>❌</strong></p>\n<p>263 Million Premium Subscribers: No Differentiation</p><p>Spotify's perimeter architecture treats all traffic identically — there is no fast lane for paid subscribers at the proxy layer. The 263 million Premium users who pay for an ad-free, uninterrupted experience were indistinguishable from the 412 million free-tier users when the perimeter crashed. This is not a design flaw; building separate perimeter infrastructure for each tier would add enormous complexity. But it means that the reliability guarantee implicit in a Premium subscription depends entirely on the reliability of shared perimeter infrastructure. When the perimeter fails totally, the premium experience fails identically to the free experience.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>The Misconfiguration Nobody Noticed — Until the Crash</h3>\n<p>The fix for the death loop was increasing perimeter server capacity — but this addressed the symptom, not the underlying misconfiguration. The root problem was that <strong>Envoy's max heap size was set higher than the Kubernetes memory limit for the pod</strong>. In normal operation, Envoy memory usage never approached its heap maximum. The misconfiguration was invisible: Envoy wasn't crashing, K8s wasn't killing pods, monitoring wasn't alerting. The gap between heap size and K8s limit existed in the configuration for an unknown period before April 16 — it simply never mattered because Envoy memory never climbed high enough to expose it. The retry flood was the first event extreme enough to push instances over the K8s limit and trigger the kill cycle.</p>\n<ul>\n<li><strong>3h 27m</strong> — Total outage duration from first Envoy crash (12:18 UTC) to full global normalization (15:45 UTC) — spanning the North American morning commute and European afternoon</li>\n<li><strong>675M</strong> — Spotify monthly active users worldwide — 263M paying Premium subscribers — all of whom experienced service degradation or complete unavailability during the incident</li>\n<li><strong>48,000+</strong> — Peak Downdetector reports — representing active user reports only; actual affected users numbered in the hundreds of millions globally excluding Asia Pacific</li>\n<li><strong>0</strong> — Regions with staged rollout before full deployment — the filter reorder was applied globally simultaneously because it was assessed as low risk, removing the safety net of incremental validation</li>\n</ul>\n\n<pre><code># THE MISCONFIGURATION: Envoy heap limit higher than K8s memory limit\n# This created a hidden gap that was invisible until the retry flood\n\n# Kubernetes pod resource specification (simplified)\napiVersion: v1\nkind: Pod\nspec:\n  containers:\n  - name: envoy\n    resources:\n      requests:\n        memory: \"2Gi\"\n      limits:\n        memory: \"3Gi\"    # K8s will OOMKill the pod above this\n\n# Envoy overload manager configuration (simplified)\noverload_manager:\n  actions:\n  - name: envoy.overload_actions.stop_accepting_requests\n    triggers:\n    - name: envoy.resource_monitors.fixed_heap\n      threshold:\n        value: 0.95  # Envoy's own heap limit: 95% of max heap\n  resource_monitors:\n  - name: envoy.resource_monitors.fixed_heap\n    typed_config:\n      max_heap_size_bytes: 4294967296  # 4GB: HIGHER than K8s 3GB limit!\n\n# Result:\n# - K8s kills at 3GB of memory usage\n# - Envoy's own safety valve triggers at 95% of 4GB = 3.84GB\n# - K8s limit is hit BEFORE Envoy's own graceful degradation kicks in\n# - Under normal load: Envoy peaks at ~1.5GB — misconfiguration invisible\n# - Under retry flood: Envoy climbs past 3GB → K8s OOMKills → restart\n#   → same flood → same OOMKill → infinite loop\n\n# THE FIX (immediate): Increase perimeter server count\n# More servers = same retry traffic spread across more instances\n# = each instance stays under 3GB = K8s doesn't kill = loop breaks\n\n# THE FIX (permanent): Align heap config with K8s memory limit\n# max_heap_size_bytes: 2684354560  # 2.5GB: safely below K8s 3GB limit</code></pre>\n<blockquote>\n<p><strong>WHY INCREASING CAPACITY FIXED THE LOOP</strong></p>\n<p>The death loop's engine was the K8s OOMKill cycle. The K8s memory limit was fixed. The retry traffic load was fixed (determined by user behavior). The only variable Spotify could change quickly was the <strong>number of Envoy instances sharing that retry load</strong>. More instances → each instance receives a smaller share of the retry flood → each instance's memory usage stays lower → stays under the K8s limit → K8s doesn't kill it → stable. This is why increasing capacity broke the loop: it reduced per-instance memory pressure below the kill threshold. The underlying misconfiguration (heap > K8s limit) was fixed separately afterward as a permanent remediation.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Spotify's Four Post-Incident Commitments</p><p>Spotify's postmortem committed to four specific engineering changes: <strong>(1)</strong> Fix the Envoy filter bug that caused the initial crash on filter reorder. <strong>(2)</strong> Fix the configuration mismatch between Envoy heap size and Kubernetes memory limit. <strong>(3)</strong> Improve the rollout process for configuration changes to the perimeter — staged rather than global simultaneous. <strong>(4)</strong> Improve monitoring capabilities to detect these issues earlier in the failure chain. Notably, the postmortem linked directly to their EnvoyCon 2025 talk, showing transparency about which system was involved rather than obscuring the component.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>What a Staged Rollout Would Have Caught</p><p>If the filter reorder had been applied to a single region first with a monitoring window before global rollout, the failure would have been a regional incident recoverable in minutes, not a global outage lasting 3.5 hours. The Envoy crash would have appeared in one region. Engineers would have investigated, found the latent bug, rolled back the filter order. Total blast radius: one region for ~10 minutes. The simultaneous global rollout removed this safety net entirely. The misconfiguration would still have existed — but would have been exposed only in one region rather than all at once.</p>\n</blockquote>\n\n<p>Spotify Envoy Outage: Timeline of Events and Recovery Progression</p><div><table><caption>Spotify Envoy Outage: Timeline of Events and Recovery Progression</caption><thead><tr><th>Time (UTC)</th><th>Event</th><th>Status</th></tr></thead><tbody><tr><td>12:18</td><td>Envoy filter order changed; all instances crash simultaneously</td><td>🔴 Global failure begins</td></tr><tr><td>12:20</td><td>Alarms triggered on traffic drop; death loop already running</td><td>🔴 Engineers paged</td></tr><tr><td>12:28</td><td>Situation escalated; only APAC serving traffic</td><td>🔴 Incident declared</td></tr><tr><td>~13:xx</td><td>Root cause identified via APAC asymmetry; capacity increase planned</td><td>🟡 Diagnosis complete</td></tr><tr><td>14:20</td><td>EU regions fully recovered</td><td>🟡 Partial recovery</td></tr><tr><td>15:10</td><td>US regions fully recovered</td><td>🟡 Partial recovery</td></tr><tr><td>15:40</td><td>All traffic patterns normal globally</td><td>🟢 Full recovery</td></tr></tbody></table>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Perimeter as a Shared Fate System</p><p>Spotify's Envoy perimeter is a <strong>shared fate system</strong> — all backend services rise and fall with it. Even if every backend service (streaming, search, auth, recommendations) remained perfectly healthy during the April 16 incident, users experienced total unavailability because the perimeter through which all requests flow was in a crash loop. This architectural property is why perimeter changes deserve the highest rollout rigor: a perimeter failure has a blast radius of every service, every user, every region simultaneously. The perimeter is where shared fate is most acute.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>Spotify's networking perimeter architecture places Envoy Proxy as the outermost layer — the first software that receives every user request, regardless of what feature or backend it is ultimately destined for. Envoy runs in every cloud region and distributes incoming traffic to the appropriate backend microservices. Custom filters extend Envoy's capabilities beyond the open-source defaults: rate limiting, authentication, request routing customization. Understanding the outage requires understanding that when every Envoy instance worldwide crashes simultaneously, no user request can reach any backend service — the entire platform goes dark regardless of whether individual backend services remain healthy.</p>\n<h3>Spotify's Perimeter Architecture: Envoy as the Universal Traffic Gateway</h3>\n<p><a href=\"https://techlogstack.com/explore/spotify-envoy-proxy-outage-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>The Three-Layer Failure Cascade: From Filter Bug to Death Loop</h3>\n<p><a href=\"https://techlogstack.com/explore/spotify-envoy-proxy-outage-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>THE ASIA PACIFIC DIAGNOSTIC: HOW ONE REGION PROVED THE ROOT CAUSE</strong></p>\n<p>The most important engineering insight in the April 16 incident was not the initial cause — it was using Asia Pacific's survival to <strong>prove the diagnosis</strong>. When engineers observed APAC was unaffected, they had two candidate hypotheses: (A) the filter bug is region-specific, or (B) the death loop is traffic-intensity dependent. If (A), APAC has different filter config. If (B), APAC has lower traffic at this hour. Investigation confirmed (B): APAC runs identical filter configuration. Lower traffic meant less retry amplification, meaning per-instance memory pressure never reached the K8s limit. This asymmetry transformed a hard debugging problem (why is the loop happening?) into a tractable one (what's different about APAC?) and pointed directly at the memory-limit misconfiguration.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Configuration Drift in Long-Running Systems</p><p>The Envoy heap/K8s limit misconfiguration almost certainly existed long before April 16, 2025. It was never caught because Envoy memory usage never reached the dangerous threshold under normal traffic. This is a common pattern: <strong>configuration mismatches that are only dangerous under abnormal load go undetected indefinitely</strong> in systems where abnormal load doesn't occur. The misconfiguration didn't cause the outage — the filter bug did. But the misconfiguration was what turned a recoverable crash into a multi-hour global outage. Auditing resource limit configurations against actual peak usage — including synthetic stress tests — is the practice that catches these time bombs before they detonate.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🛡️</strong></p>\n<p>Envoy's Custom Filter Architecture</p><p>Spotify's custom Envoy filters are the engineering investment that makes this outage particularly instructive. Envoy provides a well-defined <em>filter chain</em> (the ordered sequence of processing modules (filters) that each request passes through in an Envoy proxy instance — each filter can inspect, modify, or reject the request before passing it to the next filter in the chain) mechanism for extensibility: developers write C++ or Lua filter plugins that plug into Envoy's request processing pipeline. The order of filters in the chain determines the sequence of execution. Spotify's filters included rate limiting (discussed at EnvoyCon 2025) and other custom logic. Changing this order is semantically meaningful: a filter that assumes it runs after authentication may behave incorrectly if it suddenly runs before it. The latent bug on April 16 was exactly this class of assumption.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>The Spotify April 2025 outage is one of the cleanest documented examples of how a reasonable assessment ('low risk') combined with an undetected misconfiguration produces a disproportionate outcome. The lessons here are deeply practical.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>'Low risk' is not a substitute for staged rollout at the perimeter.</strong> A change's risk profile determines what validation it needs — it doesn't override the need for validation. Changes to shared perimeter infrastructure that affect all users worldwide deserve incremental rollout regardless of their apparent complexity. The filter reorder was simple; the blast radius of a failure was total. Stage perimeter changes by region and monitor before expanding.</li><li><span>02</span><div><em>Latent bugs</em> (code defects that exist in production but are harmless until a specific triggering condition occurs — they can be undetectable by standard testing if the triggering condition is rare or contextual) that depend on execution context cannot be caught by tests that don't vary that context. A filter test suite that exercises filters in their original order will never discover a bug that only manifests in a different order. When making ordering or sequencing changes, test explicitly in the new order — don't rely on existing test coverage that implicitly assumes the old order.</li><li><span>03</span><div><strong>Audit resource limit configurations against actual and stress-test peak usage regularly.</strong> Mismatches between Envoy heap size and Kubernetes memory limits are invisible until a load event forces memory beyond the limit. The same pattern exists for thread pool sizes, connection pool limits, and file descriptor limits. A misconfiguration that's been harmless for months can become catastrophic under the right load spike.</li><li><span>04</span><div><em>Client-side retry logic</em> (application behavior where the client automatically retries failed requests after a brief delay — designed to handle transient failures but capable of amplifying load during sustained failures) turns total simultaneous failures into traffic amplification events. Design retry logic with awareness of this: exponential backoff with jitter spreads retries over time; circuit breakers prevent retries when failure rate exceeds a threshold; retry budgets limit total retry volume per client. These mechanisms reduce the retry flood that powered Spotify's death loop.</li><li><span>05</span><div><strong>When one region survives an outage that hits all others, that region is your fastest path to root cause.</strong> APAC's survival was not luck — it was a controlled experiment running in production. Its configuration was identical; its traffic was lower. The asymmetry proved the diagnosis. Building the habit of systematically comparing the surviving regions against the failed ones — rather than focusing exclusively on what went wrong — is the investigative discipline that shortens MTTR.</li></ol>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Spotify's Transparency Standard</p><p>The April 16 postmortem was published on May 9 — 23 days after the incident. It named the specific system involved (Envoy Proxy, their custom filters), linked directly to their EnvoyCon 2025 talk about the same system, included exact timestamps for every recovery milestone, and explained the death loop mechanism in precise technical terms. It also enumerated four specific engineering commitments — not aspirational language but concrete actions. This level of technical transparency in a public postmortem is rare and sets a standard. Spotify's engineering culture treats accountability as a tool for improvement, not a liability to be managed.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE ENVOYCON IRONY</strong></p>\n<p>Spotify's engineering team had publicly presented their custom Envoy filter work — including the rate limiting filter — at EnvoyCon 2025, just weeks before the April 16 incident. The presentation described the filter system as a capability Spotify had built to enhance Envoy's performance. The April 16 postmortem described what happened when the execution order of that same filter system was changed without a staged rollout. The two documents together are an accidental case study in the gap between <strong>how a system is designed to work</strong> and <strong>how it fails under an unexpected configuration change</strong>. Publishing both — the capability presentation and the failure postmortem — is a model of engineering transparency.</p>\n</blockquote>\n\n<blockquote><p>Spotify changed the order of some filters in their proxy, which seemed fine until every server on Earth crashed simultaneously, and then Kubernetes helpfully restarted them all into the same crash in a loop — which is either a distributed systems problem or a distributed systems feature depending on how you look at it.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/spotify-envoy-proxy-outage-2025/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-24T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.979968+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Reliability", "Spotify"]}, {"id": "https://techlogstack.com/explore/airbnb-identity-graph-janusgraph-2026/", "url": "https://techlogstack.com/explore/airbnb-identity-graph-janusgraph-2026/", "title": "Airbnb's Fraud Detection Runs on a Graph of 7 Billion Nodes — Here's Why They Rebuilt It From Scratch", "summary": "How Airbnb rebuilt their 7-billion-node fraud detection identity graph from a third-party SaaS on JanusGraph + DynamoDB, cutting P99 read latency by 49% and enabling", "content_html": "<p><strong>Airbnb</strong> · Databases · 24 May 2026</p>\n<p>Airbnb's identity graph connects 7 billion nodes and 11 billion edges — every user, every device, every listing, every relationship that might reveal a fraudster trying to create a duplicate account or collude on a fake transaction. The third-party vendor powering it required periodic manual reboots to stay stable. Queries that needed 8 hops of graph traversal were hitting 5-second P99 latencies. In 2024, a small team rebuilt the entire thing internally. The results were not incremental.</p>\n<ul>\n<li>7B nodes, 11B edges</li><li>5M new edges/day</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>The stakes of Airbnb's identity graph are not abstract. When a fraudster creates a second account after being banned, tries to rent a listing to damage it, or coordinates with other accounts to inflate reviews, the first system that needs to detect the connection is the identity graph. It holds the relationships between <strong>every user, every device, every verified identity, every behavioral signal</strong> that Airbnb's Trust and Safety team uses to determine whether a new account is truly new or a known bad actor resurfacing. In 2024, this graph contained 7 billion nodes and 11 billion edges — and was growing by 5 million new edges every day. The vendor powering it was requiring periodic manual reboots to stay stable. That was the state of Airbnb's most critical fraud detection infrastructure when the decision was made to build internally.</p>\n<blockquote>\n<p><strong>🕸️</strong></p>\n<p>Airbnb's identity graph is not the only graph at the company — it was simply the first to migrate to the new internal graph infrastructure. The platform also runs <strong>inventory knowledge graphs</strong> (relationships between listings, amenities, neighborhoods, and availability) and <strong>data lineage graphs</strong> (tracking how data flows through pipelines for compliance and debugging). All are now converging onto the same JanusGraph-based infrastructure.</p>\n</blockquote>\n\n<p>The identity graph's architecture progressed through three distinct generations, each solving the previous generation's limit while introducing new constraints. The first generation used a <strong>relational database for user and entity data paired with a key-value store holding JSON-encoded edge lists</strong>. This worked at low graph density — when users had few connections, the JSON edge lists were manageable. As graph density increased and individual users accumulated hundreds or thousands of edges, the JSON edge lists became expensive to read and update. A query that needed to traverse relationships between users required deserializing large JSON blobs and joining across tables — operations that <em>relational databases</em> (database systems built around tables, rows, and SQL joins — optimal for normalized structured data but increasingly expensive as relationship traversal depth grows, because each hop requires an additional join) are not optimized for at graph scale.</p>\n<blockquote>\n<p><strong>THE FOUR ANTI-PATTERNS THAT PLAGUED AIRBNB'S GRAPH TEAMS</strong></p>\n<p>Before the centralized graph infrastructure, Airbnb teams building graph-based products fell into four documented anti-patterns. <strong>Relational graphs</strong>: modeling nodes and edges in SQL tables, producing expensive joins during traversal. <strong>Offline graphs</strong>: building the graph in the data warehouse, limiting data freshness to daily batch snapshots — useless for real-time fraud detection. <strong>DIY open source</strong>: self-managing community versions of graph databases, creating high operational toil and expertise silos. <strong>Managed PaaS</strong>: using third-party vendors — better operationally but introducing vendor lock-in, limited tuning access, and performance bottlenecks the team couldn't debug. The identity graph's 2021 migration to a third-party SaaS solved the operational overhead of DIY open source but introduced the last anti-pattern.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>Generation 1 → 2: Relational DB + KV Store Couldn't Scale Graph Density</h4>\n<p>The first-generation architecture used a relational database for entity data and a KV store holding JSON-encoded edge lists. As the identity graph grew in density — individual users accumulating hundreds of edges — querying became expensive. JSON deserialization and cross-table joins are not optimized for the multi-hop traversal patterns that fraud detection requires. The architecture became difficult and expensive to scale.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Generation 2 → 3: SaaS Vendor — Better Scale, Worse Reliability</h4>\n<p>The 2021 migration to a third-party SaaS graph database improved horizontal scalability. But it introduced new problems: long-tail latency (P99 read latency reaching 5 seconds on 8-hop queries), operational instability requiring periodic manual reboots, limited ability to tune performance for Airbnb's specific query patterns, and no fine-grained access controls. The vendor was a black box the team couldn't debug.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Generation 3: JanusGraph + DynamoDB, Internally Managed</h4>\n<p>In 2024, Airbnb built an internal graph infrastructure on JanusGraph (open-source, Apache TinkerPop stack, Gremlin query language) with DynamoDB as the storage backend and OpenSearch for indexing. The pluggable storage architecture meant Airbnb could leverage DynamoDB's operational reliability without reinventing distributed storage — while maintaining full control over the graph logic layer. They forked JanusGraph internally to add custom optimizations.</p>\n<hr />\n<h3>Result</h3>\n<h4>49% P99 Latency Reduction, 10× Write QPS, Zero Manual Reboots</h4>\n<p>P99 read end-to-end latency dropped from 5.0s to 2.5s (-49%). P95 from 2.1s to 1.0s (-51%). Write P95 from 353ms to 156ms (-56%). Write QPS during load testing: 10× the previous vendor's maximum. Manual reboots eliminated entirely. Incident investigation time shortened through transparent internal observability. Auto-scaling enabled for the first time.</p>\n<hr />\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Long-Tail Latency Problem at High Fanout</p><p>The most technically interesting challenge in the identity graph is <em>long-tail latency</em> (the phenomenon where the slowest requests in a system (P95, P99) are dramatically slower than the median — particularly damaging for real-time applications where even a small fraction of slow responses degrades user experience) on high-fanout queries. The graph is not uniformly dense — some nodes (users who have been on Airbnb for years and made hundreds of bookings) have hundreds or thousands of edges. When a fraud detection query traverses relationships at depth and hits one of these high-fanout nodes, the amount of data retrieved explodes. A query with 4–8 hops that hits a high-fanout node at hop 2 can return orders of magnitude more data than the same query on a sparse node. The P50 latency looks fine; the P99 reveals the reality.</p>\n</blockquote>\n\n<p>The choice of Gremlin as the query language was not coincidental — it was a migration enabler. Both the outgoing vendor system and the incoming JanusGraph implementation support <em>Gremlin</em> (a graph traversal language developed as part of the Apache TinkerPop framework — reads like a path through the graph, e.g. g.V(userId).out('booked').in('listed') means 'find all users who listed properties that this user has booked'), which meant Airbnb could <strong>run the same Gremlin queries against both systems simultaneously</strong> during the migration. This shadow traffic approach allowed direct performance benchmarking under real production load before any cutover — a stark contrast to migrations that require rewriting queries for the new system before they can be tested. The query language compatibility was a deliberate evaluation criterion, not an accident.</p>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Trust Graph: The Fraud Detection Use Case</p><p>Airbnb's internal name for the identity graph's fraud detection application is the <strong>Trust Graph</strong>. It models connections between Airbnb users and detects two primary fraud patterns: <strong>account duplication</strong> (a banned user creating a new account and re-joining the platform) and <strong>collusion</strong> (groups of accounts coordinating to execute fraudulent transactions, inflate review scores, or circumvent platform policies). The Trust Graph feeds ML models that learn patterns of fraudulent connectivity — the specific graph structures that appear before fraud events — and score new accounts and transactions in real time. For this use case, query latency directly impacts both fraud detection speed and host/guest experience during booking.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>📦</strong></p>\n<p>Storage Separation: Why DynamoDB as the Backend</p><p>JanusGraph's pluggable storage architecture was the property that made it the right choice for Airbnb. By using <strong>DynamoDB as the storage backend</strong>, Airbnb decoupled graph logic (JanusGraph) from distributed storage operations (DynamoDB). DynamoDB brings auto-scaling, multi-region replication, and operational reliability that Airbnb's infrastructure team already understood and trusted. JanusGraph handles the graph data model, schema management, and Gremlin query execution. The combination gave Airbnb full control over the graph layer while standing on a storage foundation that didn't need to be invented from scratch.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>5 Million New Edges Per Day: The Write Problem</p><p>The identity graph's write challenge is as significant as its read challenge. Every day, Airbnb adds approximately <strong>5 million new edges</strong> — new bookings creating host-guest relationships, new device associations, new identity verification links. Each edge must be ingested in near real-time through asynchronous events and stored durably before downstream fraud models can use them. The vendor system's write P95 of 353ms was tolerable when the graph was smaller. At 11 billion edges growing at 5 million per day, write throughput headroom becomes critical — a system at its write QPS ceiling cannot absorb traffic spikes without dropping ingestion events. The internal solution's 10× write QPS ceiling during load testing created the headroom the vendor never provided.</p>\n</blockquote>\n\n<p>The decision framework Airbnb used to evaluate JanusGraph against alternatives reflects a principled approach to graph database selection that is worth examining. Four requirements shaped the evaluation: <strong>scalability for online queries</strong> (the system had to handle real-time graph traversal at P95 latencies under 500ms), <strong>expressive schema and query capabilities</strong> (the Gremlin traversal language and labeled property graph model needed to support the identity graph's data model without structural compromises), <strong>fit with Airbnb's infrastructure and operational model</strong> (DynamoDB as the storage backend meant the team was standing on infrastructure they already operated), and <strong>a visible, extensible codebase</strong> — the specific requirement that ruled out the previous vendor and every other black-box alternative. Access to the source code was not a preference; it was a prerequisite.</p>\n<blockquote>\n<p><strong>CONNECTED ACCOUNTS: HOW THE GRAPH DETECTS FRAUD</strong></p>\n<p>The specific fraud detection application that depends on the identity graph is called <strong>Connected Accounts</strong> (also referred to as the Trust Graph). It works by finding structural patterns in the graph that correlate with fraud. A legitimate user typically has one main account, one primary device, and verified identity credentials. A fraudster attempting to re-enter the platform after a ban might create a new account — but often reuses the same phone number, the same payment method, the same device, or overlaps with the banned account's booking history. The Connected Accounts system traverses the graph to find these connections: <em>\"this new account shares a device with a banned account, which shared a payment method with another banned account, which has reviewed listings that the new account also reviewed.\"</em> That traversal pattern — spanning 4–8 hops — is exactly why graph depth performance matters for fraud detection.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>Three JanusGraph Engine Optimizations That Closed the Latency Gap</h3>\n<p>Deploying stock JanusGraph with DynamoDB would not have been sufficient — Airbnb's query patterns, particularly the high-fanout traversals that caused the worst P99 spikes, required modifications to the JanusGraph engine itself. The team forked JanusGraph internally and made three targeted optimizations: replacing the default locking mechanism with a DynamoDB-native approach, adding parallel execution to the high-fanout data fetching interface, and instrumenting the internal fork with distributed tracing that the open-source version lacked. Together, these changes addressed the specific failure modes that had made the vendor solution unacceptable.</p>\n<ul>\n<li><strong>-49%</strong> — P99 end-to-end read latency improvement — from 5.0 seconds on the vendor system to 2.5 seconds on internal JanusGraph infrastructure — directly improving fraud detection response time</li>\n<li><strong>-56%</strong> — P95 write latency improvement — from 353ms on the vendor to 156ms internally — enabling faster ingestion of the 5 million new edges added to the graph every day</li>\n<li><strong>10×</strong> — Write QPS ceiling during load testing — the internal JanusGraph infrastructure successfully scaled to ten times the maximum write throughput the vendor could sustain</li>\n<li><strong>0</strong> — Manual reboots required after migration — the vendor solution required periodic manual instance reboots to maintain optimal performance; the internal solution auto-scales without human intervention</li>\n</ul>\n\n<pre><code class=\"language-python\"># Conceptual illustration of the three JanusGraph engine optimizations\n# that reduced long-tail latency on Airbnb's identity graph\n\n# OPTIMIZATION 1: Custom transaction strategy using DynamoDB conditional writes\n# JanusGraph's default locking: acquire explicit distributed lock before write\n# Problem: lock acquisition is a round-trip to the storage backend = overhead\n\n# Old approach (simplified — uses JanusGraph's default locking):\ndef write_edge_default(tx, src_vertex, dst_vertex, edge_label):\n    lock = acquire_distributed_lock(src_vertex, edge_label)  # expensive\n    try:\n        tx.add_edge(src_vertex, dst_vertex, edge_label)\n        tx.commit()\n    finally:\n        release_lock(lock)\n\n# New approach: DynamoDB conditional writes (atomic compare-and-swap)\n# DynamoDB's native conditional writes ensure integrity without separate lock\ndef write_edge_optimized(tx, src_vertex, dst_vertex, edge_label):\n    # Condition: only write if edge doesn't already exist\n    # DynamoDB evaluates the condition atomically server-side — no round-trip lock\n    tx.add_edge_with_condition(\n        src_vertex, dst_vertex, edge_label,\n        condition=\"attribute_not_exists(edge_key)\"  # DynamoDB conditional expression\n    )  # Lower overhead, same integrity guarantee\n\n# OPTIMIZATION 2: Parallel query execution via improved getMultiSlices\n# Problem: high-fanout queries (user with 1000+ edges) fetch data serially\n# Each 'slice' of edge data retrieved one at a time from DynamoDB\n\n# Before: serial fetching of edge slices\ndef get_edges_serial(vertex_id, num_slices=50):\n    results = []\n    for slice_key in compute_slice_keys(vertex_id, num_slices):\n        results.append(dynamo.get_item(slice_key))  # Sequential round-trips\n    return merge(results)  # N sequential DynamoDB calls\n\n# After: parallel fetching via improved getMultiSlices interface\ndef get_edges_parallel(vertex_id, num_slices=50):\n    slice_keys = compute_slice_keys(vertex_id, num_slices)\n    # DynamoDB BatchGetItem fetches all slices concurrently\n    results = dynamo.batch_get_items(slice_keys)  # Single batched DynamoDB call\n    return merge(results)  # 1 call instead of N — critical for high-fanout nodes\n\n# OPTIMIZATION 3: Distributed tracing integrated into JanusGraph fork\n# OSS JanusGraph: no tracing instrumentation — impossible to profile slow queries\n# Internal fork: Airbnb's distributed trace context propagated through graph ops\ndef execute_gremlin_traversal(query, trace_context):\n    with airbnb_tracer.start_span('janusgraph.traversal', parent=trace_context) as span:\n        span.set_tag('query.hops', count_hops(query))\n        span.set_tag('query.fanout', estimated_fanout(query))\n        result = janusgraph.execute(query)  # Each DynamoDB call creates child spans\n        span.set_tag('result.edges_traversed', result.edge_count)\n    return result  # Full trace: query → graph ops → DynamoDB calls → result</code></pre>\n<blockquote>\n<p><strong>GREMLIN QUERY REWRITING: CLIENT-SIDE OPTIMIZATION</strong></p>\n<p>Even with a faster JanusGraph engine, Airbnb discovered that identical Gremlin queries produced significantly different performance on JanusGraph compared to the vendor system — because each implements different query planning optimizations over TinkerPop steps. Two specific patterns required client-side rewrites. <strong>Path steps</strong> (Gremlin's <code>path()</code> and <code>simplePath()</code> operators) are not optimized as batched queries in JanusGraph, causing non-batched storage backend queries that saturate the DynamoDB connection pool. These were replaced with conditional queries ensuring acyclic results. <strong>Side-effect aggregation steps</strong> produced non-batched substeps in JanusGraph's query planner and were restructured to minimize unoptimized computation. Both changes required deep knowledge of JanusGraph's query planning internals — knowledge only available because Airbnb owns the code.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Schema Enforcement via the Management Service</p><p>JanusGraph's open-source version ships with minimal schema management tooling. Airbnb built a <strong>Graph Management Service</strong> on top of JanusGraph to handle schema enforcement, index management, and schematized Thrift APIs for client access. The management service acts as the control plane for the graph infrastructure: it ensures that vertex and edge schemas are validated before data is written, manages secondary indexes in OpenSearch, and provides the typed API surface that downstream services (fraud detection models, Trust & Safety pipelines) call. This separates schema governance from query execution and prevents different teams' graph data from colliding in a multi-tenant infrastructure.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Shadow Traffic Migration Strategy</p><p>Migrating 7 billion nodes and 11 billion edges without downtime required a migration strategy that validated the new system under real production load before any cutover. Airbnb's approach: run both the vendor system and the internal JanusGraph system in parallel, routing the same production queries to both and comparing results. <span><strong>Because both systems use Gremlin, the same queries ran unchanged on both systems simultaneously.</strong></span> This shadow traffic phase provided two things: a performance benchmark under real load (not synthetic tests), and correctness validation (ensuring internal JanusGraph returned the same results as the vendor for the same queries). Only after shadow traffic validated both correctness and performance was production traffic cut over and the vendor deprecated.</p>\n</blockquote>\n\n<p>Third-Party Vendor vs Internal JanusGraph: Performance Comparison Across Query Types</p><div><table><caption>Third-Party Vendor vs Internal JanusGraph: Performance Comparison Across Query Types</caption><thead><tr><th>Query Type</th><th>Vendor P95</th><th>Internal P95</th><th>Improvement</th><th>Vendor P99</th><th>Internal P99</th></tr></thead><tbody><tr><td>1-hop query</td><td>~180ms</td><td>~65ms</td><td>-64%</td><td>~420ms</td><td>~150ms</td></tr><tr><td>2-hop query</td><td>~350ms</td><td>~130ms</td><td>-63%</td><td>~900ms</td><td>~280ms</td></tr><tr><td>2-hop (high fanout)</td><td>~620ms</td><td>~200ms</td><td>-68%</td><td>~1,800ms</td><td>~450ms</td></tr><tr><td>4-hop query</td><td>~900ms</td><td>~380ms</td><td>-58%</td><td>~2,500ms</td><td>~850ms</td></tr><tr><td>8-hop query (max depth)</td><td>~2,100ms</td><td>~1,000ms</td><td>-52%</td><td>~5,000ms</td><td>~2,500ms</td></tr><tr><td>Write (edge creation)</td><td>~353ms</td><td>~156ms</td><td>-56%</td><td>~800ms</td><td>~360ms</td></tr></tbody></table>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Two Ingestion Paths: Event-Driven and Bulk Load</p><p>The identity graph's data ingestion architecture has two distinct paths. <strong>Event-driven ingestion</strong> handles the 5 million daily new edges in near real-time — asynchronous events from Airbnb's platform (new booking, new device association, identity verification) trigger graph mutations through the JanusGraph write path within seconds of occurring. <strong>Bulk loading</strong> handles backfills, historical data migrations, and large-scale data corrections — optimized for high throughput rather than low latency, running as offline jobs that write directly to DynamoDB's storage layer. The two paths are served by separate applications in the identity graph service, ensuring that bulk load operations during data migrations don't contend with real-time fraud detection writes.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>The architecture of Airbnb's new graph infrastructure has three conceptual layers. The <strong>storage layer</strong> is DynamoDB for graph data persistence and OpenSearch for secondary indexes — both managed AWS services that auto-scale without Airbnb managing the distributed storage operations. The <strong>graph engine layer</strong> is Airbnb's internal JanusGraph fork — the Gremlin server that executes traversal queries against the storage layer, with custom optimizations for Airbnb's specific access patterns. The <strong>management layer</strong> is the Graph Management Service — schema enforcement, index management, multi-tenant namespace isolation, and the Thrift API surface that client services call. Each tenant (the identity graph, inventory knowledge graph, data lineage graph) operates in an isolated namespace within the same infrastructure.</p>\n<h3>Before: Vendor Graph DB — Black Box, Manual Reboots, P99 at 5 Seconds</h3>\n<p><a href=\"https://techlogstack.com/explore/airbnb-identity-graph-janusgraph-2026/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>After: Airbnb Internal Graph Infrastructure — JanusGraph + DynamoDB</h3>\n<p><a href=\"https://techlogstack.com/explore/airbnb-identity-graph-janusgraph-2026/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>Why High-Fanout Nodes Cause Long-Tail Latency: The Graph Traversal Problem</h3>\n<p><a href=\"https://techlogstack.com/explore/airbnb-identity-graph-janusgraph-2026/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>JANUSGRAPH'S PLUGGABLE STORAGE: THE ARCHITECTURAL DECISION THAT MADE THIS POSSIBLE</strong></p>\n<p>Most graph databases tightly couple the graph logic layer with the storage layer — the query engine and the data store are one system. JanusGraph is architecturally different: it uses a <strong>pluggable storage backend</strong>, meaning the graph logic layer (Gremlin server, transaction management, schema enforcement) can be decoupled from the distributed storage layer. Airbnb chose DynamoDB as the backend — a service their infrastructure team already operated at scale and trusted. This separation gave Airbnb the ability to: iterate on graph engine features without touching storage operations, leverage DynamoDB's auto-scaling for write throughput bursts (like the 5M daily edges), and evolve the storage backend in the future without rewriting the graph layer.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Open Source Observability Gap</p><p>One of the specific problems that drove Airbnb to fork JanusGraph internally was the absence of distributed tracing in the open-source version. Without tracing, <strong>there was no way to profile which graph queries were slow, which DynamoDB operations within a query were taking the most time, or which high-fanout nodes were causing P99 spikes</strong>. Debugging latency issues required guesswork or custom logging that was expensive to build and maintain. The internal fork integrated Airbnb's distributed tracing infrastructure into every graph operation — Gremlin steps, DynamoDB calls, OpenSearch index queries — giving the team the observability needed to find and fix the exact operations driving long-tail latency. You cannot optimize what you cannot measure; you cannot measure what you cannot instrument.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Airbnb's identity graph migration is a case study in the specific moment when the accumulation of vendor limitations justifies the cost of building internally — and in the engineering decisions that made the build worth the investment. The lessons are as much about when to leave a vendor as about how to build a graph database.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>The signals that a vendor relationship has passed its usefulness are specific:</strong> recurring manual operational interventions (reboots), inability to instrument or observe the system's internals, no path to tune performance for your specific access patterns, and P99 latency that is an order of magnitude worse than P50. Each of these individually might be tolerable. All four together — as Airbnb experienced — indicate that the vendor relationship is costing more in operational pain and engineering productivity than an internal solution would cost to build and maintain.</li><li><span>02</span><div><em>Pluggable storage backends</em> (an architectural pattern where the database query engine and the distributed storage layer are decoupled through a defined interface, allowing different storage systems to be swapped without changing the query layer) are the property that makes graph databases practical for large-scale production deployments. JanusGraph's DynamoDB backend let Airbnb separate concerns cleanly: Airbnb owns the graph logic layer, AWS owns the distributed storage operations. Build where you have competitive advantage; buy where you don't.</li><li><span>03</span><div><strong>Shadow traffic is the only honest migration validation strategy for a stateful system that cannot be tested in staging.</strong> You cannot reproduce 7 billion nodes, 11 billion edges, and real fraud detection query patterns in a staging environment. Running both old and new systems against the same production queries, comparing outputs and latencies, closes the validation gap. The Gremlin query language compatibility between vendor and JanusGraph was what made shadow traffic feasible here — evaluate migration options partly on query language compatibility.</li><li><span>04</span><div><em>High-fanout nodes</em> (vertices in a graph database that have an unusually large number of edges — sometimes called 'supernodes' — they cause disproportionate latency on traversal queries because a single hop to a high-fanout node can require fetching thousands of edges) are the specific failure mode of graph databases at scale that don't appear until the graph is large and dense. Design your query architecture around the assumption that some nodes will have orders of magnitude more edges than the average — parallel fetching, fanout budgets, and explicit query limits at high-fanout nodes are the tools that prevent P99 from diverging from P50.</li><li><span>05</span><div><strong>Fork open-source infrastructure when you have specific, documented performance requirements that the upstream project doesn't address — and when you intend to maintain the fork.</strong> Airbnb's JanusGraph fork added parallel query execution, DynamoDB conditional write transactions, and distributed tracing. All three were gap-fills for production requirements the OSS version didn't prioritize. The fork is a commitment — it creates a maintenance obligation and diverges from upstream. Make that decision with eyes open, but don't avoid it when the production requirements are clear.</li></ol>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>7 Billion Nodes, One Platform, Multiple Use Cases</p><p>The identity graph was the first use case to migrate to Airbnb's internal graph infrastructure — but the infrastructure was built as a <strong>multi-tenant platform</strong> from the start. Inventory knowledge graphs (relationships between listings, neighborhoods, experiences, amenities) and data lineage graphs (how data flows through Airbnb's pipelines) are among the next use cases converging onto the same JanusGraph infrastructure. Building the identity graph migration as a paved-path platform rather than a one-off solution means every subsequent team that needs a graph database inherits the operational work Airbnb already did on the first migration — schema tooling, observability, auto-scaling, migration playbook.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE MULTI-TENANT PLATFORM PAYOFF</strong></p>\n<p>Airbnb built the internal graph infrastructure as a <strong>paved-path multi-tenant platform</strong> from the start — a conscious architectural decision to build once and serve many graph use cases, rather than building a one-off solution for the identity graph. The identity graph was tenant 0: the first adopter that validated the platform under real production load. Inventory knowledge graphs and data lineage graphs are following. Each new tenant inherits the schema tooling, observability, auto-scaling, query optimization work, and migration playbook that the identity graph team built. <strong>The marginal cost of the second graph use case is dramatically lower than the first, because the platform absorbs the infrastructure complexity.</strong></p>\n</blockquote>\n\n<blockquote><p>Airbnb's fraud detection system runs on a graph of 7 billion nodes and 11 billion edges that required periodic manual reboots to stay stable — until a team rebuilt it internally, cut P99 latency in half, and eliminated the reboots, which raises the question of why they waited until 2024 to do it, but the answer is probably 'because that's how long it takes to get frustrated enough with a vendor.'<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/airbnb-identity-graph-janusgraph-2026/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-24T00:00:00+00:00", "date_modified": "2026-06-13T19:05:03.956856+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Databases", "Airbnb"]}, {"id": "https://techlogstack.com/explore/google-stitch-ai-design-tool-2026/", "url": "https://techlogstack.com/explore/google-stitch-ai-design-tool-2026/", "title": "Google Built a Free Design Tool That Generates Production Code From a Sentence — Then Added Multiplayer", "summary": "How Google acquired Galileo AI, rebranded it as Stitch, powered it with Gemini 2.5 Pro, and shipped multiplayer and a streaming design agent at I/O 2026 — for free.", "content_html": "<p><strong>Google</strong> · Performance · 21 May 2026</p>\n<p>At Google I/O 2025, Sundar Pichai demoed a tool that turned a plain English description into a complete mobile UI in under 30 seconds. Figma charges $15 per editor per month for collaborative design. Google Stitch does it free. A year later, Google added real-time multiplayer, a streaming design agent, and voice input. The design industry noticed.</p>\n<ul>\n<li>350 free generations/month</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>On May 20, 2025, Sundar Pichai stood on stage at Google I/O and ran a live demo that made designers and developers simultaneously uncomfortable. He typed a one-sentence description of a mobile app. In under 30 seconds, Google Stitch rendered a complete, multi-component mobile UI with matching color palette, navigation structure, and typography. One button click exported it as React code ready to paste into a development environment. Another exported it as a Figma file with editable layers and auto-layout. The audience applauded. Figma's product team took notes. <strong>The tool was free.</strong></p>\n<p>Google Stitch did not emerge from Google's internal R&D labs. It began with the early-2025 acquisition of <strong>Galileo AI</strong> — a startup that had built one of the first credible text-to-UI generators, capable of interpreting product descriptions and producing coherent interface layouts. Google acquired Galileo, rebranded the technology as Stitch, integrated it with the <em>Gemini 2.5 Pro</em> (Google's most capable multimodal model at the time of Stitch's launch — able to process text, images, audio, and video simultaneously and generate structured outputs across all of them) model family, and launched it as a Google Labs experiment at I/O 2025. The deliberate Labs framing was a signal: Google was testing the market before committing to a full product. The response was immediate. <strong>Over 1 million waitlist signups appeared overnight</strong> after the live demo.</p>\n<blockquote>\n<p><strong>WHAT 'VIBE DESIGN' ACTUALLY MEANS</strong></p>\n<p>Stitch entered the vocabulary alongside 'vibe coding' — the practice of describing software intent to an AI and refining the output iteratively rather than building from first principles. Vibe design applies the same model to interface creation: describe a screen, watch it appear, ask for changes conversationally. The skill shifts from <strong>pixel manipulation to intent specification</strong>. A founder who cannot use Figma can now produce a working prototype in minutes. A product manager can test five layout variations in the time it would previously have taken to brief a designer on one.</p>\n</blockquote>\n\n<p>The evolution from launch to I/O 2026 followed a clear product trajectory, compressed by user feedback and Google's resources. The May 2025 version was single-screen only — one prompt, one screen, export. By July 2025, theme customization and Figma export were added based on beta user feedback showing that designers needed design-system integration, not just raw screens. December 2025 brought the <strong>Prototypes feature</strong> alongside Gemini 3 integration — for the first time, multiple related screens could be linked into interactive flows. March 19, 2026 was <strong>Stitch 2.0</strong>: infinite canvas, multi-screen generation, voice input, app-flow generation, and 5-screen simultaneous canvas rendering. Ten months of user feedback had transformed a demo into a workspace.</p>\n\n<h3>Problem</h3>\n<h4>Design-to-Dev Handoff: The Productivity Black Hole</h4>\n<p>The traditional design-to-development pipeline required designers to build components in Figma, annotate specifications manually, and hand off to developers who re-implemented everything in code. Even with design tokens and component libraries, the gap between 'designed' and 'built' consumed weeks of coordination. For small teams and solo founders, this gap was existential: they lacked either the design skill or the engineering skill to complete the loop alone.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Multimodal Models Reached UI-Generation Quality</h4>\n<p>By early 2025, Gemini's multimodal capabilities had reached a threshold where they could reliably interpret both text descriptions of interfaces and uploaded images of existing UIs, and generate coherent layouts with appropriate component choices, spacing, and visual hierarchy. The Galileo AI acquisition gave Google a product layer that had already worked out the prompt engineering, training data, and output format questions on top of this capability.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Stitch: Gemini-Powered UI Generator With Production-Grade Exports</h4>\n<p>Stitch accepted three input types simultaneously: natural language descriptions, uploaded reference images or screenshots, and annotated screenshots with modification notes. Gemini 2.5 Pro processed all three modalities together to produce screen designs. Export paths were designed for real developer workflows: Figma files with editable layers and auto-layout, production-ready HTML/CSS, React components, and Vue code.</p>\n<hr />\n<h3>Result</h3>\n<h4>I/O 2026: Streaming Agent + Multiplayer — Free</h4>\n<p>At I/O 2026 on May 20, 2026, Google launched a streaming design agent that renders UI components onto the canvas in real time as a designer types or speaks — before generation completes, mid-generation course correction is possible. Simultaneous multi-user editing was also added, directly matching Figma's flagship collaboration feature. Both are free. Figma's professional plan charges $15 per editor per month.</p>\n<hr />\n\n<blockquote>\n<p><strong>🎨</strong></p>\n<p>Google Stitch accepts <strong>three types of input simultaneously</strong>: plain language prompts ('a healthcare app onboarding screen for elderly users'), uploaded images or screenshots ('match this visual style'), and annotated screenshots with modification notes ('make this header bigger and change the blue to green'). Gemini 2.5 Pro processes all three modalities in a single context window, producing a UI design that reflects all constraints at once.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Figma Bridge: Complement, Not Replace</p><p>When the original Stitch tool was unveiled at I/O 2025, Sarah Drasner, Director of Engineering at Google, was explicit: the Figma export function was designed to <strong>complement rather than replace</strong> existing design workflows. Stitch generates a starting point; Figma is where professional designers refine, apply design systems, and collaborate with stakeholders. The paste-to-Figma function exports fully editable layers with auto-layout intact, giving designers a high-fidelity starting point that respects their existing tooling. This positioning is why Stitch attracted both vibe designers (who need no Figma) and professional designers (who use both).</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Free Tier's Real Constraints</p><p>Google Stitch is free, but not unlimited. The standard free tier provides <strong>350 standard generations + 50 experimental generations per month</strong>. For solo founders and students, this is ample. For teams using Stitch for daily rapid prototyping, 350 generations can deplete mid-month. The $20/month Pro tier provides unlimited access. Critically, the March 2026 Stitch 2.0 update — which introduced multi-screen generation and infinite canvas — still required separate generation credit per screen, meaning a 5-screen app prototype consumed 5 credits rather than 1. Users building complex flows discovered this quickly.</p>\n</blockquote>\n\n<p>The I/O 2026 streaming agent represents the most technically ambitious evolution of Stitch's architecture. Previous versions followed a <strong>turn-based model</strong>: submit a prompt, wait for generation to complete, review the finished result, submit a revision. The streaming model replaces this with a continuous render: as a designer types or speaks, the agent renders UI components directly onto the canvas in real time, reflowing layouts before generation finishes. The practical difference is the ability to <strong>steer mid-generation</strong> — if a layout is heading in the wrong direction, a designer can interrupt before it finishes and redirect. Voice input, integrated since March 2026, works within this streaming loop: speech is parsed in real time, and the canvas responds while the designer is still talking.</p>\n<blockquote>\n<p><strong>🔌</strong></p>\n<p>Three Export Paths for Three Audiences</p><p>Stitch's export architecture targets three distinct audiences. <strong>Figma export</strong> (editable layers + auto-layout) for designers who refine in their primary tool. <strong>Production code export</strong> (HTML/CSS, React, Vue, Tailwind) for developers who need deployable components. <strong>AI Studio integration</strong> for developers who want to wire backend logic to the UI without leaving the Stitch workflow. Power users chain all three: Figma for design review, code export for development handoff, AI Studio for full-stack experimentation.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Internal Google Labs Strategy</p><p>Launching Stitch as a Google Labs experiment rather than a full Google product was a deliberate risk-management decision. Labs experiments carry lower accountability expectations — they can be deprecated without the product embarrassment of killing a flagship tool. They also attract the <strong>early-adopter, feedback-rich user base</strong> that a new AI product needs: developers, designers, and founders who are comfortable with rough edges in exchange for early access. The Labs label signaled 'come help us build this' rather than 'this is production-ready,' which attracted exactly the right population. By March 2026, 10 months of Labs usage had turned enough of those experiments into validated product decisions to justify the Stitch 2.0 announcement.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE VIT CONNECTION IN GOOGLE TRENDS</strong></p>\n<p>The same Google Trends report that showed Stitch trending also showed <strong>VIT Counselling at +120%</strong> — referring to Vellore Institute of Technology admissions season in India. This demographic overlap is instructive: the Indian engineering student population (VIT, IIT, NIT applicants) represents a massive early-adopter cohort for tools like Stitch. Students who cannot afford Figma's professional pricing ($15/seat) but need to prototype apps for hackathons, capstone projects, and startup pitches are an ideal-fit audience for a <strong>free AI-native design tool</strong>. Stitch's zero-cost model and browser-based access (no installation, no hardware requirements) make it accessible to exactly this demographic.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>The Technical Architecture: Gemini as the UI Design Engine</h3>\n<p>Google Stitch's core is not a purpose-built design model — it is <em>Gemini 2.5 Pro</em> (Google's multimodal frontier model capable of processing text, images, audio, and code simultaneously — trained on a corpus that includes UI design patterns, code, natural language, and visual references) with a specialized prompt engineering and output parsing layer on top. This architectural choice explains both Stitch's strengths and its limitations. Stitch understands design concepts like 'glassmorphism', 'neumorphic', 'material design', and 'iOS Human Interface Guidelines' because Gemini was trained on documentation and examples of all of them. It can interpret a hand-drawn sketch and understand design intent because Gemini's vision capability has seen thousands of wireframes. And it generates production-quality React code because Gemini understands React at a level that exceeds most specialized code generation models.</p>\n<ul>\n<li><strong>30s</strong> — Time from plain English description to complete mobile UI including navigation, components, and color palette — as demonstrated live by Sundar Pichai at Google I/O 2025</li>\n<li><strong>3 inputs</strong> — Simultaneous input types: text prompts, uploaded reference images, and annotated screenshots — all processed in a single Gemini 2.5 Pro context window</li>\n<li><strong>5-screen</strong> — Simultaneous canvas rendering introduced in Stitch 2.0 (March 2026) — generate an entire app flow across 5 connected screens in a single generation</li>\n<li><strong>$0 vs $15</strong> — Google Stitch multiplayer vs Figma professional plan per editor per month — the pricing asymmetry that defines Stitch's competitive positioning</li>\n</ul>\n\n<pre><code>// Conceptual: How Stitch's streaming agent differs from the original turn-based model\n// The I/O 2026 streaming upgrade is an architectural change, not just a speed improvement\n\n// BEFORE (turn-based): submit → wait → review → resubmit\nasync function generateUI_old(prompt) {\n  const result = await stitch.generate(prompt);  // blocking — wait for complete output\n  // Designer sees nothing until fully done\n  // If it's wrong: submit an entirely new prompt from scratch\n  return result.screens; // [{ html, css, figmaLayers }]\n}\n\n// AFTER (streaming agent): real-time render + mid-generation steering\nasync function generateUI_streaming(prompt) {\n  const stream = stitch.generateStream(prompt);\n  \n  // Components render onto canvas as they are generated\n  stream.on('component', (component) => {\n    canvas.renderPartial(component);  // visible immediately\n  });\n  \n  // Designer can interrupt and redirect before generation finishes\n  stream.on('layoutDecision', (decision) => {\n    const userFeedback = canvas.checkInterrupt();\n    if (userFeedback) {\n      // Mid-generation course correction — no waiting, no re-prompting from scratch\n      stream.steer(userFeedback);\n    }\n  });\n  \n  // Voice input works inline with the stream:\n  // \"make the header larger\" spoken mid-generation → reflected in remaining components\n  voiceInput.on('command', (cmd) => stream.steer(cmd));\n  \n  await stream.complete();\n  return canvas.getCurrentState(); // final assembled result\n}</code></pre>\n<blockquote>\n<p><strong>THE GALILEO ACQUISITION RATIONALE</strong></p>\n<p>Google could have built Stitch from scratch using its existing Gemini capabilities. It chose instead to <strong>acquire Galileo AI and use their product as the foundation</strong>. The rationale is clear in retrospect: Galileo had already solved the hardest non-model problems — the prompt engineering approach that reliably produces coherent UIs, the output parser that converts model outputs into valid design tokens and component trees, the training data pipeline for UI-specific examples, and the user experience model for iterative refinement. Rebuilding these would have taken months. The acquisition compresses that to days. Galileo's technology became the product layer; Gemini became the intelligence underneath it.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>RLHF for UI Quality: Training With Design Feedback</p><p>Stitch's code export quality reached 95% accuracy (as measured by component rendering fidelity) in the March 2025 closed beta, up from earlier estimates of around 70%. The improvement was driven by <strong>RLHF — Reinforcement Learning from Human Feedback</strong> — applied specifically to UI generation quality. The beta involved 500+ partner users including Vercel developers, who provided direct feedback on generated code quality and design accuracy. This domain-specific RLHF signal tuned Gemini's output for the specific quality criteria that professional designers and developers cared about: component naming, layout accuracy, code cleanliness, and design system compatibility.</p>\n</blockquote>\n\n<p>Google Stitch Evolution: Feature Timeline from Launch to I/O 2026</p><div><table><caption>Google Stitch Evolution: Feature Timeline from Launch to I/O 2026</caption><thead><tr><th>Date</th><th>Major Update</th><th>Key Feature Added</th></tr></thead><tbody><tr><td>May 20, 2025</td><td>Google I/O Launch</td><td>Single-screen generation, Figma export, code export (HTML/CSS/React)</td></tr><tr><td>Jul–Aug 2025</td><td>Public beta</td><td>Theme customization, RTL language support, bug fixes</td></tr><tr><td>Dec 2025</td><td>Stitch 2.0 preview</td><td>Prototypes (multi-screen flows), Gemini 3 integration</td></tr><tr><td>Mar 19, 2026</td><td>Stitch 2.0 GA</td><td>Infinite canvas, 5-screen canvas, voice input, app-flow generation</td></tr><tr><td>May 20, 2026</td><td>I/O 2026 update</td><td>Streaming agent (real-time canvas render), multiplayer editing — both free</td></tr></tbody></table>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Complete Design-to-Production Pipeline</p><p>Power users combine all three export paths sequentially to achieve a full design-to-production workflow without opening additional tools: <strong>generate</strong> in Stitch (AI-driven rapid exploration), <strong>paste to Figma</strong> for team review and design system application, <strong>export code</strong> (React/HTML) for developer handoff, <strong>push to AI Studio</strong> for backend integration. A workflow that previously took weeks of designer-developer coordination can now be completed by a solo founder in hours. The step-change in productivity has been most pronounced for indie developers and small startup teams.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🎙️</strong></p>\n<p>Voice-First Design: The March 2026 Input Upgrade</p><p>The voice input feature added in March 2026 was more than a convenience — it signaled a shift in the interaction model for AI design tools. Text prompts require switching mental context from 'I'm designing' to 'I'm writing instructions.' Voice input keeps the designer in <strong>continuous creative flow</strong>: say 'give me three different menu styles' while looking at the canvas, watch three variants appear, point and say 'more of that one, but with rounded corners.' The integration of voice into the streaming agent at I/O 2026 completed this loop: voice commands steer the generation in real time, not as post-hoc revision instructions.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>Stitch's internal architecture has three distinct layers. The <strong>input layer</strong> processes multimodal inputs through Gemini 2.5 Pro — text prompts, reference images, and annotated screenshots are all converted into a unified context window that Gemini reasons across simultaneously. The <strong>generation layer</strong> produces structured design outputs — not raw HTML, but an intermediate representation of design tokens, component hierarchies, and layout constraints that the export layer then converts into format-specific outputs. The <strong>export layer</strong> translates the intermediate representation into Figma-compatible JSON (with proper component structure and auto-layout), production-grade React/HTML/CSS, and AI Studio integration configs.</p>\n<h3>Before Stitch: The Traditional Design-to-Development Pipeline</h3>\n<p><a href=\"https://techlogstack.com/explore/google-stitch-ai-design-tool-2026/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>Google Stitch Architecture: Multimodal Input to Production Output</h3>\n<p><a href=\"https://techlogstack.com/explore/google-stitch-ai-design-tool-2026/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>THE INTERMEDIATE REPRESENTATION: WHY NOT JUST GENERATE CODE DIRECTLY</strong></p>\n<p>Stitch generates an intermediate design representation rather than generating Figma JSON or React code directly. This is a key architectural decision. <strong>Direct code generation</strong> from a natural language prompt produces valid code but loses design semantics — a developer can see the code but cannot easily edit the visual design. <strong>Intermediate representation</strong> preserves design intent — component names, spacing relationships, and design tokens — enabling export to multiple formats from one generation while also enabling the Figma integration to produce files with proper component structure, not just static vector exports. The IR is what makes Stitch genuinely useful for professional design workflows rather than just for throwaway prototypes.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Multiplayer Technical Challenge</p><p>Adding simultaneous multi-user editing to an AI-native canvas is harder than adding it to a traditional design tool like Figma. In Figma, multiplayer synchronizes object positions, properties, and selection states — deterministic operations with well-understood <em>CRDT</em> (Conflict-free Replicated Data Type — a data structure designed for distributed systems that allows multiple users to edit the same data concurrently without conflicts, automatically merging changes) semantics. In Stitch, <strong>two users can simultaneously be prompting the AI to modify the same canvas</strong>, producing non-deterministic outputs that may conflict visually. Google's implementation queues concurrent AI generation requests per canvas object and applies a last-write-wins merge for AI-generated changes, while standard CRDT semantics apply for manual edits.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Design Quality Ceiling</p><p>Stitch's core limitation remains consistent across all reviews: generated designs are <strong>starting points, not finished products</strong>. The AI produces layouts with appropriate components and reasonable visual hierarchy, but professional polish — precise spacing, custom illustration integration, brand-specific typography choices, edge-case state design (empty states, error states, loading states) — still requires human design expertise. Stitch is strongest for exploration and prototyping; weakest for production-ready UI that needs to meet professional brand standards. This ceiling is exactly where Figma integration matters: Stitch generates, Figma polishes.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Google Stitch is the most significant challenge to Figma's category dominance since Figma itself displaced Sketch in 2019. The lessons here are as much about product strategy as about engineering.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Acquisition of a specialized AI startup accelerates a product category by months, not weeks.</strong> Google had the models (Gemini) but not the product layer (Galileo). Galileo had the product layer but not the model quality or distribution. The acquisition combined both instantly. Teams building in AI-adjacent product categories should evaluate whether acquiring specialized AI startups is faster than building the application layer from scratch on top of foundation models.</li><li><span>02</span><div><em>Intermediate representation</em> (an abstract, format-agnostic description of design intent — component hierarchy, spacing tokens, visual relationships — that can be translated into multiple output formats without losing design semantics) between AI generation and format-specific output is the architecture that makes multi-format export viable. Generating React directly loses Figma compatibility. Generating Figma directly loses code usability. An IR exports to both, and to future formats not yet defined.</li><li><span>03</span><div><strong>Free with generous limits is a viable strategy for disrupting paid professional tools when the underlying AI cost is subsidized.</strong> Google can offer Stitch free because Gemini API calls are already budgeted across Google's infrastructure at marginal cost. Figma cannot match free without destroying its revenue model. This asymmetry is the structural moat Stitch is building — not feature parity, but cost parity at zero.</li><li><span>04</span><div>Build the <strong>complement-not-replace narrative</strong> from day one. Sarah Drasner's explicit framing of Stitch as a Figma complement — not replacement — reduced designer resistance and encouraged adoption among exactly the professional user base that could drive serious enterprise usage. Fighting the dominant tool's ecosystem directly creates adversarial resistance. Complementing it creates adoption.</li><li><span>05</span><div><em>Streaming generation</em> (delivering AI outputs progressively as they are computed rather than waiting for completion before showing any output) changes the product experience more profoundly than speed improvements do. A 30-second generation that shows nothing for 28 seconds feels slow. A 30-second generation that shows components appearing in real time and allows mid-stream steering feels like collaboration. Same underlying model, fundamentally different user experience.</li></ol>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Demographic It Actually Disrupts</p><p>Stitch's most significant disruption is not to professional designers using Figma full-time — it's to <span><strong>the 90% of product people who needed design work done but couldn't justify hiring a designer for it</strong></span>: founders, product managers, startup engineers, and indie developers. For this population, Stitch doesn't replace Figma — it replaces the decision not to have a designed product at all. The expansion of who can produce designed interfaces is the market story; Figma competition is the noise around it.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Multiplayer Moment</p><p>When Google announced simultaneous multi-user editing at I/O 2026, it was specifically compared to Figma's real-time collaboration — the feature that made Figma the dominant design platform. The comparison was deliberate. Figma's multiplayer is the product feature most closely associated with its enterprise value. Google offering equivalent functionality at zero cost changes the calculus for every design team evaluating their tool budget. Whether Stitch can match Figma's workflow depth is a separate question. Whether it will pressure Figma's pricing is not.</p>\n</blockquote>\n\n<blockquote><p>Google built a tool that turns a sentence into a React component in 30 seconds, then made it free, then added multiplayer, which is either a product strategy or a way of saying 'we're very sorry about Google Workspace.'<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/google-stitch-ai-design-tool-2026/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-21T00:00:00+00:00", "date_modified": "2026-06-13T19:32:59.524747+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Performance", "Google"]}, {"id": "https://techlogstack.com/explore/google-gemini-omni-2026/", "url": "https://techlogstack.com/explore/google-gemini-omni-2026/", "title": "Google's Gemini Omni Is the First AI That Creates From Anything — Here Is What That Actually Means", "summary": "How Google built Gemini Omni — the first natively multimodal any-to-any model — replacing chained Veo+Imagen+Lyria pipelines with one transformer trained on all moda", "content_html": "<p><strong>Google</strong> · Distributed Systems · 21 May 2026</p>\n<p>For three years, Google built Gemini to be 'natively multimodal.' At I/O 2026, they finally demonstrated what that phrase means in practice. Gemini Omni takes a photo, an audio clip, a video, and a text description — all at once — and produces a new video that reflects all of them simultaneously. This is not four models chained together. It is one.</p>\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<blockquote><p>When we first announced Gemini, it was our first AI model to be natively multimodal. We knew that training it on a combination of text, code, audio, images, and video would give it a deeper understanding of the world. With world models, AI is moving from predicting text to simulating reality. Gemini Omni is the next step in that direction.</p><p><em>— — Sundar Pichai, CEO of Google — Google I/O 2026, May 19, 2026</em></p></blockquote>\n<p>The phrase 'natively multimodal' has been in Google's AI vocabulary since Gemini's December 2023 announcement. For two and a half years, it described an aspiration — a model that could process multiple modalities together rather than routing between specialized models. At Google I/O 2026 on May 19, Google delivered the concrete realization of that aspiration: <strong>Gemini Omni</strong>, a model that accepts text, image, audio, and video as inputs simultaneously and generates video as output — not by chaining Veo (video generation), Imagen (image generation), and Lyria (audio generation) together, but by processing all of them within a single transformer's forward pass. The distinction is architectural, not cosmetic. A chain of models cannot reason about relationships between its inputs. A unified model can.</p>\n<p>The path from Gemini's original announcement to Omni runs through three engineering milestones. <strong>Gemini 2.0 Flash</strong> (late 2024) introduced native audio output and real-time multimodal interaction through the Live API — the first demonstration that Gemini could not just understand audio and video but generate them natively. <strong>Project Astra</strong> (ongoing research) explored what it means for an AI to have a continuous, persistent understanding of a physical environment through video and audio streams — seeing the world in real time rather than processing discrete inputs. <strong>Nano Banana</strong> (2025) brought Gemini's intelligence to image generation and editing — restoring old photos, designing from sketches, visualizing ideas from natural language. Omni synthesizes all three threads into a production model: <strong>real-time multimodal understanding (Astra) + generative output across modalities (Nano Banana) + unified architecture (Gemini 2.0 Flash's Live API direction)</strong>.</p>\n<blockquote>\n<p><strong>CHAINED MODELS VS NATIVE OMNI: THE FUNDAMENTAL DIFFERENCE</strong></p>\n<p>OpenAI's Sora (text-to-video) and Google's Veo were both excellent at their specific task but could not natively reason across modalities. A user who wanted to generate a video matching a specific audio track and a reference image had to: (1) generate a video with Veo using the text description, (2) separately process the audio with a music AI, (3) manually synchronize the two. <strong>Gemini Omni collapses these three steps into one prompt</strong>: upload the image, the audio, and write a description — the model reasons about all three simultaneously and produces a video where the visuals respond to the audio tempo, match the visual reference, and reflect the text description. The unified context window is what enables this.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>Multimodal AI Was a Pipeline of Specialized Models</h4>\n<p>The previous state-of-the-art for multimodal content creation required chaining specialized models: text-to-video (Veo, Sora), text-to-image (Imagen, DALL-E), text-to-audio (Lyria, ElevenLabs), and manual integration. Each handoff between models lost context — the relationship between audio tempo and visual rhythm, the visual style of a reference image, the emotional tone of a text prompt. Creators managed these integrations manually, which required technical skill that limited access to specialists.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Separate Models Cannot Reason Across Modality Boundaries</h4>\n<p>The limitation was architectural. A video model that receives a reference image as a text description (\"generate a video that looks like this photo\") has lost the actual pixel relationships. A video model that receives an audio file as a description (\"generate a video that matches this music\") has lost the actual waveform data. Genuine multimodal reasoning requires all modalities in the same context window — not converted to text summaries of each other.</p>\n<hr />\n<h3>Solution</h3>\n<h4>One Transformer Trained on All Modalities Simultaneously</h4>\n<p>Gemini Omni was trained on text, image, audio, and video simultaneously within a single transformer architecture. The model develops internal representations that encode cross-modal relationships — understanding that a warm color palette relates to a particular musical key, that a specific visual style is associated with a cultural context, that physical object behavior in video follows the laws of physics that Gemini has observed across its training data.</p>\n<hr />\n<h3>Result</h3>\n<h4>Any Input to Video Output, With Conversational Editing</h4>\n<p>Gemini Omni Flash launched May 19, 2026 in the Gemini app and YouTube Shorts — 10-second clips, with API access planned within weeks. The model accepted any combination of text, image, audio, and video inputs and produced video output with character consistency, physics grounding, and SynthID watermarking. Conversational editing retained full context across turns — a generated scene could be revised through natural language without re-prompting from scratch.</p>\n<hr />\n\n<blockquote>\n<p><strong>🎬</strong></p>\n<p>Gemini Omni Flash's launch was <strong>simultaneous in the Gemini app and YouTube Shorts</strong> — the latter integration meaning that YouTube's 2+ billion monthly active users could generate AI videos directly within the YouTube Shorts creation flow. The distribution reach of this integration dwarfs any standalone AI video tool's install base on launch day one.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Character Consistency: The Long-Context Advantage</p><p>One of the most practically important features of Gemini Omni is <strong>character consistency across shots</strong> — a character introduced in scene 1 retains their face, clothing, and voice across all subsequent scenes in the same conversation, without the creator re-uploading the reference image for each shot. This is enabled by Gemini's long context window: the model carries the character's visual description as an implicit context throughout the conversation. Competing video models, which have shorter effective contexts, required reference images at every generation turn and still produced inconsistent results. For content creators building multi-scene narratives, this is the feature that makes Omni viable for professional work rather than just single-shot experiments.</p>\n</blockquote>\n\n<p>The conversational editing model is Omni's most transformative product experience. Previous video generation tools operated like vending machines: insert prompt, receive video, discard and re-insert if wrong. Gemini Omni operates like a <strong>continuous creative collaboration</strong>: generate a scene, ask for the camera angle to change, ask for the background color to shift, ask for a second character to enter — and the model keeps the context of every previous instruction and modification. The resulting video reflects all decisions made across the conversation, not just the most recent prompt. Creators describe this as the difference between 'generating video' and 'directing a scene.'</p>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>What Omni Cannot Do Yet</p><p>The initial Gemini Omni Flash release has explicit limitations that Google acknowledges. Output is capped at <strong>10 seconds per clip</strong>. Image and audio output (generating images or audio files as outputs, not just accepting them as inputs) are on the roadmap but not in the initial release. Complex motion, exact text rendering, and cross-scene consistency for background elements remain challenging. Google's own documentation notes that consistency across edits, complex motion, and precise text in video are 'still challenging.' These are the frontiers where the next model generation will push.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Veo to Omni Transition</p><p>Gemini Omni Flash does not deprecate Veo — Google's prior video generation model — immediately. Veo 3 and Veo 3.1 Light remain available for use cases that need pure text-to-video without the multimodal complexity. But Omni is positioned as the <strong>future of video generation within the Gemini ecosystem</strong>: as Omni's capabilities expand (longer clips, image output, audio output), Veo's separate product line will converge into the Omni family. The Flash suffix — the same naming convention used for Gemini 2.0 Flash — signals that a fuller, more capable Omni Pro version is on the roadmap. Flash is the fast, accessible entry point; Pro will be the quality-ceiling version for professional creators.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE NANO BANANA PRECURSOR</strong></p>\n<p>Before Omni, Google shipped <strong>Nano Banana</strong> in 2025 — a product that brought Gemini's intelligence to image generation and editing. Nano Banana could restore old photos, generate images from sketches, and edit photos with natural language commands ('remove the background', 'change the season to winter'). It reached millions of users and established the UX patterns — natural language editing, reference image input, conversational refinement — that Omni extends to video. Omni is, in the product lineage, Nano Banana for video. The audience, the interaction model, and the safety infrastructure were all validated by Nano Banana's image generation rollout before being applied to the harder problem of video.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🌊</strong></p>\n<p>Omni's videos are described as being <strong>grounded in Gemini's real-world knowledge</strong> — meaning that objects in generated videos behave according to physical laws the model has internalized from training data, not just based on visual pattern matching to existing videos. A wave breaks on a beach with correct fluid dynamics. A ball falls with correct gravitational arc. A flag moves with correct cloth simulation under wind. This physics grounding is the property that makes Omni's outputs feel more real than the uncanny outputs of earlier text-to-video models where motion was statistically plausible but physically wrong.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>Architecture: How Natively Multimodal Actually Works</h3>\n<p>Gemini Omni's architecture is a transformer trained across all modalities simultaneously — not a <em>mixture of experts</em> (a neural network architecture where different 'expert' subnetworks specialize in different input types, with a routing mechanism that directs each input to the appropriate expert) architecture with separate video, image, and audio experts, but a single dense model where all modalities interact in every layer. This is the design choice that enables cross-modal reasoning: a visual token and an audio token from the same moment in a video can attend to each other directly within the same attention layer, rather than being processed by separate specialized networks and their outputs later merged. The training corpus includes synchronized video+audio, image+text pairs, audio+text transcriptions, and video+text descriptions at scale — forcing the model to learn the statistical relationships between modalities, not just how to process each in isolation.</p>\n<ul>\n<li><strong>Any→Video</strong> — Input-to-output capability of Gemini Omni Flash: text, image, audio, and video inputs simultaneously → video output with physics grounding and real-world knowledge</li>\n<li><strong>10s</strong> — Maximum clip length for the Gemini Omni Flash initial release — capped at launch for Gemini app and YouTube Shorts; longer-form output on the roadmap</li>\n<li><strong>SynthID</strong> — Imperceptible watermark on every Omni-generated video — survives re-encoding and resizing, enables AI provenance verification without visible degradation</li>\n<li><strong>1 model</strong> — Architecture of Gemini Omni vs chained specialized models (Veo + Imagen + Lyria) — the unification enables cross-modal reasoning that pipeline architectures fundamentally cannot match</li>\n</ul>\n\n<pre><code class=\"language-python\"># Conceptual: Gemini Omni API vs the chained model approach it replaces\n# This illustrates the architectural difference — API details TBC when GA\n\n# OLD APPROACH: Chaining specialized models\n# Each model gets a text description of the other modalities — context is lost\nfrom veo import VeoClient\nfrom lyria import LyriaClient\n\nveo = VeoClient()\naudio_gen = LyriaClient()\n\n# Step 1: Generate audio from description\naudio_clip = audio_gen.generate(\n    prompt=\"upbeat electronic music, 10 seconds\"\n)  # has no knowledge of the visual reference\n\n# Step 2: Generate video from text — no actual audio waveform input\nvideo = veo.generate(\n    prompt=\"city timelapse, upbeat electronic vibe, matches photo style\",\n    reference_image=None  # can't actually process image input\n)  # can't see the audio; can't see the image style\n\n# Manual synchronization: the user's problem now\n\n# GEMINI OMNI: One model, all modalities in one prompt\nimport google.generativeai as genai\n\nmodel = genai.GenerativeModel('gemini-omni-flash')\n\n# All four modalities provided simultaneously — model reasons across all of them\nresponse = model.generate_content([\n    \"Create a 10-second timelapse of a city transforming from day to night.\",\n    genai.upload_file('reference_photo.jpg'),   # actual pixel data — style extracted\n    genai.upload_file('audio_track.mp3'),       # actual waveform — beat sync possible\n    genai.upload_file('reference_clip.mp4')     # actual video — motion style extracted\n])\n# Output: video clip that reflects the photo's visual style,\n# syncs transitions to the audio's beat, and uses the reference clip's camera movement\n\n# Conversational editing — context is preserved\nresponse2 = model.generate_content(\n    \"Same scene, but make it rain and show the character from my last prompt\"\n    # Model remembers: the character, the city style, the audio — all retained\n)</code></pre>\n<blockquote>\n<p><strong>SYNTHID: WATERMARKING THAT CANNOT BE REMOVED</strong></p>\n<p>Every video generated by Gemini Omni carries Google's <strong>SynthID digital watermark</strong> — an imperceptible signal embedded in the pixel data itself. Unlike visible watermarks (which are trivially removed by cropping) or metadata watermarks (which disappear on re-encoding), SynthID is embedded into the statistical patterns of the pixels in a way that survives common processing: re-encoding to different codecs, resizing, color grading, and speed adjustments. The watermark allows any tool with the SynthID detector to verify that a video was AI-generated by a Gemini product. As part of the C2PA Content Credentials standard, SynthID watermarked videos can be verified by any C2PA-compatible platform. Digital avatars additionally require mandatory onboarding (recording yourself, speaking verification numbers) before use — a guardrail against deepfakes.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>World Models: The Theoretical Foundation</p><p>Sundar Pichai described Omni as a step toward <strong>world models</strong> — AI systems that simulate physical and social reality rather than just predict token sequences. The distinction matters for video generation: a language model predicting video token sequences will produce realistic-looking but physically incorrect motion (objects falling upward, light sources moving inconsistently, human bodies with impossible joint angles). A world model that has internalized physics, causality, and spatial relationships from its training data produces videos where motion is physically coherent because the model understands <em>why</em> objects move the way they do, not just what they look like when they move. Gemini's training corpus — which includes vast quantities of video annotated with physical and contextual descriptions — is what gives Omni's outputs their reported grounding in real-world knowledge.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Digital Avatars: The Use Case That Needed Safety First</p><p>Gemini Omni includes a digital avatar feature — users record themselves, Google stores a personal avatar, and the avatar can appear in any future Omni generation. The feature is explicitly framed as a response to the deepfake problem: <span><strong>your own likeness, under your own control, with verifiable AI provenance via SynthID.</strong></span> OpenAI had popularized digital avatars in Sora ('Cameos') before Sora's app was deprecated. Google's implementation adds the safety layer — mandatory verification onboarding, SynthID watermarking, and C2PA content credentials — that transforms a deepfake risk into a controlled creative feature.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The API Rollout: Weeks After Launch</p><p>Gemini Omni Flash launched in the Gemini app and YouTube Shorts on May 19, 2026, with API access described as arriving 'within weeks.' This staggered rollout is standard for Google's AI product launches: consumer surface first (to validate quality and gather real-world usage signal), developer API second (once the team has confidence the model performs as expected across the diversity of real-world use cases). The API will be available through Google AI Studio and Vertex AI, following the same access model as Gemini 2.0 Flash and other Gemini family models.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The YouTube Shorts Pipeline</p><p>YouTube Shorts creation flow, as of May 20 2026, includes Omni as a native video generation option accessible directly from the Shorts composer. A creator can generate a base scene, refine it conversationally, and publish directly to Shorts without leaving YouTube's mobile app. The Shorts algorithm already understands Omni-generated content through SynthID — these videos are labeled as AI-generated in discovery surfaces, giving creators transparency credit while maintaining their reach. This is the first time a frontier AI video model has had a direct distribution path to a 2-billion-user platform on launch day.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>The architecture of Gemini Omni represents the culmination of the Gemini model family's design philosophy from its first announcement: train a single model on all modalities simultaneously so that cross-modal understanding is emergent from the training process rather than engineered through explicit routing. The practical consequence is that Omni's internal representation of a video frame encodes relationships to audio, text context, and physical physics simultaneously — enabling generation that reflects all input modalities without explicit instructions about how to combine them.</p>\n<h3>Chained Pipeline vs Gemini Omni: Architectural Comparison</h3>\n<p><a href=\"https://techlogstack.com/explore/google-gemini-omni-2026/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>Gemini Omni: Conversational Editing Flow and Context Retention</h3>\n<p><a href=\"https://techlogstack.com/explore/google-gemini-omni-2026/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>THE YOUTUBE SHORTS INTEGRATION: DISTRIBUTION AT SCALE</strong></p>\n<p>Gemini Omni's Day 1 integration into YouTube Shorts is a distribution strategy that no standalone AI video tool can match. YouTube Shorts has <strong>2+ billion monthly logged-in users</strong>. The Shorts creation flow integrates Omni as a native generation option — creators can generate a 10-second AI video clip directly within YouTube's creation tools without downloading a separate app or managing an API key. The integration also means that every Omni-generated Short carries YouTube's standard content policy enforcement on top of SynthID watermarking — a two-layer safety system for the most viral content format in the world.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Context Window Limit for Long-Form Video</p><p>Gemini's long context window enables character consistency within a conversation, but 10-second clip limits reflect real constraints in generating long-form coherent video. <strong>Video generation at 10+ seconds requires planning scene transitions, maintaining narrative coherence, and generating consistent motion physics across hundreds of frames</strong> — a computational and quality challenge that current transformer architectures address better over short sequences. The 10-second cap at launch is an engineering constraint, not a product decision, and it will extend as the model and infrastructure mature. The conversational multi-shot workflow is Google's practical solution for longer-form content in the interim: generate shots individually, retain character context across turns, assemble the narrative manually.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🔬</strong></p>\n<p>C2PA Content Credentials: The Open Standard for AI Provenance</p><p><em>C2PA</em> (Coalition for Content Provenance and Authenticity — an open technical standard co-developed by Adobe, Microsoft, BBC, Intel, Sony, and others that cryptographically signs digital content at the point of creation with metadata about its origin and modification history, enabling any C2PA-compatible tool to verify whether a piece of content is AI-generated, human-made, or modified from either source) integrates with SynthID to give Gemini Omni-generated videos a verifiable provenance chain. Any C2PA-compatible media player or content verification tool can confirm that a video was generated by Gemini Omni, when it was generated, and (if the user consented) by whom. This is the standard that resolves the 'is this real?' question for media at scale — not by restricting AI generation, but by making AI generation verifiable.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Gemini Omni is the first product-grade demonstration of what 'natively multimodal' means in practice. The lessons here are as much about AI architecture philosophy as about the specific product.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Training a single model on all modalities simultaneously is architecturally superior to chaining specialized models for tasks that require cross-modal reasoning.</strong> A chain of models loses the pixel relationships, waveform data, and temporal correlations at every handoff. A unified model retains them throughout. The performance gap between chained and unified architectures grows with the complexity of the cross-modal reasoning required.</li><li><span>02</span><div><em>World models</em> (AI architectures that simulate the physical and causal structure of reality — understanding why objects move, how light behaves, and what consequences follow from what actions — rather than simply predicting what the next frame statistically should look like) produce more coherent generated video than token-prediction models because they model causality rather than correlation. Sundar Pichai's framing at I/O 2026 — 'AI is moving from predicting text to simulating reality' — is the product-facing version of this architectural shift.</li><li><span>03</span><div><strong>The conversational editing model changes who can use AI video generation.</strong> Prompt-and-retry was a specialist workflow — only people fluent in prompt engineering could get good results efficiently. Conversational steering, where natural language revisions apply incrementally to a persistent context, is intuitive for anyone who has ever given feedback in a meeting. The audience for AI video creation expanded dramatically with this UX shift.</li><li><span>04</span><div>Safety infrastructure — <em>SynthID</em> (Google's imperceptible AI-generated content watermark, embedded in pixel-level statistical patterns in a way that survives re-encoding, resizing, and color processing), C2PA content credentials, mandatory avatar onboarding verification — is not a regulatory compliance checkbox. It is a prerequisite for deploying generative video at YouTube's scale without becoming the infrastructure for a deepfake crisis. Build safety into the foundation, not as a post-launch patch.</li><li><span>05</span><div>Distribution is the moat that model quality cannot easily overcome. <strong>An average model with YouTube Shorts integration reaches 2 billion users on Day 1. A superior model without distribution reaches the early-adopter population.</strong> Google's decision to launch Omni simultaneously in Gemini app and YouTube Shorts rather than as a standalone tool reflects a distribution philosophy: route new AI capabilities through existing products with existing users rather than trying to build a new user acquisition funnel.</li></ol>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Project Astra Thread</p><p>Gemini Omni's launch completes an arc that began with Project Astra — Google's research into a universal AI assistant that processes real-time audio and video streams continuously. Astra demonstrated that Gemini could understand the physical world in real time. Omni demonstrates that it can generate representations of the physical world from any input. The research-to-product pipeline that runs from Astra's prototype glasses through Omni's Flash model is one of the cleaner demonstrations of Google DeepMind's model: do the research under a project name, productize when the model quality meets the distribution threshold.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE COMPETITIVE CONTEXT: SORA'S RETREAT</strong></p>\n<p>OpenAI's Sora launched with enormous fanfare, then had its public-facing app deprecated after relatively brief availability. The gap between Sora's demo-quality output and production-ready video generation proved larger than expected. Google's Omni launch comes with a different posture: explicit acknowledgment of limitations (10-second caps, challenging complex motion), safety infrastructure (SynthID, C2PA, avatar verification) built in from day one, and distribution through existing products rather than a standalone app. The contrast is instructive: <strong>building safety and distribution infrastructure before launch is slower but more durable than building capabilities first and retrofitting safety later.</strong></p>\n</blockquote>\n\n<blockquote><p>Google spent three years explaining that Gemini was 'natively multimodal' and then at I/O 2026 showed what that actually means by letting you upload a photo, an audio clip, and a text prompt and getting back a video where the sun moves in time with the music — which is either an impressive technical achievement or proof that 'natively multimodal' needed a better marketing team from the start.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/google-gemini-omni-2026/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-21T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.276622+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Distributed Systems", "Google"]}, {"id": "https://techlogstack.com/explore/openai-chatgpt-kubernetes-outage-2024/", "url": "https://techlogstack.com/explore/openai-chatgpt-kubernetes-outage-2024/", "title": "OpenAI Deployed a Tool to Monitor Kubernetes — and It Took Down All of Kubernetes", "summary": "How OpenAI's December 11, 2024 deployment of a telemetry service overwhelmed the Kubernetes control plane, caused cascading failures across all services, and locked", "content_html": "<p><strong>OpenAI</strong> · Reliability · 21 May 2026</p>\n<p>On December 11, 2024, OpenAI deployed a new telemetry service designed to improve Kubernetes observability. Within 29 minutes, it had crashed the Kubernetes control plane across every cluster. ChatGPT, the API, and Sora were all unavailable for over four hours. The engineers trying to fix it couldn't run kubectl. The control plane that manages clusters was down — and it was the only way back in.</p>\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>There is a particular category of failure that strikes harder than a random bug or infrastructure crash: the failure of a system deployed specifically to prevent other failures. On December 11, 2024, OpenAI deployed a new telemetry service designed to <strong>improve observability of their Kubernetes control planes</strong> — to give engineers better visibility into how their clusters were behaving, to catch problems earlier. Within minutes of deployment, the telemetry service was itself the problem. By 3:16 PM PST, every OpenAI service was degraded or completely unavailable. <strong>ChatGPT. The API. Sora.</strong> All down. And the engineers responsible for fixing it had a problem that made everything worse: the Kubernetes control plane — the system through which you manage Kubernetes — was itself down, which meant <span>engineers couldn't run kubectl</span>. The tools for recovery depended on the infrastructure that had failed.</p>\n<blockquote><p>Our tests didn't catch the impact the change was having on the Kubernetes control plane. DNS caching added a delay between making the change and when services started failing. Remediation was very slow because of the locked out effect.</p><p><em>— — OpenAI — December 11, 2024 Incident Postmortem, status.openai.com</em></p></blockquote>\n<p>The events unfolded with the particular cruelty of incidents where staging does not predict production. On December 10, the telemetry service was deployed to a staging cluster and verified as working correctly. On December 11 at 2:23 PM, the code was merged and the deployment pipeline triggered. From 2:51 PM to 3:20 PM, the change was applied to all production clusters. At 3:16 PM — <strong>five minutes before the rollout was even complete</strong> — all OpenAI products began degrading. The root cause was a configuration in the new service that caused <strong>every node in every cluster to execute resource-intensive Kubernetes API operations simultaneously</strong>. The cost of these operations scaled with the size of the cluster — which meant the largest clusters, which also happened to be the most critical, were hit hardest and fastest.</p>\n<blockquote>\n<p><strong>DNS CACHING: THE HIDDEN TIME BOMB</strong></p>\n<p>The staging environment passed because of a combination of two factors. First, the staging cluster was small — the telemetry service's API load scaled with cluster size, so a small staging cluster generated manageable load. Second, and more insidiously: <strong>DNS caching masked the failure</strong>. When the telemetry service started overwhelming the Kubernetes API servers, services that had already cached DNS responses continued functioning temporarily — they could still reach their dependencies through stale cache entries. This created a delay between the moment the change was applied and the moment services began failing. Engineers saw a clean deployment, saw services continuing to function, and assumed success — until the DNS cache expired and services that hadn't failed yet began failing all at once.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>Telemetry Rollout to All Clusters in 29 Minutes</h4>\n<p>On December 11, 2024 at 2:51 PM PST, the new telemetry service configuration began rolling out to all Kubernetes clusters. The rollout completed at approximately 3:20 PM. The service's configuration caused every node in every cluster to issue simultaneous resource-intensive Kubernetes API calls — a load that scaled with cluster size, hitting the largest, most critical clusters hardest.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Kubernetes Control Plane Overwhelmed — DNS and Service Discovery Broken</h4>\n<p>With thousands of nodes simultaneously hammering the Kubernetes API servers, the control planes of most large clusters crashed. Kubernetes's control plane manages service discovery and DNS resolution — when it failed, services could no longer find each other. DNS cache expiry then propagated the failure to services that had been temporarily protected by stale cache entries, turning a partial degradation into a complete cascading failure.</p>\n<hr />\n<h3>Solution</h3>\n<h4>The Locked-Out Problem: No kubectl Access</h4>\n<p>Recovery required rolling back the telemetry configuration — but rolling back Kubernetes configurations requires kubectl, which requires a functioning Kubernetes control plane. The control plane was down. Engineers were effectively locked out of the clusters they needed to fix. Recovery required out-of-band mechanisms: directly accessing nodes through cloud provider management consoles, bypassing the Kubernetes layer entirely to remove the telemetry service's configuration.</p>\n<hr />\n<h3>Result</h3>\n<h4>4h 22min Outage, Full Postmortem Published</h4>\n<p>ChatGPT reached substantial recovery at 5:45 PM PST. Full recovery across all services was achieved at 7:38 PM PST — 4 hours and 22 minutes after the incident began. OpenAI published a detailed postmortem within days, identifying four root causes and committing to specific architectural changes including break-glass emergency access mechanisms and staged rollouts for all infrastructure changes.</p>\n<hr />\n\n<blockquote>\n<p><strong>📱</strong></p>\n<p>Apple released <strong>iOS 18.2 on December 11, 2024</strong> — the same day as the ChatGPT outage. iOS 18.2 introduced ChatGPT integration into Apple Intelligence. The timing was spectacularly bad: millions of iPhone users who had just updated their OS to get ChatGPT access discovered ChatGPT was down. Many initially assumed the iOS update had caused the outage. OpenAI's postmortem explicitly confirmed it had not: <strong>the iOS 18.2 launch was coincidental</strong>. The real cause had nothing to do with the traffic spike from Apple users.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Circular Dependency That Made Recovery Hard</p><p>The deepest structural problem revealed by the December 11 outage was a <strong>circular dependency between the Kubernetes control plane and the services that depend on it</strong>. When the control plane failed, it took down: (1) DNS resolution for all services, (2) service discovery across the cluster, (3) the ability to schedule new pods or reschedule crashed ones, and (4) the primary mechanism engineers use to manage all of the above. Recovery from a Kubernetes control plane failure required access to a system that the control plane failure had disabled. This is the engineering equivalent of locking your keys inside your car — and the standard response (calling a locksmith) had not been pre-arranged.</p>\n</blockquote>\n\n<p>The 'locked out effect' that OpenAI's postmortem names is a well-known failure mode in Kubernetes operations, though it often appears in less severe forms. Kubernetes is a complex distributed system where the control plane manages the state of the cluster, and the data plane (the nodes) depends on that state to function. But the management tools (kubectl, Helm, the Kubernetes API) also depend on the control plane. When the control plane goes down, the cluster enters a state where it continues running existing workloads on warm nodes (the data plane doesn't immediately die) but <strong>nothing can be changed, fixed, scaled, or recovered</strong> through standard channels. The cluster is frozen — and thawing it requires direct node access bypassing Kubernetes's own abstractions.</p>\n<blockquote>\n<p><strong>🔬</strong></p>\n<p>The Telemetry-Ironically-Causes-Outage Pattern</p><p>The December 11 outage belongs to a specific failure pattern category that has appeared at multiple major companies: <strong>the observability tool that causes the outage it's meant to detect</strong>. A new metrics agent deployed across a large fleet issues unexpected API calls. A distributed tracing system generates load spikes while capturing other services' load spikes. A log aggregation service fills up disk on the servers it monitors. The pattern is instructive: observability infrastructure touches every service in the fleet and therefore has a potential blast radius of everything. It requires the same staged rollout rigor as any production service deployment.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>❌</strong></p>\n<p>The Scale of Impact</p><p>OpenAI's services reach hundreds of millions of users. On December 11, 2024, every one of them hit the same wall: ChatGPT is unavailable. Beyond consumer users, the ChatGPT API powers thousands of production applications — startups that had built their products on top of OpenAI's API, enterprises that had integrated ChatGPT into customer support flows, developers whose production services were serving errors to their own users. A single infrastructure configuration error at OpenAI propagated into cascading failures across the entire ecosystem of businesses built on its infrastructure. This is the multiplier effect of platform outages.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE SAME DAY AS SORA'S DEBUT</strong></p>\n<p>The December 11 outage came <strong>on the same day that OpenAI was also managing the pressure of Sora's recent launch</strong> — its video generation platform, which had seen immediate scaling challenges upon release. The Sora platform was itself affected by the December 11 outage (listed in the postmortem alongside ChatGPT and the API as impacted products). This confluence made December 11 OpenAI's most visible reliability day: its most-watched new product and its most-used existing product, both down simultaneously. The postmortem was unusually forthcoming about the operational context — acknowledging explicitly that the organization was managing multiple scaling challenges at once.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⏰</strong></p>\n<p>The most counterintuitive fact in the December 11 timeline: services began degrading at <strong>3:16 PM</strong>, but the rollout had only started at 2:51 PM and wasn't complete until 3:20 PM. Services started failing <strong>while the rollout was still in progress</strong>. This is the DNS cache masking effect in action: the earliest-affected clusters (the large ones, which received the change first) started degrading immediately; clusters that received the change later showed the failure slightly later. From the engineers' monitoring dashboards, it looked like a gradual degradation — masking the true cause until DNS caches expired everywhere and the pattern became undeniable.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>What Actually Broke and Why Recovery Took Four Hours</h3>\n<p>Understanding the December 11 recovery timeline requires understanding the specific Kubernetes failure mode. The telemetry service's configuration caused each node to watch Kubernetes API resources continuously — a watch operation that made API calls proportional to the number of resources in the cluster. Across thousands of nodes in large clusters, these API calls compounded into an overwhelming flood. The Kubernetes API servers — the stateful components of the control plane that maintain cluster state — became saturated. With the API servers unresponsive, <em>etcd</em> (the distributed key-value store that backs Kubernetes' state — all cluster state (node metadata, pod specifications, service definitions) lives in etcd, and the API servers cannot function without access to it) became unreachable. Without etcd, the API servers couldn't recover. Without the API servers, nothing could be changed. The cluster was in a deadlock.</p>\n<ul>\n<li><strong>4h 22m</strong> — Total outage duration from 3:16 PM to 7:38 PM PST December 11, 2024 — the longest single outage in ChatGPT's history at the time</li>\n<li><strong>29 min</strong> — Time from deployment start (2:51 PM) to all products degrading (3:16 PM) — fast enough that the full fleet was affected before the scope was understood</li>\n<li><strong>All</strong> — Services affected simultaneously — ChatGPT, API, Sora, and all OpenAI products experienced degradation or complete unavailability at the same time</li>\n<li><strong>0</strong> — Staging warnings — the telemetry service passed staging validation completely, because staging clusters were too small to reproduce the API call scaling behavior that took down production</li>\n</ul>\n\n<pre><code class=\"language-python\"># Simplified model of the failure mode: telemetry service overwhelming K8s API\n# Each node watches K8s API objects — cost scales with cluster size\n\n# The telemetry service configuration (simplified)\nTELEMETRY_CONFIG = {\n    \"watch_all_pods\": True,      # Watch all pod events in the cluster\n    \"watch_all_nodes\": True,     # Watch all node events\n    \"watch_all_services\": True,  # Watch all service definitions\n    \"poll_interval_ms\": 100,     # Check for changes every 100ms (aggressive)\n}\n\n# What happens when this runs on N nodes simultaneously:\ndef nodes_making_api_calls(cluster_size: int) -> int:\n    # Each node creates 3 watchers, each calling K8s API every 100ms\n    return cluster_size * 3 * (1000 / 100)  # calls per second\n\n# Small staging cluster (100 nodes):\nstaging_load = nodes_making_api_calls(100)   # 3,000 API calls/sec — manageable\n\n# Large production clusters (thousands of nodes):\nprod_load = nodes_making_api_calls(5000)     # 150,000 API calls/sec — CATASTROPHIC\n# K8s API server limit: typically ~1,000-2,000 requests/sec\n# At 150,000: API server becomes unresponsive within seconds\n# DNS resolution breaks: services can't find each other\n# kubectl stops working: engineers can't recover\n\n# The fix: remove the watch configuration from the telemetry service\n# But to apply a config change, you need kubectl\n# kubectl requires a working API server\n# The API server is down because of the config\n# ↑ The locked-out effect ↑\n\n# Recovery path: bypass Kubernetes entirely\n# SSH directly to nodes via cloud provider console\n# Manually stop the telemetry service process on each node\n# API server load drops\n# Control plane recovers\n# kubectl works again\n# Verify and clean up</code></pre>\n<blockquote>\n<p><strong>THE FOUR ROOT CAUSES FROM OPENAI'S POSTMORTEM</strong></p>\n<p>OpenAI's postmortem identified four specific contributing factors: <strong>(1) The staging cluster was too small</strong> to reproduce the load scaling behavior — the failure only manifested at production cluster sizes. <strong>(2) DNS caching masked the initial failure</strong> — services continued functioning on stale cache entries, giving engineers a false signal that the deployment was clean before cache expiry revealed the truth. <strong>(3) No canary deployment</strong> — the configuration was applied to all clusters simultaneously rather than validated incrementally on one cluster first. <strong>(4) No break-glass mechanism</strong> — there was no pre-arranged out-of-band access path for exactly this class of failure where the standard Kubernetes management plane was unavailable.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Recovery Steps</p><p>OpenAI's engineers recovered the cluster through a sequence that bypassed Kubernetes abstractions entirely: <strong>Step 1</strong> — access individual nodes directly through the cloud provider's management console (not through Kubernetes), bypassing the downed control plane. <strong>Step 2</strong> — manually stop the telemetry service process on each node to eliminate the API call flood. <strong>Step 3</strong> — with load removed, Kubernetes API servers began recovering. <strong>Step 4</strong> — once kubectl was functional, roll back the telemetry service configuration through standard channels. <strong>Step 5</strong> — monitor service recovery and DNS propagation across the fleet. Each step added latency because it required manual execution across thousands of nodes.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Post-Incident Actions: Four Engineering Commitments</p><p>OpenAI's postmortem committed to four concrete engineering changes: <strong>(1) Immediate</strong>: locked the telemetry configuration to prevent re-deployment. <strong>(2) Short-term</strong>: implement break-glass emergency access mechanisms that function even when the Kubernetes control plane is unavailable. <strong>(3) Medium-term</strong>: decouple observability infrastructure from the components it monitors, so a failing telemetry system cannot cascade into the monitored services. <strong>(4) Long-term</strong>: all infrastructure-related configuration changes will use staged deployment with continuous monitoring and the ability to halt at any percentage. The staged rollout commitment was the same lesson Cloudflare had learned twice.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Kubernetes Watch API Amplification</p><p>The specific mechanism was a Kubernetes <em>Watch API</em> (a Kubernetes API feature that allows clients to receive a stream of events as resources change — an efficient alternative to polling, but one that creates a persistent connection from the watching client to the API server, consuming API server resources proportional to the number of watchers) misuse. Rather than polling for cluster state on a schedule, watch operations create a persistent long-lived connection from each watcher to the API server. The telemetry service created three watch connections per node — at 5,000 nodes in a large cluster, that's 15,000 persistent watch connections. Each watch connection requires API server resources to maintain. The API server, designed for a few hundred concurrent operations, was maintaining thousands — and also handling the event stream updates that each watch triggered as cluster state changed.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Immediate Locking Action</p><p>Within hours of the outage's resolution, OpenAI took one immediate action while longer-term architectural work was planned: <strong>locked the telemetry configuration</strong> so it could not be re-deployed in its original form without an explicit manual override. This lock-before-investigation pattern is a standard SRE practice: after a configuration causes a production incident, prevent it from being accidentally reapplied during the postmortem or by a team member who doesn't yet know about the incident. The lock is a cheap, immediate mitigation that buys time for proper architectural fixes. It is the equivalent of removing a circuit breaker from service rather than leaving it in a state where it could trip again.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>OpenAI's Kubernetes architecture runs the inference clusters that power ChatGPT's model serving, the API gateway that handles developer requests, and the Sora video generation pipeline. All of these depend on Kubernetes's control plane for service discovery, DNS resolution, pod scheduling, and configuration management. Understanding how a single telemetry service configuration could take all of them down requires understanding both the structure of Kubernetes and the specific amplification mechanism the December 11 configuration triggered.</p>\n<h3>The Failure Chain: From Telemetry Deployment to Complete Outage</h3>\n<p><a href=\"https://techlogstack.com/explore/openai-chatgpt-kubernetes-outage-2024/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>Recovery Architecture: Bypassing Kubernetes to Restore It</h3>\n<p><a href=\"https://techlogstack.com/explore/openai-chatgpt-kubernetes-outage-2024/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>WHY KUBERNETES CONTROL PLANE FAILURE IS CATASTROPHIC</strong></p>\n<p>The Kubernetes control plane manages three things that are catastrophic to lose simultaneously: <strong>DNS resolution</strong> (services find each other by name, not IP — without DNS, microservices go blind), <strong>service discovery</strong> (load balancers can't route to healthy pods without the API server updating their configuration), and <strong>pod scheduling</strong> (crashed pods can't be restarted, replicas can't be scaled). In most partial failures, you lose one of these. A control plane failure loses all three. And because the standard recovery path requires the control plane to function, recovery from total control plane failure requires out-of-band mechanisms that most teams haven't pre-arranged.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Staging Trap: Size-Dependent Bugs</p><p>The December 11 outage is a textbook example of a <strong>size-dependent bug</strong> — a failure that only manifests at production scale. The telemetry service worked correctly in staging because staging clusters were small enough that the aggregate API call load from all nodes was within the API server's capacity. Every small-scale test passed cleanly. At production scale, with thousands of nodes instead of dozens, the same configuration produced 100× the load — enough to overwhelm even a properly sized API server. Size-dependent bugs require load testing at production scale, not just functional testing at representative scale. The standard 'test in staging' process is insufficient for infrastructure changes whose failure modes are non-linear functions of cluster size.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Kubernetes Control Plane Architecture</p><p><em>Kubernetes control plane</em> (the set of components that manage the overall state of a Kubernetes cluster — including the API server (handles all REST operations), etcd (distributed key-value store backing all cluster state), the scheduler (assigns pods to nodes), and the controller manager (runs reconciliation loops)) is itself a distributed system running on dedicated master nodes. In OpenAI's architecture, the control plane runs on separate infrastructure from the data plane nodes that run model inference. When the control plane fails, data plane nodes continue running their existing workloads (model inference pods don't immediately die) but cannot be managed — pods can't be restarted, scaled, or reconfigured. Services that depended on service discovery (which uses control-plane-managed DNS) began failing immediately. Services with static configuration or warm DNS caches survived longer before failing.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>The December 11 ChatGPT outage is among the most instructive Kubernetes incidents ever publicly documented — partly because OpenAI published a detailed postmortem, and partly because the failure pattern recurs across the industry whenever teams deploy infrastructure changes without accounting for scale-dependent behavior.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Observability infrastructure is production infrastructure.</strong> A telemetry service deployed across your entire fleet has the blast radius of your entire fleet. Deploy it with the same staged rollout rigor you apply to production services: one cluster, verify, one region, verify, full fleet. The December 11 rollout applied the configuration to all clusters in 29 minutes. A staged rollout would have revealed the problem on the first cluster before it could cascade.</li><li><span>02</span><div><em>DNS caching</em> (a mechanism where the results of DNS lookups are stored locally for a period defined by the record's TTL, allowing services to resolve domain names without contacting the DNS server on every request) is a reliability asset that can become a diagnostic liability during incidents. When an infrastructure change breaks DNS, services continue functioning on cached entries — masking the failure until cache TTLs expire. If your deployment passes initial health checks and then fails minutes later at scale, DNS cache expiry is a likely explanation. Monitor DNS resolution success rates separately from application health checks.</li><li><span>03</span><div><strong>Build break-glass emergency access before you need it.</strong> The December 11 engineers needed to access nodes directly, bypassing the Kubernetes control plane, using mechanisms that had not been pre-arranged. Pre-arrange them. Every Kubernetes deployment should have a documented, tested procedure for accessing nodes and making configuration changes when kubectl is unavailable. Like any emergency procedure, it must be practiced before the emergency.</li><li><span>04</span><div><em>Size-dependent bugs</em> (failures that manifest only at production scale because their severity is a non-linear function of system size — a 100-node staging cluster may pass cleanly while a 5,000-node production cluster fails catastrophically) cannot be caught by functional testing at representative scale. Load test infrastructure changes against production-equivalent cluster sizes. If production-scale testing is not feasible, test at 10% of production scale and extrapolate load metrics before applying to the full fleet.</li><li><span>05</span><div><strong>Decouple the components that manage your infrastructure from the infrastructure they manage.</strong> The Kubernetes control plane should not be the only path to emergency recovery. OpenAI's post-incident commitment to decoupling Kubernetes components addresses this: if the control plane fails, some emergency management capability should remain available independently of the failed layer.</li></ol>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The iOS 18.2 Coincidence</p><p>Apple shipped iOS 18.2 — which introduced ChatGPT integration into Apple Intelligence — on the same day as the outage. Millions of users who updated and then tried ChatGPT saw it was unavailable. Social media immediately speculated that the iOS update had caused the outage. OpenAI's postmortem was explicit: <strong>iOS 18.2 had nothing to do with the outage.</strong> The telemetry service failure had already begun degrading the infrastructure before the iOS update's traffic could have any effect. The coincidence is a useful reminder that correlation — especially coincidence of timing — is not causation, and that attributing outage causes to the most visible concurrent event is a common and often wrong instinct.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE STAGED ROLLOUT THAT WOULD HAVE CAUGHT IT</strong></p>\n<p>A staged rollout with the pattern: 1 cluster → verify 30 minutes → 10% of clusters → verify 30 minutes → 50% → verify → 100%, would have caught this failure at the 1-cluster stage. <strong>One large cluster showing API server saturation is a signal.</strong> One large cluster crashing before engineers even understood why is an outage. The difference between the two outcomes is the presence of a verification window between deployment stages — time where the system's behavior can be observed before the next deployment stage commits. OpenAI's December 11 deployment had no such window: the configuration was applied to all clusters in 29 minutes without a verification pause.</p>\n</blockquote>\n\n<blockquote><p>OpenAI deployed a service to watch their Kubernetes clusters more carefully, and the service watched so carefully it broke every cluster simultaneously — which is either ironic or a very expensive way to learn that observability infrastructure has a blast radius.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/openai-chatgpt-kubernetes-outage-2024/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-21T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.286784+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Reliability", "OpenAI"]}, {"id": "https://techlogstack.com/explore/linkedin-kafka-origin-2011/", "url": "https://techlogstack.com/explore/linkedin-kafka-origin-2011/", "title": "LinkedIn Needed a Message Queue. They Built the One the Entire Internet Runs On.", "summary": "How LinkedIn engineers Jay Kreps, Jun Rao, and Neha Narkhede built Apache Kafka in 2010 — the append-only log that went from 1 billion to 7 trillion messages per day", "content_html": "<p><strong>LinkedIn</strong> · Messaging · 19 May 2026</p>\n<p>In 2010, LinkedIn was drowning in data it couldn't move. Every ML model, every recommendation engine, every real-time feature was starving because there was no reliable way to get activity data from the website into the systems that needed it. Jay Kreps, Jun Rao, and Neha Narkhede spent a year building a fix. They named it after Franz Kafka. The rest of the internet adopted it.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;events/day at launch (2011)&#x27;, &#x27;value&#x27;: &#x27;1B&#x27;}</li><li>{&#x27;label&#x27;: &#x27;messages/day by 2015&#x27;, &#x27;value&#x27;: &#x27;1T&#x27;}</li><li>{&#x27;label&#x27;: &#x27;messages/day by 2019&#x27;, &#x27;value&#x27;: &#x27;7T&#x27;}</li><li>{&#x27;label&#x27;: &#x27;of Fortune 100 run it today&#x27;, &#x27;value&#x27;: &#x27;80%+&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>In 2010, LinkedIn was a growing professional network with a problem that every ambitious data-driven company eventually hits: a massive accumulation of valuable activity data that was effectively <strong>locked inside the systems that generated it</strong>. Every page view, every job click, every connection request, every search query was data. LinkedIn's ML engineers wanted that data to train recommendation models. LinkedIn's analytics engineers wanted it to understand user behavior. LinkedIn's search engineers needed it to keep the index fresh within seconds of updates. But the pipelines connecting these systems to their data sources were a fragile, inconsistent web of point-to-point integrations — each one custom-built, each one brittle, none of them sharing any infrastructure. Jay Kreps, who was leading data infrastructure engineering at LinkedIn, later described the root cause with characteristic directness: <span>\"Everyone wanted to build fancy machine-learning algorithms, but without the data, the algorithms were useless. Getting the data from source systems and reliably moving it around was very difficult.\"</span></p>\n<blockquote>\n<p><strong>📊</strong></p>\n<p>Before Kafka, LinkedIn's data architecture had an <strong>N×M integration problem</strong>: every data source needed a custom pipeline to every data destination. With dozens of source systems and dozens of consumer systems, engineers were maintaining hundreds of individual pipelines — each with its own error handling, schema management, and operational burden. Adding one new data source meant writing N new pipelines. Adding one new consumer meant updating M existing sources.</p>\n</blockquote>\n\n<p>Kreps, alongside Jun Rao (who had joined from IBM's database group) and Neha Narkhede (who had come from Oracle), evaluated every existing solution. <strong>Traditional message queues</strong> — <em>ActiveMQ</em> (an open-source message broker implementing the JMS specification, designed for reliable, ordered message delivery between enterprise applications), <em>RabbitMQ</em> (a message broker built around the AMQP protocol, designed for flexible routing, delivery guarantees, and complex messaging patterns) — were designed for a different problem. They offered rich delivery guarantees, complex routing, and transaction semantics, but they were built for the <em>reliable delivery of individual task messages</em>, not for the <em>high-throughput streaming of millions of activity events</em>. The broker in these systems tracked the delivery state of every message — consuming memory and CPU proportional to the number of outstanding messages. They were designed for near-immediate consumption. They could not handle the situation where a Hadoop job needed to replay yesterday's activity data. They could not scale to the throughput LinkedIn needed. Most critically: their <strong>per-message overhead was enormous</strong>. ActiveMQ's message format had 144 bytes of overhead per message. LinkedIn needed to process millions of messages per second.</p>\n<blockquote>\n<p><strong>THE INSIGHT: TREAT DATA MOVEMENT LIKE A LOG</strong></p>\n<p>The founding insight of Kafka was recognizing that LinkedIn's data movement problem was not a messaging problem — it was a <strong>log problem</strong>. Databases have used append-only logs for decades: the <em>write-ahead log</em> (a sequential record of all changes made to a database, written before the changes are applied — used for crash recovery, replication, and point-in-time restoration) is how MySQL, Postgres, and every serious database achieves durability and replication. Jay Kreps asked: what if the data pipeline itself was an append-only log? Producers append events. Consumers read them at their own pace. The log retains messages for a configured period. Any consumer can replay from any point. The broker doesn't track state. The simplicity unlocked everything.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>LinkedIn's Data Was Locked in Silos</h4>\n<p>By 2010, LinkedIn had dozens of data source systems and dozens of consumer systems — ML models, analytics pipelines, search indexers, real-time features — all of which needed the same activity stream data. Point-to-point custom pipelines were the solution, but maintaining hundreds of them was unsustainable. The existing messaging systems (ActiveMQ, RabbitMQ) couldn't handle LinkedIn's throughput requirements and were designed for task queues, not event streams.</p>\n<hr />\n<h3>Cause</h3>\n<h4>No Tool Existed for High-Throughput Real-Time Event Streaming</h4>\n<p>The problem in 2010 had two halves: batch systems (Hadoop) could handle large volumes but only hours later; traditional message queues could deliver in real-time but couldn't scale to LinkedIn's volume or support replay. There was no system that provided high throughput, low latency, durability, replayability, and horizontal scalability simultaneously. The three engineers concluded that the tool they needed did not exist.</p>\n<hr />\n<h3>Solution</h3>\n<h4>One Year Building Kafka: The Append-Only Distributed Log</h4>\n<p>Kreps, Rao, and Narkhede spent approximately one year building the first version of Kafka. The core architectural decision was treating the message store as an append-only log rather than a queue. This single choice enabled sequential disk I/O (orders of magnitude faster than random I/O), stateless brokers (consumers track their own position), arbitrary replay (consumers read from any offset), and horizontal partitioning (each partition is an independent log that scales independently).</p>\n<hr />\n<h3>Result</h3>\n<h4>1 Billion Events Per Day at Launch, 7 Trillion by 2019</h4>\n<p>Kafka went into production at LinkedIn in 2011 and immediately became the backbone of the company's real-time infrastructure. At launch it was ingesting over 1 billion events per day. LinkedIn open-sourced it in early 2011. It became an Apache Top-Level Project in October 2012. By 2015 it was processing 1 trillion messages per day. By 2019, 7 trillion. Kreps, Narkhede, and Rao left LinkedIn in November 2014 to found Confluent, building the commercial ecosystem around Kafka.</p>\n<hr />\n\n<blockquote><p>I thought that since Kafka was a system optimized for writing, using a writer's name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka. Plus the name sounded cool for an open source project.</p><p><em>— — Jay Kreps — on naming Kafka after Franz Kafka, via Quora</em></p></blockquote>\n<p>The name was chosen when the project was being prepared for open-sourcing. Jay Kreps was inspired by <em>Franz Kafka</em> (the German-language novelist (1883–1924) known for works including The Metamorphosis, The Trial, and The Castle — exploring themes of alienation, bureaucracy, and transformation) — a writer whose work Kreps admired and whose name, he felt, suited a system built for writing. The practical truth is that the naming was an afterthought. The engineering came first. In the original academic paper published at the NetDB workshop in June 2011, the system is described without literary flourish: a distributed messaging system for log processing, designed for high throughput, low latency, and horizontal scalability. The paper's benchmarks were direct: <strong>Kafka produced messages at a rate that was orders of magnitude faster than ActiveMQ or RabbitMQ</strong>. The numbers were not close.</p>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Why LinkedIn's Data Architecture Needed Real-Time</p><p>LinkedIn's core value proposition — showing you the right jobs, the right connections, the right content — required <strong>real-time signals</strong>. If you search for \"software engineers in San Francisco\" and connect with three of them, LinkedIn's recommendations should update within seconds to reflect what your new connections know and who they know. With Hadoop batch jobs, this update happened hours later. With Kafka feeding real-time stream processing, <strong>updates became searchable within seconds of being posted</strong>. The latency reduction from hours to seconds was not a technical nicety — it was the product feature that made LinkedIn's social graph feel alive.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>What Existing Systems Got Wrong at Scale</p><p>The original Kafka paper published the benchmark numbers without ceremony. LinkedIn configured a single producer to publish 10 million messages of 200 bytes each. Kafka with batch size 50: <strong>~50MB/sec</strong>. ActiveMQ: <strong>~2MB/sec</strong>. RabbitMQ: slightly better than ActiveMQ but far below Kafka. The gap was not 10% or even 2x — it was an order of magnitude. The performance difference traced directly to Kafka's design: sequential disk writes, zero per-message broker state, batched I/O, and a message format with 9 bytes of overhead versus ActiveMQ's 144 bytes.</p>\n</blockquote>\n\n<p>Jay Kreps wrote one of the most-cited engineering essays of the last decade: <strong>\"The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction\"</strong> (LinkedIn Engineering, 2013). The essay argued that the append-only log was not just a Kafka implementation detail — it was a <strong>universal primitive</strong> for distributed systems. Databases use it for replication. Kafka uses it for messaging. Stream processors use it for state. The essay made the case that any system that needs to integrate data across multiple consumers should be built on a log, not on point-to-point integrations. At the time Kreps wrote the essay, LinkedIn was running over <strong>60 billion unique message writes through Kafka per day</strong> — several hundred billion counting cross-datacenter replication. The argument was not theoretical.</p>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>LinkedIn's Real-Time Feature Hunger</p><p>LinkedIn's product ambitions in 2010 were fundamentally real-time: <strong>who viewed your profile in the last 24 hours?</strong> Which of your connections just updated their job title? When a recruiter posts a job matching your skills, how fast does it appear in your feed? These features required that activity data flowing from the website into the recommendation and notification systems be fresh — not hours old. The batch pipeline to Hadoop was adequate for weekly model training but useless for features that needed sub-minute freshness. Kafka was not just a performance improvement over existing infrastructure; it was the prerequisite for an entire class of real-time product features that didn't yet exist.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>1 BILLION EVENTS PER DAY — IMMEDIATELY</strong></p>\n<p>When Kafka went into production at LinkedIn in 2011, it was immediately processing <strong>more than 1 billion events per day</strong>. This was not a gradual ramp — the scale was there from day one because LinkedIn's existing activity volume was already at that level; Kafka simply replaced the fragile point-to-point pipelines that had been handling it. The immediate billion-event scale validated the architecture under real production conditions within weeks of launch. It also meant the open-source release in mid-2011 came with a credibility that mattered: this was not a research prototype. It was a system already running at significant scale.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>Five Design Decisions That Made Kafka Fast</h3>\n<p>Kafka's performance advantage over existing systems was not the result of clever optimization of a standard architecture. It was the result of choosing a fundamentally different architecture, where every key design decision reinforced the same goal: <strong>maximize throughput for streaming event data</strong>. Five decisions stand out as architecturally defining — and each was a deliberate rejection of how existing messaging systems had been built.</p>\n<ul>\n<li><strong>~50MB/s</strong> — Kafka producer throughput in the original 2011 benchmark — versus ~2MB/s for ActiveMQ at the same message size (200 bytes, 10M messages)</li>\n<li><strong>9 bytes</strong> — Per-message overhead in Kafka — versus 144 bytes in ActiveMQ. The storage efficiency difference meant Kafka could handle 16x more messages in the same disk space</li>\n<li><strong>Stateless</strong> — Kafka brokers — consumer offset tracking is done by the consumer, not the broker, eliminating the broker memory pressure that crippled traditional queues at scale</li>\n<li><strong>Sequential</strong> — Disk access pattern for both writes and reads — append-only to the log means no random I/O, allowing Kafka to push disk throughput to near hardware limits</li>\n</ul>\n\n<pre><code>// The five key Kafka design decisions in code form\n\n// DECISION 1: Append-only log storage (not a queue)\n// Each partition is a directory of segment files, appended to sequentially\n// /kafka-logs/my-topic-0/00000000000000000000.log\n// /kafka-logs/my-topic-0/00000000000000100000.log  (new segment at 100k messages)\n// → Sequential writes: disk seeks are expensive; sequential I/O is 100x faster\n\n// DECISION 2: Consumer tracks its own offset\n// The broker doesn't care what consumers have read — it just serves bytes\nlong consumerOffset = consumer.position(topicPartition);  // consumer owns this\n// → Brokers are stateless: no per-consumer memory, no ack tracking overhead\n// → Consumers can replay any time: just reset the offset\nconsumer.seek(topicPartition, 0);  // replay from the beginning\n\n// DECISION 3: Topics are partitioned for horizontal scale\n// Each partition is an independent log — producers and consumers parallelise\nProducerRecord<String, String> record =\n    new ProducerRecord<>(\"user-activity\",\n        userId,          // partition key: same user → same partition = ordered\n        eventJson        // the message\n    );\n// → Topic with N partitions can be consumed by N consumers in parallel\n// → Add brokers, add partitions: linear throughput scaling\n\n// DECISION 4: Batch I/O from client to broker\nprops.put(\"batch.size\", 16384);          // batch up to 16KB before sending\nprops.put(\"linger.ms\", 5);               // or wait up to 5ms for the batch\n// The original paper: batch size 50 improved throughput by ~10x vs batch size 1\n\n// DECISION 5: Zero-copy transfer using sendfile()\n// When a consumer fetches data, Kafka uses OS sendfile() syscall:\n// data goes directly disk → network socket, bypassing userspace entirely\n// → No data copy into JVM heap → no GC pressure → consistent low latency\n// This is why Kafka can deliver data nearly as fast as the network allows</code></pre>\n<blockquote>\n<p><strong>THE STATELESS BROKER: THE COUNTERINTUITIVE MASTERSTROKE</strong></p>\n<p>The most counterintuitive decision in Kafka's design is making the broker stateless — the broker doesn't track which consumers have read which messages. In ActiveMQ and RabbitMQ, the broker maintains delivery state for every message: who acknowledged it, who hasn't, what needs to be retried. At scale, this per-message state tracking consumes enormous memory and creates a bottleneck. Kafka's solution was radical: <strong>let consumers track their own position</strong> (their offset in each partition). The broker just stores bytes in a log. Consumers read at their own pace, commit their offset to Zookeeper (later to a Kafka topic itself), and can reset to any offset to replay. The broker's memory footprint is constant regardless of consumer count or message backlog.</p>\n</blockquote>\n\n<p>Kafka vs Traditional Message Queues: Architectural Comparison (2011 original benchmarks and design properties)</p><div><table><caption>Kafka vs Traditional Message Queues: Architectural Comparison (2011 original benchmarks and design properties)</caption><thead><tr><th>Property</th><th>ActiveMQ / RabbitMQ</th><th>Kafka</th></tr></thead><tbody><tr><td>Storage model</td><td>Queue (messages deleted after ack)</td><td>Append-only log (messages retained by time/size)</td></tr><tr><td>Broker state</td><td>Tracks ack state per message per consumer</td><td>Stateless — consumers track own offset</td></tr><tr><td>Producer throughput (bench)</td><td>~2 MB/sec (ActiveMQ)</td><td>~50 MB/sec (batch size 50)</td></tr><tr><td>Message overhead</td><td>144 bytes (ActiveMQ, JMS header)</td><td>9 bytes</td></tr><tr><td>Consumer replay</td><td>Not supported (message gone after ack)</td><td>Supported — seek to any offset</td></tr><tr><td>Horizontal scale</td><td>Limited (complex cluster configs)</td><td>Native — add partitions, add consumers</td></tr><tr><td>Use case fit</td><td>Task queues, guaranteed delivery, routing</td><td>Event streaming, log aggregation, activity tracking</td></tr></tbody></table>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>What LinkedIn Actually Used Kafka For</p><p>By 2019, Kafka was the circulatory system of LinkedIn's infrastructure. <span><strong>Activity tracking</strong></span> (the original use case): every page view, search, ad impression fed to both Hadoop and real-time processors. <strong>Real-time search indexing</strong>: profile updates searchable within seconds. <strong>Database replication</strong>: Espresso CDC via Kafka replaced MySQL replication. <strong>Inter-service messaging</strong>: microservices decoupled through Kafka topics. <strong>Stream processing</strong>: Apache Samza (LinkedIn's open-source stream processor) used Kafka as both input and durable state store. Every part of LinkedIn's data plane ran on Kafka.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Zero-Copy: The OS Kernel Trick That Doubled Throughput</p><p>One of Kafka's most impactful performance optimizations is invisible to application code: <strong>zero-copy data transfer</strong> using the OS-level <code>sendfile()</code> syscall. In a traditional data transfer, data moves from disk to kernel buffer, kernel buffer to userspace, userspace to socket buffer, socket buffer to network. In Kafka's consumer path, the OS <code>sendfile()</code> call transfers data directly from the page cache (disk buffer) to the network socket, bypassing userspace entirely. This means no data is copied into the JVM heap — no GC pressure, no object allocation overhead. At LinkedIn's throughput rates, this optimization alone is responsible for significant throughput gains and, more importantly, for Kafka's consistent low latency even under high load.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Open-Source Flywheel</p><p>LinkedIn open-sourced Kafka in early 2011 — before it was even an Apache project. The decision to share the core infrastructure was not philanthropic; it was strategic. LinkedIn's engineers knew that the data pipeline problem they had solved was universal. By open-sourcing Kafka, they attracted contributions from engineers at Netflix, Uber, Twitter, and hundreds of other companies — all of whom had the same problem. The community built tooling LinkedIn would never have had resources to build alone: Kafka Streams, Kafka Connect, ksqlDB, MirrorMaker, Schema Registry. The open-source flywheel turned a LinkedIn internal tool into the internet's standard real-time data infrastructure.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>Kafka's architecture has three layers. The <strong>storage layer</strong> is a set of partitioned, replicated append-only log files on disk — each partition is an independent, totally ordered sequence of records. The <strong>broker layer</strong> is a cluster of server processes that manage partition assignment, replication, and client connections — but hold no consumer state. The <strong>client layer</strong> is producers writing to partitions and consumer groups reading from them, each consumer group maintaining its own independent offset position per partition. Understanding why this architecture outperforms traditional queues requires visualizing the data flow.</p>\n<h3>Before Kafka: N×M Integration Spaghetti</h3>\n<p><a href=\"https://techlogstack.com/explore/linkedin-kafka-origin-2011/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>After Kafka: The Centralized Log Hub</h3>\n<p><a href=\"https://techlogstack.com/explore/linkedin-kafka-origin-2011/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>Inside Kafka: Topics, Partitions, Offsets, and Consumer Groups</h3>\n<p><a href=\"https://techlogstack.com/explore/linkedin-kafka-origin-2011/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>THE LOG/TABLE DUALITY: JAY KREPS' DEEPER INSIGHT</strong></p>\n<p>In his 2013 essay \"The Log,\" Kreps articulated a concept that went beyond Kafka's implementation: the <em>log/table duality</em> (a mathematical relationship where any database table can be derived by replaying a log of changes from the beginning, and any log can be materialized into a table by applying each event as a state update — they are two views of the same underlying truth). Every database table can be derived by replaying the change log from the beginning. Every stream of events can be materialized into a table by accumulating state. This duality means a <strong>Kafka topic is simultaneously a stream and a database</strong> — you can query it as a stream in motion (stream processing) or materialize it as a snapshot (a table). This insight later became the foundation for Kafka Streams, ksqlDB, and the entire stream-processing ecosystem.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>LinkedIn's Kafka by 2019: The Numbers</p><p>By 2019, LinkedIn's Kafka deployment had become one of the largest publicly documented distributed systems in existence: <strong>7 trillion messages per day</strong>, spread across <strong>100+ clusters</strong>, <strong>4,000+ brokers</strong>, <strong>100,000+ topics</strong>, and <strong>7 million partitions</strong>. Each message was consumed by approximately <strong>four consumer groups</strong> on average. The cross-datacenter replication system (Brooklin) was itself mirroring over 7 trillion messages per day. From 1 billion events per day at launch in 2011 to 7 trillion by 2019: a <strong>7,000x growth</strong> in eight years on the same fundamental architecture.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Kafka is fifteen years old and it powers a majority of the world's real-time data infrastructure. Its success is not luck — it emerged directly from the architectural decisions Jay Kreps, Jun Rao, and Neha Narkhede made in 2010. The lessons here are about identifying the right abstraction, challenging assumptions baked into existing tools, and the compounding returns of open-sourcing infrastructure.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Before building, verify no existing tool solves your problem at your scale.</strong> The Kafka team evaluated ActiveMQ, RabbitMQ, and existing log aggregation systems before building. Their conclusion — existing tools were designed for the wrong problem — was evidence-based. The benchmark comparison (50 MB/sec vs 2 MB/sec) made the decision concrete. Never rebuild what can be adopted; never adopt what demonstrably can't serve your workload.</li><li><span>02</span><div><em>The append-only log</em> (a data structure where records are only ever added to the end, never modified in place — enabling sequential I/O, arbitrary consumer replay, and stateless brokers) is the universal data integration primitive. Any system that moves data between producers and consumers is implementing a log, whether it knows it or not. The explicit recognition of this pattern — and building directly on it — is what gave Kafka its performance advantage and its flexibility.</li><li><span>03</span><div><strong>Stateless brokers make systems horizontally scalable in ways stateful brokers cannot match.</strong> When the broker tracks delivery state per consumer per message, broker memory scales with consumers × outstanding messages. When consumers track their own offsets, broker memory scales with partitions only. This seemingly small architectural choice is why Kafka can serve hundreds of consumer groups without broker degradation.</li><li><span>04</span><div>Sequential I/O is dramatically faster than random I/O on both HDDs and SSDs. <strong>An append-only log turns a bursty stream of writes into sequential disk operations</strong>, allowing Kafka to approach disk hardware throughput limits. Systems that update records in-place pay random I/O costs on every write. Kafka writes append-only and leverages the OS page cache for reads, achieving throughput that surprised the entire industry.</li><li><span>05</span><div>Open-sourcing infrastructure that solves a universal problem creates compounding returns. <strong>LinkedIn open-sourced Kafka in 2011 because the team recognized it solved a problem every data-intensive company had.</strong> The community contributions, ecosystem tools (Kafka Streams, Connect, ksqlDB), and widespread adoption that followed made Kafka better than LinkedIn could have built alone. Netflix, Uber, Goldman Sachs, and thousands of other companies now run Kafka — and improvements they contribute flow back to LinkedIn. The return on open-sourcing infrastructure is measured in ecosystem, not just code.</li></ol>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>From 1 Billion to 7 Trillion: The Same Architecture</p><p>The most remarkable fact about Kafka's growth is that the core architecture described in the 2011 paper — append-only partitioned log, stateless brokers, consumer-side offsets — is still the architecture running at 7 trillion messages per day in 2019. The system scaled <strong>7,000x on the same fundamental design</strong>. Operational complexity grew (Cruise Control for rebalancing, Burrow for consumer lag monitoring, Brooklin for cross-datacenter replication), but the append-only log at the center of it all never needed to be replaced. Good architecture ages well.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE PLATFORM THAT MADE KAFKA A COMPANY</strong></p>\n<p>In November 2014, three years after Kafka's public launch, Jay Kreps, Neha Narkhede, and Jun Rao left LinkedIn to found <strong>Confluent</strong> — a company built to provide enterprise Kafka services, managed Kafka infrastructure, and the commercial ecosystem around the open-source project. Confluent went public in 2021 at a $4.5 billion valuation. The path from LinkedIn internal tool to billion-dollar company in thirteen years is one of the most compelling data points for the value of open-sourcing well-designed infrastructure. The tool built to solve LinkedIn's data pipeline problem had become the data pipeline solution for most of the internet.</p>\n</blockquote>\n\n<blockquote><p>Jay Kreps named Kafka after Franz Kafka because it was 'a system optimized for writing' — and then built something that the entire internet writes 7 trillion messages through per day, which is exactly the kind of outcome Franz Kafka would have found deeply, cosmically absurd.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/linkedin-kafka-origin-2011/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-19T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.188763+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Messaging", "LinkedIn"]}, {"id": "https://techlogstack.com/explore/shopify-llm-evaluation-production-2025/", "url": "https://techlogstack.com/explore/shopify-llm-evaluation-production-2025/", "title": "The 80% Problem: Why Getting an LLM System to 'Works in Demo' Is 20% of the Work", "summary": "How Shopify's Sidekick and Flow agent teams built LLM evaluation infrastructure — LLM judges, merchant simulators, and production mirroring — to close the demo-to-pr", "content_html": "<p><strong>Shopify</strong> · Reliability · 19 May 2026</p>\n<p>Every team building with LLMs discovers the same brutal truth: 80% quality arrives in a few weeks. The final 15% — the gap between 'impressive demo' and 'product I'd trust with my customers' — takes the rest of the time. Shopify's Flow agent and Sidekick teams lived this curve and came back with a systematic playbook. It is mostly about measurement.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;hand-crafted benchmark&#x27;, &#x27;value&#x27;: &#x27;300-example&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>ZenML analyzed 1,200 production LLM deployments across companies ranging from startups to large enterprises and found a pattern so consistent it has become a rule: <strong>reaching 80% quality happens quickly, but pushing past 95% requires the majority of total development time</strong>. The teams that hit 80% in four weeks and spend the next six months trying to reach 95% are not failing — they are experiencing the standard engineering curve for AI systems. The teams that mistake 80% for done are the ones shipping products that quietly erode user trust. Shopify's engineering teams, building both Sidekick (the merchant AI assistant) and the Flow agent (automated workflow generation), lived this curve in production. Their solution was not a better model. It was a better measurement system.</p>\n<blockquote>\n<p><strong>WHY EVALUATION IS THE HARD PART</strong></p>\n<p>Traditional software has a truth oracle: does the function return the correct value? LLM systems have no such oracle. A response can be grammatically correct, semantically reasonable, formatted perfectly — and still be wrong in ways that only a domain expert would notice, or only appear wrong on the tenth interaction in a specific workflow. <strong>Without a reliable way to measure quality, you cannot improve systematically.</strong> You are optimizing blind, hoping that the next prompt change or model upgrade makes things better without making other things worse. Evaluation infrastructure is not overhead — it is the prerequisite for all other AI engineering work.</p>\n</blockquote>\n\n<p>Shopify's Flow agent generates Shopify Flow automations from natural language — merchants describe what they want to happen ('when an order is over $200, add the customer to my VIP segment'), and the agent produces the workflow. The task requires <em>tool calling</em> (a pattern where an LLM is given a set of available functions (tools) with descriptions, and can request that a specific tool be executed by generating a structured function call — enabling LLMs to take real-world actions beyond text generation) and produces a structured output in a domain-specific format. It sounds well-bounded. In practice, the diversity of merchant intent is vast, the edge cases accumulate rapidly, and subtle errors in the generated workflow — a wrong condition operator, a missing trigger — produce silently incorrect automations that only fail when a merchant's order actually arrives.</p>\n<blockquote>\n<p><strong>📏</strong></p>\n<p>Shopify calibrated their LLM judge from a <strong>Cohen's Kappa of 0.02</strong> (essentially random — the judge agreed with human evaluators no more than chance would predict) to <strong>0.61</strong>, close to the human evaluator baseline of 0.69. The human baseline itself was 0.69 rather than 1.0 — a reminder that human evaluators don't perfectly agree with each other either. The goal is not a perfect judge; it's a judge trustworthy enough that its signals drive reliable engineering decisions.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>Benchmarks Said Ready; Production Said Otherwise</h4>\n<p>Shopify's fine-tuned Flow agent passed a hand-crafted 300-example benchmark at high accuracy. When deployed to production shadow traffic, performance on real merchant workflows diverged from the benchmark. The benchmark had been crafted by engineers who knew the system well and implicitly sampled from the distribution they understood. Real merchant intent had a long tail the benchmark didn't capture.</p>\n<hr />\n<h3>Cause</h3>\n<h4>No Quality Signal Trustworthy Enough to Drive Iteration</h4>\n<p>The early LLM judge had a Cohen's Kappa of 0.02 — barely better than random agreement with human evaluators. This meant the judge's verdicts could not reliably distinguish good responses from bad ones. Engineering decisions based on judge verdicts were effectively noise. Human evaluation at scale was impractical. Without a trustworthy quality signal, iteration was slow and direction was unclear.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Calibrated LLM Judge + Production Mirroring Flywheel</h4>\n<p>The team iteratively improved the LLM judge through systematic calibration against human labels (Kappa 0.02 → 0.61), then used it to score production traffic at scale. Production mirroring — routing real traffic through both current and candidate models — generated the failure cases that didn't appear in benchmarks. Those failures were fed back into the training dataset, closing the benchmark-to-production gap.</p>\n<hr />\n<h3>Result</h3>\n<h4>Production Gap Closed in Two Weeks with the Flywheel</h4>\n<p>The gap from 'benchmark-ready' to 'production-ready' closed in two weeks using the production mirroring flywheel. The fine-tuned Flow agent now serves the majority of production traffic. Weekly retraining cycles on H200 GPUs mean the model continuously improves from new production signal rather than drifting as merchant behavior evolves.</p>\n<hr />\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Human Agreement Ceiling</p><p>One of the most grounding facts in Shopify's evaluation system is that human evaluators agreed with each other at <strong>Cohen's Kappa of 0.69</strong> — not 1.0. Humans disagree about quality. This is not a failure of the evaluation process; it reflects genuine ambiguity in what 'correct' means for natural language tasks. The practical implication: don't try to build a perfect judge. Build a judge that matches or approaches human agreement levels, and treat that as the meaningful ceiling. Optimizing a judge past the human agreement level is overfitting to individual human annotators, not finding truth.</p>\n</blockquote>\n\n<p>The merchant simulator deserves particular attention as an engineering pattern. Before any system change ships to production, it is tested against simulated merchant interactions derived from real production conversations. The simulator captures the 'essence' — the underlying merchant goal — from real conversations and replays that goal against the new system. This is fundamentally different from benchmark evaluation: it tests the new system against <strong>realistic merchant intent distributions</strong>, including the long tail that engineering-crafted benchmarks consistently miss. It is also fundamentally different from A/B testing: it catches regressions before any real merchant sees them, without requiring a traffic split.</p>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Synthetic Data: Closing the Training Data Gap</p><p>The Flow agent's fine-tuning training data was almost entirely <strong>synthetic</strong> — generated by an LLM, not labeled by humans. The process: sample a real production workflow, use a stronger model to generate a plausible natural-language request that would produce it, construct the ideal multi-turn tool trajectory. The synthetic data generation was the majority of the engineering effort. The resulting dataset covered the breadth of Flow's usage in a way that manual annotation never could — because the diversity of real workflows provided the supervision signal, and the LLM provided the scale. This is the emerging pattern for fine-tuning specialists: synthetic data from real production outputs, not expensive human annotation from scratch.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🔬</strong></p>\n<p>The Industry Pattern: ZenML's 1,200 Case Studies</p><p>ZenML's LLMOps database of 1,200+ production deployments confirms that Shopify's experience is universal, not exceptional. The summary from their analysis: <strong>'Perhaps this is a truism by now, but you'll spend more time building evaluation infrastructure than you will on the actual application logic. And if you're not, you're probably shipping broken features.'</strong> LLM-as-judge has emerged as the dominant pattern for scalable quality measurement. But every successful deployment maintains human-in-the-loop golden datasets for critical domains. The dual-layer approach — LLM judges for velocity, human ground truth for calibration — is the de facto standard.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The A/B Test That Isn't</p><p>Teams new to LLM evaluation often reach for A/B testing as the measurement tool: split traffic, measure conversion, pick the winner. A/B testing has a fatal problem for LLM evaluation: <strong>a 5% improvement in a downstream metric like merchant click-through might take weeks of traffic to reach statistical significance</strong> — and you may have introduced a subtle quality regression in a different dimension that the metric doesn't capture. Production mirroring with direct output comparison is faster and richer: you see whether response quality improved for the same inputs, without waiting for downstream business metric movement. Business metrics confirm value; output comparison guides engineering.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE COST CURVE IS ASYMMETRIC</strong></p>\n<p>The 80% → 95% quality journey is asymmetric in effort. The first 80% comes from model capability: the LLM already knows how to generate text, use tools, and follow instructions. The final 15% comes from <strong>understanding the specific failure modes of your specific application on your specific user distribution</strong> — and that knowledge cannot be bought or downloaded. It is earned through measurement, systematic failure analysis, and targeted training data creation. The companies that invest in this domain-specific evaluation work build durable advantages over those that simply upgrade to the next model version and hope.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🏭</strong></p>\n<p>Notion AI (referenced in ZenML's analysis) built a <strong>multi-layer evaluation stack</strong> that balances speed and cost: cheap heuristic checks on every commit, LLM judge scoring on every merge, and expensive human evaluation on every release candidate. Teams that adopted this tiered approach reported <strong>10x faster development velocity</strong> compared to running full human evaluation on every change. The insight: match eval depth to the stakes of the change, not to a uniform 'always run everything' policy.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>Building the Evaluation Flywheel</h3>\n<p>Shopify's evaluation architecture is best understood as a flywheel: production traffic generates failures, failures feed the training pipeline, retraining improves the model, the improved model generates fewer failures, and the cycle continues. Each turn of the flywheel reduces the gap between benchmark performance and production performance. The flywheel only works if each component — quality measurement (LLM judge), failure collection (production mirroring), training (fine-tuning pipeline), deployment (shadow traffic + promotion) — is production-grade itself. A miscalibrated judge produces misleading signal. A flaky training pipeline slows iteration. A low-coverage benchmark misses the failures that actually matter.</p>\n<ul>\n<li><strong>0.61</strong> — Cohen's Kappa achieved for Shopify's LLM judge after iterative calibration — close to the human evaluator baseline of 0.69 and sufficient to drive reliable engineering decisions</li>\n<li><strong>300</strong> — Hand-crafted benchmark examples for the Flow agent — covering breadth of expected usage, used as the initial quality gate before production shadow testing</li>\n<li><strong>2 weeks</strong> — Time to close the benchmark-to-production gap using the production mirroring flywheel — from 'benchmark says ready' to 'production confirms ready'</li>\n<li><strong>Weekly</strong> — Qwen3-32B retraining cadence on H200 GPUs (12h full training run) — keeping the model aligned with evolving merchant behavior without months-long release cycles</li>\n</ul>\n\n<pre><code class=\"language-python\"># LLM Judge calibration: the process that takes you from Kappa 0.02 to 0.61\n# A judge is only useful if it agrees with humans. Measure this first.\n\nfrom sklearn.metrics import cohen_kappa_score\n\ndef calibrate_llm_judge(judge_prompt: str, calibration_set: list[dict]) -> float:\n    \"\"\"\n    calibration_set: list of {conversation, human_label} pairs\n    human_label: 'good' | 'bad' | 'needs_improvement'\n    Returns Cohen's Kappa between judge and human labels.\n    \"\"\"\n    judge_labels = []\n    for sample in calibration_set:\n        # Ask the judge to evaluate this conversation\n        judge_verdict = call_llm(judge_prompt, sample['conversation'])\n        judge_labels.append(judge_verdict)\n    \n    human_labels = [s['human_label'] for s in calibration_set]\n    kappa = cohen_kappa_score(human_labels, judge_labels)\n    return kappa  # target: >0.60 before trusting judge at scale\n\n# The calibration loop:\nkappa = 0.02  # initial judge is barely better than random\nwhile kappa < 0.60:\n    # Analyze where judge and humans disagree\n    disagreements = find_disagreements(calibration_set, current_judge_labels)\n    \n    # Improve judge prompt based on disagreement patterns:\n    # - Add clarifying criteria for ambiguous cases\n    # - Add few-shot examples from disagreements (human = ground truth)\n    # - Adjust rubric language to match human intuitions\n    new_judge_prompt = improve_prompt(current_judge_prompt, disagreements)\n    \n    kappa = calibrate_llm_judge(new_judge_prompt, calibration_set)\n    print(f\"Kappa after iteration: {kappa:.2f}\")  # logs: 0.02 → 0.15 → 0.31 → 0.48 → 0.61</code></pre>\n<blockquote>\n<p><strong>PRODUCTION MIRRORING: THE GROUND TRUTH TEST</strong></p>\n<p>Benchmarks are necessary but not sufficient. A benchmark is a fixed dataset that reflects the understanding of the engineers who created it. Production traffic reflects the actual diversity of user intent — including all the edge cases, unusual phrasings, and unexpected use patterns that no engineer anticipated. <strong>Production mirroring routes a percentage of real traffic through both the current model and the candidate model simultaneously</strong>, comparing outputs. Differences trigger human review of high-value or uncertain cases. This is the only way to discover whether a model improvement that looks good on a benchmark actually performs better for real users — or merely performs better on what engineers think real users want.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Synthetic Data Generation Pipeline</p><p>Shopify's Flow agent training data was generated through a three-step pipeline: <strong>Step 1</strong> — sample a diverse set of validated production workflows (at least one workflow per unique workflow descriptor, from merchants with two or more qualifying workflows). <strong>Step 2</strong> — use a stronger LLM to generate a plausible natural-language merchant request that would lead to that workflow. <strong>Step 3</strong> — construct the ideal multi-turn tool call trajectory from request to completed workflow. The resulting dataset had two properties manual annotation lacks: <strong>scale</strong> (the production workflow corpus is large) and <strong>grounding</strong> (every training example was derived from a real workflow that actually ran).</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Tangle: The ML Pipeline That Enables Weekly Retraining</p><p>The full training pipeline — data collection, synthetic data generation, fine-tuning, evaluation, deployment — runs on <strong>Tangle, Shopify's open-source ML experimentation platform</strong>. Tangle composes each pipeline step as a reproducible workflow with intelligent caching: only the steps affected by a change re-run. This means a change to the synthetic data generator doesn't trigger a full pipeline rerun from scratch — only the data generation step and its downstream steps re-execute. The caching infrastructure is what makes weekly retraining economically and operationally viable. Without it, the iteration cycle would be measured in months, not weeks.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Golden Datasets Are Non-Negotiable</p><p>ZenML's analysis is unambiguous: <strong>every successful production LLM deployment they analyzed maintains human-in-the-loop golden datasets</strong> for critical domains. LLM judges are used for velocity — scoring production traffic at scale. But they drift. A judge trained on last month's quality standards may give wrong verdicts on today's outputs. Golden datasets — small, carefully curated, human-labeled examples that represent ground truth — anchor the judge calibration and detect judge drift. Without a golden dataset, you have no way to know when your quality measurement system itself has stopped working.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Two-Week Rule</p><p>Shopify's experience with the production mirroring flywheel produced a rule of thumb that has since appeared in multiple other teams' postmortems: <strong>if your candidate model passes benchmark evaluation, it takes approximately two weeks of production mirroring to confirm whether it's truly production-ready</strong>. Two weeks of real traffic at a shadow percentage generates enough diverse examples to surface the tail failures that the benchmark didn't cover. If the flywheel is working, those failures are incorporated into the training data and the model improves. If the failures are systematic — indicating a training distribution problem rather than isolated edge cases — the two weeks reveals this before the model is promoted to full production.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>The evaluation architecture for production LLM systems has four components that form a cycle. <strong>Benchmark evaluation</strong> provides fast, reproducible quality gates during development. <strong>LLM-as-judge scoring</strong> provides continuous quality measurement at production traffic scale. <strong>Production mirroring</strong> provides ground truth about whether a candidate model performs better for real users. <strong>The training flywheel</strong> converts production failures into training examples, closing the gap each cycle. Each component is necessary; none is sufficient alone.</p>\n<h3>The Production LLM Evaluation Flywheel</h3>\n<p><a href=\"https://techlogstack.com/explore/shopify-llm-evaluation-production-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>LLM Judge Architecture: From Random Agreement to Near-Human</h3>\n<p><a href=\"https://techlogstack.com/explore/shopify-llm-evaluation-production-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>THE MERCHANT SIMULATOR AS PRE-DEPLOYMENT SAFETY NET</strong></p>\n<p>The merchant simulator sits between benchmark evaluation and production mirroring — it's a <strong>synthetic production environment</strong>. It replays real merchant intents (extracted from production conversations) against candidate systems in a controlled environment, before any real merchant sees the new system. This catches the specific failure mode that benchmarks miss: correct behavior on engineer-anticipated test cases, incorrect behavior on the realistic distribution of merchant intent in production. The simulator doesn't replace production mirroring — it prevents the worst regressions from reaching the production mirroring stage at all.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Eval Budget vs Training Budget: The Cost Trap</p><p>ZenML's analysis of 1,200 production deployments found that teams frequently discover that <strong>running comprehensive evaluations on every commit burns through inference budget faster than production traffic</strong>. Running a full eval suite — LLM judge on 1,000 examples × multiple iterations per PR — can cost more per day than serving users. The solution is a tiered eval strategy: fast, cheap unit evals on every commit; comprehensive judge-scored evals on every merge; full production mirroring only for release candidates. Eval should be sized to the stakes of what's being changed, not run at maximum coverage on every code change.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🧪</strong></p>\n<p>The Multi-LLM Annotation Pattern</p><p>For high-stakes quality assessments (like Shopify's Global Catalogue product taxonomy), a single LLM judge has too much variance. The production pattern is to run <strong>multiple LLMs independently on the same evaluation task</strong>, then use an arbitration system — a specialized model — to resolve disagreements. This ensemble approach dramatically reduces false positives in quality assessment: a response that confuses one model but is rated correctly by three others is probably correct. The arbitration model applies structured ruling logic for edge cases that simple voting would misclassify. The pattern adds cost but reduces the error rate of the quality signal for critical decisions.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>The 80% quality curve is the defining challenge of production AI engineering. The teams that accept it and build systematic measurement infrastructure navigate it successfully. The teams that are surprised by it and try to push past it with more prompting and model upgrades are still on it.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>You will spend more time building evaluation infrastructure than the application logic itself.</strong> This is not inefficiency — it is the correct allocation of engineering effort for probabilistic systems. Accept it before starting. Budget for it explicitly. The teams shipping reliable AI products have evaluation as a first-class engineering investment, not an afterthought.</li><li><span>02</span><div><em>LLM-as-judge</em> (using a language model to evaluate the outputs of another language model, calibrated against human labels to produce quality scores at scale without requiring manual human evaluation of every production interaction) is the scalable evaluation pattern. But an uncalibrated judge (Cohen's Kappa 0.02) is worse than useless — it gives false confidence. Calibrate your judge against human labels before trusting its verdicts. Target Kappa ≥ 0.6.</li><li><span>03</span><div><strong>A benchmark that passes is a necessary condition, not a sufficient one.</strong> Benchmarks reflect what engineers anticipated; production reflects what users actually do. Always follow benchmark success with production mirroring — routing real traffic through both current and candidate systems and comparing outputs. The two weeks Shopify needed to close the benchmark-to-production gap is the standard cost of this final validation step.</li><li><span>04</span><div><em>Synthetic data generation</em> (using an LLM to create training examples from a production data source, such as generating natural-language merchant requests from real production workflows) from real production outputs is the path to scalable training data for domain-specific fine-tuning. Manual annotation doesn't scale. Synthetic data derived from production workflows does — and it's grounded in real-world distribution rather than engineer-imagined distribution.</li><li><span>05</span><div><strong>The retraining cycle speed determines how fast you can respond to production drift.</strong> Merchant behavior changes, new workflow patterns emerge, new merchant categories join Shopify — and a model trained on last quarter's data will drift from current reality. Weekly retraining on production signal, made economically viable by efficient infrastructure (Tangle's intelligent caching, H200 GPUs, 12h run), keeps the model in alignment with the world it serves.</li></ol>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Universal Pattern Across 1,200 Deployments</p><p>ZenML's analysis of 1,200 production LLM deployments confirms Shopify's findings are not unique: <span><strong>the organizations extracting real value from AI are not the ones with the most innovative demos — they are the ones doing the less glamorous engineering work: building evaluation pipelines, implementing guardrails, designing for uncertainty, and treating their LLM systems with the same rigor they'd apply to any critical infrastructure.</strong></span> The pattern is consistent across startups, mid-market, and enterprise. Model quality is table stakes. Evaluation infrastructure is competitive differentiation.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>EVALUATION INFRASTRUCTURE IS PRODUCT</strong></p>\n<p>The merchant simulator, the calibrated LLM judge, the production mirroring pipeline, the golden dataset maintenance process — these are not internal tooling that engineers built for themselves. They are the <strong>product quality infrastructure</strong> that Shopify's merchants depend on, even though they will never see it. Every improvement to the evaluation system is an improvement to Sidekick's and Flow's reliability. Building evaluation infrastructure is building the product. Teams that separate 'evaluation tooling' from 'product work' are misclassifying one of their highest-value investments.</p>\n</blockquote>\n\n<blockquote><p>Shopify's engineers discovered that getting an AI to produce a correct Shopify Flow automation 80% of the time takes two weeks, and getting it to 95% takes the rest of the year — which is either a profound insight about probabilistic systems or a profound insight about how hard it is to write good evals for commerce automation, and it turns out to be both.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/shopify-llm-evaluation-production-2025/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-19T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.889156+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Reliability", "Shopify"]}, {"id": "https://techlogstack.com/explore/ibm-quantum-advantage-2026/", "url": "https://techlogstack.com/explore/ibm-quantum-advantage-2026/", "title": "Quantum Computing Just Beat the Best Classical Computer — Here Is the Engineering That Made It Happen", "summary": "How Q-CTRL achieved 3,000× quantum speedup over classical computers on May 6, 2026 — and how IBM's engineering stack made it possible: Fire Opal, qLDPC, and quantum-", "content_html": "<p><strong>IBM</strong> · Distributed Systems · 19 May 2026</p>\n<p>On May 6, 2026, Q-CTRL ran a materials science simulation on an IBM quantum computer in 2 minutes. The best classical supercomputer needed over 100 hours to reach the same accuracy — and then gave up. The day before, IBM's quantum computers simulated a 12,635-atom protein with Cleveland Clinic and RIKEN, 40 times larger than anything attempted six months prior. After 30 years of promises, quantum advantage arrived. Here is what actually changed.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;protein simulated (May 5 2026)&#x27;, &#x27;value&#x27;: &#x27;12,635-atom&#x27;}</li><li>{&#x27;label&#x27;: &#x27;qubits, 10,000+ two-qubit gates&#x27;, &#x27;value&#x27;: &#x27;120&#x27;}</li><li>{&#x27;label&#x27;: &#x27;min quantum vs 100+ hours classical&#x27;, &#x27;value&#x27;: &#x27;2&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>On May 19, 2026, Google Trends for India showed the query <strong>\"what is quantum computing in simple terms\"</strong> with a BREAKOUT signal — the highest possible designation, meaning search volume had increased so dramatically that the normal percentage scale couldn't contain it. The trigger was two announcements that had landed within 48 hours of each other. On May 5, scientists at Cleveland Clinic, RIKEN, and IBM used quantum computers to simulate <strong>trypsin, a protein with 12,635 atoms</strong> — the largest biologically meaningful molecule ever simulated with quantum hardware, 40 times larger than what the same method could achieve just six months prior. On May 6, Q-CTRL announced they had run a materials science calculation on an IBM quantum processor in <strong>2 minutes</strong>. The best classical supercomputer took over 100 hours to reach equivalent accuracy. That is a <strong>3,000× speedup</strong>. The physics community called it practical quantum advantage — the first time a quantum computer had demonstrably outperformed the best classical tool on a problem of real commercial relevance.</p>\n<blockquote><p>For years, quantum computing has been a promise. Now, quantum computers are producing results that matter to science. The systems we simulated here are the kind of molecules that biologists and chemists work with in the real world.</p><p><em>— — Jay Gambetta, Director of IBM Research — IBM Think 2026, Boston</em></p></blockquote>\n<p>Understanding why these results matter requires understanding what stood in the way. <em>NISQ</em> (Noisy Intermediate-Scale Quantum — the current era of quantum computing, characterized by processors with 50–1,000 qubits that are not error-corrected, meaning errors accumulate as circuit depth grows and place hard limits on what computations can run reliably) quantum computers — the hardware that exists today — are fundamentally noisy. Every <em>two-qubit gate</em> (the fundamental entangling operation in quantum computing that creates correlations between qubits — essential for quantum algorithms but a primary source of error in NISQ hardware, with typical error rates of 0.1–1% per gate) introduces a small probability of error. At shallow circuit depths with a handful of gates, this is manageable. At the depth required for commercially meaningful simulations — 10,000+ two-qubit gates across 120 qubits — errors compound exponentially and the computation collapses into noise. For years, this was the wall. The May 2026 results are not the wall coming down. They are the first evidence that engineers have found a way to work within the wall's constraints precisely enough, and extend it enough, that real problems now fall on the quantum side of it.</p>\n<blockquote>\n<p><strong>THE SEARCH BREAKOUT EXPLAINED</strong></p>\n<p>Google Trends in India showed a BREAKOUT on 'what is quantum computing in simple terms' within hours of the May 5-6 announcements reaching mainstream media. India's large engineering student population — studying at IITs, NITs, VITs, and hundreds of other technical universities — represents one of the highest densities of people who both <strong>understand enough to be curious</strong> and <strong>don't yet know enough to explain it themselves</strong>. The query 'in simple terms' is the signature of real scientific interest crossing from specialist to general audience. BREAKOUT signals on engineering topics in India reliably indicate a moment when a technical development has become a cultural one.</p>\n</blockquote>\n\n<h3>What Q-CTRL Actually Did</h3>\n<p>Q-CTRL's achievement used an IBM 156-qubit Heron processor on the IBM Quantum Platform, enhanced by Q-CTRL's own <strong>Fire Opal</strong> performance-management software. The target problem was the <em>Fermi-Hubbard model</em> (a foundational physics model that describes how electrons interact in a crystal lattice — capturing phenomena like high-temperature superconductivity and quantum magnetism — problems whose classical simulation cost grows exponentially with system size, making them natural candidates for quantum advantage) — a system of 60 interacting electrons in a 1D chain, using 120 of the chip's 156 qubits and executing over 10,000 two-qubit gate operations. The classical competitor was ITensor's TDVP solver running on a 32-vCPU, 64GB-RAM AWS instance — the acknowledged best-in-class classical tool for this class of problem. The quantum computation completed in <strong>~2 minutes</strong>. The classical computation, to reach the same accuracy, required <strong>over 100 hours</strong> — and at longer evolution times required over 160 hours before the two results diverged irreconcilably, meaning the classical computer ran out of ability to match the quantum result entirely.</p>\n<blockquote>\n<p><strong>⚛️</strong></p>\n<p>Q-CTRL's Fire Opal compiler reduced the number of two-qubit gates required for the Fermi-Hubbard calculation by <strong>60% compared to IBM's native Qiskit implementation</strong> of the same algorithm. Fewer gates means less error accumulation. This single optimization was the difference between a circuit that collapsed into noise at this scale and one that produced results accurate enough to match — and then exceed — the classical benchmark.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>NISQ Wall: Errors Compound Before Computation Completes</h4>\n<p>NISQ quantum processors accumulate errors with every two-qubit gate. For shallow circuits (hundreds of gates), error mitigation techniques can recover useful results. For commercially meaningful simulations (10,000+ gates), errors historically compounded to the point where the quantum output was indistinguishable from random noise. This wall had blocked practical quantum advantage for three decades.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Gate Count Was the Critical Variable</h4>\n<p>Every additional two-qubit gate multiplies error probability. IBM's native Qiskit compiler produced correct but gate-heavy implementations. Q-CTRL's Fire Opal compiler took the same algorithm and reduced gate count by 60% through advanced circuit optimization and error suppression techniques built on years of quantum control research. The 60% reduction was the difference between circuits that collapsed into noise and circuits that produced valid results.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Two Simultaneous Breakthroughs: Materials and Biology</h4>\n<p>May 5: IBM, Cleveland Clinic, and RIKEN simulated a 12,635-atom protein using quantum-centric supercomputing — fragmenting the molecule, computing quantum-mechanical behavior on IBM Heron processors, and assembling results on Fugaku and Miyabi-G supercomputers. May 6: Q-CTRL demonstrated 3,000× speedup on the Fermi-Hubbard model, completing in 2 minutes what took classical computers 100+ hours.</p>\n<hr />\n<h3>Result</h3>\n<h4>Practical Quantum Advantage: The Field's First</h4>\n<p>On May 6, 2026, Q-CTRL declared practical quantum advantage — the first time a quantum computer had outperformed the best available classical tool on a problem of known commercial relevance, using hardware accessible to any developer via the IBM Quantum Platform. IBM CEO Arvind Krishna had predicted quantum advantage would arrive in 2026. The prediction was correct.</p>\n<hr />\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Cleveland Clinic Protein Simulation</p><p>The May 5 protein simulation used a different approach — quantum-centric supercomputing (QCSC) — pairing IBM Heron quantum processors at both Cleveland Clinic (USA) and RIKEN (Japan) with two classical supercomputers: Fugaku at RIKEN and Miyabi-G at the University of Tokyo. The key algorithm was <strong>EWF-TrimSQD</strong> — a quantum-classical hybrid that fragmented the 12,635-atom trypsin protein into computable pieces, computed quantum-mechanical behavior on QPUs (up to 94 qubits, ~6,000 quantum operations per fragment), and reconstructed the full protein's behavior on classical supercomputers. The result: a 40-fold increase in system size and 210× improvement in accuracy compared to results from just six months earlier.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>What These Results Are Not</p><p>Precision is important here. The Fermi-Hubbard result is not proof that quantum computers beat classical computers at everything — or even most things. The advantage holds for this specific class of fermionic simulation problems, which scale poorly for classical computers by a known theoretical argument. Breaking RSA-2048 with Shor's algorithm requires hundreds of thousands to millions of physical qubits under error correction — a challenge orders of magnitude harder. The May 2026 results are the first concrete proof that quantum advantage is achievable on useful, commercially relevant problems with today's hardware, properly engineered.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Fermi-Hubbard Model: Why This Problem Matters</p><p>The <em>Fermi-Hubbard model</em> (a foundational model in condensed matter physics that describes interacting electrons on a lattice — used to understand phenomena including high-temperature superconductivity, Mott insulators, and quantum magnetism) is not an academic toy problem. It is the theoretical framework that physicists use to understand <strong>high-temperature superconductors</strong> — materials that could enable lossless power transmission and dramatically more efficient computing. Classical computers struggle with Fermi-Hubbard at scale because the number of quantum states grows exponentially with system size — a 60-electron system has 2^60 possible states, far beyond any classical memory capacity. Quantum computers naturally represent these states using quantum superposition. This is the textbook case of exponential quantum advantage — and May 2026 is the first real-world confirmation that it holds in practice.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE 300MM WAFER SHIFT: SCALING QUANTUM MANUFACTURING</strong></p>\n<p>Alongside the algorithm and software achievements, IBM made a manufacturing announcement that will define the next decade of quantum hardware: shifting quantum processor wafer fabrication to <strong>300mm wafers at the Albany NanoTech Complex</strong> — the same fabrication scale used by the most advanced classical semiconductor fabs. The shift from smaller wafers doubles IBM's development speed while enabling 10× more complex chips for the fault-tolerant error correction roadmap. This is the semiconductor industry's hard-won manufacturing knowledge being applied to quantum hardware — the industrialization of quantum chip production.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>The Engineering Stack That Made It Possible</h3>\n<p>The Q-CTRL result did not emerge from better quantum hardware alone — it emerged from a full engineering stack combining IBM's hardware, Q-CTRL's compiler, and years of quantum control research. Three layers mattered: the <strong>hardware layer</strong> (IBM Heron's 156-qubit chip with improved coherence times and gate fidelity), the <strong>compilation layer</strong> (Q-CTRL's Fire Opal reducing gate count by 60% through circuit optimization and noise-aware compilation), and the <strong>error suppression layer</strong> (runtime techniques that actively suppress errors during execution rather than correcting them after). None of these layers alone would have been sufficient. The result is an emergent property of all three operating together.</p>\n<ul>\n<li><strong>3,000×</strong> — Wall-clock speedup of quantum over classical in Q-CTRL's Fermi-Hubbard simulation — 2 minutes vs 100+ hours on the best available classical hardware and software</li>\n<li><strong>60%</strong> — Gate count reduction achieved by Q-CTRL's Fire Opal compiler vs IBM's native Qiskit implementation of the same algorithm — the optimization that made the circuit depth feasible</li>\n<li><strong>12,635</strong> — Atoms in the trypsin protein simulated by Cleveland Clinic + RIKEN + IBM — the largest biologically meaningful molecule ever computed with quantum hardware</li>\n<li><strong>40×</strong> — Increase in simulation system size achieved in six months (from prior protein simulation results) — driven by the EWF-TrimSQD algorithm and tighter QPU-CPU-GPU integration</li>\n</ul>\n\n<pre><code class=\"language-python\"># Conceptual: What Fire Opal does differently from native Qiskit compilation\n# The 60% gate reduction is the engineering story in code form\n\n# Native Qiskit compilation: correct but gate-heavy\nfrom qiskit import QuantumCircuit, transpile\nfrom qiskit_ibm_runtime import QiskitRuntimeService\n\nservice = QiskitRuntimeService()\nbackend = service.backend('ibm_heron_r2')  # 156-qubit Heron\n\n# The Fermi-Hubbard Trotter circuit before optimization:\n# Each Trotter step requires multiple CNOT layers\n# At 90 Trotter steps: ~15,000+ two-qubit gates in the naive implementation\ncircuit = build_fermi_hubbard_circuit(n_qubits=120, n_trotter_steps=90)\nnative_transpiled = transpile(circuit, backend=backend)\nprint(f\"Native gate count: {native_transpiled.count_ops()['cx']} CX gates\")\n# Output: ~15,000+ CX gates → error rate too high → output is noise\n\n# Q-CTRL Fire Opal: noise-aware compilation\nimport fire_opal\n\n# Fire Opal applies:\n# 1. Circuit rewriting: finds equivalent circuits with fewer gates\n# 2. Noise-aware mapping: places qubits to minimize cross-talk\n# 3. Dynamical decoupling: inserts refocusing pulses to cancel drift\n# 4. Gate fusion: combines adjacent compatible gates\noptimized_result = fire_opal.run(\n    circuits=[circuit],\n    backend=backend,\n    optimization_level='aggressive',\n    error_suppression=['dynamical_decoupling', 'gate_twirling']\n)\nprint(f\"Fire Opal gate count: ~6,000 CX gates\")\n# 60% reduction → circuit runs within error tolerance → useful result\n# 120 qubits × 90 Trotter steps × 2 minutes wall time\n# vs TDVP classical: 100+ hours before classical diverges from quantum</code></pre>\n<blockquote>\n<p><strong>ERROR SUPPRESSION VS ERROR CORRECTION</strong></p>\n<p>The May 2026 results were achieved with <strong>error suppression</strong>, not error correction. The distinction is fundamental. <strong>Error correction</strong> (the goal for 2029) uses logical qubits — groups of physical qubits that encode information redundantly and can detect and fix errors in real-time. It requires hundreds of physical qubits per logical qubit. <strong>Error suppression</strong> (what Q-CTRL and IBM use now) cannot fix errors — it minimizes them through circuit optimization, noise-aware compilation, and runtime control techniques. Error suppression works within the limits of NISQ hardware. Error correction eliminates those limits entirely. The 3,000× result was achieved within the NISQ limits. What becomes possible once error correction arrives is qualitatively different.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>IBM Quantum Loon: All Hardware Elements for Fault Tolerance</p><p>IBM's roadmap toward fault-tolerant quantum computing advanced significantly in November 2025 with the IBM Quantum Loon processor — the first to demonstrate <strong>all hardware components required for fault-tolerant quantum computing</strong>: c-couplers for long-range qubit connectivity, qubit reset between computations, and high-fidelity gates at FTQC-relevant speeds. Simultaneously, IBM achieved real-time <em>qLDPC</em> (quasi-cyclic Low-Density Parity-Check codes — a class of quantum error-correcting code that IBM believes provides the most efficient path to large-scale fault tolerance, requiring fewer physical qubits per logical qubit than the surface code used by most competitors) decoding in under 480 nanoseconds — a full year ahead of schedule. Loon is not a production processor — it is an experimental validation platform. But it proves the components exist.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The EWF-TrimSQD Algorithm: 40× Larger in Six Months</p><p>The Cleveland Clinic simulation's 40-fold increase in system size in six months was driven by a new algorithm: <strong>EWF-TrimSQD</strong> (Embedding Workflow with Tailored Reduced-qubit Molecular Dynamics). It improved the efficiency of how the protein was fragmented into quantum-computable pieces, reducing overhead per fragment and enabling larger total simulations within the same qubit budget. The algorithm was a joint development between IBM, Cleveland Clinic, and RIKEN. The 40× improvement in six months means the scaling is not linear — each algorithmic improvement compounds with the hardware improvements, accelerating both simultaneously.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>What Quantum Does Not Yet Beat Classical At</p><p>Intellectual honesty requires noting what Q-CTRL's team itself acknowledged: on variational quantum eigensolver problems (a different class of quantum chemistry simulation), they found in their own research that a new classical method they developed outperformed the quantum computer just days before their quantum advantage announcement. IBM's Borja Peropadre was candid at CES 2026 about this: quantum and classical methods advance simultaneously, and each quantum claim must be verified against the best classical methods — not methods from two years ago. The advantage frontier moves, and keeping track of it requires constant benchmarking against current classical state-of-the-art.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>IBM Quantum Platform: Cloud Access to Advantage</p><p>The IBM Quantum Platform has been accessible to developers via cloud since May 2016 — a full decade before the May 2026 advantage demonstrations. The platform provides tiered access: free access to smaller processors for experimentation, and paid premium access to the most capable systems including the Heron processors used in the advantage demonstrations. Q-CTRL's 3,000× speedup was demonstrated on hardware accessible to any registered IBM Quantum Platform user. The advantage is not locked in a research lab. It is available now, on public cloud infrastructure, to any team willing to develop the quantum expertise to use it.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>The architecture of both May 2026 achievements reflects a common pattern: quantum processors are not stand-alone computers that replace classical ones. They are specialized accelerators for specific types of computation, tightly integrated with classical CPUs and GPUs that handle the parts of the problem where quantum offers no advantage. IBM calls this <strong>Quantum-Centric Supercomputing (QCSC)</strong> — a heterogeneous computing architecture where tasks are assigned to the compute layer where they run best. Understanding this architecture is essential to understanding what quantum advantage actually means in practice.</p>\n<h3>The NISQ Error Accumulation Problem: Why Circuit Depth Is the Wall</h3>\n<p><a href=\"https://techlogstack.com/explore/ibm-quantum-advantage-2026/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>Quantum-Centric Supercomputing (QCSC): The Cleveland Clinic Architecture</h3>\n<p><a href=\"https://techlogstack.com/explore/ibm-quantum-advantage-2026/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>IBM Quantum Roadmap: From NISQ to Fault Tolerance (2025–2029)</h3>\n<p><a href=\"https://techlogstack.com/explore/ibm-quantum-advantage-2026/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>QUANTUM + AI: CONVERGENCE, NOT COMPETITION</strong></p>\n<p>IBM CEO Arvind Krishna at Think 2026 was direct: <strong>\"Quantum and AI do not compete; they converge and complement each other.\"</strong> Quantum can solve optimization and simulation problems that AI cannot reach through gradient descent. AI can learn from quantum-computed results and develop faster classical approximations. The trajectory: quantum computes what AI cannot, AI learns from what quantum computed, the combination advances faster than either could alone. This is why IBM's quantum research program sits alongside its AI research program rather than in competition with it — and why the Cleveland Clinic drug discovery work matters: quantum simulates molecular interactions that ML models can then learn to approximate.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>10 Years of Cloud Quantum: IBM Quantum Platform</p><p>IBM put the first quantum computer on the cloud on May 4, 2016. IBM Think 2026 coincided almost exactly with the <strong>10th anniversary</strong> of cloud-accessible quantum computing. In that decade, the IBM Quantum Platform grew from a single 5-qubit processor to a fleet of processors up to 156 qubits, serving hundreds of thousands of users globally via tiered access plans. The cloud accessibility is not incidental to the May 2026 results — Q-CTRL's 3,000× speedup was achieved on hardware accessible to any developer via the IBM Quantum Platform's API, not on a private research machine. Practical quantum advantage arrived on public cloud infrastructure.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>May 5–6, 2026 is the week where quantum computing stopped being a future technology and became a present one — for specific, bounded, commercially relevant problems. The lessons here are about what actually changed, what the engineering looks like, and what it means for the decade ahead.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Quantum advantage arrived not from better qubits alone, but from better compilers.</strong> Q-CTRL's Fire Opal reduced gate count by 60% on the same IBM hardware that was available before. The 3,000× speedup was enabled by 60% fewer gates — and 60% fewer gates was enabled by years of investment in quantum control theory and noise-aware compilation. Hardware and software co-optimization, not hardware alone, crossed the threshold.</li><li><span>02</span><div><em>Quantum-centric supercomputing</em> (a heterogeneous computing architecture that pairs quantum processors with classical CPUs and GPUs, assigning each part of a problem to the computational resource where it runs best) is how quantum advantage works in practice. Quantum computers do not replace classical computers — they accelerate the specific parts of computation where quantum mechanics provides an exponential advantage, while classical computers handle the rest. Drug discovery, materials simulation, and optimization are the first domains where this integration delivers measurable commercial results.</li><li><span>03</span><div><strong>Error suppression and circuit optimization are the engineering disciplines that matter most in the NISQ era.</strong> Error correction remains the long-term goal (IBM Starling, 2029), but error suppression — reducing gate count, noise-aware mapping, dynamical decoupling — is the bridge that makes today's hardware useful for real problems. Engineers building on quantum hardware should invest as much in compilation optimization as in circuit design.</li><li><span>04</span><div>The rate of improvement is accelerating, not slowing. <span>40× larger molecule simulation in six months.</span> A year-ahead-of-schedule qLDPC decoder. Qiskit circuits 24% more accurate at 100+ qubits. <em>Trotter</em> (a simulation technique that approximates quantum time evolution by breaking it into small sequential steps — the number of Trotter steps determines simulation accuracy, and running 90 Trotter steps at 120 qubits with useful accuracy was previously considered infeasible on NISQ hardware) depth at 90 steps on 120 qubits that would have been impossible two years ago. The practical implication: organizations that start developing quantum-advantage applications now will be significantly ahead of those that wait for the technology to 'mature.'</li><li><span>05</span><div><strong>Practical quantum advantage arrived on public cloud infrastructure.</strong> Q-CTRL's 3,000× speedup was not achieved in a government lab on classified hardware — it was achieved on IBM Quantum Platform hardware accessible via API to any registered developer. The democratization of quantum hardware through cloud access, begun in 2016, is what made May 2026's results broadly verifiable and immediately applicable. Build your quantum software stack now, on publicly accessible hardware, while the advantage window expands.</li></ol>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Community Advantage Tracker</p><p>IBM, Algorithmiq, researchers at the Flatiron Institute, and BlueQubit are contributing results to an <span><strong>open, community-led quantum advantage tracker</strong></span> — a systematic framework for verifying quantum advantage claims across three experiment types: observable estimation, variational problems, and classically verifiable problems. This tracker is the scientific community's answer to the reproducibility question: quantum advantage claims require independent verification, and the tracker provides the framework for that verification. It is the peer review system for quantum advantage — and its existence is itself evidence that the field has matured from speculation to engineering.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE DRUG DISCOVERY IMPLICATION</strong></p>\n<p>Cleveland Clinic's motivation for the protein simulation work is direct: drug discovery. If quantum computers can accurately simulate how drug molecules bind to protein targets like trypsin, pharmaceutical researchers can <strong>screen candidate molecules computationally before any physical experiment</strong>. The typical drug development cycle takes over 10 years and costs billions of dollars. Accurate quantum simulation of binding energies could identify non-starters earlier and prioritize promising candidates faster. The current 12,635-atom result is a milestone, not a final destination. But the 40× size increase in six months shows the trajectory is steep.</p>\n</blockquote>\n\n<blockquote><p>For 30 years, quantum computing was always 10 years away — until May 2026, when Q-CTRL ran a computation in 2 minutes that took the best classical supercomputer 100 hours, and the only thing that changed was that engineers got a 60% better compiler.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/ibm-quantum-advantage-2026/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-19T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.975775+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Distributed Systems", "IBM"]}, {"id": "https://techlogstack.com/explore/netflix-chaos-monkey-2011/", "url": "https://techlogstack.com/explore/netflix-chaos-monkey-2011/", "title": "Netflix Unleashed a Monkey With a Weapon in Its Own Data Center — On Purpose", "summary": "How Netflix's 2008 database outage led to Chaos Monkey — and how deliberately killing production servers during business hours spawned an entire engineering discipli", "content_html": "<p><strong>Netflix</strong> · Chaos Engineering · 18 May 2026</p>\n<p>It was 2011 and Netflix had just migrated hundreds of microservices to AWS. Their architecture was distributed, horizontally scaled, and theoretically fault-tolerant. But theory and production are different things. The only way to know if a system could survive failures was to cause failures — constantly, deliberately, during business hours, and in production. So they built a monkey.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;Simian Army members&#x27;, &#x27;value&#x27;: &#x27;10&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<blockquote><p>The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption.</p><p><em>— — Yury Izrailevsky & Ariel Tseitlin — via The Netflix Simian Army, Netflix Tech Blog, July 19, 2011</em></p></blockquote>\n<p>The origin of Chaos Monkey is not a clever engineering insight — it is a three-day disaster. In August 2008, Netflix was still primarily a DVD-by-mail business, running its technology on vertically scaled servers in its own data centers. A <strong>major database corruption</strong> took down the entire system. For three days, Netflix could not ship DVDs to its customers. It wasn't a complicated failure. It was a <em>single point of failure</em> (a component whose failure brings down the entire system — the exact opposite of a fault-tolerant distributed architecture) at the most basic level: one database, one failure mode, total outage. The company's engineering leadership concluded that the only path forward was to move away from centralized relational databases in their own datacenter toward <strong>highly reliable, horizontally scalable, distributed systems in the cloud</strong>. They chose Amazon Web Services. The seven-year cloud migration that followed would produce one of the most influential engineering philosophies in the history of distributed systems.</p>\n<p>The migration to AWS presented a new problem in place of the old one. Netflix was moving from a single monolith with a small number of failure points — each catastrophic — to a <em>microservices architecture</em> (a system design where an application is broken into many small, independently deployable services that communicate over a network, improving scalability and team autonomy at the cost of increased distributed systems complexity) with hundreds of services, each potentially failing in its own unique way. The distributed system was theoretically more resilient. But theory is not production. Netflix's engineers designed systems with graceful degradation in mind — if the recommendations service failed, show popular titles instead of personalized ones; if the search service was slow, streaming should still work. They wrote the code. They reviewed it. They tested it in staging. And then they realized: <strong>there was no way to know if the fault tolerance actually worked without experiencing actual failures</strong>. The staging environment couldn't reproduce the chaos of production. Controlled tests couldn't capture the emergent failure modes of hundreds of interdependent services under real load.</p>\n<blockquote>\n<p><strong>THE CORE INSIGHT: FAIL CONSTANTLY</strong></p>\n<p>Netflix's founding philosophy for Chaos Engineering was radical in its simplicity: <strong>the best way to avoid failure is to fail constantly</strong>. If you only experience failures accidentally, in production, at 3am, your engineers have no muscle memory for responding to them and your systems have never been forced to prove their resilience claims. If you fail constantly, during business hours, with engineers present — your systems either prove they can recover or they expose the gaps so engineers can fix them before those gaps become incidents.</p>\n</blockquote>\n\n<h3>What Chaos Monkey Actually Does</h3>\n<p>Chaos Monkey is, mechanically, a simple tool. It runs continuously across Netflix's AWS environment and at some point during <strong>business hours</strong>, picks one EC2 instance at random from each cluster and terminates it. No warning. No coordination. No grace period. The instance just stops. This deceptively simple act forces every service in Netflix's architecture to prove, continuously, that it can tolerate the loss of an individual instance. Services that depend on a single backend instance fail immediately and obviously. Services built with proper fallbacks — load balancers, retries, graceful degradation paths — continue working. The business hours constraint is deliberate: when Chaos Monkey strikes at 2pm on a Tuesday, engineers are at their desks and can respond to any cascading failure. Striking at 2am would produce the exact scenario Netflix was trying to avoid — unplanned, unattended failures with no one ready to respond.</p>\n\n<h3>Problem</h3>\n<h4>August 2008: Database Corruption, Three Days of Darkness</h4>\n<p>Netflix's vertically scaled infrastructure suffered a major database corruption that halted DVD shipping for three days. The root cause was architectural: a single relational database instance, a single point of failure. No redundancy, no graceful degradation, no recovery path faster than manual intervention. The outage made the problem concrete: this architecture couldn't support Netflix's growth.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Distributed Systems Are Only Theoretically Resilient</h4>\n<p>Moving to hundreds of microservices on AWS solved the single-point-of-failure problem at the architecture level — but introduced new questions: did the code actually implement the graceful degradation it was supposed to? Staging environments couldn't tell you. Code review couldn't tell you. The only honest answer required production failures, and those were the thing Netflix was trying to avoid.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Chaos Monkey: Production Failure on a Schedule</h4>\n<p>Netflix built Chaos Monkey — a script that randomly terminates EC2 instances during business hours — and deployed it in all production environments. Engineers came in every day knowing that Chaos Monkey was running, knowing their services might get an instance killed at any moment, and knowing they had to build recovery mechanisms or face a very bad afternoon. The tool made fault tolerance a daily engineering discipline, not a theoretical design principle.</p>\n<hr />\n<h3>Result</h3>\n<h4>Sept 2014: AWS Reboots 10% of Its Servers. Netflix Shrugs.</h4>\n<p>On September 25, 2014, AWS rebooted approximately 10% of its EC2 instances without warning. Netflix's systems handled it without customer impact. Netflix explicitly credited Chaos Monkey: the engineers had already been building and proving recovery mechanisms every day for years. When AWS created an unplanned failure event at scale, Netflix's systems responded exactly as they'd been trained to respond — automatically, gracefully, and without requiring an emergency war room.</p>\n<hr />\n\n<blockquote>\n<p><strong>🐒</strong></p>\n<p>Chaos Monkey was one of the <strong>first systems</strong> Netflix engineers built in AWS during the cloud migration. Not a caching layer, not a deployment system, not a monitoring platform — a tool to randomly kill their own production servers. This sequencing was intentional: the discipline came first, and the architecture was shaped by it.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Rambo Architecture</p><p>Netflix's engineering team coined the term <strong>Rambo Architecture</strong> for the design philosophy that Chaos Monkey enforced: each system must be able to succeed no matter what, even all on its own. If the recommendations service is down, still respond — show popular titles. If the search service is slow, streaming still works. If a dependent microservice returns an error, handle it gracefully. Every service is both a potential failure source and a potential victim of failures, and must be designed for both roles simultaneously.</p>\n</blockquote>\n\n<h3>The Simian Army</h3>\n<p>The success of Chaos Monkey triggered a proliferation. If randomly killing instances made Netflix more resilient to instance failures, what would it take to become resilient to other failure categories? In July 2011 — the same blog post that named Chaos Monkey publicly — Netflix announced the <strong>Simian Army</strong>: a growing suite of failure-injection and resilience-verification tools, each targeting a different class of failure. The roster was remarkable in its scope and its naming creativity. <em></em><em>Latency Monkey</em> (a tool that injects artificial delays into Netflix's RESTful service communication layer, simulating network degradation to verify that upstream services detect and respond to downstream slowdowns appropriately) introduced artificial delays in service communication to simulate degradation. <em>Conformity Monkey</em> identified and shut down instances not following engineering best practices. <em>Doctor Monkey</em> ran health checks and removed unhealthy instances from service. <em>Janitor Monkey</em> cleaned up unused cloud resources to reduce costs and complexity. <em>Security Monkey</em> hunted for security vulnerabilities. <em>10-18 Monkey</em> detected multi-region configuration problems. And <em></em><em>Chaos Gorilla</em> (a Simian Army tool that simulates the failure of an entire AWS availability zone — one step up from Chaos Monkey's instance-level failures, testing whether Netflix's architecture could survive losing an entire AZ) simulated the complete failure of an AWS availability zone.</p>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Chaos Kong: The Region Killer</p><p>Above Chaos Gorilla in the hierarchy sat <strong>Chaos Kong</strong> — the most extreme tool in the Simian Army, designed to simulate the complete failure of an entire AWS region. If Chaos Monkey proved Netflix could survive an instance failure and Chaos Gorilla proved it could survive an AZ failure, Chaos Kong tested the hardest question: could Netflix continue streaming if us-east-1 went dark? The answer, after years of Chaos Engineering practice, was yes — with careful architecture involving active-active multi-region deployment and data replication strategies that Netflix documented in subsequent engineering blog posts.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>Building a Fault-Tolerant Culture</h3>\n<p>The most important thing Chaos Monkey fixed was not a technical system — it was an organizational incentive. Before Chaos Monkey, engineers at Netflix could ship code that was theoretically fault-tolerant but practically fragile without facing immediate consequences. The fragility would only become visible during a real, unplanned outage — at which point it was someone else's problem. After Chaos Monkey, the consequences were immediate and personal: if your service didn't handle instance failures gracefully, Chaos Monkey would expose this <strong>during your working hours, while you were at your desk</strong>, with your team watching. This behavioral economics effect — where the cost of fragility was paid by the person who created it, immediately — transformed how Netflix engineers thought about resilience. It was no longer a design principle to be aspirationally implemented. It was a daily test to be continuously passed.</p>\n<ul>\n<li><strong>2011</strong> — Year Chaos Monkey was publicly announced in 'The Netflix Simian Army' blog post — three years after the 2008 database outage that triggered the AWS migration and the need for built-in fault tolerance</li>\n<li><strong>10+</strong> — Members of the Simian Army at peak — each targeting a different failure category from individual instances (Chaos Monkey) to full AWS regions (Chaos Kong)</li>\n<li><strong>Business hours</strong> — The scheduling constraint that made Chaos Monkey safe and effective — failures during working hours, with engineers present to respond, rather than 3am on-call escalations</li>\n<li><strong>Sept 2014</strong> — The real-world validation: AWS rebooted 10% of EC2 instances without warning — Netflix handled it without customer impact, directly crediting years of Chaos Monkey practice</li>\n</ul>\n\n<pre><code class=\"language-python\"># Simplified version of what Chaos Monkey does\n# Real implementation was originally Java, later Go (v2.0)\n# Runs continuously during configurable business hours\n\nimport random\nimport time\nfrom datetime import datetime\n\nclass ChaosMonkey:\n    def __init__(self, aws_client, excluded_clusters=None):\n        self.aws = aws_client\n        self.excluded = excluded_clusters or []\n    \n    def is_business_hours(self) -> bool:\n        \"\"\"Only run during business hours so engineers are present.\n        The key safety constraint of Chaos Monkey's original design.\"\"\"\n        now = datetime.now()\n        return (\n            now.weekday() < 5 and          # Monday–Friday\n            9 <= now.hour < 17              # 9am–5pm local time\n        )\n    \n    def run(self):\n        while True:\n            if self.is_business_hours():\n                # Identify all clusters Chaos Monkey is configured to target\n                clusters = self.aws.get_all_clusters()\n                \n                for cluster in clusters:\n                    if cluster.name in self.excluded:\n                        continue\n                    \n                    # Pick one instance at random from each cluster\n                    instances = cluster.get_running_instances()\n                    if not instances:\n                        continue\n                    \n                    victim = random.choice(instances)\n                    \n                    # Terminate it. No warning. No coordination.\n                    # If the system doesn't survive this, the engineers\n                    # will know about it immediately — and fix it.\n                    self.aws.terminate_instance(victim.id)\n                    print(f\"[Chaos Monkey] Terminated {victim.id} \"\n                          f\"in cluster {cluster.name}\")\n            \n            # Wait before running again — mean time between terminations\n            # configured per cluster, not random probability\n            time.sleep(self.config.termination_interval_seconds)</code></pre>\n<blockquote>\n<p><strong>FAILURE INJECTION TESTING (FIT): THE EVOLUTION</strong></p>\n<p>In 2014, Netflix engineers (including Kolton Andrus, who later co-founded Gremlin) introduced <strong>FIT — Failure Injection Testing</strong>. Where Chaos Monkey operated at the infrastructure level (kill an EC2 instance), FIT operated at the application level: injecting failure metadata through <em>Zuul</em> (Netflix's edge proxy that handles all requests from devices and applications to Netflix's backend services) to simulate specific service failures with surgical precision. FIT could say 'for this specific user's request, pretend the recommendations service is timing out' without actually degrading the recommendations service for everyone. This precision made chaos experiments far more targeted and safer to run continuously.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Chaos Monkey 2.0: Open-Sourced and Rebuilt in Go</p><p>Chaos Monkey was open-sourced in 2012 and rebuilt in 2016 as version 2.0. The new version was written in Go, used Spinnaker as its deployment platform dependency, and introduced mean-time-between-terminations (rather than probabilistic scheduling) for more predictable test coverage. Version 2.0 also added <strong>Trackers</strong> — Go language objects that report instance terminations to external monitoring systems, enabling downstream correlation of Chaos Monkey events with application metrics and alerts.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Industry Adoption: From Netflix to Everywhere</p><p>By 2015, Netflix's Chaos Engineering practices had been codified in the <strong>Principles of Chaos Engineering</strong> document (published by a team including Casey Rosenthal, who led Netflix's Chaos Engineering team), transforming what had been an internal Netflix tool into a formal engineering discipline. Companies including LinkedIn, Facebook, Google, Amazon, and Twilio adopted chaos engineering practices. Kolton Andrus (from Netflix's FIT team) founded Gremlin in 2016 to commercialize chaos engineering tooling. AWS launched its own Fault Injection Simulator service in 2021.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Open-Source Release and Industry Spread</p><p>Netflix open-sourced Chaos Monkey in 2012, making the tool available to any engineering team that wanted to adopt the practice. The release did something more important than provide the code: it legitimized the approach. Engineering teams at other companies who had been quietly running similar experiments could now point to Netflix's published methodology as industry precedent. By 2015, companies including <strong>LinkedIn, Facebook, Google, Amazon, and Twilio</strong> had publicly acknowledged chaos engineering practices. The 2015 publication of the <em>Principles of Chaos Engineering</em> by Netflix's Casey Rosenthal and colleagues formalized the discipline with scientific language: hypothesis, experiment, steady state, blast radius. What had been a Netflix internal tool became a named engineering discipline.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE SPINNAKER DEPENDENCY IN V2.0</strong></p>\n<p>Chaos Monkey 2.0 (2016) introduced a significant constraint: it requires <em>Spinnaker</em> (Netflix's open-source multi-cloud continuous delivery platform that manages application deployments across AWS, Azure, Kubernetes, and other providers) as its deployment platform. This means that teams wanting to use Chaos Monkey 2.0 must also adopt Spinnaker — a substantial investment. Companies unwilling to commit to Spinnaker found Chaos Monkey 2.0 inaccessible, which opened market space for alternatives like Gremlin (founded by Netflix alumni Kolton Andrus and Matt Fornaciari) that offered chaos engineering as a service without infrastructure prerequisites.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>Netflix's architecture in 2011 was organized around a principle that Chaos Monkey enforced: every service must be independently deployable, independently scalable, and independently recoverable. The microservices were connected through REST APIs, with each service maintaining its own data store and exposing a versioned interface to its consumers. Chaos Monkey operated at the AWS EC2 instance layer — the individual virtual machines running each service's processes. When an instance was terminated, the load balancer in front of that service's cluster detected the unhealthy instance and stopped routing traffic to it. If the cluster had been sized with enough redundancy, other instances absorbed the traffic without degradation. If not, the service degraded — and the engineers learned something.</p>\n<h3>The Simian Army: Failure Coverage Across Infrastructure Layers</h3>\n<p><a href=\"https://techlogstack.com/explore/netflix-chaos-monkey-2011/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>How Netflix's Architecture Handles Chaos Monkey Instance Loss</h3>\n<p><a href=\"https://techlogstack.com/explore/netflix-chaos-monkey-2011/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>THE BEHAVIORAL ECONOMICS OF CHAOS ENGINEERING</strong></p>\n<p>Chaos Monkey's deepest contribution to Netflix's culture was <strong>aligning incentives</strong>. Without it, the cost of fragile code was paid by whoever happened to be on-call when a real failure occurred — often not the engineer who wrote the fragile code. With Chaos Monkey, the cost was paid immediately and visibly by the team whose service broke. Engineers who experienced a Chaos Monkey failure during business hours had a powerful motivator to invest in proper fault tolerance: they didn't want to experience it again. This is DevOps incentive design at its finest — not policy mandates, but a system where the right behavior is the path of least resistance.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Why Business Hours Only — The Safety Constraint</p><p>The original Chaos Monkey ran only during business hours, and this was not a limitation — it was the essential design principle. An instance killed at 2am when engineers are asleep creates exactly the scenario Netflix wanted to avoid: unplanned, unattended failure with long MTTD (Mean Time To Detect) and long MTTR (Mean Time To Recover). An instance killed at 2pm on a Tuesday <strong>is pedagogical, not adversarial</strong>: engineers learn from it, fix the gap, and build better systems. As Netflix's confidence in its architecture grew, chaos experiments expanded to cover more scenarios and broader failure scopes — but the principle of human-attended chaos remained core to responsible chaos engineering practice.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>What Chaos Monkey Doesn't Test</p><p>Chaos Monkey's instance-termination model is powerful but deliberately narrow. It does not test <strong>network partitions</strong> (instances visible but unreachable), <strong>latency degradation</strong> (Latency Monkey's job), <strong>data corruption</strong>, or <strong>slow memory leaks</strong> that cause gradual performance degradation over hours. Chaos Monkey's successors in the Simian Army and in tools like Gremlin were created precisely to cover these gaps. The original insight — failing constantly builds resilience — generalizes to all failure types, but the specific mechanism must match the specific failure mode being tested. A chaos engineering program that only kills instances is missing most of the failure surface.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Chaos Monkey is fourteen years old and it has influenced every major engineering organization's approach to reliability. Its lessons are not about the specific tool — they are about the philosophy that the tool embodies and the cultural transformation it requires.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Designing for fault tolerance is not the same as having fault tolerance.</strong> Netflix's engineers wrote graceful degradation code. Netflix's Chaos Monkey tested whether it actually worked. Until production failure exercises the code path, you don't know whether your fault tolerance design survived contact with reality. Chaos Monkey converts theoretical resilience into empirical evidence.</li><li><span>02</span><div><em>Chaos Engineering</em> (the discipline of deliberately injecting controlled failures into production systems during business hours, with engineers present, in order to proactively expose resilience gaps before they become unplanned outages) must be practiced during business hours, with humans present. The purpose is learning, not destruction. Chaos experiments run at 3am when no one is available to respond create exactly the incidents that chaos engineering is supposed to prevent.</li><li><span>03</span><div><strong>Align incentives with the behavior you want.</strong> Chaos Monkey made the cost of fragile code immediate and personal — the engineer whose service broke during business hours paid the cost of fixing it right then. Without this alignment, resilience engineering is aspirational. With it, resilience engineering is survival instinct.</li><li><span>04</span><div>The <em>blast radius</em> (the scope of impact when a single component fails — chaos engineering is designed to continuously measure and minimize blast radius by forcing service-level isolation) of individual failures is only measurable through testing. A microservices architecture where every service failure cascades to every other service provides less reliability than a monolith, not more. Chaos Monkey surfaces these cascade dependencies so they can be eliminated before a real failure exposes them at scale.</li><li><span>05</span><div><strong>Start at the instance level and escalate gradually.</strong> Netflix began with Chaos Monkey (instances), expanded to Chaos Gorilla (availability zones), then to Chaos Kong (regions). Each level was only attempted after the previous level produced a stable, confident result. This graduated escalation model — expand scope only when you're confident you've solved the current scope — is the responsible path for any chaos engineering program.</li></ol>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The September 2014 Test That Validated Everything</p><p>Netflix's most public validation of Chaos Monkey's philosophy came not from their own experiments but from AWS itself. On September 25, 2014, AWS rebooted approximately 10% of its EC2 instances across regions without warning — a real, unplanned failure event at significant scale. Netflix handled it without customer impact. The years of Chaos Monkey practice had built exactly the muscle memory and architectural robustness required. Engineers didn't panic. Systems didn't cascade. Services degraded gracefully and recovered automatically. This was the experiment Netflix couldn't have designed themselves — and they passed it.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>FROM TOOL TO DISCIPLINE: THE PRINCIPLES OF CHAOS ENGINEERING</strong></p>\n<p>In 2015, Netflix's Casey Rosenthal formalized Chaos Monkey's philosophy into the <strong>Principles of Chaos Engineering</strong> — a document that defined chaos engineering with scientific rigor: establish a steady-state hypothesis, vary real-world events, run experiments in production, automate continuously, minimize blast radius. These principles transformed chaos engineering from 'Netflix's thing where they kill their own servers' into a reproducible engineering discipline with clear methodologies. <strong>The formalization is what allowed chaos engineering to spread beyond Netflix</strong> — teams could now implement the practice without having to rediscover the same principles themselves.</p>\n</blockquote>\n\n<blockquote><p>Netflix built a tool that killed their own servers on purpose every business day for years, and the one time AWS killed 10% of their servers by accident, nobody noticed — which is either the best possible outcome of a chaos engineering program or proof that Netflix engineers have very high stress tolerances.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/netflix-chaos-monkey-2011/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-18T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.084337+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Chaos Engineering", "Netflix"]}, {"id": "https://techlogstack.com/explore/figma-postgres-horizontal-sharding-2024/", "url": "https://techlogstack.com/explore/figma-postgres-horizontal-sharding-2024/", "title": "Figma's Database Grew 100x in Four Years — Here's How a Small Team Kept It From Toppling", "summary": "How Figma's small databases team horizontally sharded a 100x-grown Postgres stack in 9 months using colos, logical sharding via Postgres views, and a custom Go query", "content_html": "<p><strong>Figma</strong> · Databases · 18 May 2026</p>\n<p>In 2020, Figma ran on a single Postgres instance on AWS's largest available machine. Four years later, that database had grown nearly 100x. Some tables had swelled to several terabytes and billions of rows. The Postgres vacuum process — the background job that keeps Postgres alive — was causing reliability incidents. They had months of runway left before hitting the IOPS ceiling. A small databases team had nine months to fix it.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;DB growth since 2020&#x27;, &#x27;value&#x27;: &#x27;100x&#x27;}</li><li>{&#x27;label&#x27;: &#x27;migration&#x27;, &#x27;value&#x27;: &#x27;9-month&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<blockquote><p>We needed a bigger lever.</p><p><em>— — Sammy Steele, Tech Lead — Figma Databases Team, via Figma Engineering Blog</em></p></blockquote>\n<p>Figma's database story follows a pattern familiar to every fast-growing product company, but with stakes that were unusually high and a timeline that was unusually compressed. In 2020, Figma ran on a <strong>single Postgres database</strong> on AWS's largest available RDS instance. By the end of 2022, the team had done what most scaling playbooks suggest first: add read replicas, add a connection pooler (<em>PgBouncer</em> (a lightweight PostgreSQL connection pooler that sits between application code and the database, multiplexing many application connections down to a smaller pool of real database connections — reducing connection overhead significantly)), and <em>vertically partition</em> (splitting a single database into multiple smaller databases, each containing a logical group of related tables — for example, one database for Figma files data, another for organization data) the database into a dozen domain-specific shards. These steps bought them runway. They did not buy them enough runway.</p>\n<p>The data was unambiguous. Certain tables — the ones tracking Figma files, user activity, and collaboration state — were growing at rates that would soon exceed what <strong>Amazon RDS could support in IOPS</strong>. Some of these tables already contained <strong>several terabytes and billions of rows</strong>. At that size, Postgres's <em>vacuum process</em> (a critical background maintenance operation in Postgres that reclaims storage from deleted rows and prevents the database from running out of 32-bit transaction IDs — if vacuuming falls behind, it can cause severe performance degradation and, in extreme cases, force the database offline) was beginning to cause reliability incidents — it was falling behind on the largest tables, unable to reclaim space fast enough to keep up with write volume. Vertical partitioning couldn't fix this: the smallest unit of vertical partitioning is a single table, and these individual tables were the problem.</p>\n<blockquote>\n<p><strong>THE VACUUM PROBLEM AT SCALE</strong></p>\n<p>Postgres must periodically vacuum tables to reclaim space from deleted and updated rows. This is not optional — if a table accumulates too many dead tuples, query performance degrades severely. On tables with billions of rows and high write rates, the vacuum process can fall behind the rate of new writes. When this happens, the database starts showing reliability symptoms: <strong>bloated tables, degraded query plans, and in extreme cases the risk of transaction ID wraparound</strong> — a catastrophic condition that forces Postgres into read-only emergency mode. Figma was seeing the early signs of this at scale.</p>\n</blockquote>\n\n<h3>Why Not CockroachDB, TiDB, Spanner, or Vitess?</h3>\n<p>Figma's databases team evaluated every obvious alternative before committing to building their own horizontal sharding layer. <strong>CockroachDB, TiDB, Google Spanner, and Vitess</strong> were all on the list. All were rejected for the same core reason: switching to any of them would have required a complex data migration across two different database stores simultaneously. With only months of runway remaining before hitting critical IOPS limits, a migration to an unfamiliar storage layer under deadline pressure was a risk the team couldn't accept. They had also accumulated significant operational expertise running RDS Postgres. That expertise would have to be rebuilt from scratch for any new system. The team instead chose to build horizontal sharding on top of their existing RDS Postgres infrastructure — not a generic solution, but one scoped precisely to Figma's data model and access patterns.</p>\n\n<h3>Problem</h3>\n<h4>IOPS Ceiling and Vacuum Incidents Converge</h4>\n<p>By late 2022, Figma's largest tables had grown to several terabytes with billions of rows, and the Postgres vacuum process was causing reliability incidents on the highest-write tables. Projections showed the team would exceed RDS maximum IOPS within months. Vertical partitioning — splitting databases by domain — could not help because individual tables were the bottleneck, not cross-domain coupling.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Single-Table Ceiling: The Vertical Partitioning Limit</h4>\n<p><em>Horizontal sharding</em> (splitting a single large table's rows across multiple physical database instances based on a shard key — allowing any individual table to grow beyond the limits of a single machine) was the only viable path. But implementing it on a complex relational data model, with hundreds of engineers writing queries, required solving three hard problems simultaneously: routing queries correctly, maintaining developer productivity, and enabling rollback if something went wrong.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Colos + Logical Sharding + DBProxy</h4>\n<p>The team invented three interlocking abstractions: 'colos' (colocation groups of related tables sharing a shard key), logical sharding via Postgres views (which allowed safe percentage-based rollout without moving any data), and DBProxy (a custom Go query proxy with an AST parser that routed queries to the correct physical shard). Together these allowed incremental, reversible rollout of horizontal sharding without disrupting product development.</p>\n<hr />\n<h3>Result</h3>\n<h4>Nine Months, Nearly Infinite Scalability</h4>\n<p>The migration completed in nine months with zero downtime and the ability to roll back at any step. Future shard splits at the physical level are now transparent to application developers — after the initial upfront work to make a table compatible with horizontal sharding, all subsequent scale-outs happen in the infrastructure layer without any product team involvement.</p>\n<hr />\n\n<blockquote>\n<p><strong>🪄</strong></p>\n<p>The most elegant part of Figma's sharding approach was using standard <strong>Postgres views</strong> to implement logical sharding. A view like <code>CREATE VIEW table_shard1 AS SELECT * FROM table WHERE hash(shard_key) BETWEEN min AND max</code> lets Postgres behave as if data is already sharded — without any data moving. This made the logical sharding phase essentially free to roll back: change the view definition, flip the config, done.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Shadow Planning Framework</p><p>Before building DBProxy's query engine, the team needed to know which queries to support. They built a <strong>shadow planning framework</strong> that let engineers define potential sharding schemes for their tables, then ran those plans against live production traffic — logging the queries and plans to Snowflake for offline analysis. This gave them empirical data to design a query language covering the most common <strong>90% of queries</strong> while deliberately excluding the rare worst-case patterns that would have made DBProxy impossibly complex.</p>\n</blockquote>\n\n<p>The constraints the team placed on their query language were deliberate and principled. All range scan and point queries were supported. Cross-table joins were <strong>only allowed when both tables belonged to the same colo and the join was on the sharding key</strong>. Scatter-gather queries — those that must fan out to all shards because they lack a shard key — were supported but their use was actively discouraged because each scatter-gather effectively multiplies database load by the shard count. Application developers were encouraged to refactor scatter-gather access patterns before sharding their tables, using the shadow planning data to understand which of their queries fell into this category.</p>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Five Goals They Refused to Compromise</p><p>Figma's team defined five non-negotiables before writing a line of sharding code: <strong>minimize developer impact</strong> (product engineers shouldn't need to rewrite queries), <strong>scale-out transparency</strong> (future shard splits invisible to application layer), <strong>no expensive backfills</strong> (no solution requiring moving terabytes before going live), <strong>incremental progress</strong> (percentage-based rollout at every step), and <strong>rollback at any stage</strong> — even after physical sharding. Every architectural decision was measured against these five goals.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Postgres Vacuum Threat Nobody Talks About</p><p>Postgres uses a 32-bit transaction counter. Every write increments it. If the database ever gets close to the maximum 2^31 transactions without vacuuming reclaimed space, Postgres enters a <strong>read-only emergency mode called transaction ID wraparound</strong> — a database-wide shutdown to prevent data corruption. On tables with billions of rows and heavy write rates, falling behind on vacuuming is not a theoretical risk. Figma was experiencing real reliability incidents from vacuum lag on their largest tables. This was the alarm that confirmed horizontal sharding was urgent, not optional.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>The Three-Layer Solution</h3>\n<p>Figma's horizontal sharding solution had three distinct architectural components that worked together. <strong>Colos</strong> (colocations) were the conceptual layer — groups of related tables that shared the same sharding key and physical shard layout. Tables within a colo could be joined and queried transactionally as long as the join was on the sharding key. The sharding keys were chosen from a small set: <code>user_id</code>, <code>file_id</code>, or <code>org_id</code> — most tables at Figma could be naturally associated with one of these. <strong>Logical sharding</strong> was the rollout layer — using Postgres views to simulate sharding behavior without moving any data. <strong>DBProxy</strong> was the execution layer — intercepting queries, parsing them into an AST, determining which logical shard the query targeted, and routing it to the appropriate physical database.</p>\n<ul>\n<li><strong>100x</strong> — Database growth since 2020 — the scale that made vertical partitioning insufficient and horizontal sharding the only viable path forward</li>\n<li><strong>9 months</strong> — Total migration timeline from design to production completion — achieved with a small team under aggressive growth-driven deadline pressure</li>\n<li><strong>90%</strong> — Query coverage targeted by DBProxy's query engine — the pragmatic threshold that kept the proxy simple while covering the vast majority of production access patterns</li>\n<li><strong>0</strong> — Application layer changes required for future shard splits — after initial table compatibility work, all subsequent scale-outs are transparent to product engineers</li>\n</ul>\n\n<pre><code>-- Logical sharding via Postgres views: the key insight\n-- No data moves during logical sharding phase.\n-- Tables behave as if sharded — just views on the same physical table.\n\n-- Single physical table still holds all data:\n-- CREATE TABLE figma_files (file_id UUID, org_id UUID, data JSONB, ...)\n\n-- Logical shards created as views filtered by hash range:\nCREATE VIEW figma_files_shard_0 AS\n  SELECT * FROM figma_files\n  WHERE hashtext(file_id::text) % 4 = 0;\n\nCREATE VIEW figma_files_shard_1 AS\n  SELECT * FROM figma_files\n  WHERE hashtext(file_id::text) % 4 = 1;\n\n-- Views accept both reads AND writes in Postgres:\n-- INSERT INTO figma_files_shard_0 (file_id, data) VALUES (...);\n-- → Postgres routes to the underlying table\n-- → DBProxy validates the shard key is in the correct range\n\n-- Physical sharding later:\n-- Data is ACTUALLY moved to separate RDS instances per shard\n-- DBProxy routing stays the same — application code unchanged\n-- Rollback: re-point physical shard back to original instance</code></pre>\n<blockquote>\n<p><strong>DBPROXY: THE QUERY ENGINE</strong></p>\n<p>DBProxy is a Go service sitting between the application layer and PgBouncer. Its query engine has three components: a <strong>query parser</strong> that transforms SQL into an AST, a <strong>logical planner</strong> that extracts query type and shard IDs from the AST, and a <strong>physical planner</strong> that maps logical shard IDs to physical database instances and rewrites queries accordingly. DBProxy also handles scatter-gather queries (fanning out to all shards and aggregating results), dynamic load-shedding, improved observability, and database topology management. Building it took months — but it was the only way to make sharding transparent to application developers.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Logical Before Physical: The Two-Phase Rollout</p><p>Figma's key migration insight was separating logical sharding from physical sharding. <strong>Logical sharding</strong> (Phase 1) makes the application behave as if tables are sharded — using views, updating DBProxy config — but all data still lives in one physical database. This can be rolled out as a percentage-based config change, validated against production traffic, and rolled back instantly. <strong>Physical sharding</strong> (Phase 2) actually moves data to separate RDS instances. Much higher risk — but by this point, the logical layer has been running in production for weeks, bugs are fixed, and the team has empirical confidence in the sharding correctness.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Rollback Guarantee: Even After Physical Sharding</p><p>Most horizontal sharding implementations are one-way migrations — once data is on separate physical instances, rolling back requires a complex reverse migration. Figma's team designed their system so that <span><strong>physical shard splits are reversible</strong></span>. They maintained the ability to point physical shards back to the original database instance while the new routing logic was validated. This reduced the risk of being stuck in a bad state when unknown unknowns inevitably occurred.</p>\n</blockquote>\n\n<p>Figma's Three-Phase Database Scaling Journey: Before and After</p><div><table><caption>Figma's Three-Phase Database Scaling Journey: Before and After</caption><thead><tr><th>Phase</th><th>Architecture</th><th>Bottleneck Addressed</th><th>Runway Gained</th></tr></thead><tbody><tr><td>2020</td><td>Single RDS Postgres instance</td><td>Initial growth</td><td>Moderate</td></tr><tr><td>2021–2022</td><td>Vertical partitioning (12 domain DBs) + read replicas + PgBouncer</td><td>CPU, read load, connection pool</td><td>~1 year</td></tr><tr><td>2023–2024</td><td>Horizontal sharding via colos + DBProxy + logical/physical migration</td><td>Table-level IOPS ceiling, vacuum backlog, billions-of-row tables</td><td>Near-infinite scalability</td></tr></tbody></table>\n<blockquote>\n<p><strong>🔀</strong></p>\n<p>Scatter-Gather: The Necessary Evil</p><p>Some queries don't have a shard key — a query like 'get all recently modified files for an admin dashboard' has no natural file_id scope. DBProxy handles these with scatter-gather: fan the query out to every shard in parallel, collect results, merge and sort. It works correctly but is expensive. Figma's engineering team was <strong>explicit with product engineers about the scatter-gather tax</strong>, encouraging them to refactor access patterns before their tables were sharded. The shadow planning data showed exactly which queries would become scatter-gather — engineering teams had weeks to fix them before cutover.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>The before-state of Figma's architecture had application services talking directly to PgBouncer, which connected to RDS Postgres. Vertical partitioning meant multiple databases, but each database was still a single physical instance — and the largest individual tables still had no mechanism to distribute their rows across instances. DBProxy was inserted between the application and PgBouncer layers, adding the query parsing and routing intelligence that made horizontal sharding possible without requiring application code changes.</p>\n<h3>Before Horizontal Sharding: Vertical Partitions Only</h3>\n<p><a href=\"https://techlogstack.com/explore/figma-postgres-horizontal-sharding-2024/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>After: DBProxy + Logical/Physical Horizontal Sharding</h3>\n<p><a href=\"https://techlogstack.com/explore/figma-postgres-horizontal-sharding-2024/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>COLOS: THE DEVELOPER-FACING ABSTRACTION</strong></p>\n<p>The colo concept is what made horizontal sharding usable for product engineers. A colo is a named group of tables that share a sharding key — for example, the <code>files_colo</code> contains <code>figma_files</code>, <code>file_nodes</code>, <code>file_comments</code>, and other tables all sharded by <code>file_id</code>. Within a colo, <strong>cross-table joins and full transactions are supported</strong> when restricted to a single shard key value. This matches how Figma's application code already accessed the database — most operations concerned a single file or a single user, not cross-colo data. The colo abstraction minimized the number of queries that needed to be refactored.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Scatter-Gather Tax</p><p>Queries without a shard key — those that need results from all shards — are handled by DBProxy's scatter-gather mechanism: the query is fanned out to all shards in parallel and results are merged. Scatter-gather is correct but expensive: it multiplies read load by the number of shards. <strong>Having too many scatter-gather queries would defeat the purpose of horizontal sharding</strong>. The shadow planning framework specifically identified scatter-gather patterns in the production query log before sharding, allowing teams to refactor the most frequent offenders before their tables were migrated.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>DBProxy: Six Months to Build, Indefinite Value</p><p>Building DBProxy — the Go service with an AST parser, logical planner, and physical planner — was the highest-risk engineering bet in the sharding project. It took months to build and required solving problems that existing tools had already solved in different ways. But the payoff was precise control: <span><strong>DBProxy understands Figma's specific query patterns</strong></span>, supports exactly the subset of SQL that Figma uses, and can be extended as Figma's needs evolve. A generic proxy would have required adapting Figma's code to its limitations. DBProxy was adapted to Figma's code.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Figma's sharding story is widely cited because it did something genuinely hard — horizontally sharding a complex relational production database under deadline pressure — and documented the architecture decisions clearly enough for other teams to learn from. The lessons are about sequencing, abstraction, and the courage to build something custom when existing tools genuinely don't fit.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Separate logical sharding from physical sharding.</strong> Implementing sharding routing behavior at the application layer — using views, config, or a proxy — before moving any physical data gives you weeks of production validation at essentially zero risk. When you flip to physical sharding, the routing is already proven correct. This two-phase approach is the biggest risk reducer in a horizontal sharding migration.</li><li><span>02</span><div><em>Colocations</em> (groups of related tables that share the same sharding key and physical shard layout, allowing cross-table joins and transactions within the group) are the abstraction that makes sharding survivable for product engineers. Without colos, horizontal sharding forces engineers to think about shard routing on every database query. With colos, most queries just work as they always did.</li><li><span>03</span><div><strong>Use shadow traffic to define your query language before building your proxy.</strong> Figma's shadow planning framework let them empirically measure which query patterns existed in production before designing DBProxy. This meant the proxy was built for real queries, not imagined ones — and the 10% of queries excluded from support were known and manageable, not discovered as surprises in production.</li><li><span>04</span><div>Know when existing tools don't fit your timeline. Figma evaluated CockroachDB, Spanner, TiDB, and Vitess — all good systems. They chose to build something custom not out of arrogance but because <strong>the migration risk to an unfamiliar storage layer under a months-long deadline was genuinely higher than building a scoped custom solution</strong> on their existing Postgres expertise. The build-vs-buy decision was made with real risk data, not intuition.</li><li><span>05</span><div>Design for rollback even after the migration completes. Figma maintained the ability to reverse physical shard splits after they happened. <strong>The unknown unknowns in a horizontal sharding migration are real</strong> — building in a reverse path at every phase is the engineering discipline that lets teams execute confidently rather than hold their breath.</li></ol>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Post-Migration State: Scale Without Change</p><p>After the initial work to make a table horizontal-sharding-compatible, all future shard splits happen transparently. As a table grows again toward limits, the infrastructure team can split shards — updating the physical topology and DBProxy's routing config — without any product engineer touching their code. This is the payoff of the upfront investment: the database can now scale indefinitely at the infrastructure layer, decoupled from the application layer.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>WHEN MONTHS OF RUNWAY MEANS NOW</strong></p>\n<p>Figma's team framed their problem as 'months of runway remaining' — meaning that if they did nothing, they would hit a hard scaling ceiling and likely experience database reliability incidents or outages within months. This framing was not catastrophizing; it was <strong>the math of their growth rate applied to their IOPS limit</strong>. The urgency drove the decision to build custom rather than migrate to an unfamiliar system. Teams facing similar trajectory should run this calculation early — months of runway sounds like plenty of time until the migration itself takes several months.</p>\n</blockquote>\n\n<blockquote><p>Figma's database grew 100x and a small team fixed it in nine months — which is either very good database engineering or very good use of Postgres views depending on who you ask, and the answer is both.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/figma-postgres-horizontal-sharding-2024/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-18T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.791800+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Databases", "Figma"]}, {"id": "https://techlogstack.com/explore/datadog-systemd-outage-graceful-degradation-2023/", "url": "https://techlogstack.com/explore/datadog-systemd-outage-graceful-degradation-2023/", "title": "Datadog Went Dark for 24 Hours and Came Back With a Different Philosophy", "summary": "How an unsupervised systemd update took down 50–60% of Datadog's Kubernetes nodes globally, cost $5M, and drove a company-wide shift to graceful degradation architec", "content_html": "<p><strong>Datadog</strong> · Reliability · 18 May 2026</p>\n<p>On March 8, 2023, Datadog — the platform engineers use to know when their own infrastructure is broken — broke. For more than 24 hours, across five regions on three cloud providers, metrics stopped arriving, logs disappeared, and dashboards showed nothing. The people whose job was to fix it couldn't see what was happening. It cost $5 million. It changed how Datadog thinks about building software.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;global outage&#x27;, &#x27;value&#x27;: &#x27;24h+&#x27;}</li><li>{&#x27;label&#x27;: &#x27;regions, 3 cloud providers&#x27;, &#x27;value&#x27;: &#x27;5&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<blockquote><p>We had built with the assumption that the only way to handle failure was to prevent it entirely — or to stop everything — rather than finding ways to degrade gracefully and continue delivering value to customers, even under extreme conditions.</p><p><em>— — Laura de Vesine, Rob Thomas, Maciej Kowalewski — via Datadog Engineering Blog</em></p></blockquote>\n<p>At 01:31 EST on March 8, 2023, Datadog experienced its first global outage — every region, every cloud provider, simultaneously. The company that monitors the infrastructure of thousands of other companies could not monitor its own. Dashboards loaded but displayed no data. <strong>Logs, metrics, alerting, and traces were all unavailable.</strong> The engineers whose job was to diagnose and fix the outage were operating without the observability tools that Datadog itself provides. It lasted over 24 hours. It cost $5 million in direct revenue. And it forced a fundamental rethink of how Datadog builds reliable systems.</p>\n<p>The immediate cause was disarmingly mundane: an <strong>automated </strong><em>systemd</em> (the init system and service manager used by most modern Linux distributions — it starts processes, manages services, and handles system initialization) update was applied to Datadog's Ubuntu-based virtual machines across all regions simultaneously. This was a legacy security patch mechanism — Datadog had since built a modern lifecycle automation system for all nodes — but the legacy channel was still active and executed its update across the global fleet without any staged rollout, any health gates, or any human awareness. The update caused a <em>systemd-networkd</em> (the systemd component responsible for managing network interfaces on Linux hosts) restart interaction that <strong>removed network routes from the machines as they came back up</strong>. Nodes that had previously been connected to each other's network simply vanished from the cluster.</p>\n<blockquote>\n<p><strong>THE CIRCULAR DEPENDENCY TRAP</strong></p>\n<p>The worst part was not that 50–60% of Kubernetes nodes lost network connectivity — it was what those nodes were running. Among the VMs brought down by the network route removal were the VMs powering Datadog's <strong>regionalized control planes based on </strong><em>Cilium</em> (a cloud-native networking platform for Kubernetes that uses eBPF to provide networking, security, and observability for containerized workloads). The control plane going down meant Kubernetes couldn't schedule new pods, auto-repair failed nodes, or scale workloads to compensate. The very system that should have responded to the failure was among the first things the failure took down. This circular dependency — <strong>the recovery mechanism depending on the infrastructure that failed</strong> — is what turned a 50% node loss into a nearly complete platform outage.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>Simultaneous Global Node Loss at 01:31 EST</h4>\n<p>A legacy automated Ubuntu security update channel applied a systemd update across Datadog's entire global fleet simultaneously — all five regions, all three cloud providers, all at once. The update caused a systemd-networkd restart interaction that removed network routing tables from nodes as they restarted. 50–60% of Kubernetes nodes lost network connectivity within minutes. Pages loaded but displayed no data. The outage was total from the customer perspective.</p>\n<hr />\n<h3>Cause</h3>\n<h4>The Control Plane Was in the Blast Radius</h4>\n<p>The Kubernetes control plane — the cluster management layer responsible for scheduling, auto-repair, and scaling — was among the nodes that lost connectivity. This created a circular dependency: the recovery system needed the cluster to heal, but the cluster could not heal without the recovery system. Additionally, Datadog's multi-region, multi-cloud architecture provided no protection because the update was applied uniformly across all infrastructure simultaneously.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Manual Node Recovery + Architecture Rethink</h4>\n<p>Recovery required manual intervention: engineers identified and restarted affected nodes, restoring network routing and bringing Kubernetes control planes back online region by region. The legacy update channel was immediately disabled. But recovery took over 24 hours — far longer than the node loss itself — because services loaded large in-memory caches on startup that were slow to initialize, and the cluster lacked the spare capacity to absorb the sudden recovery surge.</p>\n<hr />\n<h3>Result</h3>\n<h4>Full Recovery, New Philosophy</h4>\n<p>Full service restoration after 24+ hours. In the months following, Datadog published a detailed engineering blog describing not just what happened but the architectural shift it drove: away from <em>never-fail</em> systems toward systems designed to <strong>degrade gracefully</strong> when failure inevitably occurs. Published October 2025, the blog documented two years of architectural work as a direct result of the March 2023 incident.</p>\n<hr />\n\n<blockquote>\n<p><strong>💸</strong></p>\n<p>Datadog operates on usage-based billing — customers pay for the volume of metrics, logs, and traces they send. During the 24-hour outage, Datadog <strong>did not charge customers for data they couldn't send</strong>. The $5M revenue loss was direct: one day of global service unavailability translated directly into one day of foregone billing. This number was revealed on an earnings call, making the financial cost of the outage unusually concrete and public.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>❌</strong></p>\n<p>Multi-Cloud Did Not Help</p><p>Datadog ran in five regions across three cloud providers — AWS, GCP, and Azure. This architecture is often cited as a reliability best practice. But it provided <span><strong>zero protection</strong></span> in this incident because the failure mechanism — the automated Ubuntu update — operated at the OS layer, uniformly across all infrastructure regardless of cloud provider. Multi-cloud protects against cloud provider failures. It does not protect against failures in your own automation that touch all infrastructure simultaneously.</p>\n</blockquote>\n\n<p>The 24-hour recovery time was itself a lesson. Even after the Kubernetes control planes came back online and new pods could be scheduled, <strong>services were slow to recover</strong>. The investigation found two patterns: some services had insufficient compute allocated relative to others, causing them to wait a long time for Kubernetes to schedule their pods after the control plane recovered. Others loaded large, processing-intensive caches into memory at startup — caches that had been optimized for steady-state operation but were extremely expensive to rebuild from scratch after a complete restart. Both of these were design choices that had seemed reasonable in a world where failure was rare and total restarts were rarer still. In a world where failure must be expected, they were traps.</p>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Irony of the Observability Platform</p><p>There is a particular quality of darkness in losing observability tooling during an outage. Engineers responding to the incident were using Datadog to understand what was happening — and Datadog was the thing that was down. The response team had to work from first principles: SSH into individual hosts, read raw logs, check systemd status directly. The tooling built to abstract away that complexity was unavailable at precisely the moment the complexity needed to be navigated. The incident revealed how dependent Datadog's own oncall rotation was on Datadog itself.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Square-Wave Failure Pattern</p><p>Datadog's engineers described the outage as a <strong>square-wave failure</strong> — the platform went from fully operational to nearly completely down almost instantaneously, rather than degrading gradually. This pattern is characteristic of failures at the infrastructure layer: when Kubernetes nodes lose network connectivity, every pod running on those nodes disappears from service meshes and load balancers at once. There is no gradual ramp. For an observability platform designed around monitoring continuous signals, a square-wave drop to zero looked different from every other failure mode the monitoring systems had been trained on.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🌐</strong></p>\n<p>Datadog ran infrastructure across <strong>five regions on three different cloud providers</strong> — a setup specifically designed to avoid single points of failure. It provided no protection at all against this incident because the failure mechanism lived at a layer beneath the cloud provider abstraction: the Ubuntu OS update that ran on every Datadog-managed VM, regardless of which cloud it ran on. The lesson is precise: multi-cloud resilience and OS-level automation independence are orthogonal properties.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE POSTMORTEM DELAY</strong></p>\n<p>Datadog waited over two months to publish a public postmortem — a gap that generated significant industry commentary, particularly after the CEO referenced it on an earnings call before it was publicly available. The eventual postmortem was substantive and technical. But the delay — and the CEO's apparent confusion about whether it had been shared — was widely noted as a departure from the transparency standard set by companies like Cloudflare. <strong>Speed of postmortem publication matters for customer trust</strong>, especially for a platform whose entire value proposition is reliability and observability.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>The Philosophical Shift: From Never-Fail to Graceful Degradation</h3>\n<p>The deep engineering response to the March 2023 outage was not a list of tactical fixes. It was a philosophical shift. Datadog's engineering teams had, historically, built for reliability through redundancy — designing systems so that individual components never went down. This produced what the postmortem called <strong>never-fail architectures</strong>: systems where components and services had to be fully functional to serve any user use case. When a component did fail, the entire service path that depended on it failed with it. The incident revealed a hidden assumption: that recovery would be fast and partial failure would be brief. A 24-hour outage broke that assumption completely, and exposed how little thought had gone into what the system should do while broken.</p>\n<ul>\n<li><strong>24h+</strong> — Total outage duration — longer than the initial node loss because service startup was slow and the cluster lacked capacity to absorb the recovery surge</li>\n<li><strong>$5M</strong> — Direct revenue loss from usage-based billing — one day of global unavailability translated to one day of zero billing, revealed publicly on an earnings call</li>\n<li><strong>50–60%</strong> — Kubernetes nodes that lost network connectivity from the systemd update — enough to take down control planes and make automated recovery impossible</li>\n<li><strong>3 clouds</strong> — Cloud providers affected simultaneously — AWS, GCP, and Azure all impacted because the failure was in Datadog's own automation, not in any cloud provider's infrastructure</li>\n</ul>\n\n<blockquote>\n<p><strong>WHAT GRACEFUL DEGRADATION ACTUALLY MEANS</strong></p>\n<p>Datadog's post-incident architectural shift was built on a simple principle: when failure occurs, the system should continue to deliver <strong>as much value as possible to as many customers as possible</strong>, even if it cannot deliver full value to all customers. This means designing every service with an explicit answer to the question: <em>what does this service do when its dependencies are unavailable?</em> Can it serve stale data? Can it serve a subset of features? Can it serve with degraded accuracy? Or does it have to stop entirely? Most services, when the question is asked honestly, can do better than stop.</p>\n</blockquote>\n\n<pre><code class=\"language-python\"># Before: Never-fail architecture (implicit assumption)\nclass MetricsQueryService:\n    def query_metrics(self, metric_name, time_range):\n        # If storage is unavailable, this raises an exception\n        # The exception propagates up — user sees an error page\n        raw_data = self.storage.fetch(metric_name, time_range)\n        return self.process(raw_data)  # no fallback\n\n# After: Graceful degradation architecture\nclass MetricsQueryService:\n    def query_metrics(self, metric_name, time_range):\n        try:\n            # Try live storage first\n            raw_data = self.storage.fetch(metric_name, time_range)\n            return self.process(raw_data)\n        except StorageUnavailable:\n            # Fall back to cached/stale data — user sees old data with a warning\n            stale_data = self.read_through_cache.fetch(metric_name, time_range)\n            if stale_data:\n                return DataResponse(data=stale_data, staleness_warning=True)\n            # Fall further back — return partial data from other sources\n            partial = self.fallback_source.fetch(metric_name, time_range)\n            if partial:\n                return DataResponse(data=partial, completeness_warning=True)\n            # Only now surface an error — and make it informative\n            return DataResponse(error='Storage degraded', retry_in=30)</code></pre>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Startup Optimization: Fixing the Recovery Drag</p><p>Two changes addressed the slow recovery after node restoration. First, Datadog used <strong>Kubernetes priority mechanisms</strong> to ensure critical services got compute allocated before lower-priority ones when the cluster came back online — preventing a thundering herd of equal-priority pods all waiting for the same scarce resources. Second, services with large startup caches shortened their <strong>lookback windows</strong> and changed data formats to eliminate processing-intensive deserialization at startup. Services that had been trying to rebuild six months of cache at startup were redesigned to start with a smaller warm window and build up over time.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Architectural Patterns That Emerged</p><p>Over the two years following the incident, Datadog published a set of graceful degradation patterns applied across its products: <strong>persist data early</strong> (write to durable storage as early as possible in the pipeline, so recovery is stateless); <strong>stale reads</strong> (serve cached data with a staleness indicator rather than surfacing an error); <strong>partial serving</strong> (return what you have rather than nothing); <strong>circuit breaking</strong> (automatically stop calling a failing dependency, fall back to alternative, re-probe for recovery). None of these patterns were invented by Datadog — they were standard resilience engineering techniques that Datadog had systematically under-applied.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Persist Data Early: The Durability Pattern</p><p>One of the most concrete architectural changes after the incident was implementing a <strong>persist early</strong> pattern across Datadog's data pipelines. Instead of holding data in-memory for processing before writing to durable storage, the system was changed to write to durable storage as soon as data arrived — before processing. This meant that even if processing services went down, incoming customer telemetry was safely on disk and could be processed retroactively when services recovered. Recovery no longer required customers to resend data that had arrived during the outage window.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Kubernetes Priority Class Oversight</p><p>After the outage, Datadog's investigation found that many services had not been assigned appropriate <strong>Kubernetes Priority Classes</strong> — a mechanism that tells the Kubernetes scheduler which pods should get compute resources first when the cluster is under resource pressure. In normal operation, this doesn't matter much. After a large failure where the entire cluster restarts simultaneously, priority classes determine recovery order. Services that should start first (database proxies, ingestion pipelines) were waiting for the same CPU allocations as low-priority background jobs. Recovery order is a design decision that should be made explicitly, not left to scheduler defaults.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>The architecture that failed in March 2023 had a specific shape: every product feature in Datadog's platform depended on a chain of services, each of which had to be fully healthy for any part of the chain to work. Logs required a log ingestion pipeline, a storage layer, a query layer, and a frontend — all healthy. If any component in the chain was down, the entire feature was down. The never-fail architecture assumed each link in the chain would always be up. The March 2023 incident showed what happens when multiple links go down simultaneously.</p>\n<h3>Before: Never-Fail Chain Architecture (Any Failure = Total Failure)</h3>\n<p><a href=\"https://techlogstack.com/explore/datadog-systemd-outage-graceful-degradation-2023/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>After: Graceful Degradation Architecture (Failure = Degraded, Not Dark)</h3>\n<p><a href=\"https://techlogstack.com/explore/datadog-systemd-outage-graceful-degradation-2023/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>MULTI-CLOUD IS NOT A RELIABILITY SILVER BULLET</strong></p>\n<p>Datadog's global, multi-cloud infrastructure — five regions, three cloud providers — provided zero protection against this incident. The lesson generalizes: <strong>multi-cloud protects against cloud provider failures</strong>. It does not protect against failures in your own configuration management, your own automation, your own deployment systems, or your own service design. An automated update that runs across all infrastructure uniformly bypasses all multi-cloud redundancy. Organizations that invest heavily in multi-cloud while neglecting the uniformity of their own automation are addressing the wrong failure vector.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Legacy Channel Problem</p><p>The update that caused the outage went through a <strong>legacy security update mechanism</strong> — a channel that Datadog's security team had kept active while building a modern replacement. The modern system had been built; the legacy system had not been decommissioned. This is one of the most common failure patterns in infrastructure: a replaced system that was never actually turned off. The old system executed one last time at the worst possible moment. Every team with legacy automation that still runs in production should audit whether it could execute in a way that bypasses the modern system's safety gates.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🔬</strong></p>\n<p>Root Cause Archaeology: Finding the Network Route Bug</p><p>The technical root cause was subtle: when <strong>systemd-networkd restarted</strong> during the OS update, it cleared the network routing table for container workloads that had been set up by Kubernetes's networking plugin (Cilium). New nodes starting up for the first time don't have this problem — they start with an empty routing table and Cilium populates it correctly. But nodes that were already running had existing routing entries that were erased by the systemd-networkd restart. This was a previously unobserved interaction that only manifested when restarting a running node rather than provisioning a new one.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>The March 2023 Datadog outage is extraordinary for two reasons: the irony of an observability platform going dark, and the depth of the architectural response it drove. The lessons here are not primarily about the incident itself but about the philosophy that emerged from it.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Build for graceful degradation, not just failure prevention.</strong> Every service should have an explicit answer to: what do I do when my dependencies are unavailable? Stale data with a warning, partial results, degraded accuracy — all of these are better than returning nothing. The goal is to serve as many customers as possible, as fully as possible, even while broken.</li><li><span>02</span><div><em>Circular dependencies</em> (when component A depends on component B for recovery, and component B depends on component A to be running) between service infrastructure and recovery infrastructure are a reliability catastrophe waiting to happen. Explicitly audit your control planes, monitoring systems, and automation pipelines: if the thing that fixes failures is also in the blast radius of those failures, you have a recovery problem.</li><li><span>03</span><div><strong>Decommission legacy automation systems completely.</strong> The outage was caused by a legacy update channel that still had execution access after its replacement was built. Every organization has deprecated-but-still-running systems. Audit them. A legacy channel that runs once a year can cause an outage just as reliably as one that runs every day.</li><li><span>04</span><div><em>Staged rollouts</em> (applying changes to a small percentage of infrastructure first, checking health, then expanding gradually) are not optional for automated changes to production infrastructure. The Datadog systemd update was applied globally and simultaneously. A staged rollout — 1% of nodes, health check, 10%, health check — would have caught the network route removal on a handful of nodes before it cascaded to the entire fleet.</li><li><span>05</span><div><strong>Design service startup to be fast under the conditions that follow a large outage.</strong> When a cluster recovers from a significant failure, all services restart simultaneously with no warm caches, competing for scarce cluster capacity. Services optimized for steady-state operation can become bottlenecks in this cold-restart scenario. Test your startup behavior under cluster-wide cold-start conditions, not just under normal rolling restarts.</li></ol>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Two Years to Publish the Engineering Post</p><p>The March 2023 outage happened in March 2023. The detailed engineering blog documenting the architectural response was published in <strong>October 2025 — two and a half years later</strong>. This timeline reflects the depth of the work: the blog described real architectural changes that had been implemented and validated in production across Datadog's entire product portfolio, not aspirational plans. Publishing only after the work was done is the responsible version of transparency — claiming to have fixed something before you've fixed it erodes trust faster than silence.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>GRACEFUL DEGRADATION AS A DESIGN PRINCIPLE, NOT A FEATURE</strong></p>\n<p>The deepest lesson from Datadog's post-incident work is that graceful degradation is not a feature you add to a service after it's built — it's a design principle that shapes how the service is architected from the beginning. A service designed to gracefully degrade will have different internal boundaries, different cache strategies, different dependency contracts, and different SLOs than one designed to always succeed. <strong>Retrofitting graceful degradation into a never-fail architecture is expensive</strong>. Building for it from the start is cheaper. After two years of retrofitting, Datadog's engineering organization now treats the question 'how does this service degrade?' as a required design review criterion.</p>\n</blockquote>\n\n<blockquote><p>Datadog's monitoring platform went down for 24 hours — which means the engineers had to debug a global infrastructure failure using SSH, intuition, and the kind of raw log reading skills that got them into engineering in the first place.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/datadog-systemd-outage-graceful-degradation-2023/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-18T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.876392+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Reliability", "Datadog"]}, {"id": "https://techlogstack.com/explore/openai-postgresql-scaling-2026/", "url": "https://techlogstack.com/explore/openai-postgresql-scaling-2026/", "title": "OpenAI Runs ChatGPT for 800 Million Users on One PostgreSQL Instance — and It Works", "summary": "How OpenAI scales ChatGPT's PostgreSQL database to 800 million users with a single primary instance, 50 read replicas, connection pooling, and ruthless query discipl", "content_html": "<p><strong>OpenAI</strong> · Databases · 18 May 2026</p>\n<p>ChatGPT has 800 million users. It handles millions of database queries per second. And it runs on a single primary PostgreSQL instance on Azure — one writer, backed by about fifty read replicas. No sharding. No distributed SQL. Just Postgres, pushed further than almost anyone thought possible through obsessive optimization and ruthless operational discipline.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;users, 1 primary PG instance&#x27;, &#x27;value&#x27;: &#x27;800M&#x27;}</li><li>{&#x27;label&#x27;: &#x27;read replicas globally&#x27;, &#x27;value&#x27;: &#x27;~50&#x27;}</li><li>{&#x27;label&#x27;: &#x27;DDL timeout enforced&#x27;, &#x27;value&#x27;: &#x27;5-second&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>The conventional wisdom about database scaling at 800 million users is straightforward: you shard. You move to a distributed SQL system. You decompose into microservices each with their own database. You do not run a single primary PostgreSQL instance. OpenAI's ChatGPT does not follow this conventional wisdom. It runs on <strong>one Azure PostgreSQL Flexible Server</strong> that handles all writes — backed by approximately 50 read replicas spread across multiple regions. The system handles millions of queries per second at low double-digit millisecond p99 latency and has maintained five-nines availability. In twelve months, they had one SEV-0. The story is not that Postgres is magic. The story is that <strong>relentless optimization of a boring, proven technology can outperform premature architectural complexity</strong>.</p>\n<blockquote>\n<p><strong>WHY SINGLE-PRIMARY WORKS AT THIS SCALE</strong></p>\n<p>ChatGPT's workload is <strong>overwhelmingly read-heavy</strong>. When 800 million users open the app, browse their chat history, or load their settings, those are reads. Writes happen on message submission and account updates — a much smaller fraction of the total traffic. This access pattern is exactly what a single-primary with many read replicas handles well: the write path stays narrow, the read load fans out horizontally across replicas. The architecture is not brilliant. It is appropriate for the workload. That fit is what makes it work.</p>\n</blockquote>\n\n<p>OpenAI's blog published at PGConf.dev 2025 was unusually candid about both the decisions that worked and the ones that nearly broke the system. The database load grew by <strong>more than 10x in a single year</strong> following ChatGPT's viral growth. The team responded with aggressive optimization at every layer: connection management, query design, caching, write path discipline, and schema change governance. Each of these deserves examination — not because the techniques are novel, but because executing all of them simultaneously, under extreme growth pressure, with production at risk, is far harder than any one technique in isolation.</p>\n<blockquote>\n<p><strong>🔌</strong></p>\n<p>OpenAI's Azure PostgreSQL Flexible Server has a maximum of <strong>5,000 concurrent connections</strong>. At ChatGPT's scale, application servers would easily exhaust this limit without connection pooling. Before deploying <em>PgBouncer</em> (a lightweight connection pooler for PostgreSQL that multiplexes many application connections into a smaller pool of real database connections, dramatically reducing connection overhead), average connection time was 50ms. After deployment in statement-pooling mode: <strong>5ms</strong>. A 10x improvement from one infrastructure change.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>10x Database Load Growth in One Year</h4>\n<p>ChatGPT's viral growth — 100 million users in two months at launch, 800 million by 2025 — drove database load up more than 10x in a single year. Connection exhaustion became a recurring threat. A 12-table ORM-generated join was causing multiple high-severity incidents when traffic spiked. Write pressure on the single primary was approaching dangerous levels during high-demand events.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Invisible Query Complexity and Write Pressure</h4>\n<p>ORMs <em>ORM</em> (Object-Relational Mapping — a framework layer like Django or SQLAlchemy that automatically generates SQL from application code, abstracting away the database — convenient but capable of generating complex, inefficient queries that are invisible until they cause production incidents) generate SQL automatically, hiding complexity from developers. Under low load, even a 12-table join is fast enough to not notice. Under 10x load, the same query saturates database CPU. Meanwhile, write-heavy workloads that could be migrated to sharded systems like Azure Cosmos DB remained on the single primary longer than optimal.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Multi-Layer Defense: Pool + Cache + Rate Limit + Migrate</h4>\n<p>OpenAI implemented PgBouncer connection pooling (cutting connect time 10x), a cache-locking mechanism to prevent thundering herd on cache misses, multi-layer rate limiting at application, proxy, and query levels, surgical elimination of the worst ORM-generated queries, strict schema change governance (5-second DDL timeout), and a policy of migrating all new write-heavy workloads to sharded systems by default.</p>\n<hr />\n<h3>Result</h3>\n<h4>One SEV-0 in Twelve Months, Five-Nines Availability</h4>\n<p>One SEV-0 in twelve months — triggered by the viral launch of ChatGPT ImageGen, which caused a 10x write surge as over 100 million users signed up within a week. Postgres recovered by design. p99 latency held at low double-digit milliseconds. The single-primary architecture remained viable at a scale that surprised the entire database engineering community.</p>\n<hr />\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The ORM Query That Caused Multiple SEV-0s</p><p>OpenAI's engineers discovered that a single ORM-generated SQL query was <strong>joining 12 tables</strong>. Under normal load, the query executed in acceptable time. Under traffic spikes, it saturated the primary database's CPU and caused multiple high-severity incidents. The query had been auto-generated by the ORM framework and never explicitly reviewed. ORMs are excellent for developer productivity and terrible for query performance visibility. OpenAI now requires that all ORM-generated queries against high-traffic tables be reviewed and analyzed with EXPLAIN ANALYZE before deployment.</p>\n</blockquote>\n\n<p>OpenAI's schema change governance is one of the most operationally distinctive aspects of their Postgres setup. They enforce a strict rule: <strong>schema changes that trigger a full table rewrite are prohibited in production</strong>. Postgres's <em>MVCC</em> (Multi-Version Concurrency Control — Postgres's mechanism for allowing readers and writers to operate concurrently without blocking each other, at the cost of retaining multiple versions of each row and requiring periodic vacuum to reclaim space) model means that operations like <code>ALTER TABLE ADD COLUMN DEFAULT</code> on large tables can hold an exclusive lock for hours while rewriting billions of rows. This would be catastrophic at ChatGPT's scale. All DDL operations have a <strong>5-second timeout</strong>: if the schema change cannot acquire a lock within 5 seconds, it is cancelled automatically. Long-running queries that would block vacuum or DDL are automatically terminated.</p>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Hot Standby in High-Availability Mode</p><p>OpenAI runs the primary database in <strong>High-Availability mode with a hot standby</strong> — a continuously synchronized replica specifically designated as the failover target. If the primary goes down, the hot standby can be promoted to primary with ~30–60 seconds of downtime. During a primary failure, read traffic on replicas is unaffected — since most ChatGPT requests are reads, a primary failure is not a SEV-0 (because reads remain available). Writes fail until promotion completes. This asymmetry between read and write availability is a conscious architectural tradeoff: the 800 million users who are just browsing conversation history continue being served.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Why Not Shard? The Honest Answer</p><p>The engineering question 'why didn't OpenAI shard PostgreSQL?' has a straightforward answer: <strong>sharding is expensive and their workload didn't require it yet</strong>. Horizontal sharding introduces cross-shard transaction complexity, scatter-gather query patterns, operational overhead of multiple database instances, and application-layer awareness of shard routing. For a read-heavy workload that can be served from replicas, these costs are not justified. OpenAI chose to pay the operational cost of extreme Postgres optimization rather than the architectural cost of sharding — and the math worked out. The 'no new tables' policy ensures this calculation will be revisited for write-heavy workloads as they emerge.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>IDLE TRANSACTION TIMEOUTS: THE QUIET KILLER</strong></p>\n<p>OpenAI identified a subtle but devastating Postgres pattern at scale: idle transactions. When application code opens a database connection, starts a transaction, does unrelated work (calling an external API, waiting for user input), and only then commits — the transaction holds locks for the entire duration. At ChatGPT's scale, applications that hold open transactions for seconds can block vacuum, block DDL, and degrade query performance for all other connections. OpenAI enforces <strong>strict idle_in_transaction_session_timeout</strong> settings — any connection idle inside a transaction for more than a few seconds is automatically terminated. This breaks poorly-written code immediately in staging rather than causing incidents in production.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>📊</strong></p>\n<p>Despite having ~50 read replicas across multiple geographic regions, OpenAI reports <strong>near-zero replication lag</strong> on most replicas under normal conditions. This is achieved by co-locating PgBouncer, application servers, and replicas in the same region (minimizing network latency in the replication path) and by keeping primary write load within the replication throughput capacity of the replicas. Heavy write events — like the ImageGen launch surge — temporarily increase replication lag, which is why read-your-own-write operations are always routed to the primary.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Write Ceiling Is Real</p><p>OpenAI's single-primary architecture has an acknowledged limit: <strong>write-heavy events can overwhelm it</strong>. The ImageGen SEV-0 was caused by a write surge, not a read surge. The architecture is not defended against arbitrary write load — it is defended against the <em>current</em> write profile, which remains manageable because most new write-heavy workloads are being routed to Cosmos DB. If write load grows faster than the migration effort proceeds, the single-primary architecture will face a harder ceiling. The 'no new tables in Postgres' policy is the operational discipline that buys time.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>The Seven-Layer Defense</h3>\n<p>OpenAI's Postgres scaling is not one clever trick — it is seven mutually reinforcing operational practices applied simultaneously. Any one of them in isolation would help marginally. Together they have produced an architecture that handles a scale that its underlying technology was not originally designed for. The practices are: connection pooling, thundering herd prevention, multi-layer rate limiting, hot standby failover, write offloading, query surgery, and DDL governance. Each addresses a specific failure mode that appeared as ChatGPT grew.</p>\n<ul>\n<li><strong>10x</strong> — Database load growth in a single year following ChatGPT's viral expansion — the growth rate that forced each of the seven defensive layers to be implemented under production pressure</li>\n<li><strong>5ms</strong> — Average connection setup time after PgBouncer deployment — down from 50ms before pooling, a 10x improvement that eliminated connection exhaustion as a recurring incident cause</li>\n<li><strong>5 sec</strong> — Maximum DDL lock wait timeout — schema changes that cannot acquire a lock within 5 seconds are automatically cancelled, preventing table-lock incidents on billion-row tables</li>\n<li><strong>1 SEV-0</strong> — High-severity incidents in the twelve months after full defensive architecture was deployed — triggered by ImageGen launch write surge, resolved by design without full platform outage</li>\n</ul>\n\n<pre><code class=\"language-python\"># Cache-locking pattern: prevents thundering herd on cache misses\n# When cache expires, only ONE request repopulates it — others wait\n\nimport threading\n\n# Simplified cache-lock implementation\n_cache = {}\n_locks = {}\n_lock_mutex = threading.Lock()\n\ndef get_with_cache_lock(key: str, fetch_fn, ttl_seconds: int):\n    \"\"\"Get value from cache. On miss, only one thread fetches;\n    others block and receive the result once available.\"\"\"\n    \n    # Fast path: cache hit\n    if key in _cache:\n        return _cache[key]\n    \n    # Slow path: cache miss — acquire per-key lock\n    with _lock_mutex:\n        if key not in _locks:\n            _locks[key] = threading.Event()\n            should_fetch = True\n        else:\n            should_fetch = False\n            event = _locks[key]\n    \n    if should_fetch:\n        try:\n            # This thread does the database read\n            value = fetch_fn()          # ONE database query\n            _cache[key] = value         # populate cache\n            return value\n        finally:\n            with _lock_mutex:\n                event = _locks.pop(key)\n            event.set()                  # wake all waiters\n    else:\n        # Other threads wait for the fetching thread to complete\n        event.wait(timeout=5)           # don't wait forever\n        return _cache.get(key)          # return from cache</code></pre>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Cosmos DB Migration Policy</p><p>OpenAI's most forward-looking operational decision is a standing policy: <strong>no new tables are created in PostgreSQL</strong>. All new workloads default to sharded systems — primarily Azure Cosmos DB. Existing write-heavy workloads that can be horizontally partitioned are gradually migrated out. This policy doesn't fix the current architecture; it fixes the future architecture. Over time, the Postgres primary handles a smaller and smaller share of writes while remaining the canonical store for core user and conversation data. The single-primary architecture is not defended forever — it's being gracefully phased toward a hybrid model.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>REVIEW ORM-GENERATED SQL IN PRODUCTION</strong></p>\n<p>OpenAI's most actionable advice: <strong>add ORM-generated SQL review to your production deployment process</strong>. ORM frameworks are brilliant for development velocity. They are silent performance landmines at scale. A query that joins 12 tables, a query that does a full table scan on an unindexed column, a query that triggers N+1 loads — none of these are visible in code review because the ORM generates them at runtime. OpenAI now requires that SQL generated by ORM frameworks for high-traffic tables be logged, analyzed with EXPLAIN ANALYZE at peak load, and reviewed by a database engineer before the code ships. This practice is cheap. Not having it costs SEV-0s.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Lazy Writes: Smoothing Write Spikes</p><p>OpenAI introduced <strong>lazy writes</strong> for certain workloads — deferring non-critical writes instead of executing them immediately. For example, updating a user's last-seen timestamp or incrementing a view counter doesn't need to hit the database synchronously with every request. Batching these writes and flushing them periodically smooths write traffic from a spiky real-time pattern to a steadier background pattern. Lazy writes reduced write load on the primary meaningfully without any change to user-visible behavior.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Covering Indexes: The Query Surgery Tool</p><p>Beyond eliminating bad ORM queries, OpenAI invested heavily in <strong>covering indexes</strong> — indexes that contain all columns needed by a query, so Postgres can answer it from the index alone without reading table rows. A covering index on a high-frequency query can reduce query cost from a sequential scan of billions of rows to a few hundred index lookups. OpenAI's database engineers regularly audit slow query logs and apply targeted index improvements, particularly after any traffic increase reveals latent query inefficiencies that weren't visible at lower load.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Feature Traffic Isolation</p><p>OpenAI isolates low-priority features from critical traffic paths. If a secondary feature — say, a background data analysis job — starts behaving poorly and consuming database resources, it should not degrade ChatGPT's core conversational experience. This isolation is implemented through <strong>separate connection pools for different traffic classes</strong>, Kubernetes resource quotas for background workloads, and rate limiting that gives core product queries priority access to database capacity. The principle: a misbehaving low-priority feature should degrade itself, not the entire platform.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>OpenAI's Postgres architecture is simple at the macro level — one writer, many readers — but densely engineered at the micro level. The simplicity is intentional: every additional layer of infrastructure complexity is a potential failure mode. The dense engineering at the application and proxy layers is what makes the simple macro architecture viable at unprecedented scale. Understanding why this architecture works requires understanding both its strengths (read-heavy workload perfectly matched to replica fan-out) and its known limits (write spikes, ORM-generated queries, connection exhaustion).</p>\n<h3>OpenAI's PostgreSQL Architecture: Single Primary, Global Read Scale</h3>\n<p><a href=\"https://techlogstack.com/explore/openai-postgresql-scaling-2026/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>Multi-Layer Rate Limiting: Defense in Depth for Write Spikes</h3>\n<p><a href=\"https://techlogstack.com/explore/openai-postgresql-scaling-2026/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>THE REPLICATION LAG TRADEOFF</strong></p>\n<p>Asynchronous replication to read replicas introduces a tradeoff: reads may return slightly stale data. For most ChatGPT operations — loading conversation history, displaying user settings, browsing the interface — <strong>a few hundred milliseconds of staleness is imperceptible and acceptable</strong>. For the small fraction of requests that require current data (a write followed immediately by a read-your-own-write pattern), OpenAI routes those reads to the primary. This explicit differentiation between 'reads that can tolerate lag' and 'reads that cannot' is a design discipline, not an accident — and it's what allows the read load to be distributed across 50 replicas.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Backfill Rate Limit: So Slow It Takes a Week</p><p>OpenAI enforces strict rate limits on database backfill operations — migrations that populate new columns or update existing rows across large tables. These rate limits are aggressive enough that a large backfill can take over a week to complete. This is deliberate: a fast backfill on a billion-row table would compete with live traffic for I/O, degrade query latency, and risk triggering the DDL timeout. Slow backfills are boring and invisible. Fast backfills cause incidents. OpenAI chose boring.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Five-Nines Achieved</p><p>OpenAI reports achieving <strong>99.999% availability</strong> on their Postgres infrastructure — five nines, which means less than 5.26 minutes of downtime per year. This is achieved despite running a single primary, primarily because most customer traffic is read-only (served by replicas even during primary downtime), write failures during primary maintenance are brief (hot standby promotion in 30–60 seconds), and the defensive layers prevent the most common failure modes from escalating. Five nines on a single-primary setup requires more engineering discipline, not less, than achieving the same availability on a distributed system.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>OpenAI's PostgreSQL story is the strongest available evidence that conventional wisdom about 'you must shard at scale' is not a law — it's a heuristic that depends heavily on workload shape. The lessons here are about operational discipline, honest workload analysis, and knowing the limits of your architecture before they find you.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Analyze your workload before choosing your architecture.</strong> OpenAI's single-primary architecture works because ChatGPT is overwhelmingly read-heavy. A write-heavy workload at the same scale would fail with this architecture. The lesson is not 'use a single primary' — it's 'design for your actual access patterns, not for the scale number on the slide.'</li><li><span>02</span><div><em>Connection pooling</em> (deploying a proxy like PgBouncer between application servers and PostgreSQL that multiplexes thousands of application connections into a smaller pool of database connections, reducing connection overhead and preventing connection exhaustion) is not optional at scale. At ChatGPT's traffic volume, hitting Postgres's 5,000-connection limit without pooling would have caused regular outages. PgBouncer turned a recurring incident cause into a non-issue. Deploy it before you need it.</li><li><span>03</span><div><strong>Review ORM-generated SQL for high-traffic tables before shipping.</strong> A 12-table join that worked fine at 1x traffic caused multiple SEV-0s at 10x. ORMs are invisible query generators. Add explicit review of ORM-generated queries — EXPLAIN ANALYZE at production load levels — as a standard pre-deployment step for database-touching code.</li><li><span>04</span><div>Enforce schema change governance with hard timeouts. <strong>A DDL operation that holds a table lock for hours will cause an incident.</strong> OpenAI's 5-second DDL timeout automatically cancels any schema change that cannot acquire a lock quickly. This constraint forces engineers to use online DDL tools (pg_repack, zero-downtime column addition) rather than naive ALTER TABLE on large tables.</li><li><span>05</span><div>Plan the exit from your current architecture before you need it. OpenAI's 'no new tables in PostgreSQL' policy and ongoing write workload migration to Cosmos DB are the planned evolution of the current architecture. <strong>A single-primary Postgres at 800M users is viable today because write load is bounded. It's viable tomorrow because write-heavy workloads are being systematically migrated out.</strong> Know the limits of your current architecture and have a credible plan for crossing them.</li></ol>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The ImageGen Launch: The One That Got Through</p><p>In twelve months of operation with the fully hardened architecture, OpenAI had one SEV-0: the launch of ChatGPT ImageGen. Over 100 million new users signed up within a week, driving a >10x spike in write traffic — specifically new account creation and preference storage — that temporarily overwhelmed the primary's write capacity. The system recovered by design, but the event validated the 'no new tables in Postgres' policy. Write surges at viral launch scale are the known limit of single-primary architecture. The Cosmos DB migration is the known fix.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE HYBRID MIGRATION STRATEGY</strong></p>\n<p>OpenAI's hybrid approach — <strong>keep Postgres for what it does well, migrate write-heavy workloads to Cosmos DB, enforce 'no new tables in Postgres'</strong> — is a template for any team running a successful legacy database under growth pressure. The alternative extremes (migrate everything at once, or never migrate anything) are both wrong. Incremental migration guided by workload characteristics is boring, slow, and correct. The discipline is in writing down the policy and enforcing it before the crisis arrives.</p>\n</blockquote>\n\n<blockquote><p>OpenAI runs ChatGPT for 800 million users on one Postgres instance and the most complex part of their database engineering is telling people not to use ORMs without reading the SQL they generate.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/openai-postgresql-scaling-2026/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-18T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.880910+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Databases", "OpenAI"]}, {"id": "https://techlogstack.com/explore/uber-multicloud-secrets-2025/", "url": "https://techlogstack.com/explore/uber-multicloud-secrets-2025/", "title": "Uber Had 150,000 Secrets Scattered Across 25 Vaults — So They Built One Platform to Rule Them", "summary": "How Uber consolidated 150,000 secrets from 25 fragmented vaults into 6 centrally managed vaults and built an automated Secret Management Platform handling 20,000 rot", "content_html": "<p><strong>Uber</strong> · Security · 18 May 2026</p>\n<p>150,000 secrets. 25 separate vaults. Hundreds of teams managing their own credentials in their own ways, some in plain text in version control. At Uber's scale — 5,000 microservices, 5,000 databases, 500,000 analytical jobs per day — secrets sprawl is not a compliance problem. It is an incident waiting to happen. A team of ten engineers decided to fix it.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;secrets managed&#x27;, &#x27;value&#x27;: &#x27;150,000&#x27;}</li><li>{&#x27;label&#x27;: &#x27;vaults → 6 managed vaults&#x27;, &#x27;value&#x27;: &#x27;25&#x27;}</li><li>{&#x27;label&#x27;: &#x27;microservices secured&#x27;, &#x27;value&#x27;: &#x27;5,000&#x27;}</li><li>{&#x27;label&#x27;: &#x27;automated rotations/month&#x27;, &#x27;value&#x27;: &#x27;20,000&#x27;}</li><li>{&#x27;label&#x27;: &#x27;fewer secrets in pipelines&#x27;, &#x27;value&#x27;: &#x27;90%&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>Secrets sprawl is the entropy of infrastructure security. Left to its own devices, every team builds its own vault, stores credentials however is convenient, and shares secrets in whatever way is fastest. At a startup with ten engineers, this is manageable. At Uber — <strong>5,000 microservices, 5,000 databases, 400+ third-party integrations, 500,000 analytical jobs per day</strong> — it becomes a systemic security risk. By the time Uber's Secrets team began their consolidation project, the company had <strong>150,000 secrets scattered across 25 separate vault systems</strong>, operated by different teams, with different security standards, different rotation practices, and inconsistent access controls. Some secrets were in plain text in codebases. Others lived in databases that had never been audited for credential exposure. Cyberattacks targeting exposed credentials were rising industry-wide. The question was not whether Uber should fix this — it was how.</p>\n<blockquote>\n<p><strong>🔐</strong></p>\n<p>Before the consolidation, Uber's <strong>25 separate vault systems</strong> were operated by various teams across engineering. Some were standard <em>HashiCorp Vault</em> (an open-source secrets management tool that provides a secure, centralized store for tokens, passwords, certificates, and encryption keys) deployments. Others were custom databases. Others were cloud-specific secret managers for AWS, GCP, and Azure. None of them talked to each other. None of them had a unified view of what credentials existed where.</p>\n</blockquote>\n\n<p>The Secrets team's strategy had two phases. Phase 1 was consolidation: take ownership of all vault infrastructure, standardize on a small number of canonical vault systems (one per cloud provider plus one on-premises HashiCorp Vault), and migrate all secrets from the 25 fragmented vaults into these six. This was the foundation work — unglamorous, involving hundreds of engineers across different teams, and requiring careful coordination to avoid breaking services that depended on existing vault paths. Phase 2 was the platform: building a <strong>Secret Management Platform</strong> on top of the consolidated vaults — a metadata model, lifecycle automation, unified API, and real-time scanning — that turned six vaults into a governed, auditable, self-service system.</p>\n<blockquote>\n<p><strong>THE FIVE PROBLEMS THEY HAD TO SOLVE</strong></p>\n<p>As the Secrets team consolidated vaults, five common problem patterns emerged that any future platform would need to address: <strong>(1) no unified metadata model</strong> — no way to know what a secret was for, who owned it, when it was last rotated; <strong>(2) no cross-vault CRUD</strong> — managing secrets across different vault types required different tools and APIs; <strong>(3) no developer self-service</strong> — engineers filed tickets to create or rotate secrets; <strong>(4) no inventory</strong> — no way to generate a complete list of secrets for security incident response; <strong>(5) no automated rotation</strong> — credential rotation required manual coordination, so it was delayed or skipped.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>Secrets Sprawl: 150,000 Credentials, No Visibility</h4>\n<p>Uber's infrastructure had grown faster than its secrets governance. 25 vault systems operated by different teams meant no single team had visibility into the company's complete credential inventory. Shadow IT vaults with no central oversight created audit gaps. Secrets were shared insecurely, rotated rarely, and sometimes stored in version control. With cyberattacks targeting credential exposure rising industry-wide, the status quo was untenable.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Scale + Decentralization = Governance Collapse</h4>\n<p>At Uber's scale, decentralized secrets management doesn't produce diversity and resilience — it produces inconsistency and risk. Each of 25 vaults had its own standards, its own rotation schedule (usually none), its own access model. There was no way to answer basic security questions: who has access to which credentials? When were they last rotated? Are any credentials in source code? The scale that made the problem urgent also made it hard to fix without a dedicated team and platform.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Consolidation + Secret Management Platform</h4>\n<p>Phase 1 consolidated 25 vaults into 6 centrally managed vaults (one per cloud provider plus on-prem HashiCorp Vault). Phase 2 built the Secret Management Platform: a metadata model, a unified API abstracting across all vault types, a Cadence-orchestrated <em>Secret Lifecycle Manager</em> (Uber's automation system that handles the complete lifecycle of secrets — creation, rotation, distribution to workloads, and eventual decommissioning — using Uber's Cadence workflow engine), real-time scanning across git/Slack/CI pipelines, and self-service developer tooling.</p>\n<hr />\n<h3>Result</h3>\n<h4>20,000 Automated Rotations Per Month, 90% Fewer Exposed Secrets</h4>\n<p>A team of 10 engineers now drives 20,000 automated monthly secret rotations — up from manual rotation that happened rarely. Secrets exposed in CI pipelines dropped by 90%. The platform generates a complete inventory of all 150,000 secrets on demand, enabling rapid response to security incidents. Uber is actively pursuing <strong>secretless authentication</strong> — replacing long-lived credentials with ephemeral, automatically-issued tokens wherever possible.</p>\n<hr />\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Migration Scale Problem</p><p>Migrating secrets from 25 vaults to 6 involved hundreds of engineers whose workloads depended on existing vault paths. A secret migration is not just a data copy — it is <strong>a coordination problem</strong>. Every service reading a secret from vault path A needs to be updated to read from vault path B. In a monolith, that's one codebase. Across Uber's 5,000 microservices, that's 5,000 potential update targets. The team built tooling to discover which services were reading from which vault paths, generated migration checklists automatically, and used feature flags to switch services over gradually with rollback capability.</p>\n</blockquote>\n\n<p>The <em>metadata model</em> (a structured representation of a secret's properties — owner, purpose, rotation schedule, associated services, expiry date, security classification — that enables automated governance and incident response) was the architectural cornerstone of the Secret Management Platform. Before consolidation, a secret was just a key-value pair in a vault with no context. After the platform was built, every secret had a structured record: who owned it, which services used it, when it was last rotated, what its rotation policy was, and what its security classification was. This metadata made <strong>automated governance possible</strong>: the platform could identify secrets that hadn't been rotated in 90 days, generate compliance reports, and automatically alert owners of soon-to-expire credentials. It also made incident response practical: when a security team needed to identify all credentials that could have been exposed in a compromise, they could query the inventory rather than interviewing 250 engineering teams.</p>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Real-Time Scanning: Catching Secrets Before They Ship</p><p>One of the most impactful platform features was real-time scanning across Uber's code repositories, CI pipelines, and internal Slack messages. The scanner looks for patterns matching API keys, database passwords, and private key formats. When detected, it <strong>automatically revokes the exposed credential</strong> and alerts the owning team. Before the platform, a credential committed to git might live there for months — or forever. Now, exposure is measured in seconds. The 90% reduction in secrets found in pipelines reflects this detection-and-revocation automation.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>❌</strong></p>\n<p>Shadow IT Vaults: The Security Debt Multiplier</p><p>Perhaps the most dangerous aspect of Uber's pre-platform state was what the team called <strong>shadow IT vaults</strong>: secret storage systems created by individual teams outside the knowledge of the central Secrets team. These vaults had no security baseline review, no rotation policy, no access audit, and no inventory. When a team built a shadow vault, they optimized for their immediate convenience — and created a security liability that the company didn't know existed. You cannot rotate credentials you don't know about. You cannot audit access to vaults you don't know exist. Shadow IT vaults are the point where 'move fast' becomes 'incur unquantifiable risk.'</p>\n</blockquote>\n\n<blockquote>\n<p><strong>WHY 400 THIRD-PARTY INTEGRATIONS MATTER</strong></p>\n<p>Uber's 400+ third-party vendor integrations are a significant factor in the secrets management challenge. Each integration requires credentials — API keys, OAuth tokens, database passwords — that must be rotated when vendors change their systems or when Uber's access policy changes. Before the platform, vendor credential rotation required manual coordination: someone had to get the new credentials from the vendor, find which services used them, update each service's configuration, and verify nothing broke. <strong>At 400 integrations, this manual process consumed disproportionate engineering time</strong> and rotations were often delayed. The Secret Lifecycle Manager automated the rotation for most standard integrations.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🔄</strong></p>\n<p>Before the Secret Management Platform, secret rotation at Uber required a service owner to coordinate with the Secrets team, obtain a new credential from the upstream provider, update their service's configuration, and verify the rotation succeeded. At 150,000 secrets across 5,000 services, this process ran rarely — not because security was a low priority but because <strong>the operational overhead was prohibitive at scale</strong>. Most secrets were rotated only when forced by a security incident or vendor requirement. The platform inverts this: rotation is the default, manual coordination is the exception.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Kubernetes Native Injection</p><p>One of the most seamless developer experiences in the platform is <strong>Kubernetes-native secret injection</strong>. Rather than requiring services to call an API to retrieve their credentials at startup, the platform can inject secrets directly as environment variables or mounted files into Kubernetes pods at deploy time. This is transparent to application code — the service sees its credentials as normal environment variables, with no awareness of which vault they came from or how they were rotated. When a rotation occurs, the platform can trigger a pod restart with the new credentials injected automatically.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>The Secret Lifecycle Manager</h3>\n<p>The Secret Lifecycle Manager (SLM) is the operational core of Uber's Secret Management Platform. Built on <em>Cadence</em> (Uber's open-source distributed workflow engine, designed for long-running, fault-tolerant business processes — the same engine that powers Uber's ride dispatch and payment workflows), SLM orchestrates the complete lifecycle of every secret: initial creation, distribution to consuming services, periodic rotation, and eventual decommissioning. Using Cadence's durable workflow model means that secret rotation operations are fault-tolerant — if the rotation workflow fails midway through, it can resume from where it left off rather than leaving credentials in a half-rotated, potentially inconsistent state.</p>\n<ul>\n<li><strong>25→6</strong> — Vault systems consolidated — from 25 team-operated vaults with inconsistent standards to 6 centrally managed vaults with uniform security baselines</li>\n<li><strong>20,000</strong> — Automated secret rotations per month — up from rare manual rotation that required coordination between the Secrets team and service owners</li>\n<li><strong>90%</strong> — Reduction in secrets found exposed in CI/CD pipelines — achieved through real-time scanning with automatic revocation on detection</li>\n<li><strong>10</strong> — Engineers on the Secrets team that built and now operates the entire platform — evidence that well-designed automation multiplies individual team capacity dramatically</li>\n</ul>\n\n<pre><code class=\"language-python\"># Simplified Secret Lifecycle Manager rotation workflow (conceptual)\n# Real implementation uses Cadence's durable workflow primitives\n\nfrom cadence.workflow import workflow_method\n\nclass SecretRotationWorkflow:\n    @workflow_method\n    async def rotate_secret(self, secret_id: str):\n        \"\"\"Cadence ensures this completes even if individual steps fail.\"\"\"\n        \n        # Step 1: Generate new credential from upstream provider\n        new_credential = await self.generate_new_credential(secret_id)\n        \n        # Step 2: Write new credential to canonical vault\n        # (Durable: if this step completes, Cadence records it)\n        await self.write_to_vault(\n            secret_id=secret_id,\n            value=new_credential,\n            version='new'  # old version still readable during transition\n        )\n        \n        # Step 3: Signal consuming services to reload credential\n        # Each service has a registered reload handler\n        consuming_services = await self.get_consumers(secret_id)\n        for service in consuming_services:\n            await self.signal_reload(service, secret_id)\n        \n        # Step 4: Verify all services are using new credential\n        # (Wait for health checks to confirm)\n        await self.verify_rotation_complete(secret_id, consuming_services)\n        \n        # Step 5: Expire old credential in upstream provider\n        await self.revoke_old_credential(secret_id)\n        \n        # Step 6: Update metadata: last_rotated, next_rotation_due\n        await self.update_metadata(secret_id, rotated_at=now())\n        # Cadence schedules next rotation based on policy</code></pre>\n<blockquote>\n<p><strong>SECRETLESS AUTHENTICATION: THE NEXT FRONTIER</strong></p>\n<p>The logical endpoint of Uber's secrets management journey is <strong>secretless authentication</strong> — a model where services don't hold long-lived credentials at all. Instead, they use their identity (a Kubernetes service account, a Spiffe/SPIRE identity, a cloud provider IAM role) to dynamically request short-lived tokens at runtime. When a token expires in 1 hour, there is nothing to steal, nothing to rotate, nothing to audit. Uber is actively building toward this model as the long-term replacement for static credential management. The Secret Management Platform is both the current solution and the bridge to the secretless future.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Self-Service Developer Tooling</p><p>Before the platform, creating a new secret required filing a ticket with the Secrets team. The turnaround could be days. After the platform, developers can create, update, and delete secrets through a <span><strong>self-service API, CLI, and web UI</strong></span> — all of which enforce the metadata requirements and policy compliance automatically. The Secrets team's workload shifted from manual secret operations (which scaled linearly with the number of services) to platform maintenance and governance (which scales much more slowly). A team of 10 can now serve 5,000 microservices because the services serve themselves.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Multi-Vault Abstraction Layer</p><p>Uber's infrastructure spans AWS, GCP, Azure, and on-premises HashiCorp Vault. Each environment has its own native secret manager with a different API. The Secret Management Platform includes a <strong>unified abstraction layer</strong> that presents a single API for secret CRUD operations regardless of which underlying vault the secret lives in. Application code interacts with the platform API; the platform handles routing the operation to the correct vault (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault, or HashiCorp Vault) and translating the response. This abstraction decouples application code from vault topology — when Uber migrates a secret from one vault to another, no application code changes.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Unified API: One SDK, Four Vaults</p><p>Uber's unified abstraction layer exposes a single SDK that application developers use regardless of which underlying vault stores their secret. The SDK handles routing: an AWS-deployed service's database password might live in AWS Secrets Manager; an on-prem service's certificate might live in HashiCorp Vault. The developer writes <code>secrets.get('myservice/db_password')</code> and receives the credential — the SDK consults the metadata catalog to find which vault holds that secret and retrieves it via the appropriate vault API. <strong>Application code is decoupled from vault topology</strong>, making future vault migrations transparent.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Compliance Reporting on Demand</p><p>Before the metadata model, answering a compliance auditor's question — 'show me all credentials with access to our payment processing systems' — would have required interviewing dozens of engineering teams over days. After the platform, the same question is answered by a metadata query: filter by associated_system='payment_processing', return all matching secrets with their rotation history, access policies, and owner contacts. Compliance reporting that took days now takes seconds. The metadata model was built for developer self-service but it turns out to be equally valuable for security operations and compliance.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>The Secret Management Platform sits as an orchestration layer above Uber's six canonical vault systems. Applications and services no longer talk directly to specific vaults — they interact with the platform's unified API or use Kubernetes-native injection (where secrets are automatically mounted into pods at deployment time). The platform maintains the metadata catalog, handles lifecycle automation via the Secret Lifecycle Manager, runs real-time scanning, and provides the developer self-service tools. The vault systems themselves are the authoritative stores; the platform is the governance and automation layer on top.</p>\n<h3>Before: 25 Fragmented Vaults, No Governance</h3>\n<p><a href=\"https://techlogstack.com/explore/uber-multicloud-secrets-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>After: Unified Secret Management Platform</h3>\n<p><a href=\"https://techlogstack.com/explore/uber-multicloud-secrets-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>CADENCE: WHY UBER CHOSE WORKFLOWS FOR ROTATION</strong></p>\n<p>Secret rotation is a multi-step process with real failure modes: the upstream provider might be unavailable, the vault write might fail, a downstream service might not acknowledge the new credential. A simple cron job or Lambda function that fails midway leaves the system in an unknown state — is the old credential still valid? Is the new one active? <strong>Cadence's durable workflow model provides exactly-once execution semantics</strong>: each step is recorded, and if the workflow fails partway through, it resumes from the last successful step. This makes secret rotation reliable enough to run 20,000 times per month without manual oversight.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Migration Coordination Challenge</p><p>Migrating a secret from its old vault path to the new centralized platform sounds like a database copy operation. In practice it's a <strong>distributed coordination problem across hundreds of teams</strong>. Every service reading the secret needs to be updated simultaneously (or with a dual-read transition period). Uber built tooling to discover all consumers of a vault path, generate migration checklists, and track completion status. Services that hadn't migrated within the target window were flagged for the owning team. The tooling made a migration problem tractable across an organization of thousands of engineers.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🛡️</strong></p>\n<p>SPIFFE/SPIRE: The Path to Secretless</p><p>Uber's secretless authentication initiative builds on the <strong>SPIFFE/SPIRE framework</strong> — an open standard for issuing cryptographic workload identities. Every service at Uber has a unique SPIFFE identity that is automatically issued, cryptographically verifiable, and short-lived. Services that can authenticate using their SPIFFE identity don't need to hold long-lived credentials at all — the identity proves who they are, and the system issues time-limited tokens dynamically. As more of Uber's infrastructure adopts SPIFFE-based authentication, the number of long-lived secrets that need to be managed by the platform shrinks toward zero.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Uber's secrets management story is about the organizational and engineering cost of decentralization without governance — and the compounding returns of building the right platform once.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Consolidate ownership before building automation.</strong> Uber's two-phase approach — consolidate 25 vaults into 6, then build the platform — was the right sequence. Building a governance platform on top of 25 independent vaults would have required integrating 25 different systems. Building it on 6 centrally owned vaults meant one integration per vault type. Consolidation first is harder organizationally but dramatically simpler technically.</li><li><span>02</span><div>A <em>metadata model</em> (a structured record of each secret's properties — owner, purpose, associated services, rotation policy, security classification — that enables automated governance, inventory, and incident response) is the prerequisite for all other automation. Without metadata, you cannot automate rotation (you don't know the rotation policy), you cannot generate inventory (you don't know what secrets are for), and you cannot respond to incidents (you don't know which services are affected). Build the metadata model before building any automation on top of it.</li><li><span>03</span><div><strong>Real-time scanning with automatic revocation changes the economics of credential exposure.</strong> When exposure is detected in seconds and the credential is automatically revoked, a developer accidentally committing a credential to git causes a 30-second incident rather than a multi-month exposure. The scanning + revocation loop is the highest-leverage security improvement for teams still relying on manual credential hygiene.</li><li><span>04</span><div>Use <strong>durable workflow systems</strong> (like Cadence or Temporal) for secret rotation, not scripts or cron jobs. Rotation is a multi-step process with real failure modes at each step. A workflow system that provides exactly-once execution and automatic resume on failure makes rotation reliable enough to run at scale without manual oversight. A cron job that fails halfway through a rotation leaves credentials in an unknown state.</li><li><span>05</span><div><strong>Self-service developer tooling is what makes centralized governance scale.</strong> A centralized Secrets team without self-service tooling becomes a bottleneck — every credential operation requires a ticket. A centralized Secrets team with self-service tooling becomes a platform team — they build and maintain the guardrails, and developers operate within them autonomously. The goal is governance at scale, not control at the cost of velocity.</li></ol>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>10 Engineers, 5,000 Microservices</p><p>The most striking number in Uber's secrets story is the ratio: <span><strong>10 engineers managing secrets governance for 5,000 microservices</strong></span>. This 500:1 leverage ratio is only possible because the platform does the work that used to require human coordination. Automated rotation, self-service tooling, policy enforcement in the platform layer — all of these shift work from the coordination model (each secret operation requires a human) to the automation model (each secret operation executes itself). Platform teams that want to scale should measure their leverage ratio and ask what automation would improve it.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>SECRETS IN VERSION CONTROL ARE NOT A MISTAKE; THEY'RE AN ARCHITECTURE PROBLEM</strong></p>\n<p>Every engineering organization has discovered credentials accidentally committed to git. The standard response is to educate developers about the risk. Uber's analysis found that the root cause was not developer carelessness — it was the absence of a convenient alternative. When getting a credential into a service requires filing a ticket and waiting days, developers find a shortcut: put it in the config file. <strong>The secure path needs to be the easy path.</strong> The self-service API and Kubernetes injection that the Secret Management Platform provides made the secure approach easier than the shortcut.</p>\n</blockquote>\n\n<blockquote><p>Uber built a platform to manage 150,000 secrets at scale — and the most important feature turned out to be a metadata field that just says 'who owns this thing?'<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/uber-multicloud-secrets-2025/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-18T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.885535+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Security", "Uber"]}, {"id": "https://techlogstack.com/explore/hotstar-ipl-scaling-2019/", "url": "https://techlogstack.com/explore/hotstar-ipl-scaling-2019/", "title": "When MS Dhoni Got Out: How Hotstar Survived 25 Million Concurrent Users", "summary": "How Hotstar built Project HULK, abandoned AWS autoscaling, and survived 25.3M concurrent users during the 2019 Cricket World Cup semi-final.", "content_html": "<p><strong>Hotstar</strong> · Live Streaming · 17 May 2026</p>\n<p>July 9th, 2019. India vs New Zealand, Cricket World Cup semi-final. MS Dhoni walks to the crease and 1.1 million new viewers join Hotstar every single minute. Then he gets run out — and 24 million people hit the back button almost simultaneously.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;peak concurrent&#x27;, &#x27;value&#x27;: &#x27;25.3M&#x27;}</li><li>{&#x27;label&#x27;: &#x27;users/min growth&#x27;, &#x27;value&#x27;: &#x27;1.1M&#x27;}</li><li>{&#x27;label&#x27;: &#x27;Tbps bandwidth&#x27;, &#x27;value&#x27;: &#x27;5.7&#x27;}</li><li>{&#x27;label&#x27;: &#x27;requests/sec&#x27;, &#x27;value&#x27;: &#x27;1M&#x27;}</li><li>{&#x27;label&#x27;: &#x27;CPU test rig&#x27;, &#x27;value&#x27;: &#x27;108,000&#x27;}</li><li>{&#x27;label&#x27;: &#x27;scale reaction&#x27;, &#x27;value&#x27;: &#x27;&lt;90s&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<blockquote>\n<p><strong>🏏</strong></p>\n<p>At the peak of the 2019 ICC World Cup semi-final between India and New Zealand, <strong>25.3 million people were simultaneously streaming live on Hotstar</strong> — a global record for any streaming platform. The platform consumed <strong>5.7 Tbps</strong> of bandwidth: roughly 70% of India's total internet capacity at the time.</p>\n</blockquote>\n\n<p>Cricket is not just a sport in India — it is synchronized national emotion. When MS Dhoni walks to the crease, tens of millions of people who had given up on the match suddenly reconsider. They reach for their phones. They open Hotstar. <strong>At one point during the semi-final, viewership was growing at 1.1 million users per minute</strong> — a rate that would overwhelm most cloud architectures before the first over was complete. Hotstar's engineering team had spent months anticipating exactly this moment, building an entirely custom infrastructure stack designed around one insight: the biggest danger in live sports streaming is not the peak load itself, but the <span>speed of arrival</span> at that peak.</p>\n<blockquote>\n<p><strong>THE DHONI PROBLEM</strong></p>\n<p>The growth rate was not the hardest engineering challenge — it was the <strong>drop</strong>. When Dhoni was run out, Hotstar went from <strong>25.3 million concurrent viewers to under 1 million in minutes</strong>. Millions of users hit the back button simultaneously. They didn't leave the app — they returned to the homepage, where personalization, recommendation, and content-discovery APIs were suddenly hammered by a traffic tsunami that had nothing to do with video streaming. A system built only for video delivery would have collapsed on the exit.</p>\n</blockquote>\n\n<p>The platform's architecture in 2019 ran on <strong>AWS EC2 instances</strong> with Akamai as the primary <em>CDN</em> (Content Delivery Network — a distributed system of edge servers that caches and delivers content from locations close to the user, reducing latency and offloading origin servers), backed by Apache Kafka for real-time event streaming and Apache Flink for stream processing. Hotstar had already made the strategic shift to <em>microservices</em> (an architecture where an application is built as a collection of small, independently deployable services, each responsible for a specific function), migrating from a monolith in 2018, which allowed individual services to be scaled independently. But AWS's native <em>Auto Scaling Groups</em> (AWS's built-in mechanism for automatically adding or removing EC2 instances based on metrics like CPU usage) had a fundamental problem: when you need to go from 1.5 million to 15 million concurrent viewers in under ten minutes, sequential instance provisioning isn't fast enough.</p>\n\n<h3>Problem</h3>\n<h4>The Speed of Arrival</h4>\n<p>During the IPL Final and World Cup, Hotstar's traffic grew at <strong>+500,000 concurrent users per minute</strong> during peak moments — triggered by push notifications sent by the marketing team when the match became exciting. AWS native auto-scaling groups couldn't provision new EC2 instances fast enough: they experienced insufficient capacity errors in specific availability zones, and their step-size mechanism added nodes in fixed batches rather than responding to the rate of change.</p>\n<hr />\n<h3>Cause</h3>\n<h4>AWS ASG's Three Fatal Flaws for Live Events</h4>\n<p><em>Availability Zone</em> (an isolated, physically separate data center within an AWS region, designed to be independent of failures in other zones) imbalance meant adding capacity to one zone could leave another undersupplied. ASG <strong>step size</strong> — adding a fixed number of instances per scaling action — couldn't respond fast enough to 1.1M user/minute growth. And insufficient capacity errors in a zone at peak were unrecoverable: there was simply no spare inventory to allocate on-demand during a national cricket final.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Pre-warming + Custom Scaling Engine</h4>\n<p>Hotstar abandoned AWS ASG for live events entirely. Instead, they <strong>pre-warmed the full expected infrastructure before each match</strong> based on Project HULK predictions, maintaining a 2-million-user capacity buffer at all times. A custom internal scaling engine — driven by <em>request rate and concurrency</em>, not CPU — could spin up new capacity in under 90 seconds. A secondary <em>ASG</em> (the backup auto-scaling group kept on standby to provide a different instance type mix if the primary cluster hits limits) provided a different instance-type mix as a fallback if the primary cluster hit hard limits.</p>\n<hr />\n<h3>Result</h3>\n<h4>25.3 Million — Zero Downtime</h4>\n<p>The IND vs NZ semi-final became the largest concurrent live stream in history at the time: <span><strong>25.3 million viewers, zero reported downtime.</strong></span> The platform handled both the spike to peak and the catastrophic drop when Dhoni got out — the homepage API layer, pre-scaled and tested via chaos engineering, absorbed the sudden exit traffic without incident. Hotstar's engineering approach became a canonical reference talk for scaling live event infrastructure on AWS.</p>\n<hr />\n\n<p>The most consequential engineering decision Hotstar made was building <strong>Project HULK</strong> — an internal load-testing platform with a footprint bigger than most companies' production environments. At full scale, Project HULK deployed <strong>108,000 CPUs, 216 TB of RAM, and 200 Gbps of outbound network</strong> across 8 geographically distributed AWS regions, running geo-distributed load generators to simulate realistic user journeys. It performed four categories of tests: load generation to establish baseline capacity, <em>Tsunami tests</em> that simulated the sudden spike-and-drop profile of a Dhoni innings, <em>chaos engineering</em> (the practice of deliberately injecting failures into a system to discover weaknesses before they manifest in production) to test resilience when an availability zone went down, and ML-driven traffic pattern modelling to predict load curves for upcoming events. The insight that emerged was that Hotstar needed to <span>face the real game before the actual game.</span></p>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Bandwidth at the Edge of India's Internet</p><p>At 25.3 million concurrent users, Hotstar's bandwidth consumption hit <strong>5.7 Tbps</strong> — approximately 70% of India's total available internet capacity at the time. This is not a platform metric; it is a national infrastructure metric. Hotstar engineers were not just managing application scale, they were managing demand at the level of physical network capacity across an entire country.</p>\n</blockquote>\n\n<blockquote><p>Race against time: +500K growth rate per minute concurrency. Fully baked AMIs: 4 minutes. Application boot-up time: 90 seconds. Reaction time: push notifications.</p><p><em>— — Gaurav Kamboj, Cloud Architect at Hotstar — AWS Community Day Bengaluru 2018</em></p></blockquote>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Push Notification Trap</p><p>The marketing team's push notifications — sent to bring users back to the app during exciting match moments — were a <strong>hidden traffic generator</strong> that engineering had no advance warning of. Every notification blast created an immediate, synchronized spike in both video and API traffic. Hotstar had to build a feedback loop where marketing intent was translated into infrastructure capacity decisions before the notification was sent, not after.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🔄</strong></p>\n<p>The Graceful Degradation Contract</p><p>When a service hit its capacity limits, Hotstar implemented <em></em><em>panic mode</em> (a defined degradation state where non-essential services are deliberately disabled to preserve capacity for the most critical user path — live video delivery). Recommendations, personalization, and social features were the first to shed load. <strong>Video streaming was always the last service standing.</strong> The contract was explicit: degrade gracefully rather than fail catastrophically.</p>\n</blockquote>\n\n<p>The <em>Infradashboard</em> — Hotstar's internal capacity planning tool — gave the operations team a real-time view of infrastructure headroom and allowed proactive scale-up decisions hours before a match began. The team maintained a permanent buffer of <strong>2 million concurrent user capacity</strong> above current load at all times during live events. Because AWS ASG couldn't add new nodes fast enough during the actual match, the buffer had to already exist before the first ball was bowled. The combination of pre-warming, buffer maintenance, and custom scaling logic turned a reactive system into a <span>predictive one</span> — and it held at 25.3 million.</p>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Record After Record — The Architecture That Kept Scaling</p><p>The 2019 record of <strong>25.3 million concurrent viewers</strong> was the first proof of concept. By 2023, Hotstar hit <strong>59 million</strong> during the Cricket World Cup final. In 2025, the ICC Champions Trophy Final drew <strong>61 million simultaneous streams</strong> — more than the entire population of Italy, watching live on a single app. Each generation of the architecture was built directly on the engineering lessons of the one before.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>How Hotstar Replaced AWS Autoscaling with a Custom Scaling Engine</h3>\n<p>AWS Auto Scaling Groups were built for the average web application: traffic grows gradually, CPU rises, new instances are added over minutes. Live sports streaming is the opposite. <strong>Traffic can double in 90 seconds</strong> when MS Dhoni appears on screen. The AWS ASG step-size mechanism — adding instances in fixed batches — was simply too slow for this pattern. Hotstar's engineering team identified three specific failure modes in ASG behavior during live events: <em>insufficient capacity errors</em> (AWS returning an error when the requested EC2 instance type has no available inventory in a specific availability zone at that moment) during peak demand, <em>availability zone skew</em> (a condition where one AZ accumulates more load than others, exhausting its capacity while other AZs still have headroom) when scaling across AZs, and the lag between a scaling trigger and a serving-ready instance that could run to <strong>4+ minutes</strong> when including AMI bake time and application boot.</p>\n<pre><code class=\"language-python\"># Hotstar's custom scaling logic — conceptual pseudocode\n# Key insight: scale on REQUEST RATE and CONCURRENCY, not CPU utilization\n\nclass HotstarScaler:\n    CAPACITY_BUFFER = 2_000_000  # always maintain 2M concurrent user headroom\n    REACTION_TIME_TARGET = 90  # seconds to new serving capacity\n    BOOT_TIME = 75             # seconds for app boot from pre-baked AMI\n\n    def compute_required_capacity(self, current_concurrency, request_rate):\n        # Project forward 5 minutes using ML traffic model for this match\n        projected_peak = self.ml_model.predict_peak(\n            current=current_concurrency,\n            rate=request_rate,\n            event_type='cricket_live'\n        )\n        # Add mandatory buffer — never let buffer drop below 2M\n        return projected_peak + self.CAPACITY_BUFFER\n\n    def pre_warm_for_event(self, event_metadata):\n        # Called hours before match start — pre-bake AMIs in all regions\n        # Uses HULK traffic model predictions, not current load\n        predicted_peak = self.ml_model.predict_peak_from_event(event_metadata)\n        target_capacity = predicted_peak + self.CAPACITY_BUFFER\n\n        for region in ACTIVE_REGIONS:\n            # Launch PRIMARY ASG with main instance type mix\n            primary_asg.scale_to(target_capacity * 0.8, region)\n            # Launch SECONDARY ASG with diverse instance types as fallback\n            # Protects against insufficient capacity in any single instance family\n            secondary_asg.scale_to(target_capacity * 0.2, region)\n\n    def scale_on_signal(self, concurrency_metric, request_rate):\n        required = self.compute_required_capacity(concurrency_metric, request_rate)\n        current = infrastructure.get_active_capacity()\n        if current < required:\n            delta = required - current\n            # SNS alert triggers Lambda to activate secondary ASG immediately\n            sns.publish('scale_required', delta=delta)\n            lambda_handler.trigger_secondary_asg(delta)</code></pre>\n<blockquote>\n<p><strong>THE SECONDARY ASG PATTERN</strong></p>\n<p>Hotstar ran <strong>two auto-scaling groups simultaneously</strong> for every live event. The primary ASG held the bulk of capacity, pre-warmed before the match. The secondary ASG — with a different mix of instance types — acted as an emergency reserve. An AWS SNS alert triggered a Lambda function to activate the secondary ASG when the primary hit limits. This avoided the single-point failure of depending on one instance family being available in a specific AZ during peak national demand.</p>\n</blockquote>\n\n<ul>\n<li><strong>25.3M</strong> — Peak concurrent viewers on July 9, 2019 — a global record for any streaming platform at the time</li>\n<li><strong><90s</strong> — Target reaction time to new serving capacity with pre-baked AMIs — vs 4+ minutes with cold AWS ASG provisioning</li>\n<li><strong>2M buffer</strong> — Permanent capacity headroom maintained above current load during all live events — never let it drop below this floor</li>\n<li><strong>0</strong> — Reported downtime during the 25.3M peak — Project HULK's chaos engineering paid off when the moment arrived</li>\n</ul>\n\n<p>The Infradashboard was the operational nerve centre during live events. Engineers could see capacity headroom in real time, trigger manual scale-up actions hours in advance of anticipated spikes, and monitor which services were approaching their degradation thresholds. The key operational insight was that live sports infrastructure management begins <strong>the morning of the match, not when the concurrency alert fires</strong>. Pre-warming full capacity ahead of push notifications — the external traffic amplifiers that marketing controlled — meant the infrastructure was already serving at near-peak levels before the first viewer joined. <span>The Infradashboard made proactive capacity management a team sport between engineering and marketing.</span></p>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Kubernetes Migration: The Long-Term Fix</p><p>The 2019 architecture was built on raw EC2 instances with a custom autoscaler. The engineering team recognized this was not indefinitely scalable. In 2018, Hotstar had begun migrating to <strong>containerized microservices on a self-managed Kubernetes cluster</strong>, which enabled pod-level scaling in seconds rather than minutes. By 2023, this evolution reached EKS with Data Center Abstraction — the infrastructure that would handle 61 million concurrent viewers four years later.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Graceful Degradation Protocol</p><p>Hotstar defined a tiered degradation order for when services approached capacity limits. <strong>Tier 1 (shed first)</strong>: recommendations, social features, personalization. <strong>Tier 2</strong>: match statistics, secondary content APIs. <strong>Tier 3 (never shed)</strong>: live video delivery, playback APIs, authentication. The protocol was automated and tested via Project HULK's chaos engineering scenarios — so when the real moment came, the system degraded on its own without human intervention.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>📡</strong></p>\n<p>The Client-Side Backoff Contract</p><p>When backend latency exceeded thresholds, Hotstar's client applications were programmed to <strong>increase the interval between retry requests</strong> — backing off rather than hammering the server. This client-side behaviour was the last line of defense: when 25 million devices all experience a glitch simultaneously, the difference between a recoverable spike and a cascading failure can come down to whether the clients know to wait before retrying.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>Hotstar's architecture in 2019 was built for one governing constraint: <strong>traffic does not arrive smoothly</strong>. A cricket match involving India can go from zero to 10 million concurrent viewers in under ten minutes — faster than any traditional auto-scaling system can respond. The platform ran on AWS EC2 across multiple regions, with Akamai as the <em>CDN</em> (Content Delivery Network — a distributed network of edge servers that caches video segments and static assets close to users, absorbing the majority of load before it reaches origin servers) delivering video segments. Apache Kafka ingested 10 billion-plus clickstream events per match day. Microservices handled video playback, match statistics, personalization, and authentication independently — each able to be scaled or degraded without affecting the others. The three-layer architecture — CDN edge, application tier, data tier — had to be specifically engineered so that no layer became a bottleneck at the velocity of a cricket crowd.</p>\n<h3>Hotstar 2019 architecture — the path from a viewer's phone to live video, and where traffic concentrations hit</h3>\n<p><a href=\"https://techlogstack.com/explore/hotstar-ipl-scaling-2019/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>PROJECT HULK: THE PRODUCTION REHEARSAL</strong></p>\n<p>Project HULK was not a staging environment — it was a <strong>separate production-scale load testing infrastructure</strong> that ran geo-distributed tests from 8 AWS regions simultaneously. Its load generation cluster alone used <strong>c5.9xlarge instances</strong> (36 vCPUs, 72 GB RAM each) to generate realistic concurrent user traffic. Simulations ran four test types: baseline load, <em>tsunami testing</em> (a stress test pattern that simulates sudden extreme spikes and drops in traffic, matching the pattern of a cricket match when a wicket falls), chaos engineering with AZ failures, and ML-trained traffic pattern replay of previous match profiles. The goal: there should be no mode of failure in production that HULK has not already triggered in a test.</p>\n</blockquote>\n\n<h3>Project HULK — load testing architecture that simulated the real match before it happened</h3>\n<p><a href=\"https://techlogstack.com/explore/hotstar-ipl-scaling-2019/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Bandwidth Ceiling No One Talks About</p><p>At 5.7 Tbps peak, Hotstar was approaching a <strong>hard physical limit</strong> — not a software limit. Adding more servers would not help if India's total internet capacity couldn't carry the bytes. This ceiling forced Hotstar's engineers to think about CDN efficiency and <em>adaptive bitrate streaming</em> (a technology that dynamically switches video quality based on the viewer's network conditions, delivering the best quality the connection can support at any moment) not just as user experience decisions, but as existential infrastructure constraints.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🌐</strong></p>\n<p>The Microservices Migration That Made It Possible</p><p>Hotstar migrated from a monolith to microservices in <strong>2018</strong> — just one year before the 25.3M record. This was the architectural prerequisite for everything else: without independent service scaling, graceful degradation tiers, and per-service capacity controls, the custom autoscaling and Infradashboard capabilities described here would have been impossible to build. You cannot surgically shed load from a monolith.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Hotstar's 2019 story is not about surviving a traffic spike — it is about <strong>re-designing the relationship between infrastructure and time</strong>. The core lesson is that reactive systems cannot serve live events. When your traffic can double in 90 seconds, you need infrastructure that is already there. Every engineering decision Hotstar made — Project HULK, pre-warming, custom ASG, the buffer, graceful degradation — was aimed at the same goal: transforming a system that responds to load into one that <em>anticipates</em> it.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>For live events, autoscaling solves the wrong problem.</strong> AWS ASG is designed for gradual load growth — it cannot provision capacity at the rate that 1.1 million users per minute demands. The correct model for <em>predictable traffic spikes</em> (load patterns where the timing and approximate magnitude of peak traffic is known in advance — like scheduled live events) is <em>pre-warming</em> based on predicted load, not reactive scaling based on current metrics. If you have a scheduled high-traffic event, provision for it before it starts.</li><li><span>02</span><div><strong>Scale on the metrics that reflect user experience, not server health.</strong> Hotstar rejected CPU and memory utilization as scaling signals in favor of <strong>request rate and concurrency</strong> — the metrics that directly reflect how many users the system is currently serving. Build your autoscaler around the business-level constraint, not the infrastructure-level symptom. A server at 30% CPU can still be serving users who are getting a degraded experience.</li><li><span>03</span><div><strong>Test the exit, not just the entry.</strong> Every load test rehearses the traffic spike. Almost none rehearse the drop. When <strong>24 million users hit back simultaneously</strong>, the homepage and recommendation APIs absorbed a wave that was entirely different in character from streaming traffic. Hotstar's <em>tsunami testing</em> (load test pattern that simulates sudden, extreme traffic spikes followed by equally extreme drops — named for the wave that recedes before it strikes) explicitly rehearsed both the spike and the collapse. Design graceful degradation tiers, and test that they activate correctly.</li><li><span>04</span><div><strong>Marketing is an unplanned infrastructure event.</strong> Push notifications sent at peak match moments created synchronized traffic spikes that engineering had no advance knowledge of. Build a feedback loop where <strong>marketing decisions trigger infrastructure responses</strong> before the notification is sent, not after. The Infradashboard gave engineers visibility; the next step is giving marketing a capacity-aware interface for campaign timing.</li><li><span>05</span><div><strong>A capacity buffer is not waste — it is the cost of reliability for live events.</strong> Hotstar maintained a permanent 2-million-user capacity buffer above current load throughout every match. This headroom meant that unexpected spikes — a viral moment, an unexpected partnership, a push notification — could be absorbed without triggering a scaling event. <span>For live events, the cost of over-provisioning is always lower than the cost of the three minutes when it fails.</span></li></ol>\n<blockquote>\n<p><strong>THE DHONI EFFECT — AND WHAT CAME NEXT</strong></p>\n<p>The 2019 record of <strong>25.3 million</strong> concurrent viewers held until 2023, when Hotstar hit <strong>59 million</strong> during the Cricket World Cup — then broke its own record again in 2025 with <strong>61 million</strong> during the ICC Champions Trophy Final. Each leap required a new generation of architecture: from EC2 to Kubernetes, from Kubernetes to EKS, from EKS to Data Center Abstraction with Envoy-based gateways. The 2019 engineering story was chapter one.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🌍</strong></p>\n<p>The Template That Changed Streaming Architecture</p><p>Hotstar's 2019 engineering approach — pre-warming, custom autoscaling based on concurrency, graceful degradation tiers, and game-day chaos testing — became a reference architecture cited in AWS re:Invent talks and engineering conference circuits worldwide. <strong>The project that started as a solution to one cricket match became the blueprint for live-event streaming infrastructure globally.</strong></p>\n</blockquote>\n\n<blockquote><p>One cricket match used a quarter of India's internet. The engineering team called that a success.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/hotstar-ipl-scaling-2019/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.089953+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Live Streaming", "Hotstar"]}, {"id": "https://techlogstack.com/explore/github-mysql-8-upgrade-2023/", "url": "https://techlogstack.com/explore/github-mysql-8-upgrade-2023/", "title": "How GitHub Upgraded 1200 MySQL Hosts Without Dropping a Single Query", "summary": "GitHub's year-long mission to upgrade 1,200+ MySQL hosts from 5.7 to 8.0 while maintaining 5.5 million queries per second and keeping rollback available at every ste", "content_html": "<p><strong>GitHub</strong> · Databases · 17 May 2026</p>\n<p>MySQL 5.7 was hitting end-of-life, and GitHub's production database fleet spanned 1,200 hosts, 300 terabytes of data, and 5.5 million queries every second. Getting from here to MySQL 8.0 without disrupting 100 million developers was going to take more than a weekend.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;MySQL hosts upgraded&#x27;, &#x27;value&#x27;: &#x27;1,200+&#x27;}</li><li>{&#x27;label&#x27;: &#x27;TB data migrated&#x27;, &#x27;value&#x27;: &#x27;300+&#x27;}</li><li>{&#x27;label&#x27;: &#x27;queries/sec maintained&#x27;, &#x27;value&#x27;: &#x27;5.5M&#x27;}</li><li>{&#x27;label&#x27;: &#x27;year planning+execution&#x27;, &#x27;value&#x27;: &#x27;&gt;1&#x27;}</li><li>{&#x27;label&#x27;: &#x27;clusters zero-downtime&#x27;, &#x27;value&#x27;: &#x27;50+&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<ul>\n<li><strong>1,200+</strong> — MySQL hosts across Azure Virtual Machines and bare-metal data center hardware — each needing individual upgrade without disturbing its neighbors</li>\n<li><strong>300+ TB</strong> — Relational data stored across 50+ clusters, sharded both horizontally and vertically using Vitess for GitHub's highest-traffic product domains</li>\n<li><strong>5.5M QPS</strong> — Queries per second sustained throughout the entire year-long upgrade — the SLO target that could not slip during any single cluster promotion</li>\n<li><strong>>1 year</strong> — Total duration from preparation start in July 2022 through final cluster upgrades — a timeline that reflects the discipline of doing this safely, not slowly</li>\n</ul>\n\n<p>GitHub started as a Ruby on Rails application with a single MySQL database over 15 years ago. Since then, MySQL had become the foundation of everything GitHub stores: repositories, pull requests, issues, code review comments, user accounts, billing data, and the entire social graph of 100 million developers. By 2022, <em>MySQL 5.7</em> (the production MySQL version GitHub had been running for years, which Oracle officially declared end-of-life in October 2023, meaning no more security patches or bug fixes) was approaching end-of-life — Oracle had announced support would end in October 2023. The GitHub database team made a simple calculation: <strong>stop receiving security patches on the database that holds every line of code pushed to GitHub, or upgrade</strong>. The only real question was how to upgrade 1,200 hosts, 300+ TB of data, and 5.5 million queries per second without disrupting a single user-visible transaction.</p>\n<p>Preparation began in July 2022 — a full year before any production host was promoted to 8.0. The team added MySQL 8.0 to <em>CI</em> (Continuous Integration — the automated system that runs tests against every code change before it merges, ensuring the codebase is always in a shippable state) for all applications using MySQL, running 5.7 and 8.0 side-by-side to catch regressions early. They built MySQL 8.0 <em>Codespaces</em> (GitHub's cloud development environment that spins up isolated VM workspaces for debugging, allowing engineers to test against specific MySQL versions without affecting production) debug containers so developers could test their queries against the new version. They created an internal GitHub Project board to track every cluster's upgrade status across the entire fleet. And they did all of this <strong>before upgrading a single production host</strong>. The discipline of the preparation phase is what made the execution phase look routine.</p>\n<blockquote>\n<p><strong>THE HIDDEN BREAKING CHANGE</strong></p>\n<p>MySQL 8.0 changes the <strong>default character set to utf8mb4</strong> and its default collation to `utf8mb4_0900_ai_ci` — a newer Unicode specification that MySQL 5.7 does not support. This created a problem: when an 8.0 primary replicates writes to a 5.7 replica, the collation metadata in the <em>binary log</em> (the record MySQL maintains of every data modification, used to replicate changes to replica hosts and to reconstruct data state for point-in-time recovery) can cause replication to break entirely on the downstream 5.7 nodes. GitHub's rollback strategy depended on maintaining backward replication from 8.0 to 5.7 — so this had to be solved before a single production primary was promoted.</p>\n</blockquote>\n\n<h3>The 5-Step Upgrade Playbook</h3>\n\n<h3>Problem</h3>\n<h4>MySQL 5.7 Hits End-of-Life</h4>\n<p>Oracle announced MySQL 5.7 end-of-life for October 2023, cutting off security patches and bug fixes. GitHub's 1,200+ host fleet running at 5.5M QPS could not safely continue on an unsupported database version. The challenge was executing a major version upgrade across a mixed fleet of Azure VMs and bare-metal hosts without a maintenance window or service disruption.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Backward Replication Incompatibilities</h4>\n<p>Testing revealed two breaking changes: MySQL 8.0's new default <em>utf8mb4_0900_ai_ci collation</em> (a Unicode 9.0 character sorting specification supported in MySQL 8.0 but absent from 5.7, causing replication to break when an 8.0 primary writes to a 5.7 replica) broke downstream 5.7 replicas, and the new MySQL 8.0 roles feature caused permission-expansion scripts to generate 8.0-syntax statements that 5.7 replicas could not parse. Both had to be patched before any primary promotion.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Rolling Replica Upgrades + Dual Replication Chains</h4>\n<p>GitHub built a 5-step playbook: upgrade replicas one data center at a time, reconfigure the <em>replication topology</em> (the tree of primary and replica MySQL hosts through which write changes propagate — the primary receives writes and replicas receive a stream of changes to stay in sync) to create parallel 5.7 and 8.0 chains, promote an 8.0 host to primary via graceful failover, keep 5.7 standbys ready for rollback, then clean up after 24 hours of successful traffic.</p>\n<hr />\n<h3>Result</h3>\n<h4>100% Fleet Upgraded, Zero SLO Violations</h4>\n<p>Every cluster upgraded without a single SLO violation. The rollback path was preserved throughout the entire year-long process — a 5.7 standby was always available. The project delivered not just the MySQL 8.0 upgrade but a repeatable automation framework for future major version upgrades, so the next one will be faster.</p>\n<hr />\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Vitess Complication: Version Advertisement</p><p><em></em><em>Vitess</em> (YouTube's open-source MySQL sharding layer that GitHub uses for its highest-traffic product domains) adds an extra layer of complexity: its proxy component <em>VTgate</em> (Vitess's query router that intercepts MySQL connections and directs them to the correct shard) advertises the MySQL version to client applications. One Java client was checking the advertised version to decide whether to disable the MySQL query cache — a feature that was <strong>completely removed in 8.0</strong>. As soon as even one shard in a Vitess keyspace was upgraded, VTgate's version advertisement had to be updated, otherwise the Java client would generate blocking errors. Timing the VTgate version bump to coincide exactly with the first shard promotion became a critical coordination step.</p>\n</blockquote>\n\n<blockquote><p>Upgrading the fleet with no impact to our Service Level Objectives (SLO) was no small feat — planning, testing and the upgrade itself took over a year and collaboration across multiple teams within GitHub.</p><p><em>— — Jiaqi Liu, Daniel Rogart, Xin Wu — via GitHub Engineering Blog</em></p></blockquote>\n<blockquote>\n<p><strong>🔄</strong></p>\n<p>GitHub's engineers discovered a replication bug in MySQL 8.0 that only manifested under <strong>intensive load over long periods</strong> — a host could eventually run out of commit-order sequence numbers and stall. The bug had been patched in MySQL 8.0.28. This meant GitHub had to ensure all hosts were on 8.0.28 or later before any long-running cluster was considered safe, adding a version-pinning requirement to an already complex upgrade matrix.</p>\n</blockquote>\n\n<p>The upgrade process for each cluster was designed to preserve the rollback option at every single step. Promoting an 8.0 replica to primary was never an irreversible action until after 24 hours of clean traffic had confirmed success. During the brief window of dual replication chains, GitHub maintained a set of <strong>offline 5.7 replicas</strong> specifically for rollback — not serving traffic, not receiving new promotion candidates, just sitting ready. <em>Orchestrator</em> (GitHub's open-source MySQL topology management tool that handles automated failover, replication topology visualization, and candidate promotion) was configured to blacklist all 5.7 hosts as failover candidates during this window, preventing an automated failover from accidentally rolling back to 5.7 during an unplanned outage. The architecture of the rollback path was as carefully designed as the architecture of the upgrade path itself.</p>\n<blockquote>\n<p><strong>🔧</strong></p>\n<p>gh-ost: Schema Changes Without Table Locks</p><p>GitHub's in-house tool <strong>gh-ost</strong> (GitHub Online Schema Migrations) was a critical part of the upgrade preparation. It enabled schema changes required for MySQL 8.0 compatibility to be applied to production tables without locking them — essential when those tables receive millions of queries per second. Without gh-ost, applying schema changes to GitHub's largest tables would have required multi-hour maintenance windows that users would have noticed.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>What MySQL 8.0 Actually Unlocked</p><p>Beyond escaping end-of-life, MySQL 8.0 delivered features GitHub's database team genuinely wanted. <strong>Instant DDLs</strong> allow many schema changes to be applied without rebuilding the entire table — critical for a 300+ TB fleet where traditional ALTER TABLE could take hours. <strong>Invisible indexes</strong> let engineers create an index, test it under production traffic without it being used by the query planner, and only then make it active — dramatically safer index deployment. <strong>Compressed binary logs</strong> reduce replication bandwidth between primary and replicas, a meaningful saving at 5.5M queries per second.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>Engineering the Rollback Path</h3>\n<p>The hardest technical problem in this upgrade was not moving forward — it was preserving the ability to move backward. MySQL officially supports replication from a lower version to the next higher version but <strong>does not support reverse replication</strong> from 8.0 down to 5.7. When GitHub tested this in staging, promoting an 8.0 host to primary caused replication to break on all downstream 5.7 replicas immediately. Two root causes: MySQL 8.0's new default <em>collation</em> (a set of rules that determines how character strings are compared and sorted; different collations can produce different sort orders for the same strings) `utf8mb4_0900_ai_ci` was not recognized by 5.7's replication parser, and MySQL 8.0's new ROLE management syntax generated statements in the <em>binary log</em> (the sequential log of all data-modifying SQL statements that MySQL writes to enable replication and point-in-time recovery) that 5.7 could not execute. Both required surgical fixes before any production promotion could proceed.</p>\n<pre><code>-- The collation incompatibility fix:\n-- MySQL 8.0 defaults to utf8mb4_0900_ai_ci (Unicode 9.0)\n-- MySQL 5.7 only supports up to utf8mb4_unicode_520_ci\n-- Fix: explicitly set database/table collations to a 5.7-compatible value\n\n-- On the 8.0 primary, before promotion:\nALTER DATABASE github_production\n  CHARACTER SET utf8\n  COLLATE utf8_unicode_ci;  -- 5.7-compatible collation\n\n-- Verify that new tables inherit the correct collation\nSHOW CREATE TABLE repositories\\G\n-- Should show utf8_unicode_ci, NOT utf8mb4_0900_ai_ci\n\n-- Confirm replication is running on downstream 5.7 replicas\n-- after a test write to ensure no Seconds_Behind_Master growth\nSHOW SLAVE STATUS\\G\n-- Expected: Seconds_Behind_Master: 0\n--           Slave_SQL_Running: Yes\n--           Last_Error: (empty)\n\n-- The roles fix: temporarily strip role-expansion from permission grants\n-- during the upgrade window so no ROLE syntax appears in the binlog</code></pre>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Dual Replication Chain Architecture</p><p>During the critical promotion window, GitHub maintained <strong>two parallel replication chains</strong> downstream of a single 8.0 replica: one chain of offline 5.7 standbys ready for rollback, and one chain of serving 8.0 replicas handling production traffic. This dual-chain state lasted only hours per cluster — long enough to confirm 8.0 health before decommissioning the 5.7 standby chain. The temporary cost: double the replica infrastructure per cluster during the promotion window.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ORCHESTRATOR: PREVENTING ACCIDENTAL ROLLBACK</strong></p>\n<p><em>Orchestrator</em> (an open-source MySQL high-availability tool co-created by GitHub that manages replication topology and automated failover) is configured to make automated failover decisions when a primary fails. During the upgrade, GitHub added an explicit blacklist of all 5.7 hosts as failover candidates. Without this, an unplanned primary failure during the upgrade window could have caused Orchestrator to promote a 5.7 host as the new primary — <strong>an automated rollback that would undo hours of upgrade work</strong> and potentially confuse application behavior with a sudden version downgrade. The blacklist was the safety guard against automation working against the upgrade.</p>\n</blockquote>\n\n<p>After each primary promotion, GitHub's policy required at least <strong>one complete 24-hour traffic cycle</strong> before declaring a cluster successfully upgraded and decommissioning the 5.7 standby chain. This was not arbitrary — GitHub's traffic has strong diurnal patterns, with dramatically different load profiles between business-hours peak traffic and overnight lows. A cluster that behaved well during off-peak hours might reveal latency regressions during the morning rush of developers opening pull requests in Europe and North America. The 24-hour window caught several edge cases in early clusters that were fixed before the team moved to the next one.</p>\n<p>GitHub's 5-Step MySQL 8.0 Upgrade Playbook Per Cluster</p><div><table><caption>GitHub's 5-Step MySQL 8.0 Upgrade Playbook Per Cluster</caption><thead><tr><th>Step</th><th>Action</th><th>Rollback Available?</th></tr></thead><tbody><tr><td>1</td><td>Upgrade replicas one DC at a time; route read traffic to 8.0 replicas</td><td>Yes — disable 8.0 replicas, re-enable 5.7</td></tr><tr><td>2</td><td>Reconfigure topology: split into dual 8.0 and 5.7 replication chains</td><td>Yes — fail back to 5.7 chain</td></tr><tr><td>3</td><td>Promote 8.0 replica to primary via Orchestrator graceful failover</td><td>Yes — 5.7 chain still in sync</td></tr><tr><td>4</td><td>Monitor for 24 hours of complete traffic cycle at full load</td><td>Yes — promote 5.7 standby if needed</td></tr><tr><td>5</td><td>Decommission 5.7 standbys after 24h success confirmation</td><td>No — rollback window closed</td></tr></tbody></table>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Automation Investment That Pays Forward</p><p>GitHub's database team explicitly designed this upgrade to produce a <strong>reusable automation framework</strong> for future major MySQL versions. The tooling for mixed-version CI, the dual-chain promotion scripts, the rollback procedures, the checklist issue templates — all of it was built as a library, not a one-time script. When MySQL 9.0 eventually needs to be adopted, the playbook already exists. The year of effort became infrastructure.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🏷️</strong></p>\n<p>The Mixed-Version CI Safety Net</p><p>Running MySQL 5.7 and 8.0 <strong>side-by-side in CI for all applications throughout the entire year-long upgrade</strong> was the single most important safety investment GitHub made. Application teams discovered query incompatibilities, deprecated feature usage, and reserved keyword conflicts in automated tests rather than in production promotions. This meant by the time each cluster was promoted, the application code was already known-compatible — the upgrade was validating infrastructure, not discovering application bugs.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>GitHub's MySQL fleet is not a single cluster — it's a network of over 50 independent clusters, each serving a specific product domain (repositories, issues, pull requests, billing, etc.), with larger domains horizontally sharded via <em>Vitess</em>. Each cluster has its own primary-replica topology. The upgrade had to be executed independently per cluster, each following the same 5-step playbook, with the dual replication chain state existing only during the transition window. Understanding this topology is essential to understanding why the upgrade took a year and why that was the right timeline, not the wrong one.</p>\n<h3>During Upgrade: Dual Replication Chain Topology (Transition State)</h3>\n<p><a href=\"https://techlogstack.com/explore/github-mysql-8-upgrade-2023/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>After: Fully Upgraded Cluster (5.7 Standbys Decommissioned)</h3>\n<p><a href=\"https://techlogstack.com/explore/github-mysql-8-upgrade-2023/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>THE TOOLING ECOSYSTEM</strong></p>\n<p>GitHub's MySQL reliability depends on a suite of open-source and in-house tools: <strong>Orchestrator</strong> manages replication topology and automated failover; <strong>gh-ost</strong> applies online schema changes without table locks; <strong>freno</strong> throttles schema migration speed based on replica lag to prevent migrations from disrupting production reads; and <strong>Percona Toolkit</strong> provides checksumming and replication verification. Without this ecosystem, the year-long upgrade would have required dozens of maintenance windows instead of zero.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Vitess Sharding: One Shard at a Time</p><p>GitHub's Vitess clusters required upgrading one shard at a time rather than one cluster at a time, adding an inner loop to the upgrade playbook. For each sharded keyspace, the VTgate version advertisement had to be updated <strong>immediately after the first shard was promoted</strong> to 8.0 — otherwise client applications checking the advertised version would behave incorrectly. This timing constraint added coordination overhead but was resolved with explicit upgrade checklist items per keyspace.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>📋</strong></p>\n<p>GitHub Projects as the Upgrade Control Plane</p><p>GitHub used its own <strong>GitHub Projects</strong> tool to build a rolling calendar that tracked every cluster's upgrade status, scheduled upcoming cluster promotions, and coordinated between the database team and application teams. Issue templates gave application teams a standardized checklist for validating their service before and after each promotion. Meta-note: GitHub building GitHub with GitHub to upgrade GitHub's database is either very on-brand or very circular, depending on your disposition.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>GitHub's MySQL 8.0 upgrade is one of the cleanest examples of large-scale infrastructure migration executed with discipline. The lessons here are as much about process and architecture as they are about database mechanics.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Design for rollback before you design for progress.</strong> GitHub's upgrade strategy was architected around the constraint that rollback must be available at every step until the 24-hour validation window closed. The dual replication chain architecture, the Orchestrator blacklisting of 5.7 failover candidates, the parallel standby maintenance — all of it was overhead deliberately accepted to preserve the ability to undo. That safety margin is what allowed the team to execute confidently.</li><li><span>02</span><div><em>Binary log</em> (the sequential write-ahead log that MySQL uses to record all data changes for replication purposes) compatibility between versions is a hidden attack surface in any major database upgrade. Always test reverse replication in staging — not just forward replication — before committing to a production upgrade strategy. GitHub discovered the collation and roles incompatibilities in staging, which is exactly the right time to find them.</li><li><span>03</span><div><strong>Run a complete 24-hour traffic cycle before decommissioning your rollback infrastructure.</strong> One cluster isn't 'done' after the primary promotion completes successfully. GitHub's requirement for a full 24-hour window before removing 5.7 standbys caught edge cases during peak traffic that weren't visible during off-peak hours. Don't close the escape hatch until you've seen the full traffic profile.</li><li><span>04</span><div>Build your upgrade automation as a <strong>reusable library, not a one-time script</strong>. GitHub's database team explicitly designed the tooling, templates, playbooks, and automation from this project as the foundation for the next major version upgrade. The year of effort becomes infrastructure that compounds in value over time — every future upgrade starts from a much higher base.</li><li><span>05</span><div><em>Orchestrator</em> (GitHub's open-source MySQL topology manager) and equivalent automation tools can work against you during a migration if not explicitly constrained. Blacklisting 5.7 hosts as failover candidates during the upgrade window was a critical safety measure. Any automated system that could undo your migration work must be told, explicitly, not to. Never assume automation understands your maintenance window.</li></ol>\n<blockquote>\n<p><strong>THE END-OF-LIFE FORCING FUNCTION</strong></p>\n<p>MySQL 5.7's end-of-life announcement was the external forcing function that gave GitHub's database team the organizational priority to execute this migration. <strong>Security patch cutoffs are one of the most effective levers for getting cross-team infrastructure migrations approved and resourced.</strong> If your team has been deferring a major version upgrade, check when your current version's security support ends — it may already be overdue.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Replication Bug in 8.0 Pre-0.28</p><p>GitHub's testing surfaced a MySQL bug where <strong>replica_preserve_commit_order</strong> under intensive load could cause a host to exhaust commit-order sequence numbers and stall replication. The fix was in 8.0.28. This meant every host had to be on at minimum 8.0.28 — adding a version-pinning constraint to an already complex upgrade matrix. Moral: always scan the target version's release notes for known bugs before committing to that specific build in production.</p>\n</blockquote>\n\n<blockquote><p>They upgraded 1,200 database hosts without a single user noticing — which means they either did extraordinary engineering or extraordinary documentation, and based on the blog post, it was both.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/github-mysql-8-upgrade-2023/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.176139+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Databases", "GitHub"]}, {"id": "https://techlogstack.com/explore/slack-cellular-architecture-2023/", "url": "https://techlogstack.com/explore/slack-cellular-architecture-2023/", "title": "Slack Built a Big Red Button to Drain an Entire Data Center in Five Minutes", "summary": "How a 2021 availability zone outage led Slack to spend 1.5 years migrating to a cellular architecture with an AZ drain capability that works in under 5 minutes.", "content_html": "<p><strong>Slack</strong> · Reliability · 17 May 2026</p>\n<p>On June 30, 2021, a network link connecting one AWS availability zone failed — and Slack users felt it, despite Slack running in multiple availability zones. The postmortem question was brutal: why did a single AZ failure affect users at all? The answer drove 18 months of architecture work.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;years migration time&#x27;, &#x27;value&#x27;: &#x27;1.5&#x27;}</li><li>{&#x27;label&#x27;: &#x27;SLA maintained&#x27;, &#x27;value&#x27;: &#x27;99.99%&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>Slack runs most of its core infrastructure in the AWS <strong>us-east-1 region</strong> across multiple <em>availability zones</em> (isolated data centers within the same geographic region, designed so that a failure in one AZ does not affect others — each AZ has independent power, cooling, and networking) (AZs). The cloud infrastructure guarantee is clear: AZs should provide failure isolation. A problem in one AZ should not cascade to others. On June 30, 2021, at 11:45am PDT, an intermittent fault developed in a network link connecting one AZ to its neighbors. From a physical hardware perspective, this was an unremarkable incident — a flaky network link that was automatically removed from service at 12:33pm, restoring full connectivity 48 minutes after it first showed symptoms. What was remarkable was that <strong>Slack's users felt it at all</strong>.</p>\n<blockquote><p>We were led to wonder why, in fact, this outage was visible to our users at all. Slack operates a global, multi-regional edge network, but most of our core computational infrastructure resides in multiple Availability Zones within a single region, us-east-1.</p><p><em>— — Slack Engineering — via Slack's Migration to a Cellular Architecture blog post</em></p></blockquote>\n<p>The answer was <em>gray failure</em> (a failure mode where different components have different views of system availability — some servers see one AZ as fully available while servers in that AZ see the others as unavailable, creating an inconsistent state that is much harder to detect and respond to than a clean hard failure). When the network link became intermittent, Slack's systems within the impacted AZ believed they had full connectivity to everything inside that AZ. Systems outside the AZ saw it as unavailable. Even clients within the same AZ had inconsistent views depending on whether their specific network flow traversed the failed equipment. This partial, view-dependent failure was <span>far harder to detect and respond to than a clean hard failure</span>. No single alert could capture it. No automated remediation was precise enough to act on it. The answer, the team concluded, was not to solve automated remediation of gray failures — it was to make the computers' job easier by relying on human judgment.</p>\n<blockquote>\n<p><strong>THE BUTTON WE NEEDED</strong></p>\n<p>During the June 2021 incident, engineers monitoring the outage could see clearly on their dashboards that <strong>one AZ was the problem</strong> — nearly every graph segmented by target AZ told the same story. If there had been a button to tell all systems 'this AZ is bad; avoid it,' they would have pressed it immediately. So the goal became: build that button. Design requirements: drain an AZ within <strong>5 minutes</strong> with no user-visible errors, operable from outside the affected AZ itself.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>June 2021: AZ Outage Reaches Users</h4>\n<p>A network link connecting one AWS AZ to the others experienced intermittent faults for 48 minutes. Despite Slack running in multiple AZs, users experienced degraded service — because Slack's core infrastructure was monolithically distributed across AZs without AZ-aware traffic isolation. No single switch could route traffic away from the affected AZ.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Gray Failures Don't Respect AZ Boundaries</h4>\n<p><em>Gray failures</em> (partial failures where different components have inconsistent views of system availability, making it impossible to detect or respond to them with simple binary health checks) are uniquely dangerous in multi-AZ architectures. When a network link is intermittently faulty, not flaky, the failure depends on which specific flow traverses the bad equipment. Automated health checks often cannot detect this — they pass most of the time and fail occasionally, making the system appear healthy by aggregated metrics.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Cellular Architecture + AZ Drain Button</h4>\n<p>Slack spent 1.5 years migrating its most critical user-facing services to a cell-based architecture — with 3-4 independent instances of each service, one per AZ. An AZ drain button was built that, when pressed by an operator, reroutes all traffic away from the targeted AZ within 5 minutes. The drain mechanism was designed to operate from <strong>outside</strong> the affected AZ so it remains usable even when the AZ's own control plane is degraded.</p>\n<hr />\n<h3>Result</h3>\n<h4><5 Minutes to Safety</h4>\n<p>An AZ failure that previously required 48 minutes of user impact can now be mitigated in under 5 minutes via the drain button. Slack's 99.99% availability SLA (less than 1 hour downtime per year) makes 5-minute mitigation operationally viable; 48 minutes does not. The cellular architecture also brought independent deployment, testing, and monitoring for each cell.</p>\n<hr />\n\n<p>The cellular architecture migration forced Slack to make hard decisions about which services could be siloed cleanly and which could not. The key dividing line was <strong>statefulness</strong>. Stateless services — those that hold no long-lived data and process requests independently — are natural candidates for full siloing: run 3-4 independent instances, one per AZ, and route requests to the closest healthy instance. Stateful services — those that are the system of record for data — are harder. Distributing state across cells introduces consistency challenges under <em>CAP theorem</em> (the theoretical result stating that a distributed data store can provide at most two of: Consistency (all nodes return the same data), Availability (every request receives a response), and Partition tolerance (the system continues operating despite network partitions)) tradeoffs. Slack's team used CAP theorem analysis as a principled framework for categorizing each service and selecting the appropriate cell architecture for it.</p>\n<blockquote>\n<p><strong>🔴</strong></p>\n<p>Slack's AZ drain button has a design requirement that many engineers overlook: it must <strong>not rely on the impacted AZ</strong> to function. During large network outages, SSH-ing into servers in the affected AZ to make them 'lame-duck' themselves is unreliable. The drain mechanism works from outside — using control plane infrastructure that is deliberately hosted in AZs other than the one being drained.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Incremental Traffic Recovery: 1% at a Time</p><p>The drain capability is bidirectional — Slack can also <strong>gradually reintroduce traffic to a recovering AZ</strong> rather than dumping all traffic back at once. Starting at 1% and monitoring for errors before increasing gives engineers confidence that the recovery is real before exposing the full user base. This incremental re-introduction is as important as the drain itself: a hard full restore after an AZ incident often triggers a second cascade as cold caches and restarted services encounter sudden full-load.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>📐</strong></p>\n<p>Why Not Automatic AZ Failover?</p><p>Automatic remediation of gray failures is technically hard because the signal is ambiguous — partial connectivity, inconsistent views, intermittent errors that don't trigger clean alert thresholds. Slack's architects chose to rely on <strong>human judgment for the drain decision</strong> while making the execution automated and fast. An operator who can see the dashboards and understand that 'one AZ is bad' is a more reliable detector than any automated system for this class of failure.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The 99.99% SLA Math</p><p>Slack's 99.99% availability SLA means <strong>less than 1 hour of downtime per year</strong> is tolerable. The June 2021 AZ incident lasted 48 minutes — nearly the entire annual budget in a single event. A second incident of similar duration would breach the SLA. The cellular architecture and AZ drain button are not aspirational reliability improvements; they are the technical prerequisites for Slack to honor its contractual commitments to enterprise customers.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🏗️</strong></p>\n<p>Shipping Deep Changes Across Connected Services</p><p>One of the most underappreciated challenges of the cellular migration was <strong>coordinating changes across services that have live dependencies on each other</strong>. Converting Service A to a cellular model while Service B still calls it monolithically requires careful sequencing and temporary compatibility shims. Slack's engineering team wrote extensively about the 'ship the change' problem: making sweeping architectural changes to a live system without disrupting the engineers working on it daily.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>WHAT 'GOOD ENOUGH' ENABLED</strong></p>\n<p>Slack's architecture team made an explicit decision to embrace a 'good enough' cellular model rather than pursuing perfect cell isolation. Some services couldn't be fully siloed without years of additional work. A cell with 80% AZ isolation that can be built in 6 months is more valuable than a perfectly isolated cell that requires 3 years. The pragmatic threshold — drain within 5 minutes with no user errors — guided every architecture decision.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>18 Months of Architecture Work for a 5-Minute Fix</h3>\n<p>The cellular architecture migration is notable not just for what it produced but for how long it took. <strong>1.5 years</strong> of engineering effort across dozens of services, with careful sequencing to avoid disrupting a platform that millions of professionals depend on every day. The team decomposed the problem by service type, migrated services incrementally starting with those most amenable to siloing, and built the AZ drain infrastructure before migrating all services to depend on it. The project combined the operational discipline of a database migration with the architectural ambition of a complete infrastructure overhaul.</p>\n<ul>\n<li><strong><5 min</strong> — Time to drain all traffic from a failing AZ using the drain button — versus ~48 minutes of user impact in the June 2021 AZ incident that triggered this work</li>\n<li><strong>1.5 years</strong> — Duration of the cellular architecture migration — reflecting the complexity of safely rearchitecting infrastructure serving millions of daily active users</li>\n<li><strong>3–4 cells</strong> — Independent instances of each critical service — one per AZ — providing fault isolation so a single AZ failure affects at most 25–33% of requests before drain</li>\n<li><strong>99.99%</strong> — Slack's SLA availability target — less than 1 hour total downtime per year — the business requirement that made sub-5-minute AZ mitigation a hard engineering constraint</li>\n</ul>\n\n<pre><code class=\"language-python\"># Simplified AZ drain logic (conceptual)\n# Real implementation uses load balancer weight APIs and health check manipulation\n\nclass AZDrainButton:\n    def drain_az(self, target_az: str):\n        \"\"\"Drain all traffic from target_az within 5 minutes.\n        Operable from any AZ — does not rely on target_az control plane.\"\"\"\n        \n        # Step 1: Update load balancer weights to 0 for target_az\n        # Uses the cloud provider API — operates outside the AZ itself\n        for service in self.critical_services:\n            self.lb_api.set_weight(\n                service=service,\n                az=target_az,\n                weight=0  # no new traffic; existing connections drain naturally\n            )\n        \n        # Step 2: Update internal service discovery to prefer other AZs\n        self.consul_api.set_az_preference(\n            preferred_azs=[az for az in ALL_AZS if az != target_az],\n            avoid_az=target_az\n        )\n        \n        # Step 3: Monitor drain progress — connections should complete within 5 min\n        return self.monitor_drain_progress(target_az, timeout_minutes=5)\n    \n    def gradual_restore(self, target_az: str, start_pct: float = 0.01):\n        \"\"\"Incrementally restore traffic to recovering AZ starting at 1%.\n        Monitor for errors before increasing allocation.\"\"\"\n        current_pct = start_pct\n        while current_pct <= 1.0:\n            self.lb_api.set_weight(target_az, weight=current_pct)\n            if self.error_rate_acceptable(target_az):\n                current_pct = min(current_pct * 2, 1.0)  # double until 100%\n            else:\n                self.lb_api.set_weight(target_az, weight=0)  # back to zero\n                break</code></pre>\n<blockquote>\n<p><strong>STATEFUL VS STATELESS CELL STRATEGY</strong></p>\n<p>Slack's cellular migration required a principled decision for each service: can this service be independently siloed per AZ? Stateless services (no persistent data) are straightforward — run 3-4 independent instances. Stateful services (system of record) require more nuance. Services optimizing for <strong>availability</strong> during a partition get AZ-isolated instances that can serve stale data. Services requiring <strong>consistency</strong> stay centralized with careful cross-AZ replication. CAP theorem is not an abstract thought experiment here — it is a deployment decision for each individual service.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Independent Deployment Per Cell</p><p>An underappreciated benefit of cellular architecture is that each cell can be <strong>deployed and updated independently</strong>. A canary deployment that goes wrong in Cell A does not affect Cell B or Cell C. This dramatically reduces the blast radius of bad deploys — one of the leading causes of production incidents. Cellular architecture is not just a reliability pattern; it's a deployment safety pattern.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Testing the Drain Before You Need It</p><p>Slack's engineering team explicitly built tooling to <strong>test the AZ drain mechanism regularly</strong> — not just in staging but in production via controlled drains of individual services. This is chaos engineering applied to the mitigation tool itself: if the drain button is only tested during incidents, its failure modes will be discovered at the worst possible moment. The drain mechanism is exercised regularly to ensure it works when it matters.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🌐</strong></p>\n<p>Slack has a <strong>global multi-regional edge network</strong> that handles user connections near the user's geographic location. This edge layer is already highly distributed. The cellular architecture migration focused on Slack's <strong>core computational infrastructure</strong> in us-east-1 — the tier where message storage, fanout, and business logic lives. Fixing the core tier was the key to eliminating AZ-level blast radius for the majority of user-visible failures.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Independent Cell Deployment Dividend</p><p>Post-migration, Slack's oncall teams reported that <strong>bad deploys could be isolated to a single cell</strong> before being promoted to the full fleet. A canary that degrades performance in Cell A triggers an alert while Cells B and C continue healthy — giving the deploying team clear signal and a clean blast-radius boundary. Reliability and deployment safety turned out to be the same investment.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>Before the migration, Slack's core platform was a monolithic service topology with components distributed across AZs but not isolated within them. A request might be load-balanced to a webapp in AZ-1, which calls a backend service in AZ-3, which reads from a database in AZ-2. This cross-AZ traffic pattern meant that any AZ degradation could affect the latency of any request — the system had no natural blast-radius boundary at the AZ level. Post-migration, each cell contains a complete serving stack: its own webapp instances, its own backend service instances, and its own cache tier. Cross-cell traffic exists only for data that genuinely requires global consistency.</p>\n<h3>Before: Cross-AZ Monolithic Topology (Gray Failure Spreads Freely)</h3>\n<p><a href=\"https://techlogstack.com/explore/slack-cellular-architecture-2023/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>After: Cellular Architecture with AZ Drain Capability</h3>\n<p><a href=\"https://techlogstack.com/explore/slack-cellular-architecture-2023/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>THE CAP THEOREM AS A DEPLOYMENT GUIDE</strong></p>\n<p>Every service Slack migrated required a CAP tradeoff decision. Cooper Bethea's QCon talk summarizes it clearly: services can choose to <strong>partition-tolerate with availability</strong> (serve potentially stale data from an isolated cell, keeping users active) or <strong>partition-tolerate with consistency</strong> (refuse to serve data if it can't verify it's current, protecting correctness). For Slack, the messaging delivery path chose availability — users can still send messages even if the cell is partitioned. Billing and auth services chose consistency — better to reject a request than to process it on stale data.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Observability Must Be Cell-Scoped</p><p>One of the operational requirements that emerged from the cellular migration was <strong>cell-scoped monitoring dashboards</strong>. Aggregated metrics across cells can hide cell-specific degradation — if Cell B is struggling but Cells A and C are fine, a global average error rate might look acceptable. Each cell needs its own metrics, alerts, and dashboards so operators can detect and act on cell-level issues without noise from the healthy cells masking the signal.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>📊</strong></p>\n<p>Per-Cell Metrics: The Observability Requirement</p><p>After cellularizing Slack's services, <strong>global aggregated metrics became misleading</strong>. A global p99 latency of 120ms might hide that Cell B has a p99 of 400ms while Cells A and C are at 90ms. Every cell now has its own dashboard, its own alerts, and its own error budget tracking. This per-cell observability investment was not optional — without it, the drain mechanism's input (operator judgment about which cell is degraded) would be unreliable.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Slack's cellular architecture migration is a landmark case study in proactive reliability engineering: a team that experienced an incident, asked 'why did this affect users at all?', and then spent 18 months building the answer into the infrastructure.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Multi-AZ alone does not guarantee AZ-failure isolation.</strong> Running services in multiple AZs provides hardware redundancy but not traffic isolation if your services freely communicate cross-AZ. Build AZ-aware traffic routing and cell boundaries so that a failure in one AZ cannot affect requests served entirely by another AZ.</li><li><span>02</span><div><em>Gray failures</em> (partial failures where different components have inconsistent views of availability) are best mitigated by human-triggered fast mitigation, not automated remediation. The ambiguous, view-dependent nature of gray failures makes reliable automated detection extremely difficult. Build a fast drain mechanism with a human in the loop — the goal is not autonomous failure response but human-triggered response that completes in minutes.</li><li><span>03</span><div><strong>Design your incident mitigation tools to operate from outside the affected system.</strong> An AZ drain that requires SSH-ing into the affected AZ is useless when the AZ's network is degraded. Control-plane infrastructure for incident mitigation should be hosted in AZs that are deliberately different from the ones being managed.</li><li><span>04</span><div><strong>Gradual traffic restoration is as important as fast draining.</strong> Restoring 100% of traffic instantly to a recovering AZ can trigger a second cascade as cold caches encounter sudden full load. Design your drain mechanism to be bidirectional: fast drain, slow restore starting at 1% with error-rate gating before each increase.</li><li><span>05</span><div>Cellular architecture is a <strong>deployment safety pattern</strong> in addition to a reliability pattern. Independent per-cell deployments bound the blast radius of bad code changes. When a canary goes wrong in one cell, the other cells continue serving users normally. This is a compounding benefit that makes every subsequent deploy safer than it would be in a monolithic topology.</li></ol>\n<blockquote>\n<p><strong>THE COST OF CORRECTNESS</strong></p>\n<p>Some services at Slack could not be cleanly siloed because they require <strong>strong consistency across AZs</strong> — and consistency under partition is expensive. These services were migrated last, required the most architectural work, and ended up with more complex cell topologies (cross-cell replication, global coordination). This is the honest cost of CAP theorem reality: perfect AZ isolation is achievable for stateless services and very expensive for stateful ones. Acknowledge the cost early and budget accordingly.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Metastable States Can Emerge in Cell-Based Systems Too</p><p>Cellular architecture reduces blast radius but does not eliminate the risk of metastable failures. A cell-level cascade — where a single cell degrades in a self-sustaining way — is still possible. The AZ drain mechanism helps by allowing operators to route traffic away from a degraded cell, but the metastable failure patterns described in Slack's 2-22-22 incident postmortem can still occur within an individual cell. Defense in depth requires both architectural isolation and operational practices for cascade detection and recovery.</p>\n</blockquote>\n\n<blockquote><p>Slack spent 18 months building a button so an operator could drain an entire data center in five minutes, which is either a lot of work for a button or exactly the right amount of work for that button.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/slack-cellular-architecture-2023/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.179907+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Reliability", "Slack"]}, {"id": "https://techlogstack.com/explore/cloudflare-react-config-outage-2025/", "url": "https://techlogstack.com/explore/cloudflare-react-config-outage-2025/", "title": "Cloudflare Fixed a React Security Vulnerability and Broke the Entire Network", "summary": "How Cloudflare's rollout of a React security fix triggered a global killswitch bug that caused HTTP 500 errors across their network — the third configuration-related", "content_html": "<p><strong>Cloudflare</strong> · Reliability · 17 May 2026</p>\n<p>In late 2025, Cloudflare was rolling out a fix for a React security vulnerability. To do so, they needed to disable an internal testing tool with a global killswitch. The killswitch, unexpectedly, triggered a bug that sent HTTP 500 errors across Cloudflare's entire global network. This was the third major configuration-related global outage in two years.</p>\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>By December 2025, Cloudflare had experienced two major configuration-related global outages — the November 2023 Bot Management outage and various incidents in between — and had identified staged configuration rollouts as the primary systemic fix. That fix was still not fully implemented. Then came the React security vulnerability outage. Cloudflare was deploying a fix for a <em>React CVE</em> (a Common Vulnerabilities and Exposures report for a security flaw in the React JavaScript library — CVEs trigger mandatory patching workflows across the industry) in their internal tooling. The patch introduced an error in an <strong>internal testing tool</strong>. The team disabled the testing tool with a <strong>global killswitch</strong>. That killswitch, unexpectedly, triggered a bug in an unrelated code path — causing HTTP 500 errors across Cloudflare's network.</p>\n<blockquote><p>In this latest outage, Cloudflare was burnt by yet another global configuration change. The previous outage in November happened thanks to a global database permissions change. This change would make it so that Cloudflare's configuration files do not propagate immediately to the full network, as they still do now. But making all global configuration files have staged rollouts is a large implementation that could take months. Evidently, there wasn't time to make it yet, and it has come back to bite Cloudflare.</p><p><em>— — The Pragmatic Engineer newsletter analysis of the Cloudflare December 2025 outage</em></p></blockquote>\n<p>The pattern was now impossible to ignore. Cloudflare had experienced multiple major outages in the 2023–2025 period, each with the same root-cause category: a configuration change that propagated globally and instantly, without staged rollout, caused unexpected systemic failures. The November 2023 Bot Management outage's primary action item — implement staged configuration rollouts — was explicitly identified as <strong>a large implementation that could take months</strong>. Each new outage was paying the price of that implementation not yet being complete. The React outage was the industry's most documented illustration of technical debt from unimplemented postmortem action items.</p>\n<blockquote>\n<p><strong>THE KILLSWITCH THAT WASN'T JUST A KILLSWITCH</strong></p>\n<p>A killswitch is a simple concept: disable something. But in a complex distributed system, disabling one component can have unexpected dependencies. The internal testing tool that was disabled via global killswitch was apparently connected to a code path that, when the tool was absent, triggered a bug causing HTTP 500 errors. <strong>Killswitches are configuration changes.</strong> All the same rules apply: validate them, stage them, monitor them. A killswitch deployed globally and instantly is a global instant configuration change.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>React CVE Fix Introduces Testing Tool Error</h4>\n<p>Cloudflare was rolling out a fix for a React security vulnerability in internal tooling. The fix caused an error in an internal testing tool, prompting the team to disable the tool. The disable was executed as a global configuration change via killswitch.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Killswitch Triggered Unexpected Code Path Bug</h4>\n<p>The global killswitch that disabled the testing tool unexpectedly triggered a bug in a connected code path. The bug caused HTTP 500 errors across Cloudflare's network. Because the killswitch was propagated globally and instantly, the impact was immediate and global.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Revert Killswitch Configuration</h4>\n<p>The fix was to revert the killswitch configuration — undoing the disable of the testing tool that had triggered the bug. This brought Cloudflare's network back to its pre-fix state. The React CVE patch then needed to be reworked to avoid triggering the testing tool error.</p>\n<hr />\n<h3>Result</h3>\n<h4>Service Restored, Pattern Acknowledged</h4>\n<p>Service was restored after reverting the configuration. The postmortem was published on the same day. CTO Dane Knecht acknowledged the pattern publicly and committed to making enhanced rollouts and versioning 'the first priority across the organization' — the same commitment made after the 2023 outages.</p>\n<hr />\n\n<blockquote>\n<p><strong>❌</strong></p>\n<p>The Third Configuration-Related Outage in Two Years</p><p>The React security fix outage was the third major configuration-related global outage in Cloudflare's 2023–2025 period. The November 2023 Bot Management outage, subsequent incidents, and the December 2025 React outage all shared the same fundamental cause: a configuration change propagated globally and instantly without safety validation. The same fix had been identified after the first outage. That the fix hadn't been implemented by the third outage is a case study in the organizational cost of deprioritizing postmortem action items.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚛️</strong></p>\n<p>The React vulnerability that started this chain of events was a <strong>security patch that Cloudflare was doing the right thing by deploying</strong>. Security vulnerability patching is mandatory and time-sensitive. The outage wasn't caused by bad intentions or negligence — it was caused by a security response that didn't account for all of its dependencies.</p>\n</blockquote>\n\n<p>One of the most challenging aspects of Cloudflare's staged rollout implementation is the security-versus-safety tension. Cloudflare's configuration distribution system was designed to be fast because <strong>security changes need to be fast</strong>. When a new attack pattern is detected, Cloudflare needs to push mitigation rules globally as quickly as possible. Slowing down configuration propagation has real security costs: the window between an attack being detected and the mitigation being globally deployed gets longer. The engineering challenge is building a system that can be fast for security-critical changes but staged for everything else — which requires distinguishing between change types at the infrastructure level.</p>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>CTO Dane Knecht's Public Commitment</p><p>Following the December 2025 outage, Cloudflare CTO Dane Knecht was quoted in the postmortem: <strong>'Global configuration changes rolling out globally remains our first priority across the organization.'</strong> This was the same commitment made after the 2023 outages. The public, repeated commitment to the same fix — without the fix having been implemented — created accountability that was difficult to ignore. The staged rollout project was given resources and deadline commitment following the December 2025 outage.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Same-Day Postmortem: The Third Time</p><p>Cloudflare published their postmortem for the December 2025 React outage on the same day the incident resolved — maintaining their remarkable transparency standard for the third major outage in two years. The postmortem's candor was notable: it explicitly referenced the November 2023 action item that hadn't been completed, and included CTO Dane Knecht's public acknowledgment that staged configuration rollouts 'remains our first priority.' Three same-day postmortems, three public commitments to the same fix, growing organizational accountability.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🔄</strong></p>\n<p>The Pattern: Configuration Changes That Break Things</p><p>Looking across Cloudflare's 2023–2025 incidents, a precise pattern emerges: (1) a routine operational change is made to production infrastructure, (2) the change has unexpected downstream effects, (3) the affected configuration or rule is propagated globally and instantly, (4) the impact is global and immediate. The fix to this pattern is not 'be more careful' — it's <strong>staged rollout infrastructure that makes global instant propagation impossible for non-security-critical changes</strong>.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>WHAT MAKES CLOUDFLARE'S CASE UNIQUE</strong></p>\n<p>Most organizations have configuration-related incidents. What makes Cloudflare's case unusual is the scale: a configuration change at Cloudflare affects infrastructure serving <strong>a significant fraction of all internet traffic</strong>. The blast radius is not one company's systems — it's millions of websites and their users globally. This scale makes configuration safety not just an operational concern but a responsibility to the broader internet ecosystem. Cloudflare's staged rollout implementation is infrastructure for global internet resilience.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>The Systemic Fix: Enhanced Rollouts and Versioning</h3>\n<p>Cloudflare's CTO described the required fix as <strong>'Enhanced Rollouts and Versioning'</strong> — applying the same safety and blast mitigation features to configuration data that Cloudflare already applies to software deployments. Software at Cloudflare is deployed gradually, with strict health validation at each stage. Configuration changes had no equivalent safety system. The fix required building one: a configuration versioning system that could tag changes, a rollout engine that could apply them to staged percentages, and health checks that could catch problems before wider propagation.</p>\n<ul>\n<li><strong>3rd</strong> — Configuration-related global outage in the 2023–2025 period — each one traceable to the same root cause: instant global config propagation without safety gates</li>\n<li><strong>Months</strong> — Estimated implementation time for staged rollouts as quoted in the November 2023 postmortem — the duration that allowed the second and third outages to occur before the fix was complete</li>\n<li><strong>Same day</strong> — Postmortem publication time — Cloudflare's consistent practice of same-day transparency, maintained even when the incident revealed repeated failure to implement a known fix</li>\n<li><strong>Priority #1</strong> — Stated organizational priority for staged configuration rollouts — acknowledged as the highest infrastructure priority after the December 2025 outage</li>\n</ul>\n\n<pre><code class=\"language-python\"># The required Enhanced Rollouts and Versioning system\n# Differentiates security-critical changes (fast) from configuration changes (staged)\n\nclass ConfigRolloutEngine:\n    def deploy_change(self, change: ConfigChange):\n        # Security-critical changes (DDoS mitigations, attack signatures)\n        # Still fast — but with validation gate\n        if change.type == ConfigChangeType.SECURITY_CRITICAL:\n            self._validate_config(change)  # validation must pass\n            self._deploy_global_fast(change)  # then deploy fast\n            return\n        \n        # All other changes: staged rollout with health gates\n        self._validate_config(change)\n        \n        # Stage 1: 1% canary\n        self._deploy_to_percentage(change, pct=0.01)\n        self._wait_and_check_health(minutes=5)\n        \n        # Stage 2: 10% cohort  \n        self._deploy_to_percentage(change, pct=0.10)\n        self._wait_and_check_health(minutes=5)\n        \n        # Stage 3: 50% cohort\n        self._deploy_to_percentage(change, pct=0.50)\n        self._wait_and_check_health(minutes=10)\n        \n        # Stage 4: Full rollout\n        self._deploy_global(change)\n    \n    def _validate_config(self, change: ConfigChange):\n        # Size limits, schema validation, semantic checks\n        # Catches the oversized ClickHouse fallback config\n        # Catches malformed configs before any propagation\n        pass\n    \n    def _wait_and_check_health(self, minutes: int):\n        # Error rate, latency, traffic metrics\n        # Auto-rollback if thresholds exceeded\n        pass</code></pre>\n<blockquote>\n<p><strong>THE SECURITY-SPEED TENSION</strong></p>\n<p>The core tension in Cloudflare's configuration safety problem is that their configuration system was designed for security use cases where speed matters. Staged rollouts introduce latency that's unacceptable for DDoS mitigation rules. The solution requires <strong>distinguishing between change types</strong>: security responses (fast propagation + validation) versus configuration updates (staged propagation + health gates). This distinction is architecturally complex — the system needs to know the change type, enforce the right deployment mode, and maintain separate pipelines without creating a new single point of failure.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Three-Outage Forcing Function</p><p>If the staged rollout implementation had been deprioritized after the November 2023 outage, the December 2025 outage provided an undeniable forcing function. Three configuration-related global outages in two years, with the same root cause, creates organizational pressure that cannot be managed with further prioritization discussions. The December 2025 outage finally resulted in resources, deadline commitment, and executive ownership for the staged rollout project.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Postmortem Action Items Need Priority Enforcement</p><p>The Cloudflare staged rollout story is one of the industry's clearest examples of what happens when postmortem action items are treated as backlog items rather than critical debt. The November 2023 postmortem identified the fix. The December 2025 outage demonstrated the cost of not implementing it. Engineering organizations need mechanisms to track postmortem action items with urgency, not just completeness — including escalation paths when critical action items age without progress.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Resources Finally Allocated After Three Incidents</p><p>The December 2025 outage served as the organizational forcing function that earlier incidents hadn't fully achieved. Following the third configuration-related global outage in two years, Cloudflare allocated dedicated engineering resources, a named project lead, and a committed delivery timeline for the Enhanced Rollouts and Versioning system. The system is now being built as production infrastructure rather than a backlog item.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Security-Critical Fast Path</p><p>One of the hardest engineering problems in the staged rollout system is the security-critical fast path. When Cloudflare detects a new DDoS attack pattern or zero-day exploit, they need to push mitigations to every PoP globally within seconds — not within the staged rollout window of 30+ minutes. The system must <strong>distinguish at the protocol level</strong> between security-critical changes (which maintain fast propagation) and configuration updates (which go through staged rollout). Building this distinction correctly — without creating a bypass that regular configuration changes can be misclassified into — is the core engineering challenge.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>The React outage sits in a chain of failures that reveals a systemic architectural vulnerability in Cloudflare's control plane. At the data plane level — PoPs, traffic routing, DDoS mitigation — Cloudflare's architecture is highly resilient. At the configuration plane level — the system that distributes rules and settings to the data plane — the architecture was designed for speed rather than safety. Three outages in two years from the same root cause is the empirical evidence that speed without safety is not viable at global infrastructure scale.</p>\n<h3>The Configuration Safety Gap: 2023–2025 Timeline</h3>\n<p><a href=\"https://techlogstack.com/explore/cloudflare-react-config-outage-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>Required Enhanced Rollout Architecture for Cloudflare</h3>\n<p><a href=\"https://techlogstack.com/explore/cloudflare-react-config-outage-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>THE ORGANIZATIONAL LESSON: ACTION ITEMS NEED OWNERS</strong></p>\n<p>Cloudflare's staged rollout work was identified as a priority after three separate incidents. Each time, it was described as a large implementation requiring months. In hindsight, the organizational failure was not the identification — it was the <strong>lack of a named owner with authority, resources, and a committed deadline</strong>. Postmortem action items without named owners, resource allocation, and deadline accountability often age in backlogs until a subsequent incident forces the conversation again.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Cloudflare's Transparency as Industry Standard</p><p>Despite three major outages with related root causes, Cloudflare's consistent same-day postmortem publication is widely recognized as an industry best practice. The transparency builds trust even when the incidents themselves erode it. <strong>Companies that publish honest postmortems attract and retain engineers who want to learn from failures</strong>, and they establish accountability mechanisms that internal-only postmortems don't create. The public commitment to fixing staged rollouts after the December 2025 outage has an accountability dimension that an internal action item does not.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Cloudflare's Scale Makes the Problem Harder</p><p>Staged configuration rollout at Cloudflare's scale (300+ PoPs, millions of configuration updates per year, microsecond-sensitive security decisions) is genuinely difficult infrastructure engineering. The problem is not that Cloudflare doesn't know how to build staged rollouts — they already do this for software deployments. The problem is <strong>retrofitting staged rollout semantics onto a configuration distribution system that was designed for a different set of requirements</strong> (fast propagation, consistency, global reach) without disrupting the security use cases that depend on that speed.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>The React security fix outage is the third chapter in a two-year story about the cost of not completing a known critical infrastructure fix. The lessons are organizational as much as technical.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>A postmortem action item that isn't implemented before the next incident becomes evidence.</strong> The staged rollout fix was identified in November 2023. Three subsequent incidents demonstrated its absence. Each one was preventable if the fix had been implemented. Organizations that deprioritize critical postmortem action items pay the price in the form of the next incident.</li><li><span>02</span><div><em>Killswitches</em> (configuration flags that disable functionality globally) are configuration changes and must be treated with the same safety rigor. A killswitch that propagates globally and instantly, without validation and health gating, is a global instant configuration change. Apply staged rollout requirements to all configuration changes — including disables, removes, and shutdowns.</li><li><span>03</span><div><strong>Security patches create deployment urgency that can override normal safety practices.</strong> CVE patches are time-sensitive, creating pressure to deploy quickly. Build explicit processes for security patching that maintain urgency while preserving safety gates — staged deployment with fast canary windows is both fast and safe compared to instant global deployment.</li><li><span>04</span><div>Postmortem action items need <strong>named owners, resource allocation, and deadline commitment</strong> — not just backlog entries. The difference between 'we identified the need for staged rollouts' and 'engineer X owns staged rollouts with Y engineers and a Q1 deadline' is the difference between an action item that ages and one that gets done.</li><li><span>05</span><div>Repeated incidents with the same root cause are not evidence that the fix is impossible — they are evidence that the fix is <strong>insufficiently prioritized</strong>. Three configuration-related global outages is a forcing function for resource allocation. If the first incident's postmortem doesn't unlock the resources to fix the root cause, count on needing either the second or third incident to do it.</li></ol>\n<blockquote>\n<p><strong>THE TRANSPARENCY COMPOUNDING EFFECT</strong></p>\n<p>Cloudflare's pattern of same-day postmortem publication for major incidents has created a compounding transparency dividend: each postmortem increases customer trust, each public commitment creates accountability, each incident with the same root cause raises the organizational urgency. <strong>The third outage with the same root cause forced a resource and timeline commitment that the first and second outages hadn't achieved</strong>. Transparency accelerates accountability.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Testing Infrastructure for Operational Safety Changes</p><p>The React CVE fix that started this chain of events was a security response — the right thing to do. But deploying it through a testing tool that hadn't been validated for that specific change created the downstream error. <strong>Operational safety infrastructure (testing tools, killswitches, monitoring systems) needs the same testing rigor as application code</strong>. When safety infrastructure fails, it often does so during incidents — exactly the moment it's needed most.</p>\n</blockquote>\n\n<blockquote><p>Cloudflare fixed a React security vulnerability and accidentally broke the global internet, which is both very on-brand for React and a reminder that security patches are just change management with higher stakes.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/cloudflare-react-config-outage-2025/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.183704+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Reliability", "Cloudflare"]}, {"id": "https://techlogstack.com/explore/discord-cassandra-scylladb-2022/", "url": "https://techlogstack.com/explore/discord-cassandra-scylladb-2022/", "title": "How Discord Migrated Trillions of Messages and Fired Their Garbage Collector", "summary": "How Discord's engineering team eliminated JVM GC pauses, cut their database fleet from 177 to 72 nodes, and migrated 4 trillion messages in 9 days.", "content_html": "<p><strong>Discord</strong> · Databases · 17 May 2026</p>\n<p>It is 2022 and Discord's on-call engineers are babysitting a 177-node database cluster, manually rebooting nodes after Java GC pauses spiral out of control. The system holding every message ever sent is becoming the thing everyone fears touching most.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;→ 72 nodes&#x27;, &#x27;value&#x27;: &#x27;177&#x27;}</li><li>{&#x27;label&#x27;: &#x27;migration (was 3-month est.)&#x27;, &#x27;value&#x27;: &#x27;9-day&#x27;}</li><li>{&#x27;label&#x27;: &#x27;records/sec migrated&#x27;, &#x27;value&#x27;: &#x27;3.2M&#x27;}</li><li>{&#x27;label&#x27;: &#x27;messages moved&#x27;, &#x27;value&#x27;: &#x27;4T+&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<blockquote><p>Our Cassandra cluster exhibited serious performance issues that required increasing amounts of effort to just maintain, not improve.</p><p><em>— — Bo Ingram, Senior Software Engineer — via Discord Engineering Blog</em></p></blockquote>\n<p>Discord launched in 2015 with a mission to build the best voice and text chat platform for gamers. By 2017 they had outgrown MongoDB and migrated their entire message store to <em>Apache Cassandra</em> (a distributed wide-column NoSQL database designed for high availability across many nodes without a single point of failure). Cassandra's promise was compelling: write anywhere, replicate everywhere, scale horizontally forever. For a few years it held. By 2022, however, the promises had curdled into a maintenance nightmare that consumed engineering cycles every single week. The database cluster had grown to <strong>177 nodes</strong> holding <strong>trillions of messages</strong>, and keeping it alive required the kind of expertise and vigilance that should be reserved for nuclear reactor operators, not chat app engineers.</p>\n<blockquote>\n<p><strong>🔥</strong></p>\n<p>At peak, Discord's Cassandra cluster required engineers to manually reboot individual nodes after <em>JVM GC pauses</em> (Java Virtual Machine garbage collection — periodic stop-the-world pauses where the JVM freezes all threads to reclaim memory) spiraled long enough to drop the node from the cluster. This was not a rare emergency — it was routine on-call work.</p>\n</blockquote>\n\n<p>The core problem was architectural. Cassandra is written in Java, and Java's garbage collector periodically halts all threads in the JVM to reclaim heap memory — a moment engineers call a <span>stop-the-world pause</span>. Under Discord's workloads, these pauses could last long enough to cause cascading latency spikes visible to users, and in severe cases, the JVM's consecutive GC pauses got so bad that a node would effectively fall out of the cluster entirely. An on-call engineer would then have to manually reboot it and babysit it back to health. The <strong>p99 latency on historical message reads</strong> ranged between <strong>40 and 125 milliseconds</strong> depending on whether compaction was running — an unpredictability that made SLO planning impossible. Every time someone tried to improve the cluster rather than merely maintain it, they risked triggering a cascade.</p>\n<h3>The Hot Partition Problem</h3>\n<p>Discord's message data model organized messages by channel ID and a fixed time window called a <em>bucket</em> (a fixed time slice, e.g. 10 days, used as part of the partition key so messages are spread across multiple Cassandra partitions rather than one per channel). This was efficient for write distribution and replication, but created a painful read problem. Cassandra performs writes cheaply by appending to a <em>commit log</em> (a sequential on-disk journal where writes are recorded before being applied to the in-memory structure, enabling fast writes at the cost of read complexity) and an in-memory structure called a <em>memtable</em> (an in-memory write buffer in Cassandra that is flushed to disk as SSTables when it fills up). Reads, however, must query the memtable and potentially multiple <em>SSTables</em> (Sorted String Tables — immutable on-disk files in Cassandra that hold flushed memtable data, which must all be merged on read to reconstruct the current value), a dramatically more expensive operation. When a popular Discord server made a major announcement and thousands of users simultaneously opened their apps to read it, every single one of those reads would hammer the same partition. The cluster called these <strong>hot partitions</strong>, and they were Discord's most common and painful operational incident.</p>\n\n<h3>Problem</h3>\n<h4>The Maintenance Spiral</h4>\n<p>By early 2022, Discord's on-call rotation was spending more time nursing Cassandra than building features. GC pause alerts fired multiple times a week, and the p99 latency on reads ranged from 40ms to 125ms depending on whether compaction was running on the affected node — an unpredictability engineers had simply learned to live with.</p>\n<hr />\n<h3>Cause</h3>\n<h4>JVM GC + Hot Partition Physics</h4>\n<p>The root cause split into two layers: <em>JVM garbage collection</em> (Java's memory management system that periodically pauses all threads to reclaim heap memory — in large heaps, these pauses could last hundreds of milliseconds) on write-heavy nodes created latency cliffs, while Cassandra's read path — requiring merges across multiple SSTables — meant any popular channel partition would spike latency under concurrent user load. The combination made the cluster inherently unpredictable at scale.</p>\n<hr />\n<h3>Solution</h3>\n<h4>ScyllaDB + Rust Data Services</h4>\n<p>Discord chose ScyllaDB, a Cassandra-compatible database rewritten in C++ with a <em>shard-per-core architecture</em> (a design where each CPU core is assigned its own exclusive subset of data and handles requests independently, avoiding cross-core coordination and lock contention). They also built a Rust-based data services layer between the API and the database to absorb hot-partition spikes via request coalescing. The migration tool was rewritten in Rust to achieve 3.2 million records per second transfer speed.</p>\n<hr />\n<h3>Result</h3>\n<h4>9 Days, 4 Trillion Messages, Zero Users Noticed</h4>\n<p>The migration completed in nine days — the original estimate using ScyllaDB's Spark migrator had been three months. The cluster footprint shrank from 177 nodes to 72, each ScyllaDB node running with 9 TB of disk versus the average 4 TB on Cassandra. P99 latency for historical reads settled at a stable, predictable <strong>15 milliseconds</strong>.</p>\n<hr />\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Why cassandra-messages Was the Last to Move</p><p>By 2020 Discord had migrated <strong>every other database</strong> to ScyllaDB — the messages cluster was the lone holdout. They deliberately waited to last because it was the most critical dataset: trillions of messages, nearly 200 nodes, and the one cluster whose failure would be immediately visible to every user. They used the other migrations to tune ScyllaDB for their access patterns first, including filing and waiting on performance improvements to ScyllaDB's reverse query support.</p>\n</blockquote>\n\n<h3>The Tombstone Trap at 99.9999%</h3>\n<p>The migration nearly ended in drama rather than triumph. After running the Rust migrator for days at <strong>3.2 million records per second</strong>, the progress bar hit 99.9999% — and stopped. The migrator was timing out trying to read the last few <em>token ranges</em> (in Cassandra, data is distributed across the ring by assigning each partition a hash token, and a token range is a contiguous slice of that ring assigned to a node) because they contained gigantic ranges of <em>tombstones</em> (deletion markers in Cassandra — when data is deleted, a tombstone is written instead of the row being removed, because immutable SSTables cannot be modified in-place; these tombstones must be read and skipped during every subsequent read until compaction removes them) that had never been compacted away. Engineers had to manually trigger compaction on that token range; <span>seconds later, the migration hit 100%</span>. Automated data validation confirmed correctness by sending a sample of reads to both databases and comparing results. Discord switched to ScyllaDB in May 2022.</p>\n<blockquote>\n<p><strong>THE WORLD CUP TEST</strong></p>\n<p>The real stress test came months after go-live: the 2022 FIFA World Cup Final between Argentina and France. Every goal by Messi, every equalizer by Mbappé, every moment in the shootout created a massive spike of simultaneous message reads across Discord's biggest servers. Under the old Cassandra architecture this would have triggered hot-partition alerts and cascading latency. Under ScyllaDB with the Rust data services layer, the monitoring dashboards showed nothing unusual. The system held flat through 120 minutes of football and a penalty shootout.</p>\n</blockquote>\n\n<p>The Rust data services layer was the architectural insight that made ScyllaDB viable, not just the database choice alone. When a popular server makes an announcement and <strong>thousands of users simultaneously open their clients</strong>, all those read requests arrive at the data service within milliseconds of each other — all asking for the same messages in the same channel. Without coalescing, each request would hit the database separately, creating a hot partition. With <em>request coalescing</em> (a pattern where the first incoming request for a piece of data triggers an active lookup, and all subsequent requests for the same data subscribe to that lookup's result rather than issuing their own query, reducing N database hits to 1), only one query goes to ScyllaDB; every subsequent request subscribes to the in-flight result and receives the answer when the single database query returns. The data services layer also used <em>consistent hashing</em> (a ring-based routing scheme where each data service instance is responsible for a specific subset of channel IDs, ensuring all requests for a given channel are routed to the same service instance to maximize coalescing effectiveness) to route requests for the same channel to the same service instance, maximizing coalescing opportunity.</p>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<ul>\n<li><strong>177→72</strong> — Cassandra nodes replaced by ScyllaDB nodes — a 59% reduction in cluster footprint while handling the same workload</li>\n<li><strong>15ms</strong> — Stable p99 read latency on ScyllaDB, down from an unpredictable 40–125ms range on Cassandra depending on compaction status</li>\n<li><strong>9 days</strong> — Total migration time for 4+ trillion messages — versus the original 3-month estimate with ScyllaDB's Spark migrator</li>\n<li><strong>3.2M/s</strong> — Peak migration throughput of the Rust-rewritten migrator, unlocking a single-flip cutover instead of a complex time-based phased approach</li>\n</ul>\n\n<p>The fix had three distinct components, and Discord was deliberate about not rushing any of them. First, they spent years migrating every other database to ScyllaDB to build operational expertise before touching the one cluster that mattered most. Second, they collaborated with the ScyllaDB team to improve reverse query performance — a blocker they hit in early testing — and waited until that was production-grade before proceeding. Third, they <strong>built the Rust data services layer before starting the migration</strong>, so the new database would go live already protected from hot-partition load patterns. This sequencing was the engineering discipline that made the migration look easy in retrospect.</p>\n<h3>The Rust Migrator Rewrite</h3>\n<p>The turning point in the migration timeline was a one-day engineering sprint. ScyllaDB's off-the-shelf <em>Spark migrator</em> (an Apache Spark-based tool provided by ScyllaDB for bulk data migration that reads token ranges from Cassandra and writes them to ScyllaDB) estimated three months to move the message data — <span>three months of dual-running two massive database clusters, three months of operational complexity</span>, and three months of potential failure modes. Bo Ingram decided that was three months too long. He and two colleagues rewrote the migrator in Rust in a single day. The new migrator read token ranges from a database, checkpointed them locally via SQLite for crash recovery, and fired them into ScyllaDB as fast as possible. The result: <strong>3.2 million records per second</strong>. The new estimate was nine days, and the team chose to do a single-flip cutover instead of a phased time-based approach entirely.</p>\n<pre><code>// Simplified version of Discord's request coalescing logic in the Rust data service\n// Real implementation uses Tokio async runtime\n\nuse std::collections::HashMap;\nuse tokio::sync::broadcast;\n\nstruct CoalescingDataService {\n    // Map from cache_key -> active broadcast sender\n    // If a task is in flight, subscribers receive the result\n    in_flight: HashMap<String, broadcast::Sender<Message>>,\n}\n\nimpl CoalescingDataService {\n    async fn get_messages(\n        &mut self,\n        channel_id: u64,\n        before_id: u64,\n    ) -> Result<Vec<Message>> {\n        // Build a stable cache key for this exact query\n        let key = format!(\"{}:{}\", channel_id, before_id);\n\n        if let Some(sender) = self.in_flight.get(&key) {\n            // A query for this channel is already in flight\n            // Subscribe and wait — NO second database round-trip\n            let mut rx = sender.subscribe();\n            return Ok(vec![rx.recv().await?]); // receive the shared result\n        }\n\n        // No existing task — we are the first; create the broadcast channel\n        let (tx, _rx) = broadcast::channel(16);\n        self.in_flight.insert(key.clone(), tx.clone());\n\n        // Execute the single database query to ScyllaDB\n        let results = self.scylladb.query_messages(channel_id, before_id).await?;\n\n        // Broadcast to ALL waiting subscribers at once\n        let _ = tx.send(results.clone()); // every subscriber wakes up\n        self.in_flight.remove(&key);      // clean up the in-flight tracker\n\n        Ok(results)\n    }\n}</code></pre>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The SuperDisk: Custom Hardware for Cloud Durability</p><p>ScyllaDB is optimized for <em>NVMe SSDs</em> (Non-Volatile Memory Express solid-state drives — extremely fast local storage that dramatically reduces I/O latency) but in cloud environments NVMe is ephemeral — a node restart wipes the disk. Discord engineered a <strong>custom RAID 1 configuration</strong> they called the Superdisk: writes go to both fast local NVMe and slower persistent network-attached storage simultaneously; reads prefer the NVMe for speed. This gave them NVMe-level read performance with cloud-level data durability.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Zstandard Compression: 50–60% Disk Reduction</p><p>Alongside the database migration, Discord enabled <span><strong>Zstandard compression</strong></span> on their ScyllaDB tables. Message data compresses extremely well. The result was a <strong>50–60% reduction</strong> in raw disk usage compared to uncompressed Cassandra storage — effectively giving each physical node far more useful capacity at zero hardware cost.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE VALIDATION STRATEGY</strong></p>\n<p>Discord ran automated correctness validation throughout the migration by sending a <strong>small percentage of reads to both databases simultaneously</strong> and comparing results. Only when reads matched across Cassandra and ScyllaDB was a partition considered successfully migrated. This shadow-read approach caught data inconsistencies without any user-visible impact, and gave the team confidence to flip the cutover switch as a single atomic event rather than a long, hedged transition.</p>\n</blockquote>\n\n<p>Cassandra vs ScyllaDB at Discord: Before and After Migration</p><div><table><caption>Cassandra vs ScyllaDB at Discord: Before and After Migration</caption><thead><tr><th>Metric</th><th>Cassandra (Before)</th><th>ScyllaDB (After)</th></tr></thead><tbody><tr><td>Cluster Nodes</td><td>177</td><td>72</td></tr><tr><td>Disk per Node (avg)</td><td>4 TB</td><td>9 TB</td></tr><tr><td>p99 Read Latency</td><td>40–125ms (variable)</td><td>~15ms (stable)</td></tr><tr><td>GC Pauses</td><td>Frequent stop-the-world</td><td>None (C++, no GC)</td></tr><tr><td>Hot Partition Risk</td><td>High — no coalescing</td><td>Mitigated by Rust data services</td></tr><tr><td>On-Call Toil</td><td>Weekly node babysitting</td><td>Dramatically reduced</td></tr></tbody></table>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>Before the migration, Discord's message write and read path ran through a monolithic API server directly into the Cassandra cluster. There was no intermediary — every user action that read messages translated directly into a database query, with no protection against fan-out or <em>hot partition amplification</em> (when many users simultaneously request data stored in the same database partition, causing that node to receive far more traffic than its neighbors, creating latency spikes and potential instability). The API server held connection pools to Cassandra, handled <em>CQL queries</em> (Cassandra Query Language — a SQL-like interface for querying Cassandra) for message pagination, and relied on Cassandra's own internal mechanisms (memtable, SSTables, compaction) to handle read pressure. Under normal load this worked. Under peak load — a major announcement, a viral moment, a World Cup Final — it did not.</p>\n<h3>Before: Direct API-to-Cassandra Architecture (Hot Partition Risk)</h3>\n<p><a href=\"https://techlogstack.com/explore/discord-cassandra-scylladb-2022/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>SHARD-PER-CORE ARCHITECTURE</strong></p>\n<p>The fundamental reason ScyllaDB handles concurrent reads so much better than Cassandra is its <strong>shard-per-core architecture</strong>. Each CPU core is assigned its own exclusive slice of the data and handles all requests for that data without coordination with other cores. In Cassandra's JVM-based model, all threads compete for heap memory under a single garbage collector. In ScyllaDB's C++ model, <strong>each core is an independent actor</strong>: no cross-core locking, no GC, no stop-the-world. When one partition gets hot, it affects only the core assigned to that shard — it cannot cascade to neighbors.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Consistent Hashing: Routing Channels to Service Instances</p><p>Each Rust data service instance is responsible for a <strong>deterministic subset of channel IDs</strong> via <em>consistent hashing</em> (a routing scheme where each channel_id is mapped to a specific service instance using a hash ring, so all requests for channel #12345 always go to Data Service Instance B — maximizing the chance that an in-flight coalescing task for that channel already exists). This means if 1,000 users simultaneously load the same popular channel, all 1,000 requests arrive at the same service instance and collapse into one database query.</p>\n</blockquote>\n\n<h3>After: Rust Data Services + ScyllaDB Architecture (Hot Partition Mitigated)</h3>\n<p><a href=\"https://techlogstack.com/explore/discord-cassandra-scylladb-2022/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>🦀</strong></p>\n<p>Why Rust for the Data Services Layer</p><p>Discord chose Rust for data services because it offered C-level throughput with memory safety guarantees that prevent entire classes of concurrency bugs common in C++ — exactly what you want in a layer handling millions of concurrent subscribers. The Tokio async runtime gave them non-blocking I/O without the GC overhead that had plagued their Cassandra setup. As Bo Ingram noted with characteristic candor: it also let them say they <strong>rewrote it in Rust</strong>.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Discord's migration took years of preparation and nine days of execution. The long preparation was not waste — it was the reason the execution was clean. The lessons here are as much about sequencing and courage as they are about database choice.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Migrate your riskiest system last, but don't use that as an excuse to never migrate it.</strong> Discord deliberately kept the messages database in Cassandra for two years after migrating everything else, using that time to build ScyllaDB expertise on less critical workloads. However, they committed to a hard deadline once operational confidence was achieved — avoiding the trap of indefinite deferral that plagues many large migrations.</li><li><span>02</span><div><em>Request coalescing</em> (combining multiple concurrent requests for identical data into a single database query, broadcasting the result to all waiters) is a force multiplier against hot partitions that no amount of database scaling alone can provide. When you have popular content that thousands of users read simultaneously, add a coalescing layer between your application and your database — the reduction in query fan-out is often more impactful than hardware upgrades.</li><li><span>03</span><div><strong>Rewrite your migration tooling if the estimated duration is unacceptable.</strong> A three-month migration estimate is not a constraint — it's a scope definition that you can change. Discord's one-day Rust rewrite of the migrator turned a three-month project into nine days, enabling a simpler single-flip cutover instead of a complex phased approach. Always ask: what would it take to make this ten times faster?</li><li><span>04</span><div><em>Stop-the-world GC pauses</em> (periodic halts in JVM-based systems where all threads freeze while the garbage collector reclaims memory) are a predictable, structural problem in Java-based databases at high concurrency — not a tuning problem you can engineer your way out of at Discord's scale. When your on-call team spends more time maintaining a database than improving it, that's the signal to evaluate architecturally different alternatives, not just different JVM flags.</li><li><span>05</span><div><strong>Run shadow reads for data validation before any large-scale cutover.</strong> Sending a percentage of reads to both old and new systems simultaneously — and comparing results automatically — gives you objective confidence that your migration is correct without user-visible risk. This pattern is applicable to any database migration and should be standard practice before any atomic cutover switch.</li></ol>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The World Cup Validation</p><p>The 2022 FIFA World Cup Final was Discord's unplanned load test — and the system passed cleanly. Every goal, every save, every penalty created message spikes across thousands of servers simultaneously. The combination of ScyllaDB's shard-per-core architecture and Rust data services coalescing kept latency flat through all 120 minutes plus penalties. <strong>No hot partition alerts. No on-call pages. No post-match war rooms.</strong></p>\n</blockquote>\n\n<blockquote>\n<p><strong>SHADOW READ VALIDATION</strong></p>\n<p>Discord's validation strategy during migration was elegantly simple: send a <strong>small percentage of reads to both Cassandra and ScyllaDB simultaneously</strong>, compare results automatically, and flag any discrepancy. This meant correctness was continuously verified during the nine days of data transfer — not checked at the end in a tense manual review. Any database migration touching production data should implement this pattern before flipping the final switch.</p>\n</blockquote>\n\n<blockquote><p>They migrated four trillion messages in nine days, and the most stressful moment was the progress bar stopping at 99.9999% — because even tombstones refuse to die quietly.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/discord-cassandra-scylladb-2022/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.396336+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Databases", "Discord"]}, {"id": "https://techlogstack.com/explore/discord-cloud-dev-environments-2023/", "url": "https://techlogstack.com/explore/discord-cloud-dev-environments-2023/", "title": "Discord Killed the MacBook Dev Environment and Never Looked Back", "summary": "Discord's journey from MacBook chaos to cloud development environments — two migrations, a Kubernetes dead-end, and how Tailscale finally made remote dev feel local.", "content_html": "<p><strong>Discord</strong> · Reliability · 17 May 2026</p>\n<p>Discord's engineering team had tripled in size and was drowning in a swamp of 'works on my machine' bugs — some engineers running macOS, some Ubuntu, all of them slowly. The solution was radical: no one gets a local dev environment anymore.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;engineering org growth&#x27;, &#x27;value&#x27;: &#x27;3x&#x27;}</li><li>{&#x27;label&#x27;: &#x27;backend devs on CDEs&#x27;, &#x27;value&#x27;: &#x27;100%&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<blockquote>\n<p><strong>🖥️</strong></p>\n<p>Discord's engineering organization <strong>tripled in size</strong> over a few years, and the Internal Developer Experience team was spending more time debugging niche, unreproducible, engineer-specific environment issues than actually improving the developer toolchain. The same code would fail on one MacBook and pass on another — and the DevEx team had no systematic way to fix that.</p>\n</blockquote>\n\n<p>For most of Discord's early history, backend engineers set up development environments on their personal laptops — primarily MacBooks, but some preferred Ubuntu. This dual-environment world was manageable when the engineering team was small and most people sat near each other in San Francisco. As Discord's product grew and the engineering organization tripled in headcount, the cracks became structural failures. <em>Homebrew</em> (a popular package manager for macOS that installs open source software but lacks guarantees of reproducibility across machines) upgrades would silently break an engineer's dev setup. A new team member would spend their first week not shipping code but untangling environment issues unique to their laptop. The tooling team accumulated a growing backlog of one-off tickets that amounted to: <strong>your environment is subtly different from everyone else's, and we have to figure out why.</strong> There was no single source of truth for what a correct Discord development environment looked like.</p>\n<h3>The Decision: Eliminate Local Environments Entirely</h3>\n<p>The solution the DevEx team landed on was radical in its simplicity: stop maintaining two local environments and move all backend and infrastructure development to a single Linux-based <em>Cloud Development Environment</em> (a remote machine running in a cloud provider's data center that developers access via their editor's remote extension, giving them full Linux capabilities without managing local hardware). Discord evaluated <em>Coder</em> (an open-source platform for creating and managing cloud development environments at scale, providing templated workspace provisioning, lifecycle management, and developer access controls) in late 2020 — a natural fit given that Coder's team were avid Discord users. The alignment was easy: Discord was already a heavy Kubernetes user, and Coder's V1 product was entirely Kubernetes-native. The partnership began, and Discord started the first of what would turn out to be <span>two separate migrations</span>.</p>\n\n<h3>Problem</h3>\n<h4>Two Environments, Infinite Edge Cases</h4>\n<p>As Discord's engineering organization tripled in size, the DevEx team found itself firefighting unreproducible environment issues specific to individual MacBooks. Brew upgrades broke setups silently. Ubuntu engineers had subtly different library versions. No environment was truly identical to another, making debugging a game of 'is this a code bug or a local environment bug?'</p>\n<hr />\n<h3>Cause</h3>\n<h4>Scale Broke the MacBook Model</h4>\n<p>The <em>SDLC</em> (Software Development Lifecycle — the full process from writing code to shipping it, including build, test, and deployment) had grown too complex for unmanaged local environments. Discord's backend requires a highly complex environment with many moving parts — running it inside <em>Sysbox</em> (a container runtime that allows running full operating systems inside Docker containers by emulating kernel features) on Kubernetes V1 introduced layers of virtualization that were difficult to debug when things went wrong.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Coder V1 → V2: From Containers to VMs</h4>\n<p>The initial Coder V1 migration moved engineers to Kubernetes-based container environments. Networking latency and frequent disconnections plagued engineers outside San Francisco. In 2023, Discord migrated to Coder V2, which replaced the Kubernetes-container model with full VMs using <strong>Tailscale and WireGuard</strong> for networking — dramatically more stable and performant.</p>\n<hr />\n<h3>Result</h3>\n<h4>Zero Support Tickets About Connection Drops</h4>\n<p>After V2, Discord stopped receiving support tickets and questions about high latency and connection drops entirely. Engineers reported that development felt faster and smoother. The DevEx team stopped spending time on 'works on my machine' debugging and started spending time on tooling improvements that actually moved the needle.</p>\n<hr />\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The V1 Kubernetes Problem: Layers on Layers</p><p>Coder's V1 product ran development environments in Docker containers orchestrated by Kubernetes. Discord quickly found that developing inside <strong>Sysbox containers on Kubernetes</strong> introduced so many layers of virtualization — container runtime, kernel emulation, cloud networking — that debugging environment failures became genuinely difficult. When something broke, the question was always: is this a Discord bug, a Coder bug, a Kubernetes bug, or a network issue? The layers made attribution nearly impossible and resolution slow.</p>\n</blockquote>\n\n<p>The V1 to V2 migration fixed the worst problems. Coder's V2 abandoned the Kubernetes-container model and delivered full <em>virtual machine</em> (a software emulation of a complete computer that runs inside a cloud provider's physical host, giving tenants full OS access and eliminating container layering complexity) provisioning instead, giving engineers direct access to the Linux host without the indirection of container runtimes. The networking stack was rewritten to use <strong>Tailscale</strong> and <strong>WireGuard</strong> — a <em>mesh VPN</em> (a peer-to-peer virtual private network where devices connect directly to each other rather than through a central gateway, reducing latency and eliminating bottlenecks) approach where developer machines connect directly to their cloud VMs via encrypted tunnels. The combination of VM simplicity and direct WireGuard networking eliminated the latency and stability issues that had made V1 frustrating for engineers outside the Bay Area office.</p>\n<blockquote>\n<p><strong>THE PANDEMIC TIMING ADVANTAGE</strong></p>\n<p>Discord began its CDE migration in late 2020 — just as the pandemic forced most tech companies to distribute their engineering teams globally. Because Discord had already committed to the cloud dev environment path, they were <strong>better positioned than most</strong> to operate as a fully distributed engineering organization. Engineers could spin up identical development environments from anywhere in the world without IT shipping them a configured MacBook. The organizational investment paid off in ways the team had not initially anticipated.</p>\n</blockquote>\n\n<p>One pragmatic concession emerged: frontend engineers who worked heavily with large HTML and JavaScript files found that the <span>network overhead of transferring those assets during live editing created noticeable latency</span> in their save-and-rebuild loops. The DevEx team made a deliberate exception — <strong>frontend work was excluded from the mandatory CDE migration</strong>, with those engineers continuing to develop locally on their MacBooks. This was not a failure of the approach; it was an honest acknowledgment that different workloads have different locality requirements, and optimizing for 100% consistency at the cost of 30% of engineers' daily experience is not engineering, it's ideology.</p>\n<blockquote>\n<p><strong>📁</strong></p>\n<p>The /home Directory Persistence Strategy</p><p>One of Discord's key architectural decisions for developer experience was keeping the <strong>/home directory persistent</strong> across VM restarts and template updates. Engineers could update templates and base images without losing their repositories, settings, personal tools, and workspace customizations. This made CDEs feel like a machine that belonged to them rather than an ephemeral container that could be wiped at any time.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>❌</strong></p>\n<p>The Communication Gap They Would Fix</p><p>Despite all-hands announcements, early signaling, and a thorough beta period, Discord's DevEx team acknowledged they could have communicated the migration better. Engineers discovered gaps in documentation only after cutover — a pattern common in large infrastructure migrations. Requesting hundreds of engineers to overhaul their entire development workflow is a major ask, and the team would invest more in change management communications next time.</p>\n</blockquote>\n\n<blockquote><p>Despite the challenges and the need for two migrations (Mac→V1→V2), our move to remote dev machines using Coder has been remarkably successful. The timing was fortuitous, as we embarked on this journey before the pandemic began.</p><p><em>— — Denbeigh Stevens, Senior Software Engineer — via Discord Engineering Blog</em></p></blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>The Two-Migration Architecture</h3>\n<p>Discord's CDE story is not a clean one-migration success. It required <strong>two complete migrations</strong>: from MacBooks to Coder V1 (Kubernetes), then from Coder V1 to Coder V2 (VMs). This is worth naming directly because it reframes the story from 'we had a good idea and executed it' to 'we had a good idea, hit a wall, and had the organizational courage to do it again better.' The decision to rebuild entirely on VMs rather than patch the Kubernetes architecture was the right engineering call, and it required Discord to invest in yet another migration cycle even as engineers were still adjusting to the first one.</p>\n<ul>\n<li><strong>3x</strong> — Engineering organization growth that made local environment maintenance untenable — the problem was fundamentally one of scale, not tooling</li>\n<li><strong>0</strong> — Support tickets about network latency and connection drops received after migrating to Coder V2 with Tailscale/WireGuard networking</li>\n<li><strong>Mac→V1→V2</strong> — Two complete developer environment migrations over ~3 years — a reminder that platform migrations rarely go in a straight line</li>\n<li><strong>~100%</strong> — Backend and infrastructure engineers now on cloud development environments — frontend engineers retained local machines due to asset transfer latency</li>\n</ul>\n\n<blockquote>\n<p><strong>WHY VMS BEAT KUBERNETES CONTAINERS FOR DEV ENVIRONMENTS</strong></p>\n<p>Kubernetes is excellent for production workloads but creates friction as a development environment host. Running in containers adds <strong>layers of virtualization</strong> that complicate debugging, restrict host access needed for development tooling, and create networking abstractions that can introduce latency. VMs eliminate this friction: engineers get a real Linux machine with full host access, simpler networking, and predictable behavior. Coder V2's choice to move to VMs was the architectural insight that made Discord's CDE program successful.</p>\n</blockquote>\n\n<pre><code># Simplified Coder V2 workspace template (Terraform)\n# Discord provisions Linux VMs for each engineer via this kind of template\n\nresource \"coder_workspace\" \"discord_backend\" {\n  # Each engineer gets their own dedicated VM\n  name = \"${data.coder_workspace.me.name}-backend\"\n\n  # VM provisioned in cloud, not a container\n  instance_type = \"n2-standard-8\"  # 8 vCPU, 32GB RAM\n  disk_size_gb  = 100\n\n  # /home is persistent — survives template updates and restarts\n  # Engineers keep their repos, configs, and customizations\n  persistent_home = true\n\n  # Tailscale/WireGuard handles secure network tunnel\n  # Engineer's laptop <--> VM via encrypted peer-to-peer mesh\n  network = \"tailscale-mesh\"\n\n  # Standard Discord development image\n  image = \"discord/devenv:latest\"\n\n  # IAM via cloud provider — no VPN required\n  service_account = \"dev-env-sa@discord-dev.iam.gserviceaccount.com\"\n}\n\n# VS Code remote extension connects to this VM\n# Engineers experience it as if it were a local machine</code></pre>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Champions: The Adoption Accelerator</p><p>Discord explicitly recruited <strong>CDE champions</strong> from across engineering departments — engineers who were enthusiastic about the new tooling and willing to beta-test it early. These champions provided a diversity of feedback from different daily development loops (backend, infrastructure, mobile), helping the DevEx team surface issues that a homogeneous beta group would have missed. Identifying and empowering internal champions is one of the most effective change management tactics for large-scale developer tooling migrations.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Immutability: The End of 'Works on My Machine'</p><p>The structural benefit of cloud development environments is <strong>immutability and reproducibility</strong>. Every engineer's environment is provisioned from the same base image and template. When the DevEx team fixes a bug or adds a tool, it ships to everyone simultaneously via image update. No more per-engineer debugging sessions. No more Homebrew version drift. No more 'what's your local Python version?' — the question itself ceases to be meaningful.</p>\n</blockquote>\n\n<p>The remaining engineering challenge after V2 was building better tooling to understand network conditions under different global setups. Discord's engineers are distributed across the US and internationally, and while Tailscale/WireGuard dramatically improved average-case latency, <span>the DevEx team admitted they didn't build enough diagnostics during migration to understand the worst-case network experience</span>. Those tools came after go-live rather than before — a gap they noted they would prioritize earlier next time. The lesson is subtle: when you migrate to a networked development environment, network observability for the developer experience layer must be treated as a first-class requirement, not an afterthought.</p>\n<blockquote>\n<p><strong>🔄</strong></p>\n<p>Template Updates Without Workspace Rebuilds</p><p>One of the most underappreciated features of Discord's CDE setup is the ability to <strong>update base templates and images without requiring engineers to rebuild their workspaces from scratch</strong>. Because the /home directory is persistent, an image update ships the new tooling to every engineer's VM while leaving their repositories, configurations, and in-progress work completely untouched. This makes infrastructure maintenance feel less like a forced migration and more like an automatic software update.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<h3>Before: Local MacBook/Ubuntu Dev Environment (Non-Reproducible)</h3>\n<p><a href=\"https://techlogstack.com/explore/discord-cloud-dev-environments-2023/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>After: Coder V2 Cloud Development Environment Architecture</h3>\n<p><a href=\"https://techlogstack.com/explore/discord-cloud-dev-environments-2023/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>TAILSCALE + WIREGUARD: WHY THIS NETWORKING WON</strong></p>\n<p>Traditional VPN solutions route all traffic through a central gateway — every packet from a distributed engineer's laptop goes hub-and-spoke to HQ before reaching the dev VM. <strong>Tailscale's mesh networking</strong> routes traffic directly peer-to-peer between the engineer's machine and their cloud VM via encrypted <em>WireGuard</em> (a modern, minimal VPN protocol built directly into the Linux kernel, providing lower latency than OpenVPN or IPSec with a drastically smaller codebase) tunnels. For globally distributed engineers, the difference in feel is enormous: direct-path WireGuard feels like working on a local machine; gateway VPNs feel like working through treacle.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Frontend Exception: A Pragmatic Carve-Out</p><p>Not all workloads are equal in a networked dev environment. Discord's frontend engineers work with <strong>large HTML/JS asset bundles</strong> whose save-and-rebuild loop requires frequent large file transfers across the network. The latency was noticeable enough to hurt developer experience, so frontend development was explicitly <strong>kept on local machines</strong>. This was not a failure — it was an honest architectural boundary that preserved developer happiness for a workload where locality genuinely mattered.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>📦</strong></p>\n<p>Immutability as Infrastructure</p><p>By running development environments as <strong>templated VMs</strong> rather than managed laptop configurations, Discord transformed dev environment maintenance from a support burden into an infrastructure deployment problem — and infrastructure deployment is something engineering teams know how to do well. Template updates ship like container image updates. Rollbacks are possible. Every engineer gets the same foundation, and deviations from it are <em>explicit configuration</em>, not accidental drift.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The V1 Sysbox Trap: When Container-in-Container Breaks</p><p>Coder's V1 used <strong>Sysbox containers</strong> to simulate a full Linux environment inside Kubernetes pods. Developing Discord's complex backend required running Docker containers <em>inside</em> those containers — a level of nesting that Sysbox enabled but made debugging treacherous. When something broke, the failure could originate in the application code, the container runtime, Sysbox's kernel emulation, the Kubernetes networking layer, or the cloud provider. Four layers of virtualization made every incident investigation significantly harder than it needed to be.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Discord migrated their entire engineering development environment twice in three years. The first migration was necessary; the second was a correction. Both were worth doing. Here is what the journey teaches engineers who are considering similar moves.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Reproducibility is a prerequisite for scale.</strong> Local environments drift silently — Homebrew upgrades, OS patches, personal tooling installs — and the drift compounds as headcount grows. If your DevEx team spends more time on per-engineer environment debugging than on tooling improvements, that is the signal that you have outgrown local development.</li><li><span>02</span><div><em>Cloud Development Environments</em> (remote virtual machines hosted in a cloud provider that developers access via their editor's remote extension) do not magically fix everything — V1 on Kubernetes introduced its own complexity via container layering. Always validate that your chosen CDE solution matches your workload's complexity requirements, and be willing to migrate again if the first choice proves incorrect.</li><li><span>03</span><div><strong>Invest in network diagnostics before go-live, not after.</strong> When you move development to a networked environment, the developer experience is only as good as the network between laptop and VM. Discord built latency diagnostics after the V1 migration rather than before, leaving them partially blind to worst-case experiences during the transition. Network observability for your CDE is a first-class requirement.</li><li><span>04</span><div>Recruit <strong>internal champions</strong> from diverse engineering disciplines before starting a large tooling migration. A homogeneous beta group of enthusiastic volunteers will miss the edge cases that matter to the median engineer. Champions from backend, infra, mobile, and data teams surface a diversity of failure modes early enough to fix them before you've annoyed the whole company.</li><li><span>05</span><div>Not all workloads belong in the cloud. Discord's <strong>frontend engineers stayed on local machines</strong> because asset transfer latency genuinely hurt their experience. Pragmatic carve-outs that acknowledge real workload differences are better engineering than dogmatic 100% migrations that quietly make a subset of people miserable.</li></ol>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Unexpected Pandemic Dividend</p><p>Discord began the CDE migration in late 2020 — right as the pandemic forced most tech companies to scramble for distributed-work solutions. Because Discord had already invested in cloud dev environments, their engineers could work from anywhere in the world with the same environment as their colleagues. What looked like an infrastructure investment turned out to be a resilience investment too.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE COST OF DOING IT TWICE</strong></p>\n<p>Discord needed to migrate twice because the first migration solved the wrong problem at the infrastructure level. Kubernetes containers gave them cloud hosting but not true machine equivalence. VMs gave them the latter. The lesson: <strong>validate your CDE architecture on a small cohort of power users before committing the whole organization</strong>, with explicit evaluation criteria around host access, networking latency, and debuggability — not just 'it runs Linux.'</p>\n</blockquote>\n\n<blockquote><p>They asked 'what if we just gave every engineer the same Linux box in the cloud?' and then had to do it twice before the cloud box stopped lying about its Wi-Fi signal.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/discord-cloud-dev-environments-2023/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.416552+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Reliability", "Discord"]}, {"id": "https://techlogstack.com/explore/netflix-maestro-workflow-2024/", "url": "https://techlogstack.com/explore/netflix-maestro-workflow-2024/", "title": "Netflix Hit the AWS Instance Ceiling and Built a Workflow Engine That Scales Forever", "summary": "How Netflix hit the vertical scaling ceiling on its Meson workflow orchestrator and built Maestro — a horizontally scalable system handling 2 million jobs a day.", "content_html": "<p><strong>Netflix</strong> · Distributed Systems · 17 May 2026</p>\n<p>Netflix's Meson orchestrator was handling hundreds of thousands of daily data and ML jobs — and running out of machine. Vertically scaling on AWS had a hard ceiling, and the workflows were doubling in size every year. The only way out was a complete architectural rethink.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;jobs/day at peak&#x27;, &#x27;value&#x27;: &#x27;2M+&#x27;}</li><li>{&#x27;label&#x27;: &#x27;jobs in single workflow&#x27;, &#x27;value&#x27;: &#x27;100K+&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>Netflix is not just a streaming platform — it is a data factory that runs thousands of <em>ML pipelines</em> (automated sequences of data processing, model training, and validation jobs that produce the recommendation algorithms and personalization signals driving Netflix's content strategy) and data engineering workflows every single day. Recommendation models, A/B test analyses, content quality pipelines, ad-targeting models — every one of these is a graph of interdependent jobs that must be orchestrated, retried on failure, scheduled on time, and monitored continuously. By 2020, Netflix was running all of this through <em>Meson</em>, an in-house workflow orchestrator built around a <strong>single-leader architecture</strong> that the team described as having achieved high availability — but at the cost of requiring continuous vertical scaling as usage grew.</p>\n<blockquote>\n<p><strong>📈</strong></p>\n<p>The number of workflows in Meson was <strong>doubling year over year</strong>, and the sizes of individual workflows were growing too — some containing tens of thousands of interdependent jobs. Vertical scaling on AWS had a hard physical ceiling: you can only get so many CPUs and so much RAM in a single EC2 instance type.</p>\n</blockquote>\n\n<p>The core problem with Meson was structural. A <em>single-leader architecture</em> (a system design where one node (the leader) is responsible for all decisions and coordination — providing simplicity and consistency but creating a vertical scaling bottleneck as load increases) means all orchestration decisions flow through one machine. For low-throughput systems, this is fine. For a platform running <strong>hundreds of thousands of daily ML workflows</strong> with projected 100% year-over-year growth, it meant the infrastructure team was perpetually fighting against the ceiling of what the largest available AWS instance type could handle. During peak usage — typically when multiple large training runs coincided with end-of-month reporting pipelines — <span>Meson experienced slowdowns that required on-call engineers to closely monitor the system, especially during off-hours</span>. The system was not broken, but it was clearly approaching the limits of how it was designed.</p>\n<blockquote>\n<p><strong>THE VERTICAL SCALING WALL</strong></p>\n<p>AWS instance types top out. In 2020, Netflix's Meson orchestrator was approaching the compute limits of the largest available EC2 instances. The team could keep picking bigger and bigger machines, but <strong>the ceiling was visible and the timeline to hit it was predictable</strong>. Horizontal scaling — distributing load across many commodity machines — was the only architectural solution that could match indefinitely growing workflow volumes.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>Meson's Single Leader Under Load</h4>\n<p>During peak usage, Meson's single-leader architecture struggled under the combined load of hundreds of thousands of daily jobs. On-call engineers were monitoring the orchestrator during off-hours to prevent cascading delays. The system worked — but it required human attention that scale demands cannot sustain.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Vertical Scaling Has a Ceiling</h4>\n<p>Meson's single-leader model meant all orchestration state — job status, dependency tracking, retry logic, scheduling — lived on one machine. As Netflix's <em>DAG</em> (Directed Acyclic Graph — a workflow representation where jobs are nodes and dependencies are directed edges, ensuring no circular dependencies) workloads grew 100% year-over-year, vertical scaling was no longer a strategy; it was a countdown.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Maestro: Horizontal Orchestration at Any Scale</h4>\n<p>Netflix designed Maestro from first principles for horizontal scalability. The architecture decomposed orchestration into independent stateless workers, event-driven step execution, and purpose-built state management. Hundreds of thousands of workflows migrated from Meson to Maestro with minimal user disruption, and by 2024 Maestro was open-sourced under Apache 2.0.</p>\n<hr />\n<h3>Result</h3>\n<h4>2 Million Jobs Per Day, No Ceiling in Sight</h4>\n<p>Maestro handles 2 million jobs on busy days across hundreds of thousands of workflows, with support for individual workflows containing hundreds of thousands of jobs. Scaling is now horizontal — add more workers, handle more load — without the system design itself becoming a constraint.</p>\n<hr />\n\n<blockquote><p>Unlike traditional workflow orchestrators that only support Directed Acyclic Graphs (DAGs), Maestro supports both acyclic and cyclic workflows and also includes multiple reusable patterns, including foreach loops, subworkflows, and conditional branches.</p><p><em>— — Jun He, Natallia Dzenisenka et al. — via Netflix Technology Blog</em></p></blockquote>\n<h3>What Makes Maestro Different</h3>\n<p>Traditional workflow orchestrators like Apache Airflow assume that workflows are <em>DAGs</em> (Directed Acyclic Graphs — graphs where edges have direction and no cycles exist, meaning no job can eventually depend on itself). Netflix's data engineering reality is messier: some workflows are genuinely cyclic, some need foreach loops that spawn thousands of child workflow instances, some need conditional branching that determines entire execution paths at runtime. Maestro was built to support all of these natively within the engine, rather than requiring users to simulate them with workarounds. The foreach pattern is particularly powerful: each iteration is internally treated as a separate workflow instance, scaling identically to any other Maestro workflow. A single foreach over a thousand content items spawns <strong>a thousand parallel workflow instances</strong>, each tracked independently, each retried independently, each reportable independently.</p>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Workflow-as-a-Service: Abstracting Infrastructure from Users</p><p>Maestro is designed as a <strong>fully managed workflow-as-a-service</strong> for Netflix's data practitioners — data scientists, ML engineers, content producers, and business analysts. Users define their business logic in Docker images, notebooks, bash scripts, SQL, or Python, and Maestro handles scheduling, queuing, dependency resolution, retries, and monitoring. Users never configure infrastructure. The engineering investment lives entirely in the platform layer, freeing thousands of practitioners to focus on what their workflows do rather than how they run.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🎯</strong></p>\n<p>Strict SLOs During Traffic Spikes</p><p>Maestro is designed to maintain its <strong>service level objectives even during spikes in workflow submission traffic</strong>. This matters because Netflix's data platform is non-uniform: large content drops, end-of-quarter analyses, and major live events all generate burst workflow submissions that can be orders of magnitude above the baseline. A workflow orchestrator that degrades its own SLOs during the exact moments when reliability matters most is worse than useless.</p>\n</blockquote>\n\n<p>Netflix's Meson-to-Maestro migration was deliberate about user disruption. The team migrated <strong>hundreds of thousands of workflows on behalf of users</strong> — users didn't rewrite anything. The platform team built migration tooling, validated that migrated workflows produced equivalent outputs, and performed the migration atomically for each user's pipeline. This is a crucial organizational point: the people running the workflows were data scientists and analysts, not infrastructure engineers. Asking them to relearn an orchestration system and rewrite their pipelines would have consumed months of productive time across the company. By owning the migration completely, the Maestro team enabled a seamless transition that felt, from the user perspective, like <span>a platform upgrade that just worked</span>.</p>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The On-Call Tax of Vertical Scaling</p><p>Meson's vertical scaling strategy created an operational pattern where <strong>engineers had to manually monitor the orchestrator during off-hours</strong> around peak usage — particularly when end-of-quarter reporting pipelines coincided with large ML training runs. This on-call tax is a leading indicator of approaching architectural limits. When your infrastructure requires human attention to avoid failure during predictable peak events, the architecture is telling you it cannot scale further without change.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>Architecture: From Single Leader to Distributed Workers</h3>\n<p>Maestro's architecture replaces Meson's single-leader with three independently scalable services. The <strong>Workflow Engine</strong> manages the full lifecycle from definition through execution — it is the core state machine that tracks which steps have completed and which are ready to run. The <strong>Step Runtime Workers</strong> are stateless executors that pick up individual step executions from a queue, run them on the appropriate compute engine (Spark, Trino, Kubernetes, etc.), and report results back. The <strong>Signal Service</strong> enables event-driven orchestration — workflows can wait for external signals (data availability, upstream pipeline completion) rather than polling or using fixed schedules. Crucially, each of these three layers <strong>scales independently</strong>: need more throughput on step execution? Add more workers. Need more scheduling capacity? Scale the engine tier. No single bottleneck.</p>\n<ul>\n<li><strong>2M+</strong> — Jobs executed on peak days — a figure that would have been impossible on Meson's single-leader architecture without continuous vertical scaling emergencies</li>\n<li><strong>100K+</strong> — Jobs supported within a single workflow — enabled by foreach's spawning of independent sub-workflow instances rather than a flat DAG with 100K nodes</li>\n<li><strong>∞ scale</strong> — Horizontal scaling: add more worker nodes to handle more load — the ceiling that constrained Meson is architecturally eliminated in Maestro</li>\n<li><strong>~0 disruption</strong> — User-visible disruption during migration — the platform team migrated hundreds of thousands of workflows on behalf of users with minimal interruption</li>\n</ul>\n\n<pre><code>// Simplified Maestro workflow definition example\n// Users define their logic; Maestro handles all execution mechanics\n{\n  \"workflow_id\": \"ml_model_training_pipeline\",\n  \"steps\": [\n    {\n      \"id\": \"data_prep\",\n      \"type\": \"spark\",          // run on Netflix's Spark cluster\n      \"image\": \"netflix/etl:v2.3\",\n      \"dependencies\": []        // no upstream dependencies — runs first\n    },\n    {\n      \"id\": \"feature_engineering\",\n      \"type\": \"foreach\",         // Maestro native pattern: spawn parallel sub-workflows\n      \"items\": \"${data_prep.output.segments}\",  // iterate over upstream output\n      \"steps\": [\n        // each item spawns its own workflow instance — scales to thousands\n        { \"id\": \"compute_features\", \"type\": \"spark\" }\n      ],\n      \"dependencies\": [\"data_prep\"]\n    },\n    {\n      \"id\": \"model_train\",\n      \"type\": \"kubernetes\",       // different compute engine — Maestro routes it\n      \"image\": \"netflix/trainer:v1.1\",\n      \"dependencies\": [\"feature_engineering\"]\n    }\n  ],\n  // Signal-based trigger: run when upstream data is ready, not on a fixed clock\n  \"trigger\": { \"type\": \"signal\", \"signal_name\": \"daily_data_ready\" }\n}</code></pre>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Why Not Airflow, Temporal, or Conductor?</p><p>The Netflix team evaluated off-the-shelf alternatives before building Maestro. <strong>Apache Airflow</strong> lacked native support for cyclic workflows and struggled with Maestro's target scale. <strong>Netflix Conductor</strong> was an option but offered more state-engine features than required, and its overhead was disproportionate to the need. <strong>Temporal</strong> was optimized for inter-process orchestration via external service calls — at Maestro's million-tasks-per-day scale with many long-running workflows, coupling the DAG engine to an external service call introduced unnecessary reliability weak spots. The team concluded that for their specific requirements — lightweight state transitions at massive scale — a purpose-built engine was the right call.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE OPEN SOURCE DECISION</strong></p>\n<p>In July 2024, Netflix open-sourced Maestro under the <strong>Apache 2.0 license</strong>. The decision reflected confidence in the system's production maturity after 4+ years of internal operation and Netflix's long tradition of contributing infrastructure tooling to the engineering community. Maestro joins a list of Netflix open-source projects — Chaos Monkey, Eureka, Hystrix — that have shaped how the industry thinks about distributed systems. The GitHub repository includes the full Java 21 + Gradle codebase and curl-based quickstart examples.</p>\n</blockquote>\n\n<p>The signal service is one of Maestro's most operationally valuable features. In traditional time-based scheduling, a pipeline runs at midnight regardless of whether the data it depends on is ready. If upstream processing is slow, the pipeline waits, fails, or wastes compute resources scanning for data that isn't there yet. <strong>Signal-based triggers</strong> invert this: Maestro listens for an event published when upstream data is confirmed ready, and only then starts the dependent workflow. For Netflix's data platform, where hundreds of pipelines have complex dependency chains on upstream tables and streams, this eliminates entire categories of pipeline failures caused by timing assumptions that don't hold under variable upstream load.</p>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Zero-Downtime Migration Strategy</p><p>Maestro's team built tooling to migrate Meson workflows in place, validating equivalence of outputs before switching. Each workflow's migration was atomic: Meson and Maestro ran side-by-side during transition, with a configurable percentage router determining which system executed any given workflow instance. Users saw no interruption to their pipelines during the migration period.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🔌</strong></p>\n<p>Pluggable Compute Backends</p><p>Maestro's step runtime workers are designed to be <strong>compute-engine agnostic</strong>. A step definition specifies the engine (Spark, Trino, Kubernetes, Python) and the business logic image; the worker handles routing, execution, and result reporting without the workflow engine knowing which compute platform ran the job. This loose coupling means Netflix can add new compute backends — or retire old ones — without touching workflow definitions.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>Meson's architecture concentrated all orchestration intelligence in a single process: state management, scheduling, dependency tracking, retry logic, and compute routing all ran in one JVM. This was comprehensible and debuggable but fundamentally unscalable. Maestro separates these concerns across three independently deployable services, with <em>Apache Kafka</em> (a distributed event streaming platform used by Netflix for high-throughput, fault-tolerant message passing between Maestro's internal services and to external downstream consumers) as the message bus connecting them and <em>PostgreSQL</em> (the relational database Maestro uses for durable workflow state storage, chosen for its ACID guarantees which are essential for exactly-once execution semantics) as the durable state store.</p>\n<h3>Meson (Before): Single-Leader Orchestration Architecture</h3>\n<p><a href=\"https://techlogstack.com/explore/netflix-maestro-workflow-2024/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>Maestro (After): Horizontally Scalable Distributed Orchestration</h3>\n<p><a href=\"https://techlogstack.com/explore/netflix-maestro-workflow-2024/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>FOREACH: SCALING TO 100K JOBS IN ONE WORKFLOW</strong></p>\n<p>Maestro's foreach step is the capability that makes truly massive workflows possible. When a foreach iterates over 10,000 content segments, Maestro internally creates <strong>10,000 independent sub-workflow instances</strong> — each scheduled, executed, monitored, and retried independently. From the user's perspective, it's a single foreach loop. From Maestro's perspective, it's 10,000 parallel workflow instances scaling identically to any other workflow in the system. This is not possible in flat DAG systems where 10,000 nodes would overwhelm the scheduler.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Exactly-Once Execution: The Deduplication Guarantee</p><p>Maestro's scheduler includes explicit deduplication logic — even if the scheduler fires a trigger multiple times for the same workflow (due to retries or race conditions), Maestro guarantees the workflow is executed exactly once. This is a critical property for <strong>ML training pipelines and financial reporting workflows</strong> where duplicate execution could produce incorrect results or waste significant compute resources. The guarantee is implemented via PostgreSQL's ACID transactions on workflow state transitions.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>📡</strong></p>\n<p>External Event Publishing: Beyond Netflix</p><p>Maestro publishes execution events to external messaging systems — Kafka, SNS, SQS — enabling downstream consumers to react to workflow state changes. A data warehouse system can start preparing tables when a pipeline completes. A monitoring system triggers alerts on critical workflow failures. An analytics dashboard updates in real time. <strong>This event bus design transforms Maestro from an orchestrator into a first-class component of Netflix's data platform event fabric.</strong></p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>The story of Maestro is ultimately a story about recognizing architectural limits before they become production disasters, and building the replacement with the same care that you'd apply to user-facing products. Platform infrastructure is a product too.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Single-leader architectures have a ceiling.</strong> They are simple to build and reason about, which makes them excellent starting points. But when your workload grows faster than vertical scaling can accommodate, the architecture itself becomes a constraint. Identify the ceiling early and plan the horizontal migration before you're forced to do it under pressure.</li><li><span>02</span><div>Platform migrations succeed when the platform team <strong>owns the migration completely</strong>, not when they ask users to rewrite their workloads. Netflix migrated hundreds of thousands of Meson workflows to Maestro on behalf of their users. This is the engineering investment that makes a platform adoption feel like a capability upgrade rather than a tax.</li><li><span>03</span><div><em>Signal-based triggers</em> (workflow start conditions driven by events like 'upstream data is available' rather than fixed clock times) eliminate entire categories of timing-based failures in data pipelines. If your pipelines fail regularly because upstream data isn't ready when the cron fires, replace the clock with a signal. The latency reduction and reliability improvement are both significant.</li><li><span>04</span><div><strong>Build platform capabilities as first-class engine features, not user-land workarounds.</strong> Maestro's foreach, subworkflow, and conditional branch are native engine constructs — not clever hacks layered on top of a DAG. When complex patterns are first-class, the engine can optimize them (parallel sub-workflows, independent retry, progress tracking) in ways that workarounds cannot.</li><li><span>05</span><div>Evaluate off-the-shelf solutions honestly and <strong>choose to build only when the custom requirements genuinely justify it</strong>. Netflix evaluated Airflow, Temporal, and Conductor before choosing to build Maestro. Their scale, cyclic workflow requirements, and strict SLO needs were genuinely outside what existing tools could provide. Building custom infrastructure for problems that existing tools solve well is waste; building it for problems they genuinely cannot solve is engineering.</li></ol>\n<blockquote>\n<p><strong>WHEN TO BUILD VS BUY</strong></p>\n<p>Netflix's evaluation of off-the-shelf orchestrators before building Maestro is a model for how platform teams should approach the build-vs-buy question. They documented specific technical requirements — <strong>cyclic workflow support, sub-hourly scheduling at million-job scale, strict SLO maintenance during spikes, horizontal scalability</strong> — and evaluated each option against those requirements. Only after finding genuine gaps in available tools did they commit to building. The evaluation process itself clarified the requirements enough to make the eventual design stronger.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Open Source Moment</p><p>Maestro's open-sourcing in July 2024 came after <strong>4+ years of production operation at Netflix scale</strong>. This timeline matters: Netflix didn't release an aspirational design or an early prototype. They released a battle-tested system with years of edge cases resolved, failure modes understood, and operational patterns documented. Open-sourcing after production hardening rather than before is a kindness to the community.</p>\n</blockquote>\n\n<blockquote><p>Netflix needed a workflow orchestrator that scaled with the universe — so they built one, hit the AWS ceiling, threw it away, and then open-sourced the second one.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/netflix-maestro-workflow-2024/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.420865+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Distributed Systems", "Netflix"]}, {"id": "https://techlogstack.com/explore/slack-incident-2-22-22/", "url": "https://techlogstack.com/explore/slack-incident-2-22-22/", "title": "Slack's Worst Day: When a Better Cache Manager Made Everything Worse", "summary": "How a routine Consul agent upgrade triggered a textbook cascading failure at Slack, as a faster cache configuration manager amplified a cache-emptying tipping point.", "content_html": "<p><strong>Slack</strong> · Reliability · 17 May 2026</p>\n<p>On February 22, 2022, Slack went down for many users — including the engineer designated as Incident Commander, who was authoring the postmortem from a position of personal experience. The culprit was a new component that worked exactly as designed.</p>\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<blockquote><p>Slack experienced a major incident on February 22 this year, during which time many users were unable to connect to Slack, including the author — which certainly made my role as Incident Commander more challenging!</p><p><em>— — Laura Nolan, Senior Staff Engineer — via Slack Engineering Blog</em></p></blockquote>\n<p>The 2-22-22 incident at Slack is one of the cleanest documented examples of a <em>metastable failure</em> (a failure mode described in distributed systems research where a system settles into a stable degraded state from which it cannot recover without external intervention, even after the original trigger has passed) in production systems. It did not require a bug. It did not require bad code. It required a <em>Consul</em> (a service discovery and service mesh tool used by Slack to maintain a dynamic registry of which servers are healthy and serving traffic) rollout to hit a tipping point during peak traffic, a faster-than-previous cache configuration manager to amplify the resulting churn, and a single inefficient database query to become a load-amplifying avalanche. Every component was working exactly as designed. The system, however, was not.</p>\n<h3>The Architecture: Caches, Consul, and Mcrib</h3>\n<p>Slack's serving architecture routes requests through a web application layer that uses <em></em><em>Memcached</em> (a high-performance in-memory key-value cache used by Slack to store frequently accessed data and avoid repeated database queries) as its primary caching tier. A component called <em>Mcrouter</em> routes cache requests using <em>consistent hashing</em> (a routing algorithm that maps each cache key to a specific server using a hash ring, so cache lookups are predictable and cache warmth is preserved during small topology changes). A control plane called <strong>Mcrib</strong> watches <em>Consul</em> (the service discovery system that tracks which Memcached nodes are healthy) and updates Mcrouter's configuration whenever nodes appear or disappear from the service catalog. When a Memcached node leaves the catalog — even temporarily, during a restart — Mcrib replaces it with a fresh, empty spare node. The new node's cache is cold. Requests that would have been cache hits on the old node now miss and hit the <em>Vitess</em> (Slack's horizontally sharded MySQL system) database instead. Under normal circumstances, this is fine: node restarts are infrequent, cache warm-up is fast, and the load spike is transient.</p>\n<blockquote>\n<p><strong>⚙️</strong></p>\n<p>The new <strong>Mcrib component</strong> was described as objectively better than its predecessor: it was faster and more efficient at detecting downed Memcached instances and replacing them with spare nodes. It did exactly what it was designed to do. That efficiency was precisely why the incident was so severe.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>Consul Rollout Hits Tipping Point</h4>\n<p>Slack was running a <em>percentage-based rollout</em> (a deployment strategy that applies a change to a fixed percentage of hosts at a time, intended to allow controlled testing before full rollout) of the Consul agent binary. Two 25% rollout steps the prior week had completed without incident. The third 25% step on February 22 — bringing total upgraded hosts to 75% — hit peak traffic and entered a cascading failure. Engineers saw user tickets, internal errors, and alerts firing simultaneously at 6am Pacific.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Cache Emptying Cascade</h4>\n<p>When a Consul agent restarts on a Memcached node, it briefly deregisters the node from the service catalog. Mcrib — the new, faster control plane — detects this immediately and replaces the departing node with an empty spare. As the rollout processed 25% of the fleet sequentially, cache nodes were continuously being emptied and replaced. Cache hit rates dropped. <em>Cache misses</em> (requests where the data is not in cache, forcing a database query to serve the response) flooded <em>Vitess</em> (Slack's sharded MySQL layer), particularly one keyspace containing channel membership data.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Throttle, Optimize, Drain</h4>\n<p>Recovery required three simultaneous interventions. First, <strong>client boot throttling</strong> reduced the incoming request rate to give the cache time to warm. Second, the problematic GDM scatter query was optimized to only fetch missing data from Vitess instead of querying every shard. Third, engineers added Vitess replicas as read sources to distribute the database load. The system was in a metastable failure state — pausing the Consul rollout was not sufficient because the cascade was already self-sustaining.</p>\n<hr />\n<h3>Result</h3>\n<h4>Service Restored, Architecture Hardened</h4>\n<p>Slack recovered after engineers intervened to break the cascade. Long-term fixes included modifying Mcrib's control loop to avoid rapid consecutive node replacements, fixing the scatter query to read from a table sharded by channel ID, and analyzing other high-volume queries backed by the cache tier for similar vulnerability.</p>\n<hr />\n\n<blockquote>\n<p><strong>THE METASTABLE STATE TRAP</strong></p>\n<p>Once Slack's system entered its cascading failure state, simply pausing the Consul rollout did not restore service. The system was already in a <strong>metastable state</strong> — a condition where the failure was self-sustaining: cache misses caused database load, database load caused slow responses, slow responses caused retries, retries caused more database load. The only exit was external intervention that changed the system state — throttling load or increasing capacity — not undoing the original trigger. This is the defining characteristic of metastable failures and the reason they are so dangerous.</p>\n</blockquote>\n\n<p>The GDM (Group Direct Message) scatter query was the specific weakness that turned a cache degradation into a database overload. This query listed GDM conversations per user, and crucially, it queried <strong>every shard in the Vitess keyspace</strong> even when most shards contained no relevant data for that user. Under normal conditions, this query's results were cached with a <span>long TTL</span> because GDM membership is immutable — so cache hits were nearly universal and the scatter pattern was rarely exercised. When the cache was systematically emptied by the Mcrib replacements, the scatter query began executing on the database at full scale for the first time under real load. <span>The keyspace became severely overloaded almost immediately.</span></p>\n<blockquote>\n<p><strong>❌</strong></p>\n<p>Client Retries: The Amplifier</p><p>Client retries, designed to recover from transient failures, become <span><strong>load amplifiers</strong></span> during sustained overload. When the Slack client receives a failure or timeout, it doesn't know whether the system is experiencing a transient local hiccup or a global overload — so it retries. During the 2-22-22 incident, automated retries with exponential backoff significantly increased database load during the window when the system needed space to recover. Backoff and jitter help, but they cannot fully counteract retries from millions of clients all experiencing the same global overload simultaneously.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>😅</strong></p>\n<p>The Author Was Also Affected</p><p>A detail that makes this postmortem unusually human: Laura Nolan, who wrote the postmortem, was also the <strong>Incident Commander during the event</strong> — and was personally unable to connect to Slack for portions of it. She was managing a Slack outage using a platform that wasn't working, making incident coordination substantially harder. The note is a small reminder that incident commanders are humans using the same systems they're trying to fix.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Two Prior Steps Passed Without Incident</p><p>The February 22 rollout was the <strong>third of three 25% steps</strong> — the prior two, executed the previous week, completed without any issues. This is a critical detail for understanding why the incident was surprising: the rollout process was validated, the previous steps were clean, and there was no signal that the third step would behave differently. The failure was not a process failure — it was a scale threshold phenomenon that only manifested at 75% fleet coverage during peak traffic.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Why GDM Membership Was Particularly Vulnerable</p><p>The Group Direct Message membership data had a <strong>long cache TTL</strong> because GDM membership is immutable under Slack's current application requirements — once you're in a GDM, you stay in it. This long TTL meant the cache was almost always warm, the scatter query rarely executed on the database, and the latent scalability issue was never observed under normal conditions. The safest-feeling queries can hide the most dangerous database access patterns.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>Breaking the Cascade: Three Simultaneous Interventions</h3>\n<p>The critical insight of the 2-22-22 recovery was that the system was in a metastable state that could not be exited by simply reverting the original trigger. Stopping the Consul rollout was necessary but insufficient — the cache was already empty, the database was already overloaded, and client retries were already sustaining the load even with the rollout paused. The engineering team needed to change the system's state, not just stop what had changed it. This required <strong>reducing load from outside while simultaneously increasing the system's capacity to serve that load</strong>.</p>\n<ul>\n<li><strong>3</strong> — Simultaneous recovery interventions required: client throttling, query optimization, and adding Vitess replicas — none alone was sufficient</li>\n<li><strong>Metastable</strong> — Failure state — self-sustaining cascade where reverting the trigger does not restore service; requires active external intervention to change system state</li>\n<li><strong>25% × 3</strong> — Percentage-based rollout steps: two prior 25% steps passed without incident; the third hit a tipping point at peak traffic</li>\n<li><strong>GDM query</strong> — The specific scatter query that turned cache degradation into database overload — querying every Vitess shard even when most shards had no relevant data</li>\n</ul>\n\n<pre><code class=\"language-python\"># The problematic GDM (Group Direct Message) scatter query pattern\n# (conceptual — Slack uses Vitess with MySQL dialect)\n\n# BEFORE: scatter across ALL shards for a single user's GDM list\n# This runs on every shard in the keyspace regardless of data locality\ndef get_gdm_list_old(user_id):\n    results = []\n    # Vitess without scatter guard: queries all shards\n    for shard in ALL_VITESS_SHARDS:\n        shard_results = db.query(\n            shard, \n            \"SELECT * FROM gdm_memberships WHERE user_id = ?\",\n            user_id\n        )  # expensive: O(shards) database round-trips per user\n        results.extend(shard_results)\n    return results\n\n# AFTER: query only the shard that owns this user's data\n# The table was reschemed to shard BY channel_id (colocated with users)\ndef get_gdm_list_fixed(user_id):\n    # Vitess routes this to the single correct shard using VSchema VIndex\n    return db.query(\n        \"SELECT * FROM gdm_memberships_sharded_by_channel WHERE user_id = ?\",\n        user_id\n    )  # O(1) — one shard, one query\n\n# Long-term: always verify cache-backed queries can survive cache-cold load</code></pre>\n<blockquote>\n<p><strong>MCRIB'S EFFICIENCY PARADOX</strong></p>\n<p>The key lesson from Mcrib is architectural: <strong>a faster control loop for infrastructure changes can make a distributed system less safe</strong>, even if the control loop itself is correct. Mcrib was better than its predecessor at detecting and responding to Consul node departures — but that speed meant Memcached churn from the Consul rollout happened faster than the cache tier could recover. The fix was not to make Mcrib slower or less correct, but to add <strong>rate limiting on consecutive node replacements</strong> — ensuring that the cache tier never loses more than a bounded fraction of its warmth at any moment.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Long-Term Architecture Hardening</p><p>After the incident, Slack made permanent structural changes: modifying Mcrib's control loop to prevent rapid consecutive cache node replacements, rewriting the GDM scatter query to target the correctly sharded table, auditing all high-volume cache-backed queries for similar scatter vulnerabilities, and analyzing whether a brief network partition affecting cache nodes could trigger the same cascade (it could — and changes were made to protect against that too). The incident became a forcing function for systematic resilience improvements.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Testing Cascading Failure Recovery — Not Just Prevention</p><p>A question raised by this incident: do you test your system's ability to <strong>recover from metastable failure states</strong>, not just its ability to prevent them? Slack's game days had focused on preventing failures. But the 2-22-22 incident required active recovery from a state the prevention had failed to avoid. Testing the exit paths from failure is as important as testing the entry paths to it.</p>\n</blockquote>\n\n<p>The client boot throttle was the most immediately effective intervention. By artificially limiting the rate at which clients could complete their session initialization (<em>boot</em> (the process by which a Slack client initializes its state — fetching channel memberships, unread counts, and other data that the client caches locally for the session)), Slack reduced the volume of requests hitting the overloaded database tier. This bought time for the cache to begin refilling and for the query optimization to take effect. The mechanism that makes throttling effective in metastable failures is that it reduces the load sustaining the cascade — if you can reduce load below the system's current degraded capacity, it can begin recovering rather than staying pinned at the overload threshold.</p>\n<blockquote>\n<p><strong>🔒</strong></p>\n<p>Rate Limiting the Mcrib Control Loop</p><p>The architectural fix to Mcrib was conceptually simple: <strong>rate-limit how many cache nodes can be replaced within a rolling time window</strong>. This prevents a coordinated wave of Consul agent restarts from emptying the entire cache tier simultaneously. The trade-off is that node replacement is slightly slower during a real failure — but this is an acceptable delay in exchange for the cache tier never losing more than a bounded fraction of its warmth at once.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>The 2-22-22 incident is best understood through the lens of how Slack's serving architecture handles data retrieval and cache topology. Requests to the webapp go through Mcrouter to Memcached first; only on a cache miss do they hit Vitess (Slack's sharded MySQL layer). Consul is the system that keeps Mcrouter and Mcrib informed about which Memcached nodes are alive. When Consul says a node left — even temporarily during a Consul agent restart — Mcrib responds by assigning a cold spare to replace it. The architecture was designed for resilience under normal node failures. It was not designed for a coordinated wave of Consul agent restarts that emptied cache at the exact moment peak traffic arrived.</p>\n<h3>Before: How a Consul Agent Restart Drains Cache (Single Node)</h3>\n<p><a href=\"https://techlogstack.com/explore/slack-incident-2-22-22/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>The Cascade: 25% of Fleet Draining Simultaneously at Peak Traffic</h3>\n<p><a href=\"https://techlogstack.com/explore/slack-incident-2-22-22/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>GRAY FAILURE VS HARD FAILURE</strong></p>\n<p>The Consul rollout itself was not a bug — it was a maintenance operation that had completed successfully twice before at 25% steps. The failure that emerged was a <strong>property of the whole system interacting across components</strong>: Consul's temporary deregistration behavior, Mcrib's fast replacement response, the cache's role in shielding the database, and the GDM query's implicit assumption of cache warmth. No single component was broken. The interaction between correct components under peak load at a specific scale threshold produced the cascade. This is precisely why cascading failures in distributed systems are so hard to prevent: they emerge from correctness, not from bugs.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Metastable Failures: The Academic Context</p><p>Laura Nolan's postmortem explicitly references the academic concept of metastable failures in distributed systems. The <strong>'Metastable States in Distributed Systems'</strong> research paper (Bronson et al.) describes how systems can enter stable degraded states where removing the triggering condition does not restore normal operation. Slack's incident is a near-perfect real-world illustration: once the cascade began, it was self-sustaining. This framing matters because it changes how you think about recovery — you're not reverting a change, you're <em>escaping a state</em>.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Fixed Architecture: Rate-Limited Replacements + Sharded Queries</p><p>After the incident, Slack made two structural changes to the architecture: Mcrib was modified to <strong>rate-limit consecutive node replacements</strong>, ensuring cache churn is bounded even during fleet-wide maintenance operations. The GDM membership query was rewritten to target a correctly sharded table, eliminating the scatter pattern. Together these changes make the system resilient to both Consul rollout-style churn and to cold cache conditions that might arise from other causes like network partitions affecting cache nodes.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>The 2-22-22 incident is cited by distributed systems researchers and practitioners because it is so precisely documented and because its lessons are universal. No bugs. No negligence. Just the emergent behavior of correct components interacting at scale in a state that the individual components couldn't see.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>A faster infrastructure component is not always safer.</strong> Mcrib was a better system than its predecessor — but its efficiency amplified cache churn during the Consul rollout in a way the slower predecessor would not have. Whenever you improve the speed of a control loop that modifies infrastructure state, audit whether that speed creates new failure modes under coordinated changes.</li><li><span>02</span><div><em>Metastable failures</em> (failure states in distributed systems that are self-sustaining even after the original trigger is removed, requiring active external intervention to exit) cannot be fixed by reverting the trigger. Your incident playbooks need explicit recovery procedures for 'we are in sustained overload even though the cause has been addressed' — throttle incoming load, add capacity, or both. Waiting for the system to self-recover from a metastable state is not a strategy.</li><li><span>03</span><div><strong>Test your high-traffic cache-miss path before it is exercised under load.</strong> The GDM scatter query had never been exercised at full scale because the cache hit rate was near-perfect. When the cache emptied, a latent design flaw — querying all shards — became a severity-1 incident. Every high-frequency query protected by a cache should be load-tested against the scenario where the cache is cold.</li><li><span>04</span><div><em>Percentage-based rollouts</em> (deployment strategies that apply changes to a fixed fraction of infrastructure at a time to detect problems before full exposure) do not guarantee safety when the failure mode is a <strong>tipping-point cascade</strong>. The first two 25% Consul steps passed without incident; the third hit a threshold where the interaction between rollout churn and peak traffic became self-sustaining. Consider adding traffic-aware gates to rollout automation.</li><li><span>05</span><div><strong>Client retries are a double-edged sword.</strong> They are essential for recovering from transient failures but amplify load during sustained overload. Slack's clients used exponential backoff with jitter — the right design — but even well-designed retries from millions of clients contribute significantly to sustained overload. Design retry logic with a global-overload abort condition: if every retry attempt is failing, stop retrying.</li></ol>\n<blockquote>\n<p><strong>HOW TO ESCAPE A METASTABLE STATE</strong></p>\n<p>Slack's recovery from 2-22-22 provides a practical playbook for escaping metastable failures: <strong>Step 1 — Reduce incoming load</strong> (client boot throttle) to give the system breathing room below its degraded capacity. <strong>Step 2 — Eliminate the load amplification source</strong> (fix the scatter query) so each cache miss is less expensive. <strong>Step 3 — Increase capacity</strong> (add Vitess replicas) so the system can handle the remaining load while recovering. These three moves — load reduction, amplification removal, capacity addition — are the universal toolkit for escaping cascading failure states.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Reference: How Complex Systems Fail</p><p>Laura Nolan's postmortem references Richard Cook's <strong>'How Complex Systems Fail'</strong> — a foundational text in systems reliability. Cook's work describes how complex systems are never fully safe, how accidents involve multiple contributing factors, and how practitioners become expert in working the system's defenses. The 2-22-22 incident is a near-perfect illustration: multiple correct components, a scale-dependent tipping point, and the system's defenses (retries, caching) becoming contributors to the failure mode.</p>\n</blockquote>\n\n<blockquote><p>Mcrib was the best cache manager Slack had ever built — it responded so fast to node departures that it helped bring down the entire platform, which is a kind of achievement.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/slack-incident-2-22-22/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.476414+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Reliability", "Slack"]}, {"id": "https://techlogstack.com/explore/stripe-docdb-data-movement-2024/", "url": "https://techlogstack.com/explore/stripe-docdb-data-movement-2024/", "title": "How Stripe Moves Petabytes Between Database Shards Without Stopping the Money", "summary": "How Stripe built DocDB on MongoDB and a Data Movement Platform that migrated 1.5 petabytes in 2023 — while maintaining 99.999% availability for $1 trillion in paymen", "content_html": "<p><strong>Stripe</strong> · Databases · 17 May 2026</p>\n<p>Stripe processed over $1 trillion in payment volume in 2023 while maintaining 99.999% uptime — five nines, fewer than 6 minutes of downtime all year. The infrastructure secret is a database platform called DocDB and a migration engine that moves petabytes of financial data between shards without any application knowing it happened.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;uptime achieved&#x27;, &#x27;value&#x27;: &#x27;99.999%&#x27;}</li><li>{&#x27;label&#x27;: &#x27;database queries/sec&#x27;, &#x27;value&#x27;: &#x27;5M&#x27;}</li><li>{&#x27;label&#x27;: &#x27;PB migrated in 2023&#x27;, &#x27;value&#x27;: &#x27;1.5&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<ul>\n<li><strong>$1T+</strong> — Payment volume processed by Stripe in 2023 — making their database reliability requirements some of the most demanding in the industry</li>\n<li><strong>99.999%</strong> — Uptime achieved — five nines means less than 5.26 minutes of total downtime per year across all Stripe APIs and payment processing</li>\n<li><strong>5M QPS</strong> — Database queries per second sustained across Stripe's DocDB fleet — comparable to some of the largest databases in the world</li>\n<li><strong>1.5 PB</strong> — Data migrated between shards in 2023 alone using the Data Movement Platform — transparently to all applications</li>\n</ul>\n\n<p>When Stripe launched in 2011, they chose <em>MongoDB</em> (a document-oriented NoSQL database that stores data as flexible JSON-like documents rather than fixed relational table schemas, offering developer productivity advantages for rapid iteration) because it was more developer-friendly than standard relational databases for a fast-moving startup. Over the next decade, as Stripe grew from a startup to a financial infrastructure company processing trillion-dollar payment volumes, the team built a layer on top of MongoDB that they call <em>DocDB</em> — a <em>Database-as-a-Service</em> (an abstraction layer that gives application developers a simple API for data access while hiding all the complexity of sharding, replication, failover, and migrations beneath it). DocDB handles <em>horizontal sharding</em> (a database scaling technique that distributes data rows across multiple independent database instances (shards) based on a partition key, so no single instance holds all the data and traffic is distributed) across thousands of shards, manages replication for high availability, and — crucially — enables the zero-downtime data migrations that allow Stripe's database fleet to scale continuously without ever taking payments offline.</p>\n<p>The central innovation of DocDB is its <strong>Data Movement Platform</strong> — a system that can migrate chunks of data between shards while both the source and target shards continue serving live production traffic. This capability is essential for Stripe's operations: as certain merchants grow rapidly and their shard fills up, it needs to be split. As the fleet evolves and some shards become underutilized, they can be consolidated. When a new MongoDB version is released, shards can be upgraded by <em>fork-lifting</em> (migrating data to a new instance running the target version, avoiding multi-step in-place upgrades that pass through each intermediate version) to the new version rather than performing multi-step in-place upgrades. All of these operations have one requirement in common: <strong>Stripe cannot stop accepting payments while they happen</strong>.</p>\n<blockquote>\n<p><strong>THE FIVE NINES CONSTRAINT</strong></p>\n<p>99.999% uptime means <strong>less than 5.26 minutes of downtime per year</strong>. For a payment processor, downtime is not just SLA violation — it's merchants unable to complete sales, customers unable to pay, and real-time revenue loss for the businesses Stripe serves. Every database operation — migration, split, consolidation, upgrade — must happen transparently. The constraint is absolute: there is no maintenance window at Stripe's scale.</p>\n</blockquote>\n\n<h3>The Six-Step Migration Protocol</h3>\n<p>The Data Movement Platform executes every shard migration through a six-step protocol: (1) register the migration plan in the <em>chunk metadata service</em> (a central catalog that tracks which data chunks live on which shards — the source of truth for query routing across Stripe's fleet), (2) build indexes on the target shard before data arrives (avoiding the performance penalty of indexing after a large data load), (3) bulk-copy a snapshot of the chunk from source to target, (4) stream async replication to apply changes made on the source since the snapshot was taken, (5) perform correctness checks to verify data consistency, (6) switch traffic to the target and deregister the chunk from the source. Steps 3 and 4 were where Stripe hit unexpected engineering challenges — and where the most creative solutions emerged.</p>\n\n<h3>Problem</h3>\n<h4>Shard Splits and Consolidations Required Downtime</h4>\n<p>Without the Data Movement Platform, scaling Stripe's database fleet required either accepting downtime during shard operations or building complex dual-write logic for every migration. As Stripe's fleet grew to thousands of shards, this was operationally unsustainable and created real risk for every migration event.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Financial Data Cannot Tolerate Inconsistency</h4>\n<p>Payment data has zero tolerance for consistency errors — a payment record that exists on the source shard but hasn't yet appeared on the target is a payment that could be double-charged, lost, or corrupted if traffic switches at the wrong moment. The six-step protocol was designed specifically to guarantee that by the time traffic switches, the target is <strong>exactly consistent</strong> with the source including all writes made during migration.</p>\n<hr />\n<h3>Solution</h3>\n<h4>CDC Replication + Correctness Verification</h4>\n<p>Stripe solved the consistency problem using <em>Change Data Capture</em> (a technique that continuously reads the MongoDB operation log (oplog) to stream every write applied to the source shard to the target, keeping the target synchronized even as live traffic modifies the source data) streaming from the source shard's oplog. After CDC replication catches up to near-real-time, correctness checks compare source and target before traffic is switched. The switch itself is atomic from the application's perspective.</p>\n<hr />\n<h3>Result</h3>\n<h4>1.5 Petabytes Moved in 2023 Transparently</h4>\n<p>In 2023 alone, Stripe migrated 1.5 petabytes of data between shards, consolidated thousands of databases through bin packing, and upgraded the entire MongoDB fleet — all with zero application downtime and no payment processing interruptions. 99.999% uptime was maintained throughout.</p>\n<hr />\n\n<blockquote><p>DocDB's ability to migrate data between shards in a consistent, granular and reliable way has made it significantly easier for Stripe to scale.</p><p><em>— — Jimmy Morzaria, Suraj Narkhede — via Stripe Engineering Blog, June 2024</em></p></blockquote>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Bulk Load Throughput Problem</p><p>Step 3 of the migration — bulk loading a snapshot of the chunk onto the target shard — hit a <strong>significant throughput limitation</strong> during testing. Stripe's engineering team tried batching writes and tuning DocDB engine parameters, but neither approach resolved the bottleneck. The root cause was an impedance mismatch between the bulk loader and the target shard's write path: the target shard was not optimized for sequential ingestion at high speeds. The engineering team eventually solved this by building purpose-built bulk import tooling with different I/O patterns than the standard DocDB write path.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🗃️</strong></p>\n<p>Stripe manages <strong>thousands of DocDB shards</strong> — and periodically performs bin-packing consolidations where underutilized shards are merged to reduce operational overhead and hardware costs. In 2023 they reduced the total number of underlying DocDB shards by approximately three-quarters through such consolidation, migrating 1.5 petabytes of data in the process.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⬆️</strong></p>\n<p>The Fork-Lift Upgrade Strategy</p><p>Traditional in-place database major version upgrades require going through each intermediate version sequentially — upgrading from MongoDB 4.0 to 5.0 to 6.0, for example, each step requiring careful validation. Stripe's Data Movement Platform enables a <strong>fork-lift strategy</strong>: provision a new shard running the target version, migrate the data to it, switch traffic, decommission the old shard. Any version can jump to any other version in a single migration step. This eliminates the risk accumulation of multi-step in-place upgrades.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>DocDB: Not a Rewrite, an Extension</p><p>A key decision in Stripe's database evolution was building DocDB <strong>on top of MongoDB Community</strong> rather than replacing MongoDB with a different database. This preserved compatibility with existing application code, the existing data model, and years of operational knowledge. The extensions — sharding, proxy routing, migration tooling — were added as a platform layer, not a fork. This pragmatic approach to building on existing foundations rather than starting from scratch is characteristic of Stripe's infrastructure philosophy.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>DocDB Architecture: The Database-as-a-Service Abstraction</h3>\n<p>DocDB's architecture is a three-tier system sitting between Stripe's application code and raw MongoDB instances. The <strong>Database Proxy</strong> is the entry point for all application read/write requests — it performs access control checks, validates queries, and routes requests to the correct shard by consulting the chunk metadata service. The <strong>Chunk Metadata Service</strong> maintains the authoritative map of which data chunks live on which shards. The <strong>Database Shards</strong> are replicated MongoDB instances that store the actual data. Applications talk only to the proxy; they are completely unaware of sharding, shard splits, or migrations in progress.</p>\n<pre><code class=\"language-python\"># Simplified 6-step Data Movement Platform migration flow\n# Each step is atomic and resumable — migrations can be paused and continued\n\nclass DataMovementPlatform:\n    def migrate_chunk(self, chunk_id: str, source_shard: str, target_shard: str):\n        # Step 1: Register migration plan — makes migration visible to monitoring\n        self.chunk_metadata.register_migration(\n            chunk_id=chunk_id, \n            source=source_shard,\n            target=target_shard\n        )\n        \n        # Step 2: Pre-build indexes on target BEFORE data arrives\n        # Avoids the performance penalty of indexing a large loaded dataset\n        self.build_indexes_on_target(target_shard, chunk_id)\n        \n        # Step 3: Bulk copy snapshot at time T\n        # Uses purpose-built I/O patterns for high-throughput sequential writes\n        snapshot_timestamp = self.bulk_copy_snapshot(chunk_id, source_shard, target_shard)\n        \n        # Step 4: Stream CDC replication — catch up all writes since snapshot\n        # Reads MongoDB oplog on source; applies to target until near-real-time\n        self.cdc_replicate_to_target(\n            source_shard, target_shard, since=snapshot_timestamp\n        )\n        \n        # Step 5: Correctness verification — compare source and target\n        # Financial data requires full consistency before any traffic switch\n        assert self.verify_consistency(chunk_id, source_shard, target_shard)\n        \n        # Step 6: Atomic traffic switch — update chunk metadata, switch routing\n        self.chunk_metadata.set_active_shard(chunk_id, target_shard)\n        # Applications querying the chunk now get routed to target\n        # Deregister from source after confirmation\n        self.chunk_metadata.deregister_from_source(chunk_id, source_shard)</code></pre>\n<blockquote>\n<p><strong>BIN-PACKING: REDUCING THE FLEET BY 75%</strong></p>\n<p>In 2023, Stripe used the Data Movement Platform to <strong>bin-pack thousands of underutilized shards</strong> into a smaller number of larger shards. Bin-packing is the reverse of splitting: instead of one shard becoming two, many small shards are consolidated into fewer shards with more data. This reduced the total number of DocDB shards by approximately 75% while moving 1.5 petabytes — dramatically reducing operational overhead and hardware costs without any application code changes.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Multitenant to Single-Tenant: Isolation on Demand</p><p>DocDB supports migrating a large merchant's data from a <strong>shared multitenant shard</strong> (multiple merchants on one shard) to a <strong>dedicated single-tenant shard</strong> (one merchant per shard). This is done transparently via the Data Movement Platform: the merchant's data is migrated to a dedicated shard, traffic routing is updated atomically, and the merchant gets dedicated resources without any downtime or visible change in behavior. This capability is increasingly important as Stripe's largest customers grow to Shopify, Amazon, and OpenAI scale.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Heat Management System: Next Chapter</p><p>At the time of the June 2024 blog post, Stripe was <strong>prototyping a heat management system</strong> that proactively balances data across shards based on real-time access patterns. Rather than waiting for a shard to become a bottleneck and then splitting it reactively, the heat management system would detect access pattern shifts and pre-emptively migrate hot data to shards with more capacity. Reactive sharding at Stripe's scale will eventually give way to predictive sharding.</p>\n</blockquote>\n\n<p>Correctness verification (Step 5) is the most cautious part of the migration protocol, and deliberately so. The platform compares a sample of records between source and target shards after CDC replication has caught up. For financial data, even a single inconsistency before the traffic switch would be unacceptable — a payment that exists on the source but not on the target could be double-charged or lost if the switch happens before it replicates. <strong>The verification step is the safety gate that makes five-nines availability compatible with live shard migrations.</strong> The cost is time — migrations take longer because of the verification window — but that cost is the explicit price of correctness guarantees on financial data.</p>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Bulk Load Throughput Engineering Challenge</p><p>During testing, Stripe found that standard MongoDB write patterns were insufficiently fast for bulk data loading during shard migrations. Batching writes and tuning engine parameters both failed to resolve the throughput bottleneck. The root cause: the standard MongoDB write path is optimized for <strong>low-latency individual writes</strong>, not for <strong>high-throughput sequential bulk loads</strong>. The engineering team built custom I/O patterns specifically for the bulk copy phase of migrations — patterns that bypassed some standard write overhead in favor of throughput.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE OPLOG AND FINANCIAL CONSISTENCY</strong></p>\n<p>MongoDB's <em>oplog</em> (a capped collection that stores all write operations in order, used for replication across MongoDB replica sets) is the technical foundation of CDC replication in DocDB. Every write to the source shard appears in the oplog in order. By replaying the oplog on the target shard in sequence, the Data Movement Platform guarantees that every write applied to the source during migration is also applied to the target — preserving full consistency of financial records. The oplog is not just a replication mechanism: it is a <strong>linearizable history of financial truth</strong>.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>DocDB's architecture enforces a clean separation between application code and database topology. Applications at Stripe never connect directly to MongoDB instances — they connect to the <em>Database Proxy</em>, which is the single point of truth for routing, access control, and scalability decisions. This indirection is what makes zero-downtime migrations possible: the proxy can update its routing table atomically as migrations complete, and applications never see anything other than consistent data.</p>\n<h3>DocDB Architecture: Three-Tier Database-as-a-Service</h3>\n<p><a href=\"https://techlogstack.com/explore/stripe-docdb-data-movement-2024/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>Data Movement Platform: Six-Step Migration Protocol</h3>\n<p><a href=\"https://techlogstack.com/explore/stripe-docdb-data-movement-2024/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>THE PROXY MAKES MIGRATIONS TRANSPARENT</strong></p>\n<p>The Database Proxy's role is the architectural key to zero-downtime migrations. By <strong>abstracting away shard topology from application code</strong>, the proxy can update routing atomically at Step 6 — the traffic switch — without any application restarting, reconnecting, or changing behavior. Applications see a continuous stream of consistent reads and writes before and after the switch. The migration is completely invisible from the application layer, which is the entire point.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Change Data Capture: Reading the Oplog</p><p>MongoDB maintains an <em>oplog</em> (operation log — a capped MongoDB collection that records every write operation applied to the database, used for replication and CDC streaming) that records every write in sequence. DocDB's CDC service reads this oplog on the source shard and replays every operation on the target shard in order. This keeps the target continuously synchronized with the source during the migration window. When replication lag drops to near-zero, the correctness verification and traffic switch can proceed safely.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Transparent Application Layer: The Developer Experience</p><p>Stripe's application engineers interact with DocDB through a simple API: read a document, write a document, query by index. They never configure sharding keys, never think about which shard holds a specific customer's data, and never coordinate with the database infrastructure team before their code ships. The abstraction layer is what makes it possible for Stripe's product engineering velocity to be decoupled from its database scaling complexity — two teams that would otherwise be in each other's way operate independently.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Stripe's DocDB and Data Movement Platform represent a decade of investment in making financial database operations invisible to application code. The lessons here are about architectural abstraction, the price of correctness, and why migration tooling is a competitive advantage.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>A database abstraction layer is an operational multiplier.</strong> Stripe's applications never talk to MongoDB directly — they talk to the proxy. This indirection cost engineering time upfront but enabled zero-downtime migrations, transparent sharding, and fleet-wide upgrades for a decade of scale growth. The abstraction layer is where scaling strategies live.</li><li><span>02</span><div><em>Change Data Capture</em> (reading a database's operation log to stream every change to a downstream consumer in real time) is the foundation of live migration. Without CDC, migrating a live database requires a maintenance window. With CDC, you copy a snapshot, stream the delta, verify consistency, then switch traffic atomically. Build CDC capability into your database infrastructure before you need live migrations.</li><li><span>03</span><div><strong>Pre-build indexes on the target before loading data.</strong> Loading data first and then building indexes on a large dataset is far more expensive than building the indexes on empty data and then inserting. For petabyte-scale migrations, this ordering difference can be the difference between hours and days. Stripe explicitly sequences index creation before bulk data arrival.</li><li><span>04</span><div>Gradual traffic restoration and correctness verification before the switch are not optional for financial data. <strong>A migration that completes fast but introduces even a single data inconsistency is worse than a slow correct migration.</strong> For domains where correctness is non-negotiable, treat Step 5 (verification) as the most important step in your migration protocol.</li><li><span>05</span><div><em>Bin-packing</em> (consolidating many small, underutilized database shards into fewer larger shards to reduce operational overhead and hardware costs) is as important as shard splitting for long-term database fleet health. As traffic patterns shift, some shards become cold. Without consolidation, you accumulate operational overhead and hardware waste. Plan for bidirectional shard topology management from day one.</li></ol>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Correctness vs Speed Tradeoff</p><p>Stripe's Data Movement Platform deliberately accepts <strong>slower migrations in exchange for guaranteed correctness</strong>. The CDC replication phase, the correctness verification step, and the atomic traffic switch all add latency to the migration timeline that a less careful system could avoid. For a company processing $1 trillion in payments, data inconsistency risk is not a speed-for-correctness tradeoff — it's a business continuity risk. The migration protocol encodes this priority explicitly.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>MIGRATION TOOLING AS INFRASTRUCTURE</strong></p>\n<p>Stripe's Data Movement Platform is not a script that gets run during migrations — it is <strong>production infrastructure</strong> that runs continuously, managing ongoing shard operations across thousands of databases. The platform has its own SLOs, its own monitoring, its own oncall rotation. Building migration tooling as first-class infrastructure rather than ad-hoc tooling is what enables Stripe to migrate petabytes per year without extraordinary engineering effort per migration.</p>\n</blockquote>\n\n<blockquote><p>Stripe moved 1.5 petabytes of financial data between database shards in 2023 and nobody noticed — which is either the most boring success story in engineering history or the most impressive one.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/stripe-docdb-data-movement-2024/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.480671+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Databases", "Stripe"]}, {"id": "https://techlogstack.com/explore/slack-unified-grid-2024/", "url": "https://techlogstack.com/explore/slack-unified-grid-2024/", "title": "Slack Rewrote Its Core Architecture for Enterprise — Because the Old One Was a Lie", "summary": "How Slack spent two years rebuilding its workspace-centric architecture into an org-wide information model to serve enterprise customers who span dozens of workspace", "content_html": "<p><strong>Slack</strong> · Distributed Systems · 17 May 2026</p>\n<p>Slack was built for teams in single workspaces. Enterprise customers were using it across dozens of workspaces simultaneously — and the architecture had never been designed for that. Every major enterprise feature was a workaround on top of a foundation that assumed one workspace per person. Slack spent two years rebuilding the foundation.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;years development time&#x27;, &#x27;value&#x27;: &#x27;2&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>Slack launched in 2013 with a beautifully simple data model: users belong to workspaces, workspaces contain channels, channels contain messages. To view a different workspace, you click on it and context switches entirely. For small teams using a single workspace, this model was perfect. For large enterprises that had grown to 50, 100, or 200 workspaces across departments, geographies, and business units, it was a prison. <strong>Every DM, every notification, every unread count, every search result was siloed by workspace.</strong> A VP with access to 80 workspaces had to remember which workspace a conversation was in, click to it, check notifications, return, and repeat — dozens of times per day. The architecture was working against the users it was supposed to serve.</p>\n<blockquote><p>All software is built atop a core set of assumptions. As new code is added and new use-cases emerge, software can become unmoored from those assumptions. When this happens, a fundamental tension arises between revisiting those foundational assumptions — which usually entails a lot of work — or trying to support new behavior atop the existing architecture.</p><p><em>— — Slack Engineering — via 'Unified Grid: How We Re-Architected Slack for Our Largest Customers'</em></p></blockquote>\n<p>Slack's team had been papering over the workspace-centric limitation for years with increasingly complex workarounds. The <em>Connect</em> (Slack's feature allowing users in different Slack organizations to message each other across workspace boundaries — built as an overlay on the workspace model) feature, multi-workspace management tools, org-wide settings — all of them were workarounds that added complexity without fixing the fundamental architecture. The CTO and engineering leadership faced a classic build-it-now-or-keep-patching decision. They chose to build. The project was called <strong>Unified Grid</strong>, and it would require rebuilding the core data model, refactoring thousands of APIs, and redesigning both the backend and every client application — simultaneously.</p>\n<blockquote>\n<p><strong>THE WORKSPACE-CENTRIC ASSUMPTION</strong></p>\n<p>In Slack's original architecture, <strong>almost all data was particular to a single workspace</strong>: messages, channels, DMs, notification preferences, user profiles, unread counts. This assumption was baked into thousands of database queries, API responses, and client rendering paths. To build Unified Grid, every piece of data that needed to be visible across workspaces had to be lifted out of the workspace silo — a change that touched nearly every system in the stack.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🏢</strong></p>\n<p>Slack's largest enterprise customers operate across <strong>dozens to hundreds of workspaces</strong>. Expecting those users to manually context-switch between workspaces to find conversations, check notifications, or respond to DMs was creating real productivity friction. The Unified Grid project was not a technical exercise — it was a direct response to enterprise customer feedback.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>Enterprise Users Drowning in Workspace Context-Switches</h4>\n<p>Slack's workspace-centric model forced enterprise users to manually navigate between dozens of workspaces to find conversations and check notifications. Key features like a unified DM inbox, an org-wide activity feed, and cross-workspace search were impossible within the existing architecture — not missing features, architecturally blocked features.</p>\n<hr />\n<h3>Cause</h3>\n<h4>The Foundation Assumption Was Wrong for Enterprise</h4>\n<p>Slack's data model had been built on the assumption that almost all user data is particular to a single workspace. Ten years of feature development had embedded this assumption deep into database schema, API contracts, and client rendering logic. Supporting org-wide views required either a rewrite or an ever-growing layer of workarounds.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Prototype the Path: Build Incrementally, Prove Out</h4>\n<p>Rather than committing immediately to a full rewrite, Slack's team built a proof of concept using Unified Grid within internal tooling — Slack's own employees using it daily. Only after the POC validated the architecture and revealed what work was required did the team commit to a full rollout. Slack calls this 'prototyping the path.'</p>\n<hr />\n<h3>Result</h3>\n<h4>Shipped After 2 Years: Rollout Sep 2023 → Mar 2024</h4>\n<p>Unified Grid rolled out to customers starting Fall 2023 and completed in March 2024. Features like the unified DMs tab, org-wide Activity tab, and Save it for Later became possible on a foundation that had been impossible on the workspace-centric model. The rewrite that everyone said you shouldn't do turned out to be necessary.</p>\n<hr />\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The 'Avoid Rewrites' Truism — and When It Breaks</p><p>Software engineering wisdom universally advises against large rewrites. Slack's team explicitly acknowledged this truism — and then concluded that it did not apply to their situation. <strong>When the architecture of an application drifts far enough from how that application is used, rebuilding the core foundation becomes less risky than continuing to build complexity on top of a wrong foundation.</strong> The key question is not 'rewrites are bad' — it's 'how far has the drift gone?'</p>\n</blockquote>\n\n<p>The Unified Grid project used a strategy Slack calls <strong>prototyping the path</strong> — building incrementally, proving out ideas in practice before committing to the full scope. Rather than designing the complete architecture and then building it, the team built a barely functional prototype of Unified Grid and deployed it to Slack's own internal teams. Using it daily for their own work surfaced what was broken, what the real user experience gaps were, and what the engineering challenges would be in production — all before the team had committed to building the entire thing. By Summer 2023, Unified Grid was stable enough that much of the company used it daily. By Fall 2023, external rollout began. By March 2024, it was complete.</p>\n<blockquote>\n<p><strong>🧪</strong></p>\n<p>Slack Dogfoods Its Own Infrastructure</p><p>Slack's engineers are among the heaviest users of Slack — including new features under development. The <strong>internal dogfooding</strong> of Unified Grid gave the team thousands of daily active users giving real feedback on a pre-production architectural overhaul. This feedback loop compressed the time between 'we thought this would work' and 'we know it doesn't work' from months to days.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Internal Deployment Milestone</p><p>By Summer 2023, <strong>much of Slack's company was using Unified Grid for their daily work</strong>. This internal milestone was not just a technical success — it was an organizational one. Having thousands of Slack employees using a pre-release architectural overhaul daily meant real bug reports, real performance data, and real confidence that the system was ready for external customers.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>WHAT HAD BEEN TRIED BEFORE</strong></p>\n<p>Slack had tried to alleviate the workspace-switching problem incrementally before committing to Unified Grid: <strong>Shared Channels</strong> allowed cross-workspace channel sharing. <strong>Connect</strong> enabled messaging across organizations. Various UI improvements consolidated workspace-switching. None of these fixed the fundamental architecture — they were features built on a wrong foundation that made the foundation slightly less painful without changing it.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🔍</strong></p>\n<p>Cross-Workspace Search: The Clearest User Pain</p><p>One of the most cited enterprise frustrations was search. Searching for a message required <strong>knowing which workspace it was in first</strong>, then switching to that workspace, then searching. Unified Grid enabled org-wide search that returned results from all accessible workspaces simultaneously. For users with 80 workspaces, this changed search from a manual process into a useful feature.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>The Technical Work: Thousands of APIs, One New Foundation</h3>\n<p>The scale of the Unified Grid engineering effort is difficult to convey without concrete numbers. Slack's codebase contained <strong>thousands of API endpoints, database queries, and client rendering paths</strong> that assumed workspace-scoped data. Each of these had to be evaluated: does it need to be org-aware? If so, what's the migration path? In many cases, a query that fetched a user's DMs from a single workspace had to be replaced with a query that could aggregate DMs from all of the user's workspaces efficiently. In other cases, entirely new data structures had to be introduced to represent org-level concepts that had never existed before.</p>\n<ul>\n<li><strong>2 years</strong> — Development duration from first prototype to full customer rollout — reflecting the depth of the architectural changes required</li>\n<li><strong>1000s</strong> — APIs, database queries, and permission checks updated to support org-wide data access rather than workspace-scoped data access</li>\n<li><strong>Mar 2024</strong> — Full rollout completion date — the project that began as a proof-of-concept in 2021-2022 became a production reality across Slack's entire customer base</li>\n<li><strong>3 features</strong> — Core Unified Grid capabilities delivered: unified DMs tab, org-wide Activity tab, and Save it for Later — all architecturally impossible on the old model</li>\n</ul>\n\n<pre><code class=\"language-python\"># Simplified conceptual example of workspace-centric vs org-wide data access\n# Real Slack uses Hack/PHP and complex distributed data systems\n\n# OLD: Workspace-centric DM fetch\n# User must specify which workspace they want DMs from\ndef get_dms_old(user_id: str, workspace_id: str) -> list:\n    # Every query is scoped to a single workspace\n    return db.query(\n        \"SELECT * FROM direct_messages \"\n        \"WHERE workspace_id = ? AND user_id = ?\",\n        workspace_id, user_id  # workspace_id required — siloed\n    )\n\n# NEW: Org-aware DM fetch (Unified Grid)\n# Returns DMs across all workspaces the user belongs to\ndef get_dms_unified(user_id: str, org_id: str) -> list:\n    # Query all workspaces the user belongs to in this org\n    workspaces = org_membership_service.get_workspaces(user_id, org_id)\n    \n    # Aggregate DMs across all workspaces — unified inbox\n    # Sorted by recency, not by workspace\n    return dm_service.get_org_wide(\n        user_id=user_id,\n        workspace_ids=[ws.id for ws in workspaces],\n        sort_by='recency'  # unified sort across workspace boundaries\n    )\n\n# Permission checks also needed org-level understanding:\n# Old: can_access(user, workspace, resource)\n# New: can_access(user, org, workspace, resource) — layered org context</code></pre>\n<blockquote>\n<p><strong>PROTOTYPE THE PATH: HOW SLACK DE-RISKED THE REWRITE</strong></p>\n<p>Slack's most important process decision for Unified Grid was building a <strong>working prototype used internally before committing to full scope</strong>. This is 'prototyping the path' — not a throwaway prototype, but a real functioning implementation used by real users on real data. The feedback from internal use surfaced problems that would have been catastrophic if discovered post-rollout. It also gave leadership confidence to commit the full engineering resources needed for the project.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The New Features That Became Possible</p><p>Unified Grid unlocked product features that were <strong>architecturally impossible</strong> before the migration: a unified DMs tab showing all DMs across all workspaces, an org-wide Activity tab showing all notifications in chronological order regardless of workspace, and Save it for Later aggregating saved items across workspace boundaries. These aren't incremental improvements — they're features that required the foundation to be correct before they could exist.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Executive Concern: Is It Worth the Cost?</p><p>The Unified Grid blog post is unusually candid about the organizational challenge: <strong>execs and engineering leadership were genuinely concerned about the cost</strong>. Was rebuilding the core architecture worth potentially thousands of engineer-weeks of effort? The team's answer was to build the proof of concept first, use internal data to demonstrate the benefits, and then make the case for full investment — rather than asking for two years of resources upfront on a bet.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Rolling Rollout: September 2023 to March 2024</p><p>Unified Grid wasn't released all at once — Slack used a <strong>controlled rollout over six months</strong>, starting with early access customers in Fall 2023 and expanding to the full customer base by March 2024. This allowed the team to find bugs under real enterprise load before every customer was affected, and to build customer success resources in parallel with technical rollout. The phased rollout of a two-year engineering project was itself a significant coordination effort.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Migration Cost That Can't Be Avoided</p><p>Unified Grid required updating existing customers' Slack configurations, data migrations for org-level constructs, and client-side state invalidation when users upgraded. Some features required users to re-learn workflows they had developed over years with the old model. There is no such thing as a transparent foundational architecture change at production scale — some user-visible change is inevitable, and Slack had to manage customer communication throughout the rollout.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>Unified Grid's architecture changes span three layers of Slack's stack. The <strong>backend</strong> required new data models for org-level concepts, updated APIs with org-level context, and new query patterns that aggregate across workspaces. The <strong>desktop and mobile clients</strong> required redesigned rendering architectures that could display org-wide views alongside workspace-specific ones. The <strong>permission system</strong> required new layering to support org-level access controls on top of existing workspace-level access controls. All three layers had to change simultaneously and stay in sync during the two-year rollout.</p>\n<h3>Before Unified Grid: Workspace-Centric Data Silos</h3>\n<p><a href=\"https://techlogstack.com/explore/slack-unified-grid-2024/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>After Unified Grid: Org-Wide Views Across Workspaces</h3>\n<p><a href=\"https://techlogstack.com/explore/slack-unified-grid-2024/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>PERMISSION LAYERS: ORG + WORKSPACE</strong></p>\n<p>One of the hardest architectural changes in Unified Grid was the permission system. Old permissions were: <strong>can this user access this resource in this workspace?</strong> New permissions are: <strong>can this user access this resource in this workspace within this org?</strong> Org-level admin controls needed to cascade down to workspace-level controls, override in some cases, and defer in others. Building a correct, auditable, performant permission system that understood both levels required careful design — org-level permission bugs in a product used by enterprises have serious security implications.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Client Architecture Challenge</p><p>Backend changes were only half the work. Slack's desktop and mobile clients had been designed to render one workspace at a time. Unified Grid required clients to <strong>maintain state across multiple workspaces simultaneously</strong>, merge data streams from different workspace backends, and render org-wide views alongside workspace views without confusion. The client architecture work was as extensive as the backend work — and had to be shipped to every platform (Mac, Windows, Linux, iOS, Android) simultaneously.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚡</strong></p>\n<p>The Rails Monolith as Change Vehicle</p><p>Despite Slack's architectural evolution, the backend rewrite was implemented within the existing <strong>Rails monolith</strong> rather than as a separate service. This made incremental deployment easier — changes could be gated behind feature flags, rolled back quickly, and deployed through the existing CI/CD pipeline. The Unified Grid project is evidence that a monolith can accommodate fundamental architectural evolution without requiring a microservices extraction.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Unified Grid challenges the 'never do large rewrites' maxim with a documented counterexample. The lesson is not 'large rewrites are fine' — it's that the decision requires honest evaluation of how far architectural drift has gone and whether incremental patching is still viable.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>The 'avoid rewrites' truism is a default, not a law.</strong> When your architecture's foundational assumptions have drifted far enough from actual usage that every new feature requires a workaround, the accumulated technical debt of workarounds may exceed the cost of rebuilding the foundation. Evaluate honestly. Don't use 'rewrites are bad' as a reason to avoid a decision that actually needs to be made.</li><li><span>02</span><div>Prototype the path before committing full resources to a rewrite. <strong>Build a working implementation, use it internally, and let real usage surface the gaps</strong> — before asking for two years of engineering investment. Slack's internal dogfooding of Unified Grid gave leadership evidence rather than speculation to justify the project's scope.</li><li><span>03</span><div>Permission systems need to evolve in lockstep with data models. <strong>Org-level access controls cannot be bolted onto workspace-level permission systems.</strong> When your user model gains a new organizational layer, your permission model must gain it too. This work is unglamorous, invisible to users, and absolutely required for enterprise security.</li><li><span>04</span><div><strong>Client and backend architecture must change together.</strong> You cannot ship an org-wide backend while keeping workspace-centric clients. The full change is end-to-end: data model, API contracts, permission systems, desktop client, mobile client, web client. Planning the delivery sequence for a change this wide is as important as designing the architecture itself.</li><li><span>05</span><div><em>Prototyping the path</em> (Slack's term for building a working but incomplete implementation of a major change, using it internally to validate the direction before committing to full scope) is the engineering equivalent of a staged rollout for architectural decisions. You don't commit the full budget until you have production evidence that the direction is right.</li></ol>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>When 'Never Rewrite' Gets Overruled by 'Never Ship This Feature'</p><p>The decisive business case for Unified Grid was not abstract technical cleanliness — it was that Slack's largest enterprise customers <strong>could not get features they needed</strong> without the architecture change. Unified DMs, org-wide Activity, cross-workspace search — these were features enterprise contracts were being written around. When the architecture prevents the product from serving its largest customers, the rewrite decision has already been made by the market.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE INCREMENTAL COMMITMENT MODEL</strong></p>\n<p>Unified Grid's development history — proof of concept, internal dogfood, limited beta, full rollout — illustrates an <strong>incremental commitment model for large engineering bets</strong>. At each stage, the team had evidence of progress before committing to the next stage's resource investment. This model de-risks large architectural bets by converting them from single go/no-go decisions into a series of smaller, evidence-gated decisions.</p>\n</blockquote>\n\n<blockquote><p>Slack spent two years building a feature that enterprise customers could have described in one sentence: 'show me all my messages, regardless of which workspace they're in' — and it turns out one sentence can hide a complete architectural overhaul.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/slack-unified-grid-2024/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.484030+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Distributed Systems", "Slack"]}, {"id": "https://techlogstack.com/explore/slack-deploy-safety-2025/", "url": "https://techlogstack.com/explore/slack-deploy-safety-2025/", "title": "Slack Cut Deploy-Related Customer Impact by 90% in Eighteen Months", "summary": "How Slack's Deploy Safety Program reduced customer impact hours from change-triggered incidents by 90% over 18 months — through automatic rollbacks and safety cultur", "content_html": "<p><strong>Slack</strong> · Reliability · 17 May 2026</p>\n<p>73% of Slack's customer-facing incidents were being triggered by Slack itself — by its own code deploys. The team stopped treating each outage as a one-off and started treating deploy safety as a program, with metrics, milestones, and automated rollbacks. Eighteen months later, customer impact hours were down 90%.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;incidents from own deploys&#x27;, &#x27;value&#x27;: &#x27;73%&#x27;}</li><li>{&#x27;label&#x27;: &#x27;reduction in impact hours&#x27;, &#x27;value&#x27;: &#x27;90%&#x27;}</li><li>{&#x27;label&#x27;: &#x27;months of sustained investment&#x27;, &#x27;value&#x27;: &#x27;18&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<blockquote><p>It's mid 2023 and we've identified some opportunities to improve our reliability. Fast forward to January 2025. Customer impact hours are reduced from the peak by 90% and continuing to trend downward.</p><p><em>— — Slack Engineering — via 'Deploy Safety: Reducing customer impact from change', slack.engineering</em></p></blockquote>\n<p>Slack's reliability story in 2023 had an uncomfortable truth at its center: the biggest source of customer-facing incidents was not external infrastructure failures, not traffic spikes, not adversarial attacks — it was <strong>Slack's own code deploys</strong>. A measurement of the incident dataset showed that <strong>73% of customer-facing incidents were change-triggered</strong>, primarily code deployments. This number reframes the reliability problem entirely. You can harden infrastructure, add redundancy, and build better monitoring — but if most incidents are self-inflicted, the highest-leverage intervention is improving how you ship code.</p>\n<blockquote>\n<p><strong>📊</strong></p>\n<p>Slack operates in a software engineering environment with <strong>hundreds of internal services</strong> and many different deployment systems and practices. Before the Deploy Safety Program, the approach to reliability was often service-specific — individual teams improving their own deploy practices independently, without a coordinated program tracking the systemic impact.</p>\n</blockquote>\n\n<p>The Deploy Safety Program began in mid-2023 with a key insight: measuring reliability improvement by waiting for incidents to occur creates a long feedback loop that is difficult to optimize. The team shifted to a <strong>leading-indicator metric</strong> — customer impact hours from high-severity change-triggered incidents — that could be tracked continuously without waiting for the next major outage. This metric served as the north star throughout the 18-month program, allowing the team to see improvement (or regression) before the data showed up in annual availability reports. The program metric had a semi-loose connection to individual customer experience, but it was directionally correct and defensible enough to drive engineering prioritization.</p>\n\n<h3>Problem</h3>\n<h4>73% of Incidents Self-Inflicted by Deploys</h4>\n<p>Slack measured that 73% of customer-facing incidents were triggered by change — primarily code deployments across the hundreds of internal services. Manual remediation processes (engineers detecting issues, investigating, deciding to roll back) added latency between deploy and recovery. Interruptions exceeding 10 minutes were disproportionately damaging to customer trust.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Manual Detection and Remediation Too Slow</h4>\n<p>The existing approach relied on engineers detecting deploy-related regressions from monitoring dashboards and making manual rollback decisions. This added human latency — the time to be paged, the time to investigate, the time to decide — to every incident. At Slack's deploy frequency, across hundreds of services, the accumulated human latency was significant.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Automatic Rollbacks + Safety Culture</h4>\n<p>The Deploy Safety Program introduced automated rollback triggers: when deploy-time metrics crossed defined thresholds, a rollback was automatically initiated without waiting for engineer intervention. This removed the human latency from the most common incident recovery path. The program also invested in safety culture across engineering teams — normalizing rollbacks as the right response rather than a failure indicator.</p>\n<hr />\n<h3>Result</h3>\n<h4>90% Reduction by January 2025</h4>\n<p>Customer impact hours were down 90% from peak by January 2025, with the trend continuing downward. The peak of impact occurred between February and April 2024 — before automatic rollbacks were introduced. Once automatic rollbacks were live, the data showed dramatic improvement.</p>\n<hr />\n\n<blockquote>\n<p><strong>THE LEADING INDICATOR STRATEGY</strong></p>\n<p>A core innovation of the Deploy Safety Program was measuring <strong>customer impact hours</strong> as a leading indicator rather than waiting for annual availability figures. This gave the engineering team a metric they could see moving week-over-week, track against program milestones, and use to evaluate whether specific projects were actually improving reliability. Without the right metric, improvement programs are optimizing blind.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Communication Challenge: Non-Linear Progress</p><p>The program's progress chart showed <strong>non-linear improvement</strong> — the first quarter of work showed reduction before any code changes were deployed, just from communicating the program's goals to engineering teams. Then came a peak of impact in early 2024 before automatic rollbacks were in place. This non-linearity made it challenging to communicate progress to stakeholders who expected a smooth downward line. The team maintained confidence in the work based on leading metrics even when trailing metrics hadn't yet reflected it.</p>\n</blockquote>\n\n<p>Slack's customer base had grown to treat Slack as <strong>mission-critical infrastructure</strong> — the same expectation applied to email or calendar, not messaging apps. This raised the stakes for deploy-related interruptions: an interruption that users would have tolerated as 'a blip' in 2018 was now disruptive to workflows, team meetings, and customer communications. The business context transformed the engineering mandate: deploy safety was not just a reliability metric, it was a retention metric. The Deploy Safety Program was not built in a vacuum — it was built in response to explicit customer feedback that interruptions had become more costly.</p>\n<blockquote>\n<p><strong>🤖</strong></p>\n<p>Automatic vs Manual Remediation: The Latency Gap</p><p>Manual remediation of a deploy-related incident requires: alert fires → engineer pages → engineer investigates → engineer diagnoses → engineer decides to roll back → engineer executes rollback. Each step adds minutes. Automatic rollback collapses this to: <strong>metric threshold crossed → rollback initiated</strong>. For the most common class of deploy-related incidents, this difference is often the difference between a sub-10-minute blip and a 30-minute customer-impacting incident.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The 10-Minute Threshold</p><p>Slack's data showed that customer tolerance for interruptions changed significantly at around <strong>10 minutes</strong>. Shorter interruptions were treated as acceptable blips; longer ones were treated as incidents that impacted workflows and generated support tickets. Designing automatic rollbacks to trigger fast enough to resolve most issues within the 10-minute window became a key design constraint for the program.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>❌</strong></p>\n<p>The Manual Process Bottleneck at Scale</p><p>Before automatic rollbacks, Slack's incident response for deploy-related regressions required: engineer wakes up, opens laptop, pulls up dashboards, assesses severity, determines cause is likely the recent deploy, decides to roll back, executes rollback. At <strong>hundreds of services with multiple daily deploys each</strong>, this process was running dozens of times per week. Each execution required a human. Each human introduced minutes of latency. The system was not designed for the frequency at which it was being invoked.</p>\n</blockquote>\n\n<p>One structural challenge the program navigated was the <strong>gap between program metric and individual project metric</strong>. A specific engineering project might reduce rollback time by 50% on one service — but how much does that move the top-line customer impact hours metric? The relationship is indirect and involves statistical noise from incident timing, severity distribution, and Slack's traffic patterns. Teams that couldn't see a direct line from their work to the program metric risked losing motivation. The solution was maintaining <strong>both program-level and project-level metrics</strong> and being explicit about how they connected — even when the connection was indirect.</p>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>The Deploy Safety Program: Engineering + Culture</h3>\n<p>The Deploy Safety Program was not purely a technical program. Its first-quarter improvements came from communication — telling engineering teams what the metric was, why it mattered, and what behaviors were contributing to it. The technical work (automated alerts, automatic rollbacks, improved deploy signals) came later. This sequencing is important: <strong>culture change came before code change</strong>, and the culture change produced measurable improvement even before the tooling was in place. Engineers who understood that their deploys were the primary source of customer impact made better decisions about when to deploy, what size deploys to ship, and when to roll back.</p>\n<ul>\n<li><strong>73%</strong> — Fraction of customer-facing incidents triggered by Slack's own code changes — the measurement that transformed reliability from an infrastructure problem into a deployment problem</li>\n<li><strong>90%</strong> — Reduction in customer impact hours from peak (Feb–Apr 2024) to January 2025 — the headline outcome of the 18-month Deploy Safety Program</li>\n<li><strong>Q1</strong> — Quarter of work when improvement appeared — before any code changes — purely from communicating the program goals to engineering teams and surfacing the metric</li>\n<li><strong>Auto</strong> — Rollback execution mode after the program's key technical milestone — removing human latency from the most common incident recovery path</li>\n</ul>\n\n<pre><code class=\"language-python\"># Simplified deploy safety automatic rollback logic\n# Real implementation uses Slack's internal deploy orchestration system\n\nclass DeploySafetyMonitor:\n    def __init__(self, service: str, deploy_id: str):\n        self.service = service\n        self.deploy_id = deploy_id\n        self.baseline = self._capture_pre_deploy_metrics()\n    \n    def monitor_post_deploy(self, window_minutes: int = 10):\n        \"\"\"Monitor service health after a deploy.\n        Automatically roll back if metrics regress beyond thresholds.\"\"\"\n        start_time = time.time()\n        \n        while time.time() - start_time < window_minutes * 60:\n            current = self._get_current_metrics()\n            \n            # Check error rate regression\n            if current.error_rate > self.baseline.error_rate * ERROR_RATE_THRESHOLD:\n                self._automatic_rollback(\n                    reason=f\"Error rate {current.error_rate:.1%} exceeded \"\n                           f\"threshold (baseline: {self.baseline.error_rate:.1%})\"\n                )\n                return  # rollback initiated — no human required\n            \n            # Check p99 latency regression  \n            if current.p99_latency > self.baseline.p99_latency * LATENCY_THRESHOLD:\n                self._automatic_rollback(\n                    reason=f\"p99 latency {current.p99_latency}ms exceeded threshold\"\n                )\n                return\n            \n            time.sleep(30)  # check every 30 seconds during bake period\n        \n        # Monitoring window passed — deploy is baked, mark stable\n        self._mark_deploy_stable(self.deploy_id)\n    \n    def _automatic_rollback(self, reason: str):\n        deploy_orchestrator.rollback(self.deploy_id)\n        pagerduty.notify(severity='P2',  # P2, not P1 — rollback is the mitigation\n                        message=f'Auto-rollback: {self.service} {self.deploy_id}\\n{reason}')</code></pre>\n<blockquote>\n<p><strong>SAFETY CULTURE: NORMALIZING ROLLBACKS</strong></p>\n<p>One of the cultural investments of the program was <strong>normalizing rollbacks as the correct first response</strong> to a deploy-related regression, not as a failure to be avoided. Previously, some teams would try to forward-fix a regression (deploy a fix) rather than roll back. Forward-fixing maintains customer impact during the investigation and fix cycle. Rolling back immediately reduces customer impact to near-zero, then gives engineers the time and calm to properly understand and fix the issue. <strong>Rollback is not defeat — it's the right call.</strong></p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The February–April 2024 Peak: A Forcing Function</p><p>The peak of customer impact hours in early 2024 — before automatic rollbacks were fully deployed — actually served as a forcing function for the program. It demonstrated to engineering leadership that the program's investment was justified, accelerated resources toward the automatic rollback work, and showed that manual remediation was insufficient at Slack's deploy frequency. Sometimes the worst period in a reliability program is the moment that unlocks the resources to fix it.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Agentforce Integration: Reducing the 10-Minute Threshold</p><p>Slack's blog post notes that the introduction of <strong>Agentforce</strong> in 2025 raised customer expectations further — Slack being used as an AI-assisted work tool made even shorter interruptions more disruptive. This ongoing expectation evolution means the Deploy Safety Program's work is continuous: the 10-minute acceptable interruption threshold will continue to shrink as Slack becomes more tightly integrated into customer workflows.</p>\n</blockquote>\n\n<p>The Deploy Safety Program faced a fundamental measurement challenge: the program metric (customer impact hours) is a <em>trailing indicator</em> (a metric that reflects outcomes that have already occurred — you only know you've improved after the fact) that doesn't give engineers real-time feedback on whether a specific project change is working. The team supplemented the trailing metric with <strong>leading indicators specific to each project</strong> — deploy alert precision, rollback rate, manual rollback to auto-rollback conversion rate — that gave faster feedback on whether individual investments were on track. The relationship between program metric and individual project metric is always indirect, but tracking both gave the team the full picture.</p>\n<blockquote>\n<p><strong>📈</strong></p>\n<p>Non-Linear Progress: Why It Looked Like It Wasn't Working</p><p>The program's impact chart showed that the first quarter produced improvement before any code shipped — from communication alone. Then came the <strong>peak of impact in early 2024</strong>, before automatic rollbacks were deployed, suggesting things were getting worse. Then a dramatic drop after automatic rollbacks went live. This non-linear curve is common in reliability programs: communication changes behavior, new tooling is built without full effect, the tooling deploys and impact drops sharply. <strong>Reading the curve correctly requires understanding what was deployed when.</strong></p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>Slack's deploy safety architecture evolved from a manual-first system to an automated-first system over the 18 months of the program. The before state: engineers deploy, monitoring alerts fire, engineers investigate, engineers decide to roll back, engineers execute rollback. The after state: engineers deploy, monitoring compares against pre-deploy baseline, automatic rollback fires if thresholds are crossed, engineer is paged with context after the rollback has already happened. The human is in the loop — but as a reviewer of an automated decision, not as a prerequisite to recovery.</p>\n<h3>Before: Manual Deploy Remediation Path</h3>\n<p><a href=\"https://techlogstack.com/explore/slack-deploy-safety-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>After: Automatic Deploy Rollback Path</h3>\n<p><a href=\"https://techlogstack.com/explore/slack-deploy-safety-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>THE MEASUREMENT SYSTEM IS THE PROGRAM</strong></p>\n<p>The Deploy Safety Program's most lasting contribution may not be the automatic rollback tooling — it's the <strong>measurement framework</strong>. Customer impact hours from change-triggered incidents, tracked continuously, with attribution to specific deploy events and services, created visibility that had not previously existed. Engineering teams could see, for the first time, which services and which deploy patterns were contributing most to customer impact. That visibility drove behavior change even before the tooling changed.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Hundreds of Services: The Coordination Challenge</p><p>Slack's deploy environment includes <strong>hundreds of internal services</strong> with different deployment systems and practices. Rolling out deploy safety monitoring and automatic rollbacks across this heterogeneous environment required a program-level coordination effort — not just a single engineering team making changes to one service. Each service team needed to integrate with the monitoring framework, validate their specific alert thresholds, and adopt the rollback automation. The program's organizational structure was designed to support this breadth.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Threshold Calibration Problem</p><p>Automatic rollbacks require <strong>precise threshold calibration</strong> — thresholds too sensitive trigger unnecessary rollbacks on normal traffic variance, eroding engineer trust in the system. Thresholds too loose miss real regressions. Slack's approach was per-service threshold calibration based on historical metric variance, with ongoing tuning as services' traffic patterns evolved. This calibration work is ongoing — it doesn't end when the automation is deployed. Getting thresholds wrong in either direction undermines the entire program.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Slack's Deploy Safety Program is a model for how to turn a vague reliability problem ('we have too many incidents') into a concrete engineering program with measurable outcomes. The lessons apply to any team where self-inflicted incidents are the primary reliability drain.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Measure what's causing incidents before investing in what might fix them.</strong> Slack's discovery that 73% of incidents were change-triggered completely reframed their reliability investment. Without measurement, they might have invested in infrastructure redundancy and network hardening while the primary driver — their own deploys — continued unchecked.</li><li><span>02</span><div><em>Trailing metrics</em> (metrics that measure outcomes after they occur, like annual availability or incident count) tell you how things went. <em>Leading metrics</em> (metrics that indicate direction of travel before outcomes are fully visible, like incident rate per deploy or rollback frequency) tell you whether what you're doing is working. Run both. Use the leading metrics to steer the program, and the trailing metrics to confirm you've arrived.</li><li><span>03</span><div><strong>Culture change can produce measurable improvement before code changes do.</strong> Slack saw improvement in the first quarter of the program from communication alone — before any technical work shipped. When engineers understand what behavior is costing customers, many of them change their behavior voluntarily. Don't skip the cultural investment in favor of jumping straight to tooling.</li><li><span>04</span><div>Automatic rollbacks are not a replacement for good engineering — they are a safety net that <strong>reduces the cost of imperfect engineering</strong>. Every team ships bugs; the question is how quickly the system detects and recovers from them. Automatic rollbacks compress the detection-to-recovery time from tens of minutes to seconds, dramatically reducing customer impact for the most common class of incidents.</li><li><span>05</span><div><strong>Rollback is the correct first response to a deploy regression.</strong> Forward-fixing maintains customer impact during investigation. Rolling back immediately restores service, then gives engineers the time and safety to understand the issue properly. Normalizing rollback as the correct response — not a failure — is as important as building the tooling to do it automatically.</li></ol>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Compounding Return</p><p>By January 2025, customer impact hours were at the lowest level ever recorded and continuing to trend downward. The Deploy Safety Program's investments compound: automatic rollbacks reduce impact per incident, better alert precision reduces false positives, safety culture reduces the frequency of reckless deploys. Each improvement makes the next improvement more effective.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Attribution Problem</p><p>Not all incidents attributed to 'change-triggered' were definitively proven to be caused by the deploy. Some correlations were timing coincidences. The Deploy Safety Program accepted some measurement noise in the metric in exchange for the simplicity of a clear, attributable signal. <strong>A useful metric with some noise is more actionable than a perfect metric that takes too long to compute.</strong> The team was explicit about this tradeoff in their communications.</p>\n</blockquote>\n\n<blockquote><p>Slack discovered that the biggest threat to Slack's reliability was Slack deploying Slack — which is either a very enlightened finding or a very embarrassing one, depending on how you look at it.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/slack-deploy-safety-2025/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.488778+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Reliability", "Slack"]}, {"id": "https://techlogstack.com/explore/atlassian-april-2022-outage/", "url": "https://techlogstack.com/explore/atlassian-april-2022-outage/", "title": "How a Two-Line Script Silently Deleted 883 Customer Cloud Sites", "summary": "How a miscommunicated ID in a cleanup script permanently deleted 883 Atlassian cloud sites — and what it took to rebuild them over 14 days.", "content_html": "<p><strong>Atlassian</strong> · Reliability · 17 May 2026</p>\n<p>At 07:38 UTC on April 5th, 2022, a maintenance script begins its run — methodical, peer-reviewed, totally routine. Twenty-three minutes later, 883 Atlassian Cloud sites have been permanently deleted, and the company's own incident management tool, Opsgenie, is one of the casualties.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;sites deleted&#x27;, &#x27;value&#x27;: &#x27;883&#x27;}</li><li>{&#x27;label&#x27;: &#x27;days max outage&#x27;, &#x27;value&#x27;: &#x27;14&#x27;}</li><li>{&#x27;label&#x27;: &#x27;customers affected&#x27;, &#x27;value&#x27;: &#x27;775&#x27;}</li><li>{&#x27;label&#x27;: &#x27;engineers mobilized&#x27;, &#x27;value&#x27;: &#x27;450+&#x27;}</li><li>{&#x27;label&#x27;: &#x27;min RPO met&#x27;, &#x27;value&#x27;: &#x27;~5&#x27;}</li><li>{&#x27;label&#x27;: &#x27;restoration approaches&#x27;, &#x27;value&#x27;: &#x27;2&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<blockquote><p>The script we used provided both the 'mark for deletion' capability used in normal day-to-day operations (where recoverability is desirable), and the 'permanently delete' capability that is required to permanently remove data when required for compliance reasons.</p><p><em>— — Sri Viswanath, CTO — via Atlassian Engineering Blog, April 2022</em></p></blockquote>\n<p>In 2021, Atlassian completed the acquisition and integration of a standalone app called <em>Insight – Asset Management</em> into Jira Service Management as native functionality. The standalone version was now obsolete and needed to be retired from the <strong>200,000+</strong> customer cloud sites that had it installed. An engineering team wrote a cleanup script using an existing deletion process — nothing unusual, nothing new. What happened next would become the longest and most public cloud outage in Atlassian's history. The seeds were sown not in a line of code, but in a conversation between two teams separated by function, timezone, and context.</p>\n<p>The deletion API that powered the script accepted two types of identifiers: <strong>app IDs</strong> to remove a specific product installation, and <strong>site IDs</strong> to remove an entire customer workspace. Both were valid inputs. The API assumed the caller knew which they were passing and offered no type-checking, no confirmation prompt, no dry-run mode. The team requesting the deletion provided the IDs of the <em>cloud sites</em> where the app was installed — not the IDs of the app instances themselves. The executing team, receiving a list of IDs and a known-good script, ran it. <em>Soft delete</em> (a reversible deletion that marks data for removal but retains it in backup for a grace period, allowing recovery) was not used; the script took the <span>permanent deletion path</span> instead. The script completed its run from 07:38 to 08:01 UTC on April 5th, 2022.</p>\n<blockquote>\n<p><strong>⚡</strong></p>\n<p>The entire deletion ran in just <strong>23 minutes</strong>. Because it executed through standard provisioning workflows, Atlassian's internal monitoring detected nothing — the system behaved exactly as designed. The first signal of disaster came not from dashboards, but from a customer support ticket filed at 07:46 UTC.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>Silent Deletion</h4>\n<p>At 07:38 UTC, the cleanup script begins sequentially deleting sites from a list of 883 IDs. Because deletions pass through the standard <em>Cloud Provisioner</em> workflow — the same pathway used for day-to-day operations — internal monitoring fires no alert. At 07:46 UTC, the first customer support ticket arrives: Jira, Confluence, Opsgenie, and Statuspage are all unreachable.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Wrong IDs, No Guard</h4>\n<p>At 08:53 UTC, engineers confirm the link between the script run and the deletions. The <strong>communication gap</strong> is clear: team A passed <em>site IDs</em> (unique identifiers for an entire customer workspace containing all their Atlassian products) instead of app IDs to team B. The deletion API, designed to accept both types without validation, assumed correctness. The script used the permanent delete path, not the soft-delete path, meaning no data was retained in recoverable staging — it was gone from production immediately.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Two Approaches to Rebuild</h4>\n<p>Restoration 1 — creating brand-new sites and migrating data across — took approximately <strong>48 hours per batch</strong> and required 70 sequential steps including re-mapping immutable Cloud IDs across every third-party ecosystem app. On April 9th, the team proposed Restoration 2: re-creating records using the <em>original site identifiers</em>, cutting the process to ~30 steps and ~12 hours per site. An engineering-wide code freeze was imposed on April 8th to eliminate risk of compounding incidents.</p>\n<hr />\n<h3>Result</h3>\n<h4>Full Restoration, Hard Lessons</h4>\n<p>The final affected customer was restored on April 18th — 13 days after the incident began. Atlassian met its <strong>Recovery Point Objective</strong> of one hour: no customer lost more than five minutes of data. The company permanently blocked bulk site deletes, mandated soft-delete policies across all systems, and committed to automated multi-site disaster recovery testing as a regular operational exercise.</p>\n<hr />\n\n<p>What made the outage uniquely brutal was its second-order effect: the script that deleted customer sites also deleted the <strong>contact information</strong> for those customers. Atlassian's support systems required a valid Cloud URL and Atlassian ID to file a ticket — and both were gone. <span>Customers couldn't reach support, and Atlassian couldn't reach customers.</span> The company had to reconstruct contact lists from billing systems, prior support tickets, and manual outreach before they could even begin coordinating restoration. The multi-tenant architecture — where data from multiple customers lives in shared storage shards — meant that a global rollback was not an option. Restoring any individual site required surgically extracting and replaying that customer's records without disturbing the data of the thousands of other tenants sharing the same database.</p>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Irony That Hurt Most</p><p>Among the <strong>883 deleted sites</strong> were Atlassian's own internal instances — and Opsgenie, the company's own incident management product. The team managing the worst outage in Atlassian history had to do it without their primary incident tracking and alerting tool. The cobbler's children had no shoes.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE HACKER NEWS SIGNAL</strong></p>\n<p>As Atlassian remained largely silent for the first nine days, the outage trended on Hacker News. The highest-voted comment, from someone claiming to be a former Atlassian employee, alleged that internal monitoring was poor and that more than <strong>50% of incidents were customer-detected</strong>. Atlassian neither confirmed nor denied the claims — but the silence amplified the speculation. On Day 9, they finally confirmed what the community had already guessed: the Insight plugin retirement script was the cause.</p>\n</blockquote>\n\n<p>At peak response, the recovery involved <strong>450+ support engineers</strong> running 24/7 shifts across global timezones, manually validating each restored site before handing it back to the customer. The team created a dedicated Jira project — <em>SITE</em> — with a custom workflow to track restoration progress site-by-site across engineering, program management, and customer support. The Restoration 2 breakthrough, when it came on Day 4, was the turning point: by re-using original site identifiers, the team could eliminate the most time-consuming step — re-mapping immutable IDs across third-party app integrations — and cut site recovery time from 48 hours to 12. <span>The final site was restored on April 18th. Every customer got their data back.</span></p>\n<blockquote>\n<p><strong>📊</strong></p>\n<p>Recovery Point Objective: Met</p><p>Despite the scale, Atlassian met its one-hour RPO — most customers lost at most <strong>five minutes of data</strong>. The combination of full backups plus point-in-time incremental backups, retained for 30 days, made this possible. What they missed was the RTO: restoring 775 customers took 13 days, not hours.</p>\n</blockquote>\n\n<blockquote><p>Peer review caught the endpoint. It just didn't ask what the IDs were for.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Bigger Context: Server Sunset Pressure</p><p>This incident occurred at a critical business moment: Atlassian had announced the end-of-life for its on-premises Server product, actively pushing customers toward the Cloud. The 14-day outage landed directly on top of that migration narrative, giving every enterprise customer a live, public data point about <em>cloud reliability</em> (the ability of a hosted service to maintain uptime and data integrity across millions of tenants) versus self-hosted control.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<ul>\n<li><strong>13d</strong> — Duration from first deletion (April 5) to final site restored (April 18) — the longest outage in Atlassian's history</li>\n<li><strong><5 min</strong> — Maximum data loss per customer — Recovery Point Objective met, despite missing Recovery Time Objective by days</li>\n<li><strong>~70 → 30</strong> — Restoration steps reduced from Approach 1 to Approach 2, cutting site rebuild time from 48 hours to ~12 hours</li>\n<li><strong>450+</strong> — Support engineers running 24/7 manual validation shifts globally at peak incident response</li>\n</ul>\n\n<p>Recovery began in three parallel workstreams the moment the root cause was confirmed on Day 1. The first workstream assembled a manual team to hand-walk through the restoration steps for individual sites, validating each one. The second workstream raced to automate those same steps so they could be run safely in large batches. The third — the one that ultimately broke the logjam — was a full rewrite of the restoration approach itself. <strong>Restoration 1</strong> created brand-new sites with fresh Cloud IDs, requiring all downstream services and third-party apps to be re-mapped to the new identifiers. This was safe but brutally slow: ~70 steps, ~48 hours per batch, with cascading dependencies that could only run in sequence. Every third-party app in the ecosystem had to be re-integrated. The math told a grim story: this approach would take three weeks to clear the full backlog of 775 customers.</p>\n<blockquote>\n<p><strong>THE BREAKTHROUGH: RESTORATION 2</strong></p>\n<p>On April 9th — Day 4 — the team proposed <strong>Restoration 2</strong>: instead of creating new sites, <em>re-create the deleted records in-place using the original site identifiers</em>. The key insight was that <em>immutable identifiers</em> (unique IDs assigned at site creation that are embedded across all downstream services, data records, and third-party integrations and cannot be changed) like CloudID were the primary source of complexity in Restoration 1. By preserving them, the team eliminated over half the restoration steps, removed the need to coordinate re-mapping with third-party app vendors, and reduced site recovery from 48 hours to approximately 12 hours. The trade-off: everything automated for Restoration 1 had to be rewritten, and both approaches ran in parallel for days while the new method was tested and validated.</p>\n</blockquote>\n\n<pre><code class=\"language-python\"># Pseudocode: Restoration 2 — re-create deleted records using original identifiers\n# This was the breakthrough that cut recovery time from ~48h to ~12h per site\n\ndef restore_site_v2(site_id, restore_point_timestamp):\n    # Step 1: Re-create the site record in the Catalogue Service using the ORIGINAL site_id\n    # Critical: preserve original cloudId to avoid re-mapping all downstream references\n    catalogue.uncreate(site_id, preserve_original_cloud_id=True)\n\n    # Step 2: Restore identity data (users, groups, permissions) in parallel\n    # These can run concurrently with database restoration — no sequential dependency\n    identity.restore_async(site_id, point_in_time=restore_point_timestamp)\n\n    # Step 3: Restore primary product databases (Jira, Confluence, etc.)\n    # Point-in-time recovery to exactly 5 minutes before deletion\n    for product in get_site_products(site_id):\n        db.restore_to_point_in_time(\n            product=product,\n            timestamp=restore_point_timestamp,  # 5 min before deletion\n            site_id=site_id\n        )\n\n    # Step 4: Restore cross-service data (media attachments, app data, feature flags)\n    # Can parallelize across services that have no dependencies on each other\n    services.restore_parallel(site_id, timestamp=restore_point_timestamp)\n\n    # Step 5: Automated validation — checks all services are healthy for the site\n    validation_result = validate_site(site_id)\n    if not validation_result.passed:\n        raise RestorationError(f\"Site {site_id} failed validation: {validation_result.errors}\")\n\n    # Step 6: Hand off to customer for final sign-off\n    notify_customer(site_id, status='ready_for_validation')</code></pre>\n<blockquote>\n<p><strong>❌</strong></p>\n<p>The Root Technical Failure: An API Without Type Safety</p><p>The deletion API accepted both <strong>app IDs and site IDs</strong> as valid inputs and assumed the caller knew which type they were passing. There was no runtime validation to check whether the input ID referred to an app or an entire customer site. A single guard — checking the type of the entity behind each ID before executing permanent deletion — would have surfaced the mismatch before a single site was touched.</p>\n</blockquote>\n\n<p>Restoration 1 vs Restoration 2 — what changed and why it mattered</p><div><table><caption>Restoration 1 vs Restoration 2 — what changed and why it mattered</caption><thead><tr><th>Dimension</th><th>Restoration 1</th><th>Restoration 2</th></tr></thead><tbody><tr><td>Approach</td><td>Create new site, migrate data in</td><td>Re-create original records in-place</td></tr><tr><td>Site identifiers</td><td>New CloudID assigned (immutable IDs changed)</td><td>Original CloudID preserved</td></tr><tr><td>Steps required</td><td>~70 sequential steps</td><td>~30 steps with parallelism</td></tr><tr><td>Recovery time/site</td><td>~48 hours per batch</td><td>~12 hours per site</td></tr><tr><td>Third-party apps</td><td>Every app re-integration required per site</td><td>No re-integration needed</td></tr><tr><td>Sites restored</td><td>112 sites (53% of affected users)</td><td>771 sites (47% of affected users)</td></tr></tbody></table>\n<p>Four changes were committed as non-negotiable outcomes. First: <strong>universal soft-delete</strong> across all Atlassian systems — permanent deletion of customer data can only occur after a soft-delete period expires, never directly. Second: automated multi-site, multi-product disaster recovery testing, regularly exercised at scale. Third: a large-scale incident playbook with sub-streams, pre-built tooling, and simulation exercises that go far beyond the single-service incidents Atlassian had historically trained for. Fourth: backup of customer contact data outside the product instance itself, so that a site deletion could never again sever the communication channel needed to coordinate recovery. <span>Every one of these was announced as an immediate action, not a future roadmap item.</span></p>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Zero Data Loss</p><p>Despite the scale and duration, <strong>no customer permanently lost data</strong>. Thirty-day immutable backups with point-in-time recovery meant the team could always get back to within five minutes of the deletion event. The RPO was met. The RTO was not. That asymmetry — data safe, access gone for two weeks — defined the entire character of the incident.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🔒</strong></p>\n<p>The Code Freeze Decision</p><p>On April 8th, Atlassian imposed a <strong>company-wide code freeze</strong> — no deployments across all of engineering until restoration was complete. This eliminated the risk of a compounding incident, reduced noise, and allowed the entire engineering org to focus exclusively on recovery without distraction from unrelated changes.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>To understand why a script deleting 883 sites took two weeks to reverse, you need to understand what an Atlassian <em>site</em> actually is. A site — for example <code>yourcompany.atlassian.net</code> — is not a row in a database. It is a <strong>logical container distributed across dozens of services</strong>, each maintaining their own slice of state. Identity data (users, groups, permissions) lives in one service. Product databases for Jira, Confluence, and Opsgenie live in others. Media attachments, feature flags, licensing metadata, third-party app configurations — each of these occupies a separate data store. All of it is hosted on AWS and orchestrated through <em></em><em>Micros</em> (Atlassian's internal Platform-as-a-Service that orchestrates deployment, security, and management for all Atlassian cloud services), Atlassian's internal PaaS. The site deletion did not touch a single database — it sent deletion events through the standard provisioning workflow, and every downstream service dutifully removed its copy.</p>\n<h3>How a site deletion propagated through Atlassian's distributed architecture</h3>\n<p><a href=\"https://techlogstack.com/explore/atlassian-april-2022-outage/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>WHY A GLOBAL ROLLBACK WASN'T POSSIBLE</strong></p>\n<p>The instinctive solution — roll back the entire database to before the script ran — was blocked by the <strong></strong><em>multi-tenant architecture</em> (a design where a single database shard stores data for many customers simultaneously, isolated at the application layer rather than the database layer). Each database shard contained data from hundreds of customers, most of whom were completely unaffected. A global rollback would have wiped hours of real work from tens of thousands of innocent customers. The only option was surgical: extract and replay each deleted customer's records individually, from 30-day immutable backups, without touching any surrounding data.</p>\n</blockquote>\n\n<h3>Restoration 2 — re-creating records in-place using original identifiers to avoid re-mapping</h3>\n<p><a href=\"https://techlogstack.com/explore/atlassian-april-2022-outage/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Missing Layer: Multi-Site Automated DR</p><p>Atlassian's disaster recovery had been designed for <strong>infrastructure failures</strong> — a lost database, a failed availability zone, a corrupted single service. It had never been designed for the scenario of selectively restoring hundreds of customers from shared backups into a live production environment. The capability existed for single-site recovery; it simply hadn't been automated or tested at this scale. The incident forced Atlassian to build, from scratch, the tooling that would eventually become a core part of their DR program.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🏗️</strong></p>\n<p>Micros: The Internal PaaS That Ran the Deletions</p><p>Atlassian's <em>Micros</em> platform orchestrates all service deployments, security controls, and provisioning events across their cloud. The deletion script triggered standard <strong>tenant destruction events</strong> through Micros — which is exactly why monitoring didn't fire. Normal deletions look identical to erroneous bulk deletions from an observability perspective when the system has no input validation at the API layer.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Thirteen days of outage, 450 engineers, two restoration approaches from scratch — and all of it began with two teams that didn't fully understand what IDs they were exchanging. The Atlassian incident is not a story of technical failure. It is a story of what happens when <strong>destructive operations lack defense-in-depth</strong>: no type validation, no dry-run mode, no staged rollout, no soft-delete, and no explicit confirmation that an operation targeting 883 site-level records was actually what anyone intended.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Deletion APIs must validate what they are deleting, not just whether the operation is allowed.</strong> An API that accepts both app IDs and site IDs without distinguishing them is a loaded gun with the safety removed. Before any destructive operation executes in production, a system-level check should confirm the <em>type</em> of entity being targeted — and fail loudly if it doesn't match the caller's stated intent.</li><li><span>02</span><div><strong></strong><em>Soft delete</em> (marking data for removal with a retention window rather than permanently destroying it immediately) must be the only permitted path for any operation touching customer data. Permanent deletion paths — even legitimate ones needed for compliance — should require a multi-step authorization separate from standard maintenance workflows. If an operation cannot be reversed in under an hour, it should not be triggerable in a single script run.</li><li><span>03</span><div><strong>Disaster Recovery testing must include the scenario you have never practiced, not just the one you have.</strong> Atlassian's backups were excellent and their single-site recovery was proven. What failed was <em>multi-site, multi-product coordinated recovery at scale</em> — a scenario that had no runbook and no automation. Test the rare catastrophe, not just the common failure.</li><li><span>04</span><div><strong>Customer contact information must be backed up outside the system it describes.</strong> When the deletion removed customer sites, it also removed the contact data Atlassian needed to reach those customers. <span>Never let a single operation sever both the incident and the communication channel for resolving it.</span> Store critical customer identifiers in a system that is logically and physically separate from the product instances they reference.</li><li><span>05</span><div><strong>Staged rollout applies to maintenance scripts, not just feature deployments.</strong> The first production run of the cleanup script processed 30 sites correctly — because those IDs had been sourced before the miscommunication occurred. The second run hit 883. A <em>staged rollout policy</em> on any script modifying customer data at scale would have surfaced the error on a batch of 10 before it reached 883.</li></ol>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Communications Lesson They Admitted</p><p>For nine days, Atlassian remained largely silent publicly while customers speculated on forums. They later acknowledged they should have communicated <strong>directional estimates with explicit uncertainty</strong> — even imprecise timelines — far earlier. Silence read as incompetence. \"We don't know yet\" is a communication, not a failure to communicate.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🏛️</strong></p>\n<p>The Industry Shift This Incident Accelerated</p><p>After the Atlassian outage, <strong>\"soft-delete by default\"</strong> moved from engineering best-practice advice to boardroom checklist item at cloud companies worldwide. The incident remains the most-cited example of how <em>blast radius</em> (the scope of unintended damage when an operation reaches further than intended) of a single maintenance script can exceed the blast radius of a network attack — because the system was designed to trust the caller.</p>\n</blockquote>\n\n<blockquote><p>The API accepted both app IDs and site IDs. It just didn't ask which one you meant.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/atlassian-april-2022-outage/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.492393+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Reliability", "Atlassian"]}, {"id": "https://techlogstack.com/explore/netflix-live-origin-tyson-paul-2024/", "url": "https://techlogstack.com/explore/netflix-live-origin-tyson-paul-2024/", "title": "65 Million Streams: How Netflix Rebuilt Its Guts for Live", "summary": "How Netflix built a custom live origin to handle 65M concurrent streams for Tyson vs. Paul — replacing S3 with Cassandra+EVCache after latency and origin storms thre", "content_html": "<p><strong>Netflix</strong> · Live Streaming · 17 May 2026</p>\n<p>November 15, 2024: 65 million people log on to watch Mike Tyson fight Jake Paul, the largest live sports stream in history. Behind the scenes, Netflix engineers are white-knuckling a system they built from scratch — one where a single bad video segment, a CDN request storm, or a missed 2-second write deadline means millions of viewers see a black screen.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;concurrent streams&#x27;, &#x27;value&#x27;: &#x27;65M&#x27;}</li><li>{&#x27;label&#x27;: &#x27;→ 25ms p50 latency&#x27;, &#x27;value&#x27;: &#x27;113ms&#x27;}</li><li>{&#x27;label&#x27;: &#x27;read throughput&#x27;, &#x27;value&#x27;: &#x27;200Gbps+&#x27;}</li><li>{&#x27;label&#x27;: &#x27;segment SLA&#x27;, &#x27;value&#x27;: &#x27;2-second&#x27;}</li><li>{&#x27;label&#x27;: &#x27;cache hit on 404 storms&#x27;, &#x27;value&#x27;: &#x27;90%+&#x27;}</li><li>{&#x27;label&#x27;: &#x27;events/sec monitored&#x27;, &#x27;value&#x27;: &#x27;38M&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<blockquote><p>Our back-of-the-envelope calculations showed worst-case read throughput in the O(100Gbps) range, which would normally be extremely expensive for a strongly-consistent storage engine like Apache Cassandra.</p><p><em>— — Xiaomei Liu, Joseph Lynch, Chris Newton — via Netflix Engineering Blog</em></p></blockquote>\n<p>On November 15, 2024, Netflix did something it had been quietly engineering toward for three years: it streamed the biggest live sports event the internet had ever seen. <strong>65 million concurrent viewers</strong> watched Mike Tyson and Jake Paul trade punches at AT&T Stadium in Arlington, Texas — a number that dwarfs any single streaming event Netflix had previously attempted. <em>Open Connect</em> (Netflix's proprietary global CDN, a network of hardware appliances co-located inside ISPs worldwide, purpose-built for video delivery) nodes around the world hammered the origin for segments every two seconds, each chunk potentially several megabytes, from every timezone simultaneously. The pressure on what Netflix engineers call <em>Live Origin</em> — the custom-built microservice bridging the cloud encoding pipeline and the CDN — was unlike anything in the company's history. <span>This was not a load test. There was no rollback button if the system buckled.</span></p>\n<p>Netflix's engineering challenge with live video is categorically different from its on-demand catalog. <em>SVOD</em> (Subscription Video on Demand — Netflix's traditional business, where content is pre-encoded, cached extensively, and served from ISP-colocated appliances at near-zero origin load) content is encoded once, uploaded once, and then served almost entirely from the edge with the origin barely involved. Live video destroys this model entirely. Every <strong>2-second segment</strong> is brand new — it must be encoded, packaged, DRM-encrypted, and written to the origin within a hard real-time deadline, while simultaneously dozens of CDN nodes are requesting that same segment the moment it should exist. Netflix's existing infrastructure, including its massive Open Connect network, was built for static content; live content required the engineers to rethink storage, traffic management, and quality control from first principles.</p>\n<blockquote>\n<p><strong>📡</strong></p>\n<p>Netflix's early live events used plain AWS S3 buckets as the segment store — and the results were brutal. <strong>Median write latency of 113ms</strong> against a 2-second publishing deadline meant the system was spending nearly 6% of every segment's entire window just waiting on storage acknowledgment, with p99 latencies of 267ms making late segments a near-certainty at scale.</p>\n</blockquote>\n\n<p>The original Live Origin architecture relied on <em>S3</em> (Amazon Simple Storage Service — a general-purpose object store, highly durable and scalable but not optimized for the strict latency SLAs of real-time live streaming) as the backing store for video segments. When the packager finished encoding a segment, it issued a PUT request to S3; when an Open Connect node needed that segment, it issued a GET. The problem was that S3 is designed for general-purpose durability, not microsecond consistency. High latency variation on writes meant segments frequently missed their publishing window. At high request rates exceeding <strong>100 RPS per event</strong>, S3 throttled the origin, causing playback latency spikes visible to viewers as buffering. The team knew that scaling to tens of millions of concurrent streams would make this completely untenable — <span>a generic storage solution cannot serve as the foundation of a real-time broadcast.</span></p>\n<h3>The Origin Storm Problem</h3>\n<p>Even after replacing S3, the engineers faced a second failure mode that they called the <strong></strong><em>Origin Storm</em> (the scenario where many Open Connect CDN nodes simultaneously request the same segment from the origin at once, generating read throughput that can overwhelm the storage system). When a new segment becomes available — at the top of every 2-second clock tick — potentially dozens of top-tier Open Connect nodes across different geographic sites all issue GET requests simultaneously. Each segment can be several megabytes of encoded video. Back-of-envelope calculations put worst-case read throughput at over <strong>100 Gbps</strong> — a volume that would obliterate write performance on any strongly-consistent database, including Apache Cassandra. The engineers had traded one problem for another: a write-optimized store that couldn't survive its own read traffic.</p>\n\n<h3>Problem</h3>\n<h4>S3 Can't Keep Up</h4>\n<p>Early live events reveal S3 segment writes hitting <strong>113ms median latency</strong> and 267ms at p99, against a hard 2-second publishing deadline. CDN nodes requesting segments early get throttled responses; playback stalls and buffering appear for viewers in real time.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Generic Storage Meets Real-Time Deadlines</h4>\n<p>S3 lacks the <em>write SLA</em> (a guaranteed maximum time within which a write operation will be acknowledged, critical in live streaming where a missed segment publish directly causes viewer-visible buffering) Netflix requires for live. Its <em>request throttling</em> (S3's automatic rate-limiting when a single prefix receives too many requests per second, designed to protect shared infrastructure) kicks in at the exact request rates a live event generates. No amount of tuning can make a general-purpose object store behave like a real-time media system.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Custom KeyValue Store on Cassandra + EVCache</h4>\n<p>Netflix builds a custom KeyValue abstraction layered on Apache Cassandra with <em>LSM</em> (Log-Structured Merge Tree — Cassandra's write-optimized storage engine that buffers writes in memory and flushes sequentially, achieving high write throughput with predictable latency) for durability, and adds EVCache (Memcached-based) as a write-through read cache. Large segment payloads are chunked to enable idempotent retries and load distribution across the Cassandra cluster. <strong>Separate EC2 stacks and storage clusters</strong> are provisioned for publishing and CDN-facing traffic.</p>\n<hr />\n<h3>Result</h3>\n<h4>65M Streams Without a Dropped Segment</h4>\n<p>Median write latency drops from <strong>113ms to 25ms</strong>; p99 improves from 267ms to 129ms. The EVCache layer absorbs nearly all read traffic, allowing the system to sustain <strong>200Gbps+ read throughput</strong> without touching the write path. The Tyson vs. Paul fight streams successfully to 65 million concurrent viewers — the largest live sports event ever delivered over the internet.</p>\n<hr />\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Dual-Write Tension</p><p>Cassandra is excellent for writes, but its <em>local-quorum consistency</em> (requiring acknowledgment from a majority of nodes in the local datacenter before a write is confirmed, trading some latency for guaranteed durability even if one availability zone fails) model meant that high concurrent reads from dozens of CDN nodes created <strong>resource contention that degraded writes</strong>. Netflix had to explicitly separate read and write paths at the infrastructure level — not just logically, but physically — to prevent the CDN read storm from destroying the publisher's ability to deliver new segments.</p>\n</blockquote>\n\n<p>The elegant solution to the origin storm was a <strong>write-through cache</strong>: every segment written to Cassandra is simultaneously cached in EVCache (Netflix's distributed Memcached layer). When an Open Connect node requests a segment, it hits EVCache first. Cache hits serve at network speed; only misses reach Cassandra. This achieves <strong>read-write separation without separate infrastructure</strong> — the write path remains clean and fast through Cassandra's LSM engine, while reads are absorbed almost entirely by the in-memory cache. The team validated this against its own back-of-envelope: if the cache hits 90% of reads, then only 10% of the theoretical 100Gbps storm ever reaches Cassandra, putting it comfortably in the sustainable range. <span>In practice the system exceeded 200Gbps sustained read throughput with no write degradation observed.</span></p>\n<blockquote>\n<p><strong>THE REDUNDANT PIPELINE</strong></p>\n<p>Netflix runs two completely independent live encoding pipelines across separate AWS regions, with separate contribution feeds, encoders, and packagers. <em>Epoch locking</em> (a mechanism where both pipelines use synchronized timestamps derived from UTC timecodes embedded in each video frame, ensuring their output segments are interchangeable without direct inter-pipeline communication) ensures the two pipelines produce interchangeable segments. When the Live Origin receives a CDN request, it selects the first valid segment from either pipeline — providing transparent, automatic failover without any client involvement.</p>\n</blockquote>\n\n<blockquote><p>Netflix spent three years quietly solving a problem nobody knew it had, and the 65 million people watching the fight had no idea any of this was happening — which is exactly how it was supposed to work.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<ul>\n<li><strong>113ms→25ms</strong> — Segment write median latency: S3 baseline vs. Cassandra+EVCache — a 4.5x improvement that brought writes inside the 2-second segment publishing budget with headroom to spare</li>\n<li><strong>200Gbps+</strong> — Sustained read throughput the EVCache layer absorbs from Open Connect CDN nodes, preventing origin storms from reaching the Cassandra write path during peak live events</li>\n<li><strong>90%+</strong> — Cache hit ratio for control-plane metadata during 404 storms — in-memory caching of event and rendition metadata means non-existent segment requests are rejected before they ever touch the storage layer</li>\n<li><strong>5s TTL</strong> — Max-age returned with HTTP 503 to low-priority DVR traffic under load — deliberately instructing Open Connect to back off and batch repeated requests, dampening traffic storms at the CDN layer</li>\n</ul>\n\n<h3>The Storage Architecture Fix</h3>\n<p>The core fix was replacing S3 with a purpose-built <em>KeyValue abstraction</em> (a storage API layer Netflix built internally, originally for cloud game-save state, adapted for Live Origin to provide chunked storage of multi-megabyte video segments with idempotent retry semantics and strict latency guarantees) layered on Apache Cassandra. The existing system, originally built for gaming cloud saves, needed three significant enhancements for live video: write availability through AZ failures (solved by Cassandra's local-quorum model), handling of large MiB-scale payloads (solved by the chunking algorithm), and read throughput during CDN storms (solved by EVCache write-through caching). Netflix engineers noted the solution was <strong>significantly more expensive</strong> than continuing with S3, but explicitly deprioritized cost — <span>at 65 million concurrent streams, a $5 latency spike per viewer is a service-ending event, not a cost optimization tradeoff.</span></p>\n<pre><code class=\"language-python\"># Netflix Live Origin: Simplified segment write path with chunking and priority rate limiting\n\ndef write_segment(segment: LiveSegment, priority: Priority) -> WriteResult:\n    # 1. Break large MiB segment into small chunks for parallel writes\n    #    Each chunk can be independently retried without re-sending the whole segment\n    chunks = chunk_payload(segment.data, chunk_size_kb=256)\n\n    # 2. Write to Cassandra with local-quorum consistency\n    #    local-quorum = majority of nodes in this AZ must ack before returning\n    #    This survives a full AZ failure while keeping latency predictable\n    for chunk in chunks:\n        cassandra_kv.put(\n            key=segment.url_path + f\":chunk:{chunk.index}\",\n            value=chunk.data,\n            consistency=Consistency.LOCAL_QUORUM  # AZ-resilient, ~25ms median\n        )\n\n    # 3. Simultaneously warm the EVCache read layer (write-through)\n    #    This means the very first CDN GET for this segment hits cache, not Cassandra\n    #    absorbing the origin storm before it reaches the write path\n    evcache.set(\n        key=segment.url_path,\n        value=segment.data,\n        ttl_seconds=segment.duration + BUFFER  # expire after segment is \"old\"\n    )\n\n    return WriteResult.OK\n\ndef handle_cdn_get(url: str, request_type: RequestType) -> HttpResponse:\n    # Priority rate limiting: protect write path when storage is under stress\n    if storage_platform.is_stressed():\n        if request_type == RequestType.DVR:          # low priority: time-shifted playback\n            # Tell CDN to back off for 5 seconds and retry\n            return HttpResponse(503, headers={\"Cache-Control\": \"max-age=5\"})\n        # Live edge traffic (request_type == LIVE_EDGE) is ALWAYS allowed through\n\n    # Check EVCache first — ~90%+ hit rate during normal operation\n    cached = evcache.get(url)\n    if cached:\n        return HttpResponse(200, body=cached)  # fast path: no Cassandra touch\n\n    # Cache miss: reconstruct from Cassandra chunks\n    chunks = cassandra_kv.get_all_chunks(url)   # only ~10% of requests reach here\n    return HttpResponse(200, body=reassemble(chunks))</code></pre>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The 404 Storm Defense</p><p>Live origin structures metadata hierarchically as <em>event → stream rendition → segment</em>. When CDN nodes request segments that don't yet exist — because the encoder hasn't finished them yet — the origin rejects the request using <strong>cached event and rendition metadata</strong>, achieving a <strong>90%+ cache hit ratio</strong> on these control-plane lookups and preventing the 404 flood from ever reaching the Cassandra storage layer.</p>\n</blockquote>\n\n<p>Path isolation was the second major fix. Netflix built <strong>completely separate EC2 compute stacks</strong> for publishing traffic (from the cloud packager) and CDN-facing traffic (from Open Connect nodes). At the storage layer, separate KeyValue clusters serve read and write operations independently. This means a CDN traffic surge — which happens at the exact moment a high-profile event begins — <span>cannot physically reach the publishing path</span>. The packager's write operations run on dedicated infrastructure that is invisible to the CDN. <em>Blast radius</em> (the scope of damage if one component fails — by isolating publishing from CDN retrieval, Netflix ensured that an origin storm could degrade DVR delivery without ever threatening the live edge) is contained at the architecture level rather than by runtime throttling.</p>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Dual-Pipeline Segment Selection</p><p>The Live Origin runs two independent encoding pipelines across separate AWS regions. When a CDN node requests a segment, the origin checks both pipelines in deterministic order and returns <strong>the first valid one</strong>. Segment defects — detected by lightweight media inspection at the packager — are communicated as metadata, allowing the origin to skip bad segments and serve good ones transparently, with no client-side logic required.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Cost Was Not a Constraint</p><p>Netflix engineers explicitly called out that the Cassandra+EVCache solution is more expensive than S3. The architectural choice to separate read and write paths with dedicated compute stacks and dual Cassandra clusters means materially higher infrastructure spend per event. Netflix accepted this as an explicit design constraint: for a 65-million-viewer live event, reliability is a product requirement, not a cost optimization target.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>MILLISECOND-GRAIN CACHING</strong></p>\n<p>Standard HTTP Cache-Control headers work only at <strong>one-second granularity</strong> — a lifetime when segments are generated every 2 seconds. Netflix added <em>millisecond-grain caching</em> to nginx at the edge to enable sub-second backoff signals. This prevents CDN nodes from hammering the origin during the brief window between when they expect a segment and when it actually arrives, significantly reducing origin-facing request chatter without modifying any client device code.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>The Netflix Live Origin sits at the exact intersection of the cloud encoding pipeline and the global Open Connect CDN — a position that makes it simultaneously the last step for publishers and the first step for hundreds of millions of viewer requests. Before the Cassandra rebuild, this was a single S3 bucket per region: simple, cheap, and completely unable to handle the latency requirements and read-write contention of a live event at scale. <strong>The post-rebuild architecture</strong> separates publishing and CDN retrieval into physically isolated compute stacks, introduces a write-optimized Cassandra backing store with EVCache write-through caching, and adds intelligent traffic prioritization at every layer. <em>REaP</em> (Redundant Encoding and Packaging — an ISO/IEC standard (23009-9) for dual-pipeline live streaming where both pipelines produce interchangeable segments without inter-pipeline communication) compliance ensures both encoding pipelines produce segments that any downstream component can select from interchangeably.</p>\n<h3>Before: S3-Based Origin — Single Store, No Path Isolation</h3>\n<p><a href=\"https://techlogstack.com/explore/netflix-live-origin-tyson-paul-2024/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>THE WRITE-READ CONFLICT</strong></p>\n<p>In the S3 architecture, the same storage endpoint served both segment publishers (writing every 2 seconds) and Open Connect nodes (reading simultaneously from dozens of global locations). <strong>There was no isolation</strong>: a CDN request surge at event launch directly competed with the packager's write operations, and S3's throttling kicked in at exactly the wrong moment — when the event started and both loads peaked simultaneously.</p>\n</blockquote>\n\n<h3>After: Live Origin with Cassandra+EVCache and Full Path Isolation</h3>\n<p><a href=\"https://techlogstack.com/explore/netflix-live-origin-tyson-paul-2024/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<p>The architecture diagram reveals the key insight: <strong>publishing and CDN retrieval share no infrastructure path</strong>. The packager writes to a dedicated EC2 stack with its own KeyValue write cluster backed by Cassandra's LSM engine. Open Connect nodes read from a completely separate EC2 stack with its own KeyValue read cluster, serviced almost entirely by EVCache. The Cassandra cluster sits beneath both, but the write-through cache ensures reads almost never reach it. When the system is under storage stress, the origin's <em>priority rate limiter</em> (a per-request traffic shaper that categorizes incoming requests by user impact — live edge (highest priority) vs. DVR (lower priority) — and sheds lower-priority traffic first when resources are constrained) sheds DVR traffic with HTTP 503 + 5-second max-age headers, <span>protecting live edge delivery for every viewer currently watching the event</span>.</p>\n<blockquote>\n<p><strong>🔒</strong></p>\n<p>Epoch Locking: How Two Pipelines Stay Synchronized Without Talking</p><p>Both encoding pipelines embed <em>UTC timecodes</em> (precise wall-clock time embedded in each video frame as SEI metadata by the contribution encoder, ensuring both pipelines produce segments with identical timing boundaries despite running completely independently) as SEI messages in each video frame. This allows both pipelines to produce segments with identical duration boundaries and interchangeable segment numbers — so the Live Origin can transparently switch between pipelines on a <strong>per-segment basis</strong> with no viewer-visible discontinuity.</p>\n</blockquote>\n\n<p>Live Origin storage architecture: before vs. after, by key operational dimension</p><div><table><caption>Live Origin storage architecture: before vs. after, by key operational dimension</caption><thead><tr><th>Dimension</th><th>S3 (Before)</th><th>Cassandra + EVCache (After)</th></tr></thead><tbody><tr><td>Write latency p50</td><td>113ms</td><td>25ms</td></tr><tr><td>Write latency p99</td><td>267ms</td><td>129ms</td></tr><tr><td>Max read throughput</td><td>~S3 limit (throttled)</td><td>200Gbps+ (EVCache)</td></tr><tr><td>Write-read isolation</td><td>None — shared bucket</td><td>Fully isolated EC2 stacks</td></tr><tr><td>AZ failure resilience</td><td>S3 replication</td><td>Cassandra local-quorum</td></tr><tr><td>DVR storm protection</td><td>None</td><td>503 + 5s TTL backpressure</td></tr><tr><td>404 storm defense</td><td>None</td><td>In-memory metadata cache (90%+ hit)</td></tr></tbody></table>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Netflix's Live Origin story is a masterclass in what happens when you try to use general-purpose infrastructure for a real-time system with hard latency deadlines. Every decision — from choosing Cassandra over S3, to physically separating publishing and CDN stacks, to the priority rate limiter — came from a specific failure mode the engineers encountered in production. These are the principles that survive contact with 65 million simultaneous viewers.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>General-purpose storage cannot serve as the foundation of a real-time system.</strong> S3 is brilliant at what it does — durable, scalable object storage — but its <em>p99 latency</em> (the response time that 99% of requests fall under; if p99 is 267ms on a 2-second deadline, roughly 1-in-100 segments publishes dangerously close to the boundary) and request throttling behavior are incompatible with hard 2-second segment deadlines. If your system has a real-time SLA, audit every storage dependency and ask whether it was designed to meet that SLA — not just whether it typically does.</li><li><span>02</span><div><strong>Read storms and write requirements are incompatible on shared storage without explicit isolation.</strong> The <em>Origin Storm</em> (when many CDN nodes simultaneously request the same fresh segment, generating burst read throughput that competes with the write path on shared storage infrastructure) only became visible at live-event scale, but the architectural vulnerability existed from day one. Separate your write path from your read path at the infrastructure level — not just logically — before you discover the contention in production.</li><li><span>03</span><div><strong>Write-through caching is a write-protection strategy, not just a read-acceleration one.</strong> Netflix's EVCache layer absorbs over <strong>90% of CDN reads</strong>, meaning the write path in Cassandra operates in near-silence even during a 65-million-viewer storm. When you add a cache to a system, design it explicitly to protect your write path — not merely to speed up reads.</li><li><span>04</span><div><strong>Priority-based degradation is an active reliability tool, not a failure mode.</strong> Deliberately returning <strong>HTTP 503 with a 5-second TTL</strong> to low-priority DVR traffic is not a bug — it is an engineered backpressure mechanism that protects the live edge for viewers watching right now. Build explicit traffic tiers into your architecture, with defined behaviors for what gets shed first when resources are constrained.</li><li><span>05</span><div><strong>Redundancy only works if it is tested at the actual failure granularity.</strong> Netflix runs two fully independent encoding pipelines across separate AWS regions, contribution feeds, encoders, and packagers — not just two copies of the same pipeline in the same region. <em>Epoch locking</em> (synchronizing both pipelines via UTC timecodes embedded at the contribution encoder, so they produce interchangeable segments without any inter-pipeline communication) makes them interchangeable at the segment level, not just the stream level. True redundancy must be tested at the smallest unit of failure in your system, not the largest.</li></ol>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Cost Admission</p><p>Netflix engineers explicitly wrote that the Cassandra+EVCache architecture is <strong>more expensive</strong> than S3. This honesty is rare and valuable: <em>reliability at scale sometimes costs more</em>, and pretending otherwise leads to systems that are cheap until the moment they matter, then catastrophically expensive. Know your constraints before you optimize.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>OBSERVABILITY AT LIVE SCALE</strong></p>\n<p>Netflix processed <strong>38 million telemetry events per second</strong> during its largest live events, using a mix of internal tools (<em>Atlas</em>, <em>Mantis</em>, <em>Lumen</em>) and open-source tech (Kafka, Druid) to deliver critical metrics to the Control Center in seconds. Live streaming is not just an engineering problem — it is a real-time operations problem. Build your observability layer <strong>before</strong> the event, not during it.</p>\n</blockquote>\n\n<blockquote><p>Netflix built a multi-year, multi-million-dollar storage architecture so that 65 million people could watch two men punch each other — and the highest praise it will ever receive is that nobody noticed.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/netflix-live-origin-tyson-paul-2024/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.576593+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Live Streaming", "Netflix"]}, {"id": "https://techlogstack.com/explore/stripe-flow-typescript-2023/", "url": "https://techlogstack.com/explore/stripe-flow-typescript-2023/", "title": "Stripe Converted 3.7 Million Lines of JavaScript in One Pull Request on a Sunday", "summary": "How Stripe's engineering team automated a 3.7-million-line JavaScript codebase migration from Flow to TypeScript and shipped it in a single pull request on a Sunday.", "content_html": "<p><strong>Stripe</strong> · Performance · 17 May 2026</p>\n<p>On Sunday, March 6, 2022, Stripe merged a single pull request that converted their entire largest JavaScript codebase from Flow to TypeScript. 3.7 million lines of code. Hundreds of engineers arrived Monday morning to start writing TypeScript. The migration had been invisible until it wasn't.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;lines converted in 1 PR&#x27;, &#x27;value&#x27;: &#x27;3.7M&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>In March 2022, Stripe's engineering blog announced something that stopped engineers mid-scroll: <strong>On Sunday, March 6, we migrated Stripe's largest JavaScript codebase from Flow to TypeScript. In a single pull request, we converted more than 3.7 million lines of code.</strong> The next day, hundreds of engineers came in to write TypeScript for the first time. The migration had been planned and built over months. Its execution took one day. Understanding how a 3.7-million-line migration becomes a single pull request requires understanding the architectural approach: you don't migrate 3.7 million lines manually, you <strong>build a machine that migrates them for you</strong>.</p>\n<blockquote>\n<p><strong>📝</strong></p>\n<p><em>Flow</em> (Facebook's JavaScript type system — an alternative to TypeScript that checks types statically but uses its own annotation syntax and requires the Flow type checker rather than TypeScript's tsc) had been Stripe's type system of choice when their codebase was first typed. By 2022, TypeScript had become the overwhelming industry standard, with a richer ecosystem, better IDE tooling, and more active development. Flow's community was contracting while TypeScript's was exploding.</p>\n</blockquote>\n\n<p>The decision to migrate from <em>Flow</em> (Facebook's open-source JavaScript type checker with its own annotation syntax) to <em>TypeScript</em> (Microsoft's statically typed superset of JavaScript that compiles to plain JavaScript, which became the de-facto standard for typed JavaScript after 2018) was driven by practical engineering considerations. Flow's tooling had fallen behind TypeScript's in IDE integration quality — autocomplete, inline error reporting, and refactoring support were all noticeably worse in editors like VS Code. The ecosystem had moved: most open-source libraries shipped TypeScript type definitions, not Flow definitions, forcing Stripe's engineers to write manual stubs or use untyped imports. The talent pipeline had shifted: engineers coming from university and other companies expected TypeScript. <strong>Every month on Flow was a month accumulating a migration debt</strong> that would only grow harder to pay.</p>\n<blockquote>\n<p><strong>THE AUTOMATION IMPERATIVE</strong></p>\n<p>Manual migration of 3.7 million lines would require years of engineer time and create an inconsistent, error-prone result with different teams applying different migration patterns. The only viable approach was building a <strong>fully automated migration tool</strong> — an <em>AST-based codemod</em> (a program that parses source code into an Abstract Syntax Tree, performs structural transformations on the tree, then emits the modified code — preserving formatting and only changing what needs to change) that could parse every Flow-annotated file and emit a valid TypeScript equivalent.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>3.7 Million Lines on a Declining Type System</h4>\n<p>Stripe's largest JavaScript codebase was typed with Flow at a time when Flow was a competitive choice. By 2022, TypeScript dominated the industry: better IDE support, a larger ecosystem, and a growing talent pool that expected TypeScript. Every month on Flow was accumulating migration debt while engineering productivity on Flow-typed code fell behind TypeScript-typed equivalents.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Migration Scale Made Manual Approach Infeasible</h4>\n<p>3.7 million lines across hundreds of files cannot be migrated manually without years of effort and severe consistency problems. The type annotation syntax differences between Flow and TypeScript are pervasive — every file would need to be touched. An automated approach was required, which meant building a production-quality migration tool before the migration could begin.</p>\n<hr />\n<h3>Solution</h3>\n<h4>AST-Based Codemod: Build the Machine</h4>\n<p>Stripe's engineering team built a codemod — an automated code transformation tool — that parsed Flow-annotated TypeScript files using an <em>AST</em> (Abstract Syntax Tree — a tree representation of code's structure that enables programmatic analysis and transformation without working with raw text strings) parser, applied transformation rules for each Flow-to-TypeScript annotation conversion, and emitted valid TypeScript. The tool was built and iterated on for months before the migration day.</p>\n<hr />\n<h3>Result</h3>\n<h4>One Pull Request, One Sunday, Done</h4>\n<p>On March 6, 2022, Stripe merged a single PR converting 3.7 million lines. Monday, hundreds of engineers arrived to find their codebase in TypeScript. The migration was complete, clean, and consistent — because a machine did it, not hundreds of engineers doing it differently.</p>\n<hr />\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Flow and TypeScript: The Annotation Differences</p><p>Flow and TypeScript share a common lineage — both add type annotations to JavaScript using a similar syntax. But they diverge in meaningful ways: Flow uses <strong>type</strong> declarations differently, handles null/undefined with its own operators, has its own syntax for type imports, and uses a different comment format for suppressing type errors. Each of these differences required a transformation rule in the codemod, and edge cases accumulated quickly across 3.7 million lines.</p>\n</blockquote>\n\n<p>The codemod development phase was not a weekend project — it was months of careful engineering. Stripe's team had to map every Flow annotation pattern to its TypeScript equivalent, handle edge cases and ambiguous cases, verify the transformation preserved semantic meaning, and run the tool against subsets of the codebase to validate correctness before trusting it on the full 3.7 million lines. The transformation rules were tested against the actual codebase incrementally, with TypeScript compilation as the correctness oracle: <strong>if the converted code compiled without type errors, the transformation was correct</strong>. Each failing compilation revealed another edge case for the codemod to handle.</p>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Suppressions Problem</p><p>Both Flow and TypeScript support type error suppression comments — a way to tell the type checker to ignore a specific error. These comments have different syntax in Flow (`// $FlowFixMe`) versus TypeScript (`// @ts-ignore`). Correctly migrating suppressions required not just syntax transformation but understanding <strong>what the suppression was suppressing and whether the equivalent TypeScript error existed</strong>. Some suppressions could be removed because TypeScript handled the case correctly; others needed the equivalent TypeScript suppression syntax.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>💡</strong></p>\n<p>The One-PR Strategy: Atomic Consistency</p><p>A single atomic pull request was the only way to ensure <strong>consistent migration state</strong>. If the migration were rolled out gradually — file by file or team by team — the codebase would be in a mixed state with some files using Flow syntax and others TypeScript syntax. This mixed state would require supporting both type checkers simultaneously, create confusion for engineers working across files, and extend the migration timeline to months. The single-PR atomic approach eliminated the mixed state entirely.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Why Sunday Was the Right Day</p><p>Launching a 3.7-million-line migration on a Sunday was a deliberate risk reduction strategy. <strong>Sunday has the lowest deploy frequency and the lowest traffic</strong> of any day in Stripe's week — meaning if something went wrong with the TypeScript configuration, there was less production code running that might be affected, and engineers could address issues before the Monday morning rush. The Sunday timing transformed a potentially chaotic migration into a calm, controllable event.</p>\n</blockquote>\n\n<blockquote><p>On Sunday, March 6, we migrated Stripe's largest JavaScript codebase from Flow to TypeScript. In a single pull request, we converted more than 3.7 million lines of code. The next day, hundreds of engineers came in to start writing TypeScript for their projects.</p><p><em>— — Stripe Engineering — via Stripe Engineering Blog</em></p></blockquote>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Risk of Atomic Migration: No Partial Rollback</p><p>The single-PR atomic approach eliminates mixed state but also eliminates partial rollback. If the TypeScript configuration had a subtle misconfiguration affecting production builds, the only option was revert the entire migration PR — 3.7 million lines back to Flow in one operation. Stripe mitigated this by running the TypeScript configuration in CI for weeks before the migration day, ensuring the build system was proven before the code was switched. Atomic migrations require particularly thorough pre-migration validation.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>The Codemod: Engineering the Migration Machine</h3>\n<p>The codemod that performed the migration is itself a significant engineering artifact. It had to handle every type annotation pattern present in 3.7 million lines of production code — including patterns that were technically valid Flow but unusual, patterns generated by code generation tools, and patterns accumulated across years of different engineers with different Flow styles. The tool used <em>jscodeshift</em> (a JavaScript codemod toolkit that parses code into an AST, applies transformation functions, and prints the modified AST back to source code while preserving formatting) as its transformation framework, with custom rules for each Flow-to-TypeScript conversion.</p>\n<ul>\n<li><strong>3.7M</strong> — Lines of code converted in the migration — the largest single JavaScript codebase at Stripe, transformed in a single automated operation</li>\n<li><strong>1 PR</strong> — Deployment vehicle for the entire migration — ensuring atomic, consistent state across all 3.7 million lines simultaneously</li>\n<li><strong>1 day</strong> — Execution time of the migration on March 6, 2022 — months of codemod development compressed into a single Sunday deployment</li>\n<li><strong>100%</strong> — TypeScript compilation success rate after migration — the correctness oracle that validated the codemod was production-ready</li>\n</ul>\n\n<pre><code>// Simplified example of a Flow-to-TypeScript codemod transformation rule\n// Real implementation uses jscodeshift AST transformation framework\n\n// FLOW syntax examples:\n// type Props = { name: string, age: number }\n// const foo = (x: ?string) => x  // ?string = nullable in Flow\n// import type { User } from './types'  // Flow type import\n\n// TYPESCRIPT equivalents:\n// type Props = { name: string; age: number }  // semicolons not commas\n// const foo = (x: string | null) => x  // explicit union, not ?\n// import type { User } from './types'  // same syntax — lucky!\n\n// Simplified codemod rule for nullable type conversion:\nmodule.exports = function transformer(file, api) {\n  const j = api.jscodeshift;\n  const root = j(file.source);\n  \n  // Find all nullable type annotations: ?SomeType\n  root.find(j.NullableTypeAnnotation).replaceWith(path => {\n    // Replace ?T with T | null | undefined (TypeScript union)\n    return j.unionTypeAnnotation([\n      path.node.typeAnnotation,      // the original T\n      j.nullLiteralTypeAnnotation(), // null\n    ]);\n  });\n  \n  // Find Flow object type separators (commas) and replace with TypeScript (semicolons)\n  root.find(j.ObjectTypeAnnotation).forEach(path => {\n    path.node.properties.forEach(prop => {\n      // jscodeshift handles the comma-to-semicolon transformation\n    });\n  });\n  \n  return root.toSource({ quote: 'single' }); // emit transformed source\n};</code></pre>\n<blockquote>\n<p><strong>COMPILATION AS THE CORRECTNESS ORACLE</strong></p>\n<p>The migration team used TypeScript compilation (`tsc --noEmit`) as the primary correctness oracle for the codemod. A successfully compiled file meant the transformation was semantically correct. A compilation error meant the codemod had produced invalid TypeScript — revealing a missing transformation rule or an edge case. <strong>Running tsc against incrementally migrated subsets of the codebase was the primary quality loop for codemod development</strong>, more reliable than manual code review of thousands of transformed files.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Monday Morning: Hundreds of Engineers, New Language</p><p>The day after the migration, hundreds of Stripe engineers arrived to find their codebase in TypeScript. No ramp-up period, no gradual transition, no mixed state to navigate. TypeScript was simply the language from that day forward. The abrupt transition required good internal documentation and TypeScript onboarding resources, but the absence of a prolonged mixed-state period was a net engineering productivity gain — engineers could learn one thing instead of learning two systems simultaneously.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The TypeScript Ecosystem Advantage</p><p>Post-migration, Stripe engineers gained the full TypeScript ecosystem advantage: TypeScript-first type definitions for most open-source libraries, significantly better IDE autocomplete and inline error reporting in VS Code, and compatibility with the rest of the industry's tooling. The tooling quality difference between Flow and TypeScript by 2022 was substantial — the migration unlocked a persistent daily productivity improvement for hundreds of engineers working in the codebase.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Codemod Iteration: Months Before the Sunday</p><p>The codemod was not built once and deployed — it was iterated. The team ran early versions against small subsets of the codebase, examined the output, identified missed cases, added transformation rules, and repeated. Each iteration against real production code revealed patterns that weren't in the test cases. This iterative refinement against the actual target codebase is what made the Sunday execution clean — by migration day, the codemod had been proven against the full diversity of patterns present in 3.7 million lines.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE TALENT PIPELINE ARGUMENT</strong></p>\n<p>By 2022, the typical new-hire JavaScript engineer expected TypeScript. Flow-only codebases were increasingly unfamiliar to engineers coming from bootcamps, universities, and other major tech companies. <strong>The migration was a recruiting and onboarding investment</strong> as much as a tooling investment — reducing the friction of ramping up new engineers on Stripe's frontend codebase.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>The Flow-to-TypeScript migration is architecturally different from most of the case studies in this collection — it's a developer tooling migration rather than a production infrastructure change. But it shares the same core challenges: a large, live system needs to change; incremental migration creates dangerous mixed states; automation is the only viable path at scale. The architectural patterns — build the transformation machine, use compilation as the correctness oracle, execute atomically — are directly applicable to infrastructure and data migrations as well.</p>\n<h3>Codemod Development and Execution Pipeline</h3>\n<p><a href=\"https://techlogstack.com/explore/stripe-flow-typescript-2023/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>Before and After: Type System Ecosystem Position</h3>\n<p><a href=\"https://techlogstack.com/explore/stripe-flow-typescript-2023/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>AST TRANSFORMATIONS: THE POWER AND THE RISK</strong></p>\n<p>AST-based code transformations are powerful because they operate on the <strong>semantic structure of code</strong>, not on raw text. A text-based search-and-replace would fail on multi-line type annotations, nested generics, and comments adjacent to type syntax. An AST transformation understands the code's structure and can make correct transformations in context. The risk: AST transformation rules must be exhaustive for the patterns present in the codebase, or the generated code will have subtle errors that only appear at runtime or in edge cases.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>What the Codemod Couldn't Do</p><p>Automated codemods handle syntax transformation perfectly but cannot handle <strong>semantic meaning changes</strong>. In a few cases, Flow and TypeScript's type systems make different assumptions about the same code — a type that Flow considers valid that TypeScript considers an error, or vice versa. These cases required manual review after the automated migration. The codemod was the 99% solution; the manual cleanup handled the remaining 1% of cases that required human judgment.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>jscodeshift: The Transformation Framework</p><p><em>jscodeshift</em> (an open-source JavaScript codemod toolkit created by Facebook that provides a jQuery-like API for traversing and transforming JavaScript ASTs) was the foundation for Stripe's codemod. It handles parsing, AST traversal, and code emission while allowing engineers to focus on writing transformation logic. The framework's familiarity (JavaScript-based, with a well-documented API) meant Stripe's frontend engineers could contribute to the codemod without learning a new tooling ecosystem.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Stripe's Flow-to-TypeScript migration is the case for investing in automation tooling before attempting any large-scale code transformation. The migration itself took one day. Building the migration tool took months. That ratio is correct.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Large-scale code transformations require automation, not heroics.</strong> 3.7 million lines cannot be migrated manually with consistency. Build the codemod first. The investment in automation tooling is repaid by the quality, consistency, and speed of the transformation it enables.</li><li><span>02</span><div>Use <strong>compilation as your correctness oracle</strong> during codemod development. Running the type checker against migrated code subsets gives immediate, objective feedback on transformation correctness — more reliable than manual code review of thousands of transformed files.</li><li><span>03</span><div><em>Atomic migration</em> (executing a complete migration in a single deployment rather than gradually, eliminating any intermediate mixed-state period) is preferable to incremental migration when the mixed state creates engineering overhead. A gradual Flow-to-TypeScript migration would require supporting both type checkers simultaneously for months. The single-PR atomic approach eliminated that overhead entirely.</li><li><span>04</span><div><strong>Technical debt in developer tooling compounds in ways that are easy to underestimate.</strong> Every month Stripe stayed on Flow was a month of suboptimal IDE tooling, missing ecosystem support, and recruiting friction for candidates who expected TypeScript. Quantify developer tooling debt explicitly — the compounding cost is real even when it's not directly measurable in production incidents.</li><li><span>05</span><div>Prepare your team for an abrupt transition, not a gradual one. <strong>Good internal documentation and onboarding resources matter more for atomic migrations than gradual ones</strong> — engineers go from the old system to the new system overnight, and the organization needs to support that transition actively.</li></ol>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Document the Decision Before You Ship</p><p>A migration that affects hundreds of engineers needs documentation ready <strong>before the PR merges</strong>. Stripe prepared TypeScript onboarding materials, answered common questions about the syntax differences, and communicated the migration plan and rationale to engineering broadly before the Sunday execution. Engineers who arrived Monday morning to a new type system should not be the first people asking 'wait, what happened?'</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE REAL ROI OF TYPE SYSTEM INVESTMENT</strong></p>\n<p>The business case for Flow-to-TypeScript is rarely made in terms of production incidents prevented — type systems catch compile-time errors that never reach production. The ROI is in <strong>engineering velocity</strong>: faster development cycles, fewer bugs caught late in review or in staging, better IDE-assisted refactoring, and lower onboarding cost for new engineers. These are real but diffuse benefits that require organizational commitment to measure and communicate.</p>\n</blockquote>\n\n<blockquote><p>Building the codemod took months. Running it took one day. The engineers who merged the PR that Sunday probably didn't fully appreciate they were doing the most leverage-per-keystroke work of the entire project.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/stripe-flow-typescript-2023/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.582195+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Performance", "Stripe"]}, {"id": "https://techlogstack.com/explore/github-failover-test-outage-2023/", "url": "https://techlogstack.com/explore/github-failover-test-outage-2023/", "title": "The Test That Broke GitHub: A Failover Drill Goes Live", "summary": "GitHub's live failover test of its new secondary Internet edge facility caused a 32-minute production outage in June 2023. The story of testing redundancy that prove", "content_html": "<p><strong>GitHub</strong> · Reliability · 17 May 2026</p>\n<p>June 29, 2023, 17:39 UTC: GitHub engineers initiate a planned live failover test of their brand-new second Internet edge facility — six months of infrastructure work designed to eliminate a single point of failure. Within seconds, instead of validating their redundancy, they've created an outage that takes GitHub offline for millions of developers across North America and South America.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;outage&#x27;, &#x27;value&#x27;: &#x27;32-minute&#x27;}</li><li>{&#x27;label&#x27;: &#x27;detect-to-revert&#x27;, &#x27;value&#x27;: &#x27;2-min&#x27;}</li><li>{&#x27;label&#x27;: &#x27;devs affected&#x27;, &#x27;value&#x27;: &#x27;~100M&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<ul>\n<li><strong>32 min</strong> — Duration of the June 29 outage, caused not by an external attack or software bug but by GitHub's own validation test of infrastructure designed to prevent exactly this kind of outage</li>\n<li><strong>2 min</strong> — Time from alert firing to engineers reverting the failover and bringing the primary edge facility back online — fast human response, but the damage was done the moment the test ran</li>\n<li><strong>6 months</strong> — Approximate age of GitHub's second Internet edge facility when the test ran — built in January 2023 and actively routing production traffic since then, yet the configuration flaw was never discovered</li>\n<li><strong>55 min</strong> — Maximum delay GitHub Actions workflows experienced during the June 7 incident — a separate but equally instructive outage where one customer's pathological repository data starved the entire Git push queue</li>\n</ul>\n\n<p>GitHub is the infrastructure underpinning virtually all software development on Earth. Over <strong>100 million developers</strong> use it to store code, run CI/CD pipelines via <em>GitHub Actions</em> (GitHub's built-in automation platform, used by millions of teams to automatically build, test, and deploy software on every code push), and collaborate on projects ranging from weekend hobby projects to the world's most critical open-source software. When GitHub goes down, the entire software industry's ability to ship code grinds to a halt. This is not a theoretical concern — it is the reality GitHub's infrastructure team lives with every day. For years, the team had known about a <em>single point of failure</em> (a component whose failure causes the entire system to stop working — if there is only one of it and it breaks, everything that depends on it also breaks) in their network architecture at the Internet edge. The fix was a second Internet edge facility, completed in January 2023.</p>\n<p>The second edge facility had been routing live production traffic since January, operating alongside the primary in a <em>high availability</em> (a system design where redundant components allow the service to continue operating even when one component fails, typically achieved by having a primary and one or more secondaries that can take over) architecture. Six months of real traffic without incident. The team's next step was logical and responsible: perform a live <strong>failover test</strong> — deliberately route all traffic to the secondary, as if the primary had failed — to verify the redundancy actually worked. June 29, 2023. <span>The test began. The secondary facility could not function as a primary. GitHub went down.</span></p>\n<blockquote><p>Unfortunately, during this failover test we inadvertently caused a production outage. The test exposed a network pathing configuration issue in the secondary side that prevented it from properly functioning as the primary facility.</p><p><em>— — GitHub Status Page — Incident gqx5l06jjxhp, June 29, 2023</em></p></blockquote>\n<h3>The Hidden Configuration Flaw</h3>\n<p>The root cause was a <strong>network path configuration issue</strong> in the secondary Internet edge facility. The secondary had been designed to route traffic alongside the primary in a shared <em>HA architecture</em> (High Availability architecture — a design where two or more facilities share load simultaneously, rather than one being idle as a hot standby, which keeps the secondary warm but means it must be capable of handling full load independently at any moment), but its specific network routing configuration was never validated for the scenario where it had to handle all traffic alone. This is the subtle trap of active-active HA: a facility can route 50% of traffic flawlessly for six months and still fail when asked to route 100%, because some of its internal network paths — <em>BGP routes</em> (Border Gateway Protocol routes — the routing table entries that tell the internet how to reach GitHub's network, which must be correctly configured on edge routers to announce GitHub's IP prefixes to the global internet) in particular — were only configured to work in the context of the primary being present. <span>The facility was a co-pilot that had never practiced landing the plane alone.</span></p>\n\n<h3>Problem</h3>\n<h4>Failover Test Initiated</h4>\n<p>At <strong>17:39 UTC</strong> on June 29, 2023, GitHub engineers begin a planned live validation of the secondary Internet edge facility by shifting all traffic away from the primary. Within seconds, parts of North America (especially the US East Coast) and South America begin experiencing connectivity failures to GitHub.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Secondary Cannot Function as Primary</h4>\n<p>The secondary facility has a <strong>network path configuration issue</strong> that was invisible while it shared load with the primary but becomes critical when it must handle all traffic alone. <em>Border router reconvergence</em> (the process by which BGP routers across the internet update their routing tables to reflect the new path to GitHub's network — a process that takes time and cannot be instantly reversed) cannot happen correctly because the secondary's own configuration is broken.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Revert in 2 Minutes</h4>\n<p>GitHub's monitoring fires immediately. Within <strong>two minutes</strong> of being alerted, engineers revert the failover change and bring the primary facility back online. The revert itself is fast — but once online, border routers across the internet need time to <em>reconverge</em>, meaning GitHub service is not instantly restored even after the primary is running.</p>\n<hr />\n<h3>Result</h3>\n<h4>Fixed, Then Tested Better</h4>\n<p>The network path configuration issue in the secondary is corrected. GitHub commits to <strong>improved failover testing procedures</strong> that minimize customer impact — specifically, scheduling future tests in a way that reduces blast radius. The test that caused the outage was ironically the most valuable test the team ever ran: it found the flaw that would have caused a much longer, unplanned outage during a real emergency.</p>\n<hr />\n\n<blockquote>\n<p><strong>❌</strong></p>\n<p>The Reconvergence Penalty</p><p>Even after GitHub reverted the failover and the primary came back online, users could not immediately reach GitHub. The internet's <em>BGP routing tables</em> (global distributed routing databases maintained by thousands of autonomous networks — once updated to reflect a path change, they must propagate the reversal across every network that learned the new path) needed time to reconverge — to undo the routing changes that the failover had caused. This is the hidden cost of network-level failures: the fix is fast, the recovery is slow.</p>\n</blockquote>\n\n<h3>The June 7 Incident: A Different Kind of Queue Starvation</h3>\n<p>Three weeks before the failover outage, GitHub experienced a completely separate but equally instructive incident. On <strong>June 7 at 16:11 UTC</strong>, GitHub's internal job queue for processing Git pushes began experiencing increasing delays. The monitoring system alerted engineers after <strong>19 minutes</strong>. Customers experienced GitHub Actions workflow delays of <strong>up to 55 minutes</strong> and pull requests that failed to reflect new commits. The root cause was a single customer pushing to a repository with a specific, unusual data shape — a shape that caused the Git backend to throttle the processing jobs, making them slow. These slow jobs <em>exhausted the worker pool</em> (consumed all available concurrent job slots, leaving no capacity to process pushes from any other repository) that served all other users. <span>One customer's pathological repository data silently starved the global Git push queue for nearly two and a half hours.</span></p>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Tenant Isolation in Shared Queues</p><p>The June 7 incident is a textbook case of <em>noisy neighbor problem</em> (when one tenant or workload in a shared system consumes disproportionate resources, degrading performance for all other tenants) in a shared job queue. GitHub's fix — making the Git backend throttle behavior <strong>fail faster</strong> and reducing the Git client timeout — prevents any single customer's workload from holding a worker slot indefinitely. The principle applies anywhere a shared queue serves diverse workloads.</p>\n</blockquote>\n\n<blockquote><p>GitHub spent six months routing real traffic through a backup facility, and it took a deliberate test to discover the backup couldn't actually back anything up — which is the whole point of testing, just not quite how they planned.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>Fixing the Failover Test Outage</h3>\n<p>The immediate fix for the June 29 outage was surgical and fast: engineers identified the network path configuration issue exposed by the failover test and corrected it in the secondary edge facility. But the more important fix was procedural — changing <strong>how future failover tests are designed and scheduled</strong>. A live failover test that takes GitHub fully offline for users in two continents is not a sustainable validation strategy. GitHub committed to scheduling tests in ways that minimize customer impact, likely through phased traffic migration (moving a small percentage of traffic first), <em>shadow testing</em> (running the secondary in parallel and comparing its behavior against the primary without actually routing real user traffic to it) to identify configuration gaps before they cause outages, and off-peak timing to reduce the blast radius if something goes wrong. <span>The secondary facility was fixed and is now genuinely capable of functioning as a primary.</span></p>\n<pre><code class=\"language-python\"># Simplified model of a safer failover test strategy\n# Instead of \"flip all traffic to secondary\", use staged validation\n\ndef run_failover_validation(primary: EdgeFacility, secondary: EdgeFacility):\n    \"\"\"\n    Safe failover validation: verify the secondary can function as primary\n    without causing a production outage.\n    \"\"\"\n\n    # Step 1: Shadow test — route 0% of real traffic, compare responses\n    # Checks routing and config WITHOUT touching user requests\n    shadow_result = shadow_test(secondary, sample_requests=SYNTHETIC_TRAFFIC)\n    if not shadow_result.routes_correctly:\n        # ✅ Caught here — no user impact\n        alert_team(\"Secondary cannot route independently — config issue found\")\n        return FailoverResult.ABORTED\n\n    # Step 2: Canary — shift 1% of traffic to secondary, monitor error rates\n    with traffic_shift(secondary, percentage=1):\n        if error_rate() > ACCEPTABLE_THRESHOLD:\n            rollback()  # Instant revert, only 1% of users briefly affected\n            return FailoverResult.ABORTED\n\n    # Step 3: Gradual ramp — 10% → 25% → 50% → 100%\n    # At each stage, verify secondary handles the load correctly\n    for percentage in [10, 25, 50, 100]:\n        with traffic_shift(secondary, percentage=percentage):\n            # Monitor BGP convergence, latency, error rates\n            health = monitor(duration_seconds=300)\n            if not health.acceptable:\n                rollback()  # Revert to last good state\n                return FailoverResult.ABORTED\n\n    # Step 4: Full failover validated — secondary proved capable\n    return FailoverResult.SUCCESS\n\n# The June 29 incident used the equivalent of jumping straight to step 4.\n# A broken secondary had no chance to be caught before users felt it.</code></pre>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The BGP Reconvergence Reality</p><p>When GitHub's primary facility came back online after the revert, engineers could not simply flip a switch and restore service. <strong>Border routers across the internet</strong> needed time to reconverge — each network that had learned the (broken) route to GitHub's secondary needed to update its routing tables back to the primary. This <em>BGP propagation delay</em> (the time it takes for routing updates to spread across the global internet's interconnected autonomous systems — typically seconds to minutes, depending on the number of hops and the speed of peering relationships) is unavoidable, which is why the outage lasted 32 minutes even though the fix itself took under 2 minutes.</p>\n</blockquote>\n\n<p>The June 7 Git push queue fix was more technically nuanced. The Git backend's throttling behavior was changed to <strong>fail faster</strong> — instead of a throttled job slowly consuming a worker slot while retrying indefinitely, it now returns a failure quickly so the slot is released for another repository's work. The Git client timeout within the job was also reduced, preventing a hung upstream connection from holding a worker open. These two changes together mean a pathological repository data shape can no longer <span>starve the shared worker pool</span>. Additional <em>observability</em> (instrumentation that gives engineers visibility into system behavior — here, metrics that reveal when a single tenant is consuming disproportionate queue capacity, before it becomes a user-facing incident) improvements were added to reduce detection and diagnosis time for future incidents of this type.</p>\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Outage That Validated the Investment</p><p>GitHub's engineering team noted a pointed irony: the test that caused the outage was exactly the right test to run. Without it, the hidden configuration flaw would have remained undetected until a real infrastructure failure — at which point the outage would have been unplanned, potentially longer, and without the fast human revert that limited the June 29 impact to 32 minutes. A self-inflicted outage you can control is always better than a real one you cannot.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>TWO INCIDENTS, ONE JUNE</strong></p>\n<p>June 2023 gave GitHub two distinct outage patterns in a single month. The June 7 incident (<strong>2h28m</strong>) was caused by a <em>shared resource exhaustion</em> — one customer's data starving a global queue. The June 29 incident (<strong>32 min</strong>) was caused by <em>untested redundancy</em> — infrastructure built for resilience that had never been validated as a solo primary. Both share a root: <strong>assumptions that were never tested in production conditions</strong>.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Hidden Cost of Active-Active HA</p><p>The secondary facility had routed live production traffic for <strong>six months</strong> without incident — because it was always operating alongside the primary, not instead of it. Active-active HA gives you a false signal of readiness. A facility that handles 50% of traffic when the primary is healthy has never been proven to handle 100% of traffic when the primary is gone. Failover capability must be explicitly validated at full load, not inferred from shared-load health.</p>\n</blockquote>\n\n<p>The most important long-term fix was cultural: GitHub's team committed to making failover testing a regular practice, not a one-time event. Regular failover tests — scheduled with appropriate notice, designed to minimize blast radius, and run at off-peak times — are the only way to keep redundancy validated over time. Infrastructure drifts: routers get reconfigured, network policies change, and a facility that was a fully functional backup six months ago may not be today. <strong>Untested redundancy is not redundancy.</strong> It is the comforting fiction that your system is more resilient than it actually is.</p>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>GitHub's Internet edge architecture is the layer that connects the global internet to GitHub's internal infrastructure. Every request from every developer in the world — whether pushing code, pulling a repository, or triggering a GitHub Action — flows through an Internet edge facility. For years, this was a <strong>single point of failure</strong>: one facility, one set of <em>border routers</em> (network devices that connect GitHub's private network to the public internet, running BGP to announce GitHub's IP address ranges to the global routing table), and one path in from the internet. The second facility, completed in January 2023, was designed to eliminate this vulnerability. What the architecture diagrams did not capture was the specific network path configuration that would only become a problem when the secondary had to stand alone.</p>\n<h3>Before: Single Point of Failure at the Internet Edge</h3>\n<p><a href=\"https://techlogstack.com/explore/github-failover-test-outage-2023/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>After: HA Architecture with Secondary Edge — and the Hidden Flaw</h3>\n<p><a href=\"https://techlogstack.com/explore/github-failover-test-outage-2023/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<p>The architecture diagram shows the deceptive appearance of redundancy. Two edge facilities, both actively routing traffic, both connected to the same internal load balancer — it looks bulletproof. But the diagram does not show the <strong>network path configuration</strong> inside the secondary: the specific <em>BGP advertisements</em> (announcements that a network device makes to its peers describing which IP address ranges it can route to, along with the path information used to select the best route) that tell the global internet how to reach GitHub, and the internal routing rules that control traffic flow within the facility. When the secondary was asked to function as the primary during the failover test, those configurations were incorrect for the solo-primary role. <span>The redundancy was a drawing on paper, not a tested fact.</span></p>\n<blockquote>\n<p><strong>🔄</strong></p>\n<p>Border Router Reconvergence: The Delay Nobody Talks About</p><p>When GitHub's primary facility came back online, the recovery was not instant. Every network on the internet that had updated its <em>BGP routing table</em> (a database maintained by every network router describing the best path to reach any IP address on the internet — changes propagate gradually from router to router as UPDATE messages are exchanged) to route via the broken secondary had to learn the new path to the primary. This propagation delay is inherent to how the internet works and is <strong>unavoidable once a failover has been initiated</strong>. It is one more reason to avoid unnecessary failover events — even a 2-minute fix can result in 30 minutes of degraded service.</p>\n</blockquote>\n\n<p>June 2023 GitHub incidents — two outages, two root causes, one shared theme</p><div><table><caption>June 2023 GitHub incidents — two outages, two root causes, one shared theme</caption><thead><tr><th>Incident</th><th>Date</th><th>Duration</th><th>Root Cause</th><th>Fix</th></tr></thead><tbody><tr><td>Git Push Queue Starvation</td><td>June 7</td><td>2h 28m</td><td>Single customer's pathological data shape throttled jobs, exhausting the shared worker pool</td><td>Fail-faster throttling, reduced Git client timeout</td></tr><tr><td>Failover Test Outage</td><td>June 29</td><td>32 min</td><td>Secondary edge facility had hidden network path config flaw that only manifested when operating solo</td><td>Fixed secondary config; improved failover test procedures</td></tr><tr><td>Common thread</td><td>Both</td><td>—</td><td>Assumptions about system behavior that were never validated under the actual failure conditions</td><td>Testing at the real failure boundary, not the assumed one</td></tr></tbody></table>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>June 2023 gave GitHub — and the industry — two clean case studies in the same month. Neither outage was caused by a novel bug or an obscure race condition. Both were caused by things that look like good engineering on paper but hadn't been tested at the right failure boundary. These lessons apply to any team operating infrastructure with redundancy assumptions they have never validated.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Untested redundancy is not redundancy — it is a liability.</strong> GitHub's secondary edge facility routed <strong>50% of live production traffic for six months</strong> without revealing the flaw that prevented it from functioning as a primary. <em>Active-active HA</em> (a high-availability design where both primary and secondary systems handle live traffic simultaneously — effective for load sharing, but does not test failover capability unless one side is actually turned off) does not validate failover capability; it validates shared-load operation. Test your redundancy by actually removing the primary, not by observing the secondary under normal conditions.</li><li><span>02</span><div><strong>Failover tests should be staged, not binary.</strong> Shifting 100% of traffic to an untested secondary in a single step is a high-stakes gamble with no abort option. <strong>Canary failovers</strong> — shifting 1%, then 10%, then 25%, validating at each stage before proceeding — expose configuration issues before they cause full outages. The extra complexity of staged testing is trivially small compared to the cost of a production outage discovered mid-test.</li><li><span>03</span><div><strong>Reverting fast does not mean recovering fast.</strong> GitHub reverted the failover change in under <strong>2 minutes</strong>, but the outage lasted <strong>32 minutes</strong> because <em>BGP reconvergence</em> (the time for routing updates to propagate across the global internet after a path change — unavoidable, typically measured in minutes, and not under the control of the affected party once a change has been announced) takes time that no amount of engineering can compress. Build your incident response timelines around recovery time, not just fix time.</li><li><span>04</span><div><strong>Shared queues need tenant isolation to prevent noisy neighbor failures.</strong> The June 7 incident is a canonical example of one tenant's unusual workload consuming all of a shared resource. Design queue systems with <strong>per-tenant rate limits and fast-fail timeouts</strong> so that a single job never holds a worker slot long enough to starve the entire pool. The fix — making the Git backend throttle faster — is a one-line change that protects millions of users from one user's edge case.</li><li><span>05</span><div><strong>The test that breaks production is the most valuable test you ever run.</strong> GitHub's team made a pointed admission: without the June 29 failover test, the network path flaw would have remained hidden until a real infrastructure emergency forced a failover under far worse conditions — with no time to prepare, no clean revert path, and no certainty about what was broken. <strong>Deliberately probing your redundancy</strong> in a controlled environment, even at the cost of a brief outage, is the engineering equivalent of a fire drill: painful in the moment, essential in the long run.</li></ol>\n<blockquote>\n<p><strong>THE IRONY THEOREM</strong></p>\n<p>The infrastructure designed to <strong>prevent an outage</strong> caused the outage. The test designed to <strong>validate resilience</strong> proved the resilience didn't exist. And the 32-minute disruption designed to be the <strong>worst case</strong> turned out to be far better than the real-emergency case it prevented. Sometimes the most constructive thing you can do for your reliability is schedule an outage before the universe schedules one for you.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Document Your Assumptions Before You Test Them</p><p>A pre-test checklist should include: <strong>what does the secondary need to do independently?</strong> Not just what load it handles, but what <em>configuration</em> it needs — BGP route advertisements, internal routing policies, health check endpoints, TLS certificates. Every assumption about how the secondary behaves when the primary is absent should be written down and verified before the test runs, not discovered by watching production users experience an outage.</p>\n</blockquote>\n\n<blockquote><p>GitHub built a backup facility, routed real traffic through it for six months, and then discovered it was a backup that couldn't actually back anything up — all of which was fine, because they found out during a test instead of a Thursday morning at 3am.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/github-failover-test-outage-2023/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.585556+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Reliability", "GitHub"]}, {"id": "https://techlogstack.com/explore/shopify-vitess-horizontal-scale-2024/", "url": "https://techlogstack.com/explore/shopify-vitess-horizontal-scale-2024/", "title": "Shopify Sharded a Rails Database With Vitess and the App Never Knew It Happened", "summary": "How Shopify migrated the Shop app's backend to Vitess horizontal sharding using user_id as the shard key — transparently, without rewriting their Rails application.", "content_html": "<p><strong>Shopify</strong> · Databases · 17 May 2026</p>\n<p>The Shop app was growing exponentially. Its single MySQL database was approaching vertical scaling limits. Shopify needed horizontal sharding — but they had a Rails monolith that expected a single database, and a system that couldn't have downtime during a commerce platform used by millions daily.</p>\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>Shopify launched the Shop app in April 2020, giving consumers a personalized browsing and checkout experience across Shopify's merchant network. By 2023, the Shop app had achieved remarkable growth — and its backend database was approaching the scaling ceiling that every fast-growing application eventually hits. The database powering the Shop backend, running on Shopify's internal managed MySQL system called <em>KateSQL</em>, was a single MySQL instance. Single-instance databases have a hard vertical limit: no matter how much you upgrade the hardware, there's a maximum amount of data and queries per second one machine can handle. <strong>Horizontal sharding was the only path forward</strong>, and Shopify's team chose <em>Vitess</em> (an open-source MySQL scaling system developed at YouTube that adds horizontal sharding, connection pooling, and query routing on top of standard MySQL) to execute it.</p>\n<p>Vitess has a deceptively clean architecture at the application level: applications connect to <em></em><em>VTGate</em> (Vitess's query routing proxy — a stateless service that accepts MySQL connections from applications, parses queries, and routes them to the correct shard based on the query's sharding key) as if it were a regular MySQL server. VTGate speaks the MySQL wire protocol, so applications need only update their database connection string. Queries are then routed by VTGate to the appropriate <em></em><em>VTTablet</em> (a Vitess process that runs alongside each MySQL instance and manages the connection pool, health checks, and query execution for that shard), which communicates directly with the underlying MySQL process. <strong>From the application's perspective, there is one database. From the infrastructure's perspective, there are many.</strong> This transparency is what makes Vitess viable for a Rails monolith like Shopify's — the application code doesn't change, only the database topology.</p>\n<blockquote>\n<p><strong>🔑</strong></p>\n<p>Shopify chose <strong>user_id</strong> as the sharding key for the Shop app's user-owned data. Almost all tables in the database are associated with a user, so user_id was a natural choice — it distributes data evenly, ensures all of a user's data lives on the same shard, and keeps user-scoped queries on a single shard without cross-shard joins.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE VITESSIFYING PHASE</strong></p>\n<p>Shopify coined the term <strong>'Vitessifying'</strong> for the process of transforming an existing MySQL database into a Vitess keyspace without immediately sharding. In this first phase, a VTTablet is added alongside each MySQL process, and the application is reconfigured to connect through VTGate — but all data still lives on a single shard. This allows the team to validate Vitess integration, test VTGate routing, and gain operational familiarity with Vitess before making the more complex sharding changes.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>Single-Instance Database Approaching Its Ceiling</h4>\n<p>The Shop app's backend was scaling rapidly but its database was a single MySQL instance. Vertical scaling had diminishing returns and a hard ceiling. The engineering team needed horizontal sharding to support continued growth without database-induced bottlenecks.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Rails Monolith Expected One Database</h4>\n<p>Shopify's Shop backend was a Rails application that, like most Rails apps, expected a single primary database connection. Introducing sharding without a transparent proxy would require extensive application-level changes to route queries to the correct shard — a significant refactoring risk. The alternative was a transparent proxy that handled sharding invisibly.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Vitess + Dynamic Connection Switcher</h4>\n<p>The migration proceeded in phases: first Vitessifying (adding VTTablet and VTGate without sharding), then adding application-layer VTGate connectivity, then splitting tables into the user and global keyspaces, then horizontally sharding the user keyspace by user_id. A dynamic connection switcher allowed gradual traffic migration from the old system to VTGate, with the percentage adjustable without a deploy.</p>\n<hr />\n<h3>Result</h3>\n<h4>Horizontally Scalable, App Unchanged</h4>\n<p>The Shop app backend gained horizontal scalability via Vitess sharding without requiring the application to understand sharding. The connection string changed; the application code did not. Shopify can now add shards as the Shop app continues to grow without additional application-level changes.</p>\n<hr />\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Auto-Increment Problem in Sharded Systems</p><p>Rails applications default to using <strong>auto-incrementing integer primary IDs</strong> — a database feature that generates unique IDs by incrementing a counter. In a sharded system, multiple shards generating auto-increment IDs independently would produce duplicate IDs across shards. Vitess solves this with a <strong>Sequences table</strong> in an unsharded keyspace: VTTablets cache blocks of IDs from the Sequences table and distribute them, ensuring globally unique IDs across all shards. The cache size of 1000 IDs per VTTablet reduces the per-ID write overhead while maintaining uniqueness.</p>\n</blockquote>\n\n<p>The schema migration challenge was particularly subtle. When running schema migrations (<em>DDL</em> (Data Definition Language — SQL statements like ALTER TABLE that change database structure rather than data) operations) on a sharded Vitess cluster, all shards must apply the migration and complete before the Rails application can query the table schema. If the migration completes on some shards but not others, a Rails query checking the schema might get an inconsistent view — triggering a dump of a potentially incorrect schema. Shopify's solution: migrations tracked across all shards, and schema dumps only triggered after all shards confirmed completion. This required custom Rails tooling to coordinate with Vitess's sharding topology.</p>\n<blockquote>\n<p><strong>🧩</strong></p>\n<p>Two Keyspaces: Users and Global</p><p>Shopify split the Shop app data into two <em>keyspaces</em> (a Vitess concept for a logical database that can span one or more shards): a <strong>sharded 'users' keyspace</strong> containing all user-owned tables (sharded by user_id), and an <strong>unsharded 'global' keyspace</strong> for data that doesn't belong to individual users and must be accessed without a sharding key. This two-keyspace architecture is the standard pattern for Vitess migrations: shard what scales with users, keep globally-accessed lookup data unsharded.</p>\n</blockquote>\n\n<blockquote><p>Vitessifying is our internal terminology for the process of transforming an existing MySQL into a keyspace in a Vitess cluster. This allows us to start using core Vitess functionality without explicitly moving data.</p><p><em>— — Shopify Engineering — via 'Horizontally scaling the Rails backend of Shop app with Vitess'</em></p></blockquote>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Dynamic Connection Switching: Gradual Traffic Migration</p><p>Rather than a hard cutover from KateSQL to VTGate, Shopify built a <strong>dynamic connection switcher</strong> that allowed them to gradually route increasing percentages of traffic through VTGate while monitoring for performance differences. Starting at a small percentage and slowly ramping to 100% gave the team confidence in VTGate's behavior under real production load before fully committing. The percentage was adjustable at runtime without a code deploy — giving operators immediate control during the migration window.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Shopify's First Vitess Production Deployment</p><p>The Shop app backend migration was Shopify's <strong>first deployment of Vitess in production</strong>. This wasn't just a database migration — it was building organizational competency with a new database infrastructure layer from scratch. The team had to learn Vitess's operational model, its failure modes, its monitoring requirements, and its configuration nuances simultaneously with executing a live migration. Phasing the migration was in part a strategy to build this knowledge incrementally.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Cross-Shard Queries: The Scatter-Gather Problem</p><p>When a query cannot be routed to a single shard — because it lacks a sharding key or spans multiple shards — Vitess performs a <strong>scatter-gather operation</strong>: it sends the query to all shards and aggregates the results. Scatter-gather is more expensive than single-shard queries. Shopify's engineering team reviewed the Shop app's query patterns to identify scatter queries and either added sharding keys to make them single-shard or moved the data they accessed into the global keyspace. Unhandled scatter queries can become performance bottlenecks at scale.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>The Vitess Migration Playbook: Four Phases</h3>\n<p>Shopify's Vitess migration was carefully sequenced into phases that minimized risk at each step. Phase 1 (Vitessifying) validated the Vitess stack without sharding. Phase 2 (dual connectivity) validated that the application could talk to VTGate alongside the existing system. Phase 3 (keyspace splitting) separated tables into users and global keyspaces. Phase 4 (sharding) performed the actual horizontal split of the users keyspace by user_id. Each phase produced a stable, production-validated state before the next phase began — the classic incremental risk management strategy.</p>\n<ul>\n<li><strong>4 phases</strong> — Migration phases: Vitessify → dual connectivity → keyspace split → horizontal shard — each independently production-validated before proceeding</li>\n<li><strong>user_id</strong> — Sharding key — ensures all data for a user lives on the same shard, making user-scoped queries single-shard with no cross-shard joins for most operations</li>\n<li><strong>0 app changes</strong> — Application code changes required to complete the sharding — VTGate's MySQL protocol compatibility meant only the connection string changed</li>\n<li><strong>1000 IDs</strong> — VTTablet sequence cache size — each shard pre-fetches 1000 globally-unique IDs from the Sequences table to avoid per-insert writes to the sequence source</li>\n</ul>\n\n<pre><code>-- Simplified Vitess VSchema for Shopify's two-keyspace architecture\n-- VSchema tells VTGate how to route queries to shards\n\n-- USERS keyspace: sharded by user_id\n-- All user-owned tables have user_id as the Primary VIndex (shard key)\n{\n  \"sharded\": true,\n  \"vindexes\": {\n    \"hash\": { \"type\": \"hash\" }  -- consistent hash of user_id\n  },\n  \"tables\": {\n    \"orders\": {\n      \"columnVindexes\": [\n        { \"column\": \"user_id\", \"name\": \"hash\" }  -- shard on user_id\n      ],\n      -- Vitess Sequence for globally unique primary key\n      \"autoIncrement\": {\n        \"column\": \"id\",\n        \"sequence\": \"GLOBAL_KEYSPACE.orders_seq\"  -- lives in unsharded global keyspace\n      }\n    },\n    \"user_preferences\": {\n      \"columnVindexes\": [\n        { \"column\": \"user_id\", \"name\": \"hash\" }\n      ]\n    }\n  }\n}\n\n-- GLOBAL keyspace: unsharded (no user_id)\n-- Merchant data, category data, other cross-user lookups\n{\n  \"sharded\": false,\n  \"tables\": {\n    \"merchants\": {},  -- accessed without sharding key\n    \"categories\": {}\n  }\n}</code></pre>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Schema Migrations Across Multiple Shards</p><p>Running `ALTER TABLE` on a sharded Vitess cluster requires coordination: the DDL must be applied to all shards, and the application must not attempt to query the new schema until all shards have confirmed completion. Shopify built tooling to track migration status across all shards and only allow the Rails schema dump (used to verify the schema is as expected) after all shards reported completion. Without this coordination, a Rails schema check on a partially-migrated cluster could return an inconsistent view.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE SHOPIFY FIRST: VITESS IN PRODUCTION</strong></p>\n<p>The Shop app backend was <strong>Shopify's first production deployment of Vitess</strong>. This meant the team was building operational knowledge from scratch — learning Vitess's failure modes, monitoring requirements, and operational procedures while also executing a live migration. The careful phasing of the migration (Vitessify first, shard second) was in part a strategy to build this operational experience incrementally rather than learning all of Vitess's complexity at once.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>VTGate: MySQL Protocol Transparency</p><p>VTGate's most valuable property for application developers is that it speaks the <span><strong>standard MySQL wire protocol</strong></span>. Any MySQL client — including ActiveRecord, the ORM that powers Rails — can connect to VTGate without modification. From the application's perspective, VTGate is just another MySQL server. The sharding logic, the shard topology, the cross-shard routing — all invisible to the application layer.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>ProxySQL to VTGate: The Connection String Change</p><p>The Shop app had previously been using <strong>ProxySQL</strong> as its database proxy — a standard approach for MySQL connection pooling and query routing. Replacing ProxySQL with VTGate was the connection-layer change that made Vitess integration possible. From the application's perspective, both ProxySQL and VTGate speak the MySQL wire protocol; the change was transparent to Rails. The dual connectivity phase let the team validate VTGate behavior alongside ProxySQL before fully committing.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>VITESS RESOURCE ALLOCATION</strong></p>\n<p>One operational detail that surprised the Shopify team: <strong>VTTablet requires significant resource allocation</strong>. Vitess's own rule of thumb is allocating an equal number of CPUs to VTTablet as to the mysqld process it runs alongside. Memory consumption for VTTablet is generally low, but CPU requirements are substantial — VTTablet handles connection pooling, health checking, query execution, and replication management. Underprovisioning VTTablet creates a bottleneck in the query path that can limit the effective throughput of the underlying MySQL instance.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>Vitess's architecture introduces two new components between the application and MySQL: VTGate (the stateless query router, deployed as multiple replicas for high availability) and VTTablet (a sidecar process running alongside each MySQL instance). The application connects to VTGate using a standard MySQL connection. VTGate consults the <em>VSchema</em> (Vitess Schema — a configuration document that describes how keyspaces and shards are organized and which columns are used as sharding keys) to determine which shard a query should target, then forwards it to the appropriate VTTablet. The MySQL instances themselves are unchanged — they continue running as standard MySQL servers with replication configured for high availability.</p>\n<h3>Vitess Architecture: Rails App → VTGate → Sharded MySQL</h3>\n<p><a href=\"https://techlogstack.com/explore/shopify-vitess-horizontal-scale-2024/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>Migration Phases: From KateSQL to Sharded Vitess</h3>\n<p><a href=\"https://techlogstack.com/explore/shopify-vitess-horizontal-scale-2024/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>CONNECTION POOLING: AN OFTEN-OVERLOOKED VITESS BENEFIT</strong></p>\n<p>Beyond sharding, VTTablet provides <strong>connection pooling at the database level</strong>. A Rails application with 100 Puma worker threads might open 100 MySQL connections — and 100 application instances might open 10,000. VTTablet multiplexes these connections to a much smaller pool against the actual MySQL process. At Shopify's scale, this connection efficiency is a meaningful resource saving in addition to the sharding capability.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Sequence Caching: Trading Latency for Throughput</p><p>VTTablet's sequence ID caching (set at 1000 in Shopify's production config) is a throughput-versus-latency tradeoff. <strong>Without caching</strong>, every INSERT requires a roundtrip to the Sequences table in the global keyspace to get the next ID — adding latency to every write. <strong>With caching of 1000 IDs</strong>, 999 out of every 1000 INSERTs get their ID from the local cache instantly, with only every 1000th INSERT requiring a roundtrip. IDs have gaps in the sequence after a server restart (cached-but-unused IDs are lost) but remain globally unique.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>VTOrc: Automated Topology Management</p><p>In a sharded Vitess cluster, managing primary/replica failover across dozens of shards manually would be operationally prohibitive. <em>VTOrc</em> (Vitess Orchestrator — an automated MySQL topology manager integrated into Vitess that detects primary failures and promotes replicas automatically, maintaining high availability without manual operator intervention) handles this automatically. When a shard's primary fails, VTOrc promotes the best available replica and updates VTGate's routing table — keeping the cluster available without human intervention.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Shopify's Vitess migration demonstrates that horizontal database sharding doesn't have to mean rewriting your application. With the right proxy architecture, the sharding is in the infrastructure — invisible to the application layer.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Vitessify before you shard.</strong> Adding Vitess to an existing MySQL database without sharding (Vitessifying) is a safe, low-risk first step that validates the Vitess stack and builds operational knowledge before attempting the more complex sharding migration. Shopify's phased approach reflects this: get comfortable with Vitess on one shard before splitting into many.</li><li><span>02</span><div>Choose your <em>sharding key</em> (the column value used to determine which shard a row belongs to — the most important architectural decision in horizontal sharding because it determines data locality and query routing) carefully and early. user_id was the right choice for Shopify's user-centric application: it distributes data evenly, keeps user data colocated on one shard, and makes user-scoped queries single-shard. A bad sharding key creates hot shards, cross-shard joins, and an architecture that fights itself.</li><li><span>03</span><div><strong>Auto-increment IDs break in sharded systems.</strong> Every sharded application needs a strategy for globally unique IDs. Vitess Sequences, UUIDs, Snowflake IDs — the choice matters for performance, sortability, and debuggability. Don't discover this problem during your sharding migration; design for it before migration begins.</li><li><span>04</span><div>Schema migrations on sharded clusters require explicit cross-shard coordination. <strong>Any tooling that inspects or depends on schema state must be sharding-aware.</strong> Rails's schema dump, ActiveRecord migrations, and ORM schema introspection all need to understand that schema changes must be applied to all shards before the application can assume they've taken effect.</li><li><span>05</span><div>A dynamic connection switcher that allows gradual traffic migration is <strong>the safety mechanism that makes production sharding migrations recoverable</strong>. Being able to route 1% → 5% → 25% → 100% of traffic through the new system, with instant rollback by setting the percentage back to 0%, is the difference between a migration you can execute confidently and one that requires a maintenance window.</li></ol>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>VSchema Maintenance: An Ongoing Obligation</p><p>The VSchema must be updated every time the database schema changes. A new table needs a VSchema entry defining its sharding key. A new index needs evaluation for VIndex configuration. <strong>Vitess amplifies the schema change process</strong>: what was previously a single DDL operation now requires DDL plus VSchema update, coordinated across all shards. Teams adopting Vitess need processes and tooling to ensure VSchema updates are not overlooked during schema migrations.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>VITESS AS SHOPIFY STANDARD</strong></p>\n<p>Following the Shop app success, Shopify has been expanding Vitess adoption to other services. The first deployment built the organizational knowledge and tooling (custom Rails integration, dynamic connection switcher, cross-shard schema migration tooling) that makes subsequent deployments faster and safer. <strong>Infrastructure investments compound</strong>: the second Vitess deployment benefits from all the work done during the first.</p>\n</blockquote>\n\n<blockquote><p>Shopify added horizontal database sharding to a Rails app, and the app continued insisting there was only one database — which is either a beautiful abstraction or a comfortable lie, and honestly both.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/shopify-vitess-horizontal-scale-2024/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.589738+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Databases", "Shopify"]}, {"id": "https://techlogstack.com/explore/shopify-mysql-deadlocks-2024/", "url": "https://techlogstack.com/explore/shopify-mysql-deadlocks-2024/", "title": "Shopify's Engineers Hunted Deadlocks at 19 Million Queries per Second", "summary": "How Shopify's engineering team diagnosed and mitigated MySQL deadlocks at BFCM-scale — 19 million queries per second — with schema changes and query pattern fixes.", "content_html": "<p><strong>Shopify</strong> · Databases · 17 May 2026</p>\n<p>During Black Friday and Cyber Monday 2023, Shopify's MySQL fleet was handling 19 million queries per second. At that scale, even rare deadlock patterns become common enough to cause real incidents. The engineering team published a detailed playbook for diagnosing and eliminating MySQL deadlocks in high-concurrency production environments.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;MySQL QPS at BFCM peak&#x27;, &#x27;value&#x27;: &#x27;19M&#x27;}</li><li>{&#x27;label&#x27;: &#x27;requests/min app servers&#x27;, &#x27;value&#x27;: &#x27;58M&#x27;}</li><li>{&#x27;label&#x27;: &#x27;uptime maintained&#x27;, &#x27;value&#x27;: &#x27;99.999%+&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<ul>\n<li><strong>19M QPS</strong> — MySQL queries per second during BFCM 2023 peak — the environment where deadlock patterns that are rare at normal traffic become regular production incidents</li>\n<li><strong>58M req/min</strong> — Application server request rate at peak BFCM — equivalent to roughly 967,000 requests per second across Shopify's core application servers</li>\n<li><strong>99.999%+</strong> — Shopify's uptime target maintained through BFCM — the reliability bar that makes every database performance issue a potential SLA risk</li>\n<li><strong>3 patterns</strong> — Main deadlock pattern categories identified and addressed: transaction ordering, gap locking, and index selection — each requiring different fix strategies</li>\n</ul>\n\n<p><em>MySQL deadlocks</em> (a situation where two or more transactions are each waiting for locks held by the other, creating a circular dependency that MySQL resolves by automatically aborting one transaction (the 'deadlock victim')) are a fact of life in high-concurrency relational databases. At small scale, they're rare and easily retried. At Shopify's Black Friday and Cyber Monday scale — <strong>19 million MySQL queries per second</strong>, 58 million application requests per minute — even rare deadlock patterns become frequent enough to affect real user experiences. A deadlock rate that generates one deadlock per million transactions is invisible at 1,000 QPS. At 19 million QPS, it generates 19 deadlocks per second continuously.</p>\n<p>MySQL's deadlock handling is automatic: when it detects a circular lock dependency, it selects one transaction as the <strong>deadlock victim</strong>, aborts it, and allows the other transaction to proceed. The deadlock victim receives an error: `Deadlock found when trying to get lock; try restarting transaction`. Most application frameworks, including Rails, can be configured to automatically retry deadlock victims — but retries add latency, consume connection pool slots, and under sustained deadlock conditions can create retry storms that worsen the overload. The solution at scale is not to retry more aggressively but to <strong>eliminate the deadlock patterns entirely</strong>.</p>\n<blockquote>\n<p><strong>WHY DEADLOCKS GET WORSE AT SCALE</strong></p>\n<p>Deadlock frequency scales super-linearly with concurrency. Doubling transactions per second more than doubles deadlock frequency because the probability of two transactions conflicting increases with every additional concurrent transaction. The same query patterns that were safe at 10x lower traffic can become pathological at BFCM scale. <strong>Shopify specifically tests for deadlock conditions before BFCM</strong> to identify and eliminate patterns that would become problems under peak load.</p>\n</blockquote>\n\n<h3>The Three Deadlock Patterns</h3>\n<p>Shopify's engineering team identified three primary deadlock patterns in their high-concurrency MySQL environment. Pattern 1: <strong>Transaction ordering conflicts</strong> — two transactions each acquiring locks in opposite order (Transaction A locks row X then row Y; Transaction B locks row Y then row X). Pattern 2: <strong>Gap lock conflicts</strong> — MySQL's <em>gap locking</em> (a mechanism in InnoDB's REPEATABLE READ isolation level that locks the gap before a row to prevent phantom reads — this can cause unexpected conflicts between transactions that aren't modifying the same rows) mechanism, which prevents phantom reads by locking ranges rather than individual rows, creating unexpected conflicts between transactions accessing the same range. Pattern 3: <strong>Index selection conflicts</strong> — MySQL choosing suboptimal index plans under high concurrency, causing different transactions to acquire locks in different orders depending on the query execution plan.</p>\n\n<h3>Problem</h3>\n<h4>BFCM Scale Turns Rare Deadlocks into Incidents</h4>\n<p>At 19M MySQL QPS, deadlock patterns that are statistically invisible at normal load become consistent sources of latency spikes and error rates. MySQL's automatic deadlock resolution aborts one transaction per deadlock, requiring application-level retry logic that adds latency and consumes connection pool capacity.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Three Structural Deadlock Patterns</h4>\n<p>Transaction ordering (acquiring the same locks in different orders), gap locking (InnoDB's phantom-read prevention causing unexpected range lock conflicts), and index selection (query planner choosing different indexes under load, causing lock order variation). Each pattern has a different root cause and requires a different fix strategy.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Schema Changes + Query Pattern Fixes + Index Hints</h4>\n<p>Transaction ordering: enforce consistent lock acquisition order in application code. Gap locking: switch to READ COMMITTED isolation for high-concurrency tables where phantom reads aren't a concern, or add precise indexes to avoid gap lock ranges. Index selection: use index hints to force consistent query plans under varying load conditions.</p>\n<hr />\n<h3>Result</h3>\n<h4>BFCM 2023 Completed Without Deadlock Incidents</h4>\n<p>BFCM 2023 ran at 19M MySQL QPS without deadlock-related incidents. The pre-BFCM deadlock analysis and mitigation work became a repeatable playbook for identifying and eliminating deadlock patterns before major traffic events.</p>\n<hr />\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Gap Locking: The Invisible Conflict Source</p><p>MySQL's InnoDB uses <em>REPEATABLE READ</em> (the default MySQL transaction isolation level that prevents other transactions from modifying rows you've already read within a transaction — implemented in part through gap locks on ranges of the index) as its default isolation level, which uses gap locking to prevent <em>phantom reads</em> (a scenario where a transaction reads a set of rows, another transaction inserts a new row in that set, and the first transaction reads again and sees the new 'phantom' row). The side effect: transactions that touch overlapping index ranges can deadlock on gap locks even if they're accessing completely different rows. At high concurrency, gap lock deadlocks can be more common than row-level lock deadlocks. Switching high-concurrency tables to READ COMMITTED isolation eliminates gap locking at the cost of allowing phantom reads — an acceptable tradeoff for many commerce workloads.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🔒</strong></p>\n<p>MySQL's deadlock detector runs after <strong>every failed lock acquisition attempt</strong>, searching for circular lock dependencies. At 19M QPS this detection runs millions of times per second — a reminder that MySQL's deadlock resolution machinery is doing meaningful work constantly, not just during obvious deadlock events.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🗂️</strong></p>\n<p>Reading the Deadlock Log: Forensic Analysis</p><p>MySQL records the most recent deadlock in `SHOW ENGINE INNODB STATUS`. This output includes the transaction SQL, the locks held, the locks waited for, and which transaction was chosen as the victim. <strong>Forensic analysis of the deadlock log</strong> is the primary diagnosis tool — it reveals the exact lock ordering conflict that caused each deadlock pattern, which informs which fix strategy to apply.</p>\n</blockquote>\n\n<p>Shopify's approach to BFCM preparation includes explicit deadlock testing at high concurrency levels in staging. Load tests simulate BFCM-scale traffic against representative database states, and the resulting deadlock logs are analyzed to identify patterns that would be problematic at production scale. This pre-event analysis allows the team to eliminate deadlock patterns <strong>before</strong> they become BFCM incidents — converting reactive incident response into proactive engineering.</p>\n<blockquote>\n<p><strong>THE RETRY LOOP DANGER</strong></p>\n<p>MySQL automatically selects a deadlock victim and aborts one transaction. Most frameworks, including Rails with ActiveRecord, can be configured to <strong>automatically retry deadlocked transactions</strong>. But retries are dangerous under sustained deadlock conditions: each retry re-acquires locks and adds to the contention, potentially creating a retry storm that makes the deadlock situation worse. The correct strategy is to eliminate deadlock patterns, not to retry aggressively — retries are acceptable for occasional, transient deadlocks but not for structural patterns.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>Fix Patterns by Deadlock Category</h3>\n<p>Each of the three deadlock categories has a distinct fix strategy. Transaction ordering conflicts are fixed at the application level by enforcing a canonical lock acquisition order — if every transaction that needs to lock resources A and B always acquires them in the same order (A before B), circular dependency is impossible. Gap lock conflicts are fixed either at the database level (changing isolation level for specific tables) or at the schema level (adding precise indexes that reduce the range that InnoDB locks). Index selection conflicts are fixed by adding index hints to force MySQL to use the same index plan consistently across varying load conditions.</p>\n<pre><code>-- Pattern 1: Transaction ordering deadlock\n-- BAD: Transaction A and B acquire locks in opposite order\n-- Transaction A: UPDATE orders, then UPDATE inventory  \n-- Transaction B: UPDATE inventory, then UPDATE orders\n-- Fix: enforce canonical lock order in application code\n\n-- Pattern 2: Gap lock deadlock\n-- At REPEATABLE READ isolation, MySQL locks gaps to prevent phantoms\n-- Fix option A: Use READ COMMITTED for high-concurrency tables\nSET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED;\n-- This disables gap locking — only row-level locks on actual rows\n-- Trade-off: phantom reads possible within transaction\n\n-- Fix option B: Add precise index to narrow the locked range\n-- Before: idx_status (status) -- gaps between statuses are locked\n-- After: idx_status_created_at (status, created_at) -- narrower range\nALTER TABLE orders ADD INDEX idx_status_created_at (status, created_at);\n\n-- Pattern 3: Index selection inconsistency under load\n-- Under high concurrency, MySQL's query planner may choose different indexes\n-- Fix: Force consistent index selection with index hint\nSELECT * FROM orders FORCE INDEX (idx_user_id_status)\nWHERE user_id = ? AND status = 'pending';\n-- Forces the same index plan regardless of load conditions\n-- Prevents lock order variation from causing deadlocks</code></pre>\n<blockquote>\n<p><strong>THE BFCM PREPARATION PLAYBOOK</strong></p>\n<p>Shopify treats BFCM deadlock prevention as a structured engineering program: <strong>(1) Load test at BFCM scale in staging</strong> to generate representative deadlock logs. <strong>(2) Analyze deadlock logs</strong> to identify patterns and their categories. <strong>(3) Apply fix strategy per category</strong>: ordering fixes, isolation level changes, or index hints. <strong>(4) Re-test</strong> to verify the fix eliminates the pattern. <strong>(5) Deploy and monitor during BFCM itself</strong>. This loop is not run once — it's part of the pre-BFCM engineering sprint every year.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>READ COMMITTED vs REPEATABLE READ: The Tradeoff</p><p>Switching high-concurrency tables from REPEATABLE READ to READ COMMITTED eliminates gap locking and dramatically reduces deadlock frequency — but at the cost of allowing <strong>phantom reads</strong> within a transaction (a subsequent read in the same transaction might see rows that didn't exist at the transaction's start). For most commerce operations — inserting an order, updating inventory, processing a payment — phantom read isolation isn't required. For reporting or analytics queries that require consistent snapshots, REPEATABLE READ is still appropriate. Shopify evaluates each table's access patterns to determine the right isolation level.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Performance Monitoring Integration</p><p>Shopify monitors deadlock rates as a first-class operational metric. The deadlock rate graph is visible on engineering dashboards alongside query latency and error rates. When deadlock rates spike during BFCM, they're caught within seconds — not minutes — and the specific transaction patterns can be identified from the deadlock log immediately. Real-time deadlock visibility is as important as the fix strategies themselves.</p>\n</blockquote>\n\n<p>Index hints are a last resort — they trade flexibility for predictability. MySQL's query planner is usually good at selecting optimal indexes, and overriding it can cause performance regressions when data distributions change. But at BFCM scale, <strong>unpredictable index selection under high concurrency is more dangerous than a slightly suboptimal but consistent plan</strong>. The engineers who added index hints to specific high-volume queries did so with full awareness of the tradeoff, and with monitoring in place to detect if the forced plan becomes problematic over time.</p>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Canonical Lock Order: Application-Level Enforcement</p><p>The canonical lock order fix requires reviewing every code path that modifies multiple related objects within a transaction. In a Rails application, this means identifying all ActiveRecord transactions that update, say, both an Order and an Inventory record, and ensuring they <strong>always acquire the lock on the object with the lower ID first</strong> (or some other consistent ordering). This is a refactoring task, not a database config change — and it requires careful code review to ensure no path is missed.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>BEFORE BFCM: THE DEADLOCK AUDIT</strong></p>\n<p>Shopify's engineering team conducts a pre-BFCM deadlock audit: run load tests at maximum expected concurrency, collect InnoDB status outputs repeatedly during the load test, parse the deadlock logs, categorize patterns, apply fixes, re-test. The audit is a structured engineering sprint, not an ad-hoc investigation. <strong>The earlier the deadlock patterns are found, the more time there is to implement and validate fixes</strong> before the day that those patterns become 19-per-second incidents.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>Shopify's MySQL architecture for the BFCM period reflects years of scaling decisions: a mix of MySQL 5.7 and MySQL 8 instances, federated across over 100 independent shards for the Shopify Core application, with Vitess managing the most rapidly growing workloads. The deadlock patterns that Shopify addressed affect all high-concurrency MySQL deployments — the scale just makes them visible faster and more painfully.</p>\n<h3>Transaction Ordering Deadlock: The Classic Pattern</h3>\n<p><a href=\"https://techlogstack.com/explore/shopify-mysql-deadlocks-2024/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>Gap Lock Deadlock: The Invisible Pattern</h3>\n<p><a href=\"https://techlogstack.com/explore/shopify-mysql-deadlocks-2024/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>INNODB LOCK MONITOR: YOUR DIAGNOSTIC TOOL</strong></p>\n<p>MySQL's InnoDB storage engine provides several diagnostic commands for deadlock investigation: <strong>SHOW ENGINE INNODB STATUS</strong> shows the latest deadlock with full transaction and lock details. <strong>performance_schema.data_locks</strong> shows currently held and waiting locks in real time. <strong>performance_schema.events_statements_history</strong> shows recent SQL statements per thread. Together, these tools let engineers reconstruct the exact sequence of events that led to a deadlock — which is the prerequisite for applying the correct fix strategy.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Index Hints: The Nuclear Option</p><p>Index hints override MySQL's query planner for a specific query. They are <strong>maintenance debt</strong>: if the data distribution changes and the forced index becomes suboptimal, the hint will cause performance regressions that are difficult to diagnose (the hint is in the application code, not visible in query explain plans without reading the source). Use index hints only for queries where inconsistent index selection is causing production deadlocks, document them extensively, and review them regularly as the database evolves.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>MySQL 8.0 Deadlock Improvements</p><p>MySQL 8.0 introduced several improvements relevant to deadlock diagnosis and prevention. <strong>NOWAIT and SKIP LOCKED</strong> allow queries to immediately return an error or skip locked rows rather than waiting — useful for queue-like patterns. <strong>Invisible indexes</strong> allow testing new indexes without them being used by the query planner. <strong>Improved performance_schema</strong> provides better lock visibility. GitHub's MySQL 8.0 upgrade (covered elsewhere in TechLogStack) and Shopify's own MySQL fleet management both reflect the industry's move to take advantage of these improvements at scale.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Shopify's deadlock playbook is one of the most practical database engineering documents published by a major engineering blog. It translates academic database theory into actionable production fixes for the three patterns that account for the vast majority of MySQL deadlocks at high concurrency.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Deadlock rates scale super-linearly with concurrency.</strong> Patterns that are statistically invisible at normal traffic become consistent incidents at peak scale. Test your database workloads at maximum expected concurrency before every major traffic event — not just for performance, but explicitly for deadlock patterns.</li><li><span>02</span><div><em>Gap locks</em> (InnoDB locks that cover the space between indexed values to prevent phantom reads in REPEATABLE READ isolation) are a frequent and underappreciated source of MySQL deadlocks at high concurrency. Consider using READ COMMITTED isolation for high-throughput tables where phantom read protection is not required — it eliminates an entire class of deadlock patterns at the cost of weaker isolation semantics.</li><li><span>03</span><div><strong>Enforce canonical lock acquisition order in your application layer.</strong> If every code path that acquires multiple locks always acquires them in the same order, circular dependencies become impossible. This is the fundamental fix for transaction ordering deadlocks, and it requires reviewing all code paths that modify the same set of resources in a single transaction.</li><li><span>04</span><div>MySQL's query planner can select different indexes under varying concurrency conditions, leading to <strong>lock order variability that produces deadlocks at high load but not low load</strong>. Use EXPLAIN ANALYZE at peak load (not just nominal load) to understand actual query execution plans, and use index hints selectively to force consistent plans for known-problematic queries.</li><li><span>05</span><div>Monitor deadlock rates as a first-class operational metric with real-time visibility. <strong>Deadlocks discovered in postmortem analysis are incidents. Deadlocks discovered in real-time monitoring are operational data.</strong> Alerting on deadlock rate spikes during high-traffic events gives engineers the ability to act on deadlock patterns before they escalate to user-visible incidents.</li></ol>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Deadlocks Are Not Random</p><p>Engineers sometimes treat MySQL deadlocks as random background noise — occasional events that retries handle automatically. At high concurrency, deadlocks are not random. They are deterministic consequences of specific query patterns executing at sufficient concurrency. Every deadlock pattern can be analyzed, categorized, and eliminated. Treating deadlocks as random events prevents you from doing that work.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE ISOLATION LEVEL TRADEOFF IN PRACTICE</strong></p>\n<p>Changing table-level isolation from REPEATABLE READ to READ COMMITTED is a schema-level decision that requires careful per-table evaluation. Questions to answer: <strong>Do any queries on this table require phantom read protection?</strong> (Analytics queries often do; CRUD operations often don't.) <strong>What application logic depends on consistent snapshots within a transaction?</strong> <strong>What is the read-to-write ratio?</strong> READ COMMITTED reduces read-side lock contention, which benefits high-concurrency tables where writes dominate. The decision requires understanding the application's correctness requirements, not just the performance characteristics.</p>\n</blockquote>\n\n<blockquote><p>At 19 million queries per second, even a one-in-a-million database error happens 19 times a second — which is Shopify's way of saying there's no such thing as 'too rare to matter.'<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/shopify-mysql-deadlocks-2024/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.676740+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Databases", "Shopify"]}, {"id": "https://techlogstack.com/explore/cloudflare-control-plane-outage-2023/", "url": "https://techlogstack.com/explore/cloudflare-control-plane-outage-2023/", "title": "Cloudflare's Datacenter Partner Failed and the Control Plane Went Dark for 40 Hours", "summary": "How a power failure at Cloudflare's Flexential datacenter partner took down the control plane for 40 hours — and prompted Cloudflare to create the Code Orange incide", "content_html": "<p><strong>Cloudflare</strong> · Reliability · 17 May 2026</p>\n<p>On November 2, 2023, Cloudflare's primary datacenter partner experienced a power failure. The control plane — the system that lets customers configure DNS, firewall rules, and every Cloudflare service — went dark. It stayed dark, in various forms, for nearly 40 hours. The postmortem introduced a concept Cloudflare hadn't had before: Code Orange.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;hours control plane down&#x27;, &#x27;value&#x27;: &#x27;~40&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<blockquote><p>We have not had such a process in the past, but it's clear today we need to implement a version of it ourselves: Code Orange.</p><p><em>— — Cloudflare Engineering — via Post-Mortem on the Cloudflare Control Plane and Analytics Outage, November 2023</em></p></blockquote>\n<p>Cloudflare operates one of the world's largest content delivery and security networks, with hundreds of <em>PoPs</em> (Points of Presence — Cloudflare's globally distributed server locations that handle actual traffic routing, DDoS mitigation, and content delivery for customers) handling customer traffic across the globe. These PoPs operate largely autonomously from the control plane — once configurations are pushed, the network continues operating even if the control plane has issues. <strong>But when the control plane goes down, customers can't configure anything.</strong> They can't add DNS records, update firewall rules, change SSL settings, or deploy new Workers. The network keeps running, but it becomes effectively immutable. On November 2, 2023, at 11:43 UTC, that's exactly what happened.</p>\n<p>The cause was a power failure at <strong>Flexential</strong>, Cloudflare's primary datacenter partner hosting the control plane infrastructure. Flexential is not a cloud provider — it's a colocation facility where Cloudflare runs its own physical servers. Power failures in colocation facilities, while rare, happen. What made this incident severe was that the control plane recovery was not automatic. Failover to Cloudflare's disaster recovery facility required manual orchestration, and some services — particularly raw log delivery — were not replicated to the DR facility and therefore couldn't be recovered until the primary datacenter came back. <span>Services were still not fully restored 40 hours after the initial failure.</span></p>\n<blockquote>\n<p><strong>🏭</strong></p>\n<p>Cloudflare's control plane infrastructure ran in a colocation datacenter — physical servers in a facility owned by a third party, <strong>not in a public cloud</strong>. This architecture gives Cloudflare hardware control and cost efficiency but means datacenter-level failures require manual failover coordination rather than cloud provider automated region failover.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>Flexential Power Failure at 11:43 UTC Nov 2</h4>\n<p>Cloudflare's primary datacenter partner experienced a power failure. The control plane — API, dashboard, analytics services — went offline. Edge traffic continued operating normally (PoPs are autonomous), but customers could not make any configuration changes. Internal monitoring and log analytics were also impacted.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Control Plane Not Designed for Autonomous Failover</h4>\n<p>Unlike Cloudflare's edge network, the control plane was not designed for automatic failover. Recovery required manual orchestration to bring services up at the DR facility. Some data — particularly raw log streams — was not replicated to DR, meaning certain services could not be restored until the primary facility recovered.</p>\n<hr />\n<h3>Solution</h3>\n<h4>DR Failover + Manual Service Restoration</h4>\n<p>Control plane core functionality was restored at the DR facility at 17:57 UTC on Nov 2 — ~6 hours after the incident started. Many customers saw restored API access at this point. However, some services continued to experience issues until Nov 4 as teams worked through recovery of systems that had data in the primary datacenter only.</p>\n<hr />\n<h3>Result</h3>\n<h4>Full Restoration Nov 4, Code Orange Invented</h4>\n<p>Services were fully restored at 04:25 UTC on November 4, nearly 40 hours after the initial failure. The incident prompted Cloudflare to create a new process — Code Orange — modeled on Google's Code Yellow/Red, for major incidents requiring all-hands engineering mobilization.</p>\n<hr />\n\n<blockquote>\n<p><strong>CODE ORANGE: A NEW INCIDENT PROCESS IS BORN</strong></p>\n<p>Google has a practice where significant crises trigger a Code Yellow or Code Red — most engineering resources are shifted to address the issue. Cloudflare had no equivalent process before this incident. The 40-hour outage demonstrated the need for a structured mechanism to <strong>mobilize engineering resources across all teams for critical incidents</strong>. Code Orange was created as Cloudflare's version of this concept. The process includes defined escalation paths, cross-team coordination protocols, and clear criteria for when to invoke it.</p>\n</blockquote>\n\n<p>The log push service — Cloudflare's product that delivers raw access logs directly to customer storage buckets in real time — was <strong>unavailable for the majority of the outage duration</strong>. Unlike the control plane API, which could be brought up at the DR facility using replicated state, the log pipeline infrastructure was primarily hosted in the primary datacenter and not fully replicated to DR. Customers who relied on log push for security monitoring, compliance logging, or billing reconciliation had gaps in their log data that could not be recovered. Cloudflare's postmortem explicitly noted that some datasets which are not replicated in the EU would have <span>persistent gaps</span> — data that would never be recovered regardless of DR restoration success.</p>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The 'Expect Entire Datacenters to Fail' Principle</p><p>The postmortem contained a striking engineering principle statement: <strong>we must expect that entire data centers may fail</strong>. This is a design requirement, not a risk acceptance. Any system that would be materially impaired by a single datacenter failure needs to be redesigned so that it either operates independently of any single datacenter or fails over automatically when one goes dark. The control plane architecture had not been held to this standard — and the 40-hour outage was the consequence.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>📋</strong></p>\n<p>Questions for Flexential Still Outstanding</p><p>As of the postmortem publication, Cloudflare stated it had <strong>a number of questions that needed to be answered from Flexential</strong>. A power failure of this duration at a major colocation facility raises questions about redundant power systems, UPS capacity, diesel generator performance, and facility operations procedures. The postmortem was transparent about this outstanding accountability — an unusual admission that the root cause investigation wasn't complete.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Data Plane Continued Running</p><p>During the entire 40-hour control plane outage, Cloudflare's <strong>data plane continued operating normally</strong> — DDoS mitigation, CDN caching, SSL termination, and traffic routing were all functioning. Customers using Cloudflare for traffic performance and security saw no degradation. This is a testament to Cloudflare's edge-resilient architecture — PoPs operate autonomously from the control plane once configured. The outage was exclusively a <em>management plane</em> failure: you couldn't change anything, but what was already configured kept working.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>❌</strong></p>\n<p>Analytics and Dashboard Dark for Enterprise Customers</p><p>For Cloudflare's enterprise customers, the control plane outage had real operational consequences beyond configuration changes. Security dashboards showing live attack traffic, WAF logs, DNS analytics, and firewall event monitoring were all unavailable. During a 40-hour window when customers couldn't see what was happening on their infrastructure, security teams had reduced visibility precisely when they might have needed it most. The monitoring and analytics darkness was in some ways more operationally painful than the configuration lock.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE CUSTOMER EXPERIENCE ASYMMETRY</strong></p>\n<p>Cloudflare's customer base experienced the outage very differently based on what they used Cloudflare for. <strong>Performance-focused customers</strong> (CDN, caching) saw nothing — their traffic ran fine. <strong>Security-focused customers</strong> (WAF, DDoS mitigation) had protection but lost visibility into attacks. <strong>Developer customers</strong> (Workers, Pages, DNS) were locked out of deploying changes for 40 hours. <strong>Analytics-dependent customers</strong> had data gaps that couldn't be recovered. Same outage, four different impact profiles.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>Post-Incident Architecture Changes</h3>\n<p>The November 2023 control plane outage forced Cloudflare to confront a fundamental architectural gap: the network edge (PoPs) was designed for resilience and independence, but the control plane was not. The fixes needed were architectural — not configuration tweaks. The postmortem identified several categories of required investment: automatic failover for control plane services, expanded data replication to DR, staged rollouts for configuration changes to prevent future single-change outages, and the Code Orange process for mobilizing resources during major incidents.</p>\n<ul>\n<li><strong>~40h</strong> — Total outage duration from power failure to full service restoration across all affected services — November 2–4, 2023</li>\n<li><strong>6h</strong> — Time to partial restoration at DR facility — core API and dashboard functions restored at 17:57 UTC on November 2 after the 11:43 UTC failure</li>\n<li><strong>0</strong> — Data recovery possible for log push gaps — certain log streams not replicated to DR resulted in permanent data loss for the outage window</li>\n<li><strong>1</strong> — New process created: Code Orange — Cloudflare's all-hands engineering mobilization protocol for major incidents, modeled on Google's Code Yellow/Red</li>\n</ul>\n\n<pre><code># Conceptual DR architecture requirements derived from the incident\n# Every control plane service needs to meet these criteria:\n\ncontrol_plane_service:\n  # Recovery requirements\n  automatic_failover:\n    enabled: true  # No manual orchestration needed\n    rto: \"< 30 minutes\"  # Recovery Time Objective\n    rpo: \"< 5 minutes\"   # Recovery Point Objective\n  \n  # Data replication requirements\n  data_replication:\n    primary_dc: \"colo-us-east\"\n    dr_dc: \"colo-us-west\"    # replicated to DR\n    replication_lag: \"< 60s\"  # data freshness at DR\n    log_streams:              # ALL streams, not just some\n      replicated: true        # persistent log gaps = customer data loss\n  \n  # Configuration safety\n  config_rollout:\n    staged: true             # NOT global instant propagation\n    health_checks: true      # validate before wider rollout\n    rollback_trigger: \"auto\" # revert on health check failure</code></pre>\n<blockquote>\n<p><strong>THE CONFIGURATION ROLLOUT PROBLEM (FORESHADOWING)</strong></p>\n<p>In the postmortem, Cloudflare identified a specific action item: <strong>'Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input'</strong> — including staged rollouts rather than global instant propagation of configuration files. This action item would become prophetic: weeks after the November 2023 outage, a global configuration change for Bot Management caused another major outage. The staged rollout work hadn't been completed in time.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Code Orange: The All-Hands Protocol</p><p>Cloudflare's new Code Orange process, modeled on Google's Code Yellow/Red, provides a structured mechanism for major incidents. When Code Orange is declared, most or all engineering resources shift to the incident. The process defines: who has authority to declare Code Orange, how resources are mobilized, what communication protocols apply, and what the criteria are for declaring the incident resolved. Before Code Orange, major incidents relied on informal escalation — which is slower and less coordinated under pressure.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Colocation Trade-Off</p><p>Cloudflare chose colocation over cloud hosting for the control plane for cost and hardware control reasons. This decision is defensible — cloud hosting at Cloudflare's scale would be extremely expensive. But it comes with a trade-off: manual failover instead of automated region failover. The November 2023 incident makes this trade-off concrete: 40 hours of control plane impact versus the cost of cloud hosting or of building proper automatic DR. Organizations must make this trade-off explicitly, not accidentally.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The DR Facility Limitation</p><p>Cloudflare's disaster recovery facility was able to handle core API and dashboard functionality after the 6-hour manual failover. But it could not handle everything. Services that stored state locally in the primary datacenter without replication — including the log push pipeline — could not be restored at DR. The architectural lesson: <strong>DR readiness is not a single binary state</strong>. Each service needs to be evaluated individually for DR completeness: which data is replicated, which processes can be restarted at DR, and what the residual capability is when the primary is down.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🔍</strong></p>\n<p>Questions for Flexential Still Open</p><p>Cloudflare's postmortem noted they had <strong>outstanding questions for Flexential</strong> about the power failure — specifically about the adequacy of redundant power systems, UPS performance, and datacenter operations. This transparency is notable: Cloudflare publicly acknowledged that they didn't yet have the full picture of why a colocation facility experienced a power failure significant enough to take down their control plane for 40 hours, and that accountability from the vendor was part of the recovery process.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>Cloudflare's architecture has a conceptual split: the <strong>data plane</strong> (PoPs, edge servers, traffic processing) and the <strong>control plane</strong> (API, dashboard, configuration management, analytics). The data plane is designed for resilience — distributed across 300+ locations, able to continue serving traffic even if the control plane is unavailable. The control plane was designed for correctness and consolidation — centralized, manageable, cost-efficient. The November 2023 incident exposed the asymmetry: data plane is nearly indestructible; control plane is a single point of failure.</p>\n<h3>Cloudflare's Architecture: Data Plane vs Control Plane</h3>\n<p><a href=\"https://techlogstack.com/explore/cloudflare-control-plane-outage-2023/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>After: Required Control Plane Architecture (DR + Replication)</h3>\n<p><a href=\"https://techlogstack.com/explore/cloudflare-control-plane-outage-2023/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>THE EDGE IS RESILIENT. THE CENTER IS NOT.</strong></p>\n<p>Cloudflare's network architecture reflects a common pattern in distributed systems: <strong>the edges are designed for resilience; the center is designed for convenience</strong>. Edge nodes must work autonomously during control plane outages. Centralized control planes are often designed for operational simplicity rather than failure resilience — because failures at the center are rare and the cost of the engineering to harden them is high. The November 2023 incident makes the explicit argument that for systems managing global internet infrastructure, the center must be designed with the same resilience standards as the edges.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Log Data Persistence: A Gap in DR Planning</p><p>The log push service's inability to be restored at the DR facility reveals a gap in DR planning that is common in operational log pipelines: <strong>log data is often considered less critical than application data</strong> and receives less investment in replication and DR. But for Cloudflare's enterprise customers, raw access logs are security audit trails, compliance records, and billing evidence. A gap in log data can trigger regulatory issues, security investigations, and customer trust problems that persist long after service is restored.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Colocation vs Cloud: The Resilience Trade-Off Made Explicit</p><p>The Cloudflare control plane outage makes concrete a trade-off that every infrastructure team makes implicitly: <strong>colocation is cheaper and gives hardware control, but requires manual failover during datacenter failures</strong>. Public cloud providers offer automated region failover but at significantly higher cost. The November 2023 incident is the explicit data point for how expensive the colocation approach can be during a once-a-decade power failure: 40 hours of control plane impact, customer trust damage, and the engineering investment required to build DR parity that cloud providers offer by default.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>The November 2023 Cloudflare control plane outage is a master class in the asymmetry between designing edges and designing centers. The edge was fine. The center failed. The lesson is to apply the same resilience standards to both.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Expect entire data centers to fail.</strong> This is not a pessimistic assumption — it's a design requirement. Any system that cannot survive the loss of a single datacenter needs to be redesigned. Resilience is not about preventing failure; it's about maintaining operation through it.</li><li><span>02</span><div><em>Automatic failover</em> (a mechanism that detects failure and routes traffic to backup systems without human intervention) versus manual failover is the difference between a 30-minute recovery and a 6-hour recovery. Colocation is cost-efficient, but if it requires manual failover orchestration, the cost of the efficiency is availability during rare but consequential failures.</li><li><span>03</span><div><strong>Replicate all data to your DR facility, not just the most critical data.</strong> Log pipelines, analytics streams, and operational data that is not replicated to DR creates irrecoverable data gaps during primary datacenter failures. The cost of replication is fixed; the cost of data loss is unbounded.</li><li><span>04</span><div>Large incidents benefit from an <strong>explicit all-hands mobilization process</strong>. Google's Code Yellow/Red, Cloudflare's new Code Orange — these are mechanisms for concentrating engineering attention on critical problems without the usual resource negotiation overhead. If your organization doesn't have an equivalent, the first time you need one will be too late to create it.</li><li><span>05</span><div>Configuration changes to global infrastructure need <strong>staged rollouts</strong>, not global instant propagation. The November 2023 postmortem explicitly identified this as a required improvement. The subsequent December 2023 global outage from a Bot Management configuration change was a direct consequence of that improvement not yet being implemented. Infrastructure safety systems must be built before the next incident, not after.</li></ol>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Staged Rollout Action Item and the December Outage</p><p>The November 2023 postmortem explicitly identified staged configuration rollouts as a required improvement. The December 2023 Bot Management outage happened because that work hadn't been completed yet — another global configuration change propagated instantly and caused a systemwide outage. This sequence is a stark illustration of why postmortem action items need urgency tracking: the cost of the follow-on incident exceeded the cost of the work that would have prevented it.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>DATA REPLICATION IS NOT OPTIONAL FOR LOGS</strong></p>\n<p>The log push service's permanent data gaps during the outage elevated a truth that gets underinvested in: <strong>operational log data is customer data</strong>. For security audits, compliance requirements, billing reconciliation, and forensic investigation, log streams are as important as application data. Organizations that treat logs as ephemeral operational noise rather than durable customer data will find this assumption tested during major incidents.</p>\n</blockquote>\n\n<blockquote><p>Cloudflare routed traffic for half the internet the entire time their control plane was dark — which proves the edge works and proves the center matters, simultaneously.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/cloudflare-control-plane-outage-2023/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.680660+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Reliability", "Cloudflare"]}, {"id": "https://techlogstack.com/explore/cloudflare-bot-management-outage-2023/", "url": "https://techlogstack.com/explore/cloudflare-bot-management-outage-2023/", "title": "A Database Permission Change in ClickHouse Took Down 28% of Cloudflare's HTTP Traffic", "summary": "How a ClickHouse database permissions change generated a corrupt Bot Management configuration file that propagated globally, taking down 28% of Cloudflare's HTTP tra", "content_html": "<p><strong>Cloudflare</strong> · Reliability · 17 May 2026</p>\n<p>On November 2, 2023 — the same day as the control plane datacenter failure — Cloudflare also experienced a separate six-hour global outage. The cause: a database permission change in ClickHouse generated a corrupt configuration file that was silently propagated to every server in Cloudflare's Bot Management system, crashing it globally.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;HTTP traffic impacted&#x27;, &#x27;value&#x27;: &#x27;28%&#x27;}</li><li>{&#x27;label&#x27;: &#x27;hours total duration&#x27;, &#x27;value&#x27;: &#x27;6&#x27;}</li><li>{&#x27;label&#x27;: &#x27;to find root cause&#x27;, &#x27;value&#x27;: &#x27;2.5h&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>November 2, 2023 was an unusually bad day at Cloudflare. The datacenter power failure that took down the control plane had already created a major incident. Then, separately and concurrently, a different failure caused a completely independent global outage affecting 28% of Cloudflare's HTTP traffic. The two incidents shared a date but not a cause. The Bot Management outage was caused by a database permission change in <em>ClickHouse</em> (a column-oriented database designed for real-time analytical queries, used by Cloudflare for its Bot Management system to query feature metadata) that inadvertently generated a <strong>corrupt configuration file</strong> — and the corrupt file was propagated globally to every Bot Management node before anyone noticed something was wrong.</p>\n<p>The mechanics are precise. Cloudflare's Bot Management system queries a ClickHouse database to fetch <strong>feature metadata</strong> — data used to evaluate whether a given request exhibits bot-like behavior patterns. A database change altered the permissions for queries, causing them to fall back to a different database called 'default' that contained a different, larger set of 60 features rather than the distributed tables normally used. The Bot Management configuration file generator fetched this expanded feature set, generated a file that was larger than the software processing it could handle, and emitted the oversized file. The oversized file was then <strong>propagated throughout Cloudflare's global network</strong> — instantly and completely — as a standard configuration update.</p>\n<blockquote>\n<p><strong>THE GLOBAL PROPAGATION PROBLEM</strong></p>\n<p>Cloudflare's configuration system was designed to propagate changes globally as fast as possible — this is a feature for legitimate configuration updates. For security changes, speed matters. For this incident, speed was the accelerant: a corrupt configuration file reached <strong>every Cloudflare server globally within seconds</strong> of being generated. There was no staged rollout, no canary deployment, no percentage-based rollout. One bad file. Every server. Instantly.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>ClickHouse Permission Change Triggers Fallback</h4>\n<p>A database permission change in ClickHouse caused Bot Management queries to fall back from distributed tables to the 'default' database containing 60 features. The configuration file generator fetched the larger dataset, generating a file that exceeded the size limit of the consuming software.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Oversized Config Silently Propagated Globally</h4>\n<p>The oversized configuration file was not validated before propagation. Cloudflare's configuration distribution system treated it like any other config update and propagated it globally to all Bot Management nodes. Every node crashed when it tried to load the oversized file. 28% of HTTP traffic was impacted because Bot Management is in the critical path for Cloudflare's proxy layer.</p>\n<hr />\n<h3>Solution</h3>\n<h4>2.5h to Find Root Cause, 3.5h to Fix and Deploy</h4>\n<p>It took 2.5 hours to identify the incorrect configuration files as the source of the outage — early investigation suspected a DDoS attack because Cloudflare's status page coincidentally went offline at the same time (unrelated outage). Once identified, stopping the propagation and deploying a correct file took another hour, and cleanup took 2.5 more hours.</p>\n<hr />\n<h3>Result</h3>\n<h4>Service Restored 6 Hours After Start</h4>\n<p>The outage was resolved at 17:06 UTC, approximately 6 hours after it started. A new configuration file was deployed. Bot Management came back online globally. The postmortem identified staged configuration rollouts as the primary required fix — the same action item from the control plane outage postmortem that hadn't been implemented yet.</p>\n<hr />\n\n<blockquote>\n<p><strong>🔍</strong></p>\n<p>Cloudflare's status page went offline at the same time as the outage, causing the incident response team to <strong>initially suspect a DDoS attack</strong>. The status page failure was a coincidence — an unrelated issue — not part of the outage. This created a 2.5-hour investigation red herring: engineers were looking for evidence of an attack while the actual cause was a configuration file size issue.</p>\n</blockquote>\n\n<blockquote><p>Matthew: 'None of us were happy — we were embarrassed by what had happened — but we declared it true and accurate. Sent the draft over to the SF team, who did one more sweep, then posted it.'</p><p><em>— — Matthew Prince, CEO of Cloudflare — discussing the postmortem publication, via The Pragmatic Engineer</em></p></blockquote>\n<p>Cloudflare CEO Matthew Prince wrote the first version of the incident review at home in Lisbon, the evening the incident resolved. This was not a PR-managed corporate response — it was an engineer's honest account of what went wrong, written while the incident was fresh. The postmortem was then circulated internally, reviewed by the SF team, and published. The same-day publication is unusual for a company of Cloudflare's size and is a demonstration of the <strong>cultural commitment to transparency</strong> that makes Cloudflare's postmortems some of the most cited in the industry.</p>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The November 2023 Postmortem Action Item (Uncompleted)</p><p>The previous November 2023 Cloudflare control plane outage had included an explicit action item: implement staged configuration rollouts so that configuration files do not propagate immediately to the full global network. The Bot Management outage was, in part, a consequence of that work not yet being completed. The postmortem was explicit: staged config rollouts 'remains our first priority across the organization' but implementation was a large project that could take months.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>❌</strong></p>\n<p>Why 28% of Traffic Was Affected</p><p>Bot Management is not a peripheral feature — it's in the critical path for Cloudflare's proxy layer. When Bot Management crashes on a node, that node's proxy functionality goes offline. 28% of Cloudflare's HTTP traffic routes through nodes where Bot Management is active in the serving path. This architectural coupling — a feature module that can take down the core proxy function — is exactly the kind of dependency that staged rollouts would have contained: a crash on 1% of nodes is very different from a crash on 100%.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Bot Management Architecture: Why It Was Critical Path</p><p>Cloudflare's Bot Management evaluates every HTTP request against behavioral signals to determine if it's bot traffic. This evaluation happens <strong>inline in the request path</strong> — the proxy holds the request while Bot Management runs its checks. This design is necessary for real-time bot mitigation: if the check happened asynchronously, bots could complete their requests before being blocked. The trade-off is that a Bot Management failure blocks the request path entirely rather than allowing traffic through unprotected.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>❌</strong></p>\n<p>Two Outages, One Day: The Coincidence Tax</p><p>The fact that Cloudflare experienced two separate major outages on November 2, 2023 — one from a datacenter power failure, one from a configuration file — created disproportionate reputational damage. Each incident was explainable individually. Together, they suggested to some customers that Cloudflare had a systemic reliability problem rather than two independent bad-luck events. The same is true in reliability engineering generally: coincident failures compound trust damage beyond what either would cause alone.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE ERROR LOGGING GAP</strong></p>\n<p>One finding from the postmortem: the line of code that returned an error from the oversized configuration file <strong>did not log the error</strong>. If errors had been logged and alerted on when they spiked on nodes, root cause identification would have taken minutes rather than 2.5 hours. Logging errors at the point they occur — not just aggregating them — and alerting on error rate spikes is fundamental debugging infrastructure. This was one of the most actionable lessons from the incident.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>Required Fixes: Staged Rollouts and Config Validation</h3>\n<p>The Bot Management outage had two independent root causes that both needed to be addressed. The first: the ClickHouse permission change that caused the query fallback should have been tested in a staging environment where the configuration file output could be validated before propagation. The second: the configuration distribution system should have validated the file size and format before propagating globally — and should never have propagated a configuration change globally and instantly regardless of its validity.</p>\n<ul>\n<li><strong>28%</strong> — HTTP traffic impacted — because Bot Management is in the critical path of Cloudflare's proxy layer, a module crash takes down the proxy function for that node</li>\n<li><strong>2.5h</strong> — Time to identify the root cause — delayed by initial suspicion of DDoS attack after the status page coincidentally went offline at the same time</li>\n<li><strong>6h</strong> — Total outage duration from start to full resolution — 2.5h investigation, 1h fix deployment, 2.5h cleanup</li>\n<li><strong>Instant</strong> — Configuration propagation speed before fix — the system was designed to propagate configs globally as fast as possible, which made the corrupt config a global instant failure</li>\n</ul>\n\n<pre><code class=\"language-python\"># Simplified config validation and staged rollout logic\n# Addresses both root causes of the Bot Management outage\n\nclass ConfigDeployer:\n    MAX_CONFIG_SIZE_BYTES = 10_000_000  # explicit size limit\n    \n    def deploy_config(self, config_data: bytes, config_type: str):\n        # VALIDATION GATE: Reject invalid configs before any propagation\n        self._validate_config(config_data, config_type)\n        \n        # STAGED ROLLOUT: Not global-instant anymore\n        # Phase 1: Deploy to 1% of nodes\n        self._deploy_to_percentage(config_data, pct=0.01)\n        if not self._health_check_passes(window_minutes=5):\n            self._rollback()  # automatic rollback on health failure\n            raise ConfigDeploymentError(\"Health check failed at 1%\")\n        \n        # Phase 2: Expand to 10%\n        self._deploy_to_percentage(config_data, pct=0.10)\n        if not self._health_check_passes(window_minutes=5):\n            self._rollback()\n            raise ConfigDeploymentError(\"Health check failed at 10%\")\n        \n        # Phase 3: Full deployment\n        self._deploy_global(config_data)\n    \n    def _validate_config(self, data: bytes, config_type: str):\n        # Size validation — catches the ClickHouse fallback issue\n        if len(data) > self.MAX_CONFIG_SIZE_BYTES:\n            raise ConfigValidationError(\n                f\"Config size {len(data)} exceeds maximum {self.MAX_CONFIG_SIZE_BYTES}\"\n            )\n        # Schema validation — catches structural issues\n        parser = CONFIG_PARSERS[config_type]\n        parser.validate(data)  # raises on malformed config</code></pre>\n<blockquote>\n<p><strong>THE INVESTIGATION RED HERRING</strong></p>\n<p>One of the most instructive details in this postmortem is the <strong>DDoS attack hypothesis</strong>. Cloudflare's status page went offline coincidentally at the same time as the Bot Management outage — completely unrelated. Incident responders, seeing both the outage and the status page failure, initially focused on finding evidence of an attack. This wasted 2.5 hours investigating the wrong hypothesis. The lesson: when an incident starts, explicitly enumerate and test competing hypotheses rather than pursuing only the first plausible one.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>ClickHouse Permission Architecture</p><p>Cloudflare's Bot Management uses ClickHouse to query feature metadata — data about which behavioral signals to look for in traffic. The ClickHouse cluster had two query paths: the distributed tables path (normal operation, queries a subset of features), and the 'default' database fallback (60 features, designed for different purposes). The permission change that triggered the fallback was routine maintenance — <strong>there was no intent to cause the fallback</strong>. This is a reminder that permission changes to production databases require the same testing rigor as code changes.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Same-Day Postmortem: The Transparency Standard</p><p>Cloudflare published the incident postmortem on the same day as the outage. This is exceptional — most companies take days or weeks to publish postmortems. The same-day publication reflects a culture where transparency with customers is treated as part of incident response, not a post-recovery PR exercise. Cloudflare's CEO wrote the first draft the evening the incident resolved. That speed and candor is why Cloudflare's postmortems are among the most trusted in the industry.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Missing Error Log</p><p>A key finding in the postmortem: the code that crashed when loading the oversized configuration file <strong>returned an error but did not log it</strong>. This meant that even as nodes were crashing, the specific error causing the crash was not visible in monitoring. Engineers investigating the incident had to work backward from service failures rather than forward from error messages. Every error should be logged at the point it occurs, and log-level alerts should be configured for error rate spikes.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE SPEED-SAFETY TRADEOFF IN CONFIG PROPAGATION</strong></p>\n<p>Cloudflare's instant global config propagation was designed for a real use case: when a new DDoS attack signature is detected, Cloudflare needs to push the mitigation rule globally as fast as possible. <strong>Security changes genuinely benefit from fast propagation</strong>. The fix isn't to make config propagation slower — it's to distinguish between security-critical changes (fast propagation with validation) and configuration updates (staged rollout with health gates). Not all configuration changes have the same urgency requirements.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>The Bot Management outage reveals how Cloudflare's internal architecture works at a feature module level. Bot Management is a module within Cloudflare's proxy software that evaluates every HTTP request against bot detection criteria. When it loads its configuration file at startup (or on configuration update), it reads the feature definitions that determine what signals to analyze. If that configuration file is oversized or malformed, the module crashes — and because it's in the critical path of the proxy, the proxy function for that node crashes too.</p>\n<h3>Bot Management Outage: The Configuration Propagation Chain</h3>\n<p><a href=\"https://techlogstack.com/explore/cloudflare-bot-management-outage-2023/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>After: Config Validation + Staged Rollout Architecture</h3>\n<p><a href=\"https://techlogstack.com/explore/cloudflare-bot-management-outage-2023/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>THE 'SAME MISTAKE TWICE' CONCERN</strong></p>\n<p>Two separate Cloudflare outages within weeks of each other, both caused by a configuration change propagating globally without staged rollout, created a serious customer confidence problem. The November 2023 datacenter outage was an external failure. The Bot Management outage was a self-inflicted failure with a root cause that the team had already identified from the prior incident. <strong>Customers rightly noticed the pattern</strong>. CTO Dane Knecht acknowledged in the postmortem that global configuration changes 'remains our first priority across the organization' — a public commitment to completing the staged rollout work that the team already knew it needed.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Module Criticality and Blast Radius</p><p>An architecture question raised by this incident: <strong>should Bot Management be in the critical path of the proxy layer?</strong> If Bot Management crashes, the proxy crashes. An alternative design isolates Bot Management as a non-critical component that the proxy bypasses on failure — allowing traffic to flow (without bot protection) rather than blocking entirely. This fail-open vs fail-closed design decision has security implications (fail-open allows bots through temporarily) versus availability implications (fail-closed takes the proxy down). For a CDN, the availability argument may outweigh the security argument.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>🔒</strong></p>\n<p>Fail-Open vs Fail-Closed: The Bot Management Design Decision</p><p>The Bot Management outage surfaces a fundamental architecture decision: when a security module fails, should the system <strong>fail-open</strong> (allow traffic through unprotected) or <strong>fail-closed</strong> (block traffic until the module recovers)? Fail-open maintains availability but exposes customers to unprotected bot traffic during the failure window. Fail-closed maintains security posture but impacts availability. Cloudflare's current design is fail-closed — 28% of traffic went down rather than flowing unprotected. <strong>The right answer depends on whether your customers value security continuity or availability continuity more</strong> during module failures.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>The Cloudflare Bot Management outage teaches a simple lesson about configuration safety that applies to every distributed system: fast global propagation is an availability risk. The lessons here are architectural and process-oriented.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Validate configuration files before propagating them.</strong> Size limits, schema validation, and semantic checks should all run before a configuration update is distributed to production nodes. A corrupt config that fails validation is an alert; a corrupt config that propagates globally is an outage.</li><li><span>02</span><div><em>Staged rollouts</em> (deploying configuration changes to a small percentage of nodes first, checking health, then expanding gradually) for configuration changes are as important as staged rollouts for code changes. The same principles apply: canary, health gate, expand. Global instant propagation for configuration changes is a global outage waiting to happen.</li><li><span>03</span><div><strong>Database permission changes are code changes.</strong> They modify system behavior and can cause unexpected fallbacks, query plan changes, and downstream effects. Test them in staging. Apply them with the same rigor as schema migrations. The Cloudflare ClickHouse permission change was routine maintenance that caused a global outage because it wasn't tested for downstream effects.</li><li><span>04</span><div>When investigating incidents, explicitly enumerate competing hypotheses and test the most likely ones in parallel. <strong>The DDoS false lead cost 2.5 hours</strong> because investigators committed too quickly to one explanation. Structured incident investigation that tests multiple hypotheses simultaneously finds root causes faster.</li><li><span>05</span><div>Postmortem action items must have urgency. <strong>The same staged rollout improvement identified in the November 2023 control plane outage postmortem would have prevented the Bot Management outage</strong> if it had been implemented before the second incident. Postmortem action items are not backlog items — they are debt with interest that accrues in the form of the next incident.</li></ol>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The 2023 Cloudflare Transparency Report</p><p>Cloudflare's CEO published the incident review on the same day as the outage, and the write-up was detailed and candid about the mistakes made. This level of post-incident transparency is unusual and valuable for the industry. <strong>When major infrastructure providers share honest postmortems</strong>, they give other engineering teams a chance to learn from failures they didn't experience themselves — and raise the industry standard for incident communication.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>CONFIGURATION AS CODE: THE MISSING GATE</strong></p>\n<p>The Bot Management config file was generated by a system that fetched data from a database and formatted it. This is code that produces configuration. It had no equivalent of a test suite, a staging environment validation, or a size limit check. <strong>Configuration generators need the same quality gates as application code</strong>: unit tests for the generation logic, integration tests against real database states, validation of the output before propagation, and size/schema checks at the distribution layer. Configuration generation is engineering, not operations.</p>\n</blockquote>\n\n<blockquote><p>The same configuration safety fix that would have prevented the first outage also would have prevented the second outage — which makes the second outage Cloudflare's most expensive action item ever left in a backlog.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/cloudflare-bot-management-outage-2023/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.776646+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Reliability", "Cloudflare"]}, {"id": "https://techlogstack.com/explore/netflix-maestro-100x-2025/", "url": "https://techlogstack.com/explore/netflix-maestro-100x-2025/", "title": "Netflix Made Their Workflow Orchestrator 100x Faster by Rewriting the Engine Nobody Thought Was Slow", "summary": "How Netflix's data team rewrote Maestro's workflow engine to achieve 100x throughput improvement after Live, Ads, and Games drove sub-hourly scheduling requirements.", "content_html": "<p><strong>Netflix</strong> · Performance · 17 May 2026</p>\n<p>Maestro had been running Netflix's data and ML workflows successfully for two and a half years. Then Live, Ads, and Games drove sub-hourly scheduling requirements that revealed the orchestrator's overhead — not in crashes or alerts, but in slow step launches that nobody had measured. The fix was a complete engine rewrite that delivered 100x throughput improvement.</p>\n<ul>\n<li>{&#x27;label&#x27;: &#x27;throughput improvement&#x27;, &#x27;value&#x27;: &#x27;100x&#x27;}</li><li>{&#x27;label&#x27;: &#x27;years before overhead visible&#x27;, &#x27;value&#x27;: &#x27;2.5&#x27;}</li><li>{&#x27;label&#x27;: &#x27;tasks/day still supported&#x27;, &#x27;value&#x27;: &#x27;1M+&#x27;}</li></ul>\n\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>Two and a half years after Netflix's Maestro workflow orchestrator replaced Meson, it had achieved its design goals: horizontal scalability, support for hundreds of thousands of workflows, reliable execution of millions of jobs per day. By 2024, however, Netflix's business had changed in ways that revealed new performance requirements. <strong>Live programming, Ads, and Games</strong> drove use cases with sub-hourly scheduling needs — ad targeting pipelines that needed to run every 15 minutes, live event data processing that needed to execute within seconds of an event, low-latency ad hoc queries. These workloads exposed overhead in Maestro's step execution path that had been invisible during daily and hourly ETL workflows. The orchestrator wasn't broken — but it was noticeably slower than it needed to be for a new class of latency-sensitive use cases.</p>\n<blockquote>\n<p><strong>⏱️</strong></p>\n<p>The overhead that sub-hourly workloads exposed wasn't measured in seconds of latency — it was measured in <strong>fractions of seconds of step launch time</strong> that added up across thousands of daily executions. For hourly ETL pipelines, a 200ms step launch overhead is irrelevant. For 15-minute ad targeting workflows with hundreds of steps, that overhead becomes a material fraction of the entire scheduling budget.</p>\n</blockquote>\n\n<p>The Maestro engineering team investigated the overhead and traced it to the <strong>flow engine</strong> — the component responsible for managing state transitions between workflow steps. The original flow engine had been built on top of Netflix Conductor, an open-source workflow orchestration system that provided a full feature set of state management capabilities. Maestro used only a subset of Conductor's features — lightweight state transitions — but paid the overhead of Conductor's full implementation. This overhead was acceptable at 1-million-task-per-day scale with daily scheduling. It was unacceptable for the sub-hourly, low-latency workloads that Netflix's evolving product portfolio demanded.</p>\n<blockquote>\n<p><strong>THE INVISIBLE OVERHEAD</strong></p>\n<p>The flow engine overhead didn't cause errors or trigger alerts. Workflows completed. SLOs were met. But the step launch time was <strong>higher than it needed to be</strong>, and for sub-hourly workloads, 'higher than needed' became 'unacceptably slow.' This is a class of performance issue that only becomes visible when new use cases push the system closer to its boundaries — the boundary had always been there, but daily ETL workloads never reached it.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>Sub-Hourly Workloads Expose Step Launch Overhead</h4>\n<p>Netflix's expansion into Live, Ads, and Games drove scheduling requirements as short as 15 minutes. Sub-hourly workflows executing hundreds of steps were sensitive to per-step launch overhead that was invisible on daily ETL pipelines. The Maestro flow engine's overhead, acceptable at hourly+ scheduling, became a bottleneck for the new use case class.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Flow Engine Built on Conductor's Full Feature Set</h4>\n<p>Maestro's flow engine used Netflix Conductor for state management, but only needed lightweight state transitions — not Conductor's full feature set. The team also considered Temporal (optimized for inter-process orchestration via external service calls) but concluded that coupling the DAG engine to an external service introduced unnecessary reliability risk at 1M+ daily tasks.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Purpose-Built State Machine: Keep DAG, Rewrite Flow Engine</h4>\n<p>The team kept the DAG engine (workflow definition and dependency management) and rewrote only the flow engine (state transitions). The new flow engine was purpose-built for Maestro's specific requirements: lightweight state transitions at very high frequency, without the overhead of a general-purpose state management framework.</p>\n<hr />\n<h3>Result</h3>\n<h4>100x Throughput Improvement</h4>\n<p>The rewritten flow engine delivered 100x throughput improvement, enabling the sub-hourly and low-latency use cases that Netflix's Live, Ads, and Games products required. The improvement opened new possibilities for workflow orchestration at Netflix that hadn't been feasible on the original engine.</p>\n<hr />\n\n<blockquote><p>We felt it was an unnecessary source of risk to couple the DAG engine execution with an external service call. If our requirements went beyond lightweight state transition management we might reconsider because Temporal is a very robust control plane orchestration system, but for our needs it introduced complexity and potential reliability weak spots when there was no direct need for the advanced feature set that it offered.</p><p><em>— — Netflix Engineering — via '100X Faster: How We Supercharged Netflix Maestro's Workflow Engine'</em></p></blockquote>\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Why Not Temporal?</p><p>Temporal is a popular workflow orchestration framework that handles complex, long-running workflows with strong durability guarantees. The Netflix team evaluated it seriously but concluded it was optimized for a different use case: <strong>inter-process orchestration via external service calls</strong>. Maestro operates at 1M+ daily tasks; coupling the DAG execution engine to an external Temporal service call for each state transition would add network latency and a reliability dependency to the most critical path in the system. For Maestro's needs — lightweight, in-process state transitions at very high frequency — Temporal was over-engineered and over-coupled.</p>\n</blockquote>\n\n<p>The architectural decision to keep the DAG engine while rewriting only the flow engine reflects a key engineering principle: <strong>surgical rewrites are better than complete rewrites when you can precisely identify the component causing the problem</strong>. The DAG engine — the code that parses workflow definitions, evaluates dependencies, and determines which steps are ready to execute — was not the source of the overhead. Replacing it alongside the flow engine would have added scope, risk, and development time without addressing the actual bottleneck. The team's ability to identify precisely where the overhead lived was the prerequisite for a scoped, successful rewrite.</p>\n<blockquote>\n<p><strong>🚀</strong></p>\n<p>New Use Cases Unlocked</p><p>The 100x throughput improvement wasn't just a quantitative improvement in existing workflows — it <strong>unlocked qualitatively new use cases</strong>. Ad targeting pipelines that previously ran hourly can now run on 15-minute cycles, providing fresher signals. Live event data processing can now run within seconds of event completion rather than waiting for the next hourly window. The performance improvement changed what Netflix could build, not just how fast they could run existing things.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The 2.5-Year Latency to Visibility</p><p>Maestro had operated successfully for two and a half years before the sub-hourly workloads revealed the flow engine overhead. This timeline is instructive: <strong>performance bottlenecks are often invisible until a new use case pushes the system closer to its limits</strong>. Daily ETL pipelines completing in hours have no reason to notice a 200ms step launch overhead. 15-minute ad targeting pipelines immediately feel it. Building systems with performance observability from the start allows bottlenecks to be found proactively rather than reactively.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>LIVE, ADS, GAMES: THE PRODUCT DRIVERS</strong></p>\n<p>Netflix's expansion into live events (sports, comedy specials, live programming), advertising (a new revenue stream launched 2022), and games (mobile and cloud gaming) created data pipeline requirements that hadn't existed in Netflix's purely subscription VOD model. <strong>Advertising requires near-real-time data to be effective</strong>: ad targeting signals from viewer behavior need to be processed and applied within minutes, not hours. Live events generate immediate engagement data that needs to flow through analytics pipelines before the event ends. These new product lines were the forcing function for Maestro's performance improvement.</p>\n</blockquote>\n\n<blockquote><p>We built the new flow engine from first principles specifically for Maestro's requirements — lightweight state transitions at very high frequency, without coupling the DAG execution engine to an external service call on every state change.</p><p><em>— — Netflix Engineering — via '100X Faster: How We Supercharged Netflix Maestro's Workflow Engine'</em></p></blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>The Flow Engine Rewrite</h3>\n<p>The new flow engine was designed from first principles for Maestro's specific requirements. Rather than building on Conductor's general-purpose state management or Temporal's inter-process orchestration, the team implemented a purpose-built state machine that handled exactly the transitions Maestro needed: step-ready → running → completed/failed, with retry and timeout logic, at extremely high frequency without external service dependencies. The design was minimal by intention: every abstraction layer that wasn't serving Maestro's use case was eliminated.</p>\n<ul>\n<li><strong>100x</strong> — Throughput improvement from the flow engine rewrite — enabling sub-hourly scheduling and low-latency ad hoc queries that were infeasible on the original engine</li>\n<li><strong>2.5 years</strong> — Time Maestro operated successfully before the sub-hourly use case revealed the flow engine overhead — a reminder that performance requirements change as products evolve</li>\n<li><strong>0</strong> — External service dependencies in the new flow engine — state transitions happen in-process, eliminating the network latency and reliability coupling of external orchestration services</li>\n<li><strong>Kept DAG</strong> — Components preserved from the original architecture — the DAG engine was not the bottleneck and was not rewritten, limiting scope and risk</li>\n</ul>\n\n<pre><code>// Conceptual: The old flow engine approach vs new flow engine\n// Old: Conductor-based state management (full feature set, higher overhead)\n// New: Purpose-built lightweight state machine\n\n// OLD APPROACH: Conductor state transitions\n// Each step state change requires a round-trip to Conductor's state store\n// Conductor evaluates full state management logic for each transition\nclass OldStepExecutor {\n    void onStepComplete(Step step, StepResult result) {\n        // Conductor handles state transition — full feature set overhead\n        conductor.updateTaskStatus(\n            step.taskId,\n            result.toTaskResult()  // serialization + network call\n        );\n        // Conductor evaluates downstream dependencies\n        conductor.decide(step.workflowId); // another network call\n    }\n}\n\n// NEW APPROACH: Purpose-built in-process state machine\n// State transitions are in-memory, no external service calls\n// Only the transitions Maestro needs, optimized for high frequency\nclass NewStepExecutor {\n    void onStepComplete(Step step, StepResult result) {\n        // In-process state update — no network round-trip\n        WorkflowState state = stateStore.get(step.workflowId);\n        state.markStepComplete(step.id, result);\n        \n        // Evaluate ready steps locally — no external service dependency\n        List<Step> readySteps = state.getReadySteps();\n        \n        // Dispatch ready steps to execution queue\n        readySteps.forEach(this::dispatch);\n        \n        // Persist state change atomically\n        stateStore.save(state);\n    }\n}</code></pre>\n<blockquote>\n<p><strong>SURGICAL REWRITE: SCOPE IS A VIRTUE</strong></p>\n<p>The decision to rewrite only the flow engine — not the DAG engine, not the API layer, not the scheduling system — is what made the 100x improvement possible within a reasonable development timeline. <strong>A complete rewrite of Maestro would have taken years and carried enormous risk</strong>. A targeted rewrite of the bottleneck component took months and carried bounded risk. The prerequisite was precise understanding of where the overhead lived. Profiling and measurement before architectural decisions is not overhead — it's the work that makes targeted improvements possible.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Open-Source Beneficiaries</p><p>The 100x performance improvement was contributed to the open-source Maestro repository. Organizations that adopted Maestro after the original open-sourcing in July 2024 now benefit from an orchestration engine capable of sub-hourly scheduling at million-task-per-day scale. The compound value of open-sourcing battle-tested systems: community users get production-grade improvements as they're developed.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Netflix Product Evolution That Drove the Fix</p><p>Maestro's 100x improvement is a case study in how product evolution creates engineering requirements that didn't exist at system design time. When Maestro was designed in 2020, Netflix's primary workflow use cases were daily ETL pipelines and hourly ML training runs. By 2024–2025, Live, Ads, and Games had created sub-hourly and real-time data requirements. <strong>Workflow orchestrators that were designed for daily batch jobs don't automatically handle real-time event-driven workloads</strong> — the latency requirements are an order of magnitude different.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Keeping the DAG Engine: The Right Scope Decision</p><p>The DAG engine — the component that parses workflow definitions, evaluates dependencies, and determines which steps are ready to run — was not contributing to the flow engine overhead. Rewriting it alongside the flow engine would have added months of development time, introduced new bugs in a working component, and required re-validating all of Maestro's workflow semantics. <strong>Scope discipline — rewriting only what needs to be rewritten — is the engineering decision that made 100x improvement achievable in a reasonable timeline.</strong></p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE OPEN SOURCE TIMELINE</strong></p>\n<p>The 100x improvement was contributed to the open-source Maestro repository following its development. Since Maestro was open-sourced in July 2024, external users who adopted it benefit from a continuously improving orchestration platform — not a snapshot. <strong>The value of open-sourcing production systems compounds over time</strong> as improvements driven by internal Netflix requirements become available to the broader engineering community.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>Maestro's architecture after the flow engine rewrite maintains the same three-layer structure: Workflow Engine (DAG state, dependency tracking), Step Runtime Workers (stateless executors), and Signal Service (event-driven triggers). The change is internal to the Workflow Engine layer: the flow engine that manages step state transitions was replaced with a purpose-built implementation. From the outside — from users defining workflows, from the Signal Service publishing events, from the Step Runtime Workers reporting completions — nothing changed. The optimization was architecturally invisible.</p>\n<h3>Maestro Before: Conductor-Based Flow Engine (Higher Overhead)</h3>\n<p><a href=\"https://techlogstack.com/explore/netflix-maestro-100x-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>Maestro After: Purpose-Built Flow Engine (100x Faster)</h3>\n<p><a href=\"https://techlogstack.com/explore/netflix-maestro-100x-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>PROFILING BEFORE REWRITING</strong></p>\n<p>The 100x improvement was possible because the team could precisely identify the flow engine as the overhead source. This required <strong>detailed profiling of Maestro's step execution path</strong> — measuring where time was spent at each stage of a step state transition. Without this profiling work, a rewrite might have targeted the wrong component and produced minimal improvement. Measurement before optimization is not a platitude — it's the prerequisite for targeted, effective engineering.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The 1M+ Task/Day Scale Constraint</p><p>The new flow engine had to maintain support for Maestro's existing workload — 1M+ tasks per day, workflows with hundreds of thousands of steps, long-running daily ETL pipelines. The 100x improvement was not achieved by sacrificing existing workload support — it was achieved by <strong>removing overhead that wasn't serving existing workloads either</strong>. The new engine is faster at all scales, not just at sub-hourly scales. The improvement was architectural, not a tradeoff.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Performance Impact at Maestro's Scale</p><p>The 100x throughput improvement at Maestro's operating scale — 1M+ tasks per day — translates to significant concrete capacity. The same infrastructure can now support 100x more concurrent step executions, enabling Netflix to run sub-hourly workflows alongside existing daily ETL pipelines without requiring additional worker capacity. For a system already handling hundreds of thousands of workflows, the improvement effectively eliminates step-launch as a scaling bottleneck for the foreseeable future.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>The Maestro 100x story is about the intersection of product evolution, performance measurement, and surgical engineering. The lessons apply to any long-running production system that needs to serve new use cases it wasn't designed for.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Measure before you rewrite.</strong> The Maestro team knew exactly which component to rewrite because they had profiled the execution path and located the overhead precisely. A rewrite without measurement is a guess. A rewrite with measurement is a targeted intervention. The profiling work is not overhead — it's the work that makes targeted improvements possible.</li><li><span>02</span><div><em>Surgical rewrites</em> (replacing only the specific component causing a performance problem, while preserving all surrounding components) have lower risk and faster delivery than complete rewrites. The flow engine was replaced; the DAG engine was kept. This scoping decision is why the improvement was achievable in months rather than years.</li><li><span>03</span><div><strong>Performance requirements change as products evolve.</strong> Maestro was correctly designed for daily ETL workloads in 2020. Netflix's expansion into Live, Ads, and Games in 2024–2025 created sub-hourly requirements that didn't exist at design time. Build systems that are measurable and targetable for performance improvement as requirements evolve.</li><li><span>04</span><div>General-purpose frameworks have overhead that purpose-built implementations don't. <strong>Use general-purpose frameworks when their full feature set is needed; build purpose-built when it isn't.</strong> Conductor was the right choice when Maestro was designed — it provided reliable state management quickly. The rewrite was right when the overhead became the bottleneck — the team had the data to make that call.</li><li><span>05</span><div><strong>Architectural improvements that remove external dependencies improve both performance and reliability simultaneously.</strong> The new flow engine is faster because it has no external service round-trips. It's also more reliable because it has fewer failure modes — no external service to go down, no network partition to handle in the hot path.</li></ol>\n<blockquote>\n<p><strong>PERFORMANCE OBSERVABILITY AS DESIGN REQUIREMENT</strong></p>\n<p>The Maestro overhead existed for 2.5 years before it became visible. <strong>If per-step launch latency had been a tracked metric from day one</strong>, the overhead would have been visible from the beginning — even if it hadn't mattered yet. Building systems with detailed performance instrumentation from the start means bottlenecks are discovered via monitoring rather than via new use cases hitting walls. Performance observability is a first-class design requirement, not an afterthought.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Temporal Consideration</p><p>The Netflix team explicitly evaluated Temporal before deciding to build a custom flow engine. Their conclusion: Temporal's value proposition is in managing long-running, durably-persisted workflows with complex retry and compensation logic — a use case that requires coupling the execution engine to an external orchestration service. Maestro's lightweight state transition needs don't justify that coupling. Choosing not to adopt a popular framework when its overhead exceeds its benefit is an engineering decision, not a gap.</p>\n</blockquote>\n\n<blockquote><p>Netflix's workflow orchestrator ran 2.5 years without anyone noticing a 100x performance improvement was available — which is either a compliment to how well Maestro worked or a reminder that daily ETL jobs don't complain about latency.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/netflix-maestro-100x-2025/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.780609+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Performance", "Netflix"]}, {"id": "https://techlogstack.com/explore/netflix-container-cpu-scaling-2025/", "url": "https://techlogstack.com/explore/netflix-container-cpu-scaling-2025/", "title": "Netflix's Containers Were Fighting Their Own CPUs — and Losing", "summary": "How Netflix's 'Mount Mayhem' investigation discovered that container scheduling on modern multi-core CPUs was causing cache locality violations that degraded perform", "content_html": "<p><strong>Netflix</strong> · Performance · 17 May 2026</p>\n<p>Netflix ran millions of containers per day on modern multi-core CPUs. The containers performed well on benchmarks. In production, under certain workloads, they were mysteriously slower than expected — slower than the hardware should have allowed. The culprit was CPU topology: the operating system was scheduling container workloads in ways that violated modern CPU cache architecture. They called the investigation 'Mount Mayhem.'</p>\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>Modern CPUs are not the flat, uniform compute resources that operating system schedulers often treat them as. A 96-core server CPU is actually a hierarchical structure: multiple <strong>physical processor dies</strong>, each containing cores that share L3 cache, connected to memory via <em>NUMA</em> (Non-Uniform Memory Access — a CPU architecture where memory access speed depends on the physical proximity of the memory module to the CPU core accessing it, with local memory being faster than remote memory) domains. Two cores on the same die can share cache and access memory very quickly. Two cores on different dies, or different NUMA nodes, pay a penalty for cross-die communication and remote memory access. When the Linux kernel schedules container workloads across these cores, it doesn't always respect these hardware boundaries — and at Netflix's scale, the performance consequences are measurable and significant.</p>\n<blockquote>\n<p><strong>🔬</strong></p>\n<p>Netflix's investigation, internally called <strong>Mount Mayhem</strong>, was triggered by observing that certain container workloads performed below their expected throughput under production load — despite having sufficient CPU cores allocated. The gap between expected and observed performance pointed to a hardware-level efficiency issue rather than an application-level bug.</p>\n</blockquote>\n\n<p>Netflix runs its container workloads on <em>Titus</em> (Netflix's in-house container orchestration platform built on top of Apache Mesos, used to run containerized workloads across Netflix's AWS fleet), the company's internal container orchestration platform. Titus allocates CPU cores to containers and manages scheduling across Netflix's fleet of EC2 instances. The Mount Mayhem investigation revealed that even when a container had adequate CPU allocation, the <em>placement</em> of those CPUs across the host machine's <em>CPU topology</em> (the physical and logical organization of CPU cores, cache hierarchies, and memory domains on a multi-socket or multi-die server) could dramatically affect performance. A workload allocated 4 cores on a single NUMA node performs very differently from the same workload allocated 4 cores spread across multiple NUMA nodes — even if the raw core count is identical.</p>\n<blockquote>\n<p><strong>THE CPU TOPOLOGY PROBLEM</strong></p>\n<p>Modern high-core-count CPUs (64, 96, 128+ cores) achieve their core counts by connecting multiple <strong>physical dies</strong> on a package. Each die has its own L3 cache. Cores on the same die share their L3 cache efficiently. Cores on different dies must communicate through slower inter-die interconnects. When an application's threads are scheduled on cores from different dies, they can't efficiently share cached data — triggering expensive cross-die or cross-NUMA memory accesses. The <strong>CPU is physically working against itself</strong>.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>Container Throughput Below Hardware Expectation</h4>\n<p>Certain Netflix container workloads showed throughput below what the allocated CPU resources should deliver. Benchmarks on isolated machines showed good performance; production deployments on shared multi-tenant hosts showed degraded performance. The gap suggested a scheduling or placement issue rather than an application bug.</p>\n<hr />\n<h3>Cause</h3>\n<h4>CPU Scheduling Across NUMA and Die Boundaries</h4>\n<p>The Linux kernel's CFS scheduler allocates cores without necessarily respecting CPU die and NUMA boundaries. A container allocated 4 cores might receive cores from 2 different physical dies, or 2 different NUMA nodes. Cross-die and cross-NUMA communication is significantly slower than intra-die communication, degrading cache efficiency and increasing memory access latency.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Topology-Aware Scheduling in Titus</h4>\n<p>Netflix experimented with CPU pinning and topology-aware scheduling in Titus — allocating cores to containers from the same physical die and NUMA node wherever possible. This kept container workloads' cache working sets on a single die's L3 cache, eliminating cross-die cache misses.</p>\n<hr />\n<h3>Result</h3>\n<h4>Measurable Throughput Improvement for Topology-Sensitive Workloads</h4>\n<p>Topology-aware scheduling produced measurable throughput improvements for workloads that were sensitive to CPU cache locality — particularly high-throughput, latency-sensitive services. The investigation produced insights applicable across Netflix's Titus fleet and contributed to best practices for container scheduling on modern multi-die CPUs.</p>\n<hr />\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Why Benchmarks Don't Catch This</p><p>CPU topology issues are particularly insidious because standard benchmarks often miss them. A benchmark running on a dedicated machine with <strong>exclusive CPU allocation</strong> naturally gets cores from the same die, because there's no competing workload to fragment the allocation. A production multi-tenant host with many containers running simultaneously may fragment core allocations across dies, exposing the topology sensitivity that benchmarks never saw. Performance testing on dedicated hardware systematically underestimates topology-related performance risks in multi-tenant production.</p>\n</blockquote>\n\n<p>The <em>L3 cache</em> (the largest on-chip cache memory, shared among all cores within a physical die — typically 32-128MB depending on the CPU — used to reduce expensive main memory accesses by keeping frequently-used data close to the processors) is the critical resource in this story. When threads on different cores access the same data, the L3 cache is where they can find it without going to main memory. But L3 caches are per-die — cores on different physical dies don't share an L3 cache. If a container's threads are split across dies, their shared data must travel through <em>inter-die interconnects</em> (the high-speed but slower-than-L3-cache links connecting multiple physical dies on a modern CPU package — examples include AMD's Infinity Fabric and Intel's Ring Bus), which are significantly slower than intra-die L3 cache access. The result: higher cache miss rates, more main memory accesses, and lower effective throughput — even with the same number of CPU cores available.</p>\n<blockquote>\n<p><strong>🏔️</strong></p>\n<p>Why 'Mount Mayhem'?</p><p>Netflix has a tradition of giving investigation projects evocative internal names. Mount Mayhem captured the sense of <strong>scaling up to a problem that turned out to be architecturally complex</strong> — the investigation started as a performance anomaly and revealed deep hardware-software interaction dynamics that required expertise in CPU microarchitecture, Linux kernel scheduling, and distributed systems to fully understand.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Multi-Tenant Host Complexity</p><p>Netflix's Titus platform runs many containers on each host simultaneously. In a multi-tenant environment, <strong>CPU allocation decisions for one container affect the topology available for all other containers on the same host</strong>. If Container A is allocated cores 0-3 (all on Die 1), Container B must take cores from Die 2, creating NUMA pressure for B. Topology-aware scheduling must be global — considering all containers on the host simultaneously — not local to each individual container allocation.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Titus Platform Context</p><p>Netflix Titus is the company's internal container orchestration platform, built on Apache Mesos and managing containerized workloads across Netflix's entire AWS fleet. Titus handles resource allocation, scheduling, and lifecycle management for hundreds of thousands of containers running Netflix's services — from the API layer to encoding pipelines to data processing. The Mount Mayhem investigation happened at the Titus scheduling layer: improving how Titus allocates CPU resources to containers on physical hosts.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>❌</strong></p>\n<p>The Multi-Tenant Fragmentation Effect</p><p>In a dedicated single-tenant host, a container requesting 4 cores naturally gets them from a single die because there's no competition for cores. In Netflix's multi-tenant production environment, <strong>multiple containers compete for cores on the same host simultaneously</strong>. The first container might claim cores 0-3 (Die 1). The second claims 4-7 (still Die 1). The third gets 24-27 — which are on Die 2, forcing cross-die allocation. Multi-tenancy makes topology fragmentation an emergent property that doesn't appear in single-container benchmarks.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>HARDWARE TOPOLOGY DOCUMENTATION GAP</strong></p>\n<p>One finding from the Mount Mayhem investigation: <strong>detailed CPU topology information is not always easily accessible within cloud VMs</strong>. AWS EC2 instances expose varying levels of hardware topology information depending on instance type and generation. Netflix's Titus integration reads available topology data (via /proc/cpuinfo, dmidecode, and lscpu) and builds a best-effort topology model — but for some instance types, the model is incomplete. This is an industry gap that cloud providers have been slowly addressing as multi-die CPUs become the norm.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>Topology-Aware Scheduling in Netflix Titus</h3>\n<p>The fix for CPU topology violations was implemented at the container scheduler level in Titus. Rather than treating all cores as equivalent, the topology-aware scheduler models the host's CPU topology — which cores belong to which die, which dies share L3 cache, which cores are in which NUMA node — and uses this model to make allocation decisions that minimize cross-die and cross-NUMA placements. For latency-sensitive services, this means preferentially allocating cores from the same die. For batch workloads with lower latency sensitivity, the scheduler can be more flexible with topology to improve overall utilization.</p>\n<ul>\n<li><strong>Die-local</strong> — Core allocation strategy for topology-sensitive workloads — placing all container threads on cores from the same physical die to maximize L3 cache sharing</li>\n<li><strong>NUMA-aware</strong> — Memory allocation strategy paired with CPU pinning — ensuring memory allocated by a container is on the NUMA node closest to the container's CPU cores</li>\n<li><strong>Measurable</strong> — Throughput improvement from topology-aware scheduling for cache-sensitive workloads — the improvement varies by workload type but is consistent for high-throughput services</li>\n<li><strong>Per-workload</strong> — Scheduling policy applied based on workload characteristics — not all workloads benefit equally from topology-aware scheduling; policy is tuned per service type</li>\n</ul>\n\n<pre><code class=\"language-python\"># Simplified topology-aware CPU allocation logic\n# Real implementation in Netflix Titus uses Go and integrates with Linux cgroups\n\nclass TopologyAwareCPUAllocator:\n    def __init__(self, host_topology: CPUTopology):\n        # host_topology describes: dies, cores per die, NUMA nodes, L3 cache sizes\n        self.topology = host_topology\n    \n    def allocate_cores(\n        self,\n        num_cores: int,\n        workload_type: WorkloadType\n    ) -> list[int]:\n        if workload_type == WorkloadType.LATENCY_SENSITIVE:\n            # Prefer cores from a single die — maximize L3 cache sharing\n            return self._allocate_from_single_die(num_cores)\n        \n        elif workload_type == WorkloadType.BATCH:\n            # Allow cross-die allocation — prioritize utilization over locality\n            return self._allocate_for_utilization(num_cores)\n        \n        else:\n            # Default: try single die, fall back to NUMA-aware cross-die\n            cores = self._allocate_from_single_die(num_cores)\n            if cores is None:\n                cores = self._allocate_numa_aware(num_cores)\n            return cores\n    \n    def _allocate_from_single_die(self, num_cores: int) -> list[int] | None:\n        for die in self.topology.dies:\n            available = die.available_cores()\n            if len(available) >= num_cores:\n                return available[:num_cores]  # all from same die\n        return None  # no single die has enough free cores\n    \n    def _allocate_numa_aware(self, num_cores: int) -> list[int]:\n        # Cross-die but NUMA-aware: prefer cores on same NUMA node\n        # Avoids the worst penalty (cross-NUMA) when cross-die is unavoidable\n        return self.topology.allocate_numa_local(num_cores)</code></pre>\n<blockquote>\n<p><strong>THE WORKLOAD SENSITIVITY DIMENSION</strong></p>\n<p>Not all workloads benefit equally from topology-aware scheduling. <strong>High-throughput, low-latency services</strong> (API servers, caching layers, media serving) benefit significantly because they have large working sets that benefit from shared L3 cache. <strong>Batch processing workloads</strong> with large independent tasks often don't benefit — their working sets are too large for L3 cache regardless of core placement. Netflix's Titus scheduler applies topology-aware policies selectively based on workload characteristics rather than universally.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>NUMA Memory Allocation Paired with CPU Pinning</p><p>CPU topology-aware scheduling alone is insufficient if memory allocation doesn't follow the same locality rules. When container threads run on Die 1 cores but their memory is allocated on the NUMA node associated with Die 2, memory accesses still pay the NUMA penalty. <strong>Effective topology-aware scheduling requires paired NUMA-aware memory allocation</strong> — using Linux kernel mechanisms like `numactl` or `mbind` to bind container memory allocations to the NUMA node local to the allocated CPU cores.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>Implications for EC2 Instance Selection</p><p>The Mount Mayhem investigation also produced insights for EC2 instance type selection. Different instance families have different CPU topologies — some have multiple physical dies with separate L3 caches, others have a single die with shared L3. For topology-sensitive workloads, single-die instances provide predictable cache locality without scheduling complexity. This insight influences Netflix's instance type selection for specific workload classes — the optimal instance for a topology-sensitive service may not be the one with the most raw core count.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>LLC Cache Affinity: The Technical Mechanism</p><p>The Linux kernel provides CPU affinity controls via cgroups cpuset — the mechanism that restricts which cores a container can use. Netflix's Titus scheduler, when making topology-aware placements, sets the cgroup cpuset to cores on a single die and <strong>does not allow the Linux CFS to migrate threads beyond that cpuset</strong>. The cpuset becomes the enforcement mechanism: the container's threads can only run on the assigned die, guaranteeing L3 cache locality for the container's lifetime on that host.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Utilization Trade-Off</p><p>Topology-aware scheduling improves per-container performance but can reduce overall host utilization. If Die 1 has 12 cores and only 10 are free, a topology-aware scheduler won't place a 4-core container there even though there are cores available — it needs 4 contiguous die-local cores. This <strong>leaves cores underutilized to preserve topology quality</strong>. Netflix tunes this trade-off per workload class: topology-sensitive services get strict die-local allocation; batch workloads accept cross-die placement in exchange for higher host utilization.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>Understanding the CPU topology problem requires a brief grounding in modern CPU physical architecture. A 96-core server CPU is not 96 independent identical cores — it's typically 4–8 physical dies, each with 12–24 cores sharing an L3 cache. The dies are connected via a high-speed but non-L3 interconnect. When the operating system schedules threads, it sees 96 logical CPUs and can place threads anywhere. The hardware, however, provides very different performance depending on whether threads land on the same die or different dies.</p>\n<h3>Modern CPU Physical Topology: Two Dies, Four NUMA Nodes</h3>\n<p><a href=\"https://techlogstack.com/explore/netflix-container-cpu-scaling-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>Container Allocation: Topology-Unaware vs Topology-Aware</h3>\n<p><a href=\"https://techlogstack.com/explore/netflix-container-cpu-scaling-2025/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>LINUX SCHEDULER VS HARDWARE REALITY</strong></p>\n<p>The Linux Completely Fair Scheduler (CFS) was designed for a world where CPU cores were equivalent. On modern multi-die CPUs, they are not. CFS's load balancing can migrate threads between cores to equalize load — inadvertently crossing die boundaries and destroying cache locality in the process. Netflix's topology-aware scheduling in Titus works <strong>above the CFS level</strong>, making initial placement decisions that minimize cross-die allocations, and using CPU affinity pinning to prevent CFS from subsequently migrating threads across die boundaries.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Cloud Provider Dimension</p><p>AWS EC2 instance types vary in their physical CPU topology. Some instances expose underlying hardware topology via DMI/SMBIOS data that can be read from within a VM. Netflix's Titus integration reads this topology information to make informed allocation decisions. <strong>Not all instance types expose this information reliably</strong>, which constrains topology-aware scheduling to instance families where the physical topology is knowable. This is an area where cloud provider documentation and tooling has been improving as CPU core counts grow.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>💡</strong></p>\n<p>AMD EPYC: The Multi-Die Extreme Case</p><p>The CPU topology problem is particularly pronounced on <strong>AMD EPYC processors</strong> (widely used in AWS EC2 instances). Modern EPYC chips use AMD's chiplet design, with 8+ physical compute dies (CCDs) per CPU, each with 8 cores sharing 32MB of L3 cache. A 64-core EPYC CPU has 8 separate L3 cache domains. Cross-CCD communication goes through the AMD Infinity Fabric — fast, but not as fast as intra-CCD L3 access. Netflix's Mount Mayhem work was partly motivated by expanding use of EPYC-based instances where the topology effects are most pronounced.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Mount Mayhem's lessons bridge hardware microarchitecture and distributed systems — a combination that's rare but important as core counts on modern servers grow into the hundreds.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Modern CPUs are not flat.</strong> 96 cores on one CPU are not 96 equivalent resources — they're organized in a hierarchy of dies, NUMA nodes, and L3 caches that dramatically affect performance depending on how workloads are placed. Container schedulers that treat all cores as equivalent leave performance on the table for topology-sensitive workloads.</li><li><span>02</span><div><em>NUMA</em> (Non-Uniform Memory Access — memory access speed depends on the physical proximity of the memory to the CPU core accessing it) effects compound with L3 cache locality effects. Topology-aware scheduling must address both: allocate cores from the same die, <strong>and</strong> allocate memory from the NUMA node local to those cores. Fixing CPU placement without fixing memory placement leaves half the penalty in place.</li><li><span>03</span><div><strong>Benchmarks on dedicated hardware systematically underestimate topology-sensitivity.</strong> A benchmark with exclusive machine access gets natural die locality. Production multi-tenant hosts fragment core allocations across dies under load. Always validate performance on production-equivalent multi-tenant hosts, not just isolated benchmark environments.</li><li><span>04</span><div>Topology-aware scheduling should be <strong>workload-aware, not universally applied</strong>. Latency-sensitive services with shared working sets benefit significantly from die-local core allocation. Batch workloads with large independent tasks often benefit less. Apply topology-aware policies where they deliver measurable improvements and use flexible allocation elsewhere to maximize host utilization.</li><li><span>05</span><div><strong>CPU affinity pinning prevents the Linux scheduler from undoing topology-aware placement.</strong> Without pinning, CFS load balancing can migrate threads across die boundaries to equalize load — destroying the cache locality that topology-aware placement achieved. Topology-aware scheduling requires affinity pinning to be durable under load.</li></ol>\n<blockquote>\n<p><strong>AS CORE COUNTS GROW, THIS MATTERS MORE</strong></p>\n<p>The CPU topology problem gets <strong>worse as server CPU core counts increase</strong>. A 16-core CPU might be a single die. A 96-core CPU is almost certainly multiple dies. A 192-core CPU will have even more complex topology. As cloud providers offer larger instance types and as hardware vendors continue scaling core counts via multi-die packaging, topology-aware scheduling becomes increasingly important for high-performance production workloads.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>The Encoding Pipeline Application</p><p>Netflix's video encoding pipelines are a particularly topology-sensitive workload. Encoding involves shared codec state across multiple threads working on different sections of the same video. When those threads are split across dies, the shared state must cross the inter-die interconnect on every access. <strong>Topology-aware scheduling for encoding workloads produced some of the most measurable improvements</strong> in the Mount Mayhem investigation — the shared-working-set nature of encoding makes it naturally sensitive to L3 cache locality.</p>\n</blockquote>\n\n<blockquote><p>Netflix's containers were assigned 4 CPU cores and got 4 CPU cores, but two of them had to talk across a hardware bus to share data with the other two, which is the infrastructure equivalent of having a meeting where half the attendees are on speakerphone from another room.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/netflix-container-cpu-scaling-2025/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.784830+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Performance", "Netflix"]}, {"id": "https://techlogstack.com/explore/netflix-live-operations-scale-2026/", "url": "https://techlogstack.com/explore/netflix-live-operations-scale-2026/", "title": "Netflix Streamed Live Sports for Millions — and the Hard Part Wasn't the Video", "summary": "How Netflix built the operations team, escalation paths, and real-time decision systems that let engineers respond to live event failures in seconds — the human infr", "content_html": "<p><strong>Netflix</strong> · Reliability · 17 May 2026</p>\n<p>When Netflix began streaming live events — boxing, NFL games, comedy specials — the engineering challenge wasn't encoding or delivery. It was building the human infrastructure: the operations team, the escalation paths, the real-time decision systems, and the runbooks that let engineers respond to live event failures in seconds, not minutes.</p>\n<hr />\n<h2 id=\"the-story\">The Story</h2>\n<p>Netflix's engineering reputation is built on on-demand streaming — a fundamentally forgiving medium where buffering, retries, and eventual consistency all work in the viewer's favor. Live streaming is the opposite: a viewer watching a boxing match or an NFL game experiences a latency spike or buffering pause <strong>in real time, during the action</strong>, with no opportunity to replay or retry the moment they missed. The technical requirements are different; the operational requirements are even more different. Netflix's blog post on Live at Scale wasn't primarily about CDN architecture or encoding pipelines — it was about the <strong>human infrastructure</strong>: the operations team that makes live events work.</p>\n<blockquote>\n<p><strong>🏈</strong></p>\n<p>Netflix's first major live sports events were <strong>NFL Christmas Day games in 2023</strong> — two NFL games streamed simultaneously to tens of millions of viewers. The event was the highest-stakes live streaming test Netflix had ever run, and preparing for it required building operations infrastructure from scratch.</p>\n</blockquote>\n\n<p>On-demand streaming fails gracefully: buffering algorithms absorb network variability, ABR <em>Adaptive Bitrate</em> (a streaming technique that continuously selects the highest quality video bitrate the viewer's connection can support, switching between quality levels in real time based on available bandwidth) logic adjusts quality to available bandwidth, and CDN edge nodes cache content to reduce origin load. Live streaming has much less room for graceful degradation: <strong>buffering during a touchdown is visible failure</strong>, and the viewer doesn't get a second chance at that moment. The engineering systems that make live streaming reliable are well-understood — origin redundancy, edge caching, latency optimization. But the operational systems — who decides what when something goes wrong, how fast can a decision be made, who has authority to take down a feature if it's affecting stability — were built from scratch for Netflix's live events.</p>\n<blockquote>\n<p><strong>THE HUMAN OPERATIONS PROBLEM</strong></p>\n<p>Netflix's blog identified a simple but profound insight: <strong>at live streaming scale, automated systems can detect problems but humans must make many critical decisions</strong>. Should the CDN fall back to a lower-quality tier? Should a region be taken offline to protect capacity elsewhere? Should a feature be disabled to reduce processing overhead? These decisions require human judgment, access to real-time data, and clear authority chains — and they must be made in seconds, not minutes.</p>\n</blockquote>\n\n<h3>Problem</h3>\n<h4>On-Demand Operations Models Don't Scale to Live</h4>\n<p>Netflix's existing operations model was designed for a world where failures could be investigated over minutes or hours. A degraded CDN node could be taken offline after an on-call engineer investigated the issue. For live events, the same decision needed to happen in 10-20 seconds — before viewers noticed the degradation. The operations model needed a complete redesign.</p>\n<hr />\n<h3>Cause</h3>\n<h4>Live Events Have Zero Recovery Time</h4>\n<p>Unlike on-demand content where a retry or rebuffer is acceptable, live streaming requires decisions to be made <strong>before the viewer impact is visible</strong>. The detection → decision → action loop must complete in seconds. Existing oncall processes, which allowed investigation before action, were too slow for live event operations.</p>\n<hr />\n<h3>Solution</h3>\n<h4>Live Operations Center: Command Structure for Speed</h4>\n<p>Netflix built a dedicated Live Operations Center (LOC) structure for major events — a command center with real-time data displays, predefined decision authorities, escalation paths, and runbooks that reduced decision time from minutes to seconds. The LOC operates during every major live event with specific roles assigned for each decision type.</p>\n<hr />\n<h3>Result</h3>\n<h4>NFL Christmas Day Streamed Without Major Incidents</h4>\n<p>Netflix successfully streamed NFL Christmas Day 2023 to tens of millions of simultaneous viewers — one of the highest-concurrency streaming events in Netflix's history. The operational infrastructure built for it became the template for subsequent live events including boxing, comedy specials, and additional NFL games.</p>\n<hr />\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Predefined Authority: Who Decides What</p><p>One of the most critical components of the Live Operations Center model is <strong>predefined decision authority</strong>. Before each live event, specific engineers are assigned specific decisions they're authorized to make without approval — the CDN engineer can take a region offline, the encoding engineer can drop a quality tier, the product engineer can disable specific features. This pre-assignment eliminates the escalation delay that would otherwise occur when a novel incident requires a decision that nobody is sure they're authorized to make.</p>\n</blockquote>\n\n<p>Pre-event <em>game days</em> (structured rehearsals of incident scenarios conducted before a major event, where the operations team practices executing runbooks, making decisions, and coordinating across teams under simulated pressure) were a critical part of Netflix's live event preparation. The team ran game day exercises that simulated specific failure scenarios — a CDN region becoming unreachable, an encoding node failing, a sudden traffic spike above capacity — and practiced the complete response cycle: detection, decision, action, verification. Game days served two purposes: they validated that the runbooks were correct, and they gave the operations team practice making fast decisions under pressure in a consequence-free environment.</p>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Feature Flag Granularity for Live Events</p><p>Netflix built a <strong>fine-grained feature flag system</strong> specifically for live events — one where individual features could be disabled or degraded in seconds, with precise scope control. A feature flag that takes 10 minutes to propagate globally is useless when a live event is failing in real time. The live event feature flag system was designed to propagate changes in under 30 seconds globally, enabling operators to quickly disable features that were contributing to stability issues.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>📡</strong></p>\n<p>Real-Time Data Displays: The LOC Dashboard</p><p>The Live Operations Center required a custom real-time dashboard — one that aggregated CDN health, playback success rates, encoding health, and geographic performance data with <strong>sub-second refresh rates</strong>. Standard monitoring dashboards with 1-minute aggregation windows are insufficient when a live event decision needs to be made in 10 seconds. Netflix built custom LOC dashboards that showed the data operators needed at the latency that live events required.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Difference from On-Demand Operations</p><p>Netflix's on-demand operations model assumes failures can be investigated before action: an on-call engineer is paged, investigates the issue over 5-15 minutes, and makes a decision. <strong>This model is incompatible with live events.</strong> By the time an on-call engineer has investigated a CDN region degradation during an NFL game, millions of viewers have experienced 5-15 minutes of buffering. The LOC model inverts this: action is taken first (disable the degraded region), investigation follows. The predefined authority structure makes this possible without the chaos of unauthorized action.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Live vs On-Demand: The Buffer Factor</p><p>On-demand streaming buffers 10-30 seconds of video ahead of playback. A 2-second CDN hiccup is invisible — the buffer absorbs it. Live streaming buffers 3-8 seconds ahead of playback (to allow some smoothing) but cannot buffer further ahead because there's no future content yet. <strong>A 4-second CDN hiccup in live streaming visibly impacts viewers</strong>; the same hiccup in on-demand streaming is completely invisible. This fundamental difference in buffering physics explains why live streaming requires different operational response times.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE SIMULTANEOUS CONCURRENCY CHALLENGE</strong></p>\n<p>The hardest technical challenge in live event operations is not average load — it's the <strong>instantaneous concurrency spike at event start</strong>. When an NFL game kicks off, tens of millions of viewers hit play within minutes of each other. This creates a simultaneous connection establishment wave that CDN and origin infrastructure must absorb without degradation. Netflix's capacity pre-provisioning — sizing for peak concurrency rather than average concurrency — was one of the key operational investments before the first NFL game.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"the-fix\">The Fix</h2>\n<h3>The Live Operations Center: Roles, Runbooks, and Real-Time Decisions</h3>\n<p>The Live Operations Center structure was Netflix's key operational innovation for live events. Rather than relying on an on-call rotation to respond reactively, the LOC is a <strong>proactive operations command</strong> staffed during live events with engineers who have predefined authority, real-time data access, and practiced runbooks. The LOC staff are not passive monitors — they're empowered to make impactful decisions (taking regions offline, disabling features, changing CDN routing) within defined boundaries and without requiring approval escalations.</p>\n<ul>\n<li><strong>Seconds</strong> — Target decision time in the Live Operations Center — versus minutes for standard on-call incident response, the time constraint that drove the LOC's predefined authority model</li>\n<li><strong><30s</strong> — Feature flag propagation time for live event flags — enabling operators to disable destabilizing features before viewers notice the impact</li>\n<li><strong>Predefined</strong> — Authority assignments before each event — every LOC role has specific decisions they can make without escalation, eliminating the approval latency that kills live event response times</li>\n<li><strong>Game days</strong> — Pre-event rehearsals that practice the detection → decision → action cycle in simulated failure scenarios — building muscle memory for operations under pressure</li>\n</ul>\n\n<pre><code># Conceptual LOC runbook structure for a CDN region health issue\n# Real runbooks are more detailed and include specific tool commands\n\nloc_runbook:\n  title: \"CDN Region Health Degradation\"\n  trigger: \"Playback success rate < 95% in a geographic region\"\n  severity: P1  # live event is actively degraded\n  \n  immediate_actions:  # within 30 seconds\n    - owner: cdn_operator  # predefined authority\n      action: \"Check CDN region health dashboard\"\n      decision: \"If error rate > 5%: disable region from CDN rotation\"\n      tool: \"cdn-control --region {region} --disable\"\n      propagation_time: \"< 15 seconds globally\"\n  \n  parallel_actions:  # simultaneously  \n    - owner: traffic_operator\n      action: \"Verify origin capacity for increased load from CDN failover\"\n    - owner: encoding_operator\n      action: \"Confirm encoding pipeline health\"\n  \n  verification:  # within 60 seconds of action\n    - check: \"Playback success rate recovering in affected region\"\n    - check: \"Adjacent CDN regions absorbing traffic without degradation\"\n  \n  escalation:  # if not resolved in 2 minutes\n    to: LOC_director\n    with: \"CDN region {region} removed from rotation, impact ongoing\"</code></pre>\n<blockquote>\n<p><strong>POST-EVENT REVIEWS: THE LEARNING SYSTEM</strong></p>\n<p>Netflix conducts structured post-event reviews after every major live event — whether or not there were incidents. The review covers: what went well operationally, what decisions were made and whether they were the right ones, what runbooks were unclear or incorrect under pressure, and what monitoring gaps were discovered. <strong>Post-event reviews treat each live event as a learning opportunity regardless of outcome</strong>. Over time, this produces increasingly reliable operational runbooks, better-calibrated decision thresholds, and a team that gets better at live event operations with every event.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The NFL Christmas Day Template</p><p>The NFL Christmas Day 2023 LOC structure became the template for all subsequent Netflix live events. The roles, runbooks, dashboard configuration, feature flag system, and game day process were all documented and reused for subsequent boxing events, comedy specials, and additional NFL games. The first event built the infrastructure; subsequent events improved it.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Capacity Pre-Provisioning</p><p>Unlike on-demand streaming where traffic grows gradually and can be served from cached content, live events create <strong>instantaneous global concurrency spikes</strong>. Tens of millions of viewers join within the first few minutes of kickoff. Netflix's capacity provisioning for live events required pre-positioning encoding capacity, CDN edge capacity, and origin capacity well in advance of the event start — you can't autoscale fast enough to serve a simultaneous ramp of 10 million viewers.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>The Runbook Quality Problem</p><p>Runbooks written by engineers who understand the system often assume knowledge that operators executing them under pressure don't have. Netflix's game day exercises revealed multiple runbooks that were unclear under time pressure — steps that assumed familiarity with a tool's interface, thresholds that weren't specified precisely, or escalation conditions that were ambiguous. <strong>Game day runbook reviews produced a round of rewrites</strong> before the NFL game, producing runbooks that could be executed correctly by anyone in the LOC without expert system knowledge.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>THE GLOBAL CAPACITY MODEL</strong></p>\n<p>Unlike on-demand content (which can be served from cache), live streaming requires real-time encoding capacity at origin for every concurrent viewer. Netflix built a global capacity model for live events: estimating concurrent viewership by geographic region, calculating encoding and CDN capacity requirements per region, and pre-provisioning that capacity before the event. <strong>Live events cannot rely on autoscaling</strong> — the ramp from 0 to 10 million concurrent viewers in 5 minutes is faster than any autoscaling system can respond.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"architecture\">Architecture</h2>\n<p>Netflix's live streaming architecture builds on its existing on-demand CDN infrastructure but adds live-specific components. The encoding pipeline produces a live stream rather than a pre-encoded library asset. CDN edge nodes cache live segments rather than VOD content. But the most significant architectural changes are at the operational layer — the systems that support human operators making fast decisions during live events.</p>\n<h3>Live Operations Center Command Structure</h3>\n<p><a href=\"https://techlogstack.com/explore/netflix-live-operations-scale-2026/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<h3>Live Streaming Architecture: Technical Stack</h3>\n<p><a href=\"https://techlogstack.com/explore/netflix-live-operations-scale-2026/#architecture\">View interactive diagram on TechLogStack →</a></p>\n<p><em>Interactive diagram available on TechLogStack (link above).</em></p>\n\n<blockquote>\n<p><strong>THE TECHNICAL-HUMAN INTEGRATION POINT</strong></p>\n<p>The most sophisticated part of Netflix's live operations architecture is not the CDN or the encoder — it's the <strong>interface between automated detection systems and human decision-makers</strong>. Automated systems can detect that a CDN region's error rate has crossed a threshold in 1-2 seconds. Translating that detection into a human decision (disable the region) in another 5-10 seconds requires: alerting that reaches the right person immediately, a dashboard that shows the right context instantly, a runbook that specifies the right action clearly, and authority that doesn't require approval. Building all four simultaneously is the live operations infrastructure challenge.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>ℹ️</strong></p>\n<p>Geographic Failover Decision Logic</p><p>One of the most complex LOC decisions is geographic failover: when a CDN region degrades, should traffic from that region be rerouted to another CDN region (adding latency for affected viewers) or should the affected viewers receive degraded quality (lower bitrate, more buffering) rather than higher latency? This decision requires knowing: what's the latency cost of rerouting to the nearest alternative region, what's the quality cost of degraded serving, and which outcome is better for viewer experience? Netflix pre-computed this decision matrix for major regions before each event, so LOC operators could execute a decision in seconds rather than computing it under pressure.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>✅</strong></p>\n<p>The Reusable LOC Template</p><p>After the NFL Christmas Day 2023 events, Netflix documented the LOC structure as a reusable template: role definitions, decision authority matrices, dashboard configurations, runbook formats, game day scenarios, and post-event review structure. Every subsequent major live event started from this template and refined it based on what was learned. The investment in the first LOC paid compounding dividends across all subsequent live events — each one faster to staff and better equipped than the last.</p>\n</blockquote>\n\n<hr />\n<h2 id=\"lessons\">Lessons</h2>\n<p>Netflix's Live at Scale blog is one of the few engineering posts that focuses primarily on human systems rather than technical ones. The lessons here are about organizational design under real-time pressure.</p>\n<div role=\"region\"><p>What to remember</p><ol><li><span>01</span><div><strong>Live events require operational infrastructure as much as technical infrastructure.</strong> The CDN, encoder, and ABR algorithm are prerequisites. But without predefined decision authority, real-time dashboards, fast feature flags, and practiced runbooks, technical infrastructure excellence doesn't translate to operational excellence during live failures.</li><li><span>02</span><div><em>Predefined authority</em> (assigning specific engineers specific decisions they can make without approval before an event, to eliminate escalation latency during incidents) eliminates the approval overhead that kills live event response times. Pre-define who can make which decisions and under what conditions — before the event, not during it.</li><li><span>03</span><div><strong>Game days build operational muscle memory.</strong> Reading a runbook and executing one under pressure are different skills. Practice the complete detection → decision → action → verification cycle in simulated scenarios before live events. Operators who have done it once in practice make better decisions when they have to do it for real.</li><li><span>04</span><div>Feature flag propagation speed is an operational tool. <strong>A feature flag that takes 10 minutes to propagate is useless during a live event failure.</strong> Build live event feature flags with sub-30-second global propagation and verify that propagation speed in game days, not just in unit tests.</li><li><span>05</span><div>Post-event reviews are the learning system. <strong>Review every event, whether there were incidents or not.</strong> Smooth events reveal runbook clarity and decision threshold calibration. Incident events reveal operational gaps. Both are necessary inputs for getting better at live event operations over time.</li></ol>\n<blockquote>\n<p><strong>⚠️</strong></p>\n<p>Live Events Expose Monitoring Gaps</p><p>Netflix's first major live events revealed monitoring gaps that on-demand operations had never exposed. Playback success rate metrics that averaged over 1-minute windows were too slow to drive 10-second decisions. Regional health metrics that aggregated too broadly masked localized CDN failures. <strong>Live events require monitoring at finer granularity and shorter time windows than on-demand operations</strong> — and the gaps only become visible when a live event decision depends on data that doesn't exist at the required granularity.</p>\n</blockquote>\n\n<blockquote>\n<p><strong>BUILDING FOR THE FIRST TIME, SCALING FOR ALL TIME</strong></p>\n<p>Netflix's live operations infrastructure was built to serve the first NFL game in 2023, but it was designed to scale to all subsequent live events indefinitely. The LOC template, runbook library, game day framework, and feature flag system were built as <strong>reusable infrastructure rather than one-time solutions</strong>. The marginal cost of each subsequent live event decreased as the infrastructure matured. Build live event operations as a platform, not as a per-event scramble.</p>\n</blockquote>\n\n<blockquote><p>Netflix proved they could stream NFL football to millions of simultaneous viewers — and the lesson they published was about how to write good runbooks and pre-define who gets to press the big red button, which is simultaneously obvious and deeply underappreciated.<br /><cite>TechLogStack — built at scale, broken in public, rebuilt by engineers</cite></p></blockquote>\n\n<hr />\n<p><em>This case is a plain-English retelling of publicly available engineering material.</em></p>\n<p><strong><a href=\"https://techlogstack.com/explore/netflix-live-operations-scale-2026/\">Read the full case on TechLogStack →</a></strong> (interactive diagrams, source links, and the full reader experience).</p>", "date_published": "2026-05-17T00:00:00+00:00", "date_modified": "2026-06-13T18:53:04.788304+00:00", "authors": [{"name": "TechLogStack Editorial"}], "tags": ["Reliability", "Netflix"]}]}