⚡ Engineering War Stories from the Trenches

When Big Tech Breaks
& How They Fix It

The postmortems they published. The outages they survived. The fixes that saved millions of users. Read the real story — not the press release.

Case Studies

Companies

∞

Things Broke

100%

No Fiction

Netflix Chaos Engineering ★ 5.0

Netflix Unleashed a Monkey With a Weapon in Its Own Data Center — On Purpose

It was 2011 and Netflix had just migrated hundreds of microservices to AWS. Their architecture was distributed, horizontally scaled, and theoretically fault-tolerant. But theory and production are different things. The only way to know if a system could survive failures was to cause failures — constantly, deliberately, during business hours, and in production. So they built a monkey.

July 19 2011 blog published Business-hours instance killing 10 Simian Army members +3

Read full story →

Hotstar Live Streaming ★ 5.0

When MS Dhoni Got Out: How Hotstar Survived 25 Million Concurrent Users

July 9th, 2019. India vs New Zealand, Cricket World Cup semi-final. MS Dhoni walks to the crease and 1.1 million new viewers join Hotstar every single minute. Then he gets run out — and 24 million people hit the back button almost simultaneously.

25.3M peak concurrent 1.1M users/min growth 5.7 Tbps bandwidth +3

Read full story →

GitHub Databases

How GitHub Upgraded 1200 MySQL Hosts Without Dropping a Single Query

MySQL 5.7 was hitting end-of-life, and GitHub's production database fleet spanned 1,200 hosts, 300 terabytes of data, and 5.5 million queries every second. Getting from here to MySQL 8.0 without disrupting 100 million developers was going to take more than a weekend.

1,200+ MySQL hosts upgraded 300+ TB data migrated 5.5M queries/sec maintained +3

Read full story →

Slack Reliability

Slack Built a Big Red Button to Drain an Entire Data Center in Five Minutes

On June 30, 2021, a network link connecting one AWS availability zone failed — and Slack users felt it, despite Slack running in multiple availability zones. The postmortem question was brutal: why did a single AZ failure affect users at all? The answer drove 18 months of architecture work.

2021-06-30 AZ outage trigger 1.5 years migration time AZ drain in <5 minutes +3

Read full story →

Cloudflare Reliability

Cloudflare Fixed a React Security Vulnerability and Broke the Entire Network

In late 2025, Cloudflare was rolling out a fix for a React security vulnerability. To do so, they needed to disable an internal testing tool with a global killswitch. The killswitch, unexpectedly, triggered a bug that sent HTTP 500 errors across Cloudflare's entire global network. This was the third major configuration-related global outage in two years.

Dec 2025 global outage React CVE fix triggered outage Global killswitch bug +3

Read full story →

LinkedIn Messaging

LinkedIn Needed a Message Queue. They Built the One the Entire Internet Runs On.

In 2010, LinkedIn was drowning in data it couldn't move. Every ML model, every recommendation engine, every real-time feature was starving because there was no reliable way to get activity data from the website into the systems that needed it. Jay Kreps, Jun Rao, and Neha Narkhede spent a year building a fix. They named it after Franz Kafka. The rest of the internet adopted it.

1B events/day at launch (2011) 1T messages/day by 2015 7T messages/day by 2019 +3

Read full story →

Google Performance

Google Built a Free Design Tool That Generates Production Code From a Sentence — Then Added Multiplayer

At Google I/O 2025, Sundar Pichai demoed a tool that turned a plain English description into a complete mobile UI in under 30 seconds. Figma charges $15 per editor per month for collaborative design. Google Stitch does it free. A year later, Google added real-time multiplayer, a streaming design agent, and voice input. The design industry noticed.

Launched I/O May 20 2025 Galileo AI acquisition → Stitch rebrand Gemini 2.5 Pro multimodal core +3

Read full story →

Google Distributed Systems

Google's Gemini Omni Is the First AI That Creates From Anything — Here Is What That Actually Means

For three years, Google built Gemini to be 'natively multimodal.' At I/O 2026, they finally demonstrated what that phrase means in practice. Gemini Omni takes a photo, an audio clip, a video, and a text description — all at once — and produces a new video that reflects all of them simultaneously. This is not four models chained together. It is one.

Announced I/O May 19 2026 Any-to-any: text+image+audio+video → video Flash release: 10s clips, Gemini app + YouTube Shorts +3

Read full story →

GitHub Distributed Systems

GitHub Built the Internet's Code Platform — Then AI Agents Broke It

Between May 2025 and April 2026, GitHub experienced 257 incidents — 48 of them major outages. That's roughly one significant disruption every single week. The culprit wasn't a security breach, a botched deployment, or a rogue engineer. It was the thing GitHub had spent years celebrating: AI. Specifically, agentic AI workflows that turned one human developer's footprint into hundreds of commits, thousands of CI minutes, and a dozen simultaneous PR operations — all at once, across millions of accounts. GitHub had been built for humans. Agents are not human.

257 incidents — May 2025 to April 2026 48 major outages, 112+ hours total downtime 57 GitHub Actions outages in 12 months +5

Read full story →

OpenAI Reliability

OpenAI Deployed a Tool to Monitor Kubernetes — and It Took Down All of Kubernetes

On December 11, 2024, OpenAI deployed a new telemetry service designed to improve Kubernetes observability. Within 29 minutes, it had crashed the Kubernetes control plane across every cluster. ChatGPT, the API, and Sora were all unavailable for over four hours. The engineers trying to fix it couldn't run kubectl. The control plane that manages clusters was down — and it was the only way back in.

3:16 PM → 7:38 PM PST (4h 22min) ALL OpenAI services affected Kubernetes control plane down in most large clusters +3

Read full story →

AWS Distributed Systems

A Race Condition in DynamoDB's DNS Took Down Snapchat, Fortnite, Ring, and Half the Internet for 15 Hours

It was 11:48 PM PDT on October 19, 2025. Two automation processes inside AWS's DynamoDB DNS management system were doing the same job simultaneously — one fast, one painfully slow. The slow one was just finishing up when the fast one, having already completed, triggered a cleanup job that deleted the slow one's work. In that moment, every DNS record for DynamoDB in the world's busiest cloud region vanished. Snapchat went dark for 375 million daily users. Fortnite lobbies dissolved mid-match. Ring cameras stopped recording. The UK's HMRC tax authority went offline. For 15 hours, the internet's largest database service had no address.

October 19–20, 2025 — 15-hour outage in US-EAST-1 Root cause: race condition between two DNS Enactor processes DynamoDB offline for ~3 hours; EC2 cascade lasted 12+ more hours +5

Read full story →

Google Distributed Systems

Google's Own Cleanup Job Crashed Cloud Services Across 4 Continents — and Then Made Recovery Worse

On May 29, 2025, a Google engineer deployed new quota-checking code to Service Control — the system that authorizes every single API request across Google Cloud. The code had a bug: it couldn't handle a null value. But the bug was invisible during deployment because it could only be triggered by a specific type of policy data that hadn't appeared yet. Two weeks later, on June 12, an automated system pushed a routine policy update containing blank fields. The policy data replicated globally within seconds. Every Service Control binary in every region hit the null pointer, crashed, and refused to restart without eating itself. Spotify went down. Discord went down. Snapchat went down. Google's own status page went down. And when engineers deployed the fix, the restart surge overwhelmed the infrastructure — making the recovery worse than the crash.

June 12, 2025 — 7+ hour outage across North America, Europe, Far East, Africa Root cause: null pointer exception in Service Control binary from May 29 code change No feature flag protection and no error handling on the new code path +5

Read full story →

Discord Databases

How Discord Migrated Trillions of Messages and Fired Their Garbage Collector

It is 2022 and Discord's on-call engineers are babysitting a 177-node database cluster, manually rebooting nodes after Java GC pauses spiral out of control. The system holding every message ever sent is becoming the thing everyone fears touching most.

177 → 72 nodes p99 latency 15ms (was 40–125ms) 9-day migration (was 3-month est.) +3

Read full story →

Discord Reliability

Discord Killed the MacBook Dev Environment and Never Looked Back

Discord's engineering team had tripled in size and was drowning in a swamp of 'works on my machine' bugs — some engineers running macOS, some Ubuntu, all of them slowly. The solution was radical: no one gets a local dev environment anymore.

3x engineering org growth Mac→V1→V2 (two migrations) Began 2020, V2 done 2023 +3

Read full story →

Netflix Distributed Systems

Netflix Hit the AWS Instance Ceiling and Built a Workflow Engine That Scales Forever

Netflix's Meson orchestrator was handling hundreds of thousands of daily data and ML jobs — and running out of machine. Vertically scaling on AWS had a hard ceiling, and the workflows were doubling in size every year. The only way out was a complete architectural rethink.

2M+ jobs/day at peak Hundreds of thousands of workflows Meson → Maestro 2020 +3

Read full story →

Slack Reliability

Slack's Worst Day: When a Better Cache Manager Made Everything Worse

On February 22, 2022, Slack went down for many users — including the engineer designated as Incident Commander, who was authoring the postmortem from a position of personal experience. The culprit was a new component that worked exactly as designed.

Feb 22 2022 outage Consul rollout to 75% of fleet Cache hit rate collapsed +3

Read full story →

Stripe Databases

How Stripe Moves Petabytes Between Database Shards Without Stopping the Money

Stripe processed over $1 trillion in payment volume in 2023 while maintaining 99.999% uptime — five nines, fewer than 6 minutes of downtime all year. The infrastructure secret is a database platform called DocDB and a migration engine that moves petabytes of financial data between shards without any application knowing it happened.

$1T+ payment volume 2023 99.999% uptime achieved 5M database queries/sec +3

Read full story →

Slack Distributed Systems

Slack Rewrote Its Core Architecture for Enterprise — Because the Old One Was a Lie

Slack was built for teams in single workspaces. Enterprise customers were using it across dozens of workspaces simultaneously — and the architecture had never been designed for that. Every major enterprise feature was a workaround on top of a foundation that assumed one workspace per person. Slack spent two years rebuilding the foundation.

2 years development time Workspace-centric → org-wide Thousands of APIs refactored +3

Read full story →

Slack Reliability

Slack Cut Deploy-Related Customer Impact by 90% in Eighteen Months

73% of Slack's customer-facing incidents were being triggered by Slack itself — by its own code deploys. The team stopped treating each outage as a one-off and started treating deploy safety as a program, with metrics, milestones, and automated rollbacks. Eighteen months later, customer impact hours were down 90%.

73% incidents from own deploys 90% reduction in impact hours Manual → automatic rollbacks +3

Read full story →

Atlassian Reliability

How a Two-Line Script Silently Deleted 883 Customer Cloud Sites

At 07:38 UTC on April 5th, 2022, a maintenance script begins its run — methodical, peer-reviewed, totally routine. Twenty-three minutes later, 883 Atlassian Cloud sites have been permanently deleted, and the company's own incident management tool, Opsgenie, is one of the casualties.

883 sites deleted 14 days max outage 775 customers affected +3

Read full story →

Netflix Live Streaming

65 Million Streams: How Netflix Rebuilt Its Guts for Live

November 15, 2024: 65 million people log on to watch Mike Tyson fight Jake Paul, the largest live sports stream in history. Behind the scenes, Netflix engineers are white-knuckling a system they built from scratch — one where a single bad video segment, a CDN request storm, or a missed 2-second write deadline means millions of viewers see a black screen.

65M concurrent streams 113ms → 25ms p50 latency 200Gbps+ read throughput +3

Read full story →

Stripe Performance

Stripe Converted 3.7 Million Lines of JavaScript in One Pull Request on a Sunday

On Sunday, March 6, 2022, Stripe merged a single pull request that converted their entire largest JavaScript codebase from Flow to TypeScript. 3.7 million lines of code. Hundreds of engineers arrived Monday morning to start writing TypeScript. The migration had been invisible until it wasn't.

3.7M lines converted in 1 PR Single Sunday deploy Largest JS codebase at Stripe +3

Read full story →

GitHub Reliability

The Test That Broke GitHub: A Failover Drill Goes Live

June 29, 2023, 17:39 UTC: GitHub engineers initiate a planned live failover test of their brand-new second Internet edge facility — six months of infrastructure work designed to eliminate a single point of failure. Within seconds, instead of validating their redundancy, they've created an outage that takes GitHub offline for millions of developers across North America and South America.

32-minute outage 2-min detect-to-revert US East + South America +3

Read full story →

Shopify Databases

Shopify Sharded a Rails Database With Vitess and the App Never Knew It Happened

The Shop app was growing exponentially. Its single MySQL database was approaching vertical scaling limits. Shopify needed horizontal sharding — but they had a Rails monolith that expected a single database, and a system that couldn't have downtime during a commerce platform used by millions daily.

KateSQL → Vitess migration user_id as sharding key VTGate transparent to app +3

Read full story →

Shopify Databases

Shopify's Engineers Hunted Deadlocks at 19 Million Queries per Second

During Black Friday and Cyber Monday 2023, Shopify's MySQL fleet was handling 19 million queries per second. At that scale, even rare deadlock patterns become common enough to cause real incidents. The engineering team published a detailed playbook for diagnosing and eliminating MySQL deadlocks in high-concurrency production environments.

19M MySQL QPS at BFCM peak 58M requests/min app servers 99.999%+ uptime maintained +3

Read full story →

Cloudflare Reliability

Cloudflare's Datacenter Partner Failed and the Control Plane Went Dark for 40 Hours

On November 2, 2023, Cloudflare's primary datacenter partner experienced a power failure. The control plane — the system that lets customers configure DNS, firewall rules, and every Cloudflare service — went dark. It stayed dark, in various forms, for nearly 40 hours. The postmortem introduced a concept Cloudflare hadn't had before: Code Orange.

Nov 2–4 2023 outage ~40 hours control plane down Flexential datacenter failure +3

Read full story →

Cloudflare Reliability

A Database Permission Change in ClickHouse Took Down 28% of Cloudflare's HTTP Traffic

On November 2, 2023 — the same day as the control plane datacenter failure — Cloudflare also experienced a separate six-hour global outage. The cause: a database permission change in ClickHouse generated a corrupt configuration file that was silently propagated to every server in Cloudflare's Bot Management system, crashing it globally.

Nov 2 2023 outage 28% HTTP traffic impacted 6 hours total duration +3

Read full story →

Netflix Performance

Netflix Made Their Workflow Orchestrator 100x Faster by Rewriting the Engine Nobody Thought Was Slow

Maestro had been running Netflix's data and ML workflows successfully for two and a half years. Then Live, Ads, and Games drove sub-hourly scheduling requirements that revealed the orchestrator's overhead — not in crashes or alerts, but in slow step launches that nobody had measured. The fix was a complete engine rewrite that delivered 100x throughput improvement.

100x throughput improvement 2.5 years before overhead visible Sub-hourly scheduling trigger +3

Read full story →

Netflix Performance

Netflix's Containers Were Fighting Their Own CPUs — and Losing

Netflix ran millions of containers per day on modern multi-core CPUs. The containers performed well on benchmarks. In production, under certain workloads, they were mysteriously slower than expected — slower than the hardware should have allowed. The culprit was CPU topology: the operating system was scheduling container workloads in ways that violated modern CPU cache architecture. They called the investigation 'Mount Mayhem.'

Mount Mayhem investigation Modern multi-core CPU topology NUMA and cache locality +3

Read full story →

Netflix Reliability

Netflix Streamed Live Sports for Millions — and the Hard Part Wasn't the Video

When Netflix began streaming live events — boxing, NFL games, comedy specials — the engineering challenge wasn't encoding or delivery. It was building the human infrastructure: the operations team, the escalation paths, the real-time decision systems, and the runbooks that let engineers respond to live event failures in seconds, not minutes.

Live events at Netflix scale Sub-second escalation paths NFL Christmas Day 2023 +3

Read full story →

Figma Databases

Figma's Database Grew 100x in Four Years — Here's How a Small Team Kept It From Toppling

In 2020, Figma ran on a single Postgres instance on AWS's largest available machine. Four years later, that database had grown nearly 100x. Some tables had swelled to several terabytes and billions of rows. The Postgres vacuum process — the background job that keeps Postgres alive — was causing reliability incidents. They had months of runway left before hitting the IOPS ceiling. A small databases team had nine months to fix it.

100x DB growth since 2020 Single instance → horizontal shards 9-month migration +3

Read full story →

Datadog Reliability

Datadog Went Dark for 24 Hours and Came Back With a Different Philosophy

On March 8, 2023, Datadog — the platform engineers use to know when their own infrastructure is broken — broke. For more than 24 hours, across five regions on three cloud providers, metrics stopped arriving, logs disappeared, and dashboards showed nothing. The people whose job was to fix it couldn't see what was happening. It cost $5 million. It changed how Datadog thinks about building software.

24h+ global outage $5M revenue loss 50–60% Kubernetes nodes lost +3

Read full story →

OpenAI Databases

OpenAI Runs ChatGPT for 800 Million Users on One PostgreSQL Instance — and It Works

ChatGPT has 800 million users. It handles millions of database queries per second. And it runs on a single primary PostgreSQL instance on Azure — one writer, backed by about fifty read replicas. No sharding. No distributed SQL. Just Postgres, pushed further than almost anyone thought possible through obsessive optimization and ruthless operational discipline.

800M users, 1 primary PG instance ~50 read replicas globally Millions of QPS, p99 <20ms +3

Read full story →

Uber Security

Uber Had 150,000 Secrets Scattered Across 25 Vaults — So They Built One Platform to Rule Them

150,000 secrets. 25 separate vaults. Hundreds of teams managing their own credentials in their own ways, some in plain text in version control. At Uber's scale — 5,000 microservices, 5,000 databases, 500,000 analytical jobs per day — secrets sprawl is not a compliance problem. It is an incident waiting to happen. A team of ten engineers decided to fix it.

150,000 secrets managed 25 vaults → 6 managed vaults 5,000 microservices secured +3

Read full story →

Shopify Reliability

The 80% Problem: Why Getting an LLM System to 'Works in Demo' Is 20% of the Work

Every team building with LLMs discovers the same brutal truth: 80% quality arrives in a few weeks. The final 15% — the gap between 'impressive demo' and 'product I'd trust with my customers' — takes the rest of the time. Shopify's Flow agent and Sidekick teams lived this curve and came back with a systematic playbook. It is mostly about measurement.

LLM judge: 0.02 → 0.61 Kappa 300-example hand-crafted benchmark Production mirroring closes gap in 2 weeks +3

Read full story →

IBM Distributed Systems

Quantum Computing Just Beat the Best Classical Computer — Here Is the Engineering That Made It Happen

On May 6, 2026, Q-CTRL ran a materials science simulation on an IBM quantum computer in 2 minutes. The best classical supercomputer needed over 100 hours to reach the same accuracy — and then gave up. The day before, IBM's quantum computers simulated a 12,635-atom protein with Cleveland Clinic and RIKEN, 40 times larger than anything attempted six months prior. After 30 years of promises, quantum advantage arrived. Here is what actually changed.

3,000× speedup over best classical (May 6 2026) 12,635-atom protein simulated (May 5 2026) 120 qubits, 10,000+ two-qubit gates +3

Read full story →

Spotify Reliability

Spotify Changed a Filter Order in Their Proxy — Then Every Server in the World Crashed at Once

On April 16, 2025, Spotify's engineering team made a change they deemed low risk: reordering the custom filters inside their Envoy Proxy perimeter. They applied it to all regions simultaneously. Within two minutes, every Envoy instance worldwide had crashed. And then the restart loop began — a loop Kubernetes itself was powering, killing each new server as fast as it came back up. 675 million users couldn't load the app. Asia Pacific stayed up, and the reason why told the engineers exactly what was broken.

12:18–15:45 UTC (3h 27min) 675M MAU affected 48,000+ Downdetector peak reports +3

Read full story →

Airbnb Databases

Airbnb's Fraud Detection Runs on a Graph of 7 Billion Nodes — Here's Why They Rebuilt It From Scratch

Airbnb's identity graph connects 7 billion nodes and 11 billion edges — every user, every device, every listing, every relationship that might reveal a fraudster trying to create a duplicate account or collude on a fake transaction. The third-party vendor powering it required periodic manual reboots to stay stable. Queries that needed 8 hops of graph traversal were hitting 5-second P99 latencies. In 2024, a small team rebuilt the entire thing internally. The results were not incremental.

7B nodes, 11B edges 5M new edges/day P99 read: 5.0s → 2.5s (-49%) +3

Read full story →

AWS · Distributed Systems

A Race Condition in DynamoDB's DNS Took Down Snapchat, Fortnite, Ring, and Half the Internet for 15 Hours

The Story

When this issue occurred at 11:48 PM PDT, all systems needing to connect to the DynamoDB service in the N. Virginia (us-east-1) Region via the public endpoint immediately began experiencing DNS failures and failed to connect to DynamoDB. This included customer traffic as well as traffic from internal AWS services that rely on DynamoDB.

— Amazon Web Services — Official Post-Incident Summary, October 2025

DynamoDB is not just a database. Inside AWS's infrastructure, it is the connective tissue — the system that EC2, IAM, Lambda, STS, Redshift, and dozens of other control-plane services rely on to store metadata, track state, and coordinate operations. When DynamoDB becomes unreachable, it doesn't just take databases offline. It takes down the systems that manage everything else. This is why a DNS failure that lasted roughly three hours for DynamoDB itself cascaded into a 15-hour platform-wide crisis. The control plane broke. And when the control plane breaks, recovery is not a matter of fixing the root cause — it is a matter of stabilizing everything that lost its footing when the ground disappeared.

To understand what happened, you need to understand how DynamoDB manages its DNS. At AWS's scale, DynamoDB maintains hundreds of thousands of DNS records to operate the massive fleet of load balancers that route traffic across each region. These records are updated constantly — as capacity is added, as hardware fails, as traffic is redistributed. AWS built a two-component system to manage this at scale: a DNS Planner and multiple DNS Enactors.

THE TWO-COMPONENT DNS ARCHITECTURE: PLANNER AND ENACTOR

The DNS management system had two independent components: The DNS Planner monitors load balancer health and capacity and periodically creates new DNS plans — essentially a specification of which load balancers should receive traffic and with what weight distribution. The DNS Enactors are the workers — multiple independent processes running across three different Availability Zones — that pick up the plans and apply them to Route53 (AWS's DNS service). Multiple Enactors running in parallel provide redundancy: if one Enactor fails, others continue. In theory.

Problem

Enactor A Slows Down — And Its Stale Check Becomes a Time Bomb

DNS Enactor A began applying an older DNS plan but encountered unusual delays — it kept getting blocked trying to update records and moved painfully slowly through the list of endpoints. Crucially, Enactor A performed a staleness check early in its process: 'Is my plan newer than what's currently active?' At the time of that check, it was. But by the time Enactor A actually finished applying the plan, the world had moved on — newer plans had been created and applied. The staleness check was now stale itself.

Cause

The Race Condition Fires — Enactor B Wins, Then Cleans Up

While Enactor A was slowly working through its updates, Enactor B picked up one of the newer plans and rapidly applied it across all endpoints. When Enactor B completed, it triggered the cleanup process: identify plans that are significantly older than the one just applied, and delete them. At that exact moment — T+45 seconds after the race began — Enactor A finally finished applying its old plan, overwriting Enactor B's newer records. The cleanup job identified Enactor A's newly-applied old plan as many generations old, and deleted it. All DynamoDB DNS records for the US-EAST-1 regional endpoint were gone.

Solution

11:48 PM PDT: Total DNS Blackout

At 11:48 PM PDT on October 19, every system trying to connect to DynamoDB in US-EAST-1 via the public endpoint received DNS failures. No IP address. No connection. The failure was immediate and total — not a degradation, not elevated error rates, but a complete inability to resolve the DynamoDB endpoint. Internal AWS services relying on DynamoDB for control-plane operations went down alongside customer traffic. EC2's Droplet Workflow Manager lost its ability to track instance lease state. IAM couldn't validate credentials. Lambda couldn't execute. The cascade was underway.

Result

15 Hours of Cascading Failure — and Manual Recovery

Engineers identified the DNS issue by 12:38 AM UTC and began temporary mitigations by 1:15 AM UTC. DynamoDB itself recovered by approximately 2:25 AM UTC — roughly three hours after the incident began. But the cascade had already overwhelmed EC2's Droplet Workflow Manager with a backlog of expired instance leases it couldn't process. The DWFM entered congestive collapse, requiring 12+ more hours for network state to fully stabilize. The fix for the automation itself was brutal in its simplicity: engineers had to manually disable the automatic failover system entirely to stop it from flip-flopping between states and allow the platform to stabilize. Full recovery across all services wasn't complete until late afternoon on October 20 — roughly 15 hours after the cascade began.

Ookla, the network intelligence company behind Speedtest.net, recorded over 17 million outage reports across more than 3,000 organizations during the incident. Independent measurements showed 20 to 30 percent of all internet-facing services experienced disruptions at peak impact. The US, UK, and Germany were hit hardest.

What Actually Went Dark

The list of affected services illustrates something important about how the modern internet is structured. US-EAST-1 is AWS's default region — the one developers reach for first, the one that has the most mature service availability, the one that decades of 'just deploy to us-east-1' decisions have concentrated critical infrastructure in. Even services claiming multi-region redundancy often still rely on US-EAST-1 for authentication, metadata stores, or database calls — dependencies that only become visible when US-EAST-1 goes dark.

Major services and platforms affected by the October 19–20, 2025 AWS US-EAST-1 outage

Category	Affected Services
Social & Entertainment	Snapchat (375M daily users), Discord, Reddit, Roblox, Fortnite, Disney+, Hulu, Twitch
Finance & Payments	Coinbase, Venmo, several UK banks (Lloyds, Halifax)
Smart Home & IoT	Amazon Ring cameras, Amazon Alexa, Eight Sleep
Communications	Signal, several enterprise communication platforms
Government	UK HMRC tax authority
Travel	United Airlines app, Delta app
AI & Developer Tools	Perplexity AI, Pokémon GO
AWS Services (internal)	EC2, IAM, STS, Lambda, S3, SQS, Amazon Connect, Redshift (140+ services total)

The Control Plane Problem: Why DynamoDB's Failure Was Uniquely Catastrophic

A typical service outage takes down the service that failed. The October 2025 DynamoDB outage was different because DynamoDB is infrastructure for infrastructure. EC2 uses DynamoDB to track compute instance state. IAM uses DynamoDB to store and retrieve access policies. Lambda uses DynamoDB for execution state. STS uses DynamoDB to validate tokens. When DynamoDB became unreachable, these services couldn't perform their core functions — not because they had their own bugs, but because the foundation they relied on had disappeared. This is a control-plane failure, and control-plane failures cascade differently from data-plane failures: they don't just take down what failed, they take down the ability to manage everything else.

The EC2 Congestive Collapse: Why Recovery Took 12 Extra Hours

DynamoDB's DNS was restored in approximately three hours. But the outage lasted 15. The reason was EC2's Droplet Workflow Manager (DWFM) — the system responsible for managing EC2 instance lifecycle events, including lease renewals. When DynamoDB became unavailable, DWFM couldn't process instance state updates and began accumulating a backlog of expired leases. By the time DynamoDB recovered, DWFM was facing an enormous queue of backlogged lease management tasks — all trying to execute simultaneously. The system entered congestive collapse: the more it tried to process, the more it overwhelmed the now-recovered DynamoDB, which slowed processing, which lengthened the queue, which increased the pressure. Network state recovery from this congestive collapse took more than five additional hours after DynamoDB was fixed.

THE ANTI-PATTERN: WHEN AUTOMATION PREVENTS RECOVERY

The most counterintuitive part of the recovery was that engineers had to disable automatic failover to stabilize the system. The automatic failover mechanisms — designed to move traffic to healthy systems when failures are detected — were themselves contributing to the instability. With DNS records in an inconsistent state, the failover systems were flip-flopping: detecting failures, triggering failovers, detecting those failovers as failures, triggering more failovers. The automation designed to speed recovery was making recovery impossible. Engineers had to manually turn it off, let the system reach a stable state, and then re-enable it with the correct DNS records in place. This is one of the most instructive moments in the incident: sometimes, the recovery automation has to be stopped before recovery can begin.

~3 hrs

Time from incident start to DynamoDB DNS restoration — engineers had to manually diagnose, understand the inconsistent DNS state, and correct it since automated systems couldn't self-recover

12+ hrs

Additional hours EC2's Droplet Workflow Manager required to clear its congestive collapse from accumulated expired lease backlog after DynamoDB recovered

140+

AWS services eventually affected by the cascade — because DynamoDB powers the control planes of EC2, IAM, Lambda, STS, and dozens of other foundational services

$581M

Estimated insurance losses from the 15-hour outage, per CyberCube cyber risk analytics — representing disruption to thousands of businesses globally dependent on US-EAST-1

The Fix

AWS's Post-Incident Fixes: Preventing the Race, Containing the Cascade

AWS published its official post-incident report three days after the October 20 event. The fixes address four distinct failure layers: the race condition itself, the cleanup automation that deleted active records, the EC2 cascade, and the inadequate test coverage for the recovery workflow. Each fix targets a specific mechanism that allowed the failure to happen or to propagate.

AWS's five-layer post-incident fix plan, derived from the official post-incident summary published October 23, 2025

Failure Layer	What Went Wrong	AWS's Fix
DNS Enactor race condition	Two Enactors ran concurrently; Enactor A's stale staleness check allowed it to overwrite Enactor B's newer plan	Stronger staleness validation in the Enactor before applying plans — the check must reflect the current state of the world at time of application, not at time of plan pickup
Cleanup automation	The cleanup job deleted Enactor A's just-applied old plan because it appeared many generations old — wiping all DNS records in the process	Safeguards to ensure no automated process can delete or remove an active DNS plan — any plan being actively used as the live record is protected from cleanup regardless of its generation number
NLB failover velocity	Network Load Balancers moved large amounts of capacity during AZ failover triggered by the DNS failure, amplifying the cascade	Velocity control mechanism for NLBs to limit how much capacity a single NLB can remove when health check failures cause AZ failover — preventing AZ-level failures from creating region-level capacity evaporation
EC2 recovery workflow	EC2's DWFM entered congestive collapse when DynamoDB recovered and the backlogged lease queue overwhelmed the system — a failure mode that had not been tested	Additional test suite to exercise the DWFM recovery workflow at scale — catching congestive collapse scenarios before they happen in production rather than discovering them during outage recovery
Automatic failover during recovery	Failover automation flip-flopped during recovery, requiring manual disabling before stabilization could occur	Review of failover automation behavior during degraded DNS states — automation must detect the difference between 'service is down' and 'DNS is inconsistent during recovery' and respond differently to each

The Unstated Root Cause: The Architecture of Trust in US-EAST-1

AWS's post-mortem addressed the technical race condition correctly. But the incident exposed a deeper architectural problem that no single fix resolves: the internet's implicit trust in US-EAST-1. AWS designed US-EAST-1 as a region — a geographic cluster of data centers meant to be one of many independently redundant deployment targets. Over 20 years, it became something else: the default region for millions of applications, the region where foundational services were first available, and the region that even 'multi-region' architectures often quietly depend on for authentication, metadata, or coordination. Ring cameras depend on it for authentication. Venmo depends on it for payment processing. UK government services depend on it for API calls. None of these dependencies were meant to create a single point of failure. But that's what they became.

The Test Coverage Gap: You Can't Fully Test Massive Scale Without Massive Scale

One of the most honest admissions in AWS's post-incident report is about test coverage. The DWFM recovery workflow — the path EC2's Droplet Workflow Manager follows when it needs to process a massive backlog of expired leases after a DynamoDB outage — had not been adequately tested at the scale required to discover the congestive collapse failure mode. AWS's response is to build additional test suites specifically for this recovery workflow. But the admission surfaces a fundamental challenge of hyperscale infrastructure: the failure conditions that matter most are the ones that only occur at scale, and at scale, test environments are approximations of production, not replicas of it. The only complete test of how AWS's systems behave during a DynamoDB outage recovery is an actual DynamoDB outage. This is the same insight that drove Netflix to build Chaos Monkey — except that for a cloud provider, you cannot deliberately cause a DynamoDB outage to test the recovery path.

THE HIDDEN CROSS-REGION DEPENDENCY PROBLEM

The October 2025 outage adds to a body of evidence about a specific architectural anti-pattern: regions that are called independent but aren't. AWS regions were designed with the premise that a failure in US-EAST-1 should not affect services running in EU-WEST-1 or AP-SOUTHEAST-1. But control-plane dependencies — authentication services, metadata stores, quota management systems — create invisible cross-region ties. When the control plane fails in one region, services in other regions that depend on that control plane for any operation fail with it. True regional independence requires not just deploying application code in multiple regions, but ensuring that every control-plane dependency is also independently redundant per region. For most organizations building on cloud infrastructure, this is not the architecture they have — it is the architecture they think they have.

Architecture

The October 2025 DynamoDB outage is a case study in what distributed systems engineers call a control-plane failure — a class of failure that is categorically more damaging than a data-plane failure because it removes the ability to manage and coordinate infrastructure rather than just disrupting one service. To understand why the failure cascaded so far and recovered so slowly, you need to understand the three layers of the failure: the DNS automation race condition, the DynamoDB control-plane dependency web, and EC2's Droplet Workflow Manager congestive collapse.

The DNS Race Condition: Step-by-Step

sequenceDiagram participant Planner as DNS Planner participant EnactorA as Enactor A (slow) participant EnactorB as Enactor B (fast) participant Route53 as Route53 DNS participant Cleanup as Cleanup Job Planner->>EnactorA: Assign Plan #50 Planner->>EnactorB: Assign Plan #100 (newer) Note over EnactorA: Staleness check: Plan #50 is current ✓ (check is now stale) EnactorB->>Route53: Apply Plan #100 rapidly ✓ EnactorB->>Cleanup: Trigger: delete plans much older than #100 Cleanup->>Route53: Delete Plan #50 and older EnactorA->>Route53: Finally apply Plan #50 (overwrites #100!) Cleanup->>Route53: Delete Plan #50 (it's old!) Note over Route53: ALL DynamoDB DNS records deleted Note over Route53: 11:48 PM PDT — total DNS blackout

The Cascade: How DynamoDB's DNS Failure Propagated

flowchart TD dns_gone["DynamoDB DNS Records Deleted\n11:48 PM PDT"] dns_gone --> dynamo_dark["DynamoDB Unreachable\n(no IP to connect to)"] dynamo_dark --> ec2_fail["EC2 Instance State Tracking Fails\n(Droplet Workflow Manager can't write state)"] dynamo_dark --> iam_fail["IAM Policy Evaluation Fails\n(can't retrieve access policies)"] dynamo_dark --> lambda_fail["Lambda Execution State Fails"] dynamo_dark --> sts_fail["STS Token Validation Fails"] ec2_fail --> dwfm_backlog["DWFM Accumulates Backlog\nof Expired Instance Leases"] iam_fail --> auth_broken["All API Authentication Broken\nacross dependent services"] auth_broken --> snapchat["Snapchat ↓"] auth_broken --> fortnite["Fortnite ↓"] auth_broken --> ring["Ring Cameras ↓"] auth_broken --> venmo["Venmo ↓"] dwfm_backlog --> congestive["Congestive Collapse\nwhen DynamoDB recovered\n(backlog overwhelms recovered DB)"] congestive --> extra_12hr["12+ Additional Hours\nfor EC2 state recovery"] style dns_gone fill:#ef4444,color:#ffffff style congestive fill:#f59e0b,color:#000000

WHY US-EAST-1 BECAME A SINGLE POINT OF FAILURE FOR THE INTERNET

AWS designed its regions to be independently operable — a failure in US-EAST-1 should not affect EU-WEST-1. This design intention is correct, but the reality that emerged over 20 years is different. US-EAST-1 is where AWS first launched most services, so it accumulated the most mature feature sets. It became the default — the region developers reach for first, the region enterprises trust most deeply. Over time, even architectures claiming multi-region resilience often retain quiet dependencies on US-EAST-1 for authentication flows, control-plane coordination, or foundational database calls. The technical independence of regions is real. The operational independence, as experienced during the October 2025 outage, is not.

The Automatic Failover Anti-Pattern

One of the most practically instructive moments of the October 2025 outage was the decision to manually disable automatic failover to allow recovery to proceed. The automatic failover systems — designed to improve availability — were detecting the DNS inconsistency as failures and triggering failovers, which created new inconsistencies, which triggered more failovers. The automation was creating a feedback loop that prevented stabilization. Engineers had to turn it off to let the system reach a stable state. The lesson: automatic recovery systems need to distinguish between 'the service is down' (trigger failover) and 'the DNS state is inconsistent during manual recovery' (pause failover until DNS is stable). Automation that cannot make this distinction can prevent recovery faster than it enables it.

Lessons

The October 2025 DynamoDB outage is one of the most technically instructive incidents in cloud computing history — not because the root cause was complex, but because it was so simple, and yet it cascaded so far. A race condition in a cleanup job. The most consequential bug is often the one you're sure you've already solved.

Staleness checks must be evaluated at time of use, not time of pickup. Enactor A's staleness check was valid when it ran. By the time Enactor A acted on the result, the check was stale. In any concurrent system where state changes between the check and the action, the check must be re-evaluated immediately before the action — not cached from a prior point in the workflow. This is — one of the oldest race condition patterns in computer science — appearing in production at AWS scale.

No automated process should be able to delete an active record. The cleanup job's design — delete plans that are significantly older than the most recently applied plan — had no protection for the case where an older plan was actively in use as the live DNS record. The invariant that must be protected: the record currently resolving live traffic cannot be deleted by any automated process, regardless of its generation number. This invariant is simpler than the cleanup logic that violated it.

Congestive collapse is a failure mode that only appears at scale — and the recovery path for it must be tested before it's needed. EC2's DWFM had never been tested through the scenario of processing a massive backlog of expired leases simultaneously after a DynamoDB recovery. The scenario seemed unlikely enough to skip in testing, and specific enough that staging environments couldn't reproduce it. Building the test suite that exercises recovery workflows at production scale is the investment that pays off only in disasters — but those are exactly the moments when it matters most.

Multi-region architecture must be evaluated not just for application code but for . Ring cameras deployed globally still authenticated against US-EAST-1 IAM. UK government services deployed in EU regions still made US-EAST-1 API calls. True regional independence requires independently redundant control planes, not just independently deployed application code.

Sometimes, the recovery automation has to stop before recovery can start. The engineers who manually disabled automatic failover to stabilize the system were making the right call — but it required human judgment to recognize that the automation was making things worse rather than better. Build your recovery playbooks to include the question: 'Is any automated system currently making this worse?' The answer is occasionally yes, and having a clear path to pause automation is as important as having automation in the first place.

What Good Regional Independence Actually Looks Like

The October 2025 outage drew a clear line between companies that had genuine multi-region independence and those that believed they did. Genuine independence requires: application code deployed in at least two regions; authentication, authorization, and metadata services independently operational per region; no synchronous cross-region API calls in the critical path; tested failover that has been exercised under real load; and runbooks that don't assume a specific region is available. The companies whose services stayed up during the October 2025 outage weren't lucky. They had made specific architectural decisions years earlier — decisions that cost money and engineering time — that happened to be exactly the right decisions.

THE PRACTICAL RESPONSE FOR EVERY ENGINEERING TEAM

The October 2025 AWS outage has a direct implication for every team running production workloads on cloud infrastructure. Map your US-EAST-1 dependencies before the next outage, not during it. Specifically: identify every service your application calls that is hosted in US-EAST-1, even if your application code is deployed elsewhere. This includes authentication providers, CDN origins, third-party APIs, and internal microservices. For each dependency, ask: 'If this endpoint returned no DNS records for three hours, what would our users experience?' The answer to that question is your actual blast radius for a US-EAST-1 control-plane failure — and it is almost certainly larger than your architecture diagram suggests.

Amazon Web Services runs infrastructure at a scale where the cost of a single race condition is measured in hundreds of millions of dollars and 375 million users unable to send a Snapchat. The race condition itself — two processes trying to update the same state concurrently, with a stale check allowing a stale write — is the kind of bug that appears in computer science textbooks under 'concurrent programming gotchas.' The lesson isn't that AWS made an obvious mistake. The lesson is that obvious mistakes at sufficient scale have non-obvious consequences, and the gap between 'finding the bug' and 'recovering from the bug' was twelve hours wide.TechLogStack — built at scale, broken in public, rebuilt by engineers