The Story
When this issue occurred at 11:48 PM PDT, all systems needing to connect to the DynamoDB service in the N. Virginia (us-east-1) Region via the public endpoint immediately began experiencing DNS failures and failed to connect to DynamoDB. This included customer traffic as well as traffic from internal AWS services that rely on DynamoDB.
— Amazon Web Services — Official Post-Incident Summary, October 2025
DynamoDB is not just a database. Inside AWS's infrastructure, it is the connective tissue — the system that EC2, IAM, Lambda, STS, Redshift, and dozens of other control-plane services rely on to store metadata, track state, and coordinate operations. When DynamoDB becomes unreachable, it doesn't just take databases offline. It takes down the systems that manage everything else. This is why a DNS failure that lasted roughly three hours for DynamoDB itself cascaded into a 15-hour platform-wide crisis. The control plane broke. And when the control plane breaks, recovery is not a matter of fixing the root cause — it is a matter of stabilizing everything that lost its footing when the ground disappeared.
To understand what happened, you need to understand how DynamoDB manages its DNS. At AWS's scale, DynamoDB maintains hundreds of thousands of DNS records to operate the massive fleet of load balancers that route traffic across each region. These records are updated constantly — as capacity is added, as hardware fails, as traffic is redistributed. AWS built a two-component system to manage this at scale: a DNS Planner and multiple DNS Enactors.
THE TWO-COMPONENT DNS ARCHITECTURE: PLANNER AND ENACTOR
The DNS management system had two independent components:
The DNS Planner monitors load balancer health and capacity and periodically creates new DNS plans — essentially a specification of which load balancers should receive traffic and with what weight distribution.
The DNS Enactors are the workers — multiple independent processes running across three different Availability Zones — that pick up the plans and apply them to Route53 (AWS's DNS service). Multiple Enactors running in parallel provide redundancy: if one Enactor fails, others continue. In theory.
Problem
Enactor A Slows Down — And Its Stale Check Becomes a Time Bomb
DNS Enactor A began applying an older DNS plan but encountered unusual delays — it kept getting blocked trying to update records and moved painfully slowly through the list of endpoints. Crucially, Enactor A performed a staleness check early in its process: 'Is my plan newer than what's currently active?' At the time of that check, it was. But by the time Enactor A actually finished applying the plan, the world had moved on — newer plans had been created and applied. The staleness check was now stale itself.
Cause
The Race Condition Fires — Enactor B Wins, Then Cleans Up
While Enactor A was slowly working through its updates, Enactor B picked up one of the newer plans and rapidly applied it across all endpoints. When Enactor B completed, it triggered the cleanup process: identify plans that are significantly older than the one just applied, and delete them. At that exact moment — T+45 seconds after the race began — Enactor A finally finished applying its old plan, overwriting Enactor B's newer records. The cleanup job identified Enactor A's newly-applied old plan as many generations old, and deleted it. All DynamoDB DNS records for the US-EAST-1 regional endpoint were gone.
Solution
11:48 PM PDT: Total DNS Blackout
At 11:48 PM PDT on October 19, every system trying to connect to DynamoDB in US-EAST-1 via the public endpoint received DNS failures. No IP address. No connection. The failure was immediate and total — not a degradation, not elevated error rates, but a complete inability to resolve the DynamoDB endpoint. Internal AWS services relying on DynamoDB for control-plane operations went down alongside customer traffic. EC2's Droplet Workflow Manager lost its ability to track instance lease state. IAM couldn't validate credentials. Lambda couldn't execute. The cascade was underway.
Result
15 Hours of Cascading Failure — and Manual Recovery
Engineers identified the DNS issue by 12:38 AM UTC and began temporary mitigations by 1:15 AM UTC. DynamoDB itself recovered by approximately 2:25 AM UTC — roughly three hours after the incident began. But the cascade had already overwhelmed EC2's Droplet Workflow Manager with a backlog of expired instance leases it couldn't process. The DWFM entered congestive collapse, requiring 12+ more hours for network state to fully stabilize. The fix for the automation itself was brutal in its simplicity: engineers had to manually disable the automatic failover system entirely to stop it from flip-flopping between states and allow the platform to stabilize. Full recovery across all services wasn't complete until late afternoon on October 20 — roughly 15 hours after the cascade began.
🌐Ookla, the network intelligence company behind Speedtest.net, recorded over 17 million outage reports across more than 3,000 organizations during the incident. Independent measurements showed 20 to 30 percent of all internet-facing services experienced disruptions at peak impact. The US, UK, and Germany were hit hardest.
What Actually Went Dark
The list of affected services illustrates something important about how the modern internet is structured. US-EAST-1 is AWS's default region — the one developers reach for first, the one that has the most mature service availability, the one that decades of 'just deploy to us-east-1' decisions have concentrated critical infrastructure in. Even services claiming multi-region redundancy often still rely on US-EAST-1 for authentication, metadata stores, or database calls — dependencies that only become visible when US-EAST-1 goes dark.
Major services and platforms affected by the October 19–20, 2025 AWS US-EAST-1 outage
| Category | Affected Services |
|---|
| Social & Entertainment | Snapchat (375M daily users), Discord, Reddit, Roblox, Fortnite, Disney+, Hulu, Twitch |
| Finance & Payments | Coinbase, Venmo, several UK banks (Lloyds, Halifax) |
| Smart Home & IoT | Amazon Ring cameras, Amazon Alexa, Eight Sleep |
| Communications | Signal, several enterprise communication platforms |
| Government | UK HMRC tax authority |
| Travel | United Airlines app, Delta app |
| AI & Developer Tools | Perplexity AI, Pokémon GO |
| AWS Services (internal) | EC2, IAM, STS, Lambda, S3, SQS, Amazon Connect, Redshift (140+ services total) |
⚠️The Control Plane Problem: Why DynamoDB's Failure Was Uniquely Catastrophic
A typical service outage takes down the service that failed. The October 2025 DynamoDB outage was different because DynamoDB is infrastructure for infrastructure. EC2 uses DynamoDB to track compute instance state. IAM uses DynamoDB to store and retrieve access policies. Lambda uses DynamoDB for execution state. STS uses DynamoDB to validate tokens. When DynamoDB became unreachable, these services couldn't perform their core functions — not because they had their own bugs, but because the foundation they relied on had disappeared. This is a control-plane failure, and control-plane failures cascade differently from data-plane failures: they don't just take down what failed, they take down the ability to manage everything else.
The EC2 Congestive Collapse: Why Recovery Took 12 Extra Hours
DynamoDB's DNS was restored in approximately three hours. But the outage lasted 15. The reason was EC2's Droplet Workflow Manager (DWFM) — the system responsible for managing EC2 instance lifecycle events, including lease renewals. When DynamoDB became unavailable, DWFM couldn't process instance state updates and began accumulating a backlog of expired leases. By the time DynamoDB recovered, DWFM was facing an enormous queue of backlogged lease management tasks — all trying to execute simultaneously. The system entered congestive collapse: the more it tried to process, the more it overwhelmed the now-recovered DynamoDB, which slowed processing, which lengthened the queue, which increased the pressure. Network state recovery from this congestive collapse took more than five additional hours after DynamoDB was fixed.
THE ANTI-PATTERN: WHEN AUTOMATION PREVENTS RECOVERY
The most counterintuitive part of the recovery was that engineers had to
disable automatic failover to stabilize the system. The automatic failover mechanisms — designed to move traffic to healthy systems when failures are detected — were themselves contributing to the instability. With DNS records in an inconsistent state, the failover systems were flip-flopping: detecting failures, triggering failovers, detecting those failovers as failures, triggering more failovers.
The automation designed to speed recovery was making recovery impossible. Engineers had to manually turn it off, let the system reach a stable state, and then re-enable it with the correct DNS records in place. This is one of the most instructive moments in the incident: sometimes, the recovery automation has to be stopped before recovery can begin.
~3 hrs
Time from incident start to DynamoDB DNS restoration — engineers had to manually diagnose, understand the inconsistent DNS state, and correct it since automated systems couldn't self-recover
12+ hrs
Additional hours EC2's Droplet Workflow Manager required to clear its congestive collapse from accumulated expired lease backlog after DynamoDB recovered
140+
AWS services eventually affected by the cascade — because DynamoDB powers the control planes of EC2, IAM, Lambda, STS, and dozens of other foundational services
$581M
Estimated insurance losses from the 15-hour outage, per CyberCube cyber risk analytics — representing disruption to thousands of businesses globally dependent on US-EAST-1
The Fix
AWS's Post-Incident Fixes: Preventing the Race, Containing the Cascade
AWS published its official post-incident report three days after the October 20 event. The fixes address four distinct failure layers: the race condition itself, the cleanup automation that deleted active records, the EC2 cascade, and the inadequate test coverage for the recovery workflow. Each fix targets a specific mechanism that allowed the failure to happen or to propagate.
AWS's five-layer post-incident fix plan, derived from the official post-incident summary published October 23, 2025
| Failure Layer | What Went Wrong | AWS's Fix |
|---|
| DNS Enactor race condition | Two Enactors ran concurrently; Enactor A's stale staleness check allowed it to overwrite Enactor B's newer plan | Stronger staleness validation in the Enactor before applying plans — the check must reflect the current state of the world at time of application, not at time of plan pickup |
| Cleanup automation | The cleanup job deleted Enactor A's just-applied old plan because it appeared many generations old — wiping all DNS records in the process | Safeguards to ensure no automated process can delete or remove an active DNS plan — any plan being actively used as the live record is protected from cleanup regardless of its generation number |
| NLB failover velocity | Network Load Balancers moved large amounts of capacity during AZ failover triggered by the DNS failure, amplifying the cascade | Velocity control mechanism for NLBs to limit how much capacity a single NLB can remove when health check failures cause AZ failover — preventing AZ-level failures from creating region-level capacity evaporation |
| EC2 recovery workflow | EC2's DWFM entered congestive collapse when DynamoDB recovered and the backlogged lease queue overwhelmed the system — a failure mode that had not been tested | Additional test suite to exercise the DWFM recovery workflow at scale — catching congestive collapse scenarios before they happen in production rather than discovering them during outage recovery |
| Automatic failover during recovery | Failover automation flip-flopped during recovery, requiring manual disabling before stabilization could occur | Review of failover automation behavior during degraded DNS states — automation must detect the difference between 'service is down' and 'DNS is inconsistent during recovery' and respond differently to each |
⚠️The Unstated Root Cause: The Architecture of Trust in US-EAST-1
AWS's post-mortem addressed the technical race condition correctly. But the incident exposed a deeper architectural problem that no single fix resolves: the internet's implicit trust in US-EAST-1. AWS designed US-EAST-1 as a region — a geographic cluster of data centers meant to be one of many independently redundant deployment targets. Over 20 years, it became something else: the default region for millions of applications, the region where foundational services were first available, and the region that even 'multi-region' architectures often quietly depend on for authentication, metadata, or coordination. Ring cameras depend on it for authentication. Venmo depends on it for payment processing. UK government services depend on it for API calls. None of these dependencies were meant to create a single point of failure. But that's what they became.
The Test Coverage Gap: You Can't Fully Test Massive Scale Without Massive Scale
One of the most honest admissions in AWS's post-incident report is about test coverage. The DWFM recovery workflow — the path EC2's Droplet Workflow Manager follows when it needs to process a massive backlog of expired leases after a DynamoDB outage — had not been adequately tested at the scale required to discover the congestive collapse failure mode. AWS's response is to build additional test suites specifically for this recovery workflow. But the admission surfaces a fundamental challenge of hyperscale infrastructure: the failure conditions that matter most are the ones that only occur at scale, and at scale, test environments are approximations of production, not replicas of it. The only complete test of how AWS's systems behave during a DynamoDB outage recovery is an actual DynamoDB outage. This is the same insight that drove Netflix to build Chaos Monkey — except that for a cloud provider, you cannot deliberately cause a DynamoDB outage to test the recovery path.
THE HIDDEN CROSS-REGION DEPENDENCY PROBLEM
The October 2025 outage adds to a body of evidence about a specific architectural anti-pattern:
regions that are called independent but aren't. AWS regions were designed with the premise that a failure in US-EAST-1 should not affect services running in EU-WEST-1 or AP-SOUTHEAST-1. But control-plane dependencies — authentication services, metadata stores, quota management systems — create invisible cross-region ties. When the control plane fails in one region, services in other regions that depend on that control plane for any operation fail with it.
True regional independence requires not just deploying application code in multiple regions, but ensuring that every control-plane dependency is also independently redundant per region. For most organizations building on cloud infrastructure, this is not the architecture they have — it is the architecture they think they have.
Architecture
The October 2025 DynamoDB outage is a case study in what distributed systems engineers call a control-plane failure — a class of failure that is categorically more damaging than a data-plane failure because it removes the ability to manage and coordinate infrastructure rather than just disrupting one service. To understand why the failure cascaded so far and recovered so slowly, you need to understand the three layers of the failure: the DNS automation race condition, the DynamoDB control-plane dependency web, and EC2's Droplet Workflow Manager congestive collapse.
The DNS Race Condition: Step-by-Step
sequenceDiagram
participant Planner as DNS Planner
participant EnactorA as Enactor A (slow)
participant EnactorB as Enactor B (fast)
participant Route53 as Route53 DNS
participant Cleanup as Cleanup Job
Planner->>EnactorA: Assign Plan #50
Planner->>EnactorB: Assign Plan #100 (newer)
Note over EnactorA: Staleness check: Plan #50 is current ✓ (check is now stale)
EnactorB->>Route53: Apply Plan #100 rapidly ✓
EnactorB->>Cleanup: Trigger: delete plans much older than #100
Cleanup->>Route53: Delete Plan #50 and older
EnactorA->>Route53: Finally apply Plan #50 (overwrites #100!)
Cleanup->>Route53: Delete Plan #50 (it's old!)
Note over Route53: ALL DynamoDB DNS records deleted
Note over Route53: 11:48 PM PDT — total DNS blackout
The Cascade: How DynamoDB's DNS Failure Propagated
flowchart TD
dns_gone["DynamoDB DNS Records Deleted\n11:48 PM PDT"]
dns_gone --> dynamo_dark["DynamoDB Unreachable\n(no IP to connect to)"]
dynamo_dark --> ec2_fail["EC2 Instance State Tracking Fails\n(Droplet Workflow Manager can't write state)"]
dynamo_dark --> iam_fail["IAM Policy Evaluation Fails\n(can't retrieve access policies)"]
dynamo_dark --> lambda_fail["Lambda Execution State Fails"]
dynamo_dark --> sts_fail["STS Token Validation Fails"]
ec2_fail --> dwfm_backlog["DWFM Accumulates Backlog\nof Expired Instance Leases"]
iam_fail --> auth_broken["All API Authentication Broken\nacross dependent services"]
auth_broken --> snapchat["Snapchat ↓"]
auth_broken --> fortnite["Fortnite ↓"]
auth_broken --> ring["Ring Cameras ↓"]
auth_broken --> venmo["Venmo ↓"]
dwfm_backlog --> congestive["Congestive Collapse\nwhen DynamoDB recovered\n(backlog overwhelms recovered DB)"]
congestive --> extra_12hr["12+ Additional Hours\nfor EC2 state recovery"]
style dns_gone fill:#ef4444,color:#ffffff
style congestive fill:#f59e0b,color:#000000
WHY US-EAST-1 BECAME A SINGLE POINT OF FAILURE FOR THE INTERNET
AWS designed its regions to be independently operable — a failure in US-EAST-1 should not affect EU-WEST-1. This design intention is correct, but the reality that emerged over 20 years is different. US-EAST-1 is where AWS first launched most services, so it accumulated the most mature feature sets. It became the default — the region developers reach for first, the region enterprises trust most deeply. Over time, even architectures claiming multi-region resilience often retain quiet dependencies on US-EAST-1 for authentication flows, control-plane coordination, or foundational database calls.
The technical independence of regions is real. The operational independence, as experienced during the October 2025 outage, is not.
⚠️The Automatic Failover Anti-Pattern
One of the most practically instructive moments of the October 2025 outage was the decision to manually disable automatic failover to allow recovery to proceed. The automatic failover systems — designed to improve availability — were detecting the DNS inconsistency as failures and triggering failovers, which created new inconsistencies, which triggered more failovers. The automation was creating a feedback loop that prevented stabilization. Engineers had to turn it off to let the system reach a stable state. The lesson: automatic recovery systems need to distinguish between 'the service is down' (trigger failover) and 'the DNS state is inconsistent during manual recovery' (pause failover until DNS is stable). Automation that cannot make this distinction can prevent recovery faster than it enables it.
Lessons
The October 2025 DynamoDB outage is one of the most technically instructive incidents in cloud computing history — not because the root cause was complex, but because it was so simple, and yet it cascaded so far. A race condition in a cleanup job. The most consequential bug is often the one you're sure you've already solved.
01
Staleness checks must be evaluated at time of use, not time of pickup. Enactor A's staleness check was valid when it ran. By the time Enactor A acted on the result, the check was stale. In any concurrent system where state changes between the check and the action, the check must be re-evaluated immediately before the action — not cached from a prior point in the workflow. This is — one of the oldest race condition patterns in computer science — appearing in production at AWS scale.
02
No automated process should be able to delete an active record. The cleanup job's design — delete plans that are significantly older than the most recently applied plan — had no protection for the case where an older plan was actively in use as the live DNS record. The invariant that must be protected: the record currently resolving live traffic cannot be deleted by any automated process, regardless of its generation number. This invariant is simpler than the cleanup logic that violated it.
03
Congestive collapse is a failure mode that only appears at scale — and the recovery path for it must be tested before it's needed. EC2's DWFM had never been tested through the scenario of processing a massive backlog of expired leases simultaneously after a DynamoDB recovery. The scenario seemed unlikely enough to skip in testing, and specific enough that staging environments couldn't reproduce it. Building the test suite that exercises recovery workflows at production scale is the investment that pays off only in disasters — but those are exactly the moments when it matters most.
04
Multi-region architecture must be evaluated not just for application code but for . Ring cameras deployed globally still authenticated against US-EAST-1 IAM. UK government services deployed in EU regions still made US-EAST-1 API calls. True regional independence requires independently redundant control planes, not just independently deployed application code.
05
Sometimes, the recovery automation has to stop before recovery can start. The engineers who manually disabled automatic failover to stabilize the system were making the right call — but it required human judgment to recognize that the automation was making things worse rather than better. Build your recovery playbooks to include the question: 'Is any automated system currently making this worse?' The answer is occasionally yes, and having a clear path to pause automation is as important as having automation in the first place.
✅What Good Regional Independence Actually Looks Like
The October 2025 outage drew a clear line between companies that had genuine multi-region independence and those that believed they did. Genuine independence requires: application code deployed in at least two regions; authentication, authorization, and metadata services independently operational per region; no synchronous cross-region API calls in the critical path; tested failover that has been exercised under real load; and runbooks that don't assume a specific region is available. The companies whose services stayed up during the October 2025 outage weren't lucky. They had made specific architectural decisions years earlier — decisions that cost money and engineering time — that happened to be exactly the right decisions.
THE PRACTICAL RESPONSE FOR EVERY ENGINEERING TEAM
The October 2025 AWS outage has a direct implication for every team running production workloads on cloud infrastructure.
Map your US-EAST-1 dependencies before the next outage, not during it. Specifically: identify every service your application calls that is hosted in US-EAST-1, even if your application code is deployed elsewhere. This includes authentication providers, CDN origins, third-party APIs, and internal microservices. For each dependency, ask: 'If this endpoint returned no DNS records for three hours, what would our users experience?' The answer to that question is your actual blast radius for a US-EAST-1 control-plane failure — and it is almost certainly larger than your architecture diagram suggests.
Amazon Web Services runs infrastructure at a scale where the cost of a single race condition is measured in hundreds of millions of dollars and 375 million users unable to send a Snapchat. The race condition itself — two processes trying to update the same state concurrently, with a stale check allowing a stale write — is the kind of bug that appears in computer science textbooks under 'concurrent programming gotchas.' The lesson isn't that AWS made an obvious mistake. The lesson is that obvious mistakes at sufficient scale have non-obvious consequences, and the gap between 'finding the bug' and 'recovering from the bug' was twelve hours wide.TechLogStack — built at scale, broken in public, rebuilt by engineers