AWS DynamoDB Outage 2025: DNS Race Condition Explained

The Story

To understand the October 2025 AWS outage, you first need to understand how DynamoDB manages its DNS. AWS runs three independent processes for DynamoDB in us-east-1, one in each of the three Availability Zones (us-east-1a, us-east-1b, us-east-1c). Each Enactor independently monitors DynamoDB's health and generates a 'DNS plan' — a specification of which IP addresses should be registered under dynamodb.us-east-1.amazonaws.com. The design intent is redundancy: if one AZ fails, the other two Enactors continue managing DNS correctly. This design had worked for years. On October 19th, it failed in a way the designers had not anticipated.

The failure mechanism was a race condition between a slow Enactor and a garbage collection mechanism. DynamoDB's DNS system uses a 'keep last N plans' garbage collection model — it retains recent DNS plans and deletes older ones. The deletion logic has one critical constraint: it must never delete a plan that is currently active — that is, currently serving live traffic. One of the three DNS Enactors, running in us-east-1a, became extremely slow. It fell hours behind the other two Enactors. The garbage collector, running across all three AZs, began deleting old plans — including a plan that the slow us-east-1a Enactor still considered active. The Enactor's active plan was garbage-collected out from under it. The Enactor then wrote an empty DNS record — no IP addresses — for the DynamoDB regional endpoint.

An empty DNS record for dynamodb.us-east-1.amazonaws.com means that any system trying to connect to DynamoDB in us-east-1 by its standard regional endpoint gets back a DNS response with no IP addresses. The connection fails at DNS resolution, before any TCP connection is even attempted. This is categorically different from a DynamoDB service failure — the DynamoDB service itself was running normally. The databases were up. The read and write handlers were responding. But no new connection could be established because the DNS record that should point to those handlers was empty. DynamoDB was down from the DNS layer, not from the database layer.

DYNAMODB IS THE CONTROL PLANE FOR AWS'S CONTROL PLANE

The blast radius of DynamoDB's DNS failure was enormous because of a second architectural property: DynamoDB is a critical dependency for many AWS control plane services. When a customer launches a new EC2 instance, the control plane needs to look up configuration in DynamoDB. When Lambda invokes a function, state management uses DynamoDB. When Fargate schedules a container, metadata lives in DynamoDB. All of these services connect to DynamoDB via the public regional endpoint that was now returning empty DNS. The DynamoDB failure cascaded into EC2 instance launch failures, Lambda invocation failures, and Fargate task launch failures — each of which had its own downstream customers whose services were now degraded.

Problem

3:11 AM ET: DynamoDB DNS Resolution Failures Begin

At approximately 3:11 AM ET on October 20, 2025, AWS's monitoring systems detected increased error rates and latency on DynamoDB in us-east-1. Customers began experiencing DNS resolution failures for dynamodb.us-east-1.amazonaws.com. Internally, the trigger was the DNS record for the DynamoDB regional endpoint being written as empty — no IP addresses — by the slow us-east-1a DNS Enactor. DynamoDB itself was operational; the failure was entirely at the DNS layer. But the effect was indistinguishable from DynamoDB being fully down.

Cause

Race Condition Between Slow Enactor and Garbage Collector

A latent race condition in the DNS management system: one of three redundant DNS Enactors (running in us-east-1a) became extremely slow — hours behind its peers. The garbage collection system, which uses a 'keep last N DNS plans' model, deleted an old plan that the slow Enactor still considered active. The slow Enactor, finding its active plan garbage-collected, wrote an empty DNS record to replace it. The condition had existed in the codebase undetected because it required a specific combination of extreme Enactor latency (well beyond normal variance) and unlucky garbage collection timing to trigger. AWS's postmortem described the root cause as a 'latent race condition in the DynamoDB DNS management system.'

Solution

DNS Record Corrected; Cascade Impact Managed Service by Service

AWS engineers identified the empty DNS record and corrected it, restoring DynamoDB connectivity. However, the cascade effects — services that had failed to initialize or had queued work during the DNS outage — required individual recovery across EC2, Lambda, Fargate, and other dependent services. Each service that depended on DynamoDB had its own recovery timeline as it worked through queued requests and re-established connections. Customers with DynamoDB global tables were able to connect to replica tables in other regions throughout the outage but experienced prolonged replication lag. The AWS Support Center — which had failed over to another region as designed — encountered its own secondary failure when a metadata subsystem blocked legitimate users from accessing it.

Result

Cascading Failures Across EC2, Lambda, Fargate; 3-15 Hours Customer Impact

DynamoDB itself recovered within a few hours of the initial alert, but cascading effects on dependent services meant some customers experienced degradation for significantly longer — some reported impact lasting up to 15 hours. Affected end-user applications included Snapchat, Fortnite, Duolingo, Signal, Ring security cameras, and many banking and financial services. AWS released a detailed postmortem within 3 days of the incident — notably faster than the 4-month wait following a similarly large 2023 event. Long-term fixes included architectural changes to the DNS Enactor system to prevent the garbage collection race condition.

The Failure Mode That Redundancy Didn't Cover

The most technically interesting aspect of the failure is that the redundancy worked correctly for the intended failure mode but failed for an unexpected failure mode. The three DNS Enactors were designed to be independent so that a single AZ failure wouldn't break DNS management. An AZ failure would take one Enactor offline entirely, but the other two would continue. The designers had not modeled the scenario where one Enactor remained running but became extremely slow — slow enough to fall hours behind the garbage collector's expectations of what constituted an 'old' plan. Being partially functional was worse than being offline, because the slow-but-running Enactor had enough authority to write the empty DNS record, but not enough speed to avoid having its active plan garbage-collected first.

The AWS postmortem drew a distinction that became central to the community discussion that followed: the DynamoDB service itself maintained availability. If customers had used DynamoDB's private VPC endpoints rather than the public regional endpoint, they might not have been affected — private endpoints don't rely on the same DNS resolution path. But the vast majority of customers — and AWS's own control plane services — used the public regional endpoint by default. The question of whether AWS's internal services should be more insulated from their own managed service endpoints is one the postmortem implicitly raises without fully answering.

The Fix

The Race Condition AWS Fixed and the Broader DNS Architecture Question It Raised

Fixing the race condition in the DynamoDB DNS Enactor required changing the interaction between the garbage collection system and the concept of 'active plan.' The core issue was that garbage collection made decisions about which plans were safe to delete based on age, without fully verifying whether any running Enactor still considered that plan active. The fix required adding a coordination mechanism that ensures no plan is garbage-collected while any Enactor claims it as its current active plan — regardless of how far behind that Enactor has fallen.

Empty

DNS record that caused the outage — dynamodb.us-east-1.amazonaws.com returned zero IP addresses, making all new connections fail at DNS resolution before any TCP connection was attempted

3 DCs

Independent DNS Enactors (one per AZ) — designed for AZ-level failure tolerance but not for the scenario where one Enactor remained running while falling hours behind peers

VPC

Endpoint type that was NOT affected — customers using DynamoDB's private VPC endpoints (rather than the public regional endpoint) had a different DNS resolution path and experienced different or no impact

3 days

Time to AWS postmortem publication — notably faster than the 4-month delay following a comparably large AWS outage in 2023, demonstrating improved incident transparency practices

Comparison table
Component	Before (Vulnerable State)	After (Race Condition Fixed)
DNS Enactor coordination	Three independent Enactors with no cross-AZ plan ownership tracking	Enactors register active plan ownership; GC cannot delete a plan while any Enactor claims it
Garbage collection safety check	Age-based deletion — 'keep last N plans' without verifying if any Enactor still references an old plan	Reference-counting or heartbeat-based check — plan cannot be deleted while any registered Enactor has it as active
Enactor latency detection	No alarm on Enactor falling hours behind peers — treated as normal variance	Alert triggers when any Enactor's lag exceeds a safety threshold; engineers investigate before lag reaches dangerous levels
Regional endpoint DNS path	All traffic (customer + internal AWS services) via public regional endpoint	Internal AWS control plane services migrated to use isolated endpoints separate from the public regional endpoint
DynamoDB global tables failover	Available in other regions during us-east-1 DNS failure, but with prolonged replication lag	Replication lag monitoring improved; global table failover more clearly documented for customers as a mitigation path
VPC endpoint adoption	Optional feature, low adoption	AWS now more strongly recommends VPC endpoints for production workloads; internal services migrated off public endpoints

python

# BEFORE: The vulnerable DNS Enactor pseudocode
# Three instances run independently. GC uses age-based deletion.

class DnsEnactor:
    def __init__(self, az: str):
        self.az = az
        self.active_plan_id = None
    
    def run(self):
        while True:
            new_plan = self.generate_dns_plan()  # compute current desired DNS state
            self.active_plan_id = new_plan.id
            self.write_to_dns(new_plan)  # atomically update DNS record
            sleep(POLL_INTERVAL)

class GarbageCollector:
    def collect_old_plans(self):
        all_plans = db.get_all_plans()
        plans_to_keep = all_plans[-KEEP_LAST_N:]  # keep recent N plans
        
        # BUG: No check whether a slow Enactor still holds an old plan as 'active'
        # If us-east-1a Enactor is hours behind, its active_plan_id might point
        # to a plan that is now outside the KEEP_LAST_N window.
        for plan in all_plans[:-KEEP_LAST_N]:
            db.delete_plan(plan.id)  # ← Deletes enactor's active plan!
        
        # After deletion, slow Enactor wakes up, sees active plan missing:
        # → writes an empty DNS record as a 'safe' default
        # → dynamodb.us-east-1.amazonaws.com → []

# AFTER: Coordinated deletion with ownership tracking

class DnsEnactor:
    def run(self):
        while True:
            new_plan = self.generate_dns_plan()
            # Register ownership before activating
            db.claim_plan_ownership(self.az, new_plan.id)
            self.write_to_dns(new_plan)
            sleep(POLL_INTERVAL)
            
            # If enactor falls behind, alert fires before GC can delete active plan
            self.check_lag_and_alert()

class GarbageCollector:
    def collect_old_plans(self):
        # Check ownership table before deleting
        all_plans = db.get_all_plans()
        active_plan_ids = db.get_all_active_plan_ids()  # registered by all enactors
        
        for plan in all_plans[:-KEEP_LAST_N]:
            if plan.id not in active_plan_ids:  # ← Safe to delete
                db.delete_plan(plan.id)
            else:
                # Alert: plan outside retention window is still active
                # This means an Enactor is extremely slow — investigate
                alert_on_call('Enactor lag: plan outside GC window still active')

THE REAL FIX: MONITORING THE LEADING INDICATOR, NOT JUST THE OUTCOME

The most counterintuitive lesson from the architectural fix is that the problem was not primarily in the race condition itself — it was in the assumption that an Enactor falling hours behind was a normal operational condition to handle silently. If a health check had fired when the us-east-1a Enactor's lag exceeded a safe threshold, engineers could have intervened before the garbage collector deleted the active plan. The race condition required both an extremely slow Enactor AND an unlucky garbage collection timing — monitoring the Enactor's lag as a leading indicator would have prevented the race condition from being triggered even if the fix to the GC logic was not yet deployed.

Architecture

The DNS Enactor architecture and its failure mode are best understood as a distributed coordination problem. Three processes, running independently, each making autonomous decisions about DNS state. When one process slows dramatically, the other two — and the GC system — continue operating without knowing the slow process has an in-flight stake in a plan they're about to delete.

Diagram preview unavailable.

Why us-east-1 Failures Hit So Hard — and So Broadly

The blast radius diagram reveals a critical architectural property of AWS us-east-1: because DynamoDB is used by so many of AWS's own control plane services, a failure in DynamoDB's DNS does not just affect customers using DynamoDB — it affects customers using EC2, Lambda, Fargate, and any other service whose control plane operations touch DynamoDB in us-east-1. This is why the community discussion after the outage focused on the risks of hyperscaler concentration: a latent defect in one service's DNS management can propagate outward through the entire platform's dependency graph.

Lessons

The October 2025 AWS DynamoDB DNS outage is the most technically complex major cloud incident of 2025 — not because the root cause was exotic, but because the chain of dependencies from DNS Enactor to empty record to DynamoDB failure to control plane cascade is so long and so hard to reason about statically. The lessons are about distributed system design, dependency isolation, and the limits of redundancy.

What to remember

Redundancy only protects against the failure modes it was designed for. Three independent DNS Enactors protected against AZ-level failure — if us-east-1a went offline, the other two continued. But one Enactor falling hours behind while still running was a failure mode the designers had not specifically tested. Redundancy without comprehensive failure mode analysis creates false confidence: your system is resilient against the scenarios you imagined, not against the scenarios you haven't imagined yet.
Garbage collection must never delete resources that any process in the distributed system still considers active. This is a classic distributed systems coordination problem: the 'keep last N' model assumes all processes have roughly similar views of what is recent. When one process falls dramatically behind, it can hold references to resources that appear old to the GC. The solution — requiring explicit ownership registration before any resource can be GC'd — is well-understood in distributed memory management but had not been applied to this DNS plan lifecycle.
Monitor the leading indicator, not just the outcome. The us-east-1a Enactor's extreme lag was visible hours before the race condition triggered. If an alert had fired when lag exceeded a safety threshold, engineers could have investigated and potentially intervened before the GC deleted the active plan. Many failure cascades have an early-warning signal that is measurable but not monitored — because the system designers did not predict the failure path the signal was warning about.
Internal platform services should not depend on the same public endpoint that customers use. AWS's own control plane services — EC2, Lambda, Fargate — connected to DynamoDB via the same public regional endpoint (dynamodb.us-east-1.amazonaws.com) that experienced the DNS failure. This meant the DynamoDB DNS failure cascaded into control plane failures for services that had no direct relationship with DynamoDB from a customer's perspective. Isolating internal service-to-service communication from the public data plane reduces the blast radius of DNS failures.
A service-level failover that succeeds can still fail at the system level. The AWS Support Center failed over to another region as designed — it worked exactly as planned. But a metadata subsystem responsible for account verification in the backup region began blocking legitimate users who needed to file support tickets during the outage. The failover success was real, but a secondary dependency that was not stress-tested under the conditions of a major outage created a new failure mode in the recovery path.

The us-east-1 Concentration Risk Persists

The most significant industry consequence of the October 2025 outage was the renewed discussion about over-reliance on us-east-1. Cloud economist Corey Quinn had noted after the 2021 AWS outage that 'you cannot have a multi-region failover strategy on AWS that features us-east-1. Too many things apparently single-track through that region for you to be able to count on anything other than total control-plane failure when that region experiences a significant event.' The 2025 DynamoDB DNS outage confirmed this assessment had not yet been fully addressed. The most popular cloud region in the world remains a concentration risk — and its concentration of internal AWS dependencies makes customer-facing blast radius larger, not smaller, with each passing year.

DynamoDB promised 99.999% availability. On October 20, 2025, DynamoDB kept that promise — the service was running fine. The DNS record pointing to it was just empty. Which, for the millions of customers who couldn't connect to it, was exactly the same as it being down.TechLogStack — built at scale, broken in public, rebuilt by engineers

The Story

3:11 AM ET: DynamoDB DNS Resolution Failures Begin

Race Condition Between Slow Enactor and Garbage Collector

DNS Record Corrected; Cascade Impact Managed Service by Service

Cascading Failures Across EC2, Lambda, Fargate; 3-15 Hours Customer Impact

The Fix

The Race Condition AWS Fixed and the Broader DNS Architecture Question It Raised

Architecture

Lessons

Related Stories

Google's Gemini Omni Is the First AI That Creates From Anything — Here Is What That Actually Means

GitHub Built the Internet's Code Platform — Then AI Agents Broke It

A Race Condition in DynamoDB's DNS Took Down Snapchat, Fortnite, Ring, and Half the Internet for 15 Hours