The Story

At 9:37 AM PST on February 28th, 2017, an authorized member of the Amazon S3 team in US-EAST-1 was debugging a problem with the S3 billing system, which had been processing more slowly than expected. Using an established playbook -- a documented, pre-approved operational procedure -- the engineer ran a command intended to remove a small number of servers from one of the subsystems that supports billing. One of the inputs to that command was entered incorrectly. Instead of removing a handful of servers, it removed a much larger set. Within minutes, two of S3's most load-bearing subsystems were short enough capacity that both required a full restart -- something AWS had not had to do at that scale in years.

S3 underpins far more of AWS than object storage. During the outage, the S3 console, new EC2 instance launches, EBS volumes that needed to read from an S3 snapshot, and AWS Lambda were all degraded or unavailable in US-EAST-1 -- because each one quietly depends on S3 being reachable. So did hundreds of customer sites: Slack, Trello, Quora, Docker Hub, and many more all reported errors within the same window.

THE INSIGHT: TWO SUBSYSTEMS, ONE DEPENDENCY CHAIN

The servers removed by mistake supported two other subsystems beyond billing. The manages the metadata and location of every object in the region; without it, S3 cannot resolve where any object actually lives. The allocates storage for new objects and depends on the index subsystem being healthy first. Lose enough capacity from both at once, and S3 doesn't degrade gracefully -- it stops answering requests for GET, LIST, PUT, and DELETE alike, because the dependency runs in one direction: placement waits on index, and nothing waits on placement.

Problem

A Billing Subsystem Was Running Slow

The S3 billing system had been processing more slowly than expected for some time. An authorized engineer began a routine debugging session to investigate, using a pre-approved playbook for removing a small number of servers from the affected subsystem to observe behavior under reduced load.

Cause

One Input, Entered Wrong

At 9:37 AM PST, the engineer executed the playbook command. One parameter, meant to specify a small number of servers, was entered with a much larger value. The tool had no safeguard checking whether the requested removal would take any subsystem below its minimum required capacity, so it executed exactly as instructed.

Solution

Two Full Subsystem Restarts, From Scratch

With insufficient capacity remaining, both the index and placement subsystems required a full restart. S3 had not fully restarted these subsystems in its largest regions in years, and S3's growth since then meant the restart, plus the integrity checks needed to validate the metadata, took longer than engineers expected.

Result

Index Recovered by 12:26 PM, Placement by 1:54 PM

The index subsystem began serving GET, LIST, and DELETE requests again at 12:26 PM PST and was fully recovered by 1:18 PM. The placement subsystem, which needed the index subsystem healthy first, finished recovering at 1:54 PM, at which point S3 was operating normally again, roughly four hours and seventeen minutes after the command had run.

Why the Outage Page Couldn't Load Either

From the start of the event until 11:37 AM PST, AWS could not update individual service statuses on its own Service Health Dashboard, because the dashboard's administration console itself had a dependency on S3. AWS fell back to its Twitter account and a banner message on the dashboard to communicate status while the irony of an outage page that needed the outaged service played out in public.

Why AWS Hadn't Restarted These Subsystems in Years

S3 is built on the assumption that capacity will occasionally fail, and removing or replacing servers is a routine operational practice the team relies on constantly, in small doses. What hadn't happened in years was losing enough capacity from the index and placement subsystems simultaneously to force a full cold restart of either one. Routine operations at small scale and the same operation at a scale nobody had rehearsed turned out to be very different problems.

THE CORE TECHNICAL INSIGHT

The tool that removed the servers did exactly what it was told -- the failure wasn't a bug in the logic, it was the absence of a guardrail. Nothing in the playbook's tooling checked whether an input would push a subsystem below the minimum capacity it needed to keep functioning. A system can be well-architected for failure and still go down hard, if the operational tooling around it has no concept of a safe floor.

The Fix

Three Changes That Closed the Gap

AWS's response to the outage wasn't a rewrite of S3's architecture -- the underlying design held up. The fix was about constraining how fast and how far a single human mistake could propagate, and about shrinking how long recovery takes when it happens anyway. Three changes did most of the work.

9:37 AM
PST timestamp the flawed command executed, the single point of failure for the entire event
0
Minimum-capacity safeguards that existed in the removal tool before this incident, added immediately after
4hr 17min
Total time from command execution to S3 operating normally again in US-EAST-1
Multi-region
New design for the Service Health Dashboard admin console, so a single region's S3 issue can't blind status reporting again

S3's Capacity-Removal Tooling: Before vs. After the Outage

S3's Capacity-Removal Tooling: Before vs. After the Outage
PropertyBefore Feb 28, 2017After Feb 28, 2017
Minimum-capacity checkNone, tool executed any input as givenRemoval blocked if it would drop a subsystem below minimum required capacity
Removal speedLarge capacity changes applied immediatelyCapacity removed more gradually, with checks between steps
Blast radius per subsystemA single mistaken input could affect index and placement togetherIndex subsystem further partitioned into smaller cells to limit blast radius
Status page dependencySHD admin console depended on S3 in the affected regionSHD admin console runs across multiple AWS regions
Audit scopeSafeguards were specific to this one toolOther operational tools audited for the same missing safety checks
python
# Illustrative: the class of safeguard AWS added to capacity-removal tooling.
# Not AWS's actual implementation -- this models the missing check described
# in the official post-incident summary.

def remove_capacity(subsystem, servers_to_remove, current_capacity, min_required_capacity):
    """
    Before: this function executed whatever server count was passed in,
    with no floor check. A typo in `servers_to_remove` had no safety net.
    """
    remaining = current_capacity - len(servers_to_remove)

    # The fix: refuse the operation if it would drop the subsystem
    # below the capacity it needs to keep functioning.
    if remaining < min_required_capacity:
        raise CapacityGuardrailError(
            f"Refusing to remove {len(servers_to_remove)} servers from "
            f"{subsystem}: would leave {remaining} of {min_required_capacity} "
            f"minimum required capacity."
        )

    # Additional fix: remove capacity in smaller, throttled increments
    # rather than all at once, so a bad input surfaces before full damage.
    for batch in chunk(servers_to_remove, batch_size=SAFE_REMOVAL_BATCH_SIZE):
        execute_removal(subsystem, batch)
        if subsystem.health_check() == "degraded":
            abort_remaining_removals(subsystem)
            break

THE COUNTERINTUITIVE PART: THE FIX WASN'T MORE AUTOMATION

It would be easy to assume the lesson here is "automate the human out of the loop." That's not quite what AWS did. The playbook command was already an authorized, established procedure -- the human wasn't doing anything unusual. The fix was adding a guardrail that makes the tool itself refuse unsafe instructions, regardless of who or what issues them. The goal wasn't removing trust in operators; it was making sure no single input, human or scripted, could push a subsystem past a point of no return.

Architecture

The outage is easiest to understand as a single mistaken input propagating through a dependency chain that nobody had load-tested at that scale. Two diagrams make the propagation concrete: what the cascade looked like as it happened, and what the safeguarded version of the same operation looks like today.

The Cascade: One Command to a Region-Wide Outage

Before vs. After: The Guardrail That Was Missing

What to Notice in the Cascade

Look at how few nodes actually needed to fail for the blast radius to reach customer sites: just two subsystems, both fed by the same single command. Everything downstream -- EC2, EBS, Lambda, the status dashboard itself, and ultimately Slack and Trello -- was a passenger on that initial failure, not an independent point of weakness. The fix diagram shows the same operation with exactly one new decision point added: a floor check that didn't exist before.

Lessons

This incident is nearly a decade old, but it's still one of the most-cited reliability case studies in the industry, because the failure mode it exposes -- a well-tested system brought down by an operation that was routine at small scale and untested at large scale -- keeps recurring in new systems every year.

What to remember

  1. Authorize the operation, but verify the input against a hard floor. The command that caused the outage was already an approved, established procedure. The missing control wasn't permission, it was a guardrail checking whether the specific input would leave a subsystem below the minimum capacity it needs to function. Authorization and input validation are two different layers of safety, and you need both.
  2. Routine operations at small scale are not validated for the same operation at large scale. AWS had relied on removing and replacing S3 capacity since launch, but had not fully restarted the index or placement subsystems in its largest regions in years, because nothing had forced that path. An operation you've run a thousand times safely can still be untested at the scale a mistake can trigger.
  3. Dependency direction determines blast radius, not dependency count. The placement subsystem depended on the index subsystem being healthy, but nothing depended on placement. That asymmetry meant a single shared failure point, capacity loss, took down both index-dependent reads and placement-dependent writes simultaneously, rather than degrading independently.
  4. Status pages should never depend on the system they're reporting on. AWS couldn't update its own Service Health Dashboard for the first two hours of the outage because the dashboard's admin console itself required S3, the exact service that was down. Monitoring and status infrastructure needs an independence boundary from the systems it monitors.
  5. Recovery time is itself a property you have to design for, not a byproduct of good architecture. The index subsystem's restart took longer than expected specifically because of integrity-check overhead that scales with data volume, and AWS's response included accelerating partitioning work specifically to reduce future recovery time, not just to prevent recurrence.

S3's Eleven Nines, Nearly a Decade Later

The most notable long-term outcome isn't that this kind of region-wide S3 outage never happened again exactly this way -- it's that the underlying append-and-replicate object storage architecture that failed that day is still, fundamentally, the architecture running S3 today. The fix was operational discipline around the system, not a redesign of the system itself. AWS has continued to advertise S3's durability design at 99.999999999% (eleven nines) for objects across multiple facilities -- the February 2017 outage was an availability incident, not a durability one: no customer data was lost.

THE PLAYBOOK BECAME A TEACHING TOOL

AWS's own postmortem of this event has since become one of the most commonly cited examples in chaos-engineering training material and SRE coursework, not because the failure was exotic, but because it wasn't. A single mistyped parameter, a missing floor check, and a cascading dependency chain is a failure pattern nearly every infrastructure team can recognize in their own systems, which is exactly why this incident keeps getting taught years after the outage itself ended.
The most expensive bug in this story wasn't a line of code, it was the absence of one. A single 'if remaining < minimum: refuse' check, never written, cost the internet four hours and seventeen minutes.TechLogStack -- built at scale, broken in public, rebuilt by engineers