The Story
At 9:37 AM PST on February 28th, 2017, an authorized member of the Amazon S3 team in US-EAST-1 was debugging a problem with the S3 billing system, which had been processing more slowly than expected. Using an established playbook -- a documented, pre-approved operational procedure -- the engineer ran a command intended to remove a small number of servers from one of the subsystems that supports billing. One of the inputs to that command was entered incorrectly. Instead of removing a handful of servers, it removed a much larger set. Within minutes, two of S3's most load-bearing subsystems were short enough capacity that both required a full restart -- something AWS had not had to do at that scale in years.
THE INSIGHT: TWO SUBSYSTEMS, ONE DEPENDENCY CHAIN
The servers removed by mistake supported two other subsystems beyond billing. The manages the metadata and location of every object in the region; without it, S3 cannot resolve where any object actually lives. The allocates storage for new objects and depends on the index subsystem being healthy first. Lose enough capacity from both at once, and S3 doesn't degrade gracefully -- it stops answering requests for GET, LIST, PUT, and DELETE alike, because the dependency runs in one direction: placement waits on index, and nothing waits on placement.Problem
A Billing Subsystem Was Running Slow
The S3 billing system had been processing more slowly than expected for some time. An authorized engineer began a routine debugging session to investigate, using a pre-approved playbook for removing a small number of servers from the affected subsystem to observe behavior under reduced load.
Cause
One Input, Entered Wrong
At 9:37 AM PST, the engineer executed the playbook command. One parameter, meant to specify a small number of servers, was entered with a much larger value. The tool had no safeguard checking whether the requested removal would take any subsystem below its minimum required capacity, so it executed exactly as instructed.
Solution
Two Full Subsystem Restarts, From Scratch
With insufficient capacity remaining, both the index and placement subsystems required a full restart. S3 had not fully restarted these subsystems in its largest regions in years, and S3's growth since then meant the restart, plus the integrity checks needed to validate the metadata, took longer than engineers expected.
Result
Index Recovered by 12:26 PM, Placement by 1:54 PM
The index subsystem began serving GET, LIST, and DELETE requests again at 12:26 PM PST and was fully recovered by 1:18 PM. The placement subsystem, which needed the index subsystem healthy first, finished recovering at 1:54 PM, at which point S3 was operating normally again, roughly four hours and seventeen minutes after the command had run.
Why the Outage Page Couldn't Load Either
From the start of the event until 11:37 AM PST, AWS could not update individual service statuses on its own Service Health Dashboard, because the dashboard's administration console itself had a dependency on S3. AWS fell back to its Twitter account and a banner message on the dashboard to communicate status while the irony of an outage page that needed the outaged service played out in public.
Why AWS Hadn't Restarted These Subsystems in Years
S3 is built on the assumption that capacity will occasionally fail, and removing or replacing servers is a routine operational practice the team relies on constantly, in small doses. What hadn't happened in years was losing enough capacity from the index and placement subsystems simultaneously to force a full cold restart of either one. Routine operations at small scale and the same operation at a scale nobody had rehearsed turned out to be very different problems.
THE CORE TECHNICAL INSIGHT
The tool that removed the servers did exactly what it was told -- the failure wasn't a bug in the logic, it was the absence of a guardrail. Nothing in the playbook's tooling checked whether an input would push a subsystem below the minimum capacity it needed to keep functioning. A system can be well-architected for failure and still go down hard, if the operational tooling around it has no concept of a safe floor.The Fix
Three Changes That Closed the Gap
AWS's response to the outage wasn't a rewrite of S3's architecture -- the underlying design held up. The fix was about constraining how fast and how far a single human mistake could propagate, and about shrinking how long recovery takes when it happens anyway. Three changes did most of the work.
S3's Capacity-Removal Tooling: Before vs. After the Outage
| Property | Before Feb 28, 2017 | After Feb 28, 2017 |
|---|---|---|
| Minimum-capacity check | None, tool executed any input as given | Removal blocked if it would drop a subsystem below minimum required capacity |
| Removal speed | Large capacity changes applied immediately | Capacity removed more gradually, with checks between steps |
| Blast radius per subsystem | A single mistaken input could affect index and placement together | Index subsystem further partitioned into smaller cells to limit blast radius |
| Status page dependency | SHD admin console depended on S3 in the affected region | SHD admin console runs across multiple AWS regions |
| Audit scope | Safeguards were specific to this one tool | Other operational tools audited for the same missing safety checks |
# Illustrative: the class of safeguard AWS added to capacity-removal tooling.
# Not AWS's actual implementation -- this models the missing check described
# in the official post-incident summary.
def remove_capacity(subsystem, servers_to_remove, current_capacity, min_required_capacity):
"""
Before: this function executed whatever server count was passed in,
with no floor check. A typo in `servers_to_remove` had no safety net.
"""
remaining = current_capacity - len(servers_to_remove)
# The fix: refuse the operation if it would drop the subsystem
# below the capacity it needs to keep functioning.
if remaining < min_required_capacity:
raise CapacityGuardrailError(
f"Refusing to remove {len(servers_to_remove)} servers from "
f"{subsystem}: would leave {remaining} of {min_required_capacity} "
f"minimum required capacity."
)
# Additional fix: remove capacity in smaller, throttled increments
# rather than all at once, so a bad input surfaces before full damage.
for batch in chunk(servers_to_remove, batch_size=SAFE_REMOVAL_BATCH_SIZE):
execute_removal(subsystem, batch)
if subsystem.health_check() == "degraded":
abort_remaining_removals(subsystem)
break
THE COUNTERINTUITIVE PART: THE FIX WASN'T MORE AUTOMATION
It would be easy to assume the lesson here is "automate the human out of the loop." That's not quite what AWS did. The playbook command was already an authorized, established procedure -- the human wasn't doing anything unusual. The fix was adding a guardrail that makes the tool itself refuse unsafe instructions, regardless of who or what issues them. The goal wasn't removing trust in operators; it was making sure no single input, human or scripted, could push a subsystem past a point of no return.Architecture
The outage is easiest to understand as a single mistaken input propagating through a dependency chain that nobody had load-tested at that scale. Two diagrams make the propagation concrete: what the cascade looked like as it happened, and what the safeguarded version of the same operation looks like today.
The Cascade: One Command to a Region-Wide Outage
Before vs. After: The Guardrail That Was Missing
What to Notice in the Cascade
Look at how few nodes actually needed to fail for the blast radius to reach customer sites: just two subsystems, both fed by the same single command. Everything downstream -- EC2, EBS, Lambda, the status dashboard itself, and ultimately Slack and Trello -- was a passenger on that initial failure, not an independent point of weakness. The fix diagram shows the same operation with exactly one new decision point added: a floor check that didn't exist before.
Lessons
This incident is nearly a decade old, but it's still one of the most-cited reliability case studies in the industry, because the failure mode it exposes -- a well-tested system brought down by an operation that was routine at small scale and untested at large scale -- keeps recurring in new systems every year.
What to remember
- Authorize the operation, but verify the input against a hard floor. The command that caused the outage was already an approved, established procedure. The missing control wasn't permission, it was a guardrail checking whether the specific input would leave a subsystem below the minimum capacity it needs to function. Authorization and input validation are two different layers of safety, and you need both.
- Routine operations at small scale are not validated for the same operation at large scale. AWS had relied on removing and replacing S3 capacity since launch, but had not fully restarted the index or placement subsystems in its largest regions in years, because nothing had forced that path. An operation you've run a thousand times safely can still be untested at the scale a mistake can trigger.
- Dependency direction determines blast radius, not dependency count. The placement subsystem depended on the index subsystem being healthy, but nothing depended on placement. That asymmetry meant a single shared failure point, capacity loss, took down both index-dependent reads and placement-dependent writes simultaneously, rather than degrading independently.
- Status pages should never depend on the system they're reporting on. AWS couldn't update its own Service Health Dashboard for the first two hours of the outage because the dashboard's admin console itself required S3, the exact service that was down. Monitoring and status infrastructure needs an independence boundary from the systems it monitors.
- Recovery time is itself a property you have to design for, not a byproduct of good architecture. The index subsystem's restart took longer than expected specifically because of integrity-check overhead that scales with data volume, and AWS's response included accelerating partitioning work specifically to reduce future recovery time, not just to prevent recurrence.
S3's Eleven Nines, Nearly a Decade Later
The most notable long-term outcome isn't that this kind of region-wide S3 outage never happened again exactly this way -- it's that the underlying append-and-replicate object storage architecture that failed that day is still, fundamentally, the architecture running S3 today. The fix was operational discipline around the system, not a redesign of the system itself. AWS has continued to advertise S3's durability design at 99.999999999% (eleven nines) for objects across multiple facilities -- the February 2017 outage was an availability incident, not a durability one: no customer data was lost.
THE PLAYBOOK BECAME A TEACHING TOOL
AWS's own postmortem of this event has since become one of the most commonly cited examples in chaos-engineering training material and SRE coursework, not because the failure was exotic, but because it wasn't. A single mistyped parameter, a missing floor check, and a cascading dependency chain is a failure pattern nearly every infrastructure team can recognize in their own systems, which is exactly why this incident keeps getting taught years after the outage itself ended.