Netflix Chaos Monkey 2011: The Origin of Chaos Engineering a

The Story

The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption.

— Yury Izrailevsky & Ariel Tseitlin — via The Netflix Simian Army, Netflix Tech Blog, July 19, 2011

The origin of Chaos Monkey is not a clever engineering insight — it is a three-day disaster. In August 2008, Netflix was still primarily a DVD-by-mail business, running its technology on vertically scaled servers in its own data centers. A major database corruption took down the entire system. For three days, Netflix could not ship DVDs to its customers. It wasn't a complicated failure. It was a at the most basic level: one database, one failure mode, total outage. The company's engineering leadership concluded that the only path forward was to move away from centralized relational databases in their own datacenter toward highly reliable, horizontally scalable, distributed systems in the cloud. They chose Amazon Web Services. The seven-year cloud migration that followed would produce one of the most influential engineering philosophies in the history of distributed systems.

The migration to AWS presented a new problem in place of the old one. Netflix was moving from a single monolith with a small number of failure points — each catastrophic — to a with hundreds of services, each potentially failing in its own unique way. The distributed system was theoretically more resilient. But theory is not production. Netflix's engineers designed systems with graceful degradation in mind — if the recommendations service failed, show popular titles instead of personalized ones; if the search service was slow, streaming should still work. They wrote the code. They reviewed it. They tested it in staging. And then they realized: there was no way to know if the fault tolerance actually worked without experiencing actual failures. The staging environment couldn't reproduce the chaos of production. Controlled tests couldn't capture the emergent failure modes of hundreds of interdependent services under real load.

THE CORE INSIGHT: FAIL CONSTANTLY

Netflix's founding philosophy for Chaos Engineering was radical in its simplicity: the best way to avoid failure is to fail constantly. If you only experience failures accidentally, in production, at 3am, your engineers have no muscle memory for responding to them and your systems have never been forced to prove their resilience claims. If you fail constantly, during business hours, with engineers present — your systems either prove they can recover or they expose the gaps so engineers can fix them before those gaps become incidents.

What Chaos Monkey Actually Does

Chaos Monkey is, mechanically, a simple tool. It runs continuously across Netflix's AWS environment and at some point during business hours, picks one EC2 instance at random from each cluster and terminates it. No warning. No coordination. No grace period. The instance just stops. This deceptively simple act forces every service in Netflix's architecture to prove, continuously, that it can tolerate the loss of an individual instance. Services that depend on a single backend instance fail immediately and obviously. Services built with proper fallbacks — load balancers, retries, graceful degradation paths — continue working. The business hours constraint is deliberate: when Chaos Monkey strikes at 2pm on a Tuesday, engineers are at their desks and can respond to any cascading failure. Striking at 2am would produce the exact scenario Netflix was trying to avoid — unplanned, unattended failures with no one ready to respond.

Problem

August 2008: Database Corruption, Three Days of Darkness

Netflix's vertically scaled infrastructure suffered a major database corruption that halted DVD shipping for three days. The root cause was architectural: a single relational database instance, a single point of failure. No redundancy, no graceful degradation, no recovery path faster than manual intervention. The outage made the problem concrete: this architecture couldn't support Netflix's growth.

Cause

Distributed Systems Are Only Theoretically Resilient

Moving to hundreds of microservices on AWS solved the single-point-of-failure problem at the architecture level — but introduced new questions: did the code actually implement the graceful degradation it was supposed to? Staging environments couldn't tell you. Code review couldn't tell you. The only honest answer required production failures, and those were the thing Netflix was trying to avoid.

Solution

Chaos Monkey: Production Failure on a Schedule

Netflix built Chaos Monkey — a script that randomly terminates EC2 instances during business hours — and deployed it in all production environments. Engineers came in every day knowing that Chaos Monkey was running, knowing their services might get an instance killed at any moment, and knowing they had to build recovery mechanisms or face a very bad afternoon. The tool made fault tolerance a daily engineering discipline, not a theoretical design principle.

Result

Sept 2014: AWS Reboots 10% of Its Servers. Netflix Shrugs.

On September 25, 2014, AWS rebooted approximately 10% of its EC2 instances without warning. Netflix's systems handled it without customer impact. Netflix explicitly credited Chaos Monkey: the engineers had already been building and proving recovery mechanisms every day for years. When AWS created an unplanned failure event at scale, Netflix's systems responded exactly as they'd been trained to respond — automatically, gracefully, and without requiring an emergency war room.

Chaos Monkey was one of the first systems Netflix engineers built in AWS during the cloud migration. Not a caching layer, not a deployment system, not a monitoring platform — a tool to randomly kill their own production servers. This sequencing was intentional: the discipline came first, and the architecture was shaped by it.

The Rambo Architecture

Netflix's engineering team coined the term Rambo Architecture for the design philosophy that Chaos Monkey enforced: each system must be able to succeed no matter what, even all on its own. If the recommendations service is down, still respond — show popular titles. If the search service is slow, streaming still works. If a dependent microservice returns an error, handle it gracefully. Every service is both a potential failure source and a potential victim of failures, and must be designed for both roles simultaneously.

The Simian Army

The success of Chaos Monkey triggered a proliferation. If randomly killing instances made Netflix more resilient to instance failures, what would it take to become resilient to other failure categories? In July 2011 — the same blog post that named Chaos Monkey publicly — Netflix announced the Simian Army: a growing suite of failure-injection and resilience-verification tools, each targeting a different class of failure. The roster was remarkable in its scope and its naming creativity. introduced artificial delays in service communication to simulate degradation. Conformity Monkey identified and shut down instances not following engineering best practices. Doctor Monkey ran health checks and removed unhealthy instances from service. Janitor Monkey cleaned up unused cloud resources to reduce costs and complexity. Security Monkey hunted for security vulnerabilities. 10-18 Monkey detected multi-region configuration problems. And simulated the complete failure of an AWS availability zone.

Chaos Kong: The Region Killer

Above Chaos Gorilla in the hierarchy sat Chaos Kong — the most extreme tool in the Simian Army, designed to simulate the complete failure of an entire AWS region. If Chaos Monkey proved Netflix could survive an instance failure and Chaos Gorilla proved it could survive an AZ failure, Chaos Kong tested the hardest question: could Netflix continue streaming if us-east-1 went dark? The answer, after years of Chaos Engineering practice, was yes — with careful architecture involving active-active multi-region deployment and data replication strategies that Netflix documented in subsequent engineering blog posts.

The Fix

Building a Fault-Tolerant Culture

The most important thing Chaos Monkey fixed was not a technical system — it was an organizational incentive. Before Chaos Monkey, engineers at Netflix could ship code that was theoretically fault-tolerant but practically fragile without facing immediate consequences. The fragility would only become visible during a real, unplanned outage — at which point it was someone else's problem. After Chaos Monkey, the consequences were immediate and personal: if your service didn't handle instance failures gracefully, Chaos Monkey would expose this during your working hours, while you were at your desk, with your team watching. This behavioral economics effect — where the cost of fragility was paid by the person who created it, immediately — transformed how Netflix engineers thought about resilience. It was no longer a design principle to be aspirationally implemented. It was a daily test to be continuously passed.

2011

Year Chaos Monkey was publicly announced in 'The Netflix Simian Army' blog post — three years after the 2008 database outage that triggered the AWS migration and the need for built-in fault tolerance

10+

Members of the Simian Army at peak — each targeting a different failure category from individual instances (Chaos Monkey) to full AWS regions (Chaos Kong)

Business hours

The scheduling constraint that made Chaos Monkey safe and effective — failures during working hours, with engineers present to respond, rather than 3am on-call escalations

Sept 2014

The real-world validation: AWS rebooted 10% of EC2 instances without warning — Netflix handled it without customer impact, directly crediting years of Chaos Monkey practice

python

# Simplified version of what Chaos Monkey does
# Real implementation was originally Java, later Go (v2.0)
# Runs continuously during configurable business hours

import random
import time
from datetime import datetime

class ChaosMonkey:
    def __init__(self, aws_client, excluded_clusters=None):
        self.aws = aws_client
        self.excluded = excluded_clusters or []
    
    def is_business_hours(self) -> bool:
        """Only run during business hours so engineers are present.
        The key safety constraint of Chaos Monkey's original design."""
        now = datetime.now()
        return (
            now.weekday() < 5 and          # Monday–Friday
            9 <= now.hour < 17              # 9am–5pm local time
        )
    
    def run(self):
        while True:
            if self.is_business_hours():
                # Identify all clusters Chaos Monkey is configured to target
                clusters = self.aws.get_all_clusters()
                
                for cluster in clusters:
                    if cluster.name in self.excluded:
                        continue
                    
                    # Pick one instance at random from each cluster
                    instances = cluster.get_running_instances()
                    if not instances:
                        continue
                    
                    victim = random.choice(instances)
                    
                    # Terminate it. No warning. No coordination.
                    # If the system doesn't survive this, the engineers
                    # will know about it immediately — and fix it.
                    self.aws.terminate_instance(victim.id)
                    print(f"[Chaos Monkey] Terminated {victim.id} "
                          f"in cluster {cluster.name}")
            
            # Wait before running again — mean time between terminations
            # configured per cluster, not random probability
            time.sleep(self.config.termination_interval_seconds)

FAILURE INJECTION TESTING (FIT): THE EVOLUTION

In 2014, Netflix engineers (including Kolton Andrus, who later co-founded Gremlin) introduced FIT — Failure Injection Testing. Where Chaos Monkey operated at the infrastructure level (kill an EC2 instance), FIT operated at the application level: injecting failure metadata through to simulate specific service failures with surgical precision. FIT could say 'for this specific user's request, pretend the recommendations service is timing out' without actually degrading the recommendations service for everyone. This precision made chaos experiments far more targeted and safer to run continuously.

Chaos Monkey 2.0: Open-Sourced and Rebuilt in Go

Chaos Monkey was open-sourced in 2012 and rebuilt in 2016 as version 2.0. The new version was written in Go, used Spinnaker as its deployment platform dependency, and introduced mean-time-between-terminations (rather than probabilistic scheduling) for more predictable test coverage. Version 2.0 also added Trackers — Go language objects that report instance terminations to external monitoring systems, enabling downstream correlation of Chaos Monkey events with application metrics and alerts.

Industry Adoption: From Netflix to Everywhere

By 2015, Netflix's Chaos Engineering practices had been codified in the Principles of Chaos Engineering document (published by a team including Casey Rosenthal, who led Netflix's Chaos Engineering team), transforming what had been an internal Netflix tool into a formal engineering discipline. Companies including LinkedIn, Facebook, Google, Amazon, and Twilio adopted chaos engineering practices. Kolton Andrus (from Netflix's FIT team) founded Gremlin in 2016 to commercialize chaos engineering tooling. AWS launched its own Fault Injection Simulator service in 2021.

The Open-Source Release and Industry Spread

Netflix open-sourced Chaos Monkey in 2012, making the tool available to any engineering team that wanted to adopt the practice. The release did something more important than provide the code: it legitimized the approach. Engineering teams at other companies who had been quietly running similar experiments could now point to Netflix's published methodology as industry precedent. By 2015, companies including LinkedIn, Facebook, Google, Amazon, and Twilio had publicly acknowledged chaos engineering practices. The 2015 publication of the Principles of Chaos Engineering by Netflix's Casey Rosenthal and colleagues formalized the discipline with scientific language: hypothesis, experiment, steady state, blast radius. What had been a Netflix internal tool became a named engineering discipline.

THE SPINNAKER DEPENDENCY IN V2.0

Chaos Monkey 2.0 (2016) introduced a significant constraint: it requires as its deployment platform. This means that teams wanting to use Chaos Monkey 2.0 must also adopt Spinnaker — a substantial investment. Companies unwilling to commit to Spinnaker found Chaos Monkey 2.0 inaccessible, which opened market space for alternatives like Gremlin (founded by Netflix alumni Kolton Andrus and Matt Fornaciari) that offered chaos engineering as a service without infrastructure prerequisites.

Architecture

Netflix's architecture in 2011 was organized around a principle that Chaos Monkey enforced: every service must be independently deployable, independently scalable, and independently recoverable. The microservices were connected through REST APIs, with each service maintaining its own data store and exposing a versioned interface to its consumers. Chaos Monkey operated at the AWS EC2 instance layer — the individual virtual machines running each service's processes. When an instance was terminated, the load balancer in front of that service's cluster detected the unhealthy instance and stopped routing traffic to it. If the cluster had been sized with enough redundancy, other instances absorbed the traffic without degradation. If not, the service degraded — and the engineers learned something.

The Simian Army: Failure Coverage Across Infrastructure Layers

How Netflix's Architecture Handles Chaos Monkey Instance Loss

THE BEHAVIORAL ECONOMICS OF CHAOS ENGINEERING

Chaos Monkey's deepest contribution to Netflix's culture was aligning incentives. Without it, the cost of fragile code was paid by whoever happened to be on-call when a real failure occurred — often not the engineer who wrote the fragile code. With Chaos Monkey, the cost was paid immediately and visibly by the team whose service broke. Engineers who experienced a Chaos Monkey failure during business hours had a powerful motivator to invest in proper fault tolerance: they didn't want to experience it again. This is DevOps incentive design at its finest — not policy mandates, but a system where the right behavior is the path of least resistance.

Why Business Hours Only — The Safety Constraint

The original Chaos Monkey ran only during business hours, and this was not a limitation — it was the essential design principle. An instance killed at 2am when engineers are asleep creates exactly the scenario Netflix wanted to avoid: unplanned, unattended failure with long MTTD (Mean Time To Detect) and long MTTR (Mean Time To Recover). An instance killed at 2pm on a Tuesday is pedagogical, not adversarial: engineers learn from it, fix the gap, and build better systems. As Netflix's confidence in its architecture grew, chaos experiments expanded to cover more scenarios and broader failure scopes — but the principle of human-attended chaos remained core to responsible chaos engineering practice.

What Chaos Monkey Doesn't Test

Chaos Monkey's instance-termination model is powerful but deliberately narrow. It does not test network partitions (instances visible but unreachable), latency degradation (Latency Monkey's job), data corruption, or slow memory leaks that cause gradual performance degradation over hours. Chaos Monkey's successors in the Simian Army and in tools like Gremlin were created precisely to cover these gaps. The original insight — failing constantly builds resilience — generalizes to all failure types, but the specific mechanism must match the specific failure mode being tested. A chaos engineering program that only kills instances is missing most of the failure surface.

Lessons

Chaos Monkey is fourteen years old and it has influenced every major engineering organization's approach to reliability. Its lessons are not about the specific tool — they are about the philosophy that the tool embodies and the cultural transformation it requires.

What to remember

Designing for fault tolerance is not the same as having fault tolerance. Netflix's engineers wrote graceful degradation code. Netflix's Chaos Monkey tested whether it actually worked. Until production failure exercises the code path, you don't know whether your fault tolerance design survived contact with reality. Chaos Monkey converts theoretical resilience into empirical evidence.
must be practiced during business hours, with humans present. The purpose is learning, not destruction. Chaos experiments run at 3am when no one is available to respond create exactly the incidents that chaos engineering is supposed to prevent.
Align incentives with the behavior you want. Chaos Monkey made the cost of fragile code immediate and personal — the engineer whose service broke during business hours paid the cost of fixing it right then. Without this alignment, resilience engineering is aspirational. With it, resilience engineering is survival instinct.
The of individual failures is only measurable through testing. A microservices architecture where every service failure cascades to every other service provides less reliability than a monolith, not more. Chaos Monkey surfaces these cascade dependencies so they can be eliminated before a real failure exposes them at scale.
Start at the instance level and escalate gradually. Netflix began with Chaos Monkey (instances), expanded to Chaos Gorilla (availability zones), then to Chaos Kong (regions). Each level was only attempted after the previous level produced a stable, confident result. This graduated escalation model — expand scope only when you're confident you've solved the current scope — is the responsible path for any chaos engineering program.

The September 2014 Test That Validated Everything

Netflix's most public validation of Chaos Monkey's philosophy came not from their own experiments but from AWS itself. On September 25, 2014, AWS rebooted approximately 10% of its EC2 instances across regions without warning — a real, unplanned failure event at significant scale. Netflix handled it without customer impact. The years of Chaos Monkey practice had built exactly the muscle memory and architectural robustness required. Engineers didn't panic. Systems didn't cascade. Services degraded gracefully and recovered automatically. This was the experiment Netflix couldn't have designed themselves — and they passed it.

FROM TOOL TO DISCIPLINE: THE PRINCIPLES OF CHAOS ENGINEERING

In 2015, Netflix's Casey Rosenthal formalized Chaos Monkey's philosophy into the Principles of Chaos Engineering — a document that defined chaos engineering with scientific rigor: establish a steady-state hypothesis, vary real-world events, run experiments in production, automate continuously, minimize blast radius. These principles transformed chaos engineering from 'Netflix's thing where they kill their own servers' into a reproducible engineering discipline with clear methodologies. The formalization is what allowed chaos engineering to spread beyond Netflix — teams could now implement the practice without having to rediscover the same principles themselves.

Netflix built a tool that killed their own servers on purpose every business day for years, and the one time AWS killed 10% of their servers by accident, nobody noticed — which is either the best possible outcome of a chaos engineering program or proof that Netflix engineers have very high stress tolerances.TechLogStack — built at scale, broken in public, rebuilt by engineers

The Story

What Chaos Monkey Actually Does

August 2008: Database Corruption, Three Days of Darkness

Distributed Systems Are Only Theoretically Resilient

Chaos Monkey: Production Failure on a Schedule

Sept 2014: AWS Reboots 10% of Its Servers. Netflix Shrugs.

The Simian Army

The Fix

Building a Fault-Tolerant Culture

Architecture

Lessons

Related Stories

A 3-Day Database Outage in 2008 Convinced Netflix to Move Everything to AWS. It Took 7 Years.

Netflix Hit the AWS Instance Ceiling and Built a Workflow Engine That Scales Forever

65 Million Streams: How Netflix Rebuilt Its Guts for Live