Company

Netflix

Every Netflix engineering case study on TechLogStack — real production incidents, post-mortems, and fixes.

Netflix Chaos Engineering
★ 5.0
19 min

Netflix Unleashed a Monkey With a Weapon in Its Own Data Center — On Purpose

It was 2011 and Netflix had just migrated hundreds of microservices to AWS. Their architecture was distributed, horizontally scaled, and theoretically fault-tolerant. But theory and production are different things. The only way to know if a system could survive failures was to cause failures — constantly, deliberately, during business hours, and in production. So they built a monkey.

10 Simian Army members
Netflix Distributed Systems
16 min

Netflix Hit the AWS Instance Ceiling and Built a Workflow Engine That Scales Forever

Netflix's Meson orchestrator was handling hundreds of thousands of daily data and ML jobs — and running out of machine. Vertically scaling on AWS had a hard ceiling, and the workflows were doubling in size every year. The only way out was a complete architectural rethink.

2M+ jobs/day at peak 100K+ jobs in single workflow
Netflix Live Streaming
18 min

65 Million Streams: How Netflix Rebuilt Its Guts for Live

November 15, 2024: 65 million people log on to watch Mike Tyson fight Jake Paul, the largest live sports stream in history. Behind the scenes, Netflix engineers are white-knuckling a system they built from scratch — one where a single bad video segment, a CDN request storm, or a missed 2-second write deadline means millions of viewers see a black screen.

65M concurrent streams 113ms → 25ms p50 latency 200Gbps+ read throughput +3 2-second segment SLA 90%+ cache hit on 404 storms 38M events/sec monitored
Netflix Performance
16 min

Netflix Made Their Workflow Orchestrator 100x Faster by Rewriting the Engine Nobody Thought Was Slow

Maestro had been running Netflix's data and ML workflows successfully for two and a half years. Then Live, Ads, and Games drove sub-hourly scheduling requirements that revealed the orchestrator's overhead — not in crashes or alerts, but in slow step launches that nobody had measured. The fix was a complete engine rewrite that delivered 100x throughput improvement.

100x throughput improvement 2.5 years before overhead visible 1M+ tasks/day still supported
Netflix Performance
16 min

Netflix's Containers Were Fighting Their Own CPUs — and Losing

Netflix ran millions of containers per day on modern multi-core CPUs. The containers performed well on benchmarks. In production, under certain workloads, they were mysteriously slower than expected — slower than the hardware should have allowed. The culprit was CPU topology: the operating system was scheduling container workloads in ways that violated modern CPU cache architecture. They called the investigation 'Mount Mayhem.'

Netflix Reliability
16 min

Netflix Streamed Live Sports for Millions — and the Hard Part Wasn't the Video

When Netflix began streaming live events — boxing, NFL games, comedy specials — the engineering challenge wasn't encoding or delivery. It was building the human infrastructure: the operations team, the escalation paths, the real-time decision systems, and the runbooks that let engineers respond to live event failures in seconds, not minutes.