Reliability Engineering Case Studies

17 min

Slack Built a Big Red Button to Drain an Entire Data Center in Five Minutes

On June 30, 2021, a network link connecting one AWS availability zone failed — and Slack users felt it, despite Slack running in multiple availability zones. The postmortem question was brutal: why did a single AZ failure affect users at all? The answer drove 18 months of architecture work.

1.5 years migration time 99.99% SLA maintained

Read full story

Cloudflare Reliability

18 min

Cloudflare Fixed a React Security Vulnerability and Broke the Entire Network

In late 2025, Cloudflare was rolling out a fix for a React security vulnerability. To do so, they needed to disable an internal testing tool with a global killswitch. The killswitch, unexpectedly, triggered a bug that sent HTTP 500 errors across Cloudflare's entire global network. This was the third major configuration-related global outage in two years.

Read full story

OpenAI Reliability

20 min

OpenAI Deployed a Tool to Monitor Kubernetes — and It Took Down All of Kubernetes

On December 11, 2024, OpenAI deployed a new telemetry service designed to improve Kubernetes observability. Within 29 minutes, it had crashed the Kubernetes control plane across every cluster. ChatGPT, the API, and Sora were all unavailable for over four hours. The engineers trying to fix it couldn't run kubectl. The control plane that manages clusters was down — and it was the only way back in.

Read full story

Cloudflare Reliability

3 min

Cloudflare Changed a Database Permission and 2.4 Billion Users Got HTTP 500

A ClickHouse permission update caused the bot detection file to triple in size. Cloudflare's proxies were not designed to survive that — and for six hours, neither was most of the internet.

~6 hr outage 2.4B users affected HTTP 500 sitewide +1

Read full story

Shopify Reliability

3 min

Shopify's Authentication Went Down on Cyber Monday — the Year's Biggest Shopping Day

At 6:45 AM Pacific on December 1, 2025, Shopify merchants started getting locked out of their own stores. The platform was in the middle of a record $14.6 billion holiday weekend.

Cyber Monday morning 4,000+ Downdetector reports $5.1M/min peak throughput +1

Read full story

GitHub Reliability

3 min

GitHub Actions Froze for 95% of Workflows When a Redis Cluster Hit Its Limit

On March 5, 2026, GitHub's CI job queue stopped dispatching. Developers pushed code, saw their checks queued — and then nothing happened for 30 minutes.

95% workflows blocked 30 min avg delay Redis cluster failure +1

Read full story

Optus Reliability

3 min

Optus Upgraded a Firewall and Accidentally Blocked Emergency 000 Calls for 13 Hours

A routine firewall upgrade on September 18, 2025, broke the routing path for Triple Zero emergency calls in four Australian states. Six hundred calls failed. Four people died.

13 hr outage window 600 failed emergency calls 4 states affected +1

Read full story

Discord Reliability

15 min

Discord Killed the MacBook Dev Environment and Never Looked Back

Discord's engineering team had tripled in size and was drowning in a swamp of 'works on my machine' bugs — some engineers running macOS, some Ubuntu, all of them slowly. The solution was radical: no one gets a local dev environment anymore.

3x engineering org growth 100% backend devs on CDEs

Read full story

Slack Reliability

17 min

Slack's Worst Day: When a Better Cache Manager Made Everything Worse

On February 22, 2022, Slack went down for many users — including the engineer designated as Incident Commander, who was authoring the postmortem from a position of personal experience. The culprit was a new component that worked exactly as designed.

Read full story

Slack Reliability

16 min

Slack Cut Deploy-Related Customer Impact by 90% in Eighteen Months

73% of Slack's customer-facing incidents were being triggered by Slack itself — by its own code deploys. The team stopped treating each outage as a one-off and started treating deploy safety as a program, with metrics, milestones, and automated rollbacks. Eighteen months later, customer impact hours were down 90%.

73% incidents from own deploys 90% reduction in impact hours 18 months of sustained investment

Read full story

Atlassian Reliability

18 min

How a Two-Line Script Silently Deleted 883 Customer Cloud Sites

At 07:38 UTC on April 5th, 2022, a maintenance script begins its run — methodical, peer-reviewed, totally routine. Twenty-three minutes later, 883 Atlassian Cloud sites have been permanently deleted, and the company's own incident management tool, Opsgenie, is one of the casualties.

883 sites deleted 14 days max outage 775 customers affected +3

Read full story

GitHub Reliability

17 min

The Test That Broke GitHub: A Failover Drill Goes Live

June 29, 2023, 17:39 UTC: GitHub engineers initiate a planned live failover test of their brand-new second Internet edge facility — six months of infrastructure work designed to eliminate a single point of failure. Within seconds, instead of validating their redundancy, they've created an outage that takes GitHub offline for millions of developers across North America and South America.

32-minute outage 2-min detect-to-revert ~100M devs affected

Read full story

Cloudflare Reliability

16 min

Cloudflare's Datacenter Partner Failed and the Control Plane Went Dark for 40 Hours

On November 2, 2023, Cloudflare's primary datacenter partner experienced a power failure. The control plane — the system that lets customers configure DNS, firewall rules, and every Cloudflare service — went dark. It stayed dark, in various forms, for nearly 40 hours. The postmortem introduced a concept Cloudflare hadn't had before: Code Orange.

~40 hours control plane down

Read full story

Cloudflare Reliability

17 min

A Database Permission Change in ClickHouse Took Down 28% of Cloudflare's HTTP Traffic

On November 2, 2023 — the same day as the control plane datacenter failure — Cloudflare also experienced a separate six-hour global outage. The cause: a database permission change in ClickHouse generated a corrupt configuration file that was silently propagated to every server in Cloudflare's Bot Management system, crashing it globally.

28% HTTP traffic impacted 6 hours total duration 2.5h to find root cause

Read full story

Netflix Reliability

16 min

Netflix Streamed Live Sports for Millions — and the Hard Part Wasn't the Video

When Netflix began streaming live events — boxing, NFL games, comedy specials — the engineering challenge wasn't encoding or delivery. It was building the human infrastructure: the operations team, the escalation paths, the real-time decision systems, and the runbooks that let engineers respond to live event failures in seconds, not minutes.

Read full story

Datadog Reliability

18 min

Datadog Went Dark for 24 Hours and Came Back With a Different Philosophy

On March 8, 2023, Datadog — the platform engineers use to know when their own infrastructure is broken — broke. For more than 24 hours, across five regions on three cloud providers, metrics stopped arriving, logs disappeared, and dashboards showed nothing. The people whose job was to fix it couldn't see what was happening. It cost $5 million. It changed how Datadog thinks about building software.

24h+ global outage 5 regions, 3 cloud providers

Read full story

Shopify Reliability

19 min

The 80% Problem: Why Getting an LLM System to 'Works in Demo' Is 20% of the Work

Every team building with LLMs discovers the same brutal truth: 80% quality arrives in a few weeks. The final 15% — the gap between 'impressive demo' and 'product I'd trust with my customers' — takes the rest of the time. Shopify's Flow agent and Sidekick teams lived this curve and came back with a systematic playbook. It is mostly about measurement.

300-example hand-crafted benchmark

Read full story

Spotify Reliability

20 min

Spotify Changed a Filter Order in Their Proxy — Then Every Server in the World Crashed at Once

On April 16, 2025, Spotify's engineering team made a change they deemed low risk: reordering the custom filters inside their Envoy Proxy perimeter. They applied it to all regions simultaneously. Within two minutes, every Envoy instance worldwide had crashed. And then the restart loop began — a loop Kubernetes itself was powering, killing each new server as fast as it came back up. 675 million users couldn't load the app. Asia Pacific stayed up, and the reason why told the engineers exactly what was broken.

675M MAU affected 48,000+ Downdetector peak reports

Read full story