TechLogStack — When Big Tech Breaks & How They Fix It

★ 5.0

19 min

Netflix Unleashed a Monkey With a Weapon in Its Own Data Center — On Purpose

It was 2011 and Netflix had just migrated hundreds of microservices to AWS. Their architecture was distributed, horizontally scaled, and theoretically fault-tolerant. But theory and production are different things. The only way to know if a system could survive failures was to cause failures — constantly, deliberately, during business hours, and in production. So they built a monkey.

10 Simian Army members

Read full story

Hotstar Live Streaming

★ 5.0

18 min

When MS Dhoni Got Out: How Hotstar Survived 25 Million Concurrent Users

July 9th, 2019. India vs New Zealand, Cricket World Cup semi-final. MS Dhoni walks to the crease and 1.1 million new viewers join Hotstar every single minute. Then he gets run out — and 24 million people hit the back button almost simultaneously.

25.3M peak concurrent 1.1M users/min growth 5.7 Tbps bandwidth +3

Read full story

GitHub Databases

16 min

How GitHub Upgraded 1200 MySQL Hosts Without Dropping a Single Query

MySQL 5.7 was hitting end-of-life, and GitHub's production database fleet spanned 1,200 hosts, 300 terabytes of data, and 5.5 million queries every second. Getting from here to MySQL 8.0 without disrupting 100 million developers was going to take more than a weekend.

1,200+ MySQL hosts upgraded 300+ TB data migrated 5.5M queries/sec maintained +2

Read full story

Slack Reliability

17 min

Slack Built a Big Red Button to Drain an Entire Data Center in Five Minutes

On June 30, 2021, a network link connecting one AWS availability zone failed — and Slack users felt it, despite Slack running in multiple availability zones. The postmortem question was brutal: why did a single AZ failure affect users at all? The answer drove 18 months of architecture work.

1.5 years migration time 99.99% SLA maintained

Read full story

Cloudflare Reliability

18 min

Cloudflare Fixed a React Security Vulnerability and Broke the Entire Network

In late 2025, Cloudflare was rolling out a fix for a React security vulnerability. To do so, they needed to disable an internal testing tool with a global killswitch. The killswitch, unexpectedly, triggered a bug that sent HTTP 500 errors across Cloudflare's entire global network. This was the third major configuration-related global outage in two years.

Read full story

LinkedIn Messaging

21 min

LinkedIn Needed a Message Queue. They Built the One the Entire Internet Runs On.

In 2010, LinkedIn was drowning in data it couldn't move. Every ML model, every recommendation engine, every real-time feature was starving because there was no reliable way to get activity data from the website into the systems that needed it. Jay Kreps, Jun Rao, and Neha Narkhede spent a year building a fix. They named it after Franz Kafka. The rest of the internet adopted it.

1B events/day at launch (2011) 1T messages/day by 2015 7T messages/day by 2019 +1

Read full story

Google Performance

19 min

Google Built a Free Design Tool That Generates Production Code From a Sentence — Then Added Multiplayer

At Google I/O 2025, Sundar Pichai demoed a tool that turned a plain English description into a complete mobile UI in under 30 seconds. Figma charges $15 per editor per month for collaborative design. Google Stitch does it free. A year later, Google added real-time multiplayer, a streaming design agent, and voice input. The design industry noticed.

350 free generations/month

Read full story

X Infrastructure Outages

15 min

Why Does X Go Down? Inside the Microservice Spaghetti and Database Bottlenecks That Trigger Global Outages

When a routine update knocks out a global network, thousands of systems are suddenly isolated from real-time global conversation. In June 2026, X (formerly Twitter) went completely dark for hours, stranding millions of active users and forcing engineering teams into an all-hands-on-deck post-mortem. It wasn’t a hacker attack—it was a core architectural failure of microservice dependency loops and database connections failing under concurrent load.

users impacted globally: 10M+ duration of peak downtime: 3.5 Hours concurrent error spike: 24,000% +1

Read full story

Google Distributed Systems

20 min

Google's Gemini Omni Is the First AI That Creates From Anything — Here Is What That Actually Means

For three years, Google built Gemini to be 'natively multimodal.' At I/O 2026, they finally demonstrated what that phrase means in practice. Gemini Omni takes a photo, an audio clip, a video, and a text description — all at once — and produces a new video that reflects all of them simultaneously. This is not four models chained together. It is one.

Read full story

Anthropic Platform Reliability

15 min

Why Did Claude Go Down? Inside the Trillion-Token Capacity Strain and API Fault Loops That Silenced LLM Workflows

On June 23, 2026, large language models shifted instantly from cutting-edge workflow accelerators to silent systemic bottlenecks. A massive, multi-tiered service disruption hit Anthropic’s infrastructure, causing Claude.ai, the Claude API, Claude Console, and Claude Code environments to collapse under elevated backend error rates. It wasn't a malicious cyber attack—it was an engineering post-mortem of extreme computing demand, cascading server overloading, and the real-world vulnerability of AI dependencies across global software ecosystems.

{'label': 'user alerts submitted', 'value': '1,272'} {'label': 'time from alert to fix', 'value': '34 Minutes'} {'label': 'affected models simultaneously', 'value': '100%'} +1

Read full story

GitHub Distributed Systems

19 min

GitHub Built the Internet's Code Platform — Then AI Agents Broke It

Between May 2025 and April 2026, GitHub experienced 257 incidents — 48 of them major outages. That's roughly one significant disruption every single week. The culprit wasn't a security breach, a botched deployment, or a rogue engineer. It was the thing GitHub had spent years celebrating: AI. Specifically, agentic AI workflows that turned one human developer's footprint into hundreds of commits, thousands of CI minutes, and a dozen simultaneous PR operations — all at once, across millions of accounts. GitHub had been built for humans. Agents are not human.

257 incidents — May 2025 to April 2026 48 major outages, 112+ hours total downtime 57 GitHub Actions outages in 12 months +1

Read full story

OpenAI Reliability

20 min

OpenAI Deployed a Tool to Monitor Kubernetes — and It Took Down All of Kubernetes

On December 11, 2024, OpenAI deployed a new telemetry service designed to improve Kubernetes observability. Within 29 minutes, it had crashed the Kubernetes control plane across every cluster. ChatGPT, the API, and Sora were all unavailable for over four hours. The engineers trying to fix it couldn't run kubectl. The control plane that manages clusters was down — and it was the only way back in.

Read full story

Tata Communications Infrastructure Disasters

15 min

Why Did the Delhi Data Centre Burn? Inside the Lithium-Battery Fire and Single-Zone Storage Failures That Erased Decades of Enterprise Data

On June 5, 2026, a massive fire ripped through a major New Delhi data centre jointly operated by STT Global Data Centres India and Tata Communications. The catastrophic blaze severely damaged internal infrastructure, knocking out critical cloud routing paths for Google Cloud and wiping out decades of historical operational records for enterprise clients. It wasn't a software bug or a configuration mistake—it was a physical force majeure event that exposed the dangerous realities of non-replicated storage pools and tight localized hardware dependency.

{'label': 'estimated commercial loss', 'value': '₹500 Crore+'} {'label': 'years of legacy data erased', 'value': '20 Years+'} {'label': 'fire deployment tenders', 'value': '10 Units'} +1

Read full story

Google Platform Reliability

15 min

Why is Gemini Down? Inside the Database Hotspotting and Cache Failures That Triggered Error 1076 Worldwide

On June 10, 2026, millions of creators, developers, and Google Workspace users were suddenly locked out of their primary workflows. Google’s flagship AI assistant, Gemini, crumbled under a massive, multi-tiered service disruption. Across mobile apps, web surfaces, and Chrome integrations, prompts were met with cryptic 'error 1076' and 'error 1099' messages. Far from a simple networking glitch, the root cause revealed a severe internal breakdown: database hotspotting and a cache configuration failure that sent a 60% failure rate rippling through Google’s foundational storage tier.

{'label': 'elevated error rate duration', 'value': '6h 55m'} {'label': 'total system incident duration', 'value': '14h 49m'} {'label': 'database query failure peak', 'value': '60%'} +1

Read full story

Google Distributed Systems

20 min

Google's Own Cleanup Job Crashed Cloud Services Across 4 Continents — and Then Made Recovery Worse

On May 29, 2025, a Google engineer deployed new quota-checking code to Service Control — the system that authorizes every single API request across Google Cloud. The code had a bug: it couldn't handle a null value. But the bug was invisible during deployment because it could only be triggered by a specific type of policy data that hadn't appeared yet. Two weeks later, on June 12, an automated system pushed a routine policy update containing blank fields. The policy data replicated globally within seconds. Every Service Control binary in every region hit the null pointer, crashed, and refused to restart without eating itself. Spotify went down. Discord went down. Snapchat went down. Google's own status page went down. And when engineers deployed the fix, the restart surge overwhelmed the infrastructure — making the recovery worse than the crash.

50+ Google Cloud services affected including IAM, Compute Engine, Cloud Storage, BigQuery

Read full story

AWS Infrastructure Outages

15 min

Why Did AWS Overheat? Inside the North Virginia Thermal Surge and Power Failures That Halted Global Trading Rails

On May 7, 2026, a sudden cooling failure at an Amazon Web Services (AWS) facility in Northern Virginia turned the backbone of the internet into an environmental oven. High temperatures triggered a critical 'thermal event' and subsequent power loss, causing widespread instance impairments and freezing operations for enterprise heavyweights like Coinbase, FanDuel, and CME Group. Far from a routine glitch, this high-profile disruption exposed the extreme fragility of high-density availability zones failing under synchronous load.

{'label': 'peak incident duration', 'value': '21 Hours'} {'label': 'affected core services', 'value': '12+'} {'label': 'downdetector alert spike', 'value': '~600 Blocks'} +1

Read full story

Railway Platform Infrastructure

15 min

Why Did Railway Go Down? Inside the Automated Google Cloud Account Suspension That Paralyzed Cross-Cloud Control Planes

On May 20, 2026, cloud infrastructure provider Railway vanished from the internet's control loop. Without prior notice, an automated compliance sweep by Google Cloud Platform (GCP) suspended Railway’s primary project account, instantly cutting off their multi-cloud orchestration engine. While existing container workloads fought to survive, the sudden loss of the underlying network control plane API triggered a massive global outage, leaving developers staring at blank console screens and broken application pipelines. It was a stark lesson in absolute provider dependency and the terrifying velocity of automated administrative locks.

{'label': 'GCP monthly infrastructure spend', 'value': '$1M+'} {'label': 'time to account restoration', 'value': '9 Minutes'} {'label': 'consecutive annual scale incidents', 'value': '3 Years'} +1

Read full story

Cloudflare Reliability

4 min

Cloudflare Changed a Database Permission and 2.4 Billion Users Got HTTP 500

A ClickHouse permission update caused the bot detection file to triple in size. Cloudflare's proxies were not designed to survive that — and for six hours, neither was most of the internet.

~6 hr outage 2.4B users affected HTTP 500 sitewide +1

Read full story

Cloudflare Security

3 min

A Security Fix Broke 28% of the Internet for 25 Minutes — Cloudflare's December 2025 Outage

A well-reviewed security patch hit production traffic patterns it had never seen in testing, and a retry amplification loop did the rest.

25 min outage ~28% internet affected HTTP 500 errors +1

Read full story

GitHub Databases

4 min

GitHub's Settings Cache Went Stale and Took Authentication, Actions, and Copilot Down With It

A configuration change to the user settings cache triggered a global invalidation. Every AI model preference and policy setting hit the database at once — and the database that stored them also handled login.

2 hr 43 min disruption 2 outage windows Auth, Actions, Copilot affected +1

Read full story

Shopify Reliability

3 min

Shopify's Authentication Went Down on Cyber Monday — the Year's Biggest Shopping Day

At 6:45 AM Pacific on December 1, 2025, Shopify merchants started getting locked out of their own stores. The platform was in the middle of a record $14.6 billion holiday weekend.

Cyber Monday morning 4,000+ Downdetector reports $5.1M/min peak throughput +1

Read full story

GitHub Reliability

3 min

GitHub Actions Froze for 95% of Workflows When a Redis Cluster Hit Its Limit

On March 5, 2026, GitHub's CI job queue stopped dispatching. Developers pushed code, saw their checks queued — and then nothing happened for 30 minutes.

95% workflows blocked 30 min avg delay Redis cluster failure +1

Read full story

GitHub Security

3 min

GitHub Lost Telemetry and Its Own Security System Started Blocking Real Developer VMs

A telemetry pipeline went silent. GitHub's security automation treated the silence as a threat signal — and locked every Codespaces VM out of its own metadata service.

~6 hr Codespaces outage All regions affected Copilot, CodeQL blocked +1

Read full story

GitHub Performance

2 min

A Fiber Cut in Seattle Slowed GitHub Clone Speeds to Under 1 MB/s for Eight Hours

Nothing changed in GitHub's codebase. A cable under Seattle got cut, and git clone on the US west coast went from fast to dial-up speed — and TCP made it worse.

<1 MiB/s clone speed 800 Gbps → 3.2 Tbps ~8 hr disruption +1

Read full story

GitHub Databases

2 min

Dependabot Silently Skipped 10% of Security PRs Because a Failover Landed on a Read-Only Database

For 42 hours, Dependabot appeared to be running fine. It was quietly failing to create security pull requests — and nobody got an alert.

42 hr degraded window 10% of PRs silently failed Zero visible errors shown +1

Read full story

Optus Reliability

3 min

Optus Upgraded a Firewall and Accidentally Blocked Emergency 000 Calls for 13 Hours

A routine firewall upgrade on September 18, 2025, broke the routing path for Triple Zero emergency calls in four Australian states. Six hundred calls failed. Four people died.

13 hr outage window 600 failed emergency calls 4 states affected +1

Read full story

GitHub Distributed Systems

4 min

GitHub Was Built for 2008. AI Agents Demanded 30x More Scale in 2026 — and the Platform Broke

In October 2025, GitHub set a goal of 10x capacity growth. By February 2026, the CTO was publicly saying that wasn't enough. AI-assisted development had changed the load model entirely.

257 incidents in 12 months 48 major outages 30x redesign target +1

Read full story

AWS Reliability

10 min

One Wrong Number in a Routine Command Took Down Slack, Trello, and a Chunk of the Internet

On the morning of February 28th, 2017, an authorized S3 engineer ran a command to remove a small number of servers from a billing subsystem. One input was typed wrong. The command removed far more capacity than intended -- enough to force two core S3 subsystems into a full restart neither had needed in years. For the next four hours, US-EAST-1 was effectively offline, and a long list of sites that quietly depended on S3 went down with it.

PST time the command ran: 9:37 AM total disruption: 4hr 17min core subsystems needing full restart: 2 +1

Read full story

Facebook Distributed Systems

15 min

Facebook Took Itself Off the Internet for 7 Hours — and Couldn't Get Back In

At 15:39 UTC on October 4, 2021, a command was issued to audit Facebook's global backbone network capacity. The command contained an error. Within seconds, it severed all fiber-optic connections between Facebook's data centers worldwide. Facebook, Instagram, WhatsApp, and Messenger went dark simultaneously for 3.5 billion users. Engineers trying to fix the problem discovered they couldn't reach the servers remotely — because the corporate authentication systems also ran on the same disconnected backbone. To get back in, they had to physically drive to a data center in Santa Clara.

users locked out globally: 3.5B duration of full outage: ~7 hrs market cap lost on the day: ~$47B +1

Read full story

GitLab Databases

9 min

GitLab Deleted Its Own Production Database. Then Found All Five Backups Had Failed Too

On January 31st, 2017, a GitLab engineer trying to fix a lagging database replica ran one command on the wrong host. In two seconds, around 300GB of production data was gone -- and so was the database GitLab.com depended on to function. When the team turned to their backups to recover, they found that of five separate backup and replication mechanisms, only one partial, six-hour-old snapshot actually worked.

production data removed: ~300GB backup mechanisms that had silently failed: 4 of 5 hours to fully restore service: ~18hrs +1

Read full story

CircleCI Security

11 min

Malware on One Laptop Gave Attackers a Way Into CircleCI's Production Systems

On December 16th, 2022, malware landed on the laptop of a CircleCI engineer. CircleCI's antivirus software didn't catch it. Three days later, the malware stole a session cookie that was already authenticated past two-factor authentication -- and an attacker used it to impersonate that engineer from a remote location. Because the employee's job included generating production access tokens, the attacker could too. It took a customer's bug report, not CircleCI's own monitoring, to surface the breach.

days the malware went undetected: 13 laptop compromised: Dec 16, 2022 customers reporting downstream compromise: < 5 +1

Read full story

GitHub Databases

16 min

GitHub's Database Split Its Brain in 43 Seconds. Cleaning It Up Took 24 Hours.

On October 21, 2018, a technician in a GitHub data center accidentally disconnected a 100G optical cable during routine maintenance. The connection was restored in 43 seconds. In those 43 seconds, GitHub's automated database failover tool had already promoted a new primary database on the West Coast — before the East Coast primary had finished replicating its recent writes. For the next 24 hours, GitHub had two sources of truth for its most critical metadata, neither complete, and no clean way to merge them.

duration of network partition: 43 sec service degradation after: 24+ hrs webhooks queued for replay: 5M+ +1

Read full story

Netflix Reliability

16 min

A 3-Day Database Outage in 2008 Convinced Netflix to Move Everything to AWS. It Took 7 Years.

In August 2008, Netflix's Oracle database — the monolith at the center of their DVD-by-mail empire — became corrupted. For three days, Netflix could not ship DVDs to its members. The company could not fix the problem fast enough with its existing data center infrastructure. Yury Izrailevsky, the VP who would lead the subsequent migration, later said the outage 'is when we realized that we had to move away from vertically scaled single points of failure.' The solution would take seven years to complete and permanently reshape how companies think about cloud infrastructure.

duration of 2008 DB outage: 3 days migration duration: 7 years streaming member growth (2008–2016): 8× +1

Read full story

Amazon Web Services Distributed Systems

17 min

AWS's Most Popular Region Went Down Because DynamoDB's DNS Had a Race Condition Nobody Had Seen Before

At 3:11 AM ET on October 20, 2025, AWS began receiving alerts that DynamoDB in us-east-1 was failing to resolve. The root cause turned out to be a race condition in DynamoDB's DNS management automation — a latent defect that had existed undetected until a slow Availability Zone caused one of three independent DNS enactors to fall hours behind its peers. The resulting empty DNS record for dynamodb.us-east-1.amazonaws.com didn't just take down DynamoDB — it took down every AWS service that depended on DynamoDB for metadata, control plane operations, or state management. Snapchat, Fortnite, Duolingo, Ring doorbells, and hundreds of banking apps went offline. The technical chain that caused it is one of the most intricate dependency failures in cloud computing history.

services impacted in us-east-1: Many root cause: race condition type: DNS Enactor customer impact duration: ~3–15 hrs +1

Read full story

Discord Databases

16 min

How Discord Migrated Trillions of Messages and Fired Their Garbage Collector

It is 2022 and Discord's on-call engineers are babysitting a 177-node database cluster, manually rebooting nodes after Java GC pauses spiral out of control. The system holding every message ever sent is becoming the thing everyone fears touching most.

177 → 72 nodes 9-day migration (was 3-month est.) 3.2M records/sec migrated +1

Read full story

Discord Reliability

15 min

Discord Killed the MacBook Dev Environment and Never Looked Back

Discord's engineering team had tripled in size and was drowning in a swamp of 'works on my machine' bugs — some engineers running macOS, some Ubuntu, all of them slowly. The solution was radical: no one gets a local dev environment anymore.

3x engineering org growth 100% backend devs on CDEs

Read full story

Netflix Distributed Systems

16 min

Netflix Hit the AWS Instance Ceiling and Built a Workflow Engine That Scales Forever

Netflix's Meson orchestrator was handling hundreds of thousands of daily data and ML jobs — and running out of machine. Vertically scaling on AWS had a hard ceiling, and the workflows were doubling in size every year. The only way out was a complete architectural rethink.

2M+ jobs/day at peak 100K+ jobs in single workflow

Read full story

Slack Reliability

17 min

Slack's Worst Day: When a Better Cache Manager Made Everything Worse

On February 22, 2022, Slack went down for many users — including the engineer designated as Incident Commander, who was authoring the postmortem from a position of personal experience. The culprit was a new component that worked exactly as designed.

Read full story

Stripe Databases

16 min

How Stripe Moves Petabytes Between Database Shards Without Stopping the Money

Stripe processed over $1 trillion in payment volume in 2023 while maintaining 99.999% uptime — five nines, fewer than 6 minutes of downtime all year. The infrastructure secret is a database platform called DocDB and a migration engine that moves petabytes of financial data between shards without any application knowing it happened.

99.999% uptime achieved 5M database queries/sec 1.5 PB migrated in 2023

Read full story

Slack Distributed Systems

16 min

Slack Rewrote Its Core Architecture for Enterprise — Because the Old One Was a Lie

Slack was built for teams in single workspaces. Enterprise customers were using it across dozens of workspaces simultaneously — and the architecture had never been designed for that. Every major enterprise feature was a workaround on top of a foundation that assumed one workspace per person. Slack spent two years rebuilding the foundation.

2 years development time

Read full story

Slack Reliability

16 min

Slack Cut Deploy-Related Customer Impact by 90% in Eighteen Months

73% of Slack's customer-facing incidents were being triggered by Slack itself — by its own code deploys. The team stopped treating each outage as a one-off and started treating deploy safety as a program, with metrics, milestones, and automated rollbacks. Eighteen months later, customer impact hours were down 90%.

73% incidents from own deploys 90% reduction in impact hours 18 months of sustained investment

Read full story

Atlassian Reliability

18 min

How a Two-Line Script Silently Deleted 883 Customer Cloud Sites

At 07:38 UTC on April 5th, 2022, a maintenance script begins its run — methodical, peer-reviewed, totally routine. Twenty-three minutes later, 883 Atlassian Cloud sites have been permanently deleted, and the company's own incident management tool, Opsgenie, is one of the casualties.

883 sites deleted 14 days max outage 775 customers affected +3

Read full story

Netflix Live Streaming

18 min

65 Million Streams: How Netflix Rebuilt Its Guts for Live

November 15, 2024: 65 million people log on to watch Mike Tyson fight Jake Paul, the largest live sports stream in history. Behind the scenes, Netflix engineers are white-knuckling a system they built from scratch — one where a single bad video segment, a CDN request storm, or a missed 2-second write deadline means millions of viewers see a black screen.

65M concurrent streams 113ms → 25ms p50 latency 200Gbps+ read throughput +3

Read full story

Stripe Performance

15 min

Stripe Converted 3.7 Million Lines of JavaScript in One Pull Request on a Sunday

On Sunday, March 6, 2022, Stripe merged a single pull request that converted their entire largest JavaScript codebase from Flow to TypeScript. 3.7 million lines of code. Hundreds of engineers arrived Monday morning to start writing TypeScript. The migration had been invisible until it wasn't.

3.7M lines converted in 1 PR

Read full story

GitHub Reliability

17 min

The Test That Broke GitHub: A Failover Drill Goes Live

June 29, 2023, 17:39 UTC: GitHub engineers initiate a planned live failover test of their brand-new second Internet edge facility — six months of infrastructure work designed to eliminate a single point of failure. Within seconds, instead of validating their redundancy, they've created an outage that takes GitHub offline for millions of developers across North America and South America.

32-minute outage 2-min detect-to-revert ~100M devs affected

Read full story

Shopify Databases

15 min

Shopify Sharded a Rails Database With Vitess and the App Never Knew It Happened

The Shop app was growing exponentially. Its single MySQL database was approaching vertical scaling limits. Shopify needed horizontal sharding — but they had a Rails monolith that expected a single database, and a system that couldn't have downtime during a commerce platform used by millions daily.

Read full story

Shopify Databases

15 min

Shopify's Engineers Hunted Deadlocks at 19 Million Queries per Second

During Black Friday and Cyber Monday 2023, Shopify's MySQL fleet was handling 19 million queries per second. At that scale, even rare deadlock patterns become common enough to cause real incidents. The engineering team published a detailed playbook for diagnosing and eliminating MySQL deadlocks in high-concurrency production environments.

19M MySQL QPS at BFCM peak 58M requests/min app servers 99.999%+ uptime maintained

Read full story

Cloudflare Reliability

16 min

Cloudflare's Datacenter Partner Failed and the Control Plane Went Dark for 40 Hours

On November 2, 2023, Cloudflare's primary datacenter partner experienced a power failure. The control plane — the system that lets customers configure DNS, firewall rules, and every Cloudflare service — went dark. It stayed dark, in various forms, for nearly 40 hours. The postmortem introduced a concept Cloudflare hadn't had before: Code Orange.

~40 hours control plane down

Read full story

Cloudflare Reliability

17 min

A Database Permission Change in ClickHouse Took Down 28% of Cloudflare's HTTP Traffic

On November 2, 2023 — the same day as the control plane datacenter failure — Cloudflare also experienced a separate six-hour global outage. The cause: a database permission change in ClickHouse generated a corrupt configuration file that was silently propagated to every server in Cloudflare's Bot Management system, crashing it globally.

28% HTTP traffic impacted 6 hours total duration 2.5h to find root cause

Read full story

Netflix Performance

16 min

Netflix Made Their Workflow Orchestrator 100x Faster by Rewriting the Engine Nobody Thought Was Slow

Maestro had been running Netflix's data and ML workflows successfully for two and a half years. Then Live, Ads, and Games drove sub-hourly scheduling requirements that revealed the orchestrator's overhead — not in crashes or alerts, but in slow step launches that nobody had measured. The fix was a complete engine rewrite that delivered 100x throughput improvement.

100x throughput improvement 2.5 years before overhead visible 1M+ tasks/day still supported

Read full story

Netflix Performance

16 min

Netflix's Containers Were Fighting Their Own CPUs — and Losing

Netflix ran millions of containers per day on modern multi-core CPUs. The containers performed well on benchmarks. In production, under certain workloads, they were mysteriously slower than expected — slower than the hardware should have allowed. The culprit was CPU topology: the operating system was scheduling container workloads in ways that violated modern CPU cache architecture. They called the investigation 'Mount Mayhem.'

Read full story

Netflix Reliability

16 min

Netflix Streamed Live Sports for Millions — and the Hard Part Wasn't the Video

When Netflix began streaming live events — boxing, NFL games, comedy specials — the engineering challenge wasn't encoding or delivery. It was building the human infrastructure: the operations team, the escalation paths, the real-time decision systems, and the runbooks that let engineers respond to live event failures in seconds, not minutes.

Read full story

Figma Databases

17 min

Figma's Database Grew 100x in Four Years — Here's How a Small Team Kept It From Toppling

In 2020, Figma ran on a single Postgres instance on AWS's largest available machine. Four years later, that database had grown nearly 100x. Some tables had swelled to several terabytes and billions of rows. The Postgres vacuum process — the background job that keeps Postgres alive — was causing reliability incidents. They had months of runway left before hitting the IOPS ceiling. A small databases team had nine months to fix it.

100x DB growth since 2020 9-month migration

Read full story

Datadog Reliability

18 min

Datadog Went Dark for 24 Hours and Came Back With a Different Philosophy

On March 8, 2023, Datadog — the platform engineers use to know when their own infrastructure is broken — broke. For more than 24 hours, across five regions on three cloud providers, metrics stopped arriving, logs disappeared, and dashboards showed nothing. The people whose job was to fix it couldn't see what was happening. It cost $5 million. It changed how Datadog thinks about building software.

24h+ global outage 5 regions, 3 cloud providers

Read full story

OpenAI Databases

18 min

OpenAI Runs ChatGPT for 800 Million Users on One PostgreSQL Instance — and It Works

ChatGPT has 800 million users. It handles millions of database queries per second. And it runs on a single primary PostgreSQL instance on Azure — one writer, backed by about fifty read replicas. No sharding. No distributed SQL. Just Postgres, pushed further than almost anyone thought possible through obsessive optimization and ruthless operational discipline.

800M users, 1 primary PG instance ~50 read replicas globally 5-second DDL timeout enforced

Read full story

Uber Security

19 min

Uber Had 150,000 Secrets Scattered Across 25 Vaults — So They Built One Platform to Rule Them

150,000 secrets. 25 separate vaults. Hundreds of teams managing their own credentials in their own ways, some in plain text in version control. At Uber's scale — 5,000 microservices, 5,000 databases, 500,000 analytical jobs per day — secrets sprawl is not a compliance problem. It is an incident waiting to happen. A team of ten engineers decided to fix it.

150,000 secrets managed 25 vaults → 6 managed vaults 5,000 microservices secured +2

Read full story

Shopify Reliability

19 min

The 80% Problem: Why Getting an LLM System to 'Works in Demo' Is 20% of the Work

Every team building with LLMs discovers the same brutal truth: 80% quality arrives in a few weeks. The final 15% — the gap between 'impressive demo' and 'product I'd trust with my customers' — takes the rest of the time. Shopify's Flow agent and Sidekick teams lived this curve and came back with a systematic playbook. It is mostly about measurement.

300-example hand-crafted benchmark

Read full story

IBM Distributed Systems

21 min

Quantum Computing Just Beat the Best Classical Computer — Here Is the Engineering That Made It Happen

On May 6, 2026, Q-CTRL ran a materials science simulation on an IBM quantum computer in 2 minutes. The best classical supercomputer needed over 100 hours to reach the same accuracy — and then gave up. The day before, IBM's quantum computers simulated a 12,635-atom protein with Cleveland Clinic and RIKEN, 40 times larger than anything attempted six months prior. After 30 years of promises, quantum advantage arrived. Here is what actually changed.

12,635-atom protein simulated (May 5 2026) 120 qubits, 10,000+ two-qubit gates 2 min quantum vs 100+ hours classical

Read full story

Spotify Reliability

20 min

Spotify Changed a Filter Order in Their Proxy — Then Every Server in the World Crashed at Once

On April 16, 2025, Spotify's engineering team made a change they deemed low risk: reordering the custom filters inside their Envoy Proxy perimeter. They applied it to all regions simultaneously. Within two minutes, every Envoy instance worldwide had crashed. And then the restart loop began — a loop Kubernetes itself was powering, killing each new server as fast as it came back up. 675 million users couldn't load the app. Asia Pacific stayed up, and the reason why told the engineers exactly what was broken.

675M MAU affected 48,000+ Downdetector peak reports

Read full story

Airbnb Databases

★ 4.0

23 min

Airbnb's Fraud Detection Runs on a Graph of 7 Billion Nodes — Here's Why They Rebuilt It From Scratch

Airbnb's identity graph connects 7 billion nodes and 11 billion edges — every user, every device, every listing, every relationship that might reveal a fraudster trying to create a duplicate account or collude on a fake transaction. The third-party vendor powering it required periodic manual reboots to stay stable. Queries that needed 8 hops of graph traversal were hitting 5-second P99 latencies. In 2024, a small team rebuilt the entire thing internally. The results were not incremental.

7B nodes, 11B edges 5M new edges/day

Read full story

Netflix Unleashed a Monkey With a Weapon in Its Own Data Center — On Purpose

When MS Dhoni Got Out: How Hotstar Survived 25 Million Concurrent Users

How GitHub Upgraded 1200 MySQL Hosts Without Dropping a Single Query

Slack Built a Big Red Button to Drain an Entire Data Center in Five Minutes

Cloudflare Fixed a React Security Vulnerability and Broke the Entire Network

LinkedIn Needed a Message Queue. They Built the One the Entire Internet Runs On.

Google Built a Free Design Tool That Generates Production Code From a Sentence — Then Added Multiplayer

Why Does X Go Down? Inside the Microservice Spaghetti and Database Bottlenecks That Trigger Global Outages

Google's Gemini Omni Is the First AI That Creates From Anything — Here Is What That Actually Means

Why Did Claude Go Down? Inside the Trillion-Token Capacity Strain and API Fault Loops That Silenced LLM Workflows

GitHub Built the Internet's Code Platform — Then AI Agents Broke It

OpenAI Deployed a Tool to Monitor Kubernetes — and It Took Down All of Kubernetes

Why Did the Delhi Data Centre Burn? Inside the Lithium-Battery Fire and Single-Zone Storage Failures That Erased Decades of Enterprise Data

Why is Gemini Down? Inside the Database Hotspotting and Cache Failures That Triggered Error 1076 Worldwide

Google's Own Cleanup Job Crashed Cloud Services Across 4 Continents — and Then Made Recovery Worse

Why Did AWS Overheat? Inside the North Virginia Thermal Surge and Power Failures That Halted Global Trading Rails

Why Did Railway Go Down? Inside the Automated Google Cloud Account Suspension That Paralyzed Cross-Cloud Control Planes

Cloudflare Changed a Database Permission and 2.4 Billion Users Got HTTP 500

A Security Fix Broke 28% of the Internet for 25 Minutes — Cloudflare's December 2025 Outage

GitHub's Settings Cache Went Stale and Took Authentication, Actions, and Copilot Down With It

Shopify's Authentication Went Down on Cyber Monday — the Year's Biggest Shopping Day

GitHub Actions Froze for 95% of Workflows When a Redis Cluster Hit Its Limit

GitHub Lost Telemetry and Its Own Security System Started Blocking Real Developer VMs

A Fiber Cut in Seattle Slowed GitHub Clone Speeds to Under 1 MB/s for Eight Hours

Dependabot Silently Skipped 10% of Security PRs Because a Failover Landed on a Read-Only Database

Optus Upgraded a Firewall and Accidentally Blocked Emergency 000 Calls for 13 Hours

GitHub Was Built for 2008. AI Agents Demanded 30x More Scale in 2026 — and the Platform Broke

One Wrong Number in a Routine Command Took Down Slack, Trello, and a Chunk of the Internet

Facebook Took Itself Off the Internet for 7 Hours — and Couldn't Get Back In

GitLab Deleted Its Own Production Database. Then Found All Five Backups Had Failed Too

Malware on One Laptop Gave Attackers a Way Into CircleCI's Production Systems

GitHub's Database Split Its Brain in 43 Seconds. Cleaning It Up Took 24 Hours.

A 3-Day Database Outage in 2008 Convinced Netflix to Move Everything to AWS. It Took 7 Years.

AWS's Most Popular Region Went Down Because DynamoDB's DNS Had a Race Condition Nobody Had Seen Before

How Discord Migrated Trillions of Messages and Fired Their Garbage Collector

Discord Killed the MacBook Dev Environment and Never Looked Back

Netflix Hit the AWS Instance Ceiling and Built a Workflow Engine That Scales Forever

Slack's Worst Day: When a Better Cache Manager Made Everything Worse

How Stripe Moves Petabytes Between Database Shards Without Stopping the Money

Slack Rewrote Its Core Architecture for Enterprise — Because the Old One Was a Lie

Slack Cut Deploy-Related Customer Impact by 90% in Eighteen Months

How a Two-Line Script Silently Deleted 883 Customer Cloud Sites

65 Million Streams: How Netflix Rebuilt Its Guts for Live

Stripe Converted 3.7 Million Lines of JavaScript in One Pull Request on a Sunday

The Test That Broke GitHub: A Failover Drill Goes Live

Shopify Sharded a Rails Database With Vitess and the App Never Knew It Happened

Shopify's Engineers Hunted Deadlocks at 19 Million Queries per Second

Cloudflare's Datacenter Partner Failed and the Control Plane Went Dark for 40 Hours

A Database Permission Change in ClickHouse Took Down 28% of Cloudflare's HTTP Traffic

Netflix Made Their Workflow Orchestrator 100x Faster by Rewriting the Engine Nobody Thought Was Slow

Netflix's Containers Were Fighting Their Own CPUs — and Losing

Netflix Streamed Live Sports for Millions — and the Hard Part Wasn't the Video

Figma's Database Grew 100x in Four Years — Here's How a Small Team Kept It From Toppling

Datadog Went Dark for 24 Hours and Came Back With a Different Philosophy

OpenAI Runs ChatGPT for 800 Million Users on One PostgreSQL Instance — and It Works

Uber Had 150,000 Secrets Scattered Across 25 Vaults — So They Built One Platform to Rule Them

The 80% Problem: Why Getting an LLM System to 'Works in Demo' Is 20% of the Work

Quantum Computing Just Beat the Best Classical Computer — Here Is the Engineering That Made It Happen

Spotify Changed a Filter Order in Their Proxy — Then Every Server in the World Crashed at Once

Airbnb's Fraud Detection Runs on a Graph of 7 Billion Nodes — Here's Why They Rebuilt It From Scratch

New stories, zero jargon