Topic

Databases

A database migration goes wrong at 2 AM. A query pattern no one anticipated brings down a cluster holding trillions of messages. A single index change cascades into an hours-long outage. These are the database engineering case studies where the boring infrastructure layer turned out to be the most interesting one.

How GitHub Upgraded 1200 MySQL Hosts Without Dropping a Single Query

MySQL 5.7 was hitting end-of-life, and GitHub's production database fleet spanned 1,200 hosts, 300 terabytes of data, and 5.5 million queries every second. Getting from here to MySQL 8.0 without disrupting 100 million developers was going to take more than a weekend.

1,200+ MySQL hosts upgraded 300+ TB data migrated 5.5M queries/sec maintained +2 >1 year planning+execution 50+ clusters zero-downtime

GitHub's Settings Cache Went Stale and Took Authentication, Actions, and Copilot Down With It

A configuration change to the user settings cache triggered a global invalidation. Every AI model preference and policy setting hit the database at once — and the database that stored them also handled login.

2 hr 43 min disruption 2 outage windows Auth, Actions, Copilot affected +1 Zero data lost

Dependabot Silently Skipped 10% of Security PRs Because a Failover Landed on a Read-Only Database

For 42 hours, Dependabot appeared to be running fine. It was quietly failing to create security pull requests — and nobody got an alert.

42 hr degraded window 10% of PRs silently failed Zero visible errors shown +1 All jobs recovered after reroute

How Discord Migrated Trillions of Messages and Fired Their Garbage Collector

It is 2022 and Discord's on-call engineers are babysitting a 177-node database cluster, manually rebooting nodes after Java GC pauses spiral out of control. The system holding every message ever sent is becoming the thing everyone fears touching most.

177 → 72 nodes 9-day migration (was 3-month est.) 3.2M records/sec migrated +1 4T+ messages moved

How Stripe Moves Petabytes Between Database Shards Without Stopping the Money

Stripe processed over $1 trillion in payment volume in 2023 while maintaining 99.999% uptime — five nines, fewer than 6 minutes of downtime all year. The infrastructure secret is a database platform called DocDB and a migration engine that moves petabytes of financial data between shards without any application knowing it happened.

99.999% uptime achieved 5M database queries/sec 1.5 PB migrated in 2023

Shopify Sharded a Rails Database With Vitess and the App Never Knew It Happened

The Shop app was growing exponentially. Its single MySQL database was approaching vertical scaling limits. Shopify needed horizontal sharding — but they had a Rails monolith that expected a single database, and a system that couldn't have downtime during a commerce platform used by millions daily.

Shopify's Engineers Hunted Deadlocks at 19 Million Queries per Second

During Black Friday and Cyber Monday 2023, Shopify's MySQL fleet was handling 19 million queries per second. At that scale, even rare deadlock patterns become common enough to cause real incidents. The engineering team published a detailed playbook for diagnosing and eliminating MySQL deadlocks in high-concurrency production environments.

19M MySQL QPS at BFCM peak 58M requests/min app servers 99.999%+ uptime maintained

Figma's Database Grew 100x in Four Years — Here's How a Small Team Kept It From Toppling

In 2020, Figma ran on a single Postgres instance on AWS's largest available machine. Four years later, that database had grown nearly 100x. Some tables had swelled to several terabytes and billions of rows. The Postgres vacuum process — the background job that keeps Postgres alive — was causing reliability incidents. They had months of runway left before hitting the IOPS ceiling. A small databases team had nine months to fix it.

100x DB growth since 2020 9-month migration

OpenAI Runs ChatGPT for 800 Million Users on One PostgreSQL Instance — and It Works

ChatGPT has 800 million users. It handles millions of database queries per second. And it runs on a single primary PostgreSQL instance on Azure — one writer, backed by about fifty read replicas. No sharding. No distributed SQL. Just Postgres, pushed further than almost anyone thought possible through obsessive optimization and ruthless operational discipline.

800M users, 1 primary PG instance ~50 read replicas globally 5-second DDL timeout enforced
★ 4.0
23 min

Airbnb's Fraud Detection Runs on a Graph of 7 Billion Nodes — Here's Why They Rebuilt It From Scratch

Airbnb's identity graph connects 7 billion nodes and 11 billion edges — every user, every device, every listing, every relationship that might reveal a fraudster trying to create a duplicate account or collude on a fake transaction. The third-party vendor powering it required periodic manual reboots to stay stable. Queries that needed 8 hops of graph traversal were hitting 5-second P99 latencies. In 2024, a small team rebuilt the entire thing internally. The results were not incremental.

7B nodes, 11B edges 5M new edges/day