Company

OpenAI

Every OpenAI engineering case study on TechLogStack — real production incidents, post-mortems, and fixes.

OpenAI Reliability
20 min

OpenAI Deployed a Tool to Monitor Kubernetes — and It Took Down All of Kubernetes

On December 11, 2024, OpenAI deployed a new telemetry service designed to improve Kubernetes observability. Within 29 minutes, it had crashed the Kubernetes control plane across every cluster. ChatGPT, the API, and Sora were all unavailable for over four hours. The engineers trying to fix it couldn't run kubectl. The control plane that manages clusters was down — and it was the only way back in.

OpenAI Databases
18 min

OpenAI Runs ChatGPT for 800 Million Users on One PostgreSQL Instance — and It Works

ChatGPT has 800 million users. It handles millions of database queries per second. And it runs on a single primary PostgreSQL instance on Azure — one writer, backed by about fifty read replicas. No sharding. No distributed SQL. Just Postgres, pushed further than almost anyone thought possible through obsessive optimization and ruthless operational discipline.

800M users, 1 primary PG instance ~50 read replicas globally 5-second DDL timeout enforced