Company

Datadog

Every Datadog engineering case study on TechLogStack — real production incidents, post-mortems, and fixes.

Datadog Reliability
18 min

Datadog Went Dark for 24 Hours and Came Back With a Different Philosophy

On March 8, 2023, Datadog — the platform engineers use to know when their own infrastructure is broken — broke. For more than 24 hours, across five regions on three cloud providers, metrics stopped arriving, logs disappeared, and dashboards showed nothing. The people whose job was to fix it couldn't see what was happening. It cost $5 million. It changed how Datadog thinks about building software.

24h+ global outage 5 regions, 3 cloud providers