The Story
On January 31st, 2017, an engineer at GitLab began a routine infrastructure project: setting up multiple PostgreSQL servers in staging to test , hoping to reduce load on GitLab.com's single production database. At roughly 17:20 UTC, before starting that work, the engineer took a snapshot of the production database and loaded it into staging -- wanting a more current copy than the automatic daily snapshot would provide. It was an ordinary first step for an ordinary infrastructure task. Nothing about it suggested what would happen six hours later.
THE INSIGHT: THE TERMINAL DOESN'T TELL YOU WHICH HOST YOU'RE ON
Around 23:00 UTC, increased database load (later traced to a spam wave and a background job mistakenly scheduled to delete a GitLab employee's account) caused replication between the primary and secondary to fall behind, then fail outright. The fix was routine: wipe the secondary's data directory and re-sync it from the primary using . An engineer went to do exactly that. After a confusing series of failed attempts and unclear error messages, a second engineer, troubleshooting the same stuck process, ran the data-directory wipe again -- on the primary, while believing they were on the secondary. The two terminal sessions looked identical.Problem
A Lagging Replica Needed a Manual Re-Sync
Increased database load around 23:00 UTC caused the secondary's replication to fall too far behind the primary; the WAL segments it needed had already been removed. With no WAL archiving in place, the only fix was to wipe the secondary's data directory and rebuild it from the primary using pg_basebackup.
Cause
pg_basebackup Hung, and a Second Engineer Tried the Same Fix on the Wrong Host
Multiple attempts to run pg_basebackup hung silently with no clear error. After raising max_wal_senders and restarting PostgreSQL to allow it, the process still wouldn't start replication visibly. Believing they were clearing the secondary's data directory to retry, an engineer ran the wipe command against the primary instead. The mistake was caught within one to two seconds -- but by then, roughly 300GB had already been removed.
Solution
Five Backup Mechanisms, Checked One by One -- and Four Failed
With the primary wiped and the secondary already wiped moments earlier as part of the failed re-sync, engineers checked every backup path: daily pg_dump to S3, daily Azure disk snapshots, daily LVM snapshots, and the now-broken replication stream. Only a single LVM snapshot taken roughly six hours earlier, originally created just to refresh the staging environment, was usable.
Result
An 18-Hour Restore From a Staging Snapshot on Slow Disks
GitLab restored production from that six-hour-old LVM snapshot, copying it from slow, throttled staging-environment disks over roughly 18 hours. Service came back online on February 1st. The team estimated they permanently lost around 5,000 projects, 5,000 comments, and 700 user accounts created or modified in the six-hour gap.
Why pg_dump Was Silently Failing for Weeks
GitLab's daily logical backups used pg_dump, but that backup job ran from an application server, not the database server -- so it had no PostgreSQL data directory of its own to detect the running version from, and defaulted to an older PostgreSQL 9.2 client against a PostgreSQL 9.6 database. Major-version mismatches cause pg_dump to error out immediately. It had been failing every single day, and nobody knew, because the failure-notification emails were silently rejected by the receiving mail server for lacking signing.
Why the Azure Snapshots Didn't Help Either
GitLab did use Azure disk snapshots for some infrastructure, including its NFS servers, but they had never been enabled for the database hosts -- the team had assumed their other backup mechanisms were sufficient. Restoring an Azure snapshot across storage accounts could also take hours to days, which would have made it a poor primary recovery path even if it had existed for the database.
THE CORE TECHNICAL INSIGHT
GitLab didn't have zero backup mechanisms -- it had five, on paper. What it didn't have was a single person responsible for verifying that any of them actually worked. Backups that have never been test-restored aren't backups; they're untested assumptions, and on January 31st, four of GitLab's five untested assumptions turned out to be false simultaneously.The Fix
Five Backups, Zero Owners -- and How GitLab Fixed That
GitLab's fix wasn't a single new backup tool. It was assigning explicit ownership and verification to a backup strategy that had quietly become five disconnected, individually-unverified mechanisms, plus making it structurally harder to confuse one database host for another in the first place.
GitLab's Five Backup Mechanisms on January 31, 2017
| Mechanism | Intended purpose | What actually happened |
|---|---|---|
| pg_dump to S3 (daily) | Logical backup, primary disaster-recovery path | Failed silently for weeks: ran with wrong major-version client, errors swallowed by un-signed DMARC emails |
| Azure disk snapshots (daily) | Full-disk recovery for NFS and database hosts | Never enabled for database servers; team assumed other backups covered it |
| LVM snapshot (daily, automatic) | Refresh staging environment from production | Existed, but most recent one was nearly 24 hours old |
| LVM snapshot (manual, ad hoc) | One-off snapshot taken ~6 hours before the incident for unrelated testing | The only usable recovery point -- used by chance, not by design |
| Primary-to-secondary replication | Failover for high availability | Already broken before the incident; secondary's data directory was wiped as part of the failed re-sync |
# Illustrative: the class of safeguard GitLab adopted after this incident --
# making it visually and mechanically harder to run a destructive command
# against the wrong host.
# Before: PS1 prompts on db1 and db2 looked nearly identical,
# e.g. "user@cluster:~$" on both -- no environment cue in the terminal itself.
# After: PS1 explicitly encodes host role and environment, and destructive
# commands require an explicit environment-matching confirmation.
export PS1='\[\e[41m\][PRODUCTION:db1-PRIMARY]\[\e[0m\] \u@\h:\w\$ '
# A wrapper around destructive data-directory operations that requires
# the operator to type back the hostname they believe they're targeting.
safe_wipe_data_directory() {
local target_host="$1"
local actual_host="$(hostname -f)"
if [[ "$target_host" != "$actual_host" ]]; then
echo "ABORT: you are on '$actual_host' but specified '$target_host'."
return 1
fi
read -p "Type '$actual_host' to confirm wiping its data directory: " confirm
if [[ "$confirm" != "$actual_host" ]]; then
echo "ABORT: confirmation did not match current host."
return 1
fi
rm -rf "$PGDATA"/*
}
THE COUNTERINTUITIVE PART: THE FIX WASN'T A BETTER BACKUP TOOL
The instinct after a backup failure is usually to add another backup tool. GitLab's actual postmortem went a different direction: the highest-priority fix was assigning a named owner for data durability -- someone explicitly responsible for verifying, on a recurring basis, that every backup mechanism could actually restore data. Five backup mechanisms with no owner had quietly degraded into five backup mechanisms nobody was accountable for testing.Architecture
The incident is really two failures stacked on top of each other: the mistaken wipe of the primary database, and then the much longer tail of discovering that recovery had no working safety net. Two diagrams separate those failures: the cascade that led to the data loss, and the backup topology before vs. after GitLab rebuilt it.
The Cascade: From Replication Lag to a Wiped Primary
Before vs. After: GitLab's Backup Topology
What to Notice in the Cascade
Notice that the data-loss event itself -- the mistaken wipe -- took one to two seconds. Almost everything that made this incident severe happened afterward, while engineers discovered, one mechanism at a time, that their safety net had more holes in it than backup paths. The 'before' topology diagram shows exactly why: every arrow pointing away from the primary represented an assumption, and none of them had a named owner checking whether the assumption still held.
Lessons
GitLab's decision to publish this postmortem in full, including a livestreamed recovery and a public incident document, made it one of the most studied database failures in the industry. The specifics are GitLab's, but the structural failure -- backups nobody verifies -- shows up in nearly every infrastructure team eventually.
What to remember
- A backup that has never been restored is not a backup -- it's an assumption. GitLab had five separate backup and replication mechanisms. Four were broken, and none of the breakage was discovered until the moment recovery actually depended on them. Assign someone to regularly prove, by restoring, that backups work.
- Make the current host impossible to mistake. The engineer who wiped the primary believed they were on the secondary because the two terminal sessions looked identical. A prompt, hostname banner, or confirmation step that makes production unmistakable is cheap insurance against exactly this mistake.
- Validate cross-version compatibility for every tool in your backup chain, not just your application code. GitLab's pg_dump backups failed for weeks because the backup job's client version didn't match the database server's major version -- a mismatch that produced an error, but one nobody saw.
- Failure notifications need a delivery guarantee, not just a send action. The pg_dump failure emails were sent every single day and rejected silently by the receiving mail server over a missing DMARC signature. A notification system needs to verify delivery, or alert through a channel that can't fail the same way email can.
- Replication is a high-availability tool, not a disaster-recovery tool -- treat them as separate problems. GitLab's secondary existed purely for failover, and it was the first thing wiped in the attempted fix, removing the one resource that might otherwise have shortened recovery time.
The Postmortem Became the Recovery
The most surprising long-term outcome wasn't a technical one. GitLab's decision to publish this incident publicly, in detail, with names initially included by the engineer's own choice, became one of the clearest examples in the industry of radical transparency building trust rather than destroying it. GitLab went public on NASDAQ in October 2021, and this postmortem is still referenced today, nearly a decade later, in onboarding material at companies that have never used GitLab's product -- purely for what it teaches about backup verification.
THE LIVESTREAM THAT BECAME PART OF THE STORY
GitLab restored GitLab.com while livestreaming the recovery process on YouTube, reaching a peak of around 5,000 concurrent viewers -- briefly the second most-watched livestream on the platform at the time. It wasn't a stunt; it was the same instinct that produced the public postmortem: show the actual recovery, mistakes included, rather than a polished summary after the fact.