GitLab's 2017 Outage: When All 5 Backups Failed

The Story

On January 31st, 2017, an engineer at GitLab began a routine infrastructure project: setting up multiple PostgreSQL servers in staging to test , hoping to reduce load on GitLab.com's single production database. At roughly 17:20 UTC, before starting that work, the engineer took a snapshot of the production database and loaded it into staging -- wanting a more current copy than the automatic daily snapshot would provide. It was an ordinary first step for an ordinary infrastructure task. Nothing about it suggested what would happen six hours later.

GitLab.com ran on a single primary PostgreSQL database, `db1.cluster.gitlab.com`, with one secondary, `db2.cluster.gitlab.com`, kept purely as a hot-standby failover -- not a load-balancing replica. That single-primary setup had already caused multiple prior incidents, including a database outage in November 2016. Every write GitLab.com's users made depended on that one host staying healthy.

THE INSIGHT: THE TERMINAL DOESN'T TELL YOU WHICH HOST YOU'RE ON

Around 23:00 UTC, increased database load (later traced to a spam wave and a background job mistakenly scheduled to delete a GitLab employee's account) caused replication between the primary and secondary to fall behind, then fail outright. The fix was routine: wipe the secondary's data directory and re-sync it from the primary using . An engineer went to do exactly that. After a confusing series of failed attempts and unclear error messages, a second engineer, troubleshooting the same stuck process, ran the data-directory wipe again -- on the primary, while believing they were on the secondary. The two terminal sessions looked identical.

Problem

A Lagging Replica Needed a Manual Re-Sync

Increased database load around 23:00 UTC caused the secondary's replication to fall too far behind the primary; the WAL segments it needed had already been removed. With no WAL archiving in place, the only fix was to wipe the secondary's data directory and rebuild it from the primary using pg_basebackup.

Cause

pg_basebackup Hung, and a Second Engineer Tried the Same Fix on the Wrong Host

Multiple attempts to run pg_basebackup hung silently with no clear error. After raising max_wal_senders and restarting PostgreSQL to allow it, the process still wouldn't start replication visibly. Believing they were clearing the secondary's data directory to retry, an engineer ran the wipe command against the primary instead. The mistake was caught within one to two seconds -- but by then, roughly 300GB had already been removed.

Solution

Five Backup Mechanisms, Checked One by One -- and Four Failed

With the primary wiped and the secondary already wiped moments earlier as part of the failed re-sync, engineers checked every backup path: daily pg_dump to S3, daily Azure disk snapshots, daily LVM snapshots, and the now-broken replication stream. Only a single LVM snapshot taken roughly six hours earlier, originally created just to refresh the staging environment, was usable.

Result

An 18-Hour Restore From a Staging Snapshot on Slow Disks

GitLab restored production from that six-hour-old LVM snapshot, copying it from slow, throttled staging-environment disks over roughly 18 hours. Service came back online on February 1st. The team estimated they permanently lost around 5,000 projects, 5,000 comments, and 700 user accounts created or modified in the six-hour gap.

Why pg_dump Was Silently Failing for Weeks

GitLab's daily logical backups used pg_dump, but that backup job ran from an application server, not the database server -- so it had no PostgreSQL data directory of its own to detect the running version from, and defaulted to an older PostgreSQL 9.2 client against a PostgreSQL 9.6 database. Major-version mismatches cause pg_dump to error out immediately. It had been failing every single day, and nobody knew, because the failure-notification emails were silently rejected by the receiving mail server for lacking signing.

Why the Azure Snapshots Didn't Help Either

GitLab did use Azure disk snapshots for some infrastructure, including its NFS servers, but they had never been enabled for the database hosts -- the team had assumed their other backup mechanisms were sufficient. Restoring an Azure snapshot across storage accounts could also take hours to days, which would have made it a poor primary recovery path even if it had existed for the database.

THE CORE TECHNICAL INSIGHT

GitLab didn't have zero backup mechanisms -- it had five, on paper. What it didn't have was a single person responsible for verifying that any of them actually worked. Backups that have never been test-restored aren't backups; they're untested assumptions, and on January 31st, four of GitLab's five untested assumptions turned out to be false simultaneously.

The Fix

Five Backups, Zero Owners -- and How GitLab Fixed That

GitLab's fix wasn't a single new backup tool. It was assigning explicit ownership and verification to a backup strategy that had quietly become five disconnected, individually-unverified mechanisms, plus making it structurally harder to confuse one database host for another in the first place.

~300GB

Production data removed in roughly one to two seconds before the command was stopped

4 of 5

Backup and replication mechanisms that turned out to be broken or unusable when actually needed

6 hours

Age of the one LVM snapshot that ended up being GitLab's actual recovery path

Hourly

New LVM snapshot frequency after the incident, up from once every 24 hours

GitLab's Five Backup Mechanisms on January 31, 2017

GitLab's Five Backup Mechanisms on January 31, 2017
Mechanism	Intended purpose	What actually happened
pg_dump to S3 (daily)	Logical backup, primary disaster-recovery path	Failed silently for weeks: ran with wrong major-version client, errors swallowed by un-signed DMARC emails
Azure disk snapshots (daily)	Full-disk recovery for NFS and database hosts	Never enabled for database servers; team assumed other backups covered it
LVM snapshot (daily, automatic)	Refresh staging environment from production	Existed, but most recent one was nearly 24 hours old
LVM snapshot (manual, ad hoc)	One-off snapshot taken ~6 hours before the incident for unrelated testing	The only usable recovery point -- used by chance, not by design
Primary-to-secondary replication	Failover for high availability	Already broken before the incident; secondary's data directory was wiped as part of the failed re-sync

bash

# Illustrative: the class of safeguard GitLab adopted after this incident --
# making it visually and mechanically harder to run a destructive command
# against the wrong host.

# Before: PS1 prompts on db1 and db2 looked nearly identical,
# e.g. "user@cluster:~$" on both -- no environment cue in the terminal itself.

# After: PS1 explicitly encodes host role and environment, and destructive
# commands require an explicit environment-matching confirmation.
export PS1='\[\e[41m\][PRODUCTION:db1-PRIMARY]\[\e[0m\] \u@\h:\w\$ '

# A wrapper around destructive data-directory operations that requires
# the operator to type back the hostname they believe they're targeting.
safe_wipe_data_directory() {
  local target_host="$1"
  local actual_host="$(hostname -f)"

  if [[ "$target_host" != "$actual_host" ]]; then
    echo "ABORT: you are on '$actual_host' but specified '$target_host'."
    return 1
  fi

  read -p "Type '$actual_host' to confirm wiping its data directory: " confirm
  if [[ "$confirm" != "$actual_host" ]]; then
    echo "ABORT: confirmation did not match current host."
    return 1
  fi

  rm -rf "$PGDATA"/*
}

THE COUNTERINTUITIVE PART: THE FIX WASN'T A BETTER BACKUP TOOL

The instinct after a backup failure is usually to add another backup tool. GitLab's actual postmortem went a different direction: the highest-priority fix was assigning a named owner for data durability -- someone explicitly responsible for verifying, on a recurring basis, that every backup mechanism could actually restore data. Five backup mechanisms with no owner had quietly degraded into five backup mechanisms nobody was accountable for testing.

Architecture

The incident is really two failures stacked on top of each other: the mistaken wipe of the primary database, and then the much longer tail of discovering that recovery had no working safety net. Two diagrams separate those failures: the cascade that led to the data loss, and the backup topology before vs. after GitLab rebuilt it.

The Cascade: From Replication Lag to a Wiped Primary

Before vs. After: GitLab's Backup Topology

What to Notice in the Cascade

Notice that the data-loss event itself -- the mistaken wipe -- took one to two seconds. Almost everything that made this incident severe happened afterward, while engineers discovered, one mechanism at a time, that their safety net had more holes in it than backup paths. The 'before' topology diagram shows exactly why: every arrow pointing away from the primary represented an assumption, and none of them had a named owner checking whether the assumption still held.

Lessons

GitLab's decision to publish this postmortem in full, including a livestreamed recovery and a public incident document, made it one of the most studied database failures in the industry. The specifics are GitLab's, but the structural failure -- backups nobody verifies -- shows up in nearly every infrastructure team eventually.

What to remember

A backup that has never been restored is not a backup -- it's an assumption. GitLab had five separate backup and replication mechanisms. Four were broken, and none of the breakage was discovered until the moment recovery actually depended on them. Assign someone to regularly prove, by restoring, that backups work.
Make the current host impossible to mistake. The engineer who wiped the primary believed they were on the secondary because the two terminal sessions looked identical. A prompt, hostname banner, or confirmation step that makes production unmistakable is cheap insurance against exactly this mistake.
Validate cross-version compatibility for every tool in your backup chain, not just your application code. GitLab's pg_dump backups failed for weeks because the backup job's client version didn't match the database server's major version -- a mismatch that produced an error, but one nobody saw.
Failure notifications need a delivery guarantee, not just a send action. The pg_dump failure emails were sent every single day and rejected silently by the receiving mail server over a missing DMARC signature. A notification system needs to verify delivery, or alert through a channel that can't fail the same way email can.
Replication is a high-availability tool, not a disaster-recovery tool -- treat them as separate problems. GitLab's secondary existed purely for failover, and it was the first thing wiped in the attempted fix, removing the one resource that might otherwise have shortened recovery time.

The Postmortem Became the Recovery

The most surprising long-term outcome wasn't a technical one. GitLab's decision to publish this incident publicly, in detail, with names initially included by the engineer's own choice, became one of the clearest examples in the industry of radical transparency building trust rather than destroying it. GitLab went public on NASDAQ in October 2021, and this postmortem is still referenced today, nearly a decade later, in onboarding material at companies that have never used GitLab's product -- purely for what it teaches about backup verification.

THE LIVESTREAM THAT BECAME PART OF THE STORY

GitLab restored GitLab.com while livestreaming the recovery process on YouTube, reaching a peak of around 5,000 concurrent viewers -- briefly the second most-watched livestream on the platform at the time. It wasn't a stunt; it was the same instinct that produced the public postmortem: show the actual recovery, mistakes included, rather than a polished summary after the fact.

Five backup mechanisms, four of them theoretical. GitLab didn't lose a database that day -- it lost the assumption that having a backup plan and having a working backup plan are the same sentence.TechLogStack -- built at scale, broken in public, rebuilt by engineers

The Story

A Lagging Replica Needed a Manual Re-Sync

pg_basebackup Hung, and a Second Engineer Tried the Same Fix on the Wrong Host

Five Backup Mechanisms, Checked One by One -- and Four Failed

An 18-Hour Restore From a Staging Snapshot on Slow Disks

The Fix

Five Backups, Zero Owners -- and How GitLab Fixed That

Architecture

Lessons

Related Stories

How GitHub Upgraded 1200 MySQL Hosts Without Dropping a Single Query

GitHub's Settings Cache Went Stale and Took Authentication, Actions, and Copilot Down With It

Dependabot Silently Skipped 10% of Security PRs Because a Failover Landed on a Read-Only Database