The Story

On February 9, 2026, GitHub experienced two back-to-back degradation windows — 16:12 to 17:39 UTC and 18:53 to 20:09 UTC — totalling nearly three hours of disrupted service across GitHub.com, the API, Actions, Git over HTTPS, and Copilot. Engineers and developers trying to push code got errors. CI pipelines stalled. The root cause was a database cluster that had been quietly growing into a single point of failure for two years.

GitHub stores user settings — AI model preferences, governance policies, feature flags — in a dedicated database cluster. When those settings were first introduced, each user's row was a few bytes. Over time, as GitHub added AI model controls and policy enforcement, each row grew to kilobytes. The growth was masked by caching: the database rarely saw the full read load. A configuration change to the caching mechanism invalidated the cache globally. Every user setting had to be fetched from the database simultaneously. The cluster buckled. Because that same cluster handled authentication and user management, every GitHub service that requires a logged-in user broke with it.

Why this compounded

The database load was invisible during normal operation — hidden behind the cache TTL. It only appeared during model or policy rollouts. Nobody saw the danger accumulating until the cache was gone entirely.
2 hr 43 min
Total user-visible disruption
2
Separate outage windows on the same day
0
SSH Git operations affected
0
User data lost or corrupted

The Fix

GitHub separated the user settings store from the authentication database cluster. AI model preferences and governance policy data were moved to a purpose-built store with lazy-read semantics, so a settings query failure no longer cascades to login. The cache TTL was restructured with per-user staggered expiry — the same jitter principle that prevents thundering herds — so a single configuration change can never invalidate the cache globally again. Circuit breakers were added around the settings read path so saturation there no longer propagates to auth.

Solution

Settings store decoupled from auth cluster. Staggered per-user cache expiry replaces synchronized TTL. Circuit breakers prevent settings saturation from reaching login flows.

Architecture

Before: auth and settings sharing one database cluster

A single database cluster handled both authentication lookups and user setting reads. A cache invalidation made both fail simultaneously.

Lessons

What to remember

  1. Data that starts as bytes and grows to kilobytes without a review is a slow-burning single point of failure. Audit stored-data size as part of schema review, not just schema shape.
  2. A global cache invalidation is a targeted traffic spike. Never allow it without per-user stagger or lazy reload — especially when the backing store also handles auth.
  3. If your authentication database also stores AI model preferences, you've made login dependent on ML policy rollout failures.
  4. Load masked by cache TTL is still load. Instrument the database query rate during rollouts, not just during steady state.
The danger accumulated invisibly until catastrophic failure. The cache TTL was hiding a bomb, not preventing one.Reliability engineer analysis of the February 9 incident