The Story
On March 5, 2026, GitHub's CI/CD coordination layer hit a ceiling. A Redis cluster responsible for managing the GitHub Actions job queue degraded, and within minutes 95% of workflow runs couldn't start. Pull request checks froze — not with an error, just silence. The average delay before jobs eventually ran was 30 minutes. For teams running time-sensitive deploys or on-call automated checks, that silence cost real hours.
What Actions Redis actually does
The Redis cluster coordinates job dispatch: which workflow runs are pending, which runners are available, which jobs can be scheduled. When it degrades, the queue coordination layer stops working. GitHub receives the webhook trigger, acknowledges it — and then loses the job in the queue.The failure exposed a structural weakness: all workflow job coordination ran through a single Redis cluster with no sharding by workload type. A failure in that cluster wasn't a partial outage — it was a near-total one. Replication lag prevented a clean failover. The replica that took over inherited an incomplete queue state, which caused a second wave of issues as engineers tried to recover the backlog without triggering a thundering herd of queued jobs firing simultaneously.
Problem
A single Redis cluster owned all Actions job queue coordination. Degradation blocked 95% of workflow dispatch. Replication lag made failover incomplete, prolonging recovery.
Problem
Redis cluster degrades
Job dispatch rate drops sharply. Engineers are paged. 95% of Actions workflows stop starting within five minutes.
Cause
Replication lag prevents clean failover
The replica promoted to primary inherits an incomplete queue state, causing coordination inconsistencies and prolonging job dispatch failures.
Solution
Traffic rerouted to secondary cluster
Engineers route coordination traffic to a secondary Redis cluster while the primary is recovered and queue state reconciled.
Result
Actions resumes, backlog flushes
Full dispatch capacity restored within 2 hours. Queued jobs flush over the following 4 hours without a stampede, due to controlled release.
The Fix
GitHub sharded the Actions job queue so no single Redis cluster owns all pending coordination. The queue is now partitioned by runner type and organization tier — a failure in one shard degrades a slice of capacity, not 95% of all workflows. Dead-letter queue logic was added so jobs that cannot be dispatched are logged and requeued rather than silently dropped. The Redis cluster was also horizontally scaled with enough replica headroom to absorb a failover under live load, not just in a quiet maintenance window.
Solution
Job queue partitioned by shard. Dead-letter queue added for failed dispatches. Redis cluster horizontally scaled with 3x replica headroom for load-bearing failover.
Lessons
What to remember
- A job queue that is 95% blocked by one cluster failure is a single point of failure — even if that cluster is technically replicated. Replication is not sharding.
- Silent job loss is worse than a visible error. Dead-letter queues surface what was dropped. Without them, you're debugging ghost workflows.
- Test your Redis failover under production-level load, not in a quiet window. Replication lag under load is not the same as replication lag during maintenance.
- Backlog release after a queue outage is its own risk. Control the flush rate to avoid a thundering herd of held jobs firing all at once.
A job that disappears without an error is harder to diagnose than a job that fails loudly. Build for observability in the failure path, not just the success path.