The Story

Between January 31 and February 2, 2026, Dependabot — GitHub's automated dependency security tool — failed to create pull requests for roughly 10% of repositories it was supposed to update. Engineers and open source maintainers noticed it first: security patches weren't appearing. The automation that protects codebases from vulnerable libraries had quietly stopped doing its job for a day and a half.

The hardest part

Dependabot doesn't notify you when it doesn't create a PR — it just doesn't create one. Repositories went without security patches for 42 hours with no indication anything was wrong.

A routine cluster failover had routed Dependabot's processing infrastructure to a read-only database replica instead of a writable primary. Dependabot tried to write new pull request records, hit write failures, and moved the work to its retry queue. The queue backed up silently. No error surfaced to repository owners. The jobs sat in limbo until GitHub engineers noticed the anomaly in PR creation rate metrics and rerouted traffic to a healthy writable cluster.

42 hrs
Total degraded window
10%
Of automated PRs silently failed
0
Visible error notifications shown to developers
100%
Failed jobs recovered after traffic reroute

Problem

Cluster failover routed Dependabot writes to a read-only replica. Write failures queued silently. No alert fired. Repositories went without security PR coverage for 42 hours.

The Fix

GitHub added write-verification logic to the Dependabot processing cluster — a failover destination is tested for write capability before job routing begins. A read-only replica is now rejected as a failover destination for write-heavy workloads before any traffic flows to it. Jobs that failed during the degraded window were identified and requeued once a writable primary was restored. GitHub also added PR creation rate monitoring per repository — a drop below expected velocity now triggers an alert rather than accumulating silently in a retry queue.

Solution

Failover destinations verified for write capability before routing. PR creation rate alerting added. All failed jobs from the 42-hour window identified and requeued successfully.

Before vs after

Before vs after
ControlBeforeAfter
Failover destination checkConnectivity onlyWrite capability verified
Silent failure alertingNonePR creation rate monitoring
Failed job visibilityRetry queue onlyLogged and surfaced

Lessons

What to remember

  1. A healthy replica is the wrong failover target for a write-heavy workload. Connectivity checks are not capability checks. Test both.
  2. Silent automation failures are worse than noisy ones. If your tool doesn't create an artifact it was supposed to, it should tell someone immediately — not queue and retry indefinitely.
  3. Track expected output rates for automation, not just error rates. A PR creation rate chart costs almost nothing; discovering a 42-hour security coverage gap does not.
  4. Retry queues hide failures gracefully in the short term but create a thundering herd when the underlying issue resolves. Bound queue depth and alert on queue age.
The queue kept retrying quietly for 42 hours. The system assumed patience was the same thing as progress.GitHub post-incident review, February 2026