The Story
Between January 31 and February 2, 2026, Dependabot — GitHub's automated dependency security tool — failed to create pull requests for roughly 10% of repositories it was supposed to update. Engineers and open source maintainers noticed it first: security patches weren't appearing. The automation that protects codebases from vulnerable libraries had quietly stopped doing its job for a day and a half.
The hardest part
Dependabot doesn't notify you when it doesn't create a PR — it just doesn't create one. Repositories went without security patches for 42 hours with no indication anything was wrong.A routine cluster failover had routed Dependabot's processing infrastructure to a read-only database replica instead of a writable primary. Dependabot tried to write new pull request records, hit write failures, and moved the work to its retry queue. The queue backed up silently. No error surfaced to repository owners. The jobs sat in limbo until GitHub engineers noticed the anomaly in PR creation rate metrics and rerouted traffic to a healthy writable cluster.
Problem
Cluster failover routed Dependabot writes to a read-only replica. Write failures queued silently. No alert fired. Repositories went without security PR coverage for 42 hours.
The Fix
GitHub added write-verification logic to the Dependabot processing cluster — a failover destination is tested for write capability before job routing begins. A read-only replica is now rejected as a failover destination for write-heavy workloads before any traffic flows to it. Jobs that failed during the degraded window were identified and requeued once a writable primary was restored. GitHub also added PR creation rate monitoring per repository — a drop below expected velocity now triggers an alert rather than accumulating silently in a retry queue.
Solution
Failover destinations verified for write capability before routing. PR creation rate alerting added. All failed jobs from the 42-hour window identified and requeued successfully.
Before vs after
| Control | Before | After |
|---|---|---|
| Failover destination check | Connectivity only | Write capability verified |
| Silent failure alerting | None | PR creation rate monitoring |
| Failed job visibility | Retry queue only | Logged and surfaced |
Lessons
What to remember
- A healthy replica is the wrong failover target for a write-heavy workload. Connectivity checks are not capability checks. Test both.
- Silent automation failures are worse than noisy ones. If your tool doesn't create an artifact it was supposed to, it should tell someone immediately — not queue and retry indefinitely.
- Track expected output rates for automation, not just error rates. A PR creation rate chart costs almost nothing; discovering a 42-hour security coverage gap does not.
- Retry queues hide failures gracefully in the short term but create a thundering herd when the underlying issue resolves. Bound queue depth and alert on queue age.