The Story
At 11:05 UTC on November 18, 2025, a Cloudflare engineer applied a routine permission update. Twenty minutes later, X was down, ChatGPT was unreachable, and Shopify merchants couldn't process a single order. The error everywhere was identical: HTTP 500 Internal Server Error.
Cloudflare's proxies load a feature file every five minutes — it tells each node how to score incoming traffic as human or bot. The permission change triggered a duplicate-row bug in the underlying query. What normally returned 60 features came back with over 200. The file exceeded a hard size limit baked into the proxy binary. When a proxy hit its next scheduled refresh and tried loading the oversized file, it crashed. The tricky part: the refresh cycle is staggered, not synchronized. Some proxies loaded a clean file, others got the broken one. Services flickered — briefly recovered, then crashed again — until every database node was patched.
Problem
Bot Management feature file grew from 60 fields to 200+ fields due to a ClickHouse permission change producing duplicate rows. Proxies crashed on their next five-minute reload cycle.
Problem
ClickHouse permission deployed
At 11:05 UTC, a database permission update is applied. The Bot Management feature-file query begins returning duplicate rows.
Cause
Proxy crash cycle begins
At 11:20 UTC, proxies that hit their five-minute refresh crash when they attempt to load the oversized feature file. HTTP 500 errors start propagating globally.
Solution
ClickHouse permission reverted
Cloudflare engineers identify the database change as the source and push a corrective permission update across all database nodes.
Result
Full recovery
By approximately 17:30 UTC, all proxies load clean feature files on their next refresh cycle. Total window: nearly six hours.
The Fix
The ClickHouse permission was corrected to return deduplicated rows. But Cloudflare added a more important safeguard: proxies now validate the feature file's size before loading it. If a file exceeds the expected ceiling, the proxy rejects it and reverts to its last known good configuration instead of crashing. That single validation step — checking before loading — would have prevented the entire six-hour outage.
Solution
Size validation added to proxy feature file load cycle. Oversized files are rejected; proxy reverts to last known good state rather than crashing.
Operational insight
The staggered refresh cycle that slowed recovery also prevents synchronized failures in normal operation. Don't remove it — add a validation gate before the file is accepted, not after.
Architecture
How the feature file refresh cascade caused staggered failures
Lessons
What to remember
- A hard size limit in a hot-path binary needs a fallback. If the file is too big and the proxy just crashes, there is no recovery — only waiting for the next deploy.
- Rolling refresh cycles protect you and slow your recovery. Know which systems refresh on a schedule so you can account for the time when planning incident resolution.
- Database permission changes touch query behavior. They deserve the same testing rigor as schema migrations — not the same process as access control updates.
- When services flicker up and down on a regular interval, look for a scheduled reload cycle before investigating deployment tooling.