The Story

At 11:05 UTC on November 18, 2025, a Cloudflare engineer applied a routine permission update. Twenty minutes later, X was down, ChatGPT was unreachable, and Shopify merchants couldn't process a single order. The error everywhere was identical: HTTP 500 Internal Server Error.

~6 hrs
Total outage window
2.4B
Monthly active users affected
200+
Feature fields in the bloated file (normal: 60)
0
Data lost or corrupted

Cloudflare's proxies load a feature file every five minutes — it tells each node how to score incoming traffic as human or bot. The permission change triggered a duplicate-row bug in the underlying query. What normally returned 60 features came back with over 200. The file exceeded a hard size limit baked into the proxy binary. When a proxy hit its next scheduled refresh and tried loading the oversized file, it crashed. The tricky part: the refresh cycle is staggered, not synchronized. Some proxies loaded a clean file, others got the broken one. Services flickered — briefly recovered, then crashed again — until every database node was patched.

Problem

Bot Management feature file grew from 60 fields to 200+ fields due to a ClickHouse permission change producing duplicate rows. Proxies crashed on their next five-minute reload cycle.

Problem

ClickHouse permission deployed

At 11:05 UTC, a database permission update is applied. The Bot Management feature-file query begins returning duplicate rows.

Cause

Proxy crash cycle begins

At 11:20 UTC, proxies that hit their five-minute refresh crash when they attempt to load the oversized feature file. HTTP 500 errors start propagating globally.

Solution

ClickHouse permission reverted

Cloudflare engineers identify the database change as the source and push a corrective permission update across all database nodes.

Result

Full recovery

By approximately 17:30 UTC, all proxies load clean feature files on their next refresh cycle. Total window: nearly six hours.

The Fix

The ClickHouse permission was corrected to return deduplicated rows. But Cloudflare added a more important safeguard: proxies now validate the feature file's size before loading it. If a file exceeds the expected ceiling, the proxy rejects it and reverts to its last known good configuration instead of crashing. That single validation step — checking before loading — would have prevented the entire six-hour outage.

Solution

Size validation added to proxy feature file load cycle. Oversized files are rejected; proxy reverts to last known good state rather than crashing.

Operational insight

The staggered refresh cycle that slowed recovery also prevents synchronized failures in normal operation. Don't remove it — add a validation gate before the file is accepted, not after.

Architecture

How the feature file refresh cascade caused staggered failures

Each proxy refreshes its bot management feature file on a rolling schedule. When the bloated file was live, proxies failed on their individual refresh cycle — not all at once.

Lessons

What to remember

  1. A hard size limit in a hot-path binary needs a fallback. If the file is too big and the proxy just crashes, there is no recovery — only waiting for the next deploy.
  2. Rolling refresh cycles protect you and slow your recovery. Know which systems refresh on a schedule so you can account for the time when planning incident resolution.
  3. Database permission changes touch query behavior. They deserve the same testing rigor as schema migrations — not the same process as access control updates.
  4. When services flicker up and down on a regular interval, look for a scheduled reload cycle before investigating deployment tooling.
The proxy didn't crash because of bad data. It crashed because we gave it a file size we promised would never exist.Cloudflare postmortem, November 2025