Cloudflare operates one of the world's largest content delivery and security networks, with hundreds of handling customer traffic across the globe. These PoPs operate largely autonomously from the control plane — once configurations are pushed, the network continues operating even if the control plane has issues. But when the control plane goes down, customers can't configure anything. They can't add DNS records, update firewall rules, change SSL settings, or deploy new Workers. The network keeps running, but it becomes effectively immutable. On November 2, 2023, at 11:43 UTC, that's exactly what happened.
The cause was a power failure at Flexential, Cloudflare's primary datacenter partner hosting the control plane infrastructure. Flexential is not a cloud provider — it's a colocation facility where Cloudflare runs its own physical servers. Power failures in colocation facilities, while rare, happen. What made this incident severe was that the control plane recovery was not automatic. Failover to Cloudflare's disaster recovery facility required manual orchestration, and some services — particularly raw log delivery — were not replicated to the DR facility and therefore couldn't be recovered until the primary datacenter came back. Services were still not fully restored 40 hours after the initial failure.
🏭Cloudflare's control plane infrastructure ran in a colocation datacenter — physical servers in a facility owned by a third party, not in a public cloud. This architecture gives Cloudflare hardware control and cost efficiency but means datacenter-level failures require manual failover coordination rather than cloud provider automated region failover.
Problem
Flexential Power Failure at 11:43 UTC Nov 2
Cloudflare's primary datacenter partner experienced a power failure. The control plane — API, dashboard, analytics services — went offline. Edge traffic continued operating normally (PoPs are autonomous), but customers could not make any configuration changes. Internal monitoring and log analytics were also impacted.
Cause
Control Plane Not Designed for Autonomous Failover
Unlike Cloudflare's edge network, the control plane was not designed for automatic failover. Recovery required manual orchestration to bring services up at the DR facility. Some data — particularly raw log streams — was not replicated to DR, meaning certain services could not be restored until the primary facility recovered.
Solution
DR Failover + Manual Service Restoration
Control plane core functionality was restored at the DR facility at 17:57 UTC on Nov 2 — ~6 hours after the incident started. Many customers saw restored API access at this point. However, some services continued to experience issues until Nov 4 as teams worked through recovery of systems that had data in the primary datacenter only.
Result
Full Restoration Nov 4, Code Orange Invented
Services were fully restored at 04:25 UTC on November 4, nearly 40 hours after the initial failure. The incident prompted Cloudflare to create a new process — Code Orange — modeled on Google's Code Yellow/Red, for major incidents requiring all-hands engineering mobilization.
CODE ORANGE: A NEW INCIDENT PROCESS IS BORN
Google has a practice where significant crises trigger a Code Yellow or Code Red — most engineering resources are shifted to address the issue. Cloudflare had no equivalent process before this incident. The 40-hour outage demonstrated the need for a structured mechanism to
mobilize engineering resources across all teams for critical incidents. Code Orange was created as Cloudflare's version of this concept. The process includes defined escalation paths, cross-team coordination protocols, and clear criteria for when to invoke it.
The log push service — Cloudflare's product that delivers raw access logs directly to customer storage buckets in real time — was unavailable for the majority of the outage duration. Unlike the control plane API, which could be brought up at the DR facility using replicated state, the log pipeline infrastructure was primarily hosted in the primary datacenter and not fully replicated to DR. Customers who relied on log push for security monitoring, compliance logging, or billing reconciliation had gaps in their log data that could not be recovered. Cloudflare's postmortem explicitly noted that some datasets which are not replicated in the EU would have persistent gaps — data that would never be recovered regardless of DR restoration success.
📋Questions for Flexential Still Outstanding
As of the postmortem publication, Cloudflare stated it had a number of questions that needed to be answered from Flexential. A power failure of this duration at a major colocation facility raises questions about redundant power systems, UPS capacity, diesel generator performance, and facility operations procedures. The postmortem was transparent about this outstanding accountability — an unusual admission that the root cause investigation wasn't complete.
ℹ️The Data Plane Continued Running
During the entire 40-hour control plane outage, Cloudflare's data plane continued operating normally — DDoS mitigation, CDN caching, SSL termination, and traffic routing were all functioning. Customers using Cloudflare for traffic performance and security saw no degradation. This is a testament to Cloudflare's edge-resilient architecture — PoPs operate autonomously from the control plane once configured. The outage was exclusively a management plane failure: you couldn't change anything, but what was already configured kept working.
❌Analytics and Dashboard Dark for Enterprise Customers
For Cloudflare's enterprise customers, the control plane outage had real operational consequences beyond configuration changes. Security dashboards showing live attack traffic, WAF logs, DNS analytics, and firewall event monitoring were all unavailable. During a 40-hour window when customers couldn't see what was happening on their infrastructure, security teams had reduced visibility precisely when they might have needed it most. The monitoring and analytics darkness was in some ways more operationally painful than the configuration lock.
THE CUSTOMER EXPERIENCE ASYMMETRY
Cloudflare's customer base experienced the outage very differently based on what they used Cloudflare for.
Performance-focused customers (CDN, caching) saw nothing — their traffic ran fine.
Security-focused customers (WAF, DDoS mitigation) had protection but lost visibility into attacks.
Developer customers (Workers, Pages, DNS) were locked out of deploying changes for 40 hours.
Analytics-dependent customers had data gaps that couldn't be recovered. Same outage, four different impact profiles.