On November 18, 2025, Cloudflare experienced a significant global outage that severely impacted its network and the numerous internet services relying on its infrastructure. This incident was characterized by widespread 500 errors and affected millions of users and thousands of websites worldwide .
The disruption commenced at approximately 11:20 UTC on November 18, 2025, with Cloudflare's network beginning to experience significant failures in delivering core network traffic 1. Cloudflare officially acknowledged internal service degradation and initiated an investigation around 11:40-11:48 UTC 2. The timeline of key events is detailed below:
| Time (UTC) | Event Description | Reference |
|---|---|---|
| 11:20 | Cloudflare's network began experiencing significant failures to deliver core network traffic. | 1 |
| 11:48 | Cloudflare confirmed internal service degradation and began investigating. | 2 |
| 13:09 | The issue was identified, and a fix was being implemented. | 2 |
| 14:30 | Core traffic was largely flowing as normal. | 1 |
| 14:42 | A fix was implemented, and the incident was believed resolved, with monitoring continuing. | 2 |
| 17:06 | All Cloudflare systems were functioning as normal. | 1 |
| 17:44 | Cloudflare confirmed services were operating normally, with a deeper investigation underway. | 2 |
The period of significant impact lasted approximately 3 hours and 10 minutes, from 11:20 UTC to 14:30 UTC, while full restoration of all systems took about 5 hours and 46 minutes, concluding at 17:06 UTC 1.
The outage led to significant failures across Cloudflare's global network, affecting both its internal services and a vast array of customer-facing platforms. Cloudflare's own services experienced degraded performance, including Cloudflare Sites and Services, Bot Management, CDN/Cache, Dashboard, Firewall, Network, and Workers 2. Additionally, Cloudflare Access and WARP were affected, with WARP access being temporarily disabled in London during the incident 2.
Customers encountered prevalent errors such as "Internal server error Error code 500" 4 and a high volume of 5xx HTTP status codes 1. Numerous external platforms and applications that rely on Cloudflare's infrastructure became inaccessible, including major services like X (formerly Twitter), ChatGPT, Canva, Grindr 4, Discord , and Shopify . Even outage tracking sites such as Downdetector.com and Pingdom.com appeared to experience issues themselves 3. Cloudflare's own status page also went down temporarily, which initially complicated their response and led them to suspect a broader attack 1.
The disruption was global, affecting websites and users across the entire world 3. Cloudflare's status page reported a "Partial Outage" in numerous cities spanning Africa, Asia, Europe, Latin America & the Caribbean, the Middle East, North America, and Oceania 2. London was specifically noted as WARP access was temporarily disabled there 2.
Cloudflare confirmed that the outage was not the result of a cyberattack or malicious activity 1. The technical trigger was identified as an inadvertent change to permissions within one of Cloudflare's database systems 1. This permission modification caused the ClickHouse database to output multiple, duplicate entries into a "feature file" utilized by the Bot Management system 1.
The underlying cause lay within a specific Rust code component in the Bot Management system, specifically within the fl2_worker_thread 1. This component had a hard limit on the size of the feature file it processed 1. When the influx of duplicate entries caused the file to exceed this size limit, the Rust code, instead of handling the error gracefully, invoked Result::unwrap on an Err value, leading the thread to panic and crash the core proxy modules 1. The duplication of entries in the feature file was traced back to a database migration involving ClickHouse, where a query against system.columns returned duplicate rows because users had been granted access to an additional backing database 1.
The global outage immediately triggered widespread public reaction and extensive media coverage. Users across the internet reported "widespread 500 errors" and often initially suspected issues with their own internet connections before realizing the scale of the Cloudflare problem . Social media platforms, particularly X (formerly Twitter), saw over 11,500 problem reports 7, with major media outlets quickly highlighting disruptions to prominent services like ChatGPT and X . The inaccessibility of Cloudflare's own status page for parts of the incident compounded user frustration and prevented timely updates . Within online technical communities, such as r/sysadmin and Hacker News, administrators engaged in discussions that blended "hugops" (sympathetic support) with "gallows humor" regarding the initial panic before the cause was known 7.
Media commentary frequently underscored the "fragility" and "interconnectedness" of the modern digital ecosystem, drawing parallels to recent outages at Amazon Web Services and Microsoft Azure in October 2025 . While Cloudflare's team initially suspected a hyper-scale DDoS attack due to the fluctuating nature of the errors and the coincident unavailability of its status page 6, external commentary largely confirmed the reported impact and Cloudflare's rapid efforts toward resolution . There were no external critiques contradicting Cloudflare's eventual detailed technical explanation of the root cause 8. Instead, expert and media commentary broadened the discussion to emphasize the critical need for "cyber resilience," highlighting how "one small mistake gets amplified everywhere at once" in highly automated, globally distributed systems 8.
The Cloudflare outage on November 18, 2025, caused widespread disruption across the internet, highlighting the critical dependency of numerous online services on its infrastructure and underscoring the potential for cascading failures within the internet ecosystem 9. Lasting approximately four hours and ten minutes for significant impact, with full restoration taking about 5 hours and 46 minutes, the incident broadly affected thousands of websites and applications globally .
The outage led to extensive operational disruptions and negative impacts on end-users and internet traffic flow worldwide .
The outage broadly impacted thousands of websites and applications that rely on Cloudflare's CDN and other services . Services exhibiting degraded performance included Cloudflare Sites and Services, Bot Management, CDN/Cache, Dashboard, Firewall, Network, and Workers 2. External platforms relying on Cloudflare became inaccessible, leading to a significant list of disrupted services:
| Category | Specific Services Affected |
|---|---|
| Social Media & Communication | X (formerly Twitter) , Truth Social 9, Discord , Facebook 7, Zoom |
| AI Services | ChatGPT , Claude AI 9, Character AI 9, OpenAI , Gemini 7, Perplexity AI |
| Creative & Productivity | Canva , Medium , Feedly 7, Figma 7, 1Password 7, Trello 7, Postman 7, Archive of Our Own 9, Dropbox 10, Atlassian 7 |
| Gaming & Entertainment | League of Legends , Spotify 9, Letterboxd 9 |
| E-commerce & Financial | Shopify , Coinbase 10, multiple cryptocurrency exchanges 8, Moody's credit ratings service 10, IKEA 9, Uber 9, Google Store 9, Square 9 |
| Hosting & Infrastructure | DigitalOcean 7, Namecheap 7, Vercel 7 |
| Other Critical Services | New Jersey Transit system (njtransit.com) 10, Dayforce 9, Indeed 9, Quizlet 9, Canvas 9, Downdetector (briefly experienced connectivity issues due to its reliance on Cloudflare) |
The financial impact of the Cloudflare outage was substantial for businesses globally, underscoring the high cost of downtime in the digital economy 9:
Beyond direct revenue loss and service unavailability, the outage incurred several hidden costs and broader consequences:
The global outage experienced by Cloudflare on November 18, 2025, was not the result of a cyberattack or malicious activity, but rather a complex chain of technical failures originating from an inadvertent configuration change 1. This section details the immediate trigger, underlying vulnerabilities, specific technical failures, and the subsequent diagnostic and resolution processes employed by Cloudflare.
The immediate trigger for the outage was an "inadvertent change to permissions within one of Cloudflare's database systems" 6, which affected a ClickHouse database cluster 1. This permission change, made around 11:05 UTC, allowed ClickHouse queries to run under initial user accounts and explicitly access metadata for tables in the r0 database 6.
This seemingly minor modification exposed a pre-existing vulnerability in a query used to construct a "feature file" for Cloudflare's Bot Management system 1. The query, SELECT name, type FROM system.columns WHERE table = 'http_requests_features' order by name;, lacked a filter for database name 6. Consequently, with the new r0 access, the query began returning duplicate entries from the r0 schema, effectively doubling the number of rows (features) in the final output file 6. This oversized feature file was then propagated to all network machines running traffic routing software 6.
A critical underlying vulnerability was a hardcoded size limit within a specific Rust code component of the Bot Management system, residing in the fl2_worker_thread 1. This limit, set to 200 machine learning features, was intended for memory preallocation to optimize performance 6. When the expanded feature file, containing more than 200 entries, exceeded this preallocated memory limit, the Rust code failed to handle the error gracefully 6. Instead, it called Result::unwrap on an Err value, causing the fl2_worker_thread to panic and crash core proxy modules across Cloudflare's network 1. This cascading failure led to widespread service degradation and HTTP 5xx errors 1.
The bug was classified as a "latent bug"—an error hidden within the system that only manifested under a specific and unusual combination of conditions, in this case, a routine configuration change interacting with an underlying query flaw 11.
Cloudflare's engineering teams initiated an immediate investigation upon detecting internal service degradation and unusual traffic spikes around 11:40-11:48 UTC 2. The intermittent nature of the initial outage, with systems recovering and then failing again every five minutes due to the periodic regeneration of the faulty feature file, initially led Cloudflare to suspect a hyper-scale Distributed Denial of Service (DDoS) attack 6. Further complicating diagnostics was the concurrent, unrelated outage of Cloudflare's independently hosted status page, which initially suggested a broader attack 1.
The core issue was identified as a faulty configuration file 2, specifically the oversized bot management feature file 1. Resolution efforts faced challenges as there was no immediate mechanism to "insert a good file into the queue" 1. To rectify the situation, Cloudflare's team resorted to rebooting processes across a multitude of machines globally to force them to flush the corrupted configuration files 1. This process also involved isolating the malfunctioning component and temporarily disabling WARP services in London as a remediation step 2. By stopping the propagation of the faulty feature file and manually inserting a known good version, the core issue was effectively addressed 6.
The restoration process unfolded over several hours:
| Event | Time (UTC) | Source |
|---|---|---|
| Cloudflare's network began experiencing significant failures | 11:20 | 1 |
| Cloudflare acknowledged and began investigating the issue | 11:40-11:48 | 2 |
| The issue was identified, and a fix was being implemented | 13:09 | 2 |
| Core traffic was largely flowing as normal | 14:30 | 1 |
| A fix was implemented, and the incident was believed resolved | 14:42 | 2 |
| All Cloudflare systems were functioning as normal | 17:06 | 1 |
| Cloudflare confirmed services were operating normally, with a deeper investigation underway | 17:44 | 2 |
The period of significant impact lasted approximately 3 hours and 10 minutes (from 11:20 UTC to 14:30 UTC), while full restoration of all systems took about 5 hours and 46 minutes (from 11:20 UTC to 17:06 UTC) 1. After the fix, teams continuously monitored the system for errors and latency, confirming a return to normal service levels 2.
Throughout the incident, Cloudflare maintained a transparent communication strategy:
To prevent similar incidents, Cloudflare outlined several hardening measures and lessons learned:
The incident also served as a broader reminder of the internet's dependence on central providers like Cloudflare and underscored the importance of cyber resilience, advocating for diversification, redundant DNS providers, multi-CDN strategies, and robust incident response plans 8.
The Cloudflare outage of November 18, 2025, served as a stark reminder of the internet's inherent vulnerabilities, prompting significant introspection within Cloudflare and across the broader industry. This section details the corrective actions and new protocols initiated by Cloudflare, examines the industry-wide dialogue on internet resilience, draws comparisons to previous outages to distill critical lessons, and outlines recommendations for building a more robust and fault-tolerant internet infrastructure.
Following the November 2025 incident, Cloudflare immediately outlined several key remediation and follow-up steps to address the root causes and enhance system stability. These measures, alongside prior initiatives, aim to fortify their infrastructure against future disruptions.
The primary cause of the outage was a permissions change in a database system, leading to an oversized "feature file" that caused the software (FL2 Rust code) to panic and generate HTTP 5xx errors across the network 6. In response, Cloudflare has committed to:
Prior to the November 2025 event, Cloudflare had already embarked on several significant initiatives following earlier incidents to improve its resilience:
The November 2025 outage reignited critical industry discussions regarding internet resilience, emphasizing distributed architectures and the mitigation of single points of failure.
The incident underscored the "centralization paradox," where the efficiency and performance benefits offered by major providers like Cloudflare and AWS inadvertently create systemic vulnerabilities that can cascade across the internet during failures 9. This has highlighted the benefits of truly distributed or decentralized systems, with observations that "legacy setups" or systems running on bare metal without reliance on such centralized services often survived outages, suggesting a move towards more distributed models 17.
The root cause, an automatically generated configuration file, emphasized that while automation boosts efficiency, it can amplify errors without robust safeguards like oversight, size limits, and validation checks 9. The discussion extended to concerns about "Vibe Coding" or AI-generated code potentially leading to hard-to-debug infrastructure issues 17. Furthermore, the outage demonstrated how many services not directly using Cloudflare were still affected due to indirect dependencies, underscoring the necessity of understanding the complete dependency chain 9. There is also an ongoing debate regarding balancing the need for rapid configuration updates to combat evolving threats against the safety concerns of pushing potentially flawed updates quickly to a global network 1.
The November 18, 2025, Cloudflare outage shares commonalities with, and offers distinct lessons from, previous major internet infrastructure failures.
Several recurring themes emerge when comparing this incident to past outages:
The November 2025 outage also provided unique lessons:
Experts and industry bodies emphasize several best practices to enhance internet infrastructure resilience and incident preparedness.
To mitigate the risks associated with centralized dependencies, organizations should consider:
Beyond architectural redundancy, operational excellence and preparedness are crucial:
The table below summarizes key recommendations for resilient infrastructure:
| Category | Recommendation | Description |
|---|---|---|
| Architectural Redundancy | Multi-CDN Architecture | Distribute traffic across multiple CDNs (primary/backup, active/active, geographic) to avoid single points of failure 9. |
| DNS-Based Failover | Implement automated DNS failover with continuous health checks and low TTL values for rapid redirection in case of service degradation or outage 9. | |
| Origin Shield | Use an intermediate caching layer to protect origin servers and maintain caching availability even if the primary CDN fails 9. | |
| Edge Computing / Distributed Architecture | Process data and application logic closer to users to reduce latency and ensure localized functionality despite central failures 9. | |
| Operational Excellence | Comprehensive Monitoring | Implement multi-layer monitoring (synthetic, RUM, infrastructure, CDN-specific) with well-defined alerts and escalation procedures 9. |
| DR/BCP with Regular Testing | Develop detailed disaster recovery and business continuity plans, including RTO/RPO, incident response playbooks, and frequent drills 9. | |
| Automation with Safeguards | Ensure automated systems include oversight, size limits, validation, canary deployments, and circuit breakers to prevent cascading errors 9. | |
| Strategic Preparedness | Dependency Mapping | Understand all direct and indirect service dependencies to identify hidden risks and potential points of failure 9. |
| AI-Powered Traffic Management | Leverage machine learning for predictive scaling, intelligent routing, anomaly detection, and automated remediation (future trend) 9. | |
| Adherence to Regulatory Standards | Anticipate and comply with increasing regulatory requirements for infrastructure resilience, such as the EU Digital Operational Resilience Act (DORA) 9. |
The November 2025 Cloudflare outage underscores the persistent challenges in maintaining robust internet infrastructure. Cloudflare's detailed remediation efforts, focusing on configuration hardening, enhanced error handling, and robust deployment practices, alongside its ongoing commitment to distributed architectures and chaos testing, are crucial steps towards mitigating future incidents. The broader industry conversation highlights the critical need to address the "centralization paradox" and actively pursue truly distributed models, recognizing the inherent risks of automation without sufficient safeguards. By learning from past mistakes, both common (configuration errors, centralized dependencies) and unique (Rust .unwrap impact), the industry can advance towards more resilient systems. Implementing comprehensive strategies for redundant infrastructure, multi-CDN approaches, rigorous monitoring, and meticulous disaster recovery planning, coupled with an understanding of evolving regulatory landscapes and future technologies like AI-powered traffic management, will be paramount in safeguarding the future of the interconnected internet. The takeaway is clear: continuous vigilance, architectural foresight, and operational discipline are non-negotiable for internet infrastructure providers and consumers alike.