Unlocking Performance: How Cloudflare Resolved a Critical Bottleneck in ClickHouse

In the world of hyper-scale distributed systems, performance degradation is rarely a result of simple hardware saturation. When Cloudflare, a global leader in internet security and performance, began experiencing mysterious latency spikes in its mission-critical billing and fraud detection pipelines, the engineering team discovered that the culprit wasn’t a lack of CPU, RAM, or I/O bandwidth. Instead, they uncovered a deep-seated architectural bottleneck within the query planning phase of ClickHouse, the high-performance analytical database that powers their massive data operations.

The resolution of this issue, which involved a fundamental re-engineering of how ClickHouse handles data part locks, offers a masterclass in modern systems debugging. It underscores a shifting reality in infrastructure management: as datasets grow into the petabyte scale, the "coordination layer"—the invisible glue holding distributed systems together—often becomes the primary point of failure.

The Context: A Pipeline Under Pressure

Cloudflare’s reliance on ClickHouse dates back to the database’s early days, long before modern features like built-in data expiration were standard. To manage hundreds of petabytes of data with an ingestion rate reaching millions of rows per second, Cloudflare developed a custom retention system. By splitting data in their "Ready-Analytics" tables into daily partitions and programmatically purging records older than 31 days, they maintained a high-performance environment suitable for hundreds of internal applications.

By December 2024, this repository had ballooned to over 2PiB of data. While the system was robust, it possessed a structural vulnerability: its rigid partitioning scheme. Seeking to improve flexibility, the team initiated a migration to a more granular partitioning scheme that incorporated customer namespaces. This allowed for tenant-specific retention policies, a move that promised greater efficiency but inadvertently triggered a "slowdown" that baffled the team.

The Chronology: From Migration to Diagnosis

The issues surfaced shortly after the migration. Despite the new partitioning scheme, daily aggregation jobs—which are essential for accurate billing and fraud detection—began to suffer from significant latency.

The Diagnostic Paradox

Initially, the engineering team looked at the usual suspects. They monitored I/O throughput, memory pressure, and row-scan counts. To their surprise, these standard performance metrics remained well within the expected, "healthy" range. The system appeared to be under-utilized, yet queries were taking exponentially longer to execute.

The engineers soon shifted their focus from execution to planning. Using deep-dive profiling tools, they discovered that nearly 45% of total sampled CPU time was being consumed by a single function: filterPartsByPartition.

Cloudflare Identifies Query Planning Bottleneck in ClickHouse

Uncovering the Mutex Contention

The deeper investigation revealed that the issue was not a lack of computational power, but rather a "traffic jam" at the coordination layer. The query planner, tasked with identifying which data parts to read before execution, was being forced into a serial bottleneck.

Every query required the acquisition of a MergeTreeData mutex—a synchronization primitive designed to protect the integrity of the table’s list of parts. Because the new migration had significantly increased the total number of parts, the time spent waiting for this single lock to become available skyrocketed. The system was not CPU-bound; it was lock-bound.

Supporting Data and Technical Root Causes

The technical findings published by senior distributed systems engineers James Morrison and Christian Endres illustrate a classic concurrency trap. In the original ClickHouse architecture, the query planning phase involved a deep copy of the entire list of data parts for every single incoming query.

As the number of parts grew, the cost of this operation—combined with the exclusive lock required to access that list—created a compounding latency effect. For a system processing millions of rows per second across hundreds of applications, the serialization of this metadata lookup proved fatal to throughput.

The data presented by Cloudflare shows a clear correlation: as the number of data parts increased, the time spent in the planning phase increased linearly. By March 2026, after the implementation of the fix, this correlation was effectively broken. Query durations dropped by 50%, and the system’s performance profile became decoupled from the total number of parts, providing a much more predictable and scalable environment.

The Remediation: Engineering a Solution

The team’s solution was a three-pronged approach aimed at reducing the burden on the MergeTreeData mutex:

Shared vs. Exclusive Locks: The most impactful change was replacing the exclusive lock with a shared lock. This allowed multiple concurrent queries to read the parts list simultaneously, provided no write operations were modifying it, thereby eliminating the serialization bottleneck.
Eliminating Redundant Copies: The team refactored the code to remove the requirement that every query must generate a full copy of the parts list. By moving to a more memory-efficient structure, they drastically reduced the overhead associated with query initialization.
Optimized Part Filtering: Finally, they improved the filtering logic itself. Instead of scanning the entire list of parts every time, the planner was optimized to discard irrelevant partitions more efficiently, further reducing the work required under the protection of the lock.

These patches were not merely temporary workarounds; they represented a fundamental improvement to the ClickHouse codebase. Cloudflare contributed these changes back to the upstream project, where they were formally integrated in version 25.11, benefiting the entire global community of ClickHouse users.

Implications for Distributed Systems Architecture

The broader industry reaction to Cloudflare’s findings has been telling. While some observers on forums like Reddit pointed to the "massive single table" design as the source of the struggle, industry experts argue that this is a symptom of a much larger, systemic shift.

Edydh Marquez Avila, a field engineer at Park Place Technologies, noted on LinkedIn that the investigation serves as a critical "reminder that modern infrastructure failures increasingly happen in coordination layers."

The "Death" of High-Level Telemetry

The Cloudflare incident highlights a growing gap in observability. High-level telemetry—CPU usage, memory, disk latency—is increasingly insufficient for diagnosing the complexities of modern, highly concurrent distributed systems. As systems become more distributed and layered, low-level execution visibility is no longer a "nice-to-have" for performance engineers; it is an absolute necessity.

The Sustainability of Meta-Data Management

Despite the success of the patch, Cloudflare’s engineers remain cautious. The "uneasy truce" between the database and its orchestration layer (ZooKeeper) continues to face pressure. The metadata load generated by their scale is pushing the limits of current coordination tools. The question remains: can the current architecture scale indefinitely, or is there a hard ceiling to how much metadata a single analytical engine can effectively track before the coordination layer collapses again?

Conclusion

The Cloudflare ClickHouse incident is a potent case study in the lifecycle of massive-scale software. It demonstrates that the most sophisticated performance issues are rarely found in the "hot paths" of data processing, but rather in the overlooked synchronization mechanisms that govern how systems behave under load.

By transitioning from exclusive to shared locking and optimizing how metadata is handled, Cloudflare not only saved their own billing pipeline but also provided a vital architectural upgrade to the open-source community. As we push the boundaries of what data systems can handle, the lesson is clear: for modern, distributed infrastructure, the most important optimizations often happen in the quiet, crowded corners where processes wait for permission to act.

For the engineers at Cloudflare and beyond, the pursuit of efficiency continues—not just by adding more power, but by better orchestrating the power we already have.