Kousa4 Stack
ArticlesCategories
Education & Careers

Uncovering a Hidden ClickHouse Bottleneck: A Guide to Diagnosing and Fixing Slow Aggregation Pipelines

Published 2026-05-20 19:03:25 · Education & Careers

Overview

When your daily billing pipeline starts crawling, every minute of delay can cost you—in reconciliation headaches, missed SLAs, and lost revenue. At Cloudflare, we rely on ClickHouse to process millions of queries each day that calculate usage-based billing for hundreds of millions of dollars in revenue. After a routine migration, those critical aggregation jobs slowed to a halt. All the usual metrics looked fine: I/O was normal, memory pressure was low, rows scanned and parts read were within expected ranges. The problem turned out to be a subtle bottleneck buried deep in ClickHouse’s internals—one that required three targeted patches to resolve.

Uncovering a Hidden ClickHouse Bottleneck: A Guide to Diagnosing and Fixing Slow Aggregation Pipelines
Source: blog.cloudflare.com

This guide walks you through the same investigative journey we took: from understanding the underlying architecture, to identifying the hidden bottleneck, to implementing the fixes. By the end, you’ll know what to look for when your ClickHouse pipeline suddenly degrades—even when all the usual suspects are clean.

Prerequisites

Before diving in, make sure you have a solid grasp of these ClickHouse concepts:

  • Partitioning and Primary Keys – How ClickHouse organizes data into parts and sorts them using primary keys.
  • MergeTree Engine – The engine behind most ClickHouse tables, including how merges work and how parts are compacted.
  • TTL vs. Partition-based Retention – Native time-to-live versus custom retention via dropping partitions.
  • Query Profiling – Using system.query_log and system.part_log to diagnose slow queries.
  • Cluster Architecture – Understanding shards, replicas, and distributed tables (though the bottleneck we found was single-node).

This guide assumes you’re already comfortable running ClickHouse in production and have access to performance metrics (CPU, I/O, memory, query logs).

Step-by-Step Diagnosis and Fix

1. Understand Your Setup: The Petabyte-Scale Analytics Platform

Cloudflare runs a system called Ready-Analytics built on ClickHouse. It stores over 100 petabytes across dozens of clusters. The idea is simple: teams stream data into a single massive table instead of designing custom schemas. Each record uses a standard schema (20 float fields, 20 string fields, a timestamp, and an indexID). The primary key is (namespace, indexID, timestamp), where namespace distinguishes different datasets and indexID controls data ordering for each namespace. By December 2024, the table had grown to 2 PiB, ingesting millions of rows per second.

Critical flaw: Retention was enforced by dropping partitions older than 31 days—a one-size-fits-all policy. Teams that needed longer retention (years) or shorter (days) couldn’t use this platform. We needed per-namespace retention.

2. Identify the Bottleneck: When Aggregation Jobs Slow Down

After a migration to support per-namespace retention, the daily aggregation queries used for billing slowed dramatically. Here’s what we checked—and what we didn’t find:

  • I/O – No spike in disk reads/writes.
  • Memory – No unusual pressure.
  • Rows scanned – Still in the same range as before.
  • Parts read – Normal.
  • CPU – Consistent with previous runs.

Everything looked healthy, yet the jobs were taking hours longer. We turned to system.query_log and system.part_log to dig deeper.

3. Discover the Hidden Bottleneck: Inside ClickHouse’s Merge Pipeline

ClickHouse stores data in parts (sorted chunks). Over time, background merges combine smaller parts into larger ones. With per-namespace retention, we had to keep namespaces with varying lifetimes in the same table. To drop old data per namespace, we started using partitions based on a virtual column that combined namespace and day. This created many micro-partitions, each with only one or a few namespaces. The merge logic, however, assumed that parts could be merged freely as long as they belonged to adjacent partitions. Because our virtual partition scheme caused many non-adjacent partitions, merges became extremely selective—and consequently much slower.

Uncovering a Hidden ClickHouse Bottleneck: A Guide to Diagnosing and Fixing Slow Aggregation Pipelines
Source: blog.cloudflare.com

The hidden bottleneck was read amplification during merges. Even though query-time reads were fine, each merge had to re-read and sort a huge number of tiny parts, causing a dramatic increase in total bytes read from disk over the course of a day. This slowdown cascaded into the aggregation jobs, which depend on up-to-date merged parts for efficient scanning.

4. Apply the Three Patches

We developed three patches to resolve the issue without changing the retention model:

  1. Optimize Merge Selectivity – Modify the merge algorithm to merge parts that share the same namespace even if the partition key differs. This reduced the number of tiny parts and cut merge overhead significantly.
  2. Parallel Merge Workers – Increase the number of background merge threads per partition range, allowing merges to run concurrently rather than sequentially.
  3. Adaptive Memory Budget for Merges – Allow merges to use more memory when the system is idle, speeding up the sorting phase.

After deploying these patches, merge write amplification dropped by 40%, and the aggregation jobs returned to their normal completion times.

Common Mistakes

  • Ignoring merge performance – Many operators focus on query-time metrics (rows scanned, I/O) and forget that background merges can become the bottleneck. Always monitor system.merges and system.part_log for merge latency and bytes processed.
  • Assuming partitioning fixes all retention problems – Relying on partition-level retention for per-namespace data can create a partition explosion. Validate your partition scheme with the expected number of namespaces and retention periods.
  • Not simulating the workload – Before deploying a new retention strategy, run a load test that mimics production merge patterns. Use SYSTEM START MERGES and SYSTEM STOP MERGES to observe merge behavior in isolation.
  • Overlooking the primary key – Even if the primary key is optimal for queries, it may cause merge inefficiencies if the sort order doesn’t align well with partition boundaries.

Summary

A seemingly healthy ClickHouse pipeline can hide a debilitating bottleneck in its merge engine. When per-namespace retention forced us into a virtual partition scheme, merges became selective and slow, causing daily aggregation jobs to stall. The fix required understanding ClickHouse’s merge internals and applying three targeted patches. This guide showed you how to diagnose such issues by looking beyond obvious metrics, and gave you concrete steps to prevent them in your own environment. Always profile merges as part of your performance baseline, and test new partition or retention strategies before going live.