ClickHouse documentation recommends using hasAnyTokens / hasAllTokens for full-text search against its text inverted index. We followed that guidance precisely and ran eight WHERE … LIMIT 100 queries against a 5 TB log table, comparing it with CtrlB on the same hardware.
span_id point lookup
16 GiB vs. 282 GiB
CtrlB wins every pattern
- CtrlB wins all 8 queries, from 2.2× to 99× faster, cold cache, same hardware
- Index footprint: CtrlB 16.3 GiB vs. ClickHouse 282.4 GiB (17× gap)
- CtrlB queries run in standard SQL with no additional syntax layer
ClickHouse is a capable analytical database with active documentation on optimizing full-text search using its inverted index functions. We followed those recommendations and set up both engines on identical hardware against the same 5 TB production dataset. Results are the p50 of 5 cold runs.
Test Environment
| Component | Configuration |
|---|---|
| Dataset | 5 TB structured Kubernetes log data, production workload |
| Table | default.benchmark_5tb |
| Hardware | 16 vCPU instance, identical for both engines |
| Cache state | Cold cache at the start of every run |
| Measurement | p50 of 5 cold runs per query, wall-clock latency end-to-end |
| ClickHouse version | v26.2 |
| ClickHouse indexes | bloom_filter(0.01) on trace_id & span_id; text(tokenizer=splitByNonAlpha) on message |
| CtrlB queries | Standard SQL with LIKE and equality predicates |
| ClickHouse queries | Written per ClickHouse documentation for full-text index usage |
Table Schema & Index Setup
The ClickHouse table carries three secondary indexes representing the two index families ClickHouse provides for this workload.
CREATE TABLE default.benchmark_5tb
(
-- Timestamp & human-readable date
`_timestamp` UInt64,
`ist_timestamp` String DEFAULT '',
-- Free-text log line (the haystack)
`message` String,
-- Distributed tracing IDs (high-cardinality)
`trace_id` String DEFAULT '',
`span_id` String DEFAULT '',
-- Kubernetes metadata (low/medium cardinality)
`namespace_name` LowCardinality(String) DEFAULT '',
`container_name` LowCardinality(String) DEFAULT '',
`pod_name` String DEFAULT '',
`plugin_source` LowCardinality(String) DEFAULT '',
`stream` LowCardinality(String) DEFAULT '',
`bytes` UInt64 DEFAULT 0,
-- Full-text inverted index (splitByNonAlpha tokenizer)
INDEX idx_message_text message
TYPE text(tokenizer = splitByNonAlpha) GRANULARITY 100000000,
-- Bloom filters for exact ID lookups (1% false-positive rate)
INDEX idx_trace_id trace_id TYPE bloom_filter(0.01) GRANULARITY 1,
INDEX idx_span_id span_id TYPE bloom_filter(0.01) GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY (namespace_name, container_name, stream, _timestamp)
SETTINGS
index_granularity = 8192,
min_compress_block_size = 65536,
old_parts_lifetime = 480
Sample log record
One representative row from the 5 TB Kubernetes log dataset used in this benchmark:
The Two Index Families
idx_trace_id and idx_span_id use bloom_filter(0.01) with
GRANULARITY 1. These are high-cardinality, near-unique ID columns where the entire
point is exact-match lookup. A Bloom filter at granularity 1 is the textbook ClickHouse answer:
one filter per MergeTree granule (~8,192 rows), letting ClickHouse skip granules that provably
don't contain the value. The 0.01 false-positive rate is standard ClickStack tuning.
idx_message_text uses TYPE text(tokenizer = splitByNonAlpha) —
ClickHouse's inverted full-text index on the free-text message column, tokenizing on
non-alphanumeric boundaries. This is ClickHouse's strongest possible choice for substring search.
The GRANULARITY 100000000 effectively covers the entire dataset in a single index
entry, correct for a posting-list inverted index.
Ingestion Rate: 280 MB/s vs. 210 MB/s
Both engines ingested the same 5 TB dataset on identical hardware — a 16 vCPU VM — with sustained ingest measured during load. Storage was the only deliberate difference: CtrlB streams to object storage; ClickHouse wrote to local SSD (required for its MergeTree layout).
Results: Wall-Clock Latency Per Query
8 queries · p50 of 5 cold runs · lower is better. ClickHouse queries use
hasAnyTokens / hasAllTokens as recommended in the ClickHouse
documentation for activating the text inverted index.
| Query pattern | CtrlB | ClickHouse | Advantage |
|---|---|---|---|
Find one trace_id |
140 ms | 7.1 s | 50× |
Find one span_id |
92 ms | 9.1 s | 99× |
| Substring match, rare UUID | 46 ms | 721 ms | 16× |
Substring match, common token (ERROR) |
33 ms | 2.1 s | 64× |
Two filters: container_name + trace_id |
107 ms | 588 ms | 5.5× |
Filter + substring on message |
180 ms | 2.1 s | 12× |
Double substring on message, worst case |
1.25 s | 2.8 s | 2.2× |
High-cardinality equality on pod_name |
3.6 s | 24.5 s | 7× |
The 8 Queries
Each query was run against default.benchmark_5tb. CtrlB uses standard SQL.
ClickHouse uses the syntax recommended in its documentation for full-text index coverage.
trace_id
High-Cardinality Point Lookup
The most common operation in distributed tracing: pull all log lines for a given trace.
Both engines use equality filters. ClickHouse has a Bloom filter on trace_id
with GRANULARITY 1, the most aggressive resolution possible.
At 1% FPR with thousands of granules across 5 TB, the Bloom filter generates hundreds of
false-positive granule reads. Each false-positive requires decompressing the granule and doing
a precise row-level check. That I/O and CPU cost accumulates to 7+ seconds.
CtrlB's sparse index achieves near-zero false positives at this cardinality.
SELECT * FROM benchmark_5tb_rti_v17 WHERE trace_id = '2d3d5a83f6a923d9...' LIMIT 100
SELECT * FROM default.benchmark_5tb WHERE trace_id = '8ceafb5879a818a5...' AND _timestamp >= 1776617583000000 AND _timestamp <= 1776842347000000 ORDER BY _timestamp DESC LIMIT 100 FORMAT JSONEachROW
span_id
High-Cardinality Point Lookup
Same pattern as trace_id, but on span_id. The gap widens to 99×.
SELECT * FROM benchmark_5tb_rti_v17 WHERE span_id = '07766058015caeea' LIMIT 100
SELECT * FROM default.benchmark_5tb WHERE span_id = 'fbf291e847a2ac5e' AND _timestamp >= 1776617583000000 AND _timestamp <= 1776842347000000 ORDER BY _timestamp DESC LIMIT 100 FORMAT JSONEachROW
A LIKE '%<uuid>%' search on the message column for a
low-frequency substring. CtrlB: 46 ms. ClickHouse with its text index: 721 ms.
SELECT * FROM benchmark_5tb_rti_v17 WHERE message LIKE '%9d46ccea-98bd-4cef-bbd5%' LIMIT 100
SELECT * FROM default.benchmark_5tb
WHERE hasAnyTokens(message,
'c8fedcbb-9a05-47a4-8dd8-f24f75291749')
AND _timestamp >= 1776617583000000
AND _timestamp <= 1776842347000000
ORDER BY _timestamp DESC LIMIT 100
Searching for 'ERROR', a high-frequency token, is the worst case for an inverted
index: the posting list is enormous and provides almost no pruning benefit.
SELECT * FROM benchmark_5tb_rti_v17 WHERE message LIKE '%ERROR%' LIMIT 100
SELECT * FROM default.benchmark_5tb WHERE hasAnyTokens(message, 'ERROR') AND _timestamp >= 1776617583000000 AND _timestamp <= 1776842347000000 ORDER BY _timestamp DESC LIMIT 100
container_name + trace_id
Low/Medium Cardinality Exact Filter
A compound predicate combining a low-cardinality equality filter with a high-cardinality ID lookup. CtrlB: 107 ms. ClickHouse: 588 ms (5.5× slower).
SELECT * FROM benchmark_5tb_rti_v17 WHERE container_name = 'sidekiq-data-engine' AND trace_id = '2d3d5a83f6a923d9...' LIMIT 100
SELECT * FROM default.benchmark_5tb WHERE container_name = 'controller' AND trace_id = '4f99282f17d604c3...' AND _timestamp >= 1776617583000000 AND _timestamp <= 1776842347000000 ORDER BY _timestamp DESC LIMIT 100 FORMAT JSONEachROW
message
Full-Text / Wildcard Search
Filter by container_name, then search for 'INFO' in the message —
the bread-and-butter observability query: narrow by service, then search logs.
SELECT * FROM benchmark_5tb_rti_v17 WHERE container_name = 'sidekiq-data-engine' AND message LIKE '%INFO%' LIMIT 100
SELECT * FROM default.benchmark_5tb WHERE container_name = 'demo-app' AND hasAnyTokens(message, 'INFO') AND _timestamp >= 1776617583000000 AND _timestamp <= 1776842347000000 ORDER BY _timestamp DESC LIMIT 100
message
Full-Text / Wildcard Search
Two simultaneous LIKE conditions on message: 'INFO'
AND 'timeout'. The most demanding full-text pattern in this benchmark.
The gap narrows here because the intersection of posting lists for two common tokens
genuinely prunes granules.
SELECT * FROM benchmark_5tb_rti_v17 WHERE message LIKE '%INFO%' AND message LIKE '%timeout%' LIMIT 100
SELECT * FROM default.benchmark_5tb WHERE hasAnyTokens(message, 'INFO') AND hasAnyTokens(message, 'timeout') AND _timestamp >= 1776617583000000 AND _timestamp <= 1776842347000000 ORDER BY _timestamp DESC LIMIT 100
pod_name
High Cardinality Exact Filter
An equality filter on pod_name, which has very high cardinality (unique per pod
deployment). Neither engine has a dedicated secondary index on this column; both must scan.
CtrlB's storage layout and scan path are more efficient even without a secondary index.
SELECT * FROM benchmark_5tb_rti_v17 WHERE pod_name = 'demo-app-sqs-a4544e758c-fe295' LIMIT 100
SELECT * FROM default.benchmark_5tb WHERE pod_name = 'ingress-nginx-controller-ddc1a51c61-f795c' AND _timestamp >= 1776617583000000 AND _timestamp <= 1776842347000000 ORDER BY _timestamp DESC LIMIT 100
Index Footprint: 16 GiB vs. 282 GiB
The most significant infrastructure difference is the index size on the message column.
ClickHouse requires a full-text inverted index to make substring search tractable. That index
weighs 282.4 GiB. CtrlB's sparse index on the same column is 16.3 GiB, 17× smaller.
The ClickHouse text inverted index uses a Lucene-style approach: for every token produced
by the splitByNonAlpha tokenizer, it maintains a posting list of which granules
contain that token. At 5 TB of log data, that posting list becomes very large, at 5.6% of the raw
data size, compared to CtrlB's 0.33%.
Storage cost is not included in the latency numbers, but it shapes every capacity decision once the dataset crosses a few terabytes. At 50 TB, the ClickHouse text index alone would approach 2.8 TB of additional local SSD.
Same CPU. Very Different Bill.
Both engines ran on a 16-vCPU instance. The CPU was identical. What differed was the attached storage requirement.
| Attribute | CtrlB | ClickHouse |
|---|---|---|
| Data storage | Object storage (cold streaming) | Local SSD required |
| Index size (message col.) | 16.3 GiB | 282.4 GiB |
| Index as % of raw data | 0.33% | 5.60% |
| Index scaling with data | Sub-linear (sparse) | Linear with data volume |
Architectural Differences
ClickHouse Index Characteristics
Lucene-style inverted indexes work well for document search: structured token vocabularies and read-heavy patterns over well-defined fields. Log observability presents a different access pattern:
- High-cardinality point lookups on UUIDs. Even 1% FPR becomes expensive at 5 TB scale, producing hundreds of false-positive granule reads per query
- Substring matches on freeform message content that may cross token boundaries
- Compound filters mixing cardinality levels in a single query
- Cold cache on every query, with datasets too large to keep hot
CtrlB Index Design
CtrlB's sparse index is built for the observability workload. It keeps a small footprint,
reads directly from object storage, achieves near-zero false positives on ID lookups at
any scale, and accepts standard SQL LIKE natively.
Conclusion
ClickHouse is a capable analytical database with strong text indexing support. This benchmark followed its documentation for full-text search configuration and ran both engines on identical hardware. The results reflect how each system handles the specific access patterns common in log observability workloads at scale.
CtrlB is purpose-built for this workload shape. Across all 8 queries, the results were consistent:
8 of 8 queries: CtrlB is faster, by 2.2× to 99×
17× smaller index: 16.3 GiB vs. 282.4 GiB on the message column
Standard SQL with no additional syntax requirements
No local SSD requirement. Streams cold data directly from object storage
Benchmark dataset
The 5 TB Kubernetes log dataset used in this benchmark is stored in a public S3 bucket. Click here to browse or download the data for reproducibility.
