Return to Blogs

Finding the Needle.
CtrlB vs. ClickHouse, 8 Ways.

PUBLISHED:
CtrlB vs ClickHouse benchmark comparison chart

ClickHouse documentation recommends using hasAnyTokens / hasAllTokens for full-text search against its text inverted index. We followed that guidance precisely and ran eight WHERE … LIMIT 100 queries against a 5 TB log table, comparing it with CtrlB on the same hardware.

99×
Fastest speedup
span_id point lookup
17×
Smaller index
16 GiB vs. 282 GiB
8/8
Queries won
CtrlB wins every pattern
Key Findings
  • CtrlB wins all 8 queries, from 2.2× to 99× faster, cold cache, same hardware
  • Index footprint: CtrlB 16.3 GiB vs. ClickHouse 282.4 GiB (17× gap)
  • CtrlB queries run in standard SQL with no additional syntax layer

ClickHouse is a capable analytical database with active documentation on optimizing full-text search using its inverted index functions. We followed those recommendations and set up both engines on identical hardware against the same 5 TB production dataset. Results are the p50 of 5 cold runs.

Test Environment

ComponentConfiguration
Dataset5 TB structured Kubernetes log data, production workload
Tabledefault.benchmark_5tb
Hardware16 vCPU instance, identical for both engines
Cache stateCold cache at the start of every run
Measurementp50 of 5 cold runs per query, wall-clock latency end-to-end
ClickHouse versionv26.2
ClickHouse indexesbloom_filter(0.01) on trace_id & span_id; text(tokenizer=splitByNonAlpha) on message
CtrlB queriesStandard SQL with LIKE and equality predicates
ClickHouse queriesWritten per ClickHouse documentation for full-text index usage

Table Schema & Index Setup

The ClickHouse table carries three secondary indexes representing the two index families ClickHouse provides for this workload.

ClickHouse
CREATE TABLE default.benchmark_5tb
(
  -- Timestamp & human-readable date
  `_timestamp`       UInt64,
  `ist_timestamp`    String DEFAULT '',

  -- Free-text log line (the haystack)
  `message`          String,

  -- Distributed tracing IDs (high-cardinality)
  `trace_id`         String DEFAULT '',
  `span_id`          String DEFAULT '',

  -- Kubernetes metadata (low/medium cardinality)
  `namespace_name`   LowCardinality(String) DEFAULT '',
  `container_name`   LowCardinality(String) DEFAULT '',
  `pod_name`         String DEFAULT '',
  `plugin_source`    LowCardinality(String) DEFAULT '',
  `stream`           LowCardinality(String) DEFAULT '',
  `bytes`            UInt64 DEFAULT 0,

  -- Full-text inverted index (splitByNonAlpha tokenizer)
  INDEX idx_message_text message
    TYPE text(tokenizer = splitByNonAlpha) GRANULARITY 100000000,

  -- Bloom filters for exact ID lookups (1% false-positive rate)
  INDEX idx_trace_id trace_id TYPE bloom_filter(0.01) GRANULARITY 1,
  INDEX idx_span_id  span_id  TYPE bloom_filter(0.01) GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY (namespace_name, container_name, stream, _timestamp)
SETTINGS
  index_granularity      = 8192,
  min_compress_block_size = 65536,
  old_parts_lifetime     = 480

Sample log record

One representative row from the 5 TB Kubernetes log dataset used in this benchmark:

High cardinality Low / medium cardinality
Sample record
{
"_timestamp": 1780257600000000,High
"bytes": 17540,High
"container_name": "rabbitmq-data-engine", Low
"ist_timestamp": "01/06/2026, 01:30:00",High
"message": "I, [2025-03-08T20:57:48.308159 #1] INFO -- : [70b56038c39e2247e90f419c] [15dc74a5] [New relic Debug] - (job_started) - {:enqueued_at=>1741467082, :scheduled_at=>nil, :class=>\"TrackingCrawlerSqs\", :queue=>\"data_engine-high-processor\", :job_id=>\"70b56038c39e2247e90f419c\", :execution_id=>\"15dc74a5\", :execution_count=>1, :execution_at=>1741467468, :duration=>386, :status=>\"started\", :engine_name=>\"container_tracking\"} [batch:56e6d843-dee7-4160-8855-f3cbaeee41561162]", High
"namespace_name": "production", Low
"plugin_source": "k8s-fb-1.19.2", Low
"pod_name": "ctrlbgo-app-rabbitmq-data-engine-79db6c66bc-gp6dl", High
"span_id": "9ed343b6a933fc9a", High
"stream": "stdout", Low
"timestamp": "2026-05-31T19:59:59.065497790Z",High
"trace_id": "9c3116fe0a2bbd33d9b1f4771f2868c4" High
}

The Two Index Families

idx_trace_id and idx_span_id use bloom_filter(0.01) with GRANULARITY 1. These are high-cardinality, near-unique ID columns where the entire point is exact-match lookup. A Bloom filter at granularity 1 is the textbook ClickHouse answer: one filter per MergeTree granule (~8,192 rows), letting ClickHouse skip granules that provably don't contain the value. The 0.01 false-positive rate is standard ClickStack tuning.

idx_message_text uses TYPE text(tokenizer = splitByNonAlpha) — ClickHouse's inverted full-text index on the free-text message column, tokenizing on non-alphanumeric boundaries. This is ClickHouse's strongest possible choice for substring search. The GRANULARITY 100000000 effectively covers the entire dataset in a single index entry, correct for a posting-list inverted index.

Ingestion Rate: 280 MB/s vs. 210 MB/s

Both engines ingested the same 5 TB dataset on identical hardware — a 16 vCPU VM — with sustained ingest measured during load. Storage was the only deliberate difference: CtrlB streams to object storage; ClickHouse wrote to local SSD (required for its MergeTree layout).

CtrlB · Ingestion rate
280 MB/s
~33% higher sustained ingest
16 vCPU VM · object storage (cold streaming)
ClickHouse · Ingestion rate
210 MB/s
Sustained ingest during load
16 vCPU VM · local SSD

Results: Wall-Clock Latency Per Query

8 queries · p50 of 5 cold runs · lower is better. ClickHouse queries use hasAnyTokens / hasAllTokens as recommended in the ClickHouse documentation for activating the text inverted index.

Query pattern CtrlB ClickHouse Advantage
Find one trace_id 140 ms 7.1 s 50×
Find one span_id 92 ms 9.1 s 99×
Substring match, rare UUID 46 ms 721 ms 16×
Substring match, common token (ERROR) 33 ms 2.1 s 64×
Two filters: container_name + trace_id 107 ms 588 ms 5.5×
Filter + substring on message 180 ms 2.1 s 12×
Double substring on message, worst case 1.25 s 2.8 s 2.2×
High-cardinality equality on pod_name 3.6 s 24.5 s

The 8 Queries

Each query was run against default.benchmark_5tb. CtrlB uses standard SQL. ClickHouse uses the syntax recommended in its documentation for full-text index coverage.

Query 1: Find One trace_id High-Cardinality Point Lookup

The most common operation in distributed tracing: pull all log lines for a given trace. Both engines use equality filters. ClickHouse has a Bloom filter on trace_id with GRANULARITY 1, the most aggressive resolution possible. At 1% FPR with thousands of granules across 5 TB, the Bloom filter generates hundreds of false-positive granule reads. Each false-positive requires decompressing the granule and doing a precise row-level check. That I/O and CPU cost accumulates to 7+ seconds. CtrlB's sparse index achieves near-zero false positives at this cardinality.

CtrlB
SELECT * FROM benchmark_5tb_rti_v17
WHERE trace_id = '2d3d5a83f6a923d9...'
LIMIT 100
Wall clock 140 ms
ClickHouse
SELECT * FROM default.benchmark_5tb
WHERE trace_id = '8ceafb5879a818a5...'
  AND _timestamp >= 1776617583000000
  AND _timestamp <= 1776842347000000
ORDER BY _timestamp DESC
LIMIT 100 FORMAT JSONEachROW
Wall clock 7.1 s · 50× slower
Query 2: Find One span_id High-Cardinality Point Lookup

Same pattern as trace_id, but on span_id. The gap widens to 99×.

CtrlB
SELECT * FROM benchmark_5tb_rti_v17
WHERE span_id = '07766058015caeea'
LIMIT 100
Wall clock 92 ms
ClickHouse
SELECT * FROM default.benchmark_5tb
WHERE span_id = 'fbf291e847a2ac5e'
  AND _timestamp >= 1776617583000000
  AND _timestamp <= 1776842347000000
ORDER BY _timestamp DESC
LIMIT 100 FORMAT JSONEachROW
Wall clock 9.1 s · 99× slower
Query 3: Substring Match, Rare UUID Full-Text / Wildcard Search

A LIKE '%<uuid>%' search on the message column for a low-frequency substring. CtrlB: 46 ms. ClickHouse with its text index: 721 ms.

CtrlB
SELECT * FROM benchmark_5tb_rti_v17
WHERE message LIKE '%9d46ccea-98bd-4cef-bbd5%'
LIMIT 100
Wall clock 46 ms
ClickHouse
SELECT * FROM default.benchmark_5tb
WHERE hasAnyTokens(message,
    'c8fedcbb-9a05-47a4-8dd8-f24f75291749')
  AND _timestamp >= 1776617583000000
  AND _timestamp <= 1776842347000000
ORDER BY _timestamp DESC LIMIT 100
Wall clock 721 ms · 16× slower
Query 4: Substring Match, Common Token Full-Text / Wildcard Search

Searching for 'ERROR', a high-frequency token, is the worst case for an inverted index: the posting list is enormous and provides almost no pruning benefit.

CtrlB
SELECT * FROM benchmark_5tb_rti_v17
WHERE message LIKE '%ERROR%'
LIMIT 100
Wall clock 33 ms
ClickHouse
SELECT * FROM default.benchmark_5tb
WHERE hasAnyTokens(message, 'ERROR')
  AND _timestamp >= 1776617583000000
  AND _timestamp <= 1776842347000000
ORDER BY _timestamp DESC LIMIT 100
Wall clock 2.1 s · 64× slower
Query 5: Two Filters: container_name + trace_id Low/Medium Cardinality Exact Filter

A compound predicate combining a low-cardinality equality filter with a high-cardinality ID lookup. CtrlB: 107 ms. ClickHouse: 588 ms (5.5× slower).

CtrlB
SELECT * FROM benchmark_5tb_rti_v17
WHERE container_name = 'sidekiq-data-engine'
  AND trace_id = '2d3d5a83f6a923d9...'
LIMIT 100
Wall clock 107 ms
ClickHouse
SELECT * FROM default.benchmark_5tb
WHERE container_name = 'controller'
  AND trace_id = '4f99282f17d604c3...'
  AND _timestamp >= 1776617583000000
  AND _timestamp <= 1776842347000000
ORDER BY _timestamp DESC
LIMIT 100 FORMAT JSONEachROW
Wall clock 588 ms · 5.5× slower
Query 6: Filter + Substring on message Full-Text / Wildcard Search

Filter by container_name, then search for 'INFO' in the message — the bread-and-butter observability query: narrow by service, then search logs.

CtrlB
SELECT * FROM benchmark_5tb_rti_v17
WHERE container_name = 'sidekiq-data-engine'
  AND message LIKE '%INFO%'
LIMIT 100
Wall clock 180 ms
ClickHouse
SELECT * FROM default.benchmark_5tb
WHERE container_name = 'demo-app'
  AND hasAnyTokens(message, 'INFO')
  AND _timestamp >= 1776617583000000
  AND _timestamp <= 1776842347000000
ORDER BY _timestamp DESC LIMIT 100
Wall clock 2.1 s · 12× slower
Query 7: Double Substring on message Full-Text / Wildcard Search

Two simultaneous LIKE conditions on message: 'INFO' AND 'timeout'. The most demanding full-text pattern in this benchmark. The gap narrows here because the intersection of posting lists for two common tokens genuinely prunes granules.

CtrlB
SELECT * FROM benchmark_5tb_rti_v17
WHERE message LIKE '%INFO%'
  AND message LIKE '%timeout%'
LIMIT 100
Wall clock 1.25 s
ClickHouse
SELECT * FROM default.benchmark_5tb
WHERE hasAnyTokens(message, 'INFO')
  AND hasAnyTokens(message, 'timeout')
  AND _timestamp >= 1776617583000000
  AND _timestamp <= 1776842347000000
ORDER BY _timestamp DESC LIMIT 100
Wall clock 2.8 s · 2.2× slower
Query 8: High-Cardinality Equality on pod_name High Cardinality Exact Filter

An equality filter on pod_name, which has very high cardinality (unique per pod deployment). Neither engine has a dedicated secondary index on this column; both must scan. CtrlB's storage layout and scan path are more efficient even without a secondary index.

CtrlB
SELECT * FROM benchmark_5tb_rti_v17
WHERE pod_name =
  'demo-app-sqs-a4544e758c-fe295'
LIMIT 100
Wall clock 3.6 s
ClickHouse
SELECT * FROM default.benchmark_5tb
WHERE pod_name =
  'ingress-nginx-controller-ddc1a51c61-f795c'
  AND _timestamp >= 1776617583000000
  AND _timestamp <= 1776842347000000
ORDER BY _timestamp DESC LIMIT 100
Wall clock 24.5 s · 7× slower

Index Footprint: 16 GiB vs. 282 GiB

The most significant infrastructure difference is the index size on the message column. ClickHouse requires a full-text inverted index to make substring search tractable. That index weighs 282.4 GiB. CtrlB's sparse index on the same column is 16.3 GiB, 17× smaller.

CtrlB · Sparse index
16.3 GiB
0.33% of raw data
Streams cold data from object storage. No local SSD required.
ClickHouse · Text inverted index
282.4 GiB
5.6% of raw data
Requires oversized local SSD to keep compressed MergeTree parts + ~250 GiB text index hot.

The ClickHouse text inverted index uses a Lucene-style approach: for every token produced by the splitByNonAlpha tokenizer, it maintains a posting list of which granules contain that token. At 5 TB of log data, that posting list becomes very large, at 5.6% of the raw data size, compared to CtrlB's 0.33%.

Storage cost is not included in the latency numbers, but it shapes every capacity decision once the dataset crosses a few terabytes. At 50 TB, the ClickHouse text index alone would approach 2.8 TB of additional local SSD.

Same CPU. Very Different Bill.

Both engines ran on a 16-vCPU instance. The CPU was identical. What differed was the attached storage requirement.

AttributeCtrlBClickHouse
Data storageObject storage (cold streaming)Local SSD required
Index size (message col.)16.3 GiB282.4 GiB
Index as % of raw data0.33%5.60%
Index scaling with dataSub-linear (sparse)Linear with data volume

Architectural Differences

ClickHouse Index Characteristics

Lucene-style inverted indexes work well for document search: structured token vocabularies and read-heavy patterns over well-defined fields. Log observability presents a different access pattern:

  • High-cardinality point lookups on UUIDs. Even 1% FPR becomes expensive at 5 TB scale, producing hundreds of false-positive granule reads per query
  • Substring matches on freeform message content that may cross token boundaries
  • Compound filters mixing cardinality levels in a single query
  • Cold cache on every query, with datasets too large to keep hot

CtrlB Index Design

CtrlB's sparse index is built for the observability workload. It keeps a small footprint, reads directly from object storage, achieves near-zero false positives on ID lookups at any scale, and accepts standard SQL LIKE natively.

Conclusion

ClickHouse is a capable analytical database with strong text indexing support. This benchmark followed its documentation for full-text search configuration and ran both engines on identical hardware. The results reflect how each system handles the specific access patterns common in log observability workloads at scale.

CtrlB is purpose-built for this workload shape. Across all 8 queries, the results were consistent:

Performance

8 of 8 queries: CtrlB is faster, by 2.2× to 99×

Index size

17× smaller index: 16.3 GiB vs. 282.4 GiB on the message column

SQL compatibility

Standard SQL with no additional syntax requirements

Storage

No local SSD requirement. Streams cold data directly from object storage

Benchmark dataset

The 5 TB Kubernetes log dataset used in this benchmark is stored in a public S3 bucket. Click here to browse or download the data for reproducibility.

Elevate Your Workflow

Take control of your observability

Join thousands of developers using CtrlB to monitor their systems with complete confidence and extreme precision.

Request Access

Fill out the form below to get started with CtrlB.