# Finding the Needle. CtrlB vs. ClickHouse, 8 Ways.

Eight WHERE … LIMIT 100 queries on a 5 TB log table — CtrlB vs. ClickHouse on identical hardware. CtrlB wins all 8 patterns, 2.2× to 99× faster, with a 17× smaller index.

Canonical: https://ctrlb.ai/blogs/finding-the-needle-ctrlb-vs-clickhouse

## Article body

99×

        Fastest speedup
span_id point lookup

      

      
        17×

        Smaller index
16 GiB vs. 282 GiB

      

      
        8/8

        Queries won
CtrlB wins every pattern

      

    

    
      Key Findings

      
        - CtrlB wins all 8 queries, from 2.2× to 99× faster, cold cache, same hardware

        - Index footprint: CtrlB 16.3 GiB vs. ClickHouse 282.4 GiB (17× gap)

        - CtrlB queries run in standard SQL with no additional syntax layer

      
    

    
      ClickHouse is a capable analytical database with active documentation on optimizing
      full-text search using its inverted index functions. We followed those recommendations
      and set up both engines on identical hardware against the same 5 TB production dataset.
      Results are the p50 of 5 cold runs.
    

    
    Test Environment

    
      
        
          ComponentConfiguration

        
        
          Dataset5 TB structured Kubernetes log data, production workload

          Tabledefault.benchmark_5tb

          Hardware16 vCPU instance, identical for both engines

          Cache stateCold cache at the start of every run

          Measurementp50 of 5 cold runs per query, wall-clock latency end-to-end

          ClickHouse versionv26.2

          ClickHouse indexesbloom_filter(0.01) on trace_id & span_id; text(tokenizer=splitByNonAlpha) on message

          CtrlB queriesStandard SQL with LIKE and equality predicates

          ClickHouse queriesWritten per ClickHouse documentation for full-text index usage

        
      

    

    
    Table Schema & Index Setup

    
      The ClickHouse table carries three secondary indexes representing the two index families
      ClickHouse provides for this workload.
    

    
      ClickHouse

      CREATE TABLE default.benchmark_5tb
(
  -- Timestamp & human-readable date
  `_timestamp`       UInt64,
  `ist_timestamp`    String DEFAULT '',

  -- Free-text log line (the haystack)
  `message`          String,

  -- Distributed tracing IDs (high-cardinality)
  `trace_id`         String DEFAULT '',
  `span_id`          String DEFAULT '',

  -- Kubernetes metadata (low/medium cardinality)
  `namespace_name`   LowCardinality(String) DEFAULT '',
  `container_name`   LowCardinality(String) DEFAULT '',
  `pod_name`         String DEFAULT '',
  `plugin_source`    LowCardinality(String) DEFAULT '',
  `stream`           LowCardinality(String) DEFAULT '',
  `bytes`            UInt64 DEFAULT 0,

  -- Full-text inverted index (splitByNonAlpha tokenizer)
  INDEX idx_message_text message
    TYPE text(tokenizer = splitByNonAlpha) GRANULARITY 100000000,

  -- Bloom filters for exact ID lookups (1% false-positive rate)
  INDEX idx_trace_id trace_id TYPE bloom_filter(0.01) GRANULARITY 1,
  INDEX idx_span_id  span_id  TYPE bloom_filter(0.01) GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY (namespace_name, container_name, stream, _timestamp)
SETTINGS
  index_granularity      = 8192,
  min_compress_block_size = 65536,
  old_parts_lifetime     = 480
    

    Sample log record

    
      One representative row from the 5 TB Kubernetes log dataset used in this benchmark:
    

    
      High cardinality
      Low / medium cardinality
    

    
      Sample record

      {
  "_timestamp": 1780257600000000,High

  "bytes": 17540,High

  "container_name": "rabbitmq-data-engine", Low

  "ist_timestamp": "01/06/2026, 01:30:00",High

  "message": "I, [2025-03-08T20:57:48.308159 #1]  INFO -- : [70b56038c39e2247e90f419c] [15dc74a5] [New relic Debug] - (job_started) - {:enqueued_at=>1741467082, :scheduled_at=>nil, :class=>\"TrackingCrawlerSqs\", :queue=>\"data_engine-high-processor\", :job_id=>\"70b56038c39e2247e90f419c\", :execution_id=>\"15dc74a5\", :execution_count=>1, :execution_at=>1741467468, :duration=>386, :status=>\"started\", :engine_name=>\"container_tracking\"} [batch:56e6d843-dee7-4160-8855-f3cbaeee41561162]", High

  "namespace_name": "production", Low

  "plugin_source": "k8s-fb-1.19.2", Low

  "pod_name": "ctrlbgo-app-rabbitmq-data-engine-79db6c66bc-gp6dl", High

  "span_id": "9ed343b6a933fc9a", High

  "stream": "stdout", Low

  "timestamp": "2026-05-31T19:59:59.065497790Z",High

  "trace_id": "9c3116fe0a2bbd33d9b1f4771f2868c4" High

}

    

    The Two Index Families

    
      idx_trace_id and idx_span_id use bloom_filter(0.01) with
      GRANULARITY 1. These are high-cardinality, near-unique ID columns where the entire
      point is exact-match lookup. A Bloom filter at granularity 1 is the textbook ClickHouse answer:
      one filter per MergeTree granule (~8,192 rows), letting ClickHouse skip granules that provably
      don't contain the value. The 0.01 false-positive rate is standard ClickStack tuning.
    

    
      idx_message_text uses TYPE text(tokenizer = splitByNonAlpha) —
      ClickHouse's inverted full-text index on the free-text message column, tokenizing on
      non-alphanumeric boundaries. This is ClickHouse's strongest possible choice for substring search.
      The GRANULARITY 100000000 effectively covers the entire dataset in a single index
      entry, correct for a posting-list inverted index.
    

    Ingestion Rate: 280 MB/s vs. 210 MB/s

    
      Both engines ingested the same 5 TB dataset on identical hardware — a
      16 vCPU VM — with sustained ingest measured during load.
      Storage was the only deliberate difference: CtrlB streams to
      object storage; ClickHouse wrote to local SSD
      (required for its MergeTree layout).
    

    
      
        CtrlB · Ingestion rate

        280 MB/s

        ~33% higher sustained ingest

        16 vCPU VM · object storage (cold streaming)

      

      
        ClickHouse · Ingestion rate

        210 MB/s

        Sustained ingest during load

        16 vCPU VM · local SSD

      

    

    
    Results: Wall-Clock Latency Per Query

    
      8 queries · p50 of 5 cold runs · lower is better. ClickHouse queries use
      hasAnyTokens / hasAllTokens as recommended in the ClickHouse
      documentation for activating the text inverted index.
    

    
      
        
          
            Query pattern
            CtrlB
            ClickHouse
            Advantage
          

        
        
          
            Find one trace_id
            140 ms
            7.1 s
            50×
          

          
            Find one span_id
            92 ms
            9.1 s
            99×
          

          
            Substring match, rare UUID
            46 ms
            721 ms
            16×
          

          
            Substring match, common token (ERROR)
            33 ms
            2.1 s
            64×
          

          
            Two filters: container_name + trace_id
            107 ms
            588 ms
            5.5×
          

          
            Filter + substring on message
            180 ms
            2.1 s
            12×
          

          
            Double substring on message, worst case
            1.25 s
            2.8 s
            2.2×
          

          
            High-cardinality equality on pod_name
            3.6 s
            24.5 s
            7×
          

        
      

    

    
    The 8 Queries

    
      Each query was run against default.benchmark_5tb. CtrlB uses standard SQL.
      ClickHouse uses the syntax recommended in its documentation for full-text index coverage.
    

    
    
      Query 1: Find One trace_id
      High-Cardinality Point Lookup
    

    
      The most common operation in distributed tracing: pull all log lines for a given trace.
      Both engines use equality filters. ClickHouse has a Bloom filter on trace_id
      with GRANULARITY 1, the most aggressive resolution possible.
      At 1% FPR with thousands of granules across 5 TB, the Bloom filter generates hundreds of
      false-positive granule reads. Each false-positive requires decompressing the granule and doing
      a precise row-level check. That I/O and CPU cost accumulates to 7+ seconds.
      CtrlB's sparse index achieves near-zero false positives at this cardinality.
    

    
      
        CtrlB

        SELECT * FROM benchmark_5tb_rti_v17
WHERE trace_id = '2d3d5a83f6a923d9...'
LIMIT 100
        Wall clock 140 ms

      

      
        ClickHouse

        SELECT * FROM default.benchmark_5tb
WHERE trace_id = '8ceafb5879a818a5...'
  AND _timestamp >= 1776617583000000
  AND _timestamp 
        Wall clock 7.1 s · 50× slower

      

    

    
    
      Query 2: Find One span_id
      High-Cardinality Point Lookup
    

    
      Same pattern as trace_id, but on span_id. The gap widens to 99×.
    

    
      
        CtrlB

        SELECT * FROM benchmark_5tb_rti_v17
WHERE span_id = '07766058015caeea'
LIMIT 100
        Wall clock 92 ms

      

      
        ClickHouse

        SELECT * FROM default.benchmark_5tb
WHERE span_id = 'fbf291e847a2ac5e'
  AND _timestamp >= 1776617583000000
  AND _timestamp 
        Wall clock 9.1 s · 99× slower

      

    

    
    
      Query 3: Substring Match, Rare UUID
      Full-Text / Wildcard Search
    

    
      A LIKE '%<uuid>%' search on the message column for a
      low-frequency substring. CtrlB: 46 ms. ClickHouse with its text index: 721 ms.
    

    
      
        CtrlB

        SELECT * FROM benchmark_5tb_rti_v17
WHERE message LIKE '%9d46ccea-98bd-4cef-bbd5%'
LIMIT 100
        Wall clock 46 ms

      

      
        ClickHouse

        SELECT * FROM default.benchmark_5tb
WHERE hasAnyTokens(message,
    'c8fedcbb-9a05-47a4-8dd8-f24f75291749')
  AND _timestamp >= 1776617583000000
  AND _timestamp 
        Wall clock 721 ms · 16× slower

      

    

    
    
      Query 4: Substring Match, Common Token
      Full-Text / Wildcard Search
    

    
      Searching for 'ERROR', a high-frequency token, is the worst case for an inverted
      index: the posting list is enormous and provides almost no pruning benefit.
    

    
      
        CtrlB

        SELECT * FROM benchmark_5tb_rti_v17
WHERE message LIKE '%ERROR%'
LIMIT 100
        Wall clock 33 ms

      

      
        ClickHouse

        SELECT * FROM default.benchmark_5tb
WHERE hasAnyTokens(message, 'ERROR')
  AND _timestamp >= 1776617583000000
  AND _timestamp 
        Wall clock 2.1 s · 64× slower

      

    

    
    
      Query 5: Two Filters: container_name + trace_id
      Low/Medium Cardinality Exact Filter
    

    
      A compound predicate combining a low-cardinality equality filter with a high-cardinality ID
      lookup. CtrlB: 107 ms. ClickHouse: 588 ms (5.5× slower).
    

    
      
        CtrlB

        SELECT * FROM benchmark_5tb_rti_v17
WHERE container_name = 'sidekiq-data-engine'
  AND trace_id = '2d3d5a83f6a923d9...'
LIMIT 100
        Wall clock 107 ms

      

      
        ClickHouse

        SELECT * FROM default.benchmark_5tb
WHERE container_name = 'controller'
  AND trace_id = '4f99282f17d604c3...'
  AND _timestamp >= 1776617583000000
  AND _timestamp 
        Wall clock 588 ms · 5.5× slower

      

    

    
    
      Query 6: Filter + Substring on message
      Full-Text / Wildcard Search
    

    
      Filter by container_name, then search for 'INFO' in the message —
      the bread-and-butter observability query: narrow by service, then search logs.
    

    
      
        CtrlB

        SELECT * FROM benchmark_5tb_rti_v17
WHERE container_name = 'sidekiq-data-engine'
  AND message LIKE '%INFO%'
LIMIT 100
        Wall clock 180 ms

      

      
        ClickHouse

        SELECT * FROM default.benchmark_5tb
WHERE container_name = 'demo-app'
  AND hasAnyTokens(message, 'INFO')
  AND _timestamp >= 1776617583000000
  AND _timestamp 
        Wall clock 2.1 s · 12× slower

      

    

    
    
      Query 7: Double Substring on message
      Full-Text / Wildcard Search
    

    
      Two simultaneous LIKE conditions on message: 'INFO'
      AND 'timeout'. The most demanding full-text pattern in this benchmark.
      The gap narrows here because the intersection of posting lists for two common tokens
      genuinely prunes granules.
    

    
      
        CtrlB

        SELECT * FROM benchmark_5tb_rti_v17
WHERE message LIKE '%INFO%'
  AND message LIKE '%timeout%'
LIMIT 100
        Wall clock 1.25 s

      

      
        ClickHouse

        SELECT * FROM default.benchmark_5tb
WHERE hasAnyTokens(message, 'INFO')
  AND hasAnyTokens(message, 'timeout')
  AND _timestamp >= 1776617583000000
  AND _timestamp 
        Wall clock 2.8 s · 2.2× slower

      

    

    
    
      Query 8: High-Cardinality Equality on pod_name
      High Cardinality Exact Filter
    

    
      An equality filter on pod_name, which has very high cardinality (unique per pod
      deployment). Neither engine has a dedicated secondary index on this column; both must scan.
      CtrlB's storage layout and scan path are more efficient even without a secondary index.
    

    
      
        CtrlB

        SELECT * FROM benchmark_5tb_rti_v17
WHERE pod_name =
  'demo-app-sqs-a4544e758c-fe295'
LIMIT 100
        Wall clock 3.6 s

      

      
        ClickHouse

        SELECT * FROM default.benchmark_5tb
WHERE pod_name =
  'ingress-nginx-controller-ddc1a51c61-f795c'
  AND _timestamp >= 1776617583000000
  AND _timestamp 
        Wall clock 24.5 s · 7× slower

      

    

    
    Index Footprint: 16 GiB vs. 282 GiB

    
      The most significant infrastructure difference is the index size on the message column.
      ClickHouse requires a full-text inverted index to make substring search tractable. That index
      weighs 282.4 GiB. CtrlB's sparse index on the same column is 16.3 GiB, 17× smaller.
    

    
      
        CtrlB · Sparse index

        16.3 GiB

        0.33% of raw data

        Streams cold data from object storage. No local SSD required.

      

      
        ClickHouse · Text inverted index

        282.4 GiB

        5.6% of raw data

        Requires oversized local SSD to keep compressed MergeTree parts + ~250 GiB text index hot.

      

    

    
      The ClickHouse text inverted index uses a Lucene-style approach: for every token produced
      by the splitByNonAlpha tokenizer, it maintains a posting list of which granules
      contain that token. At 5 TB of log data, that posting list becomes very large, at 5.6% of the raw
      data size, compared to CtrlB's 0.33%.
    

    
      Storage cost is not included in the latency numbers, but it shapes every capacity decision once
      the dataset crosses a few terabytes. At 50 TB, the ClickHouse text index alone would approach
      2.8 TB of additional local SSD.
    

    Same CPU. Very Different Bill.

    Both engines ran on a 16-vCPU instance. The CPU was identical. What differed was the attached storage requirement.

    
      
        
          AttributeCtrlBClickHouse

        
        
          Data storageObject storage (cold streaming)Local SSD required

          Index size (message col.)16.3 GiB282.4 GiB

          Index as % of raw data0.33%5.60%

          Index scaling with dataSub-linear (sparse)Linear with data volume

        
      

    

    
    Architectural Differences

    ClickHouse Index Characteristics

    
      Lucene-style inverted indexes work well for document search: structured token vocabularies
      and read-heavy patterns over well-defined fields. Log observability presents a different
      access pattern:
    

    
      - High-cardinality point lookups on UUIDs. Even 1% FPR becomes expensive at 5 TB scale, producing hundreds of false-positive granule reads per query

      - Substring matches on freeform message content that may cross token boundaries

      - Compound filters mixing cardinality levels in a single query

      - Cold cache on every query, with datasets too large to keep hot

    

    CtrlB Index Design

    
      CtrlB's sparse index is built for the observability workload. It keeps a small footprint,
      reads directly from object storage, achieves near-zero false positives on ID lookups at
      any scale, and accepts standard SQL LIKE natively.
    

    Conclusion

    
      ClickHouse is a capable analytical database with strong text indexing support. This benchmark
      followed its documentation for full-text search configuration and ran both engines on identical
      hardware. The results reflect how each system handles the specific access patterns common in
      log observability workloads at scale.
    

    CtrlB is purpose-built for this workload shape. Across all 8 queries, the results were consistent:

    
      
        Performance

        8 of 8 queries: CtrlB is faster, by 2.2× to 99×

      

      
        Index size

        17× smaller index: 16.3 GiB vs. 282.4 GiB on the message column

      

      
        SQL compatibility

        Standard SQL with no additional syntax requirements

      

      
        Storage

        No local SSD requirement. Streams cold data directly from object storage

      

    

    Benchmark dataset

    
      The 5 TB Kubernetes log dataset used in this benchmark is stored in a public S3 bucket.
      Click here
      to browse or download the data for reproducibility.
