ctrlb-decompose

Compress raw log lines into structural patterns with statistics, anomalies, and correlations. Turn millions of noisy log lines into a handful of actionable patterns.

CLIWASMRust LibraryGitHub RepositoryMIT License
bash — ctrlb-decompose
$ cat server.log | ctrlb-decompose
ctrlb-decompose: 1,247,831 lines -> 43 patterns (99.9% reduction)
#1[ERROR]██████████████████████18,402 (1.5%)
<TS> ERROR [<*>] Connection to <ip> timed out after <duration>
ipIPv4unique=12 top: 10.0.1.15 (34%), 10.0.1.22 (28%)durationDuration p50=120ms p99=4.8s
#2[INFO] ████████████████████904,221 (72.5%)
<TS> INFO [<*>] Request from <ip> completed in <duration> status=<status>
ipIPv4unique=1,847 top: 10.0.1.15 (12%), 10.0.1.22 (8%)durationDuration p50=23ms p99=312msstatusEnumunique=3 values: 200 (91%), 404 (6%), 500 (3%)

How It Works

A two-stage normalization and clustering pipeline that processes logs in a single streaming pass with minimal memory footprint.

1

Timestamp Extraction

Strip & parse timestamps (ISO 8601, Apache, syslog, Unix epoch, etc.) into normalized <TS> markers with DateTime values.

2
3

Drain3 Clustering

Tree-based similarity clustering groups logtypes into patterns. Differing tokens become <*> wildcards. Incremental — no second pass needed.

4
5

Statistics Accumulation

DDSketch quantiles (p50/p99), HyperLogLog cardinality estimation, top-k values, temporal bucketing, and reservoir-sampled example lines.

6
7

Scoring & Correlation

Keyword-based severity (ERROR > WARN > INFO > DEBUG), temporal co-occurrence, shared variable correlation, and error cascade detection across patterns.

8

Stage 1 — CLP Encoding

CLP (Compact Log Pattern) encoding normalizes variable tokens into typed placeholders, so structurally identical lines produce identical logtypes regardless of the actual values.

Input:
"Request from 10.0.1.15 completed in 45ms status=200"
Logtype:
"Request from <dict> completed in <float>ms status=<int>"

Stage 2 — Drain3 Clustering

The Drain algorithm builds a prefix tree over logtypes and groups them by token similarity (configurable threshold, default 0.4). Where tokens diverge, the template gains a <*> wildcard.

This runs incrementally — each line is processed once with no second pass needed.

Variable Classification

Extracted variables are classified into semantic types for richer analysis:

TypeExampleDetection Method
IPv4 / IPv610.0.1.15CIDR pattern match
UUID550e8400-e29b-...8-4-4-4-12 hex format
Duration45ms, 3.2sNumeric + time unit suffix
HexID0x1a2b3c4+ hex digits
Integer200Parses as i64
Float3.14Contains ., parses as f64
EnumERRORLow cardinality (<=20 unique, top-3 >= 80%)
Timestamp2024-01-15T14:22:01ZRFC 3339 pattern
Stringanything elseFallback

Memory Efficiency

Drain3 Clusters

O(k) with LRU eviction (default 10k max)

Quantiles

DDSketch — fixed ~200 bytes per numeric slot, no raw value storage

Cardinality

HyperLogLog++ — ~200 bytes per high-cardinality variable

Examples

Reservoir sampling — bounded buffer per pattern

Installation

macOS (Homebrew)

brew tap ctrlb-hq/tap
brew install ctrlb-decompose

Debian / Ubuntu

curl -LO https://github.com/ctrlb-hq/ctrlb-decompose/releases/download/v0.1.0/ctrlb-decompose_0.1.0-1_amd64.deb
sudo dpkg -i ctrlb-decompose_0.1.0-1_amd64.deb

Build from source

git clone https://github.com/ctrlb-hq/ctrlb-decompose.git
cd ctrlb-decompose
cargo build --release
# Binary at target/release/ctrlb-decompose

Usage & Options

# Pipe from stdin
cat /var/log/syslog | ctrlb-decompose
# Read from file
ctrlb-decompose server.log
# LLM-optimized output (compact, token-efficient)
ctrlb-decompose --llm app.log
# JSON output
ctrlb-decompose --json app.log
# Top 10 patterns with 3 example lines each
ctrlb-decompose --top 10 --context 3 app.log

Command Line Options

ctrlb-decompose [OPTIONS] [FILE]

Arguments:
  [FILE]          Log file path (reads stdin if omitted or "-")

Options:
      --human         Human-readable output with colors (default)
      --llm           LLM-optimized compact markdown
      --json          Structured JSON output
      --top <N>       Show top N patterns (default: 20)
      --context <N>   Example lines per pattern (default: 0)
      --no-color      Disable ANSI colors
      --no-banner     Suppress header/footer
  -q, --quiet         Suppress progress messages
  -h, --help          Show help
  -V, --version       Show version

Output Formats

FormatFlagBest for
Human--human (default)Terminal investigation — colored, visual bars
LLM--llmFeeding into LLMs — compact, token-efficient markdown
JSON--jsonProgrammatic consumption — structured, machine-readable