| Scenario | Pipeline | Reduction |
|---|---|---|
| AI chat (fast reload) | repomix --compress → zstd -3 |
~85–90% tokens (code repos); ~84% (markup repos) — measured |
| AI deep analysis | repomix lossless → zstd -9 | ~82% size (measured, HTML/CSS repo); ~70% for code repos |
| Cold archival | repomix → PPMd via 7z | ~93–95% size |
| OTEL hot path | zstd at collector + ClickHouse ZSTD(3) | 85–95% vs raw |
| OTEL archival | OTel Arrow + ClickHouse DoubleDelta + ZSTD | up to 30:1 |
| SQL time-series | Delta + ZSTD codec chain | 10–50x vs uncompressed |
| KV small records | zstd dictionary training | +10–30% over base |
Overview
A research collection on data compression strategies for code, telemetry, and storage — covering algorithms, pipelines, and practical tooling for the 2025–2026 ecosystem.
The collection spans three core areas: code condensation via repomix and semantic compression tools, observability pipeline compression for OpenTelemetry data, and storage-level codec selection for SQL databases and key-value stores.
Key Findings at a Glance
Research Papers
Repomix-to-Condense Pipeline: Compression Analysis & Tool Integrations
The core document. Answers how much repomix losslessly condenses a codebase, what a code → repomix → zstd pipeline adds, and whether the round-trip is recoverable. Includes a deep dive on integrations with PPM, ast-grep, and custom Makefiles for polyglot repos with mixed file types — with per-type compression strategy tables and concrete pipeline examples.
OTEL Telemetry Data Compression
Compression best practices for OpenTelemetry traces, metrics, and logs. Covers OTLP protocol compression options (gzip mandatory, zstd/snappy optional), OTel Arrow's 30–70% bandwidth reduction, ClickHouse codec selection for SigNoz/ClickStack, collector pipeline configuration, and dictionary training on protobuf samples. Includes cost impact estimates for SaaS ingest at scale.
SQL & KV Data Storage Compression
Compression best practices for relational databases and key-value stores. Covers PostgreSQL TOAST (pglz vs lz4), ClickHouse per-column codec selection (Delta, DoubleDelta, Gorilla, T64, ZSTD), RocksDB compression by level, Redis memory encoding (ziplist/listpack), and Cloudflare KV edge compression. Addresses write amplification tradeoffs across compression levels.
Reference Documents
| Document | Contents |
|---|---|
| Prediction by Partial Matching | PPM algorithm — context modeling, escape symbols, arithmetic coding; variants PPMC, PPMd, PPM*, PPM-Decay; open-source libraries across C, Python, Java, R |
| Repomix CLI Cheat Sheet | Full CLI reference — output formats, file filtering, --compress lossy behavior, compression modes (Interface/Signature/Minimal), full repomix.config.json template |
| Zstandard Condense Report | zstd overview — benchmark tables vs brotli/zlib/lz4/snappy, dictionary training how-to, build systems (make/cmake/vcpkg/conan), language bindings across 30+ languages |
| HTTP Content Compression (RFC 8478) | Full text of RFC 8478 — Zstandard compression format specification, frame format, FSE and Huffman entropy encoding, IANA media type registration |