Repomix-to-Condense Pipeline — Integrity Studio

Research Questions

How much does repomix losslessly condense? How much additional benefit would a code → repomix → zstd pipeline add, and what about the round-trip back? Then: a deeper dive on integrations with PPM, ast-grep, custom Makefiles, and other tools for polyglot repos.

1. How Much Does Repomix Condense?

Repomix operates at two distinct levels:

1a. Lossless Packing (default, no `--compress`)

Repomix concatenates all source files into a single structured document (XML, Markdown, JSON, or plain text). This step is not compression in the traditional sense — the output is typically larger than the sum of individual files because it adds XML/Markdown wrapper tags, directory structure metadata, file summary / token count header, and security check metadata.

Estimated overhead: +5–15% over raw file concatenation, depending on output style. XML adds the most overhead; plain text the least.

Effective reduction for AI consumption:

Ignore filtering: Stripping node_modules/, dist/, .git/, lockfiles, binaries — which in many repos account for 80–95% of total disk size
Comment removal (--remove-comments): Typically saves 10–25% of source-only content
Empty line removal (--remove-empty-lines): Saves another 3–8%

Net result (lossless): For a typical Node.js/Python project, repomix with --remove-comments --remove-empty-lines produces output that is roughly 60–80% of the raw source file content. Relative to total repo size on disk (including node_modules, .git, etc.), the reduction is 90–98%.

1b. Lossy Compression (`--compress`)

Using Tree-sitter, repomix extracts structural signatures and strips function bodies:

Status	Content
Preserves	Function/method signatures, class structures, interface/type definitions, imports, exports, docstrings
Removes	Function bodies, loop/conditional internals, local variable assignments

Reported reduction: ~70% token count reduction vs. full source. Combined with comment and empty-line removal, the total reduction from source → compressed repomix output is typically 75–85% fewer tokens.

2. The code → repomix → zstd Pipeline

2a. What zstd Adds on Top of Repomix Output

Repomix output is structured text (XML/Markdown) with high redundancy — repeated tag names, indentation patterns, and boilerplate. This makes it an excellent candidate for general-purpose compression.

Measured zstd ratios on repomix lossless output (4.0 MB XML, HTML/CSS/MD-heavy repo, zstd 1.5.7, Apple M-series, Mar 2026):

zstd Level	Compression Ratio	Compress Time (4 MB)	Decompress Time (4 MB)
`-1` (default)	3.73x	~0.02s (~200 MB/s)	~0.01s (~400 MB/s)
`-9` (balanced)	5.31x	~0.15s (~27 MB/s)	~0.01s (~400 MB/s)
`--ultra -22` (max)	5.90x	~2.8s (~1.4 MB/s)	~0.01s (~400 MB/s)

Note: Ratios exceed Silesia corpus estimates because repomix XML output has extreme structural redundancy. Actual ratios vary by repo type — code-heavy repos (JS/TS) typically see slightly lower ratios than markup-heavy repos. Source: integritystudio.io reports repo (233 HTML/CSS/MD files, 4.1 MB).

Key insight: Decompression time is consistent across all levels (~0.01s for 4 MB). The cost is entirely at compression time. For AI tool workflows where you compress once and decompress frequently, even --ultra -22 is asymptotically free on reload.

2b. Combined Pipeline Numbers

Empirical measurements — integritystudio.io reports repo (233 HTML/CSS/MD files, zstd 1.5.7, Apple M-series, Mar 2026):

Stage	Size	Cumulative Reduction
Raw source files	4.1 MB	—
repomix lossless	3.9 MB	6%
repomix `--compress`	3.6 MB	15%
lossless + zstd -1	1.05 MB	75%
lossless + zstd -9	0.74 MB	82%
compress + zstd -9	0.69 MB	84%
compress + zstd --ultra -22	0.62 MB	85%

For comparison, raw source → zstd -9 (no repomix): 0.76 MB (82% reduction).

HTML/CSS/MD caveat: --compress shows only 15% reduction for this markup-heavy repo because Tree-sitter's function-body stripping targets code constructs not present in HTML/CSS. For JS/TS/Python repos, --compress typically delivers 70–85% token reduction, making the combined pipeline significantly more effective. The zstd stage dominates size reduction regardless of repo type.

2c. The Round-Trip

zstd is fully lossless. Decompressing a .zst file yields the exact repomix output byte-for-byte. So the round-trip question is really about repomix:

Without --compress: The round-trip is lossless. You can extract individual files from the output (though repomix doesn't provide an "unpack" tool).
With --compress: The round-trip is lossy. Function bodies are gone. This is a one-way transformation suitable for AI analysis, not archival.

3. Deep Dive: Additional Tools & Integrations

3a. Prediction by Partial Matching (PPMd)

PPM is an adaptive statistical compression technique using Markov context models + arithmetic coding. PPMd (by Dmitry Shkarin) is the most practical implementation, used in RAR and 7-Zip.

PPMd vs. zstd on text/code (Large Text Compression Benchmark — enwik9, 1 GB Wikipedia XML):

Compressor	Compressed Size	Ratio	Compress Time	Memory
FreeArc PPMd (order 13, 1012 MB)	175 MB	5.7x	1175s	1046 MB
7zip PPMd (order 10, 1630 MB)	179 MB	5.6x	503s	1630 MB
zstd --ultra -22	216 MB	4.6x	701s	792 MB
xz -9 -e (LZMA2)	248 MB	4.0x	2310s	690 MB
bzip2 -9	254 MB	3.9x	379s	8 MB
gzip -9	323 MB	3.1x	101s	1.6 MB

Key takeaway: PPMd achieves 15–25% better compression than zstd's maximum on text-heavy content, but at the cost of 5–10x slower decompression, higher memory requirements (1–2 GB vs. <1 GB), and no streaming/dictionary support.

Integration with repomix pipeline:

# Active use (fast decompress for AI tools):
repomix --compress | zstd -9 > repo.xml.zst

# Archival (maximum squeeze):
repomix --compress -o repo.xml
7z a -m0=PPMd -mx=9 -mmem=1024m repo.7z repo.xml

3b. ast-grep (Structural Code Search & Rewriting)

ast-grep (12.6k GitHub stars, MIT license) is a Rust CLI tool for AST-based code structural search, lint, and rewriting. It uses Tree-sitter — the same parser repomix uses for --compress.

1. Pre-compression normalization

# Normalize all arrow functions to consistent style
sg --pattern 'function $NAME($ARGS) { return $EXPR }' \
   --rewrite 'const $NAME = ($ARGS) => $EXPR' --lang ts

2. Selective extraction

# Extract only exported API surfaces
sg --pattern 'export $$$' --lang ts -r . > api-surface.txt

3. Full pipeline

# Normalize → strip dead code → pack → compress
sg --pattern 'console.log($$$)' --rewrite '' --lang ts -r src/
repomix --compress --remove-comments -o packed.xml src/
zstd -9 packed.xml

ast-grep supports 20+ languages via Tree-sitter grammars — matching repomix's polyglot capability but with surgical precision.

3c. Custom Makefiles for Polyglot Pipelines

For repos with mixed languages and file types, a Makefile can orchestrate per-filetype optimal compression:

SRCDIR := src
OUTDIR := .condense
ZSTD_LEVEL := 9

CODE_FILES := $(shell find $(SRCDIR) -name '*.ts' -o -name '*.py' -o -name '*.rs' -o -name '*.go' -o -name '*.java')
CONFIG_FILES := $(shell find $(SRCDIR) -name '*.json' -o -name '*.yaml' -o -name '*.toml' -o -name '*.xml')
TEXT_FILES := $(shell find $(SRCDIR) -name '*.md' -o -name '*.txt' -o -name '*.rst')
IMAGE_FILES := $(shell find $(SRCDIR) -name '*.png' -o -name '*.jpg' -o -name '*.svg')

# Code: semantic compress via repomix
$(OUTDIR)/code.xml.zst: $(CODE_FILES)
	repomix --compress --remove-comments --include "**/*.{ts,py,rs,go,java}" \
		-o $(OUTDIR)/code.xml
	zstd -$(ZSTD_LEVEL) --rm $(OUTDIR)/code.xml

# Config: lossless pack (structure matters)
$(OUTDIR)/config.xml.zst: $(CONFIG_FILES)
	repomix --remove-empty-lines --include "**/*.{json,yaml,toml,xml}" \
		-o $(OUTDIR)/config.xml
	zstd -$(ZSTD_LEVEL) --rm $(OUTDIR)/config.xml

# Text: PPMd for maximum ratio (compress once, read rarely)
$(OUTDIR)/docs.7z: $(TEXT_FILES)
	repomix --include "**/*.{md,txt,rst}" -o $(OUTDIR)/docs.xml
	7z a -m0=PPMd -mx=9 $(OUTDIR)/docs.7z $(OUTDIR)/docs.xml
	rm $(OUTDIR)/docs.xml

# Images: already compressed, just archive
$(OUTDIR)/assets.tar.zst: $(IMAGE_FILES)
	tar cf - $(IMAGE_FILES) | zstd -1 > $(OUTDIR)/assets.tar.zst

condense: $(OUTDIR)/code.xml.zst $(OUTDIR)/config.xml.zst $(OUTDIR)/docs.7z $(OUTDIR)/assets.tar.zst

clean:
	rm -rf $(OUTDIR)

.PHONY: condense clean

Why per-type strategies matter:

File Type	Best Strategy	Reason
Code (.ts, .py, .rs)	repomix `--compress` + zstd	AST extraction removes 70% before statistical compression
Config (.json, .yaml)	repomix lossless + zstd	Structure must be preserved; high redundancy benefits zstd
Text (.md, .txt)	repomix + PPMd	Natural language is PPMd's sweet spot (15–25% better than zstd)
Images (.png, .jpg)	tar + zstd -1	Already compressed; fast archive only
SVG	repomix lossless + zstd -19	XML text; compresses very well

3d. Dictionary Training for Polyglot Repos

# Train a dictionary on your codebase's file samples
zstd --train src/**/*.ts src/**/*.py -o code.dict

# Compress with the trained dictionary
zstd -D code.dict -9 packed-output.xml

For repos with consistent coding style, a trained dictionary can improve compression ratios by 10–30% on small-to-medium files (under 100 KB). Best for microservices repos, monorepos with many similarly-structured small files, configuration file collections.

3e. BWT (Burrows-Wheeler Transform) Preprocessing

bzip2 uses BWT + MTF + Huffman. From the Large Text Compression Benchmark (enwik9, 1 GB Wikipedia XML):

bzip2 -9: 254 MB (3.9x ratio), 379s, 8 MB memory
zstd --ultra -22: 216 MB (4.6x ratio), 701s, 792 MB memory
PPMd (7zip, order 10): 179 MB (5.6x ratio), 503s, 1630 MB memory

BWT could theoretically be used as a preprocessing stage before entropy coding, but no production tool currently chains BWT → zstd cleanly.

3f. Tree-sitter Grammar-Aware Tokenization

A theoretical pipeline enhancement using both repomix and ast-grep's shared Tree-sitter infrastructure:

Tree-sitter parse → extract AST
AST normalization → canonical variable names, consistent formatting
AST serialization → compact binary or S-expression format
zstd compression → on the normalized output

This theoretical pipeline could achieve better compression than repomix's current approach because AST normalization eliminates cosmetic variation (naming, whitespace, brace style) that wastes entropy. No production tool implements this full pipeline today, but the components exist.

4. Summary: Optimal Pipelines by Use Case

Use Case	Pipeline	Expected Reduction
AI chat (fast reload)	repomix `--compress` → zstd -3	~90% tokens, instant decompress
AI deep analysis	repomix (lossless) → zstd -9	~70% size, preserves all logic
Cold archival	repomix → PPMd via 7z	~93–95% size, slow decompress
Polyglot repo (mixed types)	Makefile with per-type strategy	~90–95% overall
Monorepo (many small files)	zstd dictionary training + repomix	~92% with trained dict
CI artifact storage	repomix → zstd --fast=3	~80% size, minimal CPU cost

Sources

Repomix documentation — Code Compression
Repomix configuration reference
facebook/zstd — GitHub (26.7k stars, v1.5.7)
Zstandard benchmarks
Matt Mahoney — Large Text Compression Benchmark
ast-grep/ast-grep — GitHub (12.6k stars, v0.41.0)
Wikipedia — Prediction by Partial Matching
Repomix GitHub issues — compress/token discussions

Appendix: Practical Repomix Pipeline Choices (2025–2026 Era)

Most people choose zstd (often level 3–9 for balance, or --ultra -19–22 when squeezing harder) because the decompression speed advantage is huge when reloading the condensed repo into tools like Continue.dev, Cursor, Aider, or Claude.

PPMd shines in offline archival scenarios (e.g., yearly repo snapshots, cold storage) where you compress once and rarely decompress.

Hybrid tip: repomix --compress → zstd (fast, everyday) for active use; repomix --compress → PPMd (via 7z a -m0=PPMd -mx=9 archive.7z) for maximum squeeze when archiving.

Related Research

OTEL Telemetry Data Compression — Compression best practices for OpenTelemetry traces, metrics, and logs; OTLP protocol compression; ClickHouse codecs; collector and storage-level strategies.
SQL & KV Data Storage Compression — Compression best practices for SQL databases (PostgreSQL, ClickHouse, MySQL) and KV stores (RocksDB, Redis, Cloudflare KV); per-column codec selection; write amplification tradeoffs.