3a. Prediction by Partial Matching (PPMd)
PPM is an adaptive statistical compression technique using Markov context models + arithmetic coding. PPMd (by Dmitry Shkarin) is the most practical implementation, used in RAR and 7-Zip.
PPMd vs. zstd on text/code (Large Text Compression Benchmark — enwik9, 1 GB Wikipedia XML):
| Compressor |
Compressed Size |
Ratio |
Compress Time |
Memory |
| FreeArc PPMd (order 13, 1012 MB) |
175 MB |
5.7x |
1175s |
1046 MB |
| 7zip PPMd (order 10, 1630 MB) |
179 MB |
5.6x |
503s |
1630 MB |
| zstd --ultra -22 |
216 MB |
4.6x |
701s |
792 MB |
| xz -9 -e (LZMA2) |
248 MB |
4.0x |
2310s |
690 MB |
| bzip2 -9 |
254 MB |
3.9x |
379s |
8 MB |
| gzip -9 |
323 MB |
3.1x |
101s |
1.6 MB |
Key takeaway: PPMd achieves 15–25% better compression than zstd's maximum on text-heavy content, but at the cost of 5–10x slower decompression, higher memory requirements (1–2 GB vs. <1 GB), and no streaming/dictionary support.
Integration with repomix pipeline:
# Active use (fast decompress for AI tools):
repomix --compress | zstd -9 > repo.xml.zst
# Archival (maximum squeeze):
repomix --compress -o repo.xml
7z a -m0=PPMd -mx=9 -mmem=1024m repo.7z repo.xml
3b. ast-grep (Structural Code Search & Rewriting)
ast-grep (12.6k GitHub stars, MIT license) is a Rust CLI tool for AST-based code structural search, lint, and rewriting. It uses Tree-sitter — the same parser repomix uses for --compress.
1. Pre-compression normalization
# Normalize all arrow functions to consistent style
sg --pattern 'function $NAME($ARGS) { return $EXPR }' \
--rewrite 'const $NAME = ($ARGS) => $EXPR' --lang ts
2. Selective extraction
# Extract only exported API surfaces
sg --pattern 'export $$$' --lang ts -r . > api-surface.txt
3. Full pipeline
# Normalize → strip dead code → pack → compress
sg --pattern 'console.log($$$)' --rewrite '' --lang ts -r src/
repomix --compress --remove-comments -o packed.xml src/
zstd -9 packed.xml
ast-grep supports 20+ languages via Tree-sitter grammars — matching repomix's polyglot capability but with surgical precision.
3c. Custom Makefiles for Polyglot Pipelines
For repos with mixed languages and file types, a Makefile can orchestrate per-filetype optimal compression:
SRCDIR := src
OUTDIR := .condense
ZSTD_LEVEL := 9
CODE_FILES := $(shell find $(SRCDIR) -name '*.ts' -o -name '*.py' -o -name '*.rs' -o -name '*.go' -o -name '*.java')
CONFIG_FILES := $(shell find $(SRCDIR) -name '*.json' -o -name '*.yaml' -o -name '*.toml' -o -name '*.xml')
TEXT_FILES := $(shell find $(SRCDIR) -name '*.md' -o -name '*.txt' -o -name '*.rst')
IMAGE_FILES := $(shell find $(SRCDIR) -name '*.png' -o -name '*.jpg' -o -name '*.svg')
# Code: semantic compress via repomix
$(OUTDIR)/code.xml.zst: $(CODE_FILES)
repomix --compress --remove-comments --include "**/*.{ts,py,rs,go,java}" \
-o $(OUTDIR)/code.xml
zstd -$(ZSTD_LEVEL) --rm $(OUTDIR)/code.xml
# Config: lossless pack (structure matters)
$(OUTDIR)/config.xml.zst: $(CONFIG_FILES)
repomix --remove-empty-lines --include "**/*.{json,yaml,toml,xml}" \
-o $(OUTDIR)/config.xml
zstd -$(ZSTD_LEVEL) --rm $(OUTDIR)/config.xml
# Text: PPMd for maximum ratio (compress once, read rarely)
$(OUTDIR)/docs.7z: $(TEXT_FILES)
repomix --include "**/*.{md,txt,rst}" -o $(OUTDIR)/docs.xml
7z a -m0=PPMd -mx=9 $(OUTDIR)/docs.7z $(OUTDIR)/docs.xml
rm $(OUTDIR)/docs.xml
# Images: already compressed, just archive
$(OUTDIR)/assets.tar.zst: $(IMAGE_FILES)
tar cf - $(IMAGE_FILES) | zstd -1 > $(OUTDIR)/assets.tar.zst
condense: $(OUTDIR)/code.xml.zst $(OUTDIR)/config.xml.zst $(OUTDIR)/docs.7z $(OUTDIR)/assets.tar.zst
clean:
rm -rf $(OUTDIR)
.PHONY: condense clean
Why per-type strategies matter:
| File Type |
Best Strategy |
Reason |
| Code (.ts, .py, .rs) |
repomix --compress + zstd |
AST extraction removes 70% before statistical compression |
| Config (.json, .yaml) |
repomix lossless + zstd |
Structure must be preserved; high redundancy benefits zstd |
| Text (.md, .txt) |
repomix + PPMd |
Natural language is PPMd's sweet spot (15–25% better than zstd) |
| Images (.png, .jpg) |
tar + zstd -1 |
Already compressed; fast archive only |
| SVG |
repomix lossless + zstd -19 |
XML text; compresses very well |
3d. Dictionary Training for Polyglot Repos
# Train a dictionary on your codebase's file samples
zstd --train src/**/*.ts src/**/*.py -o code.dict
# Compress with the trained dictionary
zstd -D code.dict -9 packed-output.xml
For repos with consistent coding style, a trained dictionary can improve compression ratios by 10–30% on small-to-medium files (under 100 KB). Best for microservices repos, monorepos with many similarly-structured small files, configuration file collections.
3e. BWT (Burrows-Wheeler Transform) Preprocessing
bzip2 uses BWT + MTF + Huffman. From the Large Text Compression Benchmark (enwik9, 1 GB Wikipedia XML):
- bzip2 -9: 254 MB (3.9x ratio), 379s, 8 MB memory
- zstd --ultra -22: 216 MB (4.6x ratio), 701s, 792 MB memory
- PPMd (7zip, order 10): 179 MB (5.6x ratio), 503s, 1630 MB memory
BWT could theoretically be used as a preprocessing stage before entropy coding, but no production tool currently chains BWT → zstd cleanly.
3f. Tree-sitter Grammar-Aware Tokenization
A theoretical pipeline enhancement using both repomix and ast-grep's shared Tree-sitter infrastructure:
- Tree-sitter parse → extract AST
- AST normalization → canonical variable names, consistent formatting
- AST serialization → compact binary or S-expression format
- zstd compression → on the normalized output
This theoretical pipeline could achieve better compression than repomix's current approach because AST normalization eliminates cosmetic variation (naming, whitespace, brace style) that wastes entropy. No production tool implements this full pipeline today, but the components exist.