Prediction by Partial Matching (PPM)

Overview

Prediction by Partial Matching (PPM) is an adaptive statistical data compression technique that predicts the next symbol in a stream based on a set of previous symbols (the "context"). First proposed by Cleary and Witten in 1984, it is highly effective for lossless compression of natural language text.

Core Mechanism

PPM uses three interlocking mechanisms to achieve high compression ratios on sequential data:

Context Modeling: The algorithm uses a Markov model of order n, where n is the number of previous symbols used for prediction.
Blending & Escape Symbols: If a symbol has not been seen in the current n-order context, an escape symbol is emitted, and the algorithm "escapes" to a shorter (n-1) context to attempt a new prediction. This continues until a match is found or it reaches a default order-0 or order-(-1) context.
Arithmetic Coding: The final predicted probabilities are typically processed using an arithmetic coder to produce the compressed output.

Key Variants

Variant	Description
PPMC	A widely cited implementation by Alistair Moffat (1990) that became a standard benchmark for high compression ratios.
PPMd	An optimized version by Dmitry Shkarin used by default in the RAR file format and available in 7-Zip.
PPM*	An unbounded variant that allows contexts of any length.
PPM-Decay	A version that incorporates memory decay, often used in auditory cognition research to model how humans predict patterns.

Applications

File Compression: Used by default in RAR and available in 7z. Also defined as method 98 in the ZIP file format specification, though most ZIP archivers (WinZip, Info-ZIP) default to Deflate.
Assistive Technology: Powers Dasher, a gesture-based text-entry system for users with motor impairments.
Web Prefetching: Used to predict the next webpage a user might visit to reduce loading times.

Relevance to Code Compression: PPMd excels on polyglot codebases because code has strong byte-level patterns (indentation, keyword repetition, naming conventions) that PPM's adaptive context modeling captures better than LZ77-family algorithms. The advantage grows with homogeneous code style. On the Large Text Compression Benchmark (1 GB Wikipedia XML), PPMd achieves 5.6x compression vs zstd's 4.6x — at 15–25% better ratios, but 5–10x slower decompression.

Open-Source PPM Libraries

Several open-source libraries and implementations of PPM are available across different programming languages, ranging from high-performance C versions to research-focused Python and R packages.

High-Performance C/C++ Libraries

Library	Description
7-Zip / p7zip (PPMd)	The most widely used implementation. PPMd (Variant H and I) is part of the open-source 7-Zip LZMA SDK. Highly optimized for both speed and compression ratio.
PPMZ	A specialized file-to-file compressor based on PPM, known for extremely high compression ratios. Slower and more memory-intensive than PPMd.

Python Libraries

Library	Description
PyPPMd	Python binding for the C-based PPMd implementation from p7zip. Supports PPMd Variant H and Variant I. Suitable for high-performance applications.
ppmd-cffi	Python wrapper using CFFI to interface with the p7zip PPMd code.
pyppm	Native Python implementation of PPM. Useful for learning or projects where a pure Python solution is preferred.
Reference Arithmetic Coding (Nayuki)	Clear, educational Python implementation of PPM combined with arithmetic coding.

Research & Data Mining Libraries

Library	Description
SPMF (Java)	Open-source data mining library including a PPM sequence prediction model. Often used for predicting the next symbol in a sequence rather than file compression.
ppm (R Package)	Specialized package for auditory cognition research implementing variants like PPM-Decay, which models how human memory fades over time.

Other Implementations

Implementation	Description
rene-puschinger/ppm (C++)	Straightforward C++ implementation on GitHub designed for benchmarking against other compression tools like ZIP and 7-Zip.
ppm-c (C)	Student-led implementation from the Federal University of Paraiba. Useful for those studying the algorithm from an information theory perspective.