RFC 8478

Zstandard Compression and the application/zstd Media Type
IETF RFC (Informational) | Authors: Y. Collet & M. Kucherawy, Ed. (Facebook) | Published: October 2018 | ISSN: 2070-1721

Abstract

Zstandard, or "zstd" (pronounced "zee standard"), is a data compression mechanism. This document describes the mechanism and registers a media type and content encoding to be used when transporting zstd-compressed content via Multipurpose Internet Mail Extensions (MIME).

Despite use of the word "standard" as part of its name, readers are advised that this document is not an Internet Standards Track specification; it is being published for informational purposes only.

Status of This Memo: This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has been approved for publication by the Internet Engineering Steering Group (IESG). Information about the current status of this document, errata, and feedback: https://www.rfc-editor.org/info/rfc8478

1. Introduction

Zstandard, or "zstd" (pronounced "zee standard"), is a data compression mechanism, akin to gzip [RFC1952].

This document describes the Zstandard format. Also, to enable the transport of a data object compressed with Zstandard, this document registers a media type that can be used to identify such content when it is used in a payload encoded using Multipurpose Internet Mail Extensions (MIME).

2. Definitions

Some terms used elsewhere in this document are defined here for clarity.

uncompressed
Describes an arbitrary set of bytes in their original form, prior to being subjected to compression.
compress, compression
The act of processing a set of bytes via the compression mechanism described here.
compressed
Describes the result of passing a set of bytes through this mechanism. The original input has thus been compressed.
decompress, decompression
The act of processing a set of bytes through the inverse of the compression mechanism described here, in an attempt to recover the original set of bytes prior to compression.
decompressed
Describes the result of passing a set of bytes through the reverse of this mechanism. When successful, the decompressed payload and the uncompressed payload are indistinguishable.
encode
The process of translating data from one form to another; this may include compression or it may refer to other translations done as part of this specification.
decode
The reverse of "encode"; describes a process of reversing a prior encoding to recover the original content.
frame
Content compressed by Zstandard is transformed into a Zstandard frame. Multiple frames can be appended into a single file or stream. A frame is completely independent, has a defined beginning and end, and has a set of parameters that tells the decoder how to decompress it.
block
A frame encapsulates one or multiple blocks. Each block contains arbitrary content, described by its header, and has a guaranteed maximum content size depending on frame parameters. Unlike frames, each block depends on previous blocks for proper decoding. However, each block can be decompressed without waiting for its successor, allowing streaming operations.
natural order
A sequence or ordering of objects or values that is typical of that type. A set of unique integers is in "natural order" if when progressing from one element to the next, there is never a decrease in value.

The naming convention for identifiers within the specification is Mixed_Case_With_Underscores. Identifiers inside square brackets indicate that the identifier is optional in the presented context.

3. Compression Algorithm

This section describes the Zstandard algorithm.

The purpose of this document is to define a lossless compressed data format that is a) independent of the CPU type, operating system, file system, and character set and b) suitable for file compression and pipe and streaming compression, using the Zstandard algorithm.

The format uses the Zstandard compression method, and an optional xxHash-64 checksum method, for detection of data corruption. The data format does not attempt to allow random access to compressed data.

Compliance requirements:

  • A compliant compressor must produce data sets that conform to the specifications presented here. It does not need to support all options.
  • A compliant decompressor must be able to decompress at least one working set of parameters that conforms to the specifications. It may also ignore informative fields such as the checksum. Whenever it does not support a parameter, it must produce a non-ambiguous error code and associated error message.

3.1. Frames

Zstandard compressed data is made up of one or more frames. Each frame is independent and can be decompressed independently of other frames. The decompressed content of multiple concatenated frames is the concatenation of each frame's decompressed content.

There are two frame formats defined for Zstandard: Zstandard frames (contain compressed data) and skippable frames (contain custom user metadata).

3.1.1. Zstandard Frames

The structure of a single Zstandard frame:

+--------------------+------------+
|    Magic_Number    | 4 bytes    |
+--------------------+------------+
|    Frame_Header    | 2-14 bytes |
+--------------------+------------+
|     Data_Block     | n bytes    |
+--------------------+------------+
| [More Data_Blocks] |            |
+--------------------+------------+
| [Content_Checksum] | 0-4 bytes  |
+--------------------+------------+
  • Magic_Number: 4 bytes, little-endian format. Value: 0xFD2FB528. Selected to be less probable at the beginning of an arbitrary file — avoids trivial patterns, contains byte values outside ASCII range, and doesn't map into UTF-8 space.
  • Frame_Header: 2 to 14 bytes. Detailed in Section 3.1.1.1.
  • Data_Block: This is where data appears. Detailed in Section 3.1.1.2.
  • Content_Checksum: Optional 32-bit checksum, only present if Content_Checksum_Flag is set. The result of the XXH64() hash function digesting the original (decoded) data with a seed of zero. The low 4 bytes stored in little-endian format.

3.1.1.1. Frame Header

The frame header has a variable size, with a minimum of 2 bytes and up to 14 bytes depending on optional parameters:

+-------------------------+-----------+
| Frame_Header_Descriptor | 1 byte    |
+-------------------------+-----------+
|   [Window_Descriptor]   | 0-1 byte  |
+-------------------------+-----------+
|     [Dictionary_ID]     | 0-4 bytes |
+-------------------------+-----------+
|  [Frame_Content_Size]   | 0-8 bytes |
+-------------------------+-----------+

Frame_Header_Descriptor

A single byte encoding multiple flags. Key fields:

Bit(s) Field Description
7–6 Frame_Content_Size_Flag Determines number of bytes used for Frame_Content_Size field (0–3 → 0, 1, 2, 4 bytes, or special value for 8-byte size)
5 Single_Segment_Flag If set, Window_Descriptor is omitted and Frame_Content_Size must be present; entire frame must fit in a single buffer
4 Unused_Bit Reserved; must be zero
3 Reserved_Bit Reserved for future use
2 Content_Checksum_Flag If set, a 4-byte content checksum is appended after the last block
1–0 Dictionary_ID_Flag Determines the number of bytes used for the Dictionary_ID field (0, 1, 2, or 4 bytes)

Window_Descriptor

Present only when Single_Segment_Flag is not set. Encodes the minimum required decoder buffer size as a power of two, with a mantissa for fine-grained control. The window size must be at least 1 KB and at most 8 MB for non-dictionary content.

Dictionary_ID

Optional field. When present, it indicates the ID of a predefined dictionary that was used during compression. The decompressor must have this dictionary loaded before it can decompress the frame. Dictionary IDs in the range [32768, 2^31 - 1] are reserved for public use; IDs [1, 32767] are for private contexts.

Frame_Content_Size

Optional field indicating the original (decompressed) content size. This allows the decompressor to pre-allocate the right amount of output buffer. When present, the field is 1, 2, 4, or 8 bytes depending on Frame_Content_Size_Flag.

3.1.1.2. Blocks

Each block consists of a 3-byte header followed by block content:

+------------+-----+
| Block_Header | 3 bytes |
+--------------+
| Block_Content | n bytes |
+-------------------+

Block_Header

A 3-byte value encoding two fields:

Bit(s) Field Description
23–3 Block_Size Size of Block_Content in bytes (21 bits)
2 Last_Block If set, this is the last block in the frame
1–0 Block_Type 0 = Raw_Block, 1 = RLE_Block, 2 = Compressed_Block, 3 = Reserved

Block Types

  • Raw_Block (type 0): Block content is raw data, stored as-is. Used when data cannot be compressed effectively.
  • RLE_Block (type 1): Block content is a single byte, repeated Block_Size times in the output.
  • Compressed_Block (type 2): Block content is compressed. Detailed in Section 3.1.1.3.
  • Reserved (type 3): This value is reserved.

The maximum content size of a block depends on whether the Single_Segment_Flag is set. For multi-segment frames, the maximum block size is the lesser of Window_Size and 128 KB. For single-segment frames, the maximum is the declared Frame_Content_Size.

3.1.1.3. Compressed Blocks

A compressed block consists of a Literals_Section followed by a Sequences_Section. This mirrors the LZ77 compression scheme: the literals section stores literal bytes (content not matched from earlier in the data), and the sequences section stores (offset, match_length, literal_length) tuples that reference previously decoded content.

Literals Section

The literals section header encodes the literals block type and associated size information:

Literals_Block_Type Value Description
Raw_Literals_Block 0 Literals are stored uncompressed
RLE_Literals_Block 1 A single byte repeated N times
Compressed_Literals_Block 2 Literals are Huffman-compressed (see Section 4.2)
Treeless_Literals_Block 3 Huffman-compressed using a table from a previous block in the same frame

Sequences Section

The sequences section encodes a series of commands, each consisting of:

  • Literal_Length: Number of literal bytes to copy from the literals section
  • Match_Length: Number of bytes to copy from a previous position in the decoded output
  • Offset: Distance back in the decoded output where the match begins

The sequences header specifies the number of sequences and the compression modes for the three FSE tables (Literals_Lengths_Table, Match_Lengths_Table, Offsets_Table). Four compression modes are available:

  • Predefined_Mode (0): Use a predefined FSE table (specified in Appendix A)
  • RLE_Mode (1): A single symbol with probability 1.0; stored as a single byte
  • FSE_Compressed_Mode (2): A new FSE table is provided; used for the current block and subsequent blocks with Repeat_Mode
  • Repeat_Mode (3): Reuse the FSE table from the previous block in the same frame

3.1.1.4. Sequence Execution

To decode a compressed block, a decoder:

  1. Decodes the literals section to obtain the literals buffer
  2. Reads and decodes the sequences section using the FSE bitstreams
  3. Executes each sequence: copies Literal_Length bytes from the literals buffer, then copies Match_Length bytes from offset output_position - Offset in the output buffer
  4. After all sequences, copies any remaining literals from the literals buffer

Sequences are decoded in reverse order from the bitstream (the FSE decoder reads bits from the end of the stream towards the beginning), but executed in forward order to produce the decompressed output.

3.1.1.5. Repeat Offsets

Zstandard maintains a table of three recent offsets to improve compression of data with repetitive reference patterns. The three most recently used offsets are stored as offset_1, offset_2, and offset_3.

When an offset value of 1, 2, or 3 appears in the sequences bitstream, it does not mean a literal offset of 1, 2, or 3 bytes — instead it refers to the first, second, or third previously used offset. This "repeat offset" mechanism significantly compresses patterns where the same offset recurs frequently, which is common in structured data.

The repeat offset table is updated after each sequence according to specific rules that ensure the three most useful recent offsets are always available.

3.1.2. Skippable Frames

A skippable frame allows custom user metadata to be stored alongside compressed data without affecting decompression. Any decompressor that encounters a skippable frame must skip over it without error.

+--------------------+-----------+
|  Magic_Number      | 4 bytes   |
+--------------------+-----------+
|  Frame_Size        | 4 bytes   |
+--------------------+-----------+
|  User_Data         | n bytes   |
+--------------------+-----------+

The magic number for skippable frames is any value in the range [0x184D2A50, 0x184D2A5F] (16 possible values), giving users flexibility to tag different kinds of metadata. Frame_Size is a 4-byte little-endian value indicating the size of User_Data in bytes.

Use cases include embedding seek tables, index structures, or application-specific metadata alongside compressed data without requiring a separate file.

4. Entropy Encoding

Zstandard uses two entropy encoding schemes: FSE (Finite State Entropy) for sequences and Huffman coding for literals.

4.1. FSE (Finite State Entropy)

FSE is a fast and efficient entropy coder based on Asymmetric Numeral Systems (ANS), developed by Yann Collet (the same author as Zstandard). It provides near-optimal entropy coding at significantly higher speed than arithmetic coding.

How FSE Works

FSE encodes a sequence of symbols using a state machine with a table of size 2^Accuracy_Log entries. Each state encodes a symbol and a transition to the next state. The state is maintained as an integer, and symbols are packed/unpacked by writing/reading bits to/from the bitstream.

4.1.1. FSE Table Description

The FSE table describes the probability distribution of symbols. It encodes normalized symbol frequencies using a compact bitstream format. For each symbol with nonzero probability, the table provides:

  • Number_of_States: How many states are assigned to this symbol; proportional to its probability
  • Baseline: The minimum state value for this symbol's transitions
  • Nb_Bits: How many bits to read when transitioning out of a state for this symbol

The Accuracy_Log parameter (6–9 bits for sequences, 5–11 for literals) controls the table size and precision. Higher accuracy logs allow finer probability distinctions but increase table memory.

Decoding process:

  1. Initialize the FSE state by reading Accuracy_Log bits from the bitstream
  2. Read the current symbol from the table at position state
  3. Read Nb_Bits[state] bits from the stream to get low_bits
  4. Update state: state = Baseline[state] + low_bits
  5. Repeat from step 2 until all symbols are decoded

4.2. Huffman Coding

Huffman coding is used to compress the literals section. Zstandard's Huffman implementation supports up to 255 symbols (byte values) and prefix codes of up to 11 bits.

4.2.1. Huffman Tree Description

The Huffman tree is described by the set of weights (code lengths) for each symbol. Only symbols with nonzero weight are included. Symbols with higher frequencies receive shorter codes.

4.2.1.1. Huffman Tree Header

The header byte determines how the tree weights are encoded:

  • If the header byte value is >= 128: the weights for all symbols follow in direct 4-bit-per-symbol encoding. The number of symbols is header_byte - 127.
  • If the header byte value is < 128: the header byte encodes the compressed size of an FSE-encoded weight table. The weights are encoded using FSE compression.

4.2.1.2. FSE Compression of Huffman Weights

When weights are FSE-compressed, they use a specialized FSE table with Accuracy_Log in the range [5, 6]. The FSE table for weights is itself encoded using an interleaved representation at the beginning of the compressed weights data.

4.2.1.3. Conversion from Weights to Huffman Prefix Codes

Given the weight array Weight[symbol] for each symbol, the Huffman prefix code for each symbol is computed using the standard algorithm: assign shorter codes to symbols with higher weights (lower code lengths). Symbols with Weight = 0 are not used and have no code.

4.2.2. Huffman-Coded Streams

Literals are encoded using the Huffman tree, with the bitstream written in little-endian bit order. The literals section may contain 1 or 4 Huffman-coded streams. When 4 streams are used, they are decoded in parallel (4 independent streams with pre-computed offsets), allowing SIMD acceleration on modern CPUs — this is a key source of Zstandard's high decompression throughput.

5. Dictionary Format

A Zstandard dictionary is a file that provides pre-shared context to improve compression of small data. Dictionaries are especially effective for data under 64 KB where LZ77-family compressors would otherwise have little past context to exploit.

+------------------------+----------+
|  Dictionary_Magic      | 4 bytes  |
+------------------------+----------+
|  Dictionary_ID         | 4 bytes  |
+------------------------+----------+
|  Entropy_Tables        | variable |
+------------------------+----------+
|  Content               | variable |
+------------------------+----------+
  • Dictionary_Magic: Value 0xEC30A437, identifies this as a Zstandard dictionary.
  • Dictionary_ID: 4-byte identifier, matched against the Dictionary_ID field in frame headers. The decompressor uses this to look up the correct dictionary.
  • Entropy_Tables: Pre-trained Huffman and FSE tables. These seed the entropy coders with symbol probability distributions derived from the training corpus, eliminating the need to transmit tables for each compressed frame.
  • Content: Raw bytes that form the "virtual history" prepended to the data being compressed. Back-references in the LZ77 stage can point into this content, allowing the compressor to reference common byte sequences from the training corpus.

Dictionary IDs

Range Use
[0] Reserved; indicates no dictionary
[1, 32767] Reserved for private/internal use
[32768, 2^31 - 1] Public use; registered or ad hoc
[2^31, 2^32 - 1] Reserved

6. IANA Considerations

6.1. The 'application/zstd' Media Type

This document registers the application/zstd media type for Zstandard-compressed data:

Field Value
Type name application
Subtype name zstd
Required parameters None
Optional parameters None
Encoding considerations Binary
Security considerations See Section 7
Interoperability considerations None
Published specification RFC 8478

6.2. Content Encoding

This document registers the "zstd" HTTP Content Encoding token in the IANA "HTTP Content Coding Registry". This encoding is suitable for use in the Content-Encoding and Accept-Encoding HTTP headers.

6.3. Dictionaries

No IANA registry is defined for Zstandard dictionaries; dictionary IDs are uncoordinated. Public dictionary IDs should use values in the range [32768, 2^31 - 1]. Applications sharing dictionaries must ensure both parties have the same dictionary by means outside the scope of this specification.

7. Security Considerations

As with any compression algorithm, several security considerations apply:

  • Decompression bomb / zip bomb: A malicious actor could create a small compressed file that expands to a very large decompressed output, exhausting the decompressor's memory or disk space. Implementations should impose limits on decompressed output size. The Frame_Content_Size field, when present, allows the decompressor to pre-check whether the output would exceed available resources.
  • Malformed input: Compressed data may be malformed, either by accident (transmission error) or by deliberate manipulation. Implementations must handle all invalid input gracefully by returning an error rather than crashing or producing incorrect output. The xxHash-64 content checksum provides detection of accidental corruption.
  • Dictionary handling: Dictionaries contain pre-shared content and entropy tables. Applications that load dictionaries from untrusted sources must validate them appropriately; a malformed dictionary could cause a compliant implementation to produce incorrect output.
  • Information leakage via compression: Compression ratios can leak information about the content being compressed (the "CRIME" attack class in HTTP). Applications should be aware that compressing attacker-controlled data alongside sensitive data may allow the attacker to infer sensitive content based on the compressed size.

8. Implementation Status

At the time of RFC 8478 publication (October 2018), Zstandard had been deployed at scale across Meta (Facebook) infrastructure and was actively used in production systems including Linux kernel compression (v4.14+), Firefox update packages, the Python standard library (zstandard module), FoundationDB, Hadoop, and many other projects.

As of 2025–2026, Zstandard v1.5.7 is the reference implementation. The library has been continuously fuzzed by Google's oss-fuzz project since 2016 with no known security vulnerabilities in the decompression path. Language bindings exist for 30+ languages including C, C++, Go, Rust, Python, Java, JavaScript, Ruby, PHP, and others.

9. References

Normative References

  • RFC 1952 — GZIP file format specification version 4.3
  • XXHASH — Collet, Y., "xxHash — Extremely fast non-cryptographic hash algorithm", https://github.com/Cyan4973/xxHash

Informative References

Related Documents in This Collection

Appendix A: Decoding Tables for Predefined Codes

This appendix specifies the predefined FSE decoding tables used when sequences use Predefined_Mode. These tables encode fixed probability distributions that were chosen to work reasonably well across a variety of data types without requiring table transmission in the frame. There are three predefined tables:

A.1. Literal Length Code Table

Accuracy_Log = 6 (table size = 64). Encodes the distribution of literal length codes used in sequence commands. Shorter literal lengths (0–15) are assigned much higher probabilities reflecting their frequency in typical compressed data; longer literal lengths (16–35) receive lower probabilities.

A.2. Match Length Code Table

Accuracy_Log = 6 (table size = 64). Encodes the distribution of match length codes. Short matches (3–8 bytes) are most common and receive the highest probabilities. The distribution reflects typical LZ77 match-length statistics observed across diverse data corpora.

A.3. Offset Code Table

Accuracy_Log = 5 (table size = 32). Encodes the distribution of offset codes. Small offsets (recent back-references) receive higher probability, reflecting the principle of temporal locality in data. The three implicit "repeat offset" codes (values 1, 2, 3) are handled separately by the repeat offsets mechanism and do not appear in this table.

Note: The full numeric values for all three tables are specified in the RFC text. Implementors must use the exact table values to ensure bitstream compatibility with other Zstandard implementations. See RFC 8478 Appendix A for the complete predefined code tables.