LZ77 is a lossless compression algorithm based on a sliding window approach. It is a dictionary-based encoding method that dynamically references previously seen text to compress data.
Sliding Window Structure
The sliding window is conceptually divided into two parts:
- Search Window (Dictionary / Past):
- Contains already encoded data.
- Used to find matching substrings for compression.
- Lookahead Buffer (Future):
- Contains data to be encoded.
- Scanned to find the longest match in the search window.
At each step, the encoder examines the lookahead buffer and searches the past (search window) for the longest match.
While it is possible to have the entire past and future accessible (from the start to the end of the string), in practice both are bounded to limit memory usage. For example, a 1MB search window and a 2MB lookahead buffer allow the encoder to look back million characters and ahead million characters, respectively.
Encoding Strategy
To encode the input text:
- Find the longest match:
- Scan the search window (past) for the longest substring that matches from the start of the lookahead buffer (future).
- If the match extends beyond the search window, it’s still valid.
- If the match reaches the end of the lookahead buffer, it stops, but the
next_charcan still be taken.
- Encode as a triple:
offset— Distance back to where the match begins in the search window.- Encoded using a prefix-free integer code
length— The number of characters matched.- Encoded using a prefix-free integer code
next_char— The character following the matched substring in the lookahead buffer.- Encoded using Huffman Coding based on character frequencies
- Output the binary representation:
- The triple is stored as
<offset><length><next_char>, in binary.
- The triple is stored as
- Advance the window:
- Increase the search window size by
length + 1(up to its limit), or shift it bylength + 1if already at capacity. - Move the start position in the lookahead buffer forward by
length + 1.
- Increase the search window size by
This process is repeated until the entire input is encoded. Then, an encoded header, containing the length of the original input and the Huffman codewords, is prepended to the compressed data to ensure lossless decoding.
Decoding Strategy
To decode the compressed binary data:
- Read the encoded header:
- Extract the original input length.
- Load the Huffman codebook used to decode
next_char.
- Decode each triple:
offset— Decode using the same prefix-free integer code used during encoding.length— Also decoded using a prefix-free integer code.next_char— Decoded using the provided Huffman codebook.
- Reconstruct the substring:
- From the current position in the output, go back
offsetcharacters. - Copy
lengthcharacters from that position into the output. - Append
next_charimmediately after the copied substring.
- From the current position in the output, go back
- Advance the Output Position:
- Move forward by
length + 1in the reconstructed output. - This becomes the new position for processing the next triple.
- Move forward by
Repeat this process until the number of output characters matches the original input length (as specified in the header). This ensures that the entire original text is exactly reconstructed, making the decoding process lossless.
LZSS (Lempel-Ziv-Storer-Szymanski) Variation
LZSS (Lempel-Ziv-Storer-Szymanski) is a variation of the LZ77 algorithm where instead of always using the full triple, it introduces conditional encoding strategies to reduce the number of bits used during compression.
Each substring is encoded using one of two formats based on the length of the match:
- Format 1 (Literal Character):
- Used when the match length is less than a threshold (typically for English text)
- Encoded as
<1-bit flag, char>as the character is stored directly without referencing past data.
- Format 0 (Pointer):
- Used when the match length is greater than or equal to .
- Encoded as
<0-bit flag, offset, length>without thenext_char.
In standard LZ77, encoding short matches still requires a full offset and length, which can be bit-expensive, especially if the offset is large. For such short matches, simply storing the character directly often uses fewer bits.