Lempel Ziv For Lossless Text Compressions

LZ77 is a lossless compression algorithm based on a sliding window approach. It is a dictionary-based encoding method that dynamically references previously seen text to compress data.

Sliding Window Structure

The sliding window is conceptually divided into two parts:

Search Window (Dictionary / Past):
- Contains already encoded data.
- Used to find matching substrings for compression.
Lookahead Buffer (Future):
- Contains data to be encoded.
- Scanned to find the longest match in the search window.

At each step, the encoder examines the lookahead buffer and searches the past (search window) for the longest match.

While it is possible to have the entire past and future accessible (from the start to the end of the string), in practice both are bounded to limit memory usage. For example, a 1MB search window and a 2MB lookahead buffer allow the encoder to look back $1$ million characters and ahead $2$ million characters, respectively.

Encoding Strategy

To encode the input text:

Find the longest match:
- Scan the search window (past) for the longest substring that matches from the start of the lookahead buffer (future).
- If the match extends beyond the search window, it’s still valid.
- If the match reaches the end of the lookahead buffer, it stops, but the next_char can still be taken.
Encode as a triple:
- offset — Distance back to where the match begins in the search window.
  - Encoded using a prefix-free integer code
- length — The number of characters matched.
  - Encoded using a prefix-free integer code
- next_char — The character following the matched substring in the lookahead buffer.
  - Encoded using Huffman Coding based on character frequencies
Output the binary representation:
- The triple is stored as <offset><length><next_char>, in binary.
Advance the window:
- Increase the search window size by length + 1 (up to its limit), or shift it by length + 1 if already at capacity.
- Move the start position in the lookahead buffer forward by length + 1.

This process is repeated until the entire input is encoded. Then, an encoded header, containing the length of the original input and the Huffman codewords, is prepended to the compressed data to ensure lossless decoding.

Decoding Strategy

To decode the compressed binary data:

Read the encoded header:
- Extract the original input length.
- Load the Huffman codebook used to decode next_char.
Decode each triple:
- offset — Decode using the same prefix-free integer code used during encoding.
- length — Also decoded using a prefix-free integer code.
- next_char — Decoded using the provided Huffman codebook.
Reconstruct the substring:
- From the current position in the output, go back offset characters.
- Copy length characters from that position into the output.
- Append next_char immediately after the copied substring.
Advance the Output Position:
- Move forward by length + 1 in the reconstructed output.
- This becomes the new position for processing the next triple.

Repeat this process until the number of output characters matches the original input length (as specified in the header). This ensures that the entire original text is exactly reconstructed, making the decoding process lossless.

LZSS (Lempel-Ziv-Storer-Szymanski) Variation

LZSS (Lempel-Ziv-Storer-Szymanski) is a variation of the LZ77 algorithm where instead of always using the full triple, it introduces conditional encoding strategies to reduce the number of bits used during compression.

Each substring is encoded using one of two formats based on the length of the match:

Format 1 (Literal Character):
- Used when the match length is less than a threshold $δ$ (typically $δ = 3$ for English text)
- Encoded as <1-bit flag, char> as the character is stored directly without referencing past data.
Format 0 (Pointer):
- Used when the match length is greater than or equal to $δ$ .
- Encoded as <0-bit flag, offset, length> without the next_char.

In standard LZ77, encoding short matches still requires a full offset and length, which can be bit-expensive, especially if the offset is large. For such short matches, simply storing the character directly often uses fewer bits.

Quartz 4

Explorer

Lempel Ziv For Lossless Text Compressions

Sliding Window Structure

Encoding Strategy

Decoding Strategy

LZSS (Lempel-Ziv-Storer-Szymanski) Variation

Graph View

Table of Contents

Backlinks