Prefix-Free Codes

Variable-length codewords must be designed so that decoding is unambiguous as a situation could arise where one codeword is a prefix of another, which can lead to multiple possible interpretations of a message:

Example Of Ambiguous Decoding

Given the code words:

$A = 0 B = 1 C = 10 D = 101$

Encoding the message BABA yields 1010. However, this sequence can be decoded in multiple ways:

1 0 1 0 as B A B A

10 10 as C C

101 0 as D A

This ambiguity arises because some codewords are prefixes of others (e.g., 1, 10, and 101).

To prevent this, prefix-free codes are used, where no codeword is a prefix of another. This ensures that each message has a unique decoding.

Example Of Prefix Free Codewords (Unambiguous Decoding)

Given the code word:

$A = 0 B = 10 C = 110 D = 111$

Note how no code word is a prefix of another; decoding is unambiguous as no codewords overlaps with the start of another.

It's not that two codes can't have a shared prefix, but instead, one code in its entirety, cannot be a prefix of another.

Huffman Coding

Huffman coding is a greedy algorithm that generates prefix-free codewords based on the frequencies of symbols. It ensures that more frequent characters get shorter codes, while less frequent ones get longer codes, minimising the total number of bits used.

It takes as input the frequencies or probabilities of each symbol and constructs an optimal binary prefix code (i.e. shortest codeword for most frequent character, 2nd shortest for 2nd most, and so on).

Constructing The Huffman Tree

Huffman coding builds a binary tree where each leaf node represents a symbol and each internal node represents the combined frequency of its child nodes.

Start with a list of all symbols and their frequencies.
Create a leaf node for each symbol and insert them into a priority queue (min-heap) sorted by frequency.
While more than one node remains:
1. Remove the two nodes with the lowest frequencies.
2. Combine them into a new internal node with the new frequency being the sum of the two.
3. Insert the new node back into the queue.
The final remaining node is the root of the Huffman tree.

Once the Huffman tree is fully constructed, then traverse each root-to-leaf path to assign binary codewords to each symbol (leaf). This determines its codewords (typically left adds 0 and right adds 1).

Additionally, decoding involves traversing the Huffman tree from the root, following each bit (0 for left, 1 for right) until a leaf is reached, which yields the corresponding symbol.

This structure guarantees that each symbol is assigned a unique binary path and no codeword is a prefix of another

Example Of Huffman Coding

Given the string A_DEAD_DAD_CEDED_A_BAD_BABE_A_BEADED_ABACA_BED, the character frequencies are:

$C : 2 B : 6 E : 7_: 10 D : 10 A : 11$

Step 1 — Initialise a priority queue Q with leaf nodes for each symbol, sorted by ascending frequency::

Q = [C (2), B (6), E (7), _ (10), D (10), A (11)]

Step 2 — Remove the two lowest-frequency nodes: C and B, combine them into a new internal node CB with frequency 2 + 6 = 8. Note that CB has C and B as its children.

Step 3 — Insert CB back into the priority queue Q = [E (7), CB (8), _ (10), D (10), A (11)]

Continue combining the two lowest-frequency nodes until one node remains (the full tree).

Note that at 7., a mapping has been constructed via root-to-leaf traversal, and at 8., a simple exact match traversal can be used to decode the string.

Why The Code Is Prefix-Free

In a binary tree, no two leaf nodes can share the exact same root-to-leaf path. Since Huffman codes are derived from leaf paths, and only leaves represent valid symbols, it’s impossible for one codeword to be a prefix of another.

For a codeword to be a prefix of another, it would have to correspond to an internal node (i.e. an incomplete path), but Huffman assigns codewords only to leaf nodes, ensuring prefix-freeness.

Proof By Contradiction

Suppose, for contradiction, that one codeword c₁c₂...cₖ is a prefix of another c₁c₂...cₖ...cₙ, where k < n. In the Huffman tree, this implies:

c₁c₂...cₖ leads to a node that must be a leaf (since it represents a complete codeword),

but the path continues to cₖ₊₁, so the node at c₁c₂...cₖ must also be an internal node.

This is a contradiction a node cannot be both a leaf and have children.

Additionally, if k = n, then you’re comparing a codeword to itself.

Hence, Huffman codes are always prefix-free.

This algorithm fails to generate prefix-free codes for integers

Quartz 4

Explorer

Prefix-Free Codes

Huffman Coding

Constructing The Huffman Tree

Why The Code Is Prefix-Free

Graph View

Table of Contents

Backlinks