Variable-length codewords must be designed so that decoding is unambiguous as a situation could arise where one codeword is a prefix of another, which can lead to multiple possible interpretations of a message:
Example Of Ambiguous Decoding
Given the code words:
Encoding the message
BABAyields1010. However, this sequence can be decoded in multiple ways:
1 0 1 0asB A B A10 10asC C101 0asD AThis ambiguity arises because some codewords are prefixes of others (e.g.,
1,10, and101).
To prevent this, prefix-free codes are used, where no codeword is a prefix of another. This ensures that each message has a unique decoding.
Example Of Prefix Free Codewords (Unambiguous Decoding)
Given the code word:
Note how no code word is a prefix of another; decoding is unambiguous as no codewords overlaps with the start of another.
It's not that two codes can't have a shared prefix, but instead, one code in its entirety, cannot be a prefix of another.
Huffman Coding
Huffman coding is a greedy algorithm that generates prefix-free codewords based on the frequencies of symbols. It ensures that more frequent characters get shorter codes, while less frequent ones get longer codes, minimising the total number of bits used.
It takes as input the frequencies or probabilities of each symbol and constructs an optimal binary prefix code (i.e. shortest codeword for most frequent character, 2nd shortest for 2nd most, and so on).
Constructing The Huffman Tree
Huffman coding builds a binary tree where each leaf node represents a symbol and each internal node represents the combined frequency of its child nodes.
- Start with a list of all symbols and their frequencies.
- Create a leaf node for each symbol and insert them into a priority queue (min-heap) sorted by frequency.
- While more than one node remains:
- Remove the two nodes with the lowest frequencies.
- Combine them into a new internal node with the new frequency being the sum of the two.
- Insert the new node back into the queue.
- The final remaining node is the root of the Huffman tree.
Once the Huffman tree is fully constructed, then traverse each root-to-leaf path to assign binary codewords to each symbol (leaf). This determines its codewords (typically left adds 0 and right adds 1).
Additionally, decoding involves traversing the Huffman tree from the root, following each bit (0 for left, 1 for right) until a leaf is reached, which yields the corresponding symbol.
This structure guarantees that each symbol is assigned a unique binary path and no codeword is a prefix of another
Example Of Huffman Coding
Given the string
A_DEAD_DAD_CEDED_A_BAD_BABE_A_BEADED_ABACA_BED, the character frequencies are:
Step 1 — Initialise a priority queue
Qwith leaf nodes for each symbol, sorted by ascending frequency::
Q = [C (2), B (6), E (7), _ (10), D (10), A (11)]Step 2 — Remove the two lowest-frequency nodes:
CandB, combine them into a new internal nodeCBwith frequency2 + 6 = 8. Note thatCBhasCandBas its children.Step 3 — Insert
CBback into the priority queueQ = [E (7), CB (8), _ (10), D (10), A (11)]Continue combining the two lowest-frequency nodes until one node remains (the full tree).
Note that at 7., a mapping has been constructed via root-to-leaf traversal, and at 8., a simple exact match traversal can be used to decode the string.
Why The Code Is Prefix-Free
In a binary tree, no two leaf nodes can share the exact same root-to-leaf path. Since Huffman codes are derived from leaf paths, and only leaves represent valid symbols, it’s impossible for one codeword to be a prefix of another.
For a codeword to be a prefix of another, it would have to correspond to an internal node (i.e. an incomplete path), but Huffman assigns codewords only to leaf nodes, ensuring prefix-freeness.
Proof By Contradiction
Suppose, for contradiction, that one codeword
c₁c₂...cₖis a prefix of anotherc₁c₂...cₖ...cₙ, wherek < n. In the Huffman tree, this implies:
c₁c₂...cₖleads to a node that must be a leaf (since it represents a complete codeword),but the path continues to
cₖ₊₁, so the node atc₁c₂...cₖmust also be an internal node.This is a contradiction a node cannot be both a leaf and have children.
Additionally, if
k = n, then you’re comparing a codeword to itself.Hence, Huffman codes are always prefix-free.
This algorithm fails to generate prefix-free codes for integers
