Variable-length codewords must be designed so that decoding is unambiguous as a situation could arise where one codeword is a prefix of another, which can lead to multiple possible interpretations of a message:

Example Of Ambiguous Decoding

Given the code words:

Encoding the message BABA yields 1010. However, this sequence can be decoded in multiple ways:

  1. 1 0 1 0 as B A B A
  2. 10 10 as C C
  3. 101 0 as D A

This ambiguity arises because some codewords are prefixes of others (e.g., 1, 10, and 101).

To prevent this, prefix-free codes are used, where no codeword is a prefix of another. This ensures that each message has a unique decoding.

Example Of Prefix Free Codewords (Unambiguous Decoding)

Given the code word:

Note how no code word is a prefix of another; decoding is unambiguous as no codewords overlaps with the start of another.

It's not that two codes can't have a shared prefix, but instead, one code in its entirety, cannot be a prefix of another.

Huffman Coding

Huffman coding is a greedy algorithm that generates prefix-free codewords based on the frequencies of symbols. It ensures that more frequent characters get shorter codes, while less frequent ones get longer codes, minimising the total number of bits used.

It takes as input the frequencies or probabilities of each symbol and constructs an optimal binary prefix code (i.e. shortest codeword for most frequent character, 2nd shortest for 2nd most, and so on).

Constructing The Huffman Tree

Huffman coding builds a binary tree where each leaf node represents a symbol and each internal node represents the combined frequency of its child nodes.

  1. Start with a list of all symbols and their frequencies.
  2. Create a leaf node for each symbol and insert them into a priority queue (min-heap) sorted by frequency.
  3. While more than one node remains:
    1. Remove the two nodes with the lowest frequencies.
    2. Combine them into a new internal node with the new frequency being the sum of the two.
    3. Insert the new node back into the queue.
  4. The final remaining node is the root of the Huffman tree.

Once the Huffman tree is fully constructed, then traverse each root-to-leaf path to assign binary codewords to each symbol (leaf). This determines its codewords (typically left adds 0 and right adds 1).

Additionally, decoding involves traversing the Huffman tree from the root, following each bit (0 for left, 1 for right) until a leaf is reached, which yields the corresponding symbol.

This structure guarantees that each symbol is assigned a unique binary path and no codeword is a prefix of another

Why The Code Is Prefix-Free

In a binary tree, no two leaf nodes can share the exact same root-to-leaf path. Since Huffman codes are derived from leaf paths, and only leaves represent valid symbols, it’s impossible for one codeword to be a prefix of another.

For a codeword to be a prefix of another, it would have to correspond to an internal node (i.e. an incomplete path), but Huffman assigns codewords only to leaf nodes, ensuring prefix-freeness.

This algorithm fails to generate prefix-free codes for integers