Data streams generate quintillion bytes of data daily, containing various objects (texts, images, structured tables, etc). Compression of this data is required to save space and time in storage, retrieval and transmission.

Data Compression

Data compression is the practice of representing information in a compact form. It is achieved by identifying and exploiting repeat-patterns in data.

Lossless data compression algorithms allow encoding data into a compact form which can be perfectly reconstructed (decoded) to get back the original data.

Fixed-Length Codewords

Fixed-length codewords are an encoding method where each symbol is represented by a codeword of the same length, using the same number of bits regardless of the symbol’s frequency.

An example of this is ASCII , where each character is represented by a 7 or 8-bit binary code, and such, every symbol, whether common or rare, gets the same bit length.

This uniformity means:

  • Encoding is fast and straightforward as there is no need to track symbol frequencies.
  • Decoding is fast and straightforward as the system always knows how many bits to read per symbol.
  • There’s no need for delimiters or complex parsing logic.

However, fixed-length encoding is not space-efficient. It assumes all symbols are equally likely, which rarely reflects real data. More efficient schemes use variable-length codes, assigning shorter codes to more frequent symbols.

If you open a compressed file in a plain text editor, you may see gibberish. That's because compressed files often use variable-length encoding, but the editor tries to interpret every 8 bits as a character.

Variable-Length Codewords

Variable-length codewords are an encoding method where each symbol is assigned a codeword of varying length (bits), depending on its probability of occurrence.

Symbols that appear more frequently are given shorter codewords, while rarer symbols receive longer ones. This strategy reduces the total number of bits needed to represent the data, making it more space-efficient than fixed-length encoding.

This approach means:

  • Encoding and decoding are more complex, requiring knowledge of codeword boundaries and symbol frequencies.
  • It enables compression, as the average bit length per symbol is minimised.
  • To avoid ambiguity during decoding, the scheme must ensure that no codeword is a prefix of another (i.e. prefix-free)

An example is Morse code, where frequent letters like E and T have shorter representations (. and -), while rare letters like Q or Z have longer ones.

Many compression algorithms, like Huffman Coding and arithmetic coding, are based on variable-length encoding, leveraging symbol frequency to minimise file size.