The Burrows-Wheeler Transform (BWT) of a string, is a transformation formed by the characters in the last column of the Sorted Cyclic Permutation Matrix of the string. It is widely used in pattern matching and lossless compression.

Sorted Cyclic Permutation Matrix

A Cyclic Permutation Matrix is a set of all cyclic rotations (permutations) of a string, with each rotation forming a row in the matrix. Sorting the rows lexicographically yields the Sorted Cyclic Permutation Matrix

As usual to mark the end of the string and ensure uniqueness, a unique symbol $, which is lexicographically smallest, is appended to the string. Since each row is a rotation, characters beyond $ are redundant. Thus, the matrix can be simplified by truncating rows after $.

The resulting matrix represents the sorted suffixes of the original string. Each row is a suffix text[i...n], therefore a suffix array can capture this by recording only the starting indices of these suffixes relative to the original string.

Therefore, a suffix array can be constructed in place of the Sorted Cyclic Permutation Matrix without any loss of information.

Properties Of The Sorted Cyclic Permutation Matrix

The manipulations made above are supported by the structural properties of the matrix :

Property 1: Each column of the matrix is a permutation of the original string

This is because every character appears exactly once in each rotation, and the matrix contains all such rotations.

Property 2: Any consecutive columns of the matrix represent all possible substrings (in lexicographic order) of length (i.e., all -mers)

This follows from the fact that each row is a rotated version of the text, so each -mer appears exactly once across the rows.

A corollary of Property 2 is that, as consists of sorted cyclic permutations, the last column always precedes the first column in the original text.

Extracting The Burrows-Wheeler Transform

The BWT of a string, is the string formed by the letters in the last column of its Sorted Cyclic Permutation Matrix.

Since a suffix array is often used instead of constructing the full matrix, the last column must be computed indirectly.

Due to the cyclic nature of the rotations, the characters in the last column are the ones preceding the characters in the first column. Thus, the index of the characters in the first column is calculated as follows:

bwt[i] = text[suffix_array[i] - 1], where

This provides an alternate definition of the BWT as the string formed by the characters immediately preceding the start positions indicated in the suffix array.

Time Complexity Of BWT Construction

The time complexity of generating the burrow wheelers transform of a string depends on the suffix array construction:

  • Naive Method:
  • Prefix doubling: