Inverting The Burrows-Wheeler Transform Of A String

A key property of Burrows-Wheeler Transform that enables quick pattern matching is that it is invertible.

Naive Inversion Of A BWT

The naive method of inverting a BWT reconstructs the Sorted Cyclic Permutation Matrix by iteratively prepending the BWT to the current column, sorting the rows, and repeating this until the full matrix is reconstructed.

Sorting the BWT yields the first column of the SCPM, since the matrix is formed by sorting all cyclic rotations.
- In fact, sorting any column reconstructs the first column due to the matrix’s sorting property.
Cyclically, the first column follows the last (BWT), therefore prepending the BWT to the sorted column and sorting the resulting 2-mers reconstructs the first two columns.
- Recall that the SCPM is created from sorting each row, which is achieved by sorting the 2-mers in this case.
This process is repeated until all $n$ columns are built.

This is supported by property 2 of SCPMs as the first column is the sorted permutation of substrings of length 1, the first 2 columns are the same but of length 2, and so on.

Finally, the original string is retrieved as the first row of the reconstructed matrix, which has the form $ + S[1...n-1]. This is because the row is lexicographically smallest as it begins with the terminal $, which precedes all other characters.

Inversion Using LF-Mapping

The BWT example used for this section is lo$oogg

In a SCPM, each letter in the last column/BWT can be mapped to to a corresponding letter in the first column, via property 1.

As the first row has the form $ + S[1...n-1] due to the terminal $ being unique and lexicographically smallest, and due to the cyclical nature of the SCPM. The first letter of the BWT is always the last letter of the original string (not including $).

This insight gives us the starting point of recovery using the first letter of the BWT.

Therefore BWT[1] = l is the last letter of the string, meaning l$ can be reconstructed trivially.

To go further, the rank of each unique character in the BWT must be computed. The rank of a character is the position of it in the first column.

To do this, a sorted array of each unique character is required, this can be made in $O ∣ alphabet ∣)$ space and time using ASCII values for ordering.

A snapshot of the rank table is:

ASCII Values 36 ($) 103 (g) 108 (l) 111 (o)
Rank - - - -

ASCII Values	36 ($)	103 (g)	108 (l)	111 (o)
Rank	-	-	-	-

Then for each character, find out where the characters block starts in the first column

Naive Computation Of Rank Array

A naive method is sorting the BWT to get the first column F. This can be done in $O (n)$ using radix sort and $O (n lo g n)$ using comparison sort.

Then populate the table by iterating through F and storing the first index of each unique character found, which is also $O (n)$

Sorting the BWT lo$oogg results in $ggloo, this can be used to compute the rank table

ASCII Values 36 ($) 103 (g) 108 (l) 111 (o)
Rank 1 2 4 5

ASCII Values	36 ($)	103 (g)	108 (l)	111 (o)
Rank	1	2	4	5

Optimal Computation Of Rank Array

An optimal method is recognising you already have the ordered characters in the rank table and the length of the string.

Therefore, first iterate through the BWT and count the number of occurrence of each character, store this in a count table initialised similarly as rank

l = 1 o = 3 $ = 1 g = 2

Then recall that the terminal $, always starts at position 1 in F via the properties of the SCPM and $ being the lexicographically smallest. This gives a starting point to reconstruct the rest of the rank array.

Use the fact above to start populating the rank array,

36 ($) = 1

The next value in ASCII order is g, meaning the next sorted character is g in F, so we can populate g by adding the current index = 1 to the number of occurrence of $ = 1

103 (g) = 1 +1 = 2

The next value in sorted order is l, to populate this, add the current index = 2 to the number of occurrences of g = 2

108 (l) = 2 + 2 = 4

The next value is o, to populate this, add the current index = 4 to the number of occurrences of l = 1

111 (o) = 4 + 1 = 5

This, in $O (n)$ time, constructs the rank array.

The formal algorithm involves parsing through the BWT and counting the number of occurrences. Then going through the rank array which is an ASCII ordered array, starting at $ which always stores 1, and then for each successive unique character, store current_rank + count(prev_letter).

So its $O (n + ∣ alphabet ∣)$

Using Rank Array

The rank array tells us the position of each character in the BWT, in the first column. To reconstruct the string we need to know the characters preceding the strings we’ve already reconstructed.

So using the mapping from the rank array and the cyclic nature of the SCPM, we can effectively go from the rank position in the first column to get a particular row and go to the preceding column (the BWT) of that row to get the preceding character.

Simply speaking BWT[rank[ord(char)]] returns the character preceding char

Illustration Of This LF Mapping

We know L[4] = o precedes F[4] = l. So we can reconstruct
- ol$
If any L[i] maps to some F[j], then L[j] has to be its preceding letter

An issue arises for non-unique character, as there can be multiple mappings. However, note that the order of non unique characters is the same as in F and L, this is the case for all BWT. This is because L is just a shifted version of F and the sorting done in F preserves the order in which the block of letters would appear in subsequent columns.

For any letter L[I] = ‘x’ in the last or BWT column L, there has to be a suffix starting with this specific instance of x in the first column F at some position/index ‘pos’ in F. That is, F[pos] = x:

$pos = rank (x) + nocc (x, L [1... i))$

The position of a letter $x$ is the position where $x$ first appears plus the number of times $x$ appears in L[1..i-1]. Where i is the position of interest.

$pos = rank (x) + nocc (x, B W T [1... i))$

The nocc table represents the number of occurrences of a particular letter in L.

EXAMPLE

bwt = lo$oogg rank = [$ = 1, g = 2, l = 4, o = 5]

Starting from l$,

First look at the rank(l) = 4 (where it first appears in F), the preceding character is bwt[4] = o constructing ol$

Then to identify the position of the character preceding o in F, use the rank table and add the number of occurrence of the character before the previous rank exclusive. So prev_rank = 4, therefore bwt[1..prev_rank] = bwt[1...4] = lo$

The pos(o) = rank(o) + nocc(o, bwt[1...4]) = 5 + 1 = 6, therefore the preceding character is bwt[6] = g

gol$

To identify the character preceding g, find pos(g) = rank(g) + nocc(g, bwt[1..6]) = 2 + 0 = 2, and then use that pos to find bwt[pos] = bwt[2] = o

ogol$

To identify the character preceding o, find pos(o) = rank(o) + nocc(o, bwt[1...2] = 5 + 0 = 5 and use that to find bwt[pos] = bwt[5] = o

oogol$

Then repeat pos(o) = rank(o) + nocc(o, bwt[1..5) = 5 + 2 = 7 so bwt[pos] = bwt[7] = g

googol$

And we can either continue one last iteration (leading to $ again) or stop when the string of length $n$

Complexity Analysis

The time and space complexity of the naive BWT inversion method is:

Time Complexity: $O (n^{2} lo g n)$
- Each of the $n$ iterations involves sorting $n$ strings of increasing length (from 1 to $n$ ), with each sort being $O (n lo g n)$
Space Complexity: $O (n^{2})$
- The algorithm stores $n$ strings, each of up to $n$ characters, to build the full matrix.

The time and space complexity of LF-Mapping is:

Pre-process the rank table = $O (n)$
Pre-process the nocc table = $O (n)$
Then the algorithm iterates through $n$ texts so all together its $O (n)$

Quartz 4

Explorer

Inverting The Burrows-Wheeler Transform Of A String

Naive Inversion Of A BWT

Inversion Using LF-Mapping

Naive Computation Of Rank Array

Optimal Computation Of Rank Array

Using Rank Array

Complexity Analysis

Graph View

Table of Contents

Backlinks