Suffix Array

The suffix array of a string is a sorted array of all its suffixes, however, rather than storing each suffix explicitly, taking $O (n^{2})$ space, the array stores only the starting indices of each suffix in the original string, taking $O (n)$ space.

To ensure all suffixes are distinct and the empty suffix is included, a terminal character ($) which is lexicographically smaller than all other characters, is appended to the string. This character makes the empty suffix appear first in the sorted order.

Thus, a string of length $n$ has $n + 1$ suffixes, including the empty suffix.

Example of A Constructing A Suffix Array

Let text[1..6] = banana. After appending the terminal character, we get banana$. Its suffixes are:

text[1...7] = banana$,

text[2...7] = anana$,

text[3...7] = nana$,

text[4...7] = ana$,

text[5...7] = na$,

text[6...7] = a$,

text[7...7] = $

The suffix array for banana$ would thus be:

SA("banana$") = [$, a$, ana$, anana$, banana$, na$, nana$]

Or, more space efficiently:

SA("banana$") = [7, 6, 4, 2, 1, 5, 3]

Naively Constructing Suffix Arrays

The naive method builds a list of all starting indices $1 \dots n$ and sorts them by comparing the corresponding suffixes.

def naive_suffix_build(string: str):
    # Initiliase suffix array with starting indices
    suffix_array = [i for i in range(len(string))]
    # Sort indices by comparing each suffix of the string
    suffix_array.sort(key = lambda i : string[i:])
    return suffix_array

This algorithm take $O (n^{2} lo g n)$ time due to sorting taking $O (n lo g n)$ time, and comparing suffixes taking up to $O (n)$ per comparison.

Prefix Doubling Algorithm

The Prefix Doubling Algorithm constructs a suffix array SA[1...n] by iteratively sorting suffixes based on the first $k$ characters, where $k$ doubles each round ( $k = 1, 2, 4, 8, \dots$ ), until all suffixes are uniquely ranked.

If a suffix has fewer than $k$ characters, it’s padded with a sentinel value (e.g., an empty character) that is lexicographically smaller than any valid character, e.g., cat < cathode.

Core Data Structures

To efficiently maintain and update the sorted order of suffixes, two structures are used:

Suffix Array (SA) — An array of indices representing the current order of suffixes.
Rank Array (rank) — For each index i, rank[i] gives the current rank (or position) of the suffix starting at i.

Initially, suffixes are sorted by their first character and ranks are assigned accordingly (identical prefixes receive identical ranks).

Fast Suffix Comparison Via Rank Pairs

To achieve fast suffix comparisons, once the ranks for prefixes of length $k$ are computed, suffixes of length $2 k$ can be compared efficiently by splitting them into two halves:

Key Idea

To compare two suffixes A and B of length $2 k$ , split them as:

A = A₁ + A₂

B = B₁ + B₂

Where A₁, B₁ are the first $k$ characters, and A₂, B₂ are the next $k$ .

Then compare as follows:

If A₁ < B₁, then A < B

If A₁ > B₁, then A > B

If A₁ = B₁, compare A₂ and B₂

Since the ranks for prefixes of length $k$ are already known, to compare suffixes starting at i and j, the algorithm compares:

The first $k$ characters using rank[i] and rank[j], and
The next $k$ characters using rank[i + k] and rank[j + k]

This works because rank[i] represents the order of the first $k$ characters of the suffix at i, and rank[i+k] represents the order of the next $k$ characters. If both rank pairs are equal, the suffixes are considered equal for this round

Each iteration doubles the prefix length used for comparison, allowing suffixes to be sorted using previously computed ranks. By comparing rank pairs (rank[i], rank[i + k]), suffixes of length $2 k$ are compared in $O (1)$ time. After at most $lo g n$ iterations, all suffixes get unique ranks, and the suffix array would be fully constructed without full string comparisons.

Example of Prefix Doubling (In 0-based Indexing)

Let text[0..5] = banana. After appending the terminal character, this become banana$.

Initial Setup

Assign initial ranks based on the first character of each suffix

Index (i) Suffix (text[i...6]) First char Rank
0 banana$ b $2$
1 anana$ a $1$
2 nana$ n $3$
3 ana$ a $1$
4 na$ n $3$
5 a$ a $1$
6 $ $ $0$

Thus rank = [2, 1, 3, 1, 3, 1, 0] and SA = [0, 1, 2, 3, 4, 5, 6] (default)

Iteration 1 ( $k = 1$ )

Sort each suffix text[i...n] using the rank pairs: (rank[i], rank[i+k]). Note that $- 1$ is used when $i + k \geq n$

Index (i) Suffix (text[i...6]) Rank Pair
0 banana$ $(2, 1)$
1 anana$ $(1, 3)$
2 nana$ $(3, 1)$
3 ana$ $(1, 3)$
4 na$ $(3, 1)$
5 a$ $(1, 0)$
6 $ $(0, - 1)$

Sorting these tuples results in SA = [6, 5, 1, 3, 0, 2, 4], using this assign new ranks:

Index (i) Suffix (text[i...6]) New Rank
6 $ 0
5 a$ 1
1 anana$ 2
3 ana$ 2
0 banana$ 3
2 nana$ 4
4 na$ 4

Thus rank = [3, 2, 4, 2, 4, 1, 0]

Iteration 2 ( $k = 2$ )

Sort each suffix text[i...n] using the rank pairs: (rank[i], rank[i+k]).

Index (i) Suffix (text[i...6]) Rank Pair
0 banana$ $(3, 4)$
1 anana$ $(2, 2)$
2 nana$ $(4, 4)$
3 ana$ $(2, 1)$
4 na$ $(4, 0)$
5 a$ $(1, - 1)$
6 $ $(0, - 1)$

Sorting these tuples results in SA = [6, 5, 3, 1, 0, 4, 2], using this assign new ranks:

Index (i) Suffix (text[i...6]) New Rank
6 $ 0
5 a$ 1
3 ana$ 2
1 anana$ 3
0 banana$ 4
4 na$ 5
2 nana$ 6

Thus rank = [4, 3, 6, 2, 5, 1, 0]. Since rank now assigns unique values, sorting is complete.

Final Results

SA = [6, 5, 3, 1, 0, 4, 2]

Index (`i`)	Suffix (`text[i...6]`)	First char	Rank
`0`	banana$	`b`	$2$
`1`	anana$	`a`	$1$
`2`	nana$	`n`	$3$
`3`	ana$	`a`	$1$
`4`	na$	`n`	$3$
`5`	a$	`a`	$1$
`6`	$	`$`	$0$

Index (`i`)	Suffix (`text[i...6]`)	Rank Pair
`0`	banana$	$(2, 1)$
`1`	anana$	$(1, 3)$
`2`	nana$	$(3, 1)$
`3`	ana$	$(1, 3)$
`4`	na$	$(3, 1)$
`5`	a$	$(1, 0)$
`6`	$	$(0, - 1)$

Index (`i`)	Suffix (`text[i...6]`)	New Rank
`6`	$	0
`5`	a$	1
`1`	anana$	2
`3`	ana$	2
`0`	banana$	3
`2`	nana$	4
`4`	na$	4

Index (`i`)	Suffix (`text[i...6]`)	Rank Pair
`0`	banana$	$(3, 4)$
`1`	anana$	$(2, 2)$
`2`	nana$	$(4, 4)$
`3`	ana$	$(2, 1)$
`4`	na$	$(4, 0)$
`5`	a$	$(1, - 1)$
`6`	$	$(0, - 1)$

Index (`i`)	Suffix (`text[i...6]`)	New Rank
`6`	$	0
`5`	a$	1
`3`	ana$	2
`1`	anana$	3
`0`	banana$	4
`4`	na$	5
`2`	nana$	6

Complexity Analysis Of Prefix Doubling

The Prefix Doubling Algorithm runs in $O (n lo g^{2} n)$ time when using a standard comparison sort as:

Each iteration compares and sorts suffixes based on rank pairs of length $2 k$ , which takes $O (n lo g n)$ time.
The prefix length $k$ doubles each iteration, so there are $O (lo g n)$ iterations.

Thus, the total complexity is :

$O (n lo g n) per iteration \times O (lo g n) iterations = O (n lo g^{2} n)$

However, if radix sort is used instead of comparison sort, each sorting round can be done in linear time $O (n)$ , resulting in an improved total time complexity of:

$O (n lo g n)$

Additionally, the space complexity is $O (n)$ , due to arrays for the suffix array, rank, and temporary storage.

The algorithm is efficient in practice and relatively simple to implement compared to more advanced linear-time suffix array algorithms like SA-IS or DC3.

Quartz 4

Explorer

Suffix Array

Naively Constructing Suffix Arrays

Prefix Doubling Algorithm

Core Data Structures

Fast Suffix Comparison Via Rank Pairs

Complexity Analysis Of Prefix Doubling

Graph View

Table of Contents

Backlinks