The suffix array of a string is a sorted array of all its suffixes, however, rather than storing each suffix explicitly, taking space, the array stores only the starting indices of each suffix in the original string, taking space.
To ensure all suffixes are distinct and the empty suffix is included, a terminal character ($) which is lexicographically smaller than all other characters, is appended to the string. This character makes the empty suffix appear first in the sorted order.
Thus, a string of length has suffixes, including the empty suffix.
Example of A Constructing A Suffix Array
Let
text[1..6] = banana. After appending the terminal character, we getbanana$. Its suffixes are:
text[1...7] = banana$,text[2...7] = anana$,text[3...7] = nana$,text[4...7] = ana$,text[5...7] = na$,text[6...7] = a$,text[7...7] = $The suffix array for
banana$would thus be:
SA("banana$") = [$, a$, ana$, anana$, banana$, na$, nana$]Or, more space efficiently:
SA("banana$") = [7, 6, 4, 2, 1, 5, 3]
Naively Constructing Suffix Arrays
The naive method builds a list of all starting indices and sorts them by comparing the corresponding suffixes.
def naive_suffix_build(string: str):
# Initiliase suffix array with starting indices
suffix_array = [i for i in range(len(string))]
# Sort indices by comparing each suffix of the string
suffix_array.sort(key = lambda i : string[i:])
return suffix_arrayThis algorithm take time due to sorting taking time, and comparing suffixes taking up to per comparison.
Prefix Doubling Algorithm
The Prefix Doubling Algorithm constructs a suffix array SA[1...n] by iteratively sorting suffixes based on the first characters, where doubles each round (), until all suffixes are uniquely ranked.
If a suffix has fewer than characters, it’s padded with a sentinel value (e.g., an empty character) that is lexicographically smaller than any valid character, e.g., cat < cathode.
Core Data Structures
To efficiently maintain and update the sorted order of suffixes, two structures are used:
- Suffix Array (
SA) — An array of indices representing the current order of suffixes. - Rank Array (
rank) — For each indexi,rank[i]gives the current rank (or position) of the suffix starting ati.
Initially, suffixes are sorted by their first character and ranks are assigned accordingly (identical prefixes receive identical ranks).
Fast Suffix Comparison Via Rank Pairs
To achieve fast suffix comparisons, once the ranks for prefixes of length are computed, suffixes of length can be compared efficiently by splitting them into two halves:
Key Idea
To compare two suffixes
AandBof length , split them as:
A = A₁ + A₂B = B₁ + B₂Where
A₁,B₁are the first characters, andA₂,B₂are the next .Then compare as follows:
- If
A₁ < B₁, thenA < B- If
A₁ > B₁, thenA > B- If
A₁ = B₁, compareA₂andB₂
Since the ranks for prefixes of length are already known, to compare suffixes starting at i and j, the algorithm compares:
- The first characters using
rank[i]andrank[j], and - The next characters using
rank[i + k]andrank[j + k]
This works because rank[i] represents the order of the first characters of the suffix at i, and rank[i+k] represents the order of the next characters. If both rank pairs are equal, the suffixes are considered equal for this round
Each iteration doubles the prefix length used for comparison, allowing suffixes to be sorted using previously computed ranks. By comparing rank pairs (rank[i], rank[i + k]), suffixes of length are compared in time. After at most iterations, all suffixes get unique ranks, and the suffix array would be fully constructed without full string comparisons.
Example of Prefix Doubling (In 0-based Indexing)
Let
text[0..5] = banana. After appending the terminal character, this becomebanana$.Initial Setup
- Assign initial ranks based on the first character of each suffix
Index ( i)Suffix ( text[i...6])First char Rank 0banana$ b1anana$ a2nana$ n3ana$ a4na$ n5a$ a6$ $
- Thus
rank = [2, 1, 3, 1, 3, 1, 0]andSA = [0, 1, 2, 3, 4, 5, 6](default)Iteration 1 ()
- Sort each suffix
text[i...n]using the rank pairs:(rank[i], rank[i+k]). Note that is used when
Index ( i)Suffix ( text[i...6])Rank Pair 0banana$ 1anana$ 2nana$ 3ana$ 4na$ 5a$ 6$
- Sorting these tuples results in
SA = [6, 5, 1, 3, 0, 2, 4], using this assign new ranks:
Index ( i)Suffix ( text[i...6])New Rank 6$ 0 5a$ 1 1anana$ 2 3ana$ 2 0banana$ 3 2nana$ 4 4na$ 4
- Thus
rank = [3, 2, 4, 2, 4, 1, 0]Iteration 2 ()
- Sort each suffix
text[i...n]using the rank pairs:(rank[i], rank[i+k]).
Index ( i)Suffix ( text[i...6])Rank Pair 0banana$ 1anana$ 2nana$ 3ana$ 4na$ 5a$ 6$
- Sorting these tuples results in
SA = [6, 5, 3, 1, 0, 4, 2], using this assign new ranks:
Index ( i)Suffix ( text[i...6])New Rank 6$ 0 5a$ 1 3ana$ 2 1anana$ 3 0banana$ 4 4na$ 5 2nana$ 6
- Thus
rank = [4, 3, 6, 2, 5, 1, 0]. Sinceranknow assigns unique values, sorting is complete.Final Results
SA = [6, 5, 3, 1, 0, 4, 2]
Complexity Analysis Of Prefix Doubling
The Prefix Doubling Algorithm runs in time when using a standard comparison sort as:
- Each iteration compares and sorts suffixes based on rank pairs of length , which takes time.
- The prefix length doubles each iteration, so there are iterations.
Thus, the total complexity is :
However, if radix sort is used instead of comparison sort, each sorting round can be done in linear time , resulting in an improved total time complexity of:
Additionally, the space complexity is , due to arrays for the suffix array, rank, and temporary storage.
The algorithm is efficient in practice and relatively simple to implement compared to more advanced linear-time suffix array algorithms like SA-IS or DC3.