The suffix array of a string is a sorted array of all its suffixes, however, rather than storing each suffix explicitly, taking space, the array stores only the starting indices of each suffix in the original string, taking space.

To ensure all suffixes are distinct and the empty suffix is included, a terminal character ($) which is lexicographically smaller than all other characters, is appended to the string. This character makes the empty suffix appear first in the sorted order.

Thus, a string of length has suffixes, including the empty suffix.

Naively Constructing Suffix Arrays

The naive method builds a list of all starting indices and sorts them by comparing the corresponding suffixes.

def naive_suffix_build(string: str):
    # Initiliase suffix array with starting indices
    suffix_array = [i for i in range(len(string))]
    # Sort indices by comparing each suffix of the string
    suffix_array.sort(key = lambda i : string[i:])
    return suffix_array

This algorithm take time due to sorting taking time, and comparing suffixes taking up to per comparison.

Prefix Doubling Algorithm

The Prefix Doubling Algorithm constructs a suffix array SA[1...n] by iteratively sorting suffixes based on the first characters, where doubles each round (), until all suffixes are uniquely ranked.

If a suffix has fewer than characters, it’s padded with a sentinel value (e.g., an empty character) that is lexicographically smaller than any valid character, e.g., cat < cathode.

Core Data Structures

To efficiently maintain and update the sorted order of suffixes, two structures are used:

  • Suffix Array (SA) — An array of indices representing the current order of suffixes.
  • Rank Array (rank) — For each index i, rank[i] gives the current rank (or position) of the suffix starting at i.

Initially, suffixes are sorted by their first character and ranks are assigned accordingly (identical prefixes receive identical ranks).

Fast Suffix Comparison Via Rank Pairs

To achieve fast suffix comparisons, once the ranks for prefixes of length are computed, suffixes of length can be compared efficiently by splitting them into two halves:

Key Idea

To compare two suffixes A and B of length , split them as:

  • A = A₁ + A₂
  • B = B₁ + B₂

Where A₁, B₁ are the first characters, and A₂, B₂ are the next .

Then compare as follows:

  • If A₁ < B₁, then A < B
  • If A₁ > B₁, then A > B
  • If A₁ = B₁, compare A₂ and B₂

Since the ranks for prefixes of length are already known, to compare suffixes starting at i and j, the algorithm compares:

  • The first characters using rank[i] and rank[j], and
  • The next characters using rank[i + k] and rank[j + k]

This works because rank[i] represents the order of the first characters of the suffix at i, and rank[i+k] represents the order of the next characters. If both rank pairs are equal, the suffixes are considered equal for this round

Each iteration doubles the prefix length used for comparison, allowing suffixes to be sorted using previously computed ranks. By comparing rank pairs (rank[i], rank[i + k]), suffixes of length are compared in time. After at most iterations, all suffixes get unique ranks, and the suffix array would be fully constructed without full string comparisons.

Complexity Analysis Of Prefix Doubling

The Prefix Doubling Algorithm runs in time when using a standard comparison sort as:

  • Each iteration compares and sorts suffixes based on rank pairs of length , which takes time.
  • The prefix length doubles each iteration, so there are iterations.

Thus, the total complexity is :

However, if radix sort is used instead of comparison sort, each sorting round can be done in linear time , resulting in an improved total time complexity of:

Additionally, the space complexity is , due to arrays for the suffix array, rank, and temporary storage.

The algorithm is efficient in practice and relatively simple to implement compared to more advanced linear-time suffix array algorithms like SA-IS or DC3.