The substring matching problems asks to pre-process a text[1...n], so that any pattern pat[1...m] can be searched in time. Unlike exact pattern matching, where the pattern is known upfront, substring matching requires building a structure from the text to support efficient multiple-pattern queries.

Suffix-Based Tree Data Structures can be constructed from the text to support pattern matching in time.

In string processing, a unique terminal symbol $, chosen as the lexicographically smallest character, is often appended to mark the end of the string and ensure all suffixes are distinct.

Suffix Tries

A Suffix Trie is a rooted tree where each edge is labelled with a character, and each root-to-leaf path represents a suffix of the input string.

It stores all suffixes of a given string in a compact, prefix-sharing structure.

Suffix Trees

A suffix tree for a string str[1...n] is a path compressed suffix trie, with the following properties:

  1. It has exactly leaves numbered through , representing the suffix start index
  2. every internal node, except for the root, has at least two children
  3. Each edge is labelled with a non-empty substring of str
  4. Every node has at most one edge beginning with any given character of the alphabet.
  5. The string label of the path from root to leaf is the suffix str[i...n] for
Naive Construction Of Suffix Trees (Pseudocode)
def naive_suffix_tree(str[1...n]):
    initialise root node
    for i from [1...n]:
        suffix = str[i...n]
        current = root
        j = 0
        while j < length(suffix):
            if there is an edge from current starting with suffix[j]:
                follow the edge and match characters
                if mismatch occurs mid edge:
                    split the edge, insert a new node and create a new leaf
                    break
                else:
                    advance to next node
            else:
                create a new edge labeled suffix[j...end] from current
                break

The naive suffix tree construction has a time complexity of . It generates suffixes, and for each suffix str[i...n], it may require traversing the tree from the root, potentially examining up to characters. The total comparisons across all suffixes thus sum to .

Ukkonen’s algorithm offers a more efficient, linear-time approach to building suffix trees.

Implicit Suffix Trees

An implicit suffix tree for a str[1...n] is the suffix tree constructed for str, without appending a terminal character ($).

  • The terminal symbol ensures all suffixes are distinct, resulting in leaves (one for each suffix).
  • By omitting $, only unique paths are represented as suffixes with matching prefixes share branches.

Additionally, A full suffix tree is obtained by appending $ to each suffix contained within the tree.

Space Efficient Representation

Explicitly storing the substring labels of each edge in a suffix tree requires space. A more efficient representation is implicitly using a tuple to denote the substring str[j...i], where .

This reduces the space complexity to , assuming a fixed size alphabet.