The Gusfield’s Z-Algorithm is a linear-time algorithm for pre-processing a string. Its output, the Z-values (), enable efficient solutions to various string-related problems, including linear-time pattern matching.

Defining The -Values, -Box, &

For a string str[1...n], the -value, at each position , represents the length of the longest substring starting at position that matches a prefix of the string.

The -box at each position is the contiguous range of indices , representing the longest substring starting at position that matches the prefix of str.

The , for , is the right-most endpoint of all Z-boxes that begin at or before position . The value of never decreases as increases.

The , for , is the left endpoint of any Z-box that ends at position . If there are multiple Z-boxes that end at , then can be chosen as the left endpoint of any of those Z-boxes.

Overview Of Gusfield’s Z-Algorithm

The goal is to compute values for each position . The computation starts at , since is always the length of the string and provides no useful information.

However, not all values need to be actively tracked and stored in an array:

  • Each computed value is stored, as it is essential for further calculations.
  • -box — The interval can be derived in time from .
  • Only the most recent values (i.e., ) are stored as temporary variables, since updating only depends on the previous iteration.

Z-Algorithm: Base Case

To start the algorithm, compute by performing an explicit left-to-right comparison of the characters str[(i=2)...q-1] with str[1...q-2] until a mismatch is found at some .

  • This results in
  • If , i.e. a Z-box of some length is found:
    • Set and
  • Else, if , i.e. the Z-box is undefined:
    • Set and

Z-Algorithm: Inductive Step

Assume that for some position :

  • The values through have been computed correctly,
  • currently holds and,
  • currently holds ,

To compute at position , there are two cases to consider:

Case 1 —

If , then is outside the previously computed -box, meaning there’s no additional information available.

Therefore, to compute , explicitly compare characters str[k...q-1] with the prefix str[1...q-k] until a mismatch is found at some .

  • This results in
  • If , then:
    • Set and
  • Else, if , then:
    • retains its value, remaining the same as .

Case 2 —

If , then lies within the boundaries of the previously computer -box, allowing future information about substring matches to aid in computing .

  • By definition, the -box (), matches the prefix range , making them symmetrical.
  • This then implies that str[k...r] = str[k-l+1...Z_l] which again by definition, is equal to str[1...r-k+1]
  • By the inductive hypothesis, is already computed, so it can be used to aid in the computation of

More specifically, we relate to the length of the green box (), leading to three possible cases:

Case 2a —

If , then the -box remains within the bounds of the remaining portion of the ​-box, so its value can be directly copied.

  • Recall that -box is equivalent to , and str[k...r] = str[k-l+1...Z[l]] = str[1...r-k+1]
  • Since a mismatch occurs immediately after the -box, at position with the prefix at position , the exact same mismatch must occur after the -box, at position

Therefore, as is identical to , we can simply:

  • Set .
  • Keep and unchanged.

Since no explicit comparisons are performed, is computed in time.

Case 2b —

If , then the -box reaches the boundary of the box, meaning it could extend further, requiring explicit matching.

  • Recall that -box is equivalent to , and str[k...r] = str[k-l+1...Z_l] = str[1...r-k+1]
  • This case implies that a mismatch occurred at with the corresponding prefix character at
  • Thus, the character at can’t be the same as , because if it was, the box would have extended further, contradicting its original boundary at .
  • However, the character at can be the same as ; we can’t be certain that is strictly bounded by .

Therefore, we:

  • Set
  • Increment by continuing explicit matching from str[r+1...q-1] with the prefix from str[r-k+2...q-k] until a mismatch occurs at some position .
  • This results in
  • Additionally, set and

This takes time, as the algorithm iterates over characters.

Case 2c —

If , then the -box extends beyond the boundary of the -box, meaning provides enough information to compute without further comparisons.

  • Recall that -box is equivalent to , and str[k...r] = str[k-l+1...Z_l] = str[1...r-k+1]
  • Since the -box extends past , by definition, its corresponding match at the prefix, extends past
  • Therefore, does not extend, as if it did, then the character at would be the same as , but this would mean the box would have extended further, contradicting its original boundary at .
  • This means that the character at cannot be the same as , and is thus a mismatch.

Therefore, we can simply:

  • Set
  • Keep and unchanged (, and (or as both are the left endpoint of the z-box ending at )

Since no new comparisons are required, this case takes time.

Time Complexity Analysis

The Z-algorithm pre-processes a string in time by computing all -values efficiently.

  • The total run-time depends on two factors
    • The number of iterations
    • The total number of character comparisons (both matches and mismatches)
  • Since the algorithm computes one -value per iteration, there are exactly iterations
    • Each iteration, aside from character comparisons, involves only constant-time operations.
  • Character comparisons stop as soon as a mismatch occurs, meaning there can be at most mismatches in total.
    • To determine the total number of matches, note that the rightmost boundary is always non-decreasing across iterations. Denoted by , where
    • A -box cannot extend beyond the string boundaries, , ensuring that the total number of matches

Therefore the algorithm performs at most character comparisons across all iterations, along with additional work, leading to an overall time complexity of

Z-Algorithm For Linear Time Exact Pattern Matching

The Exact Pattern Matching Problem involves finds all occurrences of a given pattern[1...m] in a given text[1...n].

Gusfield’s Z-algorithm enables a linear-time solution to this, as follows:

  1. Construct a new string: str[1...m+n+1] = pat[1...m] + "$" + txt[1...n]
  2. For , pre-process values corresponding to the str.
  3. For any , positions where indicate an occurrence of pat[1...m] at position in str, and hence in txt. That is: pat[1...m] = str[i...i+m-1] = txt[i-(m+1)...i-2]
  4. As the computation of values for any string takes time, this algorithm takes time.