The Gusfield’s Z-Algorithm is a linear-time algorithm for pre-processing a string. Its output, the Z-values (), enable efficient solutions to various string-related problems, including linear-time pattern matching.
Defining The -Values, -Box, &
For a string str[1...n], the -value, at each position , represents the length of the longest substring starting at position that matches a prefix of the string.
Example of -values
Given the string
str = aabcaabxaay, the corresponding Z-array, with the same length asstr, has values:
Z[1] = 11→ SinceZ[1]represents the entire string, it is always equal ton.Z[2] = 1→ The substring starting atstr[2](a) matches the first character of the prefix.Z[3] = 0→ No match betweenstr[3](b) and the prefix.Z[4] = 0→ No match betweenstr[4](c) and the prefix.Z[5] = 3→ The substringstr[5..7] = "aab"matches the prefixstr[1..3] = "aab".Z[6] = 1→ The substringstr[6] = "a"matches the first character of the prefix.Z[7] = 0→ No match betweenstr[7](b) and the prefix.Z[8] = 0→ No match betweenstr[8](x) and the prefix.Z[9] = 2→ The substringstr[9..10] = "aa"matches the prefixstr[1..2] = "aa".Z[10] = 1→ The substringstr[10] = "a"matches the first character of the prefix.Z[11] = 0→ No match betweenstr[11](y) and the prefix.
The -box at each position is the contiguous range of indices , representing the longest substring starting at position that matches the prefix of str.
Example of a -Box
Given the string
str = aabcaabxaay, the corresponding -box has the following values:
ZBox[1] = [1..11]→ The -box is from index 1 to 11 (i.e., the entire string).ZBox[2] = [2..2]→ The -box is from index 2 to 2 (a).ZBox[3] = undefined→ No match, so the -box is empty.ZBox[4] = undefined→ No match, so the -box is empty.ZBox[5] = [5..7]→ The -box is from index 5 to 7 (aab).ZBox[6] = [6..6]→ The -box is from index 6 to 6 (a).ZBox[7] = undefined→ No match, so the -box is empty.ZBox[8] = undefined→ No match, so the -box is empty.ZBox[9] = [9..10]→ The -box is from index 9 to 10 (aa).ZBox[10] = [10..10]→ The -box is from index 10 to 10 (a).ZBox[11] = undefined→ No match, so the -box is empty.
The , for , is the right-most endpoint of all Z-boxes that begin at or before position . The value of never decreases as increases.
Example of
Given the string
str = aabcaabxaay, the corresponding values are:
r[2] = 2→ZBox[2] = [2..2], so the right-most endpoint is2.r[3] = 2→ZBox[3]is undefined, so the right-most endpoint remains as2.r[4] = 2→ Same as previous.r[5] = 7→ZBox[5] = [5..7], so the right-most endpoint is7, as .r[6] = 7→ZBox[6] = [6..6], the right-most endpoint remains7, as .r[7] = 7→ZBox[7]is undefined, so the right-most endpoint remains as7.r[8] = 7→ Same as previous.r[9] = 10→ZBox[9] = [9..10], so the right-most endpoint is10, as .r[10] = 10→ZBox[10] = [10..10], the right-most endpoint remains as10.r[11] = 10→ZBox[1]is undefined, so the right-most endpoint remains as10.
The , for , is the left endpoint of any Z-box that ends at position . If there are multiple Z-boxes that end at , then can be chosen as the left endpoint of any of those Z-boxes.
Example of
Given the string
str = aabcaabxaay, the corresponding values are:
l[2] = 2→ The Z-box that end at positionr[2] = 2isZBox[2] = [2..2]which has the left-endpoint of2l[3] = 2→r[3] = r[2], so the left endpoint remains.l[4] = 2→ Same as previous.l[5] = 5→ The Z-box that end at positionr[5] = 7isZBox[5] = [5..7]which has the left-endpoint of5l[6] = 5→r[6] = r[5], so the left endpoint remainsl[7] = 5→ Same as previous.l[8] = 5→ Same as previous.l[9] = 9 || 10→ The Z-boxes that end at positionr[9] = 10areZBox[9] = [9..10]andZBox[10] = [10..10]. The could either be9or10.l[10] = 9 || 10→r[10] = r[9], so the left endpoint remainsl[11] = 9 || 10→ Same as previous.
Illustration of all Definitions
Overview Of Gusfield’s Z-Algorithm
The goal is to compute values for each position . The computation starts at , since is always the length of the string and provides no useful information.
However, not all values need to be actively tracked and stored in an array:
- — Each computed value is stored, as it is essential for further calculations.
- -box — The interval can be derived in time from .
- — Only the most recent values (i.e., ) are stored as temporary variables, since updating only depends on the previous iteration.
Z-Algorithm: Base Case
To start the algorithm, compute by performing an explicit left-to-right comparison of the characters str[(i=2)...q-1] with str[1...q-2] until a mismatch is found at some .
- This results in
- If , i.e. a Z-box of some length is found:
- Set and
- Else, if , i.e. the Z-box is undefined:
- Set and
The length of a -box (the -value) is given by .
As and are the start & end points of the -box, we have the formula:
This can be rearranged to make the subject:
Example of Computing the Base Case ( )
Given the string
str = aabcaacxaabcaabcay, the base case begins with calculating .
- Initialise the -array as a zero-filled array of the
str’s length.- Compare
str[2]withstr[1]:
str[2] = a = str[1], so increments.- Compare
str[3]withstr[2]
str[3] = b != str[2], so a mismatch is found.- Since , set:
Z-Algorithm: Inductive Step
Assume that for some position :
- The values through have been computed correctly,
- currently holds and,
- currently holds ,
To compute at position , there are two cases to consider:
Case 1 —

If , then is outside the previously computed -box, meaning there’s no additional information available.
Therefore, to compute , explicitly compare characters str[k...q-1] with the prefix str[1...q-k] until a mismatch is found at some .
- This results in
- If , then:
- Set and
- Else, if , then:
- retains its value, remaining the same as .
Example of Computing Case 1
Given the string
str = aabcaacxaabcaabcay,
- The base case begins with , , and
- For :
- Since the -box ends at , it provides no additional information for computing .
- Explicit matching is performed:
str[3] = b != str[1], so .- Thus, retain the previous values of .
Case 2 —
If , then lies within the boundaries of the previously computer -box, allowing future information about substring matches to aid in computing .
- By definition, the -box (), matches the prefix range , making them symmetrical.
- This then implies that
str[k...r] = str[k-l+1...Z_l]which again by definition, is equal tostr[1...r-k+1] - By the inductive hypothesis, is already computed, so it can be used to aid in the computation of
More specifically, we relate to the length of the green box (), leading to three possible cases:
Case 2a —

If , then the -box remains within the bounds of the remaining portion of the -box, so its value can be directly copied.
- Recall that -box is equivalent to , and
str[k...r] = str[k-l+1...Z[l]] = str[1...r-k+1] - Since a mismatch occurs immediately after the -box, at position with the prefix at position , the exact same mismatch must occur after the -box, at position
Therefore, as is identical to , we can simply:
- Set .
- Keep and unchanged.
Since no explicit comparisons are performed, is computed in time.
Case 2b —

If , then the -box reaches the boundary of the box, meaning it could extend further, requiring explicit matching.
- Recall that -box is equivalent to , and
str[k...r] = str[k-l+1...Z_l] = str[1...r-k+1] - This case implies that a mismatch occurred at with the corresponding prefix character at
- Thus, the character at can’t be the same as , because if it was, the box would have extended further, contradicting its original boundary at .
- However, the character at can be the same as ; we can’t be certain that is strictly bounded by .
Therefore, we:
- Set
- Increment by continuing explicit matching from
str[r+1...q-1]with the prefix fromstr[r-k+2...q-k]until a mismatch occurs at some position . - This results in
- Additionally, set and
This takes time, as the algorithm iterates over characters.
Case 2c —

If , then the -box extends beyond the boundary of the -box, meaning provides enough information to compute without further comparisons.
- Recall that -box is equivalent to , and
str[k...r] = str[k-l+1...Z_l] = str[1...r-k+1] - Since the -box extends past , by definition, its corresponding match at the prefix, extends past
- Therefore, does not extend, as if it did, then the character at would be the same as , but this would mean the box would have extended further, contradicting its original boundary at .
- This means that the character at cannot be the same as , and is thus a mismatch.
Therefore, we can simply:
- Set
- Keep and unchanged (, and (or as both are the left endpoint of the z-box ending at )
Since no new comparisons are required, this case takes time.
Time Complexity Analysis
The Z-algorithm pre-processes a string in time by computing all -values efficiently.
- The total run-time depends on two factors
- The number of iterations
- The total number of character comparisons (both matches and mismatches)
- Since the algorithm computes one -value per iteration, there are exactly iterations
- Each iteration, aside from character comparisons, involves only constant-time operations.
- Character comparisons stop as soon as a mismatch occurs, meaning there can be at most mismatches in total.
- To determine the total number of matches, note that the rightmost boundary is always non-decreasing across iterations. Denoted by , where
- A -box cannot extend beyond the string boundaries, , ensuring that the total number of matches
Therefore the algorithm performs at most character comparisons across all iterations, along with additional work, leading to an overall time complexity of
Space Complexity Analysis
- A regular implementation computes and stores -values in an array of length , requiring space
Z-Algorithm For Linear Time Exact Pattern Matching

The Exact Pattern Matching Problem involves finds all occurrences of a given pattern[1...m] in a given text[1...n].
Gusfield’s Z-algorithm enables a linear-time solution to this, as follows:
- Construct a new string:
str[1...m+n+1] = pat[1...m] + "$" + txt[1...n] - For , pre-process values corresponding to the
str. - For any , positions where indicate an occurrence of
pat[1...m]at position instr, and hence intxt. That is:pat[1...m] = str[i...i+m-1] = txt[i-(m+1)...i-2] - As the computation of values for any string takes time, this algorithm takes time.
The terminal character (
$) can be omitted with minor modifications to the algorithm.
- The
$serves as a separator, ensuring no character intextmatches it.- This guarantees that the maximum possible value is .
- Without
$, this restriction is lost, so occurrences must be identified by considering all values.- Additionally, indexing adjustments are needed since the string length is reduced by one.
Improvements to Space Complexity
- In a regular implementation, the space complexity would be space
- This can be optimised by only storing necessary values (those of length ), reducing it to space
