Boyer-Moore leverages both the extended bad character rule and the good suffix rule to compute the maximum possible shift at each alignment, significantly improving efficiency over naive matching.

Overview Of The Algorithm

  1. Pre-process the pattern pat[1...m] to compute:
    • The values needed for the extended bad character shift rule, and
    • The good_suffix and matched_prefix values need for the good suffix shift rule
  2. The main search algorithm proceeds as follows:
    1. Align pat[1...m] with text[1...n], and scan characters right-to-left within the current alignment.
    2. If a mismatch occurs,
      • Use the bad character rule to compute a potential shift, denoted as
      • Use the good suffix rule to compute a second shift, denoted as
      • Shift the pattern to the right by positions
    3. If a complete match occurs:
      • Report the match position
      • Then shift the pattern by m - matched_prefix[2] positions

The Boyer-Moore algorithm requires Galil’s optimisation to achieve linear-time complexity in the worst case, as without, the time complexity can degrade to .

Galil’s Optimisation

Galil’s optimisation is a technique that avoids redundant character comparisons during the search phase of the Boyer-Moore algorithm:

  • Each time the pattern shifts, there’s often regions of the pattern that are guaranteed to match the text at the new alignment.
  • These matching regions arise naturally from the logic behind the good suffix and matched prefix rules.
  • Galil’s insight is to skip comparisons in these known-matching regions in the next iteration, thus avoiding unnecessary work.

Implementing Galil’s Optimisation

To implement this optimisation, maintain two indices:

  • start — the leftmost index of the region that is know to match
  • stop — the rightmost index of the matching region

When the pattern is realigned, the right-to-left scan of the pattern:

  • Starts from the end (pat[m]) and proceeds down to stop + 1.
  • Skips the region from start to stop entirely.
  • Resumes from start - 1 down to pat[1] if needed.

This segmentation ensures that the known-matching region pat[start ... stop] is never re-compared against the text.

Two Cases To Consider

There are two main scenarios in which Galil’s optimisation is applied:

  1. Good Suffix Shift
    • If a mismatch occurs at some position , and the good suffix rule suggests a shift such that good_suffix[k+1] = p > 0
    • Then it is known that pat[p-m+k+1...p] matches txt[j+k...j+m-1]
    • Therefore, comparisons in this region can be skipped in the next scan, and the variable are set as:
      • start = p - m + k + 1
      • stop = p

  1. Matched Prefix Shift
    • If the shift is based on a matched prefix, (when good_suffix[k+1] = 0),
    • Then the prefix pat[1...matched_prefix[k+1] matches txt[j+m-matched_prefix[k+1]...j+m-1]
    • Therefore, comparisons in this region can be skipped in the next scan, and the variable are set as:
      • start = 1
      • stop = matched_prefix[k+1]

Galil’s optimisation ensures that Boyer-Moore does not re-check character pairs that are already known to match. By efficiently skipping over these regions, the algorithm maintains its worst-case time complexity of after an initial pre-processing phase.