Implementing The Boyer-Moore Algorithm

Boyer-Moore leverages both the extended bad character rule and the good suffix rule to compute the maximum possible shift at each alignment, significantly improving efficiency over naive matching.

Overview Of The Algorithm

Pre-process the pattern pat[1...m] to compute:
- The $R_{k}$ values needed for the extended bad character shift rule, and
- The good_suffix and matched_prefix values need for the good suffix shift rule
The main search algorithm proceeds as follows:
1. Align pat[1...m] with text[1...n], and scan characters right-to-left within the current alignment.
2. If a mismatch occurs,
  - Use the bad character rule to compute a potential shift, denoted as $n_{b c}$
  - Use the good suffix rule to compute a second shift, denoted as $n_{g s}$
  - Shift the pattern to the right by $max (n_{b c}, n_{g s})$ positions
3. If a complete match occurs:
  - Report the match position
  - Then shift the pattern by m - matched_prefix[2] positions

The Boyer-Moore algorithm requires Galil’s optimisation to achieve linear-time complexity in the worst case, as without, the time complexity can degrade to $O (mn)$ .

Worst-Case Example for Boyer-Moore

Let text[1...10] = aaaaaaaaaa and pat[1...3] = aaa.

In this case, every valid alignment between pat and text results in a complete match.

Since m - matched_prefix[2] = 1, the pattern shifts by only one position each time. This means Boyer-Moore examines all possible alignments, resulting in an $O (mn)$ time complexity.

Clearly, this is inefficient despite all matches being trivial.

Galil’s Optimisation

Galil’s optimisation is a technique that avoids redundant character comparisons during the search phase of the Boyer-Moore algorithm:

Each time the pattern shifts, there’s often regions of the pattern that are guaranteed to match the text at the new alignment.
These matching regions arise naturally from the logic behind the good suffix and matched prefix rules.
Galil’s insight is to skip comparisons in these known-matching regions in the next iteration, thus avoiding unnecessary work.

Implementing Galil’s Optimisation

To implement this optimisation, maintain two indices:

start — the leftmost index of the region that is know to match
stop — the rightmost index of the matching region

When the pattern is realigned, the right-to-left scan of the pattern:

Starts from the end (pat[m]) and proceeds down to stop + 1.
Skips the region from start to stop entirely.
Resumes from start - 1 down to pat[1] if needed.

This segmentation ensures that the known-matching region pat[start ... stop] is never re-compared against the text.

Two Cases To Consider

There are two main scenarios in which Galil’s optimisation is applied:

Good Suffix Shift
- If a mismatch occurs at some position $k$ , and the good suffix rule suggests a shift such that good_suffix[k+1] = p > 0
- Then it is known that pat[p-m+k+1...p] matches txt[j+k...j+m-1]
- Therefore, comparisons in this region can be skipped in the next scan, and the variable are set as:
  - start = p - m + k + 1
  - stop = p

Matched Prefix Shift
- If the shift is based on a matched prefix, (when good_suffix[k+1] = 0),
- Then the prefix pat[1...matched_prefix[k+1] matches txt[j+m-matched_prefix[k+1]...j+m-1]
- Therefore, comparisons in this region can be skipped in the next scan, and the variable are set as:
  - start = 1
  - stop = matched_prefix[k+1]

Galil’s optimisation ensures that Boyer-Moore does not re-check character pairs that are already known to match. By efficiently skipping over these regions, the algorithm maintains its worst-case time complexity of $O (n)$ after an initial $O (m)$ pre-processing phase.

Worst-Case Example with Galil's Optimisation

Let text[1...10] = aaaaaaaaaa and pat[1...3] = aaa.

As before, every valid alignment between pat and text results in a complete match. Without optimisation, Boyer-Moore shifts by m - matched_prefix[2] = 1 each time, re-checking almost every character.

With Galil’s optimisation, after the first match is found, the region pat[1...matched_prefix[2]] = pat[1...2] is known to match the corresponding region in the text. So on the next shift:

Set start = 1 and stop = matched_prefix[2] = 2

During the right-to-left scan, we skip comparing pat[1...2], since it’s guaranteed to match

The scan only needs to compare pat[3] with txt[j + 2], and then immediately confirms another full match

This process repeats for each alignment: only one comparison (the rightmost character) is needed at each step, instead of 3.

As a result, the algorithm still explores all alignments, but now does so in linear time. The number of character comparisons is reduced from $O (mn)$ to $O (n)$ — achieving the optimal performance Boyer-Moore promises in theory.

Quartz 4

Explorer

Implementing The Boyer-Moore Algorithm

Overview Of The Algorithm

Galil’s Optimisation

Implementing Galil’s Optimisation

Two Cases To Consider

Graph View

Table of Contents

Backlinks