# The Boyer-Moore algorithm ## Matching backwards may enable new optimizations As described in the section on the [KMP algorithm](https://mpop.gitbook.io/bioinformatics-lecture-notes/exact-string-matching/the-kmp-algorithm), the naïve string matching algorithm can be sped up by leveraging the self-similarity between segments of the pattern. One disadvantage of this approach is that the amount of "speed-up" at any iteration is bounded by the amount of "work" the algorithm did during that iteration, i.e., the number of characters that were matched between the pattern and the text before a mismatch was encountered. This behavior is a direct consequence of the fact that the matches proceeds in a "forward" direction, moving from left to right along the pattern and the text. While the pattern can be processed prior to matching to reveal useful self-similarities, the composition of the text is unknown until characters in the pattern are aligned to the text. Thus, the KMP algorithm cannot anticipate what happens in the text at later locations, limiting the amount of shifting that can occur. Note, however, that there's absolutely no reason why the search should process the pattern and text in a left-to-right manner. For example, let's rewrite the naïve string matching algorithm encountered in the [introduction to string matching](https://mpop.gitbook.io/bioinformatics-lecture-notes/exact-string-matching/introduction-to-exact-string-matching) to the following form: ``` for i = 1 , n - m + 1 : for j = m, 1 : if P[j] != T[i + j - 1] : goto NOMATCH report match found @ position i NOMATCH ``` The only thing we changed was the second line, modified to iterate the variable `j` from the right end (`m`) to the left end (`1`) of the pattern. You can easily see that this algorithm still identifies all possible matches of the pattern to the text. Does changing the direction in which we check for the match give us new opportunities for optimization? The intuition is that, by starting to match at the end of the pattern, we are essentially peering into the future along the text (by future, meaning that we see characters that haven't yet been reached by the beginning of the pattern). If we find a mismatch, we may be able to shift by an amount that is no longer bounded by the amount of matches we made, potentially skipping over the entire length of the pattern before trying again. That's the key idea that the Boyer-Moore algorithm is trying to exploit. ## The rules for shifting More precisely, the Boyer-Moore algorithm defines what happens when a mismatch is found, summarized in two rules: * **The bad character rule.** Let `x` be the character in `T` that mismatches character `P[i]` in the pattern. Shift `P` until the first instance of character `x` in `P` to the left of position is aligned to the corresponding character in `T`. A match between `P` and `T` cannot occur unless all characters in `P` match to characters in `T`, thus we make sure at least the character that previously mismatched will now match. * **The good suffix rule.** Shift `P` to the right until a substring of `P` matches the segment of `T` that matched the suffix of `P` prior to the mismatch. The same idea as the bad character rule applies here - once we identified a partial match between the pattern and the text, we shift such that the corresponding segment in the text still matches the pattern. A graphical representation of these heuristics is shown below:

Three rectangles representing a text (top) and two location of a pattern. A blue rectangle on the right side of the leftmost location of the pattern indicates a matching region. Three characters are highlighted in the pattern in the following order, X, X, C, with the character C occuring right before the matching region. Character C is aligned with a character X in the text. The third rectangle represents the pattern after a shift that aligns the rightmost X character with the X character in the text. — Bad character rule. Pattern P is matched to the text T starting from the right, the blue box representing the section of the pattern that matches the text, with a mismatch between characters C in the pattern and character X in the text. The third row represents a shift of the pattern such that the rightmost X in the pattern now matches with the X in the text.

Three rectangles representing a text (top) and two locations of a pattern. The first (leftmost) placement of the pattern along the text has its suffix highlighted with a blue box representing the segment of the pattern matching with the text. The third rectangle represents the pattern after a shift that aligns a segment in its middle (shown with a blue box) with the previously matched segment of the text. — Good suffix rule. Pattern P is matched to the text starting from the right, with blue boxes in the pattern and the text indicating the matching segment. The third row represents the pattern after a shift such that a segment inside the pattern matches the previously matched segment in the text.

Note that if neither rule applies, i.e., if there is no instance of character X in the pattern prior to the location of the mismatch, or if the matched suffix of the pattern doesn't match anywhere else, we can simply shift the pattern past the point where the mismatch occurred. Clearly we need to preprocess the pattern in order to make these operations possible. For the "bad character" rule we will need to record the position of all letters in the pattern. More specifically, we will construct an array `R[x, i]` which stores the rightmost position less than `i` where character `x` occurs. This information can be computed in linear time. **Question:** Outline an algorithm that computes `R[x, i]`. Note that a trivial algorithm would run in time $$O(m \Sigma)$$, where $$m$$ is the length of the pattern and $$\Sigma$$ the size of the alphabet (e.g., 4 for DNA or 26 for English text). Can you implicitly store the information contained in the array `R` such that you use less memory, e.g., $$O(m)$$ only, while still being able to efficiently determine the shift? To implement the "good suffix" rule, we take inspiration from the KMP algorithm and define an array that records the self-similarity between segments of the pattern, focusing on the suffix of the pattern. Specifically, we define `L'[i]` to be the largest value less than `m` such that `P[i .. m]` matches a suffix of `P[1 .. L'[i]]` and `P[i - 1]` mismatches the character preceding the corresponding suffix. You can see the parallel between the `sp'` values used in KMP and the `L'` values in Boyer-Moore. The `L'` array can be constructed in linear time with the help of the [Z algorithm](https://mpop.gitbook.io/bioinformatics-lecture-notes/exact-string-matching/the-z-algorithm). **Question:** Describe an algorithm that uses the Z algorithm to compute, in linear time, the `L'` array. Before concluding the discussion of the Boyer-Moore algorithm, it is important to highlight a corner case. It is possible that the prefix of `P` matches the suffix of the region matched in the text even if that region doesn't match fully inside of `P`. To ensure we do not miss any matches, we would want to shift the pattern only up to the point where we identify such a prefix-suffix match. To be able to determine when such situations occur, we need to define one more array, `l'`, defined as follows: `l'[i]` = longest suffix of `P[i .. m]` that matches a prefix of `P`. It should be easy to see that the `l'` array can also be computed in linear time using the Z algorithm. To clarify the two arrays, here is an example: ``` P : A T T C A C T A T T C G G C T A T L': 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 1 14 l': 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 ``` ## Putting it all together To summarize the execution of the Boyer-Moore algorithm, here is the pseudo-code: ``` i = 1 while i < n - m + 1 : for j = m, 1 : if P[j] != T[i + j - 1] : s1 = shift according to bad character rule for R[T[i + j - 1], j] s2 = shift according to L'[j] s3 = shift according to l'[j] i = max(s1, s2, s3) # shift the furthest possible break ## return to the outer loop report match at position i ## we haven't found any mismatches in P i++ ## Try to find more matches ``` **Question:** We haven't detailed the arithmetic that determines the length of the shift according to the values stored in the `R`, `L',` and `l'` arrays. Please work out these numbers. **Question:** Prove that the algorithm described above does not miss any matches. Unlike KMP, where there was only one "rule" for determining the shift, here you need to consider the interaction between three different rules and make sure that choosing a long shift according to one of the rules doesn't result in jumping over a valid match. ## Final thoughts A natural question is whether the Boyer-Moore algorithm is efficient. Can we prove that it runs in $$O(n + m)$$ like we did for the KMP algorithm? Yes, but the proof is quite complex and requires a careful analysis that reasons about the [periodicity of strings](https://mpop.gitbook.io/bioinformatics-lecture-notes/exact-string-matching/periodicity-of-strings), something that is beyond the scope of this chapter. For details, see (Gusfield, 1997) pages 39-47. It is useful to consider how the alphabet size influences the execution of the Boyer-Moore algorithm. The primary place where alphabet size may be an issue is the bad character rule. As we've already noted above, the size of the `R` array used to determine the shift is proportional to the alphabet size. At the same time, the larger the alphabet size, the more likely it is that there will not be any instances of a given text character in the pattern, thereby allowing the algorithm to rapidly shift over the entire pattern. In English text, this may occur quite frequently since many letters occur rarely in typical text. ## Exercises 1. Does the bad character rule add anything? In other words, are there situations where the bad character rule allows the algorithm to be faster than if just the good prefix rule was used? 2. Would it be possible to add a "bad character" rule to the KMP algorithm, perhaps offering speed-ups over those obtained with just the `sp` array? If yes, please detail an algorithm and justify its efficiency. If not, then argue why such a rule would not be effective. ## References Gusfield, D. 1997. *Algorithms on Strings, Trees, and Sequences*. The press syndicate of the university of Cambridge. .