The KMP algorithm
Last updated
Last updated
As mentioned earlier, the Z algorithm was developed by Dan Gusfield as a way to teach string algorithms. The ideas we just presented should help guide you through the approach used by the Knuth Morris Pratt (KMP) string matching algorithm.
To develop the basic intuition about this algorithm, let us start with the naïve string matching algorithm described at the beginning of this chapter. The matching process compares the pattern to the text until either a match or a mismatch is found. In case of mismatch, the algorithm shifts the pattern position by one and restarts all the comparisons. But can we instead reuse some of the information we learned while matching the pattern to the text? Let us look at the example below between a text and the pattern A B C X A B C D E:
At position i in the text, the naïve algorithm starts by matching A B C X A B C, then mismatches on character D in the pattern, aligned to character X in the text. The intuition behind the KMP algorithm is that we can now simply shift the pattern over until its prefix matches a previously matched section of the text, as highlighted on the third row of the example where the pattern shifted four positions. Thus, we can potentially save a lot of computation (comparisons) by not trying to match the pattern earlier (e.g., the naïve algorithm would shift the pattern by 1 character and try to re-match, only to find a mismatch between the first A in the pattern and character B at position i+1 in the text).
Let us now formalize this approach better. We will define a new function, , which is defined for every position in the pattern. For every , is the length of the longest non-trivial suffix of that matches exactly a prefix of the pattern.
As in the case of the Z algorithm we will assume for now that the values can be computed efficiently. Can we use these values to speed up alignment? The basic idea is to shift the pattern positions (rather than 1) once we mismatch at position , such that the prefix of the pattern is aligned to the corresponding matching suffix (see figure below). The matching continues from where it stopped (position in the text aligned to the X in the figure) as we already know the shaded region matches the text.
Question: Are we missing any matches when shifting so far?
To convince ourselves (i.e., prove) that the shifts are correct, i.e., we are not skipping any possible matches, we proceed with a proof by contradiction. Assume that there is another match between the original position of the pattern and the new one (which would be missed if we just jump along). The figure below shows this situation.
Assume the completely shaded pattern matches the text fully. As a corollary, the prefix of the pattern matches within the text in the same region where the pattern originally matched (region before the X at the first position of the pattern), region that also matches the suffix of represented by the value. This is clearly a contradiction with our initial assumption that is the maximal such value. QED
Question: Is the KMP algorithm efficient?
Let us concentrate on the number of comparisons we make. It should be obvious that once a character in the text was matched with a character in the pattern, we never examine that specific character again. At the same time, the pattern can mismatch the same character in the text (that is aligned to the X in the figure) multiple times. Notice, however, that every time we hit a mismatch, we shift the pattern by at least one position (), thus the number of shifts (and therefore comparisons) is bounded by the total length of the text. To sum it up, during our algorithm we encounter at most n correct matches, and we shift the pattern at most n times, for a total run time of 2n, i.e., the KMP algorithm is efficient. The memory usage is also good – we only use an additional m memory locations to store the values.
The only bit remaining is computing the values in linear time. The most simple approach is to use the Z algorithm, as there should be obvious similarities between the two concepts. Before you attempt to write such an algorithm, let us make a small change to the KMP algorithm. Let us define a new sp array, , where is the length of the longest suffix of that matches a prefix of P, and (the character after the matching suffix-prefix is different).
You can quickly check that running the KMP algorithm using the values is equally correct and efficient. The new values are, however, easier to compute using the Z values.
Note: The values highlight an interesting aspect – we are essentially being smarter about predicting what character is being aligned next (or more precisely what character isn't aligned next).
Write out the algorithm for computing values using the Z values.
Explain why the values are easier to compute using Z values than the original sp values.
Implement, in your favorite language, the algorithms described above.
The genomes of many bacteria are circular. In order to represent the data in a linear form, genome assembly programs pick a random location in the genome to break the circle. Thus, it is possible that running the same program multiple times we would get different answers, corresponding to different circular rotations of the same string. Using the Z values described in class, describe a simple algorithm that identifies whether two DNA strings are circular rotations of each other. For example, string ACATTCAT is a circular rotation of string TTCATACA. (Hint: this algorithm is a simple extension of the algorithm used to perform string matching based on Z values).
Define one possible relationship between and . Can you define a similar relationship between and ?
The Z algorithm and its use in the computation of the KMP sp' values has served its pedagogical role and has hopefully provided you with a better understanding of the way we reason about string matching. Below we will describe the original KMP algorithm (specifically the computation of the sp values). While we can clearly achieve the same results using the Z algorithm, the KMP approach will prove useful as we explore other aspects of string matching, such as the ability to simultaneously match multiple strings.
As you should already expect by now, the KMP algorithm achieves efficiency by reusing the results of earlier computations. This general optimization technique is called memoization and will pop up later, most prominently when we discuss dynamic programming.
Thus, let us assume that we are trying to compute the value and that we already know the values for all . Also, let us denote by c the character at position i + 1 in the pattern
(), and by x the character at position (), i.e., the character that follows the prefix of P that matched the suffix ending at i. If the two characters match () we can simply extend the matching prefix/suffix pair by one character to obtain . What happens, however, if ? In this case provides us with no additional information, but we might be able to use , the suffix of that matches a prefix of P. Here we will compare c with ', the character following this new prefix (see figure below). If they match, . If they don't, we simply continue trying further values until we either find a match, or we 'bottom out' indicating no such prefix exists. In the latter case we simply set if , or , otherwise.
The correctness of this algorithm should be fairly obvious. Its efficiency, however, is unclear. At every position in the pattern we might have to recurse through many values before being able to compute the value of . For now we will omit a proof of this property, which can be found in (Gusfield 1997) pages 50-51. The intuition behind the proof, however, is that we can 'charge' every recurrence to a previously matched character in the pattern. Thus, while any particular computation may 'stall' for a number of steps, the total number of recurrences will not exceed the length of the pattern, yielding an overall linear run time. Put in a different way, the proof relies on the intuition that , i.e., sp values grow slowly even if they can decrease fast. The total number of times we can iterate at any round is, thus, bounded by the amount of growth we have achieved so far, which is ultimately bounded by the length of the pattern. A useful analogy is your bank account - by the end of the year you cannot spend more than the amount that was in the account at the beginning of the year (here the work spent computing the first sp value) plus the amount of money you accumulated in bi-weekly payments during the year (the slow growth in sp values that only happens when the character at position i+1 matches somewhere earlier in the text).