Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Accelerating pattern shifting with sp values
  • Exercises
  • Direct computation of sp values
  1. Exact string matching

The KMP algorithm

PreviousThe Z algorithmNextIntroduction to multiple sequence alignment

Last updated 3 months ago

Accelerating pattern shifting with sp values

As mentioned earlier, the Z algorithm was developed by Dan Gusfield as a way to teach string algorithms. The ideas we just presented should help guide you through the approach used by the Knuth Morris Pratt (KMP) string matching algorithm.

To develop the basic intuition about this algorithm, let us start with the naïve string matching algorithm described at the beginning of this chapter. The matching process compares the pattern to the text until either a match or a mismatch is found. In case of mismatch, the algorithm shifts the pattern position by one and restarts all the comparisons. But can we instead reuse some of the information we learned while matching the pattern to the text? Let us look at the example below between a text and the pattern A B C X A B C D E:

1 i              n
XYABCXABCXADCDAFEA
  ABCXABCDE
      ABCXABCDE

At position i in the text, the naïve algorithm starts by matching A B C X A B C, then mismatches on character D in the pattern, aligned to character X in the text. The intuition behind the KMP algorithm is that we can now simply shift the pattern over until its prefix matches a previously matched section of the text, as highlighted on the third row of the example where the pattern shifted four positions. Thus, we can potentially save a lot of computation (comparisons) by not trying to match the pattern earlier (e.g., the naïve algorithm would shift the pattern by 1 character and try to re-match, only to find a mismatch between the first A in the pattern and character B at position i+1 in the text).

Let us now formalize this approach better. We will define a new function, sp[i]sp[i]sp[i], which is defined for every position in the pattern. For every i∈(1,m]i \in (1, m]i∈(1,m], sp[i]sp[i]sp[i] is the length of the longest non-trivial suffix of P[1..i]P[1..i]P[1..i] that matches exactly a prefix of the pattern.

pattern:	A T T C A C T A T T C G G C T A T
sp[i]:  	0 0 0 0 1 0 0 1 2 3 4 0 0 0 0 1 2

As in the case of the Z algorithm we will assume for now that the sp[i]sp[i]sp[i] values can be computed efficiently. Can we use these values to speed up alignment? The basic idea is to shift the pattern i–sp[i]i – sp[i]i–sp[i] positions (rather than 1) once we mismatch at position i+1i + 1i+1, such that the prefix of the pattern is aligned to the corresponding matching suffix (see figure below). The matching continues from where it stopped (position in the text aligned to the X in the figure) as we already know the shaded region matches the text.

Question: Are we missing any matches when shifting so far?

To convince ourselves (i.e., prove) that the shifts are correct, i.e., we are not skipping any possible matches, we proceed with a proof by contradiction. Assume that there is another match between the original position of the pattern and the new one (which would be missed if we just jump along). The figure below shows this situation.

Question: Is the KMP algorithm efficient?

Exercises

  1. Implement, in your favorite language, the algorithms described above.

  2. The genomes of many bacteria are circular. In order to represent the data in a linear form, genome assembly programs pick a random location in the genome to break the circle. Thus, it is possible that running the same program multiple times we would get different answers, corresponding to different circular rotations of the same string. Using the Z values described in class, describe a simple algorithm that identifies whether two DNA strings are circular rotations of each other. For example, string ACATTCAT is a circular rotation of string TTCATACA. (Hint: this algorithm is a simple extension of the algorithm used to perform string matching based on Z values).

Direct computation of sp values

The Z algorithm and its use in the computation of the KMP sp' values has served its pedagogical role and has hopefully provided you with a better understanding of the way we reason about string matching. Below we will describe the original KMP algorithm (specifically the computation of the sp values). While we can clearly achieve the same results using the Z algorithm, the KMP approach will prove useful as we explore other aspects of string matching, such as the ability to simultaneously match multiple strings.

As you should already expect by now, the KMP algorithm achieves efficiency by reusing the results of earlier computations. This general optimization technique is called memoization and will pop up later, most prominently when we discuss dynamic programming.

Assume the completely shaded pattern matches the text fully. As a corollary, the prefix of the pattern matches within the text in the same region where the pattern originally matched (region before the X at the first position of the pattern), region that also matches the suffix of P[1..i]P[1..i]P[1..i] represented by the sp[i]sp[i]sp[i] value. This is clearly a contradiction with our initial assumption that sp[i]sp[i]sp[i] is the maximal such value. QED

Let us concentrate on the number of comparisons we make. It should be obvious that once a character in the text was matched with a character in the pattern, we never examine that specific character again. At the same time, the pattern can mismatch the same character in the text (that is aligned to the X in the figure) multiple times. Notice, however, that every time we hit a mismatch, we shift the pattern by at least one position (sp[i]<isp[i] < isp[i]<i), thus the number of shifts (and therefore comparisons) is bounded by the total length of the text. To sum it up, during our algorithm we encounter at most n correct matches, and we shift the pattern at most n times, for a total run time of 2n, i.e., the KMP algorithm is efficient. The memory usage is also good – we only use an additional m memory locations to store the sp[i]sp[i]sp[i] values.

The only bit remaining is computing the sp[i]sp[i]sp[i] values in linear time. The most simple approach is to use the Z algorithm, as there should be obvious similarities between the two concepts. Before you attempt to write such an algorithm, let us make a small change to the KMP algorithm. Let us define a new sp array, sp′sp'sp′, where sp′[i]sp'[i]sp′[i] is the length of the longest suffix of P[1..i]P[1..i]P[1..i] that matches a prefix of P, and P[sp′[i]+1]≠P[i+1]P[sp'[i] + 1] \ne P[i + 1]P[sp′[i]+1]=P[i+1](the character after the matching suffix-prefix is different).

You can quickly check that running the KMP algorithm using the sp′sp'sp′ values is equally correct and efficient. The new values are, however, easier to compute using the Z values.

Note: The sp′sp'sp′ values highlight an interesting aspect – we are essentially being smarter about predicting what character is being aligned next (or more precisely what character isn't aligned next).

Write out the algorithm for computing sp′sp'sp′ values using the Z values.

Explain why the sp′sp'sp′ values are easier to compute using Z values than the original sp values.

Define one possible relationship between sp[i]sp[i]sp[i] and sp[i–1]sp[i – 1]sp[i–1]. Can you define a similar relationship between Z[i]Z[i]Z[i] and Z[i−1]Z[i-1]Z[i−1]?

Thus, let us assume that we are trying to compute the value sp[i+1]sp[i + 1]sp[i+1] and that we already know the values sp[j]sp[j]sp[j] for all j≤ij \le ij≤i. Also, let us denote by c the character at position i + 1 in the pattern

(c=P[i+1]c = P[i + 1]c=P[i+1]), and by x the character at position sp[i]+1sp[i] + 1sp[i]+1 (x=P[sp[i]+1]x = P[sp[i] + 1]x=P[sp[i]+1]), i.e., the character that follows the prefix of P that matched the suffix ending at i. If the two characters match (x==cx == cx==c) we can simply extend the matching prefix/suffix pair by one character to obtain sp[i+1]=sp[i]+1sp[i+1] = sp[i] + 1sp[i+1]=sp[i]+1. What happens, however, if x≠cx \ne cx=c? In this case sp[i]sp[i]sp[i] provides us with no additional information, but we might be able to use sp[sp[i]]sp[sp[i]]sp[sp[i]], the suffix of P[1..sp[i]]P[1..sp[i]]P[1..sp[i]] that matches a prefix of P. Here we will compare c with xxx', the character following this new prefix (see figure below). If they match, sp[i+1]=sp[sp[i]]+1sp[i+1] = sp[sp[i]] + 1sp[i+1]=sp[sp[i]]+1. If they don't, we simply continue trying further sp[sp[...[sp[i]]...]]sp[sp[...[sp[i]]...]]sp[sp[...[sp[i]]...]] values until we either find a match, or we 'bottom out' indicating no such prefix exists. In the latter case we simply set sp[i+1]=1sp[i + 1] = 1sp[i+1]=1 if c==P[1]c == P[1]c==P[1], or sp[i+1]=0sp[i + 1] = 0sp[i+1]=0, otherwise.

The correctness of this algorithm should be fairly obvious. Its efficiency, however, is unclear. At every position in the pattern we might have to recurse through many sp[j]sp[j]sp[j] values before being able to compute the value of sp[i+1]sp[i+1]sp[i+1]. For now we will omit a proof of this property, which can be found in (Gusfield 1997) pages 50-51. The intuition behind the proof, however, is that we can 'charge' every recurrence to a previously matched character in the pattern. Thus, while any particular computation may 'stall' for a number of steps, the total number of recurrences will not exceed the length of the pattern, yielding an overall linear run time. Put in a different way, the proof relies on the intuition that sp[i+1]≤sp[i]+1sp[i+1] \le sp[i] + 1sp[i+1]≤sp[i]+1, i.e., sp values grow slowly even if they can decrease fast. The total number of times we can iterate at any round is, thus, bounded by the amount of growth we have achieved so far, which is ultimately bounded by the length of the pattern. A useful analogy is your bank account - by the end of the year you cannot spend more than the amount that was in the account at the beginning of the year (here the work spent computing the first sp value) plus the amount of money you accumulated in bi-weekly payments during the year (the slow growth in sp values that only happens when the character at position i+1 matches somewhere earlier in the text).

Visual representation of pattern matching in text where pattern shifting  is done through suffix-prefix matching. A longer description is provided in the caption.
A long rectangle representing text T with a blue region highlighting the portion of the text that matches the pattern at its first position. The two possible alignment positions for pattern P are indicated by two smaller rectangles that have two blue regions each to represent the "sp box" for the matched region, i.e., the longest prefix of P[1..i]P[1..i]P[1..i] that matches a suffix of P[1..i]P[1..i]P[1..i]. The second position of the pattern is located so that its prefix aligns with suffix of the matched region above it. The X on the rectangle for the first position of pattern P indicates the first character in the pattern that no longer matches the text. Once the X is encountered, the pattern is then shifted to the second position so that its prefix in the new location matches the suffix in the previous location.
Visual representation of the hypothetical situation where a pattern may match the text between the two positions selected by the prefix-suffix match. A longer description is provided in the caption.
A hypothetical situation where the pattern may match the text fully in between the two locations selected by the prefix-suffix matches. The two possible alignment locations for pattern P are indicated by two smaller rectangles that have two blue regions each to represent the "sp box" for the matched region, i.e., the longest prefix of P[1..i]P[1..i]P[1..i] that matches a suffix of P[1..i]P[1..i]P[1..i]. The second position of the pattern is located so that its prefix aligns with the suffix of the matched region above it. A final blue rectangle also representing the pattern P is located between the first and second position to indicate an intermediate position of the pattern where the pattern matches the text fully. The alignment of the pattern in this position to its first location (i.e. between the two blue regions) implies a longer suffix-prefix match, therefore violating our assumption that the sp[i]sp[i]sp[i] value represents the longest suffix-prefix match.
Visual representation of recursion when computing sp values for pattern P, depicted here as a long white rectangle. A longer description is provided in the caption.
The two blue boxes correspond to the suffix-prefix match represented by sp[i]sp[i]sp[i], with the prefix ending at position sp[i]sp[i]sp[i] and the suffix ending at position i. The letter X at the end of prefix represents the character after position sp[i]sp[i]sp[i], while the letter C at the end of the suffix corresponds to the character after position i. If characters C and X do not match, we can focus on the suffix-prefix match represented by sp[sp[i]]sp[sp[i]]sp[sp[i]], shown here by the two red boxes located within the first blue box ending at positions sp[sp[i]]sp[sp[i]]sp[sp[i]] and sp[i]sp[i]sp[i], respectively. The same string also occurs at the end of the second blue box, thus we now need to compare character C to X', the character after position sp[sp[i]]sp[sp[i]]sp[sp[i]], to decide whether sp[i+1]sp[i+1]sp[i+1] can be computed, or we need to iterate further.