Introduction to exact string matching

Introduction

One of the most basic string operations is finding an exact copy of one string (often called query or pattern) within a longer string (called text or reference). One can easily come up with simple examples of this operation in pretty much all text related applications, such as finding a word or short phrase within a webpage or document. There are also many such examples in biology. In the previous chapter we discussed finding frequent k-mers that may indicate the location of the origin of replication. Once we found a frequent k-mer, finding its locations within the genome is a variant of exact matching. Also, restriction enzymes (proteins that 'cut' the DNA at specific locations in the genome) recognize exact patterns within a genome. Finding all such restriction sites is, thus, another instance of exact matching.

In this section we will try to develop fast algorithms for this seemingly simple problem. First, let's explore the following question:

Given a pattern (query), does it exist in the text (reference)?

For example, assume text = ATTCACTATTCGGCTAT and pattern = GCAT. Before reading any further, try to design an algorithm that will find all locations of the pattern in the text. How "expensive" is your algorithm?

The cost of an algorithm

As a quick aside – in computer science we generally compute the 'cost' of an algorithm in terms of two main commodities: time and memory needed to execute the algorithm. To simplify calculations at this point, assume that each object (letter, integer, etc.) uses up one memory location and that the only 'expensive' operation is the comparison of two characters. Thus, the run time of the algorithm would be expressed as the number of comparison operations. Also, as convention throughout the class, assume the text has length n and the pattern length m.

Naïve string matching approach

The most simple algorithm for exactly matching a string to another one works as follows. Start with the first character of the text and the first character of the pattern and compare each character individually. If a mismatch occurs shift the pattern one character and repeat the procedure. Terminate when either a match is found, or not enough characters remain in the text.

Here is the algorithm in pseudo-code:

for i=1,n - m + 1 :
  for j=1,m :
    if P[j] != T[i+j-1] :
      goto NOMATCH
  report match found @ position i
NOMATCH

Now, let us try to analyze this algorithm. First, its memory usage. Clearly we are only storing the pattern and the text (n + m memory locations) and a few additional variables (i, j, n, m), or a total of n + m + 4 memory locations. Since the memory usage is proportional to the total size of the inputs we say that our algorithm has 'linear memory usage'.

How about the run time? A quick back of the envelope calculation tells us that we repeat the comparison operation P[j] != T[i + j -1] m times in the inner for loop, which itself is repeated n – m + 1 times in the outer for loop, or a total of $(n – m + 1) m = nm – m^2 + m$ operations.

If we just want to coarsely compare the performance of different algorithms, we do not really need to go into this much detail, rather it is sufficient to focus on the largest term of the equation above and it suffices to say that our algorithm has a runtime 'order of' nm, also written as O(nm). This kind of reasoning is called 'asymptotic analysis' and focuses on just the behavior of the algorithm as the size of the inputs grows towards infinity. A full description of this approach is beyond the scope of this class and I encourage you to refresh your memory about asymptotic analysis by reviewing materials from earlier classes.

Question: Does the algorithm above always perform $nm – m^2 + m$ operations?

The answer depends on whether we stop as soon as we find a match, or we continue until we have found all the matches of the pattern within the text. In the latter case we will have to do all the comparisons to make sure we do not miss any matches. In the former, as soon as the first match is found the algorithm stops, possibly using a lot fewer operations.

Question: Assume we just want to find the first match, do you ever need $nm – m^2 + m$ operations? If yes, when?

The worst case scenario occurs when the pattern 'almost' matches the text, i.e., we have to compare characters all the way to the end of the pattern before figuring out the pattern doesn't match. See the example below:

1               n    
AAAAAAAAAAAAAAAAA
AAAAT (m comparisons)
 AAAAT (m comparisons)
  AAAAT (m comparisons)
       ... (repeat for n-m+1 characters)

After successfully matching the first m-1 As, there's a mismatch at the last position in pattern that requires us to start from scratch.

OK, so is the runtime the best we can do? Is it good enough for our typical applications? Let's imagine we want to match a 1000 bp sequence to the 3 billion bp human genome. The total runtime would be on the order of $3 \cdot 10^{12}$ operations. On a 3 GHz processor (and assuming each comparison takes exactly one CPU cycle), the calculation would take 1,000 seconds, or about 16 minutes. Given that a typical sequencing experiment generates tens to hundreds of millions of sequences, this runtime is clearly too slow for even the simplest applications.

The following sections describe several approaches for speeding up the matching procedure.

PreviousFinding genes NextSemi-numerical matching

Last updated 6 months ago