Introduction to exact string matching
Introduction
One of the most basic string operations is finding an exact copy of one string (often called query or pattern) within a longer string (called text or reference). One can easily come up with simple examples of this operation in pretty much all text related applications, such as finding a word or short phrase within a webpage or document. There are also many such examples in biology. In the previous chapter we discussed finding frequent k-mers that may indicate the location of the origin of replication. Once we found a frequent k-mer, finding its locations within the genome is a variant of exact matching. Also, restriction enzymes (proteins that 'cut' the DNA at specific locations in the genome) recognize exact patterns within a genome. Finding all such restriction sites is, thus, another instance of exact matching.
In this section we will try to develop fast algorithms for this seemingly simple problem. First, let's explore the following question:
Given a pattern (query), does it exist in the text (reference)?
For example, assume text = ATTCACTATTCGGCTAT
and pattern = GCAT.
Before reading any further, try to design an algorithm that will find all locations of the pattern in the text. How "expensive" is your algorithm?
The cost of an algorithm
As a quick aside – in computer science we generally compute the 'cost' of an algorithm in terms of two main commodities: time and memory needed to execute the algorithm. To simplify calculations at this point, assume that each object (letter, integer, etc.) uses up one memory location and that the only 'expensive' operation is the comparison of two characters. Thus, the run time of the algorithm would be expressed as the number of comparison operations. Also, as convention throughout the class, assume the text has length n and the pattern length m.
Naïve string matching approach
The most simple algorithm for exactly matching a string to another one works as follows. Start with the first character of the text and the first character of the pattern and compare each character individually. If a mismatch occurs shift the pattern one character and repeat the procedure. Terminate when either a match is found, or not enough characters remain in the text.
Here is the algorithm in pseudo-code:
Now, let us try to analyze this algorithm. First, its memory usage. Clearly we are only storing the pattern and the text (n + m memory locations) and a few additional variables (i, j, n, m), or a total of n + m + 4 memory locations. Since the memory usage is proportional to the total size of the inputs we say that our algorithm has 'linear memory usage'.
If we just want to coarsely compare the performance of different algorithms, we do not really need to go into this much detail, rather it is sufficient to focus on the largest term of the equation above and it suffices to say that our algorithm has a runtime 'order of' nm, also written as O(nm). This kind of reasoning is called 'asymptotic analysis' and focuses on just the behavior of the algorithm as the size of the inputs grows towards infinity. A full description of this approach is beyond the scope of this class and I encourage you to refresh your memory about asymptotic analysis by reviewing materials from earlier classes.
The answer depends on whether we stop as soon as we find a match, or we continue until we have found all the matches of the pattern within the text. In the latter case we will have to do all the comparisons to make sure we do not miss any matches. In the former, as soon as the first match is found the algorithm stops, possibly using a lot fewer operations.
The worst case scenario occurs when the pattern 'almost' matches the text, i.e., we have to compare characters all the way to the end of the pattern before figuring out the pattern doesn't match. See the example below:
After successfully matching the first m-1 As, there's a mismatch at the last position in pattern that requires us to start from scratch.
The following sections describe several approaches for speeding up the matching procedure.
Last updated