Semi-numerical matching

Let's re-visit the naïve algorithm presented earlier:

for i=1,n - m + 1 :
  for j=1,m :
    if P[j] != T[i+j-1] :
      goto NOMATCH
  report match found @ position i
NOMATCH

It's unlikely you would write your own code like that, at least not in a modern programming language, instead the inner loop would be hidden inside a string comparison:

for i = 1, n - m + 1:
  if P == T[i..i+m-1]:                  // the length m substring of T starting at i
    report match found @ position i

It is important to recognize that simple operators such as '==' can entail a significant computational cost, depending on what objects are being compared. Here, P == T[i..i+m-1] compares two strings, and this comparison requires checking that all characters in the two strings match each other, thus the runtime of this simple operator is $O(m)$ , i.e., proportional to the length of the strings being compared.

In the case of integers, however, a test such as x == y can be performed in $O(1)$ , since such tests are natively implemented in the microprocessor's hardware (for completeness: this is only true for integers that fit within a processor's "word" size). Perhaps we could speed up string matching by simply converting the strings being matched to integers.

Let's assume we have a function toInt that converts a string into an integer representation. You've seen an example in the previous chapter where each DNA letter was converted to 2-bit number. Using this function, we could change the code as follows:

for i = 1, n - m + 1:
  if toInt(P) == toInt(T[i..i+m-1]):  
    report match found @ position i

Now the == operator is applied to two integers, thus the inner loop is $O(1)$ , for a total cost of the algorithm of $O(n)$ time - a huge improvement over the original $O(mn)$ time of the naïve string matching algorithm.

Question: Has the use of the toInt function truly solved the exact matching problem?

That's actually not entirely accurate, since we've simply shifted the cost from the comparison operator to the $toInt$ function. The conversion from DNA letters to integers requires examining every character in the string to find its corresponding 2-bit encoding, thus the runtime of toInt is $O(m)$ , so the pseudo-code described above is still running in $O(mn)$ time.

Note that the cost of processing the pattern, toInt(P), can be moved out of the loop since the pattern doesn't change:

intPattern = toInt(P)
for i = 1, n - m + 1:
  if intPattern == toInt(T[i..i+m-1]):  
    report match found @ position i

However, the corresponding section of the text changes from iteration to iteration, thus toInt must be executed again for a different string each time.

To figure out how speed up the computation of toInt, we need to explore a bit closer the relationship between the strings being processed in consecutive iterations of the loop. Let's make things concrete, assuming a pattern of length 4, and the following portion of the text:

...TACAGGTACGATAC...

The four characters bolded above (AGGT) represent the string being processed in iteration $i$ of the algorithm. At iteration $i + 1$ , the substring being processed will be GGTA (the next 4 characters in the text):

...TACAGGTACGATAC...

It should be easy to see that the only characters that change are the first and the last characters. The string GGT is the same in both iterations, and the substring analyzed at iteration $i + 1$ is simply the substring from iteration i after removing its first character and adding a new character at the end. In other words, we make a constant number of changes to the substring being analyzed between iterations. Perhaps we can leverage this property to speed up the computation of toInt.

Let us assume that we are using the simple 2-bit encoding of DNA letters described in the previous chapter, and let's write out the actual integer value of the two strings described above:

$toInt(AGGT) = toInt(A) \cdot 4^3 + toInt(G) \cdot 4^2 + toInt(G) \cdot 4 + toInt(T)$

$toInt(GGTA) = toInt(G) \cdot 4^3 + toInt(G) \cdot 4^2 + toInt(T) \cdot 4 + toInt(A)$

Where $toInt(A) = 0$ (00), $toInt(C) = 1$ (01), $toInt(G) = 2$ (10), and $toInt(T) = 3$ (11). Another way to think about this is that we are looking at strings of DNA letters as if they were base-4 numbers.

We can now use simple arithmetic to relate the two values to each other:

$toInt(GGTA) = (toInt(AGGT) - toInt(A) \cdot 4^3) \cdot 4 + toInt(A)$

To put this in words, we remove the contribution to the integer due to the first A in the substring, shift the three characters shared by the two substring to the left (i.e., their significance increases by a factor of 4), then add the new character A to the right of the string. Please verify that the arithmetic works out. Try it for several iterations of the algorithm.

Thus, we can change our code to something like this:

intPattern = toInt(P)
intText = toInt(T[1..m])                  // first substring of the text
for i = 1, n - m + 1:
  if intPattern == intText:  
    report match found @ position i
  // now we shift to a new substring
  intText = 4*(intText - toInt(P[i]*pow(4, m-1)) + toInt(P[i+m])

Here, the function pow(4,m-1) computes $4^{m-1}$ . Also toInt(c) can be implemented as a simple lookup for any individual character c.

Note that in the new implementation, every operation within the loop can be computed in $O(1)$ time, thus the overall runtime of the algorithm is $O(n)$ .

EXERCISE: Implement this approach using bit-wise operations instead of the arithmetic formula described above.

We mentioned in the previous chapter that the conversion of a DNA string into an integer is a form of hashing. A hash function that can be computed in the fashion described above as we shift a "window" across a string is called a rolling hash, and the algorithm we have just described is called the Karp-Rabin algorithm, after its developers, Richard Karp and Michael Rabin.

Advanced aside

As described, the algorithm described above only works for relatively short strings since it's efficiency depends on the hash function fitting within a computer word. For modern 64-bit architectures, this algorithm only works well for patterns up to 32 letters in length. One way around this limitation is to give up on the one-to-one property of the hash function, e.g., by computing toInt modulo the length of a computer word. Doing so will ensure that the hash value fits within a computer word irrespective how long the pattern is. However, it is now possible for two different strings to be encoded into the same hash value. The search algorithm must, then, be modified to account for this possibility:

intPattern = toInt(P)
intText = toInt(T[1..m])                  // first substring of the text
for i = 1, n - m + 1:
  if intPattern == intText:  
    if P == T[i..i+m-1]:                  // additional check needed
      report match at position i 
  // now we shift to a new substring, modulo computer word size
  intText = 4*(intText - toInt(P[i]*pow(4, m-1)) + toInt(P[i+m]) % 64

Thus, we don't completely avoid performing $O(m)$ operations within the loop, but these operations would only be performed rarely, when the hashes of the pattern and the corresponding substring of the text are the same.

EXERCISE: Implement the algorithm outlined above and run it on a range of randomly-constructed patterns and texts. How often do the hashes match each other but the corresponding strings do not? How long does it take to execute this algorithm in comparison to the naïve string matching algorithm and to the simpler version of the algorithm that only works for short patterns?

PreviousIntroduction to exact string matching NextThe Z algorithm

Last updated 6 months ago