Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Z values
  • Using Z values in string matching
  • Linear-time computation of Z values
  • Exercises
  • References
  1. Exact string matching

The Z algorithm

Z values

Another strategy for speeding up matching is to modify the naïve matching approach in a way that avoids doing too much work. If we can somehow reuse our work we might be able to save time. To illustrate this idea we will start with a simple algorithm, the Z algorithm, developed by Dan Gusfield (Gusfield 1997) as an educational tool to help explain the basics of other string matching algorithms.

For a text T, let us define a simple function Z that captures the internal structure of the text. Specifically, for every position i∈(0,n) i \in (0, n)i∈(0,n) from T, we define Z[i]Z[i]Z[i] to be the length of the longest prefix of T[i..n]T[i..n]T[i..n] that matches exactly a prefix of T[0..n]T[0..n]T[0..n]. We also define Z[0]=0Z[0] = 0Z[0]=0. Here is an example:

Text: A T T C A C T A T T C G G C T A T
Z[i]: 0 0 0 0 1 0 0 4 0 0 0 0 0 0 0 2 0

Question: Can Z values help in aligning a pattern to a text?

IMPORTANT: When questions are posed in the text, please spend a few minutes trying to figure them out before reading further. Evidence in the teaching and learning literature shows that even if you can't figure out the answer, the effort you devoted to trying to answer the question will help you better understand the answer and retain this information long term.

Using Z values in string matching

To use Z values as a tool for exact matching, we can construct the following string S=P$TS = P\$TS=P$T, where P and T are the pattern and text, respectively, and $ is a character that does not match any other character in either pattern or text. Let us assume for the moment that we have a black box function that computes the Z values for string S.

Question: Can the Z values computed for the string S defined above help us find matches between P and T?

The answer is yes – it suffices to look at the Z values corresponding to characters in T. Any Z-value equal to the length of P indicates the pattern matches. Specifically, for any i>mi > mi>m, if Z[i] == m, pattern P matches text T starting at position i. (Aside: can Z[i] be greater than m?)

Question: What is the runtime and memory usage of this algorithm?

First, the memory – we clearly need to store the Z values in addition to the pattern and text, for a total of 2(n+m)+12(n+m)+12(n+m)+1 memory locations. Even though we are using more memory than the naïve algorithm, the memory consumption is still linear in the size of the inputs. There might be additional memory required by the algorithm that computes the Z values, but we'll get to that in a moment.

In terms of runtime, again omitting the work done to compute the Z values, we simply need to examine each position in the Z array to find all matches between P and T, or to decide that a match doesn't exist, i.e., the runtime is simply O(n) O(n) O(n) (we only need to examine the Z values corresponding to positions in the text). Here we see a much bigger improvement over the earlier naïve algorithm.

To sum up, if we have available an approach for computing Z values, we can perform exact matching much faster than the naïve algorithm, without using up (too much) more memory.

Of course, this result is predicated on the assumption that the Z values can be computed efficiently.

Question: Can Z-values be computed in linear time and linear space?

Linear-time computation of Z values

Note that one could come up with a naïve algorithm for computing the Z values using essentially the same approach used by the naïve exact matching algorithm. Clearly this approach would also yield a quadratic time algorithm (Aside: write down this algorithm and figure out its complexity in time and memory).

A much faster algorithm can be developed if we cleverly reuse the work done earlier. The intuition is that when computing Z[i]Z[i]Z[i] we can use induction, leveraging the Z[j]Z[j]Z[j] values for j<ij < ij<i.

Let us look at the figure below, where the Z 'boxes' (the part of the string that matches both at the beginning and at position j) are represented as blue boxes. By definition, the length of those boxes is Z[j]Z[j]Z[j]. We assume that there exists a position j<ij < ij<i such that the Z 'box' at position j extends past i, i.e., j+Z[j]>ij + Z[j] > ij+Z[j]>i. The section of the Z 'box' starting at j that extends past i is identical to the same region at the beginning of the string (bracket in the figure), information that we can now use to compute the Z[i]Z[i]Z[i] value.

We can distinguish several cases:

Case 1. The Z 'box' for value k=i–jk = i – jk=i–j does not extend past the blue region (Z[j]Z[j]Z[j] box), i.e.,k+Z[k]<Z[j]k+Z[k] < Z[j]k+Z[k]<Z[j]. In that case we know that Z[i]=Z[k]Z[i] = Z[k]Z[i]=Z[k] . The Z[i]Z[i]Z[i] box cannot extend past that point as the following character (marked with an x in the figure below) is different from the corresponding character in the prefix of the string.

Case 2. The Z 'box' for k extends past the blue region (i.e., k+Z[k]>Z[j]k + Z[k] > Z[j]k+Z[k]>Z[j]). In this case Z[i]=j+Z[j]–iZ[i] = j + Z[j] – iZ[i]=j+Z[j]–i, i.e., the Z[i]Z[i]Z[i] box ends at the end of the Z[j]Z[j]Z[j] box. The reasoning is that the character marked with an x in the figure is different from the character at the end of the Z[j]Z[j]Z[j] box in the prefix of the string (otherwise Z[j]Z[j]Z[j] could have been extended), and thus the Z[i]Z[i]Z[i] box cannot be extended any further.

Case 3. Finally, in the remaining case, where Z[k]+k=Z[j]Z[k] + k = Z[j]Z[k]+k=Z[j], we know that Z[i]>=j+Z[j]–iZ[i] >= j + Z[j] – iZ[i]>=j+Z[j]–i. The exact value cannot be known without checking the additional characters (starting with the one marked ? in the figure).

If you check carefully the three cases above, you can see that in cases 1 and 2 we can directly assign the Z[i]Z[i]Z[i] value without doing any additional comparisons, thus all the work is done in case 3. We will show below that this approach allows us to dramatically speed up the algorithm.

But first, what value for j should we use? For any i, we will select the value j<ij < ij<i such that j+Z[j]>ij + Z[j] > ij+Z[j]>i and this value is maximal, i.e., the corresponding Z 'box' extends the farthest into the string. The intuition is that this way we are making maximal use of the information already computed. What happens, however, if no Z[j]Z[j]Z[j] box extends past i? In that case, we simply revert to the naïve algorithm, i.e. we compare T[i]T[i]T[i] to T[1]T[1]T[1], T[i+1]T[i + 1]T[i+1] to T[2]T[2]T[2], and so on and stop when we find a mismatch, appropriately updating the Z value.

Here is the full Z algorithm:

Z[0] = 0
maxZ = 0
j = 0
for (i = 1; i < n; i++)
  if (maxZ < i)                          // must do the hard work
    l = 0
    Z[i] = 0
    while (T[i + l] == T[l])
      Z[i]++
      l++
    end while
    maxZ = i + Z[i]
    j = i
  else
    k = i – j
    if (Z[k] + i < maxZ)                 // case 1
      Z[i] = Z[k]
    else if (Z[k] + i > maxZ)            // case 2
      Z[i] = maxZ – i
    else                                 // case 3 Z[k] + i = maxZ
      Z[i] = maxZ – i
      l = Z[i]
      while (T[i + l] == T[l])
        Z[i]++
        l++
      end while
      maxZ = i + Z[i]
      j = i
    end if
  end if
end for

Question: Check the algorithm above and fix the 'off by one' errors (i.e., should k be i–ji – ji–j or i–j+1i – j + 1i–j+1?)

Now, why is this algorithm efficient? It should easy to see that other than a few additional variables we do not add any more memory, thus the memory usage remains linear. But how about the comparisons? They only occur within the two while loops, and only involve characters that have never been compared before, thus the total runtime cannot exceed the number of characters in the text, i.e., the runtime is linear.

Question: Is it important that we only focus on the j whose Z[j]Z[j]Z[j] extends the furthest in the text? Argue/prove why this is important for ensuring either correctness or efficiency. (Hint: draw a situation where you are using a sub-optimal j, i.e., there exists another j you could use that extends further).

Exercises

  1. Implement the Z algorithm in your favorite programming language. Create several test strings that ensure your algorithm will execute the three cases. How else are you going to figure out you have implemented this algorithm correctly?

  2. Write out the Z-values for the following string: 1 1 1 1 1 1 1 1 1 1 2 2 2 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 A G A G A G A C C A G C A G A G A G A G A C G T Assume you are trying to compute Z[14]Z[14]Z[14] - please explain how you compute this value using the efficient algorithm described in class. Specifically:

    1. what are the values for j and Z[j]Z[j]Z[j]?

    2. which earlier value of Z do you need to make this computation?

    3. do you need to perform any additional character comparisons?

References

PreviousSemi-numerical matchingNextThe KMP algorithm

Last updated 3 months ago

Gusfield, D. 1997. Algorithms on Strings, Trees, and Sequences. The press syndicate of the university of Cambridge. .

http://books.google.com/books?isbn=0521585198
Visual representation of a string where two segments are identical to each other, one at the beginning of the string and the other at position j. A longer description is provided in the caption
A white rectangle representing a string that contains two blue boxes (Z boxes) representing sections of the string that are identical to each other, one starting at position 0 (the prefix of the string) and ending at position Z[j]Z[j]Z[j], and the other starting at position j and ending at position j+Z[j]j + Z[j]j+Z[j]. The length of the two segments is Z[j]Z[j]Z[j]. The fact that our current focus is on position i, and that this position is located within the Z box starting at position j, allows us to infer how the portion of the string starting at position i aligns to the prefix of the string. The portion of the string starting at position k=i−jk=i-jk=i−j and ending at Z[j]Z[j]Z[j] is the same as the portion of the string starting at position i and ending at position j+Z[j]j + Z[j]j+Z[j].
Visual representation of Case 1. A longer description is provided in the caption.
The two blue boxes represent sections of the string that are identical to each other, one starting at position 0 (the prefix of the string) and ending at position Z[j]Z[j]Z[j], and the other starting at position j and ending at position j+Z[j]j + Z[j]j+Z[j]. The dark bar represents the portion of the string matching at position 0 (prefix) and position k, i.e., the Z[k]Z[k]Z[k] box. Since the segment between k and Z[j]Z[j]Z[j] in the first blue box is the same as that between i and j+Z[j]j + Z[j]j+Z[j] in the second blue box, the same string (of length Z[k]Z[k]Z[k]) also matches starting at position i (shown by a third dark bar, or the Z[i]Z[i]Z[i] box). Since the Z[k]Z[k]Z[k] box starting at k does not extend past the end of the Z[j]Z[j]Z[j] box (i.e., k+Z[k]<Z[j]k + Z[k] < Z[j]k+Z[k]<Z[j]), we can infer that the same length starting at position i will stop before j+Z[j]j + Z[j]j+Z[j], i.e., Z[i]=Z[k]Z[i] = Z[k]Z[i]=Z[k].
Visual representation of Case 2. A longer description is provided in the caption.
The two blue boxes represent sections of the string that are identical to each other, one starting at position 0 (the prefix of the string) and ending at position Z[j]Z[j]Z[j], and the other starting at position j and ending at position j+Z[j]j + Z[j]j+Z[j]. The dark bar represents the portion of the string matching at position 0 (prefix) and position k, i.e., the Z[k]Z[k]Z[k] box. Since the Z[k]Z[k]Z[k] box starting at k extends past the end of the Z[j]Z[j]Z[j] box (i.e., k+Z[k]>Z[j]k + Z[k] > Z[j]k+Z[k]>Z[j]), we can infer that the portion of the string from position i to position j+Z[j]j + Z[j]j+Z[j], i.e., the end of the Z[j]Z[j]Z[j] box starting at j, will match the prefix of the string. This latter matching portion cannot extend further since the characters right after the two blue boxes are different, i.e., Z[i]=Z[j]−k=j+Z[j]–iZ[i] = Z[j] - k = j + Z[j] – iZ[i]=Z[j]−k=j+Z[j]–i
Visual representation of Case 3. A longer description is provided in the caption.
The two blue boxes represent sections of the string that are identical to each other, one starting at position 0 (the prefix of the string) and ending at position Z[j]Z[j]Z[j], and the other starting at position j and ending at position j+Z[j]j + Z[j]j+Z[j]. The dark bar represents the portion of the string matching at position 0 (prefix) and position k, i.e., Z[k]. The dark bar represents the portion of the string matching at position 0 (prefix) and position k, i.e., the Z[k]Z[k]Z[k] box. Since the segment between k and Z[j]Z[j]Z[j] in the first blue box is the same as that between i and j+Z[j]j + Z[j]j+Z[j] in the second blue box, the same string (of length Z[k]Z[k]Z[k]) also matches starting at position i (shown by a third dark bar, or the Z[i]Z[i]Z[i] box). Since the Z[k]Z[k]Z[k] box starting at k ends at the same place as the first Z[j]Z[j]Z[j] box (i.e., Z[k]+k=Z[j]Z[k] + k = Z[j]Z[k]+k=Z[j]), we can infer that the string starting at position i will reach position j+Z[j]j + Z[j]j+Z[j], i.e., the end of the Z[j]Z[j]Z[j] box starting at j. Whether the string starting at position i can still be extended while still matching at the beginning of the text needs to be checked by comparing additional characters.