The Z algorithm
Z values
Another strategy for speeding up matching is to modify the naïve matching approach in a way that avoids doing too much work. If we can somehow reuse our work we might be able to save time. To illustrate this idea we will start with a simple algorithm, the Z algorithm, developed by Dan Gusfield (Gusfield 1997) as an educational tool to help explain the basics of other string matching algorithms.
For a text T, let us define a simple function Z that captures the internal structure of the text. Specifically, for every position from T, we define to be the length of the longest prefix of that matches exactly a prefix of . We also define . Here is an example:
Question: Can Z values help in aligning a pattern to a text?
IMPORTANT: When questions are posed in the text, please spend a few minutes trying to figure them out before reading further. Evidence in the teaching and learning literature shows that even if you can't figure out the answer, the effort you devoted to trying to answer the question will help you better understand the answer and retain this information long term.
Using Z values in string matching
To use Z values as a tool for exact matching, we can construct the following string , where P and T are the pattern and text, respectively, and $ is a character that does not match any other character in either pattern or text. Let us assume for the moment that we have a black box function that computes the Z values for string S.
Question: Can the Z values computed for the string S defined above help us find matches between P and T?
The answer is yes – it suffices to look at the Z values corresponding to characters in T. Any Z-value equal to the length of P indicates the pattern matches. Specifically, for any , if Z[i] == m
, pattern P matches text T starting at position i. (Aside: can Z[i]
be greater than m?)
Question: What is the runtime and memory usage of this algorithm?
First, the memory – we clearly need to store the Z values in addition to the pattern and text, for a total of memory locations. Even though we are using more memory than the naïve algorithm, the memory consumption is still linear in the size of the inputs. There might be additional memory required by the algorithm that computes the Z values, but we'll get to that in a moment.
In terms of runtime, again omitting the work done to compute the Z values, we simply need to examine each position in the Z array to find all matches between P and T, or to decide that a match doesn't exist, i.e., the runtime is simply (we only need to examine the Z values corresponding to positions in the text). Here we see a much bigger improvement over the earlier naïve algorithm.
To sum up, if we have available an approach for computing Z values, we can perform exact matching much faster than the naïve algorithm, without using up (too much) more memory.
Of course, this result is predicated on the assumption that the Z values can be computed efficiently.
Question: Can Z-values be computed in linear time and linear space?
Linear-time computation of Z values
Note that one could come up with a naïve algorithm for computing the Z values using essentially the same approach used by the naïve exact matching algorithm. Clearly this approach would also yield a quadratic time algorithm (Aside: write down this algorithm and figure out its complexity in time and memory).
A much faster algorithm can be developed if we cleverly reuse the work done earlier. The intuition is that when computing we can use induction, leveraging the values for .
Let us look at the figure below, where the Z 'boxes' (the part of the string that matches both at the beginning and at position j) are represented as blue boxes. By definition, the length of those boxes is . We assume that there exists a position such that the Z 'box' at position j extends past i, i.e., . The section of the Z 'box' starting at j that extends past i is identical to the same region at the beginning of the string (bracket in the figure), information that we can now use to compute the value.
We can distinguish several cases:
Case 1. The Z 'box' for value does not extend past the blue region ( box), i.e.,. In that case we know that . The box cannot extend past that point as the following character (marked with an x in the figure below) is different from the corresponding character in the prefix of the string.
Case 2. The Z 'box' for k extends past the blue region (i.e., ). In this case , i.e., the box ends at the end of the box. The reasoning is that the character marked with an x in the figure is different from the character at the end of the box in the prefix of the string (otherwise could have been extended), and thus the box cannot be extended any further.
Case 3. Finally, in the remaining case, where , we know that . The exact value cannot be known without checking the additional characters (starting with the one marked ? in the figure).
If you check carefully the three cases above, you can see that in cases 1 and 2 we can directly assign the value without doing any additional comparisons, thus all the work is done in case 3. We will show below that this approach allows us to dramatically speed up the algorithm.
But first, what value for j should we use? For any i, we will select the value such that and this value is maximal, i.e., the corresponding Z 'box' extends the farthest into the string. The intuition is that this way we are making maximal use of the information already computed. What happens, however, if no box extends past i? In that case, we simply revert to the naïve algorithm, i.e. we compare to , to , and so on and stop when we find a mismatch, appropriately updating the Z value.
Here is the full Z algorithm:
Question: Check the algorithm above and fix the 'off by one' errors (i.e., should k be or ?)
Now, why is this algorithm efficient? It should easy to see that other than a few additional variables we do not add any more memory, thus the memory usage remains linear. But how about the comparisons? They only occur within the two while loops, and only involve characters that have never been compared before, thus the total runtime cannot exceed the number of characters in the text, i.e., the runtime is linear.
Question: Is it important that we only focus on the j whose extends the furthest in the text? Argue/prove why this is important for ensuring either correctness or efficiency. (Hint: draw a situation where you are using a sub-optimal j, i.e., there exists another j you could use that extends further).
Exercises
Implement the Z algorithm in your favorite programming language. Create several test strings that ensure your algorithm will execute the three cases. How else are you going to figure out you have implemented this algorithm correctly?
Write out the Z-values for the following string:
1 1 1 1 1 1 1 1 1 1 2 2 2 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 A G A G A G A C C A G C A G A G A G A G A C G T
Assume you are trying to compute - please explain how you compute this value using the efficient algorithm described in class. Specifically:what are the values for j and ?
which earlier value of Z do you need to make this computation?
do you need to perform any additional character comparisons?
References
Last updated