Introduction to suffix trees
Last updated
Last updated
Previously we have looked for a full match between a pattern and a text. Assume that no perfect match can be found. It can still be useful to know that part of the pattern matches the text. Thus, we can have a relaxed version of the string matching problem:
Prefix match. Given pattern P and text T, find the longest prefix of P that matches somewhere in T.
I argue that you can easily solve this problem efficiently using the Z or KMP algorithms, with some small changes. Please think carefully about how you may modify these algorithms to solve this problem, before proceeding.
Now, let's think about the following problem:
Longest common substring. Given pattern P and text T, find the longest substring of P that is also a substring of T.
For example, for P = ATTCGCTTAGCCTA and T = GGAGCTTAGAACT the answer is GCTTAG. Can any of the algorithms described earlier solve this longest common substring problem? It may not be immediately obvious, but none of the earlier algorithms can solve it efficiently. Key to their correctness and efficiency was the shift function, yet this function only guarantees that we will not miss a complete match of the pattern within the text, not that we will not miss a longest substring match between pattern and text.
As before, let's ignore efficiency for now and focus on a naïve solution to the problem. This naïve algorithm is very simple:
It is not hard to see, however, that the algorithm is inefficient. We have different patterns, each of maximal length , thus yielding an runtime (see the calculation at the beginning of the chapter). Can we possibly do better?
Let us try to improve this algorithm. First, the algorithm is fairly inefficient because of the large number of patterns we have to search. To reduce this number we'll just focus on patterns, specifically, all the suffixes of P. As we discussed earlier, finding a longest prefix match between a pattern and the text is easy. Since each substring is the prefix of a suffix, the answer to the longest common substring problem will be the longest matching prefix of one of the suffixes. Note, that this simple observation reduces the number of "patterns" we need to search from to , yielding to a total runtime of if using KMP or Z.
We'll try to obtain further speed-ups by exploiting the similarity between patterns. Specifically, since it's easier to find matches at the beginning of the pattern, we want to place the patterns in a datastructure that ensures that shared prefixes are not repeated. The datastructure we'll use is called a trie—just a fancy word to say we'll construct a tree structure that merges together patterns which have common prefixes (see below for patterns abcq, abcr, bcde, cd
). Each edge in this tree is labeled with one single character, and every path from the root to a leaf spells the sequence of one of the patterns.
This structure is just a (slightly) compressed way to store the patterns together. Having the shared prefixes along the same path in the tree should allow us to simultaneously match (up to a point) multiple patterns.
If the patterns stored in the trie are the suffixes of a string, we'll call the resulting datastructure the suffix trie of the string.
The suffix tree for a more complex string is shown below. As before, the leaves are labeled with the number of the corresponding suffix.
If you briefly look at this tree you'll notice that we are missing two suffixes (6 and 7). If you look more carefully, suffix 6 (AT) is already found in the tree on the second edge of the left-most path going from root to suffix 1. The same is true for suffix 7 (T) which is in the middle of the first edge on the right-most path of the tree. There's no need to branch at this location, thus we have not introduced any node or leaf corresponding to this suffix. This makes it a bit confusing as now suffixes can either end at leaves, or can end at arbitrary locations in the middle of edges. To make things more consistent, we'll use a simple trick—we'll append to the string the character $ or any other character that's not a letter in the string. This character will force every suffix to end at a leaf as shown below.
Also note that I'm using 1-based coordinates here. It may be more natural to use 0-based coordinates, but you'll find out that biologists often prefer to start with 1, thus it is useful to get used to both conventions.
Before you proceed with this chapter, please try to build the suffix trees for a few strings. If you are looking for inspiration, try: BANANA, MISSISSIPPI, GATTACCA, AATATTATAATATA, or even AAAAAAA.
OK, so we created this fancy new data structure, but how useful is it? Let's look at some examples.
We'll start with the most basic string matching problem—given a pattern P of length m and a text T of length n, find if the pattern occurs within T, as well as the locations where it occurs.
To see how a suffix tree may be useful, let us remember the observation we made earlier that any substring of T can be seen as the prefix of a suffix of T. Thus, if P matches within T, then P is the prefix of one or more suffixes of T. Given that the suffixes are organized within a suffix tree such that they share common prefixes, that means that P must spell a path starting from the root of the suffix tree of T.
As you can see, starting down from the root of the tree, we can follow edges that spell ATA, indicating that this pattern occurs in the text. Also note that below the endpoint of the string ATA we have a sub-tree that has two leaves labeled 1 and 3, indicating that pattern ATA occurs twice in the text, the first time starting at position 1 and the second at position 3, which you can verify by looking at the two strings.
An actual algorithm for searching for the pattern inside the text will look something like:
As a final point of discussion, we should focus briefly on figuring out where the pattern actually matches. As mentioned above, the node at the end of the edge where the end of the pattern is located after executing the algorithm we just outlined, is the root of a subtree that has, as leaves, the suffixes corresponding to the positions where the pattern matches. It is, thus, sufficient to run a simple tree traversal algorithm to collect the labels of all the leaves and output all the match locations.
If all we care about is figuring out how many matches there are, but not actually finding their positions, we could pre-process the tree to store, at each internal node of the tree, the number of leaves located below it. Once we find a match of the pattern in the tree, we simply look up the number stored at the node right below the match.
Write code that searches for a pattern inside a suffix tree
Write code that enumerates the labels of all leaves below a given node in a tree
Write code that pre-processes a (suffix) tree to include, at each node, the number of leaves below it.
Write code that pre-processes a suffix tree to include, at each node, the depth of the node in terms of the number of characters spelled by the path from root to the node.
A key component of the Z and KMP algorithms was identifying that certain regions of a string are identical to each other. In the Z and KMP algorithms, we focused on a specific type of repeated strings—those that match the prefix (beginning) of the string. But repeats are interesting more generally, and we've actually dealt with them when talking about finding hidden signals in DNA. There, we looked at strings of length k (k-mers) that are repeated throughout the genome. You probably asked yourself then (and if not, think a bit about this now) how we decide what value of k we should use? Suffix trees allow us to figure this out without starting with any particular guess for k.
Specifically, we're looking for segments of DNA that appear multiple times in the genome, as shown below (the shaded areas). You'll note that these segments represent the prefixes of different suffixes of the string (S1, S2, and S3 in the figure), thus they will show up as a shared path inside the tree (thick line), before the corresponding suffixes branch off.
Note that any path (and part thereof) leading to an internal node of the tree represents a repeat as it is the prefix of at least two distinct suffixes. We could, thus, expand the idea of frequent k-mers that we developed earlier in the class into a new problem:
Find the most frequent repeat in the genome that is longer than k.
This way, the value of k becomes a lower bound meant to avoid finding obvious patterns (e.g., for k=1, the answer is the most frequent letter in the genome, which is not particularly interesting). But this new algorithm could find interesting patterns that are longer than k. With some basic book-keeping and pre-processing, you could also look for clumps of frequent repeats. You will need to record, at each node in the tree, the range of coordinates represented at that node (the minimum and maximum values of the leaves in the subtree rooted at the node), then find nodes with relatively narrow ranges (where distance between min and max is below a certain value) but that have many leaves below them.
Describe an algorithm that finds the most frequent repeat that is longer than k.
Describe an algorithm that finds all repeats longer than k and that occur more than N times in the genome.
Describe an algorithm that pre-processes a suffix tree to include, at each node, the minimum and maximum values of the leaves in the sub-tree rooted at that node.
I've started this chapter by promising you that suffix trees can efficiently solve the longest common substring problem. This is true, but it requires a bit of a deeper understanding of suffix trees that we develop here.
I'll leave it as an exercise to write out the code that simultaneously measures the string depth of nodes in a suffix tree and determines whether the nodes are "diverse" (contain leaves from two strings), and outputs the "deepest" such node.
Describe a linear-time algorithm that uses a suffix tree to compute the sp values of a string
Describe a linear-time algorithm that uses a suffix tree to compute the Z values of a string
We have labeled the m leaves with the starting position of each suffix (1 is the longest suffix, etc.). This data structure requires memory since there are characters (and therefore nodes in the tree) in the set of all suffixes of a string of length m. Note, however, that this is quite wasteful as most internal nodes of the tree only have one child, so we could save some space by only keeping the internal branching nodes (in the case of string "abc" there are none).
This new structure saves a bit of space since the number of internal nodes is now instead of . Why is that the case? I will argue that the number of internal nodes in a tree is always less than the number of leaves. Can you prove it (hint: use induction)? What tree has the most internal nodes with respect to the number of leaves?
Thus, even by "compacting" multiple edges into a single one, the space requirement of the tree remains as we have to store all characters contained in the suffixes of the pattern. A simple 'trick' can help us further lower the space. Specifically, each edge within the tree corresponds to a substring of the pattern, and can, therefore, be represented by its coordinates (start and end) within the pattern. As a result, each edge needs to only store constant information, and the total size of the tree becomes . The resulting tree for the string "abc" is shown below.
Note that this algorithm advances character by character along both P and the suffix tree of T, and completes when either P is fully found, or when a mismatch occurs, indicating P is not found in T. Thus, the runtime is .
Take one second to compare this run time to that of the KMP algorithm—. We seem to have magically removed from the equation the length of the text. In fact, I'm playing a bit of trick on you since I didn't take into account the time needed to create the suffix tree. For now, trust me that it is possible to build this data structure in time, thus searching for one pattern gives us a similar run time as KMP. But what happens when we try to search for k patterns as described at the beginning of the chapter? The runtime for KMP is . How about for a suffix tree? Since we don't need to construct a new suffix tree for each pattern, we can avoid the term, leading to . In many biological applications, k could be on the order of thousands or even millions, and given that n is also on the order of millions, the suffix tree solution provides an important reduction in run time.
One "hack" we can use is to construct a suffix tree that contains the suffixes of two different strings at the same time. To figure out which string is which, we'll include different characters at the end of the two strings, such as $ and #. In this tree, the longest common substring will be a path in the tree that is the prefix of suffixes in both strings, and is the longest such path. It should be pretty easy to figure out that this path must end at an internal node in the tree—if it ended in the middle of an edge, then you could extend it until the nearest node, violating the property that it is the longest path. Furthermore, the node where the path ends must be the root of a sub-tree that contains leaves from both strings. Since we used two different end of string characters, each leaf in the tree can be uniquely assigned to one of the two strings. Thus, to find the longest common substring, you need to find "diverse" nodes and pick the one that is the furthest away from the root in terms of the number of characters along the path from the root to the node. It turns out that a simple tree traversal can enumerate all diverse nodes, and compute their "string depth", and can thus find the shortest common substring in instead of as we proposed earlier. Of course, it's all predicated on being able to construct a suffix tree in . This result may not seem surprising, but before suffix trees were discovered, (the K in the KMP algorithm) conjectured that the shortest common substring could not be solved in sub-quadratic time.