Suffix trees: beyond the basics
Last updated
Last updated
The has provided an introduction to the basic structure of a suffix tree and has shown how powerful this data structure can be for string indexing and exact matching. This introduction, however, omitted some of the details of the suffix tree data structure and did not show how suffix trees can be constructed in linear time. We provide these additional details. here.
As hinted in the basic introduction to suffix trees, solving the longest common substring problem (finding the longest exact match between a substring of a pattern and a text) is not easily done with the basic suffix tree data structure. It turns out that a bit of additional information added to the tree can help. This information consists of additional edges added to the tree that allow you to "jump" across branches of the tree, thereby speeding up searches. These edges are called suffix links, and are defined as follows. For any node in the tree that has the label , where is a string of characters and is a single character, a suffix link connects with node which has label . An example is shown in the figure below.
Before looking at the figure, however, try to answer this question: given a node with label in the tree, is the tree guaranteed to also have a node with label ? In other words, are the suffix links well-defined?
Let's return to the question above, paying attention to the situation shown in the figure. The reason node exists in the tree is because there are two suffixes, labeled and , such that the first letter of string is different from the first letter of string . In other words, the longest common prefix of the two suffixes is exactly . Since the suffix tree contains all the suffixes of the original string, it will also contain all the suffixes of any suffix of this string. Thus, if string is a suffix stored in the tree, so is the suffix . Likewise, suffix must also exist in the tree, and both suffixes must be labels of paths starting from the root. But these two suffixes share a common prefix , and the first letters of strings and are different, thus a node must exist in the tree that has the label , i.e., the node we have labeled . QED.
Thus, we can be sure each node in the tree has a valid suffix link to one, and only one other node in the tree (the one that can be reached through a path spelling the string ). As a result, the suffix links only add an additional space requirement, keeping the overall space needed to store a suffix tree of a string of length to .
To better understand how KMP values relate to suffix links, assume that we are matching a suffix of the pattern against a text and we encounter a mismatch at position in the suffix. Let's also assume that the matched section of the suffix (positions through ) is the string . The KMP algorithm would now shift this suffix according to the sp values, i.e., by finding the longest suffix of the string that matches a prefix of that string. However, our goal is to find if any of the suffixes of the pattern match, thus we can generalize the sp value to be the longest non-trivial suffix of the string that matches the prefix of any one of the suffixes of our pattern. It turns out that the string itself is this longest suffix: it cannot be any longer, since adding an additional character would make it be the full suffix, and it is the prefix of one of the suffixes of the pattern. Thus, the "shift" implied by this generalized version of the sp values will go from the string to the string , or exactly the definition of the suffix links.
As we've mentioned above, the suffix links are meant to represent a generalization of the KMP sp values that allow us to "shift" suffixes of the pattern along the text. To be clear — in this case the pattern is represented within the suffix tree. At a high level, we can match some suffix of the pattern to the text by following a path from the root of the tree for as long as the characters match the text. The "string depth" (number of characters along the path from the root) represents the longest match, so far, between a substring of the pattern (a prefix of a suffix of the pattern) and the text. Once we find a mismatch, we follow the suffix link from the position right before the mismatch, and continue trying to match going forward from the new location we have found in the tree, incrementing the string depth when we find matches, and decrementing it when we follow suffix links (by the definition of suffix links, following a link takes us to a location with a string depth that is exactly 1 lower than the current one), and retaining the largest string depth (the longest common substring). The figure below sketches this process.
Just as is the case for the KMP algorithm, the characters in the text that have been matched to the pattern don't have to be compared again, since they are guaranteed to continue to match after a "shift" of the tree along the text. Every time a suffix link is followed, the suffix tree shifts along the text by one position. The combination of these two observations yields a runtime of (a total of possible matches and a total of possible shifts).
It is important to note, as we have done in the caption of the figure above, that all suffixes of the pattern that share the prefix that is currently matched along the text are processed at the same time — the search implicitly operates in parallel for these regions that are duplicated within the pattern.
In the description above, we have made a very strong and unrealistic assumption: that the mismatch between the path in the tree and the text occurs exactly following a node in the tree. This allowed us to follow the suffix link located at that node in order to shift the matching by one character along the text. How do we handle the more general situation when the mismatch occur in the middle of an edge? There are no suffix links connecting the middles of edges to each other (if there were, we would need space to store such links). Let's look at the figure below that highlights such a situation.
As shown above, assume we have matched the string to the text and that the first mismatch occurred in the middle of an edge. Since there is no suffix link at this location, we move up the tree to the node immediately above the location of the mismatch (labeled ), then follow its suffix link to the node labeled . We now need to descend through the tree until we identify the end of the segment labeled .
Importantly, we cannot proceed character by character, since the runtime of our algorithm will quickly blow up. The advantage of the suffix links is that they allow us to jump around the tree in . Thus, how can we quickly find the end of the segment labeled ? A first observation is that we know the string must occur in the tree starting from the node labeled . This is true because string is the prefix of at least one suffix of the pattern, because string is the prefix of at least one suffix of the pattern (a suffix of a suffix is still a suffix). Since we know the length of the string , we do not need to look at all characters, we simply descend from the node labeled for a distance .
To follow the correct path, at every node encountered along the way, we compare the first letter of the edges below the node with the corresponding character within the string . For example, if we've moved down 15 characters from the end of the suffix link and we encounter a node, we compare character 16 in the string to the first character in the edges below this node, and follow the edge that starts with this character, continuing this process until we've moved down in the tree for a distance equal to , reaching the end of the string .
Note that this process is a lot more involved than following a single suffix link between two nodes. By "jumping" down according to the number of characters we still need to find from string , we avoid having to move down character by character, but it is possible that we may encounter multiple nodes on the path spelling . Following such implicit suffix links, may, therefore still incur a significant additional cost.
To prove that this is not the case, we can use a simple banking argument, by noting that anytime we follow a suffix link (implicit or explicit), the node depth of the new location in the tree either increases (in case there are nodes along the segment labeled , as is the case in the figure above), or decreases by at most one. This last statement is true because any node along the path labeled has a suffix link that connects to a node along the path labeled , i.e., the node depth of the node labeled is at least as large as the node depth of the node labeled , with one exception, when a node exist in the tree that is labeled . This node, at node depth 1 has a suffix link connecting it to the root (the empty suffix), thus leading to a decrease in node depth after following the suffix link.
To summarize, by following a suffix link, we may decrease the node depth by 1, or increase it by an arbitrary amount. This arbitrary amount is, however, bounded by the length of the longest suffix (). And once we increase the node depth, we cannot decrease it too fast since at each iteration we can only decrease it by 1. Using a banking analogy, the total number of increases throughout the execution of the algorithm (which is the "added" cost of following implicit suffix links) is bounded by the total amount of decreases throughout the execution of the algorithm, which is (we can decrease node depth at most 1 each time we shift the pattern along the text). Thus, on average, following each implicit suffix links takes .
The basic idea of the approach is that we will iteratively build the tree on prefixes of the string. For example, for string BANANAS, we will build the tree for string B, then for BA, BAN, and so on. We'll denote by the suffix tree built on the ith prefix of the string ( corresponds to B in our example, is BAN, etc.) At each stage in the process we will add one character to all the suffixes stored in the earlier trees, as well as add the final suffix for the just added character. Note that we will ignore the $ character for now, and allow suffixes to end along a path in the tree.
Clearly this algorithm appears to be quite expensive. We process different prefixes, and at each prefix we have to update different suffixes, which leads to an algorithm. If we take into account the time needed to 'find' each of the suffixes we may end up with an solution, far from what we are trying to achieve.
A couple observations can help us speed up this algorithm dramatically. For the following, assume we are processing suffix of tree and try to extend it with character .
Observation 1: Once a leaf, always a leaf. Once the algorithm has created a leaf, that leaf will never disappear. Furthermore, the string labeling the edge leading to this leaf will always end at the end of the current string (position in the original string for tree ) As a corollary, at each stage in the algorithm we do not need to update the leaves that are already created. Instead, once we create a leaf we can simply label the end of the edge with a special 'end of string' character (remember, the edge is simply a pair of numbers corresponding to the coordinates of the edge in the original string). Once the tree is constructed we will spend time to replace these numbers with – the coordinate of the end of the whole string. Also note that there are leaves in the tree and we only spend computation when creating each leaf, thus the total time spent on leaves in our algorithm is overall .
Observation 2: If string already exists in the tree, all subsequent suffixes also exist. This observation is fairly easy to prove. If existed in tree , then clearly all other suffixes shorter than it would also be in by the definition of this tree. As a result, when the our algorithm reaches this situation, we can simply stop and proceed with the next prefix of , i.e., tree is complete and we can proceed to . As a corollary, we can also bound the number of times we encounter this situation (a suffix of is already in ) to as the situation only occurs once per iteration.
Observation 3: If a node is created by the algorithm, the node to which the suffix link will point either exists in the tree, or will be created immediately afterward. To see why this is true, assume we are processing suffix which is not a leaf (i.e., either ends in the middle of an edge, or is a node with two or more children), yet the character(s) following is/are not . If ends at a node, the suffix link must already exist since is a correct suffix tree. If ends in the middle of an edge, we must now split the edge and create a new node. Now our algorithm will try to add character to the next suffix, such that . If ends at a node, we simply add a suffix link from the newly created node to this node. If ends in the middle of an edge, the character following is not , i.e., a new node will have to be created, and this node will be the target for the suffix link from the previously created node. Thus, suffix links will be correctly created. Furthermore, the process of finding the end of can use the same approach we used to find the longest common substring, which we have shown earlier to be on average. Following the same logic used in the longest common substring situation, over the entire execution of the algorithm, this process also only uses computation.
Putting the three observation together we get an algorithm for constructing suffix trees.
In this example, we build the suffix tree for string ABCABCD. At step I we simply build the tree for the first prefix, comprising a single suffix (A), and create a first leaf in the tree. By the observation above, this leaf will not need to be updated again, rather the string will be filled in automatically.
In step II, we add another suffix (B), the suffix AB having been automatically added by the leaf rule, and create a new leaf.
Same deal for step III.
In step IV, like in the earlier steps, we implicitly add suffixes ABCA, BCA, and CA represented at the corresponding leaves. We just need to add suffix A (the 4th suffix). As in the previous iteration we were implicitly left at leaf 3, we progress to the root of the tree and try to find a path labeled A. This path already exists in the tree and we terminate this round. The current pointer is now placed at the end of this path (character A). Note that no new leaves or nodes have been created.
At step V, we again know that suffixes ABCAB, BCAB, CAB are already in the tree at the corresponding leaves. We now try to add suffixes AB and B, starting from the last position we visited in the tree (character A highlighted in red in step IV). We find that suffix AB is already in the tree and terminate this round. Note that we did not need to check if suffix B is also found in the tree as we already know this fact to be true. The pointer in the tree has now progressed to the end of suffix 4, the B character highlighted in red.
Step VI is the same as step V, progressing another position down the same edge.
Finally, we reach step VII where we try to add the final character to all the suffixes not ending in a leaf. We start from the position where we last ended (the red C character in panel VI), the end of suffix 4 (we started with this suffix in steps V, VI, and VII). As we do not find suffix ABCD in the tree we create a new node and a leaf for suffix 4, then progress to search for suffix 5. We do so by using the skip-count trick we developed earlier in the context of search. Specifically we find the parent of the node (the root in this case), and progress (blindly) down the path from the root that starts with B for two characters (length of ABC minus the first character). At this position we again try to add character D to the tree, and since it is not found, we create a new node and leaf 5. The newly created node becomes the target of the suffix link from the previously created node. We then follow a similar approach (move to the root, follow path labeled C one character down) to insert the 6ths suffix (CD), and add a suffix link for the previously created node. Finally, we insert the 7th suffix (D) in the tree, and create a suffix link from the previous node to the root.
Throughout this approach we see that at each stage we either created a leaf, or terminated the execution of the algorithm as soon as we found that a suffix existed in the tree.
Note that when creating the suffix links, the node to which the link will point to may already exist in the tree (see below), however we are still guaranteed that we will visit that node immediately after creating the new node. The example below follows the insertion of character D in the suffix tree constructed for the string ABCACDABC. In the last panel, the red node already existed in the tree when we try to fill in the suffix link for the newly created node (red node in the second to last panel).
To better understand suffix links, it is useful to learn a bit about their history. A version of this idea was developed by and as they were trying to figure out how to extend the notion of the sp values in the KMP algorithm to multiple patterns. The goal was to create an algorithm that can simultaneously match multiple patterns to a text at the same time. Thus, when finding a mismatch and shifting, we want to shift in such a way that doesn't miss a match in any of the patterns. Thealgorithm achieved this through a variant of the suffix links.
We will now describe 's algorithm for constructing suffix trees in linear time (). Note that the suffix tree structure was developed earlier by Peter Weiner and subsequent work by and Ukkonen simplified the construction approach.
The following description can be quite confusing. I suggest you also use online resources to better understand what's going on. The following link provides a nice description and also a number of other pointers.