Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Introduction
  • Suffix links
  • Historical note on suffix links
  • Longest common substring computation with suffix trees
  • Implicit suffix links
  • Linear time construction of suffix trees
  • Ukkonen's algorithm example
  1. String indexing

Suffix trees: beyond the basics

PreviousIntroduction to suffix treesNextSuffix arrays

Last updated 2 months ago

Introduction

The has provided an introduction to the basic structure of a suffix tree and has shown how powerful this data structure can be for string indexing and exact matching. This introduction, however, omitted some of the details of the suffix tree data structure and did not show how suffix trees can be constructed in linear time. We provide these additional details. here.

Suffix links

As hinted in the basic introduction to suffix trees, solving the longest common substring problem (finding the longest exact match between a substring of a pattern and a text) is not easily done with the basic suffix tree data structure. It turns out that a bit of additional information added to the tree can help. This information consists of additional edges added to the tree that allow you to "jump" across branches of the tree, thereby speeding up searches. These edges are called suffix links, and are defined as follows. For any node nnn in the tree that has the label cαc\alphacα, where α\alphaα is a string of characters and ccc is a single character, a suffix link connects nnn with node n′n'n′ which has label α\alphaα. An example is shown in the figure below.

Before looking at the figure, however, try to answer this question: given a node nnn with label cαc\alphacα in the tree, is the tree guaranteed to also have a node n′n'n′ with label α\alphaα? In other words, are the suffix links well-defined?

Let's return to the question above, paying attention to the situation shown in the figure. The reason node nnn exists in the tree is because there are two suffixes, labeled cαβc\alpha\betacαβ and cαγc\alpha\gammacαγ, such that the first letter of string β\betaβ is different from the first letter of string γ\gammaγ. In other words, the longest common prefix of the two suffixes is exactly cαc\alphacα. Since the suffix tree contains all the suffixes of the original string, it will also contain all the suffixes of any suffix of this string. Thus, if string cαβc\alpha\betacαβ is a suffix stored in the tree, so is the suffix αβ\alpha\betaαβ. Likewise, suffix αγ\alpha\gammaαγ must also exist in the tree, and both suffixes must be labels of paths starting from the root. But these two suffixes share a common prefix α\alphaα, and the first letters of strings β\betaβ and γ\gammaγ are different, thus a node must exist in the tree that has the label α\alphaα, i.e., the node we have labeled n′n'n′. QED.

Thus, we can be sure each node in the tree has a valid suffix link to one, and only one other node in the tree (the one that can be reached through a path spelling the string α\alphaα). As a result, the suffix links only add an additional O(n)O(n)O(n) space requirement, keeping the overall space needed to store a suffix tree of a string of length nnn to O(n)O(n)O(n).

Historical note on suffix links

To better understand how KMP values relate to suffix links, assume that we are matching a suffix of the pattern against a text and we encounter a mismatch at position i+1i + 1i+1 in the suffix. Let's also assume that the matched section of the suffix (positions 000 through iii) is the string cαc\alphacα . The KMP algorithm would now shift this suffix according to the sp values, i.e., by finding the longest suffix of the string cαc\alphacα that matches a prefix of that string. However, our goal is to find if any of the suffixes of the pattern match, thus we can generalize the sp value to be the longest non-trivial suffix of the string cαc\alphacα that matches the prefix of any one of the suffixes of our pattern. It turns out that the string α\alphaα itself is this longest suffix: it cannot be any longer, since adding an additional character would make it be the full suffix, and it is the prefix of one of the suffixes of the pattern. Thus, the "shift" implied by this generalized version of the sp values will go from the string cαc\alphacα to the string α\alphaα, or exactly the definition of the suffix links.

Longest common substring computation with suffix trees

As we've mentioned above, the suffix links are meant to represent a generalization of the KMP sp values that allow us to "shift" suffixes of the pattern along the text. To be clear — in this case the pattern is represented within the suffix tree. At a high level, we can match some suffix of the pattern to the text by following a path from the root of the tree for as long as the characters match the text. The "string depth" (number of characters along the path from the root) represents the longest match, so far, between a substring of the pattern (a prefix of a suffix of the pattern) and the text. Once we find a mismatch, we follow the suffix link from the position right before the mismatch, and continue trying to match going forward from the new location we have found in the tree, incrementing the string depth when we find matches, and decrementing it when we follow suffix links (by the definition of suffix links, following a link takes us to a location with a string depth that is exactly 1 lower than the current one), and retaining the largest string depth (the longest common substring). The figure below sketches this process.

Just as is the case for the KMP algorithm, the characters in the text that have been matched to the pattern don't have to be compared again, since they are guaranteed to continue to match after a "shift" of the tree along the text. Every time a suffix link is followed, the suffix tree shifts along the text by one position. The combination of these two observations yields a runtime of O(2n)O(2n)O(2n) (a total of nnn possible matches and a total of nnn possible shifts).

It is important to note, as we have done in the caption of the figure above, that all suffixes of the pattern that share the prefix that is currently matched along the text are processed at the same time — the search implicitly operates in parallel for these regions that are duplicated within the pattern.

Implicit suffix links

In the description above, we have made a very strong and unrealistic assumption: that the mismatch between the path in the tree and the text occurs exactly following a node in the tree. This allowed us to follow the suffix link located at that node in order to shift the matching by one character along the text. How do we handle the more general situation when the mismatch occur in the middle of an edge? There are no suffix links connecting the middles of edges to each other (if there were, we would need O(n2)O(n^2)O(n2) space to store such links). Let's look at the figure below that highlights such a situation.

As shown above, assume we have matched the string CαβC\alpha\betaCαβ to the text and that the first mismatch occurred in the middle of an edge. Since there is no suffix link at this location, we move up the tree to the node immediately above the location of the mismatch (labeled CαC\alphaCα), then follow its suffix link to the node labeled α\alphaα. We now need to descend through the tree until we identify the end of the segment labeled β\betaβ.

Importantly, we cannot proceed character by character, since the runtime of our algorithm will quickly blow up. The advantage of the suffix links is that they allow us to jump around the tree in O(1)O(1)O(1). Thus, how can we quickly find the end of the segment labeled β\betaβ? A first observation is that we know the string β\betaβ must occur in the tree starting from the node labeled α\alphaα. This is true because string αβ\alpha\betaαβ is the prefix of at least one suffix of the pattern, because string CαβC\alpha\betaCαβ is the prefix of at least one suffix of the pattern (a suffix of a suffix is still a suffix). Since we know the length of the string β\betaβ, we do not need to look at all characters, we simply descend from the node labeled α\alphaα for a distance len(β)\text{len}(\beta)len(β).

To follow the correct path, at every node encountered along the way, we compare the first letter of the edges below the node with the corresponding character within the string β\betaβ. For example, if we've moved down 15 characters from the end of the suffix link and we encounter a node, we compare character 16 in the string β\betaβ to the first character in the edges below this node, and follow the edge that starts with this character, continuing this process until we've moved down in the tree for a distance equal to len(β)\text{len}(\beta)len(β), reaching the end of the string β\betaβ.

Note that this process is a lot more involved than following a single suffix link between two nodes. By "jumping" down according to the number of characters we still need to find from string β\betaβ, we avoid having to move down character by character, but it is possible that we may encounter multiple nodes on the path spelling β\betaβ. Following such implicit suffix links, may, therefore still incur a significant additional cost.

To prove that this is not the case, we can use a simple banking argument, by noting that anytime we follow a suffix link (implicit or explicit), the node depth of the new location in the tree either increases (in case there are nodes along the segment labeled β\betaβ, as is the case in the figure above), or decreases by at most one. This last statement is true because any node along the path labeled CαC\alphaCα has a suffix link that connects to a node along the path labeled α\alphaα, i.e., the node depth of the node labeled α\alphaα is at least as large as the node depth of the node labeled CαC\alphaCα, with one exception, when a node exist in the tree that is labeled CCC. This node, at node depth 1 has a suffix link connecting it to the root (the empty suffix), thus leading to a decrease in node depth after following the suffix link.

To summarize, by following a suffix link, we may decrease the node depth by 1, or increase it by an arbitrary amount. This arbitrary amount is, however, bounded by the length of the longest suffix (len(P)\text{len}(P)len(P)). And once we increase the node depth, we cannot decrease it too fast since at each iteration we can only decrease it by 1. Using a banking analogy, the total number of increases throughout the execution of the algorithm (which is the "added" cost of following implicit suffix links) is bounded by the total amount of decreases throughout the execution of the algorithm, which is O(n)O(n)O(n) (we can decrease node depth at most 1 each time we shift the pattern along the text). Thus, on average, following each implicit suffix links takes O(1)O(1)O(1).

Linear time construction of suffix trees

The basic idea of the approach is that we will iteratively build the tree on prefixes of the string. For example, for string BANANAS, we will build the tree for string B, then for BA, BAN, and so on. We'll denote by TiT_iTi​ the suffix tree built on the ith prefix of the string SSS (T1T_1T1​ corresponds to B in our example, T3T_3T3​ is BAN, etc.) At each stage in the process we will add one character to all the suffixes stored in the earlier trees, as well as add the final suffix for the just added character. Note that we will ignore the $ character for now, and allow suffixes to end along a path in the tree.

Clearly this algorithm appears to be quite expensive. We process nnn different prefixes, and at each prefix iii we have to update iii different suffixes, which leads to an O(n2)O(n^2)O(n2) algorithm. If we take into account the time needed to 'find' each of the suffixes we may end up with an O(n3)O(n^3)O(n3) solution, far from what we are trying to achieve.

A couple observations can help us speed up this algorithm dramatically. For the following, assume we are processing suffix α\alphaα of tree TiT_iTi​ and try to extend it with character S[i+1]S[i+1]S[i+1].

Observation 1: Once a leaf, always a leaf. Once the algorithm has created a leaf, that leaf will never disappear. Furthermore, the string labeling the edge leading to this leaf will always end at the end of the current string (position iii in the original string for tree TiT_iTi​) As a corollary, at each stage in the algorithm we do not need to update the leaves that are already created. Instead, once we create a leaf we can simply label the end of the edge with a special 'end of string' character (remember, the edge is simply a pair of numbers corresponding to the coordinates of the edge in the original string). Once the tree is constructed we will spend O(n)O(n)O(n) time to replace these numbers with nnn – the coordinate of the end of the whole string. Also note that there are O(n)O(n)O(n) leaves in the tree and we only spend computation when creating each leaf, thus the total time spent on leaves in our algorithm is overall O(n)O(n)O(n).

Observation 2: If string αS[i+1]\alpha S[i+1]αS[i+1] already exists in the tree, all subsequent suffixes also exist. This observation is fairly easy to prove. If αS[i+1]\alpha S[i+1]αS[i+1] existed in tree TiT_iTi​, then clearly all other suffixes shorter than it would also be in TiT_iTi​ by the definition of this tree. As a result, when the our algorithm reaches this situation, we can simply stop and proceed with the next prefix of SSS, i.e., tree Ti+1T_{i+1}Ti+1​ is complete and we can proceed to Ti+2T_{i+2}Ti+2​. As a corollary, we can also bound the number of times we encounter this situation (a suffix of Ti+1T_{i+1}Ti+1​ is already in TiT_iTi​) to O(n)O(n)O(n) as the situation only occurs once per iteration.

Observation 3: If a node is created by the algorithm, the node to which the suffix link will point either exists in the tree, or will be created immediately afterward. To see why this is true, assume we are processing suffix α\alphaα which is not a leaf (i.e., α\alphaα either ends in the middle of an edge, or is a node with two or more children), yet the character(s) following α\alphaα is/are not S[i+1]S[i+1]S[i+1]. If α\alphaα ends at a node, the suffix link must already exist since TiT_iTi​ is a correct suffix tree. If α\alphaα ends in the middle of an edge, we must now split the edge and create a new node. Now our algorithm will try to add character S[i+1]S[i+1]S[i+1] to the next suffix, β\betaβ such that α=cβ\alpha = c \betaα=cβ. If β\betaβ ends at a node, we simply add a suffix link from the newly created node to this node. If β\betaβ ends in the middle of an edge, the character following β\betaβ is not S[i+1]S[i+1]S[i+1], i.e., a new node will have to be created, and this node will be the target for the suffix link from the previously created node. Thus, suffix links will be correctly created. Furthermore, the process of finding the end of β\betaβ can use the same approach we used to find the longest common substring, which we have shown earlier to be O(1)O(1)O(1) on average. Following the same logic used in the longest common substring situation, over the entire execution of the algorithm, this process also only uses O(n)O(n)O(n) computation.

Putting the three observation together we get an O(n)O(n)O(n) algorithm for constructing suffix trees.

Ukkonen's algorithm example

In this example, we build the suffix tree for string ABCABCD. At step I we simply build the tree for the first prefix, comprising a single suffix (A), and create a first leaf in the tree. By the observation above, this leaf will not need to be updated again, rather the string will be filled in automatically.

In step II, we add another suffix (B), the suffix AB having been automatically added by the leaf rule, and create a new leaf.

Same deal for step III.

In step IV, like in the earlier steps, we implicitly add suffixes ABCA, BCA, and CA represented at the corresponding leaves. We just need to add suffix A (the 4th suffix). As in the previous iteration we were implicitly left at leaf 3, we progress to the root of the tree and try to find a path labeled A. This path already exists in the tree and we terminate this round. The current pointer is now placed at the end of this path (character A). Note that no new leaves or nodes have been created.

At step V, we again know that suffixes ABCAB, BCAB, CAB are already in the tree at the corresponding leaves. We now try to add suffixes AB and B, starting from the last position we visited in the tree (character A highlighted in red in step IV). We find that suffix AB is already in the tree and terminate this round. Note that we did not need to check if suffix B is also found in the tree as we already know this fact to be true. The pointer in the tree has now progressed to the end of suffix 4, the B character highlighted in red.

Step VI is the same as step V, progressing another position down the same edge.

Finally, we reach step VII where we try to add the final character to all the suffixes not ending in a leaf. We start from the position where we last ended (the red C character in panel VI), the end of suffix 4 (we started with this suffix in steps V, VI, and VII). As we do not find suffix ABCD in the tree we create a new node and a leaf for suffix 4, then progress to search for suffix 5. We do so by using the skip-count trick we developed earlier in the context of search. Specifically we find the parent of the node (the root in this case), and progress (blindly) down the path from the root that starts with B for two characters (length of ABC minus the first character). At this position we again try to add character D to the tree, and since it is not found, we create a new node and leaf 5. The newly created node becomes the target of the suffix link from the previously created node. We then follow a similar approach (move to the root, follow path labeled C one character down) to insert the 6ths suffix (CD), and add a suffix link for the previously created node. Finally, we insert the 7th suffix (D) in the tree, and create a suffix link from the previous node to the root.

Throughout this approach we see that at each stage we either created a leaf, or terminated the execution of the algorithm as soon as we found that a suffix existed in the tree.

Note that when creating the suffix links, the node to which the link will point to may already exist in the tree (see below), however we are still guaranteed that we will visit that node immediately after creating the new node. The example below follows the insertion of character D in the suffix tree constructed for the string ABCACDABC. In the last panel, the red node already existed in the tree when we try to fill in the suffix link for the newly created node (red node in the second to last panel).

To better understand suffix links, it is useful to learn a bit about their history. A version of this idea was developed by and as they were trying to figure out how to extend the notion of the sp values in the KMP algorithm to multiple patterns. The goal was to create an algorithm that can simultaneously match multiple patterns to a text at the same time. Thus, when finding a mismatch and shifting, we want to shift in such a way that doesn't miss a match in any of the patterns. Thealgorithm achieved this through a variant of the suffix links.

We will now describe 's algorithm for constructing suffix trees in linear time (). Note that the suffix tree structure was developed earlier by Peter Weiner and subsequent work by and Ukkonen simplified the construction approach.

The following description can be quite confusing. I suggest you also use online resources to better understand what's going on. The following link provides a nice description and also a number of other pointers.

Alfred Aho
Margaret Corasick
Aho-Corasick
Esko Ukkonen
Ukkonen 1995
Edward McCreight
http://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english
previous section
A representation of two paths in a tree, the first one labeled c alpha, ending in node n, with the second one labeled alpha, and ending in node n prime. Two solid edges below node n are labeled beta and gamma, respectively, and two solid edges below node n prime are  also labeled beta and gamma. Dashed edges below both n and n prime indicate that other edges may exist in the tree beyond the labeled ones.
Suffix link between nodes nnn and n′n'n′. The label of the path to node nnn is cαc\alphacα, with the α\alphaα portion of the path highlighted by a blue box. The label of node n′n'n′ is just α\alphaα. Dashed edges indicate that there are other potential edges stemming from nodes nnn and n′n'n′. Not shown in the figure is the possibility that other nodes (and edges) may occur on the path between the root and the nodes nnn and n′n'n′.
A text T is represented as a long rectangle containing the string c alpha, with the alpha segment highlighted as a blue rectangle. A path in a suffix tree is shown aligned to the text so that the c alpha portion of the tree matches the text. The tree representation also include a suffix link to a path starting with a label alpha.  The pattern is represented below containing two occurrences of the string c alpha.
Suffix tree of the pattern aligned to the text. Since the node with the label cαc\alphacα has two children, the pattern contains two occurrences of this string (shown at the bottom) and both are simultaneously aligned to the text when aligning the corresponding path in the tree. By following the suffix link shown as a thick arrow, the root of the tree gets implicitly shifted by one character along the text, and the matching can continue again from the character following the end of the string cαc\alphacα in the text.
Suffix tree of a pattern aligned to a text after a shift caused by following a suffix link The text has the string c alpha highlighted in it with the segment labeled alpha highlighted as a blue box. A path in the suffix tree starting with the label alpha is aligned to the corresponding region in the text. A separate path, starting with the label c alpha is shown in the tree as well as the suffix link (a thick arrow) connecting the ends of these two paths.
Situation after following the suffix link. The tree has effectively shifted by one position along the text since now the path starting with the label α\alphaα is aligned to the text.
Suffix tree in which two paths are highlighted - one labeled C alpha beta and the other alpha beta. Between the strings alpha and beta there is a node in each path and the corresponding nodes are connected with a suffix link, the node labeled c alpha links to the node labeled alpha. The path labeled beta ends in the middle of an edge. On the path  from the root labeled alpha beta, there is another node in the middle of the segment labeled beta.
Situation where the match ends in the middle of an edge, after matching the string CαβC \alpha\betaCαβ to the text. In this situation, the suffix link for the node immediately above the string β\betaβ is followed, then the end of the string β\betaβ can be found by descending from the end of the suffix link.
One edge labeled A, with the suffix labeled 1.
Tree after step 1.
Tree after two steps - one edge labeled AB from root to leaf 1, another edge labeled B from root to leaf 2.
Tree after step 2
Tree after step 3. Three edges from the root to leaves 1, 2, and 3, with corresponding labels ABC, BC, and C.
Tree after step 3.
Tree with three edges connecting the root to leaves 1, 2, and 3, with corresponding labels ABCA, BCA, and CA.
Tree after step 4.
Tree with three edges linking root to leaves 1, 2, and 3, and labels ABCAB, BCAB, and CAB.
Tree after step 5.
Tree with three edges connecting root to leaves 1, 2, and 3, labeled ABCABC, BCABC, and CABC.
Tree after step 6.
Four versions of the suffix tree representing the order of operations during Ukkonen's algorithm.
Sequence of operations taking place in step 7. The final tree is in the bottom panel.
Sequence of operations for adding a new character to the suffix tree in Ukkonen's algorithm.
Sequence of operations in Ukkonnen's algorithm if the suffix link connects to an already existing node.