Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Introduction
  • The suffix tree structure
  • Using suffix trees
  • Searching for a pattern inside a text
  • Exercises
  • Finding repeats
  • Exercises
  • Notes
  • Exercises
  1. String indexing

Introduction to suffix trees

PreviousIntroduction to string indexingNextSuffix trees: beyond the basics

Last updated 3 months ago

Introduction

Previously we have looked for a full match between a pattern and a text. Assume that no perfect match can be found. It can still be useful to know that part of the pattern matches the text. Thus, we can have a relaxed version of the string matching problem:

Prefix match. Given pattern P and text T, find the longest prefix of P that matches somewhere in T.

I argue that you can easily solve this problem efficiently using the Z or KMP algorithms, with some small changes. Please think carefully about how you may modify these algorithms to solve this problem, before proceeding.

Now, let's think about the following problem:

Longest common substring. Given pattern P and text T, find the longest substring of P that is also a substring of T.

For example, for P = ATTCGCTTAGCCTA and T = GGAGCTTAGAACT the answer is GCTTAG. Can any of the algorithms described earlier solve this longest common substring problem? It may not be immediately obvious, but none of the earlier algorithms can solve it efficiently. Key to their correctness and efficiency was the shift function, yet this function only guarantees that we will not miss a complete match of the pattern within the text, not that we will not miss a longest substring match between pattern and text.

As before, let's ignore efficiency for now and focus on a naïve solution to the problem. This naïve algorithm is very simple:

Construct all substrings of P
Search each of these against T

It is not hard to see, however, that the algorithm is inefficient. We have O(m2)O(m^2)O(m2) different patterns, each of maximal length O(m)O(m)O(m), thus yielding an O(m3+m2n)O(m^3 + m^2n)O(m3+m2n) runtime (see the calculation at the beginning of the chapter). Can we possibly do better?

The suffix tree structure

Let us try to improve this algorithm. First, the algorithm is fairly inefficient because of the large number of patterns we have to search. To reduce this number we'll just focus on O(m)O(m)O(m) patterns, specifically, all the suffixes of P. As we discussed earlier, finding a longest prefix match between a pattern and the text is easy. Since each substring is the prefix of a suffix, the answer to the longest common substring problem will be the longest matching prefix of one of the suffixes. Note, that this simple observation reduces the number of "patterns" we need to search from O(m2)O(m^2)O(m2) to O(m)O(m)O(m), yielding to a total runtime of O(m2+mn)O(m^2 + mn)O(m2+mn) if using KMP or Z.

We'll try to obtain further speed-ups by exploiting the similarity between patterns. Specifically, since it's easier to find matches at the beginning of the pattern, we want to place the patterns in a datastructure that ensures that shared prefixes are not repeated. The datastructure we'll use is called a trie—just a fancy word to say we'll construct a tree structure that merges together patterns which have common prefixes (see below for patterns abcq, abcr, bcde, cd). Each edge in this tree is labeled with one single character, and every path from the root to a leaf spells the sequence of one of the patterns.

This structure is just a (slightly) compressed way to store the patterns together. Having the shared prefixes along the same path in the tree should allow us to simultaneously match (up to a point) multiple patterns.

If the patterns stored in the trie are the suffixes of a string, we'll call the resulting datastructure the suffix trie of the string.

We have labeled the m leaves with the starting position of each suffix (1 is the longest suffix, etc.). This data structure requires O(m2)O(m^2)O(m2) memory since there are m2m^2m2 characters (and therefore nodes in the tree) in the set of all suffixes of a string of length m. Note, however, that this is quite wasteful as most internal nodes of the tree only have one child, so we could save some space by only keeping the internal branching nodes (in the case of string "abc" there are none).

This new structure saves a bit of space since the number of internal nodes is now O(m)O(m)O(m) instead of O(m2)O(m^2)O(m2). Why is that the case? I will argue that the number of internal nodes in a tree is always less than the number of leaves. Can you prove it (hint: use induction)? What tree has the most internal nodes with respect to the number of leaves?

Thus, even by "compacting" multiple edges into a single one, the space requirement of the tree remains O(m2)O(m^2)O(m2) as we have to store all O(m2)O(m^2)O(m2) characters contained in the O(m)O(m)O(m) suffixes of the pattern. A simple 'trick' can help us further lower the space. Specifically, each edge within the tree corresponds to a substring of the pattern, and can, therefore, be represented by its coordinates (start and end) within the pattern. As a result, each edge needs to only store constant information, and the total size of the tree becomes O(m)O(m)O(m). The resulting tree for the string "abc" is shown below.

The suffix tree for a more complex string is shown below. As before, the leaves are labeled with the number of the corresponding suffix.

If you briefly look at this tree you'll notice that we are missing two suffixes (6 and 7). If you look more carefully, suffix 6 (AT) is already found in the tree on the second edge of the left-most path going from root to suffix 1. The same is true for suffix 7 (T) which is in the middle of the first edge on the right-most path of the tree. There's no need to branch at this location, thus we have not introduced any node or leaf corresponding to this suffix. This makes it a bit confusing as now suffixes can either end at leaves, or can end at arbitrary locations in the middle of edges. To make things more consistent, we'll use a simple trick—we'll append to the string the character $ or any other character that's not a letter in the string. This character will force every suffix to end at a leaf as shown below.

Also note that I'm using 1-based coordinates here. It may be more natural to use 0-based coordinates, but you'll find out that biologists often prefer to start with 1, thus it is useful to get used to both conventions.

Before you proceed with this chapter, please try to build the suffix trees for a few strings. If you are looking for inspiration, try: BANANA, MISSISSIPPI, GATTACCA, AATATTATAATATA, or even AAAAAAA.

Using suffix trees

OK, so we created this fancy new data structure, but how useful is it? Let's look at some examples.

Searching for a pattern inside a text

We'll start with the most basic string matching problem—given a pattern P of length m and a text T of length n, find if the pattern occurs within T, as well as the locations where it occurs.

To see how a suffix tree may be useful, let us remember the observation we made earlier that any substring of T can be seen as the prefix of a suffix of T. Thus, if P matches within T, then P is the prefix of one or more suffixes of T. Given that the suffixes are organized within a suffix tree such that they share common prefixes, that means that P must spell a path starting from the root of the suffix tree of T.

As you can see, starting down from the root of the tree, we can follow edges that spell ATA, indicating that this pattern occurs in the text. Also note that below the endpoint of the string ATA we have a sub-tree that has two leaves labeled 1 and 3, indicating that pattern ATA occurs twice in the text, the first time starting at position 1 and the second at position 3, which you can verify by looking at the two strings.

An actual algorithm for searching for the pattern inside the text will look something like:

n = root # current node
i = 0 # current character in string
while not end of string
  if edge e under cn starts with string[i]:
    check if label of e matches string starting at i
       if mismatch occurs prior to end of string, declare no match
       if string ends (without mismatch) before end of e – declare match
       if e ends before end of string, 
          set cn as endpoint of e, 
          and i as first unmatched character in string
  else:
    string doesn't match

Note that this algorithm advances character by character along both P and the suffix tree of T, and completes when either P is fully found, or when a mismatch occurs, indicating P is not found in T. Thus, the runtime is O(m)O(m)O(m).

Take one second to compare this run time to that of the KMP algorithm—O(m+n)O(m + n)O(m+n). We seem to have magically removed from the equation the length of the text. In fact, I'm playing a bit of trick on you since I didn't take into account the time needed to create the suffix tree. For now, trust me that it is possible to build this data structure in O(n)O(n)O(n) time, thus searching for one pattern gives us a similar run time as KMP. But what happens when we try to search for k patterns as described at the beginning of the chapter? The runtime for KMP is O(kn+∑imi)O(kn + \sum_i m_i)O(kn+∑i​mi​). How about for a suffix tree? Since we don't need to construct a new suffix tree for each pattern, we can avoid the O(kn)O(kn)O(kn) term, leading to O(n+∑imi)O(n + \sum_i m_i)O(n+∑i​mi​). In many biological applications, k could be on the order of thousands or even millions, and given that n is also on the order of millions, the suffix tree solution provides an important reduction in run time.

As a final point of discussion, we should focus briefly on figuring out where the pattern actually matches. As mentioned above, the node at the end of the edge where the end of the pattern is located after executing the algorithm we just outlined, is the root of a subtree that has, as leaves, the suffixes corresponding to the positions where the pattern matches. It is, thus, sufficient to run a simple tree traversal algorithm to collect the labels of all the leaves and output all the match locations.

If all we care about is figuring out how many matches there are, but not actually finding their positions, we could pre-process the tree to store, at each internal node of the tree, the number of leaves located below it. Once we find a match of the pattern in the tree, we simply look up the number stored at the node right below the match.

Exercises

  1. Write code that searches for a pattern inside a suffix tree

  2. Write code that enumerates the labels of all leaves below a given node in a tree

  3. Write code that pre-processes a (suffix) tree to include, at each node, the number of leaves below it.

  4. Write code that pre-processes a suffix tree to include, at each node, the depth of the node in terms of the number of characters spelled by the path from root to the node.

Finding repeats

A key component of the Z and KMP algorithms was identifying that certain regions of a string are identical to each other. In the Z and KMP algorithms, we focused on a specific type of repeated strings—those that match the prefix (beginning) of the string. But repeats are interesting more generally, and we've actually dealt with them when talking about finding hidden signals in DNA. There, we looked at strings of length k (k-mers) that are repeated throughout the genome. You probably asked yourself then (and if not, think a bit about this now) how we decide what value of k we should use? Suffix trees allow us to figure this out without starting with any particular guess for k.

Specifically, we're looking for segments of DNA that appear multiple times in the genome, as shown below (the shaded areas). You'll note that these segments represent the prefixes of different suffixes of the string (S1, S2, and S3 in the figure), thus they will show up as a shared path inside the tree (thick line), before the corresponding suffixes branch off.

Note that any path (and part thereof) leading to an internal node of the tree represents a repeat as it is the prefix of at least two distinct suffixes. We could, thus, expand the idea of frequent k-mers that we developed earlier in the class into a new problem:

Find the most frequent repeat in the genome that is longer than k.

This way, the value of k becomes a lower bound meant to avoid finding obvious patterns (e.g., for k=1, the answer is the most frequent letter in the genome, which is not particularly interesting). But this new algorithm could find interesting patterns that are longer than k. With some basic book-keeping and pre-processing, you could also look for clumps of frequent repeats. You will need to record, at each node in the tree, the range of coordinates represented at that node (the minimum and maximum values of the leaves in the subtree rooted at the node), then find nodes with relatively narrow ranges (where distance between min and max is below a certain value) but that have many leaves below them.

Exercises

  1. Describe an algorithm that finds the most frequent repeat that is longer than k.

  2. Describe an algorithm that finds all repeats longer than k and that occur more than N times in the genome.

  3. Describe an algorithm that pre-processes a suffix tree to include, at each node, the minimum and maximum values of the leaves in the sub-tree rooted at that node.

Notes

I've started this chapter by promising you that suffix trees can efficiently solve the longest common substring problem. This is true, but it requires a bit of a deeper understanding of suffix trees that we develop here.

I'll leave it as an exercise to write out the code that simultaneously measures the string depth of nodes in a suffix tree and determines whether the nodes are "diverse" (contain leaves from two strings), and outputs the "deepest" such node.

Exercises

  1. Describe a linear-time algorithm that uses a suffix tree to compute the sp values of a string

  2. Describe a linear-time algorithm that uses a suffix tree to compute the Z values of a string

One "hack" we can use is to construct a suffix tree that contains the suffixes of two different strings at the same time. To figure out which string is which, we'll include different characters at the end of the two strings, such as $ and #. In this tree, the longest common substring will be a path in the tree that is the prefix of suffixes in both strings, and is the longest such path. It should be pretty easy to figure out that this path must end at an internal node in the tree—if it ended in the middle of an edge, then you could extend it until the nearest node, violating the property that it is the longest path. Furthermore, the node where the path ends must be the root of a sub-tree that contains leaves from both strings. Since we used two different end of string characters, each leaf in the tree can be uniquely assigned to one of the two strings. Thus, to find the longest common substring, you need to find "diverse" nodes and pick the one that is the furthest away from the root in terms of the number of characters along the path from the root to the node. It turns out that a simple tree traversal can enumerate all diverse nodes, and compute their "string depth", and can thus find the shortest common substring in O(n)O(n)O(n) instead of O(n2)O(n^2)O(n2) as we proposed earlier. Of course, it's all predicated on being able to construct a suffix tree in O(n)O(n)O(n). This result may not seem surprising, but before suffix trees were discovered, (the K in the KMP algorithm) conjectured that the shortest common substring could not be solved in sub-quadratic time.

Donald Knuth
Trie constructed from four string patterns. Strings that share a prefix (abcq and abcr) are represented along the same path until their disagreement. Note that only prefix-level similarities are compacted in the tree - the same strings "bc" and "cd" are repeated along multiple paths within the tree.
Suffix trie for the string "abc". The three patterns within the trie represent the three suffixes of the string: 1-"abc", 2-"bc", 3-"c". The numbers represent the starting position of each suffix within the string.
Suffix trie "compressed" by removing the non-branching internal nodes.
Suffix tree of the string "abc". The labels on the edges implicitly refer to the coordinates, within the original string, of the substring represented along the edge.
Suffix tree for string "ATATAAT". The strings referenced by the pair of coordinates for each edge are shown in parentheses below the corresponding coordinates.
Suffix tree for string "ATATAAT$". The addition of the "end of string" character $ forces all of the suffixes to end at a leaf.
Using a suffix tree to search for pattern "ATA" within the string "ATATAAT$". Starting from the root, the path highlighted with a box indicates that "ATA""is the prefix of two suffxes stored in the tree: suffix 1 (the entire string), and suffix 3 ("ATAAT"), indicating that string "ATA" occurs twice, starting at coordinates 1 and 3.
At the top of the figure we highlight a string that contains three repeated segments (in gray). The three lines below highlight three of the suffixes that have the repeated segments as prefix. The tree below shows the repeated segment as thick gray lines. The three suffixes occur within the subtree the starts below the repeated segment.
Graphical representation of a tree structure with the root at the top and edges hanging down. Letters on the edges represent characters within the strings placed in the tree.
Graphical representation of a tree structure with the root at the top and edges hanging down. Letters on the edges represent characters within the strings placed in the tree.
Graphical representation of a tree structure with the root at the top and edges hanging down. Letters on the edges represent characters within the strings placed in the tree.
Graphical representation of a tree structure with the root at the top and edges hanging down. Letters on the edges represent characters within the strings placed in the tree.
Graphical representation of a tree structure with the root at the top and edges hanging down. Letters on the edges represent characters within the strings placed in the tree.
Graphical representation of a tree structure with the root at the top and edges hanging down. Letters on the edges represent characters within the strings placed in the tree.
Graphical representation of a tree structure with the root at the top and edges hanging down. Numbers on the edges represent the range in the original string corresponding to the edge's label.
Graphical representation of a tree structure with the root at the top and edges hanging down. Numbers on the edges represent the range in the original string corresponding to the edge's label.
Graphical representation of a tree structure with the root at the top and edges hanging down. Numbers on the edges represent the range in the original string corresponding to the edge's label.  Letters in parantheses represent the actual string referenced by the pair of coordinates.
Graphical representation of a tree structure with the root at the top and edges hanging down. Numbers on the edges represent the range in the original string corresponding to the edge's label.  Letters in parantheses represent the actual string referenced by the pair of coordinates.
Graphical representation of a tree structure with the root at the top and edges hanging down. Numbers on the edges represent the range in the original string corresponding to the edge's label.  Letters in parantheses represent the actual string referenced by the pair of coordinates.
Graphical representation of a tree structure with the root at the top and edges hanging down. Numbers on the edges represent the range in the original string corresponding to the edge's label.  Letters in parantheses represent the actual string referenced by the pair of coordinates.
Graphical representation of a tree structure with the root at the top and edges hanging down. Letters on the edges represent the string stored along the edge. A curved box along one edge highlights the search path for a pattern within the tree.
Graphical representation of a tree structure with the root at the top and edges hanging down. Letters on the edges represent the string stored along the edge. A curved box along one edge highlights the search path for a pattern within the tree.
A white rectangle with three gray rectangles inside highlights a string and three repeated segments. Three lines below it, aligned with the start of the gray rectanges represent the three suffixes that share the repeated region as a prefix. At the bottom, a tree representation highlights in thick gray lines where the repeated region occurs along a path from the root.
A white rectangle with three gray rectangles inside highlights a string and three repeated segments. Three lines below it, aligned with the start of the gray rectanges represent the three suffixes that share the repeated region as a prefix. At the bottom, a tree representation highlights in thick gray lines where the repeated region occurs along a path from the root.