Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Introduction
  • The LCP array: speeding up suffix array searches
  • I. LCP[L, S] > LCP[L, M]
  • II. LCP[L, S] < LCP[L, M]
  • III. LCP[L, S] == LCP[L, M]
  • Exercises
  1. String indexing

Suffix arrays

PreviousSuffix trees: beyond the basicsNextThe Burrows-Wheeler transform and the FM-index

Last updated 3 months ago

Introduction

Suffix trees appear, at first glance, to solve a wide range of exact matching problems efficiently, in both time and space consumption. The constants associated with the O(n)O(n)O(n) space consumption, however, can be quite large, as the tree can be viewed as a collection of O(n)O(n)O(n) pointers, each of which can use 4-8 bytes. In practice, the best implementations of suffix trees use about 15 bytes per base, compared to 2 bits/base needed to store the actual sequence. Can we store all the suffixes of a string more efficiently, and still allow fast searches?

A simple answer is the suffix array data structure of and (Manber and Myers 1993). In it's most basic form, we can store all the suffixes of a string in a sorted list. We already know how to search sorted lists in O(log⁡n)O(\log n)O(logn) time using binary search, thus the problem may be easily solved.

Here's an example for the string BANANA, with each suffix numbered:

Before sorting:
1 BANANA
2 ANANA
3 NANA
4 ANA
5 NA
6 A

After sorting:
6 A
4 ANA
2 ANANA
1 BANANA
5 NA
3 NANA

Note, however, that in its basic form, this data structure is quadratic in size as it must store all the characters in all the suffixes. Just like in the case of suffix trees we can overcome this limitation by simply storing the position of the suffix in the string (one number suffices, the starting position). The resulting structure is much more compact, storing just O(n)O(n)O(n) integers (6, 4, 2, 1, 5, 3, for the example shown above).

How long does it take, however, to search this data structure? At first glance, the simple binary search algorithm would yield an O(log⁡n)O(\log n)O(logn) runtime. Binary search, however, assumes that each comparison is performed in a unit of time. Comparing strings requires a runtime proportional to the length of the string, i.e., in the worst case the binary search may require O(nlog⁡n)O(n \log n)O(nlogn) time. This is much worse than the O(n)O(n)O(n) time needed to search within a suffix tree.

The LCP array: speeding up suffix array searches

To fix the runtime, we make a simple observation: if two suffixes are next to each other in the sorted suffix array, these suffixes likely share a common prefix. Thus, while performing the binary search we may be able to reuse some of the comparison information.

Let us formalize this idea a bit better. We will construct a 'longest common prefix' (LCP) array that stores, for every pair of suffixes, the length of the longest common prefix of those suffixes. In other words, LCP[i,j] contains the largest number of characters that match exactly at the beginning of suffixes i and j of text T, or T[i .. i+LCP[i,j]] = T[j .. j+LCP[i,j]]. Let us see how the LCP array helps find a match in a suffix array for text T.

We will follow the execution of the binary search algorithm as it's trying to find if a string S matches the prefix of some suffix in T. We will pick up the algorithm in the middle of its execution. Just as a brief reminder, binary search operates by maintaining the left (L) and right (R) boundaries of the range in an array where a particular value may be found, starting with L = 0, R = n, where n is the size of the array. At each stage, the algorithm interrogates the position in the middle of this range (M = (L+R)/2) and recurses on the left [L, M] or right half [M, R] of the search interval. We will, thus, assume that we are at particular positions L, M, R in the suffix array. We will also assume that we have, so far computed the value LCP[S, L], the longest common prefix of the query string S and the suffix at the left boundary of the current search interval. In the previous iteration, the string labeled L would have been the middle of the range, and knowing the longest common prefix of S and this string was necessary to decide the direction in which we are recursing. We will now compare this value with the precomputed LCP[L, M] value (remember, all LCP[i,j] values are precomputed for the suffix array).

Importantly — our goal is to compare S to M, but we try to infer how they compare to each other based on how S had aligned to other strings during prior iterations.

We encounter three possible situations:

I. LCP[L, S] > LCP[L, M]

Without comparing any characters, we can now directly recurse on the left side of the search interval as we know that string S is lexicographically smaller than suffix M. Otherwise, the longest common prefix of L and M would have been longer. Another way to see this situation is to realize that S[LCP[L,M] + 1] = L[LCP[L, M] +1] < M[LCP[L,M] + 1] (the character following the LCP[L, M] prefix in S is the same in L but must be smaller than the corresponding character in M, otherwise the LCP property would be invalidated).

II. LCP[L, S] < LCP[L, M]

Using a similar argument, we now know that we have to recurse on the right side of the search interval. Here we know that S cannot occur between L and M since the all suffixes in that range share the prefix LCP[L, M]. By exclusion, S can only occur to the right of M (if S matches a suffix, this suffix must occur between L and R by induction as we start with a search interval spanning the entire suffix array).

III. LCP[L, S] == LCP[L, M]

In this case we do not have sufficient information to decide whether we should recurse on the left or right side of the array. To figure out where S may fit we need to compare the characters of S beyond LCP[L, S] = LCP[L, M] to the corresponding characters of M.

To understand why this algorithm is efficient, it suffices to notice that in cases I and II we never compare characters, rather make a decision based solely on the values of the LCP array. In case III, we only compare characters we have not matched before, limiting the number of successful comparisons to O(m)O(m)O(m) (m=length(S)m = length(S)m=length(S)). Just as in the case of KMP, we may mismatch a same character of S multiple times, however each time we do we recurse in the algorithm. The total runtime is, thus, O(m+log(n))O(m + log(n))O(m+log(n)).

Note, however, that we have cheated a bit by ignoring the runtime necessary to construct the suffix array and corresponding LCP array.

Observation: The suffix array of string S can be constructed in O(n) time from the suffix tree of S.

Ok, we cheat a bit because we have to construct a suffix tree first, then convert it to a more space-efficient structure, but at least we now have a linear time algorithm for constructing the suffix array. Suffix arrays can be constructed directly without the help of a suffix tree using a careful implementation of radix sort.

The LCP array appears to be more complicated. We defined this array for every pair i,j of suffixes in S, i.e., the array must contain O(n2)O(n^2)O(n2) values and would, thus, require at least O(n2)O(n^2)O(n2) runtime to compute. To reduce this time, we notice that we do not need to store all LCP values. Rather, we only use the LCP[L,M] or LCP[M,R] values where L, M, R are a triplet of values encountered in the execution of the binary search algorithm. It is not difficult to prove that not all pairs of coordinates will be examined. To see why, you can represent the execution of the binary search as a binary search tree. Each node in the tree corresponds to a different (L, M, R) triplet encountered during the binary search process. It is, thus, easy to see that the number of such triplets is linear (the tree has O(n)O(n)O(n) leaves, and therefore O(n–1)O(n – 1)O(n–1) internal nodes). It, thus, suffices to store only a linear number of LCP values.

While we solved our storage problem, we still need to compute these values in linear time.

Observation: The LCP array can be computed in linear time using the suffix tree of string S.

Observation: For any triplet (L, M, R), LCP(L, R) = min(LCP(L, M), LCP(M, R))

The new data structure, including the suffix array itself (O(n)O(n)O(n) space) and the LCP array (O(n)O(n)O(n) as well), requires linear space, just like the suffix trees. The constants are much lower than for suffix trees, requiring just 4 bytes per base in the most basic implementations. Note, however, that we have taken a hit in terms of runtime as the matching algorithm requires O(m+log(n))O(m + log(n))O(m+log(n)) time, as opposed to O(m)O(m)O(m) in the case of suffix trees.

Exercises

  1. Describe in pseudo-code (or implement in your favorite language) an algorithm that computes the suffix array of a string from its suffix tree.

  2. Describe in pseudo-code (or implement in your favorite language) an algorithm that computes the LCP array of a string from its suffix tree. Only construct the LCP array values needed for performing the binary search.

  3. Given a suffix array (including the relevant LCP values), implement the search algorithm described above.

Manber
Myers
Graphical representation of the search range within the suffix array. The range is bounded by suffixes marked L and R, and its middle is suffix M. The shared prefix (longest common prefix) between the query string S and L is shown as a blue box, and that between L and M is showed as a gray box. In this situation, LCP(L, S) > LCP(L, M). During the next iteration the recursion will proceed to the left of M.
Graphical representation of the search range within the suffix array. The range is bounded by suffixes marked L and R, and its middle is suffix M. The shared prefix (longest common prefix) between the query string S and L is shown as a blue box, and that between L and M is showed as a gray box. In this situation, LCP(L, S) < LCP(L, M). During the next iteration the recursion will proceed to the right of M.
Graphical representation of the search range within the suffix array. The range is bounded by suffixes marked L and R, and its middle is suffix M. The shared prefix (longest common prefix) between the query string S and L is shown as a blue box, and that between L and M is showed as a gray box. In this situation, LCP(L, S) == LCP(L, M). It is not possible to determine whether to recurse left or right. Making this decision requires further comparisons between S and M.
Vertical lines representing a query string S and the three string from the suffix array that represent the left (L), right (R), and middle (M) of the search range. The longest common prefix of S and L is shown as blue rectangles. The longest common prefis of L and M is shown as gray rectangles.
Vertical lines representing a query string S and the three string from the suffix array that represent the left (L), right (R), and middle (M) of the search range. The longest common prefix of S and L is shown as blue rectangles. The longest common prefis of L and M is shown as gray rectangles.
Vertical lines representing a query string S and the three string from the suffix array that represent the left (L), right (R), and middle (M) of the search range. The longest common prefix of S and L is shown as blue rectangles. The longest common prefis of L and M is shown as gray rectangles.
Vertical lines representing a query string S and the three string from the suffix array that represent the left (L), right (R), and middle (M) of the search range. The longest common prefix of S and L is shown as blue rectangles. The longest common prefis of L and M is shown as gray rectangles.
Vertical lines representing a query string S and the three string from the suffix array that represent the left (L), right (R), and middle (M) of the search range. The longest common prefix of S and L is shown as blue rectangles. The longest common prefis of L and M is shown as gray rectangles.
Vertical lines representing a query string S and the three string from the suffix array that represent the left (L), right (R), and middle (M) of the search range. The longest common prefix of S and L is shown as blue rectangles. The longest common prefis of L and M is shown as gray rectangles.