Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  1. Advanced inexact alignment

Sequence alignment with bounded error

PreviousSequence alignment in linear spaceNextIntroduction to proteomic data analysis

Last updated 4 months ago

All the inexact alignment algorithms we have described so far require quadratic run time, irrespective of how well sequences match each other. Can we do better for sequences that match well? For example, in the absence of gaps, global alignment simply requires filling the main diagonal of the table, leading to a linear time algorithm.

To explore the relationship between alignment runtime and the amount of error, let us denote by k the numbers of edits we encounter in an alignment. Assume for now that we know a priori what k is, or more realistically, we have set an upper bound on its value (e.g., we are only interested in alignments with fewer than k=10 edits).

In this case, we can easily convince ourselves that certain parts of the dynamic programming table do not need to be filled in. Specifically, reaching a cell that is more than k diagonals away from the main diagonal of the table requires us to incur more than k edits. As a result, we do not need to fill out any cells of the dynamic programming matrix that are further than k edits away from the main diagonal, as shown in the figure below.

Our algorithm only has to fill in O(2kmin⁡(n,m))O(2 k \min(n,m))O(2kmin(n,m)) cells, which leads to essentially linear time execution for k≪min⁡(n,m)k \ll \min(n,m)k≪min(n,m). (in English: k much smaller than the length of the smallest string)

It is certainly possible that the best alignment found in the 'trimmed' table (the band around the diagonal) contains more than k differences, thus our algorithm must check that the best alignment reported satisfies the bound on the number of edits.

Question: If the best alignment has more than k mismatches, can we still report it as the best alignment of the two strings?

If we do not know the value of k a priori we can find it using binary search. Specifically, we begin with a guess for k and find the best alignment. If this alignment has more than k edits, it means that we underestimated the error, and we repeat the process with k′=2kk'=2kk′=2k, and so on, until we find an alignment with fewer edits than the current value of k. We can easily see that this algorithm will always find the best alignment in a parsimonious way (k is never more than twice as large as necessary).

Question: What is the cost of this algorithm?

Let us denote by kMk_MkM​, the maximal value of k encountered in the algorithm, i.e., the best alignment has fewer than kMk_MkM​ edits (generally between kM/2k_M/2kM​/2 and kMk_MkM​ edits). Clearly, the maximum memory used by the algorithm is O(kMmin⁡(n,m))O(k_M \min(n, m))O(kM​min(n,m)), as this is the size of the "band" that we used within the dynamic programming table. We can enumerate the steps that took us to this point and estimate their computational complexity. The run time can be written as:

O(min⁡(n,m)⋅(kM+kM/2+kM/4+…+k))O(\min(n, m) \cdot (k_M + k_M / 2 + k_M / 4 + \ldots + k))O(min(n,m)⋅(kM​+kM​/2+kM​/4+…+k)) where k is our initial guess. As a reminder, kMk_MkM​ is the final value of the band, at the point where an alignment contained entirely within the 2kM 2 k_M2kM​ wide band had a total number of edits of less than kMk_MkM​. Note that we now have the same exponential series as we encountered when discussing alignment in linear space, series that converges to 2. Thus, the total runtime for the banded alignment is O(min⁡(n,m)⋅2⋅kM)O(\min(n, m) \cdot 2 \cdot k_M)O(min(n,m)⋅2⋅kM​) . Since kMk_MkM​ is itself at most twice as large as the optimal number of edits, despite the possibly large number of doubling steps in our algorithm, the runtime will be bounded by O(4⋅k∗⋅min⁡(n,m)) O(4\cdot k^*\cdot \min(n, m))O(4⋅k∗⋅min(n,m)) and total memory bounded by O(2⋅k∗⋅min⁡(n,m))O(2 \cdot k^* \cdot \min(n, m))O(2⋅k∗⋅min(n,m)), where k∗k^*k∗ is the number of edits of the optimal alignment. Without actually knowing this value, were are able to create an algorithm that uses only about as many resources as needed to compute the alignment.

Banded alignment. If alignment A has fewer than k edits it will be the best scoring such alignment. If A has more than k edits then it is not necessarily the best alignment. Alignment B crosses outside the 2k-width band, thus it has to incur more than k edits, but it could have a better score than alignment A if the latter also has more than k edits. Alignment B, however, cannot be found since the banded alignment algorithm only computes the dynamic programming table between the diagonals.
A rectangle representing a dynamic programming table. The main diagonal is highlighted as well as two diagonals on either side of it, each located at distance k from the main diagonal. Two alignments are shown as squiggly lines - alignment A is entirely within the region bounded by the two secondary diagonals, while alignmnet B crosses outside of this bound.