Sequence alignment with bounded error

All the inexact alignment algorithms we have described so far require quadratic run time, irrespective of how well sequences match each other. Can we do better for sequences that match well? For example, in the absence of gaps, global alignment simply requires filling the main diagonal of the table, leading to a linear time algorithm.

To explore the relationship between alignment runtime and the amount of error, let us denote by k the numbers of edits we encounter in an alignment. Assume for now that we know a priori what k is, or more realistically, we have set an upper bound on its value (e.g., we are only interested in alignments with fewer than k=10 edits).

In this case, we can easily convince ourselves that certain parts of the dynamic programming table do not need to be filled in. Specifically, reaching a cell that is more than k diagonals away from the main diagonal of the table requires us to incur more than k edits. As a result, we do not need to fill out any cells of the dynamic programming matrix that are further than k edits away from the main diagonal, as shown in the figure below.

A rectangle representing a dynamic programming table. The main diagonal is highlighted as well as two diagonals on either side of it, each located at distance k from the main diagonal. Two alignments are shown as squiggly lines - alignment A is entirely within the region bounded by the two secondary diagonals, while alignmnet B crosses outside of this bound.

Our algorithm only has to fill in $O(2 k \min(n,m))$ cells, which leads to essentially linear time execution for $k \ll \min(n,m)$ . (in English: k much smaller than the length of the smallest string)

It is certainly possible that the best alignment found in the 'trimmed' table (the band around the diagonal) contains more than k differences, thus our algorithm must check that the best alignment reported satisfies the bound on the number of edits.

Question: If the best alignment has more than k mismatches, can we still report it as the best alignment of the two strings?

If we do not know the value of k a priori we can find it using binary search. Specifically, we begin with a guess for k and find the best alignment. If this alignment has more than k edits, it means that we underestimated the error, and we repeat the process with $k'=2k$ , and so on, until we find an alignment with fewer edits than the current value of k. We can easily see that this algorithm will always find the best alignment in a parsimonious way (k is never more than twice as large as necessary).

Question: What is the cost of this algorithm?

Let us denote by $k_M$ , the maximal value of k encountered in the algorithm, i.e., the best alignment has fewer than $k_M$ edits (generally between $k_M/2$ and $k_M$ edits). Clearly, the maximum memory used by the algorithm is $O(k_M \min(n, m))$ , as this is the size of the "band" that we used within the dynamic programming table. We can enumerate the steps that took us to this point and estimate their computational complexity. The run time can be written as:

$O(\min(n, m) \cdot (k_M + k_M / 2 + k_M / 4 + \ldots + k))$ where k is our initial guess. As a reminder, $k_M$ is the final value of the band, at the point where an alignment contained entirely within the $2 k_M$ wide band had a total number of edits of less than $k_M$ . Note that we now have the same exponential series as we encountered when discussing alignment in linear space, series that converges to 2. Thus, the total runtime for the banded alignment is $O(\min(n, m) \cdot 2 \cdot k_M)$ . Since $k_M$ is itself at most twice as large as the optimal number of edits, despite the possibly large number of doubling steps in our algorithm, the runtime will be bounded by $O(4\cdot k^*\cdot \min(n, m))$ and total memory bounded by $O(2 \cdot k^* \cdot \min(n, m))$ , where $k^*$ is the number of edits of the optimal alignment. Without actually knowing this value, were are able to create an algorithm that uses only about as many resources as needed to compute the alignment.

PreviousSequence alignment in linear space NextIntroduction to proteomic data analysis

Last updated 6 months ago