Sequence alignment in linear space

The overall runtime of $O(nm)$ for global inexact sequence alignment is not easy to overcome (though we will try to do so soon). Can we, however, reduce the overall memory usage?

For computing the scores, it is easy to see that we do not need to store the entire dynamic programming table, rather it is sufficient to remember just two rows (or columns, or diagonals) in order to find out the best score in the table, as all the values necessary for computing a cell are found on just two rows of the table (the current row, and the preceding one).

Specifically, a linear-space computation of the alignment score would look something like:

initialize row1
initialize row2[0]

while end of table is not reached
   compute values in row2 according to recurrence equations
   swap row1 with row2
   initialize row2[0]

If we want to also find the actual alignment, however, we need to be able to backtrack, thus we need the information stored each cell, i.e., entire $O(nm)$ table.

To gain some intuition on why we might be able to do better than $O(nm)$ space, note that the actual best alignment itself can be stored in linear space—we simply store one out of the many possible paths through the table connecting the two corners (assuming global alignment). The problem is that we do not know which path to store until the whole table is filled out, hence the need to store $O(nm)$ cells. The key idea behind a linear space alignment algorithm is that we can, in linear space, figure out parts of the optimal path, which will eventually allow us to reconstruct the whole path.

Specifically, the algorithm uses a "divide and conquer" strategy, by breaking up the dynamic programming table into two, and focusing separately on the alignments of the first and second halves of one of the strings (as we'll show in more detail below). For the first half, the alignment proceeds as usual, filling the first half of the dynamic programming table up to (and including) the middle row.

For the second half, we fill the table backwards, starting from the bottom, effectively matching the reverse of the second half of one string to the other. Importantly, we are not referring to reverse complementing the strings, just flipping them around. It is easy to see that the optimal alignment should be the same for both the forward and reverse versions of the problem. Again, we fill the dynamic programming table up to (and including) the middle row.

Note that the alignment algorithm we use only computes the dynamic programming score (does not store backtrack pointers) and it can, thus, be executed in linear space, by only storing two rows of the matrix. Also note that the middle row is computed by both the forward and the backward dynamic programming approaches, corresponding to alignment scores for the first and second halves of the string represented along the left side of the table. We will set the value of this middle row to be the sum of the values obtained by the two executions.

We argue that the optimal path through the entire dynamic programming table crosses this middle row exactly through the cell with the maximum value, as this location maximizes the sum of the dynamic programming scores of the two halves of the string, value which is the score of the alignment for the entire string. You can perhaps gain some intuition for why this is true by thinking of the alignment as a graph. The highest-scoring path through the graph has the same score irrespective of the direction in which you traverse it, and the location in the matrix (node in the graph) we select is the one that accumulates the highest score when reached from both ends of the matrix.

At this point, we can directly fill in part of the optimal alignment by following the path this alignment takes along the middle row up to the point where it diverges into the top or bottom halves of the table. In other words, we can backtrack using the information available to us at this point, i.e., the middle row of the matrix and the two rows right above and below it. Backtracking cannot proceed any further because we only remember two rows of the dynamic programming table since our goal is to save memory.

So far we have, thus, identified a small section of the optimal alignment path (which, as pointed above, can be stored in linear space) while using only linear space. We, then, repeat this process by recursing on the two halves of the dynamic programming table, at each step reconstructing another small segment of the path, until we eventually reconstruct the whole path. Note that at each stage the recursion only needs to focus on part of the dynamic programming table as shown in the figure below. Given that we only allow indel and match/mismatch edits, the alignment has to advance in one or both of the strings, thus its path will be limited to the two segments of the dynamic programming table defined by the location in the middle row that we just discovered (the location that maximizes the sum of the scores for the two halves of one of the strings) and the top-left and bottom-right corners of the matrix, respectively.

Large rectangle representing a dynamic programming table. The middle row is highlighted by a white rectangle. A small blue square in the middle indicates the location where the alignment (shown as a squiggly line connecting the top-left and bottom-right corners of the table) crosses the middle row. Two large rectangles between the middle square and the bottom-left and top-rigth corners of the matrix represent regions that can be ignored when computing the rest of the alignment.

While it is clear that this algorithm uses linear space, it is not at all obvious that it is efficient. To compute its runtime, note that at the first stage the runtime is exactly the same as that for the traditional dynamic programming approach: $O(n m/2 + n m/2) = O(nm)$ .

As we recurse, however, at each stage we halve the runtime since we ignore half of the dynamic programming table. The total computational cost of this algorithm is, thus, $O(nm)$ for stage 1, $O(nm/2)$ for stage 2, $O(nm/4)$ for stage three, ..., $O(nm/2^i)$ for stage i+1, and so on, leading to a total time of:

O(\sum_{0 \le i < \log(m)} {nm \over 2^i}) \le O(2nm)

The inequality is due to the properties of the exponential series: $\sum_{0 \le i < \infty} {1 \over 2^i} = 2$ . Thus, we can achieve a linear space algorithm while just doubling the run time.

PreviousGap penalties NextSequence alignment with bounded error

Last updated 5 months ago