Landau-Vishkin algorithm

In the traditional approach for computing the alignment between two sequences, the dynamic programming formulation sets up the sub-problem to be the alignment between the prefixes of the aligned strings. The way in which the algorithm moves from sub-problem to sub-problem (e.g., moving row by row, column by column, along diagonals, or even moving along reversed strings) creates opportunities for various optimizations, as see, e.g., when seeking to perform the alignment within certain error bounds. Can a re-definition of the dynamic programming sub-problem provide further opportunities for optimizing the alignment algorithms?

A solution was proposed by Landau and Vishkin (Landau and Vishkin 1986). This approach focuses on the different diagonals within the dynamic programming table – a perfect match represents simply a stretch of one of these diagonals (shown in thick lines in the figure below). The basic idea behind the algorithm is to try to find the furthest distance along each diagonal that can be traveled with a certain number of edits. The dynamic programming sub-problem is, thus, defined as: $V(i, d)$ - the rightmost end of an alignment that contains $i$ mismatches and ends on diagonal $d$ .

By diagonal we mean all the cells in the dynamic programming table that have the same difference between the coordinates in the two strings: $d = k – j$ , where $k$ is the coordinate in $S1$ and $j$ is the coordinate in $S2$ . To further clarify the following better, $V(i, d)$ will refer to the coordinate in $S1$ of the end of the alignment. Note that from the definition of $d$ and $V(i, d)$ we can immediately compute the coordinates of the cell ending this alignment to be $V(i, d)$ in $S1$ (by definition), and $V(i, d) – d$ in $S2$ .

It should be fairly easy to see that $V(i, d)$ can be computed from nearby values as the maximum among three cases described below. In all cases, we denote $LCP(i, j)$ to represent the longest common prefix between $S1[i..n]$ and $S2[j..m]$ , i.e., the length of the shared prefix of suffixes $i$ and $j$ of strings $S1$ and $S2$ , respectively.

Mismatch along diagonal $d$ . $VM = V(i-1, d) + LCP(V(i-1, d)) + 1, V(i-1, d) - d + 1)$

Rectangle representing a dynamic programming table between strings S1 (represented along the horizontal axis) and S2 (represented along the vertical axis) with the lengths n, m, respectively. A diagonal line located at position d along the top edge of the rectangle represents the d-th diagonal in the matrix. The end of a possible alignment is shown as a thickened line along the diagonal, ending at coordinates (V(i-1, d), V(i-1, d) - d). A second thickened diagonal starting one character after the end of the first on is labeled LCP(V(i-1, d) + 1, V(i-1, d) - d + 1) and represents the longest exact match along this diagonal after a mismatch at the end of the previously aligned region. The end of this diagonal is labeled VM.

Deletion from S1. $VD = V(i-1, d-1) + LCP(V(i-1, d-1) + 1, V(i-1, d-1) - d + 1)$

In this setting, the alignment "jumps" from diagonal $d-1$ to diagonal $d$ along $S1$ . The character within $S1$ that is skipped by this jump is not aligned to any character in $S2$ , thus it is deleted (aligned to a gap character) from the alignment. Once on diagonal $d$ , the alignment proceeds further as far as exact matches are found, i.e., the length of the longest common prefix of the suffixes $V(i-1, d-1)+1$ of $S1$ and $V(i-1, d-1) -d + 1$ of $S2$ .

Insertion into S1. $VI = V(i-1, d+1) + LCP(V(i-1, d+1), V(i-1, d+1) -d)$

In this situation, the alignment jumps vertically from diagonal $d+1$ to diagonal $d$ , representing an insertion of a character into $S1$ or a deletion from $S2$ (the corresponding character in $S2$ will be aligned to a gap character in $S1$ ). Once on diagonal $d$ , the alignment proceeds further as far as exact matches are found, i.e., the length of the longest common prefix of the suffixes $V(i-1, d+1)$ of $S1$ and $V(i-1, d+1) -d$ of $S2$ .

As for other dynamic programming algorithms, the final alignment score is the maximum of the three situations described:

$V(i, d) = max(VM, VD, VI)$

As described, the cost of a gap is 1, however it should be fairly easy to extend this algorithm to more complex gap functions by properly accounting for the corresponding scores when jumping across diagonals in the computation of $VD$ and $VI$ . For example, a k-length deletion in $S1$ would result in the following equation:

$VD = V(i-cost(k), d-k) + LCP(V(i-cost(k), d-k) + k, V(i-cost(k), d-k) - d + k)$

where $cost(k)$ is the cost of a gap of length $k$ .

It should be fairly obvious now that computing the $V(i, d)$ values is sufficient for finding the optimal alignment. Specifically, we are looking for the smallest $i$ value that leads one of the $V(i, d)$ values to reach the bottom right corner of the dynamic programming table. If we fill in the $V(i, d)$ values in increasing order of $i$ , we stop once we reach the correct value $i = e$ , where $e$ is the minimum number of edits, and the total number of values filled in is $O(ne)$ . As a result, the memory usage is automatically 'tuned' to the divergence between the two strings.

The runtime depends on how quickly we can compute the LCP values. One approach relies on generalized suffix trees – suffix trees constructed from the suffixes of two or more strings (it is easy to see that Ukkonen's algorithm can easily construct such trees). If we construct a suffix tree from strings $S1$ and $S2$ the $LCP(i, j)$ for suffixes $i$ in $S1$ and $j$ in $S2$ is simply the string depth of the lowest common ancestor of the leaves corresponding to the two suffixes. An algorithm due to Vishkin can compute this information in constant time (more specifically, this algorithm can find the lowest common ancestor of two nodes in any tree), leading to an overall $O(ne)$ runtime for our algorithm.

References

Landau, G. M. and U. Vishkin (1986). Introducing efficient parallelism into approximate string matching and a new serial algorithm. Proceedings of the eighteenth annual ACM symposium on Theory of computing. Berkeley, California, USA, ACM: 220-230. http://dl.acm.org/citation.cfm?id=12152

PreviousSequence alignment with bounded error NextParallelizing sequence alignment

Last updated 28 days ago

hashtagReferences

References