Landau-Vishkin algorithm

In the traditional approach for computing the alignment between two sequences, the dynamic programming formulation sets up the sub-problem to be the alignment between the prefixes of the aligned strings. The way in which the algorithm moves from sub-problem to sub-problem (e.g., moving row by row, column by column, along diagonals, or even moving along reversed strings) creates opportunities for various optimizations, as see, e.g., when seeking to perform the alignment within certain error bounds. Can a re-definition of the dynamic programming sub-problem provide further opportunities for optimizing the alignment algorithms?

A solution was proposed by Landau and Vishkin (Landau and Vishkin 1986). This approach focuses on the different diagonals within the dynamic programming table – a perfect match represents simply a stretch of one of these diagonals (shown in thick lines in the figure below). The basic idea behind the algorithm is to try to find the furthest distance along each diagonal that can be traveled with a certain number of edits. The dynamic programming sub-problem is, thus, defined as: V(i,d)V(i, d) - the rightmost end of an alignment that contains ii mismatches and ends on diagonal dd.

By diagonal we mean all the cells in the dynamic programming table that have the same difference between the coordinates in the two strings: d=kjd = k – j, where kk is the coordinate in S1S1 and jj is the coordinate in S2S2. To further clarify the following better, V(i,d)V(i, d) will refer to the coordinate in S1S1 of the end of the alignment. Note that from the definition of dd and V(i,d)V(i, d) we can immediately compute the coordinates of the cell ending this alignment to be V(i,d)V(i, d) in S1S1 (by definition), and V(i,d)dV(i, d) – d in S2S2.

It should be fairly easy to see that V(i,d)V(i, d) can be computed from nearby values as the maximum among three cases described below. In all cases, we denote LCP(i,j)LCP(i, j) to represent the longest common prefix between S1[i..n]S1[i..n] and S2[j..m]S2[j..m], i.e., the length of the shared prefix of suffixes ii and jj of strings S1S1 and S2S2, respectively.

Mismatch along diagonal dd. VM=V(i1,d)+LCP(V(i1,d))+1,V(i1,d)d+1)VM = V(i-1, d) + LCP(V(i-1, d)) + 1, V(i-1, d) - d + 1)

Rectangle representing a dynamic programming table between strings S1 (represented along the horizontal axis) and S2 (represented along the vertical axis) with the lengths n, m, respectively. A diagonal line located at position d along the top edge of the rectangle represents the d-th diagonal in the matrix.  The end of a possible alignment is shown as a thickened line along the diagonal, ending at coordinates (V(i-1, d), V(i-1, d) - d). A second thickened diagonal starting one character after the end of the first on is labeled LCP(V(i-1, d) + 1, V(i-1, d) - d + 1) and represents the longest exact match along this diagonal after a mismatch at the end of the previously aligned region. The end of this diagonal is labeled VM.
Computation of VM. After the end of the previous alignment ending at coordinates x = V(i-1, d), y= V(i-1, d) -d, a mismatch is allowed along the diagonal, then the longest exact match following this mismatch determines the end of the alignment, VM.

Deletion from S1. VD=V(i1,d1)+LCP(V(i1,d1)+1,V(i1,d1)d+1)VD = V(i-1, d-1) + LCP(V(i-1, d-1) + 1, V(i-1, d-1) - d + 1)

In this setting, the alignment "jumps" from diagonal d1d-1 to diagonal dd along S1S1. The character within S1S1 that is skipped by this jump is not aligned to any character in S2S2, thus it is deleted (aligned to a gap character) from the alignment. Once on diagonal dd, the alignment proceeds further as far as exact matches are found, i.e., the length of the longest common prefix of the suffixes V(i1,d1)+1V(i-1, d-1)+1 of S1S1 and V(i1,d1)d+1V(i-1, d-1) -d + 1 of S2S2.

Rectangle representing a dynamic programming table between strings S1 (represented along the horizontal axis) and S2 (represented along the vertical axis) with the lengths n, m, respectively. A diagonal line located at position d along the top edge of the rectangle represents the d-th diagonal in the matrix. Below it is a second diagonal representing the (d-1)-th diagonal in the matrix. The end of a possible alignment along diagonal d-1 is shown as a thickened line, ending at coordinates (V(i-1, d-1), V(i-1, d-1) - d + 1). A second thickened diagonal starting on diagonal d at the same y coordinate where the prior alignment ended, is labeled LCP(V(i-1, d-1) + 1, V(i-1, d-1) - d + 1) and represents the longest exact match along this diagonal after a deletion from S1 at the end of the previously aligned region. The end of this diagonal is labeled VD.
Computation of VD. At the end of the previous alignment ending at coordinates x=V(i-1, d-1), y=V(i-1, d-1)-d + 1, a deletion from S1 is allowed (short arrow in the middle), shifting the alignment onto diagonal d, then the longest exact match following this jump determines the end of the alignment, VD.

Insertion into S1. VI=V(i1,d+1)+LCP(V(i1,d+1),V(i1,d+1)d)VI = V(i-1, d+1) + LCP(V(i-1, d+1), V(i-1, d+1) -d)

In this situation, the alignment jumps vertically from diagonal d+1d+1 to diagonal dd, representing an insertion of a character into S1S1 or a deletion from S2S2 (the corresponding character in S2S2 will be aligned to a gap character in S1S1). Once on diagonal dd, the alignment proceeds further as far as exact matches are found, i.e., the length of the longest common prefix of the suffixes V(i1,d+1)V(i-1, d+1) of S1S1 and V(i1,d+1)dV(i-1, d+1) -d of S2S2.

Rectangle representing a dynamic programming table between strings S1 (represented along the horizontal axis) and S2 (represented along the vertical axis) with the lengths n, m, respectively. A diagonal line located at position d along the top edge of the rectangle represents the d-th diagonal in the matrix. Above it is a second diagonal representing the (d+1)-th diagonal in the matrix. The end of a possible alignment along diagonal d+1 is shown as a thickened line, ending at coordinates (V(i-1, d+1), V(i-1, d+1) - d - 1). A second thickened diagonal starting on diagonal d at the same x coordinate where the prior alignment ended, is labeled LCP(V(i-1, d+1), V(i-1, d+1) - d) and represents the longest exact match along this diagonal after an insertion into S1 (deletion from S2) at the end of the previously aligned region. The end of this diagonal is labeled VI.
Computation of VI. At the end of the previous alignment ending at coordinates x=V(i-1, d+1), y=V(i-1, d+1)-d - 1, a deletion from S2 is allowed (short arrow in the middle), shifting the alignment onto diagonal d, then the longest exact match following this jump determines the end of the alignment, VI.

As for other dynamic programming algorithms, the final alignment score is the maximum of the three situations described:

V(i,d)=max(VM,VD,VI)V(i, d) = max(VM, VD, VI)

As described, the cost of a gap is 1, however it should be fairly easy to extend this algorithm to more complex gap functions by properly accounting for the corresponding scores when jumping across diagonals in the computation of VDVD and VIVI. For example, a k-length deletion in S1S1 would result in the following equation:

VD=V(icost(k),dk)+LCP(V(icost(k),dk)+k,V(icost(k),dk)d+k)VD = V(i-cost(k), d-k) + LCP(V(i-cost(k), d-k) + k, V(i-cost(k), d-k) - d + k)

where cost(k)cost(k) is the cost of a gap of length kk.

It should be fairly obvious now that computing the V(i,d)V(i, d) values is sufficient for finding the optimal alignment. Specifically, we are looking for the smallest ii value that leads one of the V(i,d)V(i, d) values to reach the bottom right corner of the dynamic programming table. If we fill in the V(i,d)V(i, d) values in increasing order of ii, we stop once we reach the correct value i=ei = e, where ee is the minimum number of edits, and the total number of values filled in is O(ne)O(ne). As a result, the memory usage is automatically 'tuned' to the divergence between the two strings.

The runtime depends on how quickly we can compute the LCP values. One approach relies on generalized suffix trees – suffix trees constructed from the suffixes of two or more strings (it is easy to see that Ukkonen's algorithm can easily construct such trees). If we construct a suffix tree from strings S1S1 and S2S2 the LCP(i,j)LCP(i, j) for suffixes ii in S1S1 and jj in S2S2 is simply the string depth of the lowest common ancestor of the leaves corresponding to the two suffixes. An algorithm due to Vishkin can compute this information in constant time (more specifically, this algorithm can find the lowest common ancestor of two nodes in any tree), leading to an overall O(ne)O(ne) runtime for our algorithm.

References

Landau, G. M. and U. Vishkin (1986). Introducing efficient parallelism into approximate string matching and a new serial algorithm. Proceedings of the eighteenth annual ACM symposium on Theory of computing. Berkeley, California, USA, ACM: 220-230. http://dl.acm.org/citation.cfm?id=12152arrow-up-right

Last updated