In the traditional approach for computing the alignment between two sequences, the dynamic programming formulation sets up the sub-problem to be the alignment between the prefixes of the aligned strings. The way in which the algorithm moves from sub-problem to sub-problem (e.g., moving row by row, column by column, along diagonals, or even moving along reversed strings) creates opportunities for various optimizations, as see, e.g., when seeking to perform the alignment within certain error bounds. Can a re-definition of the dynamic programming sub-problem provide further opportunities for optimizing the alignment algorithms?
A solution was proposed by Landau and Vishkin (Landau and Vishkin 1986). This approach focuses on the different diagonals within the dynamic programming table – a perfect match represents simply a stretch of one of these diagonals (shown in thick lines in the figure below). The basic idea behind the algorithm is to try to find the furthest distance along each diagonal that can be traveled with a certain number of edits. The dynamic programming sub-problem is, thus, defined as: V(i,d) - the rightmost end of an alignment that contains i mismatches and ends on diagonal d.
By diagonal we mean all the cells in the dynamic programming table that have the same difference between the coordinates in the two strings: d=k–j, where k is the coordinate in S1 and j is the coordinate in S2. To further clarify the following better, V(i,d) will refer to the coordinate in S1 of the end of the alignment. Note that from the definition of d and V(i,d) we can immediately compute the coordinates of the cell ending this alignment to be V(i,d) in S1 (by definition), and V(i,d)–d in S2.
It should be fairly easy to see that V(i,d) can be computed from nearby values as the maximum among three cases described below. In all cases, we denote LCP(i,j) to represent the longest common prefix between S1[i..n] and S2[j..m], i.e., the length of the shared prefix of suffixes i and j of strings S1 and S2, respectively.
Mismatch along diagonal d. VM=V(i−1,d)+LCP(V(i−1,d))+1,V(i−1,d)−d+1)
Computation of VM. After the end of the previous alignment ending at coordinates x = V(i-1, d), y= V(i-1, d) -d, a mismatch is allowed along the diagonal, then the longest exact match following this mismatch determines the end of the alignment, VM.
Deletion from S1.VD=V(i−1,d−1)+LCP(V(i−1,d−1)+1,V(i−1,d−1)−d+1)
In this setting, the alignment "jumps" from diagonal d−1 to diagonal d along S1. The character within S1 that is skipped by this jump is not aligned to any character in S2, thus it is deleted (aligned to a gap character) from the alignment. Once on diagonal d, the alignment proceeds further as far as exact matches are found, i.e., the length of the longest common prefix of the suffixes V(i−1,d−1)+1 of S1 and V(i−1,d−1)−d+1 of S2.
Computation of VD. At the end of the previous alignment ending at coordinates x=V(i-1, d-1), y=V(i-1, d-1)-d + 1, a deletion from S1 is allowed (short arrow in the middle), shifting the alignment onto diagonal d, then the longest exact match following this jump determines the end of the alignment, VD.
Insertion into S1.VI=V(i−1,d+1)+LCP(V(i−1,d+1),V(i−1,d+1)−d)
In this situation, the alignment jumps vertically from diagonal d+1 to diagonal d, representing an insertion of a character into S1 or a deletion from S2 (the corresponding character in S2 will be aligned to a gap character in S1). Once on diagonal d, the alignment proceeds further as far as exact matches are found, i.e., the length of the longest common prefix of the suffixes V(i−1,d+1) of S1 and V(i−1,d+1)−d of S2.
Computation of VI. At the end of the previous alignment ending at coordinates x=V(i-1, d+1), y=V(i-1, d+1)-d - 1, a deletion from S2 is allowed (short arrow in the middle), shifting the alignment onto diagonal d, then the longest exact match following this jump determines the end of the alignment, VI.
As for other dynamic programming algorithms, the final alignment score is the maximum of the three situations described:
V(i,d)=max(VM,VD,VI)
As described, the cost of a gap is 1, however it should be fairly easy to extend this algorithm to more complex gap functions by properly accounting for the corresponding scores when jumping across diagonals in the computation of VD and VI. For example, a k-length deletion in S1 would result in the following equation:
It should be fairly obvious now that computing the V(i,d) values is sufficient for finding the optimal alignment. Specifically, we are looking for the smallest i value that leads one of the V(i,d) values to reach the bottom right corner of the dynamic programming table. If we fill in the V(i,d) values in increasing order of i, we stop once we reach the correct value i=e, where e is the minimum number of edits, and the total number of values filled in is O(ne). As a result, the memory usage is automatically 'tuned' to the divergence between the two strings.
The runtime depends on how quickly we can compute the LCP values. One approach relies on generalized suffix trees – suffix trees constructed from the suffixes of two or more strings (it is easy to see that Ukkonen's algorithm can easily construct such trees). If we construct a suffix tree from strings S1 and S2 the LCP(i,j) for suffixes i in S1 and j in S2 is simply the string depth of the lowest common ancestor of the leaves corresponding to the two suffixes. An algorithm due to Vishkin can compute this information in constant time (more specifically, this algorithm can find the lowest common ancestor of two nodes in any tree), leading to an overall O(ne) runtime for our algorithm.
References
Landau, G. M. and U. Vishkin (1986). Introducing efficient parallelism into approximate string matching and a new serial algorithm. Proceedings of the eighteenth annual ACM symposium on Theory of computing. Berkeley, California, USA, ACM: 220-230. http://dl.acm.org/citation.cfm?id=12152