Local alignment
Last updated
Last updated
The alignment algorithm we have described so far requires the two strings to align from end to end. For this reason, this version of the alignment is called global alignment. Importantly, when we say that the algorithm makes the strings align end to end, we don't mean that there cannot be any gaps at the end of the alignment. For example, the alignment:
is still a global alignment as long as we "pay" for all the edits up to the end of both strings. In the "edit distance" formulation, the alignment score of this global alignment is 8, since we had to delete the 8 As that flank the matching triplet CGT.
To make sure we pay for all these edits, the dynamic programming algorithm had to traverse the entire dynamic programming table, computing the alignment score for the bottom-right corner, and backtracking through the table all the way back to the top-left corner. The path taken by the alignment is shown in the figure below.
We can easily modify this algorithm to relax this requirement. For starters, let us try to find the best prefix-prefix alignment between the two strings, i.e., we will identify the prefixes of S1 and S2 that, when aligned, have the best edit distance score from among all possible such prefixes.
It should be easy to see that we already have the answer. The value at each location in the dynamic programming table is exactly the score of aligning some prefix of one string to a prefix of the other one (by definition), thus solving the prefix alignment problem simply relies on finding the best score in the entire table (rather than the score E[n,m]
). Thus, the initialization of the first row and column stays the same as before, the recurrence proceeds as before, but the backtracking starts at the cell in the matrix that has the best score, rather than the bottom-right corner. Effectively, you are ignoring the content of the ends of the two strings.
If thinking in terms of the graph metaphor described above, you can solve the prefix alignment problem by adding to the graph connections "free" (cost 0) edges from each node of the graph to the bottom-right node. The graph can get quite busy (an additional edges are added) so we are not drawing it here.
IMPORTANT: Prefix alignment (as well as other non-global versions) only makes sense in a scoring system that seeks the best scoring alignment (i.e., using a in the recurrence equation), rather than the least "bad" alignment as was the case with the generic edit distance. Otherwise, the best alignment will always be one with score of 0, i.e., perhaps one that doesn't use any letter from either of the strings.
How about converting this algorithm to find the inexact version of the longest common substring problem, i.e., find the substrings of each of the strings that align to each other with the best score? The basic intuition is that we want to allow the path through the dynamic programming table to begin elsewhere than the top left corner (see figure below).
Note, however, that the way the initial values are set up in the original description of the algorithm, we are penalized for omitting characters from the beginning of one or the other string. We can eliminate this penalty by simply converting the first row and column of this table to 0s. The problem is still not fully solved as we would still pay a penalty for any mismatches or indels that occur as we move from the first row or first column into the dynamic programming table. To eliminate this penalty we can simply add '0' as an option in the dynamic programming recurrence.
In the graph metaphor, we want to add "free" edges from the top-left node to all other nodes in the graph, as well as from each node in the graph to the bottom-right node (allowing for parts of the ends of the strings to not align).
To summarize, we can solve the local alignment problem simply by filling the initial row and column of the dynamic programming table with 0s, and by using the following recurrences:
The initial values for the first row and column of the dynamic programming table are set to 0s, though this is not necessarily important since the recurrence equation will always reset negative scores to 0.
The backtracking starts from the value in the matrix that has the highest score, and stops once the algorithm reaches 0.
To close this section, let's return to the alignment we have shown at the beginning:
As mentioned earlier, this looks like a local alignment (just part of the first string is aligned), but we are, in fact, paying for all the gap characters, thus it is in fact a global alignment. We've also mentioned that having a scoring function that penalizes edits may ultimately lead to uninformative alignments. Not aligning anything would give us the minimum penalty (0). Thus, let's assume our scoring function rewards us 1 for every match, and penalizes us -0.5 for every mismatch or indel. Using this scoring function, the global alignment score for the alignment above is: . This score takes into account the 8 gap characters and 3 matches. What if we wanted to compute the best local alignment between the two strings. The result would look like:
We would no longer pay for the gaps at the beginning and end of the first string, and the total score would be 3. Thus, we could have a local alignment that has a better score than the global alignment of two strings, and that's ultimately the point. This alignment strategy allows us to focus on the regions of similarity within longer strings, thus ignoring the "noise" created by segments of the strings that are not actually similar/related to each other.