Backtracking through the dynamic programming table
Last updated
Last updated
The graph representation of inexact alignment can provide a bit of context to the numbers represented in the dynamic programming table with which we started this section. The value stored in the bottom-right corner of the matrix (the node towards which all the edges converge in the graph representation) is simply the cost accumulated along the shortest path from the top-left corner of the matrix/graph. Importantly, when we say "shortest path", we mean "the shortest among all possible paths between the top-left and bottom-right corners of the matrix", i.e., the score implicitly captures a large number of possible traversals of the graph/matrix.
Question: How many possible paths are there in the graph between the top left and bottom right corners, expressed in terms of the lengths of the two strings (m and n, respectively)?
Also note that each of the internal nodes of the matrix also stores a value that is equal to the cost accumulated along the shortest path from the top-left corner to the current location in the matrix/graph.
Let's return to the question of how we can figure out the alignment that led to the observed minimum score. We need to work backwards, trying to figure out how we computed the score in the first place. The logic we'll use is the following: "If the current score is optimal, then each step up to this point was optimal", i.e., the prior location in the matrix is the location from which we can get where we are using an optimal move.
As a quick reminder, the recurrence equations we used were:
Thus, from location we'll move back to (move diagonally up and left in the graph) if the first line in the equation led to the maximum, we'll move back to (move to the left) in case the second line led to the maximum, and we'll move back to (move up) in case the last line led to the maximum.
What if two or more lines in the recurrence equation all lead to the same value? Then, it doesn't matter in which direction we go — they are all legitimate shortest paths, though each will lead to a different alignment. In other words, there may be more than one way to align two strings to each other.
Let's go through an example starting with the matrix we computed earlier.
-
A
C
A
-
0
1
2
3
A
1
0
1
2
B
2
1
1
2
B
3
2
2
2
A
4
3
3
2
Note that there may be different ways that we could have used to get the value 2 in the bottom right corner, and we cannot know which one actually led to this value without tracing back our steps. Thus, we start at the bottom right corner (highlighted number) and ask the question: why did we decide the value was 2?
We can follow the recurrence equations and note that the row and column characters match, thus case I applies, meaning that the value at this cell ([4, 3]) is the same as the value at cell [3, 2] which is 2, indicating that the alignment between the strings aligned the last characters of the two strings. We can also note that the other parts of the recurrence equation (indels) yield values 3, and 4, respectively, thus they are not valid choices. We can, thus, shift our attention to cell [3,2], and fill in the last characters of the alignment:
-
A
C
A
-
0
1
2
3
A
1
0
1
2
B
2
1
1
2
B
3
2
2
2
A
4
3
3
2
We now find ourselves at location [3, 2], highlighted next in the matrix above. Again, we need to figure out where the value at cell [3, 2] came from. Since the corresponding characters, 'B' and 'C', are different, the values for cell [3, 2] are the minimum of:
mismatch. E[2, 1] + 1 = 2 deletion from S2. E[2, 2] + 1 = 2 deletion from S1. E[3, 1] + 1 = 3
Note that now we have two possible previous values, obtained either by following the diagonal (mismatch) or by moving up in the matrix (deletion), highlighted below on row 2:
-
A
C
A
-
0
1
2
3
A
1
0
1
2
B
2
1
1
2
B
3
2
2
2
A
4
3
3
2
Choosing either yields the same score, but different alignments. Following the diagonal forces the next two characters in the strings to mismatch:
-
A
C
A
-
0
1
2
3
A
1
0
1
2
B
2
1
1
2
B
3
2
2
2
A
4
3
3
2
Choosing to move up, leads to the deletion of a character (the last B in S2), indicated by a – character (highlighted below):
-
A
C
A
-
0
1
2
3
A
1
0
1
2
B
2
1
1
2
B
3
2
2
2
A
4
3
3
2
To simplify things at this point, we'll assume we always take the diagonal if possible, so we find ourselves in the following state:
-
A
C
A
-
0
1
2
3
A
1
0
1
2
B
2
1
1
2
B
3
2
2
2
A
4
3
3
2
-
A
C
A
-
0
1
2
3
A
1
0
1
2
B
2
1
1
2
B
3
2
2
2
A
4
3
3
2
Note that because we moved up in the matrix, we had to "delete" a character from the string represented on the left (S2), i.e., character B does not align to any character in the top string.
We then repeat the process one more time, noting that at cell [1, 1] the characters match, case I (match) applies, and that the other options (indels) imply values larger than the one we have (zero), thus we follow the diagonal and complete the alignment:
-
A
C
A
-
0
1
2
3
A
1
0
1
2
B
2
1
1
2
B
3
2
2
2
A
4
3
3
2
You can now verify that indeed only two edits are needed to change one string into the other, for example deleting the first B in ABBA and replacing the second B with a C. Conversely, you can insert a B in the second string, and replace the C with another B.
But what would have happened if we followed the other path when we had a choice to make. Remember, we had stopped here:
-
A
C
A
-
0
1
2
3
A
1
0
1
2
B
2
1
1
2
B
3
2
2
2
A
4
3
3
2
You will now see that the only way to get the 1 is to follow the diagonal (a mismatch since B and C do not match):
-
A
C
A
-
0
1
2
3
A
1
0
1
2
B
2
1
1
2
B
3
2
2
2
A
4
3
3
2
Which brings us back to cell [1, 1] from where the only choice is the diagonal:
-
A
C
A
-
0
1
2
3
A
1
0
1
2
B
2
1
1
2
B
3
2
2
2
A
4
3
3
2
Again, you can verify that this alignment also has exactly two edits as expected since we started from a cell with value 2 in it.
Please compare these two examples with the shortest paths through the alignment graph that were highlighted on the previous page.
The two alignments we found here are:
and
There are two main strategies used for performing the backtracking. The first is the same as I have just described—at each cell in the matrix, you apply the same rules as you did to fill the matrix in the first place in order to figure out where to go next. This approach makes coding quite easy as you simply copy the code you used to fill in the matrix but use it "backwards".
The second approach is to explicitly remember how you got the value in the cell when you computed its value. Specifically, you store a pointer from the cell that you just filled in, to the cell from which you derived its value. You can store several pointers if multiple choices yield the same lowest value. Now you can view the backtracking process as a graph traversal, following the backtracking pointers from the bottom right corner until you reach the top left corner, with each edge encoding a different type of edit, and choosing an arbitrary pointer in case multiple exist at a node.
When you come to a fork in the road, take it. -Yogi Berra
LCS
Graph-based formulation
Now we need to figure out the origin of the value at cell [2, 1]. The characters mismatch so a diagonal move would cost 1, and we note that the value of 1 we find in the current cell could only have come from an indel event since . Thus, we move up: