Modeling alignment as a graph
Last updated
Last updated
Before describing how backtracking works, it is useful to introduce another way of looking at sequence alignment, modeling the problem as a graph traversal. While the description we have provided earlier makes sense from the point of view of writing code to compute an inexact alignment, thinking about the problem may provide a more intuitive way of understanding the algorithm.
For two strings S1 of length n and S2 of length m, we construct a graph that has nodes representing the "gaps" between the characters in the two strings (see below). The nodes are connected to each other with edges that represent the possible edits—a horizontal edge indicates a deletion from the string represented along the top, a vertical edge indicates a deletion from the string represented along the side, and a diagonal represents a match or mismatch between the corresponding characters in the two strings. Here is the graph for the two strings we have discussed above, together with the matrix we used earlier:
Each edge is labeled with the "cost" of traversing the edge, i.e., the cost of the edit represented along the edge. Most of the edges in the figure have a cost of 1 as they imply either a deletion or a substitution, and there are four edges with a cost of 0 indicating the match between the As in the two strings.
In this graph, the optimal alignment is represented by a shortest path between the top left and bottom right nodes, highlighted in gray in the figure above. Solving the shortest path problem is itself a dynamic programming algorithm (Dijkstra's), so this representation does not make the alignment any easier, but it does provide a different conceptual framework that may be easier to understand for certain people. Note that there are multiple possible shortest paths—highlighted by the thick edges in the figure.