Modeling alignment as a graph

Before describing how backtracking works, it is useful to introduce another way of looking at sequence alignment, modeling the problem as a graph traversal. While the description we have provided earlier makes sense from the point of view of writing code to compute an inexact alignment, thinking about the problem may provide a more intuitive way of understanding the algorithm.

For two strings S1 of length n and S2 of length m, we construct a graph that has (n+1)(m+1)(n+1) \cdot (m+1) nodes representing the "gaps" between the characters in the two strings (see below). The nodes are connected to each other with edges that represent the possible edits—a horizontal edge indicates a deletion from the string represented along the top, a vertical edge indicates a deletion from the string represented along the side, and a diagonal represents a match or mismatch between the corresponding characters in the two strings. Here is the graph for the two strings we have discussed above, together with the matrix we used earlier:

On the left, the dynamic programming table described in the previous example. On the right, a graph representation of the same table where nodes represent gaps between the aligned characters and the edges represent the possible edits  - horizontal or vertical represent indels while diagonal edges represent match/mismatch events. The weights on the edges represent the cost of the corresponding operation. The eges are oriented and only point down and/or to the right.
Dynamic programming as a graph. The edges represent possible edits and their score. The alignment is represented by a shortest path in this graph between the top left and bottom right nodes (highlighted in gray). Note that two paths are possible, shown with thickened edges.

Each edge is labeled with the "cost" of traversing the edge, i.e., the cost of the edit represented along the edge. Most of the edges in the figure have a cost of 1 as they imply either a deletion or a substitution, and there are four edges with a cost of 0 indicating the match between the As in the two strings.

In this graph, the optimal alignment is represented by a shortest path between the top left and bottom right nodes, highlighted in gray in the figure above. Solving the shortest path problem is itself a dynamic programming algorithm (Dijkstra's), so this representation does not make the alignment any easier, but it does provide a different conceptual framework that may be easier to understand for certain people. Note that there are multiple possible shortest paths—highlighted by the thick edges in the figure.

Last updated