Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  1. Inexact alignment

Modeling alignment as a graph

PreviousExample: filling the dynamic programming tableNextBacktracking through the dynamic programming table

Last updated 3 months ago

Before describing how backtracking works, it is useful to introduce another way of looking at sequence alignment, modeling the problem as a graph traversal. While the description we have provided earlier makes sense from the point of view of writing code to compute an inexact alignment, thinking about the problem may provide a more intuitive way of understanding the algorithm.

For two strings S1 of length n and S2 of length m, we construct a graph that has (n+1)⋅(m+1)(n+1) \cdot (m+1)(n+1)⋅(m+1) nodes representing the "gaps" between the characters in the two strings (see below). The nodes are connected to each other with edges that represent the possible edits—a horizontal edge indicates a deletion from the string represented along the top, a vertical edge indicates a deletion from the string represented along the side, and a diagonal represents a match or mismatch between the corresponding characters in the two strings. Here is the graph for the two strings we have discussed above, together with the matrix we used earlier:

Each edge is labeled with the "cost" of traversing the edge, i.e., the cost of the edit represented along the edge. Most of the edges in the figure have a cost of 1 as they imply either a deletion or a substitution, and there are four edges with a cost of 0 indicating the match between the As in the two strings.

In this graph, the optimal alignment is represented by a shortest path between the top left and bottom right nodes, highlighted in gray in the figure above. Solving the shortest path problem is itself a dynamic programming algorithm (Dijkstra's), so this representation does not make the alignment any easier, but it does provide a different conceptual framework that may be easier to understand for certain people. Note that there are multiple possible shortest paths—highlighted by the thick edges in the figure.

Dynamic programming as a graph. The edges represent possible edits and their score. The alignment is represented by a shortest path in this graph between the top left and bottom right nodes (highlighted in gray). Note that two paths are possible, shown with thickened edges.
On the left, the dynamic programming table described in the previous example. On the right, a graph representation of the same table where nodes represent gaps between the aligned characters and the edges represent the possible edits  - horizontal or vertical represent indels while diagonal edges represent match/mismatch events. The weights on the edges represent the cost of the corresponding operation. The eges are oriented and only point down and/or to the right.
On the left, the dynamic programming table described in the previous example. On the right, a graph representation of the same table where nodes represent gaps between the aligned characters and the edges represent the possible edits  - horizontal or vertical represent indels while diagonal edges represent match/mismatch events. The weights on the edges represent the cost of the corresponding operation. The eges are oriented and only point down and/or to the right.