Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  1. Inexact alignment

Inexact alignment calculation with dynamic programming

Computing the edit distance with dynamic programming

To compute this distance we will rely on a programming paradigm called dynamic programming. Specifically, let us assume that we have two sequences S1 and S2 and that we have managed to align to each other the ith and jth prefix of the two strings respectively. In other words, we assume that we know E[i, j] , the edit distance between S1[1..i] and S2[1..j]. We will now attempt to find the edit distance for longer prefixes of S1 and S2, and, proceeding by induction, we will eventually obtain the edit distance between S1 and S2.

We can distinguish two situations:

I. The characters after the already-aligned prefixes match each other: S1[i + 1] == S2[j + 1] In this case, we can set E[i + 1, j + 1] = E[i, j]—we simply extend the previous alignment for free (no edit necessary)

II. If they characters don't match, they mismatch: S1[i + 1] != S2[j + 1]

In this second case it's not immediately clear how to align longer sequences and we can try three different approaches:

II.a. Simply pay for the mismatch: E[i + 1, j + 1] = E[i, j] + 1

II.b. Delete the last character in string 1: E[i + 1, j + 1] = E[i, j + 1] + 1

II.c. Delete the last character in string 2: E[i + 1, j + 1] = E[i + 1, j] + 1

Note that these last three cases all require us to "pay" for one edit, whether it is a substitution or an indel event. Thus, we have several different ways for computing E[i + 1, j + 1] from elements of the E matrix that have already been computed. Which one shall we choose?

Remember that our goal is align the strings in such a way that the total number of edits is minimized, thus, we'll just select the smallest of the different possible values:

E[i+1,j+1]=min⁡{E[i,j]+(S1[i+1]==S2[j+1]?0:1)E[i,j+1]+1E[i+1,j]+1E[i +1, j+1] = \min \left \{ \begin{matrix} E[i, j] + (S1[i+1] {==} S2[j+1]?0:1) \\ E[i, j+1] + 1\\ E[i+1, j] + 1 \end{matrix} \right .E[i+1,j+1]=min⎩⎨⎧​E[i,j]+(S1[i+1]==S2[j+1]?0:1)E[i,j+1]+1E[i+1,j]+1​

The expression (S1[i+1]==S2[j+1]?0:1) (S1[i+1] {==} S2[j+1] ? 0:1)(S1[i+1]==S2[j+1]?0:1) evaluates to 0 if the equality between characters in the two strings holds or to 1 if it doesn't (i.e., we have to pay for a mismatch).

It is easy to see that we can represent the E values as a two dimensional array of size n=length(S1) x m = length(S2), and that the 4 cases outlined (which we will call recurrence equations) above allow us to fill in this array starting with just its boundaries.

One thing omitted from the discussion above is a base case. If we repeatedly apply the recurrence equations, we'll reach the boundaries of the array (E[0, j] and E[i, 0]), values that cannot be computed from prior values. Note that these values represent the alignment of the prefixes of one string against an empty string. Under our definition of edit distance the alignment score, we can simply set E[i, 0] = i, and E[0, j] = j i.e., we account for the number of edits needed to delete the first i or j characters.\

PreviousIntroduction to inexact alignmentNextExample: filling the dynamic programming table

Last updated 4 months ago