Inexact alignment calculation with dynamic programming
Computing the edit distance with dynamic programming
To compute this distance we will rely on a programming paradigm called dynamic programming. Specifically, let us assume that we have two sequences S1 and S2 and that we have managed to align to each other the ith and jth prefix of the two strings respectively. In other words, we assume that we know E[i, j]
, the edit distance between S1[1..i]
and S2[1..j]
. We will now attempt to find the edit distance for longer prefixes of S1 and S2, and, proceeding by induction, we will eventually obtain the edit distance between S1 and S2.
We can distinguish two situations:
I. The characters after the already-aligned prefixes match each other: S1[i + 1] == S2[j + 1]
In this case, we can set E[i + 1, j + 1] = E[i, j]
—we simply extend the previous alignment for free (no edit necessary)
II. If they characters don't match, they mismatch: S1[i + 1] != S2[j + 1]
In this second case it's not immediately clear how to align longer sequences and we can try three different approaches:
II.a. Simply pay for the mismatch: E[i + 1, j + 1] = E[i, j] + 1
II.b. Delete the last character in string 1: E[i + 1, j + 1] = E[i, j + 1] + 1
II.c. Delete the last character in string 2: E[i + 1, j + 1] = E[i + 1, j] + 1
Note that these last three cases all require us to "pay" for one edit, whether it is a substitution or an indel event. Thus, we have several different ways for computing E[i + 1, j + 1]
from elements of the E matrix that have already been computed. Which one shall we choose?
Remember that our goal is align the strings in such a way that the total number of edits is minimized, thus, we'll just select the smallest of the different possible values:
The expression evaluates to 0 if the equality between characters in the two strings holds or to 1 if it doesn't (i.e., we have to pay for a mismatch).
It is easy to see that we can represent the E values as a two dimensional array of size n=length(S1) x m = length(S2), and that the 4 cases outlined (which we will call recurrence equations) above allow us to fill in this array starting with just its boundaries.
One thing omitted from the discussion above is a base case. If we repeatedly apply the recurrence equations, we'll reach the boundaries of the array (E[0, j]
and E[i, 0]
), values that cannot be computed from prior values. Note that these values represent the alignment of the prefixes of one string against an empty string. Under our definition of edit distance the alignment score, we can simply set E[i, 0] = i
, and E[0, j] = j
i.e., we account for the number of edits needed to delete the first i or j characters.\
Last updated