From edit distance to alignment scores

Up until now, we discussed a fairly simple way of computing the score of the alignment in terms of the number of edits that need to be made to convert one string into the other. Nature, however, is not a typewriter, thus we may want to define alignment scores in a way that better captures the biological relationships between sequences. If we somehow define such a score, could we still compute the alignments using the algorithm we just defined?

As we'll show below, the dynamic programming algorithm we created still works for a number of scoring systems that can incorporate biological knowledge. Note, however, that it is not necessarily obvious that this algorithm should work, and there are some scoring systems for alignments for which we do not have a good way of computing the optimal alignment. One example is the situation in which we allow an additional edit — the transposition of two neighboring letters (i.e., changing from ABCD to ACBD in one single edit by swapping the order of letters B and C). Optimizing such a scoring system is an NP-hard problem, unlike our simple model that only accounts for insertions, deletions, and substitutions.

Assume we are given a biologically-informed function $f$ , that can provide us with a score for any pair of aligned characters (including '-' to indicate in indel). Mathematically, $f:\Sigma \cup \{'-'\} \rightarrow \R$ , where $\Sigma$ represents the alphabet from which the strings are constructed (i.e., $\Sigma = \{A, C, G, T\}$ for DNA). We can now rewrite the recurrence equations as:

$E[i + 1, j + 1] = min \left \{ \begin{matrix} E[i, j] + f(S1[i+1], S2[j+1]) \\ E[i+1, j] + f(-, S2[j + 1]) \\ E[i, j+1] + f(S1[i+1], -)\end{matrix} \right.$

Note that the diagonal case (the first equation) automatically accounts for the situation in which the two characters being aligned are either the same or different. You can also see that in the edit distance case earlier, we could have simply defined the function $f$ as:

f(a, b) = (a==b ? 0 : 1)

In the graph formulation, the function $f$ will assign the weights of the edges (which are fixed throughout the algorithm since they only depend on the corresponding characters), thus you can easily see that we should still be able to compute a shortest path through the graph, even if the weights on the edges are arbitrary. Thus, I hope you agree that our initial algorithm will work for more complex scoring functions.

What is perhaps a bit less clear, is that we could choose a function that rewards matches rather than penalize edits as we've done so far. If we do so, the best alignment would be the one that maximizes the score, thus we have to change the recurrence equation to be:

$E[i + 1, j + 1] = \max \left \{ \begin{matrix} E[i, j] + f(S1[i+1], S2[j+1]) \\ E[i+1, j] + f(-, S2[j + 1]) \\ E[i, j+1] + f(S1[i+1], -)\end{matrix} \right.$

The algorithm would stay the same as before, but we'll select the maximum among the three possible "edits" at each location in the matrix instead of the minimum as we've done so far.

What biologically-relevant scoring functions exist?

At this point, we will not go into a lot of detail about how one defines a biologically-relevant scoring function. At a very high level, such functions are defined by statistically analyzing alignments that we know have some biological meaning (e.g., by aligning proteins that have the same function but have different sequences). In the case of DNA, one can think of a number of factors that we could account for — for example, when discussing finding the origin of replication in DNA, we mentioned deamination — the fact that a C could easily mutate into a T . Thus, a biologically-informed scoring function could assign a higher score (or lower penalty) to C-T alignments than to other pairings of letters. Also, such a function could prioritize alignments that match a purine with a purine, or pyrimidine to a pyrimidine and increase the penalty (or decrease the score) of purine-pyrimidine alignments.

Scoring functions, are, however, more commonly used to guide alignments of protein sequences. There the alphabet has 20 letters (for the 20 amino acids) and there are many possible pairings, some of which are more likely than others. A commonly used substitution matrix is the BLOSUM-62 matrix shown below:

# Entries for the BLOSUM62 matrix at a scale of ln(2)/2.0.
   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  J  Z  X  *
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1 -1 -1 -4
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1 -2  0 -1 -4
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  4 -3  0 -1 -4
D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4 -3  1 -1 -4
C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -1 -3 -1 -4
Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0 -2  4 -1 -4
E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1 -3  4 -1 -4
G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -4 -2 -1 -4
H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0 -3  0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3  3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4  3 -3 -1 -4
K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0 -3  1 -1 -4
M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3  2 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3  0 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -3 -1 -1 -4
S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0 -2  0 -1 -4
T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1 -1 -1 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -2 -2 -1 -4
Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -1 -2 -1 -4
V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3  2 -2 -1 -4
B -2 -1  4  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4 -3  0 -1 -4
J -1 -2 -3 -3 -1 -2 -3 -4 -3  3  3 -3  2  0 -3 -2 -1 -2 -1  2 -3  3 -3 -1 -4
Z -1  0  0  1 -3  4  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -2 -2 -2  0 -3  4 -1 -4
X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1

(source https://ncbi.nlm.nih.gov)

Note that this matrix is symmetric ( $f(x, y) = f(y, x)$ ) and has been normalized to only contain integers. The gap character is indicated by a '*' symbol here (rather than the '-' symbol we used earlier).

PreviousBacktracking through the dynamic programming table NextLocal alignment

Last updated 6 months ago