Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Case 1: arbitrary gap penalties
  • Affine gap penalties
  1. Advanced inexact alignment

Gap penalties

PreviousExercisesNextSequence alignment in linear space

Last updated 4 months ago

In the original description of the sequence alignment algorithm, each gap character was penalized an equal amount. For example, if aligning strings ABBBA and ABA, the following two alignments both incur the same cost:

ABBBA
A--BA	

ABBBA
A-B-A

But, remember that we are performing alignment in order to capture some biologically-meaningful relations between DNA or protein strings. The alignment score is supposed to reflect some notion of how biologically-plausible the alignment is. Depending on context, the first, or the second alignment, may be biologically "better". A good example is the case of the eukaryotic gene structure—in mRNA the gene is a contiguous sequence, while in DNA the gene is a collection of exons separated by introns. The latter get 'spliced' out during transcription. An alignment of the RNA sequence to the DNA should, thus, not be penalized for the length of the gaps corresponding to the 'missing' introns, thus favoring alignments such as the one shown first above. Can we modify the dynamic programming algorithm to capture this fact?

Case 1: arbitrary gap penalties

Let us, first, assume that we can define a function g(ngaps)g(n_{gaps})g(ngaps​) which reflects the biological "cost" of a number ngapsn_{gaps}ngaps​ of adjacent gaps in the alignment. In the original alignment algorithm, g(ngaps)=ngaps⋅σg(n_{gaps}) = n_{gaps}\cdot \sigmag(ngaps​)=ngaps​⋅σ, where σ\sigmaσ is the gap penalty. For now, we are making no assumption about the "shape" of this function.

Let's return to the two alignments shown above and compute their score using the old and the new gap function:

ABBBA
A--BA	

"traditional" score: 3⋅match−2⋅σ3 \cdot \text{match} - 2 \cdot \sigma3⋅match−2⋅σ "new" score: 3⋅match−g(2)3 \cdot \text{match} - g(2)3⋅match−g(2)

ABBBA
A-B-A

"traditional" score: 3⋅match−2⋅σ3 \cdot \text{match} - 2 \cdot \sigma3⋅match−2⋅σ "new" score: 3⋅match−2⋅g(1)3 \cdot \text{match} - 2 \cdot g(1)3⋅match−2⋅g(1)

Note that the traditional scores of the two alignments are identical, but the new scoring function yields different scores, as long as g(2)≠2⋅g(1)g(2) \ne 2 \cdot g(1)g(2)=2⋅g(1). If g(2)<2⋅g(1)g(2) < 2 \cdot g(1)g(2)<2⋅g(1), the first alignment would have a higher score, and would, thus, be preferred by the dynamic programming algorithm. We assumed here that both σ\sigmaσ and g()g()g() are positive, thus we subtract the gap penalty from the score.

Question: Could you modify the dynamic programming alignment algorithm to compute the optimal alignment between two strings using this new gap function?

Question: What is the runtime of this modified dynamic programming alignment algorithm?

To address these questions we will modify the dynamic programming recurrences as follows. Note that we've renamed the table of scores V (rather than E as we did above) in keeping with historical usage. We'll assume a global alignment setting, but the algorithm can be easily modified for local alignment.

First, let us convince ourselves that this approach yields a correct algorithm. We have replaced the recurrences corresponding to the gap insertions in one or the other string with an exploration of all possible gap lengths that could be inserted at that location. Thus, the new algorithm will automatically pick the appropriate gap length to maximize the score according to the generic gap penalty function g.

Affine gap penalties

The intuition behind the algorithm we are about to describe is that we'll try to remember whether the alignment scores we are computing were originating from a string of gaps or from a match/mismatch. Note that we cannot keep track of all this information in a single dynamic programming table, and we'll have to create a few additional tables to help us out.

Specifically, we will record three new dynamic programming tables, E, F, G, in addition to V.

In table E we will record the best alignment scores for alignments that end with a gap in S1 In table F we will record the best alignment scores for alignments that end with a gap in S2 In table G we will record the best alignment scores for alignments that do not end with a gap (i.e., the last characters either match or mismatch). In V, as before, we record the best alignment score overall.

The tables are filled with the following recurrences:

The same situation but for string 2 is accounted in table F:

The matrix G only handles the match/mismatch situation:

The approach outlined above is due to Gotoh, and the affine gap penalties are the standard definition of gap scores in the bioinformatics community as they offer the best tradeoff between efficiency and biological plausibility.

V[i+1,j+1]=max⁡{V[i,j]+f(S1[i+1],S2[j+1])max⁡1≤k≤iV[i−k+1,j+1]+g(k)max⁡1≤k≤jV[i+1,j−k+1]+g(k)V[i + 1, j + 1] = \max \left \{ \begin{matrix} V[i, j] + f(S1[i +1], S2[j + 1]) \\ \max_{1 \le k \le i} V[i - k + 1, j + 1] + g(k) \\ \max_{1 \le k \le j} V[i + 1, j - k + 1] + g(k) \end{matrix} \right .V[i+1,j+1]=max⎩⎨⎧​V[i,j]+f(S1[i+1],S2[j+1])max1≤k≤i​V[i−k+1,j+1]+g(k)max1≤k≤j​V[i+1,j−k+1]+g(k)​

In the graph metaphor for alignment, we now allow each node to be connected to each of the other nodes along the same row or column, each of the corresponding edges having a weight of g(k)g(k)g(k) where k is the number of nodes "skipped" by the edge.

In terms of performance, however, note that the new recurrences require us to perform O(n)O(n)O(n) and O(m)O(m)O(m) work, respectively, at each cell in the O(nm)O(nm)O(nm) dynamic programming table, leading to a runtime of O(nm(n+m))O(nm(n+m))O(nm(n+m)) or O(n3)O(n^3)O(n3). Thus, we can incorporate more biology in our algorithm, yet we pay a major penalty in terms of runtime. Can we find a reasonable trade-off between these two requirements?

It turns out that for certain forms of the gap function we can more efficiently compute the alignment score. In particular, we will focus on affine gap penalties of the form g(ngap)=gopen+ngapgextendg(n_{gap}) = g_{open} + n_{gap} g_{extend}g(ngap​)=gopen​+ngap​gextend​, where gopeng_{open}gopen​ and gextendg_{extend}gextend​ are the penalties for creating a gap, and for adding another gap character to an already existing gap string, respectively. Clearly, gopen>gextendg_{open} > g_{extend}gopen​>gextend​, otherwise it would always make sense to create new gaps rather than extend existing ones. This formulation of the gap penalty will allow us to 'remember' the information computed in earlier stages of the algorithm, and thus avoid the additional penalty in runtime.

E[i+1,j+1]=max⁡{V[i+1,j]+gopen+gextendE[i+1,j]+gextendE[i + 1, j + 1] = \max \left \{ \begin{matrix} V[i+1, j] + g_{open} + g_{extend} \\ E[i + 1, j] + g_{extend} \end{matrix} \right .E[i+1,j+1]=max{V[i+1,j]+gopen​+gextend​E[i+1,j]+gextend​​

In words, if we are creating a gap in S1, we are recording this information in table E and accounting for two possible situations: (i) this is the first gap in a chain, thus we carry over information from the general table and open a new gap of length 1 (score gopen+gextendg_{open} + g_{extend}gopen​+gextend​); (ii) we are extending a chain of gaps, thus we use information from table E itself, and only add a gap extension.

F[i+1,j+1]=max⁡{V[i,j+1]+gopen+gextendF[i,j+1]+gextendF[i + 1, j + 1] = \max \left \{ \begin{matrix} V[i, j + 1] + g_{open} + g_{extend} \\ F[i, j + 1] + g_{extend} \end{matrix} \right .F[i+1,j+1]=max{V[i,j+1]+gopen​+gextend​F[i,j+1]+gextend​​

G[i+1,j+1]=V[i,j]+f(S1[i+1],S2[j+1])G[i + 1, j + 1] = V[i, j] + f(S1[i + 1], S2[j + 1])G[i+1,j+1]=V[i,j]+f(S1[i+1],S2[j+1])

Note that we don't really have any choice here since there is no special meaning to a chain of matches. Also, since we do not care if the previous value was the result of a match/mismatch or indel, we use the best of all possible ways to get there, i.e., V[i,j]V[i, j]V[i,j]. Which brings us to the definition of the V matrix:

V[i+1,j+1]=max⁡(E[i+1,j+1],F[i+1,j+1],G[i+1,j+1])V[i + 1, j + 1] = \max (E[i + 1, j + 1], F[i + 1, j + 1], G[i + 1, j + 1])V[i+1,j+1]=max(E[i+1,j+1],F[i+1,j+1],G[i+1,j+1])

Note that these tables can be computed in constant time per operation, or O(nm)O(nm)O(nm) overall (i.e., the same runtime as the simple alignment algorithm). Of course, the total space requirement has tripled (the V table can be maintained implicitly), but that is a small price to pay for shaving a linear multiplicative factor from the runtime.