Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  1. Inexact alignment

Local alignment

PreviousFrom edit distance to alignment scoresNextExercises

Last updated 3 months ago

The alignment algorithm we have described so far requires the two strings to align from end to end. For this reason, this version of the alignment is called global alignment. Importantly, when we say that the algorithm makes the strings align end to end, we don't mean that there cannot be any gaps at the end of the alignment. For example, the alignment:

AAAACGTAAAA
----CGT----

is still a global alignment as long as we "pay" for all the edits up to the end of both strings. In the "edit distance" formulation, the alignment score of this global alignment is 8, since we had to delete the 8 As that flank the matching triplet CGT.

To make sure we pay for all these edits, the dynamic programming algorithm had to traverse the entire dynamic programming table, computing the alignment score for the bottom-right corner, and backtracking through the table all the way back to the top-left corner. The path taken by the alignment is shown in the figure below.

We can easily modify this algorithm to relax this requirement. For starters, let us try to find the best prefix-prefix alignment between the two strings, i.e., we will identify the prefixes of S1 and S2 that, when aligned, have the best edit distance score from among all possible such prefixes.

It should be easy to see that we already have the answer. The value at each location in the dynamic programming table is exactly the score of aligning some prefix of one string to a prefix of the other one (by definition), thus solving the prefix alignment problem simply relies on finding the best score in the entire table (rather than the score E[n,m]). Thus, the initialization of the first row and column stays the same as before, the recurrence proceeds as before, but the backtracking starts at the cell in the matrix that has the best score, rather than the bottom-right corner. Effectively, you are ignoring the content of the ends of the two strings.

How about converting this algorithm to find the inexact version of the longest common substring problem, i.e., find the substrings of each of the strings that align to each other with the best score? The basic intuition is that we want to allow the path through the dynamic programming table to begin elsewhere than the top left corner (see figure below).

Note, however, that the way the initial values are set up in the original description of the algorithm, we are penalized for omitting characters from the beginning of one or the other string. We can eliminate this penalty by simply converting the first row and column of this table to 0s. The problem is still not fully solved as we would still pay a penalty for any mismatches or indels that occur as we move from the first row or first column into the dynamic programming table. To eliminate this penalty we can simply add '0' as an option in the dynamic programming recurrence.

In the graph metaphor, we want to add "free" edges from the top-left node to all other nodes in the graph, as well as from each node in the graph to the bottom-right node (allowing for parts of the ends of the strings to not align).

To summarize, we can solve the local alignment problem simply by filling the initial row and column of the dynamic programming table with 0s, and by using the following recurrences:

The initial values for the first row and column of the dynamic programming table are set to 0s, though this is not necessarily important since the recurrence equation will always reset negative scores to 0.

The backtracking starts from the value in the matrix that has the highest score, and stops once the algorithm reaches 0.

Final thoughts on local alignment

To close this section, let's return to the alignment we have shown at the beginning:

AAAACGTAAAA
----CGT----
AAAACGTAAAA
    CGT

We would no longer pay for the gaps at the beginning and end of the first string, and the total score would be 3. Thus, we could have a local alignment that has a better score than the global alignment of two strings, and that's ultimately the point. This alignment strategy allows us to focus on the regions of similarity within longer strings, thus ignoring the "noise" created by segments of the strings that are not actually similar/related to each other.

If thinking in terms of the graph metaphor described above, you can solve the prefix alignment problem by adding to the graph connections "free" (cost 0) edges from each node of the graph to the bottom-right node. The graph can get quite busy (an additional O(mn)O(mn)O(mn) edges are added) so we are not drawing it here.

IMPORTANT: Prefix alignment (as well as other non-global versions) only makes sense in a scoring system that seeks the best scoring alignment (i.e., using a max⁡\maxmax in the recurrence equation), rather than the least "bad" alignment as was the case with the generic edit distance. Otherwise, the best alignment will always be one with score of 0, i.e., perhaps one that doesn't use any letter from either of the strings.

E[i+1,j+1]=max⁡{0E[i,j]+f(S1[i+1],S2[j+1])E[i+1,j]+f(−,S2[j+1])E[i,j+1]+f(S1[i+1],−)E[i + 1, j + 1] = \max \left \{ \begin{matrix} 0 \\ E[i, j] + f(S1[i+1], S2[j+1]) \\ E[i+1, j] + f(-, S2[j + 1]) \\ E[i, j+1] + f(S1[i+1], -) \end{matrix} \right .E[i+1,j+1]=max⎩⎨⎧​0E[i,j]+f(S1[i+1],S2[j+1])E[i+1,j]+f(−,S2[j+1])E[i,j+1]+f(S1[i+1],−)​

As mentioned earlier, this looks like a local alignment (just part of the first string is aligned), but we are, in fact, paying for all the gap characters, thus it is in fact a global alignment. We've also mentioned that having a scoring function that penalizes edits may ultimately lead to uninformative alignments. Not aligning anything would give us the minimum penalty (0). Thus, let's assume our scoring function rewards us 1 for every match, and penalizes us -0.5 for every mismatch or indel. Using this scoring function, the global alignment score for the alignment above is: 3−8⋅0.5=−13 - 8\cdot 0.5 = -13−8⋅0.5=−1. This score takes into account the 8 gap characters and 3 matches. What if we wanted to compute the best local alignment between the two strings. The result would look like:

Global alignment:The path taken during backtracking (squiggly line) starts from the bottom-right corner (blue square) and ends in the top-left corner (white circle). The values shown along the top row and left column represent the initial conditions - penalties for the corresponding number of gaps.
Prefix alignment: Global alignment:The path taken during backtracking (squiggly line) starts from the best score in the matrix (blue square) and ends in the top-left corner (white circle). The values shown along the top row and left column represent the initial conditions - penalties for the corresponding number of gaps.
Local alignment: The path taken during backtracking (squiggly line) starts from the best score in the matrix (blue square) and ends in the middle of the matrix (white circle) where a score of 0 is encountered. The values shown along the top row and left column represent the initial conditions - zeroes ensure that the algorithm doesn't penalize for gaps at the beginning of either string.
Rectangle representing a dynamic programming table. A blue square in the bottom right corner represents the end of alignment, a circle in the top left corner represents the start of the alignment, and a squiggly line connecting the two represents the path taken by the alignment.
Rectangle representing a dynamic programming table. A blue square in the bottom right corner represents the end of alignment, a circle in the top left corner represents the start of the alignment, and a squiggly line connecting the two represents the path taken by the alignment.
Rectangle representing a dynamic programming table. A blue square in the middle of the rectangle represents the end of alignment, a circle in the top left corner represents the start of the alignment, and a squiggly line connecting the two represents the path taken by the alignment.
Rectangle representing a dynamic programming table. A blue square in the middle of the rectangle represents the end of alignment, a circle in the top left corner represents the start of the alignment, and a squiggly line connecting the two represents the path taken by the alignment.
Rectangle representing a dynamic programming table. A blue square in the middle of the rectangle represents the end of alignment, a circle in the middle of the matrix represents the start of the alignment, and a squiggly line connecting the two represents the path taken by the alignment.
Rectangle representing a dynamic programming table. A blue square in the middle of the rectangle represents the end of alignment, a circle in the middle of the matrix represents the start of the alignment, and a squiggly line connecting the two represents the path taken by the alignment.