Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Backtracking example
  • Implementation details
  1. Inexact alignment

Backtracking through the dynamic programming table

The graph representation of inexact alignment can provide a bit of context to the numbers represented in the dynamic programming table with which we started this section. The value stored in the bottom-right corner of the matrix (the node towards which all the edges converge in the graph representation) is simply the cost accumulated along the shortest path from the top-left corner of the matrix/graph. Importantly, when we say "shortest path", we mean "the shortest among all possible paths between the top-left and bottom-right corners of the matrix", i.e., the score implicitly captures a large number of possible traversals of the graph/matrix.

Question: How many possible paths are there in the graph between the top left and bottom right corners, expressed in terms of the lengths of the two strings (m and n, respectively)?

Also note that each of the internal nodes of the matrix also stores a value that is equal to the cost accumulated along the shortest path from the top-left corner to the current location in the matrix/graph.

Let's return to the question of how we can figure out the alignment that led to the observed minimum score. We need to work backwards, trying to figure out how we computed the score in the first place. The logic we'll use is the following: "If the current score is optimal, then each step up to this point was optimal", i.e., the prior location in the matrix is the location from which we can get where we are using an optimal move.

As a quick reminder, the recurrence equations we used were:

E[i+1,j+1]=min⁡{E[i,j]+(S1[i+1]==S2[j+1]?0:1)E[i,j+1]+1E[i+1,j]+1E[i +1, j+1] = \min \left \{ \begin{matrix} E[i, j] + (S1[i+1] == S2[j+1] ? 0:1) \\ E[i, j+1] + 1\\ E[i+1, j] + 1 \end{matrix} \right .E[i+1,j+1]=min⎩⎨⎧​E[i,j]+(S1[i+1]==S2[j+1]?0:1)E[i,j+1]+1E[i+1,j]+1​

Thus, from location E[i+1,j+1]E[i + 1, j + 1]E[i+1,j+1] we'll move back to E[i,j]E[i, j]E[i,j] (move diagonally up and left in the graph) if the first line in the equation led to the maximum, we'll move back to E[i,j+1]E[i, j+1]E[i,j+1] (move to the left) in case the second line led to the maximum, and we'll move back to E[i+1,j]E[i + 1, j]E[i+1,j] (move up) in case the last line led to the maximum.

What if two or more lines in the recurrence equation all lead to the same value? Then, it doesn't matter in which direction we go — they are all legitimate shortest paths, though each will lead to a different alignment. In other words, there may be more than one way to align two strings to each other.

Backtracking example

Let's go through an example starting with the matrix we computed earlier.

-

A

C

A

-

0

1

2

3

A

1

0

1

2

B

2

1

1

2

B

3

2

2

2

A

4

3

3

2

Note that there may be different ways that we could have used to get the value 2 in the bottom right corner, and we cannot know which one actually led to this value without tracing back our steps. Thus, we start at the bottom right corner (highlighted number) and ask the question: why did we decide the value was 2?

We can follow the recurrence equations and note that the row and column characters match, thus case I applies, meaning that the value at this cell ([4, 3]) is the same as the value at cell [3, 2] which is 2, indicating that the alignment between the strings aligned the last characters of the two strings. We can also note that the other parts of the recurrence equation (indels) yield values 3, and 4, respectively, thus they are not valid choices. We can, thus, shift our attention to cell [3,2], and fill in the last characters of the alignment:

A
A

-

A

C

A

-

0

1

2

3

A

1

0

1

2

B

2

1

1

2

B

3

2

2

2

A

4

3

3

2

We now find ourselves at location [3, 2], highlighted next in the matrix above. Again, we need to figure out where the value at cell [3, 2] came from. Since the corresponding characters, 'B' and 'C', are different, the values for cell [3, 2] are the minimum of:

mismatch. E[2, 1] + 1 = 2 deletion from S2. E[2, 2] + 1 = 2 deletion from S1. E[3, 1] + 1 = 3

Note that now we have two possible previous values, obtained either by following the diagonal (mismatch) or by moving up in the matrix (deletion), highlighted below on row 2:

-

A

C

A

-

0

1

2

3

A

1

0

1

2

B

2

1

1

2

B

3

2

2

2

A

4

3

3

2

Choosing either yields the same score, but different alignments. Following the diagonal forces the next two characters in the strings to mismatch:

CA
BA

-

A

C

A

-

0

1

2

3

A

1

0

1

2

B

2

1

1

2

B

3

2

2

2

A

4

3

3

2

Choosing to move up, leads to the deletion of a character (the last B in S2), indicated by a – character (highlighted below):

-A
BA

-

A

C

A

-

0

1

2

3

A

1

0

1

2

B

2

1

1

2

B

3

2

2

2

A

4

3

3

2

To simplify things at this point, we'll assume we always take the diagonal if possible, so we find ourselves in the following state:

CA
BA

-

A

C

A

-

0

1

2

3

A

1

0

1

2

B

2

1

1

2

B

3

2

2

2

A

4

3

3

2

Now we need to figure out the origin of the value at cell [2, 1]. The characters mismatch so a diagonal move would cost 1, and we note that the value of 1 we find in the current cell could only have come from an indel event since E[2,1]=E[1,1]+1=1E[2,1] = E[1, 1] + 1 = 1E[2,1]=E[1,1]+1=1. Thus, we move up:

-CA
BBA

-

A

C

A

-

0

1

2

3

A

1

0

1

2

B

2

1

1

2

B

3

2

2

2

A

4

3

3

2

Note that because we moved up in the matrix, we had to "delete" a character from the string represented on the left (S2), i.e., character B does not align to any character in the top string.

We then repeat the process one more time, noting that at cell [1, 1] the characters match, case I (match) applies, and that the other options (indels) imply values larger than the one we have (zero), thus we follow the diagonal and complete the alignment:

A-CA
ABBA

-

A

C

A

-

0

1

2

3

A

1

0

1

2

B

2

1

1

2

B

3

2

2

2

A

4

3

3

2

You can now verify that indeed only two edits are needed to change one string into the other, for example deleting the first B in ABBA and replacing the second B with a C. Conversely, you can insert a B in the second string, and replace the C with another B.

But what would have happened if we followed the other path when we had a choice to make. Remember, we had stopped here:

-A
BA

-

A

C

A

-

0

1

2

3

A

1

0

1

2

B

2

1

1

2

B

3

2

2

2

A

4

3

3

2

You will now see that the only way to get the 1 is to follow the diagonal (a mismatch since B and C do not match):

C-A
BBA

-

A

C

A

-

0

1

2

3

A

1

0

1

2

B

2

1

1

2

B

3

2

2

2

A

4

3

3

2

Which brings us back to cell [1, 1] from where the only choice is the diagonal:

AC-A
ABBA

-

A

C

A

-

0

1

2

3

A

1

0

1

2

B

2

1

1

2

B

3

2

2

2

A

4

3

3

2

Again, you can verify that this alignment also has exactly two edits as expected since we started from a cell with value 2 in it.

Please compare these two examples with the shortest paths through the alignment graph that were highlighted on the previous page.

The two alignments we found here are:

A-CA
ABBA

and

AC-A
ABBA

Implementation details

There are two main strategies used for performing the backtracking. The first is the same as I have just described—at each cell in the matrix, you apply the same rules as you did to fill the matrix in the first place in order to figure out where to go next. This approach makes coding quite easy as you simply copy the code you used to fill in the matrix but use it "backwards".

The second approach is to explicitly remember how you got the value in the cell when you computed its value. Specifically, you store a pointer from the cell that you just filled in, to the cell from which you derived its value. You can store several pointers if multiple choices yield the same lowest value. Now you can view the backtracking process as a graph traversal, following the backtracking pointers from the bottom right corner until you reach the top left corner, with each edge encoding a different type of edit, and choosing an arbitrary pointer in case multiple exist at a node.

When you come to a fork in the road, take it. -Yogi Berra

LCS

Graph-based formulation

PreviousModeling alignment as a graphNextFrom edit distance to alignment scores

Last updated 4 months ago