Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  1. Inexact alignment

Example: filling the dynamic programming table

Let us see how this works out for one of the examples shown above—ABBA versus ACA. First, we set up the dynamic programming table by making room for the string '-ABBA' along the rows and string '-ACA' along the columns. The character '-' indicates the empty string, and symbolizes the deletion of a character. The resulting table looks like:

-

A

C

A

-

A

B

B

A

As discussed above, we start by initializing the first row and column of the matrix with the values of the edit distances between the empty string and successively longer prefixes of one or the other strings.

For example, the cell '-,-' will have value 0, as the edit distance between two empty strings is 0. The cell '-,A' will have value 1, as we need to delete (or insert) character A in order to make string 'A' equal to the empty string:

-

A

C

A

-

0

1

A

B

B

A

Following the same logic, we can see that the first row and column are simply increasing series of integers:

-

A

C

A

-

0

1

2

3

A

1

B

2

B

3

A

4

At this point, we are ready to apply the recurrence equations. Note that we can only compute the values located in cells for which we already know the value in the preceding 3 cells (left, top-left, and top), thus, we can only fill in the cell 'A, A'.

As a reminder, the recurrence equations are:

In our case, S1[1] = A = S2[1], thus the cell [1,1] receives the same value as the cell [0, 0] which is 0.

Note that all the other options listed above yield numbers larger than 0 — the indels (last two equations) yield a score of 2. The matrix now becomes:

-

A

C

A

-

0

1

2

3

A

1

0

B

2

B

3

A

4

At this point, we are now ready to fill in two other cells – 'A, C' or [1, 2] and 'B, A' or [2, 1]. Which one you fill out is completely arbitrary, and different algorithms either go row by row, column by column, or diagonal by diagonal. The different orders of operations yield the same results but may provide different opportunities for optimizing the execution if runtime is a concern.

Let's go row by row. We now want to fill in cell [1,2] and note that S1[1] = A which is different from C = S2[2], thus we have the following options:

substitution (forcing A and C to align): E[1, 2] = E[0, 1] + 1 = 2 deletion from S2: E[1, 2] = E[0, 2] + 1 = 3 deletion from S1: E[1, 2] = E[1, 1] + 1 = 1

Since we want to minimize the score, we'll select the last value, yielding the matrix:

-

A

C

A

-

0

1

2

3

A

1

0

1

B

2

B

3

A

4

We, then continue to the next cell,[1, 3], where S1[1] = A = S2[2], thus case I applies and the three equations now are: match. E[1, 3] = E[0, 2] = 2 deletion from S2. E[1, 3] = E[0, 3] + 1 = 4 deletion from S1. E[1, 3] = E[1, 2] + 1 = 2

The resulting matrix (choosing the minimum score from the three above) is:

-

A

C

A

-

0

1

2

3

A

1

0

1

2

B

2

B

3

A

4

We then continue the same process, moving on to cell [2, 1] or 'B, A'', and so on, yielding the full table:

-

A

C

A

-

0

1

2

3

A

1

0

1

2

B

2

1

1

2

B

3

2

2

2

A

4

3

3

2

Since the value in the bottom right corner of the matrix is 2, we know that we can transform the string ACA into ABBA by making just two edits, and that this is the minimum number of edits required.

But this value doesn't tell us anything about how we can figure out exactly which edits we need to make. To do so, we need to "walk back" through all the decisions we have made in a process called backtracking.

Additional resources

PreviousInexact alignment calculation with dynamic programmingNextModeling alignment as a graph

Last updated 4 months ago

E[i+1,j+1]=min⁡{E[i,j]+(S1[i+1]==S2[j+1]?0:1)E[i,j+1]+1E[i+1,j]+1E[i +1, j+1] = \min \left \{ \begin{matrix} E[i, j] + (S1[i+1] == S2[j+1]? 0: 1) \\ E[i, j+1] + 1\\ E[i+1, j] + 1 \end{matrix} \right .E[i+1,j+1]=min⎩⎨⎧​E[i,j]+(S1[i+1]==S2[j+1]?0:1)E[i,j+1]+1E[i+1,j]+1​

A nice animation of this algorithm is available at:

https://algorithm-visualizer.org/dynamic-programming/levenshteins-edit-distance