Introduction to inexact alignment

Inexact sequence alignment

So far in the material we have only discussed situations where the two strings we align to each other match exactly. What if we want to allow for more extensive differences between the two strings? The differences we may allow are simple 'typos' such as characters inserted or deleted in one of the strings, and characters aligned to mismatching characters. How can we define a good alignment, and how can we compute it?

Assume, for example that we are looking at string ABBA. We might encounter the following situations. The characters aligned to each other indicate how we might 'edit' one string to make it look like the other.

The alignment of string ABBA to string ABA indicates an insertion/deletion event:

ABBA
AB-A

Note that the "dash" (-) character indicates that the second B in string "ABBA" is deleted in order to align ABBA to ABA. Alternatively, we can say that character B must be inserted in string "ABA" in order to transform it into string "ABBA". Given that the same event can be seen as both an insertion and deletion depending on which string we view as the reference, such events are typically referred-to as insertion/deletion, or indel for short.

The alignment of string ABBA to string ABCA reveals a substitution event:

ABBA
ABCA

Here the two strings are aligned to each other, and one of the characters needs to be changed/substituted to transform one of the strings into the other.

The alignment of string ABBA to string ACA demonstrates how multiple of these events may need to be accounted for:

ABBA
A-CA

Here we have both an indel event (B/-) and a substitution event (B/C). Note that we could also align the strings in a way that avoids substitutions altogether:

ABB-A
A--CA

Essentially we can view the problem of inexactly aligning one sequence to another as an editing operation—changing one string by inserting, deleting, or substituting characters until it perfectly matches another one. As it is clear from the above, there are many ways we could perform such edits. Clearly finding an algorithm that reports all such possible alignments is impractical, both computationally and in terms of the interpretability of the results. For simplicity, we will focus for now on the task of finding just the 'best' match between two sequences defined in terms of the number of edits that need to be performed.

Aside: Often times we (as computer scientists) like to focus on well defined optimization problems even though we do not fully understand what objective function we need to optimize in order to match a specific biological 'meaning' (which itself is not too well defined a concept). An emerging and promising area of research is the development of approaches that enumerate some sub-optimal solutions as well, allowing additional biological information to be brought in to identify the most biologically-relevant solution even when computationally modeling such biological information is impractical.

Coming back to the inexact alignment problem, we'll define the edit distance between two strings to be the minimum number of insertions, deletions, and substitutions necessary to convert one of the strings into the other. This distance is also called the 'Levenshtein distance' based on the scientist that studied it in the 1960s (i.e., before we could sequence DNA).

PreviousThe Burrows-Wheeler transform and the FM-index NextInexact alignment calculation with dynamic programming

Last updated 6 months ago