# Whole-genome alignment ## Introduction to chaining The discussion so far about inexact string alignment assumed that the strings being aligned are relatively short and relatively similar to each other, e.g., two versions of a gene sequence. Additional complications arise when aligning long sequences, such as entire genomes. A global alignment of such long sequences is no longer biologically relevant due to possible large-scale changes between even closely related genomes, such as insertions/deletions of large segments of DNA (e.g., antibiotic resistance cassettes), or large-scale rearrangements (e.g., large segments that are duplicated or inverted or swapped). Finding the best local alignment between such long sequences is equally meaningless, since a good local alignment may only represent a very short segment of the sequences being aligned. To get an intuition about the types of alignments we may want to discover, it is worth looking at the relationship between real genomes. The figure below (from (Eisen, Heidelberg, et al. 2000))shows a "dot plot" relating two *Chlamydia* genomes. Each dot represents a short exact match between the two genomes, and the x and y coordinates of the dot correspond to the locations of the exact match in the two genomes.

Dot plot showing the relationship between the genomes of Chlamydia pneumoniae and Chlamydia trachmoatis. Dots representing exacth matches between the two genomes are largely aligned along a diagonal of a square with the exception of two X-like patterns going perpendicular to the main diagonal. — Dot plot of the alignment between the genomes of Chlamydia pneumoniae (Y axis) and Chlamydia trachomatis (X axis). Each dot is an exact match between the two genomes, and the location of the dots represents the corresponding locations in the two genomes. Two X-like patterns located at the origin marked with an R) and terminus (marked with a T) of replication highlight genomic inversions. Figure from (Eisen, Heidelberg, et al. 2000).

As can be seen in the figure, the exact matches roughly align in long stretches, in this case either along the main diagonal or perpendicular to it. Thus, a reasonable goal would be to identify such stretches of exact matches (or, more generally, local alignments) between two genomes. The main question here is how we define the “goodness” of a set of local alignments. To gather a bit of intuition, see the picture below, representing a dynamic programming table. The thick lines are good local alignments (be it exact or inexact). Just by visual inspection, it appears that there is a good alignment in the middle of the table – there simply is a large density of local alignments around the same diagonal. There are different ways of 'completing' these local alignments into global alignments (the thin lines). Which one of them is better?

A box with several thick diagonal segments going from top left to bottom right. Several segments are connected to each other by thin lines indicating which segments are compatible with each other. — A series of exact or inexact matches within a dynamic programming alignment table. The thick segments are the alignments while thin lines indicate which segments are compatible with each other, i.e., the top left corner of a segment is to the right and below the bottom right corner of the preceding segment. Only one of the paths through the directed acyclic structure shown in the figure can be used to build a full alignment.

To formalize, we will define the 'goodness' of a chain of alignments to be simply the sum of the scores of the chained alignments, as long as the alignments are 'compatible'. By 'compatible' we mean that the alignments are not allowed to overlap. We can also add a score term (or penalty) related to the gap between the alignment (chaining across large gaps in the alignment could be penalized, for example), however we'll leave the details of that implementation as a detail to be worked out by the reader. ## One-dimensional chaining To understand how we can find the best chain, we'll first focus on a 1-dimensional version of this problem. Specifically, assume you have a set of intervals on a line (see the figure below) and you want to find the set of non-overlapping intervals that has maximum weight, where for each interval $$i$$ we define $$w(i)$$ to be its weight.

A set of horizontal line segments at different vertical positions overlapping each other along the horizontal dimension. — Set of intervals along a line. Segments overlapping along the horizontal dimension are incompatible, i.e., one alignment can only use one of them.

The algorithm for computing this maximum weight 'cover' (not to be confused with the traditional set theory max cover problem) follows the dynamic programming paradigm. First, we define $$V\[i]$$ to be the weight of the maximum weight set of non-overlapping intervals ending with interval $$i$$ (we assume the intervals are ordered from $$1$$ to $$n$$ by their left-most coordinate). A simple (and inefficient) dynamic programming approach would iterate over all $$i$$ values in increasing order. For each $$i$$ this algorithm sets $$V\[i] = \max\_{j < i, \text{interval j ends before interval i}} V\[j] + w(i)$$. In other words, at each iteration we check all prior intervals which do not overlap the current one and choose the one with maximum score. It is easy to see that this algorithm runs in $$O(n^2)$$. A more efficient algorithm can be obtained by observing that we do not need to check all prior intervals, just the one with the maximum score that ended before the current interval started. Specifically, the algorithm will record, as above, the score $$V\[i]$$, as well as a value $$MAX$$ representing the maximum $$V\[j]$$ over all intervals $$j$$ that ended before the current position. The ends of all intervals are first sorted, keeping track of whether each coordinate represents a left or a right end of an interval. Then the algorithm proceeds by filling in the $$V\[i]$$ and $$MAX$$ values as follows: ``` foreach interval end e from left to right (in increasing order of coordinates) if (e == left(i)) # if e is the left end of interval i V[i] = MAX + w(i) end if if (e == right(i)) # e is the right end of interval i if (V[i] > MAX) MAX = V[i] end if end if end foreach ``` At the end of the algorithm, the value $$MAX$$ will contain the weight of the maximum weight set of non-overlapping intervals. The actual set of intervals can be found through backtracking by recording a bit more information. The correctness of this algorithm is easy to verify – $$MAX$$ is always set to the weight of the maximum chain of intervals that has been completed (the rightmost position of its rightmost interval was seen), as $$MAX$$ is only set once we encounter the right end of an interval. In addition, when setting $$V\[i]$$, the current interval is implicitly compatible (non-overlapping) with the interval embodied by $$MAX$$ as the latter's rightmost end occurred before the left end of the current interval ($$V\[i]$$ is only set at the left end of an interval). The runtime of the algorithm is dominated by sorting the interval ends, yielding a runtime of $$O(n \log n)$$. ## Two-dimensional chaining The first dynamic programming algorithm described above can be easily extended to two dimensions. Like in the one-dimensional case we can visit the intervals (which are now rectangles bounding the 'good' diagonals within the dot plot or dynamic programming matrix) in order of their x- or y-coordinates, and check all compatible intervals to extend a current chain. Like before, $$V\[i]$$ is the weight of the maximum weight chain that ends with rectangle $$i$$. In two dimensions, the concept of compatibility is a bit more complicated – the upper-left corner of the current rectangle must occur to the right and below the lower right corner of some previous rectangle in order to be able to chain them together (see figure below).

A rectangle containing multiple gray rectangles numbered from 1 to 5. Rectangles labeled 1, 2, 5 have their top-left corner connected to the top-left corner of the large rectangle. Rectangles 3, 4, 5 have their bottom-right corner linked to the bottom-right corner of the large rectangle. The bottom-right corner of rectangle 1 is connected to the top-left corners of rectangles 3 and 4. The bottom-right corner of rectangle 2 is connected to the top-left corner of rectangle 4. — A set of sub-alignments drawn as rectangles numbered from 1 to 5. An edge connects the bottom-right corner of a rectangle to the top-left corner of another rectangle if the latter corner is to the right and below to the former, i.e., the two sub-alignments are compatible (can be connected into a proper alignment of the strings represented on the edges of the dynamic programming table). For clarity, connections from the top-left corner of the matrix to the top-left corners of rectangles 3 and 4, as well as connections from the bottom-right corners of rectangles 1 and 2 to the bottom-right corner of the dynamic programming table have been omitted.

The lines connecting the rectangles shown in the figure also suggest a different way of looking at the chaining problem. We could simply look for a path through the graph defined by these lines (with the nodes representing the rectangles) that maximizes the total weight of the nodes along the path. Such a graph formulation also has an $$O(n^2)$$ runtime. A natural question at this point is whether $$O(n^2)$$ is the best we can achieve in two-dimensions, or whether the more efficient one-dimensional algorithm can be adapted to two dimensions. Let's try! Let us assume that we will proceed along the x axis. We will follow a same procedure as before, visiting every interval end in order to update the value of the chain ending with the current interval ($$V\[i]$$). At each position, however, we will have to take into account more than just one maximum value, tracking the vertical placement of the intervals. Specifically, instead of a single value $$MAX$$, we will maintain a vector $$MAX$$ that contains the tuple $$(b\[i], V\[i])$$ for every interval $$i$$ in this vector, where $$b\[i]$$ is the bottom y coordinate of the interval. We will maintain this vector sorted by $$b\[i]$$. While visiting the intervals along the x axis, as before, we distinguish two situations: **1. The current position is the left end of interval i** In this situation, we need to first find the position of this interval along the y axis (since we visit intervals in order of their x coordinates, we already know their horizontal placement). Thus, we look for the top coordinate of interval $$i$$, $$t\[i]$$ within the array $$MAX$$, looking for the interval $$j$$ such that $$b\[j] > t\[i]$$ and $$b\[j]$$ is the "lowest" such value. Thus, interval $$i$$ occurs to the right (only "complete" intervals occur in $$MAX$$) and below (because of the search above) interval $$j$$, meaning that the two intervals are compatible. We can, thus, set $$V\[i] = V\[j] + w(i)$$. This step take $$O(\log n)$$ time since we can perform the search of $$t\[i]$$ within $$MAX$$ using binary search. If none of the intervals in $$MAX$$ are compatible with interval $$i$$ (i.e., the binary search fails), we simply set $$V\[i] = w(i)$$. **2. The current position is the right end of interval i** In this situation we need to update the $$MAX$$ array. As in the previous case, we'll look for the position of $$b\[i]$$ in this array, using binary search. Let $$j$$ be the "lowest" interval in $$MAX$$ such that $$b\[j] ≥ b\[i]$$. If $$V\[i] < V\[j]$$ we do not need to add interval $$i$$ to the $$MAX$$ array since interval $$j$$ is a better choice for building a high-scoring chain. More precisely, intervals $$i$$ and $$j$$ cannot be both used within a chain, thus we must use interval $$j$$ since it leads to a higher score. **STOP AND THINK:** Convince yourself (prove) that if $$V\[i] < V\[j]$$, intervals $$i$$ and $$j$$ are incompatible. If $$V\[i] > V\[j]$$, we can add interval $$i$$ into the $$MAX$$ array in the correct position to keep this array sorted by the $$b\[i]$$ values. We can also remove from the $$MAX$$ array all intervals $$k$$ that satisfy the conditions $$b\[k] < b\[i]$$ and $$V\[k] < V\[i]$$. In other words, we remove any interval that ends below interval $$i$$ that has a lower score than interval $$i$$. Any chain that includes such an interval can use interval $$i$$ instead, leading to a higher score. **STOP AND THINK:** Prove the last sentence in the prior paragraph. This second step requires a runtime of $$O(\log n)$$ for the search process and insertion of interval $$i$$ into the $$MAX$$ array. To bound the amount of time spent removing intervals from the $$MAX$$ array, we make two observations: (i) due to the process described for inserting an interval into the array, the $$MAX$$array is sorted both by the $$b\[i]$$ and the $$V\[i]$$ values; (ii) an interval can only be added or removed once during the entire execution of the algorithm. The consequence of the first observation is that all intervals $$k$$ that satisfy $$b\[k] < b\[i]$$ and $$V\[k] < V\[i]$$ occur next to each other in the array, thus they can be remove in time proportional to the number of these intervals (rather than time proportional to the size of the $$MAX$$ array). The consequence of the second observation is that throughout the entire execution of the algorithm, the time spent deleting intervals from the $$MAX$$ array is bounded by $$n$$ - the total number of intervals. Putting all this information together, we can determine that the total runtime of the two-dimensional chaining algorithm is $$O(n \log n + n) = O(n \log n)$$. The first term corresponds to binary search being executed $$2n$$ times (once for each end of each interval), and the second term corresponds to the cost of removing intervals from the $$MAX$$ array. The score of the highest scoring chain will be stored in the last element of the $$MAX$$ array (as a reminder, this array ends up being sorted by both score and bottom coordinates). Finding the actual set of intervals within this chain requires backtracking. ## Summary To summarize, when aligning long genomes we start by finding good local alignments, then chain these into either a global alignment (as described above), or simply longer local alignments (using a local version of the algorithm described above). Once the alignments are chained, we can fill in the gaps between them using the traditional global alignment approach. ## Historical note The program MUMmer developed by Art Delcher (former scientist at the CBCB) introduced the idea that the local alignment 'guides' should be Maximal Unique Matches (MUMs) (Delcher, Kasif et al. 1999) – longest matching segments that are unique in both of the strings being aligned. Adam Phillippy (former UMD grad) applied the chaining algorithm to find inexact matches around the MUMs found by MUMmer (Delcher, Phillippy et al. 2002). He implemented the $$O(n^2)$$ version since it is easier to implement and debug and the improved performance of the $$O(n \log n)$$ algorithm was not necessary in typical applications. The MUMmer package was originally developed when Art worked at Celera Genomics. Open-source suffix tree code from Stefan Kurtz allowed the package to be released open source (Kurtz, Phillippy et al. 2004). ## Exercises 1. For a sequence of n integers, the longest increasing subsequence problem (LIS) represents finding the longest subsequence in which all integers are listed in increasing order. For example, for string:\ \ 5,3,4,9,6,2,1,8,7,10\ \ one possible answer is:\ \ 3,4,6,8,10 1. Describe a simple dynamic programming algorithm to compute the longest increasing subsequence in $$O(n^2)$$. 2. Can you see any similarity between the LIS problem and the chaining algorithms described in the chapter? Indicate whether the 1-D or 2-D chaining algorithms is most similar to the LIS problem, and sketch a mechanism for solving the LIS problem in $$O(n \log n)$$, assuming you already have code that can compute the optimal solution to the chaining problem. 2. You are trying to find the placement of sequences of length 100 along the human genome (3 Gbp), allowing for at most 4 mis-matches, however you do not want to use Smith-Waterman directly. Instead you will use an exact matching algorithm to find good candidate matches then extend these with Smith-Waterman. What is the length of the exact match seeds you will use in order to guarantee you that no correct alignments are missed? 3\. You are given two strings of integers, e.g.:\ \ S1= 1 7 5 3 2 8 9\ S2= 14 4 1 3 6 5 4\ \ Assume you partition these strings into the same number of sections. Sum up the numbers present in each section of each string S, and replace the string with a new (possibly shorter) string S' representing the sums of its parts. For example, if S1 is partitioned: | 1 7 5 | 3 2 | 8 | 9 |, you would replace it with the string: S1' = 13 5 8 9. Compare the strings S1' and S2' by summing up the squares of the differences between corresponding numbers (remember, the prime strings have the same number of integers because the original strings were partitioned into the same number of sections).\ \ Describe a dynamic programming algorithm that computes the optimal way of partitioning the strings, i.e. the approach that minimized the sum of the squared differences between the primed strings.\ \ For example, for the strings shown above a possible solution is:\ \ S1= | 1 7 5 | 3 | 2 8 | 9 |\ S2= | 14 | 4 | 1 3 6 | 5 4 |\ \ S1'= 13 3 10 9\ S2'= 14 4 10 9\ \ Diff= 1 1 0 0\ Dist= 1² + 1² = 2\ \ **Additional constraints:** each string must be split into two or more sections, and none of the sections can be empty.\ \ **Hint:** Try to think in terms of the spaces between numbers, instead of the numbers themselves, i.e. possible locations of a "cut" defining a subsection of a string. Try to determine how to pair up cut-points between the two strings. ## References Delcher, A. L., S. Kasif, R. D. Fleischmann, J. Peterson, O. White and S. L. Salzberg (1999). *Alignment of whole genomes*. Nucleic Acids Res 27(11): 2369-2376. Delcher, A. L., A. Phillippy, J. Carlton and S. L. Salzberg (2002). *Fast algorithms for large-scale genome alignment and comparison*. Nucleic Acids Res 30(11): 2478-2483. Eisen JA, Heidelberg JF, White O, Salzberg SL. Evidence for symmetric chromosomal inversions around the replication origin in bacteria. Genome Biol. 2000;1(6):RESEARCH0011. doi: 10.1186/gb-2000-1-6-research0011. Epub 2000 Dec 4. PMID: 11178265; PMCID: PMC16139. Kurtz, S., A. Phillippy, A. L. Delcher, M. Smoot, M. Shumway, C. Antonescu and S. L. Salzberg (2004). *Versatile and open software for comparing large genomes*. Genome Biol 5(2): R12.