> For the complete documentation index, see [llms.txt](https://mpop.gitbook.io/bioinformatics-lecture-notes/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://mpop.gitbook.io/bioinformatics-lecture-notes/multiple-sequence-alignment/global-multiple-sequence-alignment.md).

# Global multiple sequence alignment

## Introduction

The global multiple sequence alignment is a relatively straightforward extension of [global pairwise sequence alignment](/bioinformatics-lecture-notes/inexact-alignment/introduction-to-inexact-alignment.md). Given $$n$$ sequences, we seek to insert "gap" (-) characters in them such that all sequences are the same length, i.e., they are all aligned to each other. Of course, we do not want just any kind of alignment, but the one that minimizes the disagreements between the sequences (similar to the definition of edit distance), or an alignment that maximizes the fit of the sequences, e.g., based on some definition of the similarity between the aligned nucleotides or amino-acids.  An example is shown below:

```
Seq1	ACC-AGTGA-CT
Seq2	-CGTAGCGA-CT
Seq3	ACC-AGTGT-CT
Seq4	ACGTAGTCATCT
```

Note that we typically expect at least one character per column to not be a gap character, otherwise the alignment can be simplified by removing the "all gaps" columns.

## Scoring multiple sequence alignments

### Edit-distance-like scores

Let us first assume that we are aiming to define an edit distance-like score where we simply count the differences between the sequences. While such a distance makes intuitive sense, it's not immediately obvious how to count differences within a column where you have both identical and distinct letters. Assume, for example, that you have a column that contains: A, A, A, C, C, - .  What is the edit distance?

One approach that is commonly used in motif finding is to simply define the edit score of a column to be the number of letters that differ from the majority letter (in the example above A). Thus, the score of the column would be 3, since C, C, and - differ from A.  See more discussion on the nuances of this score in the [motif finding section](/bioinformatics-lecture-notes/multiple-sequence-alignment/motif-finding.md).&#x20;

Another way to think about the score is to compute all pairwise distances between the letters in the column:

$$D = d(A, A) + d(A, A) + d(A, C) + d(A, C) + d(A, -) + \dots = 11$$

Importantly - this assumes we're only counting the difference between one letter and the ones following it, otherwise each distance would be double-counted.

### Sum of pairs alignment score

The pairwise distance approach described at the end of the previous section can be applied separately to each column in the alignment to generate the score of the full multiple alignment.  Furthermore, it is easy to see that this letter-by-letter score can be computed on a sequence-by-sequence basis, looking at the sum of the pairwise edit distances between the sequences in the multiple alignment:

$$\text{Score} = \sum\_{s\_i, s\_j \in S} d(s\_i, s\_j)$$

**STOP AND THINK:** are the $$d(s\_i, s\_j)$$ values the optimal edit distances (as defined in the inexact alignment section) between sequences $$s\_i$$ and $$s\_j$$?

The short answer is NO.  The location of gaps and mismatches in the global multiple alignment are intended to minimize the "cost" of the entire alignment, and do not necessarily minimize the cost of aligning any given pair of sequences. Thus, it is important to distinguish whether a distance is the optimal distance between a pair of sequences (according to inexact alignment, distance we will represent as $$D(s\_i, s\_j)$$)) or whether it is the distance induced by the multiple alignment $$M$$, represented as $$d\_M(s\_i, s\_j)$$.  In general, $$d\_M(s\_i, s\_j) \ge D(s\_i, s\_j)$$, since the optimal alignment has the lowest possible distance score.&#x20;

For completeness, we rewrite the equation above to use this new notation:

$$\text{Score}(M) = \sum\_{s\_i, s\_j \in S} d\_M(s\_i, s\_j)$$

So far we have assumed that the distance metric was edit distance, however the sum of pairs approach can apply to any other distance or similarity metric, such as the ones described in the [sequence alignment statistics](/bioinformatics-lecture-notes/advanced-inexact-alignment/sequence-alignment-statistics.md) section.&#x20;

### Bayesian integral log-odds scores

After reading the [section on alignment statistics](/bioinformatics-lecture-notes/advanced-inexact-alignment/sequence-alignment-statistics.md), you should have a good appreciation for the fact that pairwise alignment scores are defined by a rigorous statistical framework, based on estimating the likelihood that two aligned letters could evolved into each other within evolutionarily-related proteins. The sum-of-pairs score, however, cannot generalize the statistical properties of BLOSUM scores, for example, to multiple alignments. Specifically, the likelihood that a column of letters is generated by some evolutionary process within a set of related proteins, is not simply the pairwise sum of the probabilities of transitions between paired letters.  To address this issue, (Altschul et al, 2010) developed a scoring system specifically for columns of multiple alignments, the Bayesian Integral Log-oDds score (BILD). At a very high level, this scoring system relies on Dirichlet mixture models to estimate the structure of the space of columns in protein multiple alignments. A deeper discussion is beyond the scope of this book.

## Optimizing sum-of-pairs alignment scores

Above we defined various ways of determining the "score" of an alignment, which brings us to the question: *How can we find the alignment that optimizes the score over all possible alignments of the strings?*&#x20;

This question can be "easily" answered through an extension of the dynamic programming algorithm for pairwise sequence alignment. The intuition is to extend the $$V(i,j)$$ values from the 2-sequence context, to $$V(i, j, k)$$ for 3 sequences, etc. The resulting algorithm runs in $$O(L^n)$$ where $$L$$ is the length of the aligned sequences, and $$n$$ is the number of sequences.

**STOP AND THINK.** Does this algorithm solve the multiple alignment problem for all scoring systems, or just for the sum-of-pairs score?

Clearly, this algorithm is impractical, though some run-time improvements can be obtained by carefully pruning the dynamic programming table. For example, one could start with all pairwise optimal alignments between sequences and define, along each face of the dynamic programming high-dimensional 'cube' a region of acceptable sub-optimal alignments (essentially a swath surrounding the optimal path). The search for the optimal multiple sequence alignment can then focus on just the section of the high-dimensional table that is consistent with all the pairwise regions, similar in a way to the banded alignment approach used in pairwise alignment.

In general, the Multiple Alignment Problem is NP-hard, leading to the need for heuristic approximation algorithms.

### Star alignment

One such algorithm is called **Star Alignment** – deriving its name from the fact that the multiple alignment is constructed by aligning all the sequences to a same "center" sequence. The alignment process is called "progressive alignment" and proceeds as follows:

1. A sequence is selected to represent the "center" around which the alignment will be built (more on how this sequence is selected later)
2. Construct alignment of sequence 1 with the center sequence (using standard pairwise alignment)
3. Construct alignment of sequence 2 with center sequence (using standard pairwise alignment). Any gaps inserted within the center are propagated to the already constructed alignment with sequence 1.
4. repeat step 3 until all sequences have been aligned

**Note:** the implementation can be a bit tricky as you need to avoid having spurious gaps inserted in the alignment – e.g., if two sequences both lead to the insertion of a gap at the same location in the center sequence, the alignment algorithm should ensure a single gap is inserted (rather than two adjacent gaps).

The question of aligning one sequence against an existing alignment, or two alignments to each other is discussed in more detail below in <mark style="color:$warning;">Section</mark> .

#### Choice of the center sequence

For reasons that will become apparent below, the center sequence $$s\_c$$ is chosen to be the sequence that minimizes the equation:

$$\sum\_{i \ne c} D(s\_i, s\_c)$$

over all possible choices of $$s\_c$$.

#### **Theorem**

If the distance between sequences satisfies the triangle inequality (i.e., the distance is a metric), the star alignment has a score at most $$2 \* \text{OPT}$$ , where $$\text{OPT}$$ is the score of the optimal multiple alignment.

#### Proof

Let $$S^\*$$ be the sum of pairs score of the star alignment. By the definition of this score and the triangle inequality property:

$$S^*=\sum\_{i \ne j} d\_{*}(s\_i, s\_j) \le \sum\_{i\ne j} (D(s\_i, s\_c) + D(s\_j, s\_c)) = 2(k - 1) \sum\_i D(s\_i, s\_c)$$

where $$k$$ is the number of sequences and $$d\_\*$$ is the distance between sequences implied by the star alignment. The first inequality is due to the triangle inequality – each pairwise comparison between two sequences can be bounded by the longer 'path' passing through the chosen center sequence. The second equality requires a bit of counting – there are $$k (k - 1)$$ distinct pairs of sequences, and for each of them we compute $$D(s\_i, s\_c)$$ twice.

Let $$\text{OPT}$$ be the score of the optimal alignment. We can write this score as:

$$\text{OPT} = \sum\_{i \ne j} d\_{\text{OPT}}(s\_i, s\_j) = \sum\_i \sum\_j d\_{\text{OPT}} (s\_i, s\_j) \ge \sum\_j \sum\_i D(s\_i, s\_j) \ge k \sum\_i D(s\_i, s\_c)$$

The first equality simply transforms the pairwise sums into a collection of star alignments with a different sequence $$j$$ as the center. The inequality is due to the fact that the pairwise distance between two sequences in the optimal alignment is at most as good as the optimal pairwise alignment between the sequences. The final equality is due to the fact that the star with the center $$s\_c$$ obtains the best score from among all possible choices of center.

Combining the two equations, we get:

$${S^\* \over \text{OPT}} \le 2 - {2 \over k} \lt 2$$

i.e., the star alignment is at most twice as bad as the optimal alignment. We can also say that the star alignment achieves a 2-factor approximation.

### Steiner and consensus string

Let us spend a bit more time on the alignment problem, at the conceptual level. Above we came up with an alignment approach that picked a special string, the center, to guide the alignment. This center string is one from among the strings being aligned. What if we could pick another string such that the score of the star alignment for this string is optimized? This string (denoted $$T$$ in the following) is called the Steiner string. For this string, $$\sum\_i D(s\_i, T)$$ is minimized over all possible choices of $$T$$. In some sense you can get the intuition that an alignment built around $$T$$ would have a better score than that built around another choice of center. Also note that this string $$T$$ can, at a very broad level, be viewed as a way to summarize the alignment (though more precise ways of summarization will be described below). This string embodies the commonality between the different strings being aligned.

Another way of summarizing an alignment is to construct a **consensus string**. Specifically, for each column in the alignment we pick the letter that minimizes the distance from all the other letters in the alignment (essentially a Steiner nucleotide or amino-acids). The resulting string is the consensus string (see below).

```
Seq1 ACC-AGTGA-CT
Seq2 -CGTAGCGA-CT
Seq3 ACC-AGTGT-CT
Seq4 ACGTAGTCATCT
Cons ACCTAGTGA-CT
```

Note that for large enough alignments of similar enough sequences, the consensus string simply comprises the most abundant letter in each column (the sum of distances contains many zeroes). In generaly, however, the consensus letter for a column may not be the same as the most frequent letter in the column, depending on the choice of scoring system. &#x20;

It is not immediately obvious but there is a connection between the consensus string and the Steiner string – specifically, the consensus string of the optimal alignment is actually the optimal Steiner string. Needless to say, finding this string is NP hard.

Nonetheless, the consensus sequence of a multiple alignment (not necessarily the optimal one) is commonly used in practice as a way to summarize the alignment, e.g., in the context of progressive alignment (sequences are iteratively aligned to the consensus sequence constructed so far).

### Tree-guided alignment

Neither the optimization problem implied in the definition of the multiple alignment problem, nor the approximation provided by the star alignment have a biological interpretation. As a corollary, the resulting multiple alignments may not capture interesting biological phenomena. A more "biologically meaningful" approach to multiple alignment relies on the construction of a guide tree that, intuitively, captures the evolutionary relationship between the sequences being aligned. The guide tree stores the sequences at its leaves, and the progressive alignment approach described in the context of the star alignment can be modified to allow sequences to be merged together in a bottom-up fashion into a single multiple alignment.

Virtually all "popular" multiple alignment programs rely on this paradigm. Two examples are ClustalW (Thompson, Higgins et al. 1994) and Muscle (Edgar 2004).

## Multiple alignment of genomes

So far we've discussed global multiple alignment – the sequences have to be aligned end-to-end. This approach simply doesn't scale to whole genomes. Also, whole genomes have rearrangements, events that are not captured in the typical edit-distance alignment algorithms. One approach that can handle the alignment of multiple genome is Mauve (Darling, Mau et al. 2004) <https://darlinglab.org/mauve/mauve.html> .

### Outline of the Mauve approach

1. Find a collection of maximally-unique-matches (MUMs) shared by 2 or more genomes (multi-MUMs).\
   \
   One can think of a variety of ways of doing this, but in principle the basic idea is to first find exact k-mer matches across multiple genomes (making sure none is a repeat in any of the genomes) then extend these matches along groups of genomes as long as the corresponding sequences agree. Other useful techniques are described in the [string indexing chapter.](/bioinformatics-lecture-notes/string-indexing/introduction-to-string-indexing.md)&#x20;
2. Find locally collinear blocks (LCB) among the MUMs – these are sets of MUMs that occur in the same order in all genomes. Note: the LCB algorithm requires all MUMs to occur in all genomes so it's only applied to those MUMs.\
   \
   The idea of the LCB algorithm is to repeatedly sort the MUMs based on their occurrence in each genome, and mark breakpoints in this sorting – places when the MUMs sorted by genome $$i$$ occur out of order in genome $$j$$.
3. Construct a global guide-tree from all genomes, using the weight of the shared MUMs as a similarity/distance measure. This tree is then used to align the regions between the MUMs using ClustalW.

## Exercises

1. The optimal Steiner multiple alignment for a set of $$k$$ strings, is the alignment that minimizes $$E(S^*) = \sum\_i D(S^*, S\_i)$$over all possible choices of the Steiner string $$S^*$$. Show that, if the scoring function $$D$$ satisfies the triangle inequality, there exists a string $$\bar{S}$$ in the original set of strings such that $${{E(\bar{S})} \over {E(S^*)}} < 2$$.
2. We showed that the sum-of-pairs score of the optimal star alignment is at most twice the score of the optimal multiple alignment of a set of sequences. Can you generalize this result? Specifically, the star is a special type of tree. Assume that instead of a star you could construct a tree containing the sequences being aligned at both the leaves and internal nodes. The score of such a tree is, just as in the star case, the sum of the optimal pairwise alignment scores for sequences adjacent in the tree. Assume you had an algorithm that could compute the optimal tree alignment (sum of distances is minimized). Can you prove anything about the sum-of-pairs score of the induced alignment? (Note: assume distances satisfy triangle inequality)

## References

* Altschul SF, Wootton JC, Zaslavsky E, Yu YK (2010) *The Construction and Use of Log-Odds Substitution Scores for Multiple Sequence Alignment.* PLOS Computational Biology 6(7): e1000852. <https://doi.org/10.1371/journal.pcbi.1000852>
* Darling, A. C., B. Mau, F. R. Blattner, and N. T. Perna. (2004) *Mauve: Multiple Alignment of Conserved Genomic Sequence With Rearrangements.* Genome Research 14 (7): 1394–403. <https://doi.org/10.1101/gr.2289704>.
* Thompson, J. D., D. G. Higgins, and T. J. Gibson. (1994) *CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice.* Nucleic Acids Res 22 (22): 4673–80. 7984417.