# Sequence alignment statistics

## Introduction

The sequence alignment problem can be viewed as the interplay of three main considerations:

1. The definition of what is a good alignment (generally modeling some biological meaning)
2. Algorithm to compute a good alignment
3. Statistical arguments to determine whether an alignment can be explained by chance

The three considerations are in tension: the most biologically relevant score may be intractable to compute or may entail intractable statistics, while tractable definitions may not be closely related to biology.

We already described [algorithms for finding an inexact match](https://mpop.gitbook.io/bioinformatics-lecture-notes/inexact-alignment/inexact-alignment-calculation-with-dynamic-programming) between two strings and assumed we had some arbitrary scoring function that determined the score of aligning or mis-aligning characters to each other. In this chapter we delve a bit more in the statistical meaning behind those scores.

Throughout the following we will focus primarily on local alignment as the statistics are well defined and easier to reason about. Such a statistical framework does not exist for global alignment.

We will focus on two main questions:

1. How to estimate the probability that a particular alignment can be explained by chance?
2. How to define the alignment score?

The former question is particularly relevant in the context of database searches. Assume we have a new biological sequence (for now we'll assume we are looking at protein sequences) and try to find a similar sequence in a database. Assume our alignment algorithm finds an alignment of score $$S$$ between this sequence and another sequence in the database. How do we know that the alignment we found is 'real', i.e., it is not due simply to chance.&#x20;

This question falls into the category of statistical hypothesis testing where we are asking whether a particular observation belongs in a particular distribution. The null hypothesis assumes that the observation is indeed derived from the distribution, and statistical tests can tell us whether this null hypothesis is invalidated. In the context of alignment, the null hypothesis is that our sequence found a match simply by chance. Rejecting the null hypothesis would mean that the match found is 'surprising', i.e., the matching sequence is probably related to our query.

## The significance of an alignment

Before we proceed it is important to realize that we try to find a scoring function that has some biological meaning, in order to insure that our alignments are biologically correct. By necessity, to make the statistics tractable, we will make some simplifying assumptions which we will relax later. For now we will assume that the alignments have no gaps.

The scores are represented by a 20x20 matrix that contains the score of aligning two amino-acids to each other. These scores are defined statistically - they are deviations from chance that two amino-acids are aligned to each other. We need a definition of background probabilities and then try to find a string of letters that are aligned in a 'surprising' way.

We will describe how these scores are computed later. For now, we will focus on the properties we might want the scores to have. It may not be immediately obvious from the description of the dynamic programming algorithm, but the type of alignment we find depends strongly on the scoring function. With a small change we can, for example, change the dynamic programming algorithm to only find exact matches (an overkill given that we already know how to do it more efficiently). We will require the following two properties from a 'good' scoring system:

* The expected score is negative: $$\sum\_{i, j}p\_i p\_j s\_{i,j} < 0$$. This property is useful for the math that will follow, but also ensures that we can find real alignments among the noise. If scores are largely positive, the alignment algorithm would generally produce very long alignments even when aligning random sequences, as simply adding more characters to the alignment will increase the alignment score.
* At least some of the scores are positive. This property is useful for insuring we will find at least some alignments. If all scores are negative, the Smith-Waterman algorithm would always select the null alignment of score 0.

**Theorem:** Any scoring system satisfying the properties described above can be written as $$s\_{i,j} =  \left ( {\ln {{q\_{i,j}} \over {p\_i p\_j}}}  \right ) / \lambda = \log {{q\_{i,j}}\over{p\_i p\_j}}$$

Conversely, a non-zero matrix computed according to the formula above will satisfy our two conditions. The values $$q\_{i,j}$$ are called the target frequencies and we will discuss them later. The $$\lambda$$ value is simply a positive scaling factor that will not affect the actual alignment algorithm but it is useful for defining the statistics. The value of this parameter can be computed analytically as shown below.

To prove the theorem above, we can define a new function $$f(x) = \sum\_{i,j} p\_i p\_j e^{s\_{i,j} x}$$, where $$s\_{i,j}$$ is the scoring function we chose. Based on the assumptions we made about this scoring function we can learn a few properties of the function $$f$$. First, $$f(0) = \sum\_{i,j} p\_i p\_j = 1$$ simply due to the fact that the background probabilities are probabilities. Second, $$f'(0) = \sum\_{i,j} p\_i p\_j s\_{i,j} < 0$$ due to the fact that the expected score is negative. In other words, the function $$f(x)$$ crosses the $$y$$ axis at value 1 and decreases as $$x$$ increase. Third, $$f''(x) = \sum\_{i,j} p\_i p\_j s\_{i,j}^2 e^{s\_{i,j}x} > 0$$, i.e., $$f$$ is a convex function. Finally, since at least one of the $$s\_{i,j}$$ values is positive, $$\lim\_{j \rightarrow \infty} f(x) = \infty$$, implying that the equation $$f(x) = 1$$ has a unique positive solution $$x = \lambda$$. This value can be easily computed with standard numerical algorithms for any scoring function.

We can now simply define $$q\_{i,j} = p\_i p\_j e^{\lambda s\_{i,j}}$$. These values are all positive and sum up to one since $$\sum\_{i,j} q\_{i,j} = f(\lambda) = 1$$. Furthermore, $$s\_{i,j} = \left ( \ln {q\_{i,j} \over {p\_i p\_j}} \right ) / \lambda$$, thereby proving the theorem.

We will now focus on the question of aligning two random sequences. To perform the statistical hypothesis test described above, we need to find, for each score $$S$$*,* the expected number of independent local alignments that are greater than or equal to this score $$E(S, m, n)$$*.* By independent we mean alignments that do not overlap or touch each other. The product of the lengths of the two strings $$N = m \times n$$ is the size of the search space.

Simply using intuition, we can see that $$E(S, m, n) \approx mn$$ (if you double the space you double the number of alignments expected). How does $$E(S, m, n)$$ vary with $$S$$? Again intuition indicates that the expectation of the number of random alignments must decrease with increasing $$S$$ (the higher the score, the fewer such alignments), and that it will decrease exponentially. The intuition is based on a simple coin-flipping experiment where we try to count the number of heads at the beginning of a series of flips. The likelihood of a string of heads at the beginning of the series decreases exponentially with the length of the string - $$p(h \text{ heads}) = \left ({1 \over 2}\right)^h = e^{-(\ln 2)h}$$.

Thus, for a different scoring function, we can expect that $$E(S, m, n) \approx e^{aS}$$ where $$a$$ is a constant.

One can prove mathematically that $$E(S, m, n) = K mn e^{\lambda S}$$ for a constant $$K$$, however this proof is very complex. Note that this expectation can be easily interpreted, even if the whole math underlying it is complex. Also, note that this formula is asymptotic – it approaches this value for large values of *m* and *n*. For small values of *m* and *n* additional corrections are necessary.

The expectation (or E-value) is just one part of the equation – the statistical hypothesis test relies on a p-value – the probability that an alignment with score higher than $$S$$ exists (remember, $$E$$ is simply the expected number of such alignments). Using Poisson statistics it can be shown that $$p=1-e^{-E}$$. For small values of $$E$$, $$p\approx E$$.

&#x20;Also note that the formulas above are parametrized by parameters $$K$$ and $$\lambda$$, parameters specific to the scoring function which must be computed in order to compute the statistics. The intuition behind the $$K$$ parameter is that it represents the reduced search space within one can expect to find optimal alignments. To simplify the equations we can simply fold the parameters into the score, obtaining a new score $$S' = (\lambda S - \ln K)/\ln 2$$. Now $$E(S, m, n)$$ can be written as $$E(S', m, n) = {mn \over {s^{S'}}}$$, with $$S'$$ represented in bits (also called the **bit score**).

This score can be easily interpreted, as increasing the score by one bit results in a halving of the number of random alignments. As an example, imagine a sequence of length 250 aligned to a database of length one billion. The expected number of local alignments with normalized scores greater than 35 bits is given by the search space – $$250 \times 10^9 \approx 2^8 \times 2^{30} = 2^{38}$$ - divided by $$2^{35}$$, or $$E\approx 2^{38}/2^{35} = 2^3 = 8$$. Increasing the minimum score by 10 bits reduces this number by $$2^{10}$$, or roughly 1000, leading to an expectation of 0.008 random alignments with a bit score of at least 45.&#x20;

As we already mentioned, we focused on a simplified version of the alignment problem where we allow no gaps. The dynamic programming alignment algorithm can [tolerate gaps](https://mpop.gitbook.io/bioinformatics-lecture-notes/advanced-inexact-alignment/gap-penalties), and we can clearly add a gap scoring function to that alignment. It turns out that this situation can also be modeled statistically in a similar way to the description above, however this fact is only evident empirically and has not yet been proven analytically.

We need to make sure gaps are penalized enough otherwise we can always get a good alignment by simply stringing together high scoring letter pairs with long stretches of gaps. This is equivalent to the requirement that the expected alignment score be negative. There is a transition point in the gap score beyond which the gapped alignments behave just like a local alignment. The transition point can be found experimentally, but there is no mathematical proof that allows one to calculate it.

For a large enough penalty we have the same general statistics but we do not know how to calculate the values of $$\lambda$$ and $$K$$. Instead we can estimate these parameters using the following approach. First it is important to note that the value of the highest scoring alignment (in both the gapped and ungapped cases) follows an extreme value distribution. Such a distribution is given by the probability density function $$\text{pdf}(x) = e^{-\lambda(x - u)-e^{-\lambda (x - u)}}$$, where $$\lambda$$ is called the **scale**, and $$u$$ the **characteristic value**. For optimal local alignments, $$\lambda$$ is the same as the $$\lambda$$ parameter we defined above, and the characteristic value is $$u = (\ln K mn)/\lambda$$. For ungapped alignments $$\lambda$$ and $$K$$ an be determined analytically. For gapped alignments we can simply run multiple gapped alignments (for many different random sequences) using a given gapped scoring function, find the highest score, then fit an extreme value distribution to the observed values. From the fitted distribution we can now calculate the values of $$\lambda$$ and $$K$$.

All our theory so far assumes a model of proteins as random strings of letters. Proteins, however, have sections that do not behave like the random model in part due to the actual structure of the protein. In globular proteins the random model is pretty good. For more 'unusual' proteins the statistics will not work, e.g., the expected scores in certain regions are positive. Some such examples are proteins with long 'tails' comprised of a small repeated patterns. Database searches try to screen out such regions because the statistics don't work, and in some sense the alignment idea itself does not make sense in these regions anyway.

## Defining alignment scores

As pointed out above, any 'good' alignment score can be written as a log odds: $$s\_{i,j} = \log{q\_{i, j} \over {p\_i p\_j}}$$. This is true for any score, whether or not it captures a deeper biological meaning. In this section we will focus on how to tie more closely the scores to actual biological information. We will do so by defining the **target frequencies** $$q\_{i,j}$$.

To get a better intuition of what we are looking for, note that the distribution/histogram of alignment scores for random alignments has an exponential distribution – the higher the score, the fewer random alignments with that score. For a search space $$N = nm$$, we expect at most one random score to exceed $$\log N$$. This property holds for all possible scoring systems. We will try to find a scoring system that maximizes the scores for true non-random alignments, i.e., a scoring system that best separates true matches from random matches. We will use a score that relies on the frequency at which amino-acids align to each other in 'true' alignments, specifically we will set the target frequencies $$q\_{i,j}$$ to be the observed frequency with which amino-acid $$i$$ is aligned to amino-acid $$j$$ in a true alignment. Note that this definition is intuitively correct – if the frequency of aligning the amino-acids is larger than expected by chance ($$q\_{i,j} > p\_i p\_j$$), the corresponding score $$s\_{i,j}$$ will be positive.

Note that the idea of 'aligned amino-acids in true alignments' is not a coherent concept. The alignments and amino-acid pairing frequencies depend on the evolutionary distance between the two proteins. Thus, the scoring system must be adapted to a specific degree of evolutionary divergence.

Deriving scoring matrices from alignments which themselves are constructed with the use of scoring matrices entails a certain level of circularity. This circularity was broken by Margaret Dayhoff (Dayhoff 1983) – she focused on proteins that are only 15% diverged. Due to the high similarity, the alignment can be computed correctly irrespective of the scoring function. She then modeled evolutionary change by modeling the frequency with which amino-acids could mutated into each others. From these evolutionary changes she could extrapolate out to other evolutionary distances, leadin to the PAM model (Point Accepted Mutations). A point mutation is a mutation at a single position. An accepted point mutation, is a mutation that survived evolution. The basic unit, starting point for the PAM model, is 1 PAM – evolutionary distance such that on average there is only one mutation per 100 sites.

PAM 100 corresponds roughly to proteins that align about 43%. The best PAM matrix, that is consistent with the evolutionary distances most interesting to biologists at that time, was PAM 250 (proteins \~20% identical). This matrix became widely used.

As an aside, what would PAM 0 look like? These are alignments without any substitutions, i.e., all alignments are on the diagonal. The off-diagonal target frequencies are 0, and hence the corresponding scores are negative infinity. Along the main diagonal, the target frequencies are exactly the proportion of the specific amino acid, and the score is $$\log {q\_{i,i} \over p\_i p\_i} = \log {p\_i \over p\_i p\_i} = - \log p\_i$$. Since the $$p\_i$$ values are less than one, the diagonal values are all positive numbers.

Heninkoff and Heninkoff (Heninkoff and Heninkoff 1992) sought to avoid the extrapolation to large PAM distances (certain phenomena, such as slowly evolving amino-acids, cannot be accurately modeled from short-term data) by looking directly at certain levels of evolutionary divergence to avoid being biased by slow/fast evolving rates. Again, they ran into the circularity issue. To overcome this issue they relied on large multiple alignments of sequences at a certain level of divergence, and identified blocks of good alignments from which substitution rates could be estimated. Good multiple alignments provide higher confidence that the individual pairwise alignments are actually correct, even if the scoring function is not exactly accurate. This approach led to the creation of the BLOSUM matrices. BLOSUM-62 is the most widely used and is roughly equivalent the PAM-180.

One question that arises is whether we can estimate the average substitution score, i.e., the contribution to the alignment score of a single aligned letter. From this score we can estimate the score of an alignment based on its length. For this purpose, we will introduce the concept of **relative entropy**. Suppose two sequences are diverged by 100 PAMs and we compare them with matrix PAM-100 – the average score can be expressed as:

$$H = \sum\_{i,j} q\_{i,j} s\_{i,j} = \sum\_{i,j} q\_{i,j} \log {q\_{i,j} \over p\_i p\_j}$$

This is the relative entropy. If using a base 2 logarithm, the entropy is expressed in bits per position. Note that this is the average score of *true* alignments and it can be shown to be positive. Replacing $$q\_{i,j}$$ with $$p\_i p\_j$$ we get the score for *random* alignments, score which is negative.

The further the sequences diverge the less information we get from the alignment. If we want our alignment to appear above statistical noise, we can estimate the number of bits necessary (30-40 bits, depending on database size). We can simply divide the relative entropy by this number to find out the minimum alignment length needed before we can separate noise from signal.

Note that in general we do not know what divergence level to look for. What if we use the wrong matrix? We can define the **efficiency of a matrix** at a particular evolutionary distance (e.g., PAM 150 at 100 PAMs distance). The efficiency is the proportion of the score you obtain with the wrong matrix compared to the score that you would get with the correct matrix. The efficiency of the PAM $$y$$ matrix at PAM distance $$x$$ can be formally defined as

$$\sum\_{i,j} q\_{i,j}^x s\_{i,j}^y \over \sum\_{i,j}q\_{i,j}^xs\_{i,j}^x$$

The concept of efficiency allows us to understand why we use the BLOSUM-62 matrix. Assume, for example, that we have sequences that are highly similar. The corresponding alignment would carry a lot of information irrespective of scoring function, thus the true alignment will be easily distinguished from noise. Conversely, poor alignments will almost always be missed, irrespective of scoring function. The precise use of the scoring function matters only for alignments that have just enough information to be differentiated from noise (\~30-40 bits). This region, the 'twilight zone' of alignments, falls exactly in the BLOSUM-62 (or PAM-180) range for typical alignment lengths biologists are most interested in (\~100 or so amino-acids), thus BLOSUM-62 is the matrix that leads to the least loss of information for such alignments.

All that we have discussed above can be extended to DNA alignments. We can come up with an evolutionary model for DNA (just like Dayhoff's model) and use it to construct PAM matrices for matches/mismatches, allowing us to perform alignments the same way as for proteins.

Note that we can compare DNA sequences at the DNA level or at the protein level (the translated search – simply translate the DNA into an amino-acid sequence and compare the amino-acid sequences). Which approach is better?

We can estimate the information in DNA-level and protein-level alignments. At close evolutionary distances (highly similar sequences), DNA alignments carry more information. This can be easily seen from the fact that an exact alignment of two codons (3 base-pairs) contains 6 bits of information as opposed to just about 4 for a pair of amino-acids. At longer evolutionary distances (> 50 PAMs), however, we get more information in the protein sequence.

Note that the observations above apply to naïve alignments of DNA sequences and more complex alignment strategies can recover more information. There are, however, computational tradeoffs to be taken into account. For example, translated DNA searches multiply the runtime by a factor of 6 (three reading frames on the 'forward' strand, and three on the 'reverse' strand). Similarly, 'threading' – an alignment for protein sequences that takes into account structural information in addition to sequence information – is more accurate biologically, but is NP-hard, i.e., finding the optimal alignment requires sifting through an exponential number of possible alignments.

## Notes

Many of the ideas described above appeared in (Karlin and Altschul 1990) and (Altschul 1991). The BLOSUM matrices first described in (Henikoff and Henikoff 1992) turn out to have been incorrectly computed in practice, yet their original implementation is more effective at sorting out through the 'twilight zone' than the correctly computed BLOSUM-62 matrix (Styczynski, Jensen et al. 2008). The comparison between the information in codon alignments and the corresponding protein alignments was discussed in (States, Gish et al. 1991).

## References

Altschul, S. F. (1991). *Amino acid substitution matrices from an information theoretic perspective*. J Mol Biol 219(3): 555-565. <http://www.ncbi.nlm.nih.gov/pubmed/2051488>

Dayhoff MO, Barker WC, Hunt LT. *Establishing homologies in protein sequences.* Methods Enzymol. 1983;91:524-45. doi: 10.1016/s0076-6879(83)91049-2. PMID: 6855599.

Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992 Nov 15;89(22):10915-9. doi: 10.1073/pnas.89.22.10915. PMID: 1438297; PMCID: PMC50453.

Karlin, S. and S. F. Altschul (1990). *Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes*. Proc Natl Acad Sci U S A 87(6): 2264-2268. <http://www.ncbi.nlm.nih.gov/pubmed/2315319>

States, D. J., W. Gish and S. F. Altschul (1991). *Improved sensitivity of nucleic acid database searches using application-specific scoring matrices*. Methods 3(1): 66-70. <http://www.sciencedirect.com/science/article/pii/S1046202305801653>

Styczynski, M. P., K. L. Jensen, I. Rigoutsos and G. Stephanopoulos (2008). *BLOSUM62 miscalculations improve search performance*. Nat Biotechnol 26(3): 274-275. <http://www.ncbi.nlm.nih.gov/pubmed/18327232>