Sequence alignment heuristics

Introduction

In the general case, if our goal is to find the best scoring alignment between two strings, irrespective of how good this alignment is, we cannot beat the quadratic runtime of the basic algorithm. For most applications this runtime is simply prohibitive – imagine, for example, aligning several million 100bp sequenced reads to the entire 3 billion base-pairs in the human genome, or aligning one entire 3 Mbp bacterial genome against a collection of other 10,000 such genomes. To efficiently perform such alignment tasks we must relax our goals, by focusing, for example, on just the highest quality alignments, or even give up on finding the best scoring alignment. In this chapter we will discuss some such heuristic approaches. Note, however, that such heuristics must be used with care and only when the biological application is compatible with the relaxed assumptions of the heuristic approach. Thus, before you propose a 'smart' new heuristic that outperforms other alignment approaches try to fully understand the needs of the biologist users.

Exclusion methods

Exclusion methods attempt to restrict the search space to just regions where good alignments are likely to be found. Assume, for example, that we are interested in finding an alignment with fewer than $k$ edits of a pattern $P$ of length m against a text of length $n$ . Baeza-Yates and Perleberg (Baeza-Yates 1996) observed that if we break $P$ into $k+1$ equally-sized blocks of size $m/(k+1)$ , at least one of these blocks contains no edits, i.e., the block matches perfectly against the text. As any good alignment satisfies this property, it suffices to focus on the parts of the text that contain exact matches with the pattern of length at least $m/(k+1)$ . Once exact match "anchors" are found, the rest of the pattern can be aligned to the text in $O(m^2)$ rather than $O(mn)$ , since the regions of the text where the pattern may match is $O(m)$ in length.

To find the locations where the pattern has an exact substring match to the text one can use any of the exact string matching algorithms we discussed earlier. Specifically:

The pattern can be split into $k+1$ equally-sized blocks, which are then searched against the text using the Aho-Corasick algorithm;
The longest common subsequence matching between the pattern and the text can be found (as well as all matching substrings longer than $m/(k+1)$ ) using a suffix tree constructed on the text;
An index of all strings of length $m/(k+1)$ can be constructed from the text allowing us to quickly find all segments that are found in both the pattern and the text.

The latter option warrants further discussion. Substrings of length $k$ of a string are typically called k-mer. The type of index that stores k-mers together with their location in a text is called an inverted index as it maps content (k-mers) to their location. For each k-mer the inverted index lists all indices in the text where that k-mer is found. Such indices can be quite efficient as they can be organized as a hash table with effectively constant access time per k-mer. The efficiency, however, depends on the size of the k-mer (the shorter the k-mer the more efficiently the index can be stored), and the level of repetitiveness of the genome (the more times a k-mer is found, the less efficient is the index).

Case study: BLAST

The BLAST (Altschul 1990) program is one of the most widely used and heavily cited programs in bioinformatics. It is 'simply' a database search program that searches individual sequences against a large database. To identify putative alignments and their scores, BLAST constructs a k-mer index for the query, i.e. it stores all k-mers of a certain length ( $k=3$ for proteins, $k=11$ for DNA by default) and their location in the query. For each k-mer BLAST also records the 1-off neighbors of the k-mer (e.g., in addition to ACA it would also store TCA, GCA, CCA, as well as ACT, ..., etc.). The algorithm then attempts to find exact k-mer matches between the database and the query, and performs a local dynamic programming alignment around each of the matches found to identify high scoring segment pairs (HSPs), which are then evaluated statistically. Note that the exact matches are scored according to the selected scoring matrix, and only k-mer hits of a sufficiently high score are considered.

The local alignment procedure starting from a k-mer match continues on either side of the hit and stops once the score drops significantly.

Note that BLAST is not guaranteed to find the best match between a query and the database – assume, for example that it compares two versions of a gene's DNA sequence, and these versions differ in the third position of their codons while still encoding the same protein. In these sequences one will not find any exact match longer than 2 bp and the biologically correct alignment will be missed by blast (which uses $k=11$ for DNA sequences), even though it tries to account for mismatches in its search. At the same time, using a smaller value of $k$ , e.g., $k=5$ (after accounting for one mismatch we should be able to find matching k-mers in this case) might lead to too many exact matches requiring a prohibitive amount of time to compute all alignments and find the best one.

This trade-off between efficiency and sensitivity (ability to find the correct alignment) is key to most heuristic choices in alignment.

BLAST involves several tradeoffs. First, the more word hits, the slower the algorithm (runtime is roughly proportional to the number of words that match between the query and the database). The expected number of word hits depends on word length and score threshold, and an intuition of this dependence can be obtained from an analysis of the statistics of alignment scores. At the same time, increasing word length or score threshold increases the chance that an alignment may be missed.

Note that the original BLAST used ungapped extensions of the exact matches – something that can be done in linear time. To allow for gapped alignments, BLAST uses a second score threshold which selects ungapped extensions that are 'promising' – once an ungapped alignment of a high enough score is found, the seed hit is extended with a gapped alignment procedure. This threshold is selected so that, on average, only 1 in 50 hits would trigger the gapped extension procedure. Furthermore, the interplay between the gapped and ungapped analyses creates opportunities for errors or misunderstandings of the BLAST output (Shah 2019).

Another heuristic used by BLAST requires two hits to exist on the same diagonal (rather than one longer hit) before an extension is performed. This heuristic is similar to the one used by the FASTA program (Pearson and Lipman 1988).

Expected number of anchors

Exclusion methods, as described above, are perhaps unnecessarily conservative. By choosing the length of an exact anchor to be $\lfloor { m\over {e+1}}\rfloor$ we can ensure that no match of a pattern of length $m$ with at most $e$ mismatches (errors) will be missed. However, for small values of $m$ or large values of $e$ , the exact matches we look for may yield too many putative matches thereby offsetting the performance gains of the approach. The worst case scenario implied by this approach – that the errors are equidistant along the pattern – may not be that common. Instead, we could look for longer anchors in the hopes that with high enough probability a good alignment would contain a long enough error-free stretch. We can estimate this probability more formally as follows.

Just as before we will assume the length of the pattern is $m$ . We will also define $\lambda$ to be the error rate in the alignment we are looking for. The expected number of errors within the pattern is $e = m \lambda$ . Given these parameters, the expected number of error-free anchors of length $k$ within the pattern is $(m - k + 1) (1 - \lambda)^k$ . This formula simply multiplies the number of possible anchors with the probability that a string of $k$ consecutive positions do not contain any error.

A similar analysis can be applied to estimate the number of error-free windows of length $k$ or longer within the pattern. The expected number of error-free anchors is simply the expected length of such error-free windows. The number of error free windows starting with an error within the pattern is simply $e$ , and we also have to account for a possible error-free window before the first error. Furthermore, the probability of each window being longer than $k$ is simply $\lambda(1 - \lambda)^k$ , or the probability of an error followed by $k$ or more exact matches. Putting these together, the expected number of error free windows is $(1-\lambda)^k + e\lambda(1 - \lambda)^k = (1 + e \lambda)(1 - \lambda)^k$ ..

To see the discrepancy between the worst case and average case scenarios, let $m=100$ and $\lambda = 0.04$ . The worst case analysis suggest we should look for anchors of length 20, assuming the four errors in the pattern are equally spaced. This is clearly overly conservative as on average we expect to see 29 anchors of length 20 at this error rate. In fact, even for an anchor of length 50 (half a read) we expect that, on average, 4 such anchors could be found within a pattern.

Analyses such as the ones outlined above can be extended to look at multiple patterns. In DNA sequencing experiments one usually generates tens of millions of sequences. Depending on the question being asked, it is not necessary that all the reads be aligned, and we could use a more aggressive choice for the length of exact matches we are looking for. Of course, as we design such a strategy we may want to estimate whether the choice of anchor length is not too conservative, thereby eliminating too many of the sequences. To do so, we want to estimate the expected number of sequences of length $L$ that contain at least $m$ exact matches of length $k$ , assuming we are starting with $N$ sequences.

We first need to estimate the probability that a sequence contains at least $m$ matches of length at least $k$ . As we discussed above, each match starts with an error (probability $\lambda$ ) followed by at least $k$ matches ( $(1-\lambda)^k$ ). As we are looking for m such patterns, the probability becomes $\lambda^m (1 - \lambda) ^{km}$ . Finally, we need to take into account the number of ways in which the basepairs not included in these exact matches are distributed within the sequence. There are $L - mk$ such characters that are distributed among $m + 1$ buckets (the spaces between exact matches), or a total of ${L - m (k - 1)} \choose {m}$ configurations, leading to the per-sequence probability of ${\lambda^m (1-\lambda)^{mk}}{ {L - m(k+1)}\choose{m}}$ . By linearity of expectation, the expected number of sequences with at most $m$ k-mer matches is ${N\lambda^m (1-\lambda)^{mk}}{ {L - m(k+1)}\choose{m}}$ .

Spaced seeds

Earlier we gave an example of a situation where k-mer filtering may fail to capture real biological relationships: the alignment of two versions of a gene that differ in every third position of each codon while still encoding the same protein. As an example, the following genes:

g1: AATCAAGGUAAATTA g2: AACCAGGGAAAGTTG

both encode the amino-acid string: NQGKL.

Clearly no 11-mer matches exactly or even with one mismatch between the two versions of the gene and this alignment would be missed by a tool like BLAST that relies on k-mer filtering prior to performing the alignment. We can, however find the alignment using a special type of seed called a spaced seed, where we allow certain positions to mismatch. Specifically, we will represent the seed as a string of 1s and 0s, and only require that the characters aligned to a 1 match each other.

We will compare two seeds of length (also called width) 11: the seed 11111111111 (the original 11-mer) and the seed 11011011011 (an 11-mer that allows 'wobble' basepairs at every third location). Let us see how they do in the example above.

The seed 11111111111 requires the first 11 characters of the two genes (AATCAAGGUAA and AACCAGGGAAA, respectively) to match each other, thus would reject the alignment. The spaced seed treats every third character in the two strings as a 'don't care' symbol, leading to a comparison between the strings AA*CA*GG*AA from g1 and AA*CA*GG*AA from g2, strings which match, thus prompting the aligner to verify the alignment using the dynamic programming algorithm.

In other words, the spaced seed appears to be better – it has the same width yet finds a match missed by the contiguous seed, and also only requires us to look at 8 characters (the weight of the seed) instead of 11.

We can formalize this property by defining the sensitivity of a seed to be the probability that at least one seed matches between two aligned strings. The sensitivity is specific to the similarity between the strings, defined by the fraction of identities in the alignment. Note that we can think of the alignment as simply a series of 1s and 0s, where 1s indicate that the corresponding characters match perfectly (we ignore gaps here). If the rate of identities is $p$ , we expect 1s to occur in the alignment with probability $p$ .

We can get a bit of an intuition on why spaced seeds are better by observing that we can compute the expected number of seed matches between the two strings as follows. Let $L$ be the length of the alignment, and $W$ be the width of the seed, and $p$ the identity rate (66% in the case above where 1 in 3 letters mismatch). There are $L-W$ possible placements of the seed in the alignment. For the exact seed, the probability that all 11 positions match identical characters in both strings is simply $p^{11}=(0.66)^{11} = 0.0103$ . For the inexact seed we only need 8 positions to match, leading to $p^8 = (0.66)^8 = 0.0360$ , i.e., 3 times higher than for the exact seed. Given that $L-W$ is the same for both seeds, the inexact seed yields more expected hits in the same alignment range and thus has a higher likelihood of finding the correct alignment.

Another way to look at this is from the point of view of the information provided by overlapping seeds. In exact seeds, a simple shift by one position changes just two letters (the ends of the seed), i.e., the new position of the seed provides little new knowledge. In the case of spaced seed, every block of 0s creates another 'start and end' where new letters will be aligned, yielding more information (more chances of finding an alignment).

To formally compute the sensitivity of a seed, we encode possible alignments as strings of 0s and 1s representing the position of the exact matches and mismatches. We define the sensitivity of a seed as the probability that the seed is compatible with an alignment, i.e., that the seed (represented as a string of 0s and 1s) aligns (all 1s in the seed must be aligned to 1s in the alignment) to a substring within the corresponding string representing the alignment. To compute this probability we will rely on dynamic programming, as follows.

First, we define $f(i, b)$ to be the probability that the seed matches a prefix of length $i$ of the alignment that ends in the binary string $b$ (of length equal to the width of the seed). The probability of the seed matching an alignment of length $L$ is simply $\text{sensitivity}(\text{seed}) = \sum_{\text{all } b} f(L, b) p(b)$ where $p(b)$ is the probability of each string of length $b$ . The latter can be computed as we have done above, based on the actual string $b$ (more precisely, the number of 1s in this string), and the probability of any given character matching $p$ .

In other words, the dynamic programming table $f(i,b)$ focuses on some prefix of possible alignments, and checks whether the seed matches the suffix of this prefix. If $S = b$ then $f(i,b) = 1$ as clearly the seed matches all alignments that end in $b$ . If $S \ne b$ , we can use the recurrence equation:

$f(i,b) = f(i-1, 0b') (1-p) + f(i-1, 1b')p$

where $b'$ is the string $b$ without its last bit. Essentially, we are shifting the seed along the possible alignment strings and check again whether we have a match between the seed and the alignment. The values for $f(W, b)$ (the base case for the dynamic programming), are trivially set to $0 \text{ if } S\ne b$ and $1 \text{ if } S=b$

Note that the dynamic programming approach outlined above essentially enumerates all possible alignments of length L and computes the proportion of these alignments that contain our seed. The runtime of this algorithm is exponential in the size of the alignment $O(2^W2^{L-W})$ .. The first term is the enumeration of all words of length $W$ , while the second on enumerates all the extensions of these words to the total length of the alignment.

Exercise: To get a handle on this algorithm, try running it for the seed 1101, alignments of length 8, and $p=0.5$ (50% chance of a match). You should find that $f(8, 1101) = 0.5$ because 8 out of the possible 16 alignments ending in 1011 are compatible with the seed.

Using multiple seeds

Any specific spaced seed is unlikely to be fully sensitive (any good alignment contains a matching seed), with the exception of trivial corner cases (e.g., exact matches must contain other exact matches). As a result, algorithms that rely on spaced seeds use multiple such seeds in order to increase the chances of finding an alignment. What is important in this case is that the seeds are independent of each other such that together they cover a large fraction of the space of possible alignments. Examining the example above should provide you with the intuition that independent seeds do not overlap each other well, i.e., as we shift the $b$ box during the execution of the algorithm, the boxes that match a particular seed occur in different parts of the tree. While computing the sensitivity of a seed is highly expensive (as we have seen above), this simple observation can lead to effective and efficient algorithms for constructing seed sets – see (Ilie and Ilie 2009).

Historical note

The concept of spaced seeds is closely related to the concept of locality-sensitive hashing. In general terms, a locality-sensitive hashing function is 'compatible' with some definition of distance between objects. The closer the objects are (e.g., the more similar two sequences are), the more likely it is that the hash function will put these objects in a same bucket. One can easily see that selecting a set of positions from a sequence (the 1s in the example above) and mapping the sequence to the bucket corresponding to the base-pairs such identified is a hashing function, and that the more similar two sequences are, the more likely it is that they will agree over some subset of their positions. Such ideas were initially introduced by Buhler (Buhler 2001) before the concept of spaced seeds was developed, in part by Li and Ma (Ma, Tromp et al. 2002). The statistical aspects of spaced seeds were studied by Keich et al. (Keich, Li et al. 2004).

Inexact seeds

A concept similar to spaced seeds was introduced by Ghodsi and Pop (Ghodsi and Pop 2009) and also independently by Kärkkäinen (Kärkkäinen and Na). The main limitation of spaced seeds is that they do not tolerate insertions and deletions, thus their sensitivity is inherently limited to finding alignments where most differences are due to substitutions. As we have seen so far, handling insertions and deletions can be quite expensive, while exact matches can miss too many alignments, or lead to too many options that need to be explored. A trade-off between these two extremes can be obtained based on the following simple lemma:

Lemma: For any $l<L$ , an alignment of length $L$ with $k$ edits must contain a sub-alignment of length $l$ with at most $\lfloor{kL\over l}\rfloor$ mismatches.

At the extreme, this lemma is equivalent with the Baeza-Yates observation that any alignment with k edits contains an exact match of length $l = L/(k+1)$ . The lemma is easy to prove by observing that a path through the dynamic programming algorithm representing alignment $L$ cannot always exceed $\lfloor{kL\over l}\rfloor$ mismatches along all of its subpaths of length $l$ .

References

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. doi: 10.1016/S0022-2836(05)80360-2.

Baeza-Yates, RA, Perleberg, CH, Fast and practical approximate string matching, Information Processing Letters, 59(1), pp. 21-27, 1996, https://doi.org/10.1016/0020-0190(96)00083-X

Buhler, J. (2001). Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17(5): 419-428. http://www.ncbi.nlm.nih.gov/pubmed/11331236

Ghodsi, M. and M. Pop (2009). Inexact Local Alignment Search over Suffix Arrays. IEEE International Conference on Bioinformatics and Biomedicine (BIBM). X.-W. Chen and S. Kim. Washington, DC, IEEE: 83-87. http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/21278916/

Ilie, L. and S. Ilie (2009). Fast computation of neighbor seeds. Bioinformatics 25(6): 822-823. http://www.ncbi.nlm.nih.gov/pubmed/19176560

Kärkkäinen, J. and J. C. Na Faster Filters for Approximate String Matching. 2007 Proceedings of the Ninth Workshop on Algorithm Engineering and Experiments (ALENEX): 84-90.

Keich, U., M. Li, B. Ma and J. Tromp (2004). On spaced seeds for similarity search. Discrete Applied Mathematics 138(3): 253-263. http://www.sciencedirect.com/science/article/pii/S0166218X03003822

Ma, B., J. Tromp and M. Li (2002). PatternHunter: faster and more sensitive homology search. Bioinformatics 18(3): 440-445. http://www.ncbi.nlm.nih.gov/pubmed/11934743

Pearson, W. R. and D. J. Lipman (1988). Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA 85: 2444-2448.

Shah N, Nute MG, Warnow T, Pop M. Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows. Bioinformatics. 2019 May 1;35(9):1613-1614. doi: 10.1093/bioinformatics/bty833. PMID: 30247621.

PreviousParallelizing sequence alignment NextSequence alignment statistics

Last updated 1 month ago

hashtagIntroduction

hashtagExclusion methods

hashtagCase study: BLAST

hashtagExpected number of anchors

hashtagSpaced seeds

hashtagUsing multiple seeds

hashtagHistorical note

hashtagInexact seeds

hashtagReferences