K-mers in genome assembly
When running a modern genome assembler you are frequently required to select a k-mer size (or range of k-mer sizes) that is most appropriate for your assembly problem. Choosing the value of this parameter is the subject of vigorous debate among practitioners, in no small part due to the fact that the principles underlying the use of k-mer-based approaches in assembly are poorly understood. Here I attempt to clarify some of these principles.
Let's start with the beginning. Assembly is necessary because the sequences "read" by a sequencing machine are commonly much shorter than the segments of DNA being sequenced. Assembly is the process that takes a shredded version of the DNA and "glues" together the shreds into a reconstruction of the original DNA (see genome assembly primer for more details, and try your hand at "assembling" text by going here). The assembly process is simply impossible if the fragments of DNA do not overlap each other as seen in the figure below.
To ensure the fragments overlap, it is necessary to sequence the DNA multiple times over, leading to each base from the original DNA being present, on average, in c different reads. The variable c is called coverage and usually has values larger than ~5 (i.e., one sequences about 5 times more DNA than is contained in the sequence being reconstructed).
Once the fragments overlap, we have a chance to put together the original genome, as shown below.
Before we proceed, it is important to stress that we will focus on two types of "overlaps" between DNA fragments or sequencing reads. The "true" overlaps are those that actually represent the location in the genome from which the fragments or reads were derived. This is the information that a genome/sequence assembler is trying to infer. The "inferred" overlaps are the implicit or explicit relationships between reads that are inferred by the assembler from the sequence of the reads. The difficulty of sequence assembly derives from the difference between the inferred set of overlaps and the true ones.
Making sure that the fragments overlap is necessary but not sufficient for assembly. The assembler must be able to identify these overlaps, and do so efficiently.
Forgetting about efficiency for a moment, let us focus on what is necessary to find overlaps. Assume that, on average, two sequences overlap by exactly one character. This information is clearly useless for assembly, as every sequence will share its last character with the first or last character of roughly half of the rest of the sequences. Even deep learning cannot untangle the resulting mess. In other words, for assembly to be possible, the overlap between two sequences must be long enough to ensure it can be distinguished from overlaps that occur simply by chance.
We are only focusing at this point on the true overlaps between reads assuming that we know exactly where the reads fit in the genome. We'll revisit this point a bit further down. The length of these true overlaps is determined by the depth of coverage. At 1X (or 1-fold) coverage (i.e., if we sequence as much DNA within reads as there is in the sequence being reconstructed) we would expect the reads to not overlap at all, rather line up end to end along the sequence and we wouldn't be able to reconstruct the genome. In reality, the randomness of the sequencing process leads to some reads overlapping by quite a bit, with large gaps occurring between others. This phenomenon can be analyzed through Poisson statistics, and was detailed by Eric Lander and Mike Waterman in Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics. 2(3):231-239.
To get an intuitive idea of the relationship between depth of coverage and the expected overlap between reads, here are two simple diagrams. In the first, the coverage is relatively low, somewhere between 1 and 2.
As you can see, most of the overlaps between the reads (the gray boxes) are small. In the diagram below, we increase the depth a little bit, to just about 3.
At this point, there are many more long overlaps between the reads. As coverage further increases, so does the length of the true overlaps between reads, which increases the chances that a sequence assembler will be able to detect them.
Let's consider what happens when the sequencing technology is not perfect. Assume we are looking at an overlap of exactly 4 basepairs and try to decide whether it is "real", i.e., distinguish it from possible random matches. There are exactly
different DNA strings of length 4, and only one of them is the string we are interested in. The chance that a completely random sequence will inadvertently match the selected string is, thus, only 1 in 256. Assume now that one error is possible within this same 4-letter long segment. More precisely, when looking for an exact overlap between two sequences that contain a given string of length 4, we want to allow for the possibility that one letter in one sequence is incorrect. To account for this error, we only require that 3 out of the 4 letters match within the string. Now the chance that a random string would match our selected one has increased to 1 in 16. There are four possible locations in our selected string where an error could occur, and for each location, there are
possible strings of length 3, hence we have 4 chances in 64 to find a random string matching our selected one with exactly one mismatch.
Given that a typical sequencing experiment entails millions of sequences, many unrelated sequence will, thus, be assumed to be overlapping, confusing the assembly process. To counter this effect, sequence assemblers require longer overlaps, which, in turn, require more depth of coverage (as discussed above) to ensure that the reads do, indeed, overlap.
The need to tolerate errors also has implications for the efficiency of algorithms used to detect overlaps. When only looking for exact overlaps, we can very efficiently find them using a number of indexing strategies (e.g., suffix trees, suffix arrays, Burrows-Wheeler Transform, etc.). These algorithms run in linear or sub-linear time, meaning that the runtime is proportional to the number of bases in the set of reads. Once errors become part of the equation, the algorithms used to find the overlaps become much more expensive. In the worst case, these algorithms run in quadratic time (proportional to the square of the number of bases in the set of sequences). In the best case, the runtime is amplified by the error rate that we wish to tolerate.
To briefly summarize the discussion above: In order to assemble sequences, we need to determine how they overlap each other, even in the presence of errors. Doing so, using "traditional" algorithms, entails a significant computational cost.
An alternative, that has been proposed from the early days of sequence assembly research, is to rely on exact matches as much as possible. Specifically, instead of computing the actual overlap between two sequencing reads, we can simply check how many strings of length k (also called k-mers) are shared by the two reads. If the reads overlap each other perfectly, then all k-mers in the overlapping region are shared, and, at a first approximation, the fraction of k-mers shared is roughly proportional to the level of similarity between the two strings.
It is important to note at this point that most genome assemblers today rely on a de Bruijn graph approach that directly models the k-mers in the data, largely ignoring the reads themselves. That is, modern assemblers don't actually try to figure out if two reads overlap. Nonetheless, the overlap between reads is implied within the assembly graph, and thinking in terms of overlaps clarifies the k-mer concept for both overlap-based and de Bruijn graph-based approaches.
So, let us focus for now on just one such overlap between two reads:
How large should the value of k be so that we can find this overlap of length o? If we don't expect any errors, we can select k = o. Since we are not simply focusing on a pair of reads but want to detect enough overlaps between enough reads so that we can reconstruct the genome, we need to select k depending on the depth of coverage and the expected length of the overlap between reads at that particular depth of coverage. You can work out this number using simple statistics, or just experiment with a few values and see if the assembly works.
In reality, though, sequencing errors exist. If you selected k = o and there is an error in one of the reads, the overlap will no longer be detectable since the corresponding k-mers are different:
To account for errors, we, thus, have to select a smaller value of k, proportional to the expected "gap" between errors. For a given probability of error
, the expected gap is roughly
, so you want to select k such that it is smaller than both this gap and the size of the expected overlap o.
In short, to make sure you find the true overlaps between the reads, you want to select a fairly small value of k so that you can tolerate errors as well as regions with relatively low coverage.
Aside: figuring out the value of k based on the size of the expected overlap based on coverage and the expected gap between errors is a bit more complicated than I describe above as you have to take into account the full distribution of overlap and gap sizes to ensure enough overlaps are detected by the assembler. You can do so analytically, but it's more effective to experiment with multiple values of k, starting with an initial guess based on simple calculations like those described above. The empirical results will implicitly account for other types of experimental artifacts that are difficult to model mathematically.
So far, we have focused on whether we can detect true overlaps between reads, essentially only focusing on addressing false negatives--the situations when a true overlap is missed because we picked the wrong value of k. How about false positives--detecting an overlap between reads that does not actually exist?
Such false overlaps are possible because of genomic repeats--segments of DNA that occur multiple times in a genome in an identical or near-identical form. A simple example is the word "the" from the phrase "the quick brown fox jumps over the lazy dog". Because this word is repeated, the fragments "the quick brown", "the lazy dog", and "jumps over the" all overlap each other, even though only "jumps over the" and "the lazy dog" should.
Repeats are, actually, one of the major challenges in sequence assembly since they induce ambiguity in the sequence reconstruction. The relationship between the length of overlaps and the length of repeats determines whether the assembly problem is easy, hard, or impossible. (see Nagarajan and Pop. Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J Comput Biol. 2009 Jul;16(7):897-908). Once the overlaps between reads are longer than the repeats in the genome, the problem becomes easier as seen below.
The 2-word overlaps make it easy to overcome the complexity introduce by the single word repeat "the".
This is one of the reasons why long read technologies have been such a game-changer in genome assembly--the longer reads allow longer overlaps to be considered by the assembler, which means that fewer repeats are a problem for the assembly process.
Ok, so what do repeats have to do with k-mers? In the figure below I have a presumed overlap between two reads (A and B) that happens to located in a repetitive region R. This repeat is also found in the genome at a location where read C is located. If the k-mer size is smaller than the size of R, the k-mer (shown as a gray box) will match both between A and B, but also between A and C, and B and C.
The result is that the assembler may, thus, infer incorrect information, which then makes it harder to correctly assemble the genome. Note that, in this case, the repeat is shorter than length of the overlap between the reads, which I claimed earlier made the problem easier. If, however, the assembly algorithm does not attempt to compute the full alignment between the reads and just relies on the shared k-mers (as is the case for de Bruijn graph assemblers), the inferred overlaps will contain ambiguity that is not actually present in the data.
As a result, we want to use a longer k-mer:
Now the k-mer that is shared by reads A and B is no longer the same as the k-mer in read C, and only the true overlap is inferred by the assembler.
So, let's quickly summarize. We have figured out that in order to make sure we reduce false negatives (i.e., find most true overlaps), we need to use small k-mers, particularly if coverage is low or sequencing error is high. However, in order to reduce false positives due to repeats, we need to use large k-mers.
There is no good answer to how to resolve this trade-off. The most successful assemblers don't even try, rather use multiple k-mer sizes hoping that their complementary strengths will lead to the best of both worlds, reducing both false positives and false negatives simultaneously.
Most non-assembly analytical tools that use k-mers to analyze read sets, however, typically use a single k-mer size. In this case, carefully evaluating the trade-off between false positives and false negatives is important even if an assembly is never explicitly computed. The implicit overlaps between reads, even if not explicitly calculated by the tool, impact the estimation of abundances, thus the wrong choice for k may have negative downstream effects.
So far, the presentation has assumed that we are looking at sequencing reads from one genome. What if the reads come from many different genomes (which is the case in metagenomics applications), or even just the different haplotypes of a eukaryotic genome?
All the considerations described above still apply, with some caveats. Depending on what you are looking for, the genomes themselves can be viewed as repeats, for example if you are trying to figure out the different strains of E. coli found in a stool sample, or to characterize the parental alleles of a eukaryotic genome. The size of the "repeats" handled by the assembler is, thus, substantially larger than for single genomes/haplotypes, making assembly much harder. If you don't believe me, try to see how many "haplotype-resolved" genomes you can find in public databases.
The "error" rate that you must tolerate is also different in mixed samples, as you must now contend not just with experimental errors in the sequencing process, but also with strain variation within the sample. These two factors induce differences between reads that must be tolerated by the assembler and that are not easy to tell apart from each other.
Thus, for metagenomic or polyploid samples, you may have to use lower values for k in your analyses than you would when looking at single genomes. Better yet, perform multiple analyses with different values of k and compare the results.
I hope that this page gives you a better idea for how k-mers are related to the information they can give you about the data contained in a set of reads, and the key factors that impact k-mer choice. When analyzing your own data, don't just copy the parameters used in other studies, but use the ideas described above to determine whether the key features of your data match those in these other studies, and adjust the analysis parameters accordingly.