Coverage and abundance estimation
The term "coverage" is commonly used when discussing shotgun sequencing data, particularly in the context of sequence assembly and sequence quantification. This term is increasingly ambiguous--this page provides a broad background on this concept that will hopefully clarify the contexts in which you may encounter it.
I first encountered the concept of "coverage" in the 1988 paper by Eric Lander and Michael Waterman--Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics. 2(3):231-239. This paper describes a mathematical/statistical approach for thinking about shotgun experiments, essentially answering the question: What is the relation between the amount of sequencing and our ability to reconstruct a genome from the resulting reads? In this context, they define the coverage c, as the extent by which the reads over-sample the genome. Assuming a genome of length G, and n reads of equal length L:
Another way of looking at the Lander-Waterman definition of coverage is to interpret it as the average number of sequencing reads that "pile up" at any given location in the genome. It is easy to understand it if you assume the genome is exactly the same length as the reads, as seen in the figure below.
In this case, each base pair of the genome is covered by n reads. The figure also provides an intuition of why we refer to this as "depth" of coverage.
In a more realistic setting, the genome is (much) longer than the reads, and the reads themselves are not of the same length, yielding a situation as the figure below.
Even in this case, the depth of coverage is simply defined as the number of reads that overlap at each position in the genome. For an entire genome, or a contig, the depth of coverage is simply the average over all the depths of coverage at each base in the genome/contig.
In the case when the genome sequence was generated from the reads themselves, by definition, the minimum depth of coverage in the genome is 1 - at least one read covers every base pair in the genome. This is not true when looking at alignments of reads against a genome or some other sequence, e.g., when aligning metagenomic reads to a reference genome sequence, or RNA-seq reads against a protein database. In that case, some parts of the reference sequence may not be covered by any read (depth of coverage equal to 0), as shown below.
In such situations, the concept of "breadth" of coverage describes the fraction of the reference sequence that is "covered" by the reads. In short, the "depth" and "breadth" of coverage are complementary concepts that describe how well represented a genome/reference sequence is within a set of sequencing reads.
The coverage information, or more precisely, the alignment of reads to DNA sequences, can be used to estimate the actual abundance of a DNA sequence within a sample. A special case is determining whether a sequence is actually found in a sample--essentially asking whether the coverage is non-zero. Estimating abundances from the "pile-up" of reads is an active (and occasionally controversial) area of research that goes beyond the scope of this simple tutorial. What I want to do here is highlight some common issues you may encounter.
In the first figure shown in this document (where the reads and the reference sequence all have the same length), the number of reads aligned to the reference sequence is a reasonable estimate of the abundance of the reference sequence in a sample. This is no longer true for the cases exemplified in the other figures. In the general case, the number of reads aligned to a sequence is proportional not only to its abundance, but to its length as well. To "normalize" the counts, researchers divide the number of reads by the length of the reference sequence - a measure called "reads per kilo-base-pair", which is essentially the average depth of coverage along a sequence. Most commonly, practical approaches simultaneously also normalize by the total amount of sequence generated in an experiment, yielding a measure you may have heard about: reads per kilo-base-pair per million (RPKM) - the average depth of coverage per sequence for every million reads sequenced. You can read more about RPKM and other variants of this measure here: https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/, https://www.biostars.org/p/273537/ .
It is important to realize that even in an idealized setting, the depth of coverage varies along the DNA sequence due to the randomness of the shotgun sequencing process. Below is an example where the beginning of a DNA sequence is significantly over-represented than its end.
In the example above, the depth of coverage is highly uneven, but the breadth of coverage is 100% (the entire sequence is covered by sequencing reads). In some cases, the breadth of coverage itself can be limited.
It should be clear that in such situations the average depth of coverage may not be the most accurate estimate of the abundance of the DNA sequence being analyzed.
Thus, before estimating abundances, make sure that your data are "well behaved". To my knowledge, most tools used in this context don't perform any such checks.
Why do situations such as those described above happen? Some explanations include:
- 1."Features" of sequence alignment tools--tools such as BWA and Bowtie have to cut a number of corners to rapidly compute alignments. In general, they will only find matches that are almost identical in sequence, and usually only find one or a few among multiple equally-good matches. Repeats, or near-repeats, thus "confuse" the alignment program, with some copies ending up over- or under-represented in the data.
- 2.Conserved regions/domains--sometimes the matches correspond simply to regions that are conserved across bacteria or protein families, and that happen to occur in the reference sequence. For example a fragment from an organism that spans part of the 16S rRNA gene will likely appear to be present in every bacterial metagenomic dataset simply because any read matching a conserved region of the 16S rRNA gene will match this fragment.
- 3.Assembly errors--the DNA sequences whose abundance we are trying to reconstruct are themselves built by software that often occasionally makes mistakes. Assume, for example, that the genome assembler incorrectly joined together two fragments of DNA originating from organisms with widely-divergent abundance. The depth of coverage will be highly uneven across this chimeric sequence.
First and foremost, run some sanity checks. Look at the variance in coverage within sequences, and flag those that have very high variance. Also, make sure to filter sequences by breadth of coverage--if a large fraction of a sequence is not covered by reads, it's highly unlikely that sequence is really present in your sample.
Second, make sure that the abundance estimation tools that you are using can effectively handle uneven coverage. For example, in our binning tool Binnacle, we identify breakpoints in the coverage signal and break the assembly at those locations. If the tool you are using doesn't handle uneven coverage, then don't trust the results particularly for sequences that you have flagged as having a high variance of coverage.