Introduction to sequence assembly

The process through which scientists decode the DNA sequence of an organism is called sequencing. In 1975 Frederick Sanger developed the basic sequencing technology that is still widely used today. While this technology has been continuously improved over the years, the amount of DNA a sequencing machine can decode is usually smaller than the size of most genomes: even the simplest viruses contain tens of thousands of base-pairs, bacteria contain millions, and mammalian genomes contain billions of base-pairs. To overcome this limitation, scientists have developed a technique called shotgun sequencing whereby the DNA sequence of an organism is sheared into a large number of small fragments which are then sequenced, then the resulting sequences are joined together using a computer program called an assembler. In order for the reconstruction to even be possible, the individual sequences must be sampled from random locations in the genome. Also, many copies of the genome of interest must be sequenced so that enough sequences are sampled to ensure that the individual sequences overlap, i.e., enough information is available to decide which sequences should be joined together.

A genome, represented as a thick line, under which individual sequencing reads are shown at random locations, represented as short thin line segments.

Is assembly possible? Lander-Waterman statistics

It is not at all clear that the assembly of a genome from small pieces should be possible. Given that the process through which the sequences are generated is random, it is possible that certain parts of the genome will remain uncovered by reads unless an impractical amount of sequences are generated, as shown in the figure below.

To assess the theoretical feasibility of the assembly of shotgun sequencing data, Eric Lander and Mike Waterman developed a statistical analysis based on Poisson statistics. Briefly, if some events occur uniformly at random (e.g., the start of a sequencing fragment along a genome can be assumed to be chosen uniformly at random), the number of events occurring within a given time interval is represented by a Poisson distribution.

Given an average "arrival rate" λ (# of events occurring within a given interval of time), the probability that exactly n events occur within the same interval is expressed by the formula:

$f(n, \lambda) = {\lambda^n e^{-\lambda} \over n!}$

In the context of sequencing we are interested in finding intervals that contain no events (n=0)—these represent gaps in the coverage of the genome by sequences. The contiguous segments of the genome that occur between gaps are called contigs (or islands in the original paper).

The Lander-Waterman statistics estimate the number of gaps in coverage (conversely the number of contiguous DNA segments) that can be expected given the following set of parameters:

G: genome length
n: number of sequences/reads
L: length of sequences (assuming each sequence/read has the same length)
$c=nL/G$ : depth of coverage (the average number of sequences covering each location, or the amount by which the DNA contained in reads oversamples the genome)
t: the amount by which two sequences need to overlap for an algorithm to be able to detect the overlap
$\sigma=(L-t)/L$ : the fraction of each sequence contained outside the overlapping region

Among other values, the L-W statistics provides an estimate for the expected number of contigs: $N_{\text{contigs}} = n e^{-c\sigma}$ . This function is plotted below for a genome of length 1 Mbp.

Plot of the estimated number of contigs for different depths of coverage generated by the Lander Waterman equations. The curve starts at 0 for 0 coverage, increases rapidly to ~3400 at coverage level of about 1, then decreases steadily until approaching 0 again at coverage between 8 and 10.

As can be seen in the graph above, the expected number of contigs increases initially, as each individual read represents a separate contig/island. Once the coverage reaches about 1, indicating the the total amount of sequence in the reads is roughly equal to the size of the genome, the number of contigs plateaus as new reads overlap prior ones, and are, therefore, not generating new contigs. As the coverage increases, the number of contigs drops as new reads join together contigs. Once coverage reaches about 8-10 fold (once we have sequenced enough reads to cover the genome about 8-10 times), the number of contigs approaches 1, indicating that most of the genome will be found in just a few contigs.

These calculations demonstrated that shotgun sequencing is a cost-effective way to reconstruct genomes, and were used as part of the justification to fund the initial effort to sequence the human genome.

Shortest common superstring and greedy algorithm

A simple formulation of the assembly problem as an optimization problem phrases the problem as the Shortest Common Superstring Problem: find the shortest string that contains all the sequences as substrings. In other words, find the most parsimonious "explanation" for the set of sequences. A fair amount of research went into this problem in the 1980s-90s—the problem was shown to be NP-hard. A series of approximation algorithms were developed that approach an approximation factor of 2 (the reconstructed string is at most twice as long as the shortest possible string that contains all the substrings).

A simple greedy algorithm can be proven to yield a 4-approximation, however it is conjectured that this algorithm is actually 2-optimal, given that no example has yet been found where the greedy assembly algorithm has generated a worse approximation.

This is a good example of a situation where the theoretical solution to a practical problem simply misses the mark — we want to reconstruct the genome of the organism we're interested in, not some DNA sequence that is longer than it, even if we can place an upper bound on how wrong the solution will be.

The greedy assembly algorithm proceeds as follows:

Compare all sequences in a pairwise fashion to identify sequences that overlap each other;
Pick the pair of sequences that have the best overlap among all pairs and merge them;
Repeat step 2 until no more sequences can be merged, or the remaining overlaps conflict with existing contigs.

While this algorithm is only an approximation, it has been extremely successful in practice—most early genome assemblers (phrap, TIGR Assembler, CAP) relied on this simple greedy heuristic, and were successful in reconstructing a range of genomes, including large segments of the first human genome ever sequenced.

PreviousTrait-based phylogenetic inference NextGraph formulations of sequence assembly

Last updated 6 months ago