Sequence assembly primer

(last updated August 2022)

Our genetic heritage, as well as that of all living organisms, is encoded in a set of DNA molecules called chromosomes. Each such molecule can be represented as a string of just four letters: A, C, T, G; representing the four basic molecules (called nucleotides or base-pairs) that compose the DNA. In most organisms DNA is structured as two molecules twisted around each other into a double-helix and held together by bonds between complementary nucleotides A-T and C-G. The two molecules (also called strands) have a natural orientation defined by their chemical structure and are twisted together in opposite orientations. Both strands, thus, contain the same information and the sequence of one strand can be obtained from the sequence of the other strand by reverse complementation, namely by reversing it's sequence and then replacing each nucleotide with its complement (replacing each A with a T, each G with a C and so on). This characteristic of DNA is essential to life as it allows the replication of the DNA molecule -- an important element of our ability to pass genetic information to our offspring.

Table of contents

Shotgun sequencing and assembly

The process through which scientists decode the DNA sequence of an organism is called sequencing. In 1975 Frederick Sanger developed the first scalable sequencing technology. While this technology has been continuously improved over the past 30 years, it can only decode between 1,000 and 2,000 base-pairs of DNA at a time -- a significant limitation given that even the simplest viruses contain tens of thousands of base-pairs, bacteria contain millions, and mammalian genomes contain billions of base-pairs. To overcome this limitation, scientists have developed a technique called shotgun sequencing whereby the DNA sequence of an organism is sheared into a large number of small fragments (Figure 1), the ends of the fragments are sequenced (Figure 2), then the resulting sequences are joined together using a computer program called an assembler (Figure 3).

Since the development of the Sanger technology, many other sequencing technologies have been developed (a list is available here), some of which can generate sequencing reads that are very long (as high as hundreds of thousands of base-pairs). However, most genome sizes are still longer than the reads generated, meaning that assembly is, for the time being, a necessary step in the analysis of genome sequences.

Sequencing statistics

The assembler relies on the basic assumption that two sequence reads (two strings of letters produced by the sequencing machine) that share a same string of letters originated from the same place in the genome (Figure 3). Using such overlaps between the sequences, the assembler can join the sequences together in a manner similar to solving a jigsaw puzzle. You can try your own hand at assembling phrases from their fragments at virtualmagnets.umiacs.io.

It is important to note that the shotgun sequencing process is inherently "wasteful" as, due to the randomness of the shearing process, assembly is only possible once enough sequences are generated to cover the genome 8 to 10 times. Intuitively, this phenomenon can be understood by thinking of a sidewalk as it begins to rain. As raindrops fall randomly across the sidewalk, dry spots persist for quite a while, corresponding to regions of the genome that are not represented in the set of shotgun reads. Mathematically, this phenomenon was modeled by Eric Lander and Michael Waterman in 1988. They examined the correlation between the oversampling of the genome (also called coverage, see more on this concept here) and the number of contiguous pieces of DNA (commonly called contigs) that can be re-constructed by an idealized assembly program. Figure 4 shows a plot of the Lander-Waterman equation for a genome of 1Mbp (mega base pairs = 1,000,000 base pairs). Between 8 and 10-fold coverage the model predicts that most of the genome will be assembled into a small number of contigs (approx. 5 for a 1Mbp genome).

Assembly challenges: non-random shearing and repeats

Ideally, an assembly program should produce one contig for every chromosome of the genome being sequenced. In all but the simplest cases, however, many contigs are produced due to a combination of factors. Even if the genome is fragmented into perfectly random fragments at 8-10 fold coverage (or higher), there is a non-zero probability that some portion of the genome remains unsequenced. More importantly, however, the fragmentation process, or more precisely--the set of fragments that can be effectively sequenced--the distribution of the sheared fragments along the genome cannot be modeled as a perfect Poisson process. For a range of reasons, some segments of DNA cannot be sequenced, or fragment too easily, or are too difficult to fragment. This leads to the coverage of the genome to be far from uniform, significantly increasing the chances that some regions are simply not covered at all.

The ability of an assembly program to produce a single contig is also limited by regions of the genome that occur in multiple near-identical copies throughout the genome (repeats). The reads originating from different copies of a repeat appear identical to the assembler and cause assembly errors. A simple example is shown in Figure 5, where the assembler incorrectly collapses the two copies of repeat A leading to the creation of two contigs instead of one (Figure 6).

Scaffolding

The contigs produced by an assembly program can be ordered and oriented along a chromosome using additional information contained in the shotgun data. In most sequencing projects, the sizes of the fragments generated through the shotgun process are carefully controlled, thus providing a link between the sequence reads generated from the ends of a fragment (called paired ends or mate pairs). In a typical shotgun project, multiple libraries--collections of fragments of similar sizes--are usually generated, providing the assembler with additional constraints: within the assembly the paired end reads must be placed at a distance consistent with the size of the library from which they originate and must be oriented towards each other. Within an assembly each read is assigned an orientation corresponding to the DNA strand from which the read was generated. The constraints provided by mate pairs lead to constraints on the relative order and orientation of the contigs (Figure 7). The process through which the read pairing information is used to order and orient the contigs along a chromosome is called scaffolding.

Finishing

The ultimate goal of any sequencing project is to determine every single base-pair of the original set of chromosomes. As we described above, rarely is an assembly program able to reconstruct a single piece of DNA per chromosome, leading to gaps in the reconstruction of the genome. These gaps are filled in through directed sequencing experiments in a process called finishing or gap closure. At this stage in the sequencing project, additional laboratory experiments and extensive manual curation are performed to validate the correctness of the final assembly, leading to a high-quality reconstruction of the original genome. Due to the significant cost of this process, it is only undertaken when it is critical to determine the complete sequence of a genome.

Assembly algorithms

The many assembly programs available to researchers differ in the details of their implementation and of the algorithms employed, however they all primarily fall into four general categories:

Greedy assemblers

The first assembly programs followed a simple but effective strategy in which the assembler greedily joins together the reads that are most similar to each other. An example is shown in Figure 8, where the assembler joins, in order, reads 1 and 2 (overlap = 200 bp), then reads 3 and 4 (overlap = 150 bp), then reads 2 and 3 (overlap = 50 bp) thereby creating a single contig from the four reads provided in the input. One disadvantage of the simple greedy approach is that, because local information is considered at each step, the assembler can be easily confused by complex repeats, leading to mis-assemblies.

Overlap-layout-consensus

The relationships between the reads provided to an assembler can be represented as a graph, where the nodes represent each of the reads and an edge connects two nodes if the corresponding reads overlap. The assembly problem thus becomes the problem of identifying a path through the graph that contains all the nodes--a Hamiltonian path. This formulation allows researchers to use techniques developed in the field of graph theory in order to solve the assembly problem. An assembler following this paradigm starts with an overlap stage during which all overlaps between the reads are computed and the graph structure is computed. In a layout stage, the graph is simplified by removing redundant information. Graph algorithms are then used to determine a layout (relative placement) of the reads along the genome. In a final consensus stage, the assembler builds an alignment of all the reads covering the genome and infers, as a consensus of the aligned reads, the original sequence of the genome being assembled.

de Bruijn graph / Eulerian approach

de Bruijn graph / Eulerian path approaches are based on early attempts to sequence genomes through a technique called sequencing by hybridization. In this technique, instead of generating a set of reads, scientists identified all strings of length k (k-mers) contained in the original genome. While this experimental method did not produce a viable alternative to Sanger sequencing, it led to the development of an elegant approach to sequence assembly.

This approach, also based on a graph-theoretic model, breaks up each read into a collection of overlapping k-mers. Each k-mer is represented in a graph as an edge connecting two nodes corresponding to its k-1 length prefix and suffix respectively. The resulting graph is called a de Bruijn graph, after the correspondingly-named graph theory structure.

In the graph containing the information obtained from all the reads, a solution to the assembly problem corresponds to a path in the graph that uses all the edges exactly once--an Eulerian path. More accurately, a solution to the assembly problem is the shortest path in the graph that uses each edge at least once--solving the Route Inspection problem.

Assembly graph controversy

In the early 2000s, the sequence assembly community was rocked by controversy. Which assembly approach is better? Overlap-layout-consensus or de Bruijn graph?

The argument in favor of the overlap-layout-consensus approach was that de Bruijn graph approaches throw away information (the focus is on k-mers, not the reads containing them) and that sequencing errors can have an oversized impact since the analysis is based on exact matches alone. As an aside, the emergence of de Bruijn graph approaches also led to significant research on error correction in order to mitigate the impact errors have on the structure of de Bruijn graphs (each error impacts k different k-mers).

The argument in favor of the de Bruijn graph approach focuses on computational efficiency. The Hamiltonian path problem that needs to be solved by the overlap-layout-consensus approach is a difficult computational problem, falling in the category of NP-hard problems. As an over-simplification, these are problems where the most efficient way to find a solution is to try all possible solutions, i.e., it's unlikely you can solve these problems efficiently. In contrast, the route inspection problem can be solved in polynomial time, which means that the run time of the algorithm can be expressed as a polynomial function of the size of the problem (in this case the number of nodes or edges in the graph).

As always, the reality is a lot more complex. The NP-hardness of the Hamiltonian path problem only applies to worst case scenarios that are unlikely to occur in practice. Furthermore, a de Bruijn graph is consistent with an exponential number of possible solutions to the route inspection problem, and constraining the assembly to find the path that corresponds to the genome being assembled leads to NP-hard problems even if any particular (and potentially incorrect) solution can be found in polynomial time.

The choice of algorithmic paradigm for assembly has, thus, been largely driven by the properties of the data being assembled, more than by theoretical arguments about which approach is "better". Most genome assemblers developed for short-read high-quality data (e.g., Illumina) have used the de Bruijn approach, while the emergence of long-read, high-error sequencing technologies has led to a re-birth of overlap-layout-consensus approaches as well as hybrid algorithms that attempt to combine the best features of the two paradigms.

Reference-guided assembly

As more and more genomes become available in public databases, it is increasingly the case that a completed genome exists that is closely related to the genome being assembled. The assembly problem thus becomes easier as the relative placement of reads can be inferred from their alignment to the related genome (or reference), in a process called reference-guided, or comparative assembly. Thus, the overlap stage of assembly (often one of the most computationally intensive assembly tasks) is replaced by an alignment step. The layout stage is also greatly simplified due to the additional constraints provided by the alignment to the reference. The challenge when using such an approach is to make sure the assembly actually reflects the information contained in the reads instead of simply reproducing the reference.

Hierarchical sequencing

In order to avoid some of the complexity involved in assembling large genomes, scientists have developed a hierarchical approach. During the human genome project, this approach involved breaking up the genome into a collection of large fragments (between 40 and 200 kbp) called Bacterial Artificial Chromosomes or BACs. The BACs location along the genome was then mapped using specialized laboratory experiments. A minimal tiling path of BACs was chosen such that each base in the genome was covered by at least one BAC, and the overlap between BACs was minimized. Each BAC was then sequenced through the standard shotgun method, the resulting assemblies being combined into an assembly for each chromosome using the information provided by the tiling paths (Figure 10).

More recently, a number of other technologies were developed that follow a similar approach, though avoiding the need to grow E. coli cells. These include the Illumina Synthetic Long Reads approach and techniques developed at 10X Genomics.

Challenging assembly problems

Everything described above implicitly assumes that we are trying to sequence and assemble a single DNA sequence. In a number of settings this is not true, leading to much more challenging computational problems.

In most eukaryotic genomes, each chromosome occurs in two copies, each derived from a different parent. In plants, each chromosome may have even more copies (e.g., modern wheat has 6 copies of each chromosome). An ideal assembler would resolve these chromosomes, or haplotypes, into separate contigs, thereby enabling analyses of the flow of genetic information from generation to generation. The fact that the chromosomes are almost identical for long stretches, however, makes this problem very complex since the repeats that the assembler needs to resolve are now very long.

A similar situation occurs in metagenomics--the sequencing of DNA extracted from a microbial community. Not only do metagenomic assemblers have to contend with the different haplotypes an organism may have in the mixture (here the haplotypes are different strains of closely-related bacteria), but also with the varying level of abundance, and therefore coverage of the different organisms in a sample.

Last updated