Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Is assembly possible? Lander-Waterman statistics
  • Shortest common superstring and greedy algorithm
  1. Sequence assembly

Introduction to sequence assembly

PreviousTrait-based phylogenetic inferenceNextGraph formulations of sequence assembly

Last updated 4 months ago

The process through which scientists decode the DNA sequence of an organism is called sequencing. In 1975 developed the basic sequencing technology that is still widely used today. While this technology has been continuously improved over the years, the amount of DNA a sequencing machine can decode is usually smaller than the size of most genomes: even the simplest viruses contain tens of thousands of base-pairs, bacteria contain millions, and mammalian genomes contain billions of base-pairs. To overcome this limitation, scientists have developed a technique called shotgun sequencing whereby the DNA sequence of an organism is sheared into a large number of small fragments which are then sequenced, then the resulting sequences are joined together using a computer program called an assembler. In order for the reconstruction to even be possible, the individual sequences must be sampled from random locations in the genome. Also, many copies of the genome of interest must be sequenced so that enough sequences are sampled to ensure that the individual sequences overlap, i.e., enough information is available to decide which sequences should be joined together.

Is assembly possible? Lander-Waterman statistics

It is not at all clear that the assembly of a genome from small pieces should be possible. Given that the process through which the sequences are generated is random, it is possible that certain parts of the genome will remain uncovered by reads unless an impractical amount of sequences are generated, as shown in the figure below.

Given an average "arrival rate" λ (# of events occurring within a given interval of time), the probability that exactly n events occur within the same interval is expressed by the formula:

f(n,λ)=λne−λn!f(n, \lambda) = {\lambda^n e^{-\lambda} \over n!}f(n,λ)=n!λne−λ​

In the context of sequencing we are interested in finding intervals that contain no events (n=0)—these represent gaps in the coverage of the genome by sequences. The contiguous segments of the genome that occur between gaps are called contigs (or islands in the original paper).

The Lander-Waterman statistics estimate the number of gaps in coverage (conversely the number of contiguous DNA segments) that can be expected given the following set of parameters:

  • G: genome length

  • n: number of sequences/reads

  • L: length of sequences (assuming each sequence/read has the same length)

  • c=nL/Gc=nL/Gc=nL/G: depth of coverage (the average number of sequences covering each location, or the amount by which the DNA contained in reads oversamples the genome)

  • t: the amount by which two sequences need to overlap for an algorithm to be able to detect the overlap

  • σ=(L−t)/L\sigma=(L-t)/Lσ=(L−t)/L: the fraction of each sequence contained outside the overlapping region

Among other values, the L-W statistics provides an estimate for the expected number of contigs: Ncontigs=ne−cσN_{\text{contigs}} = n e^{-c\sigma}Ncontigs​=ne−cσ. This function is plotted below for a genome of length 1 Mbp.

As can be seen in the graph above, the expected number of contigs increases initially, as each individual read represents a separate contig/island. Once the coverage reaches about 1, indicating the the total amount of sequence in the reads is roughly equal to the size of the genome, the number of contigs plateaus as new reads overlap prior ones, and are, therefore, not generating new contigs. As the coverage increases, the number of contigs drops as new reads join together contigs. Once coverage reaches about 8-10 fold (once we have sequenced enough reads to cover the genome about 8-10 times), the number of contigs approaches 1, indicating that most of the genome will be found in just a few contigs.

These calculations demonstrated that shotgun sequencing is a cost-effective way to reconstruct genomes, and were used as part of the justification to fund the initial effort to sequence the human genome.

Shortest common superstring and greedy algorithm

A simple formulation of the assembly problem as an optimization problem phrases the problem as the Shortest Common Superstring Problem: find the shortest string that contains all the sequences as substrings. In other words, find the most parsimonious "explanation" for the set of sequences. A fair amount of research went into this problem in the 1980s-90s—the problem was shown to be NP-hard. A series of approximation algorithms were developed that approach an approximation factor of 2 (the reconstructed string is at most twice as long as the shortest possible string that contains all the substrings).

A simple greedy algorithm can be proven to yield a 4-approximation, however it is conjectured that this algorithm is actually 2-optimal, given that no example has yet been found where the greedy assembly algorithm has generated a worse approximation.

This is a good example of a situation where the theoretical solution to a practical problem simply misses the mark — we want to reconstruct the genome of the organism we're interested in, not some DNA sequence that is longer than it, even if we can place an upper bound on how wrong the solution will be.

The greedy assembly algorithm proceeds as follows:

  1. Compare all sequences in a pairwise fashion to identify sequences that overlap each other;

  2. Pick the pair of sequences that have the best overlap among all pairs and merge them;

  3. Repeat step 2 until no more sequences can be merged, or the remaining overlaps conflict with existing contigs.

While this algorithm is only an approximation, it has been extremely successful in practice—most early genome assemblers (phrap, TIGR Assembler, CAP) relied on this simple greedy heuristic, and were successful in reconstructing a range of genomes, including large segments of the first human genome ever sequenced.

To assess the theoretical feasibility of the assembly of shotgun sequencing data, and developed a statistical analysis based on Poisson statistics. Briefly, if some events occur uniformly at random (e.g., the start of a sequencing fragment along a genome can be assumed to be chosen uniformly at random), the number of events occurring within a given time interval is represented by a Poisson distribution.

Eric Lander
Mike Waterman
Frederick Sanger
Sequencing reads (thin lines) generated from a genome (thick line). The reads are shown in the position from which they were derived, information that is not known a priori and must be inferred by the assembler. Multiple copies of the genome must be sampled to ensure the reads overlap each other.
Gap in sequence coverage. The segment between the vertical dashed lines contains no reads, making it impossible to reconstruct the sequence of the original genome.
Number of contigs in a shotgun sequencing experiment as a function of the average depth of coverage (how many copies of the genomes are represented in the set of reads).
A genome, represented as a thick line, under which individual sequencing reads are shown at random locations, represented as short thin line segments.
A genome, represented as a thick line, under which individual sequencing reads are shown at random locations, represented as short thin line segments. A segment of the genome is higlighted by two vertical dashed lines where no reads align, i.e., a gap in coverage.
Plot of the estimated number of contigs for different depths of coverage generated by the Lander Waterman equations. The curve starts at 0 for 0 coverage, increases rapidly to ~3400 at coverage level of about 1, then decreases steadily until approaching 0 again at coverage between 8 and 10.