Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Overlap-layout-consensus graph
  • DeBruijn graph
  • Differences between overlap graphs and de Bruijn graphs
  • Exercises
  1. Sequence assembly

Graph formulations of sequence assembly

PreviousIntroduction to sequence assemblyNextFinding Eulerian tours

Last updated 4 months ago

The parsimony definition of assembly implicit in the Shortest Common Superstring problem can be easily seen not to be relevant in a biological setting, primarily due to the presence of repeated DNA sequences (repeats) within the genomes of most organisms. These redundant sequences would be collapsed into a single "unit" by any algorithm that attempts to solve the Shortest Common Superstring problem. As a concrete example, consider a string comprising 1000 As, which is sequenced with a technology that generates sequencing reads of just 100 letters long. A solution to the shortest common superstring problem would simply pile up all the reads on top of each other, generating a string of just 100 letters. A visual representation of this phenomenon is shown below.

Other optimization criteria have been proposed that attempt to capture the biological nature of the problem. Myers proposed, for example, that we should phrase the assembly problem as the task of reconstructing a layout of the sequences that is consistent (in terms of Kolmogorov statistics) with the characteristics of the random process that generated the sequences. Unfortunately this formulation is hard to translate into a practical algorithm.

Instead, most modern assemblers formulate the assembly as a graph traversal problem as described in the following sections.

Overlap-layout-consensus graph

DeBruijn graph

An alternative formulation for the assembly problem arose form early explorations of a sequencing technology called "sequencing by hybridization". In this approach, one could find out if a given k-mer occurred in the genome being sequenced, i.e., the assembly problem could be phrased as:

Given the collection of all k-mers (strings of length k) in a genome, reconstruct the genome's sequence.

In some sense, this problem is equivalent to a shotgun process that generates reads of length exactly k, and which guarantees that exactly one read starts at each position in the genome (perfect coverage).

In the context of genome assembly, we are, of course, not considering complete de Bruijn graphs. Instead, a graph is constructed from the k-mers found in the individual sequencing reads being assembled. Specifically, each k-mer in each read is converted to a node in the graph such that each k-mer is uniquely represented in a node. In other words, if a same k-mer is found in two different locations in a read or in two different reads, then all of the instances refer to the same node in the graph. An edge is created in the graph connecting the k-mers that are adjacent in the read, i.e., which overlap each other by exactly k-1 letters. As an example, the string ACCAGACTGAC can be converted as shown below. Note that the k-mer GAC is found twice in the string but is represented by a single node in the graph.

The string can be reconstructed from the graph by traversing the graph and outputting the letters that are implicit in each edge. Specifically, each edge in the graph indicates that the strings corresponding to the head and tail nodes overlap by k-1 letters (ACC and CCA overlap by two letters: CC). The string "spelled" by the edge ACC→CCA is the k-mer in the head node, followed by the letter in the tail node that is not represented in the overlap (A in this case). This is the letter that "extends" the head node as we move along the edge. You can see this in the graph shown below where each edge is augmented with this "extension" letter. You can also observe, that in a de Bruijn graph constructed from DNA strings, each node can have at most 4 neighbors following it, corresponding to an extension by each of the four possible nucleotides.

So far, it is not clear how this helps with assembly since we only talked about a single read. Let's see what happens when we construct the graph from two overlapping reads as seen in the figure below.

The two reads are highlighted, one in bold, and the other one with underlines. As you see from the graph below, several nodes (k-mers) are found in both reads (are both bold and underlined) and they allow the two sub-graphs corresponding to the individual reads to join. By traversing the joint graph we can reconstruct the string we originally started with.

It is important to note that constructing the de Bruijn graph can be done in linear time (with respect to the total length of the reads) assuming that we can identify in constant time which node corresponds to a k-mer. For relatively short k-mers, a hash table can be used for this purpose.

Differences between overlap graphs and de Bruijn graphs

Before we discuss how to use de Bruijn graphs to construct genome sequences, it is useful to spend a bit of time to identify some key differences between overlap graphs and de Bruijn graphs.

First, let's return to the overlap we have shown in the previous figure. Let's assume that one of the reads contains a small error, for example the G in the first read was incorrectly read as a T. The resulting de Bruijn graph is shown below.

As seen above, one small error causes three k-mers to change (CAG, AGA, and GAC become CAT, ATA, and TAC, respectively) and now no k-mers in the first read match k-mers in the second read, leading to a disconnected graph. Yet, computing the overlap between the reads using dynamic programming can easily overcome this error, thus the overlap graph will be able to reconstruct this genomic segment. For this reason, genome assembly algorithms using de Bruijn graph include complex error correction modules.

However, the most significant difference between the two approaches is the complexity of finding a path through the graph that represents the correct genome. As mentioned earlier, in an overlap graph we want to find a path that uses every node of the graph exactly once, i.e., a Hamiltonian path—problem that is NP-hard. What about de Bruijn graphs. We note that each edge in a de Bruijn graph corresponds to a pair of k-mers that exist within a read. Since we are trying to use all information in the reads, we are looking for a path that uses all the edges in the graph, and since we want to find a parsimonious solution, we will be looking for the shortest such path. This corresponds to the route inspection problem—a computational problem related to the Eulerian tour problem, and which has a simple polynomial time solution. Thus, solving the assembly problem using de Bruijn graphs is (at least theoretically) much more efficient.

In reality, the situation is more complex, and actually both formulations of assembly are computationally intractable for practical applications, however that discussion is beyond the scope of this chapter.

Exercises

  1. Construct the de Bruijn graph for k = 3 (nodes of the graph represent 3-mers) for the string: ATTATAATATATTA

One of the first formulations of assembly as a graph problem, represents each sequence as a separate node in a graph, and creates an edge between any two nodes whose corresponding sequences overlap. The "overlap" is usually defined in terms of an imperfect alignment as was discussed in the The overlap between two sequences can be easily constructed with a variant of the global alignment dynamic programming algorithm.

In this formulation, we want to find a traversal of the graph that uses all the sequences, each exactly once (you cannot use a same sequence in multiple places in the genome), i.e., we are looking for a Hamiltonian path—a well-known NP-hard problem. This formulation dates to the early 1980s in the work of and colleagues.

formulated this problem in the context of de Bruijn graphs: a de Bruijn graph of order k is a graph that contains all strings of length k-1 from a given alphabet as nodes, and contains an edge between two nodes if the corresponding strings overlap by exactly k-2 letters. An example for k = 4 and the alphabet {A, T} is shown below. In such a graph, any path that visits every edge in the graph (also called an Eulerian path) spells a string that contains all possible k-mers constructed from the alphabet.

In contrast, however, the runtime of de Bruijn graph construction is linear, as we pointed out above, while the construction of an overlap graph requires aligning the reads to each other in a pairwise fashion, leading to an O(N2)O(N^2)O(N2) algorithm for N reads. Furthermore, instead of the simple hash lookups used by de Bruijn graphs, overlap graphs rely on the fairly expensive dynamic programming alignment algorithm for each pair of reads. For this reason, algorithms using overlap graphs implement many optimizations such as the banded we discussed in the that yields faster run times when the sequences are very similar to each other.

sequence alignment chapter.
Esko Ukkonen
Pavel Pevzner
alignment technique
alignment chapter
Shortest common superstring incorrectly reconstructs repeats. (left) Set of reads generated from a repetitive region of a genome; (right) The (incorrect) shortest common superstring reconstruction.
Complete de Bruijn graph for k = 4 and alphabet {A,T}. Every length 4 string comprised of As and Ts occurs within a traversal of the graph that visits all its edges.
The de Bruijn graph obtained from string ACCAGACTGAC. 4-length k-mers are represented as edges connecting their first and last 3-mers. For example, 4-mer ACCA becomes an edge between 3-mer nodes ACC and CCA.
De Bruijn graph with edges labeled with the corresponding letters. The full string can be "spelled" by concatenating the first k-mer (ACC) with the letters spelled along the path: AGACTGAC
De Bruijn graph for two overlapping reads. The bolded k-mers originate from the first string, the underlined k-mers originate from the second string, and the bolded and underlined k-mers are shared by both strings. The graph emerges after processing each string separately, without the need to explicitly identify the overlap through alignment.
Error in one read leads to a disconnected de Bruijn graph since no k-mer is shared by both strings.
On the left, a long string composed of all As that is covered by a collection of short 4-letter strings of As. On the right is a representation of all of the short strings piled up on top of each other implying they all originate form a short string of length 4.
A graph representing the set of all DNA strings of length 3 that are connected to each other by edges if the corresponding strings overlap by 2 letters.
Graph representing the 3-mers found in the string ACCAGACTGAC. Edges connect adjacent k-mers, i.e., k-mers that overlap each other by 2 letters. A loop occurs at the end of the graph connecting k-mer TGA to k-mer  GAC, indicating the latter k-mer occurs twice in the string.
Graph representing the 3-mers found in the string ACCAGACTGAC. Edges connect adjacent k-mers, i.e., k-mers that overlap each other by 2 letters. A loop occurs at the end of the graph connecting k-mer TGA to k-mer  GAC, indicating the latter k-mer occurs twice in the string. Each edge is labeled by the last letter in the 4-mer, i.e., the letter that can be added to the 3-mer at the start of the edge to complete the sequence of the 4-mer.
At the top of the image, two overlapping strings are shown. At the bottom is the de Bruijn graph that is obtained from them. The nodes are bolded if they belong to the first string, underlined if they belong to the second string, and are both bolded and underlined if they are shared by the two strings.
At the top of the image, two overlapping strings are shown. At the bottom is the de Bruijn graph that is obtained from them. The nodes are bolded if they belong to the first string, underlined if they belong to the second string, and are both bolded and underlined if they are shared by the two strings. Due to a single mutation in one of the strings, the graphs constructed from the two strings remain separate as no k-mer is shared by them.