Graph formulations of sequence assembly

The parsimony definition of assembly implicit in the Shortest Common Superstring problem can be easily seen not to be relevant in a biological setting, primarily due to the presence of repeated DNA sequences (repeats) within the genomes of most organisms. These redundant sequences would be collapsed into a single "unit" by any algorithm that attempts to solve the Shortest Common Superstring problem. As a concrete example, consider a string comprising 1000 As, which is sequenced with a technology that generates sequencing reads of just 100 letters long. A solution to the shortest common superstring problem would simply pile up all the reads on top of each other, generating a string of just 100 letters. A visual representation of this phenomenon is shown below.

On the left, a long string composed of all As that is covered by a collection of short 4-letter strings of As. On the right is a representation of all of the short strings piled up on top of each other implying they all originate form a short string of length 4.

Other optimization criteria have been proposed that attempt to capture the biological nature of the problem. Myers proposed, for example, that we should phrase the assembly problem as the task of reconstructing a layout of the sequences that is consistent (in terms of Kolmogorov statistics) with the characteristics of the random process that generated the sequences. Unfortunately this formulation is hard to translate into a practical algorithm.

Instead, most modern assemblers formulate the assembly as a graph traversal problem as described in the following sections.

Overlap-layout-consensus graph

One of the first formulations of assembly as a graph problem, represents each sequence as a separate node in a graph, and creates an edge between any two nodes whose corresponding sequences overlap. The "overlap" is usually defined in terms of an imperfect alignment as was discussed in the sequence alignment chapter. The overlap between two sequences can be easily constructed with a variant of the global alignment dynamic programming algorithm.

In this formulation, we want to find a traversal of the graph that uses all the sequences, each exactly once (you cannot use a same sequence in multiple places in the genome), i.e., we are looking for a Hamiltonian path—a well-known NP-hard problem. This formulation dates to the early 1980s in the work of Esko Ukkonen and colleagues.

DeBruijn graph

An alternative formulation for the assembly problem arose form early explorations of a sequencing technology called "sequencing by hybridization". In this approach, one could find out if a given k-mer occurred in the genome being sequenced, i.e., the assembly problem could be phrased as:

Given the collection of all k-mers (strings of length k) in a genome, reconstruct the genome's sequence.

In some sense, this problem is equivalent to a shotgun process that generates reads of length exactly k, and which guarantees that exactly one read starts at each position in the genome (perfect coverage).

Pavel Pevzner formulated this problem in the context of de Bruijn graphs: a de Bruijn graph of order k is a graph that contains all strings of length k-1 from a given alphabet as nodes, and contains an edge between two nodes if the corresponding strings overlap by exactly k-2 letters. An example for k = 4 and the alphabet {A, T} is shown below. In such a graph, any path that visits every edge in the graph (also called an Eulerian path) spells a string that contains all possible k-mers constructed from the alphabet.

A graph representing the set of all DNA strings of length 3 that are connected to each other by edges if the corresponding strings overlap by 2 letters.

In the context of genome assembly, we are, of course, not considering complete de Bruijn graphs. Instead, a graph is constructed from the k-mers found in the individual sequencing reads being assembled. Specifically, each k-mer in each read is converted to a node in the graph such that each k-mer is uniquely represented in a node. In other words, if a same k-mer is found in two different locations in a read or in two different reads, then all of the instances refer to the same node in the graph. An edge is created in the graph connecting the k-mers that are adjacent in the read, i.e., which overlap each other by exactly k-1 letters. As an example, the string ACCAGACTGAC can be converted as shown below. Note that the k-mer GAC is found twice in the string but is represented by a single node in the graph.

Graph representing the 3-mers found in the string ACCAGACTGAC. Edges connect adjacent k-mers, i.e., k-mers that overlap each other by 2 letters. A loop occurs at the end of the graph connecting k-mer TGA to k-mer GAC, indicating the latter k-mer occurs twice in the string.

The string can be reconstructed from the graph by traversing the graph and outputting the letters that are implicit in each edge. Specifically, each edge in the graph indicates that the strings corresponding to the head and tail nodes overlap by k-1 letters (ACC and CCA overlap by two letters: CC). The string "spelled" by the edge ACC→CCA is the k-mer in the head node, followed by the letter in the tail node that is not represented in the overlap (A in this case). This is the letter that "extends" the head node as we move along the edge. You can see this in the graph shown below where each edge is augmented with this "extension" letter. You can also observe, that in a de Bruijn graph constructed from DNA strings, each node can have at most 4 neighbors following it, corresponding to an extension by each of the four possible nucleotides.

So far, it is not clear how this helps with assembly since we only talked about a single read. Let's see what happens when we construct the graph from two overlapping reads as seen in the figure below.

At the top of the image, two overlapping strings are shown. At the bottom is the de Bruijn graph that is obtained from them. The nodes are bolded if they belong to the first string, underlined if they belong to the second string, and are both bolded and underlined if they are shared by the two strings.

The two reads are highlighted, one in bold, and the other one with underlines. As you see from the graph below, several nodes (k-mers) are found in both reads (are both bold and underlined) and they allow the two sub-graphs corresponding to the individual reads to join. By traversing the joint graph we can reconstruct the string we originally started with.

It is important to note that constructing the de Bruijn graph can be done in linear time (with respect to the total length of the reads) assuming that we can identify in constant time which node corresponds to a k-mer. For relatively short k-mers, a hash table can be used for this purpose.

Differences between overlap graphs and de Bruijn graphs

Before we discuss how to use de Bruijn graphs to construct genome sequences, it is useful to spend a bit of time to identify some key differences between overlap graphs and de Bruijn graphs.

First, let's return to the overlap we have shown in the previous figure. Let's assume that one of the reads contains a small error, for example the G in the first read was incorrectly read as a T. The resulting de Bruijn graph is shown below.

As seen above, one small error causes three k-mers to change (CAG, AGA, and GAC become CAT, ATA, and TAC, respectively) and now no k-mers in the first read match k-mers in the second read, leading to a disconnected graph. Yet, computing the overlap between the reads using dynamic programming can easily overcome this error, thus the overlap graph will be able to reconstruct this genomic segment. For this reason, genome assembly algorithms using de Bruijn graph include complex error correction modules.

In contrast, however, the runtime of de Bruijn graph construction is linear, as we pointed out above, while the construction of an overlap graph requires aligning the reads to each other in a pairwise fashion, leading to an $O(N^2)$ algorithm for N reads. Furthermore, instead of the simple hash lookups used by de Bruijn graphs, overlap graphs rely on the fairly expensive dynamic programming alignment algorithm for each pair of reads. For this reason, algorithms using overlap graphs implement many optimizations such as the banded alignment technique we discussed in the alignment chapter that yields faster run times when the sequences are very similar to each other.

However, the most significant difference between the two approaches is the complexity of finding a path through the graph that represents the correct genome. As mentioned earlier, in an overlap graph we want to find a path that uses every node of the graph exactly once, i.e., a Hamiltonian path—problem that is NP-hard. What about de Bruijn graphs. We note that each edge in a de Bruijn graph corresponds to a pair of k-mers that exist within a read. Since we are trying to use all information in the reads, we are looking for a path that uses all the edges in the graph, and since we want to find a parsimonious solution, we will be looking for the shortest such path. This corresponds to the route inspection problem—a computational problem related to the Eulerian tour problem, and which has a simple polynomial time solution. Thus, solving the assembly problem using de Bruijn graphs is (at least theoretically) much more efficient.

In reality, the situation is more complex, and actually both formulations of assembly are computationally intractable for practical applications, however that discussion is beyond the scope of this chapter.

Exercises

Construct the de Bruijn graph for k = 3 (nodes of the graph represent 3-mers) for the string: ATTATAATATATTA

PreviousIntroduction to sequence assembly NextFinding Eulerian tours

Last updated 6 months ago