Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Greedy solution to the Eulerian tour problem
  • Spanning tree based solution to the Eulerian tour problem
  • Exercises
  1. Sequence assembly

Finding Eulerian tours

PreviousGraph formulations of sequence assemblyNextIntroduction to sequence annotation

Last updated 4 months ago

Let us focus for now on solving the Eulerian tour problem, solution which can then be extended to address the route inspection problem, thus solving the assembly problem for de Bruijn graphs. First a definition:

Eulerian tour problem. Given a graph G(V, E), find a tour through the graph (traversal that starts and ends at the same node) that visits every edge of the graph exactly once.

This problem was initially analyzed by the mathematician (hence the name) who lived at the time in a city named Königsberg. This city had seven bridges across a river that broke up the city into four different neighborhoods. Euler wondered if he could start a walking tour from his house and return home after traversing all the bridges but without crossing any of them twice. The corresponding graph is shown below.

Euler observed that a necessary condition for solving the problem was for each node to have an even degree (each node has an even number of neighbors). The reasoning is that as you "walk" along the graph, once you enter a node you must have a way to exit the node. Since you can't use an edge twice, you must choose an edge you haven't seen before, thus each edge through which you entered a node must have a corresponding edge through which you can leave, leading to the even degree. If a node has an odd degree, you will either get stuck because you enter the node using the last available edge, thus cannot leave it, or you cannot return back to it because you will no longer have an available edge. As you can tell from the figure above, the graph corresponding to the city of Königsberg has only nodes with odd degrees, thus Euler was not able to enjoy a perfect walk.

The description so far has focused on undirected graphs (where the edges can be traversed in either direction), however the Eulerian tour problem can also be addressed on directed graphs such as de Bruijn graphs. Imagine, for example, that the streets in Königsburg are all one way. Instead of simply requiring that each node have even degree, we must add the additional condition that the number of in-edges is equal to the number of out-edges at each node. Or, in the street metaphor, for every one way street entering a neighborhood there should be a one way street exiting the neighborhood.

Is the condition that all nodes have odd degrees sufficient for finding an Eulerian tour? In the following section we'll describe an algorithm that demonstrates that it is.

Greedy solution to the Eulerian tour problem

The greedy solution to the Eulerian tour problem follows the same strategy you might also employ if walking around Königsberg.

EulerianTour (graph G):
  return False if G is not Eulerian
  Tour = {}
  while nodes exist with unvisited edges
     select an arbitrary node n
     while n has at least one unvisited edge e
       add e to Tour
       mark e as visited
       set n = tail(e)

Essentially, this algorithm picks a node that has unvisited edges adjacent to it, then traverses the graph along unvisited edges until no such edges exist. The following panels demonstrate its execution in a simple graph.

Starting with node A in the top left panel in the figure below, we proceed to node C (thickened edge), then E, then F, then return to A. Since A still has unvisited edges, we choose one of them and move to node E, which still has an available edge to D, then we proceed to C, and then B, and then return to A (bottom right panel). At this point A does not have any unvisited edges adjacent to it and we can terminate the algorithm, reporting the full tour: A, C, E, F, E, D, C, B, A.

Note, however, that we got quite lucky. An alternative traversal is shown below. Assume here that we started at node F, followed to A, then E, then returned to F (the first row of images in the figure). At this point, the algorithm is stuck as node F has no more adjacent edges that have not been visited. As described above, the algorithm now looks for a new node that has unvisited edges, for example C, and continues the traversal as highlighted in red in the figure. Now we can proceed from C to A, to B, back to C, then D, E, and back to C again, completing another cycle. We have managed to visit every edge in the graph, exactly once, however we have not been able to do so with one cycle.

It turns out that there's a simple solution to this problem. We can select one of the nodes where the two cycles meet, then "stitch" the two cycles together by swapping the order in which we visit the edges. Take node A for example. In the original tour, we visited this node coming from F, then going to E, along the black tour, then coming from C and proceeding to B in the red tour. We can swap the traversal, by going to B after coming from F, thus breaking the black cycle and proceeding along the red one. Eventually the red cycle will each A again (arriving from C) and now we can return to the black cycle continuing to E, as shown by the thin arrows in the figure below. We have, thus, effectively combined the black and red cycles into a single cycle, thus solving the Eulerian tour problem.

It should be fairly easy to see that the algorithm runs in linear time in terms of the number of edges and nodes in the graph as we only visit each edge and node a constant number of times.

Note that this algorithm works even if the graph is directed, we simply have to take the orientation of each edge in consideration.

Spanning tree based solution to the Eulerian tour problem

The greedy solution proposed above is somewhat dissatisfying as we need to return and fix problems (join cycles) after the initial greedy stage. An elegant solution that avoids this problem uses a spanning tree of the graph to guide the traversal. As a reminder, a spanning tree of a graph is a tree that includes every node in the graph. Computing a spanning tree is computationally easy: we keep adding edges to the tree as long as they don't form cycles, i.e., an edge is added to the tree if at least one of its ends is not already in the tree. This more elegant algorithm proceeds as follows:

EulerianTour (graph G):
  compute spanning tree of graph G

  Tour = {}
  pick an arbitrary node n in the tree
  while n has unvisited edges adjacent to it:
    if there exist an edge e that is adjacent to n and is not part of
       the spanning tree:
       select e
    else:
       select an arbitrary unvisited edge e adjacent to n
    
    add e to Tour
    mark e as visited
    set n = tail (e)

Effectively, we use the spanning tree as an escape strategy. We only include spanning tree edges into the tour if no other edges are available, thus ensuring the greedy traversal cannot get stuck at any node until the tour is complete.

An example is shown below. The top left panel shows the spanning tree in thick edges, then the progressive construction of the tour is shown in red. Starting at node F, we select the edge to E as it is the only unvisited edge available that is not part of the spanning tree. At node E our only choice are spanning tree edges, and we select an arbitrary one, proceeding to node C. There we proceed along one of the non-spanning tree edges to D where again our only choice is a spanning tree edge to E. There, again, we have to follow the only unvisited edge remaining to A. At a, we select the unvisited edge that is not part of the spanning tree, returning to C, where we only have one choice remaining, moving to B, then along the only available edge to A and then, finally, returning to F along the last unvisited edge in the graph. In the end, the cycle we have found is: F, E, C, D, E, A, C, B, A, F.

The reason why this algorithm works is that there are no cycles in the graph that are completely built from tree edges or from non-tree edges, thus the greedy algorithm cannot get stuck.

In a directed graph, we will want to create a spanning tree that has all edges pointing towards its root, and start the guided traversal from the root of the tree, ensuring that there will always be an escape path leading back to the starting point.

Exercises

  1. Is the following graph Eulerian?

Leonhard Euler
Graph corresponding to the bridges of Königsberg. Each edge represents a bridge, while each node is a separate neighborhood.
Greedy algorithm for finding an Eulerian tour. Starting from top left, each panel highlights with thickened edges the growing Eulerian tour computed by the algorithm.
The greedy algorithm can get stuck, leading to two independent cycles highlighted in black and red.
Eulerian tour construction guided by a spanning tree. The spanning tree is highlighted by thick edges. The progressive construction of the tour is shown in red. This algorithm yields a single cycle, i.e., does not require fixing partial cycles.
Graph representing the bridges of the city of Königsberg. Four nodes are connected in a circle. The node on the West side is additionally connected to all the other nodes.
Nine panels highlighting copies of the same graph defined on nodes A through F. Each panel shows with thickened edges consecutive stages in the greedy algorithm for finding an Eulerian tour.
Nine panels highlighting copies of the same graph defined on nodes A through F. Each panel shows with thickened edges consecutive stages in the greedy algorithm for finding an Eulerian tour. Red and black edges represent the two cycles/tours constructed by the algorithm.
Four edges, two red and two black, highlight a node where the two cycles discovered by the Eulerian tour algorithm meet.  Thin arrows highlight how changing the order of visiting the edges around the shared node can result in the two cycles being stitched together into a single cycle.
Nine panels highlighting copies of the same graph defined on nodes A through F. Each panel shows with thickened edges consecutive stages in the spanning tree algorithm for finding an Eulerian tour. Black edges highlight a spanning tree. Red edges show the progression of the Eulerian tour algorithm, with the spanning tree edges only being used when no other options are available.
Graph that contains 7 nodes. 6 of the nodes are layed out in a hexagon with edges connecting the nodes on the periphery except for the top two nodes. A node in the center of the hexagon is connected to five of the nodes, omitting a connection to one of the nodes on the periphery that is already connected to two of its neighbors.