Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Introduction
  • Solving small parsimony
  • Sankoff's algorithm walk-through
  • From parsimony to likelihood
  • Solving large parsimony
  • Exercises
  1. Phylogenetic analysis

Trait-based phylogenetic inference

PreviousDistance-based phylogenetic analysisNextIntroduction to sequence assembly

Last updated 4 months ago

Introduction

Another way of thinking about phylogenetic analysis is to focus on the individual letters in a sequence, or on individual traits (or characters) in an organism. Using such an approach allows phylogenetic analysis even if there is no easy way to define a distance between the organisms' features. For example, wings, arms, and flippers are corresponding traits of organisms, but there's no obvious way to define how similar or different they are from each other. Before DNA sequencing became widely used, phylogenetic analyses primarily relied on this trait-based approach. This approach can also be used for DNA or protein sequences, by treating each individual letter as a different trait.

IMPORTANT: The Pevzner and Compeau textbook refers to this analysis as "character-based tree reconstruction". I have chosen to use the word trait in order to avoid the confusion between character, as in a characteristic of an organism; and character, as in a letter in a biological sequence.

We can try to formalize the problem of phylogenetic inference:

Phylogenetic inference: Given a set of organisms (or sequences) and associated traits (nucleotides or amino-acids in the case of sequences), identify the phylogenetic tree that best captures the shared evolutionary history of the organisms.

When reading this formulation, you should feel a bit uneasy. While the input is clear, as is the tree structure we are trying to develop, the objective function is certainly non-rigorous: best captures the shared evolutionary history.

For now, let's try to focus on one formulation of this objective—parsimony. By parsimony, we mean that we want to define the tree that implies the fewest changes between the traits (e.g., an arm evolving into a wing, or an A evolving into a C). We can, thus, add a bit more rigor to the problem described above:

Phylogenetic inference: Given a set of organisms (or sequences) and associated traits (nucleotides or amino-acids in the case of sequences), identify the phylogenetic tree that implies the fewest changes between traits.

If you think about this a little bit, you will note that there's no easy way to measure the number of changes in a tree, if all the traits are represented at the leaves. To address this problem, we redefine the problem once again:

Phylogenetic inference: Given a set of organisms (or sequences) and associated traits (nucleotides or amino-acids in the case of sequences), identify the phylogenetic tree and assignment of traits at internal nodes in the tree, that imply the fewest changes between traits.

Note that we've made the problem a bit more difficult by asking not just for a tree, but also for an assignment of traits at internal nodes. You'll see soon that this does not actually increase the computational complexity of solving the problem.

Before we proceed further, it is important to note that the search space is extremely large. There are (2n−3)!2n−2(n−2)!(2n -3)! \over 2^{n-2} (n-2)!2n−2(n−2)!(2n−3)!​ possible binary trees with n leaves, and for each of them you have an exponential number of possible assignments of traits to internal nodes. If we use DNA letters as traits, given there are O(n–1)O(n – 1)O(n–1) internal nodes in a tree with n leaves, we have O(4n−1)O(4^{n-1})O(4n−1) different assignments. Thus, the total number of possible solutions is the product of these two numbers:

4n−1(2n−3)!2n−2(n−2)!=2n(2n−3)!(n−2)!4^{n-1} {(2n -3)! \over 2^{n-2} (n-2)!} = 2^n {(2n-3)! \over (n-2)!}4n−12n−2(n−2)!(2n−3)!​=2n(n−2)!(2n−3)!​

By now you should be familiar with some strategies for searching for a solution in such a large space. A typical paradigm alternates between two steps:

  1. computing the score of a potential solution

  2. selecting a new solution that is likely to have a better score

You have seen examples in the Gibbs sampler and k-means clustering. In the context of phylogenetic analysis, the first step is called small parsimony, while the overall goal of reconstructing a tree with the smallest possible parsimony score is referred to as large parsimony.

Solving small parsimony

To briefly recap, we are now focusing on the first step of an algorithm for searching the large space of trees in order to find the phylogenetic tree that best explains the evolutionary history of a group of modern-day organisms. To simplify things, we will assume that the traits we are analyzing are letters in the DNA sequence of a gene that is found in all the organisms of interest. Thus, we can view each trait as an individual character with four possible values: A, C, G, and T. Given a tree, with characters at its leaves, we want to determine the characters assigned to internal nodes such that the parsimony score of the tree is minimized. The parsimony score is simply the number of edges in the tree that have different letters their opposite ends, indicating that an evolutionary change occurred. An example is shown below:

To formalize the problem we are trying to solve:

Small parsimony problem: Given a tree and single character labels at its leaves, determine the assignment of characters at internal nodes such that the parsimony score of the tree is minimized.

Before we proceed with a discussion of the algorithm, I would like to mention that usually, the leaves contain more than one character, e.g., the entire sequence of a gene. When computing the parsimony, however, each individual character is independent of the others, thus it is sufficient to solve the problem for a single character. If provided a sequence, the answer is simply the concatenation of the solutions for each separate character, and the total parsimony score is simply the sum of the parsimony scores for each character.

We will attempt to solve the small parsimony problem using dynamic programming. As we did in the case of alignment, during the first stage of the algorithm we will figure out the optimal score, then we will use backtracking to figure out the actual assignment of letters to each internal node of the tree.

In other words, we are minimizing the left and right children of the current node separately, and picking the assignment at each child that minimizes the sum of the transition from parent to child, plus the corresponding value in the child.

As in the case of the alignment algorithm, ties can be broken arbitrarily, yielding one of possibly many solutions that all have the same minimal parsimony score.

Sankoff's algorithm walk-through

To clarify the algorithm described above, we will follow through a small example comprising four nodes of a larger phylogenetic tree. We assume the top node is the root of this tree, and that some of the nodes have already been filled in during the post-order traversal computation of the parsimony scores. To make it easier to follow the algorithm, we gave each node a name (a – e). A figure showing this partial tree is shown below. The tree can also be represented in the Newick format as:

((c:[3,2,1,2], d:[1,3,2,3])b:[,,,], e:[1,3,2,2])a:[,,,]

Since the values stored at both children of node b are already computed, we have enough information to compute the values that are stored at node b. To compute the value that will be stored for A, we identify the lowest-cost choice on the left and right independently and then add up their values. On the left side (going from b to c) we have the following 4 choices, from which we take the minimum:

While on the right (going from b to d) we have:

Similarly:

And so on, yielding the tree:

Now both children of the root (node a) have their values computed, and we can use the same process to fill in the values at the root, yielding the tree:

We can now start the backtracking process. We note that the lowest score at the root (node a) corresponds to the letter A, thus we select that choice, shown as a shaded box in the figure below:

In the Newick format, this tree can be now represented as:

((c:G, d:A)b:A, e:A)a:A

I encourage you to draw arbitrary trees and assign arbitrary letters at the leaves, then try to run through the algorithm a couple of times to make sure you understand how it works, and to get the practice necessary to ensure you will be able to solve problems assigned on quizzes and exams.

From parsimony to likelihood

When we discussed sequence alignment, we made a transition from an edit distance formulation, where we tried to minimize the number of edits between two sequences, to an alignment score, that took into account substitutions scores represented in a matrix such as BLOSUM. Both variants could be solved with the same algorithm by simply switching from the minimum to the maximum between a number of choices. The same can be done in the case of phylogenetic tree inference. If we are given a score that defines how likely it is for a character to change into another during an evolutionary event, we can modify the recurrence described above:

A

C

G

T

A

8

2

3

2

C

2

8

2

3

G

3

2

8

2

T

2

3

2

8

Here the score represents the likelihood a particular letter would evolve into a different one. You see a higher likelihood that a letter stays the same (the numbers on the diagonal being higher) as well as a higher likelihood of a transition (mutations between As and Gs, and between Cs and Ts, which are similar molecules) than a transversion (mutations between As and Ts or Cs; or between Gs and Ts or Cs). As and Gs are chemically similar molecules called purines, while Cs and Ts are molecules called pyrimidines.

A

C

G

T

A

8

2

3

2

C

2

8

2

5

G

4

2

8

2

T

2

3

2

8

In the matrix above, the values for C-T and G-A (in bold and highlight) are higher than the values for T-C and A-G, accounting for this phenomenon. Sankoff's algorithm can be used without modification, but it's implementation has to be done carefully to ensure the correct directionality is respected.

Again, Sankoff's algorithm can be directly applied to solve this problem, except that now the information from the children will be aggregated by multiplication at each node (unlike scores, probabilities get multiplied). In practice, however, most algorithms dealing with probabilities work in logarithmic space to avoid under-flows (after just a few multiplications the numbers become infinitesimally small), thus reverting to an additive formulation for Sankoff's algorithm.

Solving large parsimony

While Sankoff's algorithm provides a way to evaluate (in polynomial time) the quality of a given tree, finding the tree that maximizes the quality is NP-hard. Most algorithms for constructing phylogenetic trees rely on heuristic search through the space of possible trees. A discussion of these algorithms is beyond the scope of the material presented here.

Exercises

  1. Using Sankoff's algorithm, calculate the maximum score for the phylogenetic tree shown below, as well as the characters at each internal node. Note that the scoring matrix is not symmetric—the score of evolving A into T is different from the score of evolving T into A. In the scoring matrix, assume that the row represents the parent and the column the child in the tree. For example, if a node is assigned A and it's child C, the score along the branch is 3. If the parent is C and the child is A, the score is only 2. At each internal node in the tree please provide both the four scores corresponding to each individual nucleotide being present at that node, and the inferred ancestral sequence at that node (by circling that letter at the node). If multiple ancestral sequences are possible at each node, please pick one of them (e.g. the answer A/T is not acceptable)

A

C

G

T

A

5

3

-1

0

C

2

5

0

-2

G

0

1

5

2

T

-1

-1

2

5

To get a better handle on how the algorithm proceeds, let us focus on just one node and its two children. Since we do not know what letter will be assigned to any of the nodes, we will have to keep track of all possible solutions. Specifically, at each node v, we will store an array of length 4 (the number of possible choices) that contains the parsimony score assuming the node is assigned the corresponding letter. More formally, we'll define sk(v)s_k(v)sk​(v) to be the score of the tree rooted at v if character k is assigned at v. For example, sA(v)s_A(v)sA​(v) represents the score of the tree rooted at v that has A at the root. See the figure below for an example.

Thus, at each node in the tree we will store the sk(V)s_k(V)sk​(V) array:

Since we know the characters at the leaves, and we do not want to allow the algorithm to select any other choice, we will set sk(v)=0s_k(v) = 0sk​(v)=0 for the value of k equal to the character assigned to the leaf, and sk(v)=∞s_k(v) = \inftysk​(v)=∞ for all other characters. An example is shown below for a tree that has leaf labels A and C.

All that remains to figure out is the recurrence equation. Here, we'll assume the values are computed for the children, and we want to compute them for the parent. Let us take the example in the figure above. Let us assume we want to fill in the value for sA(root)s_A(\text{root})sA​(root) . Of course the value depends on the actual choices for the two children, left(root)\text{left}(\text{root})left(root) and right(root)\text{right}(\text{root})right(root). If we select A at left(root)\text{left}(\text{root})left(root) and C at right(root)\text{right}(\text{root})right(root), we see that sA(root)=1s_A(\text{root}) = 1sA​(root)=1 because the right edge (from A to C) implies one evolutionary event. What if we select A at right(root)\text{right}(\text{root})right(root)? Now the right edge has a cost of 0 (from A to A), however the total cost on the right side is ∞\infty∞ because sA(right(v))=∞s_A(\text{right}(v)) = \inftysA​(right(v))=∞ and 0+∞=∞0 + \infty = \infty0+∞=∞ . Since we are trying to minimize parsimony, the first choice was better. In general:

sk(v)=min⁡i{si(left(v))+δi,k}+mini{si(right(v))+δi,k}s_k(v) = \min_i\{s_i(\text{left}(v)) + \delta_{i,k}\} + min_i\{ s_i(\text{right}(v)) +\delta_{i,k}\}sk​(v)=mini​{si​(left(v))+δi,k​}+mini​{si​(right(v))+δi,k​}

where δi,k\delta_{i, k}δi,k​ is the cost of going from letter i to letter k: δi,k=(i=k?0:1)\delta_{i,k} = (i = k ? 0 : 1)δi,k​=(i=k?0:1) . Alternatively written:

δi,k={0 if i=k1 if i≠k\delta_{i,k} = \left \{\begin{matrix}0 \text{ if } i = k \\ 1 \text{ if } i \ne k \end{matrix} \right .δi,k​={0 if i=k1 if i=k​

Using this recurrence, the tree can be filled in using a post-order traversal. The parsimony score of the entire tree is found at the location argminksk(root)\text{argmin}_k s_k(\text{root})argmink​sk​(root) , or the smallest value in the s array at the root of the tree.

To figure out the actual letters, we backtrack down the tree, starting with the smallest value in the array located at root. Since the left and right children are independent, we can recurse on the two sub-trees. At each step in the backtracking process, we know the letter assigned to the parent of the node, and we need to determine the letter assigned to the current node. That letter will be the one that led to the parent's letter to be chosen. Specifically, assume we are at node v in the tree and the parent is assigned letter k. We will select letter j at node v as argmini(si(v)+δi,k)\text{argmin}_i (s_i(v) + \delta_{i,k})argmini​(si​(v)+δi,k​) . We can then recurse on the two children of v and proceed until we reach the leaves.

The algorithm described in this section is due to and is commonly known as Sankoff's algorithm.

The array listed as an annotation for each node represents the score for the letters A, C, G, T, in this specific order. That is sA(c)=3,sG(d)=2s_A(c) = 3, s_G(d) = 2sA​(c)=3,sG​(d)=2, etc., using the notation we used earlier.

min⁡{sA(c)+0=3sC(c)+1=3sG(c)+1=2sT(c)+1=3=2\min\left \{ \begin{matrix} s_A(c) + 0 = 3 \\ s_C(c)+1 =3 \\ s_G(c) + 1 = 2 \\ s_T(c) + 1 = 3 \end{matrix}\right . = 2min⎩⎨⎧​sA​(c)+0=3sC​(c)+1=3sG​(c)+1=2sT​(c)+1=3​=2

min⁡{sA(d)+0=1sC(d)+1=4sG(d)+1=3sT(d)+1=4=1\min\left \{ \begin{matrix} s_A(d) + 0 = 1 \\ s_C(d)+1 =4 \\ s_G(d) + 1 = 3 \\ s_T(d) + 1 = 4 \end{matrix}\right . = 1min⎩⎨⎧​sA​(d)+0=1sC​(d)+1=4sG​(d)+1=3sT​(d)+1=4​=1

The value of sA(b)s_A(b)sA​(b) is the sum of the two values, i.e., 3.

sC(b)=min⁡{sA(c)+1=4sC(c)+0=2sG(c)+1=2sT(c)+1=3+min⁡{sA(d)+1=2sC(d)+0=3sG(d)+1=3sT(d)+1=4=4s_C(b) = \min\left \{ \begin{matrix} s_A(c) + 1 = 4 \\ s_C(c)+0 =2 \\ s_G(c) + 1 = 2 \\ s_T(c) + 1 = 3 \end{matrix}\right . + \min\left \{ \begin{matrix} s_A(d) + 1 = 2 \\ s_C(d)+0 =3 \\ s_G(d) + 1 = 3 \\ s_T(d) + 1 = 4 \end{matrix}\right . = 4sC​(b)=min⎩⎨⎧​sA​(c)+1=4sC​(c)+0=2sG​(c)+1=2sT​(c)+1=3​+min⎩⎨⎧​sA​(d)+1=2sC​(d)+0=3sG​(d)+1=3sT​(d)+1=4​=4

We now have to figure out what letter assignment at nodes b and e leads to the score sA(a)s_A(a)sA​(a). On the right, we see that the best choice is to select A, since it will yield a total cost of 1. On the left, the best choice is again A, with a total cost of 3, which summed up with the score from the right yields the value 4 represented at node a. Thus, we can make the assignments of letter A for both nodes b and e.

We can now proceed to figure out the letters stored at nodes c and d. At node c, the best choice is the letter G, which yields a total cost of 2 (1 from sG(c)s_G(c)sG​(c) ) and another 1 from the mutation from A to G). While the cost of going from A to A is 0, sA(c)=3s_A(c) = 3sA​(c)=3, thus choosing A at node c would be sub-optimal. At node d, A is the best choice, yielding a total score of 1. This results in the assignments:

sk(v)=max⁡i{si(left(v))+δi,k}+max⁡i{si(right(v))+δi,k}s_k(v) = \max_i \{s_i(\text{left}(v)) + \delta_{i,k}\} + \max_i\{s_i(\text{right}(v))+\delta_{i,k}\}sk​(v)=maxi​{si​(left(v))+δi,k​}+maxi​{si​(right(v))+δi,k​}

In this case the score δi,k\delta_{i, k}δi,k​ will be defined by a matrix such as:

Note that the matrix doesn't have to be symmetrical as shown above. The tree has a natural orientation, with evolution occurring from parent to child, thus, we can account for uni-directional evolutionary events, such as the mutation of Cs into Ts by deamination (discussed in the).

The scores represented by δi,k\delta_{i, k}δi,k​ can also be the actual likelihoods that the two letters have evolved into each other, and can take into account the length of the branch, i.e., the evolutionary time that has passed between two nodes in the tree, which will impact the likelihood that a letter mutates into another. That is, we can define δi,k(t)\delta_{i, k} (t)δi,k​(t) where t is the time that "elapsed" along an edge of the tree. In this case the phylogenetic inference is also referred to as maximum likelihood, as we are trying to find the tree that maximizes the likelihood that the organisms at the leaves are related to each other as described by the tree.

David Sankoff
early chapter on patterns in DNA
Representation of a tree that has DNA letters at the leaves and internal nodes. The number 1 is written on edges that have a different label at either end of the edge.
Parsimony score. Edges that represent evolutionary changes are highlighted with the number 1. The total parsimony score of this tree is 4.
two small trees both with leaves labeled A and C, but each having a different label at the root, the first tree  has an A and the second has a T.
The score at node v. sA(v)=1s_A(v) = 1sA​(v)=1 since one edge entails a change from A at the root to C at the leaf, while sT(v)=2s_T(v) = 2sT​(v)=2
Small three-node tree, root and two leaves. At all nodes there's a blank array that contains 4 cells for each of the 4 DNA letters.
sk(v)s_k(v)sk​(v) array at each node in the tree
Small three-node tree, root and two leaves. At all nodes there's an array that contains 4 cells for each of the 4 DNA letters.  At the leaves, the arrays are filled with either 0 for the letter represented in the leaf, or infinity symbol for the other letters.
Setting the scores at leaves.
Representation of a section of a phylogenetic trees containing nodea a through e. Node a is the root and  has children b and e. Node b has children d and e. Each node has an associated array containing the parismony scores for the letters A, C, G, and T in this order. The values stored at node c are 3, 2, 1, 2. The values stored at node d are 1, 3, 2, 3 . The values stored at node e are 1, 3, 2, 2. Nodes b and a (the root) do not yet have any values computed.
Representation of a section of a phylogenetic trees containing nodea a through e. Node a is the root and  has children b and e. Node b has children d and e. Each node has an associated array containing the parismony scores for the letters A, C, G, and T in this order. The values stored at node c are 3, 2, 1, 2. The values stored at node d are 1, 3, 2, 3 . The values stored at node e are 1, 3, 2, 2. The values stored at node b are 3,4,3,4. Node a (the root) does not yet have any values computed.
Representation of a section of a phylogenetic trees containing nodea a through e. Node a is the root and  has children b and e. Node b has children d and e. Each node has an associated array containing the parismony scores for the letters A, C, G, and T in this order. The values stored at node c are 3, 2, 1, 2. The values stored at node d are 1, 3, 2, 3 . The values stored at node e are 1, 3, 2, 2. The values stored at node b are 3,4,3,4. The values stored at the root (node a) are 4,6,5,6.
Representation of a section of a phylogenetic trees containing nodea a through e. Node a is the root and  has children b and e. Node b has children d and e. Each node has an associated array containing the parismony scores for the letters A, C, G, and T in this order. The values stored at node c are 3, 2, 1, 2. The values stored at node d are 1, 3, 2, 3 . The values stored at node e are 1, 3, 2, 2. The values stored at node b are 3,4,3,4. The values stored at the root (node a) are 4,6,5,6.  At node a, letter A is highlighted by a shaded box.
Representation of a section of a phylogenetic trees containing nodea a through e. Node a is the root and  has children b and e. Node b has children d and e. Each node has an associated array containing the parismony scores for the letters A, C, G, and T in this order. The values stored at node c are 3, 2, 1, 2. The values stored at node d are 1, 3, 2, 3 . The values stored at node e are 1, 3, 2, 2. The values stored at node b are 3,4,3,4. The values stored at the root (node a) are 4,6,5,6. At nodes a,b, and e, letter A is highlighted by a shaded box.
Representation of a section of a phylogenetic trees containing nodea a through e. Node a is the root and  has children b and e. Node b has children d and e. Each node has an associated array containing the parismony scores for the letters A, C, G, and T in this order. The values stored at node c are 3, 2, 1, 2. The values stored at node d are 1, 3, 2, 3 . The values stored at node e are 1, 3, 2, 2. The values stored at node b are 3,4,3,4. The values stored at the root (node a) are 4,6,5,6. At nodes a,b, d, and e, letter A is highlighted by a shaded box. At node c, letter G is highlighted by a shaded box.
Phylogenetic tree on the set of leaves, C, T, G, T, A, T. The structure of the tree is represented by the Newick string:  (((C, T), ((G, T), A)), T)