Introduction to phylogenetic inference
Last updated
Last updated
One form of hierarchical clustering intends to discover the evolutionary history that gave rise to a set of sequences or traits observed today. In this version, the distance between sequences attempts to capture the evolutionary relatedness of the sequences, or the time (usually measured as number of mutations) that separates the sequences from each other.
Thus, the goal is to reconstruct a tree that best captures the "heritage" of a set of sequences—the root of the tree being the most recent common ancestor of all the sequences. In the context of protein sequences, the BLOSUM and PAM matrices discussed during sequence alignment provide a reasonable approximation of the evolutionary distance. For nucleotide sequences, several models of evolution exist that attempt to estimate the likelihood a certain base will mutate into another base during a given evolutionary time.
There are a number of applications of such analyses. Initially, trees were used to try to understand the history of life on our planet. An early drawing of such a "tree of life" appears in Darwin's notes for his work "On the origin of species".
A phylogenetic tree of mammals, for example, allows you to answer questions such as: "Are manatees related to cows?". While this may sound like just basic curiosity, knowing how different animals are related to each other can provide useful information for developing medical and veterinary treatments.
rom the figures shown above, you may infer some features of phylogenetic trees. The trees may be represented in a rooted form (as the mammal tree above), where it is obvious which node in the tree represents the common ancestor of the organisms in the tree. They could also be represented in an unrooted form (as in the SARS-CoV-2 treee), where no particular node in the tree is highlighted as the origin from which the organisms evolved. Such a representation may be more "honest" in situations in which there is no scientific evidence for any particular origin. As an example, our current understanding of life defines three main domains of life: Bacteria, Archaea, and Eukarya. Bacteria and Archaea represent unicellular organisms and their cells do not have a nucleus. Eukarya represent unicellular and multi-cellular organisms (including humans, animals, plants, etc.), and their cells have a nucleus. While Bacteria and Archaea are similar (and until recently they were confused for each other), Archaea have several features that make them more similar to Eukarya. Currently we do not know if Archaea are more related to Bacteria or Eukarya, i.e., we do not know which domain evolved first from the first living organisms that developed on Earth.
Another feature of phylogenetic trees is that the data being represented in the them are located at the leaves. These are the sequences or features of organisms that we are observing today. The internal nodes of the trees represent sequences or features of organisms that existed at some point in history but are not currently living on our planet. Sometimes, fossils can be used to verify that the inferences made by phylogenetic analysis are accurate, or to estimate rates of evolution.
Phylogenetic trees are usually binary trees (each node has exactly two children), in keeping with the principle of parsimony—it's most likely that at a particular point in time, a particular species breaks off into two (rather than more) different species.
As a final feature, you will note that the trees in the figures above have branches of different lengths (particularly notable in the SARS-CoV-2 tree). These represent the amount of evolution, or the amount of time that occurred between evolutionary events (nodes in the tree). It is important to note that the speed of evolution changes (for example pathogens mutate more rapidly when they are challenged with drugs or vaccines), and one of the goals of phylogenetic analysis is to estimate this rate of evolution. For now, we will, however, ignore this factor.
This format uses parentheses to highlight the tree structure — each node is represented as a set of closed parentheses, and within each node, the children of the node are separated out by commas. For example(,)
is a tree node with two children, assuming the children do not have any name. A tree node with three children would look like (,,)
. If we assign names to leaves, e.g., a node with two children called A and B, we simply write those names in the node: (A, B)
. Furthermore, additional information about the nodes can be added, separated by a colon from the node being referenced. For example, if we want to indicate that child A is at distance 0.5 from the node we are representing, while child B is at distance 1, we can write: (A:0.5, B:1)
.
Of course, we are talking about trees here, and you should by now be familiar with the fact that trees are intimately connected to recursion, thus, this simple node representation can be applied recursively. Let's represent a complete binary tree that has 4 leaves, A, B, C, and D. Leaves A and B have the same parent, as do leaves C and D, then those corresponding parents are the children of the root. This tree can be represented as: ((A, B), (C,D))
. The "depth"of the different children doesn't have to be the same, for example: ((A, B), C)
represents a tree that has 3 leaves. A and B are siblings, and C is a sibling of the parent of A and B. Internal nodes can also receive labels by attaching the corresponding label at the end of the closing parenthesis for the node. Say we want to give the "name" E to the node (A, B)
, and to call the root R, the three-node tree we just represented will become: ((A, B)E, C)R
.
EXERCISE: Write code that converts a Newick-formatted file into corresponding data structure in your favorite programming language.
EXERCISE: Write code that prints out a tree data structure into a Newick-formatted file, using your favorite programming language.
hylogenetic trees can also be used to explore the evolution of pathogens with an eventual goal of being able to predict the emergence of new mutations or strains. A great example is the analysis of the various strains and variants of the SARS-CoV-2 virus during the COVID-19 pandemic. A tree of SARS-CoV-2 strains is shown below, generated at . This website allows you to explore the emergence of strains across time and geography and also includes similar analyses for flu.
In the section, we showed two ways in which DNA sequences are represented (the FASTA and fastq file formats). How do computational biologists store more complex information, such as phylogenetic trees? Below we'll briefly describe some of the file formats, the .