Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Introduction
  • Features of phylogenetic trees
  • Representing trees in a computer-readable format
  1. Phylogenetic analysis

Introduction to phylogenetic inference

PreviousHierarchical clusteringNextDistance-based phylogenetic analysis

Last updated 4 months ago

Introduction

One form of hierarchical clustering intends to discover the evolutionary history that gave rise to a set of sequences or traits observed today. In this version, the distance between sequences attempts to capture the evolutionary relatedness of the sequences, or the time (usually measured as number of mutations) that separates the sequences from each other.

Thus, the goal is to reconstruct a tree that best captures the "heritage" of a set of sequences—the root of the tree being the most recent common ancestor of all the sequences. In the context of protein sequences, the BLOSUM and PAM matrices discussed during sequence alignment provide a reasonable approximation of the evolutionary distance. For nucleotide sequences, several models of evolution exist that attempt to estimate the likelihood a certain base will mutate into another base during a given evolutionary time.

There are a number of applications of such analyses. Initially, trees were used to try to understand the history of life on our planet. An early drawing of such a "tree of life" appears in Darwin's notes for his work "On the origin of species".

Page from Darwin's notebook. Image source: https://en.wikipedia.org/wiki/Tree_of_life_(biology)

A phylogenetic tree of mammals, for example, allows you to answer questions such as: "Are manatees related to cows?". While this may sound like just basic curiosity, knowing how different animals are related to each other can provide useful information for developing medical and veterinary treatments.

Features of phylogenetic trees

rom the figures shown above, you may infer some features of phylogenetic trees. The trees may be represented in a rooted form (as the mammal tree above), where it is obvious which node in the tree represents the common ancestor of the organisms in the tree. They could also be represented in an unrooted form (as in the SARS-CoV-2 treee), where no particular node in the tree is highlighted as the origin from which the organisms evolved. Such a representation may be more "honest" in situations in which there is no scientific evidence for any particular origin. As an example, our current understanding of life defines three main domains of life: Bacteria, Archaea, and Eukarya. Bacteria and Archaea represent unicellular organisms and their cells do not have a nucleus. Eukarya represent unicellular and multi-cellular organisms (including humans, animals, plants, etc.), and their cells have a nucleus. While Bacteria and Archaea are similar (and until recently they were confused for each other), Archaea have several features that make them more similar to Eukarya. Currently we do not know if Archaea are more related to Bacteria or Eukarya, i.e., we do not know which domain evolved first from the first living organisms that developed on Earth.

Another feature of phylogenetic trees is that the data being represented in the them are located at the leaves. These are the sequences or features of organisms that we are observing today. The internal nodes of the trees represent sequences or features of organisms that existed at some point in history but are not currently living on our planet. Sometimes, fossils can be used to verify that the inferences made by phylogenetic analysis are accurate, or to estimate rates of evolution.

Phylogenetic trees are usually binary trees (each node has exactly two children), in keeping with the principle of parsimony—it's most likely that at a particular point in time, a particular species breaks off into two (rather than more) different species.

As a final feature, you will note that the trees in the figures above have branches of different lengths (particularly notable in the SARS-CoV-2 tree). These represent the amount of evolution, or the amount of time that occurred between evolutionary events (nodes in the tree). It is important to note that the speed of evolution changes (for example pathogens mutate more rapidly when they are challenged with drugs or vaccines), and one of the goals of phylogenetic analysis is to estimate this rate of evolution. For now, we will, however, ignore this factor.

Representing trees in a computer-readable format

This format uses parentheses to highlight the tree structure — each node is represented as a set of closed parentheses, and within each node, the children of the node are separated out by commas. For example(,) is a tree node with two children, assuming the children do not have any name. A tree node with three children would look like (,,) . If we assign names to leaves, e.g., a node with two children called A and B, we simply write those names in the node: (A, B). Furthermore, additional information about the nodes can be added, separated by a colon from the node being referenced. For example, if we want to indicate that child A is at distance 0.5 from the node we are representing, while child B is at distance 1, we can write: (A:0.5, B:1).

Of course, we are talking about trees here, and you should by now be familiar with the fact that trees are intimately connected to recursion, thus, this simple node representation can be applied recursively. Let's represent a complete binary tree that has 4 leaves, A, B, C, and D. Leaves A and B have the same parent, as do leaves C and D, then those corresponding parents are the children of the root. This tree can be represented as: ((A, B), (C,D)) . The "depth"of the different children doesn't have to be the same, for example: ((A, B), C) represents a tree that has 3 leaves. A and B are siblings, and C is a sibling of the parent of A and B. Internal nodes can also receive labels by attaching the corresponding label at the end of the closing parenthesis for the node. Say we want to give the "name" E to the node (A, B), and to call the root R, the three-node tree we just represented will become: ((A, B)E, C)R .

EXERCISE: Write code that converts a Newick-formatted file into corresponding data structure in your favorite programming language.

EXERCISE: Write code that prints out a tree data structure into a Newick-formatted file, using your favorite programming language.

Mammal evolutionary tree. Image source: https://commons.wikimedia.org/wiki/File:The_Ancestors_Tale_Mammals_cladogram.png

hylogenetic trees can also be used to explore the evolution of pathogens with an eventual goal of being able to predict the emergence of new mutations or strains. A great example is the analysis of the various strains and variants of the SARS-CoV-2 virus during the COVID-19 pandemic. A tree of SARS-CoV-2 strains is shown below, generated at . This website allows you to explore the emergence of strains across time and geography and also includes similar analyses for flu.

In the section, we showed two ways in which DNA sequences are represented (the FASTA and fastq file formats). How do computational biologists store more complex information, such as phylogenetic trees? Below we'll briefly describe some of the file formats, the .

https://nextstrain.org
introduction to biology
Newick format
Handwritten drawing of an evolutionary tree from Darwin's notes.
Tree showing relationships between different strains of SARS-CoV2
Phylogenetic tree of SARS-CoV-2 strains colored by clade (variant). From:
https://nextstrain.org
A tree showing evolutionary relationships between mammals.