Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  1. Phylogenetic analysis

Distance-based phylogenetic analysis

Since phylogenetic analysis is essentially a form of hierarchical clustering, we can use similar algorithms to the ones discussed in the clustering chapter. As we have already discussed, the BLOSUM matrix we used in sequence alignment estimates the probability that two amino-acids would be mutated into each other during a certain evolutionary time, thus, the distance estimated by the alignment process between two sequences is a good approximation of the evolutionary distance between them. To construct a phylogenetic tree, we can, thus, start with a distance matrix constructed by aligning all the sequences to each other in a pair-wise fashion, then use one of the algorithms described in the clustering chapter to construct the tree.

A popular choice is the average linkage/average distance/UPGMA algorithm, where the distance between two clusters is defined as the average distance between the sequences in the two clusters:

davg(C1,C2)=∑p∈C1∑q∈C2d(p,q)∣C1∣∣C2∣d_{avg}(C_1,C_2) = {\sum_{p \in C_1} \sum_{q \in C_2} d(p, q) \over |C_1| |C_2|}davg​(C1​,C2​)=∣C1​∣∣C2​∣∑p∈C1​​∑q∈C2​​d(p,q)​

This algorithm can be extended to also estimate evolutionary distances, but we will not discuss this extension a this time.

Another popular algorithm is the neighbor-joining algorithm, that includes in its calculation of distance a term that takes into account not just the distance between two points, but also how far they are from the rest of the points.

There is a lot more that can be said about distance-based phylogenetic inference, but for now we will stop here. One important point to keep in mind, however, is that algorithms for distance-based phylogenetic inference are tractable — i.e., they terminate in polynomial time, since there are O(n)O(n)O(n) merges that take place (see Hierarchical clustering for a refresher) and each round of merging requires the computation of O(n2)O(n^2)O(n2) distances, each of which can be computed in polynomial time for the commonly used distances. As we will discuss in later sections, this is not true for other phylogenetic inference approaches.

PreviousIntroduction to phylogenetic inferenceNextTrait-based phylogenetic inference

Last updated 4 months ago