Distance-based phylogenetic analysis
Last updated
Last updated
Since phylogenetic analysis is essentially a form of hierarchical clustering, we can use similar algorithms to the ones discussed in the clustering chapter. As we have already discussed, the BLOSUM matrix we used in sequence alignment estimates the probability that two amino-acids would be mutated into each other during a certain evolutionary time, thus, the distance estimated by the alignment process between two sequences is a good approximation of the evolutionary distance between them. To construct a phylogenetic tree, we can, thus, start with a distance matrix constructed by aligning all the sequences to each other in a pair-wise fashion, then use one of the algorithms described in the clustering chapter to construct the tree.
A popular choice is the average linkage/average distance/UPGMA algorithm, where the distance between two clusters is defined as the average distance between the sequences in the two clusters:
This algorithm can be extended to also estimate evolutionary distances, but we will not discuss this extension a this time.
Another popular algorithm is the neighbor-joining algorithm, that includes in its calculation of distance a term that takes into account not just the distance between two points, but also how far they are from the rest of the points.
There is a lot more that can be said about distance-based phylogenetic inference, but for now we will stop here. One important point to keep in mind, however, is that algorithms for distance-based phylogenetic inference are tractable — i.e., they terminate in polynomial time, since there are merges that take place (see Hierarchical clustering for a refresher) and each round of merging requires the computation of distances, each of which can be computed in polynomial time for the commonly used distances. As we will discuss in later sections, this is not true for other phylogenetic inference approaches.