Hierarchical clustering

Introduction

Hierarchical clustering is a variant of agglomerative clustering (starting with each point in a separate cluster, then iteratively merging clusters) that continues the merging process until only one cluster remains. Effectively, this process constructs a tree that represents the merge operations—each node has as children the clusters that were merged to create it. An example is shown below:

Drawing of a tree that has numbers at the leaves and the internal nodes are labeled with the set of numbers located at the leaves below that node.

Note that this process does not explicitly cluster the original set of points into distinct clusters, rather the user has to break up the tree in a way that meets the needs of their specific application. A great metaphor for cutting up the tree into clusters is slicing a head of broccoli with a knife. The individual florets that fall out represent the distinct clusters.

Approaches for splitting the tree range from a flat cut at a particular height/level in the tree, to approaches that consider the properties of each edge (or cluster it represents) in deciding whether to prune the tree at that location. These approaches are, however, beyond the scope of this chapter.

An example of hierarchical clustering from an actual biological datasets is presented below:

Matrix containing squares of different colors representing the levels of expression of different genes in different samples. Along the left and top sides are trees indicating how the corresponding rows and colums, respectively, can be hierarchically clustered together.

Hierarchical clustering algorithm

In theory, each merging step could involve more than two clusters, however most approaches for hierarchical clustering restrict themselves to pairwise merge operations. The typical algorithm has the following structure:

Hierarchical_clustering(Points)

1:  D[n][n] = pairwise_distances(Points)  # pairwise distance matrix
2:  while num(Points) > 1  # while we have more than one cluster
3:    (p1, p2) = argmin D  # select the points closest to each other
4:    remove p1 and p2 from Points
5:     p1 = (p1 + p2)      # create a new cluster
6:     add p1 to Points
7:     D = pairwise_distances(Points) # now D has dimensions (n-1)x(n-1)

8:  return(Points)

There are a few things to note about this algorithm:

Just as in the case of k-means clustering, we must define a distance function between the data points. Usually, for gene expression measurements, this is the Euclidean distance, but other definitions can be used depending on the data being clustered.
The function argmin on line 3 simply says: pick the two points (p1 and p2) that are the closest to each other, i.e., the points that generate the lowest value in the distance matrix.
On line 5 we create a new data point from the two points we just merged. On line 7, we recompute the pairwise distances between points, however now one of the points is the one we just created. How do we defined the distance between this new points and the other ones in the set? For example, in the first figure on this page, what is the distance between points (3, 4) and 5? This is the key difference between various hierarchical clustering algorithms and we'll discuss it in more detail below.

Computing the distance between clusters

Let's try to formalize a bit better what we are trying to achieve. Assume we have a set of points and a distance matrix that determines the pairwise distance between the points. We want to come up with a way to define distances between a point and a cluster, or between two different clusters.

In the figure below we highlight the example of a cluster comprised of points 1 and 2, and we want to compute the distance between this cluster and point 3.

An oval containing two points labeled 1 and 2, representing that these points are clustered together. A 3rd point (labeled 3) is outside the oval and connected with lines to points 1 and 2.

Since we know the distances between the three points, perhaps we can simply use those values to define the distance between the cluster (1,2) and point 3. A natural choice is to simply select the average distance, i.e., $d((1,2), 3) = {{d(1,3) + d(2, 3)} \over 2}$ .

How about the situation when we compare two different clusters as shown in the figure below?

Two ovals containing numbered points indicating two clusters: (1,2) , and (4, 3, 5). Lines connect each point in one cluster to all points in the other cluster highlighting the distances that need to be taken into account for computing the distance between the two clusters.

In this case, we can still use the average, except that now we have to average over all pairs of points in the two clusters:

$d((1, 2), (3, 4, 5)) = {d(1, 4) + d (1, 3) + d (1, 5) + d(2, 4) + d(2, 3) + d (2, 5) \over 6}$

More generally, for clusters C1 and C2, we can define:

$d(C_1, C_2) = {\sum_{p \in C_1} \sum_{q \in C_2} d(p, q) \over |C_1| |C_2|}$

For two points p and q, $d(p, q)$ is distance between the two points, according to the distance matrix that is either computed by the clustering code, or simply provided as input to the clustering algorithm.

If we use this distance measure, the hierarchical clustering algorithm is called average linkage, or average neighbor, or UPGMA (standing for unweighted pair group method with arithmetic mean).

Of course, this is just one of the possible choices for a distance. Other common choices include:

$d_{\text{min}} (C_1, C_2) = \min_{p\in C_1; q \in C_2} d(p, q)$ also referred to as single linkage or nearest neighbor clustering.

$d_{\text{max}} (C_1, C_2) = \max_{p\in C_1; q \in C_2} d(p, q)$ also referred to as complete linkage or farthest neighbor clustering.

Hierarchical clustering walk-through

Let's do a walk-through of the algorithm for the following distance matrix:

We'll start with single linkage clustering, i.e., the distance between the two clusters is determined by the shortest distance between points in the two clusters.

To start, we identify the lowest value in the matrix, in this case 1, and determine which points (or clusters) correspond to this value. Here it is points A and D. Thus, the first cluster created is (A, D), represented in the tree form as:

Now we have to update the distance matrix to account for the new cluster. We start by removing the two rows and columns corresponding to points A and D, and insert a new row and column for the cluster (A, D):

(A, D)

To compute the corresponding distances, we will take the smallest value between the pairwise distances between each point and the two points in the cluster. For example, $d(B, (A, D)) = \min (d(A, B), d(B, D)) = \min (4, 5) = 4$ . You can then fill out the rest of the row and column corresponding to (A, D):

(A, D)

At this point, we again look for the smallest value in the matrix, now corresponding to either cluster ((A, D), C) or (C, E). Let us select the latter, yielding:

Again we update the matrix to remove the rows and columns for C and E and create a row and column for the cluster (C, E):

(A, D)

(C, E)

(A, D)

(C, E)

Now we have to compute the corresponding distances as before. We already walked through how we compute the distance between a point and a cluster, but we now also have to figure out d((A, D), (C, E)). Now we compare all pairs of points between the two clusters:

$d((A, D), (C, E)) = min(d(A, C), d(A, E), d(C, D), d(D, E)) = min(2, 3, 3, 4) = 2$

Thus, the new distance matrix is:

(A, D)

(C, E)

(A, D)

(C, E)

The new lowest value is 2, corresponding to the cluster that combines (A, D) and (C, E):

Four points labeled A, D, C, E and connected with lines to highlight clusters (A, D), (C, E), and the fact that the two clusters are further grouped together into ((A, D), (C, E))

Again we simplify the matrix, now by removing the rows/columns for (A, D) and (C, E) and creating a new row/column for (A, C, D, E):

(A, C, D, E)

We fill in this new matrix using the same formula as before, e.g., $d((A, C, D, E), F) = \min(d(A, F), d(C, F), d(D, F), d(E, F)) = 3$ .

(A, C, D, E)

Now the smallest value is for cluster ((A, C, D, E), F):

Five points labeled A, D, C, E, F and connected with lines to highlight clusters (A, D), (C, E), and the fact that the two clusters are further grouped together into ((A, D), (C, E)) and that point F clusters with this larger cluster (((A, D), (C, E)), F)

Yielding the new matrix:

(A, C, D, E, F)

and one last merge, to create the cluster (A, B, C, D, E, F):

Six points labeled A, D, C, E, F, B and connected with lines to highlight clusters (A, D), (C, E), and the fact that the two clusters are further grouped together into ((A, D), (C, E)) and that point F clusters with this larger cluster (((A, D), (C, E)), F) and point B further clusters with the resulting cluster ((((A, D), (C, E)), F), B)

Exercises

Redo the walk-through described above for complete linkage and average linkage (UPGMA) clustering.
Given the following distance matrix, construct a UPGMA tree of the 6 sequences (A-F) shown below.
1. Write out the formula for computing the distance between clusters (BC) and (AD).
2. Fill in the intermediate distance matrices including rows/columns for the clusters being created.
3. Show the topology of the tree.
4. How would answer to a) change if you were performing complete linkage clustering?
5. How would answer to a) change if you were performing single linkage clustering?

Above we mentioned that complete linkage clustering tends to create tighter clusters than single linkage clustering. Please explain this intuition. Also, provide an example (a collection of sequences/points and corresponding distances) demonstrating this effect.
Describe pseudo-code for breaking up a hierarchical clustering tree at a particular "depth", measured as number of levels from the root. Note that your algorithm must take into account the fact that not all paths in the tree are the same length. The depth is measured along the longest path (the path from root to leaf that contains most intermediate nodes).

PreviousK-means clustering NextIntroduction to phylogenetic inference

Last updated 6 months ago