Hierarchical clustering
Last updated
Last updated
Hierarchical clustering is a variant of agglomerative clustering (starting with each point in a separate cluster, then iteratively merging clusters) that continues the merging process until only one cluster remains. Effectively, this process constructs a tree that represents the merge operations—each node has as children the clusters that were merged to create it. An example is shown below:
Note that this process does not explicitly cluster the original set of points into distinct clusters, rather the user has to break up the tree in a way that meets the needs of their specific application. A great metaphor for cutting up the tree into clusters is slicing a head of broccoli with a knife. The individual florets that fall out represent the distinct clusters.
Approaches for splitting the tree range from a flat cut at a particular height/level in the tree, to approaches that consider the properties of each edge (or cluster it represents) in deciding whether to prune the tree at that location. These approaches are, however, beyond the scope of this chapter.
An example of hierarchical clustering from an actual biological datasets is presented below:
In theory, each merging step could involve more than two clusters, however most approaches for hierarchical clustering restrict themselves to pairwise merge operations. The typical algorithm has the following structure:
There are a few things to note about this algorithm:
Just as in the case of k-means clustering, we must define a distance function between the data points. Usually, for gene expression measurements, this is the Euclidean distance, but other definitions can be used depending on the data being clustered.
The function argmin
on line 3 simply says: pick the two points (p1 and p2) that are the closest to each other, i.e., the points that generate the lowest value in the distance matrix.
On line 5 we create a new data point from the two points we just merged. On line 7, we recompute the pairwise distances between points, however now one of the points is the one we just created. How do we defined the distance between this new points and the other ones in the set? For example, in the first figure on this page, what is the distance between points (3, 4) and 5? This is the key difference between various hierarchical clustering algorithms and we'll discuss it in more detail below.
Let's try to formalize a bit better what we are trying to achieve. Assume we have a set of points and a distance matrix that determines the pairwise distance between the points. We want to come up with a way to define distances between a point and a cluster, or between two different clusters.
In the figure below we highlight the example of a cluster comprised of points 1 and 2, and we want to compute the distance between this cluster and point 3.
How about the situation when we compare two different clusters as shown in the figure below?
In this case, we can still use the average, except that now we have to average over all pairs of points in the two clusters:
More generally, for clusters C1 and C2, we can define:
If we use this distance measure, the hierarchical clustering algorithm is called average linkage, or average neighbor, or UPGMA (standing for unweighted pair group method with arithmetic mean).
Of course, this is just one of the possible choices for a distance. Other common choices include:
Let's do a walk-through of the algorithm for the following distance matrix:
A
B
C
D
E
F
A
0
4
2
1
3
5
B
4
0
8
5
7
6
C
2
8
0
3
2
9
D
1
5
3
0
4
5
E
3
7
2
4
0
3
F
5
6
9
5
3
0
We'll start with single linkage clustering, i.e., the distance between the two clusters is determined by the shortest distance between points in the two clusters.
To start, we identify the lowest value in the matrix, in this case 1, and determine which points (or clusters) correspond to this value. Here it is points A and D. Thus, the first cluster created is (A, D), represented in the tree form as:
Now we have to update the distance matrix to account for the new cluster. We start by removing the two rows and columns corresponding to points A and D, and insert a new row and column for the cluster (A, D):
(A, D)
B
C
E
F
(A, D)
0
B
0
8
7
6
C
8
0
2
9
E
7
2
0
3
F
6
9
3
0
(A, D)
B
C
E
F
(A, D)
0
4
2
3
5
B
4
0
8
7
6
C
2
8
0
2
9
E
3
7
2
0
3
F
5
6
9
3
0
At this point, we again look for the smallest value in the matrix, now corresponding to either cluster ((A, D), C) or (C, E). Let us select the latter, yielding:
Again we update the matrix to remove the rows and columns for C and E and create a row and column for the cluster (C, E):
(A, D)
B
(C, E)
F
(A, D)
0
4
5
B
4
0
6
(C, E)
0
F
5
6
0
Now we have to compute the corresponding distances as before. We already walked through how we compute the distance between a point and a cluster, but we now also have to figure out d((A, D), (C, E)). Now we compare all pairs of points between the two clusters:
Thus, the new distance matrix is:
(A, D)
B
(C, E)
F
(A, D)
0
4
2
5
B
4
0
7
6
(C, E)
2
7
0
3
F
5
6
3
0
The new lowest value is 2, corresponding to the cluster that combines (A, D) and (C, E):
Again we simplify the matrix, now by removing the rows/columns for (A, D) and (C, E) and creating a new row/column for (A, C, D, E):
(A, C, D, E)
B
F
(A, C, D, E)
0
B
0
6
F
6
0
(A, C, D, E)
B
F
(A, C, D, E)
0
4
3
B
4
0
6
F
3
6
0
Now the smallest value is for cluster ((A, C, D, E), F):
Yielding the new matrix:
(A, C, D, E, F)
B
(A, C, D, E, F)
0
4
B
4
0
and one last merge, to create the cluster (A, B, C, D, E, F):
Redo the walk-through described above for complete linkage and average linkage (UPGMA) clustering.
Given the following distance matrix, construct a UPGMA tree of the 6 sequences (A-F) shown below.
Write out the formula for computing the distance between clusters (BC) and (AD).
Fill in the intermediate distance matrices including rows/columns for the clusters being created.
Show the topology of the tree.
How would answer to a) change if you were performing complete linkage clustering?
How would answer to a) change if you were performing single linkage clustering?
A
B
C
D
E
F
A
0
2
5
7
8
8
B
2
0
5
7
8
8
C
5
5
0
6
7
7
D
7
7
6
0
5
5
E
8
8
7
5
0
2
F
8
8
7
5
2
0
Above we mentioned that complete linkage clustering tends to create tighter clusters than single linkage clustering. Please explain this intuition. Also, provide an example (a collection of sequences/points and corresponding distances) demonstrating this effect.
Describe pseudo-code for breaking up a hierarchical clustering tree at a particular "depth", measured as number of levels from the root. Note that your algorithm must take into account the fact that not all paths in the tree are the same length. The depth is measured along the longest path (the path from root to leaf that contains most intermediate nodes).
Since we know the distances between the three points, perhaps we can simply use those values to define the distance between the cluster (1,2) and point 3. A natural choice is to simply select the average distance, i.e., .
For two points p and q, is distance between the two points, according to the distance matrix that is either computed by the clustering code, or simply provided as input to the clustering algorithm.
also referred to as single linkage or nearest neighbor clustering.
also referred to as complete linkage or farthest neighbor clustering.
To compute the corresponding distances, we will take the smallest value between the pairwise distances between each point and the two points in the cluster. For example, . You can then fill out the rest of the row and column corresponding to (A, D):
We fill in this new matrix using the same formula as before, e.g., .