Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Introduction
  • Hierarchical clustering algorithm
  • Computing the distance between clusters
  • Hierarchical clustering walk-through
  • Exercises
  1. Data clustering

Hierarchical clustering

PreviousK-means clusteringNextIntroduction to phylogenetic inference

Last updated 4 months ago

Introduction

Hierarchical clustering is a variant of agglomerative clustering (starting with each point in a separate cluster, then iteratively merging clusters) that continues the merging process until only one cluster remains. Effectively, this process constructs a tree that represents the merge operations—each node has as children the clusters that were merged to create it. An example is shown below:

Note that this process does not explicitly cluster the original set of points into distinct clusters, rather the user has to break up the tree in a way that meets the needs of their specific application. A great metaphor for cutting up the tree into clusters is slicing a head of broccoli with a knife. The individual florets that fall out represent the distinct clusters.

Approaches for splitting the tree range from a flat cut at a particular height/level in the tree, to approaches that consider the properties of each edge (or cluster it represents) in deciding whether to prune the tree at that location. These approaches are, however, beyond the scope of this chapter.

An example of hierarchical clustering from an actual biological datasets is presented below:

Hierarchical clustering algorithm

In theory, each merging step could involve more than two clusters, however most approaches for hierarchical clustering restrict themselves to pairwise merge operations. The typical algorithm has the following structure:

Hierarchical_clustering(Points)

1:  D[n][n] = pairwise_distances(Points)  # pairwise distance matrix
2:  while num(Points) > 1  # while we have more than one cluster
3:    (p1, p2) = argmin D  # select the points closest to each other
4:    remove p1 and p2 from Points
5:     p1 = (p1 + p2)      # create a new cluster
6:     add p1 to Points
7:     D = pairwise_distances(Points) # now D has dimensions (n-1)x(n-1)

8:  return(Points)

There are a few things to note about this algorithm:

  • Just as in the case of k-means clustering, we must define a distance function between the data points. Usually, for gene expression measurements, this is the Euclidean distance, but other definitions can be used depending on the data being clustered.

  • The function argmin on line 3 simply says: pick the two points (p1 and p2) that are the closest to each other, i.e., the points that generate the lowest value in the distance matrix.

  • On line 5 we create a new data point from the two points we just merged. On line 7, we recompute the pairwise distances between points, however now one of the points is the one we just created. How do we defined the distance between this new points and the other ones in the set? For example, in the first figure on this page, what is the distance between points (3, 4) and 5? This is the key difference between various hierarchical clustering algorithms and we'll discuss it in more detail below.

Computing the distance between clusters

Let's try to formalize a bit better what we are trying to achieve. Assume we have a set of points and a distance matrix that determines the pairwise distance between the points. We want to come up with a way to define distances between a point and a cluster, or between two different clusters.

In the figure below we highlight the example of a cluster comprised of points 1 and 2, and we want to compute the distance between this cluster and point 3.

How about the situation when we compare two different clusters as shown in the figure below?

In this case, we can still use the average, except that now we have to average over all pairs of points in the two clusters:

More generally, for clusters C1 and C2, we can define:

If we use this distance measure, the hierarchical clustering algorithm is called average linkage, or average neighbor, or UPGMA (standing for unweighted pair group method with arithmetic mean).

Of course, this is just one of the possible choices for a distance. Other common choices include:

Hierarchical clustering walk-through

Let's do a walk-through of the algorithm for the following distance matrix:

A

B

C

D

E

F

A

0

4

2

1

3

5

B

4

0

8

5

7

6

C

2

8

0

3

2

9

D

1

5

3

0

4

5

E

3

7

2

4

0

3

F

5

6

9

5

3

0

We'll start with single linkage clustering, i.e., the distance between the two clusters is determined by the shortest distance between points in the two clusters.

To start, we identify the lowest value in the matrix, in this case 1, and determine which points (or clusters) correspond to this value. Here it is points A and D. Thus, the first cluster created is (A, D), represented in the tree form as:

Now we have to update the distance matrix to account for the new cluster. We start by removing the two rows and columns corresponding to points A and D, and insert a new row and column for the cluster (A, D):

(A, D)

B

C

E

F

(A, D)

0

B

0

8

7

6

C

8

0

2

9

E

7

2

0

3

F

6

9

3

0

(A, D)

B

C

E

F

(A, D)

0

4

2

3

5

B

4

0

8

7

6

C

2

8

0

2

9

E

3

7

2

0

3

F

5

6

9

3

0

At this point, we again look for the smallest value in the matrix, now corresponding to either cluster ((A, D), C) or (C, E). Let us select the latter, yielding:

Again we update the matrix to remove the rows and columns for C and E and create a row and column for the cluster (C, E):

(A, D)

B

(C, E)

F

(A, D)

0

4

5

B

4

0

6

(C, E)

0

F

5

6

0

Now we have to compute the corresponding distances as before. We already walked through how we compute the distance between a point and a cluster, but we now also have to figure out d((A, D), (C, E)). Now we compare all pairs of points between the two clusters:

Thus, the new distance matrix is:

(A, D)

B

(C, E)

F

(A, D)

0

4

2

5

B

4

0

7

6

(C, E)

2

7

0

3

F

5

6

3

0

The new lowest value is 2, corresponding to the cluster that combines (A, D) and (C, E):

Again we simplify the matrix, now by removing the rows/columns for (A, D) and (C, E) and creating a new row/column for (A, C, D, E):

(A, C, D, E)

B

F

(A, C, D, E)

0

B

0

6

F

6

0

(A, C, D, E)

B

F

(A, C, D, E)

0

4

3

B

4

0

6

F

3

6

0

Now the smallest value is for cluster ((A, C, D, E), F):

Yielding the new matrix:

(A, C, D, E, F)

B

(A, C, D, E, F)

0

4

B

4

0

and one last merge, to create the cluster (A, B, C, D, E, F):

Exercises

  1. Redo the walk-through described above for complete linkage and average linkage (UPGMA) clustering.

  2. Given the following distance matrix, construct a UPGMA tree of the 6 sequences (A-F) shown below.

    1. Write out the formula for computing the distance between clusters (BC) and (AD).

    2. Fill in the intermediate distance matrices including rows/columns for the clusters being created.

    3. Show the topology of the tree.

    4. How would answer to a) change if you were performing complete linkage clustering?

    5. How would answer to a) change if you were performing single linkage clustering?

A

B

C

D

E

F

A

0

2

5

7

8

8

B

2

0

5

7

8

8

C

5

5

0

6

7

7

D

7

7

6

0

5

5

E

8

8

7

5

0

2

F

8

8

7

5

2

0

  1. Above we mentioned that complete linkage clustering tends to create tighter clusters than single linkage clustering. Please explain this intuition. Also, provide an example (a collection of sequences/points and corresponding distances) demonstrating this effect.

  2. Describe pseudo-code for breaking up a hierarchical clustering tree at a particular "depth", measured as number of levels from the root. Note that your algorithm must take into account the fact that not all paths in the tree are the same length. The depth is measured along the longest path (the path from root to leaf that contains most intermediate nodes).

Since we know the distances between the three points, perhaps we can simply use those values to define the distance between the cluster (1,2) and point 3. A natural choice is to simply select the average distance, i.e., d((1,2),3)=d(1,3)+d(2,3)2d((1,2), 3) = {{d(1,3) + d(2, 3)} \over 2}d((1,2),3)=2d(1,3)+d(2,3)​ .

d((1,2),(3,4,5))=d(1,4)+d(1,3)+d(1,5)+d(2,4)+d(2,3)+d(2,5)6d((1, 2), (3, 4, 5)) = {d(1, 4) + d (1, 3) + d (1, 5) + d(2, 4) + d(2, 3) + d (2, 5) \over 6}d((1,2),(3,4,5))=6d(1,4)+d(1,3)+d(1,5)+d(2,4)+d(2,3)+d(2,5)​

d(C1,C2)=∑p∈C1∑q∈C2d(p,q)∣C1∣∣C2∣d(C_1, C_2) = {\sum_{p \in C_1} \sum_{q \in C_2} d(p, q) \over |C_1| |C_2|}d(C1​,C2​)=∣C1​∣∣C2​∣∑p∈C1​​∑q∈C2​​d(p,q)​

For two points p and q, d(p,q)d(p, q)d(p,q) is distance between the two points, according to the distance matrix that is either computed by the clustering code, or simply provided as input to the clustering algorithm.

dmin(C1,C2)=min⁡p∈C1;q∈C2d(p,q)d_{\text{min}} (C_1, C_2) = \min_{p\in C_1; q \in C_2} d(p, q)dmin​(C1​,C2​)=minp∈C1​;q∈C2​​d(p,q) also referred to as single linkage or nearest neighbor clustering.

dmax(C1,C2)=max⁡p∈C1;q∈C2d(p,q)d_{\text{max}} (C_1, C_2) = \max_{p\in C_1; q \in C_2} d(p, q)dmax​(C1​,C2​)=maxp∈C1​;q∈C2​​d(p,q) also referred to as complete linkage or farthest neighbor clustering.

To compute the corresponding distances, we will take the smallest value between the pairwise distances between each point and the two points in the cluster. For example, d(B,(A,D))=min⁡(d(A,B),d(B,D))=min⁡(4,5)=4d(B, (A, D)) = \min (d(A, B), d(B, D)) = \min (4, 5) = 4d(B,(A,D))=min(d(A,B),d(B,D))=min(4,5)=4. You can then fill out the rest of the row and column corresponding to (A, D):

d((A,D),(C,E))=min(d(A,C),d(A,E),d(C,D),d(D,E))=min(2,3,3,4)=2d((A, D), (C, E)) = min(d(A, C), d(A, E), d(C, D), d(D, E)) = min(2, 3, 3, 4) = 2d((A,D),(C,E))=min(d(A,C),d(A,E),d(C,D),d(D,E))=min(2,3,3,4)=2

We fill in this new matrix using the same formula as before, e.g., d((A,C,D,E),F)=min⁡(d(A,F),d(C,F),d(D,F),d(E,F))=3d((A, C, D, E), F) = \min(d(A, F), d(C, F), d(D, F), d(E, F)) = 3d((A,C,D,E),F)=min(d(A,F),d(C,F),d(D,F),d(E,F))=3.

Drawing of a tree that has numbers at the leaves and the internal nodes are labeled with the set of numbers located at the leaves below that node.
Hierarchical clustering tree for 6 points. Each internal node is labeled with the corresponding cluster.
An oval containing two points labeled 1 and 2, representing that these points are clustered together. A 3rd point (labeled 3) is outside the oval and connected with lines to points 1 and 2.
To compute the distance between point 3 and cluster (1,2), we need to compute the pairwise distances between 3 and both of the clustered points (shown as lines).
Two ovals containing numbered points indicating two clusters: (1,2) , and (4, 3, 5). Lines connect each point in one cluster to all points in the other cluster highlighting the distances that need to be taken into account for computing the distance between the two clusters.
When comparing two clusters, all pairwise distances that go across the clusters (shown as lines) need to be taken into account.
Two points, A, and D, connected by a line to indicate that they were clustered together.
Four points labeled A, D, C, E and connected with lines to highlight clusters (A, D), (C, E).
Four points labeled A, D, C, E and connected with lines to highlight clusters (A, D), (C, E), and the fact that the two clusters are further grouped together into ((A, D), (C, E))
Five points labeled A, D, C, E, F and connected with lines to highlight clusters (A, D), (C, E), and the fact that the two clusters are further grouped together into ((A, D), (C, E)) and that point F clusters with this larger cluster (((A, D), (C, E)), F)
Six points labeled A, D, C, E, F, B and connected with lines to highlight clusters (A, D), (C, E), and the fact that the two clusters are further grouped together into ((A, D), (C, E)) and that point F clusters with this larger cluster (((A, D), (C, E)), F) and point B further clusters with the resulting cluster ((((A, D), (C, E)), F), B)
Matrix containing squares of different colors representing the levels of expression of different genes in different samples. Along the left and top sides are trees indicating how the corresponding rows and colums, respectively, can be hierarchically clustered together.
Hierarchical clustering of patients (columns) and genes (rows) from a study of rheumatoid arthritis (RA). The patients with RA and healthy controls end up in distinct branches of the hierarchical clustering tree. Image source: Panga, Venugopal; Raghunathan, Srivatsan (2018). Hierarchical clustering of RA and control samples based on the gene expression of selected key molecules in GSE7307. PLOS ONE. Figure.
https://doi.org/10.1371/journal.pone.0199530.g011