Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Introduction
  • Lloyd's algorithm for k-means clustering
  • Exercises
  1. Data clustering

K-means clustering

PreviousIntroduction to data clusteringNextHierarchical clustering

Last updated 4 months ago

Introduction

he k-means clustering algorithm is a good prototype for the partitioning clustering paradigm. This approach is also very similar in spirit to the Gibbs sampling approach.

To avoid addressing the hard question "how many clusters are in the data?", we will start with a predetermined guess K, and instead ask "what is the best way to cluster the data points into exactly K clusters?". Of course we have to define what we mean by "best". The k-means algorithm attempts to identify clusters that are "compact", and defines compactness in terms of the distance between the points in a cluster and the center of mass (or the mean) of that cluster.

Given a set of points in a Euclidean space, we define the center of those points to be the point C such that the distortion 1n∑id(pointi,C)2{1 \over n} \sum_i d(\text{point}_i, C)^2n1​∑i​d(pointi​,C)2 is minimized over all possible choices of C.

Plugging in the equation described above for the Euclidean distance, we can re-define the distortion as:

1n∑i∑j(xi,j−xc,j)2{1 \over n} \sum_i \sum_j (x_{i,j} - x_{c, j})^2n1​∑i​∑j​(xi,j​−xc,j​)2

where xi,jx_{i,j}xi,j​ is the jth coordinate of point i, and xc,jx_{c,j}xc,j​ is the jth coordinate of the chosen center. In two dimensions, this equation is equivalent to:

1n∑i(xi−xc)2+(yi−yc)2{1 \over n} \sum_i (x_i - x_c)^2 + (y_i - y_c)^2n1​∑i​(xi​−xc​)2+(yi​−yc​)2

Since we are trying to minimize this function with respect to C, it suffices to determine the coordinates of C that zero out the partial derivative of the function with respect to the coordinates of C. Please try to work out the math, but hopefully you'll find out that we the coordinates of C will be determined by the equation:

xc,j=1n∑ixi,jx_{c, j} = {1 \over n} \sum_i x_{i, j}xc,j​=n1​∑i​xi,j​

In English, the coordinates of C are simply the average of the coordinates of the points within the cluster. You will recognize this equation as that for the center of gravity or center of mass of a set of objects.

After this long detour, we can formally define the k-means clustering problem:

K-means clustering. Given a set of n points and a value K, find a set of K centers, such that the sum of the squared distances between each point and it's closest center is minimized over all possible choices of centers.

Note that this is an extension of the concept of distortion described above where distortion referred to just one center.

It is important to note at this point that solving this problem seems hopeless: there are an infinite number of possible choices for the K centers as we are not constrained to pick centers that are part of the original set of points. Thus, we need to look through an infinite search space and find an optimal solution.

Lloyd's algorithm for k-means clustering

To try to make the problem more tractable, let's observe that once a set of centers is selected, we can easily calculate the distortion, i.e., the function that we are trying to minimize. We can simply compute the distance between each point and each center, and then sum up the distances between each point and its nearest center. Just like we did in the case of the Gibbs sampler, perhaps we can start with some guess for the location of the first K centers, then intelligently adjust these centers to reduce the distortion.

We can see that the distortion of this initial clustering is quite bad. The cluster of points on the bottom right side of the figure is not close to any of the chosen centers, instead its points are assigned to the green and blue centers that are quite far away from this cluster. Perhaps we can find a new set of centers that improves the distortion. If we only had a single cluster, we already know how to find a center that minimizes the distortion for that cluster, using the "center of mass" idea discussed above. Let us try to use this idea to propose new centers for the points that are located in the red, green, and blue regions, respectively:

Note that the centers are now closer to the "clumps" of points in the figure, however we still have quite a bit of distortion since the green center is outside of the clump of points on the bottom right of the figure. We also note that half of the bottom left cluster was initially closest to the green center, but now the points are closest to the red center. Thus, the centers we chose in the last iteration are not a good estimate of the center of mass of the points that are assigned to them.

To correct this issue, we again recompute the centers:

At his point, you can see that the chosen centers correctly capture the three clusters found in the image. Furthermore, since all the points in each cluster are assigned to the correct center (e.g., there are no green points in the red region), this choice of centers is optimal from the point of view of the k-means criterion. Each cluster has a center that minimizes its distortion, hence the overall distortion is minimized.

This is pretty impressive—while there is an infinite number of possible choices for centers, starting with an arbitrary choice, which we later iteratively refined, has allowed us to find the optimal answer in just a few iterations.

What we have described so far is the Lloyd algorithm for solving the k-means clustering problem. We summarize and formalize the algorithm below:

Lloyd_K_means(Data, k)
  Centers[k] =  k randomly selected points from Data

  while Centers change
     # Centers to clusters
     for each point p in Data
       assign p to its nearest center

     # Clusters to centers
     for i in 1..k
        Centers[i] = center of mass of points assigned to it
   
   return Centers

There are a few things to note. First, the initial centers are not simply arbitrary points in space, rather they are chosen from the input points. This makes it easier to ensure that the random initial choice is somewhat close to the points we are trying to cluster. There are other approaches one could use for the initialization, such as ensuring the centers are somewhat well separated (to avoid the initial choice in the first figure above where the red and green centers were next to each other).

A second observation is that the "centers to clusters" step in the algorithm requires you to find, for each point, the cluster center that is closest to it. When implementing this algorithm you need to maintain a data structure that remembers which point is assigned to which center.

A third observation is that our stopping condition assumes that the algorithm will reach a point when the set of centers no longer changes. Is it possible to end up in a situation where the algorithm does not end? For example, can a point alternate between two clusters? In other words, can you prove that the algorithm eventually converges, irrespective what the points are?

Exercises

  1. What is the runtime of one iteration in Lloyd's algorithm, in terms of the number of centers k and the number of points n.

To work through this, let's start with a visual example. In the figure below you see a set of points (white circles) and three arbitrary choices for the K=3 centers (points colored red, green, and blue). The space is colored based on the color of the nearest center, i.e., any point in the blue region is closer to the blue center than to the red or the green. In computational geometry, this sub-division of space based on distance from a number of points is called a Voronoi diagram. The figures below are created with the wonderful applet created by Naftali Harris that is available here: .

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
Initial step in k-means clustering. The points to be colored are white circles. The points colored red, green, and blue are the initial (random) guesses for the centers. The space is colored based on the color of the nearest center point
New cluster centers after finding the centers of mass of the three regions. The colors indicate which points were used to define each center of mass.
New centers after a second iteration. Now the centers match the corresponding clouds of points.
Three "clouds of points" with three points highlighted in red, green, and blue. These represent the possible centers of clusters. The space has colored segments corresponding to the regions of the space that are closest to the similarly-colored center point.
Three "clouds of points" with three points highlighted in red, green, and blue. These represent the possible centers of clusters. The space has colored segments corresponding to the regions of the space that are closest to the similarly-colored center point.
Three "clouds of points" with three points highlighted in red, green, and blue. These represent the possible centers of clusters. The space has colored segments corresponding to the regions of the space that are closest to the similarly-colored center point.