K-means clustering

Introduction

he k-means clustering algorithm is a good prototype for the partitioning clustering paradigm. This approach is also very similar in spirit to the Gibbs sampling approach.

To avoid addressing the hard question "how many clusters are in the data?", we will start with a predetermined guess K, and instead ask "what is the best way to cluster the data points into exactly K clusters?". Of course we have to define what we mean by "best". The k-means algorithm attempts to identify clusters that are "compact", and defines compactness in terms of the distance between the points in a cluster and the center of mass (or the mean) of that cluster.

Given a set of points in a Euclidean space, we define the center of those points to be the point C such that the distortion ${1 \over n} \sum_i d(\text{point}_i, C)^2$ is minimized over all possible choices of C.

Plugging in the equation described above for the Euclidean distance, we can re-define the distortion as:

${1 \over n} \sum_i \sum_j (x_{i,j} - x_{c, j})^2$

where $x_{i,j}$ is the jth coordinate of point i, and $x_{c,j}$ is the jth coordinate of the chosen center. In two dimensions, this equation is equivalent to:

${1 \over n} \sum_i (x_i - x_c)^2 + (y_i - y_c)^2$

Since we are trying to minimize this function with respect to C, it suffices to determine the coordinates of C that zero out the partial derivative of the function with respect to the coordinates of C. Please try to work out the math, but hopefully you'll find out that we the coordinates of C will be determined by the equation:

$x_{c, j} = {1 \over n} \sum_i x_{i, j}$

In English, the coordinates of C are simply the average of the coordinates of the points within the cluster. You will recognize this equation as that for the center of gravity or center of mass of a set of objects.

After this long detour, we can formally define the k-means clustering problem:

K-means clustering. Given a set of n points and a value K, find a set of K centers, such that the sum of the squared distances between each point and it's closest center is minimized over all possible choices of centers.

Note that this is an extension of the concept of distortion described above where distortion referred to just one center.

It is important to note at this point that solving this problem seems hopeless: there are an infinite number of possible choices for the K centers as we are not constrained to pick centers that are part of the original set of points. Thus, we need to look through an infinite search space and find an optimal solution.

Lloyd's algorithm for k-means clustering

To try to make the problem more tractable, let's observe that once a set of centers is selected, we can easily calculate the distortion, i.e., the function that we are trying to minimize. We can simply compute the distance between each point and each center, and then sum up the distances between each point and its nearest center. Just like we did in the case of the Gibbs sampler, perhaps we can start with some guess for the location of the first K centers, then intelligently adjust these centers to reduce the distortion.

To work through this, let's start with a visual example. In the figure below you see a set of points (white circles) and three arbitrary choices for the K=3 centers (points colored red, green, and blue). The space is colored based on the color of the nearest center, i.e., any point in the blue region is closer to the blue center than to the red or the green. In computational geometry, this sub-division of space based on distance from a number of points is called a Voronoi diagram. The figures below are created with the wonderful applet created by Naftali Harris that is available here: https://www.naftaliharris.com/blog/visualizing-k-means-clustering/ .

Three "clouds of points" with three points highlighted in red, green, and blue. These represent the possible centers of clusters. The space has colored segments corresponding to the regions of the space that are closest to the similarly-colored center point.

We can see that the distortion of this initial clustering is quite bad. The cluster of points on the bottom right side of the figure is not close to any of the chosen centers, instead its points are assigned to the green and blue centers that are quite far away from this cluster. Perhaps we can find a new set of centers that improves the distortion. If we only had a single cluster, we already know how to find a center that minimizes the distortion for that cluster, using the "center of mass" idea discussed above. Let us try to use this idea to propose new centers for the points that are located in the red, green, and blue regions, respectively:

Note that the centers are now closer to the "clumps" of points in the figure, however we still have quite a bit of distortion since the green center is outside of the clump of points on the bottom right of the figure. We also note that half of the bottom left cluster was initially closest to the green center, but now the points are closest to the red center. Thus, the centers we chose in the last iteration are not a good estimate of the center of mass of the points that are assigned to them.

To correct this issue, we again recompute the centers:

At his point, you can see that the chosen centers correctly capture the three clusters found in the image. Furthermore, since all the points in each cluster are assigned to the correct center (e.g., there are no green points in the red region), this choice of centers is optimal from the point of view of the k-means criterion. Each cluster has a center that minimizes its distortion, hence the overall distortion is minimized.

This is pretty impressive—while there is an infinite number of possible choices for centers, starting with an arbitrary choice, which we later iteratively refined, has allowed us to find the optimal answer in just a few iterations.

What we have described so far is the Lloyd algorithm for solving the k-means clustering problem. We summarize and formalize the algorithm below:

Lloyd_K_means(Data, k)
  Centers[k] =  k randomly selected points from Data

  while Centers change
     # Centers to clusters
     for each point p in Data
       assign p to its nearest center

     # Clusters to centers
     for i in 1..k
        Centers[i] = center of mass of points assigned to it
   
   return Centers

There are a few things to note. First, the initial centers are not simply arbitrary points in space, rather they are chosen from the input points. This makes it easier to ensure that the random initial choice is somewhat close to the points we are trying to cluster. There are other approaches one could use for the initialization, such as ensuring the centers are somewhat well separated (to avoid the initial choice in the first figure above where the red and green centers were next to each other).

A second observation is that the "centers to clusters" step in the algorithm requires you to find, for each point, the cluster center that is closest to it. When implementing this algorithm you need to maintain a data structure that remembers which point is assigned to which center.

A third observation is that our stopping condition assumes that the algorithm will reach a point when the set of centers no longer changes. Is it possible to end up in a situation where the algorithm does not end? For example, can a point alternate between two clusters? In other words, can you prove that the algorithm eventually converges, irrespective what the points are?

Exercises

What is the runtime of one iteration in Lloyd's algorithm, in terms of the number of centers k and the number of points n.

PreviousIntroduction to data clustering NextHierarchical clustering

Last updated 6 months ago