Introduction to data clustering
Last updated
Last updated
Data clustering is one of the key tools used in various incarnations of data-mining—trying to make sense of large datasets. As a motivating example, let us return to an idea we discussed in the context of motif finding (you may want to re-read that chapter). There, the premise was that we somehow know that a set of genes are turned on or off by the same transcription factor, and we were looking for the signal in their upstream region that was recognized by the transcription factor. We didn't discuss at that time how exactly we can figure out which genes are co-expressed. One technology used by biologists to measure the level of expression of genes (i.e., how much RNA is produced by each gene) is called a microarray. The microarray is a small plate or chip that has "printed" on it RNA segments that are complementary to the genes of the organism being studied. To measure the level of expression of genes, the RNA is extracted from the cells of the organism then "washed" over the chip, allowing the RNA to bind to the complementary spots on the chip. Each spot, thus, captures an amount of RNA proportional to its abundance (level of expression) in the cell, information that can be read by a special reader device. Typically, this information is displayed as colored spots, with the hue representing the abundance of the corresponding gene.
Using this microarray device, we can, therefore, measure the expression of the genes of an organism, and compare these levels of expression across multiple conditions (e.g., by adding nutrients, removing oxygen, etc.). All that remains, is to figure out how to determine which genes behave in a similar way, indicating that they may be controlled by the same transcription factor. This is the primary goal of clustering. Specifically, given some definition of similarity between the gene expression profiles (the abundance of each gene across different conditions), we would like to determine groups of genes that are more similar to each other than to other genes.
If you view the abundance of a gene under each condition as a coordinate in a c-dimensional space (where c is the number of conditions tested), the goal of clustering is to identify "clumps" of points in space, where each point is a gene.
IMPORTANT: Microarrays are not commonly used anymore, and gene expression is more commonly measured through DNA sequencing, but the conceptual idea of samples as a collection of gene-level abundances embodied in the microarray picture shown above is still a useful metaphor.
First, it is important to note that clustering algorithms need to determine, explicitly or implicitly, the distance between the objects being clustered. In the context of sequence data, such distances are defined by measures of sequence similarity such as edit distance. When clustering gene expression profiles, which can be seen as vectors in a high-dimensional space, distance measures such as Euclidean distance or cosine similarity (the cosine of the angle between two vectors, related to the dot-product of the two vectors) are commonly used. For now we'll focus on the Euclidean distance, defined as: for vectors x and y.
To make things concrete, assume we have two genes G1 and G2 with the following expression profiles across conditions:
G1
0.1
0.03
-0.2
0
0.4
G2
-0.2
0.1
0.15
0.3
0
Of course, the number is meaningless without context, and the clustering algorithm must compare the pairwise distances of different points when trying to figure out how to construct the clusters. In the general case one must compute all pairwise distances between sequences—a process that is computationally prohibitive for large datasets. Various heuristics can be used, in specific situations, to reduce the runtime either by using simple measures of sequence similarity, or avoiding to compute all pairwise distances. For example, the triangle inequality can be used to bound the distance between sequences without actually computing it.
Before we discuss general paradigms for clustering, we need to define the goal of the clustering process. An intuitive definition is the good clustering principle: a pair of points from the same cluster should be closer to each other than to any pair of points from different clusters.
Unfortunately, it is not always possible to find a clustering that satisfies this principle, and we may relax it by transforming it into an optimization problem, such as: find clusters such as the number of violations to the good clustering principle is minimized. Unfortunately, still, phrasing clustering as an optimization problem frequently leads to NP-hard computational problems. Therefore, in practice clustering frequently is based on heuristic approaches, which may provide some theoretical guarantees even if they cannot be guaranteed to solve the optimization problem. We will discuss here a couple such approaches.
Clustering algorithms can be classified into three main categories:
Agglomerative: algorithm starts with each sequence as an independent cluster, then iteratively joins together sequences to create larger clusters
Divisive: algorithm starts with all the sequences in one cluster then iteratively breaks up the cluster into a collection of smaller clusters.
Partitioning/partition methods: algorithm starts with a pre-defined set of clusters then refines the cluster membership to improve cluster quality.
The various algorithms embed the optimization criterion in the iterative steps being taken, typically making the choice that is most likely to optimize the cluster "goodness" criterion selected. They also use a stopping condition to determine when a particular clustering has reached high enough quality.
There are many different ways to assess the quality of clusters and the are beyond the scope of this chapter, but you should be able to learn about them through a simple Internet search. An important general point about assessing the quality of clusters is that frequently one must optimize two competing criteria: the "compactness" of the clusters (how similar are the co-clustered points to each other), and the number of clusters produced (a measure of the overall complexity of the clustering). This trade-off is typical in machine learning and statistical inference, and relates to the competition between the goodness of the fit between a model and the data, and the complexity of the model. In fact, clustering is a form of machine learning, commonly referred to as unsupervised learning.