Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Introduction to clustering
  • Clustering paradigms and quality definitions
  1. Data clustering

Introduction to data clustering

PreviousDealing with errors in experimental spectraNextK-means clustering

Last updated 4 months ago

Introduction to clustering

Data clustering is one of the key tools used in various incarnations of data-mining—trying to make sense of large datasets. As a motivating example, let us return to an idea we discussed in the context of motif finding (you may want to re-read that chapter). There, the premise was that we somehow know that a set of genes are turned on or off by the same transcription factor, and we were looking for the signal in their upstream region that was recognized by the transcription factor. We didn't discuss at that time how exactly we can figure out which genes are co-expressed. One technology used by biologists to measure the level of expression of genes (i.e., how much RNA is produced by each gene) is called a microarray. The microarray is a small plate or chip that has "printed" on it RNA segments that are complementary to the genes of the organism being studied. To measure the level of expression of genes, the RNA is extracted from the cells of the organism then "washed" over the chip, allowing the RNA to bind to the complementary spots on the chip. Each spot, thus, captures an amount of RNA proportional to its abundance (level of expression) in the cell, information that can be read by a special reader device. Typically, this information is displayed as colored spots, with the hue representing the abundance of the corresponding gene.

A close-up of data from a microarray. Each dot represents a different gene and the color corresponds to the strength of the expression of the corresponding gene.

Using this microarray device, we can, therefore, measure the expression of the genes of an organism, and compare these levels of expression across multiple conditions (e.g., by adding nutrients, removing oxygen, etc.). All that remains, is to figure out how to determine which genes behave in a similar way, indicating that they may be controlled by the same transcription factor. This is the primary goal of clustering. Specifically, given some definition of similarity between the gene expression profiles (the abundance of each gene across different conditions), we would like to determine groups of genes that are more similar to each other than to other genes.

If you view the abundance of a gene under each condition as a coordinate in a c-dimensional space (where c is the number of conditions tested), the goal of clustering is to identify "clumps" of points in space, where each point is a gene.

IMPORTANT: Microarrays are not commonly used anymore, and gene expression is more commonly measured through DNA sequencing, but the conceptual idea of samples as a collection of gene-level abundances embodied in the microarray picture shown above is still a useful metaphor.

Clustering paradigms and quality definitions

First, it is important to note that clustering algorithms need to determine, explicitly or implicitly, the distance between the objects being clustered. In the context of sequence data, such distances are defined by measures of sequence similarity such as edit distance. When clustering gene expression profiles, which can be seen as vectors in a high-dimensional space, distance measures such as Euclidean distance or cosine similarity (the cosine of the angle between two vectors, related to the dot-product of the two vectors) are commonly used. For now we'll focus on the Euclidean distance, defined as: d(x,y)=∑i(xi−yi)2d(x, y) = \sqrt{\sum_i (x_i - y_i)^2}d(x,y)=∑i​(xi​−yi​)2​ for vectors x and y.

To make things concrete, assume we have two genes G1 and G2 with the following expression profiles across conditions:

G1

0.1

0.03

-0.2

0

0.4

G2

-0.2

0.1

0.15

0.3

0

d(G1,G2)=(0.1+0.2)2+(0.03−0.1)2+(−0.2−0.15)2+(0−0.3)2+(0.4−0)2=0.32+0.072+0.352+0.32+0.42=0.4674=0.6837d(G_1, G_2) = \sqrt{(0.1 + 0.2)^2 + (0.03 - 0.1)^2 + (-0.2 - 0.15)^2 + (0 - 0.3)^2 + (0.4 - 0)^2} =\\ \sqrt{0.3^2+0.07^2+0.35^2+0.3^2+0.4^2} = \sqrt{0.4674} = 0.6837d(G1​,G2​)=(0.1+0.2)2+(0.03−0.1)2+(−0.2−0.15)2+(0−0.3)2+(0.4−0)2​=0.32+0.072+0.352+0.32+0.42​=0.4674​=0.6837 Of course, the number is meaningless without context, and the clustering algorithm must compare the pairwise distances of different points when trying to figure out how to construct the clusters. In the general case one must compute all pairwise distances between sequences—a process that is computationally prohibitive for large datasets. Various heuristics can be used, in specific situations, to reduce the runtime either by using simple measures of sequence similarity, or avoiding to compute all pairwise distances. For example, the triangle inequality can be used to bound the distance between sequences without actually computing it.

Before we discuss general paradigms for clustering, we need to define the goal of the clustering process. An intuitive definition is the good clustering principle: a pair of points from the same cluster should be closer to each other than to any pair of points from different clusters.

Unfortunately, it is not always possible to find a clustering that satisfies this principle, and we may relax it by transforming it into an optimization problem, such as: find clusters such as the number of violations to the good clustering principle is minimized. Unfortunately, still, phrasing clustering as an optimization problem frequently leads to NP-hard computational problems. Therefore, in practice clustering frequently is based on heuristic approaches, which may provide some theoretical guarantees even if they cannot be guaranteed to solve the optimization problem. We will discuss here a couple such approaches.

Clustering algorithms can be classified into three main categories:

  • Agglomerative: algorithm starts with each sequence as an independent cluster, then iteratively joins together sequences to create larger clusters

  • Divisive: algorithm starts with all the sequences in one cluster then iteratively breaks up the cluster into a collection of smaller clusters.

  • Partitioning/partition methods: algorithm starts with a pre-defined set of clusters then refines the cluster membership to improve cluster quality.

The various algorithms embed the optimization criterion in the iterative steps being taken, typically making the choice that is most likely to optimize the cluster "goodness" criterion selected. They also use a stopping condition to determine when a particular clustering has reached high enough quality.

There are many different ways to assess the quality of clusters and the are beyond the scope of this chapter, but you should be able to learn about them through a simple Internet search. An important general point about assessing the quality of clusters is that frequently one must optimize two competing criteria: the "compactness" of the clusters (how similar are the co-clustered points to each other), and the number of clusters produced (a measure of the overall complexity of the clustering). This trade-off is typical in machine learning and statistical inference, and relates to the competition between the goodness of the fit between a model and the data, and the complexity of the model. In fact, clustering is a form of machine learning, commonly referred to as unsupervised learning.

A picture of a microarray - a series of dots of different colors representing the level of expression of different genes printed on the microarray.