# Semi-supervised hierarchical clustering

Hierarchical clustering is frequently used as an "unsupervised" clustering method, i.e., without any expectation for what the clusters/groups in the data should look like. The clustering tree that is produced by a hierarchical clustering algorithm is typically analyzed "by hand" to determine which branches in the tree can be "cut" to define clusters, i.e., to determine which subtrees appear to represent data elements that are more similar to each other than to other data elements found in the tree. What is true labels are known for some of the hierarchically-clustered data elements? Can the "cuts" in the tree be automatically determined in a way that maximizes the fit between clusters and the known labels?

Doing so requires us to define a measure of the distance between the partial clustering defined by the known labels and the clusters that would be obtained by cutting specific edges in the tree.  One way of defining this distance is through a concept called the variation of information, specifically, the difference between the entropy of the two clusterings and the mutual information between them:

$$VI\[C\_1, C\_2] = H(C\_1) + H(C\_2) - 2 I (C\_1, C\_2)$$

where $$H$$ is the entropy and $$I$$ is the mutual information. $$VI\[C\_1, C\_2] = 0$$ if and only if $$C\_1$$ and $$C\_2$$ are identical clusterings.

Let us denote by $$c^1\_1, c\_2^1, ..., c\_k^1$$ the $$k$$ clusters of $$C\_1$$ and similarly for the $$l$$ clusters of $$C\_2$$.

The entropy of clustering $$C\_1$$ can be computed by viewing each of its cluster as a random variable with probability $$p(c\_i^1) = {{|c\_i^1|} \over {|C\_1|}}$$. Thus, the entropy $$H(C\_1) = -\sum\_{i \le k} {|c\_i^1|\over |C\_1|} \log {|c\_i^1|\over |C\_1|}$$.

Similarly, the mutual information is computed by taking into account the joint probability of the two cluterings: $$p(c\_i^1, c\_j^2) = {|c\_i^1 \bigcap c\_j^2| \over n}$$ where $$n$$ is the number of data elements shared by $$C\_1$$ and $$C\_2$$. The mutual information is defined as: $$I(C\_1, C\_2) = -\sum\_{i \le k, j\le l} p(c\_i^1, c\_j^2) \log {p(c\_i^1, c\_j^2) \over p(c\_i^1) p(c\_j^2)}$$.

Navlakha et al. (Navlakha, White et al. 2010) showed that the VI distance has a nice decomposition property (it is easy to compute the VI distance between two clusterings given the distances between subsets of these clusterings) which allows the construction of a dynamic programming algorithm for finding the "perfect" cut-set in a hierarchical clustering tree – i.e. the set of cuts in the tree that maximizes the concordance between the resulting clusters and previously known clustering of a subset of the sequences. In brief, at any node $$v$$ in the tree we can compare the fit (in terms of VI distance from the known labels) of the clustering that includes the leaves dominated by $$v$$ in a single cluster, to the clustering that splits these leaves into separate clusters dominated by the children of $$v$$. This approach can be considered as semi-supervised as it relies on prior information about some of the objects being clustered.

## References

* Navlakha,S., White,J., Nagarajan,N., Pop,M., and Kingsford,C. *Finding biologically accurate clusterings in hierarchical tree decompositions using the variation of information*. J Comput Biol, 2010, 17, 503–16.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://mpop.gitbook.io/bioinformatics-lecture-notes/data-clustering/semi-supervised-hierarchical-clustering.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.