1. Spectral Clustering of ALL/AML Leukemias

The SOM result from the Golub paper produced two clusters: one with 25 nodes, the other with 13. Cluster 1 contained 24 ALL samples while cluster 2 contained 10 AML samples. Therefore our baseline is four misclassifications. Spectral clustering outperforms their result, achieving two misclassifications:

One problematic feature of spectral clustering is the requirement of knowing, a priori, the number of clusters. For this reason, we investigate a divisive clustering algorithm that works without this knowledge.

2. Divisive Clustering

To generate a k nearest neighbors (NN) graph, we compute the Euclidean distance in the space of the gene expression data (~7000 dimensions). Each of the 38 data points (corresponding to patients) is connected to its k-NN's. From the resulting graph, we compute the "betweenness centrality" of each edge, i.e. the number of shortest paths traversing each edge in the graph. We then run Newman's divisive clustering algorithm. Briefly:

  1. G' = Remove edge in G with highest betweenness centrality. Break ties randomly.
  2. Compute modularity score Q = modularity(G')
  3. Repeat until no edges remain
The G' with the highest modularity score represents the inherent structure in the original input graph. Intuitively, Q is the fraction of edges that lie within clusters minus the expected value of the fraction of edges that lie within clusters in a randomly constructed graph. The results for different values of k are shown below. Green nodes represent the known ALL samples, red are AML samples. The optimal result is obtained with k=6 NN (same value as with spectral clustering).


k=1 NN Graph

k=1 Divisive Clustering

k=2 NN Graph

k=2 Divisive Clustering

k=3 NN Graph

k=3 Divisive Clustering

k=4 NN Graph

k=4 Divisive Clustering

k=5 NN Graph

k=5 Divisive Clustering

k=6 NN Graph

k=6 Divisive Clustering

k=7 NN Graph

k=7 Divisive Clustering

k=8 NN Graph

k=8 Divisive Clustering

Beyond 9-NN, the graph is too connected to produce meaningful results: the clustering algorithm finds just one cluster. The number of clusters produced by divisive clustering as a function of k-NN: