The SOM result from the Golub paper produced two clusters: one with 25 nodes, the other with 13. Cluster 1 contained 24 ALL samples while cluster 2 contained 10 AML samples. Therefore our baseline is four misclassifications. Spectral clustering outperforms their result, achieving two misclassifications:
![]() |
|
One problematic feature of spectral clustering is the requirement of knowing, a priori, the number of clusters. For this reason, we investigate a divisive clustering algorithm that works without this knowledge.
2. Divisive Clustering
To generate a k nearest neighbors (NN) graph, we compute the Euclidean distance in the space of the gene expression data (~7000 dimensions). Each of the 38 data points (corresponding to patients) is connected to its k-NN's. From the resulting graph, we compute the "betweenness centrality" of each edge, i.e. the number of shortest paths traversing each edge in the graph. We then run Newman's divisive clustering algorithm. Briefly:
![]() k=1 NN Graph | ![]() k=1 Divisive Clustering |
![]() k=2 NN Graph | ![]() k=2 Divisive Clustering |
![]() k=3 NN Graph | ![]() k=3 Divisive Clustering |
![]() k=4 NN Graph | ![]() k=4 Divisive Clustering |
![]() k=5 NN Graph | ![]() k=5 Divisive Clustering |
![]() k=6 NN Graph | ![]() k=6 Divisive Clustering |
![]() k=7 NN Graph | ![]() k=7 Divisive Clustering |
![]() k=8 NN Graph | ![]() k=8 Divisive Clustering |
Beyond 9-NN, the graph is too connected to produce meaningful results: the clustering algorithm finds just one cluster. The number of clusters produced by divisive clustering as a function of k-NN: