The user specifies through a file, the members of the cluster to be analyzed. Ideally, this cluster should have the desired characteristic (i.e. survival) in common. The program reads in the patients and marks the patients specified.
For a specific gene g, the variance is calculated of the members within the cluster and the total variance (among all the patients). A ratio is calculated of the cluster variance compared to the total variance. A small ratio indicates that the variance within the cluster is small and outside the cluster is large. Thus, theoretically, genes which are responsible for the die-off should have a low ratio.
If the gene's ratio is below a certain user-specified threshold, the gene is considered to be a "Top Gene".
After determining the "Top Genes", a dataset is reconstructed from the genes and K-Median Clustering is run on the resulting dataset.
Professor Blummer added an update to this program to locate the 50 most dissimilar genes in the set of "Top Genes" identified. The method (courtesy of Professor Blumer) is as follows:
Kruskal's Minimum Spanning Tree algorithm was used until there are m (say 50) sets, then one representative was picked from each set. This is basically a type of hierarchical clustering.
1) Normalize each row (gene) so its vector length is 1
2) Find the distances d[i][j] between rows i less j and put them in a min-heap
3) Find m groups among the n genes by repeating the following n-m times:
a) Extract the the minimum d[i][j] from the heap until a value is found where i and j aren't already in the same group
b) Join the groups containing genes i and j
4) Pick one gene from each group
|