DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> Top Genes



Relevant Links:
Emily
Research Home
My Journal
On Another Note
Clustering Methods and Analyses:

KNN Clustering

K-Median Clustering

Kaplan-Meier Estimation

Top Genes based on Variance

Cross-Validation

Significant Genes

Top Genes based on Variance


Basic Ideas Behind K-Median Clustering Method Employed The Harvard Dataset My Results Future Plans

Basic Idea Behind K-Median Clustering:

Back to Top


When looking at a group of patients who died off at a particularly rapid rate, it should be possible to determine which genes are affecting this rapid die-off. To determine the genes responsible, I assumed that the genes affecting this die-off were expressed in a similar manner within the cluster of quickly dying patients and expressed in a different manner outside of the cluster.


Method Employed:

Back to Top



The user specifies through a file, the members of the cluster to be analyzed. Ideally, this cluster should have the desired characteristic (i.e. survival) in common. The program reads in the patients and marks the patients specified.

For a specific gene g, the variance is calculated of the members within the cluster and the total variance (among all the patients). A ratio is calculated of the cluster variance compared to the total variance. A small ratio indicates that the variance within the cluster is small and outside the cluster is large. Thus, theoretically, genes which are responsible for the die-off should have a low ratio.

If the gene's ratio is below a certain user-specified threshold, the gene is considered to be a "Top Gene".

After determining the "Top Genes", a dataset is reconstructed from the genes and K-Median Clustering is run on the resulting dataset.

Professor Blummer added an update to this program to locate the 50 most dissimilar genes in the set of "Top Genes" identified. The method (courtesy of Professor Blumer) is as follows:
Kruskal's Minimum Spanning Tree algorithm was used until there are m (say 50) sets, then one representative was picked from each set. This is basically a type of hierarchical clustering.

1) Normalize each row (gene) so its vector length is 1
2) Find the distances d[i][j] between rows i less j and put them in a min-heap
3) Find m groups among the n genes by repeating the following n-m times:
a) Extract the the minimum d[i][j] from the heap until a value is found where i and j aren't already in the same group
b) Join the groups containing genes i and j
4) Pick one gene from each group


The Harvard Dataset:

Back to Top



The Harvard Dataset can be found on the CAMDA website at www.camda.duke.edu/camda03/contest.asp. It contained five distinct tumor groups: adenocarcinomas (AD), squamous (SQ), cartoid (COID), small cell (SMLC), and normal lung (NL). There were 202 patients within the dataset, for each of which, 12,600 genes were analyzed.


My Results:

Back to Top



The following table details the output of the Top Genes program in conjunction with K-Median for clustering and a program which calculates the statistical likelyhood of the two curves being different by chance (p-value). The last column, "p-value when reduced to 50 genes" was an addition by to the Top Genes program by Professor Blumer. In this addition, 50 of the most dissimilar genes were picked out to eliminate redundancy.

Variance Ratio# GenesP-ValueP-Value When Reduced to 50 Genes
0.3930.0740.13
0.351250.1160.001
0.41580.0120.003
0.442020.0060.19


Future Plans:

Back to Top



This program is in the process of being integrated with my K-Median algorithm.







Questions or Comments?
Email Me! Emily K. Mower