Top Genes

DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> Top Genes

Relevant Links:

Clustering Methods and Analyses:

KNN Clustering

K-Median Clustering

Kaplan-Meier Estimation

Top Genes based on Variance

Cross-Validation

Significant Genes

Top Genes based on Variance

Basic Ideas Behind K-Median Clustering

Basic Idea Behind K-Median Clustering:

Back to Top

When looking at a group of patients who died off at a particularly rapid rate, it should be possible to determine which genes are affecting this rapid die-off. To determine the genes responsible, I assumed that the genes affecting this die-off were expressed in a similar manner within the cluster of quickly dying patients and expressed in a different manner outside of the cluster.

Method Employed:

Back to Top

The user specifies through a file, the members of the cluster to be analyzed. Ideally, this cluster should have the desired characteristic (i.e. survival) in common. The program reads in the patients and marks the patients specified.

For a specific gene g, the variance is calculated of the members within the cluster and the total variance (among all the patients). A ratio is calculated of the cluster variance compared to the total variance. A small ratio indicates that the variance within the cluster is small and outside the cluster is large. Thus, theoretically, genes which are responsible for the die-off should have a low ratio.

If the gene's ratio is below a certain user-specified threshold, the gene is considered to be a "Top Gene".

After determining the "Top Genes", a dataset is reconstructed from the genes and K-Median Clustering is run on the resulting dataset.

Professor Blummer added an update to this program to locate the 50 most dissimilar genes in the set of "Top Genes" identified. The method (courtesy of Professor Blumer) is as follows:
Kruskal's Minimum Spanning Tree algorithm was used until there are m (say 50) sets, then one representative was picked from each set. This is basically a type of hierarchical clustering.

1) Normalize each row (gene) so its vector length is 1
2) Find the distances d[i][j] between rows i less j and put them in a min-heap
3) Find m groups among the n genes by repeating the following n-m times:
a) Extract the the minimum d[i][j] from the heap until a value is found where i and j aren't already in the same group
b) Join the groups containing genes i and j
4) Pick one gene from each group

The Harvard Dataset:

Back to Top

The Harvard Dataset can be found on the CAMDA website at www.camda.duke.edu/camda03/contest.asp. It contained five distinct tumor groups: adenocarcinomas (AD), squamous (SQ), cartoid (COID), small cell (SMLC), and normal lung (NL). There were 202 patients within the dataset, for each of which, 12,600 genes were analyzed.

My Results:

Back to Top

The following table details the output of the Top Genes program in conjunction with K-Median for clustering and a program which calculates the statistical likelyhood of the two curves being different by chance (p-value). The last column, "p-value when reduced to 50 genes" was an addition by to the Top Genes program by Professor Blumer. In this addition, 50 of the most dissimilar genes were picked out to eliminate redundancy.

Variance Ratio	# Genes	P-Value	P-Value When Reduced to 50 Genes
0.3	93	0.074	0.13
0.35	125	0.116	0.001
0.4	158	0.012	0.003
0.44	202	0.006	0.19

Future Plans:

Back to Top

This program is in the process of being integrated with my K-Median algorithm.

Questions or Comments?

Email Me! Emily K. Mower

Basic Idea Behind K-Median Clustering: Back to Top

Method Employed: Back to Top

The Harvard Dataset: Back to Top

My Results: Back to Top

Future Plans: Back to Top

Basic Idea Behind K-Median Clustering:

Back to Top

Method Employed:

Back to Top

The Harvard Dataset:

Back to Top

My Results:

Back to Top

Future Plans:

Back to Top