My Summer Internship!



Relevant Links:
Emily
Research Home
My Journal
On Another Note
Clustering Methods and Analyses:

KNN Clustering

K-Median Clustering

Kaplan-Meier Estimation

Top Genes based on Variance

Cross-Validation

Significant Genes

My Summer Research!


The Program Background Using the Data CAMDA Research Paper

The program:

Back to Top



This summer I am part of the CRA Program. CRA is the Computer Research Association's Committee on the Status of Women in Computer Research. The goal of which is to increase the number of women participating in Computer Science and Engineering research. I learned of this program, through my mentor for the summer, Professor Lenore Cowen. Professor Cowen is a professor of Computer Science at Tufts University.


A bit of background:

Back to Top



Cancer research, once relegated to the laboratory, is now breaking free of its sterilzed walls and entering the realm of the data analyists. Where it was once possible to vaguely analyze but one gene per tumor at a time, 10,000 genes can now be analyzed simulatneously. This plethora of information has been made possible by the advancement of microarray data techniques which have allowed researchers to vastly increase the number of genes reported.


Using the data:

Back to Top



Using these per tumor genetic expressions, it should therefore be possible to find appropriate clusterings of data to indicate various characteristics- such as survival. The data was provided by the CAMDA competition (see below). My role in this research project is to attempt to find meaningful clusters amoung the data. We are hoping that the resulting clusters will be indicitive of survival, however, the clusters may simply relate to the conditions under which the data was gathered. Consequently, as the summer progresses, I will be experimenting with more and more complex clustering algorithms to more fully analyze the details of the data.

The first dataset I analyzed was the Harvard dataset.

My first clustering algorithm was KNN Clustering.

I then used K-Median Clustering.

After the data was clustered. It was necessary to determine if the clusters were differentiated based on survival. To do this, the Kaplan Meier Estimation was used.

It is also important to know how (or if) the datasets gene selections overlap. If the datasets incorporate the same set, or subsets of eachother, it may be possible to find a universal set of significant genes.

In previous research, my mentor, Professor Lenore Cowen and her research group deteremined a set of genes within the Harvard dataset that they believed to be significant. By determining the symbol names (not specified within the Harvard Dataset) I was able to compare across three of the other CAMDA datasets, Ontario and Michigan, to determine if there was overlap, and if so, where. Please click here for my findings.

I also looked at alternative methods for finding the Top Genes. One such method used a ratio of variances to determine the "Top Genes".


CAMDA: The Goal:

Back to Top



Duke Bioinformatics Shared Resource (DBSR) created CAMDA, Critical Assessment of Microarray Data Analysis, "a forum to critically assess different techniques used in microarray data mining (Duke Bioinformatics Shared Resource)." CAMDA is a challenge. Each participant is presented with the same standard data set- the goal being to find a method of analysis best describing and clustering the data set.


Research Paper

Back to Top



Please click here for the pdf version of Microarray Data Analysis of Survival Times of Patients with Lung Adenocarcinomas Using ADC and K-Medians Clustering.







Questions or Comments?
Email me: Emily K. Mower