|
Cancer research, once relegated to the laboratory, is now breaking free of its sterilzed walls
and entering the realm of the data analyists. Where it was once possible to vaguely analyze
but one gene per tumor at a time, 10,000 genes can now be analyzed simulatneously.
This plethora of information has been made possible by the advancement of microarray data techniques
which have allowed researchers to vastly increase the number of genes reported.
|
|
Using these per tumor genetic expressions, it should therefore be possible to find appropriate
clusterings of data to indicate various characteristics- such as survival.
The data was provided by the CAMDA competition (see below). My role in this research project is to
attempt to find meaningful clusters amoung the data. We are hoping that the resulting clusters will
be indicitive of survival, however, the clusters may simply relate to the conditions under which the
data was gathered. Consequently, as the summer progresses, I will be experimenting with more and more
complex clustering algorithms to more fully analyze the details of the data.
The first dataset I analyzed was the Harvard dataset.
My first clustering algorithm was KNN Clustering.
I then used K-Median Clustering.
After the data was clustered. It was necessary to determine if the clusters were differentiated based on survival. To do this, the Kaplan Meier Estimation was used.
It is also important to know how (or if) the datasets gene selections overlap. If the datasets incorporate the same set, or subsets of eachother, it may be possible to find a universal set of significant genes.
In previous research, my mentor, Professor Lenore Cowen and her research group deteremined a set of genes within the Harvard dataset that they believed to be significant. By determining the symbol names (not specified within the Harvard Dataset) I was able to compare across three of the other CAMDA datasets, Ontario and Michigan, to determine if there was overlap, and if so, where. Please click here for my findings.
I also looked at alternative methods for finding the Top Genes. One such method used a ratio of variances to determine the "Top Genes".
|
|
Duke Bioinformatics Shared Resource (DBSR) created CAMDA, Critical Assessment of Microarray Data Analysis, "a forum to critically assess different techniques used in microarray data mining (Duke Bioinformatics Shared Resource)." CAMDA is a challenge. Each participant is presented with the same standard data set- the goal being to find a method of analysis best describing and clustering the data set.
|