My Journal

Relevant Links:

Clustering Methods and Analyses:

KNN Clustering

K-Median Clustering

Kaplan-Meier Estimation

Top Genes based on Variance

Cross-Validation

Significant Genes

Journal of Summer Activity

June 16, 2003 - August 31, 2003

Week 1

Week 1:

I was introduced to the research project. Professor Cowen presented me with various papers explain the method in which the data we were to work with was obtained. The papers also described past methods of analysis and their relative success or lack there of. During this first week I also began reading about the KNN Clustering method.
Back to Top

Week 2:

I continued to work on the KNN Clustering method. I created a program that identified the genes correctly with a probability of approximately 0.9 depending on the size of K. For details on its method and functionality please see: KNN Clustering. I also continued to read papers and began to read about the K-Median Clustering method.
Back to Top

Week 3:

I created the K-Median Clustering program. The program separates the adenocarcinoma patients of the Harvard Dataset into between one and four groups and returns the center of each group. For details on its method and funtionality please see: K-Median Clustering. I also began working on a cross-validation algorithm for the KNN Clustering program to increase its reliability.
Back to Top

Week 4:

I conducted a Kaplan-Meier test to determine if the clusters resulting from my K-Median algorithm were at all related to survival. For information on the Kaplan-Meier estimate, plase see Kaplan-Meier Estimate. I finalized the cross-validation test on the KNN-Clustering. The results can be seen here. It is also important to determine whether or not there is overlap between the Harvard, Michigan, and Ontario dataset. If there is, it may be possible to determine universally significant genes. As a first step in this analysis, I determined whether or not the genes deemed significant by Professor Lenore Cowen and her research group in a previous analysis also existed within the other two datasets. The result of this analysis can be seen here. The difficulty with the Harvard Dataset, in terms of analysis, is that the genes are expressed in full name, not symbol, form. This makes it difficult to compare the genes present in said dataset to the genes present in the Michigan and Ontario datasets, both of which express their genes in symbol form. However, since both of the latter datasets use symbol format, it is possible to determine which gene overlap into both datasets. All told, 2,569 genes appear in both datasets. This gene list can be accessed here (this page takes some time to load as it contains quite a bit of data, please be patient).
Back to Top

Week 5:

I created Kaplan-Meier curves (will be up shortly) of the groupings of the Harvard Dataset, Shortened dataset, and the Overlapped dataset. It seemed that my clusters were not particularly related to survival. I therefore began analyzing other possible causes. Thus far in the analysis, none of the analyzed causes seems to be particularly influential. I am also continuing to work on developing an overlapped list between the Harvard, Ontario, and Michigan datasets. The Harvard Dataset presents an interesting problem as it is not expressed in the same format as the other two. Therefore, a program must first be developed to translate the Harvard gene descriptions into gene names. After this is completed, determining the overlap should not prove to be a problem. I have completed a program, which given a set of patients, will determine the similarity of each gene expression. I should change the program so one could input a cluster from a file and not manually enter in the cluster each time. For next week I will be analyzing the Ontario and Michigan Papers' claim to have identified a set of survival oriented genes. On another note, my Brother is coming to visit this weekend.
Back to Top

Week 6:

I continued my attempt to identify a set of genes indicative of survival. In my K=4 cluster of the Full Harvard Dataset, there was a group of 11 patients who all seemed to die rather quickly. I analyszed this group to find the set of genes which varied the least across the group. I calculated this least variation by determining the absolute value of the sum of all the distances between patients across a single gene. Genes with a sum of less than 2000 for all 11 patients were chosen for the first attempt. The Kaplan Meier curves were formed from the graphs. The graph corresponding to may show a difference in survival. A statstical analysis still needs to be run on the result. I also created a graph using a 1000 threshold. The graph corresponding to may also show a difference in survival. Once again, a statistical analysis must be run. For next week I shall be working on creating a program which analyzes the genes using an iterative approach. I will look at the genes of the patients in the worst survival (as in dies the quickest) group and determine which genes have sums under a certain threshold. I will cluster based on these genes, then analyze the patients in the bottom group again. The ultimate goal of this will be to determine which genes are strongly linked to survival.
Back to Top

Week 7:

This week I have continued my attempt to locate the Top Genes. I have not had much success.
Back to Top

Week 8:

I have continued my search for the Top Genes. This time I used my cluster four group from the Full Dataset to determine which genes are indicitive of death.
Back to Top

Week 9:

I found a bug in my K-Median code. I decided this was undesirable and decided to fix the bug. Unfortunately, the fix obliterated all of my positive results. That was fun. Moving on however, K-Median and Top Genes have now been fixed for the most part. These two programs are in the process of being integrated. Additionally, both were primarily designed to work on the Harvard Dataset. They have been changed to also function on the Michigan Dataset. I will report the results from these programs when I have results to report. Professor Anselm Blumer of Tufts University's Computer Science Department has taken over as head of the CAMDA project. He has been helping me to refine my K-Median program. One major change has been to normalize the vector of each gene to one. This has produced preliminarily interesting results. However, some have been rather unpredictable. To eliminate this side-effect, a cross-validation will be introduced to the code. On a side note, I signed up for my GRE's today (August 26).
Back to Top

Week 10:

This is the last week of my summer research. So bizarre, it seems to have gone so quickly! Classes started today, which is pretty exciting, it is nice to see friends again! As for my research, we have a meeting today at 6:40, so we shall see what the week holds in store for us. We have started writing the paper and will be turning it in by the September 19, 2003. This has been a wonderful experience! Please click here for the paper.
Back to Top

Questions or Comments?

Email Me! Emily K. Mower