Since any point within the dataset could be the cluster center, in theory, every single data point would have to be checked. For large k, this results in an unrealistic number of necessary calculations. Thus, there are several algorithms which can be used for large k's. These algorithms do not necessarily give the ideal clustering, but are generally guarenteed within a certain range.
However, I was looking for at most four clusters and could thus use a simpler, yet more accurate algorithm.
I first seperated out the adenocarcinoma patients from the Harvard Dataset A. Then, I created a distance matrix between all patients in the subclass (Professor Blumer updated this program to create a distance matrix based on genes nomalized to 1). Upon the completion of these tasks, the clustering task could begin.
After asking the user for the desired number of clusters (at most four) I isolated the patient data, or pair of patient data etc, that would be checked as a possible cluster center(s).
I looped through the data. For each patient data, I used the distance matrix previously constructed to determine which proposed cluster center the particular patient was closest to. Then, I added the pairwise distance between the patient and closest cluster center to a running sum.
I continually updated the minimum sum. This minimum sum indicated that the total distance of the specific cluster was the lowest, indicating that the cluster center(s) was the most ideal. I also continually updated the list of genes associated with each cluster center.
The cluster center(s) with the lowest total distance became known as the calculated ideal cluster center.
To adapt this method to the Harvard Dataset, I analyzed only the patients with information contained in the file survival.txt. This file, while not used in the clustering process, provides information on the survival of the patients. Thus, since some of the patients in the full Harvard Dataset do not have survival data recorded, the resulting graphs are oftenttimes incorrect.
|