K-Median



Relevant Links:
Emily
Research Home
My Journal
On Another Note
Clustering Methods and Analyses:

KNN Clustering

K-Median Clustering

Kaplan-Meier Estimation

Top Genes based on Variance

Cross-Validation

Significant Genes

K-Median Clustering


Basic Ideas Behind K-Median Clustering Basic Definitions Method Employed The Harvard Dataset My Results Future Plans

Basic Idea Behind K-Median Clustering:

Back to Top


The goal of K-Median clustering, like KNN clustering, is to seperate the data into distinct groups based on the differences in the data. Thus, upon completion, the analyst will be left with k-distinct groups with distinctive characteristics.

The goal of K-Median clustering in this instance is to form k-clusters of the adenocarcinoma data, each of which will have different surival times or rates.


Basic Definitions:

Back to Top



Cluster: - A subset of the data.

Cluster Center: - A member of the dataset which is the most representative of a particular cluster. It is the point which minimizes the sum of the distances of all other points within the cluster to itself.

Distance: - The Euclidian distance between two points.

Total Distance: - The sum of the distances to the cluster center.

Calculated Ideal Cluster Center: - The cluster center determined by the algorithm.


Method Employed:

Back to Top



Since any point within the dataset could be the cluster center, in theory, every single data point would have to be checked. For large k, this results in an unrealistic number of necessary calculations. Thus, there are several algorithms which can be used for large k's. These algorithms do not necessarily give the ideal clustering, but are generally guarenteed within a certain range.

However, I was looking for at most four clusters and could thus use a simpler, yet more accurate algorithm.

I first seperated out the adenocarcinoma patients from the Harvard Dataset A. Then, I created a distance matrix between all patients in the subclass (Professor Blumer updated this program to create a distance matrix based on genes nomalized to 1). Upon the completion of these tasks, the clustering task could begin.

After asking the user for the desired number of clusters (at most four) I isolated the patient data, or pair of patient data etc, that would be checked as a possible cluster center(s).

I looped through the data. For each patient data, I used the distance matrix previously constructed to determine which proposed cluster center the particular patient was closest to. Then, I added the pairwise distance between the patient and closest cluster center to a running sum.

I continually updated the minimum sum. This minimum sum indicated that the total distance of the specific cluster was the lowest, indicating that the cluster center(s) was the most ideal. I also continually updated the list of genes associated with each cluster center.

The cluster center(s) with the lowest total distance became known as the calculated ideal cluster center.

To adapt this method to the Harvard Dataset, I analyzed only the patients with information contained in the file survival.txt. This file, while not used in the clustering process, provides information on the survival of the patients. Thus, since some of the patients in the full Harvard Dataset do not have survival data recorded, the resulting graphs are oftenttimes incorrect.


The Harvard Dataset:

Back to Top



The Harvard Dataset can be found on the CAMDA website at www.camda.duke.edu/camda03/contest.asp. It contained five distinct tumor groups: adenocarcinomas (AD), squamous (SQ), cartoid (COID), small cell (SMLC), and normal lung (NL). There were 202 patients within the dataset, for each of which, 12,600 genes were analyzed.


My Results:

Full Dataset

Shortened Dataset

Overlapped Dataset



Back to Top



Full Dataset

The K-Median program was run on the 12,600 genes of the Harvard Dataset. The only patients considered were patients for which survival information existed.

KTotal DistanceCenterSizeCenterSizeCenterSizeCenterSize
2657.9727383946----
3623.90527363946692--
4603.59527303946692736

The Kaplan Meier Curves for the above clusterings can be seen here:
K=2K=3K=4


Shortened Dataset

The genes from the Shortened Dataset were ear-marked as important genes by Cowen et. al. in a paper not yet published. They can be seen here.

KTotal DistanceCenterSizeCenterSizeCenterSizeCenterSize
244.021431336451----
341.7432261331336438--
440.5592519261331236429

The Kaplan Meier Curves for the above clusterings can be seen here:
K=2K=3K=4

Overlapped Dataset

The Overlapped Dataset contained the genes from the Shortened Dataset that were also contained in the Michigan and Ontario Datasets. The genes used are highlighted in green.

KDistanceCenterSizeCenterSizeCenterSizeCenterSize
227.920449445940----
326.6816123326156436--
425.85642614431459286428

The Kaplan Meier Curves for the above clusters can be seen here:
K=2K=3K=4

Future Plans:

Back to Top



This program is in the process of being integrated with my Top Genes algorithm.







Questions or Comments?
Email Me! Emily K. Mower