K-Median

Relevant Links:

Clustering Methods and Analyses:

KNN Clustering

K-Median Clustering

Kaplan-Meier Estimation

Top Genes based on Variance

Cross-Validation

Significant Genes

K-Median Clustering

Basic Ideas Behind K-Median Clustering

Basic Idea Behind K-Median Clustering:

Back to Top

The goal of K-Median clustering, like KNN clustering, is to seperate the data into distinct groups based on the differences in the data. Thus, upon completion, the analyst will be left with k-distinct groups with distinctive characteristics.

The goal of K-Median clustering in this instance is to form k-clusters of the adenocarcinoma data, each of which will have different surival times or rates.

Basic Definitions:

Back to Top

Cluster: - A subset of the data.

Cluster Center: - A member of the dataset which is the most representative of a particular cluster. It is the point which minimizes the sum of the distances of all other points within the cluster to itself.

Distance: - The Euclidian distance between two points.

Total Distance: - The sum of the distances to the cluster center.

Calculated Ideal Cluster Center: - The cluster center determined by the algorithm.

Method Employed:

Back to Top

Since any point within the dataset could be the cluster center, in theory, every single data point would have to be checked. For large k, this results in an unrealistic number of necessary calculations. Thus, there are several algorithms which can be used for large k's. These algorithms do not necessarily give the ideal clustering, but are generally guarenteed within a certain range.

However, I was looking for at most four clusters and could thus use a simpler, yet more accurate algorithm.

I first seperated out the adenocarcinoma patients from the Harvard Dataset A. Then, I created a distance matrix between all patients in the subclass (Professor Blumer updated this program to create a distance matrix based on genes nomalized to 1). Upon the completion of these tasks, the clustering task could begin.

After asking the user for the desired number of clusters (at most four) I isolated the patient data, or pair of patient data etc, that would be checked as a possible cluster center(s).

I looped through the data. For each patient data, I used the distance matrix previously constructed to determine which proposed cluster center the particular patient was closest to. Then, I added the pairwise distance between the patient and closest cluster center to a running sum.

I continually updated the minimum sum. This minimum sum indicated that the total distance of the specific cluster was the lowest, indicating that the cluster center(s) was the most ideal. I also continually updated the list of genes associated with each cluster center.

The cluster center(s) with the lowest total distance became known as the calculated ideal cluster center.

To adapt this method to the Harvard Dataset, I analyzed only the patients with information contained in the file survival.txt. This file, while not used in the clustering process, provides information on the survival of the patients. Thus, since some of the patients in the full Harvard Dataset do not have survival data recorded, the resulting graphs are oftenttimes incorrect.

The Harvard Dataset:

Back to Top

The Harvard Dataset can be found on the CAMDA website at www.camda.duke.edu/camda03/contest.asp. It contained five distinct tumor groups: adenocarcinomas (AD), squamous (SQ), cartoid (COID), small cell (SMLC), and normal lung (NL). There were 202 patients within the dataset, for each of which, 12,600 genes were analyzed.

My Results:

Full Dataset

Shortened Dataset

Overlapped Dataset

Back to Top

Full Dataset

The K-Median program was run on the 12,600 genes of the Harvard Dataset. The only patients considered were patients for which survival information existed.

K	Total Distance	Center	Size	Center	Size	Center	Size	Center	Size
2	657.97	27	38	39	46	-	-	-	-
3	623.905	27	36	39	46	69	2	-	-
4	603.595	27	30	39	46	69	2	73	6

The Kaplan Meier Curves for the above clusterings can be seen here:

K=2	K=3	K=4

Shortened Dataset

The genes from the Shortened Dataset were ear-marked as important genes by Cowen et. al. in a paper not yet published. They can be seen here.

K	Total Distance	Center	Size	Center	Size	Center	Size	Center	Size
2	44.0214	31	33	64	51	-	-	-	-
3	41.7432	26	13	31	33	64	38	-	-
4	40.5592	5	19	26	13	31	23	64	29

The Kaplan Meier Curves for the above clusterings can be seen here:

K=2	K=3	K=4

Overlapped Dataset

The Overlapped Dataset contained the genes from the Shortened Dataset that were also contained in the Michigan and Ontario Datasets. The genes used are highlighted in green.

K	Distance	Center	Size	Center	Size	Center	Size	Center	Size
2	27.9204	49	44	59	40	-	-	-	-
3	26.6816	12	33	26	15	64	36	-	-
4	25.8564	26	14	43	14	59	28	64	28

The Kaplan Meier Curves for the above clusters can be seen here:

K=2	K=3	K=4

Future Plans:

Back to Top

This program is in the process of being integrated with my Top Genes algorithm.

Questions or Comments?

Email Me! Emily K. Mower

Basic Idea Behind K-Median Clustering: Back to Top

Basic Definitions: Back to Top

Method Employed: Back to Top

The Harvard Dataset: Back to Top