My Project

The evolutionary relationships between different organisms and species are generally represented in the form of phylogenetic trees. Many thousands of trees can be derived by Biologists for the same group of organisms. The Robinson-Foulds (RF) distance is one commonly used metric for analysis these derived trees. It describes the difference between two trees spanning the same set of organisms. All of the RF distances between the trees in a tree set can be stored in an RF matrix and therefore are. These matrices have the upper hand over other methods of phylogenetic analyses, such as consensus trees, as they retain more information about the original tree space and thus lose relatively less biological significance in their formation. However, with larger tree sets, such matrices can become extremely large and thus impractical for use.

The goal of this research is to encourage the use of RF matrices in phyologenetic analyses by developing an application that allows users to interactively obtain information from extremely large RF matrices efficiently. A large RF matrix size is defined as one comprised of a data set of 100,000 trees or more. The type of information the user will be able to obtain from the matrix includes:

  • All tree pairs that are a certain RF distance or range apart
  • Basic statistical information (mean, median, mode, histogram, etc.) of the data in the matrix
  • (possibly) A visual representation of the input or output data set.
The exact measurement of efficiency is as yet undefined; however, it will be dependent upon minimizing time, memory, and disk usageā€”one of the main challenges of this project. It will be properly defined after the analysis of some initial results from the research.

I hope that application I create will encourage the use of distance matrices in phylogenetic analyses of large tree sets. At the end of this project I hope to have a running user interface that can at least accomplish the aforementioned goals in a timely manner. If successful, this project will specifically make research in the field of Phylogenetics easier for Biologists by allowing them to extract data of interest to them in large RF matrices easily. The statistical and (possibly) visual capabilities of this program may simplify the process of finding patterns in large RF matrices. These capabilities could also potentially be used as a guide to direct researchers towards the interesting or odd parts of the RF matrix that they are looking at or things that they might not have otherwise noticed due to the extremely large size of the original RF matrices. Other possible applications of this project include aid in the analysis of tree convergence heuristics and the evaluation of generated consensus trees. The data extracted can be used in the analysis of the different methods of tree generation. With some modification, this program could be used on any distance matrix in any research field.