Introduction
My summer project is related to the background picture on the homepage of this website - the structure of a protein. An important problem in Bioengineering is to determine new protein structures that are viable, that is, that have stable structures. Related to this problem is the problem of determining the similarity between protein structures, and accurately classifying protein structures into their families. The Baker Lab is currently involved in The Rosetta Project which runs computationally intensive algorithms to make predictions of protein structures.
Prof Gupta's group is trying to apply statistical learning algorithms to solve problems such as the protein structure prediction problem.My Role
The problem that our group is trying to solve is the following: Given information about some data (specifically their similarity to each other),
how do you classify the data accurately?
There are two parts to this problem:
1. Determining a similarity metric that accurately models the actual relations between protein structures. For example, if the data
being considered is points in a x-y euclidean plane, squared distance is a great way to compare points (Points that are farther away are
less similar; points closer by are more similar). But, a protein is usually described by a set of 'features'. Thus each protein is represented
by a vector of feature values (something like {# alpha helices, # beta sheets, # amino acids,..so forth}). These vectors are obviously
not in Euclidean space, so it doesn't make sense to use something like squared distance. I dabbled with various similarity metrics, learning
more about what kinds of data they are good for, how they work etc.
2. The second problem is what I ended up spending most of my summer working on. The problem is, given the pairwise similarity between two samples,
how would one predict the classes that the samples belong to. We developed a similarity based parametric classifier to do this, and compared it
with existing classifiers such as near neighbors and support vector machines.