DMP: Summer 2004

JOURNAL

Project Journal Entries:
ENTRY I
4 June 2004
"OVERVIEW AND PROTEIN STRUCTURE",

I was given a book on protein structure and expected to understand it. Of course, I am not totally lost in biological matters, but I am a Computer Science/Sociology major, to remind you. I struggled through the first several chapters and had to reread each section several times in order to comprehend the inner workings of this complex system. Once I got the basic background, the terms and functionality of specific parts - it became interesting, not bothersome. I learnt how proteins assemble into a polypeptide chain, how they form secondary structures (Helices and Sheets), and how they fold into the tertiary structure, which is the biggest mystery of all. I realized that my research will be focused on this stage of folding into tertiary structure, since it's least scientifically developed and is crucial to the correct functioning of the protein.

ENTRY II
11 June 2004
"MECHANISMS OF PROTEIN FOLDING",

I continued learning about mechanisms of proteins folding, which brought me to a deeper realization of the lack of objective knowledge on the subject of forming tertiary structure. Scientifically, it's not established how exactly the secondary structure folds into 3D shape, and what goes wrong in this complex folding process (things go wrong all the time - a lot of diseases are result of such misshaping). Since experiments are incredibly time consuming, bioinformatics steps in with the capability of modeling and simulation of these processes.
I also started looking at different ways of comparing two abstract trajectories, ways of quantifying their similarities/ differences. This is relevant since such comparison would allow, for example, to compare two free energy plots of protein folding simulation to determine which simulation is more efficient. This goal is very important since a lot of researches are coming up with various algorithms, but there is not way of quantitatively comparing them.

ENTRY III
18 June 2004
"TRAJECTORY COMPARISON MEASURES",

I was looking for trajectory comparison measures in all the different aspects and subfields of science. There are a lot of materials and scientific works where trajectories get utilized for a bigger goal, as a reflection or graphical representation of some other aspects of the research. There was almost no papers that would focus on trajectory itself, as means and an end, no work that analyzed the path beyond standard statistical measures. Those, of course, were always available - mean, minima, maxima, standard deviation and dispersion. Otherwise, there is almost nothing new available to analyze trajectories.
This search was rather demanding, since I had to distinguish the relevant ideas within the papers on subjects I was often not proficient in. There were many things I didn't know, but had to get at least a general idea to determine if the methods used for trajectory comparison are applicable and useful in our case.
I borrowed a Linear Algebra book from the library because I needed the knowledge of at least the basics of the discipline, but I didn't take the course yet (I am taking it in the Fall semester). Most of the papers I read utilized the matrices and relevant notation, but the biggest reason to learn Linear Algebra for me was an attempt to comprehend the ideas behind PCA - principal component analysis - a dimensionality reduction measure that could prove useful in the research.

ENTRY IV
25 June 2004
"BIOLOGY FOR MODELERS",

I am still looking for measures of comparison of two trajectories, as well as comparison techniques for base comparison, but now I don't have that much time anymore. I started a free non-credit course at Rice University called "Biology for Modelers". According to the name and considering the fact that it's offered by the Math Department, it's understandable that it is an introductory biochemistry course with emphasis on research and modeling. I thought it would be a good idea to learn more about proteins and modeling them, since they are, in the long run, focus of this research.
Even though the class is introductory, I dedicated all my time this week to this course. Every morning I would go to the lecture, twice a week we have a lab, and the rest of the day I would be reading and rereading the chapters of biochemistry book to prepare myself for the next day lecture. This week flew by really fast - I was too busy to notice how fast the time went by.

ENTRY V
2 July 2004
"STATISTICAL MECHANICS VIEW ON PROTEIN FOLDING",

Remember it was difficult for me to read all those papers from different aspects of science and trying to figure them out? I thought those papers uncovered fields of science "entirely alien" to me!!! All those papers were at least somewhat related to CS and modeling of different things, so the only things I didn't understand was some specific terms and mostly some math, which wasn't that hard to look up.
This week we met with the Chemistry professor Cecilia Clementi, who works in collaboration with Pr. Kavraki on the protein folding project. We discussed the subject, and I was given more papers to read. This time the papers had nothing to do with Computer Science, they were based on statistical mechanics! So reading those papers, I realized there are some areas of science I am totally unfamiliar with. It was an experince I could only compare to giving birth (I don't have children!) - painful and exhilarating at the same time. It was ...difficult ... interesting ... exciting.
By the way, I didn't really have any time to focus on the difficulties, because I was still attending the Biology for Modelers class.

ENTRY VI
9 July 2004
"INTERATOMIC DISTANCE MEASURE",

Searching for alternative ways of analyzing and comparing trajectories I stumbled upon a paper by Christoph Best and Hans-Christian Hege mentioning an interesting idea for analyzing protein folding trajectory. The idea was to compile a vector of interatomic distances – distance between each pair of atoms in the protein structure – and watch it change over time, as the protein folded into its final conformation. The distances between different atoms are supposed to give you a rather good idea of how compact the structure is and what parts of the protein are coming closer or move away from each other. We decided that this measure could prove useful for us, so we resolved to test it.
So this week we finally got to code!!!
We have been working on a program that would construct an interatomic distance vector from a vector of Cartesian coordinates for a folding trajectory. It’s been fun! It’s feels great trying to apply a theoretical idea from one of those complex papers to something practical where we could see results.

ENTRY VII
16 July 2004
"INTERATOMIC DISTANCE MEASURE CNTD",

We finished coding the procedure for interatomic distance vectors with some help of a graduate student Amarda Shehua, who is a person immediately responsible for helping us with any problems/obstacles in our project. Sweet Amarda – she is so helpful and patient with us.
Anyway, with some help we finished this first peace of code!!!
Amarda also supplied us with five toy examples of folding trajectories. She created a string of 60 “atoms”/beads in VMD and bent it into five different shapes: symmetric 1 bend, asymmetric 1 bend, 2 bends, 3 bends, and a loop. This is our test material for our new code. We’ve been testing the code with it and it seems to work fine (after couple of fine-tunings), but there is no obvious information in the output files about the shape of the final the final conformation and, moreover, the whole trajectory. I mean, at least there is no obvious way of seeing that info in the file. After consulting with the Professor we decided to move on to contact vectors.
It’s a derivative of an interatomic vector, where by means of establishing a certain threshold (cutoff), we determine if two atoms are in contact: if the distance between two atoms is less then the threshold – the are considered in contact, if they are further from each other then the threshold distance – they are not in contact.

ENTRY VIII
23 July 2004
"CONTACT VECTORS",

This week we’ve been working on developing a piece of code that would create the contact vectors. We just expanded our code generating interatomic distance vectors to create also the contact vectors. We generated two types of contact vectors: binary and real. In binary contact vector entries get assigned 1 for the contact and 0 otherwise, while in real contact vector non-contact pairs get assigned real values of threshold over distance.
We ran our code with different cutoffs. You can’t realize if the results are good until you visualize it somehow, since the vectors are very big (the initial trajectories were rather big). So we plotted our contact vectors in Matlab. We tried plotting the contact vector produced with the cutoff of 2.00 angstrom. Most of the graph appeared to be a contact (all black). This could only mean that the cutoff is too big compared to the bond length.

ENTRY IX
30 July 2004
"REASONABLE CUTOFF",

I was having so much fun this week!!!
I was playing around with the Matlab the whole week. I was trying to find a good cutoff foe our contact vectors. So I would run our procedure with a certain cutoff. The procedure runs on entire folding trajectory of particular conformation. So far we are only interested in the final shape. So after running the procedure, I would separate the contact vector for the last frame from the big (full trajectory) interatomic vector. Then I would plot that smaller vector in Matlab using this little script (that Amarda helped us with again!). So running, parsing, and plotting!!! This was great – I got to play around with pictures.
After going through this routine for the cutoffs between 0.05 and 2.00, I found a good result at 0.2. But it was only good for real contact maps, while you couldn’t see anything on the binary maps. We are really interested in finding a cutoff that gives good binary maps, because it would be much easier to analyze the binary contact vector (you only have to deal with 0s and 1sinstead of reals).
The best cutoff I found is 0.6 angstrom. After I removed 7 bonding neighbors from the Matlab graph, the graphical representation would look perfect – two clear-cut clusters of black dots representing each bend. It’s two clusters for each bend because our vector only fills ½ of the square matrix we need to provide for plotting, so we replicate it symmetrically.
All this time Lin has been working on code to automatically detect number of bends (events). After I provided her with the good cutoff and neighboring window, she had much easier time finishing the code (she new how many bonded neighbors disregard in her clustering technique).

ENTRY X
6 August 2004
"PLAYING WITH ENTIRE TRAJECTORY",

Lin’s code works for automatically detecting events in the last frame. Now we are really close to something – we could run that code on the entire folding trajectory. The trajectories we were given went from unfolded to folded state, but we decided we could see more if we analyze the entire length of transformation from unfolded to folded, and back to unfolded state. So I wrote small peace of code that expanded all the trajectories (except asymmetric 1 bend which was given to us in that form from the beginning). I think this is great! We are finally approaching that initial goal of analyzing entire folding trajectories!!! I am very excited!
The way Lin wrote the code, it outputs an integer to signify the event taking place in that frame. For example, 1, 2, and 3 stand for 1, 2, and 3 bends respectively. She arbitrarily picked 9 for the loop. So when we run that code on the entire trajectory it outputs a string of integers. Now, the appropriate step would be analyze this string and conclude a type of event that took place in the whole trajectory. We did this by looking for largest number that was outputted in the row for the significant number of times. We had to filter out some noise, which rather insignificant anyway. Using those techniques, a procedure was created that outputs the type of event taking place in the entire trajectory.
We got there!!!