Distributed Mentor Project 2005 Journal

[Home] [Project] [Journal] [Summer Fun] [Final Report]

Week 1: May 23 - May 27

This is my first week in the lab for the summer. I spent most of this week working with Brad, another undergraduate in the lab. Our first task was to modify the current server used to run experiments so that it would be able to resume a job from somewhere in the middle of an experiment. This has proved to be extremely beneficial! We have been trying to strain our server code by running it for extended periods of time. Sometimes during these tests, it will just stop running. It doesn't appear that the code has crashed...it just stops. With our new resume functionality, we are able to simply restart the job where it stopped. This has been really nice! We also implemented some statistics calculations that will be used later to help optimize experimentation.

Outside of the lab, things have been busy as well. I moved into a duplex in one of the neighborhoods near the university campus last Thursday. I have three very nice roommates and have enjoyed being able to walk to campus. I drive very little now...which is nice. There is a lot of unpacking to do though. Boxes are cluttered all over the living room and dining room. It should be really nice once everything gets organized.

Week 2: May 30 - June 3

This week has been a big change from last week. Drew, another undergraduate who will be working with us, arrived for the summer. He had been studying abroad for a semester, but he worked on this project last summer and fall. On Monday he, Brad, and I were given brief sketches of our projects for the summer. I still only have a vague understanding of my project, but I know that I will learn more as I go. I didn't start on my project right away, though. Drew and I spent the rest of the week gathering some information on certain proteins as preparation of a data set that will be used in experiments later in the summer. We recorded information about the function of the protein, the EC classification number, and the ligands associated with the protein. So this has really been more of a biology week rather than a computer science week. The extent of my biology knowledge is at the introductory course level, so I'm running into many, many things that I don't know. I've been reading a lot of papers about these different proteins, their structures, and their functions (often as enzymes) and have been swimming in new biological terms. When I registered for Fall 2005 classes last spring, I had considered taking biochemistry but wasn't sure if I would. This week's work with my set of proteins has convinced me that biochemistry knowledge would indeed be useful.

Well, it's been two weeks since my move and there are still boxes all over the house! I really have to finish unpacking everything this weekend. I don't think I can stand to have my room be such a mess anymore. I went home to Round Rock last weekend to visit my family. Two of my cousins were graduating from high school, so there was a party. It was nice to see all of my aunts, uncles, and cousins.

Week 3: June 6 - June 10

On Monday of this week I learned that I will have the opportunity to participate in the W.M. Keck Center's Undergraduate Research Training Program. I had heard about this program from a friend who was involved with it last summer, and I knew that at least two undergraduates from my lab were participating in the program. I had not expected to be able to participate in the program since I had applied to the DMP rather than to this program. Then I got a call from Drew Monday morning asking if I was coming to the kick-off event. Dr. Kavraki and Brian, the graduate student with whom Drew, Brad, and I work, had arranged for the undergraduates in the lab who aren't officially part of Keck to be unfunded participants in the program. This is a program that is part of the Gulf Coast Consortia and focuses on interdisciplinary bioscience, so student participants come from majors all the way from biology and biochemistry to computer science and electrical engineering. Each week we go on site visits to learn more about current work and opportunities in computational biology. This Friday we went to the Texas Learning and Computation Center at the University of Houston.

This week Drew and I continued our search for appropriate proteins for our data set. We were starting from a set of approximately fifty structures. The group working on this project at Rice meets every week with a group from Baylor College of Medicine with whom we are collaborating. At last Friday's meeting, one of the Baylor graduate students suggested that we expand our search to include a data set that he had prepared for a previous student. So this week Drew and I started investigating the proteins in this older data set as well.

Week 4: June 13 - June 17

This week I continued to work on identifying a data set that will be appropriate for my later experiments. My goal is to have ten proteins in this set. For my project, the proteins in this data set must be diverse, each have multiple PDB structures, and each have a set of functional homologs that can be used for testing. For each protein I first find its EC number. Some proteins have no EC number, so these are discarded immediately. Next I can look at all of the PDB structures that have this EC number. This is the EC family and defines the set of functional homologs. For our experiments, it is important that the EC family be rather large. Thus proteins that don't meet our minimum family size threshold are also discarded. I then use a program that performs pair-wise alignment to find the structures that are 100% sequence homologs of the proteins with appropriately sized EC families. These 100% sequence homologs represent other PDB structures. If a protein has no 100% sequence homologs, it is discarded. Just getting to this point takes a really long time. I never realized how much work goes into developing a data set for experiments. Drew and I spent one and a half days this week just collecting this data for the older data set that I mentioned last week. There were about 400 structures in this older data set. Unfortunately most of these 400 structures proved to not be useful for our experiments because they either had no EC number, did not have a large enough set of functional homologs (where ten was the minimum threshold), or did not have any 100% sequence homologs.

For those structures that we had already identified as promising members of the data set, I began to read the papers on each structure in more detail to determine which amino acids within the protein are functionally significant. Eventually I will use these functionally significant amino acids (or residues) to design motifs that will be used to search a set of target structures for matches. Each motif is simply a set of points in three dimensional space, where each point is an atom in a particular residue. Finding documentation on functionally significant residues has been hard! Some proteins are just not well studied and documented yet. I'm very uncertain about whether I'm picking out the correct amino acids. Based on the papers, it's often difficult for me (with my limited knowledge of biological concepts and jargon) to tell. They might mention one amino acid as THE most important, and then the paper might be sprinkled with references to others (some at the active site and some at other binding sites). I feel like I'm kind of guessing about which ones are functionally significant. This part of the work has been slow, simply because reading and digesting the papers is taking me a long time.

Once I've determined the functionally significant residues, I'm going to use them to create a motif (or motifs) for each protein. I will create experiment files for these motifs and run them through the Match Augmentation algorithm using the set of functional homologs as the set of targets. This will allow me to see how well the motifs I've designed actually match (or don't match) a set of similar structures. I tried to make one of these experiment files by hand for one of my motifs late this week, but I discovered that I needed sequence analysis information for each residue. Hopefully, Brian can help me with this next week and I can get the initial experimentation started.

On the Keck trip this week, we went to the U.T.M.D. Anderson Cancer Center for a tour that focused on patients and clinical trials. I didn't find the site visit to be that useful, because there was little to no emphasis on the computational aspects of their research. The tour took us through some biology labs, but this didn't really mean much to the computer scientists and electrical engineers among us. There was an interesting presentation on the drug discovery process, with a focus on the clinical trials stage. I knew that designing a new drug was expensive and took many, many years, but I was once again reminded of just HOW EXPENSIVE the process is and just HOW LONG it takes. It's incredible.

Last weekend I went camping at Enchanted Rock in Fredericksburg, TX with four other friends who are also in Texas for the summer. Although I have lived in Texas my entire life, I had never been to Enchanted Rock. It was so much fun! We climbed the rock, which is this gigantic granite dome. (And, being the intelligent people that we are, we did all of our climbing during the hottest part of the afternoon!) Coming down the rock was fun, because some parts were so steep that we could just sit down and slide down the side. We spent one night at the campsite at the base of the rock. It had been a long time since I had seen so many stars in the sky. For a while, we just lay out on some of the rocks near our campsite and marveled at the stars. The sunrise the next morning was beautiful as well. A trip to Enchanted Rock is definitely something I would recommend to anyone visiting (or living in) Texas. I am planning to put some of my pictures from the trip up on this website.

Week 5: June 20 - June 24

I was finally able to make some of my experiment files this week! Brian had been out of town last week, but when he came back early this week, he helped me to find the sequence analysis information for each of my proteins and showed me how to incorporate this into my experiment files. It turns out that we have this information for most of the proteins already. Using an existing piece of code, I can make my experiment files automatically by providing an input file specifying the PDB file from the Protein Data Bank, the file containing the sequence analysis information, and a list of the residues that make up my motif. This means that I don't have to make all of the experiment files by hand, which is a good thing! The files that are produced, however, contain all of the residues in the protein. I am only interested in approximately eight or so residues, so I have been editing the files by hand to remove those residues that are not part of the motif. This just makes the size of my files smaller and easier to handle. After meeting with Brian early in the week, we decided that I should make two motifs for each protein, one that is primarily composed of residues that have been documented as functionally significant (supplemented where necessary with residues chosen based on sequence analysis information to increase the size of the motif) and another that is larger and includes more residues chosen based on sequence analysis information.

This week Brad, Drew, and I gave project updates for the first time at our weekly project meeting with Dr. Kavraki and the group from Baylor. I gave a short overview of my recent work identifying the data set and spoke briefly about the protein structures I have chosen so far and why they are appropriate for the data set. I was nervous about this, but it went okay. The graduate students from Baylor suggested some additional resources to help me determine the functionally significant residues in each of my proteins.

On Friday I met with Brian again to discuss in detail the progress of the data set construction. I had eight structures to show him, all including information about the documented functionally significant residues.

The Keck trip this week was to the Human Genome Sequencing Center at Baylor College of Medicine. I really, really liked this trip! Our tour guide had done his graduate work in computer science at Rice, so we got a tour that was definitely focused on the computational side of their research. We got to see their machine room, and they took us through some of their biology labs where they use robots to automate some of the tasks.

Last weekend it was both Father's Day and my grandma's birthday. My grandparents live in Houston, so my family came in from Round Rock, and my aunt, uncle, and cousins drove down from Dallas. My other uncle lives in Houston, so we were all able to be together for the weekend.

Week 6: June 27 - July 1

Well, I went home to Round Rock last weekend and spent Monday and Tuesday at home as well, so this was a short week. On Wednesday I added an eighth protein to my data set. I then continued to design motifs and make experimentation files for these eight proteins that I have chosen so far for my data set. This whole data set identification business is taking a long time. As I said earlier (week 4, I believe), I never knew that identifying a data set for experiments took so much work.

There was no Keck trip this week because of the 4th of July holiday. I met with Brian on Friday afternoon. We discussed the direction of the project and some experiments that we hope to run in a couple of weeks. He told me that at next Friday's project meeting, Brad, Drew, and I would each be giving a more formal presentation on the status of our projects. We are each going to prepare a short PowerPoint presentation and give about a ten minute talk.

Week 7: July 4 - July 8

Wow, this week was crazy!! First, I realized that we didn't have sequence analysis information for all of the protein structures that I have selected, because some of them are not the main structure in their family. Brian suggested that I use some multiple sequence alignment files that do contain information for the structures in question. Using this approach I should have been able to see how the sequences for my structures aligned with the main structure in the file. We have sequence analysis data for the main structure, so then I could just use that information. However, when I tried to compare the sequences, they hardly looked similar at all! Brian looked at them too but agreed that the sequences were too dissimilar to use the data from the main structure. We are going to ask the Baylor group if they can run new sequence analyses for each of my structures from the older data set. The structures I chose from the newer collection are fine.

The craziness of this week was heightened because of the looming Friday presentations. I had hoped to have a complete set of ten structures to show at the meeting. However, eight had to suffice. My presentation consisted of a few slides on the motivation, hypothesis, and roadmap for the project. I then had a slide for each of my protein structures, followed by slides discussing my planned experimentation. My slides took some work, because I tended to put way too much text on them. Brian had to guide me in trimming them down appropriately. For each of my proteins, I had an image I created using the Pymol visualization tool. I have a lot of fun making Pymol pictures! You load the PDB file for a structure, and the program creates an image of it on the screen that you can manipulate. You can rotate the structure, change how it's represented (surface, mesh, sticks, cartoon), and highlight specific residues. I made a green surface representation image of each of my proteins and colored the residues I had chosen for the motif orange. I also included the references for each protein structure on my slides. I took this opportunity to start keeping all of my references in a bibtex file. Now when I start writing the paper, I won't have to retype all that information. Anyway, the creation of these images and slides meant some late nights up at Duncan Hall. The presentation went well, although it was rather anti-climactic. The meeting time had been changed and due to some miscommunication, half of our project group didn't show up. Dr. Kavraki was the only person hearing our presentations. She asked us a lot of questions, though, and it was nice to give the presentation for the first time in a lower pressure environment. I'll get to give the presentation again in a couple of weeks when I present at the full group (not just the project group) meeting.

And, of course, last weekend was the 4th of July holiday. One of my friends was in town visiting from Austin, and we had an excellent weekend. We went shopping, to the movies to see Batman Begins (on an IMAX screen!), and to some Houston restaurants that I had never tried. On Monday, I went with two of my roommates to Hermann Park, which is a large public park directly across Main Street from the Rice campus. There is a large outdoor theatre where the Houston Symphony played their 4th of July concert. While I was growing up in Austin, I had always wanted to go to the Austin Symphony's 4th of July concert in the park, so I was very excited to finally be able to go to a similar event. We took our blanket, sat out on the hill, listened to patriotic music and watched the fireworks afterward. The fireworks were some of the best I have ever seen!

Week 8: July 11 - July 15

This week I finally finished developing my data set! I am very pleased with the final set. I continued creating my experiment files for the motifs I have designed and continued to run my intial experimentation. The new sequence analysis data for some of my protein structures arrived on Friday, so I will soon be able to make files for these and run them through the initial experiments. Some of the experiments that I have already run showed very few matches. I suspected I had created motifs that were too specific, but Brian suggested that I might just need to raise the match thresholds. Apparently, this had been a problem in some earlier work as well.

The next step for my project will be to calculate how geometrically similar certain structures within each protein family are to other structures in the family. Brian gave me an existing piece of code that calculates this value for two structures. I tried testing it on two structures where I knew the value to expect. I was expecting a number close to 0.5, but I was getting a number closer to 26. Fortunately we were able to solve this problem.

Overall I had a hard time staying motivated this week. I didn't really take much of a break after classes ended in May, so I definitely need to go home for a week or two before classes start again in August. Hopefully next week will be better.

Week 9: July 18 - July 22

This week I concluded my initial experimentation. I also spent some time this week refining the motifs. Previously I had supplemented some of the motifs with residues chosen based on sequence analysis information. I decided to take these points out so that all of the motifs consisted purely of residues chosen based on documented biological functionality. I felt it was better to have the motifs be equal in this respect. I also had to reduce the number of motif points in some cases, because the experiments were simply taking too long to run with larger motifs. Nine motif points appeared to be the threshold for reasonable vs. unreasonable running time, so all of the motifs now have between five and eight points.

I then moved on to the next part of my project, calculating how geometrically similar certain structures within each protein family are to other structures. A piece of code exists to calculate this value once for two structures. However, I wanted to calculate it many, many times for a variety of structures. I didn't want to run all of these experiments by hand, changing the command line each time, so I decided to modify the code to accept a batch list style of input. This presented some problems because I was using arrays of structures in C++ incorrectly, but with Brian's help I was able to get the program to process the batch list appropriately. However, I'm still not getting the correct values (as compared to the individually run experiments), so my numbers are getting messed up somewhere.

On Thursday of this week we had a group meeting to update the group website. Dr. Kavraki wanted each of us (including the undergraduates) to update (or in my case, write) our CVs. I had written a resume in the past, but never a CV. Right now my CV is quite short, of course, but hopefully it will come in handy later to have started it.

Finally, this Friday on the Keck trip we went to the UT Health Science Center Graduate School of Biomedical Sciences to hear presentations from the Pharmacoinformatics Training Grant Program trainees. Pharmacoinformatics was an unfamiliar (but interesting sounding) field to me, so I really enjoyed the talks. This has been one of my favorite Keck events. Some of the trainees were talking about proteins that I have been studying this summer, so I found it very interesting.

Week 10: July 25 - July 29

This week I continued to tackle the problems I was having last week with the code that calculates the geometric similarity between structures. I was able to fix the numerical errors--I was simply giving input in the wrong order--so I was finally getting the values I expected. However, I encountered a memory leak in the code. It was written to be run once, but I was starting it once and then looping through the calculations hundreds of time. So I moved on to track down and fix the memory leak. I did find at least one source of the problem, but in fixing it I introduced a segmentation fault into the code. By this point I was getting pretty frustrated. Eventually I decided to simply write a script that would call the program multiple times. This would allow the program to start and then end after every iteration, so the memory leak wouldn't be a problem. I'm sure there was some more elegant way to fix this problem, but this worked for what I needed. After creating this script I was able to quickly run all of the protein structures in my data set through this code, computing the desired values of geometric similarity. I then wrote a parser to read the data output files and print them into a nice table format for analysis.

This week I also gave a talk at the weekly group meeting. I just used my presentation slides from earlier in the summer, although I shortened the presentation a little bit. I was nervous about the presentation, but it went fairly well.

This was my tenth week of the summer. The project isn't finished, but I will be going home next week for a break before classes start. I then plan to continue my work on this project during the academic year.