Week 1:

I came to Texas a few days before my internship at TAMU began and those few days were spent tidying up the apartment, "trying" to cook and wondering what the internship would be like!
My internship officially began on June 4th. I met a lot of people that day who help me get settled in, most importantly, my grad student mentors Shawna and Lydia. Both of them are working towards their Ph.D.s. They helped me a lot in terms of giving an overview of what kind of work was being done and also to make me feel comfortable in the lab.

I was to be working on a project in computational biology - namely protein folding. Since I have a interest in biology, particularly genetics and related stuff, I was pretty excited about the project. I started off by reading two papers:
- Protein folding by motion planning
- Paper on MME and MMC methods that will be published soon
The papers gave me a great overview of what work had been done and frankly, I was amazed at the ingenuity of the approach. I find interdisciplinary applications of computer science very interesting and this was just that sort of thing!!! While reading the papers, my numerous queries were patiently answered by Lydia and Shawna! I think I am going to be bothering them like that for quite some time!!

After reading the papers, I was shown the code for doing all that. The sheer number and size of files along with the code suddenly made me realize how huge the project was!! I learnt how to comiple and run the code. Once I got a feel for what was being done in the lab, my grad mentors discussed potential projects with me, all of which sounded extremely interesting!! Finally the project chosen for me was the one that dealt with a new way of node generation. We had data from Ken Dill's group at UCSF which we were to use in order to generates nodes in the roadmap. This data was in the form of angles phi and psi and probabilities for different amino acids in a protein.

With a definite idea of what project I would be working on, a rough outline of the steps that would be involved in the project was made. This can be found in the Research plan link on my home page. Meanwhile, I also started building up my website and what you are reading now is a result of that!! The USRG group at TAMU has also organized GRE preparation sessions for students which we have to attend every Tuesday. Though I do not plan to take the GRE in the next couple of years, I believe that the vocab and verbal preparation will do me good :)

I started with the actual coding on Wednesday. This mainly consisted of Shawna making place for my new code. This required a lot of changes to be made but then I had a skeleton of what I would be required to do. I was really excited about my upcoming tasks: writing a function to read in the input data and a function to randomly generate conformations. After talking my ideas over, I began the actual coding and by Friday, I had a working function that read in data from the file. Then I coded the function for generating random angles. I had most of the code done by Friday except for the part where my random numbers were always the same for iteration of the loop. Finally I realized that the statement had to be OUTSIDE the loop and NOT inside!!! Duh...Computer science IS about logic, isn't it?!


Lets see what the next week brings!

Week 2:

Week 2 was mainly about checking the code I had written. I used the function to generate ten thousand nodes and it worked fine!! Some changes had to be made to assimilate it into the existing code and then it could be used like any other node generation method. But after I was done with the coding, I didn't have a definite direction to my work because next came "testing" with no fixed way to do it. As a result the week was not as fruitful as I hoped. The data I got for the 10000 nodes did not look very promising to me but then I realised that it was because the nodes were not being filtered based on energy. Once that was done, the data looked much better. On Tuesday, I finally met with my advisor! She thought that I had made consideable progress but asked me to know more about the metrics I was using since I didn't know a lot about them. I was also asked to look at the data and try to sort it using the different criteria like RMSD, Euclidean distances, potential energy and native contacts. Lydia helped me understand the metrics and got me started on the possiblities of sorting the data. Meanwhile, I also started learning the basics of a visualization software called Pymol. It is used to see the actual physical structure of various molecules and I must add that the pictures are quite pretty!!

I also spent quite some time thinking about a different aspect of the problem: the data we had was for several fragments which overlapped. So we had to come up with a way to select the correct data set for each residue. To help with this, I looked at each fragment conformation in Pymol to see how similar they were. It turned out that some conformations were quite close while others were really different. One thing I realized was that it was possible for things to be really different in the actual molecule than in the fragments. We still have to find a good way to pick the residue data and I have some ideas for it.

After learning some more of Pymol, I started writing code to sort the nodes on the basis of the criteria mentioned above. This was pretty easy and once that was done, I looked at the output from various sorted nodes. There were several combinations possible for sorting and I am presently working on them. Since there was a LOT of data, there had to be a way to analyze it and what better way than Matlab!! Unfortunately, since I hadn't used Matlab either, I had to learn it. Shawna showed me a lot of Matlab and the basic functions so that I could do something useful with it!!

I then began to plot all the data based on different metrics and compared the graph to old ones. That was what I did on Friday and I will continue doing it next week!


Week 3:

I started off with analyzing the data that had been generated from my method - the fragment method. I generated 10k, 20k and 50k nodes using my method. These nodes were then sorted using two combinations based on the various metrics. The best structures obtained from this sorting were then compared to the original ones using pymol. Unfortunately, there turned out to be very few similarities between the actual conformation and my best conformations. To get an idea of the distribution of the nodes, I plotted graphs using Matlab and compared them to those generated by other methods. These graphs turned out to the quite different and I spent quite some time analyzing the graphs. After this, it seemed apparent that the method wasn't giving me results like I expected. I have to admit that this was disappointing. To understand the limitations associated with the function, I made a list of the advantages and disadvantages of the function along with possible reasons. From this, I was able to come up with possible changes that would improve the function. Meanwhile, I also started reading about the basics of protein folding from Fundamentals of Biochemistry by Stryer. This not only gave me a better idea of the process but also pointed out weeknesses in our approach. I still have to come up with ways to overcome these sort of "inherent" problems with our approach.

I met with Lydia and Shawna on Wednesday to talk about the improvements I could make to the method. We decided to try and combine the Layers method with the Fragment method to see if we got a greater distribution of nodes. We also picked a simple way to chose data sets. I was happy to have something concrete to work on. After I was shown the appropriate code, I started modifying a copy of the Layers method. Understanding the function was fairly difficult but after that was done, implementing the changes and merging the function into the existing class was pretty easy. As before, I generated nodes from this function and plotted graphs which were then compared to the graphs from old methods. The distribution of nodes was greater but it lacked some of the defining features of the other graph.

I am hoping that the changes in our sampling technique would substantially affect the graphs. My plan for next week is basically that - writing a function to pick the appropriate data sets for each residue. According to the results of this funtion, it would have to be modified. I also hope to play around with the mixed Layer - Fragment method to get the optimum division of nodes between these two methods. Something i REALLY hope I can do in the next few weeks is come up with a way to bias the generation of nodes by the Layers method. If I can do that, it will be a VERY big deal...I think reading up more about protein folding and putting together whatever little I have learnt till now will help me do this...

Besides this, there was the usual GRE workshop and technical writing seminar. I thought the technical writing seminar was quite helpful. The high point of the week was not one, but TWO trips to Post Oak Mall!! Being someone who loves to shop, this was SUCH a treat :)



Week 4:

I guess I am now getting an idea of what research is really like - the highs and lows, the really happy times and the frustrating times!! I get disapponted at times but then a tiny improvement makes my day! Well, this week was spent modifying the Fragment - Layers combination so that the user could control the percentage of nodes generated by the two methods. After looking around a little, I found the perfect way to do this. Adding a new input variable is a pain so I decided to use a variable that was already present, but wasn't used my my method. Luckily it took values from 0 to 1 and this was jsut perfect! Adding this variable to the code turned out to be tricky because several functions needed to be modified. After seg-faulting on me several times, the program finally decided to work!! With that done, I began to think about the input function. We had to read in 8 residues a time from a the same fragment. Since the fragments overlapped, there were several such data sets and I had to also compensate for the first and last phi-psi's. After going through the algorithm and refining it a lot of times, I began coding. As it always happens with me, my logic was perfect but I got stuck on the tiniest of things - Unix needs a single front slash in file names and NOT two back slashes like Windows. When my program started reading the correct files, everything worked out fine. I usually work on a "test" class when I am experimenting with new functions. So another important step was to put this function into the Fragment class. This needed a lot more changes than I expected.

Once I had the function working, I was quite eager to try it out. I was REALLY happy when the graphs told me that I had got MUCH better results. The distribution of nodes was much more uniform, the RMSD and euclidean distances were lower - everything was better!! Happily, I began testing the new method to find the best values for the various parameters. I don't particularly enjoy testing but I've come to accept it as an essential part of any coding. I also used my sorting functions to sort these nodes and then use the "best" nodes as seeds for the Layers method. For 10k nodes at least, the sorting did not seem to matter a whole lot.

I also compared my "best" nodes to the original ones in Pymol. After getting help from a friend on the very elusive Hbonds in Pymol, I was able to compare the Hbonds that define the secondary structure of the conformation. From the Hbonds comparison, it was obvious to me that the native contacts I was using as metrics, were actually quite misleading. While we had about 90% of the native contacts, we didnt have even 10% of the Hbonds necessary for the secondary structure. Realizing this, I asked Shawna if we had a function to calculate the Hbonds. Luckily it turned out that we did, and I'll spend a lot of next week working on it.

I also spent time working on my techincal report that I have to turn in at the end of the internship. I think its a really good idea to start early on this. I'm glad Dr. Amato asked me to do it. I met with her on Monday to discuss my progress and she had a few suggestions on how to proceed. Friday turned out to be a lot of fun! Lydia, Shawna, me and Kokil went to Chick-fil-A with two other students from the Biochemistry department. Lydia had coupons for stuff there so we had a great meal!! We also chatted a lot and it was nice to hangout informally with my grad mentors. I think we should do that more often!!

Week 5:

I'm halfway through my internship already!! It doesn't seem that long, seems like I am just starting to understand stuff and apply what I know!

Last week was nice - especially because of the 4th of July holiday :) In terms of work, I made a decent amount of progress. I applied my method to the entire protein finally! The method works but as usual, its not good enough! We have a pretty good distribution of nodes but our conformations are still far from the native state. Last week, I tried something new - I constructed Ramachandran plots using the MolProbity program I found online. Implemented and maintained by Duke University, it helps you run various checks on your conformation, including Ramachandran plots. Though I could not get a great deal of inferences from the plot, it was obvious that most of the angles belonged to the aplha helix region. I have to find a way to check which residues have angles falling in which region.

We have been using the Layers method in combination with the Fragment method to generate a uniform distribution of nodes. But this method doesnt do this the smart way - it perturbs all the angles. Instead, the Rigidity Layers method assigns a probability to each residue based on its rigidity. This probability decides whether a particular residue will be perturbed. I carried out tests to see if this made a measurable difference. At this point, it doesnt seem to. I also implemented the Fragment method for the 12 residue fragment data. Comparison seems to show that the 12 residue fragments are better. The 12 residue fragment graphs have the funny property that they seem to have two instead of one peak in each of the graphs. I have to find out why. I also constructed different types of graphs like histograms to compare the distribution of nodes within one method and between different methods.

Comparing our "best" structures with the native state using Pymol has indicated that native contacts maybe be misleading. So I decided to use Hbonds as a metric. For this I added to the code, the functions to get the number of Hbonds. I added number of HBonds to the Sorting class as well. I have to find out if these Hbonds are a better metric.

Besides the lab, last week was a lot of fun too!! I went to Texas Roadhouse, Chilli's and Chipotle! Also ate great home-made Indian food and watched quite a few movies!!! My cooking is improving expoenentially it seems - just made cake and shrikhand, which were AWESOME!! Also visited the Bush Library which currently houses the miniature White House! Being an artist, I loved the tiny details in each room! If you get a chance, you should see it!

Week 6:

Last week was a little slow for me. I mainly did a lot of testing since I had recently tried a lot of variations in my method. These included the two different types of fragments - 8residue and 12 residue, the methods - Fragment, Layers and RigidityLayers and lastly the variations in the percentage of sorted nodes that are picked as seeds. I tried almost all variations possible and used percentages 10, 30, 50 and 70. It seems that the Fragment + RigidityLayers method with 12 residue fragments gives better results than the rest. The optimum percentage was found out to be 50. I have been using the parameters for other tests.

Another problem I worked on was regarding the quality of native contacts as a metric. When I was working on Hairpin1, I obtained an equal number of contacts as were present in the native state but our structure was far from being native like. When I compared the structures using Pymol and displayed the Hydrogen bonds, I realized that most of these bonds were missing in our structures. This made it apparent that native contacts could not be used as evidence of being close to the native state and hence were not a very good metric. To find out if this was the case with ProteinG too, I carried out several tests that examined the distribution of nodes when they were sorted by Hbonds and native contacts. These plots were for the potential energy, RMSD and Euclidean distance of these nodes. Soon the discrepency was apparent when the RMSD for the sorted nodes was studied. Native contacts did appear to be misleading for Hairpin1 but not for ProteinG.

After validating the use of native contacts as a metric for ProteinG, I was sort of stuck because I was seriously tired of testing and analyzing graphs. So I worked on my report and added details to it. I also started reading another book on protein folding. Unfortunately, it seems as if all the books assume quite a bit of familiarity with the topic. Then it occured to me that graphs may not be the best way to study the nodes I produced. So I decided to construct and analyze the roadmaps that I got from the nodes. From this I could find out how close we were to the native state in terms of secondary structure. I plan on comparing the SSFO to experimental ones and determining if my nodes actually follow that order or not. This will also tell me what secondary structure formation is most dominant in my nodes. This is what I will do next week and I think it will make a big difference.To help with this, I've added a new field to the protein class that can keep track od the method that created a configuration. This was quite a pain because I needed to change things at a number of places but after quite a bit of debugging, it works fine!

Had to attend 2 talks for the USRG program and 1 for the CS REU program. The one on MRI was really cool and the one on grad admissions gave a lot of helpful pointers.(pun intended!) Wrapped up last week with the new Harry Potter movie - unfortunately, it doesn't follow the book, but is a good movie as a whole. Can't wait for next weekend which is going to be spent READING the last book :D

Week 7:

Almost half of last week was spent trying to get the analyze roadmap part to work. This code is much difficult to understand than the other ones so debugging my code was tough. The program keep finding paths with repeating nodes, i.e. it keep going back to the same node instead of a node closer to the native state. After testing different things I finally found out what was wrong. I am still not sure what the problem was but the changes I made fixed it. With that done, I started the actual construction of maps. There were several alternatives for this:
- use a "pseudo" native state
- use just the native state in combination with the fragment nodes
- generate nodes from the native state and them construct maps

Each of the approaches needed quite a bit of code to be added. Additionally, since the code is interconnected, sometimes unexpected things changed. I implement one method at a time and compared the results. The glaring flaw with the first appraoch was that the SS it considered native were those from our "pseudo" native state. So the results did not make sense and couldn't be compared. There were two problems with the second appraoach: the native state wasn't in the roadmap to be analyzed and its energy was way lower than my nodes. After grep - ing several files and folders, I finally found the functions that would add the native state to the roadmap for me. This was necessary because building the roadmap again would have taken me several hours!! The other problem was solved by asking the program to ignore the huge energy difference. The results from this approach were not encouraging either my nodes being away from the native state, there was no guarantee that the SSFO got would be the correct one. Lastly, I added tha native state as a seed. This is giving better results than the other two. I am still doing some tests on it.

I met with my mentor on Wednesday to discuss my project. I have always been a little unhappy with my approach to the fragment problem because the answers aren't "good enough" for me. But I guess my faculty mentor and grad students fianlly convinced me that day that great results had not been expected in the first place. That wasn't great news but now my mentor had given me some more work to do, a variation in our method, so I was really happy! I had the code written and working that day. The next day I made modifications and ran small tests. Then I proceeded to building large maps. I started my job on Friday and when I came back on Monday, I realized that it was still running!!! I am not sure what took it that long, when I ran it again, it was done in less than a day. Next I plan to use the map evaluator code that Shawna gave me to build maps more "intelligently"!

My weekend was weird for more than one reason, one of them being that my pre-ordered Harry Potter didn't get to me on the 21st. Hope to get it soon.

Week 8:

I finally got my Harry Potter book on Monday and so Monday night was spent reading the book :) The fact that this appears at the TOP of my weekly entry should make it obvious how cool this was!! Anyway, I thought the book was great overall but like many of you I didn't like the epilogue et al.

Last week I mainly worked on my new method that biased nodes according to the MD data. After making some changes to the code i.e. asking it to not read the same file a million times, I got my program to do its stuff in just over four hours. Very happy with that, I also added the map evaluation code so that the program would know for itself when to stop. This greatly reduced my running time, I was now done in 70 mins flat! Meanwhile I also tried to construct maps so that a part was done the Fragment method and a separate part was done by the Rigidity Layers method. Then I hoped to combine these two. As of now, this doesnt seem to work as the evaluator can't seem to find paths. I have to investigate the reason for this. After building maps with my new method, I analyzed them. Though the general characteristics of the map (including graphs) were comparable or even better than those of the old methods, one crucial piece of data didn't seem to match up - the Secondary Structure Formation Order (SSFO). To understand this better, I modified a function in the analyze class that would generate a contact map for me not according to contact time but RMSD. This was to help me understand roughly what structures were present at what stages. Shawna helped me out a lot with understanding the code and after making the changes, it worked fine.

I analyzed old and new maps again using the modified analyzing function. The fact that I hadn't been very consistent in labelling my files caused me quite a bit of trouble at this stage and I had to rerun programs. I wish I had paid more attention to the naming, it would have saved a lot of my time. Unfortunately, the new analysis didn't help me a lot and I am still figuring out the cause for the weird SSFO.

Last week I also started working on my poster. I made a rough outline for it and showed it to Shawna. After looking at a few old posters, I picked up material from those and put it into my poster. This was great for me because I had never made a poster before this. With a sort of template, I began to fill out details particular to my project. By the end of the week, I had more or less most of the data I needed on my poster. I never new zooming in and zooming out could get so annoying at times!!

My weekend was rather boring because I moved to a new place and all the time was spent packing or unpacking!

Week 9:

The next-to-last week of my internship and the last week for the other students in my lab. It's amazing how fast the last few weeks go. Suddenly I feel that I have so much to do and so little time! Hope to sort out some stuff next week. Last week were the poster presentation sessions @ TAMU. There were 2 in all - the USRG one and the VPRO one. All efforts till Wednesday went into making the poster. This included running more tests, rerunning programs, constructing "user-friendly" images and so on. My poster also underwent a lot of changes with regards to content and placement. This was my first poster, so I guess it took me quite a bit of time to get the hang of what kind of information goes on the poster e.g. text should NEVER be wordy or more images the better!

I had to rerun a couple of programs because I had not ran them using the same parameters. I also had to create new images in matlab using this data. I wish all file formats were portable!! So obviously I spent time converting my images into the right format, resizing them and then finally placing them on my poster. On the poster too, it was a challenge getting all those files in the same size. I also got some pretty images from Pymol to put into the poster. I think these were really good because they were pretty much the most physically realistic images I could get. I also played around with the autoshapes and make a flowchart of my methods. It looks pretty easy when it's all done but it wasn't very easy to get all the shapes to line up, be of the correct size, the arrows to be of the same thickness and the correct oritentation and so on. I guess the biggest trouble is that the autoshapes dont have a built-in text box. I hope Microsoft implements that soon :)

I had trouble formulating my results section because I had done a lot of stuff but not all of it had been very good. So I wasn't quite sure how to state the results. I also had to pick the best way to show these results graphically. Shawna and Lydia helped me a lot there. Shawna suggested that we go from the conclusions --> results. It seemed like a funny approach but it was very effective. We started by lisitng down the things we wanted people to take from my poster in terms of results. Those became my conclusions and then we set up the exisiting results to support those conclusions. I think that was absolutely great and my poster looked a lot more neat and well thought out after that! After making (lots of!) corrections, I finally showed Dr.Amato my poster. She had a few corrections too and then I was all set :D. You can find my poster here.

The actual poster sessions were quite a bit of fun. The first one for the VPRO was on a smaller scale and I got only five or six people. But this session was good practice for me and I became quite comfortable explaining my poster. Prior to the poster session, we had a pratice poster session in our lab where we had pizza and all the interns explained their posters. The second poster session on Friday was on a large scale with two sessions. We were also given breakfast and lunch. Since the computer science posters were in the afternoon we had to evaluate 4 posters from the morning session. Though I was initally reluctant to do this, I later thought that it was great because I got to learn about projects I wouldn't have heard of otherwise. It also made me realize that there is SO MUCH you can do when it comes to research. And here I was doing my little thing and thinking that it was great. So looking at the posters was quite an eye-opener. I ended up looking at a lot more than 4 posters utlimately! In the afternoon I had about 15 to 20 people visitng my poster, asking questions. I also got 2 judges because we had a poster competition. The judges were not from Computer Science or Computationl Biology so I'm not sure if my contribution to the project was immediately obvious. I didnt'e end up winning a prize but I thought that it was a gret experience.

Week 10:

Today is the last day of my internship and I have already started missing things I was used to doing over the last 2 months! This week was mostly spent writing my report. The task took a lot longer than I had expected and I was glad that I had started writing the report early. Unfortunately, I hadn't been updating the report regularly, so the big changes I had made in my method had not been updated in the report. As a result, I spent quite a bit of time writing up the Methods section. The results section took up the most time and effort for me because it had a lot of pictures and text in it. After playing around with Latex, I gradually got used to the different things I could do. I created new images to put into my report and again ran new tests.

This week I also implemented a variation in our method that wiggled the native state according to MD data. It was pretty late to be doing new stuff but I actually got it to work and it produced better results than the previous ones. I changed the code to pick probability distributions for consecutive residues from the same fragment instead of different fragments. For this, I tried 2 things: one was to use ALL given fragments and another was to use just ONE set of non-overlapping fragments spanning the protein. This method produced a really big improvement in performance. One of the sets chosen gave results that closely resembled the expected ones pretty well.

This was a good thing to add to my paper. I could probably go on and on about what it was like to write the paper and it IS quite an experience but everything's worth it when you see the report all done :). My report can be found in the Final Report section of my homepage.



In all I think this was an amazing opportunity for me to learn so much new stuff and actually write programs that mattered. It was great to get the big idea of how computer science is being applied to protein folding and biology in general. It was quite an eye opener for me and I think I'm always going to remember this experience. I'd like to thank Shawna and Lydia for all their help over the summer because I wouldn't have been able to do half as much without their help. I also thank Dr. Amato who has been guiding me throughout the processes and suggesting new ideas. For now, I hope to work remotely on my project, lets see how it goes! :)
But for anyone remotely considering an internship in CS, I would ask them to apply to the DMP program because it is a really great opportunity!