Home » Journal |
PersonalMentorColleaguesProject Journal Week 10 Week 9 Week 8 Week 7 Week 6 Week 5 Week 4 Week 3 Week 2 Week 1 |
Week 10(August 18- August 25)This week, I'm all alone in the lab. The campus is getting filled with first years and the cadets are marching all over the place. Everything is very impressive and a little bit nostalgic, since I remember my own first year. In any case, I'm trying to finish up some of the problems of the project before I leave for Chicago. Two major things have bothered me this week. Per_length needs to be fixed, because we're getting values that exceed 100%. I've been trying to find a clean method to do that, but it doesn't work, since these BLAST hits overlap all the time(not just in a consecutive order). Eventually, I added this new data structure to our algorithm, called it exons(though it's not really supposed to count exons). Every time we decide to glue a BLAST hit, we add the query component to this exon list, check for overlap and fix it. Basically, what this does is keep a list of all the portions of the cDNA query that are being mapped in a specific "glued" part. Sounds very natural but implementing it actually turned out to be pretty hard. First of all, we would like to keep the list of this portions in ascending order. That means that every time I want to insert a new portion, I have to shift everything one space to the right and everything I change the limits of one portions, I need to check for an overall overlap(a basic cleanup of our list, checking for non-redundancy). It's actually pretty hard to do, trying to wrap my head around the best way of doing it. There are still some problems in it, still some entries > 100%, so I need to further debug it, hopefully when I get back to Chicago. Another idea that has been bugging me is whether it is possible to predict exon-intron boundaries. The truth is that with the previous exon_count and with the "exon" list, we're getting pretty close to it. All throughout this week, I have been conflicted between thinking strictly about obtaining a succesfull per_length value and actually determining exons. The tricky part is that determining exon-intron boundaries is not just about looking at the query, it's also about looking at the chromosome. Our gluinb algorithm looks at both these components, but for different purposes.
So far, the algorithm is behaving surprisingly well and I believe we used a good receipe for it. My question is, can we take it to the next level? Can we make it predict exon-intron boundaries? It seems to me that, to some degree, it can do that. Sometimes, we get BLAST hits that map exactly to the exons and we usually manage to glue them together and count the exons right. However, there exists such a thing as "junk", i.e. hits that are not neccessarily biologically significant, but that occur in unpredicted ways. For example, you can get a piece of the query from position 56 to 245 matching all over the place. That's not really tandem duplication, it's not really anything(well, not anything young enough because, otherwise, evolution could explain everything). However, it affects almost every "glued" piece that we have. So far, we don't really care about them, but if we decide to determine exons, we will have to. How do we choose which one is a better exon? Dr. Zhang suggested we should choose the longest one, but then what happens when two matches just overlap? Then we would have to look at the corresponding matches on the chromosome. But what if those ones are ery far apart? Are you going crazy yet? Because I slightly am, in this swirl of ideas. This probably is a naive dream, but I really enjoy thinking about it, trying to make sense out of all the ideas that come up. In any case, the program is done, but we will keep on working on it(not to mention write a paper on it). It was a really amazing experience and I would definitely like to continue it(in gradschool?yes?please?). Hope you, the reader, enjoyed it as much as I did. Bye! P.S. This is my favorite picture of the lab! |
|
|
|
|
|
Week 4(July 5- July 11)We are experiencing some problems with the "gluing" algorithm. The tests we previously did on it didn't show such problems, so we got pretty lucky for discovering them now. For example, we needed to take care of the case in which BLAST matched our cDNA query in the reverse position, therefore, our BLAST hits had higher start than end. That added a new parameter to our program, "direction". Breaking up the cases was a bit of a hastle to code up, too, just classic problems like string comparison and boolean values representation in Perl. As of now, though, everything should be all right. But who knows! (worried) What frustrates us is that the algorithm is working really slow and encountering all sorts of problems(like not being able to connect to the database after a while). The script has a lot of MySQL queries in it, so we suspect that's the main issue. On another note, this week we also set up the "grouping" algorithm for detecting chimeric genes. Roughly, what we are looking for are multi-exon genes that, after translation, got inserted into the exon area of another gene. In terms of BLAST hits, we would have a cDNA query that matches in "chunks"(exons) on one gene, and entirely on another one. That gives us two cDNA queries that hit the same area on the chromosome. We decided to keep an eye on this type of interaction even before we run Genewise. At this stage, what we have are contigous portions of a chromosome to which a certan cDNA query hit.(after the "gluing" algorithm) The next step now is to take those portions of the chromosome and determine which might be indicative of a chimeric gene. This means that we are looking for portions on the chromosome such that one overlaps entirely on the other one. (95%) In a way, we are trying to eliminate redundancy in the portions of the chromosome that had hits on them. Queries that hit the same area are grouped together for future reference. We have been able to spot chimeric genes just by looking at the data. For example, you could see cDNA query Q1 in which partitions of it hit this general area on the chromosome(which we expect to be a multi-exon gene) and then the entire cDNA sequence matching 100% on some other area(in the exon of another gene). There are more details to be check in order to actually make sure that it is a chimeric gene, but, in any case, such is the intuition. The algorithm itself is not that complicated(deals with a lot less cases than the "gluing" one). Hopefully, we will get to use it soon. My presentation on protein folding went really well! Once I started researching the field a little bit, I started encountering all these amazing questions that people are trying to answer. Just to give you a little bit of taste of it, I want to mention to the big problems in protein folding. Levinthal's paradox: a random search of the state of possible configurations(for a protein) would take longer than the age of the universe, yet proteins are able to reach their native comformation in a matter of seconds. Anfinsen's experiment showed that ribonuclease could refold into native state following chemical denaturation, which implies that the information for correctly foldind a protein is contained in its amino acid sequence. Starting with these two questions, you get to all sorts of paths to follow. Some scientists are interested in describing the kinetics behind the folding process, some are interested in finding folding similarities between proteins(so that they could homologically predict the folds of new proteins based on such similary scores), some are interested in just using first principles to arrive at a reasonable fold.(lattice models etc) Doing a review on this entire material was a very interesting challenge and I think it worked out decently. Choosing the specific paper to present was a hard decision, though. At first, I noticed this paper called "Free Energy Estimates of All-atom Protein Structures Using Generalized Belief Propagation". GBP?!?!?!? Whaaaaat. Sorry for the rush of excitement, I just got really attached to GBP when I took the Computer Vision class. Not to mention that it sounds extremely badass. :D The machinery behind GBP involves Markov Random Fields, Bayesian Networks and factor graphs. It's usually used for inference in graphical models, which so far I've only seen in Vision. And now they're applying it to Comp Bio, which is awesome. Also, what got me going was the fact that... I could actually understand what they were talking about(well, got the drift of the Math and whatnot). Finally, my kind of stuff. (I know, it sounds pompousm but I'm scared of Bio ha ha) In any case, the paper had not really been published in a journal, so I was a little bit reluctant to present it. Instead, I went to the RECOMB 2009 website and browsed through the published papers. Soon enough, another one caught my attention! Even better than before.... wait.. wait for it... "A Probabilistic Graphical Model for Ab Initio Folding"! Also, the paper was product of our very own TTI in collaboration with the Departments of Chemistry and Biochemistry and Molecular Biology at UofC. Are we looking at some sort of fusion between AI fields? :D This one used Conditional Random Fields and explored protein conformations in a continous space according to their probabilities. Pretty cool! In any case, I can see that I've talked way too much about this. Sorry for that, I hope you enjoyed my ranting, though. As a conclusion, everything turned out fine, it was an amazing experience aaaaand... I'm ready to end this week's journal. See you next week! |
|
Week 3(June 28- July 4)Confronted with the slow progress of running BLAST, Dr. Heath suggested we use a VT cluster that actually has a script for running parallel BLAST(called mpiblast, conceived by VT director of the Sinergy Lab, Wu Feng!!!). That just took our research to the next level. I feel very empowered, ha ha. In any case, we encountered some issues with the memory capacity, and after debating for a while on how parallel computing works(trying to determine what data we can count on from before our crash happened), we decided to redo the rest of the genome. Since BLAST is up and running, we need to think about using GeneWise. We will use it to determine the gene structure (exons and introns) of the hits on the chromosome. However, there is one thing we need to figure out. Ideally, we would like to input into GeneWise a cDNA and a portion of the chromosome. GeneWise, using the cDNA to do some matchings, will return to us the position and structure of the gene(s) on the chromosome. The problem is, what portion of the chromosome should we input? Because we're BLAST-ing cDNA against the genome, our hits will basically only point out to the exon area of the gene. Therefore, we can't really input into GeneWise solely the BLAST hits (even GeneWise realizes the need for an ample framework/ wiggle room, so they request you add 15,000 base pairs to each end of your DNA sequence). That's why, ladies and gentlemen, we came up with the "gluing "algorithm. The algorithms tries to "glue" our hits on the
genome into continuous "chunks" that make sense.
We've also started thinking about ways to detect chimeric genes and building homolog families, which we might be able to do even before we run GeneWise. More on that, though, next week. On another note, the bioinformatics lab has every week this informal meeting (Easy Chair Journal Club) in which one of the us presents an interesting paper. This week, Dr. Heath presented a paper on reduced amino acid alphabets and how doing alignments on such reduced alphabets might improve sensitivity to folding characteristics. What I liked a lot about this paper (and the field of protein folding prediction in general) is how little is know of the actual biological process and how Math&Comp Sci are trying to fill in the gaps. This kind of work in particular appeals to me because of its intent to mimic nature. (and even better nature by designing new proteins etc) In any case, the paper brought up a discussion on protein folding algorithms and I offered to do next week's presentation precisely on this. I remember being very interested in it when we discussed it in my Comp. Bio. class at UofC, so I decided to give it a try. I plant to first do a general presentation of the field (with its current problems and limitations, and future challenges) and then pick an interesting paper to discuss. |
|
Week 2(June 21- June 27)It's decided! For now, I am going to take care of the Chimp genome, ha ha. While Shilpa is preparing and testing the pipeline for the human genome, I am going to do the same thing with the chimp one. I first need to get comfortable with BioPerl and actually running BLAST(as opposed to just studying the algorithm in class). Dr. Heath suggested it was a good exercise for me to do, and I agree, since a lot of the times, "getting your hand dirty" is more valuable than just reading about it in a book. (yes, I also philosophize about the nature of knowledge and experience- UofC trademark). We've also started to think about how the project would evolve once we get our BLAST hits. We came up with this interesting problem of detecting nearest neighbours and making connections between different cDNA queries. I feel like we could use some complicated nearest-neighbour algorithms, but for now, the best thing is to stay put until we see how the results look like. We also have started outlining the algorithm that prepares our BLAST hits for GeneWise. The whole idea is a bit tricky, because it's not just an algorithm that clusters according to position on the chromosome... we also need to reflect the biological truth behind it and cluster according to our final purpose in mind (to obtain a valid predicted gene structure). As for the eternal problem of running BLAST, we're having problems running the entire genome on a single machine. For now, we're trying to get around that by splitting the database into smaller ones and running BLAST on that. However, it should still take a long time and that is a bit worrisome. |
|