DREU 2009 Ioana Bercea

Personal Mentor Colleagues Project Journal

Week 10 Week 9 Week 8 Week 7 Week 6 Week 5 Week 4 Week 3 Week 2 Week 1

Week 10(August 18- August 25)

This week, I'm all alone in the lab. The campus is getting filled with first years and the cadets are marching all over the place. Everything is very impressive and a little bit nostalgic, since I remember my own first year. In any case, I'm trying to finish up some of the problems of the project before I leave for Chicago. Two major things have bothered me this week.

Per_length needs to be fixed, because we're getting values that exceed 100%. I've been trying to find a clean method to do that, but it doesn't work, since these BLAST hits overlap all the time(not just in a consecutive order). Eventually, I added this new data structure to our algorithm, called it exons(though it's not really supposed to count exons). Every time we decide to glue a BLAST hit, we add the query component to this exon list, check for overlap and fix it. Basically, what this does is keep a list of all the portions of the cDNA query that are being mapped in a specific "glued" part. Sounds very natural but implementing it actually turned out to be pretty hard. First of all, we would like to keep the list of this portions in ascending order. That means that every time I want to insert a new portion, I have to shift everything one space to the right and everything I change the limits of one portions, I need to check for an overall overlap(a basic cleanup of our list, checking for non-redundancy). It's actually pretty hard to do, trying to wrap my head around the best way of doing it. There are still some problems in it, still some entries > 100%, so I need to further debug it, hopefully when I get back to Chicago.

Another idea that has been bugging me is whether it is possible to predict exon-intron boundaries. The truth is that with the previous exon_count and with the "exon" list, we're getting pretty close to it. All throughout this week, I have been conflicted between thinking strictly about obtaining a succesfull per_length value and actually determining exons. The tricky part is that determining exon-intron boundaries is not just about looking at the query, it's also about looking at the chromosome. Our gluinb algorithm looks at both these components, but for different purposes.
It looks at matches on the chromosome to determine if it should glue them or not, without really caring what happens back in the query(faithfull to our initial use for it: creating input for GeneWise)
It also looks at matches on the query to determine which portions of the query get matched, determine a rough exon count etc.

So far, the algorithm is behaving surprisingly well and I believe we used a good receipe for it. My question is, can we take it to the next level? Can we make it predict exon-intron boundaries? It seems to me that, to some degree, it can do that. Sometimes, we get BLAST hits that map exactly to the exons and we usually manage to glue them together and count the exons right. However, there exists such a thing as "junk", i.e. hits that are not neccessarily biologically significant, but that occur in unpredicted ways. For example, you can get a piece of the query from position 56 to 245 matching all over the place. That's not really tandem duplication, it's not really anything(well, not anything young enough because, otherwise, evolution could explain everything). However, it affects almost every "glued" piece that we have. So far, we don't really care about them, but if we decide to determine exons, we will have to. How do we choose which one is a better exon? Dr. Zhang suggested we should choose the longest one, but then what happens when two matches just overlap? Then we would have to look at the corresponding matches on the chromosome. But what if those ones are ery far apart? Are you going crazy yet? Because I slightly am, in this swirl of ideas.

This probably is a naive dream, but I really enjoy thinking about it, trying to make sense out of all the ideas that come up. In any case, the program is done, but we will keep on working on it(not to mention write a paper on it). It was a really amazing experience and I would definitely like to continue it(in gradschool?yes?please?). Hope you, the reader, enjoyed it as much as I did. Bye!

P.S. This is my favorite picture of the lab!

Week 9(August 10- August 17)

We're at our final step of our project. We have the potential retrogenes, now we just have to check for frameshift mutations and flanking regions. We came up with a formula for determining frameshift based on our BLAST results(took us a while to figure what every parameter actually refered to- since it was an mpiblast program, the output was a little bit unconvenational)The other big part is detecting parents. That would require us to do pairwise alignment between each retrogene and its potential parents and then choose the highest scoring match. Once we do that, we should be set.

However, before we do that. we need to make sure that we detect retrogenes as best as we can. That is why we've introduced a couple more parameters to our gluing algorithm: per_length, per_identity, bit_score and exon_count. Per_length is there to capture how much of the initial query we actually mapped.(we would expect full genes to have per_length > 97%). Per_indentity is the same as it is in BLAST, a measure of how good the match is(number of nucleotides that were identically matched, as opposed to matched to a gap, for example / length of the match), bit_score is well, the classic and exon_count is a little bit more complicated, talk about it later. Basically what we're trying to piece together BLAST hits and keep all the information available. Now the problem is how you manage all these quantities. First of all, bit_score is additive and per_identity can be averaged(at the expense of a little bit of over estimating it). However, align_length poses problems, because the query compotents of the BLAST hit might overlap, so we can't just add up all the lengths of the matches. More on this, however, next week.

Let me, in the meantime, explain the concept behind exon_length. Since GeneWise is returning very confusing(and sometimes extremely erronate) results, we decided to choose our gluing algorithm. Of course, we're still going to use GeneWise to detect chimeric genes(since we need exon-intron boundaries), but in the general retrogenes case, we only need to know whether a cDNA transcript comes from a multi or single exon gene. So that's where the gluing algorithm comes into play. At first, we thought about counting how many BLAST hits we glue together and having that as a good exon count. However, during our meeting with Dr. Zhang, we realized that there are cases in which two BLAST hits actually belong to the same exon. For example, this happens when the BLAST hits overlap on the chromosome OR when the distance between them is too short to be an intron. The cutoff we chose for a lower bond on intron length is 30. I read in some paper that there is an 8 nucleotide long intron, however, there is no use risking most of our data for that one little exception. (which we will still be able to detect as being multi-exon because of the other exons that will be detected).

All of this looks good in theory, but implementing it is a little bit hard, because there's a lot of overlap to be carefully examined. More on that, though, next week, when I also move to my 3rd VTech apartment! Gotta love it.

Week 8(August 2- August 9)

GeneWise is running, but there seem to be glitches that happen from time to time and mess up our output. Also, unless told so, GeneWise generates a huge amout to stderr, making our error files full of updates on the status, and not just warnings and actual errors. These files got so big that they filled out the memory space on our clusters and shut everything down. Also, the query ID's for the chimp are longer than the human one(they contain an additional 3 letter symbol to account for their species) and that makes GeneWise input loose a space.(meaning that our initial method of splitting a row by spaces needed a little adjustment)

In any case, let's just hope we're problem-free from now on. On the other hand,we designed a new grouping algorithm. The previous one grouped hits that overlapped on the chromosome. This one groups according to exons, which is more accurate for what we actually want. Basically, as long as we have more exons to read(from a particular gene) or as long as we're overlapping with a new exon(that belongs to a new gene), we keep on grouping. In the end, we get a bigger group then what we would get by just applying the regular grouping algorithm, but for sure the matches that would be grouped before are grouped in this one, as well. (so we're not splitting previous groups, we're at most doing a union of groups) The only downside is that, in the case in which our groups are bigger, we're doing more comparisons. But that shouldn't take that long. We opted for this alternative so that we avoid interacting with MySQL. If we wanted to use the old grouping algorithm, we would have had to query MySQL for each GeneWise result, which would have taken too long. It's interesting to see how these computational limitations(the fact that we can't afford quering MySQL so many times) actually added a theoretical feature to our design. I cannot stress enough how important such a thing is to experience, if you ever want to do more than just learn algorithms off a book.

On the fun side, though, this weekend we had a party at Dr. Zhang's. I learned how to marinate chicken Chinese- style! We grilled some burgers, chicken and some corn, that was pretty fun. I also witnessed flipping burgers with a chop stick which was so awesome looking. We then played charades and had fun acting around. The main highlights were Shubhi's rapping, Dr. Heath's "Alien vs Predator" and so much more. Check out Shilpa's website for funny pictures of the event.

Week 7(July 26- August 1)

This week has been dominated by our final struggle to make GeneWise work. We arranged everything so that it could be run on the cluster, including making 20 input files(for the 20 nodes), writing the submit files, testing errors, parsing etc. We could finally see it happening until ... guess what?!?!, the "gluing" algorithm failed again! Our very own baby. Truth be told, though, we are refining it more and more each time we actually encounter problems. Let me make myself understood, though. The algorithm doesn't fail as a script. It fails in the sense that it doesn't do what we want it to do. I guess such are the pains of parenthood.(hahaha) From the beginning, what we actually wanted to do with our algorithm is to account for possible introns in between the matches AND to avoid duplication. We started with imposing some rough conditions but, with time, we encountered all sorts of situations that required us to be more and more exact about what we wanted to do. Now, we are estimating intron length, we have a different overlap function... all of this to exactly make sure our goal is met. One conclusion I can fairly draw from this algorithm design is that you can't mess with the DNA! If you ever want to design an algorithm that interprets your DNA code, make sure you know exactly what you want him to do. Every restricting condition should be in there and as precise as possible. Otherwise, the DNA demon will come in and mess up your data. !!!!! We have run into literally ALL cases possible! And while that was extremely frustrating, it was for the best, because it required us to revise and even radically change our algorithm so that it's perfect.

In any case, GeneWise is running now! We're gonna wait the weekend to be done with it, and then we're going to proceed with the chimp genome. We also started thinking of what happens after the program is done, so we'e started to clean up our code, comment it well and organize it such that it'll be usable for other people who might work on thw project. That also means writing a documentation that will be used in both omy DREU final paper and in some possible future publishable paper. As opposed to last week, now we feel closer and more confident about the end. Most of the hard problems have been dealt with, though I can see making some more readjustment once we start getting retrogenes.

Week 6(July 19- July 25)

This week has been extremely busy. We still haven't started using GeneWise. We have tried to cut down the input, however, we didn't get far with that. At one point, we calculated the estimated running time, and realized that it would take 12 days on a cluster of 20 computers to run our whole input. !!!!!!! So we decided to cut down even more. For example, even though the literature recommends that we add 15,000 bp at the end of each input sequence, we decided that we add only about 1000. The way we constructed our input sequences(after the "gluing" algorithm), we already made sure we had some bit of space. We're also going to try to choose a frame in which to translate the nucleotide sequence in. Running GeneWise in all three frames is not a good idea. However, actually choosing a frame is very hard. We tried looking at the smallest nr of stop codons or the first ATG sequence, but neither of those methods were able to clearly distinguish between the three. Dr. Zhang suggested we use the longest ORF, which seems like a good criterium, considering that we will have to deal with any input genome and cDNA. In the meantime, I am working on eliminating the queries that we don't really need to know the structure of. In order to detect chimeric genes, we are mostly interested in single-exon sequences that matched to multi-exon ones. In our "grouping" algorithm, we grouped queries that matched to roughly the same place. So we decided to throw out groups that contain only multi-exon sequences. That is, sequences that we expect to correspond to multi-exon genes. In order to determine this, we are going to compare the cDNA length to the length of the match. In the case in which the match is multi-exon, the huge introns in between the exons will make the match considerably bigger than the cDNA length. The idea might be good, however, it's a bit hard to consider all the cases possible and at the same time remain faithfull to the biological truth.

On the other hand, we started thinking about how we could interpret the GeneWise results. I wrote the script to detect chimeric genes. It took a while, because having to deal with data constraints really makes things harder to deal with. Avoiding interaction with MySQL is a really good idea, but it complicates programming. Quering MySQL every time you need a piece of data is two code lines away. But using it 50,000 times is a couple of HOURS away, and we can't afford that. Thank God Perl allows you to interact with files really well and we can always just read the data line by line and use it properly. It's fun, though, having to come up with ingenious ways, rather than just adopt the easy way out. Even if thinking up these script hasn't taken long, designing and coding them sure has. And I'm not talking about bugs. By the way, I have managed to avoid debugging successfully. Perl is an easy and straightforward language(with the exception of a few different operators). In any case, coding has never been this easy.

I have 4 weeks left of the program, Shilpa has 2. I would really like us to obtain results. At least make GeneWise work and detect SOME sort of retrogenes. I am hoping that maybe, in my last two weeks, I will have time to work on the birth and death model. Of course, writing the paper will take up a lot of time as well. Oh, it feels like we're very close, but problems keep on showing up, slowing us down. Maybe this upcoming week, we will finally start running Genewise and start analyzing our results.

Week 5(July 12- July 18)

The "grouping" algorithm is ready to use. Writing it down was a lot like writing the "gluing" algorithm, expect that we perfected a little bit our Perl coding skills(functions and all that jazz). We also came up with an idea for speeding up our MySQL- related scripts. Instead of quering MySQL every time we want to pull a large chunk in, or worse, add an entry to our database(50,000 times), we decided to use intermediary text files that we could later load into MySQL in one single command line. We are trying to shift our coding policy towards using text files whenever we can, rather than working directly with MySQL. For example, as often as we can, we'd rather print our output into a tab delineated file that we could later upload into the desired database. Just accesing MySQL so many times, not to mention maintaining a live connection with it, has posed problems before, so we're trying to avoid contact with it as much as we can.

In the meantime, GeneWise has been giving us problems with installation. In order to make a compiler component to work, we need to set some environment variables, process which requires an entire set of premissions(in addition to the normal sudo access we had to our machines). Rob, our sys admin, finally came to the rescue(after an entire day of reading sys documentation) and fixed everything for us. GeneWise seems pretty hard to work with, as useful as it is. For example, it only takes as input sequences of amino acids. (when we run it with simple DNA, it seems to be trying to convert it into an amino acid sequence, but eventually fails at it) Doesn't seem like too big of a deal, right? However, we are having problems deciding which frame we should translate in. Ideally, we would choose the frame that produces no stop codons in its conversation, but there are cases in which none of the frames is without stop codons. GeneWise can also detect pseudogenes, so we also had the confusing case in which one frame gave us a pseudogene and rest didn't spot anything. So far, the only conclusion we have arrived at is that, well, we should run GeneWise with all the 3 possible frames and see what happens then.(we're gonna do just the 3 forward ones, instead of the entire 6 ones because GeneWise considers both directions when it matches, so the results would be redundant) We're going to read more about the matching algorithm in the GeneWise paper, to see how exactly frame choice influences prediction.

We also started thinking about a future paper. Shilpa has done some of the figures so far, am I'm in charge of writing the description and the pseudocode for the algorithms used. What I'm concerned more about is documenting our choices along the way. It's easy to forget about them, because Shilpa and I discuss them all the time and then proceed in our design once we make the choice. While it seems natural to us, it might not seem that straight-forward to an outside reader. I've started to document the questions that have arised and the explanations for our choice. They mostly refer to cases that arise logically(by just thinking about how the algorithm works) and that are solved by making some biological assumptions.(hopefully, those assumptions are well founded and not easily contestable)

Week 4(July 5- July 11)

We are experiencing some problems with the "gluing" algorithm. The tests we previously did on it didn't show such problems, so we got pretty lucky for discovering them now. For example, we needed to take care of the case in which BLAST matched our cDNA query in the reverse position, therefore, our BLAST hits had higher start than end. That added a new parameter to our program, "direction". Breaking up the cases was a bit of a hastle to code up, too, just classic problems like string comparison and boolean values representation in Perl. As of now, though, everything should be all right. But who knows! (worried) What frustrates us is that the algorithm is working really slow and encountering all sorts of problems(like not being able to connect to the database after a while). The script has a lot of MySQL queries in it, so we suspect that's the main issue.

On another note, this week we also set up the "grouping" algorithm for detecting chimeric genes. Roughly, what we are looking for are multi-exon genes that, after translation, got inserted into the exon area of another gene. In terms of BLAST hits, we would have a cDNA query that matches in "chunks"(exons) on one gene, and entirely on another one. That gives us two cDNA queries that hit the same area on the chromosome. We decided to keep an eye on this type of interaction even before we run Genewise. At this stage, what we have are contigous portions of a chromosome to which a certan cDNA query hit.(after the "gluing" algorithm) The next step now is to take those portions of the chromosome and determine which might be indicative of a chimeric gene. This means that we are looking for portions on the chromosome such that one overlaps entirely on the other one. (95%) In a way, we are trying to eliminate redundancy in the portions of the chromosome that had hits on them. Queries that hit the same area are grouped together for future reference. We have been able to spot chimeric genes just by looking at the data. For example, you could see cDNA query Q1 in which partitions of it hit this general area on the chromosome(which we expect to be a multi-exon gene) and then the entire cDNA sequence matching 100% on some other area(in the exon of another gene). There are more details to be check in order to actually make sure that it is a chimeric gene, but, in any case, such is the intuition. The algorithm itself is not that complicated(deals with a lot less cases than the "gluing" one). Hopefully, we will get to use it soon.

My presentation on protein folding went really well! Once I started researching the field a little bit, I started encountering all these amazing questions that people are trying to answer. Just to give you a little bit of taste of it, I want to mention to the big problems in protein folding. Levinthal's paradox: a random search of the state of possible configurations(for a protein) would take longer than the age of the universe, yet proteins are able to reach their native comformation in a matter of seconds. Anfinsen's experiment showed that ribonuclease could refold into native state following chemical denaturation, which implies that the information for correctly foldind a protein is contained in its amino acid sequence. Starting with these two questions, you get to all sorts of paths to follow. Some scientists are interested in describing the kinetics behind the folding process, some are interested in finding folding similarities between proteins(so that they could homologically predict the folds of new proteins based on such similary scores), some are interested in just using first principles to arrive at a reasonable fold.(lattice models etc) Doing a review on this entire material was a very interesting challenge and I think it worked out decently. Choosing the specific paper to present was a hard decision, though. At first, I noticed this paper called "Free Energy Estimates of All-atom Protein Structures Using Generalized Belief Propagation". GBP?!?!?!? Whaaaaat. Sorry for the rush of excitement, I just got really attached to GBP when I took the Computer Vision class. Not to mention that it sounds extremely badass. :D The machinery behind GBP involves Markov Random Fields, Bayesian Networks and factor graphs. It's usually used for inference in graphical models, which so far I've only seen in Vision. And now they're applying it to Comp Bio, which is awesome. Also, what got me going was the fact that... I could actually understand what they were talking about(well, got the drift of the Math and whatnot). Finally, my kind of stuff. (I know, it sounds pompousm but I'm scared of Bio ha ha) In any case, the paper had not really been published in a journal, so I was a little bit reluctant to present it. Instead, I went to the RECOMB 2009 website and browsed through the published papers. Soon enough, another one caught my attention! Even better than before.... wait.. wait for it... "A Probabilistic Graphical Model for Ab Initio Folding"! Also, the paper was product of our very own TTI in collaboration with the Departments of Chemistry and Biochemistry and Molecular Biology at UofC. Are we looking at some sort of fusion between AI fields? :D This one used Conditional Random Fields and explored protein conformations in a continous space according to their probabilities. Pretty cool! In any case, I can see that I've talked way too much about this. Sorry for that, I hope you enjoyed my ranting, though. As a conclusion, everything turned out fine, it was an amazing experience aaaaand... I'm ready to end this week's journal. See you next week!

Week 3(June 28- July 4)

Confronted with the slow progress of running BLAST, Dr. Heath suggested we use a VT cluster that actually has a script for running parallel BLAST(called mpiblast, conceived by VT director of the Sinergy Lab, Wu Feng!!!). That just took our research to the next level. I feel very empowered, ha ha. In any case, we encountered some issues with the memory capacity, and after debating for a while on how parallel computing works(trying to determine what data we can count on from before our crash happened), we decided to redo the rest of the genome. Since BLAST is up and running, we need to think about using GeneWise.

We will use it to determine the gene structure (exons and introns) of the hits on the chromosome. However, there is one thing we need to figure out. Ideally, we would like to input into GeneWise a cDNA and a portion of the chromosome. GeneWise, using the cDNA to do some matchings, will return to us the position and structure of the gene(s) on the chromosome. The problem is, what portion of the chromosome should we input? Because we're BLAST-ing cDNA against the genome, our hits will basically only point out to the exon area of the gene. Therefore, we can't really input into GeneWise solely the BLAST hits (even GeneWise realizes the need for an ample framework/ wiggle room, so they request you add 15,000 base pairs to each end of your DNA sequence). That's why, ladies and gentlemen, we came up with the "gluing "algorithm.

The algorithms tries to "glue" our hits on the genome into continuous "chunks" that make sense.
Basically, our input is: for each cDNA query, all the hits on the same chromosome.
Our expected output is: clusters of these hits that would capture the structure of a gene.
Preprocessing: we're going to order our hits according to subject_start (subject being the hit on the genome), so that we can go through the hits in an increasing order
Theoretical approach: we're not just going to try to cluster hits based on their position of the chromosome( of course, hits that overlap will be automatically glued, since GeneWise doesn't care which part of the gene hit the chromosome in that position). Hits that are too far away will not be automatically discarded, because the gap in between might represent an intron (and we do want to include introns when we input the sequence to GeneWise). On the other hand, we also need to take into consideration the fact that the same cDNA might have matched several genes on the chromosome (as would happen in the case of duplication). That's why we need to be careful with distinguishing when another cluster might begin (cluster roughly meaning a gene, in this case). In any case, details will be included in the final paper.

We've also started thinking about ways to detect chimeric genes and building homolog families, which we might be able to do even before we run GeneWise. More on that, though, next week.

On another note, the bioinformatics lab has every week this informal meeting (Easy Chair Journal Club) in which one of the us presents an interesting paper. This week, Dr. Heath presented a paper on reduced amino acid alphabets and how doing alignments on such reduced alphabets might improve sensitivity to folding characteristics. What I liked a lot about this paper (and the field of protein folding prediction in general) is how little is know of the actual biological process and how Math&Comp Sci are trying to fill in the gaps. This kind of work in particular appeals to me because of its intent to mimic nature. (and even better nature by designing new proteins etc) In any case, the paper brought up a discussion on protein folding algorithms and I offered to do next week's presentation precisely on this. I remember being very interested in it when we discussed it in my Comp. Bio. class at UofC, so I decided to give it a try. I plant to first do a general presentation of the field (with its current problems and limitations, and future challenges) and then pick an interesting paper to discuss.

Week 2(June 21- June 27)

It's decided! For now, I am going to take care of the Chimp genome, ha ha. While Shilpa is preparing and testing the pipeline for the human genome, I am going to do the same thing with the chimp one. I first need to get comfortable with BioPerl and actually running BLAST(as opposed to just studying the algorithm in class). Dr. Heath suggested it was a good exercise for me to do, and I agree, since a lot of the times, "getting your hand dirty" is more valuable than just reading about it in a book. (yes, I also philosophize about the nature of knowledge and experience- UofC trademark).

We've also started to think about how the project would evolve once we get our BLAST hits. We came up with this interesting problem of detecting nearest neighbours and making connections between different cDNA queries. I feel like we could use some complicated nearest-neighbour algorithms, but for now, the best thing is to stay put until we see how the results look like. We also have started outlining the algorithm that prepares our BLAST hits for GeneWise. The whole idea is a bit tricky, because it's not just an algorithm that clusters according to position on the chromosome... we also need to reflect the biological truth behind it and cluster according to our final purpose in mind (to obtain a valid predicted gene structure).

As for the eternal problem of running BLAST, we're having problems running the entire genome on a single machine. For now, we're trying to get around that by splitting the database into smaller ones and running BLAST on that. However, it should still take a long time and that is a bit worrisome.

Week 1(June 14- June 20)

I arrived in Blacksburg with my backpack, my laptop bag, and a newly bought New Yorker. I enjoy traveling very much but this year, I have had too much of it. When I arrived in Blacksburg, after three active months studying (in) Paris, I was ready for a homely place. And strangely, Blacksburg offered me that. Even though its downtown doesn't compare to Chicago's Michigan Avenue or Paris' Saint-Michel, somehow, Blacksburg has begun to grown on me as a very enjoyable place.

Also, the feeling of actually working in a lab (about to have my own cubicle, wow!) is amazing. I was always intimidated by the grad students, all with their own cubicles, toiling at their work. And now I get to have my own desk, which I can fill with research papers and coffee cups,ha ha. The project is looking very interesting, although the bio is a a bit of a challenge. Shilpa is reviewing some notions with me so that I get up to speed with the biological processes behind retro-transposition. Right now I am reading some introductory research papers Dr Zhang suggested. I had a meeting with her and we established that I am supposed to work on the pipeline with Shilpa, and, if things work out and we get results, develop a mathematical model for the birth and death of retrogenes. I hope we get to that point. I just took an intro to probabilities class in Paris and the whole birth and death framework is fresh in my mind, though I can foresee getting stuck with the bio aspect. We have a meeting next Wednesday so I'm hoping by that point I can fully get immersed in the pipeline.