I ended up staying late tonight because I thought I was super close to getting the semantic similarity confidence scoring method, GOSemSim, working. The key word there is "thought." Oh well. I also said I would spend some time coding up my website today, but got wrapped up in things. (Until now, my blog entries have just been sitting in a text file in my computer...)
Before I left, I kicked off a batch DSD run for the 100 PPI networks based on weighted literature counts in Biogrid. Hopefully everything will work... fingers crossed!
On a more personal note, I forced myself to do yoga again today despite being super sore from the last time. Summer is a great time for forming new habits (and breaking old ones), so I think this will be an excellent change. I also told myself I'd get a yoga mat and figure out how to lug it home if I can keep this up all week.
I got to my desk this morning expecting the biogrid networks to be done running through DSD, only to find that it had crashed and burned after completing 2/100. This was a bit of a bummer, but gave me some free time to get this website up and running! These DSD runs take much, much longer to run than Mint. Each file has ~6k nodes (proteins) and ~200k edges (interactions between the proteins). Even though my local machine has a 4 core processor, I only ran it on 2 because the machine could barely keep up with the cursor in a text editor - yikes. I could run it on one of the Tufts servers, but I'm still pretty unfamiliar with the systems here. My experience with them so far involves lots of CPU throttling as well, so everything just completes faster locally.
Towards the end of the day, we were able to get ahold of the code to take DSD results and predict functions using MajorityVote. It's all set to be figured out tomorrow when we have fresh brains.
I guess being constantly on the go finally caught up to me. After waking up mentally fresh, but physically tired and, well, lazy, I decided to stay in my pj's and work from home.
After continuing to finagle the protein IDs for GOSemSim, I made a shocking discovery about the biogrid network we are using, or rather, our assumptions about it. We were assuming that all of the protein interactions contained in this database were strictly between yeast. It turns out that some interactions include one other organism as well, such as humans. Also, some interactions are with strangely named yeast proteins without systematic names. This is not so great for calculating yeast PPIs. This resulted in a very, very long email thread between Lenore, Thomas, and me discussing the extent of the problem and what to do about it.
In the end of our discussions, the consensus was that we should definitely filter out all non-yeast interactions using taxonomic ID, and possibly filter further based on strict yeast nomenclature rules. We decided to talk about this at the undergrad meeting tomorrow, discussing the overlap between the two sets. We're still not sure how we'll resolve the discrepancy about the yeast nomenclature, but manual curation is still an option at this point.
In other news, I got GOSemSim working after straightening out these assumptions. Turns out it was largely failing when looking up these non-yeast proteins in a yeast protein database. Biogrid is still running based on literature probabilites as well. I also tweaked the website code further while GOSemSim and Biogrid were running.
This morning I prepped the 100 score-generated networks based on GOSemSim. The weighted lit DSDs also finished running, so I averaged them all together in preparation for analysis. Thankfully Thomas had filtered the proteins correctly without knowing about the issue with the database, and we were much relieved, given how long it took to run.
We also continued discussion of the biogrid database filtering in today's undergrad meeting. I'm so glad we were able to find this, because it affects a lot of our group's work with protein networks in a very visible way. We decided to filter based on yeast taxonomic ID and nomenclature, using only open reading frame (ORF) genes.
Enjoyed spending some time with family!
Happy Father's Day!