I spent most of this week working on improving my function-prediction program. One thing I forgot to mention previously was that the proteins and functionalities I used came from the 3rd level MIPS
(Munich Information Center for Protein Sequences) database along with
FunCat (the functional catalogue). Two other students and I brainstormed on how to incorporate DSD into what I was doing and develop new methods for better protein-function-prediction. Here is a recap of what the project is. Basically, we have a list of proteins and known functionalities. We are splitting the list in two and erasing the functionalities of one of them. Then, we are trying to predict the functionalities of the group we erased by creating guesses based on the functionalities of each protein's neighbors in the network. We are then checking these guesses with the original functionalities we erased. Of course, the original way I did this was by using a list of confidences (which showed the likelihood of two nodes being connected). However, we decided to incorporate DSD into our methods. There was already a file which contained an upper triangular matrix of DSD distances between proteins. We used this and created a class which had a distance method that called in the two proteins (as parameters) and outputted the DSD distance between them (from the file). We came up with two types of algorithms that did this. The first was a "multiple runs" method. In this, we took the list of unknowns and looked at their neighbors within a certain radius (this was set as the maximum DSD distance of the 10 closest known neighbor). Using this set of known and unknown neighbors, we guessed each unknown neighbor's functionality using a majority vote method. Ties were broken by random selection of most popular vote. We then ran through the process a second time using the guesses from our first run (so even unknown proteins could set predictions for other unknowns). The final predictions were then cross-validated with the actual values that were erased. We ran this eight times and got an average accuracy of 49.17%. The second was a "cascade" approach. In this, we took the list of unknowns and initially ranked them based on the proportion of the t closest neighbors that are known (weighted by 1/DSD). We initially set t to be 10. We then started guessing the functionalities of the most popular unknown proteins and then incorporated the guesses to predict functionalities of the less popular proteins. We ran this twelve times and got an average accuracy of 49.44%. In both methods, the votes from known neighbors were weighted using 1/DSD and the votes derived from original unknown proteins were weighted 0.5/DSD. On Saturday, I decided to head our to Boston Public Market again. As I casually strolled around, I ended up walking from Boston Public Market to the harbor to Quincy Market. After a quick lunch, I ended up crossing the Charles River and walking to and around MIT. On Sunday, I went to Revere Beach where there was an international sand sculpting festival. It was the last day of the festival, so there was no actual sand sculpting going on, but the pieces were interesting to see. I was more excited by the beach (I'm from Colorado, so I haven't seen that many beaches). However, it was a cloudy day, and the water was ice cold, so I did not end up actually swimming in it. I also tried fried dough for the first time -- I still need to burn those calories. |
Home >