Disappointingly, I wasn't able to get very much done today. With GOSemSim still running, future work based on that was still tied up. The undergrad dungeon was also empty except for me, interestingly. I did some more website coding but ultimately decided to leave a bit early and get some cleaning done at my apartment. I'm not slacking, my code is running... right? There will be plenty to do once we can begin the analysis.
The orphan python scripts are starting to pile up.
Spent some time today configuring git and github so we can finally put these things in version control and manage them a little better. We've been doing pretty well as far as commenting and modularization goes, but not so much on keeping track of these little scripts. Only the more globally useful ones made it into the version control, and these generally tended to be the better written ones. This just goes to show how important good coding practices are - if it's crappy and there's a better option, it just won't get used! If you're curious, my github username is kdoroschak, and the repository is named tufts-dsd-confidence.
We got other things done too, but this is definitely the most useful.
Today we learned that DSD has two main output formats - 1) a matrix in which a DSD value is present for all pairs of edges in the graph, and 2) a list that only contains DSD values for edges in the original PPI. We are (were) using #2 because the files end up being much smaller, and need to be using #1 so we're not losing information. Live and learn! It turns out that Tony and Inbar had the same realization around the same time, so we commiserated with them a bit before rerunning all the searches.
With this realization, we needed to rewrite the averaging script, which just became much more complicated across multiple matrices instead of lists. We now need to manage 100 - 6000x6000 matrices, or about 3.6 billion total data points. I don't think I've worked on anything this massive before, so I'm not sure how well I balanced memory usage and CPU time. Erred on the side of using less memory so I can run it locally, but we'll see. I managed to finish the script by the end of the day, ready to be run tomorrow.
This morning, we finished running the stragglers from the Mint 100 networks, and averaged them together using the new averaging script. Holy cow it is slow. As in potentially all day and then some. I think I guessed wrong on that one. When I have time again, I'll definitely revisit this.
We had the undergrad meeting again this afternoon, and verified that we correctly preprocessed our PPI network, another potential source of issues (and rerunning). All good!
I completely overhauled the averaging code today, making drastic optimizations. Basically, instead of processing each element of each matrix individually, every matrix is read into memory in a 3D numpy array, and then averaging is conducted across the entire 3D array using speedy numpy functions implemented in C.
The main problem now is keeping track of indices. Each of the 100 matrices will have the same amount of proteins or fewer in the DSD matrix. We are not guaranteed any kind of order, so we have to map from the local DSD index to the protein name to the master DSD index. It sounds simple typing it out, but it can be quite mind-fuzzing in the thick of it.
Finished the code and ran it. It still took about an hour to run, but I don't think we'll be able to get significant gains past that with ~15 GB of files to manage and performing calculations on every element. (At least not without spending a significant amount of time on it.) Just to be sure, we met with Andrew, a computer science graduate student, and he agrees that performance gains from here on out will probably be pretty small in comparison to what we've done so far.
I rented a bike from Tufts today, which was amazing. I explored more today than I have so far on my own, making it almost as far as Harvard Square (though I didn't dare to ride down there on a Saturday on an unfamiliar bike). I don't think I realized how much I missed biking!
SUPER hot outside today, so I stayed inside and cleaned the apartment. No air conditioning, but it was windy and shady, so it's better than nothing!