Over the next two weeks I have been assigned two major goals.
1.) I am to test the extendibility of Brian's program, running it at many different thresholds to determine when it performs best and when it breaks.
2.) I am to create negative tests by comparing the same residues of the source proteins used in the positive experiments to proteins which should not match (i.e. target proteins of different experiments). This is an important aspect of the project that has yet to be explored. So far, all of the tests have been checking to see if Brian's code can find matching sequences. However, it is also important that his code correctly identifies proteins which do not match. Eventually, we would like to run this program on protein combinations where we do not already know the outcome, and when we do, we would like to know that the matches are in fact correct.
This week I have been running GeoHash at all different thresholds on two sets of proteins. Unfortunately, GeoHash takes around 30 minutes to run on one of the sets of proteins, which means that even with the cluster, it has taken a long time to test the over 200 different threshold combinations. During the testing, I came across one major problem. While attempting to fork the work load to the cluster (16 connected computers), I somehow managed to create an infinite loop. Before Ara and I were able to stop the loop, 3,000 processes had been created and were running on Zeus, the computer holding everyone's file systems. Fortunately, though, we were able to stop the processes without crashing the system. Rather than risking a crash on Zeus, I have decided to run one threshold on each node, instead of dividing GeoHash's work load among the nodes. This way I can use all the nodes and put an equal amount of work on each of them.