I also had the opportunity to attend some of the graduate student orientation events this week. This was a great opportunity to learn about topics like choosing an advisor, developing a thesis, and managing the workload of graduate school.
Reflecting, this summer has been very productive. I have learned about databases, graphs and graphlets, and SQL. I have a much better idea of what research is like in an academic setting, and I have much more information about graduate school.
I spent most of this week debugging the code. Now I am working on getting everything commented, and writing the final paper.
I am also increasingly nervous that it will not be completely working in the next two weeks. I am trying to stay optimistic! Often times there are breakthroughs :) This weekend I may try to rewrite portions of the rotation algorithm to see if trying a new method works.
The first part was to write code to find the number of comparisons required. This required a bit of thought, but ended up being quite straight forward because we had enumerated the comparisons in the SQL code. The second part is to write code that tells me the number of times the shape will be found (since if you look for it certain ways, there will be repetition). This has been tricky! The first algorithm I tried was to make a list of all the small shapes that could make the larger shape. Then, I rearranged the matrix representation of the large shape in every arrangement of rows and columns. For each matrix, I then looked for the smaller matrices, in order to count how often they appeared.
However, this gave me a very high count. I realized this was because of symmetry - for example, a triangle would be counted 3 times, and a rectangle would be counted 4 times. In the databases, we are storing the shapes with particular rules which mean that a triangle is only counted one time. So, I'm working on how to count the rotations, to remove this.
There are a few factors to consider. There are graph independent factors - these are things like the number of comparisons required. So, if there is one overlapping node, one comparison is required. If there are two overlapping nodes, there are two comparisons required. There are also additional comparisons required to show that it is an induced path of length 5, which means no two points on the line are connected to eachother.
There are also graph dependent factors - these are dependent on what type of data you are using. For example, one graph might have very few paths of length 3, so it would be faster to look up two P3 shapes than to look up one P3 shape and one P2 shape.
Ideally, I will have code that will tell me what is the best small shapes to look for are, in terms of speed.
This week I'm focusing on a few long term things - mainly, getting ready to write the paper required at the end of the DREU. I have used latex, but never a research paper format. So, I now have a sample format.
I've also been reading papers to provide more background for the paper. The papers are a bit difficult to parse, in part because I'm not sure I have enough background to tell what is relevent. I'm especially interested in papers that might give me an idea for how to give the nodes in an order to specify the graph shape (to solve the problem I encountered on week 4). But I haven't found any papers that address that yet.
To do this, I used the code which I had already, which takes two small graphlets and returns everything they could combine to form. The algorithm is to first list all graphlets smaller than the input. Then, take every combination of two graphlets from the list. See what they can build, using the original code. If it builds the input graph, save the two shapes, and how they need to combine to form the input. Then, return the list of small graphlets that can build the input graphlet.
To make this slightly more efficient, I consider the size of the two small graphlets, compared to the input graphlet. For example, we don't need to test any combinations of the small graphlets where there are fewer nodes than in the input graphlet. Likewise, if either small graphlet has more nodes than the input graphlet, it won't work.
This is probably not the most efficient algorithm to solve this problem, but it does leverage the code that I already have. At the moment, I am only working with small graphlets and the time hasn't been an issue.
I am continuing to read papers. This involves making a citation trail and finding papers which have cited papers I have already read. In the next few days I'd like to start keeping track of the citations in latex. I haven't made a bibliography in latex in a while, so it may have a bit of a learning curve!
Modifying the python code has been very time consuming and a bit frustrating. There have been two parts of this - debugging the python code, and then debugging the SQL code that is generated. Then, I have to go back and update the python code so it will generate the working SQL code.
This week, I had the opportunity to give a three minute talk about my project to a group of other summer interns. It was a great opportunity to practice presentation skills! At the same intern meeting, I met the Dean of Computer Science and Informatics. He spoke about graduate school and told us a bit about the programs available to women and minorities at Indiana University.
But before I do that, I need to learn SQL! I've been working through chapters in two textbooks on databases, SQL, and a very small amount of relational algebra.
I also had the opportunity to tour the super computer Big Red II! Extremely cool!