Week 1 --- June 1 - 4

Settling in and starting up

The first week was all about just settling in here at Texas A&M, getting acquainted with the area and the people, and basically just figuring out the requirements of our research project.

After talking to Dr. Williams about possible projects I could work on, I decided to work on one about querying large RF matrices since it sounded interesting. Dr. Williams gave me a paper to read as sort of a background what RF matrices are and how they are created. She said that I didn't really need to know the details of how they're made, but it would be a good idea to know how they work. I skimmed the paper and it was insightful. I found out that RF matrices essentially describe the difference between a set of trees. They are generally symmetrical. The maximum RF distance in any RF matrix is the total number of taxa in each tree - 3. This also happens to be the number of unique bipartitions in a tree, I think. I read up on bipartitions before arriving at Texas A&M. They're essentially a very neat way of representing phylogenetic relationships. Anyway, this information will probably be useful since it means that I now know that I only really need to process one side of the diagonal whenever I look at an RF matrix.

I have already started creating small prototype algorithms in Python for extracting data from RF matrices. It takes a lot of time and space, apparently, and I haven't even tried the larger matrices yet. I'm thinking of creating an array that uses the RF distance as its matrix and stores an array of tuples containing the tree pairs (i,j) that are that RF distance apart and then querying that array as a starting point. I'll probably need to imporve it a lot as the project progresses.

Week 2 --- June 7 - 11

Day 1

The first day of the second week is going well so far. I'm trying to try out my (very basic) algorithm on larger tree sets today. Unfortunately, this is taking a while due to the large size of the data (around 10,000 trees, I think -- and that's a 'small' tree set). So I'm using the time when I'm waiting for the algorithm to finish to update this website and brainstorm different ideas for my algorithm.

The Rest of the Week -- Research/In the Lab

(I'm writing the entries for weeks 2-4 a teensy bit late. That might possibly affect my tone on some of what I was doing/thinking in those weeks. I'll try to prevent that as much as possible, but this is forewarning just in case.)

So this week I mostly worked on the representation of the index I will generate from the RF matrix for querying purposes. That sounded like a mouthful. Basically, what I have in mind at the moment is that users of the RF matrix are probably going to be most interested in tree pairs that are a certain RF distance away from each other. (If they just want all the RF distances associated with one or a group of trees, they can very easily obtain it from the RF matrix. I can add that feature in my application later very easily, I think) It's a pain to go through the matrix and search for all of the tree pairs that are a specific RF distance away each and every time. Especially on larger matrices -- a lot more time will be required. So what I'm doing instead is reordering the data in the RF matrix such that it is ordered by the RF distances between the tree pairs rather than by the tree identities (i.e. by the row and column of the RF matrix). I will use this new index as the primary input of my application for searching purposes.

Unfortunately, the problems I was facing with this were:

For larger matrices it is impossible to hold this entire index in the program's memory, at least on the PC I'm working on (I froze it while trying this. What was I thinking?)
Writing to file takes a lot of time (and memory too, depending on how you write to file. I froze the PC on this part too. I really have no experience with dealing with such large amounts of data. -- and I'm still working on the smaller matrices!)
The final index I generate takes a lot of disk space. (No, didn't freeze the PC on this one, but I was kind of surprised to find out that the index I generated was around 3 times the size of the original RF matrix I used to generate it. And that happened on each of the matrices I tried :/)

I resolved the memory problem (second bullet) thanks to the help of Suzanne, one of the grad students working with Dr. Williams. It was really simple, actually: just flush out the data you have so far. Makes you wonder why I wasn't doing that already :s. Oh yeah, it was because I still thought at that point in time that I would be able to store the entire index in the program's memory. Silly me.

Anyway, I decided to focus first on how to store the data in file so that it takes the least amount of disk space this week. I can speed up the process when I know exactly what I need to do better, I think. I tried various methods. I'll not get into all of them, there were a lot and most of them not very smart, now that I think of it. Suffice to say all of them considered storing all of the row and column pairs associated with each RF values together. For example, line 1in the file is associated with RF 0 and everything in that line is the row/column pairs that are 0 RF distances apart; line 2 would do the same for RF 1 and so on. That's just one of many methods, though. In any case, that still took up more space than the original RF matrix.

While talking to Dr. Williams about it in our weekly meeting (did I mention we had weekly meetings? Oh and a weekly progress report, too. It keeps you organized and keeps track of what you've been doing, even on weeks when you feel like you haven't done anything, according to Dr. Williams), she mentioned that if I stored just one value instead of the row/column pair, I would be storing less data and thus smaller file sizes. By just one value I mean treating the 2D matrix as a 1D array and using one absolute value to describe the location of each cell in the matrix instead of two. I felt pretty silly for not thinking of that myself. It seemed so obvious after she said it. I felt even sillier for not thinking of it since I had done that for a project in class before, too, as I vaguely seem to recall. I think it had something to do with Java 1 and Sudoku. College freshmen year.

So I tried that and it did decrease the file size for the smaller matrices but the file sizes were still to large for the indexes of the larger RF matrixes. I had been noting from the start that the matrices had a lot of ranges in them. For example, tree 1 was 6 RF distances apart from trees 6, 7, 8, 9, 10 till 34, maybe. There were a lot of similar cases too in almost all of the RF matrices I observed. This is mostly because of the fact that the total number of possible tree pairs far outnumbers the total number of possible RF distances between them. (max RF = total taxa - 3. The max number of taxa I've seen is around 500. There are still at least a thousand trees--and thus the square of that many tree pairs--usually much, much more, per RF matrix. So there's bound to be ranges)

I put this idea into effect and stored only the start and end of ranges while using the condensed representation of the tree pairs (i.e. one index only). This greatly decreased the file sizes of the index. The index size was now around 10 GB less than the original RF matrix size for the larger RF matrices. I noted that the larger the RF matrix got, the greater the reduction in index size using this method.

I forgot to mention that at this point I had assigned a separate file for each RF distance possible for my index. This made writing to file much simpler an didn't have to get into the hassle of joining files and whatnot as a post processing thing.

I added a second file size minimization thing to the matrix too. I realized that the trees that were 0 RF distances away were identical. Therefore, they can be treated as aliases for each other. Which means that I only need to store the data for one of them and take note of which ones are identical. This can speed up processing time and minimize the index file size, theoretically. I was very happy when I tested this on my first RF matrix (3000 trees/ 60 taxa) since it had a lot of identical trees. However, it wasn't of much use on all of the RF matrices I tested. The effectiveness of this method--and the ranges one too, for that matter--is dependent on the landscape of the RF matrix.

The Rest of the Week -- Outside the Lab

Dr. William's lab had a meeting with two life scientists this week. Grant, one of the grad students in Dr. Williams' lab, was giving a presentation. I also went to the meeting. It was really cool. It made this entire project seem more…real, I guess. I mean, here are people who are actually using/might use what I'm developing right now. This meeting gave me a better understanding of my problem domain as well--things I didn't exactly know, or things I knew but their implications hadn't quite clicked in my mind yet. For example, I knew that the tree pairs in the RF matrix that were 0 RF distances apart were not different at all but it didn't quite click until now that they were identical and that meant I didn’t have to process all of them. I didn't quite follow the entire presentation word by word--there were too many things I didn't know or only had a vague idea about--but I followed enough to get a general idea of what was going on, thanks to the intro to bioinformatics class I took last semester.

I wish this meeting had taken place last week. It would have made the research plan thing I had to submit for the REU program here much simpler. As it is, Dr. Williams had to tell me to edit it a few times before I submitted it. (I'm kind of embarrassed of how bad it was the first time around. I wasn't sure of how general/specific I needed to be in it and how much background I was supposed to provide. It didn't help that I wasn't fully aware about the implications of my project at the time either. So, yeah. The more specific it is, the better it is--but a little background will not hurt if it is needed. I'm still not 100% on the implications of this project, but I think I'm getting there.)

Week 3 --- June 14 - 18

Research/In the Lab

This week I started profiling my code to make it faster. I'm pretty satisfied with the representation of my index itself. It doesn't seem much for the smaller RF matrices but as they get larger, its benefits show. For example, the index size of a 9taxa/135,135 trees RF matrix (35 GiB) was 24.6 GiB. I think that's a good reduction. On smaller matrix sizes, the reduction wasn't necessarily as much but the index size was not larger than the original matrix size, at least.

This algorithm works best on matrices with smaller taxa and huge trees, since they're usually the ones with large ranges that can be condensed. Also, I don't think I can modify it for smaller sizes much further without making extracting the data from the index complex and time consuming.

Just reading the 135,135 tree RF matrix takes 8 minutes. Reading it and storing it as an int array/list takes around 14 minutes. I'm trying to optimize my algorithm for this RF matrix right now since it's the largest one I have. I know this is probably a bad idea since it's focusing on just one particular data set, but I will obviously test the algorithm on other data sets too, when I get it working on this one to resolve any bugs or errors.

I initially started out by doing a profile of just the individual functions I used. I was kind of really surprised because all of them took 0.01 seconds at most, and usually 0, according to the profile results. It turned out that I was using the wrong timer function. If you're using Python in Windows, time.clock() gives the best precision. In Linux, time.time() does. I was not aware of that before this week and so was using the wrong function. Why was I not using the built in profiler? Well, I don't have the pstats module installed on the PC I'm working on in the lab and the profiler requires that module to run. I can't install it myself since I don’t have access to install anything on the PC in the lab; I have to ask Dr. Williams or one of the grad students to do that for me. I don't want to bother them unnecessarily. I can work without the profiler too. Plus, I can always test the algorithm and/or profile it on my own laptop if I need to. I realize now that I could have tried using timeit for this too.

Anyway, it turned out that this sort of profiling (of individual statements) wasn't as useful as adding timers around parts of my code, especially in the main loop. I found out that the most time was spent on writing, casting+splitting, and string operations. (Casting + splitting = the process of converting the line read in from the RF matrix into an int array)

I spent a lot of time searching the internet for ways to speed up Python code and I found a few good sites. Unfortunately, very few techniques apply to my code. One thing that does is string concatenation. It's really slow in Python. It's much faster to use other techniques. I'm playing around with this right now. Another thing is using map and list comprehensions. I found out that using map to cast the entire line I read from the RF matrix into an int is much faster than explicitly casting them inside the loop, even though I'm going to have to loop through the read line anyway. That was kind of odd. I'm not sure about the internals of how map and list comprehensions work--something about implied loops that makes them faster. I'm looking for ways to apply that same method (or the rest of the ones I found) to make the algorithm faster. I added some minor speed ups wherever I could -- such as removing the dot operator wherever possible. This sped up the algorithm a (very) little bit.

I also experimented with how much data I should write at one time. It seems that writing after processing 10 million RF matrix cells gives the best results. It doesn't eat up too much memory (according to the System Monitor, memory usage stays constant at around 27% of a 3.9 GB memory, without using the swap). Also, it took the least amount of time compared to the others I tried (1000-50 million processed cells).

I thought I should try threading to speed up the write process. I haven't ever done threading before, and never anything like it in Python, so I'm probably going to stumble a lot. My initial attempts were pretty bad, which is to be expected. Anyway, Dr. Williams said not to worry about threading at the moment when we met this week.

I did some calculations to set some solid, numerical goals for my algorithm. The 135,135 tree RF matrix has around 18 billion cells -- 9 billion of which I have to process, at least. Since I'm writing out to file at every 10 million processed RF matrix cells, I need to bring the process time of each of those chunks down to an average of 7.89 seconds at least. 7.89 seconds means that the entire index will be generated in 2 hours. For 1 hour, I need to bring it down to 3.94 seconds and to 1.97 seconds for half an hour. Haha…yeah…it's around 34 seconds on a good day right now. :-/

Dr. Williams and I talked about an implementation of the algorithm that requires storing only one number per RF distance in the program's memory and writing to file whenever a new one was observed. Okay, that's a horrible explanation, but suffice to say that it required a lot of writes, which caused the PC to freeze up and 2 of my 4 cores to be at 100 % throughout the run. I modified it so that it only wrote after I had processed a certain number of cells from the RF matrix and added a time.sleep(0.000001) after every write to fix that problem. I'm going to work on profiling this version of the file to speed it up.

I noticed that generally, the longer I wait between writes, the faster the processing is. However, the memory requirements go up too, since I store the data as a String.

I think that in itself is the problem. I think I should store it as an int and only convert it to string (with as little formatting as possible) when I write it to file. Hm, that might sound a bit confusing with all my talk of casting to int above. I cast to int as I read and use that for indexing purposes (I figured it would be faster that way than using the dictionary and skipping the entire cast to int part. I might be wrong). As I process the matrix, I store what I have to write to my index files as a String because you can only write a String to file in Python, as far as I know. Unfortunately, this also limits the amount of data I can store/write at one time and a lot of processing time is spent on String formatting etc. operations. I guess I'll still have to do them if I only do them during writes, but maybe I can come up with a faster way of doing them when I have all the data I need to convert in an int list rather than having to do the String operations at each and one of the 9 billion cells I process. I'll work on that more next week.

Week 4 --- June 21 - 25

Research/In the Lab

This week started off pretty well for my project. I recognized that storing my values as strings to save time was counter productive since it froze the computer and really restricted how much data I could store at a time (more memory consumption than before) so I suggested storing them as an int in my weekly report. Dr. Williams agreed and suggested I have two int arrays, start and end. Start stores the start of the range while end stores the end of it. In the case of there being no range, both start and end store the same value. It took a little work and some slight modification to the original idea to implement it correctly and then remove all the bugs from it (I was confused about when to write what -- making a truth table helped. That, and trial and error) but when I was done, the results were great. It now took 12-16 seconds to process 10,000,000 RF Matrix cells whereas it used to take ~36 seconds to do that beforehand.

I kept the 10,000,000 cells metric for comparison purposes initially. I removed it later because I didn't really need it anymore. I decided it was better to keep track of how many cells were in my start/end arrays for a better understanding of the algorithm's performance. For example, now I can have a better understanding of what the write time actually means.

I now have two implementations of this same algorithm. In one of them, it writes out all of the data in the start/end arrays to file when a certain number of ints have been added to it. In the second, it sets a max size to all of the RF bins and only writes out the RF bins to file who have reached their max capacity. My reasoning was that the file overhead might be too large for some of the emptier bins and hence not worth it. Contrary to what I thought, both implementations took around the same amount of time--approximately 3.5 hrs. However, this might be different on different tree sets. I should experiment on those.

The rest of the week I did some more profiling of my code in an attempt to make it faster. I came up with a couple of ideas but none of them were very effective. It's hard to speed it up when I'm barely doing anything at all in my main loop . Here's what I'm doing:

Read a line from the matrix
Split it and store only the part I'm interested in (one side of the diagonal)
Map each value in the split line as an int

Pseudo loops in mapping are faster than explicit ones, even if you have an explicit loop afterwards anyway.

Point being, casting is slow. Move on.

If the tree I am looking at has not been observed so far (i.e. is a unique tree) :

Process the int -- is it part of an existing range? Start of a new range?

Store the processed data in a list

Write the list to file after a certain number of elements have been stored in it

Increment some counters here and there and do a test on whether the trees observed are identical-- flag if they are

The most time is spent, obviously, on writing the data to file. I even sped that up by around .4 seconds , I think (that's a lot when you're doing it 135 thousand times). I don’t think I can speed it up any more than that. (funny story--after modification, the write time was around .004 seconds. I was so extremely excited. Then I realized it was writing content from just one line. I wasn't as excited anymore)

Next, I tried to speed up the counters. They're taking quite a bit of time. I have three counters: row, column, and condensedIndex of the original RF matrix. I don't know how to speed them up. I tried using external packages like Psyco and rewriting/compiling the cod in Cython/Pyrexc but it didn't show much improvement. I might not be using the modules correctly, I don't know. I've never used anything like them before and I'm very rusty on my C. Also, I tried this on my laptop on a 3000 tree RF matrix (larger matrices take more space than I can afford to give on my laptop) since I can't install anything on my machine in my office. I didn't want to bother Dr. Williams or any of the grad students about it unless I was sure that the packages would help. I don't think they'd help.

I tried removing the check on the identical trees to see if it would make much of a difference on the speed of the algorithm on the 135,135 tree RF matrix, since that one had no identical trees at all. It barely helped at all; maybe 5-10 minutes at most. I think that if one is waiting for the algorithm for 3 hours, one can wait an extra 5-10 minutes, especially if it means a potential speed up on another tree set.

The only other option I can think of is writing this in C. I tried doing that the other day but I realized that it was taking m more time to simply remember the most simple C commands than to write the code. So I stopped. If I really need to, I'll writ it in C but I'm fairly satisfied with my algorithm at the moment. It takes 3.5 hours to generate the index of a 135,135 tree RF matrix, and that's a one-time only thing, really. Plus, it decreased the size of the original matrix by around 10 GiB.

I asked Grant, the resident Python expert according to Dr. Williams, for advice on how to speed it up. He gave some good suggestions that I have been looking into but I haven't been able to properly speed up my program with them yet. There is one suggestion I have yet to look into and I think that one might be the most useful. In the original hashrf program, there was an option to obtain only one diagonal/triangle of the matrix. It would be great if I could obtain that through fasthashrf too. That could potentially speed up the split and read time of each line in the file. I'm not sure exactly how Python performs these operations so I might be wrong, but it's still worth a try. I'll ask Suzanne about the triangle thing and see.

But still, as I've said already, I'm not going to try extremely hard on this part anymore, unless Dr. Williams recommends that I do. I think I should move on to the next part of my project--i.e. the actual functionality.

The problem is that I've never dealt with such large files before so I don't know what a 'good' time is. I'm not even sure at the moment how large RF matrices I'm supposed to plan for. Is 135,135 trees RF matrix large? Medium? Very large? I wanted it (the 135,135 tree RF matrix's index process) in under 2 hours but it takes just an hour to read + split + cast the lines. So it takes around 2 hours to process+write them. I'm sure that write takes up most of the time out of this, so there really isn't much left to optimize. BUT I might be wrong. I'm going to submit the pseudocode of my algorithm to Dr. Williams in the next meeting. Maybe she has some ideas.

Week 5 --- June 28 - July 2

Portland!--The Conference

This week Dr. Williams' lab and I went to Portland for the iEvoBio conference. It's not as sudden as it sounds; we've been preparing for it for a while now. Dr. Williams was extremely nice and took me along. It was a memorable experience and an excellent ice breaker, in my opinion. The grad students and I traveled together there while Dr. Williams went there separately since she was out of town at the time.

It was a two day conference, with the first day overlapping with the end of the Evolution conference. The conference was held in Portland's convention center. It's a really pretty place. It was my first conference experience so I was pretty excited on that part. I didn't really know what to expect, either--what the format should be or anything like that. I did expect it to be somewhat disorganized since the organizers told Suzanne they wanted her to present a day or two before the presentation…yeah, she was not thrilled about the short notice.

I also didn't know what to pack with me. What type of clothes am I expected to wear in a conference? I didn’t really bring any real formal/business clothes with me so I packed the best I had. The weather forecast predicted that the weather would be in the 70s this week so I packed a light sweater as well. I am glad I did. It felt like spring there! It was pretty chilly. I did not expect it to be like that in the middle of summer. Also, I found out I shouldn't have worried about clothing either. Most people in the conference were in jeans and other casual outfits--but that's something that varies from conference to conference, obviously.

The conference format was not what I expected either. I expected something like a science fair or, well, I'm not sure what. For the most part, everyone in it sat in one room of the convention center and speakers went up to the front and presented. Some presentations were 5 minutes long while other were 15. I think a couple might have been half an hour long but we missed those, if they were. There were two parts of the conference in which the format was somewhat different--the software bazaar and the Birds of a Feather. In the Software Bazaar, a lot of small tables were joined together to form a large arc and presenters had set up their software on laptops placed on those arcs. The rest of us could then walk around and talk to the presenters. The concept itself wasn't so bad but the implementation was lacking. There wasn't much room to walk around and there was generally just one laptop per presenter so it was hard to actually see what the software was about as we walked around. For the Birds of a Feather event, everyone signed up for topics they were interested in on a sign up sheet during the first day. On the second day, each topic was assigned a general sitting area, based on the number of people who signed up, and then people gathered into their small groups to discuss the topic. Some people decided to move around between groups when they wanted to and this was perfectly fine.

I tried to take notes on the presentations--ideas or things that might help in my own project--but I couldn't understand most of what was being presented. I didn’t have sufficient technical background in computational phylogenetics. Most of the presenters assumed the audience had that knowledge, it seemed. The easiest presentations to understand were those that were made by graduate students since they explained the background well. Anyway, I think I got one or two good ideas about visualization from these presentations. Oh yeah! There was a visualization challenge as part of the conference. One of the submissions was an interactive webapp for visualizing trees in 3D and it was really cool. I felt bad for the presenter who followed that person's presentation since his application did exactly the same thing but just didn't have the cool appearance factor.

The first day was a lot more crowded than the second since it overlapped with Evolution. In comparison, the second day the convention center seemed completely empty. The only shop that was still open inside was Starbucks--not that that many were open before either.

The material presented in the conference was pretty interesting. I found it odd, though, that on the one hand the presenters were saying that there isn't enough data for them to use while on the other hand they were saying that there's all this data they have and they're not exactly sure what the best way to manage it is. Issues of data sharing were raised as well. Or, well, that's the impression I got anyway.

It was also interesting to note that a lot of the presentations--even the computational/technical ones--were not really done by Computer Scientists. I remember this one presentation for a software and the presenter kept on saying something along the lines of 'if you want to know details of the software's technical aspects, contact this person because he's the one who made it. ' The first day's presentations were a lot more life science oriented while the second day's presentations were somewhat more computationally oriented. Suzanne presented on the day too. I liked her presentation. It was easy to follow. They tried making Grant present on the second day too (and told him about it an hour before the presentation time, I think) but I don't think that quite worked out because he did not present.

Another thing I noted about the conference was that some people don't know how to present/make oral presentations, even after years in the research field where presentation is pretty important. That just struck me as odd. All of the presentations by graduate students and above that I've seen so far have been pretty good. So it was kind of odd seeing some of these. One of the presenters wasn't really professional in his presentation either. He was using profanity and vulgar jokes as part of his presentation. It sounded like a presentation that would be made by a highschool student or maybe an undergraduate student in the classroom; not something I would expect at a conference. Dr. Williams assured me that she hasn't seen anything like that in any of the conferences she's been to before.

Most of the other presentations were fine, though. I wish I could say more about their content but 1. I've forgotten a lot of it and can't find my notes and 2. I really didn't understand a lot of it because I didn’t have the right technical background but I did try to soak in as much information as I could. I did enjoy the conference, though. It was a new experience for me and I'm glad Dr. Williams gave me the opportunity to have this experience.

Portland! -- Outside the Conference

The time spent outside of the conference was quite memorable for me. I spent a lot of time with Grant and Suzanne and got to know them a lot better. They're really cool people.

Travel into and out of College Station was eventful. We were going to take a connecting flight to Houston and then fly from Houston to Portland. Unfortunately, the weather wasn't ideal for flights that day and our connecting flight kept on getting delayed. We were afraid we might miss our flight from Houston to Portland as well. We seriously considered driving all the way to Houston so that we wouldn’t miss that flight, but we were afraid we might miss the flight anyway if we did that. So we didn't.

It's interesting to note the flight delays. Our flight from Houston to Portland was delayed because it was waiting for a flight from San Antonio to arrive. The San Antonio flight was delayed because it was waiting for a flight from Houston to arrive. Houston airport was closed because of the weather.

We got really hungry in the airport too and there weren't any open food shops in the airport (and I forgot to pack any snacks with me). We endured it for a while but when we realized the flight might be delayed even further, we ended up ordering pizza. Five minutes after we ordered the pizza, someone said that our plane was going to arrive in fifteen minutes, so we cancelled the pizza delivery. We shouldn't have done that. We ended up waiting for at least an hour after that too. The airport management was nice enough to order pizzas for all us after a while. I don’t think I was ever that happy to see/eat a non-Pizza Hut pizza before (I like Pizza Hut pizzas the best). Small airports have their benefits.

Around two other flights were delayed because of the weather too along with us. Both of them had flights scheduled after hours. Both of them left the airport before us. At least the airport staff put on the soccer world cup for us to watch. And there was free internet. And power outlets.

When we finally did arrive in Houston, we ran to our next flight's gate (it was also delayed). Okay, well, Grant and Suzanne ran. I tried but I didn’t have enough stamina and my bag was slowing me down. When we reached our gate, we found out that the gates had been changed and that our flight was boarding right now on a gate that was on the other end of the terminal. Haha. Yeah. Then we really had to run. Suzanne was running in the lead, with Grant close following her and me lagging greatly behind. We were covered in perspiration by the time we made it to the gate. Luckily, we made it in time and didn’t miss our flight. As it is, the plane stayed on ground for what seemed like half an hour after boarding before it took off.

It was a long flight and there wasn't much to do on it. I didn’t bring a book or anything so I ended up writing the journal entries I was behind on. (So if the journal entries for the last two weeks sound somewhat cranky or annoyed, this is why).

Portland airport went well enough. We took a train to our hotel. The train stop was right at the airport, which was pretty neat. When we arrived to our hotel, we found out that they had cancelled our reservations since they hadn't received payment by 6 or something. I don't know what the deal was with that. The people at Texas A&M said that they had done everything. Apparently it isn't the first time something like this has happened, though. Anyway, the hotel was all out of rooms as well. It only had meeting rooms available with roll out beds and sofa beds. They offered to let us sleep in those. They agreed to wait for the university to send in the room's payment but took Grant and Suzanne's credit card numbers on file just in case. It was late and we didn’t want to look for other hotels so we stayed there.

Suzanne and I shared a room. It was apparently a presidential suite. It had a meeting table, a flat panel TV, a small kitchen type of place right next to the Jacuzzi…

When we entered the rooms, it was completely dark. We couldn’t find the light switches and had to use our cell phone lights to find them. There were two doors apart from the main entrance connected to our room. Both of them connected our room to two separate bedrooms (I think) that people were staying in. Suzanne locked those. We had to ask for new sheets and blankets too.

We went out to look for food at around 10:30, after getting our room settled. Suzanne and I felt like eating Thai food and there was exactly one Thai restaurant at walking distance open at the time according to the application on Grant's phone so we decided to go there. We couldn't find the restaurant at first. Pretty much all of the food places we passed were closed. The only places open were bars and (what I thought were) shady clubs. I think we smelled marijuana along the way, too. We entered a pretty shady area after a while without seeing the restaurant so we backtracked a bit only to find out that we had already passed it and that it was closed. Its closing time was not listed on the shop door.

So we went back towards our hotel and ended up eating at a Denny's nearby. I think I must now state that I rarely, if ever, eat out. I have never been to any American restaurants before. I have never been to a Denny's before. I did not know that pancakes are generally huge and that so are french toasts. Where I come from, french toasts are the size of a regular slice of bread and pancakes don't exist. Oh and I'm vegetarian whenever I eat out, since I can only eat Halaal meat which isn't readily available everywhere (I think I ended up annoying Grant and Suzanne a lot because of this). So yeah, this was a new experience for me as well. I ordered all you can eat pancakes. I ended up barely eating 2.

At around midnight, Suzanne and I heard what sounded like a very heated argument between a male and a female from the next room (one of the ones that was connected to our room by a door). We might have heard some violence too. They turned up the music a lot to muffle the sounds but it didn't help too much. The two participants sounded really angry. We did not know what was going on but it scared us and we didn't know what to do. We were afraid that if we said anything, the people in there would come after us, or something. I don’t know, they sounded really angry, strong, and scary. Suzanne ended up barricading our connecting door with chairs. It was more of a psychological relief than anything, I know. We found out later that at least one of the two people from the next room were part of the conference.

Oh and the air conditioning wasn't that great either. I was shivering all night. I think I woke up every hour.

So yeah, not the best of beginnings for the trip, but, in retrospect, it wasn’t that bad. It was an excellent ice breaker, in my opinion.

The next morning we moved to the hotel that Dr. Williams was staying in. They were nice enough to let us move in early. We ended up being late to the conference because we wanted to finish moving in first. I was actually really looking forward to seeing two of the presentations that we missed, but oh well. This was more important.

Speaking of Dr. Williams…it too bad she caught a cold/flu at the beginning of the trip and was sick throughout it and for a while after we came back as well. She had to miss some of the conference because she wasn't feeling well too.

But, yeah. The rest of the trip went well enough. I wont get into more detail about it because this journal entry is already way too long. All in all, it was pretty informative and I enjoyed it. It was full of new experiences for me.

Week 6 --- July 5 - 9

Research/In the Lab:

Accomplished Previously:

Updating counters was taking up a good amount of time (~20%) in my algorithm so I tried to find ways to remove some counters. I was able to remove one counter, leading to a decrease of around 30 minutes in the time requirements of the 135,135 trees matrix. That means that it went down from 3.5 hours to 3 hours.

This counter stored the RF value observed in the previous matrix cell. Due to the way I stored data in my index, I needed to know this value. I changed the way I store data a little bit so that now I only focus on the RF value in the current matrix cell and its associated data and don't need to know about the previous one anymore.
Removing this variable reduced some CPU time needed in refreshing it on every cell processed, leading to the speedup.

I fixed a small bug in the algorithm that led to a 10-15 minute speedup in the 135,135 trees matrix. I was previously checking whether or not I should write to file (i.e. whether the writeToFile array for the current RF value was full) after every cell processed. However, I only really needed to do that after I added a new value to my writeToFile array, which is generally less frequent. I fixed the algorithm so that it performed this check only when needed.

Goal:

My goal for this week was to speed up the index generation time for the 135,135 trees matrix to 2.5 hours.

This Week:

c_types

I tried using cTypes in python this week. A website I came across recommended them and there didn't seem to be any harm in trying. From what I understand, they're C like containers for data structures used in python. For example, you can use a c_type int array instead of a python list.

The c_type int array was the only thing I could actually apply to my program so I played around with it. I tested the amount of time it would take to assign a value in a c_type array vs in a python list and the relative amount of time required to increment an int value stored in both data structures as well. I used timeit to measure the time.

In both cases, using the python list took a shorter amount of time. To assign a value, c_type took 0.32 s while the python list took 0.15 s. To increment the same int by 1, the c_type array took 0.54 s while the python list took 0.17 s. I used the average time over 3 trials in all cases.

Identical Tree Checks

In RF matrices, if two trees are 0 distance apart, then those two trees are identical. Data about only one of them needs to be stored. My previous implementation of the algorithm checked for which trees were identical and stored information about only the first occurrence of the identical trees. For all of the following instances of that same tree, only the tree IDs of the two trees that were identical was stored.

I noticed that not all of the matrices had any/many identical trees but I checked for identical trees every processed matrix cell anyway. I experimented with removing this check. It seemed better to keep it rather than not. The 90,002 trees matrix has a couple of thousand identical trees. Without the check it takes more than 40 minutes longer to generate the index files than it would with the check. It might be a better idea to add a preprocessing step to search for 0's in the entire or half of the RF matrix and, based on the number of zeroes found in the matrix, carry out a version of the algorithm that either does or doesn't check for identical trees. This way, we'll generally spend around 7-8 extra minutes at the beginning of the indexing process but potentially save more than three quarters of a an hour's worth.

Another problem with having no identical tree checks in matrices that had identical trees was that the resulting index files took up much more disk space than they should. This is due to the fact that more information is potentially stored without identical tree checks, and therefore more disk space is used by the index files. This size increase went up to around 10 GB in the 90,002 trees RF matrix. My goal was to make my index files as small as possible without complicating the search process so including this check seemed necessary.

Profiling/Other results

I found out that the three steps required to take all rows from the RF matrix and store them in a useable form in the 135,135 trees RF matrix takes around 50 minutes and writing all the data to the index files takes approximately 40-45 minutes with the current implementation of my algorithm

I finally got around to testing the times for generating the indexes for all of my tree sets. i stored all these times as in an excel file along with other properties fo the tree sets/RF matrices and index files for analysis purposes.

Due Dates

I'm getting somewhat worried about all of the upcoming deadlines now. I need to have everything done in 1.5 weeks so that I'll have enough time to edit them.

Here are the major upcoming due dates (Today is the 9th of July) :

July 27 - Abstract Due
July 30 - Poster Due
Aug 04 - Research Paper Due
Aug 03 & 06 - Poster Presentations

I'm concerned about the references for my paper. I don't really have any real references. There really hasn't been much work done in this area. Dr. Williams said in our meeting that I could add in references for my background material, for example on indexing. I'm planning on using more of the papers published by Dr. Williams' lab as references.

Queries

I'm getting worried since I haven't really worked on the queries part of my project yet and I thought that was the main point of my project. I brainstormed the types of queries my index will support some more this week and implemented some rudimentary querying algorithms while I was waiting for my profiling results.

My index will support the following query types:

Trees that are a certain RF distance or range apart from each other
Data about one tree only
Statistical data about the entire matrix

The minimum and maximum RF distance observed
The range of possible RF distances
A histogram
The most observed RF distance
The tree that was the furthest away from the rest (I guess this would be the one with the most largest distances. I'm not entirely sure how else I would implement this)
The tree that was the closest to the rest of the trees (the one with the smallest sum of distances)

I'm not entirely sure if the last two types of statistical data listed above make much sense.
It might be a good idea to use existing libraries for the statistical analysis.

Meeting

In this week's meeting Dr. Williams suggested it might be a good time to incorporate my algorithm into fasthashrf, the software we developed by her lab for calculating RF matrices. I'm going to start working on that next week.

Day 4 -- A Sick Person's Rant

So I got sick this week. Just a common cold, I think. Nothing big. But annoying as anything annoying can be. Oh yay! Now my eyes are starting to water :/ . I had to take the last two days off since I wasn't feeling well. I had to miss my meeting with Dr. Williams on Tuesday because of that and now I'm not sure what to do with my project. I've been working on profiling it and making it faster but I'm finding it difficult to come up with more ways to make it faster. I mean, well, there are still some things left to try but it seems like such a waste of time when I know the max speedup I'll probably obtain from those little edits is like 10 minutes. (I got 10 minutes more today! Yay! It was an obvious thing that I forgot to do before. Just a Python indent.)

But yeah, there has to be a limit to how much a person should profile a piece of code. The innermost loop is now 4-7 lines long (if you consider the whole write thing one line). I think it would be much more efficient to come up with a different technique altogether than trying to speed this one up. Buuuut I might be wrong, of course. I probably am. I don't know. I think threading/parallel/whatever would be a good idea but, just like it sounded like back there, I have no idea how to do that. I mean, sure, I could look it up/try to learn it, but maybe there's a better way to spend the time I might spend learning how to do that, you know? Well, I do have some idea, but I've never really done it the way I want to do it before and never ever done anything like it in Python. So, yeah.

It's so frustrating. I'm learning a lot here but what I'm learning more than what I'm actually learning is that I don't really know much. Does that make sense? There are so many techniques that I know exist out there for speeding up my code but I just don't know how to use them. I thought college was supposed to teach you stuff. Maybe I just haven't taken the right classes.

I know, I know. I sound like a whiny, ranting person right now. Okay, I am one right now. But I'm sick and kind of stuck. Okay, not really. Stuck, I mean. I am sick. I'm just tired of profiling ten lines of code over and over again without much results. Okay. There were results. Pretty decent ones. Just not the ones I wanted. Fine. I'll stop.

Week 7 --- July 12 - 16

Day 1

Today I started work on incorporating my algorithm as part of fasthashrf. It is contained in the habitat repository which uses git for versioning control. I have my own branch for the indexing part. Suzanne helped me get acquainted with habitat and the files and parts of code I will need to work with for my indexing part.

I had quite a few questions about how I should incorporate my algorithm into fasthashrf. Suzanne answered some and the rest I figured out as I played around with the fasthashrf files.

Fasthashrf calculates the RF matrix quickly and stores it in a hash table. It can then print the entire matrix out to screen if the user wants in the form of a matrix, list, or hashtable. This printing out process is done using one of two matrix traversal methods/functions, based on the input criteria. I am going to append my algorithm near the end of a copy of one of them.

The inputs I need for my algorithm are:

The RF matrix (one row at a time)
The total number of taxa
The total number of trees

My algorithm outputs

Index files

The total number of taxa and trees are global variables and can easily be accessed. The RF matrix is extracted row by row from the hash table and stored in a string. That covers my inputs.

My output is created by my algorithm, so that's covered as well.

I did some of the preprocessing work required for running my algorithm such as introducing a new command line option for the index (-x ) with fasthashrf in the proper locations and calling the correct methods to initialize the RF matrix computation + indexing process once the command line is parsed.

I ran into some segmentation fault issues because of these flags at first. It was confusing because it only happened in some of the matrices with no apparent pattern behind the matrices it was crashing on. It has something to do with the preprocessing steps for the traversal method I am using, I think. I didn't call all of them in the beginning, leading to the segmentation faults.

Now I only need to get my python code and convert it to C++ code, the language that fasthashrf is in.

One thing I'm still contemplating is whether or not I should remove the parts of the traversal method that print out the RF matrix to screen.

My algorithm currently looks something like this:

// processing only upper triangle of the matrix

k = 0

If line is not flagged as identical:

Read RF matrix line

Split/cast line to int array

For each cell in line:

If not consecutive

start[currentRF][writePtr[currentRF]] = lastIndex[currentRF]

writePtr[currentRF]++

end[currentRF][writePtr[currentRF]] = k

If writePtr[currentRF] == max

// write start[currentRF] and end[currentRF] to file and reset writePtr[currentRF]

k++

Day 2

Today I actually implemented my algorithm as part of fasthashrf. I wrote my own C++ code for it instead of using applications to convert the python code into C++ code. I couldn't quite figure the latter out properly and there seemed to be too much…noise…in the generated code that I didn't care about. Since I don't really care much about keeping the native python data structures and map/loop comprehensions /etc, figuring out how to convert python code to c++ with a third party application didn't seem to be worth the time and effort.

My C++ code looks ugly, probably, but it gets the job done. Actually, it gets it more than done. The 135,135 trees matrix now takes only 1.5 hours to process/generate an index. That's half the time it took when the algorithm was a standalone application. It's even less than the time requirement I was hoping for. I'm pretty excited with these results!!

The reason I say my code looks ugly is because I only know a little C and haven't really coded in that for a while now. I mostly copied the format/techniques used in the existing code in the fasthashrf files. I think my writing the index to file implementation is pretty bad/slow, but I couldn't figure out how else to do it and I was more concerned with actually having a working implementation done today than on how good the implementation was (as long as it wasn't too horrible).

I'm processing the lower diagonal instead of the top one in this version of the algorithm. Since I'm using the string representation of the RF matrix row, processing the upper diagonal requires looping through the lower one until I reach the start point of the upper triangle and then processing that till the end of the row. If I process the lower diagonal, I only have to loop till the end point of the lower triangle instead of the whole row, which saves time.

However, this version leads to some preprocessing problems. All of the index files have a 0 as the first entry in them. I'm going to fix that soon.

I might be able to make my algorithm even faster if I store the RF matrix row as an int array instead of as the string that I take as an input right now. I'll test this.

I ran out of disk space again today. It never fails to surprise me how only text files can take up around 400 GB of space :/

Day 3

Today I touched up the internals of my algorithm--things like malloc/free and data types used in my program (unsigned ints).

I changed the algorithm so that the lower triangle is processed rather than the upper one since I'm going to loop through the lower one in either case.

Adding my algorithm as part of the traversal method slows down the traversal by half, obviously, since more work is done by it now. It also speeds up the index generation time by half as well.

There are still some minor bugs--or, well, considerations--that need to be resolved after some testing. Overall though, I think the incorporation of my algorithm into fasthashrf was fairly successful.

I wonder whether I should work on speeding up my C++ code as well. I'm kind of tired of profiling the algorithm, to be honest (although the results were pretty exciting :D ) . Also, I really think that working on the queries part should be my focus now. I'm fast running out of time to work on that part.

I think tomorrow I'll add some queries into the program. I might be able to use NUMPY in this part, which might help speed the queries up.

I wonder how the queries interface should work. Should I make it into a desktop application? Should the UI be commandline based or have a GUI? Should it be a web application? Should the queries application be external or a part of fasthashrf? Should the application start automatically right after the indexing is complete?

Day 4

Today I conducted some logic/sanity tests on the program, both to make sure I didn't forget anything when converting it to C++ code, and because I'm now processing the lower triangle of the RF matrix rather than the upper one.

Things seem to be okay and sane.

I checked the points where one write ends and the next begins for correctness and accidentally omitted data etc. in the index files. I used the smaller RF matrices ( < 200 trees & <20 taxa) for testing purposes. I did not come across any errors in my checks.

The logic behind the values stored in the index files seemed okay and values that should be stored in each index file were there.

While looking into the ranges of ints, longs, and doubles in C++ (source: wikipedia), I found out that I need to use the unsigned long data type instead of the unsigned int that I am currently using to avoid integer overflow and thus the storage of incorrect data in my index files.

I made a couple of minor changes to my algorithm to improve speed as well. Now I don’t reset the [2D] start/end arrays on write. Instead, I only reset the write pointer for those arrays. This makes sense if you look at the code. Basically, there's less work done on write now. I'm not too sure if this is a good idea or not but so far it hasn't done any apparent harm.

I also implemented a rudimentary querying algorithm for searching for all trees that are x distance away from tree y. I have implemented a version to search the RF matrix and one to search my index. I implemented both algorithms in Python since that was easier for me for prototyping purposes. I haven't had a chance to properly test them yet.

Day 5

I worked more on the querying part of my project today.

I tested the time to search for all trees that are x RF distance away from tree y using the algorithms I developed yesterday. My code seems to have quite a few logic errors though, which I spent most of today resolving. I actually made a truth table on the white board in my office for developing the querying algorithms yesterday and used that to fix any logic errors in the code.

The search does not as yet look into the RF 0 file for aliases of found trees to ensure that the search results are complete. I need to add that part.

Search time for the RF matrix depends on the tree ID (y) that is being searched for. The larger the value, the more time required because that's the total number of matrix lines that have to be looped through in order to get the search results.

The querying results for my index files were somewhat strange, though, and I spent some time trying to debug the querying algorithm before finding out that the error lay somewhere in the index generation algorithm rather than in the querying one. This has me somewhat worried.

The problem I encountered is that, in the 9 taxa/135,135 trees RF 2 index file, the last row (/tree) indexed is 8003 although it should be at least 31,783 and probably even greater than that. Similar data omissions are in other index files for this data set as well. The data simply disappeared and it doesn't seem like I overwrote it at all, according to my debugging results.

The problem doesn't exist in the smaller RF matrices I tested so I thought this might be a problem with the type declaration of my variables. But that doesn't seem to be the case; I wrote out all of my variables on paper and their declared types and everything is the correct type. I didn't forget to make before running the changed program either, so that's not the problem either.

I don’t know what went wrong. :/

Week 8 --- July 19 - 23

Day 1

More debugging today! (lots of false leads and dead ends--where is this bug?!)

I don't understand what's going on, exactly. The index query results include extra results as well as the real results but are also missing some results. :s . Okay…that just sounds like it's entirely incorrect. I double checked and I'm converting the index value into its two tree pair constituents correctly so that can't be the problem. :s . I've checked and rechecked the index generation code many times and I can't find the problem there either. I am using print statements for debugging. The problem is that the index files of the small matrices don't have this problem; only the large ones seem to have it. So it takes a long time to get debugging results.

In the meantime, I've been working on my abstract. It's usually better to write the abstract after the paper has been written, but I don't think I'll b able to finish the paper first. I've been making my usual flow chart/map kind of diagram on paper for brainstorming purposes. The problem I'm facing is that my project goal is querying, not indexing, yet I've spent most of the time on indexing rather than on querying. I also don't really understand what I'm supposed to say my approach/methods were for this project. Prototype algorithms and polish them? Make an index and query it? :s

I wanted a draft of my abstract done by the end of today but I'm still in the brainstorming part because of the problems I just mentions.

Day 2

This elusive bug is getting somewhat frustrating now. :-( …. :-| …. >>_<<

On a bright note, I finished zipping the index files (it takes a while). Dr. Williams said in one of our earlier meetings that I should do that so that we can see whether we should make the index files in binary or something. I didn't quite understand it. Zipping does seem to reduce the total index size on disk for each matrix quite a bit.

Day 3

I FIXED IT!!! :D :D :D :D :D :D :D

The bug…it's fixed!! Yay!!!

So it WAS a problem with the long/int type declaration. I forgot to consider the internal multiplication architecture used in my program. It appears that the integer overflow is not automatically dealt with if you multiply two ints and store the result in a long container. Either the multiplier or the multiplicand (or both) need to be longs.

I performed a multiplication to get the index value every row by multiplying the number of trees (an int constant) with the tree number I was on (a for loop counter). Making the loop counter an unsigned long fixed the problem.

But, yeah. I am very happy right now. :D

Day 4

Till now, I've been printing out the RF matrix even when generating the index (it was part of the traversal method that I copied). I talked to Dr. Williams and Suzanne and it doesn't seem like there would be a need for the RF matrix if I'm making an index which contains all that data anyway. If the user wants it, he/she can print that out separately.

I removed the print RF matrix part from the function and the index generation time went down amazingly. The 9 taxa/135,135 trees tree set, whose index generation time was previously 1.5 hours now takes 13-17 minutes. Cool, right? Just the RF matrix print time is a little more than an hour for it. Cooler, right? :D This is even less time than my original time generation goal for this tree set (30 minutes). I'd given up on that goal long ago, too. So, yeah. Pretty exciting.

File I/O …wow. I knew it was a time drain, but I didn't realize how big a time drain it was until now.

I feel like I should be more excited and jumping around or something. For some reason though, I'm just…calm. I like this calmness. :-)

Day 5

My index needs to be reversed.

Great.

This wasn't an issue when I was doing the upper triangle but it is now. If I query the index files, I have to search the entire index file for all of the queries where I search for all trees x distance away from tree y. If the index were reversed, I'll have a known stopping point, which will reduce search time requirements.

Doing this as a post processing routine would take too long :/ . It would require a full copy of each index file. :/

Oh and my laptop broke. I don't know what I'm going to do about all the deadlines coming up. Repairing it might take a long time too. It really chose the perfect time to break. :/ . And my warranty ran out 6 days ago. Great. Just great.

Week 9 --- July 26 - 30

Busy Busy ...

It's been a busy week. Deadlines coming up. I won't go into minute details about everything I did this week but instead just give a very broad and brief overview of it.

Both the poster and abstract were due this week.

I made the algorithm so that now the data in the index files are in reverse/descending order.

I conducted some experiments on my algorithm's performance. They are way too time consuming, so I was only able to turn in the results of two of them for my poster. The third experiment will have to be included in the paper only. I kept on making silly mistakes in the experiments which made them take even longer. The performance results aren't too bad. I'm just being careful of how I report them so that I don't misrepresent any data.

I've decided to focus on my indexing part rather than on the querying part of my paper since that is the crux of my work. I haven't spent enough time on the querying part to write about it in my paper. Dr. Williams said that this was fine.

I still need to work on my paper. I want to complete a draft by Monday to turn into Dr. Williams to look over. I have a general outline done, all that's left is actually writing the thing. Unfortunately, my laptop is still broken. So I don't really have a way of working from my dorm. There are three or four PCs there for common use but it's so noisy where they're at that concentrating on work there becomes a difficult task. It was hard enough finishing my poster without it. I really don't like open office.

According to the lady at Geek Squad, it's a motherboard problem and the part needs to be replaced by the manufacturer. Thankfully, Dell sends their technicians to you to fix your laptop. So it should be fixed by Tuesday next week, giving me just enough time to edit my paper in peace on it the day before it's due. So that's good. In the meantime, I've been coming in to work early and leaving late to get everything done. I came in last weekend too and I'm planning on coming in this weekend as well. My laptop really chose the worst time to die on me.

Week 10 --- Aug 2 - 6

A Busy End to a Fun Summer

It seems hard to believe that this was all just one week. It feels like it was much longer. We had three poster presentations this week--one of which took two days. The final paper was due this week as well. My laptop was fixed (yay!) this week too.

Final Paper

I ended up focusing only on the indexing part of my project in my paper, since that's what I mostly worked on this summer anyway. I'm not really satisfied with my paper. There's a lot of room for improvement. I wish I could have completed some of the tests I was running before the paper was due. Ah well. It's over. I'm not going to think about it anymore. At least I don't have to write another paper for DREU. I can just use this one, however it is.

CS REU Umberella Poster Session

All of the students doing research in the CS department at Texas A&M University this summer presented their posters for each other in this poster session. It was an informal kind of thing. It was essentially another brown bag lunch (in the same room as all the other brown bag lunches) in which everyone took turns putting up their posters at the front of the room and presenting it in 5 minutes for everyone else. Each presentation was followed by a maximum of 5 minutes of questions. Since there were so many of us, the event was divided into two sessions, one on Monday afternoon and the second the following morning at 10. Each session was two hours long. We were given the order we would present in but not the day we would present on.

I liked these poster sessions. They were a good rehearsal for our actual presentations as well as a good way of finding out what everyone else had been up to the entire summer. I was surprised at the amount and quality of the questions asked in them as well. In all of the student presentations I've been to so far, only one or two questions are asked, but in these presentations a lot more were generally asked and they were pretty good questions too. They helped prepare for the prospective questions that might be asked in the following poster sessions and were also a good method of finding out the strong/weak points of our own presentations.

That said, I only got asked one question, but it was pretty early in the morning (for some of us, at least) and mine was the first presentation of the day. Also, it was pretty technical. Other presenters got asked a lot more.

The other presentations seemed impressive although I couldn't really follow some of them. It's hard presenting and explaining something you've been working extremely closely with for ten weeks to someone who potentially has no real understanding about your entire research area. Especially if your research project is very technical, which was the case with some of the presentations (mine included, maybe).

I really liked the presentation made by the people from the sketch recognition lab (and I'm not saying that just because one of the presenters is my friend). They explained their project and handled questions really well. Generally, the presenters who had the most confidence made the best presentations.

Suzanne was there to see my presentation. She gave me some good pointers on how to present better. I made the same mistake I always make--I didn't explain the background and motivation enough. According to Suzanne, the presentation should be like a story. You have to contextualize your project properly before you go on and explain it. The person listening should know why he/she is spending five minutes or however much listening to you--why should he/she care about your project? I didn't spend enough time on that part and just breezed into the technicalities and results. I talked to Grant and Suzanne some more afterwards too, and they gave some really good advice on how to present and shared some of their experiences with poster sessions. I think talking to them really helped me get rid of any nervousness I might have had in my presentations.

REU Poster Session

The second poster session was in one of the new-ish buildings at Texas A&M. It was open to everyone, I think. All of the summer REU students were presenting in it. Unfortunately, the area wasn't set up that well; there was barely any room to walk, which resulted in not a lot of people actually coming to see us present. There weren't any judges in this sessions. I only had a chance to present to one person. A second person stopped at my poster but he didn't really let me present. He was interested in what the matrix meant and how it was related to evolution but appeared unimpressed with the answer he received. I don't know exactly what he was expecting. Since not so many people came to see our presentations this session, my neighboring presenters ended up juggling and playing around most of the time (they were giving out free balls and pens somewhere. Free food/refreshments too). At one point everyone did the kick dance, or whatever it's called. That's how much free time we had. In the two hours of the poster session, most people I saw only presented twice.

USRG Poster Session

We had the opportunity to print out our posters again for the third poster session. The printing was funded by the CS department, which was pretty nice (it usually costs around 60 - 90 dollars, from what I hear). I redid my poster, keeping my previous presentations in mind. I included an additional image to explain the matrix better in case I received that question again. Suzanne and Grant gave me pointers on how to improve the poster. I think the final product turned out quite nice. I used glossy paper for it, too. AND I didn't have to use open office to make it, which was nice too

The third poster session was hosted by USRG and was for all of the Engineering summer REU students. It was judged as well. We had anywhere between 3 to 5 judges at least. We were also required to judge other presenters. (there were two sessions; we judged in one and presented in the other). I judged a person who worked on a project for making fuel out of bird refuse. He ended up being one of my judges as well. Overall, I think the presentations went pretty well. The first and last one weren't that great. The first one didn’t let me present either. She stopped me at my introduction and tried to find out everything about my project's background and field, which I wasn't prepared to present. I mean, I prepared background material to present, but not at the detail she was getting into. For example, it seemed like she wanted me to go into multiple sequence alignments and neighbor joining algorithms and I really wasn’t prepared to present that. Plus, I wanted to present my own project, which I didn't really have time to do in the end. The last judge and I seemed to have had a miscommunication which hi was unable to resolve. He insisted that one of my graphs misrepresented my data when it didn't and then left thinking that I didn't know anything about file I/O. he didn't really let me finish presenting either. I feel bad whenever I think about it. I wish I could have presented better to him. I feel worse because he made a point of noting my university and saying that it had a good CS department and then left thinking I didn’t know anything so basic. :/

Apart from that, the rest of the presentations went pretty well. I wasn't nervous at all. I focused my presentation a lot on my background/introduction part and results part. I only explained the techniques and other technical details in depth to those people who seemed interested in it or asked about it. It worked fine for me, I think.

One of the CS REU students won third place in the poster session, which was pretty cool. I think I missed his presentation during the brownbag lunch (the poster session only for CS REU umbrella people) though. I missed one or two of them. I think we can put the last two poster sessions in our resumes too. My roommate mentioned that we can at least put one of them in.

Final Thoughts

It still doesn't feel like the program has ended. I haven't even started packing yet. I'm excited to go home and get out of this heat and get some home food, but it hasn't sunk in that I'm going to be doing that in a few days. I hope the airplane ride is smooth this time. Travel in and out of College Station has been slightly eventful each time I've done it (not that I've done it many times). There's still so much left to do though. What I did was just the tip of the iceberg. I barely even started the querying part. I don't even know if my index will be useful at this point. Some of my results seemed to indicate it wont. Unfortunately, I didn't really have much time to analyze them. I wish I had more time to work on my project. I wish I knew it would be a good idea to plan out the experiments early in the game so that I'm more focused and organized in how I spend my time. I just focused on creating a fast algorithm without paying enough attention to how the product of that algorithm would/should be used. ….I don't know.

I'm sorry this journal ended up being so inconsistent in its organization. I tried but, well, that's just how things turned out.

I enjoyed this summer. It was a completely new experience for me and it was fun. I learned a lot through it as well, both academically and non-academically.

Home | About Me | My Mentor | Project | Journal | Final Report

Week 1 --- June 1 - 4

Settling in and starting up

Week 2 --- June 7 - 11

Day 1

The Rest of the Week -- Research/In the Lab

The Rest of the Week -- Outside the Lab

Week 3 --- June 14 - 18

Research/In the Lab

Week 4 --- June 21 - 25

Research/In the Lab

Week 5 --- June 28 - July 2

Portland!--The Conference

Portland! -- Outside the Conference

Week 6 --- July 5 - 9

Research/In the Lab:

Day 4 -- A Sick Person's Rant

Week 7 --- July 12 - 16

Day 1

Day 2

Day 3

Day 4

Day 5

Week 8 --- July 19 - 23

Day 1

Day 2

Day 3

Day 4

Day 5

Week 9 --- July 26 - 30

Busy Busy ...

Week 10 --- Aug 2 - 6

A Busy End to a Fun Summer

Final Paper

CS REU Umberella Poster Session

REU Poster Session

USRG Poster Session

Final Thoughts