Dear Diary..

Weekly Journal

** Home ** About moi ** The Project ** Dear Diary ** My Mentor **

Week 1: May 22 - May 26

I spent my first couple of days reading as much as I could to understand the fundamentals of the project. I also jumped into my first matlab task: assume that every structure is fully represented by a vector of d dimensions. In the binary case, if an element is 0, that means the corresponding feature is missing in the structure, and if it's 1, that means it is present. My first assignment involved creating such feature vectors both from real data and from simple distributions. I then proceeded to implement three similarity algorithms and run them on the created feature vectors. On the theoretical side, Prof Gupta gave me a bunch of interesting (and hard!) books on the basics of Information theory, entropy, and similarity. The mathematical definitions of these terms are often quite different from their English meanings, and I am still in the process of trying to fully understand them.

Week 2: May 29 - June 2

I began writing Matlab simulations for 2 purposes:

Generate data from a distribution: In a simple example, if the chosen distribution is an exponential distribution, you want to generate a vector (an array) of size n, A, where A(i) = exp(i). So if you had many vectors generated from the same distribution, you might think of them as being from the same 'class'. Therefore, a classifier should be able to tell that vectors generated from a particular distribution are similar. Similarly, if you generate A ~ exp and B ~ 1- exp, and exp is a value between 0 and 1, these 2 vectors are completely different, and a classifier should be able to see that.

Similarity metrics: Many people have come up with neat ways to compare the similarity between different kinds of data. The simplest example is to think of 2 points in the x-y plane. We can say the points are similar if the distance (euclidean) between them is small (where small is defined). But what if it doesn't make sense to find the distance between 2 points in the usual way, say, if the data actually represents the features of a protein? That's where other forms of similarity metrics come in. The similarity metrics I dabbled with were the simple hamming distance, counting, one devised in a paper by Lin , and one devised by the group here.

I have used Matlab in classes before, but I started to get more familiar with it (note to self: the first index is always 1, not 0!). Toward the end of the week, Luca said we should submit a paper to NIPS , a cool Machine Learning conference. So we'll be working a lot on that in the next week.
Oh, I've also been attending some technical talks. One was about creating a text-free user interface for illiterate people, which I thought was pretty interesting.
Coolest thing I did this week: Sit in on a Math jam session with Prof. Gupta and Luca - can't say I was a supremely active participant, but I did try to follow along. It was fun! No, really!

Week 3: June 5 - June 9

This week was dedicated to writing the paper that we will be submitting to NIPS. I mainly worked on simulations. I also learnt a lot about the process of paper writing. Reading (and understanding) technical papers is hard, but very interesting! In the meantime, I have also been reading up on some important concepts, such as Support Vector Machines in order to understand some work on it that I might have to deal with soon.
Coolest thing I learnt this week: Well, I actually already knew this before because I took a Signals and Systems class in school, but it's always fascinating to read about Shannon's theorem on communication (it has a fancier official name). It goes something like this: given an analog signal (like speech), if you sample the signal fast enough (mainly > twice the frequency of the signal), you can completely recreate the signal from its samples. That is, even though it may seem like sampling involves losing some information (because you're only looking at 1 in every n sets of data), you actually don't lose anything at all! Absolutely rivetting, isn't it? >

Week 4: June 11 - June 16

This week I worked on investigating some work done on support vector machines by this german group. Luckily for me, the group was into writing clean, neat, heavily-commented, and easily usable code (a lesson for all us sloppy coders out there =)). So, actually running their simulations was easy, but I spent some time understanding what exactly they are doing. Our ultimate goal is to compare our classifier with theirs. I might as well talk a bit about support vector machines (SVMs) since I spent so much of this week on them.
SVMs
Here's how I think of it: Say you have a cake, half of which has been topped with almond flakes, and the other half with sprinkles. Unfortunately, some of your friends are allergic to almonds, and the rest detest sprinkles because of their artificial colouring. You want to cut the cake so that the part with almonds is a separate piece from the part with the sprinkles. You can think of the knife forming a separating plane; on one side is the almond part, and on the other, the sprinkle part. This separating plane is called a hyperplane, and it has successfully classified every topping as either an almond or a sprinkle. Things are simple when a hyperplane can actually separate stuff. But what if the almonds and sprinkles are spread out so that the separating boundary is curvy? SVMs are classifiers that deal with creating a hyperplane (with an error margin) that can classify troublesome almonds/sprinkles well. This method of classifying is markedly different from the one we deal with in our work. But it would be cool to compare our method to theirs (hopefully we'll do at least as well).
Another thing I've been working on this week is finding datasets so that we can test our classifier out on real data. There are lots of datasets out there but we want those where we can figure out the domain of similarity, because our method needs to know that. This brings me to the coolest thing I've been working on this week: figuring out the domain of the similarity values generated by a similarity metric we call the Lin method. Math is cool! Unfortunately, I'm not that great at it, so I've been bugging Luca with questions.
Coolest thing I read this week: "Almost everything is almost equally probable": from one of the books Maya gave me to read on Information Theory. I felt a little dizzy after reading it, hope you did too ;)

Week 5: June 19 - June 23

Half way down! Time really is soaring past me. More work towards the journal paper this week as well. I worked to try and use a webpage dataset. It's basically a long list of university websites, and each is classified as a ' student', 'faculty', or 'staff' page manually (by some poor undergrad, I bet ;). The only info our classifier is given is a measure of similarity between these pages. Similarity of page a and b = 0 if there is no link from a to b or b to a; it's 0.5 if there's one link, and 1 if there are links both ways. Our classifier didn't too a great job classifying the data, sadly. So I will be investigating other datasets that have similarity metrics that provide slightly more information. Random note to self: stop hating on MATLAB! As annoying as it might be, it's really powerful.
Coolest thing I read this week: "We are drowning in information and starving for knowledge" - Rutherford D. Roger. Quite a depressing way to start a book, don't you think? Oh, and more random advice- take a stats class! This quote is in the beginning of this statistics book that I've been referring to- "The Elements of Statistical Learning". weeee for weeekend!!
Second coolest thing (I know! It's been an exciting week!) that you Must check out: Erika's website . Erika is a DMP intern here at UW as well. Together, we plan to conquer the world and reign supreme.

Week 6: June 26 - June 30

More datasets work this week. The protein dataset that I spent most of my time on turned out to be fairly uninteresting; we do pretty well on it, though not as well as the german group's support vector machines. I ran into a few matlab bugs (grr debugging), so that took up some time as well. This week I also got to meet with other people who are working on the project from other groups. Our overall project is an interdisciplinary one, and draws from studies in Psychology, which I think is pretty cool.
Happy July 4th Weekend!
Oh, and check out Mayra's website (another DMP intern). Yes, she got webspace on cs.washington.edu. No, I didn't get space on ee.washington.edu. =(

Week 7: July 3 - July 7

Long weekends make for short weeks. I have been doing a whole bunch of simulations, and we found something that we might want to improve in our classifier. So I'm working on thinking of how to improve the classifier by trying out different methods, and comparing it to other classifiers. The frustrating part of working with datasets/experiments: a lot of time is spent in parsing and formatting data. The cool part: Concrete Results! I have to try and get better at predicting results instead of just hacking away at them. I have simultaneously also been documenting our experiments' results in a presentable (journal includable) manner. Latex is extremely pretty - I wish Winedt (the latex editor that I use) was freeware. Anyone know of a free latex editor- let me know please!

Week 8: July 10 - July 14

More writing, experiments, simulations-running. I used to wonder, when I read papers, why authors chose to write their paper in (what seemed to me to be) the most confusing, non-intuitive manner. Now that I am writing bits and pieces of a paper, I am beginning to realize how hard it is to explain a thought very clearly in as few words as possible. Especially with experiments, it is very important to state the exact conditions under which the experiments were run, so that they are easily reproduceable by other interested groups. On Friday, Maya gave me a very basic (but Very Helpful) introduction to some of the most important classifiers out there. It was very interesting and yet intimidating, because I realized how far away I was from mastering, or even understanding, any of them. I'll try and reproduce what she told me. Basically, you can think of there being three broad categories of classifiers:
1. Modelling classes as distributions: Our classifier comes under this category, along with linear discriminant analysis, quadratic discriminant analysis etc. The basic idea is these classifiers assume that every class is a gaussian (normal) distribution with an unknown mean/covariance, and attempt to find the best mean and covariance for the distributions. Once the classes have been modelled, a test point is classified based on the probability that it could have come from one of the distributions.
2. Find a discriminator: SVMs and neural nets fall under this category. These classifiers aim on finding a separator (like a hyperplane) that can divide the space into 2 parts: one for class 1 and the other for class 2. SVMs find a hyperplane with minimum margin of error in a higher dimension space (not the space of the samples themselves). We talked a little more about neural nets, but a lot of it was math that didn't really sink in well. The idea is to build a neural net and then optimize the parameters (a 'black art'). Yeah, not a good explanation =) I'll try reading more on it to get a more intuitive idea of what neural nets are. Erika said she worked with neural nets in her AI class at school - I never really thought of taking my AI class in Berkeley, mainly because most people said it wasn't an extremely helpful class. I guess I had a misconstrued idea of what 'Artificial Intelligence' means.
3. Non parametric methods: The famous Near Neighbour method falls in this category. There's no model building involved here. In the case of one nearest neighbour, the idea is very simple. The test sample is said to be in the same class as the training sample that it is closest to.
And that was my very short crashcourse on classifiers.

Week 9: July 17 - July 21

This week was fairly frustrating. So, while running the support vector machine all this time, I was using default parameter values. Turns out that's not a good idea because the PSVM actually needs to crossvalidate its parameters to suit the data. That basically meant rerunning a lot of code. In the mean time, I began writing my final report, a more arduous task that one would expect at first. Got a fair chunk of it done, and I should be polishing stuff up next week. Next week being my LAST WEEK! Gosh, I can hardly believe 10 weeks have just flown by. woosh! I feel like I've learnt so much but just peeked into a vast universe of Machine learning. Oh, I also wrote some more parts of the 'Experiments' section of the journal paper. And I also helped Santhosh (a PhD student in Applied Math who works with Maya) to get acquainted with my code. This made me realize how messy my code was. I plan to spend a sizeable time next week cleaning stuff up.
Interesting Anecdote of the week (formerly known as 'coolest thing I did/saw/read this week' =)): I have been attending graduate student talks. These are requirements for graduate students to graduate. This one was about security in group environments. The PhD student explained an algorithm that could be used to keep medical records confidential, and to provide secure communication between cars and service stations on a highway. It was all fairly interesting. Except, I got a quick glimpse into advisor-grad student relations when the advisor started bombarding his student with questions. She handled herself pretty well, but I dreaded to think of myself in a similar position of interrogation. I guess 5-6 years of intense research gives you confidence in your field of work. Moral of the story: Always go to breakfast talks! There's always yummy pastries (and as Erika couldn't stop pointing out, *two* kinds of orange juice :) ).

*** Week 10: July 23 - July 28 ***

***Drumrolls*** And...That's It! My last week at UW is over! Well, technically, I'm still going to be here another week (I'm taking my /gulp/ GREs /gulp/ next week), and will probably work a bit early next week. But wow, it's been such a great summer. Time has just flown by me. I've learnt soo much and actually been able to contribute to a real research project. I will upload my final report very soon (it's almost over) and it will have all the gory details of the project, but I think this journal has traced my learning trajectory pretty well. This week I worked on the journal paper some more, cleaned up almost all of the code so Santhosh or anyone else should be able to use it fairly easily. I will be working on one more simulation next week. Thank You Prof Gupta, for being a great mentor. A Big thank you to the DMP program for organizing this internship program - it's been a wonderful learning experience. And a Huge shout-out to my awesome office mates, Erika and Will. I'll miss our long wikipedialogues, gossipping and general banter. It's been real =) I might be making more updates next week, but if not, it's over and out from the University of Washington. Cheers!