Weeks 0/1 (June 23 - July 2):
I've just been doing background reading to catch up to Emily and Heidy. I am mostly reading about the different strategies used in machine learning. We are focusing on support vector machines, so those are the most important to understand. It was also very nice to go out to lunch with the computer science department and researchers on my first day! (This was a coincidence). They took us to a Japanese place not far from the lab, and we all got to know each other in a context other than the lab (and I got to meet everyone).
So far, I've read:
-An article that Carla co-authored in American Scientist on the applications of data mining (Knowledge Discovery and Data Mining, Jan-Feb 1999)
-Chapters 1, 3, and 14 from Machine Learning (by Mitchell). Chapter one basically introduced the essential concepts behind machine learning, i.e. using a small sample of data to teach a computer how to recognize and classify or deal with other data. Chapter three explained decision trees and the danger of overfitting a tree, which would impede classification.
-Two papers explaining the Baysian approach at Information Retrieval and text classification. Essentially, they introduced and contrasted the two main variations of the use of the Baysian method in text classification--the model that uses the binary count of a word (simply whether it appears at all in a document) and the model that uses the actual count (how many times it occurs; sometimes this is normalized).
I used Bayes' method to build a very limited search engine in Information Retrieval at Oxford, so I am familiar with the bag-of-words approach and the representation of documents as vectors, but these papers were nice refreshers.
-Two papers on support vector machines, which we'll be using to classify the websites
Week two (July 4-9)
Well, a lot has happened in these four days... (plus, I got to see my family for the fourth of July!) Over the week, I've mostly been working on a stemmer so that our vocabulary isn't quite so large (all the papers on SVMs use stemmers). I spent way too much time trying to make Weka's stemmer work--it turns out that I didn't have the right permissions, or something. I implemented a version of the porter stemmer). Linguistically, it's interesting to find the words for which the stemmer will not work--for example, 'inning' would be stemmed to 'in', even though the two are hardly related semantically. I've also added in some other exceptions to fit the problem we're handling (male/female classification), so that any word ending in 'men' is changed to end in 'man'. I also stem all forms of the pronoun to the nominative, which I think will help us.
Another issue that we've had to deal with is that the data is very uneven--there are approximately five male faculty pages for every female page. Currently, we're just replicating the female data so that the distribution is more equal, but it would be great to have the SVM train itself well enough to deal with such a skewwed distribution.
Week Three (July 12-16)
Emily, Heidy, and I met with Andrea for code reviews for each of us (in a cafe! It was refreshing and relaxing to have a meeting away from computers). I've been making slight modifications to the stemmer all week and am looking at feature selection--ie, which words are the most important for distinguishing male and female. 'He' and 'she' are not at all helpful for this, as it turns out: more male webpages use 'she' than female pages. How discouraging.
I'm wondering now whether our bag-of-words approach is sufficient for the problem. If pronouns and names don't really work to distinguish male pages from female pages, what will? English isn't really set up as a gender-centric language, and it seems like now we just have to assume that women have a different enough vocabulary to men--and I'm not so sure about that. But we have been getting accuracy in the mid 90s, so maybe it is true?
Another thought about the stemmer: currently, I'm stemming 'her' and 'hers' to 'she,' and similarly for 'he', 'his', and 'him'. I did this because I thought it would help bolster the count of the pronouns. But what if having pronouns that are not the grammatical subject (i.e. 'her', 'hers', etc.) is messing us up? Since most biography pages talk about the person they belong to, it would seem that the subject of each sentence, if it is a person, is likely the owner's page, whereas a pronoun that is the object could more easily refer to anyone.
Outside work, life has been wonderfully lazy. Heidy and I go into Boston to buy iced bubble tea sometimes, which is the perfect drink for a summer like this. We've also discovered that we have similar tastes in authors, so we have a trip to the bookstore planned... yay!
Week Four (July 19-23)
Emily's webcrawler produced a lot of new universities, so all three of us have been doing a lot of labeling. I did Princeton, Tufts, and UCLA. Most of the data seems very good--that is, most of the pages belonging to faculty were written in the third person and were easier for us to label by looking only at the text (as opposed to relying on a picture or the layout of the website). We're hoping that the better data will mean that the words we previously thought would be helpful (like 'he' and 'she') will actually be very useful now.
We've also begun to think about tackling the initial problem of whether any given webpage belongs to a computer science faculty member. This distinction does not seem to be as well-definied as the male-female problem, as the CS-Faculty vs. everything else problem can be broken down many ways (for example: person vs. non-person, cs-person vs. non-cs person, cs-student vs. cs-faculty, et cetera). Heidy and I don't think that the SVM will perform as well for this issue, so we will probably need to use active learning to help us solve the problem.
We had a meeting to discuss our ideas for active learning--mine was to take advantage of the layout of the page and do something like run the classifier on both the top half and the bottom half (with the idea that the top half will probably be better for classification), and if the SVM classifies the halves, run separately, differently, it means that we should look at the webpage. My decision to do it by halves of the webpage is somewhat arbitrary--I really just want to take advantage of the layout of the page, which seems to be what humans do to classify pages--so I'm not sure how well this will work or how simple it will be to implement.
I've also been considering the impact of names on the male-female classifier. Although it doesn't seem like we need to improve it much more (one of the big tests we're running is consistently getting 99% correct, with 10-fold cross-validation!), it could conceivably help to identify male and female names, and this is sort of where my idea for the top half/bottom half started. Most personal websites have the owner's name rather prominently at the top of the page, and it is usually the name occurring the most. It would seem silly not to take advantage of this if we can, but there are actually quite a few roadblocks here: foreign and ambiguous names, unseen names, nicknames (e.g. Alex), and dropdown menus of the names of all faculty members. The dropdown menus are particularly annoying, as there is no way for my parser and stemmer to tell that the long list of random names used to be a dropdown menu, and Emily says it's hard to remove this. So, for now I'm not going to do anything about names, but I think that if we end up having to improve the male/female classifier more, I will try to figure this out.
Week Five (July 26-30):
There's a lot going on this week!
First of all, because we got 99% last week for the male/female classifier (and 98% when it was run on an unseen data set!), we are moving to work on the faculty/non-faculty classifier now. We are not sure if the same tactic that worked for the male-female classifier (removing words that appear fairly evenly between the two classifications) will work for the faculty/non-faculty problem, but we are trying it now. Since the vocabulary is so large, it will take days to run....
Now, however, we're facing a little bit of a problem. Heidy and I have been labeling a website as faculty only if it belongs to someone who teaches and either is associated with the computer science department or whose research interests clearly include a field of computer science. Emily has been including all computer science people, regardless of whether they teach--that is, she includes students and researchers as well. Retrospectively, I think this is the way it should have been, but now we have to go back and fix some of the ones we classified....but maybe this will improve our faculty/non-faculty results. (Right now, they're around 80%, which is not great).
I am also going to try a new approach for creating the faculty vocabulary list (feature set)--Andrea mentioned and I found some great papers on the effect of bigrams on classifiers. One of the paper in particular seems extremely helpful, as one of its data sets, webkb, is similar to ours (it deals with universities and classifing students/faculty/etc). There are three main points of the paper that I think are very helpful: 1) they select their features very aggressively, and their results show that this works best; 2) their feature set is a mix of single words and bigrams; and 3) they select bigrams based not only on relevance but on redundancy with the words that compose it.
Before Caitlin and Dani (two other DREU girls working on another project at Tufts) left, Heidy and I went with them to the fine arts museum in Boston. I'd been to the Met dozens of times, but the fine arts museum was really something else. I'm so glad we went!
Week Six (August 2-6):
Finally, I got the bigram tests running. I tested different combinations of bigrams and unigrams up to about 5,000 features total. Because I cut down the number of documents I was testing/training on (about half the original set), the tests did not take very long to run. However, the bigram-enhanced feature set does not seem to offer any huge improvements over simple unigrams. The classifier improved from 69% accuracy to 75% accuracy (69% was without bigrams, using the KL-divergence to rank the unigrams). Although this is a significant improvement, 75% is no better than the classifier has been doing when trained on other feature sets. So, this shows that using bigrams does improve accuracy, but perhaps KL-divergence is not the best way to rank the unigrams.
There is also another problem with the bigram method. The paper (Boulis 2005) proposes and motivates a method of ranking bigrams based on the KL-divergence scores of both it and its unigrams, but the captions of the tables show that the results they show do not actually reflect feature sets chosen with this metric: they used the KL-divergence scores to rank the bigrams, which they clearly say is not the main contribution of the paper, as it has already been proposed. When I implemented the bigram method, I used the equation they proposed--so maybe I shouldn't be surprised that it didn't work very well, considering they didn't even bother to show results for it!
We have also been thinking about how to focus our paper, and one of our contributions seems to be the fact that we split up the problem of identifying female computer science faculty into a hierarchy of two easier problems. I've been reading papers about using hierarchies of classifiers and, surprisingly, there seems to be very little about using a hierarchy of classifiers for a minority class problem. There is a lot of literature about using a hierarchy of classifiers when there is a hierarchy of classification, and a good amount about what happens when some of the categories in the hierarchy are very sparsely represented, but I did not find any papers that used a hierarchy of SVMs for text-based classification of a minority class. So, I think that if we can quantify exactly how to construct the optimal hierarchy for classifying a minority class, it will be a huge contribution to the field. (That being said, generalizing the construction of a such a hierarchy will be very difficult, especially with only a week left)
Week 7 (August 9-13, and beyond):
Finally, we've finished the paper! Our last week in the lab was filled with refining the focus of our paper (since we had so much data to analyze and include) and running some last tests, to make sure that every single test is on the same set of documents and the test/train method is the same, etc. We've decided that the focus of the paper will essentially be feature selection, but we will not include my work on whether using bigrams was effective (but this is just as well, considering how long it took me to compile the bigram vocabulry--it just isn't practical with a problem like ours). We're focusing on the effectiveness of the known method to pare down features--td-idf--and ours, use-ratio.
We also talk about our choice to use two different categories of classification (ie, a heirarchy) and whether it is possible to generalize a way to choose a heirarchy of classification, given any problem. We did not come up with a generalization--though we were not specifically looking for one--and concluded that our choice to use a heirarchy was crucial to our results because the classifier at the second stage (male vs. female web pages) had much more well-defined data than the classifier at the second stage (computer science webpates vs. everything else; "everything else" is the key to why this stage didn't work well). We are also happy to say that our heuristic outperformed the standard one in the second stage of classification! So, we expect that if we were to properly create a hierarchy--in which each step is as well-defined as our second step was--our heuristic could yeild significant improvement over the standard approach.
On another note, it has also been an adventure learning how to write a technical paper! (I wrote the approach and the introduction).
Finally, a massive thank you to DREU for funding us this summer, and Carla and Andrea for being such wonderful mentors (and for taking us out to lunch the last week--it was delicious, and a nice way to preemtively say goodbye before Andrea left for Alaska).
It's been a great experience, and I really enjoyed it!
Look at a copy of our final paper!
Go back to my homepage.