Heidy Khlaaf

Info & Homepage

The blog below records my weekly progress during my internship at Tufts.

To return to the main homepage please click here.

For those who have stumbled upon this blog, unbeknown to the fact that this a weekly journal pertaining to my research, please visit the site below for more information: http://www.eecs.tufts.edu/~hkhlaa0a/

We began our research with a clear goal, to build a system that crawls web-pages belonging to our target group of graduate students, professors, and researchers in academia. Eager to jump start our research and undaunted by this seemingly simple initial milestone, we blindly fell into an unforeseen pit of issues and dilemmas. We have overlooked a crucial step of our research process due to our biased perspective towards our final goal.

Webcrawling turned out to be a much more problematic issue than we anticipated it to be. No member of our team which included, Carla, Andrea, Emily, and I, had any experience pertaining to web-crawling and web based information retrieval. It was clear that the best choice was to find open-source web-crawlers that can help us achieve our first milestone. I stumbled upon a few web-crawlers that ranged from being ridiculously complex to being "gray-source". Unfortunately the only web-crawler that seemed to be promising in anyway was the "gray-source" crawler4j. While Emily used her Java-based programming knowledge to attempt to work this crawler, I began experimenting with the built in Unix GNU wget.

Regarding my social experience during the first week, all the research professors invited all the undergraduate CS students for lunch in Halligan Hall. We all introduced ourselves and the research we were doing. I found that Tufts University offers their undergraduates many opportunities for undergraduate research. It was very interesting hearing about other students' research. I was introduced to two other girls who were parts of the DREU program, Caitlin and Dani, they were mainly doing Computational Biology research.

Week 2 (June 11th - June 18th)

To my dismay, I discovered that using wget wasn't very simple. wget retrieves information based on the directories that are allocated on the web servers, and as it turns out, a university's directory hierarchy does not necessarily follow the URL's hierarchy. This was problematic since "faculty" pages were found on dedicated web-servers and weren't always retrieved from the Computer Science's main server. Finding the directed server and directory of the faculty pages varied for each university, making it impossible to spot a pattern on how to retrieve the faculty pages through wget.

Emily had better luck retrieving some data with crawler4j, she still ran across many issues with the way the crawler functioned. crawler4j did not handle re-direction and appending URL's well, thus giving us many "404" errors on university servers. Carla directed Emily to Alvin Couch, a web-expert who has been helping us through the process of web-crawling, while I began our second step of labeling and splitting the relatively minute data we were able to retrieve.

This week, I began hanging out with the other two girls from DREU, Caitlin and Dani. We then ventured to many parts of Boston including Chinatown, Fanuiel Hall, the Boston Harbor, and such. It was quite an interesting experience seeing that I've never really been to a major City in the U.S. before. Although I lived in Cairo growing up, this was quite a different experience. Chinatown was the most interesting out of all the other places, I ate Dim Sum for the first time and witnessed many cultural things pertaining to China. Wonderful experience, overall.

Week 3 (June 21st - June 25th)

Stumped by the plethora of unexpected issues, our first two weeks were no doubt a chaotic mesh of cluttered planning. As the second week steadily passed us, these scattered ideas began to harmonically unify into the infrastructure that would guide our research.

It was decided that I would begin organizing and splitting the little data that we were able to acquire into proper categories that would later be used for ARFF files on WEKA. Through the data that Emily fetched, she created text files for each web-page that would split the given text into a list of words used. We applied a sample amateur list of stop words in order to eliminate common invaluable words. After this process I had to go through a manual process to align text files with proper labels that we manually made. In the future, we will attempt to eliminate the alignment process into an automatic one. I then made a vocabulary list in a heap of all words used and the total amount of times said words were used. Looking at the distribution of words was quite interesting, but we realized that any word that is possibly used under 10 times will not contribute much to our classification process.

I then created the proper ARFF files that were necessary for each text file and ran it through WEKA. WEKA roughly classified 76% of the pages correctly, but only 15 of 126 females were classified as females. The skewed distribution of males and females in the field is very problematic in this classification process, so here is a list of things that we shall experiment with next week:

-Replicating female data in the ARFF files in order to to balance the uneven distribution of genders. -Attempting to remove noise or irrelevant words from our vocabulary. -Trying different Machine Learning algorithms and theorizing why the results would be different.

My second lab partner arrived the last day of this week, although she has not started work yet, Caitlin, Dani and I took her to the Boston Museum of Fine Arts over the weekend. I spent a few years of my life studying Art History so I was very excited to see the pieces I have studied. The museum contained a plethora of art work, too many to be seen in one day, yet I managed to do it. I am a huge fan of Ancient Greek sculpture so I very much enjoyed the Ancient Greek exhibit. Caitlin and Dani seemed to be amused by the Ancient Egyptian exhibit, but since I'm Egyptian, I happened to have seen many of the artwork in Egypt before. The most exciting part abou the museum trip was being able to see the Durer exhibit. Durer is known to be the greatest artist of the Northern Renaissance, the exhibit featured his collection of etchings. The intricate detail accompanying his prints were absolutely breath-taking.

Week 4 (June 28th - July 2nd)

Lucy Simko, my second lab partner, has finally arrived this week! She has just returned from her abroad studies at Oxford University. After discovering that Lucy is particularly well versed in Computational Linguistics and Linguistics, I began brain storming ideas pertaining to the diminishment of noise that is scattered throughout our data. Lucy and I discussed the possibilities of different stemming algorithms and stop words in order to eliminate useless diction that is contained within our data. We hypothesized that such a process would help us optimize the accuracy of Support Vector Machines on our data.

After many hours of brainstorming, we decided to follow the general algorithm of the Snowball stemmer which can be found here: http://snowball.tartarus.org/algorithms/porter/stemmer.html

It was decided that writing our own stemmer would be of the most convenience to us since we'd be able to easily modify the code to fit any desirable future changes. As Lucy began writing the stemmer, I began looking into the optimization of SVM. At the beginning of the week, our mentor Carla gave us a more in depth lecture on SVMs so I would be able to understand the parameters options on Weka. The C parameter, which stands for cost, tends to be of the uttermost importance to how efficient the algorithm runs. Carla explained that there is no possible way to know what the best C parameter is, as it varies greatly between each data-set, thus it was decided that we need an automated process in which we can see which C parameter would optimize the efficiency of the SVM algorithm.

I began seeking the help of Umaa, a graduate student who is an expert in Weka. She demonstrated to me how I can scan the ranges of the C values using a bash script in order to determine the most accurate for our data-set. I successfully wrote the script that would help us determine the most accurate C value for our SVM parameters.

In my previous blog, I have mentioned the possibility of data replication in order to balance the skewed distribution of genders in the computer science field. When I ran the ARFF data file through SVM WITHOUT replicated female data, the accuracy was absolutely horrific. I wrote an algorithm in which the female data would be replicated 7 times in order to roughly balance the skewed male/female distribution. I was able to get my accuracy up to roughly 93%. I plan on using the replicated female data until we complete our process of noise elimination from our data.

Also, it has been pouring this entire week. Due to this fact, there weren't many opportunities to explore Boston.

Week 5 (July 5th - July9th)

Lucy and I began experimenting with different methods to raise the accuracy of our SVM classifier. We read a myriad of research paper discussing how effective Stemming + IDF is. IDF is an information retrieval method which assigns a specific word a weight using the equation:

(Number of times a word is used / Number of document a word occurs in)

Our hypothesis was that the IDF method was not going to be very effective, and our said hypothesis was indeed correct. The resulting accuracy was equivalent to the older accuracies, if not lower. We began discussing methods which would contribute to our particular problem and we settled on a "Gender Ratio" method. In this method we kept track of how many times a word is used by male and female as well as the total word count. We calculated this formula afterwards:

(|Male count * Female count| / Total word count) x 100

The resulting percentage gave us an indication of how effective the word use is. A higher percentage entailed a skewed distribution of usage between genders which is extremely helpful for our specific problem. We used this method and eliminated attributes ranging from 5-25% as well as eliminated words that were used 1-5 times. I ran a total of 120 tests on the SVM classifier only to find that by eliminating words with a percentage less than 25% as well as a word count of one give us a higher accuracy. We successfully reached an accuracy of approximately 95.5%. We were ecstatic with the results.

Our week mainly consisted of running SVM tests on varying data files. We will being labeling new data next week.

Dani and I had a magnificent time in the Boston Harbor. After many days of rain, we got to experience a beautiful day out in the Boston Harbor. We found many gardens that were right across the river and mainly relaxed after a long week of work. I definitely plan on going back to the Boston Harbor many times before I leave.

Week 6 (July 12th - July 16th)

As Emily crawled the web and retrieved a myriad of new web-pages, Emily, Lucy and I began the mind numbing process of labeling yet again. As mundane as this process sounds, it is essential to our machine learning algorithm. Our team hypothesized that with a larger set of data, one can build a better SVM classifier. After labeling Lucy and I repeated the process of stemming and removing words with a Gender Ratio of a 25% weight or less. The web-pages retrieved were very promising in content IF they were actually a CS person's page, which still left us with the dilemma of how to identify the proper web-pages.

This week we began discussing the topic of Active Learning. There exists situations such as ours where labeling and abundant amount of data is expensive and time consuming. In a said scenario, one can create a learning algorithm which can actively query the user for labels which it finds crucial to the training process. There are many different algorithms one can use for Active Learning and our mentors Carla and Andrea have been providing us with lessons concerning said algorithms. We plan on prospectively utilizing Active Learning in order to differentiate a CS person's web-page from not and male from female. Although our SVM gender classifier has been getting very accurate results, one needs to perform more experiments in order to decide whether said classifier is sufficient on its own.

The SVM cross validation testing takes days to complete, we anticipate the new results after this weekend.

Week 7 (July 19th - July 23rd)

On Monday, Lucy and I found out that our SVM classifier had a 99% accuracy with a ten-fold cross validation. The attributes selected were an accumulation of all stemmed vocab words used with an elimination of a gender ratio less than 25%. Although this may appear to be good news, our classifier may not be deemed successful until we achieve the same accuracy on a separate data set. Our team began labeling a new set of web-pages in order for us to test it on the classifier. I hypothesized that the classifier will not achieve an accuracy that ranges in the 90's. After we ran the new data set on the trained classifier, our accuracy with SVM was on average in the low 80's. This indicates that our classifier was over fitting the data it was trained on.

As we waited for many tests to run, we began discussing our second text-classification problem, how can one tell if a web-page is a CS person's page or not. This problem will definitely require active learning since this classification is much broader thus there aren't keywords, as demonstrated in the gender classification, to help the classifier assess said web-pages.

Over the weekend, Lucy and I went to New York! She's a native so I got to stay with her and her family. Her room had a beautiful view of the river and the statue of liberty. Of course, me being the tourist and Lucy being the native, she took me to all the tourist-esque places around New York such as: The Metropolitan Museum of Art, the Nintendo Museum, the Ferry, Times Square, and so on. The most enjoyable part of the trip was having Lucy take me to great restaurant. As an avid fan of food, I enjoyed the delicious New York delicacies that can only be found in the native heart of the city.

Week 8 (July 26th - July 30th)

We began testing our use-ratio methodology on our second classification problem, Faculty versus Non Faculty. As it turns out, it did not work as well. Our highest results were in the mid 70's which demonstrates that the classification isn't as well-defined as in the Female classification. Emily began running active learning on our Faculty classifier, she used naive bayes. Active learning is clearly not necessary for our Female classifier as it has shown soaring accuracies. The low results for our Faculty classifier led Lucy and I to research more feature selection strategies than can improve said problem.

I had my two close friends from my hometown visit me in Boston! The weekend consisted of me showing them around the city, we spent a great deal of our time in Harvard Sqaure.

Week 9 (August 2nd - August 6th)

This week we optimized a strategy to properly train and test our SVM classifier. Our process to attain the experimental results consisted of a pipeline of sub-steps. We first retrieved a plethora of web-pages which embodied a subset of Computer Scientist faculty pages and a subset of Non Computer Scientist faculty pages. Each web-page retrieved is appropriately parsed with a stemming algorithm which incorporates various stop-words to be eliminated. The stemming algorithm then constructs a bag-of-words, where each attribute corresponds to the number of occurrences of a word in a web-page. We then use a leave-one-out four-fold cross validation to train our SVM classifier on 3/4th of the data and test on 1/4th of it. A compilation of the words belonging to the training pages is used to create the feature vector for the SVM classifier, the optimized feature selection strategy for the particular classification is utilized in this process. For each experiment run, we calculate the average and the standard deviation of all four folds to attain the proper accuracy to be used in our results.In my previous blog, I have mentioned the possibility of data replication in order to balance the skewed distribution of genders in the computer science field. When I ran the ARFF data file through SVM WITHOUT replicated female data, the accuracy was absolutely horrific. I wrote an algorithm in which the female data would be replicated 7 times in order to roughly balance the skewed male/female distribution. I was able to get my accuracy up to roughly 93%. I plan on using the replicated female data until we complete our process of noise elimination from our data.

Week 10 (August 9th - August 13th)

This week consisted of running a subset of our last experiments . We utilized the same four-fold cross validation strategy used in the previous week. We constructed feature vectors using feature selection strategies such as IDF, Use-Ratio, and Word Count on our three classifiers, female versus others, faculty versus non faculty, and female versus male. The results were quite interesting. All three feature selection strategies exceeded in performance when applied to our female classifier.

An initial test in which we attempted to identify the web-pages of female computer scientists without a hierarchical approach did not perform as well as one would anticipate. In our experiments, we compared the accuracy of our non-hierarchical classifier against a two-step classification method using various feature selection strategies. The first step in our hierarchical strategy is the classification of faculty versus non-faculty using the four-fold cross validation mentioned above. Using the pre-existing labels corresponding to each web-page, we fetched the instances that were properly classified as faculty by our classifier. The use-ratio strategy performed the best when applied to our faculty classifier, albeit the accuracy rate was only 67%. We hypothesized that the lack of well-define keywords in our faculty feature vector induces this low accuracy.

Since this is our last week on this internship, we properly commented all of our codes and experiments in order for the project to be passed on to prospective research students. We began writing our research paper; a compilation between the work performed by Lucy and I. We went out to eat lunch one last time with our mentors Carla and Andrea. This research experience gave me great insight pertaining to future research and graduate school. I immensely enjoyed it.