Weekly Progress Journal

Week 1   ↟

On the first day, I met Dr. Caragea and the seven other students who were also working in the Machine Learning Lab this summer. Then I spent the first couple days brushing up on Java, which I hadn't seriously used since AP CompSci in high school.

I also read several research papers about classification techniques, artificial intelligence, and machine learning. Dr. Caragea also explained that (almost) every week the group would be assigned a research paper to read and then we would all meet to discuss it. One of the grad students would lead the discussion. At first, this was the most intimidating part of the research experience.


"Who understands this paper?"
This was NOT what I looked like during the first week.
(Creds to whoever made this GIF & Warner Bros)

Toward the end of the week, I received the dataset from CiteSeerX that I would be using this summer. To warm up, I calculated the overlaps between the global (title and abstract) and citation contexts for each paper. On Friday, I met with Dr. Caragea to talk about several approaches to classifying papers. We settled on one approach and discussed the proposed algorithm.

Besides adjusting to work, I explored Denton. My apartment is near the main campus, which is about 15 minutes away from the Discovery Park campus where I work. There are many restaurants near the main campus, and the public transit system in Denton is surprisingly dependable and even pleasant to ride.

Week 2   ↡

I started implementing the algorithm discussed last week. The first step is the preprocessing phrase. This involves generating input files for the rest of the algorithm. My first task was to create a bag of words model for the research papers. A bag of words is essentially a dictionary of terms and their corresponding frequencies.

This week, I also created the first version of this website. After I submitted a link to the first version, I was told that the cool menu (which was based on this awesome tutorial) didn't work. So I sent the website link to several friends (free crowdsourcing / debuggers!). Unfortunately, they didn't find anything wrong with the menu either. But eventually I discovered the root of the problem: browsers. Having done all my testing in Chrome, I forgot to check for cross-browser compatibility with Safari and Firefox. I added about two lines to my SASS file and that fixed the problem.

Week 3   ↡

This week my goal was to generate ARFF files for the bag of words for the citation and global contexts. Thus began my introduction to WEKA. I found several online resources and tutorials. I played around with the WEKA graphical interface and then used the Java WEKA API to implement a custom stemmer and stopwords remover. Then, I discovered that I should have downloaded the stable version instead of the developer version. So I downloaded the stable version, which had an easier way to stem and remove stopwords.

Week 4 + 5   ↡

I continued preprocessing input files. This week I worked on generating classification dictionaries, splitting the dataset into several smaller sets for training and testing, and generating WEKA Instances to model the global context, citation context, and link diversity for the training sets and Instances for only the global context and citation context for the test sets.

I learned more about WEKA classifiers (and classifiers in general) and spent a bit of time wrapping my mind around how to compute the link diversity of a paper. Then I worked on the Bootstrap stage of the algorithm, which involves training the classifiers and using them to predict classes for the papers in the test sets.

Week 6   ↡

I moved on the the final stage, Iterative Classification, and produced preliminary results. This actually took some time, as some of the classifiers took a while to run.

Week 7   ↡

This week, I met with Dr. Caragea to discuss the results. We re-evaluated my implementation of the algorithm and I had to make several changes. After this I regenerated the files. I also divided the citation context so that we could test the cited and citing contexts separately.

I also took some time to update the website.

Week 8   ↡

This week I learned about different statistical measures: accuracy, precision, recall, and f-measure. I continued working on refining the algorithm, but the results weren't matching up with Dr. Caragea's preliminary results. Dr. Caragea and I met to discuss this. We noticed that we had partitioned the data in different ways and that I removed more words during the preprocessing phase. So Dr. Caragea sent me some of her input files. When I used those files, my Bootstrap results matched her preliminary results. So we decided to run the rest of the tests on those files instead.

Week 9   ↡

I began working on the final report for DREU and continued generating results.

This week, UNT had a free showing of Mad Max. It was one of the most bizarre movies I'd ever seen, but I enjoyed it.

Week 10   ↡

I wrapped up as much of my research as possible. I still have a bit more work to do for the final report and this website.

Although I've been missing Memphis, I'm going to miss Denton and my fellow lab researchers. I learned a lot during these past ten weeks about machine learning, graduate school, research papers, and research in general. My time at UNT cemented my decision to apply to grad school this fall.

Picture of my last day in the lab.

Namchi Do, Summer 2015