The Project
This project is supervised by Dr. Jihie Kim, whose current research focuses on intelligent technologies for teaching and learning, at the Information Sciences Institute of the University of Southern California. I'd also like to thank CRA-W's DREU program for providing me with the opportunity to do this work.
For this project, we wish to 'correct' noisy student reply-to data from a forum used for a Computer Science class at USC. While the students do indicate who they are replying to, they are not always accurate. These inaccuracies introduce noise into our data, which is hampering other student modeling and data mining efforts. Thus, my project is to analyze the forum data, and eventually build a reply-to relationship model using supervised machine learning techniques. This process will involve developing an annotation manual for deciding reply-to relationships, designing feature-sets, choosing machine learning algorithms, and running machine learning experiments.
The Final Report
To see a paper summarizing the work completed this summer, please see my Final Report. I will also list any publications both here, and on my Publications page.
The Timeline
- Week 1: Complete a first-draft of the annotation manual.
- Week 3: Complete preliminary first round of annotation, get Kappa with another researcher.
- Week 4 and Week 5: Analyze data, begin implementing feature-set
- Weeks 6-10: Finish feature-set, perform machine learning experiments
The Blog
Week 10 (August 16-20, 2010)
I'm currently writing this 7 hours before my plane leaves, and it feels very surreal! I can't believe it's my last week already. It was a really productive week, too! After speaking with Jihie, she had a couple suggestions on how to get the HMM program to work, and well, now it works!! I also tried using J48 decision trees, since they also seemed to make sense with the data. For both HMM's and decision trees, I did both one straight 70/30 split of the data for training/testing, and I did 10-fold cross-validation. Everything did better than random, so I'm happy! However, there's a lot more to still do on this project. I'd really like to be able to see it through to the end, but I'm really glad I got the chance to work on it, even if it's not completely finished. This week I also documented the HMM program (after I figured it out) and wrote my final paper. And packed! I did a lot of packing. We also had a farewell lunch today at California Pizza Kitchen, that most of the group was able to attend. I'm not really sure how to end this post, so, cheers!
Weekend 9 (August 14-15, 2010)
Had a very relaxing weekend--started packing and cleaning a little, so it wouldn't be overwhelming next week, and read by the pool. I'm now about a quarter into the third book in the Dune series. All in all, a nice relaxing weekend.
Week 9 (August 9-13, 2010)
This week, I primarily initially focused on making sure everything I did this summer was documented. I completed documenting everything but the HMM program I've been using. While it seemed feasible I could get a few preliminary machine learning experiments done, we are quickly running out of time, and I'd really like someone else to be able to continue the project, if they want to. I also did a little more annotation this week.
Weekend 8 (August 7-8, 2010)
A few of my friends from home visited this weekend on their way back from a vacation. Thursday we ate at a Korean restaurant in Koreatown, called O Dae San. I ordered dolsot bibimbap, and thoroughly enjoyed it. Friday night, we went to a dance club, also in Koreatown, and had probably the best time I've ever had at a club. Saturday and Sunday were a bit more relaxed--they left very early Sunday morning. I had a very good time.
Week 8 (August 2-6, 2010)
This week was annotating, annotating and more annotating! We got a very nice Kappa of 0.84 for the state transition work, so we decided the annotation manual was finalized. So, I mostly continued annotating for both reply-to relationships and state transitions.
Also, another undergrad from USC is starting to work with the same group I'm with, so I'm currently training him on how to annotate reply-to relationships.
Weekend 7 (July 31-August 1, 2010)
Friday night I went out to dinner with some friends to a Vietnamese place in Venice. Beforehand, we walked around a bit. Here's a picture:
Saturday and Sunday I wasn't feeling very adventurous, so I ended up reading by the pool both days. I ended up finishing the second book in the Dune series, Dune Messiah. Thoroughly enjoyed it.
Week 7 (July 26-30, 2010)
After annotating the threads, I computed a Kappa, which is a score of agreement that corrects for chance agreement. Our Kappa was around 0.65, which isn't great. So, the other person in the group that designed the annotation scheme and I sat down together, discussed our differences and made some changes and additions to the annotation manual.
I also began looking at a Hidden Markov Model package that had been used in the past, and the code that other members in the group used to build some preliminary models in the past. I figured out how to run it and initialize the HMM, and let it loose on a toy data set. So, I'll just have to change a few lines for the real data set.
Dr. Kim and I also talked (at the very beginning of this week) about my plans for the rest of the time I'm here. We got a little off-schedule due to the AERA paper, so we re-evaluated. Our new timeline looks like:
- Week 7: Get Kappa, investigate how HMM package works and familiarize myself with the code, discuss annotation disagreements, annotate 15 more threads to get a new Kappa.
- Week 8: Assuming a good Kappa, focus on annotating, and design/implement features as time permits.
- Week 9: Focus on feature-writing and machine learning experiments, continue annotating as time permits.
- Week 10: Guarantee everything is well-documented and accessible. Finish machine learning experiments, and implement any last-minute features.
Weekend 6 (July 24-25, 2010)
Erin, Sam and I and one of Sam's roommates went swimming and played cards Friday night. I went for a run and then went to Malibu Saturday, so I got to see the northern part of Los Angeles County. Sunday morning, I went surfing on Venice Beach. Surfing's a lot harder than it looks! Regardless, I had a good time, even though I never stood up on the board. I can't wait to go again. Erin and I also went on the Water Bus, and walked around in the afternoon, with more card playing in the evening. All in all, a very fun, but very eventful weekend!
Week 6 (July 19-23, 2010)
Busy week! Finally made it to Hollywood, with Erin and a few friends from my floor. We went to see Toy Story 3 at the El Capitan Theatre, and walked around
the Walk of Fame afterwards. Disney set up a little fair/carnival outside of the theatre for kids (and adults!) after the show, so we had a lot of fun there, too! Here's a picture of me nerding out on the walk of fame:
Work-wise, I was really happy because I got the theoretical framework pretty smoothly worked in to my paper. I'm really pleased with the final result, and I submitted Thursday. I'd put the paper up here, but since it's a completely blind review, I'm not sure I should.
The rest of the week went smoothly, and I was trained on how to annotate problem-solving states for threads. Since I'm annotating reply-to relationships, it's not much more work to annotate states as well. Another student started this work, and trained me how to annotate for it. Pretty soon we'll be calculating a Kappa for our annotations.
Weekend 5 (July 17-18, 2010)
My boyfriend, Brett, flew in for the weekend, so we explored LA together. We went to Venice City Beach (while I'd been to Venice before, I hadn't gone to THE Venice City Beach), where Muscle Beach is. We also went to Burton Chase Park to look at the marina, and we spent a nice quiet afternoon by the pool. All in all, a nice, relaxing weekend!
No pictures yet, Brett needs to email them to me first!
Week 5 (July 12-16, 2010)
Dr. Kim suggested that maybe part of the reason I wasn't seeing any significant results was that I didn't have enough data. I was only using one semester's worth of forums, so she suggested I add another semester, that was also tagged for the speech acts we were investigating. It did the trick!! All the patterns I was seeing were very prevalent in both semesters, and the extra data gave me statistical significance with a student's t-test. So, I got busy writing the paper, now that I had exactly what I wanted to write about, and statistics to back it up. After a lot of drafts and a lot of input from Dr. Kim, we were almost ready to submit Thursday. We decided that a discussion of the theoretical framework would add a lot to the paper, so if I could find something, I should add it. Thursday morning, while looking for theoretical cites, I got the good news that the paper deadline was extended a week! So, I'll finish up adding in the theoretical framework, double check all the editing stuff, and add it in! Next week, I'm going to go back to annotation, and soon we're going to start the machine learning part of the project. Very excited!
Weekend 4 (July 10-11, 2010)
Went to a party at a friend's house Friday night, and had a very good time. Saturday morning, Sam, his roommate and I went to a little diner for breakfast, and went back to his house afterwards to watch the Germany/Uruguay World Cup match. Here's a picture of a house and a cactus walking back to their house:
Sunday, I went for a massage at a small place down the street from the place I'm staying. It was a spur of the moment decision to make the appointment Saturday, and I'm really glad I did. It was really nice--I had a full body massage, and a foot massage. During the foot massage, which was first, they gave me tea. Really good way to relax! It was my first massage ever, and I enjoyed it.
Week 4 (July 5-9, 2010)
This week, I further investigated the patterns that we thought we were seeing last week. We realized that not only does this pattern exist, but we can capture it using existing annotations, which means that we can use more of the data! (I've still only annotated about one third.) Basically, the pattern is that longer, more detailed replies have more following posts and longer threads. So far, though, the differences in the data set haven't been statistically significant. So, I'm going to keep seeing if any other significant patterns emerge.
Weekend 3 (July 3-4, 2010)
I had a very relaxed weekend. Went for a 5 mile walk/run by the beach Saturday morning. Saturday afternoon I went to a nearby shopping district to see about getting my Birkenstocks re-soled. Unfortunately, they don't do it, but they gave me the name of a place in Santa Monica I can go. So, I wandered around the shops for a while, and stopped in a bookstore. I got the second book in the Dune series (I finished the first one on the plane on the way here) and I got a book by Isaac Asimov called Nemesis. I also got a pair of flip-flops for the beach. I then proceeded to go home, go to the pool, and fall asleep while 'reading.' (I did get a chapter or two into the Dune book.)
Since Sunday was the fourth of July, another girl who's doing the DREU program, Erin and I went to see the fireworks at Burton Chase Park. We had a very nice time, and we went back to my townhouse afterwards. Here's a picture of one of the fireworks:
Week 3 (June 28-July 2, 2010)
We decided I am definitely going to go for the AERA paper. The deadline's really soon--July 15th--so I've really been bunkering down. During the beginning of the week, I trained the other annotator to annotate the reply-to relationships, and started working on data analysis. I found a few interesting patterns that I think will fit well into the paper. Hopefully I'll have a very rough draft by Tuesday. Essentially, the focus of the paper will be how differences in dynamics in student-teacher interaction and student-student interaction in the forums impact the outcome of those particular threads. It's going slow, but it's going!
I've also been able to get a little hooping in while I've been here, and I'm getting a lot better. I definitely will eventually need some instruction, so I'm going to take some classes when I get back to Pittsburgh.
Let's finish this post off with a picture of me having fun taking a break at work:
Weekend 2 (June 26-27, 2010)
This Saturday, I went to Beverly Hills/UCLA area with a friend who works on my floor at ISI. The area of town we went to he called "Tehrangeles". We went to a very delicious Persian restaurant called Flame. We got ice cream afterwards--I had saffron and pistachio ice cream, and it was delicious!--and then we walked around the UCLA area. Definitely had a good time!!
Sunday, I went on a hike with Sam, another DREU student at ISI on the 12th floor. (I'm on 9, with the Intelligent Systems Division.) We went to Topanga State Park.
It was really, really different than all the state parks I've been to in Pennsylvania--for one thing, it wasn't muddy! It was very dusty, in fact, and the wildlife and plants were completely foreign. It was really cool, and I'm looking forward to going again. Here's a couple more pictures:
Week 2 (June 21-25, 2010)
I arrived in Marina del Rey over the weekend--I did the first week of the program in Pittsburgh (two weeks ago), so I could take a week off to go to the Intelligent Tutoring Systems conference held in Pittsburgh this year. I presented a poster-paper, and had a blast (yes, I was singing along loudly with the Educational Data Mining song). I met my roommates and their two dogs--we met off of Craigslist, and so far everything's going terrific! The Marina is really pretty, too. Here's a view from ISI's reception area:
Research-wise, we made a couple more changes to the annotation manual, and I started annotating in earnest. I'm probably about a third of the way done with the annotations by my estimate. I also wrote a couple scripts to go through and list the original poster's reply-to, my reply-to, and the order in which the posts were posted, to help with preliminary analysis. I'm gonna keep plugging!
In other news, I'm also getting better at using the espresso machine on the 11th floor.
Week 1 (June 7-11)
This week, I started working with the data. I annotated the first ten 'training' threads, and designed a preliminary annotation manual, complete with examples. These took a little longer than I expected. The manual pretty good, but I'm definitely anticipating a lot of revisions! I'm also starting to get some ideas for features for the classification problem, so I'm jotting them down to revisit later.