Journal

dreu 2012

Week 10 - July 30th - August 3rd

My last week--unfortunately--so I've been busy wrapping things up. We performed a last set of 27 machine learning experiments, with undersampled, oversampled, and the original data balance, with three different types of learning methods, and with our three labeling types. So far the oversampling is performing best, as well as labelling the window of two utterances. This makes a lot of sense, intuitively--we're capturing the change between the end of a segment and the beginning of the next.

Have been having my last New York adventures--vintage shops and vegan food in Brooklyn with my friend Fae, finally got Shakespeare in the Park tickets with Robin, delicious soba and gelato with Zoë... I'm really going to miss this place.

Week 9 - July 23rd - 27nd

I generated two new sets of segment boundary labels--we had been representing each segment with segment-initial labels, by labeling the first utterance of each segment, but it might be more natural to label the last utterance of a segment, or to create a window of two utterances and label whether or not there is a boundary between them, to capture the change between the end of one segment and the beginning of the next. I also wrote a script to oversample our positive class. Now we're working on getting our database of features updated with all the new features, and we'll run more experiments on the results next week.

Last weekend was amazing--my friend Derek was in town, and we went to two concerts; Fang Island on Friday night, which ranks among the best concerts I've been to--the band is really good live, their music is really fun to dance to, and the crowd was really good. The next day we met up with Zoë at Chelsea Market for dinner, and then went to see Sleigh Bells--they were great, but the crowd was less so, and though it was still a really fun show. Then Sunday we had more delicious FISHTAG brunch with Robin, and I took Derek to the Natural History Museum and we looked at lots of dinosaur bones.

Week 8 - July 16th - 20th

This week I've been working on more features to improve our results--I wrote extractors for pause and proper noun features, and Yuan is working on extracting prosodic features--pitch, intensity, and so forth--from the audio files. We're hoping these will help us identify more positive instances.

My friend Robin and I have been going to brunch a different place every week this summer, and I think we've finally found our favorite--a little place on 79th called FISHTAG, which I chose purely due to being amused by the name, but which turned out to be AMAZING. We already have plans to go back next week. I went to the New York Botanical Gardens this weekend with my suitemate and my friend Robin, and it was so much fun--they have this fairly wild area, which for being in the middle of the Bronx felt so much like being out in the woods! As much as I love New York, I've always lived in and I was surprised how much being surrounded by trees again relaxed me. They also had REALLY amazing, extensive greenhouses, and I took way too many pictures of plants. Would you like to see a few? Too bad here are some:

Week 7 - July 9th - 13th

And the first results are...not great. We've got a bit over 80% average precision and recall, which seems okay, but then if you look at it closer, the precision and (especially) recall for the positive utterances is much worse, around 46% and 7%, respectively. This is because the dataset is really unbalanced--we have about ten times the number of negative utterances (utterances that aren't boundaries) than we do positive utterances. So the decision tree can get pretty good results by just labeling almost everything as false. Next we're going to try some things to improve the performance for positive utterances especially--oversampling (artificially increasing the size of the set of positive utterances, by randomly selecting two similar utterances and creating an utterance that averages their values), undersampling (randomly cutting out negative instances), and of course, adding more features.

My friend Adina came to visit me this weekend! We are really good at taking stupid photos of each other, as you can see above. There was delicious Ethiopian food and lots of bubble tea and the Natural History museum (my favorite!) and I MISS HER ALREADY, come back Adinaaa.

Week 6 - July 2nd - 6th

We have features extracted! We wrote scripts to label each utterance with our initial nine features, and another to label them with the boundaries of the segments. Next week we'll put all of this into Weka, a machine learning toolkit, and get some initial results! I'm really excited about this.

It was also 4th of July week--my friend-from-high-school Robin and another friend of ours, Alec, who was visiting her, went to one of Robin's friends' apartments in Jersey City for food and fairly weird party games. There they are being incredibly patriotic! Afterwards we went to watch the fireworks, which were pretty much magnificent.

Week 5 - June 25th - 29th

This week we tested a bunch of part-of-speech taggers and compared the results--most POS taggers are trained on written text, not on speech, and so perform far worse on speech. We found one that should work well enough to use on the features we need parts of speech for. We also did a bit of hand-testing for the features that should be more difficult to extract, to see how promising they looked and how much time we should spend on them. We randomly selected a few dialogues, annotated them with the features by hand for each utterance, and then compared them to the segment boundaries that were already found, and looked at the precision and recall of each feature.

I went to the Governor's Ball music festival this weekend! It was SO GOOD. Got slightly sunburned and sore from dancing. I loved all the bands I expected to love, and through the magic of music festivals, have come home with a new favorite summery pop band, The Jezabels. They were the first band we were there for and one of my favorites. I also got to see Explosions in the Sky (seen in the photo above, taken by my friend Zoë), one of my absolute favorite bands, who I've never seen live before.

Week 4 - June 18th - 22nd

This week we spent a lot of time discussing the features we picked out in more depth--figuring out how, precisely, we would extract each one, which ones would be more or less useful, and which ones would be harder or easier to extract, to make a plan for how to extract each one. So now we've got a really solid plan going forward. We also got started doing a little work towards their actual extraction--I wrote a script to generate a list of "content words" and "discourse words" (words that are only for maintaining a dialogue) based on their relative frequencies in written and spoken text, for example.

Last weekend I went to the Met with one of my suitemates, and got to show off my (super useful!) knowledge of trivia about fiber arts and ancient languages. It's a great museum, and almost unbelievably huge--in about four hours we got through a tiny portion of it. I've sort of decided that I need to live in New York for at least as long as it takes me to see the whole museum.

I also got tired of my sad, plantless existance, so I bought a basil plant at the awesome farmer's market that's near me every Sunday. His name is Herodotus, and he is delicious on sandwiches.

Week 3 - June 11th - 15th

First, good news: we have our corpus! We got it this Thursday, so I'm looking forward to diving into that more next week.

This week, we moved from Yuan and I took a couple of samples of the corpus we did have and segmented them by hand, met to discuss segmentations with Yuan, and once we had more or less agreed, started to look for features we might be able to use to segment automatically. We picked out a few that intuitively seem useful to start with. The project is very exploratory, so we're really in the "try things out, and see what works" phase.

I also presented my first paper in the weekly reading group I'm in with the other students in Becky's lab for the summer. Presentations make me a bit nervous, but the discussions are always really interesting and I'm hoping presenting things regularly for a summer will make me better at them.

Week 2 - June 4th - 8th

The view from our 18th story suite window is really spectacular. More so at night, when the city is glittery and you can see all the way to the east side of Manhattan. But that's impossible to photograph with my little point-and-shoot, so here's a picture of a rainbow I took on Sunday.

This week has been more of the same--I learned a third type of annotation, called Task Success Annotation, which they used for the Loqui corpus which I was working with for the DFUs last week. This is to identify the different tasks people are addressing as they speak, this section is a book request by the patron, this section is a request for information by the librarian, etc--it's a lot like a more well-defined version of the segmentation task I did last week. Then I've been discussing and correcting the annotations I did last week with Yuan, another student working with Becky for the summer, and reading more papers. One of them was a study of what features make a word hard to recognize--low probability in the language model, various disfluencies (repeated words, uh or um, fragments of words), pitch and intensity range, etc. Since these features can be the same ones that might indicate a more low-stakes segment, we spent a while discussing which ones we thought might be language independent and therefore useful to us vs. language dependent--for example, we thought intensity range is likely to vary in all languages, and therefore be useful, whereas word length might not because some languages just might not have that much variation in word length. So that's all been really interesting! We still don't have the actual corpus we're supposed to be working on, which is a bit frustrating.

So my weeks are filled with really interesting discussions about language, and my weekends are running around Manhattan, meeting various friends for various meals. Yeah. Life's pretty great.

Week 1 - May 28th - June 1st

As part of the DREU program I'm supposed to keep a weekly journal about my experiences here. Well, it's been off to a bit of a slow start but it's been really interesting! I've read a few papers to get background, and I've been practicing hand-annotating dialogue function units in a corpus from one of my mentor's previous projects, transcripts of people checking out books over the telephone. DFUs are, basically, the action that's being done by the utterance--is it a request for some action? To inform the other person? et cetera, et cetera. Or my favorite DFU that I've never gotten to use, the performative--something that changes the state of the world just by having been stated. Think a proclamation, a promise, or "I now pronounce you husband and wife." I don't think this is directly related to the work I'll be doing this summer, more just to get some background. I've also done a basic segmentation of a monologue, by the vaguely defined criterion of the intent of the speaker--something goes in one segment if it's intended to do the same things. This will more directly funnel into the things I'm going to be doing this summer. The other students I'm going to be working with have now arrived, so I think things will probably pick up next week.

In non-research-related news... I'm really liking New York! I wasn't sure I would, as I have never been that fond of cities, but once I settled in I've started to like it more and more. I'm sure it doesn't hurt that a ton of people from Oberlin live here, so I have plenty of friends around eager to show me around. Also, after a couple of extremely cold days, I'm happy to report that we have finally discovered where they hid the controls for the air conditioning in our apartment.

So that's that. Really excited about this summer!