Carol and I making Rebecca's present
clockwise from bottom left: Tabitha, Rebecca, Karina, Carol, Shylah, Jessica
Views of Pittsburgh from 42 stories up
My DMP colleague Carol during our trip to Phipps Conservatory
The Cathedral of Learning on the U Pitt campus
Heinz Memorial Chapel
4th of July
My first day I met the professor I'll be working with,
Dr. Rebecca Hwa.
I also am working with another DMP participant,
Most of my work this week has been reading and doing problem sets to get acquainted with the field of Natural Language Processing (NLP). Even coming from a small CS department at
I never dreamt I would receive so much individualized attention! Dr. Hwa (Rebecca) and I are meeting daily for the first two weeks to discuss the material. There is much material to cover so these meetings usually last two or three hours.
Rebecca enrolled us in the NLP group at U Pitt, which meets weekly to listen and give advice on each other's research and presentations. This group consists of professors and graduate students, and some use it as a practice session for future presentations at academic conferences. This week,
Beatriz Maeireizo Tokeshi presented her research on co-training systems that predict emotions in spoken dialogues. Both Carol and I will be presenting our individual projects at this meeting in mid-July.
Rebecca also constructed a weekly Machine Translation (MT) session, consisting of herself, Carol, myself, and two other graduate students,
Chenhai. The purpose of this group is to discuss academic papers in the fields of NLP and MT. This week we discussed a paper on a
Maximum Entropy model,
which measures the degree of certainty or uncertainty in a probability distribution, and is useful in evaluating language models.
Mid-week, my Spring semester Operating Systems class came alive for me. I learned that one server had had an issue regarding the Redundant Array of Independant Disks (RAID). I also learned that, being next to
Carnegie Mellon University,
the U Pitt system is part of the Andrew File System. It was exciting to have direct experience with some concepts I had just learned about in books.
I am living in the Shadyside neighborhood of Pittsburgh, which I really enjoy. I can walk two miles to U Pitt, but I can also take the bus. Everything is conveniently-located. People are friendly, the city is clean, and alot of the buildings have old and ornate architecture. And it is so nice how drivers don't start to accelerate while I am in the crosswalk! It really is!
Back to Top
I am continuuing to learn about probability and its relation to language and translation modeling through reading and daily meetings with Rebecca. As an exercise, I am working on a fun Java program which emulates a language prediction experiment from
Prediction and Entropy of Printed English, Claude E. Shannon, 1950.
Shannon proposed that humans were the most accurate algorithmic model for predictiong the English language because they possess "implicitly, an enormous knowledge of the statistics of the (English) language". My program will take any text file, and prompt the human user to guess the next character, one at a time. It will then use the number of guesses to calculate the certainty, or entropy, of this human model.
In our weekly NLP meeting,
Dr. Janyce Wiebe
described her work in extracting detailed information about opinions from text, and creating summary representations from these. It is part of
ARDA's AQUAINT (Advanced Question & Answering for Intelligence) program. Our weekly MT group continuued with our maximum entropy paper. Some were above my head, but cool concepts and cool math, nonetheless.
I have met some of the computer science graduate students here. One grad student, Theresa, has studied Japanese for many years, and also while pursuing her doctorate here at Pitt. I think it adds to one's worth in their field, and to their enjoyment of life to pursue related interests like this.
This past weekend, I took in some sights. Carol showed me the
which has rooms representing various plant types including a room with many butterflies. We also checked out the view from the 36th floor of the
Cathedral of Learning.
I visited the non-denominational
Heinz Memorial Chapel, with its four pairs of 72-foot stained glass windows. I've been walking to and from work. I often get caught in cool rainstorms when the sun is out as well. Lots of rainbows!
Back to Top
This week Rebecca introduced me to the data we'll be working with. It is a corpus of Chinese sentences. Its words have been aligned with English words, and also tagged with the English words' parts of speech (called POS tagging). This has all been done by machine. I am glad we are dealing with Chinese, because I have long been interested in how it is structured. In fact, I got Rebecca to give me a mini-mini-tutorial, although I do not actually have to be familiar with the language in order to do this project.
I am working on a Java program that labels a mono-lingual corpus with its POS tags. In short, this works by labelling the POS of the words in a corpus, based on the probabilities of certain tags for certain words. The system has been trained with these probabilities. Chenhai, is writing the POS training and the POS tagging system for our projects. I am writing a simple POS tagger with limited vocabulary and POS, in order to more fully understand the process.
In our weekly NLP meeting,
Diane Litman and
presented their work on
Spoken Dialogue for Intelligent Tutoring Systems (ITSPOKE)
which builds a machine physics tutor. The computer tutor "spoke" and used voice recognition to interpret student's voices. They statistically compared the human and machine tutor, evaluating how much the student learned, how many turns were taken, and turn length.
This week our Machine Translation group is reading
What's In a Translation Rule?, by Michel Galley, Mark Hopkins, Kevin Knight and Daniel Marcu.
The authors discuss new work in transforming a parse tree containing syntactic labels (Noun, Noun Phrase, etc...) for a corpus in one language and using those transformations to align it to a second language. It was pretty dense reading so we made alot of progress and then kind of bottle-necked mid-way through the article.
Back to Top
This week was exciting because I got more of a sense of how my work is connected with Carol's and Chenhai's. You can read about this in more detail in the Research section of this site.
which is an xterm window that supports Chinese characters. Since the data sets we use to train and test the POS tagging system contain English and Chinese, it is great to be able to actually view the Chinese characters and not just their English keyboard representations.
We're wrapping up my study of POS tagging models. I am beginning to write a program that maps the different English and Chinese POS tags we are using with a set of core tags, so that it all can be compatible. I also need to write a program that formats our own set of test and training data so that it can be accepted by the POS tagging program.
Our MT group worked through the bottleneck in last week's article, but it wasn't easy. And in our NLP group, Rebecca hosted a talk about presenting research at conferences. Professors and grad students gave advice and recounted their experiences.
Some fun links in this regard:
David Patterson's "How to give a Bad Talk" website and
Geoff Pullman's Six Golden Rules.
I enjoyed checking out the Squirrel Hill neighborhood of Pittsburgh last weekend. Then this weekend, four of us went Downtown to a (free!) arts festival with music and vendors. Let's just say I had rocky road ice cream for dinner!
Back to Top
Last weekend and Monday were quite a few hours of programming for me. (It's okay, I need the practice). I finished the program that maps the different English and Chinese POS tags that we are using with a set of core tags. This allows tests run on these files to be comparable, because they can now contain the same set of tags. I got it working in English but one glitch -- the program couldn't make any sense of Chinese characters. So I found some Java code on the internet that showed me how to support encoding for a particular alphabet/human language. I had to modify it a bit, otherwise my 737 MB file would've tried to store itself into one variable --- PROBABLY not a good thing!!
I also finished the program that formats the corpus data to be input into the training and tagging programs. After testing and debugging, I am using my formatting and mapping programs on a data set containing 240,000 sentences, with English words mapped to Chinese. This is the set we are using to train the POS Tagging program. It is cool to have created the tools needed to work on such a large scale.
Rebecca and I started test-running Thursday. The craziness! Input and output files everywhere! And two versions of a program file ended up with the same name. D'oh! I put in some serious organizing on Friday, but I think I got a little burned out on command-line file maintenance for a while.
We are testing how well the data we use is training the POS tagging system. Basically, we used a very simple direct projection method to tag the Chinese words. Since the result has not been very accurate in prior work for English-to-French projection, we did not expect it to be accurate for English-to-Chinese, since English and Chinese are much less similar than English and French. And we were right -- it was quite inaccurate! Now the fun part is analyzing the data and finding out how I might improve our projection technique so that accuracy increases.
But the action doesn't stop there! We are reading a large but well-written paper in our MT group.
deals with text classification, or classifying documents into categories, such as headline news, business, etc. It discusses extending an existing algorithm so that text classification systems can be trained with less labeled text examples (expensive) and more unlabeled examples. Such a method is advantageous for an internet newsgroup or webpages that want to obtain the interests of its users and so would make use of text classification.
Back to Top
This week, I have been writing some programs that run preliminary studies, in order to explore some of the POS characteristics. Some of these studies might help us suggest ways to improve the POS tagger's accuracy by improving the quality of the projected training data. Right now we are mostly replicating the methods that Yarowksy & Ngai used to improve the training data, for English to French. Our expectation is that these ways may increase accuracy, but probably not greatly.
Some of the studies I did include identifying the distribution of POS tags for all occurrences of a distinct word in the corpus. I also surveyed the overall distribution of POS tags in the corpus. Other studies included measuring the composition of an OTHER tag to which we mapped certain categories, and a not-so-successful attempt at measuring the verbosity of Chinese compared to English. I am interested in this because English is considered a more verbose language than Chinese, whereas the opposite is true for French and English.
I also wrote a program that filters that kind of data we use to train the tagger. Rebecca had suggested that we could filter the training set by altering the proportions of two different correspondences. So, my program allows us to specify the percentage of English-with-no-Chinese-counterpart and the percentage of Chinese-with-no-English-counterpart in a sentence. Any sentences that exceed these percentages are thrown out. Our hypothesis is that decreasing the allowable level of these correspondences may help to increase the accuracy of a POS tagger trained on this data.
I am finding this work is pretty exciting because POS tag projection has been done only once for English to French, and it has not been done at all for other languages, including more complex problems like English to Chinese.
In our NLP group, Theresa Wilson did a practice run of her upcoming
talk for the upcoming Summer 2004 conference in San Jose. Her work with
Dr. Janyce Wiebe and
Dr. Rebecca Hwa
is the first attempt by anyone to identify subjective clauses (those with opinion) and classify their strength.
Also, we wrapped up last week's mega-paper in our MT reading group.
Back to Top
It was a big week with our presentations, but I also had fun this week too. Sunday was July 4th so Carol and I went to see fireworks and watch the symphony play beforehand. There was a big crowd downtown to see them, and when it was over, the crowd was like a big river pouring towards the bus stops and parking areas.
I worked hard this week getting ready for my presentation. I ran tests over the weekend and earlier this week so that I could have the results ready at the presentation. The tests involved filtering the training data according to different metrics, and testing the accuracy of the resulting tagger. Generally, the results show that certain filtering does improve the quality of the training data, although not by a huge amount. This is pretty much what we were expecting.
I haven't had much experience shell scripting, so there I was, creating directories and running tests by hand, and taking quite a bit of time to do it. Then, Rebecca showed me how to write a Perl script that did all the work for me. It ran my tests automatically, created directories, and then deleted directories them so I didn't run over my 10% remaining disk quota or the near-full local /tmp directory. It was exciting! -- like watching cookies come out of the oven. Definitely something I'm going to learn more of.
Wednesday, both Carol and I presented the progress of our summer research at the weekly NLP meeting. I was pleased that my knowledge of the subject had started to come together. However, I was glad that Rebecca was there to help field questions.
The rest of the week I tidied up my program files and their comments since other people involved in Rebecca's project may be using them, and I also set the permissions for them from my account. I've started writing my final report. There are still more tests to run, but I figure I might as well write it down while it's still fresh from my presentation.
On Friday we had a nice treat. Carol, Rebecca and I had lunch at a nice Italian restaurant with Distributed Mentor Project participants from Carnegie Mellon University:
and CMU professor and DMP coordinator,
Dr. Jessica Hodgins
Wowwee! I did not go into work EITHER day this weekend because things are slowing down a little. Saturday Carol and I went to the
National Aviary . Lucky for me Carol knows the bus system well. It was great seeing the friendly birds, and ones from faraway places, but I felt bad for the birds who lived in small indoor enclosures.
My other complaint is that I do not have access to cable TV and it is killing me that I cannot watch the Tour de France!!
Sunday I read alot in my remaining library books. One book is an astrophysics -for-regular-people book (Before the Beginning, Martin Rees) and the other book is my Chinese Learner. It would me take a VERY long while to learn Chinese, but it's interesting to read how the spoken and written language is structured.
Back to Top
My Last Week
Earlier this week, I ran some new tests. We are using a technique called re-estimation in order to improve the quality of the POS-projected training data. Due to differences in the POS distributions in French and Chinese, we expected that this technique would work better for the previous researchers than for our work. Our results showed us that our expectation was correct, but nonetheless, we had a nice jump in the accuracy of a POS tagger trained on data modified with re-estimation. Kind of an exciting development for my last week!
Also, I had to modify Rebecca's shell script for the new tests so I got a little bit of practice. And I've used the rest of the week getting my report near completion. I have been working on these nice Linux Fedora machines for so long now, I don't think I can deal with Windows XP anymore!
Mid-week, Carol organized an outing for some of us to see the musical The Music Man downtown. She had the idea, ordered the tickets, and found out where to go. Carol is wonderfully reliable, responsible and on-the-ball!
Friday, Carol and I gave Rebecca her present. We had made it last week at
Color Me Mine
but I didn't write it my journal since it was a surprise. We painted on the pottery dish some of the mathematical equations we studied, as well as "thank you" on the lid in Chinese characters. I had to practice the characters and was relieved that I didn't mistakenly write something weird. hee hee. I think she liked it!
Well, this is my last week. At my request, Rebecca kindly accepted me on board for eight weeks, instead of the more-common ten. I am so grateful she was willing to work around my schedule, allowing me to have this very valuable experience! She also spent a very generous amount of time with me, teaching me the material and guiding me in my research. Thanks Rebecca!
And also much thanks to the Computing Research Association's
Distributed Mentor Project for allowing me to have this great opportunity.
Back to Top