My research this summer is in the field of natural language processing (NLP). NLP is concerned with designing systems that can deal with human languages, rather than programming languages only. Whereas programming languages must be completely unambiguous, human languages are far from unambiguous, making NLP a challenging field. Some examples of NLP applications are speech recognition, some search engines, and machine translation.
I am working in machine translation and artificial intelligence, which strives to build systems that can teach themselves to translate human languages. Labeling words' parts of speech, or POS tagging, provides useful information about languages and is considered a necessary pre-processing step for many NLP applications, including machine translation. While human annotators can be highly accurate at POS tagging, this is costly both in terms of time and money. In addition, some languages, such as English, contain many corpora (bodies of text) that have been annotated, while less common languages often contain little if no annotated corpora.
To address this problem of the lack of annotated resources, researchers began to consider ways to make use of resources from well-annotated languages. One approach is called projection. Researchers
David Yarowksy and
are the only researchers to publish work on this approach, as of this writing.
They modified their initial projection technique, and for English-to-French projections, obtained high levels of accuracy.
Here is the article describing their algorithm.
In my work, I am attempting to replicate their results, this time for English-to-Chinese projections. Since translation issues differ between these two sets of languages, I am expecting that I will need to alter the model to better serve the English-to-Chinese projections.
Dr. Hwa's goal is to design a system that can project POS tags from one language to another, but also recognize when certain projections are inaccurate. Such projections will then be set aside for annotation by a human. This goal involves us students working together: While I am working on the projection algorithm, UPitt student & DMP participant Carol Nichols is programming the annotation and data collection tool, and UPitt graduate student Chenhai Xi has built the training and testing of the POS tagging system.