Final Research Report

Applying an Induced POS Tagger Model

to English-Chinese Tag Projection

Karina Ivanetich

Department of Computer Science

Mills College

Final Research Report

Distributed Mentor Project

Summer 2004

Abstract

Some languages (such as English) are rich in annotated resources, while many other languages experience a shortage or absence of annotated data. In addition, human annotation, although highly accurate, is costly in terms of time and money. One solution is to project part-of-speech (POS) tags from an already-annotated language onto another language. This is called POS tag projection. At present, researchers David Yarowsky and Grace Ngai (2001) are the only researchers to publish a projection model. After modifying data projected from English to French, they obtained promising levels of accuracy.

In my work this summer, I began investigating part-of-speech tagging from English to Chinese, under the guidance of Dr. Rebecca Hwa (Department of Computer Science, University of Pittsburgh). As a starting point, we retraced Yarowsky & Ngai's modifications, using English-to-Chinese POS tag projections. Since Chinese differs from English more than does French, we are finding that Yarowsky & Ngai's model apply as well to English-Chinese projections as it did for English-French projections. Our future work will focus on finding additional modifications suited to English-Chinese projection.

1 Overview

Labeling words' parts-of-speech provides useful information about languages and is considered a necessary pre-processing step for many Natural Language Processing applications, including machine translation. A Part-of-Speech tagger is a system that has been trained to label the words of a corpus with POS tags. The training of this system requires tagged training sentences (Figure One).

Tagged training sentences can be generated by having a human annotate, or label, the sentences with POS tags. This tagged text is then used as the data that trains a tagger. One drawback to this method is that annotation is expensive in terms of both time and money. In addition, although tagged corpora are plentiful in some languages, including English, these resources are scarce for most languages. While one option is to absorb the cost of human annotation, researchers David Yarowsky & Grace Ngai have introduced a second option: that of projected tagging. The advantage of this option is that a resource-rich language can be used to tag a language for which annotated resources are scarce.

In this option, one needs a parallel corpus, containing an annotated language L and a second language M (Figure Two, a). The two languages are aligned (b) through an IBM Model 3 alignment tool (see Brown et al., 1993). The tags of the Language L can be projected onto Language M (c). Language M and its tags can then be used to train a system to tag parts of speech for Language M ( the box under c).

2 Prior Research

Yarowsky & Ngai begin with an English-French corpus of parliamentary proceedings, the English-French Canadian Hansards, and aligned it with EGYPT, a version of the IBM Model 3 alignment tool. They then projected tags directly from English to French. After they modified their projection process and accuracy improved substantially. Modifications included the following:

* "Aggressive Re-estimation"

* Subscripting Alignments of One-to-Many

Yarowsky & Ngai found that French words were strongly centered around one or two POS tags. They also found that the distribution of raw projected tags to a given word contained numerous misassignments. To correct this problem, they re-weighted the tag probability distribution for each word so that the most frequent tag was most heavily weighted, and the second most frequent tag had a discounted weight, and the i-th most frequent tags (i > 2) had no weight. This is termed "aggressive re-estimation".

Alignments of One-to-Many are those where one English word is aligned to more than one French word. This is a common occurrence because French is a more verbose language than English. In their example below, a method was needed to determine which French word, "Les" or "lois" should receive the NNS (plural noun) tag from the English "Laws". Yarowsky & Ngai's solution is to subscript both French words according to their position in the French compound. Then each word is weighted such that the first word has a higher probability of being tagged a determiner, and the second word a higher probability of being tagged a plural noun.

Table One shows the accuracies of taggers trained on different models. Each row represents the different projected models used to train the tagger. The first was the initial, direct projection attempt. Row two is accuracy after their modifications. Row three is accuracy of a tagger trained on human-annotated data and is used as an ideal, upper bound. Each column represents the different data used to test the tagger. Column One is the projected data, similar to the data used to train in row two. Column Two is hand-annotated monolingual French data. Both the English-French and monolingual French sets have been mapped to a common sets of core tags so that they might be compatible.

tested on

Auto-Projected

English-French

tested on

Hand-annotated

French

Trained On:

Initial

projection

.86

.82

Yarowsky & Ngai

modified

.96

.94

Hand

annotated

.97

.98

TABLE ONE: Yarowsky & Ngai’s Evaluation of Tag Projection/Tag Induction Models. This table is an abbreviated version of Yarowsky & Ngai's published results.

Results show that accuracy from initial projection starts off moderately well (82-86%). This may be due to the fact that English and French are relatively similar. Yarowsky & Ngai's modifications bring accuracy up to 94-96%, which is near the accuracy of a tagger trained on human-annotated data.

3 Our Research: Testing the Accuracy of English-to-Chinese Projection

Ultimately, the goal of this work is to produce a model of English-to-Chinese projection such that the resulting training data can produce a highly-accurate POS tagger. More locally, our goal in this work is to investigate how well a first attempt at POS tag projection, from English to French, will work for tag projection from English to Chinese.

3.1 Resources

The corpus we used to project POS tags and to train the tagger was Foreign Broadcast Information Service (FBIS). It consists of 240,911 sentences of English text, tagged by a trained tagging system, and aligned to parallel Chinese text. The alignment tool was GIZA++, a version of IBM Model 3. We used Chinese Treebank goldstandard mainly for testing the tagger. It consists of 15,165 sentences of hand-annotated, monolingual, Chinese text. Chenhai Xi (University of Pittsburgh, Department of Computer Science) wrote the Hidden Markov Model tagger, the program used to train the tagger, including re-estimation, and the program which tests the accuracy of the tagger.

3.2 Data Preparation

Since the training and tagger programs require a format of Word_Tag, I wrote a formatter program that did just this. The tagged English text of FBIS contains English POS tags, while the Chinese Treebank contains Chinese POS tags. It was necessary then to map both sets to a set of core POS tags common to both. I wrote a program that takes as input English or Chinese tags with their mappings to core tags, and outputs the pre-formatted data with the new core tags. I ran this program to produce both formatted FBIS data with core tags, and formatted Chinese Treebank data with core tags.

3.3 Initial Projection

Initially we wanted to project directly from English to Chinese. In doing so, we merely took the POS of the English word, and projected it onto the aligned Chinese word. However there were correspondences, other than simple one-to-one correspondences, for which we had to make decisions.

1) One-to-None Correspondence: an English word for which there is no aligned Chinese word (English-No-Chinese). We decided to throw these correspondences out because there was no Chinese word onto which we could project a POS tag.

2) None-to-One Correspondence: a Chinese word for which there is no aligned English word (Chinese-No-English). We mapped this missing tag to an OTHER tag.

3) One-to-Many Correspondence: one Chinese word aligns with more-than-one English words. Recall that Yarowsky & Ngai had the opposite issue with which to contend. While their concern was with English alignments to the more verbose French language, our concern is with the English language to the less verbose Chinese. Our current projection technique is more basic however. We mapped the Chinese word to the tag of the last English word in the compound. Our reasoning is that many of these phrases are noun phrases and so the last tag is a noun.

4 Tests and Results

4.1 Initial Tests

Our initial tests, featured in Table Two, accomplished two things. Firstly, it assured us that our core tag set contained realistic granularity, and was not too simplified. 92.7% is the accuracy of a tagger trained and tested on different, hand-annotated Chinese Treebank sets. The tags were the original Chinese tags. 92.9% is the accuracy of a tagger trained and tested on the same Chinese Treebank sets, but with the mapped, core tags. Since these two figures are very close, we know that we will not create an artificially high accuracy level by mapping the data to an overly-simplified set of core tags.

Secondly, we have our accuracy level of 48.2% given a tagger system trained our directly projected, mapped, FBIS data . Note that this is much lower than Yarowsky & Ngai's 82-86% for direct projection. This was expected, as English and Chinese contain more complex differences than English to French.

Test on

Chinese Treebank

Test on

Chinese Treebank,

Core Tags

Train on

Chinese Treebank

92.7%

Train on

Chinese Treebank,

Core Tags

92.9%

Train on FBIS,

Core Tags

48.2%

TABLE TWO: Results from training and testing on a) different Chinese Treebank text, 14,000 and 1165 sentences respectively; b) different Chinese Treebank text mapped to core tags, same number sentences; c) from training on FBIS and testing on Chinese Treebank mapped to core tags, 240,911 and 1165 sentences respectively.

4.2 Filtering

With the goal of exploring what training data leads to better accuracy, our next step was to explore the effect of filtering on the training data. Our preliminary study, featured in Table Three, suggested that filtering may have some affect on accuracy. Table 2 shows the percentages of each correspondence relative to the total number of words in the corpus.

None-to-One:

Chinese-no-English

20.1%

One-to-None:

English-no-Chinese

30.9%

One Chinese to

Many English

14.4%

One Chinese to

One English

34.6%

TABLE THREE: Percentages of All Types Correspondences, over all words. The first two results suggested that filtering the amount of One-to-None and None-to-One correspondences in the training set may have an affect on tagger accuracy.

None-to-One (Chinese-no-English) and One-to-None (English-no-Chinese) correspondences claim 20.1% and 30.9% respectively. Since these are non-trivial amounts, and since these correspondences are not as simply dealt with as One-to-One correspondences are, we hypothesized that altering their presence in the training data might have an affect on accuracy of the resulting tagger.

In order to filter the training data, I wrote a program that allows the user to specify the proportions of Chinese-no-English and English-no-Chinese allowable in each sentence. For example, one might choose to filter the training data so that all allowable sentences contained less than 40% Chinese-no-English correspondences and less than 30% English-no-Chinese correspondences.

Tables Four and Five show the accuracy of the tagger after training on variations of filtered data. All listed filtering proportions are 0.4 or less because filtering above these amounts had little affect. Note that accuracy does increase as Chinese-no-English correspondences decrease. However an upper accuracy of 53.9% is not terribly desirable. Recall that Yarowsky & Ngai achieved 94% after their final modifications. Again, this was expected in that English are Chinese are different than English and French.

Interestingly, accuracy actually decreases as English-no-Chinese correspondences decrease. Presently, we cannot offer an explanation for this, although it is worth noting (in Table Four) that when English-no-Chinese is 0.1, training set size is quite low.

Chinese-no-English

English-

no-Chinese

48.0

183,714

49.6

139,722

51.5

76,605

53.9

17,768

48.0

162,831

49.4

122,302

51.1

64,310

53.6

13,265

47.1

81,981

48.7

55,672

51.1

24,952

52.8

4,312

45.2

7417

46.4

4683

48.7

2,180

46.1

729

TABLE FOUR: Accuracy After Filtering. For each combination of training data filtered on certain percentages of English-no-Chinese and Chinese-no-English: top figures are the percentage accuracy of the resulting tagger; bottom figures on the number of sentences remaining in the training set.

TABLE FIVE: Accuracy After Filtering Bar Graph, created from percentages in Table Four. Sentence length is excluded here. X-axis is the percentage of English-no-Chinese allowed in training data. Colored bars within a group represent the percentage of Chinese-no-English allowed in training data. Y-axis is the percent accuracy of the resulting tagger. The scale of the Y-axis was chosen to maintain consistency with improved methodology in Table Seven.

4.3 RE-ESTIMATION

4.3.1 Preliminary Testing

Our next step was to explore whether re-estimating the POS tags will improve accuracy. Recall that Yarowsky & Ngai found success with their technique of re-estimation, partly because of their observation that French words tended to center around one or two parts-of-speech. Therefore re-estimating the probability weights of the two most frequent tags eliminated some error. Our preliminary tests have suggested that we may not find as much success with this technique for English to Chinese projection, because the projected POS tags are not so focused around one or two POS tags.

Table Six shows the average number of POS tags for a word as it repeats throughout the corpus. This figure is calculated according to a word’s frequency of occurrence in the corpus. Table Seven shows how often the most frequent tag occurs for each unique word. Again this was divided up by word frequency.

We calculated these results from a file generated by the tagger training program, which contains the POS tag distributions for each of 38,197 unique words in the corpus.

Results in Table Six suggest that the Chinese words, unlike French words, do not focus around a small number of projected POS tags. Words which occur between five and one hundred times average 4.3 different tags. Words which occur one hundred or more times average 6.7 different tags. In addition, it is unlikely that one of these tags predominates or that the other tags are simply erroneous. The results in Table Seven suggest that tags other than the most frequent still claim a substantial amount of occurrences. For example, the most frequent tag for all words claims an average 59% of occurrences, but this still leaves 41% for the remaining tags.

4.3.2 Results of Adding Re-estimation

In Section 4.2 we presented results for tagger accuracy after training on variations of filtered data. In this section, we present results for

tagger accuracy after both re-estimation and training on variations of filtered data.

In training the tagger, we followed Yarowsky & Ngai's manner of re-weighting the tag probabilites, as described in Section Two. The most frequent tag was most heavily weighted, the second most frequent had a discounted weight, and all others had no weight.

Tables Seven and Eight show the results of these tests. Size of training set is the same for both these and the earlier tests in Section 4.2. In comparison with the tests in Section 4.2, accuracy has increased above those of taggers merely trained on filtered data . The highest accuracy obtained after re-estimation is approximately 70%, where Chinese-no-English is 0.1 and English-no-Chinese is 0.4, 0.3, or 0.2. The highest accuracy on filtered-only data was 53.9%, where Chinese-no-English is 0.1, and English-no-Chinese is 0.4.

Interestingly, accuracy no longer decreases as English-no-Chinese correspondences decrease, except for the case where both correspondence percentages equal 0.1 and size of training set is the lowest.

Re-estimating the weight of the POS tags in the training data did increase the accuracy of the resulting tagger. However, we did not achieve the higher accuracy levels of 94-96% that Yarowsky & Ngai achieved. This was somewhat expected from the preliminary results we presented in the previous section. The POS tags projected onto the Chinese words did not concentrate around one or two tags as they did for French in Yarowsky & Ngai’s work.

Chinese-no-English

English

-no-Chinese

64.0

183,714

66.1

139,722

67.7

76,605

69.9

17,768

64.0

162,831

66.2

122,302

67.7

64,310

70.0

13,265

63.5

81,981

65.6

55,672

67.7

24,952

70.6

4,312

63.1

7417

64.4

4683

66.7

2,180

65.7

729

TABLE SEVEN: Accuracy after Re-estimation has been implemented along with filtering from Table Four. Top figures are the percentage accuracy of the resulting tagger; bottom figures on the number of sentences remaining in the training set.

TABLE EIGHT: Accuracy After Re-estimation Bar Graph, created from percentages in Table Seven. Sentence length is excluded here. X-axis is the percentage of English-no-Chinese allowed in training data. Bars within a group represent the percentage of Chinese-no-English allowed in training data. Y-axis is the percent accuracy of the resulting tagger. Comparison with Table Four shows a 16% increase in accuracy.

5 FURTHER DIRECTIONS

Our results have shown that we will have to go beyond the current model to further improve accuracy of a POS tagger trained on projected data. One way to do this is to investigate the effects of reducing a One-to-Many correspondence into a One-to-One correspondence by selecting only the last tag. Our motivation for this decision is that these compounds are often noun phrases. Therefore, choosing the last tag would correctly represent the Chinese word as a noun. However it is possible that the last words can also be Chinese punctuation words. In this case, our decision may be decreasing accuracy. Since one-to-many correspondences make up 14.4% of total words (Table 3), investigating this problem may make a moderate difference in improving tagger accuracy.

6 CONCLUSION

A system that labels the words of a corpus with Part-of-Speech (POS) tags must be trained with sentences whose words have

been tagged. Human annotation is costly, so it would be advantageous if languages with plentiful annotated resources could be

used to tag other scarcely-annotated languages. An option is POS tag projection. David Yarowsky & Grace Ngai are presently the only researchers to publish an attempt at POS tag projection. They projected tags from English to French, identified and omitted or modified low-quality training data, and obtained promising results.

In my research this summer, we began investigating English-to-Chinese tag projection by retracing Yarowsky & Ngai's

methodology. As English and Chinese are quite different than English and French, we are finding, as somewhat expected, that we need to work beyond the Yarowsky & Ngai model in order to improve the quality of the projected training data.

After English tags were initially projected onto Chinese in the training set, with no modifications, tagger accuracy was

rather low (48.2%). Next we filtered the data by selecting out sentences with certain percentages of English-no-Chinese and

Chinese-no-English correspondences. Tagger accuracy did increase as Chinese-no-English correspondences decreased.

However 53.9% accuracy was again, not very high.

Our next step was to re-estimate the weight of the highest two POS tags. This, combined with filtering, improved accuracy to

70%. In this situation, increased accuracy resulted from decreasing both English-no-Chinese and Chinese-no-English correspondences.

In all these instances, our results were well under those of Yarowsky & Ngai's. This was pretty much expected, and we

attribute this most generally to the complex differences between English and Chinese.

Further work will include the further improvements that can be done on the projection model to improve the accuracy of the

resulting POS tagger.

REFERENCES

Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and R. L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263-311

D. Yarowsky and G. Ngai, 2001. Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection across Aligned Corpora, Proceedings of HLT-2001