Final Report » Summer in Philly

Annotating and Translating Non-Literal Expressions: A Pilot Study

Kaitlyn Mulcrone

University of Minnesota, Morris

mulcr002@morris.umn.edu

ABSTRACT

Non-native speakers and machine translation systems have difficulties understanding non-literal phrases like "keep to the shadows'' or "up to a point''. Here we present an analysis of non-literal expressions in newspaper articles. First, 120 articles were manually annotated for the presence of five classes of non-literal expressions. Then sentences were automatically translated with Google translate to three languages. The original sentences, with definitions of the non-literal phrases, and the translations were given to bilingual speakers to annotate the translation errors. Errors in non-literal phrases occurred more often than other translation errors. We will present detailed error analysis.

Categories and Subject Descriptors

I.27 [Natural Language Processing]: Text analysis; H.3.1 [Content Analysis and Indexing]: Linguistic processing

Keywords

non-literal expressions, corpus, leads

1. INTRODUCTION

In English, "sticking it to the man'' has nothing to do with literally sticking an object to a man's body, but instead refers to "definition.'' Phrases like this one are known as idioms, one of the types of non-literal expressions that are used quite often in speech and text. Non-literal expressions are phrases where the meaning of the phrase differs from the meaning of the words that make up the phrase. Many non-literal expressions are culturally dependent which makes it hard for non-native speakers or foreigners to understand. It could be said that machine translation systems currently share that cultural disconnect. Like non-native speakers, machine translation systems will not know a majority of these expressions without being taught.

Machine translation systems can now successfully translate between many languages, some better than others, but the translations are still not yet as accurate as they could be. There are many different kind of translation errors that need to be dealt with. Among them, are non-literal expressions a consistent translation error? The objective of this study is to discover if non-literal expressions have a high translation error rate and if focusing on improving non-literal expression translation would significantly reduce machine translation errors.

This paper presents a two part pilot study analyzing non-literal expressions in newspaper articles. It will place the study among other related works, explaining how it is similar and different to other studies. Information on the study begins in Section 3 by describing the manual annotation process of non-literal expressions and then presenting the results. Section 4 analyzes the machine translation of the annotated sentences into three languages focusing on the translation errors. Finally, the paper will end with a Section 5, which provides a conclusion and discusses future work.

2. RELATED WORK

There is an interest amongst NLP researchers to indentifying the difference between literal and non-literal language. Among different approaches, (Birke and Sarkar, 2006) use a clustering approach to identify if verbs are being used literally or non-literally and (Li, Roth, et. al. 2010) try and differentiate based of the word sense to identify idioms. A more in-depth study of non-literal phrases studied 17 idioms in which had both non-literal and literal meanings. They collected different surrounding words for when a phrases was being used literally and when it was meaning used non-literally. They tested multiple sentences of each phrases, where the phrases was being used in both ways in different sentences and tested the accuracy of their algorithm. Although the research (Sporleder and Li, 2009) had positive results but still varied in the accuracy amongst the different phrases.

Research has also been done on translating multi-word expressions. For instance, (Carpuat and Diab) worked on two tasks-based integration strategies to help English and Arabic translations. Multi-word expressions sometimes have a separate meaning than the individual words, like non-literal expressions, and integrating these phrases and definitions are the start to making non-literal expression translate more smoothly.

3. NON-LITERAL PHRASES

Very little annotation has been done on non-literal expressions. There is no way to search for these phrases, therefore, the annotation process has to be done manually. In following sections, I will describe the corpus, or compilation of documents, that was created and used for this study, explain the annotation process, and provide some results.

3.1 Corpus

The corpus used for this study consisted of a collection of New York Times articles. This corpus was generated for studies analyzing different aspects of writing or different writing styles. Articles were extracted from the New York Times database (Sandhaus, 2008). from the years 2005 and 2006 and to add variety and variation in writing style, the corpus has articles from four genres: business, science, sports, and international relations. These articles from each genre were extracted using a method from (Louis, 2012). They extracted the business and sports articles gathering articles with the section heads "Top/News/Business" and "Top/News/Sports." They extracted the international relations articles by finding the articles that were manually tagged "United States International Relations." The extraction of the science articles was more complicated. Their definition of a science articles was a research article. They first extracted all articles with the tags "Medicine and Health, Computers and the Internet, Religion and Churches, Research, Space, Physics, Brain, Evolution, Disasters, Language and Languages, Environment." Then, because not all articles under these tags are research articles, they created a handmade dictionary full of research related terms. They used the dictionary to remove all of the articles that did not use enough of the research related terms.

Using those methods, we had the original version of the corpus that contained both the full articles and the article's lead, or the first two paragraphs of the articles. This study and other studies with this corpus focused on the article's lead, therefore, a decision was made to remove all the leads that were more than 200 words. Additionally, other articles were removed because their lead lacked content. For instance, articles were deleted where the lead was information on a piece of artwork and where the article was a letter to the editor. The final reduced corpus contains 13247 business articles, 2974 science articles, 11530 sports articles, and 2929 international relations articles coming to a total of 30980 articles.

3.2 The Annotation Process

In order to look at more articles the annotation process occurred on the leads and not the full articles. Thirty leads were randomly selected from each genre making a total of 120 leads. In addition to annotating non-literal expression, I was asked to split them into different non-literal categories. Over the course of the annotation process I established four non-literal categories: NNL, ID, RID, and PV. Phrases are labeled NNL when the phrase is a noun phrase, ID when the phrase is an idiom where the words within the phrase have nothing to do with the non-literal meaning, RID when the phrase is an idiom where the words within the phrase relate to the non-literal meaning, and PV when the phrase is a phrasal verb. Each lead was read by the same annotator and marked according to these categories.

3.3 Challenges

Although the annotation process was manual, it was one of the first of its kind. There were many instances where it was difficult to make the decision if a phrase was non-literal or not. Over the course of two weeks the definition of the categories changed multiple times to guarantee that each category was distinguished from the others. During these weeks I went from reading 10 leads in each genre to the total of 30 leads per genre. The leads were reevaluated on multiple occasions before the final count was finalized. There were many odd phrases that I made note of but only four distinct categories emerged For instance, many non-literal phrases have a literal meaning but I made the decision not to label the literal instances of non-literal expressions.

3.4 Results

Once all the categories were well defined and all the leads were reevaluated, I analyzed the data and found the following results. Out the 120 leads 56 contained non-literal expressions, slightly less than half. The total number of instances of non-literal expressions was 80 and a total of 73 unique expressions. The full list of unique expressions and various results can be found at the table at the end of the paper. The amount of non-literal expressions did not vary much by genre except that Business had a few more leads without any non-literal expression than the rest. The exact counts found in each genre and category can be found in Figure 1.

Non-Literal Phrases by Category

	NNL	ID	RID	PV	Total
Business	4	2	3	5	14
Science	6	0	12	6	24
Sports	5	2	5	5	17
Politics	2	2	8	13	25
Total	17	6	28	29	80

Figure 1: The table shows the number of non-literal phrases for each genre by the categories established in this study. The total of each category is the sum of the counts from each genre.

Having 73 unique non-literal expressions available, a little experiment was conducted to see how often these phrases occurred throughout the whole corpus and then to see how many time the phrases occurred literally versus non-literally. After searching the corpus and collecting all the sentences that contained one of the 73 non-literal phrases, the results showed that 45% of the non-literal phrases occurred only 1 to 2 times throughout the entire corpus. The rest of the results can be found in Figure 2, but as the number of occurrences increased the percent of non-literal phrases decreased. This might suggest if another sample of leads were taken it would be more likely that more non-literal phrases would be added to the list than finding repeats of the ones already on the list.

Number of Occurrences in the Corpus

# of Occurrences	Percentage
1-2	45%
3-10	21%
11-25	16%
25-100	14%
Over 100	4%

Figure 2: The table shows the percentage of non-literal phrases in each range of how many times the phrases occurred in the entire corpus.

Continuing with the next step of the process with the file of a compilation of all the sentences in the corpus that contained one of the non-literal phrases, I manually read each sentences and marked whether or not it was used literally or non-literally. These judgments were made over a couple of days with over 1,000 sentences that need to be categorized. Although most of the non-literal phrases occurred only a few times there were a couple common phases that occurred over 100 times. Most of these phrases were ones that did occur literally, like "the man," which does occur more often literally than non-literally. The results showed that only 8 of the phrases occurred literally more often than non-literally, 5 phrases occurred literally as often as non-literally, and 8 occurred more non-literally than literally. The other 52 phrases occurred only non-literally in the corpus; however, only 11 of out 52 phrases occurred more than 5 times in the corpus It is important to note that if a phrase occurred non-literally in all appearances in the corpus but occurred less than 5 times, there is higher chance that the phrase could still occur literally outside the corpus. This chance decreases the more times it appeared non-literally in the corpus. There were some phrases where the chance of a phrase occurring literally outside the corpus was 100%. For instance, the phrase "cool off" only occurred non-literally inside the corpus but it is also a phrase where the literally definition is the more common definition.

4. NON-LITERAL TRANSLATION

The annotation process produced 67 unique sentences marked with non-literal expressions. With these sentences a study was conducted to find out how often non-literal expressions are mistranslated by machine translation.

4.1 The Mark Up

This study was conducted on three languages: Bulgarian, Tamil, and Mandarin. For each language a native speaker was found to check the English to their native language translations. Each native speaker was provided with a document containing the all 67 non-literal expression, their definition, and the original annotated sentence that contained the non-literal expression. The native speakers were asked to translate the sentence using Google Translate and then state whether or not the non-literal expression was translated correctly. It was made clear that to be translated correctly the sentence had to convey the non-literal meaning. Because there are no similar studies of this nature, the native speakers were also asked to comment on other aspects of the translation including other errors in the sentence, if the expression was translated literally but not non-literally, and if the phrase was understandable but sounded funny or off in the language. Figure 3 shows an example sentence and how it was marked-up in each of the three languages. The native speakers were not given due dates to finish their mark-ups by and each speaker did the mark-up at their own pace.

Mark-Up Example

Given:

Word & Non-literal definition: bum rap - an unfair punishment or false charge

Sentence: The word ''virtual'' has gotten <NNL>a bum rap</NNL> since it appeared a few centuries ago. It means ''the same, but not really.''

Bulgarian Mark-Up: The word ''virtual'' has gotten <NNL><GT>a bum rap</GT></NNL> since it appeared a few centuries ago. It means ''the same, but not really.''

Tamil Mark-Up: The word ''virtual'' has gotten <NNL><GT>a bum rap</GT></NNL> since it appeared a few centuries ago. It means ''the same, but not really.''

Chinese Mark-Up: The word ''virtual'' has gotten <NNL><GT>a bum rap</GT></NNL> since it appeared a few centuries ago. It means ''the same, but not really.''

Figure 3: The table shows what the native speakers were given and how they tagged the sentence. (CT= correct translation and GT = garbled translation). In this case all three languages had a garbled translation.

4.2 Challenges

When the translation study was complete, each of the three native speakers clearly stated whether or not the phrase translated correctly or not but the additional information provided by each native speaker varied greatly. Due to the lack of instruction, the different timelines on which the mark-ups were completed, and the difference in sentence structure between the three languages, the tags other than translated correctly or garbled translation varied widely between the three native speakers mark-ups. The Bulgarian mark-up provide 6 distinct tags and tag definitions in their mark-up. Bulgarian also happened to have the most similar sentence structure to English. The Tamil mark-up tagged if the phrases translated or not and just made comments below instead of creating additional tags. The Mandarin mark-up seemed to follow the Bulgarian tags quite well but it also added 2 tags. We discovered that Tamil and Mandarin had strikingly different sentence structure that without some editing almost none of the sentences made sense. We realized this half-way through the mark-ups and then realized that there may also be differences in what the speaker was marking for the mark-up. Because of these discrepancies, the result section will only focus on if the non-literal phrase translated correctly and not any of the additional information provided by the native speakers.

Although we were using our closest resources by picking the languages for this study by the first native speakers we could find, it turns out that the languages chosen might not have been the best. The structural difference between Mandarin, Tamil, and English may have caused some major differences in the mark-up. Overall, the goal of the study was to look at the translation of the non-literal phrases of the sentence. This seemed a reasonable task to endeavor even if the sentence structure got garbled through the translator.

4.3 Results

To discuss the results of this study I will first describe each languages results individually and then end with the combined results. Having divided the non-literal phrases into different categories, there were some categories that were expected to have a higher percent of mistranslation than the others. The RID group was expected to have the lowest percent of mistranslation because the words contained in the non-literal phrase hinted at the non-literal meaning. Whereas the ID category was expected to have a higher percent of mistranslation because the words contained in the non-literal phrases had no connection to the non-literal meaning. The NNL and PV categories didn't have an expected mistranslation percent direction because there was little to indicate how the phrases would translation into different languages.

4.3.1 Bulgarian Results

The Bulgarian mark-up matched our expected percentage outcomes, with RID having the lowest garbled translation percentage and ID having the highest garbled translation percentage. The actual sentence count and garbled translation percentage for the Bulgarian mark-up can be found in Figure 4. For Bulgarian the PV category translated better than the NNL category.

Overall the total percentage of non-literal phrases that did not translate was 53%. Only a little over half the phrases did not translate. The percentage is just half here is because two of the larger categories translated the phrases correctly more than mistranslated them. For a translation system, mistranslating non-literal phrases half the time is a significant problem. However, 53% might not be a large enough percent to deem non-literal phrases a priority for improvement in translation systems.

The Bulgarian Mark-Up

	Garbled	Correctly Translated	Percentage Garbled
NNL	12	5	70%
ID	5	1	83%
RID	11	16	40%
PV	11	12	47%
Total	39	34	53%

Figure 4: The table is divided up by non-literal category and gives counts for how many sentences were translated correctly and how many had a garbled translation. All the categories are totaled at the bottom.

4.3.2 The Tamil Results

The results of the Tamil mark-up varied from the Bulgarian mark-up, which given that the languages are different from each other and Tamil is more different to English than Bulgarian, this was not a surprise. Tamil did have the expected result that the RID category would have the lowest garbled translation percentage; however, ID did not have the higher percentage NNL did. The NNL category was the hardest for Tamil to translate. A possible problem here is that the NNL category had 17 phrases while the ID only had 6 -- that is almost a triple difference. Because the ID category had so few phrases to work with the comparison here might not be as accurate. The full results of the Tamil mark-up can be found in Figure 5.

Overall, Tamil mistranslated 64% of the non-literal phrases. The percentage is significantly over half here. In fact all the categories had a garbled percentage over 50. Non-literal phrases are problem for Tamil translation. However, as mentioned previously the variance in sentence structure between Tamil and English is causing more problems than the non-literal mistranslations are.

The Tamil Mark-Up

	Garbled	Correctly Translated	Percentage Garbled
NNL	14	3	82%
ID	4	2	66%
RID	14	13	51%
PV	15	8	65%
Total	47	26	64%

Figure 5: The table is divided up by non-literal category and gives counts for how many sentences were translated correctly and how many had a garbled translation. All the categories are totaled at the bottom.

4.3.3 The Mandarin Results

Out of the three languages the Mandarin mark-up had the most garbled translations; however, the results did not match the expectations. The ID category had the lowest garbled translation percentage when it was expected to be the highest. Again, this might be the cause of having too few phrases in this category but with the words in the phrase not related at all to non-literal meaning, this is still surprising. The NNL category had the highest garbled translation percentage, which is not surprising after seeing how difficult it was for the Tamil language. The RID had the second highest percentage, which went against the expected. The variance in sentence structure between English and Mandarin may have contributed to these mistranslations. The full results can be found in Figure 6.

With the categories combined, the total percent garbled was 67% -- the highest percentage of the three. Like the Tamil results all the non-literal phrase categories had a garbled percentage over 50. However the same situation also applies with the sentence structure, that there are more problems with translation the structure sentences than translation non-literal phrases.

The Mandarin Mark-Up

	Garbled	Correctly Translated	Percentage Garbled
NNL	14	3	82%
ID	3	3	50%
RID	18	9	66%
PV	14	9	60%
Total	49	24	67%

Figure 6: The table is divided up by non-literal category and gives counts for how many sentences were translated correctly and how many had a garbled translation. All the categories are totaled at the bottom.

4.3.4 The Combined Results

To combine the mark-ups, we wanted to see how many phrases were a problem for all three languages and how many weren't a problem for all three. The full chart of these results can be found in Figure 7. Overall 37% of the phrases were mistranslated across all three languages and only 18% of the phrases were correctly translated by all three languages. That 37% was the highest percentage across all the categories followed by the category 2 of the languages mistranslated the phrase.

The Combined Mark-Up

	NNL	ID	RID	PV	Total
All 3 Lng	9	3	7	8	27
2 Lng	6	1	9	7	23
1 Lng	2	1	3	4	10
0 No Lng	0	1	8	4	13

Figure 7: The table is divided by how many languages (Lng = languages) mistranslated the non-literal phrase. It gives counts for each category and the total combining the categories.

5. CONCLUSIONS AND FUTURE WORK

Non-literal phrases are a part of everyday life and although the majority of non-literal phrases evolve out of cultural meanings, they are found in the most common text among readers: the news. In a sample of 120 leads, 73 unique non-literal phrases were found. Using those phrases a study was conducted to see how well Google Translate could translated these phrases.

All three languages tested on had trouble translating non-literal phrases. It seemed that Bulgarian translated the most phrases correctly and Tamil and Mandarin seemed similar is having a harder time translating but also more difficulties in correctly translating sentence structure. The extent sentence structure had on mistranslating non-literal phrases is currently unknown with this short pilot study but something that needs more consideration in future work. Non-literal language is a problem translation systems needs to keep in mind for the future. However, it is a main priority? No, there seem to be other translation issues, with these languages at least, that have a higher priority. This doesn't mean that helping translation systems better translate non-literal would not be beneficial. All three languages mistranslated the non-literal phrase half the time. Translation systems could greatly benefit from increasing the accuracy of translating these phrases.

To continue this research, I would like to test the translation of more languages, especially more common languages like Spanish and French. Also in light of other translation errors and sentence structure, it would be nice to redo the mark-ups of the languages already tested with more guidance. Instead of leaving the tagging up to the native speaker, develop the tags before hand and have the native speakers work from a set of tags already set up for them. This way the mark-ups would be more similar and more designed for a combined analysis.

REFERENCES

J. Birke, A. Sarkar. 2006. A clustering approach for nearly unsupervised recognition of nonliteral language. In Proceedings of EACL-06.

M. Carpuat and M. Diab. 2010. Task-based evaluation of multiword expressions: a pilot study in statistical machine translation. In HLT-NAACL.

L. Li, B. Roth, and C. Sporleder. 2010. Topic Models For Word Sense Disambiguation And Token-Based Idiom Detection. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, pages 1138–1147.

A. Louis. 2012. Predicting text quality: metrics for content, organization and style. In Thesis Proposal, University of Pennsylvania.

E. Sandhaus. 2008. The New York Times annotated corpus. In Linguistic Data Consortium, corpus number LDC2008T19, Philadelphia.

C. Sporleder, L. Li. 2009. Unsupervised recognition of literal and non-literal use of idiomatic expressions. In Proceedings of EACL-09.

Phrase	Number of times is appears in the corpus	Number of times it appears non-literally in the corpus	Could the phrase be used literally?	Translated correctly into Bulgarian	Translated correctly into Tamil	Translated correctly into Mandarin
NNL
Cold snap	2	2	No	No	No	Yes
Administration chorus	1	1	No	Yes	No	No
Competitive flame	1	1	No	No	No	No
High jinks	2	2	No	No	No	No
The ___ bandwagon	12	12	Yes	No	No	No
___ machine	95	11	Yes	Yes	Yes	No
The man	230	10	Yes	No	No	No
Close call	2	2	No	No	No	No
Bum rap	2	2	No	No	No	No
The real deal	4	4	Rarely	No	Yes	Yes
___ person	80	4	Yes	No	No	No
Voice of reason	1	1	Rarely	Yes	No	No
Out of turn	1	1	No	No	No	No
Body blow	2	2	No	No	no	Yes
Forces of nature	4	4	No	Yes	No	No
Great deal	16	12	Yes	Yes	No	No
Off camera	2	2	No	No	No	No
DID
Went into effect	8	8	No	Yes	Yes	Yes
Killer in the kitchen	1	1	Yes	Yes	Yes	No*
Kept in the shadows	1	1	Rarely	No	No	No
Made it plain	1	1	Yes	Yes	No	No
Make clear	9	9	No	Yes	Yes	Yes
Stick with it	2	2	No	Yes	Yes	Yes
Hit the ___ market	7	7	No	No	Yes	No
Made the ___ team	3	3	No	No	No	No
Cast their ballots	4	4	Rarely	No	No	Yes
All signs pointed up	2	2	Rarely	No	No	No
Up to a point	2	2	Rarely	No	No	No
Drop ___ charges	13	9	No	Yes	Yes	Yes
Don't bother	2	2	Yes	No	No	No
Make do with	4	4	No	Yes	No	No
Get it	41	5	Yes	No	Yes	No
Fall asleep	2	2	No	Yes	Yes	No
Have an accident	1	1	No	Yes	Yes	Yes
Taking the hint	1	1	Rarely	Yes	Yes	No
Have a future	2	2	No	Yes	Yes	Yes
Everything went black	1	1	Rarely	Yes	no	No
Times change	2	2	No	No	Yes	No
Tag along	2	2	No	No	No	No
Put on the clock	1	1	Rarely	No	No	No
Stay ahead of the competition	2	2	No	Yes	No	Yes
Lost track of	2	2	No	Yes	No	No
Beat expectation	10	10	No	yes	Yes	Yes
Sheds light on	6	6	Yes	Yes	No	No
ID
Ring a bell	1	1	No	No	No	No
Play ____ card	12	1	Yes	No	No	No
Sticking it to the man	1	1	No	No	No	No
Goes to the polls	1	1	No	No	No	Yes
In the face of	43	43	No	Yes	Yes	Yes
Shed it's __ skin	1	1	Yes	No	Yes	Yes
PV
Talk up	10	10	Rarely	No	No	No
Feed off of	3	2	Yes	Yes	Close	No
Lock up	13	7	Yes	No	No	Yes
Cash in	11	9	Yes	No	No	No
Broke out in	3	3	No	Yes	No	No
Cool off	1	1	Yes	No	Close	No
Land in	121	18	No	No	Yes	No
Call for	190	170	Yes	Yes	Yes	Yes
Work out	26	26	No	No	No	No
Go up	20	10	Yes	No	No	No
Drop in	11	4	yes	No	No	No
Hand down	12	11	Rarely	Yes	Yes	Yes
Plunge in	11	1	Yes	Yes	Yes	Yes
Spell out	2	2	Yes	Yes	Yes	Yes
Pull ___ out	95	65	Yes	Yes	Yes	No
Turn ___ over	83	45	Yes	No	No	Yes
Fell apart	17	17	Yes	Yes	No	Yes
Dream up	4	4	No	Yes	No	No
Count on	39	39	Rarely	Yes	No	No
Take on	95	39	Yes	No	No	No
Shook off	3	3	No	Yes	No	Yes
Point out	25	23	Yes	Yes	No	Yes
Line up	78	21	Yes	No	No	No

News

Requirements Fulfilled?

Website created!...

Menu