PHILLY:New In The City

Annotating and Translating Non-Literal Expressions: A Pilot Study

 

Kaitlyn Mulcrone

University of Minnesota, Morris

mulcr002@morris.umn.edu

 

ABSTRACT

 

Non-native speakers and machine translation systems have difficulties understanding non-literal phrases like "keep to the shadows'' or "up to a point''. Here we present an analysis of non-literal expressions in newspaper articles. First, 120 articles were manually annotated for the presence of five classes of non-literal expressions. Then sentences were automatically translated with Google translate to three languages. The original sentences, with definitions of the non-literal phrases, and the translations were given to bilingual speakers to annotate the translation errors. Errors in non-literal phrases occurred more often than other translation errors.  We will present detailed error analysis.

 

Categories and Subject Descriptors

I.27 [Natural Language Processing]: Text analysis; H.3.1 [Content Analysis and Indexing]: Linguistic processing

 

Keywords

non-literal expressions, corpus, leads

 

1. INTRODUCTION

               In English, "sticking it to the man'' has nothing to do with literally sticking an object to a man's body, but instead refers to "definition.'' Phrases like this one are known as idioms, one of the types of non-literal expressions that are used quite often in speech and text. Non-literal expressions are phrases where the meaning of the phrase differs from the meaning of the words that make up the phrase. Many non-literal expressions are culturally dependent which makes it hard for non-native speakers or foreigners to understand. It could be said that machine translation systems currently share that cultural disconnect. Like non-native speakers, machine translation systems will not know a majority of these expressions without being taught.

               Machine translation systems can now successfully translate between many languages, some better than others, but the translations are still not yet as accurate as they could be. There are many different kind of translation errors that need to be dealt with. Among them, are non-literal expressions a consistent translation error? The objective of this study is to discover if non-literal expressions have a high translation error rate and if focusing on improving non-literal expression translation would significantly reduce machine translation errors.

               This paper presents a two part pilot study analyzing non-literal expressions in newspaper articles. It will place the study among other related works, explaining how it is similar and different to other studies. Information on the study begins in Section 3 by describing the manual annotation process of non-literal expressions and then presenting the results. Section 4 analyzes the machine translation of the annotated sentences into three languages focusing on the translation errors. Finally, the paper will end with a Section 5, which provides a conclusion and discusses future work.

 

2. RELATED WORK

               There is an interest amongst NLP researchers to indentifying the difference between literal and non-literal language. Among different approaches, (Birke and Sarkar, 2006) use a clustering approach to identify if verbs are being used literally or non-literally and (Li, Roth, et. al. 2010) try and differentiate based of the word sense to identify idioms. A more in-depth study of non-literal phrases studied 17 idioms in which had both non-literal and literal meanings. They collected different surrounding words for when a phrases was being used literally and when it was meaning used non-literally. They tested multiple sentences of each phrases, where the phrases was being used in both ways in different sentences and tested the accuracy of their algorithm. Although the research (Sporleder and Li, 2009) had positive results but still varied in the accuracy amongst the different phrases.

               Research has also been done on translating multi-word expressions. For instance, (Carpuat and Diab) worked on two tasks-based integration strategies to help English and Arabic translations. Multi-word expressions sometimes have a separate meaning than the individual words, like non-literal expressions, and integrating these phrases and definitions are the start to making non-literal expression translate more smoothly.  

 

3. NON-LITERAL PHRASES

               Very little annotation has been done on non-literal expressions. There is no way to search for these phrases, therefore, the annotation process has to be done manually. In following sections, I will describe the corpus, or compilation of documents, that was created and used for this study, explain the annotation process, and provide some results.

 

3.1 Corpus

               The corpus used for this study consisted of a collection of New York Times articles. This corpus was generated for studies analyzing different aspects of writing or different writing styles. Articles were extracted from the New York Times database (Sandhaus, 2008). from the years 2005 and 2006 and to add variety and variation in writing style, the corpus has articles from four genres: business, science, sports, and international relations. These articles from each genre were extracted using a method from (Louis, 2012). They extracted the business and sports articles gathering articles with the section heads "Top/News/Business" and "Top/News/Sports." They extracted the international relations articles by finding the articles that were manually tagged "United States International Relations." The extraction of the science articles was more complicated. Their definition of a science articles was a research article. They first extracted all articles with the tags "Medicine and Health, Computers and the Internet, Religion and Churches, Research, Space, Physics, Brain, Evolution, Disasters, Language and Languages, Environment." Then, because not all articles under these tags are research articles, they created a handmade dictionary full of research related terms. They used the dictionary to remove all of the articles that did not use enough of the research related terms.

               Using those methods, we had the original version of the corpus that contained both the full articles and the article's lead, or the first two paragraphs of the articles. This study and other studies with this corpus focused on the article's lead, therefore, a decision was made to remove all the leads that were more than 200 words. Additionally, other articles were removed because their lead lacked content. For instance, articles were deleted where the lead was information on a piece of artwork and where the article was a letter to the editor. The final reduced corpus contains 13247 business articles, 2974 science articles, 11530 sports articles, and 2929 international relations articles coming to a total of 30980 articles.

 

3.2 The Annotation Process

               In order to look at more articles the annotation process occurred on the leads and not the full articles. Thirty leads were randomly selected from each genre making a total of 120 leads. In addition to annotating non-literal expression, I was asked to split them into different non-literal categories. Over the course of the annotation process I established four non-literal categories: NNL, ID, RID, and PV. Phrases are labeled NNL when the phrase is a noun phrase, ID when the phrase is an idiom where the words within the phrase have nothing to do with the non-literal meaning, RID when the phrase is an idiom where the words within the phrase relate to the non-literal meaning, and PV when the phrase is a phrasal verb. Each lead was read by the same annotator and marked according to these categories.

 

3.3 Challenges

               Although the annotation process was manual, it was one of the first of its kind. There were many instances where it was difficult to make the decision if a phrase was non-literal or not. Over the course of two weeks the definition of the categories changed multiple times to guarantee that each category was distinguished from the others. During these weeks I went from reading 10 leads in each genre to the total of 30 leads per genre. The leads were reevaluated on multiple occasions before the final count was finalized. There were many odd phrases that I made note of but only four distinct categories emerged For instance, many non-literal phrases have a literal meaning but I made the decision not to label the literal instances of non-literal expressions.

 

3.4 Results 

               Once all the categories were well defined and all the leads were reevaluated, I analyzed the data and found the following results. Out the 120 leads 56 contained non-literal expressions, slightly less than half. The total number of instances of non-literal expressions was 80 and a total of 73 unique expressions. The full list of unique expressions and various results can be found at the table at the end of the paper. The amount of non-literal expressions did not vary much by genre except that Business had a few more leads without any non-literal expression than the rest. The exact counts found in each genre and category can be found in Figure 1.

 

Non-Literal Phrases by Category

 

NNL

ID

RID

PV

Total

Business

4

2

3

5

14

Science

6

0

12

6

24

Sports

5

2

5

5

17

Politics

2

2

8

13

25

Total

17

6

28

29

80

 

Figure 1: The table shows the number of non-literal phrases for each genre by the categories established in this study. The total of each category is the sum of the counts from each genre.

 

               Having 73 unique non-literal expressions available, a little experiment was conducted to see how often these phrases occurred throughout the whole corpus and then to see how many time the phrases occurred literally versus non-literally. After searching the corpus and collecting all the sentences that contained one of the 73 non-literal phrases, the results showed that 45% of the non-literal phrases occurred only 1 to 2 times throughout the entire corpus. The rest of the results can be found in Figure 2, but as the number of occurrences increased the percent of non-literal phrases decreased. This might suggest if another sample of leads were taken it would be more likely that more non-literal phrases would be added to the list than finding repeats of the ones already on the list.

 

Number of Occurrences in the Corpus

# of Occurrences

Percentage

1-2

45%

3-10

21%

11-25

16%

25-100

14%

Over 100

4%

Figure 2: The table shows the percentage of non-literal phrases in each range of how many times the phrases occurred in the entire corpus.

 

               Continuing with the next step of the process with the file of a compilation of all the sentences in the corpus that contained one of the non-literal phrases, I manually read each sentences and marked whether or not it was used literally or non-literally. These judgments were made over a couple of days with over 1,000 sentences that need to be categorized. Although most of the non-literal phrases occurred only a few times there were a couple common phases that occurred over 100 times. Most of these phrases were ones that did occur literally, like "the man," which does occur more often literally than non-literally. The results showed that only 8 of the phrases occurred literally more often than non-literally, 5 phrases occurred literally as often as non-literally, and 8 occurred more non-literally than literally. The other 52 phrases occurred only non-literally in the corpus; however, only 11 of out 52 phrases occurred more than 5 times in the corpus  It is important to note that if a phrase occurred non-literally in all appearances in the corpus but occurred less than 5 times, there is higher chance that the phrase could still occur literally outside the corpus. This chance decreases the more times it appeared non-literally in the corpus. There were some phrases where the chance of a phrase occurring literally outside the corpus was 100%. For instance, the phrase "cool off" only occurred non-literally inside the corpus but it is also a phrase where the literally definition is the more common definition.

 

4. NON-LITERAL TRANSLATION

               The annotation process produced 67 unique sentences marked with non-literal expressions. With these sentences a study was conducted to find out how often non-literal expressions are mistranslated by machine translation. 

 

4.1 The Mark Up

               This study was conducted on three languages: Bulgarian, Tamil, and Mandarin. For each language a native speaker was found to check the English to their native language translations. Each native speaker was provided with a document containing the all 67 non-literal expression, their definition, and the original annotated sentence that contained the non-literal expression. The native speakers were asked to translate the sentence using Google Translate and then state whether or not the non-literal expression was translated correctly. It was made clear that to be translated correctly the sentence had to convey the non-literal meaning. Because there are no similar studies of this nature, the native speakers were also asked to comment on other aspects of the translation including other errors in the sentence, if the expression was translated literally but not non-literally, and if the phrase was understandable but sounded funny or off in the language. Figure 3 shows an example sentence and how it was marked-up in each of the three languages. The native speakers were not given due dates to finish their mark-ups by and each speaker did the mark-up at their own pace.

 

Mark-Up Example

Given:

Word & Non-literal definition: bum rap - an unfair punishment or false charge

Sentence: The word ''virtual'' has gotten <NNL>a bum rap</NNL> since it appeared a few centuries ago. It means ''the same, but not really.''

 

Bulgarian Mark-Up: The word ''virtual'' has gotten <NNL><GT>a bum rap</GT></NNL> since it appeared a few centuries ago. It means ''the same, but not really.''

Tamil Mark-Up: The word ''virtual'' has gotten <NNL><GT>a bum rap</GT></NNL> since it appeared a few centuries ago. It means ''the same, but not really.'' 

Chinese Mark-Up: The word ''virtual'' has gotten <NNL><GT>a bum rap</GT></NNL> since it appeared a few centuries ago. It means ''the same, but not really.'' 

Figure 3: The table shows what the native speakers were given and how they tagged the sentence. (CT= correct translation and GT = garbled translation). In this case all three languages had a garbled translation.

 

 

4.2 Challenges

               When the translation study was complete, each of the three native speakers clearly stated whether or not the phrase translated correctly or not but the additional information provided by each native speaker varied greatly. Due to the lack of instruction, the different timelines on which the mark-ups were completed, and the difference in sentence structure between the three languages, the tags other than translated correctly or garbled translation varied widely between the three native speakers mark-ups. The Bulgarian mark-up provide 6 distinct tags and tag definitions in their mark-up. Bulgarian also happened to have the most similar sentence structure to English. The Tamil mark-up tagged if the phrases translated or not and just made comments below instead of creating additional tags. The Mandarin mark-up seemed to follow the Bulgarian tags quite well but it also added 2 tags. We discovered that Tamil and Mandarin had strikingly different sentence structure that without some editing almost none of the sentences made sense. We realized this half-way through the mark-ups and then realized that there may also be differences in what the speaker was marking for the mark-up. Because of these discrepancies, the result section will only focus on if the non-literal phrase translated correctly and not any of the additional information provided by the native speakers.

               Although we were using our closest resources by picking the languages for this study by the first native speakers we could find, it turns out that the languages chosen might not have been the best. The structural difference between Mandarin, Tamil, and English may have caused some major differences in the mark-up. Overall, the goal of the study was to look at the translation of the non-literal phrases of the sentence. This seemed a reasonable task to endeavor even if the sentence structure got garbled through the translator.    

 

4.3 Results 

               To discuss the results of this study I will first describe each languages results individually and then end with the combined results. Having divided the non-literal phrases into different categories, there were some categories that were expected to have a higher percent of mistranslation than the others. The RID group was expected to have the lowest percent of mistranslation because the words contained in the non-literal phrase hinted at the non-literal meaning. Whereas the ID category was expected to have a higher percent of mistranslation because the words contained in the non-literal phrases had no connection to the non-literal meaning. The NNL and PV categories didn't have an expected mistranslation percent direction because there was little to indicate how the phrases would translation into different languages.

 

4.3.1 Bulgarian Results

            The Bulgarian mark-up matched our expected percentage outcomes, with RID having the lowest garbled translation percentage and ID having the highest garbled translation percentage. The actual sentence count and garbled translation percentage for the Bulgarian mark-up can be found in Figure 4. For Bulgarian the PV category translated better than the NNL category.

               Overall the total percentage of non-literal phrases that did not translate was 53%. Only a little over half the phrases did not translate. The percentage is just half here is because two of the larger categories translated the phrases correctly  more than mistranslated them. For a translation system, mistranslating non-literal phrases half the time is a significant problem. However, 53% might not be a large enough percent to deem non-literal phrases a priority for improvement in translation systems.

 

The Bulgarian Mark-Up

 

Garbled

Correctly Translated

Percentage Garbled

NNL

12

5

70%

ID

5

1

83%

RID

11

16

40%

PV

11

12

47%

Total

39

34

53%  

Figure 4: The table is divided up by non-literal category and gives counts for how many sentences were translated correctly and how many had a garbled translation. All the categories are totaled at the bottom.

 

4.3.2 The Tamil Results

               The results of the Tamil mark-up varied from the Bulgarian mark-up, which given that the languages are different from each other and Tamil is more different to English than Bulgarian, this was not a surprise. Tamil did have the expected result that the RID category would have the lowest garbled translation percentage; however, ID did not have the higher percentage NNL did. The NNL category was the hardest for Tamil to translate. A possible problem here is that the NNL category had 17 phrases while the ID only had 6 -- that is almost a triple difference. Because the ID category had so few phrases to work with the comparison here might not be as accurate. The full results of the Tamil mark-up can be found in Figure 5.

               Overall, Tamil mistranslated 64% of the non-literal phrases. The percentage is significantly over half here. In fact all the categories had a garbled percentage over 50. Non-literal phrases are problem for Tamil translation. However, as mentioned previously the variance in sentence structure between Tamil and English is causing more problems than the non-literal mistranslations are.

 

The Tamil Mark-Up

 

Garbled

Correctly Translated

Percentage

Garbled

NNL

14

3

82%

ID

4

2

66%

RID

14

13

51%

PV

15

8

65%

Total

47

26

64%  

Figure 5: The table is divided up by non-literal category and gives counts for how many sentences were translated correctly and how many had a garbled translation. All the categories are totaled at the bottom.

 

4.3.3 The Mandarin Results

               Out of the three languages the Mandarin mark-up had the most garbled translations; however, the results did not match the expectations. The ID category had the lowest garbled translation percentage when it was expected to be the highest. Again, this might be the cause of having too few phrases in this category but with the words in the phrase not related at all to non-literal meaning, this is still surprising. The NNL category had the highest garbled translation percentage, which is not surprising after seeing how difficult it was for the Tamil language. The RID had the second highest percentage, which went against the expected. The variance in sentence structure between English and Mandarin may have contributed to these mistranslations. The full results can be found in Figure 6.

               With the categories combined, the total percent garbled was 67% -- the highest percentage of the three. Like the Tamil results all the non-literal phrase categories had a garbled percentage over 50. However the same situation also applies with the sentence structure, that there are more problems with translation the structure sentences than translation non-literal phrases.

 

The Mandarin Mark-Up

 

Garbled

Correctly Translated

Percentage

Garbled

NNL

14

3

82%

ID

3

3

50%

RID

18

9

66%

PV

14

9

60%

Total

49

24

67%

Figure 6: The table is divided up by non-literal category and gives counts for how many sentences were translated correctly and how many had a garbled translation. All the categories are totaled at the bottom.

 

4.3.4 The Combined Results

               To combine the mark-ups, we wanted to see how many phrases were a problem for all three languages and how many weren't a problem for all three. The full chart of these results can be found in Figure 7. Overall 37% of the phrases were mistranslated across all three languages and only 18% of the phrases were correctly translated by all three languages. That 37% was the highest percentage across all the categories followed by the category 2 of the languages mistranslated the phrase.

 

The Combined Mark-Up

 

NNL

ID

RID

PV

Total

All 3 Lng

9

3

7

8

27

2 Lng

6

1

9

7

23

1 Lng

2

1

3

4

10

0 No Lng

0

1

8

4

13

Figure 7: The table is divided by how many languages (Lng = languages) mistranslated the non-literal phrase. It gives counts for each category and the total combining the categories.

 

5. CONCLUSIONS AND FUTURE WORK

               Non-literal phrases are a part of everyday life and although the majority of non-literal phrases evolve out of cultural meanings, they are found in the most common text among readers: the news. In a sample of 120 leads, 73 unique non-literal phrases were found. Using those phrases a study was conducted to see how well Google Translate could translated these phrases.

               All three languages tested on had trouble translating non-literal phrases. It seemed that Bulgarian translated the most phrases correctly and Tamil and Mandarin seemed similar is having a harder time translating but also more difficulties in correctly translating sentence structure. The extent sentence structure had on mistranslating non-literal phrases is currently unknown with this short pilot study but something that needs more consideration in future work. Non-literal language is a problem translation systems needs to keep in mind for the future. However, it is a main priority? No, there seem to be other translation issues, with these languages at least, that have a higher priority. This doesn't mean that helping translation systems better translate non-literal would not be beneficial. All three languages mistranslated the non-literal phrase half the time. Translation systems could greatly benefit from increasing the accuracy of translating these phrases.

               To continue this research, I would like to test the translation of more languages, especially more common languages like Spanish and French. Also in light of other translation errors and sentence structure, it would be nice to redo the mark-ups of the languages already tested with more guidance. Instead of leaving the tagging up to the native speaker, develop the tags before hand and have the native speakers work from a set of tags already set up for them. This way the mark-ups would be more similar and more designed for a combined analysis.

 

REFERENCES

 

J. Birke, A. Sarkar. 2006. A clustering approach for nearly unsupervised recognition of nonliteral language. In Proceedings of EACL-06.

 

M. Carpuat and M. Diab. 2010. Task-based evaluation of multiword expressions: a pilot study in statistical machine translation. In HLT-NAACL.

 

L. Li, B. Roth, and C. Sporleder. 2010. Topic Models For Word Sense Disambiguation And Token-Based Idiom Detection. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10, pages 1138–1147.

 

A. Louis. 2012. Predicting text quality: metrics for content, organization and style.  In Thesis Proposal, University of Pennsylvania.

 

E. Sandhaus. 2008. The New York Times annotated corpus. In Linguistic Data Consortium, corpus number LDC2008T19, Philadelphia.

C. Sporleder, L. Li. 2009. Unsupervised recognition of literal and non-literal use of idiomatic expressions. In Proceedings of EACL-09.

 

Phrase

Number of times is appears in the corpus

Number of times it appears non-literally in the corpus

Could the phrase be used literally?

Translated correctly into Bulgarian

Translated correctly into Tamil

Translated correctly into Mandarin

NNL

           

Cold snap

2

2

No

No

No

Yes

Administration chorus

1

1

No

Yes

No

No

Competitive flame

1

1

No

No

No

No

High jinks

2

2

No

No

No

No

The ___ bandwagon

12

12

Yes

No

No

No

___ machine

95

11

Yes

Yes

Yes

No

The man

230

10

Yes

No

No

No

Close call

2

2

No

No

No

No

Bum rap

2

2

No

No

No

No

The real deal

4

4

Rarely

No

Yes

Yes

___ person

80

4

Yes

No

No

No

Voice of reason

1

1

Rarely

Yes

No

No

Out of turn

1

1

No

No

No

No

Body blow

2

2

No

No

no

Yes

Forces of nature

4

4

No

Yes

No

No

Great deal

16

12

Yes

Yes

No

No

Off camera

2

2

No

No

No

No

DID

           

Went into effect

8

8

No

Yes

Yes

Yes

Killer in the kitchen

1

1

Yes

Yes

Yes

No*

Kept in the shadows

1

1

Rarely

No

No

No

Made it plain

1

1

Yes

Yes

No

No

Make clear

9

9

No

Yes

Yes

Yes

Stick with it

2

2

No

Yes

Yes

Yes

Hit the ___ market

7

7

No

No

Yes

No

Made the ___ team

3

3

No

No

No

No

Cast their ballots

4

4

Rarely

No

No

Yes

All signs pointed up

2

2

Rarely

No

No

No

Up to a point

2

2

Rarely

No

No

No

Drop ___ charges

13

9

No

Yes

Yes

Yes

Don't bother

2

2

Yes

No

No

No

Make do with

4

4

No

Yes

No

No

Get it

41

5

Yes

No

Yes

No

Fall asleep

2

2

No

Yes

Yes

No

Have an accident

1

1

No

Yes

Yes

Yes

Taking the hint

1

1

Rarely

Yes

Yes

No

Have a future

2

2

No

Yes

Yes

Yes

Everything went black

1

1

Rarely

Yes

no

No

Times change

2

2

No

No

Yes

No

Tag along

2

2

No

No

No

No

Put on the clock

1

1

Rarely

No

No

No

Stay ahead of the competition

2

2

No

Yes

No

Yes

Lost track of

2

2

No

Yes

No

No

Beat expectation

10

10

No

yes

Yes

Yes

Sheds light on

6

6

Yes

Yes

No

No

ID

           

Ring a bell

1

1

No

No

No

No

Play ____ card

12

1

Yes

No

No

No

Sticking it to the man

1

1

No

No

No

No

Goes to the polls

1

1

No

No

No

Yes

In the face of

43

43

No

Yes

Yes

Yes

Shed it's __ skin

1

1

Yes

No

Yes

Yes

PV

           

Talk up

10

10

Rarely

No

No

No

Feed off of

3

2

Yes

Yes

Close

No

Lock up

13

7

Yes

No

No

Yes

Cash in

11

9

Yes

No

No

No

Broke out in

3

3

No

Yes

No

No

Cool off

1

1

Yes

No

Close

No

Land in

121

18

No

No

Yes

No

Call for

190

170

Yes

Yes

Yes

Yes

Work out

26

26

No

No

No

No

Go up

20

10

Yes

No

No

No

Drop in

11

4

yes

No

No

No

Hand down

12

11

Rarely

Yes

Yes

Yes

Plunge in

11

1

Yes

Yes

Yes

Yes

Spell out

2

2

Yes

Yes

Yes

Yes

Pull ___ out

95

65

Yes

Yes

Yes

No

Turn ___ over

83

45

Yes

No

No

Yes

Fell apart

17

17

Yes

Yes

No

Yes

Dream up

4

4

No

Yes

No

No

Count on

39

39

Rarely

Yes

No

No

Take on

95

39

Yes

No

No

No

Shook off

3

3

No

Yes

No

Yes

Point out

25

23

Yes

Yes

No

Yes

Line up

78

21

Yes

No

No

No