PHILLY:New In The City

6/22/2012: Annotations

This week I spent most of my time on my own side project. Ani was interested in incorporating my knowledge as an English major into the project. She asked me to annotated 20 leads from each of the four categories in which I would label any non-literal language. She started me off with three categories: NL – for normal non-literal language, ID – for idioms, and then OTH – for other, things that didn’t fit into the other two categories. With no expectations, she said feel free to add any more categories. She also told me to watch out for humor, sarcasm, or irony.

I was a little worried at first because the first couple of article leads didn’t seem to have any non-literal language and I thought I wouldn’t be able to do it. However, by the third or fourth lead I started to see some phrases. The first time through I just tried to pick out the possible non-literal language. Then on the second pass I categorized. Ani told me to refrain from looking at lists of non-literal language because she wanted to see how I did on my own. I did forever google phrases if I was on the fence of what category to put. There is a pretty extensive idioms dictionary online. If I didn’t know where to put a phrase I looked to see if it was in that dictionary. If it wasn’t I categorized it as NL.

I didn’t end up finding any sarcasm or irony. However, there were a few lines I thought were funny. Also I caught some words that are considered slang and made that a category SL. Also I found that sports use a lot of common words that have different meaning in sports so I made a categories of Sports Terms ST.

We meet on Wednesday and Ani thought that I picked out some good phrases but that our definitions of NL, ID, and OTH were too loose. She told me to rearrange the phrases, come up with more categories, and write definitions for each category.

I ended up adding a couple more categories and these definitions.

NL: Non-literal phrases without verbs

FNL: Familiar Non-literal phrases that can be explained by the words in the phrases

ID: Phrases where the meaning is not clear by the words

MID: Idioms that have been modified by the author

TR: Really common phrases which are also non-literal

HTT: Words or phrases that seem hard to translate

AST: Terms or phrases that have to do with a specific article

SL: Words that are not in the dictionary, slang terms

HM: phrases that are funny

OTH: Words or phrases that seem to be non-literal but don’t fit into any of these categories

I also did a check of how many leads out of the 20 had each label and how many leads didn’t have any non-literal language at all. According to this small annotation it seemed that politics had the least amount of non-literal language. This seemed consistent with what we had seen from the articles so far. It seemed the political leads were purely informational.

On a side note, I was working on a couple other things. I was still put in charge trying to cut down the 4 categories so they would have a more equal word count. I found out that some of the leads were way too long for lead size. I collected the data of the number words for each lead and then showed them on a couple graphs. I decided to cut them down to 400 words but on second look Ani thought it might be better to cut down to 200. Her decision was mostly based on the fact that the Sports and Business categories still had at least double the leads of Science and Politics (and reducing them down to a max of 200 did cut down a lot of files in the bigger categories.)