PHILLY:New In The City

6/15/2012: Completed Tasks an Week's End

It the end of the week and we met to talk about the results of our first tasks. I brought in a table that showed how many files were in each domain (there are four domains Business, Science, Sports, and Politics) and how many words are in each domain. To start we just collected all articles that fell into one of these categories in the span of two years 2005 and 2006. The New York Times does not separate them into these categories so this is something Annie did using the research tools she has become familiar with through doing her graduate project. Because, at one point, we would like to narrow down the sample to either the same amount of files or the same amount of words, we can use this information to know which domain is the smallest. Also knowing these totals will be helpful in the meantime so we can look at percentages instead of numbers.

In addition to total number of words I also looked at unique words. Knowing the amount of unique words in each category can be useful in seeing which domain is more repetitive and which uses a wider vocabulary of words.

The final task I did was look up the 100 most frequent word for each category. Most of the top words were the common articles “a” and “the” and then words like “and” and “that.” It isn’t until halfway through the list that you start to see more unique words. For some reason there are some questionable letters that show up like “c” and “f” (and yes I did make sure that it was only counting these when they were words and not just when the letter occurred). The only think I can think of is that it is a common initial that appears a lot.  

The other two researchers worked on tokenizing the corpus and doing comparisons to the words list I mentioned in an earlier entry where words were rated by age of acquisition, familiarity, concreteness, imagery, and ambiguity. They looked at things like coverage, what words appear and what words to not appear. Then by domain what is the average rates of those five categories. Additionally, one of them looked at the polarity of the domains by looking at words.