PHILLY:New In The City

6/12/2012: Down the Rabbit Hole and Bash

I read two more articles from that conference. Both of them were on literary analysis. One of them was specifically about the book Alice in Wonderland, which I love. Most of the analysis done with this project came out through the characters. They created character profiles with networks and were looking at the interaction between the characters. I thought it was really interesting and I think there is something to it. However, I was a little disappointed that they said nothing about the Cheshire Cat. He is my favorite character and also a pretty important one. Although I understand how he would have gotten lost in this study. Even though he talks with Alice, it is rare and mostly about big ambiguous ideas not about other characters. They were looking for character names for cues. 

The other article was really broad and more looked over literary theory more than anything else. I found it quite boring although they mentioned other articles I think would be interesting in reading.

You can find both articles here: http://www.aclweb.org/anthology-new/W/W12/#2500
Social Network Analysis of Alice and Wonderland

Towards a computational approach to literary analysis

After doing that little bit of reading I finally started working with the New York Times corpus.  I spent most of the day refreashing my bash scripting skills. I managed to write a bash program that took a lead of one of the articles to create a word list. Then I ran the word list through another script that counted how many times the word appears in all of the leads in that directory. I picked out five articles at random and ran the scripts on them. The lists are full of worthless words like “the” and “and” but at least it’s a start.

The most difficult thing I did today was getting rid of the punctuation attached to the words. I was able to get rid of commas and question marks just fine but periods were a problem. It was a stupid extension I was missing. Google was my best friend today!

However, while I was running the scripts I took a look at the actual structure of these leads and I noticed a few things. One of them opened with a question and so even though I stripped away the question marks for the count, I can’t forget about them. Questions make enticing leads, especially if it is a question that people want answered and the article claims they have the answer. Another one was a very descriptive scene that was written almost like an event would be described in a novel.  One focuses on a common knowledge base or the history of the topic. Another common thing is quotes! People like to hear famous people say things. Another thing is controversial topics or trending topics, get people fueled about something or grab their interest.

One thing I was surprised about was the variety of subjects in the business section, music, hurricanes, and planes. I guess that makes sense but it just wasn’t what I was expecting. I need to read more articles. I am too into literature.

I should also look at articles that are considered exceptional writing. I would guess that a good lead doesn't always indicate exceptional writing; however, exceptional writing should have a good lead. Anyway, that is something I will have to figure out how to check. I was having problems loading the text file that has that information.