Daily Log: Week 6

Week 6

Week Goals

~~Urban dictionary stuff.~~
- Next week: test to see if the sentiment analysis works on these definitions to guess the sentiment of slang.
- Get the sentiment analysis working with the synonyms
~~Set up run length encoding dictionary~~
~~Semantic polarity algorithm using SentiWordNet~~
5 week eval (due Friday)

Debug the activity problem to finish the ALL section of the activity feed?

Update my resume.
Progress Report
Work on NSF grant/TAMU application?

Thursday, July 5 & Friday, July 6

GRE class Thursday mornin'.
Added a getter to my dictionary class.
Found a spellchecker for Python on the Internet that uses edit distance (ED): http://norvig.com/spell-correct.html
Using it in checkWordSentiment now, but there is still a problem with multiple duplicated letters because the ED only goes so far out before it quits searching. It's based on literature that spellings errors are within 1 edit distance away from the target (so it goes up to 3 in this code). Therefore, I take unrecognized words and remove all duplicates to try to find the real word. This goes along with the idea that removing all duplicates will be closer to the real word than the original due to how the spellchecker works (based on the literature).
Wrote some scripts to format the swearwords and the urban dictionary words into text files. I added these words into big.txt (actually called big copy 2.txt now) which is what the spellchecker is trained on.
Edited my resume and put it into a LaTex document so it's easier to edit. It's way too long though.
Wrote up my progress report. Gotta get it signed and turned in to Theresa.

Wednesday, July 4

Happy Independence Day! No work =D

Tuesday, July 3

GRE class, as per usual.
Wrote up and submitted the mid-term evaluation.
Went to the Summer Scholars Luncheon Series.
Set up the sentiment analysis with urban dictionary
- Now the program checks to see if the word is in WordNet. If it isn't, it checks to see if it's in urban dictionary. If it is, then it gets the first synonym of the word in UD (which unfortunately could be a slang word again), and runs SentiWordNet on that instead.
- POS isn't always correct unfortunately, so that messed things up a bit. I avoided that problem for now but checking to see if the word had a synset with that POS and if not, check to see if it has a synset without that POS. If it exists without that POS, we use that one.
- There are a number of words that aren't in WordNet (e.g. articles), which are even listed on the website, and so we should just ignore those words all together and not try to get their sentiment for now since we are looking at the sentiment word by word as opposed to in context.
There is a lot of testing and debugging I need to do with this method that I'm doing right now!
I also need to code different implementations of the sentiment algorithm to test next week:
- The dissertation I referenced below talks about a few different methods, and I need to find more PUBLISHED reference material (possibly what this guy referenced) to show why I am going to choose my different methods, unless I just point out that they are as simple as possible.
  - The main couple ways to look at the analysis is by just checking out the differences between the pos and neg scores and making sure they differ by a certain threshold.
  - In the dissertation, the author looks at a number of different aspects of the message, including parts of speech, negation, the strength per part of speech, ratios, etc. I might try to take this into account.
I also need to add onto my list of IGNORE_WORDS.
Now I keep a swearword dictionary, and if the word is in the swearword dictionary, then we just mark it as full negative sentiment.
Added in a dictionary class, so now a dictionary is a dictionary with a "hasKey" function that does the long check of digging into the hash to see if the word exists in the dictionary.
I should also make a function to get the word from the dictionary

Monday, July 2

My computer wouldn't start up this morning which was super weird. But it's on now.
Four new pictures from Downtown Bryan!
Thought about how I'm going to do the dictionary of run-length encoding words.
- Encode the word, check if it already exists, if so, then get sentiment
- Otherwise, look into hash...
- The hash should be made of words without any multiples of letters. It points to the words that include those letters, e.g. the key below would point to the keys b1e1l1o1w1 and b1e1l2o1w1
- I just have to figure out which one to pick.......
Read a couple more papers/sent analysis info (UPDATE BIBLIOGRAPHY!):
- http://arrow.dit.ie/cgi/viewcontent.cgi?article=1019&context=scschcomdis *Dissertation about using SentiWordNet for analysis
  - In this paper, used Penn Treebank POS tagger
  - Word sense disambiguation is a large problem, didn't address it in this paper. Instead:
    - Evaluate scores for each synset
    - If conflicting (pos & neg are the same for a term), average the positive and average the negative
    - Return the higher of the scores if the difference is greater than some threshold (p. 121 when navigating the PDF)
- Sentiment Analysis: An Overview from University of Iowa
  - It seems like the standard way to get sentence sentiment is to average the word polarities over the number of words in the sentence.
- Information about NLTK as a part-of-speech tagger
Created a part-of-speech tagger for our purposes
- Used the NLTK one and then translate the POS tags to WordNet/SentiWordNet tags to help in picking the synsets of the words (sometimes they have more than one meaning)
Updated dictionary.py with a createUrbanDictionary method that reads in the words in the urban_dict csv file and updates the synonyms, definitions, etc. as it should
- It has 3539 words (doesn't include any words that are one character or any words that has a non-alphabetic first or second character)
- Now I need to update the words sentiment by looking through the synonyms!
Still modularizing the program into separate files.

CSE DREU: Summer 2012

Daily Log

Week 6

Week Goals

Thursday, July 5 & Friday, July 6

Wednesday, July 4

Tuesday, July 3

Monday, July 2