CSE DREU: Summer 2012
Daily Log
Week 6
Week Goals
-
Urban dictionary stuff.
- Next week: test to see if the sentiment analysis works on these definitions to
guess the sentiment of slang.
- Get the sentiment analysis working with
the synonyms
Set up run length encoding dictionary
Semantic polarity algorithm using SentiWordNet
- 5 week eval (due Friday)
- Debug the activity problem to finish the ALL section of the activity
feed?
- Update my resume.
- Progress Report
- Work on NSF grant/TAMU application?
Thursday, July 5 & Friday, July 6
- GRE class Thursday mornin'.
- Added a getter to my dictionary class.
- Found a spellchecker for Python on the Internet that uses edit distance (ED): http://norvig.com/spell-correct.html
- Using it in checkWordSentiment now, but there is still a problem with multiple duplicated letters because the ED only goes so far out before it quits searching. It's based on literature that spellings errors are within 1 edit distance away from the target (so it goes up to 3 in this code). Therefore, I take unrecognized words and remove all duplicates to try to find the real word. This goes along with the idea that removing all duplicates will be closer to the real word than the original due to how the spellchecker works (based on the literature).
- Wrote some scripts to format the swearwords and the urban dictionary words into text files. I added these words into big.txt (actually called big copy 2.txt now) which is what the spellchecker is trained on.
- Edited my resume and put it into a LaTex document so it's easier to
edit. It's way too long though.
- Wrote up my progress report. Gotta get it signed and turned in to
Theresa.
Wednesday, July 4
- Happy Independence Day! No work =D
Tuesday, July 3
- GRE class, as per usual.
- Wrote up and submitted the mid-term evaluation.
- Went to the Summer Scholars Luncheon Series.
- Set up the sentiment analysis with urban dictionary
- Now the program checks to see if the word is in WordNet. If it
isn't, it checks to see if it's in urban dictionary. If it is, then
it gets the first synonym of the word in UD (which unfortunately
could be a slang word again), and runs SentiWordNet on that instead.
- POS isn't always correct unfortunately, so that messed things up
a bit. I avoided that problem for now but checking to see if the
word had a synset with that POS and if not, check to see if it has a
synset without that POS. If it exists without that POS, we use that
one.
- There are a number of words that aren't in WordNet (e.g.
articles), which are even listed on the website, and so we should
just ignore those words all together and not try to get their
sentiment for now since we are looking at the sentiment word by word
as opposed to in context.
- There is a lot of testing and debugging I need to do with this
method that I'm doing right now!
- I also need to code different implementations of the
sentiment algorithm to test next week:
- The dissertation I referenced below talks about a few different
methods, and I need to find more PUBLISHED reference material
(possibly what this guy referenced) to show why I am going to choose
my different methods, unless I just point out that they are as
simple as possible.
- The main couple ways to look at the analysis is by just
checking out the differences between the pos and neg scores and
making sure they differ by a certain threshold.
- In the dissertation, the author looks at a number of
different aspects of the message, including parts of speech,
negation, the strength per part of speech, ratios, etc. I might
try to take this into account.
- I also need to add onto my list of IGNORE_WORDS.
- Now I keep a swearword dictionary, and if the word is in the
swearword dictionary, then we just mark it as full negative sentiment.
- Added in a dictionary class, so now a dictionary is a dictionary
with a "hasKey" function that does the long check of digging into the
hash to see if the word exists in the dictionary.
- I should also make a function to get the word from the
dictionary
Monday, July 2
- My computer wouldn't start up this morning which was super weird.
But it's on now.
- Four new pictures from Downtown Bryan!
- Thought about how I'm going to do the dictionary of run-length
encoding words.
- Encode the word, check if it already exists, if so, then get
sentiment
- Otherwise, look into hash...
- The hash should be made of words without any multiples of
letters. It points to the words that include those letters, e.g. the
key below would point to the keys b1e1l1o1w1 and b1e1l2o1w1
- I just have to figure out which one to pick.......
- Read a couple more papers/sent analysis info (UPDATE
BIBLIOGRAPHY!):
-
http://arrow.dit.ie/cgi/viewcontent.cgi?article=1019&context=scschcomdis
*Dissertation about using SentiWordNet for analysis
- In this paper, used Penn Treebank POS tagger
- Word sense disambiguation is a large problem, didn't address
it in this paper. Instead:
- Evaluate scores for each synset
- If conflicting (pos & neg are the same for a term),
average the positive and average the negative
- Return the higher of the scores if the difference is
greater than some threshold (p. 121 when navigating the PDF)
- Sentiment Analysis: An Overview from University of Iowa
- It seems like the standard way to get sentence sentiment is
to average the word polarities over the number of words in the
sentence.
- Information about NLTK as a part-of-speech tagger
- Created a part-of-speech tagger for our purposes
- Used the NLTK one and then translate the POS tags to WordNet/SentiWordNet
tags to help in picking the synsets of the words (sometimes they
have more than one meaning)
- Updated dictionary.py with a createUrbanDictionary method that reads
in the words in the urban_dict csv file and updates the synonyms,
definitions, etc. as it should
- It has 3539 words (doesn't include any words that are one
character or any words that has a non-alphabetic first or second
character)
- Now I need to update the words sentiment by looking
through the synonyms!
- Still modularizing the program into separate files.