CSE DREU: Summer 2012
Daily Log
Week 7
Week Goals
- Test to see if the sentiment analysis works on the UD definitions to
guess the sentiment of slang
- Find children's language research
- Find online language research paper
- Test and debug sentiment analysis on corpora
- Write testing file/program
- Add to "ignore_words"?
- Update bibliography?
- TAMU application!!!
- Start write up for future work?
Thursday, July 12 & Friday, July 13
- GRE class.
- Studied a lot of GRE vocab.
- Explored the website kidzworld.. it's perfect and so silly.
- Started writing the HTMLparser to scrape info from the site.
- Finished the scraper. Getting messages from kidzworld.
- Got some info on kidzworld:
- Started running my sentiment classifier on the CSV files
- There are 7098 comments, but over 30,000 lines to analyze.
- Next week I will format the testing file and put the data into a csv
file.
Wednesday, July 11
- Happy free slurpee day!
- Picked up the two books I ordered yesterday from the library (came
so fast!!)
- The book The Texture of Internet is
really good in terms of listing the features/orthographic devices of
"TXT" (Netspeak, CMC, online language, etc.) which will be helpful for
the final report
- Started TESTING. so fun.....
- Realized I needed to not check the POS of punctuation, so I did
that.
- Should I write up a testing program? Probably yes. One where it can
compare hand labeled sentiment to classified sentiment as well. This
will be good for the future. I'll do that later.
- When the post is referencing another person (which means it has
localhostkidgabmember in it), should we pay special attention to it
because we know it's talking about someone else?
- "Negative" words
- Dog (cad- scoundrel)
- Going (fail)
- Sit (sit)
- By (aside)
- guess
- did
- activity (natural process)
- not
- working (sour)
- have
- boot (kick)
- do (suffice)
- think
- comment (gossip)
- These are weird because they're certain negative senses of the word
that we don't care about.... but sometimes they could be relevant (just
not in these particular messages I'm checking)
- Changed the sentiment algorithm so that if the word sentiment is
classified as negative, we use the algorithm used in the dissertation
where we look at the average positive and average negative scores and
see if they differ by a certain threshold (I didn't change it from .5,
but maybe I should?)
- This change resolved some issues with this set of messages (but
who knows what's going to happen on a different set =/ )
- Dog, sit, comment, by, ugly, and not are all still negative
- The ones that are supposed to be positive are still
classified as negative because the positive average is 0 while
the negative average is greater than 0. I could take this
requirement out of the if statement, and see if that changes
things, but these changes feel super arbitrary to just work with
these specific words and I don't know how it'll affect other
wrods.
- I could possibly change this so that
- Bad not classified as negative anymore
- Maybe the threshold should be smaller to fix this
Tuesday, July 10
- GRE class in the morning
- Typed up the practice analytical essay we did in class-- still have
7 minutes to finish the whole thing, and then I need to send it to Cory
so he can look it over for me.
- Summers Scholars Luncheon = super fun! Good food too.
- Downloaded the WordNetDomains/WordNetAffect thing I found last week
(got approval to download it from the site), and it's sort of weird/hard
to understand. It doesn't look like there is a big enough set of synsets
to be helpful for our purposes, although the affect classification is
nice compared to just the positive negative ratings with SentiWordNet.
- Did A LOT of research about Internet language, especially to help
with the paper and to focus on why we're looking at certain aspects of
language and how to analyze the sentiment in those situations.
- Saved most of the files to my language folder in my TAMU dropbox.
- Learned:
- CMC = Computer-Mediated Communication
- Paralanguage and prosody- terms from linguistics that have
to do with showing affect in language. For us, paralanguage on
the Internet is interesting because it explains the use of
capitalization, spacing, using odd grammar and spelling
(sometimes), using certain punctuation, and prolonged graphemes
(like they use in Croatian)/repeated letters.
- I need to talk to Dr. H about the iPad app thing.
- Have to write up my grad school application by tomorrow, so I can
have Steph and Dr. H look at it before next week perhaps.
Monday, July 9
- Updated my pictures. I went contra dancing this weekend!
- Worked on my TAMU grad school app.
- Emailed Dr. Gaz from the special ed department at TAMU
- Studied for GRE
- Ran the new sentiment classifier on the database. Doesn't do very
well and takes too much time :( Lots of testing left to do.