Week 4
Week Goals
- Figure out how to link in Python code to SQL.
- Get my negative word spotter to work with the database.
- Figure out what words we are missing from WordNet or make my own dictionary of bad words.
- Look into WordNet.
- Make a PowerPoint slide about myself for the lightning talk on Thursday (due Tuesday night).
- Talk to Nick about the privacy problem with students marking posts as bullying.
- Debug the activity problem to finish the ALL section of the activity feed?
- Take the practice GRE as a diagnostic test.
- Start STUDYING for the GRE!!
- Update my resume.
Friday, June 22
- Started looking more into how WordNet and SentiNet work to see if they will really help to our advantage.
- Had a meeting with Nick and Stephanie (notes should be posted on the Wiki soon).
- Talked to Stephanie about how we are going to keep track of the
bullying alerts on activities.
- Basically, we need every activity to have a "state," which I'm
thinking will be an ENUM of "none," "needs review," "under review,"
"yes," and "no"
- I tried to make a plug-in to add a bullying state column to the wp_bp_activity table, but for whatever reason I can't get it to work erggg and I don't know how to debug it because there are no error messages of course.
- Once this is implemented, the Python program I wrote will just update the bullying state to "needs review" when it catches cyberbullying or negative sentiment in a comment or update.
- Basically, we need every activity to have a "state," which I'm
thinking will be an ENUM of "none," "needs review," "under review,"
"yes," and "no"
- I'm going to keep reading about NLTK, wordnet, and sentinet and look back at the papers to see what algorithms are going to be the best for us. Then I will think about how we're going to include the necessary chat and misspelled words into the dictionary.
- Went over the answers to the GRE practice test, and I really only made careless errors in the math, and I just really need to study vocab for the verbal portion and practice writing essays quickly!!
-
from sentiwordnet import SentiWordNetCorpusReader, SentiSynset swn_filename = 'SentiWordNet_3.0.0_20120510.txt' swn = SentiWordNetCorpusReader(swn_filename)
Thursday, June 21
- To Do: Wordnet dictionary stuff! cron job stuff.
- GRE Class is too early in the morning ..zzzz
- Talked to Steph about the bullying_checked variable and we decided it would be better to just save the ID of the last activity checked because this will be faster when we have lots of data to iterate through.
- Added the last_id_bully_check to the plugin that I made before (should probably remove the other variable eventually if we don't use it, but we can keep it now for debugging purposes?)
- Now the program only runs on activity_comments and activity_updates
that have activity ID's greater than the last one checked!
- I was worried that this would mess up when a user deletes a comment because I thought the activity ID would just increase by 1 each time, but they're in order no matter what, even if a message or update is deleted.
- I also added in a log that keeps track of every time the program is run and what messages were checked.
- It's important that next I figure out how I'm going to have the system alert the admin that we found cyberbullying (e.g. how to connect this fact back to SQL and then to the admin). I need to ask Steph about this! (Emailed her about it)
- I figured out why the cron job wasn't running (it was just because I needed to put in the entire directory for the dictionary, etc.) so I fixed that and it works now!
- Didn't get to wordnet dictionary stuff today, but that is NEXT!
Wednesday, June 20
- To Do: get time working, look into how to do the cron job, read Naive Bayes article, and GRE practice test.
- Got the time stuff working.
- Buddypress is working in GMT time (5 hours ahead of here), so now the Python code is working from GMT instead of local time.
- I added in a column in the wp_bp_activity table for bullying_checked
(tinyint for true or false) which defaults to false. I added it through
SQL Buddy which I know is not good, but I couldn't figure out how to add
a column to a table through a plug-in :(
- When Steph can help me, I will ask her how to do that instead because it's probably better.
- I tried modifying the actual bp_activity_class so that every activity had a bullying_checked variable that it put into the table, but because the column didn't already exist it didn't work out correctly. I don't know if this is really necessary if we can just have it default through the database anyway, but again, I'll ask Steph.
- Read through the Text Classification/Naive Bayes chapter that Wenzhe showed me, and it'd be helpful if I actually had training data haha. However, the articles I read already had algorithms for this classification problem, so this article was better for me to just understand how text classification and feature selection worked.
- Tried setting up the cron job but it didn't run. I'll look more into it tomorrow. Time to start the GRE practice test!
- Took the GRE practice test booooooooo I need to memorize some vocab stat
Tuesday, June 19
- GRE CLASS!
- Made my PowerPoint Slide about myself and put it onto my website! (see it in about)
- Investigated this whole early decision thing for grad school at TAMU. I'm going to do it. The due date is July 18.
- Got Python to connect to the kidgab database with MySQLdb.
- Got my Python negative word spotter to run on the messages from the database.
- Put in a variable in wp_options called last_bully_check that sets the time when the last check was made so that we only check new messages we've never checked before.
- Working on getting the UNIX_TIMESTAMP to work because right now SQL is using datetime and we are using time() and the conversion isn't working perfectly.
- Things are going pretty great though! So much progress!!
Monday, June 18
- Updated the meeting notes from Friday.
- Downloaded WordNet, ConceptNet, and SentiWordNet. Figuring out how to use WordNet with NLTK and reading a paper about SentiWordNet.
- Edited my website and added in PICTURES (now located on the first page). Took a while to figure out the permissions after adding in LightBox.
- Figuring out how to connect Python to MySQL. It looks like there is already a module built in (sqlite3) Python, but I don't know if it has all of the capabilities we need or not. I tried downloading pymysql, but I'm not sure if it's all working correctly because installing it was stupidly difficult for me.
- Since I downloaded SentiWordNet, which is just one huge text file, I
also downloaded an Python interface for it from
here. I
explored how it works, and it's pretty cool--- every word in WordNet has
a positive, negative, and objective value already, so we don't have to
worry about putting those in.
- Our task will be to add in things like misspellings, chat slang, and acronyms. A lot of swear words are already in the corpus.
- Basically, WordNet already has a corpus of words that all have attributes (about 117,000). The WordNet reader module in Python reads the words with all the attributes into a dictionary. SentiWordNet has a long txt file that includes most (no more than a couple hundred off) of the words in WordNet, so the SentiWordNet program that I found reads the text file and puts all of the words into a LIST. I think that this should be in a dictionary instead? You can access the WordNet dictionary from the SentiWordNet list though.
- wordnet.py is in the Python folder site_packages\nltk\corpus\reader\.
- I was going to take the GRE practice test today but I'm really tired zzzzz