Penn State University
Research Journal


Information

Hi, my name is Ruyan Chen. I a student at USC '16 majoring in CS. You can find the main part of my website here. For the summer of 2013, I am working with Professor Anna Squicciarini at Penn State University, who is in the IST department. You can contact me through my email.

Project

Along with Candice McKune, I will be working on a project that helps detect and warn for deviance in a social networking context, focusing on forums. We hope to write a paper and create a web app by the end of these 10 weeks.

Final Paper

Note: To see the entries for that week, simply click the week #.
This week was mostly just settling in, reading parts of the thesis already written and getting familiar with the programs that we'll be using. We're reading data from a mySQL database (I learned the great power of an INNER JOIN!) and R to model some of the user networks. There's not much to do this week as we can't officially start working for PSU until the background checks finish, but I spent a lot of time running errands like getting my new PSU card and getting to know the small, small, oh so small town.

The family we're staying with are pretty nice, though I'm definitely not used to having to cook for myself every day, but it's good preparation for next semester.

I love python because you can get a nice script running in < 10 lines of code. Sanitizing data isn't my favorite thing to do, but it's so nice to click a few buttons and just let the programs analyze your data for you.

Monday

Spent a very relaxing weekend at home catching up with friends. We made macarons, yum!

Tuesday

Trying to figure out different ways to plot something. Apparently tkplot() on R doesn't like more than 1000 nodes, which is a bit of a problem. Now I need the deviant users to be colored something different...

Wednesday

We're having a meeting tomorrow so I have to get all this graphing done. The good news is that our background checks went through! The bad news is that we have to start working for real... Generating plots takes a while though, because of all the data. MySQL calls also take a few minutes, but I think I generally have the script for R down. Just a few more minor edits, because it's really hard to read everything with all the nodes being labeled, but it's also hard to read the graph without labels. The thesis we're using as a base for this project is so poorly written, I'm chugging along as everything graphs.

Thursday

We're being split up into different task groups. I'm doing the graphing parts, mostly visualizing the network and trying to find some good metrics to use. The problem is that with ~200k edges on a small social group, it's a lot to be able to graph.

Monday

I switched to making the graph using networkx and python because it's nice to be able to work with sets when I'm trying to filter out a lot of nodes. The nice thing is that you can export the graph, import it to R, and it graphs a lot faster with a significant number of the (useless) nodes removed. The one problem was that I thought graphml and gml were the same thing, so there was a really weird seg fault happening during the import/export.

Tuesday

Adding colors to the graph! We went to lunch with another lab group, which was really fun. Talked a lot about future plans... but I have plenty of time to decide if I want to pursue a masters/phD. The weather's been pretty cool the last few days, but I like that. Found out that some of the data I've been using may be unreliable, so I have to go back and fix my sql queries to join a few more tables together. Ugh the pain...

Wednesday

We discussed the timeline for our project... basically W5/6 should combine Candice and mine's work, W7/8 should be working on the web app, and W9/10 should be working on the paper. I'm so bad at writing papers... that's intimidating. After significantly reducing the number of nodes, graphing is going a lot faster but I'm still trying to come up with the right sql queries to get all the data. I think I need to get ride of nodes later in the analysis so the numbers are more accurate...

Friday

I had a meeting with Anna today and we discussed moving forward with the graphing. We want to graph the interactions between users with at least 3 infractions, to see if they influence each other in any way. We also want to see the interactions between users in a group without looking at the threads they posted to.

Tuesday

I found a great query that lets you see only the rows where a certain field has occured more than some number of times, but joining that with my original graphing query will be difficult. Not only writing the query but also because it's going to return a ton of data and I have no idea how to sanatize it. I have the script to split up a bipartite graph, but I'm not sure how to keep edge attributes (like color) and running through and coloring all the individual edges is a waste of resources. Looking through the API for that... but it might have to stay as it is.

Thursday

My major goal for graphing right now is to figure out if infracted users posted on the same thread, and to mark that thread. It doesn't sound that hard, but in trying to make it as simple/fast as possible for hundreds of thousands of nodes makes it insanely difficult. The worst part is that when I write an algorithm, I'm not really sure that it works until ~15-20 minutes later when things have finished running.

Friday

I'm putting graphing on hold right now and helping Candice work out some of the roadblocks that she's experiencing in the sentiment analysis. Most notably, the service we're using for the analysis--AlchemyAPI--provides an SDK that neither of us are able to figure how to compile. But the nice thing is that they give us working examples and a very complicated makefile, so we're just editing one of the example C files and compiling that. It really bothers me though... there should be a simple way of generating a makefile, but for some reason the README and INSTALLATION files are very vague on it. Plus, being on OS X and not ubuntu makes things difficult sometimes.

Monday

Back on my project now! I'm calculating some statistics for the networks right now and cleaning up code. It's amazing how I wrote 6 different scripts that do almost (but not quite) the same thing in a week, and now I have to get it all down to one script where I can give it options, because it's impossible for me to remember which file is for what at this point. The good thing is that my there are only 3 graphing scripts, and they're mainly me tweaking the aesthetics.

Tuesday

Removed labels for graphing, and started generating more metrics for graph analysis. Getting betweenness is ridiculously time consuming. What I don't understand is why you have to specify a subset of the nodes in a graph, but it'll generate betweenness for the whole thing anyway. I'm looking into reducing that time, but it doesn't seem likely. We're about to start integrating the two projecst together. Candice just needs to process a bunch of the data and give me an output to start working on. In the meantime, I'm looking for easier ways to mark user interactions. A lot of really helpful functions in the library have to be ruled out simply because the graph is too large for it to be efficient.

Friday

Graphing has been all cleaned up, so now I'm working on parts of Candice's code so that we can join things together. Anna showed us some demo papers and we're probably going to write one of those. They're only 3-5 pages so it doesn't seem too difficult, but I always have a problem getting the language the right about of professional and understandable at the same time.

Tuesday

I figured out the AlchemyAPI install! It actually took me installing another hash library to get it all working, but I have all the words on the bad word list hashed so no one has to look at those again (some of them were really weird, and it was awkward to show someone your code and have them see a wall of foul language.) I accidentally hashed them with md5 at first and then tried to hash the other set with SHA1, so there was a weird moment where I thought I had cut off something when trying to save to file. I converted the whole thing into a class so it loads the hashed words to stack in the constructor and frees the appropriate memory in the destructor. Classes make things so easy :)

Wednesday

Solved a problem we were having where the AlchemyAPI wants to return data through stdout but we wanted to parse and only store a small part of the information. That finishes two out of the four areas we need to work on.

Thursday

Not having a standard regex library in C++ frustrates me so much. I'm making it easier on the word checker by removing punctuation and lowercasing all the letters, but removing the text in quotes is going to be really difficult without regular expressions. I'm looking into the boost libraries now, but I'm still bothered that there's no simple regex_replace() function widely available in C++.
Note: I was on family vacation from 6/30-7/7, so week 7 (and subsequent weeks) are pushed back a week.

Wednesday

I finished laying out and putting together the GUI using Qt on C++. I'm working on loading the graph right now. It's the home stretch now, I just hope we'll have enough time to connect the GUI to the back-end. That, and I hope the queries don't take forever to run.

Friday

I'm connecting the buttons and trying to figure out the order to call all the functions in. Also tweaking some of the GUI to make it more understandable.

Monday

Have some trouble getting python, R, and C++ to connect. I realize there's going to be a ton of dependencies on this... but first, I need to reoroganize the file structure so it looks like one nice package.

Wednesday

I'm basically finished with the GUI and I'm moving on to data cleaning and collection. I'm writing the parts that extract data and build the objects from mysql rows, which will be passed to Candice to analyze. We're trying to take out the python script in between and just have the C++ spit out a gml file for R to graph.

Monday

Had our last meeting with Anna today and we planned out the last two weeks. I have to finish up some work on the GUI. I just finished the database query part and sent it to Candice. We're starting on our paper now, so I started the structure of the LaTeX files (but I might just consolidate it into one file because it's not a long paper.)

Tuesday

We tried to connect content relevance today and realized it didn't fit in with the rest of our metrics. So we emailed Anna and just tried to touch up other parts of the code right now while we wait for her reply.

Wednesday

We spent most of today trying to put together all the files in preparation for transfer. So many code comments...!
This week I had to patch up a lot of leftover code. It turns out a lot of the code I had previously thought was written had been not. I spent this week writing the main.cpp, finishing up the backend, and connecting it to the GUI.