My name is Megan Goodling. I am a rising junior at Davidson College. I am a computer science major and mathematics minor, and I will graduate in May of 2019. My email is email@example.com.
I play Division 1 soccer at Davidson. I love yoga, music, skiing, and exploring both the city and the outdoors. I love spending time in New York City, and have gone to many concerts and fun workout classes. I am living at Columbia's East Campus apartments with 4 other girls.
Dr. Kathleen McKeown of Columbia University is the Henry and Gertrude Rothschild Professor of Computer Science. She directs the Institute for Data Sciences and Engineering. Her primary interests lie in natural language processing and generation. Her research focuses on text summarization, statistical natural language, and the generation of multimedia explanation.
My research project is to create an interactive map using the Twitter data and a NLP emotion classifier that could potentially be a useful tool for analyzing Chicago gang activity.
The Twitter data used for the project is over 2 million tweets collected from members of a small gang in Chicago.
Since many users do not enable their Twitter geo data, our group wants to find a new way to locate users. One source of location information comes from places mentioned in tweet content and handles by the gang-affiliated users. This project looked at a way to get that data and visualize it by plotting it on the Chicago grid.
The emotion classifier I use on my data is one created for the specific vernacular used by the gang members. It was developed by Terra Blevins. I use it to represent tweets tagged as aggressive with a red line, and tweets tagged as loss with a blue line.
My goal for the project was to get an understanding of how computer science is used in academic research. I would love to someday create a useful and relevant tool, but I deeply value the experience and learning process regardless of results.Final Report
This is my weekly blog - check back for new posts each week!
This is my final week in New York. I can’t believe how fast it has all gone by! I am sad to see it end!
This weekend I visited my family friends in Saratoga Springs, NY, and went to the horse races. My sister also briefly visited and we went to museums all over the city.
Over the weekend I constantly checked with Alyssa on the approval process, with no luck. I really started to worry and realized this would definitely be affecting how things go during the final week. However, these types of setbacks are part of the research process, as I have learned.
On Monday I went to the Columbia School of Social Work and met with the summer fellows, who will (hopefully) be annotating my turking program at some point this summer. There were four of them and I introduced myself, explaining a bit about what I have done this summer and hope to continue doing. Since we were not able to annotate quite yet, I asked them to let me know as soon as they hear back from Amazon, and we went through my HIT form and a sample profile in the meantime.
I met with Dr. McKeown and we talked about the possibility that the annotating might not happen before my program ends. I told her that it would be okay, I would be happy to look at the results remotely and pass them along for more analysis. At this point I feel like I have what I need to write my report for what I accomplished this summer. I will go ahead and get my Amazon project and batch set up so it can be completed when and if the fellows get approved. I am excited to have learned this skill of setting up a batch, but more importantly, I have been able to see a glimpse of what the results of the annotations will ultimately lead to: a tool to help society.
On Wednesday I wrote more of my report and attended a great research talk given by a professor at the University of Washington. She is one of my mentor’s colleagues and my whole lab went downtown to NYU to hear her speak. It was another great exposure to an interesting problem that NLP tools can address.
On Thursday and Friday I wrapped up my final report. I wrote about my attempt at mapping data parsed from the tweets, and also my experience setting up an Amazon mechanical turk program. I turned it into my mentor for revision, and planned to post it after she looked over it. I tried to model my final report in the format of a scientific paper, as specified by the DREU, but it was tough since my project was a little bit all over the place.
I am thankful to Dr. McKeown, the DREU, Columbia University, and my family for making this whole summer possible. I had a great experience and would recommend it to all CS undergraduate students.
This week I planned on getting my whole mTurk project set up and ready to go for next week. My goal was to have a beta version ready and have Alyssa (the other undergrad working for my mentor) be my test subject to make sure it works.
On Monday I finished writing the HTML template for my HIT. I researched more about worker qualification and uploaded my first batch. One problem I am having is that the mTurk platform does not support emojis, and I know that the lab wants them to be included in the data annotations. They are not necessarily vital to this project, but I would have to find a way to get rid of them in all the data (I could write a short program that does that). I think it would be okay for emojis to be left out of the pilot program, and hopefully mTurk soon extends its platform so it can support all emojis.
On Monday night, Dr. McKeown spoke at Google NYC about her work in NLP. It was incredible! I was cool to hear about her work in other areas of NLP and the venue was awesome.
Alyssa just signed up on Tuesday to be a mTurk worker so she can practice my batch. It takes about a day to get approved, so I hoped she would be cleared by Thursday or Friday test it out.
I met with William and Kyle and they gave me some great advice about how to modify my approach. Instead of individual tweets being HITs, we are going to do user profiles for each hit, and the annotators will be able to keep a running log of information on the user. This also solves the emoji problem!
We also planned some sort of reward system to gamify the annotation process. We talked about offering Columbia swag for each user profile completed.
Tuesday I spent improving my HIT format and setting up logistics for the annotating for next week. I also attended a Google Research talk on NLP and barely understood any of it. I don’t have the math or CS background yet to get anything out of those kinds of talks, but it was still kind of interesting to hear.
Wednesday morning I met with Dr. McKeown to show her my HIT template and got some good feedback. I implemented some changes and figured out more logistics for next week with William and Rebecca. We had another group lunch and one student talked about RNNs. It was really hard for me to understand. Honestly I am not that interested in the hardcore math aspects of CS.
I wrote and finalized the instructions for the annotation process for next week. Then I went to another research talk: the lead software engineer from Bloomberg spoke about their software that helps in the financial market. It was much more fascinating than the Google talk because it was understandable and geared towards explaining the applications of the neural networks instead of the math behind them, which I appreciated.
On Wednesday night I went to dinner with all 4 of my roommates for the last time, since some of them start moving out this weekend! We had such a great summer together, and it was sad saying goodbye!
On Thursday I finished instructions. Alyssa had yet to hear back from Amazon about her status as a worker, and I started worrying. Some online resources said that the supply of workers is very high, so Amazon isn’t accepting very many. I wanted too see how the data would come out once she annotated the tweets, so I could determine how I will find the winner of the treasure hunt. I’m also worried about how this will affect next week.
Friday, I went to the Columbia book store to pick up prizes for the treasure hunters: t-shirts for everyone for participating, and a sweatshirt for the person who finds the most data next week. I am still nervous about the worker status problem but hopefully things will work out next week.
This week I started off by finishing up my work with the location data. I came to the realization that this method was not effective, and that was a good thing to check off the list moving forward. Thus, I started to do a lot of research about Amazon Turk and the basics of starting a project.
I met with Dr. McKeown Wednesday morning to discuss progress of last week and how I plan to spend the next 2.5 weeks. We talked about starting a mechanical turking project where we could instruct Chicago youths to annotate for location and gang affiliation. A side goal of the project would be to educate them about the power of social media and what information they provide online about themselves. My task will be to set up a pilot program for this idea over the rest of my time at Columbia.
Wednesday we also had a lunch with all the members in the NLP lab and a few from the speech lab. One PhD student gave a short talk/explanation about vanilla neural networks based on an article she sent out. It was pretty interesting and fun to get to know the other students.
Wednesday evening we had another gang project meeting. First we addressed a few problems that some of the PhD’s were having, and then we focused on my potential work with the youths and annotation pilot. I got a lot of good ideas and will look to start putting something together tomorrow. We are going to extend the annotation process so that the turkers look for much more than just location and gang affiliation - they will search for any clues to personal information about the poster, such as name, birthday, age, address, family, etc.
Thursday I met with Jessica to get some advice on how to set up and start my mTurk project. She directed me to the documentation and told me how to create a qualification for users so I can manage who completes my HITs (i.e., the pilot users of my program). I started to set up the basics of the project through Amazon services.
Later that day, a PhD graduate from Columbia NLP came to give a talk about his work in Amazon search in Japan. It was so fascinating to hear from him and learn about the real world applications of what he learned here. I emailed him after thanking him for his time, and he directed me to a link with several internships for undergraduates.
I met with my mentor on Thursday afternoon and she had tons of ideas for my mTurk project. I am excited to get going on that through next week!
On Friday, I really got going on creating my mTurk project. I set up the basic framework and finished a lot of the HTML template that I will be using. I emailed Amazon support to answer a few questions about the qualification of users, since I want to limit the people who can take my HIT to specific people. They responded promptly and I will tackle that issue early next week.
Also on Friday, I went downtown to the Google office to meet with a Davidson alum. We had lunch and coffee and he told me about his experiences at many different companies. It was incredible to see the Google building!
At the beginning of this week I focused on methods to identify which gangs the users were a part of. I found some related work online, a paper posted about using feature extraction to identify members of a gang. However, the specific gang affiliation was not identified. I thought it could be useful to explore this avenue and use it in conjunction with the location analysis I have been working on.
I had been trying to narrow down my map by plotting only when two places intersected if they were mentioned by the same user or people in the same gang. However, it turned out that the data for that query was very limited and wasn’t very useful. For example, only one location would be mentioned by a person, rendering a very uninsightful mapping.
This week was slow because my mentor is out of town and the Fourth of July was Tuesday. I am losing steam on this part of my project. I am ready to move on to the mechanical turk side of things, so hopefully we can turn to that next week. I only have a few more weeks left here and want to make the most of my experience.
I met with Dr. Owen Rambow, who is also working on the project. He was very helpful in helping me make sense of my somewhat problematic results and issues with the tweet dataset. We decided this would be the best way for me to proceed, at least until Dr. McKeown returned: I would search through manually and look for users who I could potentially assume were associated with O-block, and once I collected a network of users, I would run the location-finding script on their most recent tweets.
I first parsed down the bigger list of 9000 users and looked for anyone who mentioned O block at any time. From there, I searched each of the ~200 users to see who fit the profile, and created a list. This also gave me a much more hands-on feel of the data as I read through the tweets, and it was very shocking at times.
The list I generated was only about 25 users. I searched the handles of each of the narrowed-down users, looked through their bios and first few pages of tweets for anything about O block or Chicago to verify, and made a decision based on intuition. Many gang members post pictures of guns or rap videos, and their language is becoming recognizable to me. When I finished going through the list, I decided the 25 users was big enough to run my scripts and see if I could tell whether this approach would ever produce any relevant data.
The script to extract geo coded data was completely useless: not a single person in my list had geo data turned on. No surprise there. The script to manually find locations mentioned did not contain enough data either. Thus, the mission was tried but failed.
One notable thing that I have learned so far, as emphasized by Dr. Rambow in our meeting, is that bad results are not necessarily useless. They tell us what approaches do not work, and they must be crossed off before new methods can be attempted. So I am trying not to feel terribly discouraged about my efforts. Hopefully next week shall bring more progress.
In other news, a few fun things I have been up to in my spare time: I have been working out at various fitness studios and with the Columbia women’s team, and have met friends downtown for dinner! I also visited a friend up in Portland, Maine, last weekend to celebrate the Fourth. I have been reading a lot in my spare time and trying to explore the Upper West Side. Additionally, I have been to quite a few museums and plan to hit all of the big ones before I leave (in just a few weeks!).
The first day of this week, I stepped away from the Twitter data and focused on learning more about machine algorithms and tools. I followed 3 online tutorials that took me through somewhat simple examples of machine learning programs, and I got them running on my computer.
One tutorial had me create a classifier for various types of flowers. Another was a MNIST Softmax program for classifying hand-written digits. I also looked a youtube videos for creating a simple neural network. Going through these tutorials helped me understand machine learning, and how big of a role math plays. I think it would be cool to look at a tutorial for machine learning in NLP, and create my own program for doing something with the Twitter data that we are looking at. I’m not sure what, but it could be interesting.
I also messed around with a tensorflow playground that interactively lets the user manipulate variables, hidden layers, number of features, etc, to see how the output changes accordingly. It’s pretty cool, but I don’t really understand it.
I met with both Kevin and Dr. McKeown. My meeting with Kevin did not lead to much; he is moving away from looking at location of users and focusing on the relationship between aggression and loss tweets. It is also his last week, so he is finishing up and starting to conclude his work. However, we got lunch and it was still good to catch up.
At my meeting with Dr. McKeown, I expressed that I am feeling a little lost in my research, and she gave me some helpful guidance. We also had an upcoming meeting with the whole gangster intervention group, so she asked me to do a presentation of everything I had so far, and told me to ask questions to the whole group.
Another really good thing that came out of my Tuesday meeting with Dr. McKeown was that we brainstormed a cool way to incorporate an element of machine learning, which I will do at some point. She suggested a partnership with the Chicago youth in which they would serve as annotators for data, and in turn, we could educate them about how their posts reveal their location to the public. This was based on the fact that the gang leader Gakirah Barnes was assassinated after she posted about her whereabouts.
The data we would collect from the annotators would be: does this tweet tell you anything about the location or gang affiliation of the Tweeter? Does it tell you anything about opposition to another gang? And then from this labeled data, we could train a model from this labeled data to predict these things.
This week I babysat for a family on the Upper West Side. It was fun to hang with the cute little girl and see the inside of a NYC apartment firsthand! Also, the babysitting rate here is insanely high!
I made a short slideshow for the Wednesday meeting to present my data and progress thus far, and I got some really good feedback. It helped me get a sense of the direction I should take my research. Also at the meeting there was discussion of major research grants for the group. It was cool to hear about that side of things.
Towards the end of the week I focused on how I could make my previous methods better, and I also gave some thought to some different approaches I could take to make my data more relevant and interesting. I looked at finding any geo coded data from the Twitter API, since Kevin was able to find a decent amount. Dr. McKeown also gave me some papers about mechanical turk, so I spent some time reading those as well.
This week I began by exploring the idea of mapping the locations on a very basic grid, that would represent a map of the area. I planned on drawing a line on each street mentioned, and I’m hoping this will show some sort of convergence, alluding to where the gang most frequently talks about.
After spending a lot of time exploring matplotlib, running examples on my machine, and trying to figure out what I wanted, I was able to generate some sort of representation in a grid-like format. I mapped each cross street to a list index, and set the scale to 0.5 streets per block to make my grid more accurate (since this is the case in this area). I have to manually enter in all the cross street names, and I am wondering if there is some sort of database to pull from instead in case I wanted to try it on another city.
Our group met on Wednesday night to discuss progress and how to move forward. It seemed that Kevin and I could really benefit from working together; however, he is hard to get ahold of and not super reliable.
After messing around with matplotlib a lot, my program was able to generate a somewhat discernible graph of intersections of blocks mentioned. My next task is to find these intersections in GPS coordinates, and this is what Kevin will look at to see if there is correlation between violence. I will filter my data to just aggressive tweets.
I tried to find a good API that could find geo coordinates by cross streets. I found one that does so and I figured out how to query it. However, the problem is that some of the theoretical points I have lie on top of the park’s boundaries, and thus those streets don’t actually cross. Thus, I get a 404 error from my http request. I need to figure out a better use of my data in terms of what I can plot.
I am having a hard time deciding what my data will actually be useful for. Just because I am able to get locations and streets mentioned in tweets, what does that mean? What does that translate to in a real world context? I am feeling a bit lost here and am not sure what my next move should be.
At first, when I was running my code on a sample set of 10,000 tweets, the data points seemed to be converging at the southwest corner of the park, which is where Gakirah’s Barnes’ gang is located. I got excited because I felt like that could be indicative of the Twitter presence in that area. However, when I ran the code on the full set of 2 million tweets, the points no longer seemed to cluster in one area and were essentially everywhere. Maybe that has to do with my visualization and plotting of the points; I’m not sure.
After varying the alpha values for the line weights, the larger dataset started to look better and still converged in the same area.
Towards the end of the week, I reached a bit of a standstill since my mentor is out of town this week and was not be able to meet with me. So I started doing my own research about the mechanical turk process and how machine learning algorithms use it. Another project Dr. McKeown suggested for me involves creating one of the forms for Amazon turkers to complete, and I find the process very interesting so I will look into it.
I started a machine learning tutorial and was so fascinated by the different types of classifiers I learned about. It was really cool working through the coding examples and seeing how supervised learning works. I think it is very valuable understanding to have.
I started out this week by creating a very small test set of tweets to help with my parsing. I included tweets that contained relevant data about location and gang affiliation. I used this small set to start creating my parser, and continued to add tweets to make it more robust for testing.
A very important thing I needed to consider, however, was how to organize my location data in terms of the best data structure to hold it all. I decided to create two lists: one containing all the users and information about locations and gangs mentioned, and another list of all locations mentioned and their tweet contents with aggression/loss score. I think this would allow me to create 2 maps and see which is more useful.
A major decision I made is that, instead of creating a tool that attempts to locate Twitter users, I will make a tool that finds the most mentioned areas by a select group of users in order to generate a sort of “heat map”. This will allow someone to locate the areas in which the most “action” is happening.
When analyzing the data from the CSV, I had to be very careful because the data is often corrupted if there is a comma in one of the fields, messing up the format. Thus I downloaded Excel, which allows me to create files that are separated by tabs instead of commas. Tabs are much less frequently used in Twitter data, so this should work better.
I spent a lot of time parsing the data and organizing my python stripping code. I had to consider a lot of different tweet formats, such as retweets, emojis, and http links. It is requiring me to refresh the Python skills I learned 2 years ago and get back to the basics.
My parser worked well on a tiny dataset of 5 tweets, successfully pulling out gang and location mentions, and associating information with the correct user (i.e. if its a retweet). I fine-tuned my parser by trying it on a bigger set. I think it has trouble when there are specific emojis, because I get a UTF-8 error. I plan on asking Elsbeth about that later.
At our weekly meeting on Wednesday, each member of the group reported on their recent progress. I also met with Kevin, and it seems like he is going in a different direction than I am. Dr. McKeown is also mentoring another student, Alyssa, but her work is mostly on a separate project. I met with her a few times to help give ideas about data scraping (as I have experience with it), and she seems to be doing better with her project.
Dr. McKeown was unable to meet with me during our usual Thursday meeting, so we spoke on the phone instead. She gave me the great idea of instead of using a mapping package for Python or some other complication visualization tool, I could just make a square grid, since the roads are essentially set up that way. This is what I plan on working on for the majority of next week. Hopefully the map leads to something insightful; we shall see.
This week was spent mostly exploring the two potential projects to see what jumped out at me most. My plan was to create a potential project idea for each, and then decide which one at the end of the week when I meet with Dr. McKeown to discuss.
For the gangster intervention project, I started reading through lots of tweets to get a sense of the dataset. I was having trouble seeing how location data could actually be found in the content. Thus I met with William Frey of the SAFE lab, which is Columbia’s space for investigating the online and offline behaviors of youths of color. He is also working on the gangster project and is familiar with the specific vernacular of the Chicago gang we are studying. He helped me look for patterns in the data, showing me what could allude to a person’s location and gang affiliation. I felt a little better about how I would approach the data. He showed me a somewhat outdated map of Chicago blocks coded by the minor gangs that exist block to block. This was helpful in giving me insight into how the gangs are organized.
For the project, I will need to generate a map, so I read a few resources about plotting on Python maps to refresh my memory. I have done this some in computer science classes, but not in a while. I will later need to investigate how I will create an interactive map.
Continuing my introductions to all of Dr. McKeown’s grad students, I spent some time talking to Jessica Ouyang about her work in NLP involving aligning sentences into a bigger paragraph. Her description was very technical and gave me a sense of what her research entails.
This week I also met with Elsbeth Turcan (Dr. McKeown’s grad student working on the energy project) to get an idea of how I could help with that project. She told me 3 ideas of how I could do that: 1) modify some of the existing templates to make them more punchy or use alternative units (gallons of water, polar bears killed, etc). 2) I could modify the newspaper summarizer to create more attention-grabbing text. 3) I could get to know in depth the machine learning algorithms used to analyze text. I figured out how to run a test of the message sending program, which would be a big step in helping me get started if I chose this project.
On Wednesday, I met with William again and attended a meeting about the gangster intervention project. At this point, I am beginning to feel more strongly about getting involved in this project, as I feel like it would give me more autonomy and feel like I’m doing my own project instead of making small changes to someone else’s. While the data is very messy and the project will be far from straight-forward, I am excited about its potential.
I created a simply python script to make sure I was able to parse through a small subset of the Twitter data, and it seemed pretty simple to figure out. The major task will be creating a very detailed parser to create dictionaries and provide me with useful data.
My approach to this would be testing on a small set of tweets (maybe 20) and seeing what data I could get, and at the end, applying it to the huge dataset of over 2 million.
On Thursday I had a meeting with Dr. McKeown, and I presented a more detailed project proposal that would help guide me through the rest of the project. This can be seen in the research section of my website. I am excited about working on the gangster project and hope I am able to create an interesting and relevant tool. I feel much better going forward and am excited to see what the rest of the summer will lead to.
Hi! Welcome to my weekly blog, where I will discuss and reflect on my participation in the DREU program over the course of the summer. I will post once per week, documenting my current work, conversations I have with other computer scientists, success and frustrations. I will also report on my exploration of the amazing New York City and detail my involvement in extracurricular activities. I hope you enjoy!
I met with my mentor, Dr. Kathleen McKeown, at the beginning of the week to meet and discuss possible projects to get involved with. We also talked about my previous experience in computer science, and how I could get involved in a project that was challenging but feasible. Dr. McKeown was very welcoming and made me very excited to work under her group this summer. The two NLP projects she detailed in this initial meeting were:
1) a study of energy use in Columbia apartments, involving generating personalized feedback based on usage data in order to alter energy consumption behaviors, and
2) the analysis of gang activity via social media, in an attempt to predict and possibly prevent gang violence.
Both of these projects greatly interest me. I feel like they both exemplify how natural language processing can be applied to real world data and create a positive world impact. One course I took sophomore year was Digital Studies, and this gave me a great introduction to computational tools that can be used for the analysis of texts. My meeting with Dr. McKeown gave me more insight into how NLP can be directly applied to address a major problem.
This week, Dr. McKeown instructed me to read several papers as well as meet with several people involved with these and other NLP projects. I met with Elsbeth Turcan, a grad student who is working on the energy project and Noura Farra, who is working on an NLP project involving less widely spoken languages. I also spoke on the phone with Fei-Tzin Lee, who is also working on the gang activity project. Each was very informative and helped me understand a little more about NLP and their roles as researchers.
I also got to sit in on a meeting about the gang activity project. I listened to a data annotator give an update about the labeling of Twitter data and other aspects of the data processing. The meeting was held in the Northwest Corner Building of Columbia on the top floor, and the view was incredible! The photo at the top of this site comes from up there.
Additionally, I met with Kevin Li, a Columbia undergrad working on the gang project. He is a statistics major and is also interested in getting involved. He showed me around the campus and took me to a spot where Columbia students often study and get coffee. I got to know him a bit and am excited to work more on this project alongside him.
During this first week, I also explored online sources about machine learning, neural networks, and natural language processing tools. One thing I found particularly interesting was a selection of papers that I read about the ethics of the use of big data. As data becomes increasingly available to us through users’ online presences, it is important to consider how its use can affect the people who stand behind the numbers and figures.
Towards the end of the week, I met again with Dr. McKeown about possible ways to get involved on both projects. For the gang intervention project, a possible task for me would be to sort through the tweets and look for any information about the location of the user (i.e. specific blocks, streets, schools in Chicago). Then, I would plot this on a map. I would also plot the amount of blight or empty buildings, as well as arrests related to substance abuse, and look for a correlation between the data.
For the environment/energy project, I would likely work alongside Elsbeth and help with the templates or the new summarizer.
I plan on spending some time over the weekend and during the beginning of next week exploring and researching about these possible projects to see what jumps out at me.
The final thing I have worked on thus far is the creation of this website. I first considered using a site like Wordpress to host my site; but then, I decided instead to create my site from scratch. I have been wanting to practice and learn more HTML, and this seems like a prime opportunity to do so.
Thank you for reading and check back next week for my next post!