DREU 2017
DREU 2017
Surina Puri
My experiences from CRA-W's (Computing Research Association-Women's)
DREU Program (Distributed Research Experiences for Undergraduates)

WEEK 01

CSE REU BootCamp

As a DREU participant, I am thrilled to participate in the Big Data Analytics REU at WashU. The CSE faculty at WashU organized a bootcamp during the first week. This boot camp included sessions on MATLAB, Unix, version control and machine learning. Along with the busy sessions in the bootcamp, the first week also included fun outings with other REU students and professors. On Wednesday, I met Dr. Ottley and the other students working with her this summer. I was assigned the Predictive Visualization research project. This project aims to predict the users' actions while they interact with an interface. Based on interaction data, such as mouse clicks, the project aims to transform visualizations to assist the users. For this project, I would be creating the visual interface for the user study, collecting interaction data and then analyzing this data using machine learning algorithms to predict users' interaction. During the first week, I took tutorials on python's machine learning library scikit-learn learn and R. I also reviewed the research papers that Dr. Ottley had previously shared with the team and I completed CITI training required for conducting an online user study. On Friday, we discussed our progress and tasks for next week with Dr. Ottley.

WEEK 02

Designing The User Study

On Monday we had a team meeting with Dr. Ottley to discuss the progress and goals. My research involves two major steps; first designing the user study and collecting data, second, developing an algorithm to analyze the data. During the first half of the week I focused on designing the user study. The users would be interacting with a crime map of St. Louis and answering six questions such that each question requires the users to explore the map and click on points that have some common underlying feature/s (all points in a neighborhood, all points of a category of crime, etc.). After finalizing the six questions, I used d3 to create the interface for each question. During the second half of the week, I integrated the visualization with Amazon Mechanical Turk platform which would be used to collect data online. By the end of the week, I had finalized the questions that the users would be answering and integrated the visualizations for the user study with the online data collection platform.

WEEK 03

Developing Platform For Data Collection

This week I continued to work on the data collection platform. On the backend, I added data storage in JSON format and wrote python scripts to convert the data in CSV format. On the front-end, I improved the UI of the user study by adding useful information and buttons to help users easily navigate the platform online. On Friday, we hosted the user study on a server and collected pilot data. This was a very useful exercise because we received great feedback from the users (other REU students) who participated in the pilot user study. On Friday evening I went for ‘Shakespeare in the park’ with other REU students. The colorful play of Shakespeare’s The Winter’s Tale , set against the landscaped backdrop of the Forest Park, was a treat to watch.

WEEK 04

Integrating visulization And Data Storage With The Online Data Collection Platform

This week, I continued to work on the user study platform. Based on the feedback received from the users in the pilot survey, I rephrased some of the tasks to make sure that they click on points and interact with the map. Instead of text entry, I changed the format of answering the questions to radio buttons. By the end of the week, the online user study platform was complete. In our weekly meeting, Dr. Ottley discussed two approaches that we would be taking for creating predictive visualization. One approach is to use hidden Markov Model to generate a list of points the user would be interested in based on previously clicked points. The other machine learning approach is to use clustering algorithm to create multiple data models and predict what features the user is interested in based on the model that best fits the user's interests. I chose to work on the clustering approach and would be using Python scikit-learn to create clusters and models of the data.

WEEK 05

Switching Gears to Machine Learning - Clustering

This week I switched gears from the user study to the clustering approach. After taking a few tutorial on Python scikit-learn and trying out a few clustering examples, I was ready to try clustering with our data set. Before running kMeans clustering on the data set, I needed to do some preprocessing on the data set; I scaled the data set and converted all non numeric features (description, street address, etc) to numeric. After the preprocessing, I performed clustering on different features of the data set. Fig 1 shows the clustering representation of data while considering two features - category of crime and neighborhood. The data set contains 8 categories of crime and 88 neighborhoods. The clustering accurately grouped points with same range of neighborhood together.

K Means Results For St. Louis Crime Data Set With 2 features- Category of Crime and Neighborhood

Me

WEEK 06

Time to develop a new algorithm!

This week we needed to structure the algorithm that we would be developing to cluster the data set and analyze the users’ input. In our weekly meeting, I discussed the approach with Dr. Ottley and we decided for the following main steps:

  • Feature Selection/Preprocessing the data
  • Generating and storing all possible clustering of the data set
  • Analyzing users’ input
  • Finding the best fit data model
I had completed the first step of preprocessing and feature selection last week. This week I wrote a python script to generate all possible clustering of the data set. Since we had 6 features (Category, neighborhood, Crime, Date, Time, Description) the number of clustering is 6C1+6C2+6C3+6C4+6C5+6C6 = 63. This week Dr. Ottley and I also met Dr. Sanmay Das, who’s research interests include AI, machine learning, and computational social science. We received his feedback on our approach and he also suggest another way through which we could predict users’ interest.

WEEK 07

Preprocessing data

After a fulfilled 4th of July long weekend, I returned to work on Wednesday. The preprocessing part and generating clusters part of the algorithm was complete last week. Now, I needed to store the clusters so that I could efficiently index into the stored file to compare it against the users’ interest. I used pandas data frame to store the clustering data in a txt file and indexed into it to compare users’ input to each cluster in the file. After implementing this approach, I realized that this data storage format was not very efficient. Dr. Ottley guided me to store the clustering data in JSON format. So I used JSON instead of the pandas data frame and organized the clustering in a better format.

WEEK 08

Improving The Algorithm

The next step of the algorithm was to compare the users’ input to the preprocessed clusters. I structured the algorithm to do this comparison by iterating by clustering, looking into each cluster and then calculating the percentage of users’ input points that are contained in each cluster. However, Dr. Ottley pointed out that this approach might not work if the algorithm were to run in real-time. In order to make it run in real time, I restructured the algorithm to iterate by the users’ input/clicks and do this comparison for each click. Our hypothesis is that the cluster that contains the maximum percentage of users’ mouse click points is the feature/clustering that the user is interested. However, I encountered the situation where more than one clusters had the maximum percentage. The algorithm could not decided on a winning interest in this case and thus we needed to break ties among all the clusters it identified. I implemented two approaches to break ties; one based on the inter point distances and second, based on the distance between points and the centroid of the cluster. I spent most of this week working on breaking ties.

WEEK 09

Time to Test: Evaluating The Algorithm

On Monday we collected data by hosting the users study on Amazon Mechanical Turk. I analyzed the users’ data using our algorithm. The results were surprising. It predicted the users interest with 100% accuracy for task 1, task 2, 0% accuracy for task 3, 4 and 100% accuracy for task 5, 6. It failed in task 3 and 4 which are the conditions where the users were interested in two features (Crime and Neighborhood). The algorithm was identifying the clustering/interest which contained the fewest number of clusters. This implies that each cluster in that clustering contained more points. This made me think that the algorithm contains a bias towards clusters which contain more points. After discussing the problem with Dr. Ottley, we identified that the error causing this bias. While calculating the confidence, I needed to divide the number of user input points present in the cluster with the length of current cluster instead of previously diving by the length of the user input. This change fixed the bias. I ran the algorithm again and the obtained the results shown in the chart below. I discussed these results with Dr. Ottley. Based on the performane of the algorithm, we identified the strengths and weaknesses of the algorithm and discussed future work that we could do to remove the limmitations of the algorithm. By the end of the week, I had completed the goals set forth at the starting of the summer; developing an algorithm to predict users’ interactions and evaluating it by designing a user study and analyzing the users’ mouse clicks using the algorithm.

Prediction Accuracy of the Algorith for Each Task

Me

WEEK 10

CSE RESU Symposium 2017

On Monday and Tuesday I worked on making a poster for my research. Dr. Ottley organized a retreat for the team on Wednesday. We spent Wednesday afternoon grilling and enjoying a wonderful homemade meal at her house. It was a wonderful experience working with Dr. Ottley and all the other REU students in her lab. Not only would I be returning to Georgia Tech with new skills and learnings but would be taking a summer full of amazing memories back with me. I presented my poster at the CSE REU Symposium on Thursday. I received helpful feedback from other professors and students. I also received pointers to improve my poster and presentation pitch. I will surely incorporate this feedback when I present this research at the Grace Hoper Celebration Poster Session in October. Friday was the last day of the REU. I completed documenting my code and reviewed it with Dr. Ottley. Though this REU program has ended, it has opened new avenues for me and has been an important step for me towards achieving my goal of pursuing a career in research.