Week One: Getting to Know the Campus and Designing the Website

When arranging housing, I chose the easy route: the dorms. Most renters wanted residents who would stay around two months. I planned to stay eleven weeks.

So far, four things have consumed my time:

  • Creating the Website logo and CSS style sheet
  • Studying basic machine-learning algorithms
  • Gaining access to campus computers
  • Beginning this journal

To gain access to campus computers, I acquired my own Kansas State eID and the ability to remotely log into a Windows machine.

My favorite English classes covered technical editing and writing for the Web (or Web writing). The Web writing class has strongly influenced how I write. I prefer shorter paragraphs and frequently use bulleted lists. My hyperlinks usually end sentences and paragraphs or appear in lists because they can distract readers.


Week Two: Learning the Naive Bayes Algorithm

This week, my group has been studying the Naive Bayes algorithm. You can use this algorithm to create a document classifier.

Initially, you tokenize your documents and eliminate duplicate words. You then remove meaningless words (e.g. prepositions and articles) and stem the remaining words. For example, the verb "repeat" has several forms, including "repeats" and "repeating." You just retain the word "repeat." The resulting list becomes your vocabulary.

Your documents should have class labels. When you train your classifier, you determine the probability that a document contains a particular vocabulary word.

To use Naive Bayes, you have to assume that all events happen independently. Only the class affects which words that a document contains. Syntax plays no role.

Although we need to understand the algorithm, we let a data-mining software platform called Weka perform the actual statistical analysis.

We want the department MySQL server to house our tweet database because the file size overwhelms personal computers. PostgreSQL, another relational database engine, originally housed the movie tweet database. We have to convert the file so that MySQL can read it.

Theoretically, we only have to change the data types in the CREATE TABLE statements. MySQL will tell us if it finds any other syntactic errors.

Week Three: Playing with Weka

When you use Weka, you can either create your .arff files yourself or have Weka do it for you. The final .arff files represent each document as a vector. Each vector has n features.

My features were my n vocabulary words. As Weka analyzed a document and created the vector, it reported how many times that a vocabulary word appeared.

I first had to convert raw documents using the TextDirectoryLoader. Given a directory that contained text documents, TextDirectoryLoader generated a single .arff file that contained all those documents. Each document also needed a class attribute. To ensure that TextDirectoryLoader included class attributes, I had to give each class a unique subdirectory.

As researchers become more sophisticated Weka users, I suspect that they prefer the command line and the Weka API rather than the Weka GUI, the Explorer. The next tool that I used, StringToWordVector, could tokenize, stem, and remove stop words (common, relatively meaningless words like articles and prepositions). However, StringToWordVector only processed the training data and test data simultaneously if I used the command line. Since I had both training and test data, I needed both corresponding .arff files to contain identical vocabularies--something that only "batch mode" could accomplish. I could have used the command line or Java code to trigger batch mode.

After I ran StringToWordVector, I found that the .arff files reported the class attributes first. The NaiveBayesMultinomial classifier threw an error until I finally put the class attributes last using the Reorder filter.

This week, I also talked to my mentor's colleague Eugene Vasserman. He is researching how to create security messages that successfully get people to change dangerous behaviors (e.g. visiting dangerous Websites). I may be helping him to create a dataset that contains cautionary tweets.

Week Four: Breaking Vim

Two other group members and I have been analyzing the tweet dataset. When we met last Tuesday, I learned that they had found more tweets than me. While they analyzed the data using the database engines MySQL and PostgreSQL, I wrote Java code that performed the same task. My code told me that the file contained ~40% fewer tweets than their databases did!

The group helped me to figure out that Vim, a common UNIX/Linux text editor, had corrupted my file. I was using Vim to remove the code that PostgreSQL had left behind. I had also overlooked that the last few lines comprised PostgreSQL code because the original file was almost 12 GB.

Vim corrupted the original file subtly. According to Windows 7, the corrupted file was still almost 12 GB. Like my Java code, the command 'wc -l', which counted how many lines that the file contained, also reported erroneous results.

I have to determine how many bits that a long integer has if I want to know the maximum file size that Vim can handle. I assume that the department Linux servers all consider long integers 64 bits. Their Vim should probaby be able to handle a 12 GB file.

Most likely, I interrupted Vim as it was saving the file. Otherwise Vim ate its 40% and left blank space behind.

When I left the meeting, I could use Unix commands to remove the PostgreSQL code. I have cleaned an uncorrupted tweet file and am getting accurate tweet counts now.

This week, my homework has included using Weka's Support Vector Machine (SVM) classifiers and tuning parameters.

Week Five: Cleaning Data

I have had trouble modifying the large tweet dataset this week. To even examine the data, I have to use Unix commands like head and grep. Apparently, Vim attempts to load the whole file into active memory...

I have started cleaning the data. Initially, I am removing tweets that contain no ASCII letters. Some tweets only contain emoticons, numbers, or punctuation marks. I am using a really basic regular expression and grep to remove these tweets. Unfortunately, grep filters out only about 280,000 tweets. Approximately 55,000,000 remain.

My biggest challenge may be to remove duplicate tweets. If I try to find duplicates now, I am going to need several hundred megabytes of memory or disk space or both. MySQL would need that much disk space to create a hash index. If I used Radix sort, the program would need that much memory.

To find duplicates, I may be able to compare tweet IDs. Although the tweets appear to have IDs, I have no idea if Twitter has assigned them.

Comparing tweets is harder. Two tweets may be identical even if each tweet has slightly different whitespace characters. I also consider two tweets identical only if the same author has written them. When comparing tweets, I have to also compare the tweet authors.

I shrink the dataset if I only use tweets that contain emoticons. However, I have been having trouble finding good emoticon dictionaries. Web articles and scholarly papers talk about resources that no longer exist. The resources that I have found only contain a few unambiguously positive or negative emoticons.

I will also filter out non-English tweets. My mentor and I think that we can use spellchecking software. I have also read a paper recently that mentions an English/non-English classifier.

Week Six: Transforming, Rejecting, Deleting

(Still Cleaning)

I am still cleaning the training data. Later, my group will be using the same process to clean the testing data.

Cleaning Steps

  1. Convert all HTML symbol entities to their ASCII equivalents.
  2. Remove duplicate tweets using the Unix commands sort and uniq. Delete @username tags and URLs.
  3. Throw out tweets that contain only punctuation or numbers.
  4. Isolate tweets that solely comprise printable ASCII characters. Remove the others.
  5. Determine if tweets contain previously unidentified "positive" or "negative" emoticons. The new emoticons that are umambiguously "positive" or "negative" can act as class labels.
  6. Separate "positive" tweets and "negative" tweets. Remove any tweets that contain both emoticon types.
  7. Replace exaggerated emoticons. For example, convert ":DDDDD" to ":D" or ":)".
  8. Remove emoticons and punctuation. Otherwise the classifier may adopt emoticons and punctuation as features.
  9. Delete the "retweet" acronym "RT" anywhere that it appears.
  10. Evaluate whether a tweet contains enough English words. If a tweet is at least 70% "English", keep it. The spellchecking software defines what is "English."

Awk, grep, and sed have proven very helpful. My favorite command is awk '$0 !~ /[^ -~]/', which my group uses to isolate the tweets that contain only printable ASCII characters.

Week Seven: Intiating Testing

Right now, I have yet to remove non-English tweets. I have classified this "not-so-clean" data using Weka's Naive Bayes implementation NaiveBayesMultinomial and 10-fold cross validation. Naive Bayes achieved 83% accuracy. However, it only classified 41% of the negative tweets correctly. The positive tweets outnumber the negative tweets nearly 5 to 1. Any classifier can easily achieve greater than 50% accuracy.

Once I finish testing the "not-so-clean" data, I plan to remove most non-English tweets. My tool is the PyEnchant Library. This spellchecking software has a simple API and just takes text strings. Some spellchecking libraries correct GUI component text instead. (If you need a spellchecking library that wraps around Java Swing components, see JOrtho.)

Since I am using PyEnchant, I have decided to start learning Python. I like Python that has list comprehensions and first-class functions. When I get home, I should try to do some demonstrative-style homework problems.

My research group has discovered that almost 100,000 tweets (1/5 of the training set) contain the retweet acronym "RT". Since both positive and negative tweets frequently contain the acronym "RT", "RT" is a meaningless feature.

Tweets sometimes contain long repeated "RT" strings. For example, I have seen multiple Twitter authors generate the string "RTRTRTRTRTRTRTRTRTRT". These "RT" strings probably convey sentiment.

Week Eight: Classifying the Training Data

My "not-so-clean" dataset contains about 400,000 tweets. Weka can rapidly generate .arff files (its preferred format) and classify the data using Naive Bayes. I am also trying to classify the data using a Support Vector Machine package called LIBSVM. LIBSVM may need hours or even days to classify the data.

I believe that LIBSVM is having trouble converging. If I change a convergence threshold, LIBSVM classifies the data quickly. However, I am simultaneously sacrificing accuracy. LIBSVM is producing much worse results than Naive Bayes. When I run LIBSVM and use 10-fold cross validation, LIBSVM labels most tweets "negative".

I wonder if I should try using LIBLINEAR instead. LIBLINEAR avoids mapping data points to higher-dimensional spaces. As a result, it can often process large datasets faster than LIBSVM. More specifically, LIBLINEAR drops kernels, the functions that project data points.

Some time today, I should access the university's high performance computing cluster, Beocat. Maybe I will get better performance?

My other project is writing the spellchecking module. Spellchecking 400,000 tweets may take considerable time. However, the module should significantly shrink the dataset. Then running LIBSVM becomes more feasible.

Week Nine: Waiting on Computers

I finished the spellchecking module. The spellchecked dataset contains around 160,000 tweets. Almost 20% are negative.

Naive Bayes achieves slightly better results when it classifies the spellchecked dataset. It gets an accuracy score of 86% (an increase of about 3%). I should expect these results because both datasets have a similar positive-to-negative ratio.

I am still finding LIBSVM challenging to use. LIBSVM needs 12-15 hours to classify the spellchecked tweets if it performs 10-fold cross validation.

To get the best results, I have to adjust several parameters. "Tuning" my parameters involves testing different parameter combinations (grid validation). However, I may have to avoid using cross validation to verify test results for each parameter. LIBSVM might need 5 hours if one parameter combination test involved 3-fold cross validation.

I initially used the linear kernel. My mentor asked me to try a different kernel because LIBSVM "learned nothing." LIBSVM labeled almost all tweets "positive."

I am still trying out different cleaning techniques. Weka apparently considers apostrophes token delimiters unless you tell it otherwise. When I tell Weka to leave the apostrophes alone, the resulting .arff files contain features like "smile\n". These '\n' characters should disappear because they finish lines!

Week Ten: Wrangling Weka

I spent considerable time this week troubleshooting Weka. This Monday, I tried installing Weka 3.7.11 on Beocat. Beocat refused to run Weka's Package Manager and to install LIBSVM.

The Weka Package Manager usually makes adding third-party software easy. However, the Beocat head node may stall resource-intensive jobs. (Beocat primarily runs scheduled batch jobs. Users have to request cores and memory resources.)

Previously, I had installed LIBSVM on my mentor's Mac. To solve my problem, I copied a "wekafiles" folder from the Mac to Beocat. The wekafiles folder contains the files that Package Manager installs.

Acquiring new software was also a problem earlier this summer. I failed to install PyEnchant successfully on the department's Linux server. Users generally need admin privileges to install PyEnchant. I have been using my mentor's Mac to spellcheck files instead.

Beocat documentation offers a potential solution: establish a virtual python environment.

I have also made Weka behave strangely because I am using PyEnchant. When I spellcheck the tweets, I tokenize them using PyEnchant's tokenizer. I think that tokenizing the tweets this way causes Weka to make some stop words features. The .arff files that Weka has been generating contain features like "i".

Generally, stop words (e.g. "i", "and", "the", "is") appear too frequently to help classifiers distinguish document classes. Unless I ask Weka to ignore stop words, Weka normally throws them out.

Naive Bayes seems to be performing slightly better when the .arff files contain stop words. More specifically, Naive Bayes labels more tweets negative.

StringToWordVector, the class that I am using to generate .arff files, may be performing statistical analysis and choosing to retain stop words. However, no API documentation states that StringToWordVector actively analyzes whether it should make stop words features. I may eventually peruse the source code.

Although the noise improves my results, I prefer to fully understand my own procedure. I have copied Weka's stop word list and successfully forced Weka to apply it.

My other major project is generating test datasets. Most test tweets mention movie titles. The vast majority (around 85%) also contain URLs. To determine if tweets with URLs contain less useful information, I am creating six test datasets. Some test datasets contain all the test tweets. The others only comprise tweets that lack URLs.

Initially, I am extracting tweets that contain emoticons. This technique generates the smallest test datasets.

My mentor has also provided positive and negative word lists. I can use these words to annotate tweets the same way that I use emoticons. Two test datasets contain tweets that only comprise positive and neutral words or negative and neutral words. Two other test datasets contain tweets that have a majority of positive or negative words.

This week, I got to see other REU students' posters, including my roommate's. The contributors were primarily science majors.

My mentor has decided that my paper should only describe how Naive Bayes performs. I hope that the research group eventually conducts SVM experiments and writes up the results. SVMs can produce impressive results if you have the time to find the right parameters.

Week Eleven: Approaching the Limit

The other REU students have gone home. My floor is empty.

The cleaning staff have cleaned the other rooms and left the doors wide open. I parade the empty corridor and wonder why only one dorm room contains high bunks.

Well, at least I have the showers to myself.

I am leaving Manhattan, Kansas soon. This has been a wonderful experience both because I have gotten to work with interesting people and because I have gotten to conduct research that interests me.

I have gotten my final test results. They seem similar to some results that published researchers have produced. The classifier achieves an accuracy score of nearly 70% when it classifies the balanced test sets. It also seems to be accurately labeling more negative tweets. However, I probably only see this improvement because the balanced test sets are almost 50% negative. The classifier has a 1 in 2 chance of assigning the correct label.

My paper is coming slowly. So far, my biggest quandary this week is whether I should write "the classifier sorts," "the classifier categorizes," or "the classifier classifies." My mother thinks that only elementary students would use "the classifier classifies." "The classifier sorts" seems incorrect because the classifier lacks the features and guarantees that sorting algorithms offer. "The classifier categorizes" has many syllables. This option is probably the best as long as no one tries to read the report aloud.

When I leave the dorm, I see that all the rooms now contain bunks. If the cleaning staff leave the rooms this way, incoming Kansas State students will have considerable storage space.