I just finished my final week of DREU! It's been a great experience. I primarily spent the week writing my final report, tidying up my code, and documenting everything I did this summer. I also found out that my poster was accepted to Grace Hopper!
Overall, I've enjoyed this summer a lot; last year made me doubt how much I really liked research, because I was working in the realm of database performance, and I found it pretty uninteresting. But this summer has made me excited about research again. Working on problems that have to do with human-centered data interest me a lot; I may not go into speech processing, but this is certainly closer to what I want to be doing, research-wise. Even if it's immensely frustrating at times, being able to do research feels good.
This week, all of the undergraduates who did research in the Interaction Lab this summer gave our final presentations! I still have one more week, but there are several students who are leaving at the end of this week, so we presented our work during the lab meeting. It's always a good experience to get up and present; although there are certainly parts of it that are a little bit nerve-racking, like speaking in front of other people, being asked unpredictable questions, and so on. I personally enjoy presenting a lot, and I'm glad I had the opportunity!
I also worked with my PhD mentor, Elaine, to write an abstract for a poster at the Grace Hopper Celebration this fall! I love the Grace Hopper Celebration—I was able to go last year, and it was an awesome experience. I've already registered this year, and I'm really excited about it.
In non-research news, I found a really tasty donut shop near my apartment called Spudnuts Donuts—they apparently make some of their donuts with potato flour (hence the name!). I also got to have dinner with some of the PhD and undergraduate students in the lab on presentation day; that was a great time! I've really enjoyed being able to spend my summer in this lab so far.
This past week, USC has been hosting the Special Olympics, and Wednesday afternoon our entire lab went to cheer on the athletes as "Fans in the Stands." (As opposed to "Fans on the Field, Creating Distractions During the Shotput Finals"? It's ambiguous.) That was fun!
After performing data visualization on all of the files, I brought the results to the meeting with Elaine and David Traum. The meeting was very interesting, and seeing the ICT campus was great, too; the ICT does a lot of research that really interests me, and it was neat to see where it all takes place.
I annotated the American English files and calculated Cohen's Kappa on the annotations, comparing the algorithm's performance with the hand-annotated files. The kappa values were low (.60, roughly, on a 0-1 scale), but after taking a balanced sample (1:1 ratio of speech:silence), kappa scores improved considerably, so that the majority of them were .8 or above. Some files even had kappa scores of .9 or, in one case, .95. I discussed these scores with my PhD student mentor, Elaine, who said that they were high enough to justify annotating the rest of the files.
Next, I wrote a program that would generate annotations and store them in ELAN file format (.eaf), as well as Praat TextGrid format (.TextGrid). Now, approximately 60 audio files (12 hours of recordings) have been completely annotated. My next step will be to produce visualizations of this data for a meeting I have next week with Elaine and Dr. David Traum from the Institute of Creative Technologies (ICT). I'm looking forward to the meeting!
On the weekend, I was able to attend the Museum of Natural History, which I thoroughly enjoyed; there was an exhibit on the development of Los Angeles from first recorded inhabitation to present. As someone from the northeast, I've found California history to be very interesting, in part because it never talks about the Revolutionary War (a northeast U.S.
I've been implementing a voice activity detection (VAD) algorithm to identify when there is a sound that matches the characteristics of human-produced noises. This algorithm, developed by Moattar and Homayounpour, divides audio (either recorded or streamed) into 10 ms “frames,” then calculates the initial energy, fundamental frequency, and spectral flatness measure of each frame.
After comparing my automatically generated annotations with the hand-annotated files, I found that the algorithm is so good at picking up voice activity that it annotates not only the primary speaker but also the speakers in the background, whose voices are occasionally picked up by the primary speaker's microphone. In order to rectify this, I added an additional condition for speech detection: an intensity threshold (using the root mean square of each audio frame). After implementing this change, the accuracy increased.
For my specific purposes, this particular VAD algorithm has some shortcomings, most notably that it detects on a frame-by-frame basis, which is not always ideal for voice detection on the sentence level. Rather than marking each “silent” frame as a pause, I have implemented a pause threshold of sorts, which will only stop annotating speech if a pause of certain length has been reached. By finding the average pause time between sentences in the original annotations, I have modified the VAD algorithm to detect pauses only of half a second (50 frames) or longer.
After using the Google Voice Recognition API to generate transcriptions of the audio recordings of the "Naming" and "Story" tasks in the American English participant groups, I've shifted my focus to automatically annotating the speech events that occur among all speakers in a given interaction not only for the American English groups but also for the Mexican Spanish and Arabic speaking groups.
In addition to the "Naming" and "Story" tasks, there are three other tasks that each group performs. They are as follows: the "Pet Peeve" task, in which participants must discuss their pet peeves; the "Movie" task, which requires participants to (1) determine a movie that all of them have seen and (2) discuss its best and worst parts; and the "Cross-Cultural Experience" task, wherein participants discuss an interaction they've had with someone who is from a culture not their own.
Only the American English "Naming" and "Story" tasks have been annotated for audio analysis, which means that all four Mexican Spanish and Arabic groups will need to be annotated, in addition to the other three tasks for the American English participant groups. Since hand-annotating all of this data would be extremely time-consuming and, for me personally, impossible (due to lack of knowledge regarding Mexican Spanish and Arabic), my objective is to annotate these as thoroughly as possible.
The audio files I've been working with are from a study conducted jointly by the University of Texas El Paso (UTEP) and the Institute of Creative Technologies (ICT), which is here in Los Angeles. The study was comprised of five tasks given to groups of four participants; the data I've been analyzing comes from two of those tasks, the "naming" task and the "story" task, in which participants were given a toy and asked to come up with a name for it, and then tasked with coming up with a story about the toy. I put the machine learning problem to the side for a little while in order to break the annotated tasks into small sound clips and run them through the Google Speech Recognition API in order to provide a transcription for each task.
This was also the Fourth of July weekend, so I got to watch fireworks from the roof of my apartment building! I don't live that far from downtown, so I was able to see a full 360-degree fireworks display from all the surrounding neighborhoods. It was beautiful! I also watched the FIFA women's world cup final. I started watching three minutes in, and when I saw that the score was 1-0 already, I thought I had started watching too late and there were only three minutes LEFT on the clock! But no, it was just good offense. That made for a good afternoon! Now I'm excited to go back to the lab after a long weekend off.
This week, I changed things up a little bit: I wrote a Python class that will read in an audio stream and detect whether or not someone is speaking. It's not the most elegant right now; it essentially uses root mean square (RMS) measurements of the audio data to determine if the decibel levels are high enough to (potentially) constitute speech. The problem, of course, is that it still detects laughter, coughing, and sudden loud noises (such as non-word exclamations), but it's a good starting point.
I continued researching laughter vs. speech differentiation, drawing from various sources to form a more complete understanding. Additionally, I collected as much data as possible from the existing annotated files and began running an ML program using Python's scikit-learn. Currently, my challenge is paring down my feature vector so that there are a reasonable number of features, rather than the ridiculously long vectors I presently have. Another challenge is that the extant data present in our corpus is a much smaller dataset than the other datasets that have been used for a similar problem; I'm likely going to need to annotate more of the audio.
In addition to my work in the lab, I was able to meet up with a friend from Simmons in Little Tokyo! I also went to a lab picnic with Elaine, two other undergrads, and one of the high school students. On the way home, we got to hike up to a "scenic overlook," and it was beautiful!
This week, I began reading about machine learning algorithms, specifically SVM implementations, in more detail. I also began processing annotated .wav files in order to extract which annotations occur at which points in time in order to determine how they correspond with audio data. My task is to develop a model that will differentiate statements from any other kind of noise, including laughter, coughing, and peripheral audio picked up from other speakers. I'm approaching this by using a 3-class classifier, since laughter and speech have more in common than silence and speech, but they still need to be separated. More to follow on how well this approach actually works!
I just got here! Stay tuned for more updates (most likely by Friday!).