Research Journal

1 2 3 4 5 6 7 8 9 10

Week 1 (June 18 - 24) I met Professor Dutta for the first time this Thursday. We discussed the project I will be working on and the material I have to learn to be able to accomplish my goal. She gave me the names of some books to read; we also agreed to meet every Wednesday at 1:00. To chart our progress, Dr. Dutta gave me access to her research blog; (I originally posted my progress there, until I created this website; now (midway through) I use both. These first weeks are copied off my blog.) I spent some time getting used to my living arrangements. I'm living off-campus, but in the area, and the other students here are really interesting. My roommate is a PhD student who is lively and sweet - we hit it off right away and have a great time together. The kids here are a little wacky, but fun, too. There's definitely not a dull moment around here!
I think I met a new friend outside my building, too- we got into an animated discussion about Russia and the USSR and stayed out so late that I had to walk her home - she lives in an apartment building nearby. I really hope to see her again - even though she's younger than me, we're really well matched when it comes to knowledge and interests.
Return to top

Week 2 (June 25 - July 1) I spent some time this week reading the new book that Professor Dutta recommended, Data Mining: Concepts and Techniques, by Jiawei Han and Micheline Kamber (Chapter 7). It gave me a general overview of data mining methods, particularly classification and prediction methods. The book had only a very brief treatment of the method Professor Dutta had told me I would need, logistic regression, so I found a graduate student who obligingly lent me the book Data Mining: Practical Machine Learning Tools and Techniques. The book explained logistic regression in more detail and also has a full section on using Weka, the machine-learning software which we had agreed that I would use for this project. The graduate student also was kind enough to teach me how to use Weka, and I ran some simple tests using Weka’s built-in data sets.
Professor Dutta was not feeling well this week, so our meeting was cancelled. However, she sent me an assignment by email: three papers on the subject on seizure prediction and detection to read and summarize. I will be posting their names and the summaries as I complete them.
I took a subway to Brooklyn this week to visit an old school friend of mine. It was really nice to see her again and have fun together like we used to. Brooklyn is really different from Manhattan, but interesting too.
Return to top

Week 3 (July 2 - 8) I read the two papers assigned to me by Dr. Dutta and summarized them. I attempted to read the third paper, but had a difficult time understanding the mathematics involved and resolved to wait for our meeting so Dr. Dutta could help me with it.
As I'd hoped, Dr. Dutta and I discussed the two papers I had read. She pointed out to me several important points that were particularly relevant to my project. Then she briefly explained some of the difficult equations in the paper I had not understood and instructed me to read it again and summarize it for next week. The main part of the meeting was taken up by an explanation of the basis for my project. Dr. Dutta explained how the EEG data was garnered using a 96-channel micro-electrode array implanted directly in a patient’s brain (she even showed me pictures of the surgery - gory but fascinating!) and how it is analyzed using windows of a fixed width, say 5 seconds, one window at a time. She also explained how the data is filtered to produce recordings at various bandwidths, each one different. The hope is that patterns that may be missed at one bandwidth may be more obvious on another. We discussed how the data is represented as a table containing the various features that may be useful for classifying it, and the data points at each time interval. She also explained to me the basics of support vector regression and assigned me a paper on Support Vector Regression, as well as one on ROC curves.
Return to top

Week 4 (July 9 - 15) I had my work cut out for me this week: get familiar with the data format in the EEG database I would be using, run some data through Weka using SMOReg, Weka’s implementation of support-vector regression, and most daunting of all, read a paper on SVR that even Dr. Dutta admitted was “a difficult read”. After spending quite some time wading through it, and barely making it past the first section, I decided to get some help. My roommate came to the rescue with some much easier material on Support Vector Machines (which, I learnt, are not machines at all, but just a kind of classifying algorithm), which she patiently explained to me. Once I understood that, I was able to grasp the basics of the equations for SV regression given in the paper. I also reread the third paper from last week that I hadn’t understood, this time skipping the mathematics that had flummoxed me last time and trying to get what Dr. Dutta calls an “intuitive understanding” of the algorithm. I also looked around on the university of Freiberg’s website to get a feel for the data I will be working with, and used Weka’s SMOReg feature to implement a SVR classification on one of Weka's built-in datasets.
At our meeting this week, we got into the details of using SVR on the EEG data. Dr. Dutta explained that the goal of SVR was to use regression to find a line that best divides the data into classes. This is easy if the data is linearly separable (that is, distributed in a way that allows a neat line to be drawn between classes), but our data will not be. In such a case, one uses a kernel function (of which there are several) to transfer the data into a higher-dimensional space (usually higher than can be visualized by humans) where there does exist a clear dividing line, or hyperplane, between classes. To accomplish the goal of finding the hyperplane which leaves the maximum amount of space, or margin, between the classes, the regression algorithm has to find a weight to multiply every data value by which will yield an equation for the maximum margin hyperplane. This is accomplished by trial and error using pre-classified training samples. (Though in our case, the output is a continuous value rather than a simple binary class label.) For next week, I am to start playing around with EEG data, albeit not the dataset I will use in the end. I will be posting my progress as it occurs, and am excited to have some real programming to do - no more papers!
Return to top

Week 5 (July 16 - 22)

This week, I finally was able to do some programming! I downloaded a zipped folder containing 100 text files, each with 4096 data points from an EEG, from the University of Bonn’s website. This still is not the data I will be working with, but it’s close enough. I copied one file’s worth into MS Excel and used the graph-making function to turn it into a line graph. It looked like a real EEG! Here it is: My EEG from Bonn dataset file Z010

I then wrote a file-processing program in Java (it was nice to be back working in my “native tongue”!) to turn the simple text file into a .arff file that could be used by Weka. Since this was just a practice, my goal was to see if, knowing a set number of data points on an EEG, could the software accurately predict where the next point would be.
Since there are 4096 points, I split the data into 512 groups of 8 points each and labeled the first 7 points of each group as attributes, with the last one being the class label (really a continuous value, as used in SV Regression.) Then I opened the file in Weka and experimented with various ways of splitting the data between training and testing sets, as well as with different values for the C-value, a measure of the size of the margin set between support vectors. Using root mean squared error to measure my results, I got the best results using 70% of the data as a training and 30% as a test set, with a C of 10. The full results can be viewed here .

After spending a short time discussing my results from last week, (they were pretty self-evident) at our meeting, Dr. Dutta finally began explaining to me what Granger Causality, which has been piquing my curiosity as the title of my page for a while, is. It is basically a measure for determining whether one data set, in this case one channel, influences another. The way this is done is by using data values from two channels and seeing if the class label is predicted with better accuracy than when using only one channel, as I had been doing until now. If, for instance, if I achieve better accuracy using data from both channel 1 and channel 2 than I had for just channel 2, that means that channel 1 is affecting channel 2. If not, it means that 2 is affecting 1, or that the two channels are independent. My assignments for this week were to read the papers on Granger Causality listed below, and to run tests using two files from the Bonn data to see if one channel affects the other. Dr. Dutta said that not much work has been done on using SVR with Granger Causality, so if I get significant results, my work might be publishable. Definitely an interesting thought.

I really had a nice time this weekend! One of the girls from the lab invited a group of us to her parents' summer home in upstate New York for a long weekend. We took long walks in the country, picnicked near the lake, and barbecued on the lawn. There even was a fireworks display nearby! I love the country - just being around grass and trees relaxes me - so I really enjoyed the weekend and came back refreshed and ready for another week of work.
Return to top

Week 6 (July 23 - 29) I started my work for this week by modifying the program I had written earlier to accept inputs from two files. Then I used it to read in values from two files of the Bonn dataset (Z010 and Z011). Since each file is a recording of the same time at a different channel, I used corresponding values from the two files, which were taken from simultaneous readings, and merged them into a single dataset containing the same 512 samples as before, but each now with 14 attributes instead of 7. The class label was the 8th value from each set in the first file read in - the 8th value from the second set was ignored. Then, using the .arff files that were produced, I ran the same set of tests that I had on each file individually (see table here for results of Z010’s tests) on the two - one where z010 was the primary file, with z011 the secondary, and one where the opposite was true. If either file produced better results when combined with data from a second channel than it did alone, that would provide proof, according to Granger Causality, that the secondary channel influenced the primary one. The results of my testing can be found here. The first two columns set out the conditions of the test and the C- value (see above) used for each one. The third and fifth columns contain the results, measured using Root Mean Squared Error (RMSE) of tests on Z010 and Z011 individually. The fourth and sixth contain the results of today’s tests using the combined files. As of now, these results seem inconclusive to me. Though there are some improvements, they seem too small to be significant, and they are balanced by a large number of places where the error increased. I do find it interesting, however, that adding the data from another channel so little influenced the results. I suppose I will have to wait until our meeting to discuss it with Dr. Dutta. I hope to read the papers she assigned me tonight; perhaps they will shed some light on the matter.

I admitted to Dr. Dutta at our meeting that I hadn't read all the papers she had assigned; it was a short week and I was still having trouble with the mathematics, so I read only the parts of the first two that she had said were most important. She graciously explained to me what I had not understood; namely, the algorithm used to apply Granger Causality to a time series, such as the EEG records I am working with. I didn't quite understand it, and hope that she'll explain it again before I need it...

We discussed my test results from last week and agreed that they were pretty inconclusive; however, as Dr. Dutta pointed out, that does not mean that this line of research is not worth pursuing. It could just mean that those two channels had no effect on each other. Then Dr. Dutta assigned me the project I had been both looking forward to and dreading: running Granger Causality tests on all combinations of two files in the data set. Since there are 100 files, that means 100P2, or 9,900 combinations, plus the 100 individual files! Obviously, I can't run all those tests manually, so my job now is to write a program that will do the processing automatically. Since it's a big job, we agreed that I only have to do 5 of the files for next week. Oh, and she wants me to use the variance of the error as my way of measuring the results, rather than RMSE, which I have been using until now. It's a daunting task, but exciting as well. I'm glad to have a programming task that I'll really have to work at.

I took a bus downtown one day this week and went shopping. I got some great deals on all sorts of things - it's hard to beat New York City stores!
Return to top

Week 7 (July 30 - August 5) Well, it turns out that my project for this week was even harder than I thought it would be. I was supposed to automate the process of using Support Vector Regression to implement Granger Causality on each pair of files. That meant writing a program (in java, of course) that would loop through all the data files (currently text files) in the dataset, convert each one, and each pair (since channel x influencing y is a separate problem from y influencing x, it's a permutation problem), into a .arff file with the attribute values in the proper order, (this is where I was very grateful that I had built up to this in previous weeks - I already had that part of the code) then figure out how to run Weka from my code to perform the regression, then calculate the variance and output the results to a file. I originally had thought of finding a way to call Weka from my code, but Dr. Dutta suggested that since Weka is open-source software written in Java, I should open the Weka source files in Eclipse, locate the classes that I needed, and use them in my program.

I ran into problems right away. I easily found the jar file with the Weka source code, but had difficulty unpacking it. After much back-and-forth with a programmer friend, I managed to unpack the jar and import all 150+ classes into a new package. I know that 150 classes is not much for commercial software, but I had never worked with more than 15-20 at a time before, all of which I had either written or had at least a rough idea of what they did. Here, I was totally at sea. The package names gave me some clues, but not enough. I figured out that I probably needed the package entitled "weka.classifiers.functions.supportvectors", but other classes that I needed were not so clearly labeled. Even the ones that did have clear names were confusing: there was an SMOReg class in the weka.classifers.functions package, which inherited from Classifier, which sounded right, but there was also a RegSMO and a RegSMOImproved, which inherited from RegOptimizer, in the functions.supportvectors package! Which one was the right one? How did I use them? They seemed to contain mostly methods for displaying tooltips, which I didn't need at all. I managed to find buildClassifier and classifyInstance methods in two of the classes, but I couldn't figure out which class called those methods. After several fruitless hours of searching, I was desperate. I emailed Rivka, Dr. Dutta's other DREU intern, since I knew that she had also had to modify Weka code, and asked her for some pointers, but her experience wasn't very relevant to mine. Dr. Dutta wasn't able to help much from afar, and I was at my wits' end. Then I had a break: I found a class called Instance which contained methods for loading a data set from a file and preparing it to be processed. The class had some sample code in the comments, which I copied into my modified file-creation program. After some tinkering with the various SMO classes, I finally emerged with a program that worked! It looped through all the files specified, created all permutations of one-and-two-file combinations, and ran each one through an SV regression algorithm. Then it calculated the variance and output the results to the console. After spending so much time debugging - stepping through code step by slow step - I almost forgot how fast programs execute - my program creates several new files and prints out thousands of lines - all in a second or two! It seems almost magical.

At our meeting, Dr. Dutta and I discussed my program and agreed that my work for next week would be to improve it - make it print out the results to a CSV file that could be easily transferred to any program that could produce a color-coded matrix - a neat visual output that would make things easier to understand. I also have to work on generalizing my code - right now it only works with data files in a certain setup, with a set number of attributes, etc. I am to make it customizable, so the layout of the data files, the size of the window, (that is, number of time-series values used for each prediction) the c-value, etc. can all be changed by the user. I have also resolved to optimize my code - I was embarrassed by the clumsiness of my code right from the beginning, and would like to make it more elegant and streamlined. I think it is particularly important because I am using Java, which runs more slowly than other languages already. Once I start using large numbers of files, just running even optimized code will take a long time - I'd like to make sure that my code is not slower than it has to be.

Dr. Dutta also explained to me that the measure I had been using - the variance of the error - was not the final measure we need. What I really should be calculating was the equation for Granger Causality. It involves taking the natural log (ln, or log of e) of the sum of the squared error of a channel, divided by the squared error of that channel combined with another one. It shouldn't be too hard to implement.

I was extra busy this week because a professor asked me to tutor one of her students who was having trouble with a programming course. I really enjoyed doing it, though teaching someone else how to program was much harder than I'd thought! We worked through several assignments together and discussed concepts like inheritance and polymorphism. I taught her how to use the Java StringTokenizer and file processing classes, and she introduced me to the Calendar and DecimalFormat classes. You learn something new every day!
Return to top

Week 8 (August 6 - 12)
As I had expected, it wasn't too difficult to modify my code to allow for customization. I did, however, encounter difficulties when trying to use a window size that resulted in some data values being left over; Weka threw an exception that I couldn't figure out how to fix. I'll admit that I ignored it (on the next modification, it somehow got worked out on its own) and went on. I made the window size, number of data values in each file, and c-value all customizable. Then I ran my program on 15 of the 100 files and imported the resulting csv file into Microsoft Excel, where it made a neat 15-by-15 matrix with all values above 0 highlighted.
At our meeting, Dr. Dutta was very intrigued by my results. While most of the Granger Causality values were negative (that is, no correlation) or very low, there was a small block of 4 or 5 pairs that were high. All in all, the results looked promising, and Dr. Dutta asked me to run my program using all 100 files. Since that would mean generating 10,000 new files and running them through extensive processing, we decided to run my program on the server rather than overload my computer.

I also continued working on this website. I had originally hoped that my blog page would be a good enough record of my internship, but for various reasons, that didn't work out. I finally found hosting last week, but ran into some problems. I couldn't figure out how to ftp my files over to the server, and the supposedly graphical ssh client I downloaded only gave me a command line. I also had problems with my css files; I've done web development before, but can never manage to get the css to work exactly as I want it. My website ended up being a group effort: one graduate student let me use a computer with a graphical file-upload tool that was incredibly easy to use, and another, a really great web developer, helped me with my css issues. (I got a tutorial on using firebug, which I'd tried with little success before, so that was an extra benefit.)
Return to top

Week 9 (August 13 - 19) Well, this week I sure got an education in the research process! I was supposed to be running my program to test for Granger Causality on all 100 files in the set I had, using different window sizes and c-values. Then I was to update my program to be able to specify different kernels and set the parameters for each, and run it several more times. Dr. Dutta also wanted me to do the same tests on the other zipped folders on the website, so we could see if our results held true for more than one patient. It sounds like a lot, and it was, but I mostly had the programs written - modifying them slightly to work with the new options was a snap, and once I had the code, running it was easy. The lab admins gave me a login name and password for the server, and I began trying to upload my program and the data files there. The server runs Unix, which was very exciting for me, because it gave me a chance to actually use the commands I'd learned last year in a rather tedious class on Unix. I wasn't sure I'd remember anything, but it all came back and I was soon ls'ing and vi'ing like a pro. I had an ssh client called Putty on my computer, which was very convenient for connecting to the server, but I couldn't figure out how to upload files. After some searching, I finally downloaded Putty's file upload program, called pscp, which runs off the command line. It's rather annoying to have to type in a password every time I want to upload anything, but it works well, and it's giving me good experience using the Windows command line, something I'm weak in.

Once I had all the files uploaded to the server, I tried running the program. The server doesn't support the javac (compile) command, so I had to upload my already compiled .class file and run it from there. After spending a long time trying to work out path issues, I finally got it working! It was really exciting to watch the numbers zipping across the screen. As we'd expected, it was really computationally heavy - depending on the window size and c-value, it could take between half an hour and two hours (!) to run! After a while, I got tired of sitting and waiting for it to run, so I modified my program to accept command line input and wrote a shell script that would automatically run the program several times in a row with different inputs each time. This way, I could leave it running for hours on end and do other things meanwhile. The only problem was that the program terminated if my connection to the server was cut off in any way - if I lost Internet connection for a few seconds, or my laptop went onto standby. Because of that, I did have to sit over it, but at least I was free to work on my other project for the week: compiling a bibliography of related works on Granger Causality, particularly as related to EEG's.

Dr. Dutta wanted me to create the bibliography in Bibtex, the bibliography part of Latex, a standard type-setting markup language used for academic papers. I wasn't sure I could do it, but my roomate had been working on converting one of her papers to Latex, and the Bibtex part of it didn't seem so hard. She introduced me to Citeseer, a website where you can search a database of research papers with the Bibtex entries provided on the side. Citeseer didn't have everything, so I also used Google Scholar, which also searches research papers, but seems to me to be more comprehensive. Every paper I looked at seemed to lead me to three or four others through its citations, and I soon had nearly 20 open tabs on my browser which took me days to wade through. My bibliography is definitely a work in progress - pretty long already, but nowhere near finished.

As for the lesson - there I was, going my merry way, with my program humming along and producing neat output, hoping to have some promising results to show Dr. Dutta, when debugging an error in my program made me notice a minor inconsistency between the website's description of the data and the actual files. Puzzled, I emailed the contact on the website and was told, (paraphrasing) "It's not a big deal, but why are you using our data for Granger Causality analysis. Our data is univariate, and for GC you need bivariate data." Well, to make a long story short, the data we were working with was not actually recordings from 100 different channels from the same patient at the same time, but rather random samplings of different patients that had no connection to each other. So all the results we had gotten were total nonsense, and all that work went out the window. Well, not really, because hopefully my programs are easily adaptable to another dataset, so I'm actually more amused (the idea of us solemnly discussing our "promising results" when they were really nothing strikes me as rather funny) and grateful (that it was discovered before we went further with it). We decided to switch to using the University of Freiburg's dataset, if it fits our requirements (and you can be sure that I'll check this time!), and I learned a rather strong lesson about the importance of checking everything before starting and not relying on others. As an undergrad with no previous research experience, I'm used to teachers verifying everything before assigning a task to students, and had assumed that Dr. Dutta would have done the same here. As I found, that's not the way research works. Dr. Dutta had no prior knowledge about the dataset and expected me to evaluate everything very thorougly. This misunderstanding has really taught me something about research that I've heard before, but never really appreciated: nothing is spoonfed to you. Often, your mentor or advisor knows no more about your specific research than you do - you have to do all the work alone. It's a pretty scary thought, but somehow exciting as well, and this experience is definitely one that will help me in graduate school!
Return to top

Week 10 (August 20 - 26) This week somehow stretched into a lot more than one week. Maybe that's because once I switched to the Freiburg dataset (as expected, it wasn't too difficult) I had to run my program all over again with the new data. While there were only 6 files in each group instead of 100, each of the 6 files had 921,600 values in it! Running a long regression program on files with almost a million values takes a long time. At least Dr. Dutta got Sam, one of the admins, to run my shell script with nohup so that it would keep running even after I logged off. In fact, I agreed to continue monitoring the tests and recording the results for Dr. Dutta even after my official internship was over, because the program took so long to run that it was still running weeks after it was started. It's still working on some of the runs (with c-values of 10 and 100) but I have to finish up this website, so I'll just have to leave it going and end this journal. If you really want to know what happened, feel free to email me at the address below. I also wrote up my final paper, using our incomplete data. A link to it is here. I also finished writing up our bibliography and compiled it using Latex.

I really enjoyed my experience this summer. It gave me a taste of what research is like and what it feels like to belong to the scientific community. It really gave me the confidence to enter a PhD program in September with the knowledge that I enjoy academic research, and can't wait to get involved in a long-term project!
Return to top

Shoshana Gottesman

CRA-DREU Research Project, Summer 2009

Research Journal