Week1 Touching Base

YY invited us to her house to acquaint us with her current graduate students. I re-read the flashback project to try to understand more of the technicalities of the project. The paper can be found here flashback.  I started researching related papers with replaying processes and threads. We will be doing presentations from this list of papers every other week. I will be presenting my first paper staring next Wed.  I also used some time to set up my working environment on Puccini, the computer that I will be using. YY gave us a couple of places to look for papers. One of the sites I found very useful is from U Penn. They have a very broad database of technical papers.

Here are a couple of interesting papers related to rollback and re-execution.

Execution replay and debugging
Determinstic Replay of Java Multithreaded Apps
A System for Program Debugging via Reversible IGOR
Starting Conditions for Post-Mortem Debugging using Deterministic Replay of Real-Time Systems Starting Conditions for Post-Mortem Debugging using Deterministic Replay of Real-Time Systems rting Conditions for Post-Mortem Debugging using Deterministic Real-Time Systems
Optimal Tracing and Replay for Debugging Message-Passing Parallel Programs

Week2 Getting My Feet Wet

I did my presentation on Deterministic Replay by Choi.  My presentation slides are ready for viewing here Presentation 1 I think I was pretty nervous.  I spent about two days preparing I was still quite nervous.  I don't think I ever gave a presentation that was 30 minutes long. Comments were that I had too much contents on my slides and I was reading off of them, so I will need to watch out for that next time.  Also it was suggested that I keep my slides more simple because they could be distracting.  It is a good experience, because I also got to hear how other people presented their papers. Heads up my to Joe. He's another grad student in YY's research group. He gives his talks like a pro.

I also spent some time talking with Sudarshan, the graduate student whom was working on Flashback project previously. We came up with several things that I could add to the project. He is very knowledgeable about the project and gave me a high level description.  I began to familiarize myself with the kernel by reading kernel module documents.  One of the best sources I found was this online book, which you can find at http://www.tldp.org/HOWTO/Module-HOWTO/x102.html  I would suggesting this to anyone who wants to do some kernel hacking. I wrote mine own tiny Hello World kernel module. I also did some reading up on Syscalltrack. A crucial tool that was used in flashback. Documentations of SyscallTrack can be found http://syscalltrack.sourceforge.net/examples.html

Week 3 Pushing forward

This week consisted of primarily code reading. I obtained the code for flashback. I think at first it was quite difficult, because I didn't know where all the files are, and which ones are important. It seemed that every file I opened contained over thousands of lines of code. So I printed out some important parts of the files I needed.   Then I did it the old fashioned hight-light and comment by hand way, which to my surprise worked quite well. I'm beginning to link the function calls with the functions and get an understanding of how it all worked. The idea is quite simple but the code is quite complex.

I also figured out how to compile the kernel with the flashback files added it. It is a tedious task for a rookie like me. Plus it takes 10-20 minutes to make the modules and compile the kernel.  I think it took a couple of hours just to get the config files right. However, I've got to admit it was a very fruitful experience.

One thing I forgot to mention is that we meet with YY every Wed to discuss the progress of our project.  I'm not sure if every adviser does this, but I think every adviser should. Not only does she gives us helpful comments, I think she also does a good job encouraging us.

I think I will be adding multiple checkpoints to the project, and into gdb and perhaps automated checkpoints if I have extra time.

Week 4 Code Injections

I gave my second presentation this week. This time the paper is on IGOR, a paper on re-execution of code.  The presentation slides can be found here Presentation 2. I think I did a better job this time, but lots of improvements are still needed. One thing I need to improve on is my presentation skill, I need to be louder. Also I think I need to understand the material more thoroughly so I can be more prepared for questions.

Now that I've gone through pages of code I'm beginning to implement the user side of the project. I'm implementing the functionality to selectively read the correct log entry for the correct checkpoint . It is very hard to debug, because every time I want to run the program I need to unload and reload the modules and my only way of debugging is to use kernel printk. As if that's not hard enough every little mistake I make ends up with crashing Puccini and restarting the computer. It can be very very frustrating!

Week 5 User Land Progress

I've made some progress with the logging part of the kernel. I figured out how the log files are written to and read from for the playback, and I've added some functions that will be able to differentiate different parts of the log for each checkpoint. I think the idea is so simple yet it's taking hours and hours to implement. One thing is I'm getting used to is being a lot more careful when writing code; it's either that or reboot time.

Based on the outputs of my kernel printk's, I'm quite convinced that my log writing and reading utilities are working correctly. I'm looking at Syscalltrack code to figure out how to pass the correct information to the kernel Handlers. I think that would be a hard task.

Week 6 Treading in Kernel Land

One of the biggest problems I ran into is needing to pass an extra parameter to the kernel handlers through Syscalltrack and I cannot find a way. I tried to trace the function calls in Syscalltrack, but I think it seems that all the functions have the same number of parameters and the nested level of function calls are at least a dozen.  I cannot seem to find a way around it.  I've been trying this and that for some time. 

Sudashan came to my rescue and mentioned that they had the same problem when they initially tried to implement flashback. He mentioned that all Syscalltrack functions pretty much has those parameters;  they had the same problem when they tried to pass information to the kernel handlers.  They used a hack by overloading some of the parameters that aren't being used by Syscalltrack to pass the information needed by flashback. He suggested that I should try to find a parameter that wasn't being used and perhaps trying to overload it.

Week 7 Narrowing it down

This is one of those things that is easy in concept, but hard to do. It took me almost two days to figure out which parameter I can use "safely".  I'm very grateful, because I think that was the only parameter left unused or I should say didn't crash the program. Otherwise I would need to implement another structure and modify their initial parameters to a pointer to that struct, in which case I would need to modify every instance of that paramerter and do casting which would lead to many days testing and debugging.

After I got the the parameter to be correctly passed to the kernel I just needed to change the kernel task_structure to index the shadow process that will be doing the replay.

Week 8 Testing and GDB changes.

I've added a array of in the task_structure and made a couple modifications in the main  kernel functions fork.c exit.c ...etc 

I've gotten the shadow process to index correctly.  After I got the checkpoints to work correctly I went ahead and made changes to the rollback and replay system calls mechanisms.  This rest of this week was spent testing it out and correcting bugs. One of the problems was the system calls and log mismatch.  I also had to make sure that the shadow task_structures were correctly being cleaned.

I also spent some time looking into gdb. I need to eventually add the functionality into gdb. I'm learning where the files are and finding out how to compile gdb ...etc.  Some how there were too copies of some of the flashback code I'm alittle confused on why there are two copies?

I think It's a very good experience. I used gdb without much thought on how it's implemented and now I get a chance to see how the work is done behind the curtain. I'm familiarizing myself with P threads.

Week 9   Fine Tuning

I did another presentation this week on “Optimal Tracing and Replay for Debugging Message-Passing Parallel Programs.” The power points can be found Presentation 3 I think this paper was a little shorter than the other papers I presented.  I think I'm starting to get the hang of the paper presentations.  I'm a little less nervous. Comments were that I need to speak a little louder it seemed that my voice trials off towards the end. It seems I need to learn more about the strategies of presenting a paper.  It used to take me about a full day and a half to prepare and I think now it's taking me about a day. I noticed that the graduate students only need a couple hours to do an exceptional job. I think reading more papers would definitely help.

I've added some functions in gdb to allow me to rollback and replay to a specific checkpoint in gdb. I think I still need to do more digging with gdb.  Gdb is a complicated piece of software, with lots of files and functions.  I think I'll start looking into the automated checkpoints.

Week 10+  Final Week

I spent rest of my time playing with gdb. My original intuition of how to approach the automated checkpoint was way off. I'm glad YY set me on the right track. I thought I needed to spawn another p thread and have it just run a while loop to handle the sig alarm, and I was working on that for quite some time. YY suggested just setting it some were in the original gdb code. First of all it would be very in efficient to run a while loop that does nothing, and second I would have do do message passing ...etc. It is more complicated than I presume and hopefully I'll be able to get it to work. I think the hard part is deciding where to put the checkpoint and testing it out.