Running TPC workloads for the Staged Database System
Mentoring Program Summer Research Project
The goal for the summer was to run TPC workloads on the Staged Database System that is currently being built at Carnegie Mellon University.
The purpose of a Staged Database System is to break the database system into modules and to encapsulate them into self-contained stages connected to each other through queues. The staged design remedies the weaknesses of modern DBMS by providing solutions at both a hardware level (by optimally exploiting the underlying memory hierarchy and taking direct advantage of SMP systems) and a software engineering level (by aiming at a highly flexible, extensible, easy to program, and evolve platform).
Description and Results
I first had to understand the benchmarks as well as port and run the TPC workload in the system. Becoming familiar with the Staged System was not a goal for the summer since I had worked with it before, during the spring semester through an Independent Study.
The two workloads that we were most interested in were TPC-C and TPC-H. TPC-C is used for OLTP workloads where as TPC-H is for decision support workloads. I read both the TPC-C and TPC-H descriptions. The TPC-C handout describes in detail the five transactions, the tables involved etc. I also explored the http://www.tpcc.org website and learned all about how TPC-C is used and which companies/systems have the best results. I received a copy of a TPC-C Toolkit, which was created by a graduate student here at Carnegie Mellon. I spent one afternoon playing around with it, loading tables, completing transactions, and looking at the output. With the help of both the document and the toolkit, I produced a document detailing TPC-C and what is required in order to run it.
In order to run TPC-C transactions in the Staged System, a few more operators had to be implemented. The two most important of those which hadn’t been implemented yet were insert and hash_join . Although I did some initial research on the insert operator and how the toolkit handles it, I didn’t do much work with it. I did, on the other hand, have much experience with the hash_join operator. First I did some research into hash-joins. I read about different algorithms and talked with my mentor about them. We also discussed issues dealing with partitioning. I then implemented a hash table which supported inserting, removing, and lookups. Initially, I decided to focus on the hash-join itself – not worrying about partitioning. I learned a great deal while trying to implement the hash join. My C++ skills were holding me back a little, but they also improved with each bug I found. The most difficult aspect of the hash join implementation was unfamiliarity with the SHORE management system. The hash-join was being built on top of SHORE so I had to be able to interact with it. The toolkit helped me a little, but it did not have all the functionality that I needed.
Due to the amount of time that it took me to implement the hash join operator and the fact that there were many more aspects to the project than just the operators, my mentor decided it was best if instead of implementing insert, I did something different and worked with TPC-C and TPC-H workloads for DB2. This provided me with an opportunity to leave the low-level aspect of the project and start working with something more high-level.
The purpose of my work this time was to work with Lisa, the other DMP participant at Carnegie Mellon, and have TPC-C and TPC-H hooked up to the interface. I was to create python scripts that would perform the necessary operations to initialize the database, create the table spaces, populate the tables, run the actual toolkit, and stop the database. After meeting with my mentor, we decided on a few factors that should be controlled by the user such as the number of warehouses requested, the buffer pool size, the number of users, interval time of the toolkit as well as the warm-up time, and thinking time.
Things went pretty smoothly with this part of the project. One downside was that it took the system a very long time to create the table spaces and populate the database. For instance, for TPC-C, for 10 warehouses, it took 30 minutes for the init() operations to complete. The more warehouses we added, the longer it took for this process. This made testing the scripts a bit difficult, but after perfecting init(), we kept the database in place and only recreated it when there were problems with it that were preventing other functions from running successfully.
Although my mentor and I initially considered working with the TPC-H toolkits as well, we did not get to this part of the project. There were several reasons for this. First, the TPC-H toolkit takes about 8 hours to create the table spaces, populate the tables and initialize that database. The initial plan was to have it work overnight, but it kept crashing every time we tried this. I worked with Minglong, a graduate student here at Carnegie Mellon, who was very helpful with the building of the database. Due to her busy schedule and the fact that I was still working with TPC-C, it took about a week to have the database ready to go. With one week to go, though, my mentor and I decided that it would be better to focus on TPC-C and make it perfect. Although there were little issues here and there, Lisa and I were successful in having a working TPC-C benchmark for DB2. We were very pleased with our results and our efforts. It was very rewarding to see all of the hours we spent in the lab produce something like the interface. It will be a very helpful tool for the Staged Database System and will be very useful when demonstrating the effectiveness and elegance of the system.
I enjoyed being able to work on various parts of the project. The hash join implementation and the work on the interface were completely different and required many different skills. It was good to mix these projects and get a feel for the entire project as opposed to only the engine or only the interface.
After the ten week period, the
Staged Database System project is closer to being completed and much
closer to being in a “presentable” state. Implementing the
hash join was very important for the Staged system since it is one of
the most widely used joins. The goal of the applet interface was to
be able to view various statistics and performance charts for different
benchmarks and systems. By having a working TPC-C benchmark for DB2,
the presentation aspect of the system is one step closer to being completed.
I felt that I really contributed to the Staged Database System and that
we made great progress this summer.