Weekly Journal

[home page] . [participant] . [mentor] . [project description] . [weekly journal]

[week 1] . [week 2] . [week 3] . [week 4] . [week 5]
[week 6] . [week 7] . [week 8] . [week 9] . [week 10]
[personal entry 1] . [personal entry 2] . [personal entry 3]

Week 10: July 20 - 24
It's my last week at ISI! I've been very busy trying to pull everything together before I leave for the summer. After much discussion with the LBNL team, we've figured out the problem with file transfers. The encryption type of the Data Node at their site is different than the encryption that Data Mover Light uses, so we can't access the files with that software. They're working on a solution but for the time being our client will format transfer requests with the encryption provided from the Data Node. It's frustrating that we haven't gotten a solution all worked out, but it's good to at least know what the problem is!

I've written up a poster to submit to SC09, a high-performance computing conference that will be held in Portland, Oregon, in November. Ann and I are hopeful that it will get accepted and I'll be able to attend! I've also created a transition document so that Ann and her students will be able to continue working with my code. It's been so hectic this summer, but I'm so happy with the progress of my program!

[return to top]

Week 9: July 13 - 17
We decided this week to focus primarily on the first phase of replication (in which we create a data transfer request by querying the metadata catalog for the locations), and leave the second phase (which publishes the data to the gateways) to be solved at a later time. At the end of last week, we were able to create the Data Mover Light transfer documents, but we still haven't been able to download physical files - the DML client is giving us an error. I've been in touch with the creators of DML at Lawrence Berkeley National Lab, so hopefully we'll find a solution soon!

I was able to add functionality to the client that lets it "query" for metadata with whatever level of granularity the user would like. The user can provide a specific model name that they want metadata for, or specify the time frequencies, run numbers, and other important variables to narrow down what they'd like to transfer. This addition make the client *much* more functional, so I was very happy to get it completed!

[return to top]

Week 8: July 6 - 10
I've made some progress this week on the actual construction of the replication client. The current version of the client can take a input file of dataset ids at the file level, get the physical locations and sizes of each file, and create an XML document for DataMover Lite! I've been trying to build the system very carefully to make sure it will always "fail gracefully", as my professors like to say - that all errors will be caught by my code, instead of throwing an exception and exiting at run time.

There is definitely a lot left to do. First, I need to figure out the best way to get all the filenames if only a model name or run frequency is provided. Second, I need to continue to modularize the system so that it can be changed easily when BDM is running, or if more information is needed for the file transfer. Most importantly, I need to look at the second phase of replication (where publication of the datasets takes place).

I represented ISI in this week's ESG conference call. It was very interesting - there are a lot of other groups working, and each is at a different stage of production for their part of the project. Something we had not discussed in a long time was the effects of versioning datasets - if changes or updates are made to a dataset, how does this affect the replicas? We'll try to consider this during development, but also need to contact the versioning and publishing teams when our client is further along to figure out how this will affect us.

[return to top]

Week 7: June 29 - July 3
This has been a really good week for the progress of our client! The ESG metadata queries now return the physical location and size of a specified file. Using this information, we can create the XML input files for DataMover Lite! This is such a good step because it's allowed me to begin writing code that creates the DML input files, and it's felt great to be able to SEE my progress!

There are still some issues to work through regarding the query returns - they only return locations and sizes at the file level, so we have to determine a way to get all the filenames if only a model, experiment, or run frequency is provided. Also, we can't physically transfer the data until I get a certificate that enables me to download data from the ESG site. Still, it's been motivating to feel like part of the client is coming together!

[return to top]

Personal Entry 3: Birthday, Halfway Point
For my 21st birthday, my mom came to visit! It was great having her here for the weekend - it was reassuring for her to see where I've been living and working, and it was fun for me to get to show her around! We went to Malibu on Saturday, and ate at a restaurant called Neptune's Net (apparently a local favorite) before spending the rest of the day on the beach. Sunday we walked through downtown Santa Monica, and then saw "Up!" that night. Monday, which was my actual birthday, we went to the Huntington, a large botanical garden with extensive art and library collections. Afterwards we had dinner at the Cafe del Rey, a restaurant near work and right on the Marina.

This weekend also marked the halfway point for my research! It doesn't feel like 5 weeks have passed already. There are still a bunch of things I need to accomplish for the project, and a ton of places I want to visit in Los Angeles, so I'll have to make the most of the second half of my summer!

[return to top]

Week 6: June 22 - 26
Our conference call on Tuesday was very productive. We expressed the need for the file locations to be returned though a metadata query, and one of the ESG team members said he will try to get this implemented within the week! That was great to hear, as he is very busy with other aspects of the project, and we had been worried that it would take much longer for this problem to be resolved.

We also discussed a number of important questions that had arisen during the planning process for our client - most importantly, how the user would specify what files they want to replicate and how the publication process would work. Talking with the heads of each development team was very useful, since each group has a different focus, and could explain what the priorities of their client were. We learned, for instance, about the time required to re-scan metadata from datasets, and discussed whether this will be an appropriate strategy for the publication step of the process.

Since it does not appear that the Bulk Data Movement client will be functional by the end of the summer, I have been familiarizing myself with DataMover Lite, another client that can transfer data from one site to another. I'm planning to build the client to interact with DataMover Lite, but to modularize it appropriately so when the BDM client is finished, we can easily modify our system to produce input for BDM.

[return to top]

Week 5: June 15 - 19
With the help of the (ever-patient) ESG team, I've been able to successfully query the gateways for metadata about a dataset. There was a lot more to the structure of the datasets than I knew initially - each data model has a number of associated experiments, each experiment can have a different time frequency, and then there are multiple files for each run of the experiment! When we are doing our file transfers, we will need to know the exact granularity of the request - does the user want to replicate an entire model? a whole experiment set? just one run? - so getting a handle on the structure will be very helpful.

Currently, though, the actual file locations are not returned when we do our queries. Without the physical locations of the files, we cannot initiate the bulk data transfers, so this puts our system at a standstill. Ann has scheduled a conference call for Tuesday of next week with our main contacts for the metadata query client, the bulk data movement client, and the publication client, so hopefully we can resolve this issue and start working on the coding of the replication client!

[return to top]

Week 4: June 8 - 12
I finally got all the pieces necessary for our data node set up this week! With the help of some of the developers of the software packages, I was able to install all the parts and get them running in conjunction with each other. After all the frustration of last week, it was very rewarding to have everything running correctly.

Now I have to determine how to query the metadata using the scripts provided by the ESG team, and how to parse that information. I've gone through a decent amount of the online Python tutorial, and have been looking at the ESG scripts (which are written in Python), so hopefully I will be able to write my own code once I have the metadata from the gateways!

[return to top]

Personal Entry 2: Trip to San Diego, Office Life
This weekend, I went to San Diego with Irene, a grad student who works in the same office as me, and her friend Omer. San Diego is a great place to visit - we stopped in La Jolla on the way there, walked around the bay and downtown, toured the botanical gardens at Balboa Park, visited the "Old Town" area, and drove out to Coronado Island. All the food we ate was delicious, and it was fun to be out walking for most of the day - but we were exhausted afterwards!

Along with Irene, I'm working in the same office as another grad student, Amer. Amer came to USC after having worked in Pakistan for a while as a network specialist, so he has a lot of experience with professional computing, and has been very helpful. He's always willing to explain things to me, and has been telling me about the process of being a PhD student! He's working as an RA in family housing on the USC campus, so he's been forwarding links to the programs he and other RAs create for their residents so I can participate. I'm glad that I have the two of them in my office to bounce ideas off of!

[return to top]

Week 3: June 1 - 5
This week I have been trying to set up a Data Node at the ISI site. It's been really hard - for whatever reason, one of the packages won't correctly install, so I've been at a standstill for a lot of the week. I've installed new versions Python and Tcl/Tk, which solved the first problem, but now it's having trouble identifying the Fortran compiler for my system. Luckily I was able to get in touch with one of the developers of CDAT, the package I'm struggling with, so hopefully he'll be able to set me straight! In the process of trying to figure all this out, I've learned a lot about Unix commands and the software installation procedure, so it definitely hasn't been all bad!

I also received a copy of the protocol for communicating with the Bulk Data Movement (BDM) client. It's written in XML, so I spent some time looking into that format, and I think I'll be using Python's built-in module "ElementTree" in order to create XML inputs for the BDM client.

[return to top]

Week 2: May 25 - 29
This week, I have been more focused on the actual implementation details of our replication client. Tuesday (after the long weekend) I participated in a conference call with two researchers involved in the metadata querying process for ESG. After talking with them and getting a sample interface from them, we were able to get a much clearer picture of the work that I will need to do for our client to be successful. First of all, since the metadata query will return ALL of the available metadata, a step is eliminated - rather that making a query to ask what the physical location of the data is, we can get all of the data initially and then harvest the physical location from it.

We also gained access to the source files for the metadata querying process. All the files are written in the programming language Python, and it was recommended that the replication client be written in Python as well, so a big first step is for me to learn it! I've been going through online tutorials to get a basic understanding of the language and its' similarities/differences from Java and C++, the two languages I've been taught through my CS courses. I also went through the source files we got and traced the methods dealing with metadata querying, so I would have a better idea of the process involved in querying the system.

Friday we talked with the team for the Bulk Data Movement client, which will actually transfer the data from one location to another. After discussing the system with them, it seems that it'll be more convenient for our system to simply create the input files for BDM, and then pass those files back to the user, who can submit them to BDM (rather than having our system replicate all the user functionality of BDM).

[return to top]

Personal Entry 1: Travel from PA
In order to get to Marina del Rey, California, I drove with my father and sister from my hometown in Pennsylvania - a total of 2,922 miles! The trip was LONG, but completely worth it - I had never seen most of the midwestern states, so it was a great opportunity to travel through them! On the way we stopped at a bunch of cities and attractions in order to keep ourselves entertained through five days of driving. Some of my favorites were:
St. Louis: This is such a gorgeous city! We went up in the arch by the riverfront, and then toured the large zoo that they have. The zoo is located in a huge complex of museums (surrounded by a park) and has all sorts of animals - including a great penguin exhibit!
Palo Duro Canyon State Park: This canyon is located in the panhandle of Texas, and was a great introduction to "the west"! It's a drive-through canyon, with tons of pull-off spots for visitors to take pictures. We even saw a rattlesnake!
Grand Canyon: The Grand Canyon was also absolutely gorgeous. It's so huge, and it's amazing to see the Colorado River winding through the center of it! I'm a little afraid of heights, so I made sure to keep away from the edges, but I got a ton of phenomenal photos and the stop was completely worth it!

[return to top]

Week 1: May 18 - 22
After driving cross-country from Pennsylvania, I finally arrived in Marina del Rey, California, in order to start my summer of research at University of Southern California's Information Sciences Institute (ISI). I've spent most of this first week reading through introductory-level papers about the Earth System Grid. The Earth System Grid (ESG) is a collaboration of sites worldwide that hold important climate modeling information onsite in local storage. By providing a common interface and search capabilities, ESG allows climate researchers to access all this information and perform analyses from their home sites.

ESG's architecture is comprised of three main components - user-level Client Applications, Gateways, and Data Nodes. Data Nodes are storage sites where data is physically located. Gateways are associated with multiple Data Nodes, and provide search capabilities and metadata services so users can locate data of interest. The user will then interface with the Gateways through Client Applications, which will allow the user to publish, download, and analyze data.

My responsibility while I'm here will be to create a replication client that allows a source Data Node to fully replicate a set or subset of available data from a target Data Node. The source Data Node will communicate with its parent Gateway to determine the location of the target data, download the data from the target Data Node to the source Data Node, get any associated metadata from the target Data Node's parent Gateway, and then "publish" the replicated data to the source Gateway. This replication process will facilitate two primary goals by allow important data to be "backed up" at multiple sites - first, if a site is unavailable, core data will remain accessible through other sites; second, replicas will be available at Data Nodes worldwide, so the latency for transferring data to another site will decrease.

Since clients are already built or are in development for each of these four key steps, it will be my job to get all of them interacting with each other, so that the source Data Node can simply input the names of the datasets they wish to replicate, and not have to interact directly with any of the clients in order to get the replication to complete.

[return to top]