Nechama Gurwitz

CRA DMP Research Project
Summer 2006
Journal
Contact me


1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Week 1
I began my research project by learning about tandem repeats in DNA as well as about the technology we will be using, a CGI interface for a C++ program. I read about how CGI works and how it is used to process forms and familiarized myself with the current program. As our first revision will be to allow the user to upload a file, rather than entering the DNA sequence, I concentrated on CGI techniques to deal with uploaded files. Finally, I designed a new website for the program which will enhance its appearance and functionality. The new website has a place for the user to upload a file, and I began revising the program to accept the file as input.
Old Website
New Website - Still in progress and does not currently work. Hopefully, by the end of the summer, the new website will be completely operational.
--Edit-- The new website has been moved to the place of the old one, so the links were removed


Week 2
This week I learned how to compile CGI programs on the server. Then I revised the current program to accept a file, test that it is in the correct format, and run the program from that. It took a while until the results from the file were identical to the results from the original program, but it seems correct now.
Return to top
Week 3

Well, I was wrong about the results always being identical. After repeated tests, I noticed that, strangely enough, the program was giving slightly different results each time that I ran it on the same DNA file. I spent most of the week trying to figure out this bug. I corresponded with the programmer who wrote the CGI class that we are using, and he gave me some ideas which helped a little, although it is still not perfectly accurate.
I also began looking at a program that Prof. Sokol wrote to filter the results, which we plan on incorporating with the current one.
Return to top
Week 4

The program was printing out hundreds and hundreds of repeats for each DNA strand, far to many to be of use. I combined Prof. Sokol's filtering program into the main program to limit the amount of matches found. The filtering program finds similar, almost duplicate, matches that were reported, combines them, and also filters out the very small repeats. I formatted the combined results in a table for ease of reading.
We also tried uploading very large files, as FASTA files can be enormous, but did not figure out how to handle them yet.
Return to top
Week 5

A big advantage of the filtered table is that the results are clearly organized and easy to see. However, the repeat is not shown visually in a picture, and the exact sequence isn't either in the table. So I added a link to each repeat "View Repeat" which opens a new CGI page in a pop-up window to showing just that repeat. That was fairly complicated, as it combined both CGI and JavaScript processing.
We found an error in the program, where a function should have been changed when it was converted to a web application. Since it wasn't changed, the DNA indexes were printing incorrectly. We changed that to work with the new version.
I spent a lot of time uploading large files. We quickly moved up our limit from 2,048 characters to much more - it's currently working with 1.5 MB files, but crashes on the really big ones.
Return to top
Week 6

After spending a lot of time testing big files, I finally got the program to work on 2 MB files. We decided that 2 MB will be the limit for online processing, since it takes such a long time to process anything bigger that my browser usually crashes. I'll hopefully revise the old (not CGI) program to work with big files and we'll let people download that. Even 2 MB takes close to an hour to upload and process... I hope these big files still work after the rest of the changes, because I'll be testing for many hours otherwise.

Aside from "actg", the letters symbolizing DNA nucleotides, DNA files are full of the letter "N" - for an unknown nucleotide. Sometimes 75% of the file is unknown. We had been taking out all the "N"s before processing, which obviously skews the results. I'm trying to work something out now that will not strip them but also not find a repeat consisting all, or mostly, of "N"s and think that it's the same letter. We're trying to replace each N in the sequence with a random letter (so that there are no repeats of N's), but it's not working yet.
Return to top


Week 7

I think that this week I finished putting up the program! I cleaned up the output from the program, and did some more testing, and it seems complete. We also spent some time debugging the original C++ (not CGI) program and put it up for download.
Return to top
Week 8

I got the other tandem repeat program this week. It also finds tandem repeats, but it allows insertions and deletions in the repeat so that a sequence that has the pattern "atcg atc atcg atccg" can be considered a repeat. I converted the program to CGI and created an input form for it. The version of the program that I got is still not complete, so the results aren't that accurate, but the CGI part does work.
Return to top
Week 9

Justin Tojera, the graduate student working on the edit distance program, sent me an updated version. I made the neccessary changes to it and put it online. This version is much more accurate than the previous one, and results seem more correct. It also finds much longer repeats than the other program did, so we had another problem with then N's. My previous string of N's was only a few hundred characters long, so when this program finds repeats that are a few thousand characters long, it will find repeats consisting totally of N's. I ended up writing a short program to permute the alphabet in different ways, which generated an 8,000 character string, which I'm using instead.
Return to top
Week 10

I used this week to write my paper to tie everything up. I ran tests on the edit distance program, to see what is the largest file size it can handle. It runs much more quickly than the other program, but I still spent a long time waiting for files to upload. I don't have a maximum amount, but it can definitely process files up to 11 MB, far more than we expected.
The next step would be creating a database of tandem repeats. There are lots of DNA sequences available, so Dr. Sokol wants to run the program on them and create a database with the results. Then biologists could select a sequence, and see the stored repeats, without uploading the sequence again and without processing the sequence again. I looked into databases, and we discussed what the data structure would be like.
Return to top
Home | Mentor | Journal | Research