|
Week 1 Old Website New Website - Still in progress and does not currently work. Hopefully, by the end of the summer, the new website will be completely operational. --Edit-- The new website has been moved to the place of the old one, so the links were removed
Return to top Well, I was wrong about the results always being identical. After repeated tests, I noticed that, strangely enough, the program was giving slightly different results each time that I ran it on the same DNA file. I spent most of the week trying to figure out this bug. I corresponded with the programmer who wrote the CGI class that we are using, and he gave me some ideas which helped a little, although it is still not perfectly accurate. I also began looking at a program that Prof. Sokol wrote to filter the results, which we plan on incorporating with the current one. Return to top The program was printing out hundreds and hundreds of repeats for each DNA strand, far to many to be of use. I combined Prof. Sokol's filtering program into the main program to limit the amount of matches found. The filtering program finds similar, almost duplicate, matches that were reported, combines them, and also filters out the very small repeats. I formatted the combined results in a table for ease of reading. We also tried uploading very large files, as FASTA files can be enormous, but did not figure out how to handle them yet. Return to top A big advantage of the filtered table is that the results are clearly organized and easy to see. However, the repeat is not shown visually in a picture, and the exact sequence isn't either in the table. So I added a link to each repeat "View Repeat" which opens a new CGI page in a pop-up window to showing just that repeat. That was fairly complicated, as it combined both CGI and JavaScript processing. We found an error in the program, where a function should have been changed when it was converted to a web application. Since it wasn't changed, the DNA indexes were printing incorrectly. We changed that to work with the new version. I spent a lot of time uploading large files. We quickly moved up our limit from 2,048 characters to much more - it's currently working with 1.5 MB files, but crashes on the really big ones. Return to top After spending a lot of time testing big files, I finally got the program to work on 2 MB files. We decided that 2 MB will be the limit for online processing, since it takes such a long time to process anything bigger that my browser usually crashes. I'll hopefully revise the old (not CGI) program to work with big files and we'll let people download that. Even 2 MB takes close to an hour to upload and process... I hope these big files still work after the rest of the changes, because I'll be testing for many hours otherwise. Aside from "actg", the letters symbolizing DNA nucleotides, DNA files are full of the letter "N" - for an unknown nucleotide. Sometimes 75% of the file is unknown. We had been taking out all the "N"s before processing, which obviously skews the results. I'm trying to work something out now that will not strip them but also not find a repeat consisting all, or mostly, of "N"s and think that it's the same letter. We're trying to replace each N in the sequence with a random letter (so that there are no repeats of N's), but it's not working yet. I think that this week I finished putting up the program! I cleaned up the output from the program, and did some more testing, and it seems complete. We also spent some time debugging the original C++ (not CGI) program and put it up for download. Return to top I got the other tandem repeat program this week. It also finds tandem repeats, but it allows insertions and deletions in the repeat so that a sequence that has the pattern "atcg atc atcg atccg" can be considered a repeat. I converted the program to CGI and created an input form for it. The version of the program that I got is still not complete, so the results aren't that accurate, but the CGI part does work. Return to top Justin Tojera, the graduate student working on the edit distance program, sent me an updated version. I made the neccessary changes to it and put it online. This version is much more accurate than the previous one, and results seem more correct. It also finds much longer repeats than the other program did, so we had another problem with then N's. My previous string of N's was only a few hundred characters long, so when this program finds repeats that are a few thousand characters long, it will find repeats consisting totally of N's. I ended up writing a short program to permute the alphabet in different ways, which generated an 8,000 character string, which I'm using instead. Return to top I used this week to write my paper to tie everything up. I ran tests on the edit distance program, to see what is the largest file size it can handle. It runs much more quickly than the other program, but I still spent a long time waiting for files to upload. I don't have a maximum amount, but it can definitely process files up to 11 MB, far more than we expected. The next step would be creating a database of tandem repeats. There are lots of DNA sequences available, so Dr. Sokol wants to run the program on them and create a database with the results. Then biologists could select a sequence, and see the stored repeats, without uploading the sequence again and without processing the sequence again. I looked into databases, and we discussed what the data structure would be like. Return to top |