Scientists who study DNA often look for tandem repeats in the DNA, a pattern of repeated sequences of nucleotides. These repeats are used for genetic identification as well as to identify some genetic diseases. As DNA strands contain billions of nucleotides, an efficient and comprehensive program is necessary to find the repeats. Dr. Sokol has developed two such programs, one to find tandem repeats allowing some mismatches in the repeats, and the other to find tandem repeats allowing variations in the repeat sizes. I am working on putting the programs online in a nice, user-friendly format that scientists will be able to use. (The first program was online in a very limited version). This involves setting up a webpage for the form inputs, converting the C++ program to work as a CGI program, and combining that program with an additional post-processing program to filter the results. The same program will be available to download for use on desktop computers. Other enhancements will be made as they are necessary.
So far, I have nearly completed putting the first program online. It allows users to upload files up to 2 MB, processes the files, filters the results, and outputs them in a clear table. There are many tasks that I still hope to accomplish. DNA strands contain large amounts of unknown nucleotides, which the program had been taking out and ignoring in the repeat search, which makes the indexes of all the other nucleotides skewed, and can make it find matches in nucleotides separated by thousands of unknown characters. We hope to implement something to deal with these unknown characters. When the online program is complete, I plan to convert it to a typical desktop program for scientists to download and use on their own computers. This will be faster for them, and allow them to process bigger files. Additionally, if I have time after this program is completed, Dr. Sokol has another program to put online.