Rachel Teo's DMP 2008 Adventure

the blog.the mentor.the mentee.the work.the fun.

All genomes consist of genes interspersed with sections of non-coding DNA. Generally, there is much more non-coding DNA than there are genes.

Traditionally, non-coding DNA was considered to be junk; mutations that had accrued in the genome over the generations. More recently, however, scientists have found introns and regulators in the non-coding regions, which has given rise to the thought that the rest of the non-coding DNA might also have meaning.

Unfortunately, since non-coding regions of the genome tend to accrue mutations at a much higher rate than coding regions, it is difficult to tell which portions of the code might be more likely to have some sort of significance. Motif inference works with the given data to try and find the motif that reoccurs most frequently across the DNA of several divergent species. In the most basic cases, for a motif that is n bases long, all possible variations of n-mers are analyzed against the provided DNA and assigned scores. The final score for a motif is calculated by adding the highest score found within each species’ genetic information. Of all the possible variations of n-mers, the one with the highest score is considered the one that is most likely to have been the original motif, and therefore the motif that is most likely to have some sort of significance.

The problem with motif inference is that there are 4^n possible variations of n-mers, which means that the algorithm soon becomes highly intractable. Realistically speaking, it is impossible to analyze every single variation of the n-mer, which is where GAMI comes in.

GAMI uses genetic algorithms to analyze only some of the variations of n-mers. GAMI starts with a randomly generated initial generation, which is then combined to create a second generation, which is then combined to create a third generation, and so on. Throughout the process, each generation is scored, and a list of the motifs with the highest fitness levels are maintained, up to the population size specified by the user.

Each new generation is produced by combining two parent motifs. Generally, better motifs are weighted more heavily so that they are more likely to be chosen as parents of the next generation. If only the best motifs are allowed to be parents in each generation, then the population could converge too quickly because of “inbreeding”. Each new generation is produced through crossovers between two parent motifs, and/or mutations within the motif. In crossover, two motifs exchange their data after a certain point as determined by the user. In mutation, a certain unit in the motif is randomly changed to a new unit.

GAMI allows motif inference to be performed on the DNA of multiple divergent species with higher values of n, since although not all possible variations of n-mers are scored, the motifs that are likely to score higher are.