Rachel Teo's DMP 2008 Adventure

Out of all the genetic material across the various species, a very very low percentage of the DNA is considered “coding DNA”, that is to say, DNA which codes for proteins. This “coding DNA” is commonly referred to as genes, and occurs throughout the genetic material, interspersed with non-coding DNA. Some of the non-coding DNA provides extra instructions in terms of regulating the proteins, but the purpose of the bulk of the non-coding DNA is still a mystery.

Motif Inference refers to the process of analyzing sections of non-coding DNA in the corresponding region across various species of organisms and trying to find out what motifs appear to be retained even through evolution. Generally, DNA that have some sort of functional significance are less likely to be mutated beyond recognition, and so motifs that have been conserved well are better candidates as functional elements.

To find motifs, an algorithm can be written that takes a given “ideal” string of DNA that is n bases long (an n-mer) and compares it against all possible n-mers in the given data. Since the “ideal” motif is seldom known, all possible variations of n-mers must be tested, which means that the algorithm must be run 4^n times. Since this results in a general algorithm of with O(4^n), this is an intractable problem.

A solution is to make use of genetic algorithms to produce n-mers that are likely to score well against the data without necessarily having to search through every single possibility.