SELF ORGANIZING MAPS AND PROTEINS | ||||||||
BACK TO HOME PAGE | ||||||||
BACK TO PROJECT PAGE | ||||||||
Self organizing maps
have been used for pattern recognition. Proteins consist of an
amino acid sequence. Proteins that are similar in amino acid sequences are also similar in structure. I have used MATLAB to graph the amino acids on a grid then use a map of size 15x15 to self organize the training data. At the end of the program the map is turned into a 3-dimensional graph with a color map depicting the amount of data in each area. The most important part of this program is to properly encode the amino acid seqeunce. I have tested 3 different types of encoding. The following are the names and letter codes of the 20 Amino Acids A-alanine 65 Q-glutamine 81 L-leucine 76 S-serine 83 R-arginine 82 E-glutamic acid 69 K-lysine 75 T-threonine 84 N-asparagine 78 G-glycine 71 M-methionine 77 W-tryptophan 87 D-aspartic acid 68 H-histidine 72 F-phenylalanine 70 Y-tyrosine 89 C-cysteine 67 I-isoleucine 73 P-proline 80 V-valine 86 I designed a C++ program to take the letter code and encode it into ASCII characters Therefore myoglobin which is 154 amino acid residues long is coded as follows: MADVKKNCLASLSLAPISKAQQAQVGKDFYKFFFTNHPDLRKYFKGAENFT ADDVQKSDRFEKLGSGLLLSVHILANTFDNEDVFRAFCRETIDRHVGRGLDP ALWKAFWSVWVAFLESKGGVSGDQKAAWDKLGTVFNDECQHQLAKHGLPHL In ASCII it becomes: Encoding I 77 65 68 86 75 75 78 67 76 65 83 76 83 76 65 80 73 83 75 65 81 81 65 81 86 71 75 68 70 89 75 70 70 70 84 78 72 80 68 76 82 75 89 70 75 71 65 69 78 70 84 65 68 68 86 81 75 83 68 82 70 69 75 76 71 83 71 76 76 76 83 86 72 73 76 65 78 84 70 68 78 69 68 86 70 82 65 70 67 82 69 84 73 68 82 72 86 71 82 71 76 68 80 65 76 87 75 65 70 87 83 86 87 86 65 70 76 69 83 75 71 71 86 83 71 68 81 75 65 65 87 68 75 76 71 84 86 70 78 68 69 67 81 72 81 76 65 75 72 71 76 80 72 76 Encoding I takes the first two numbers and plots them on the grid. Therefore the first input point is 77 as x and 65 as Y. Encoding II is similar but more linear is takes the numbers and separates them. For instance, The first number is 77 therefore 7 is x and 7 is y. Encoding II: 7 7 6 5 6 8 8 6 7 5 7 5 7 8 6 7 7 6 6 5 8 3 7 6 8 3 7 6 6 5 8 0 7 3 8 3 7 5 6 5 8 1 8 1 6 5 8 1 8 6 7 1 7 5 6 8 7 0 8 9 7 5 7 0 7 0 7 0 8 4 7 8 7 2 8 0 6 8 7 6 8 2 7 5 8 9 7 0 7 5 7 1 6 5 6 9 7 8 7 0 8 4 6 5 6 8 6 8 8 6 8 1 7 5 8 3 6 8 8 2 7 0 6 9 7 5 7 6 7 1 8 3 7 1 7 6 7 6 7 6 8 3 8 6 7 2 7 3 7 6 6 5 7 8 8 4 7 0 6 8 7 8 6 9 6 8 8 6 7 0 8 2 6 5 7 0 6 7 8 2 6 9 8 4 7 3 6 8 8 2 7 2 8 6 7 1 8 2 7 1 7 6 6 8 8 0 6 5 7 6 8 7 7 5 6 5 7 0 8 7 8 3 8 6 8 7 8 6 6 5 7 0 7 6 6 9 8 3 7 5 7 1 7 1 8 6 8 3 7 1 6 8 8 1 7 5 6 5 6 5 8 7 6 8 7 5 7 6 7 1 8 4 8 6 7 0 7 8 6 8 6 9 6 7 8 1 7 2 8 1 7 6 6 5 7 5 7 2 7 1 7 6 8 0 7 2 7 6 Encoding III takes the letter of the amino acid and assigns it an even number. A- 2 Q-34 L-24 S-38 R-18 E-10 K-22 T-40 N-20 G-14 M-26 W-46 D-8 H-16 F-12 Y-50 C-6 I-18 P-32 V-44 Therefore Encoding III looks like this: 26 2 8 44 22 22 28 6 24 2 38 24 38 24 2 32 18 38 22 2 34 34 2 34 44 14 22 8 12 50 22 12 12 12 40 28 16 32 8 24 36 22 50 12 22 14 2 10 28 12 40 2 8 8 44 34 22 38 8 36 12 10 22 24 14 38 14 24 24 24 38 44 16 18 24 2 28 40 12 8 28 10 8 44 12 36 2 12 6 36 10 40 18 8 36 16 44 14 36 14 24 8 32 2 24 46 22 2 12 46 38 44 46 44 2 12 24 10 38 22 14 14 44 38 14 8 34 22 2 2 46 8 22 24 14 40 44 12 28 8 10 6 34 16 34 24 2 22 16 14 24 32 16 24 These input sequences are then implemented into the MATLAB program and random data is trained to recognize this protein RESULTS FROM ENCODING I RESULTS FROM ENCODING II RESULTS FROM ENCODING III |