SELF ORGANIZING MAPS AND PROTEINS
BACK TO HOME PAGE
BACK TO PROJECT PAGE
Self organizing maps have been used for pattern recognition. Proteins consist of an

amino acid sequence. Proteins that are similar in amino acid sequences are also similar

in structure. I have used MATLAB to graph  the amino acids on a grid then use a map

of size 15x15 to self organize the training data. At the end of the  program the map is

turned into a 3-dimensional graph with a color map depicting the amount of data in

each area. The most important part of this program is to properly encode the amino

acid seqeunce. I have tested 3 different types of encoding.

The following are the names and letter codes of the 20 Amino Acids
A-alanine           65   Q-glutamine       81     L-leucine           76  S-serine        83
R-arginine         82    E-glutamic acid  69     K-lysine            75  T-threonine   84
N-asparagine     78    G-glycine          71     M-methionine    77  W-tryptophan 87
D-aspartic acid  68    H-histidine         72     F-phenylalanine 70   Y-tyrosine     89
C-cysteine         67    I-isoleucine       73     P-proline           80   V-valine        86


I designed a C++ program to take the letter code and encode it into ASCII characters

Therefore myoglobin which is 154 amino acid residues long is coded as follows:

MADVKKNCLASLSLAPISKAQQAQVGKDFYKFFFTNHPDLRKYFKGAENFT
ADDVQKSDRFEKLGSGLLLSVHILANTFDNEDVFRAFCRETIDRHVGRGLDP
ALWKAFWSVWVAFLESKGGVSGDQKAAWDKLGTVFNDECQHQLAKHGLPHL

In ASCII it becomes:
Encoding I
77 65 68 86 75 75 78 67 76 65 83 76 83 76 65 80 73 83 75 65 81 81 65 81 86 71 75
68 70 89 75 70 70 70 84 78 72 80 68 76 82 75 89 70 75 71 65 69 78 70 84 65 68 68
86 81 75 83 68 82 70 69 75 76 71 83 71 76 76 76 83 86 72 73 76 65 78 84 70 68 78
69 68 86 70 82 65 70 67 82 69 84 73 68 82 72 86 71 82 71 76 68 80 65 76 87 75 65
70 87 83 86 87 86 65 70 76 69 83 75 71 71 86 83 71 68 81 75 65 65 87 68 75 76 71
84 86 70 78 68 69 67 81 72 81 76 65 75 72 71 76 80 72 76

Encoding I takes the first two numbers and plots them on the grid. Therefore the first input point is 77 as x and 65 as Y.

Encoding II is similar but more linear is takes the numbers and separates them. For instance, The first number is 77 therefore 7 is x and 7 is y.

Encoding II:
7 7 6 5 6 8 8 6 7 5 7 5 7 8 6 7 7 6 6 5 8 3 7 6 8 3 7 6 6 5 8 0 7 3 8 3 7 5 6 5 8 1 8 1 6
5 8 1 8 6 7 1 7 5 6 8 7 0 8 9 7 5 7 0 7 0 7 0 8 4 7 8 7 2 8 0 6 8 7 6 8 2 7 5 8 9 7 0 7 5
7 1 6 5 6 9 7 8 7 0 8 4 6 5 6 8 6 8 8 6 8 1 7 5 8 3 6 8 8 2 7 0 6 9 7 5 7 6 7 1 8 3 7 1 7
6 7 6 7 6 8 3 8 6 7 2 7 3 7 6 6 5 7 8 8 4 7 0 6 8 7 8 6 9 6 8 8 6 7 0 8 2 6 5 7 0 6 7 8 2
6 9 8 4 7 3 6 8 8 2 7 2 8 6 7 1 8 2 7 1 7 6 6 8 8 0 6 5 7 6 8 7 7 5 6 5 7 0 8 7 8 3 8 6 8
7 8 6 6 5 7 0 7 6 6 9 8 3 7 5 7 1 7 1 8 6 8 3 7 1 6 8 8 1 7 5 6 5 6 5 8 7 6 8 7 5 7 6 7 1
8 4 8 6 7 0 7 8 6 8 6 9 6 7 8 1 7 2 8 1 7 6 6 5 7 5 7 2 7 1 7 6 8 0 7 2 7 6

Encoding III takes the letter of the amino acid and assigns it an even number.
A- 2      Q-34          L-24        S-38
R-18      E-10          K-22       T-40
N-20     G-14          M-26      W-46
D-8       H-16          F-12       Y-50
C-6       I-18           P-32       V-44

Therefore Encoding III looks like this:

26 2 8 44 22 22 28 6 24 2 38 24 38 24 2 32 18 38 22 2 34 34 2 34 44 14 22 8 12 50
22 12 12 12 40 28 16 32 8 24 36 22 50 12 22 14 2 10 28 12 40 2 8 8 44 34 22 38 8
36 12 10 22 24 14 38 14 24 24 24 38 44 16 18 24 2 28 40 12 8 28 10 8 44 12 36 2
12 6 36 10 40 18 8 36 16 44 14 36 14 24 8 32 2 24 46 22 2 12 46 38 44 46 44 2 12
24 10 38 22 14 14 44 38 14 8 34 22 2 2 46 8 22 24 14 40 44 12 28 8 10 6 34 16 34
24 2 22 16 14 24 32 16 24

These input sequences are then implemented into the MATLAB program and random
data is trained to recognize this protein

RESULTS FROM ENCODING I


RESULTS FROM ENCODING II


RESULTS FROM ENCODING III



1