Protein Secondary Structure Prediction using Pattern Recognition Neural Network

Protein Secondary Structure Prediction using Pattern Recognition Neural Network P.V. Nageswara Rao 1 (nagesh@gitam.edu), T. Uma Devi 1, DSVGK Kaladhar 1, G.R. Sridhar 2, Allam Appa Rao 3 1 GITAM University, 2 Endocrine and Diabetes Centre, Visakhapatnam, 3 JNTUK, Kakinada, India ABSTRACT Proteins are key biological molecules with diverse functions. With newer technologies producing more data (genomics, proteomics) than can be annotated manually, in silico methods of predicting their structure and thereafter their function has been christened the Holy Grail of structural bioinformatics. Successful secondary structure prediction provides a starting point for direct tertiary structure modeling; in addition it improves sequence analysis and sequence-structure binding for structure and function determination. Using machine learning and data mining process, we developed a pattern recognition technique based on statistical for predicting protein secondary structure from the component amino acid sequence. By applying this technique, a performance score of Q 8 =72.3% was achieved. This compares well with other established techniques, such as NN-I and GOR IV which achieved Q 3 scores of 64.05% and 63.19% respectively when predictions are made on single sequence alone. Key words: Secondary Structure, Pattern Recognition, Neural Network. 1. INTRODUCTION The prediction of protein structure from amino acid sequence has become the target of scientists since Anfinsen(1973) 1, who showed that the information necessary for protein folding resides completely within the primary structure. The emergence of rapid methods of DNA sequencing and the translation of the genetic code into protein sequences has boosted the need for automated methods of interpreting these linear sequences into threedimensional structure 2. Although the development of advanced molecular biology laboratory techniques reduced the amount of time necessary to determine a protein structure by X-ray crystallography, a crystal structure determination may still require many months. NMR techniques helped in determining protein structure, but NMR is also costly, time-consuming, requires large amounts of protein of high solubility and is severely limited by protein size 2. The conclusion is that current experimental methods of determining protein structure will not meet the requirements of the present and future needs for protein structure determination. 2. RELATED WORKS There are two main different approaches in determining protein structure theoretically: a molecular mechanics approach based on the assumption that a correctly folded protein occupies a minimum energy conformation, most likely a conformation near the global minimum of free energy. Potential energy is obtained by summing the terms due to bonded and non-bonded components estimated from these force field parameters and then can be minimized as a function of atomic coordinates in order to reach the nearest local minimum 3,4. This approach is very sensitive to the protein conformation of the molecules at the beginning of the simulation. One way to address this problem is to use molecular dynamics to simulate the way the molecule would move away from that initial state. Newton s laws and Monte Carlo methods were used to reach to a global energy minima. The approach of molecular mechanics is faced by problems of inaccurate force field parameters and spectrum of multiple minima 2. The second approach of predicting protein structures from sequence alone is based on the data sets of known protein structures and sequences. This approach attempts to find common features in these data sets which can be generalized to provide structural models of other proteins. Many statistical methods used the different frequencies of amino acid types: helices, strands, and loops in sequences to predict their location 5-10. The main idea is that a segment or motif of a target protein that has a sequence similar to a segment or motif with known structure is assumed to have the same structure. ISSN: 0975-5462 1752

Protein secondary structure prediction means the prediction of the formation of regular local structures such as α helices, β strands, coils, etc. Solving the protein folding problem will pave the way to rapid progress in the fields of protein engineering and drug design. As the number of protein sequences is growing much faster than our ability to solve their structures experimentally in the molecular biology laboratories, in silico prediction methods will narrow the gap between available sequences and structures. Previous research showed that it is promising to derive general rules for predicting protein structure from existing data and then applying them to unknown structures. Several methods have utilized this approach 5,11-14. Many statistically based methods use the different frequencies of amino acid types in sequences to predict their location in the secondary structure conformations: helices, strands, and coils 5-10. The basic idea is that a segment or motif of a target protein that has a sequence similar to a segment or motif with known structure is assumed to have the same structure. Unfortunately, for many proteins there is not enough homology to any protein sequence or of known structure to allow application of this technique. The GOR method was first proposed by 15 and named after its authors Garnier-Osguthorpe-Robson. The GOR method attempts to include information about a slightly longer segment of the polypeptide chain. Instead of considering tendency for a single residue, position-dependent tendencies have been calculated for all residue types. Thus the prediction will therefore be influenced not only by the actual residue at that position, but also to some extent by other neighbouring residues 16. The propensity stables to some extent reflect the fact that positively charged residues are more often found in the C-terminal end of helices and that negatively charged residues are found in the N-terminal end. 3. PROPOSED METHOD The dssp database (http://swift.cmbi.kun.nl /gv/dssp/) is an archive of protein sequence with its secondary structure. Each file describes the primary structure of the protein and secondary structure of each amino acid in a columnar fashion. A set of 625 non redundant proteins with more than 25% sequence similarity were extracted. A sniffer is written to extract the sequence and its secondary structure from the.dssp file. A sample.dssp file is presented in Fig.1. ==== Secondary Structure Definition by the program DSSP, updated CMBI version by ElmK / April 1,2000 ==== DATE=7 OCT 2009 REFERENCE W. KABSCH AND C.SANDER, BIOPOLYMERS 22 (1983) 2577 2637 HEADER HYDROLASE 30 MAR 09 3GUP. COMPND 2 MOLECULE: LYSOZYME;. SOURCE 2 ORGANISM_SCIENTIFIC: ENTEROBACTERIA PHAGE T4;. AUTHOR L.LIU,B.W.MATTHEWS. 324 2 2 2 0 TOTAL NUMBER OF RESIDUES, NUMBER OF CHAINS, NUMBER OF SS BRIDGES(TOTAL,INTRACHAIN,INTERCHAIN). 16027.0 ACCESSIBLE SURFACE OF PROTEIN (ANGSTROM**2). 234 72.2 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I) >H N(J), SAME NUMBER PER 100 RESIDUES. 0 0.0 TOTAL NUMBER OF HYDROGEN BONDS IN PARALLEL BRIDGES, SAME NUMBER PER 100 RESIDUES. 24 7.4 TOTAL NUMBER OF HYDROGEN BONDS IN ANTIPARALLEL BRIDGES, SAME NUMBER PER 100 RESIDUES 0 0.0 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I) >H N(I 5), SAME NUMBER PER 100 RESIDUES. 4 1.2 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I) >H N(I 4), SAME NUMBER PER 100 RESIDUES. 0 0.0 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I) >H N(I 3), SAME NUMBER PER 100 RESIDUES. 0 0.0 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I) >H N(I 2), SAME NUMBER PER 100 RESIDUES. 0 0.0 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I) >H N(I 1), SAME NUMBER PER 100 RESIDUES. 0 0.0 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I) >H N(I+0), SAME NUMBER PER 100 RESIDUES 0 0.0 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I) >H N(I+1), SAME NUMBER PER 100 RESIDUES 14 4.3 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I) >H N(I+2), SAME NUMBER PER 100 RESIDUES 27 8.3 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I) >H N(I+3), SAME NUMBER PER 100 RESIDUES 163 50.3 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I) >H N(I+4), SAME NUMBER PER 100 RESIDUES 6 1.9 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I) >H N(I+5), SAME NUMBER PER 100 RESIDUES 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 *** HISTOGRAMS OF ***. 0 0 0 0 3 1 2 4 2 0 0 2 2 2 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 RESIDUES PER ALPHA HELIX. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PARALLEL BRIDGES PER LADDER. 2 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ANTIPARALLEL BRIDGES PER LADDER. 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 LADDERS PER SHEET. # RESIDUE AA STRUCTURE BP1 BP2 ACC N H >O O >H N N H >O O >H N TCO KAPPA ALPHA PHI PSI X CA Y CA Z CA 1 1 A M 0 0 56 0, 0.0 2, 0.3 0, 0.0 234, 0.1 0.000 360.0 360.0 360.0 147.0 1.6 15.1 50.8 2 2 A N > 0 0 16 156, 0.0 4, 2.7 232, 0.0 5, 0.2 0.934 360.0 80.4 162.6 176.4 4.0 14.6 53.7 3 3 A I H > S+ 0 0 10 2, 0.3 4, 2.6 1, 0.2 5, 0.2 0.824 125.8 52.5 63.1 31.9 4.1 12.9 57.1 4 4 A F H > S+ 0 0 16 2, 0.2 4, 2.4 1, 0.2 1, 0.2 0.947 112.6 43.1 68.9 48.3 2.2 15.7 58.8 ISSN: 0975-5462 1753

5 5 A E H > S+ 0 0 76 2, 0.2 4, 2.0 1, 0.2 2, 0.2 0.901 113.4 54.2 61.2 38.7 0.6 15.7 56.4 6 6 A M H X S+ 0 0 0 4, 2.7 4, 2.1 2, 0.2 2, 0.2 0.944 112.1 41.6 58.9 54.3 0.7 11.9 56.5 7 7 A L H X>S+ 0 0 0 4, 2.6 4, 3.0 2, 0.2 5, 0.7 0.877 109.0 59.4 68.6 32.9 1.1 11.7 60.2 8 8 A R H X5S+ 0 0 95 4, 2.4 4, 1.3 5, 0.2 1, 0.2 0.926 109.4 45.4 60.2 37.5 3.5 14.6 60.2 9 9 A I H <5S+ 0 0 81 4, 2.0 2, 0.2 5, 0.2 1, 0.2 0.926 118.8 40.8 68.4 42.7 5.7 12.5 57.9 10 10 A D H <5S+ 0 0 7 4, 2.1 2, 0.2 1, 0.2 3, 0.2 0.880 129.9 24.9 77.2 38.5 5.3 9.3 60.0 11 11 A E H <5S 0 0 16 4, 3.0 19, 0.4 5, 0.2 3, 0.2 0.704 92.8 159.1 96.5 28.8 5.6 10.7 63.5 12 12 A G << 0 0 27 4, 1.3 2, 0.4 5, 0.7 18, 0.2 0.102 26.9 79.8 64.9 171.0 7.6 13.8 62.8 13 13 A L + 0 0 79 16, 0.2 2, 0.4 4, 0.1 16, 0.2 0.991 46.7 169.5 131.8 122.6 7.6 16.8 65.2 14 14 A R E A 28 0A 165 14, 1.7 14, 2.0 2, 0.4 4, 0.1 0.986 23.6 154.0 131.7 143.5 9.7 17.1 68.4 15 15 A L E S+ 0 0 69 2, 0.4 43, 1.7 12, 0.2 2, 0.3 0.516 73.8 56.8 104.9 3.3 9.4 19.7 71.1 16 16 A K E S C 57 0B 149 41, 0.2 41, 0.2 12, 0.1 10, 0.1 0.923 97.3 79.2 126.8 156.7 10.8 17.8 74.1 17 17 A I E + 0 0 24 39, 1.0 2, 0.3 2, 0.3 10, 0.2 0.180 55.6 165.7 49.4 127.5 9.8 14.5 75.8 18 18 A Y E A 26 0A 37 8, 3.2 8, 2.4 6, 0.1 2, 0.5 0.943 40.3 105.0 134.8 163.1 11.0 11.3 74.0 19 19 A K E A 25 0A 132 2, 0.3 6, 0.2 6, 0.2 2, 0.0 0.790 37.8 141.7 84.4 133.3 10.3 7.6 74.1 20 20 A D > 0 0 26 4, 2.4 3, 0.9 2, 0.5 1, 0.0 0.142 36.6 86.2 78.8 177.7 8.3 6.5 71.1 21 21 A a T 3 S+ 0 0 32 1, 0.2 1, 0.1 2, 0.1 2, 0.0 0.640 132.0 54.9 66.9 14.9 8.9 3.2 69.3 22 22 A E T 3 S 0 0 44 2, 0.2 1, 0.2 120, 0.1 120, 0.0 0.655 120.3 111.0 86.6 18.3 6.6 1.5 71.8 23 23 A G S < S+ 0 0 37 3, 0.9 2, 0.3 1, 0.3 2, 0.1 0.597 74.1 129.5 98.1 17.7 8.7 2.8 74.7 24 24 A Y 0 0 69 1, 0.0 4, 2.4 9, 0.0 1, 0.3 0.792 68.6 94.1 106.6 149.2 6.2 5.3 76.1 25 25 A Y E +AB 19 34A 36 9, 0.6 8, 3.0 11, 0.4 9, 1.1 0.435 56.8 165.6 66.0 130.9 6.8 9.0 76.9 26 26 A T E +AB 18 32A 2 8, 2.4 8, 3.2 6, 0.3 2, 0.2 0.877 15.6 179.2 149.3 150.5 5.7 11.1 73.9 27 27 A I E > + B 0 31A 0 4, 1.5 4, 2.2 2, 0.3 12, 0.2 0.937 51.9 12.4 150.0 174.7 6.1 14.8 72.5 28 28 A G E 4 S A 14 0A 1 14, 2.0 14, 1.7 2, 0.2 2, 1.0 0.393 125.7 8.7 55.8 129.7 5.1 17.0 69.5 29 29 A I T 4 S 0 0 6 34, 0.3 1, 0.2 16, 0.2 16, 0.2 0.732 129.1 50.3 99.7 79.3 3.6 15.0 66.7 30 30 A G T 4 S+ 0 0 11 2, 1.0 2, 0.6 19, 0.4 2, 0.2 0.682 83.0 167.8 71.4 20.4 3.3 11.6 68.2 Fig.1. The dssp file showing the primary structure and secondary structure of a protein (shown up to 30 residues only). Methodology: To predict the secondary structure of a protein, a Pattern Recognition Neural Network is designed. The neural network is defined with one input layer, one hidden layer and one output layer. The protein sequence is represented as a sliding window of size W(changing from 15 to 29) and the prediction is made on the structural state of the central residue of the window. Thus a protein segment of windows size W is represented as a 20 x W. Thus the input layer R consists of 20xW input units, i.e., W groups of 20 inputs each for each window. All the proteins that are used to train the neural network are encoded and are stored in vector. Each target is also represented as a boolean array of size 8, which represents one of the secondary structural state of the amino acid at that position in the protein sequence. The secondary structural states defined according to dssp are H,I,G,E,B,T,S and C. Thus H is represented as 10000000, I is represented as 01000000 and finally C is represented as 00000001. Thus the output layer of the neural network consists of eight units, one for each of the considered structural states(or classes). The target matrix is also prepared. The size of the hidden layer is taken as 2xW+1. The pattern recognition network is trained with the Scaled Conjugate Gradient algorithm. At each training cycle, the training sequences are presented to the network through the sliding window defined above, one residue at a time. Each hidden unit transforms the signals received from the input layer by using a transfer function log sigmoid to produce an output signal that is between and close to either 0 or 1. Weights are adjusted so that the error between the observed output from each unit and the desired output specified by the target matrix is minimized. One of the common problem data overfitting, while training the neural network, is eliminated by dividing the data into three subsets: (i) the training set, which is used for computing the gradient and updating the network weights and biases; (ii) the validation set, whose error is monitored during the training process because it tends to increase when data is overfitted; and (iii) the test set(not seen earlier by the neural network), whose error can be used to assess the quality of the division of the data set. The training process stopped automatically when any one of the several conditions like epochs, goal, validation errors is met. ISSN: 0975-5462 1754

4. RESULTS AND DISCUSSION P.V. Nageswara Rao et. al. / International Journal of Engineering Science and Technology To analyze the network response, confusion matrix is computed by considering the outputs of the trained network and comparing with the expected results(targets), shown in Fig. 2. Fig. 2. Confusion Matrix showing the performance of the classifier. The diagonal cells show the number of residue positions that were correctly classified for each structural class. The off-diagonal cells show the number of residue positions that were misclassified (e.g. helical predicted as coil). The rightmost cell in the last row shows the total percentage of correctly predicted residues (upper number) and the total percentage of incorrectly predicted residues (lower number). By applying this technique, a performance score of Q 8 =72.3% is achieved. This compares well with state of art techniques, such as NN-I and GOR IV which achieved Q 3 scores of 64.05% and 63.19% respectively when predictions are made on single sequence alone. The Receiver Operating Characteristic (ROC) curve, a plot of the true positive rate (sensitivity) versus the false positive rate (1 - specificity) is also drawn and shown in Fig.3. ISSN: 0975-5462 1755

5. CONCLUSION Fig.3. ROC Curve showing the performance of the classifier The prediction accuracy can be improved by: Increasing the number of training vectors, with appropriate distribution of all the classes. Increasing the window size or adding more relevant information, such as biochemical properties of the amino acids. Increase the number of hidden layers and neurons. ACKNOWLEDGEMENTS The authors would like to thank Acharya Nagarjuna University and GITAM University for providing computational facility and access to e-journals to carry out this research. REFERENCES 1. Anfinsen, C.B. (1973). Principles That Govern The Folding Of Protein Chains. Science. 181:223-230. 2 Stephen, R. Holbrook, Steven, M., Muskal and Sung-Hou Kim. (1990). Predicting Protein Structural Features With Artificial Neural Networks. Artificial Intelligence and Molecular Biology. 3 Weiner, P.K. and Kollman, P.A. (1981). AMBER: Assisted Model Building With Energy Refinement. A General Program For Modeling Molecular and Their Interactions. Journal of Computational Chemistry. 2:287:303. 4 Weiner, S.J., Kollman, P.A., Case, D.A., Singh, U.C., Chio, C., Alagona, G., Profeta, S. and Weiner, P.K.(1984). A New Force Field For Molecular Mechanical Simulation Of Nucleic Acids and Proteins. Journal of American Chemical Societies. 106:765-784. 5 Chou, P.Y. and Fasman, G.D. (1974). Prediction of Protein Conformation. Biochemistry. 13:222-245. 6 Garnier, J., Osguthorpe, D.J. and Robson, B. (1978). Analysis Of The Accuracy and Implications Of simple Methods For Predicting The Secondary Structure Of Globular Proteins. Journal of Molecular Biology. 120:97-120. 7 Lim, V.I. (1974). Algorithms For the Prediction of Alpha-Helical and Beta-Structural Regions in Globular Proteins. Journal of Molecular Biology. 88:873-894. 8 Blundel, T., sibanda, B.L. and Pearl, L. (1983). Three-Dimensional Structure, Specificity and Catalytic Mechanism Of Renin. Nature. 304:273-275. ISSN: 0975-5462 1756

9 Greer, J. (1981). Comparative Model-Building Of The Mammalian Serine Proteases. Journal of Molecular Biology. 153:1027-1042. 10 Warme, P.K., Momany, F.A., Rumball, S.V., Tuttle, R.W. and Scheraga, H.A. (1974). Computation Of Structures Of Homologous Proteins. Alpha-Lactalbumin From Lysozyme. Biochemistry. 13:768-782. 11 Richardson, J.S.(1981). The Anatomy and Taxonomy of Protein Structures. Advances in Protein Chemistry. 34:168-339. 12 Kringbaum, W.R., and Knutton, S.P. (1973). Prediction of The amount Of Secondary Structure in A Globular Protein From Its Amino acid Composition. Proceedings of the National Academy of Science. USA. 70(10):2809-2813. 13 Qian, N. and Sejnowski, T.J. (1988). Predicting The Secondary Structure Of Globular Proteins Using Neural Network Models. Journal of Molecular Biology. 202(4):865-884. 14 Crik, F. (1989). The Recent Excitement About Neural Networks. Nature. 337:129-132. 15 Garnier, J. Robson, B. (1989). The GOR Method For Predicting Secondary Structure in Proteins. Prediction Of Protein Structure and The Principles Of Protein Conformation. New York: Plenum Press. 417-465. 16 Garnier, J. and Robson, B.(1989). The GOR Method For Predicting Secondary Structures in Proteins. Prediction of Protein Structure and The Principles of Protein Conformation. New York:Plenum Press. 417-465. ISSN: 0975-5462 1757