Protein Structure Prediction Using Multiple Artificial Neural Network Classifier *

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier * Hemashree Bordoloi and Kandarpa Kumar Sarma Abstract. Protein secondary structure prediction is the method of extracting locally defined protein structures from the sequence of amino acids. It is a challenging and elucidating part of the field of bioinformatics. Several methods are attempting to meet these challenges. But the Artificial Neural Network (ANN) technique is turning out to be the most successful. In this work, an ANN based multi level classifier is designed for predicting secondary structure of the proteins. In this method ANNs are trained to make them capable of recognizing amino acids in a sequence following which from these amino acids secondary structures are derived. Then based on the majority of the secondary structure final structure is derived. This work shows the prediction of secondary structure of proteins employing ANNs though it is restricted initially to four structures only. Keywords: Artificial Neural network, Protein Structure Prediction. 1 Introduction Protein Secondary Structure Prediction (PSSP) is the most challenging and influencing area of research in the field of bioinformatics. It deals with the prediction and analysis of macromolecules i.e. DNA, RNA and protein. Proteins are the fundamental molecules of all organisms. They are unique chains of amino acids. They adopt unique three dimensional structures which allow them to carry out intricate biological functions. The specifications of each protein are given by Hemashree Bordoloi Deptt. of Electronics &Communication Technology Gauhati University, Guwahati e mail: hbmaini@gmail.com Kandarpa Kumar Sarma Deptt. of Electronics &Communication Technology Gauhati University, Guwahati e mail: kandarpaks@gmail.com S. Patnaik & Y.-M. Yang (Eds.): Soft Computing Techniques in Vision Sci., SCI 395, pp. 137 146. springerlink.com Springer-Verlag Berlin Heidelberg 2012

138 H. Bordoloi and K.K. Sarma the sequence of amino acids [1]. Proteins are the building blocks of all biological organisms. The basis of proteins is amino acids. There are 20 different amino acids which are made up of carbon, hydrogen, nitrogen, oxygen and sulphur. Basically proteins have four different structures Primary: Sequence of amino acids is the primary structure. Secondary: Locally defined highly regular sub structures. Alpha helix, beta sheets and coil are the three secondary structures of proteins. Tertiary: Three dimensional structure or spatial arrangement of secondary structure. Quaternary: Complex of several protein molecules or polypeptide chains. Fig. 1 Four protein Structures The general approach to predict the secondary structure of a protein is done by comparing the amino acid sequence of a particular protein to sequences of the known databases. In protein secondary structure prediction, amino acid sequences are inputs and the resulting output is conformation or the predicted structure which is the combination of alpha helices, beta sheets and loops. In this work an ANN is trained with Amino Acids to predict the protein sequence. After the prediction of the protein sequence, a second ANN classifier is configured to determine the secondary structures.

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier 139 2 Basic Biological Concept Notion of homology is the basis of bioinformatics. In genomic bioinformatics function of a gene is predicted by the homology technique. It follows the rule that if the sequence of gene A, whose function is known, is homologous to the sequence of gene B, whose function is unknown, one could infer that B may share A's function. In the structural bioinformatics, homology is used to determine which parts of a protein are important in structure formation and interaction with other proteins. In homology modeling technique, this information is used to predict the structure of a protein once the structure of a homologous protein is known. Presently it remains the only way to predict protein structures reliably. By determining the structures of viral proteins it would enable researchers design drugs for specific viruses [2]. The secondary structure has 3 regular forms: helical (alpha (α) helices), extended (beta (β) sheets) and loops (also called reverse turns or coils). In the protein secondary structure prediction, the inputs are the amino acid sequences while the output is the predicted structure also called conformation, which is the combination of alpha helices, beta sheets and loops [2]. A typical protein sequence and its conformation class are shown below: ProteinSequence: ABABABABCCQQFFFAAAQQAQQA Conformation Class: HHHH EEEE HHHHHHHH where H means Helical, E means Extended, and blanks are the remaining coiled conformations. A typical protein contains about 32% alpha helices, 21% beta sheets and 47% loops or non-regular structure [2]. With a given protein sequence of amino acids a 1, a 2,...,a n the problem of secondary structure prediction is to predict whether each amino acid a i is in a α- helix, a β-sheet or neither. If the actual secondary structure for each amino acid is known by the studies of structural biology then the three state accuracy is the percent of residues for which the prediction matches reality. It is refers to as three state because each residue can be in one of three states: α, β or other (0) [3].Three state accuracy is given by Q 3 = [(P α +P β +P coil )/T] 100% where P α, P β, and P loop are number of residues predicted correctly in state alpha helix, beta strand and loop respectively and T is the total number of residues [2]. 3 Necessities of PSSP PSSP becomes the most influencing area of research due to the following: The basis of an organism that DNA, RNA and proteins, also called macromolecules can be predicted and analyzed by PSSP. Structure function relationship can also be provided by PSSP. It means that a particular protein structure is responsible for particular function which is

140 H. Bordoloi and K.K. Sarma to be known by PSSP. So by changing the structure of the proteins or by synthesizing new proteins, functions could be added or removed or desired functions could be obtained [5]. PSSP can determine the structure of the viral proteins which leads to the design of drugs for specific viruses. PSSP reduces the sequence structure gap [4]. One of the best example of the sequence structure gap is the large scale sequencing projects such as Human Genome Project. In such type of projects, protein sequences are produced at a very fast speed which results in a large gap between the number of known protein sequences (>150,000) and the no. of known protein structures (>4,000). This gap is called sequence structure gap and PSSP can successfully reduce this gap. Experimental techniques are not capable of structure determination of some proteins such as membrane proteins. So the prediction of protein structure using computational tool is of great interest [6]. 4 Basics of Artificial Neural Network An Artificial Neural Network (ANN) is a massively parallel distributed processor that has a natural propensity for storing experimental knowledge and making it available for use. They are made up of simple processing units called artificial neurons. It resembles the brain in two respects-- Knowledge is acquired by the network from its environment through a learning process. Interneuron connection strengths known as synaptic weights, are used to store the acquired knowledge[4]. Fig. 2 An Artificial Neural Network 5 Methodology This work consists of several steps as described below:

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier 141 1. Data Set: In our work we have considered four proteins that are Hemoglobin, Sickle Cell Anemia and Myoglobin and Insulin. We have considered these four proteins because though their function is different, these three proteins are closely related to each other. 2. Coding of Proteins: Based on the chemical structure of the amino acids a coding scheme is generated. BCD codes are used for coding. There is a unique BCD code for each component or symbol in the chemical structure. Considering 20 amino acids, each amino acid is coded using the generated coding scheme. Then the four proteins i.e. Hemoglobin, Sickle Cell Anemia, Myoglobin and Insulin are coded with the help of coded amino acids. 3. Configuration of ANN: A fully connected Multi Layer Perceptron (MLP) feedforward neural network is used for our proposed work. Backpropagation algorithm is used to update the weight of the network. Network comprises of only one hidden layer. Three ANN classifiers are designed for our work. First network classifies the amino acids, the second network classifies the protein primary structure and the third network classifies the secondary structure of proteins. Table 1 Parameters of Proposed ANN ANN Data Set Size Training Type Maximum No.of Epochs Variance in Training Data MLP Training:1000 Testing:985 TRAINLM 2500 50% 4. Training and Testing of ANN: The network is trained with four coded protein structures. Then testing is done with the coded data to obtain the results. 6 System Model The system model as shown in Figure 3 comprises of three classifiers. The system is formed by two level classifiers. The first level classifier uses amino acids as inputs and provides the identification of the Amino Acids. These are applied to the second level classifiers which contain two MLPs. These classifiers receive identified Amino Acids for predicting alpha, beta and coils.

142 H. Bordoloi and K.K. Sarma Fig. 3 System Model for proposed work EPOCH TIME MSE % RATE 1233 31 secs 10-3 94% 1566 53 secs 10-4 95% 1750 65 secs 10-5 98% 2500 80 secs 10-6 100% 7 Results With the four protein structures the ANN during training shows 100% accuracy while validating the learning phase with the same set. The ANN is configured to handle an input of length 985 holding coded values of the protein structure. Four such samples representing the classes where initially taken during training. The ANN successfully recognizes the required parameters. The training is carried out with (error) back propagation algorithm with Levenberg-Marquardt optimization. The ANN is given a performance goal of around 10-6 which is attained after certain number of session though the time taken is around 80 seconds. With variations of upto ±20% in the training sequence the ANN handles the recognition without any variation in performance. It shows its robustness to variations after it is trained properly. The results also validate the coding scheme used for the work. Some problems, however, were observed due to the larger data sets which the ANN classifier was given to handle. It generates certain computational constrains. This shall be removed in subsequent stages of the work and extend it for the prediction of some unknown protein structure with subclasses. Further we are trying to go for GUI approach of our work.

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier 143 Fig. 4 Performance graph for goal 0.01 Fig. 5 Performance graph for goal 0.001

144 H. Bordoloi and K.K. Sarma Fig. 6 Performance of Hemoglobin in percent Fig. 7 Performance of Insulin in percent Insulin

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier 145 Fig. 8 Performance of Myoglobin in percent Fig. 9 Performance of Sickel Cell Anemia in percent 8 Conclusions This work shows how multi level ANN classifier can be configured for PSSP. The work may be extended to include more protein structures which shall make it a reliable set up for research in bioinformatics.

146 H. Bordoloi and K.K. Sarma References [1] Xiu-fen, Z., Zi-shu, P., Lishan, K., Chu-yu, Z.: The Evolutionary computation Techniques for Protein Structure Prediction A Survey. WU411 Wuhan University Journal of Natural Sciences 8(1B), Article ID: 1007-1202(2003)01Pr0297-06 (2003) [2] Akkaladevi, S., Katangur, A.K., Belkasim, S., Pan, Y.: GA 30303 Protein Secondary Structure Prediction using Neural Network and Simulated Annealing Algorithm. In: Proceedings of the 26th Annual International Conference of the IEEE EMBS, San Francisco, CA, USA, September 1-5 (2004) [3] Ghoting, A.: Protein Secondary Structure Prediction using Neural Networks [4] Haykin, S.: Neural Networks A Comprehensive Foundation, 2nd edn. Pearson Education, New Delhi (2003) [5] Kim, H., Park, H.: Protein secondary structure prediction based on an improved support vector machine approach. Protein Eng. 16(8), 553 560 (2003) [6] Afzal, A.: Applications of neural networks in protein structure prediction