Nonlinear Dimensional Reduction in Protein Secondary Structure Prediction

Size: px

Start display at page:

Download "Nonlinear Dimensional Reduction in Protein Secondary Structure Prediction"

Russell Weaver
5 years ago
Views:

1 Nonlinear Dimensional Reduction in Protein Secondary Structure Prediction Gisele M. Simas, Silvia S. C. Botelho, Rafael Colares, Ulisses B. Corrêa, Neusa Grando, Guilherme P. Fickel, and Lucas Novelo Centro de Ciências Computacionais C³ Universidade Federal do Rio Grande FURG, Brazil 1 Introduction This chapter deals with a common problem in bioinformatics: Protein Secondary Structure Prediction. It studies the use of Artificial Neural Networks (ANNs), usually called Neural Networks (NNs), as a preprocessing stage in the prediction and classification of protein secondary structure, from Blast profiles [Altchul, 1997] obtained through their amino acid sequence. This classification serves to reduce the search space in other methods of predicting protein tertiary structure, which is, in turn, strictly related to protein functionality [Lehninger, 1984]. Considering the large dimension of data to be treated, a pre-processing stage can increase predictor performance. Melo et al. [2003a; 2003b; 2004] uses the traditional techniques of Principal Components Analysis (PCA) and Independent Component Analysis (ICA) for dimensional reduction, showing that such methods can contribute to the improvement of the predictors. We suggest the use of a pre-processing stage, presenting a comparison between the obtained results and the linear dimensional reduction methods used by Melo et al. [Melo et al., 2003a; Melo et al., 2003b; Melo et al., 2004]. In previous studies, we originally, proposed the use of NNs through the Cascaded Nonlinear Components Analysis (C-NLPCA) method [Botelho et al., 2005; Simas et al., 2007] in the secondary structure prediction problem. Such proposal is justified by the possibility of exploiting NNs nonlinear potential, together with the reduction, to acquire useful information for the classification phase. Moreover, the use of C-NLPCA prevents the occurrence of an unwanted simplification in the variability of data, which could be caused by the use of a linear method, such as the PCA and the ICA. To validate the proposal, data are reduced and used as input for three Neural Networks with different topologies, trained by the Resilient Propagation (RPROP) method [Riedmiller & Braun, 1993]; each NN independently finds the classification of secondary structure for each amino acid. Then these classifications are combined to attain of better results. This chapter is organized into six sections. Section 2 presents an overview of the problem of predicting the protein secondary structure. Section 3 briefly describes some related work. Section 4 discusses the applied methodology and our pre-processing method for data reduction; the classifiers are likewise ex-

2 plained. Section 5 details the analysis of our implementation, tests, and results. Finally, Section 6 presents the conclusion. 2 The Prediction of Secondary Structure Proteins, composed by the numerous amino acids (chains), are responsible for the structural and architectural function of the cell. There are 20 different amino acids in nature. Two amino acids join themselves in the protein molecule by a bond called peptide, and the product formed by the bond of a superior number than 70 amino acids is called protein [Lehninger, 1984]. Proteins can have four types of hierarchical structures, depending on the space configuration and size of the chain, and the type of amino acids that comprise them: Primary - a sequence of amino acids and its peptide bonds. All the space arrangements of the molecule originate from this structure. Secondary - an amino acid chain where hydrogen bonds are established between distant amino acids, that is, folding, whose type is associated with the functions carried out by proteins. Tertiary - in diverse points of the secondary structure, some radicals are ionized, and the opposite electrical potential points attract one another, causing the chain to roll itself, assuming a space configuration. This structure confers biological activity to the proteins. Quaternary - proteins that contain more than one set of peptide chains, called polypeptide chains, have a quaternary structure. This structure refers to the space arrangement of the polypeptide chains of a protein and the nature of its contacts. Knowledge on the secondary structure of a protein serves to reduce the search space of its tertiary structure, therefore facilitating the briefing of its functionality. Detailed knowledge on proteins in turn allows for the correction of cellular dysfunctions and the discovery of alternative treatments for diverse diseases. However, the rules for a protein's three-dimensional conformation are still not clear in science. This absence of apparent pattern makes the scientific modeling of the problem difficult and makes the use of Neural Networks a good method for this purpose. Thus, the problem of predicting protein secondary structure consists of, from its primary structure (sequence of amino acids), classifying each of its constituent amino acids in one of the recurrent substructures in the three-dimensional conformation of the protein, which can be grouped into α-helixes, β-sheets, and coils. 3 Related Works The algorithms developed so far that simulate the folding of sequences can not accurately simulate the laws that control this process. Therefore, Neural Networks (NNs) appear to be a promising method and are commonly used, demonstrating good results [Cuff & Barton, 1999; Guimaraes et al., 2003; Melo et al., 2003a; Qian & Setnowski, 1988; Rost & Sander, 1994]. Qian and Setnowski [Qian & Setnowski, 1988] presented a study on determining the best configuration of NN for the prediction of secondary structure. Their proposed solution suggests that the sequences

3 of proteins must be covered by overlapped window and the prediction is related to the central element of the window. This work used Multiple Layer Perceptrons (MLPs) trained with the traditional backpropagation algorithm, to treat a data base subgroup of 106 dissimilar proteins (not homologous). From the success achieved by Qian and Setnowski, other solutions had been searched, applying different training algorithms, NN topologies, and input data treatments [Baldi & Brunak, 2001; Rost, 2001]. Significant progress was achieved with the use of additional biological information as input for the NNs, more specifically the use of sequence profiles of proteins as input for the prediction. In contrast to sequences that supply local information only, the profiles look for distant relations between diverse sequences of a bank. An example of this profile can be evaluated by the PHD predictor [Rost & Sander, 1993], which uses frequency profiles as the input. Among the most recent publications, the significant improvement obtained through the alteration of the type of input data is detailed, introducing more divergent information, particularly the PSI BLAST profiles [Pollastri et al., 2002]. This type of input was initially proposed by Jones [Jones, 1999] in PSI- PRED predictor. The PSI BLAST [Altchul et al., 1997] can determine distant homologues in proten sequences stored in a data base. Another existing predictor is CONSENSUS, which was developed by Cuff and Barton [Cuff & Barton, 1999]. This predictor compares and combines four others: DSC [King & Stemberg, 1996], PHD [Rost & Sander, 1994], NSSP [Salamov & Solovyev, 1995], and PREDATOR [Cuff & Barton, 1999], without making data reductions. Meanwhile that, Melo et al. [2003a; 2003b], taking into account the high dimension of the data to be treated, used a technique for statistical reduction of input data by applying PCA in the PSI BLAST profiles. The results were compared with those of Cuff and Barton [1999], and the new predictor constructed by Melo et al. [2003a; 2003b] presented better results. Considering all predictors in previous work related with this study, which use PSI BLAST profiles as input data, GMC [Guimaraes et al., 2003] is the predictor that presented the best performance. However, it does not use dimension reduction methods. 4 Methodology The purpose of this study is to evaluate the improvements that the use of a nonlinear method for dimension reduction of data can bring into the prediction of secondary structure. Therefore, a procedure similar to that of Melo [2003a, 2003b, 2004], which used a linear method, was carried out. Figure 1 shows our proposed method to treat the problem. Figure 1: Proposed solution.

4 In summary, a sequence of proteins is presented to the PSI BLAST [Altchul et al., 1997], and the matrix of scores produced by this is re-passed to a pre-processing stage. In such stage, the input data are reduced by the nonlinear method C-NLPCA and then passed to the three classifiers, which receive the same input values. The results obtained by the classifiers are then presented to the rule combination stage, which gives the final classification. Thus, each amino acid of each protein is associated with a secondary substructure classification, which can be α-helixes, β-sheets, and coils. 4.1 Using Blast Profiles in the Prediction Computationally, the primary structure of a protein can be represented by a sequence of characters (strings) of variable sizes, where each character represents one of the 20 existing amino acids. The order where these amino acids are displayed in the sequence represents the polypeptide bonds formed and is associated with the generated secondary structure. The 396 sequences of proteins present in the CB396 [University of Dundee, 2005] were submitted to the PSI Blast research. The software, Position Specific Iterated - Basic Local Alignment Search Tool (PSI Blast), developed by Altchul [Altchul et al., 1997] searches similar proteins in the chosen data base, that is, those formed by similar sequences of amino acids, taking into account the order of their position in the sequence. This software returns as output PSI Blast profiles that are composites of values representing the occurrence of each amino acid in each sequence position. Such profile makes possible a significant improvement in the predictors because it can generalize a similar set of proteins, detect distant homologues in the searched sequence, and provide various data. Thus, the searched protein receives a score that is stored in the PSSM Position Specific Score Matrix. Such matrix has the dimension 20xn (20 existing amino acids x n number of amino acids that compose the protein). The PSSM matrices of all protein are then re-passed to the reduction stage, which will be described in the next section. 4.2 Dimension Reduction On the assumption that NNs can be used as a good approach to classify protein secondary structures [Baldi & Brunak, 2001; Jones, 1999; Guimaraes et al., 2003; Pollastri et al., 2002; Rost, 2001], the selection of relevant information on the input sequences of classifier NNs and the process of codification are crucial steps in the effective prediction of the NN [Wang et al., 2001]. NNs with few inputs have few weights to be adjusted, lead to the best generalizations and faster training, and prevent saturation. The performance of the predictor NN can therefore be increased, reducing the number of its inputs. A reduction stage can improve performance and eliminate redundant information present in the data-base. We proposed the use of Cascaded Nonlinear Components Analysis (C-NLPCA), an approach based on NNs, for the dimension reduction of a large data set. In our approach, C-NLPCA receives the PSSM score matrix and eliminates redundant information presented in this matrix. Thus, the PSSM score matrix of each protein is covered by overlapped windows of size w (Figure 2), displaced reaching the original size of the protein is reached. The dimension of each window is p = 20xw (representing the frequencies of 20 existing amino acids x w amino acids belonging to the window). The reduction stage reduces the number of data (dimension of the window) from p to q. Later, each set of q components is forwarded to the classification stage, which has the task of predicting the corresponding secondary structure to the central amino acid of the window.

5 In the next section, we present some reduction methods, and later, we shall focus on the method proposed in this chapter, the C-NLPCA. Figure 2: One window of the PSSM score matrix. 2.1 Principal Components Analysis (PCA) PCA is a technique that can be used for the statistical analysis of a data set. This analysis is concerned with the extraction of the factors that best represent the structure of interdependence between variables of large dimensions. Therefore, all the variables are analyzed simultaneously, each one in relation to all the others, to determine the factors (principal components) that maximize the explanation of variability existing in the data. However, PCA is indicated for the analysis of variables that have linear relations [Bishop, 2006]. In PCA, the data are approximated by a straight line, which minimizes the mean square error (MSE). In this study, we used the Expectation Maximization (EM) [Roweis, 1998] algorithm to calculate the PCA Independent Components Analysis (ICA) ICA is a method of linear data transformation used to find a data representation that minimizes the statistical dependence of the components represented. This way, to obtain the components of a vector x, the ICA tries to find a linear transformation s=wx, in which the s i components are as independent as possible [Melo et al., 2003b]. This is accomplished through the maximization of a function F(s 1,, s m ) capable of measuring the independence of the components. Both PCA and ICA are projection techniques in their own space. The principal difference lies in the fact that the PCA generates uncorrelated components, while the ICA generates independent components [Bishop, 2006].

6 4.2.3 Cascaded Nonlinear Components Analysis (C-NLPCA) In secondary structure prediction, the nonlinear behavior of input data can be seen in the good performance of NNs in prediction applications [Baldi & Brunak, 2001; Jones, 1999; Guimaraes et al., 2003; Pollastri et al., 2002; Rost, 2001]. We also intend to extend this nonlinear treatment to the pre-processing stage. In this chapter, the use of C-NLPCA is suggested (see details in [Botelho et al., 2005; Simas et al., 2007]) to provide a nonlinear mapping of the data because linear methods such as the PCA and ICA may introduce undesirable simplifications in the analysis of variables with nonlinear relations. The C-NLPCA method is based on the cascading in layers of simple NLPCAs NNs [Kramer, 1991], aimed at the nonlinear treatment of high-dimension data (see Figure 3). Figure 3: C-NLPCA system NLPCA Analysis. NNs for the analysis of nonlinear principal components, called NLPCAs, are MLP NNs composed by the reduction and expansion stages. In the reduction, NNs with five layers are used: input (p neurons), hidden codification (m neurons), bottleneck (r neurons), hidden decoding (m neurons) and output (p neurons). Neurons of the codification and decoding layers use nonlinear functions, while those of the input, bottleneck, and output use linear functions of activation. These characteristics make the NNs model a set of functions. The inputs in a p-dimensional space, when forwarded to the bottleneck layer, are mapped by the r-dimensional space. The activation values of the bottleneck layer neuron supply the nonlinear principal components. Each NLPCA NN is trained to obtain a mapping between input and output, which minimizes the following function: n min X i X i i=1, (1) where X i is the output of NN for each X i input. The output produced by the expansion will contain the cumulative error of samples, so the X i X i residue value can be used to obtain the second principal component and so on, successively From NLPCA Sets to the C-NLPCA System. Due to the intrinsic problem of saturation of NNs, the applicability of the NLPCA is restricted to cases where p n [Botelho et al., 2005]. That is, a more for-

7 mal relation can be established in the function of the number of parameters (weights and bias) of NLPCA and of the number of samples n presented to the NN. This relation establishes that 2m + r + p n To avoid such limitation, we have proposed an architecture where NLPCAs are grouped into layers. In the reduction stage, the data of p initial dimensions are grouped into a series of small NLPCA NNs of p' input neurons. Each NLPCA reduces its respective input of p' for the 1 dimension. The reduced data (principal locals) are again grouped and reduced successively in subsequent layers (reduction layers). Next, the bottleneck neuron of the last NLPCA NN will supply the first global principal component (C- NLPC). Expansion Stage. After the reduction of layers, a set of MLP NNs composes the expansion stage. The output values obtained by the bottleneck NLPCA are used as input for MLP NNs with 1 input neuron and p' output neurons. These NNs, designated as expansion NNs, are disposed successively layer by layer until the output sets are reproduced, whose final dimension is equal to the input presented to the system: p. During the expansion, there is no training of NNs, and only the value of the principal component is propagated. Since every principal local is combined successively, in the stage of reduction, the C-NLPCA considers all relations of neighborhood between the variables. The relations between all windows that compose the PSSM matrix of a given protein are also analyzed in terms of their use as samples in the training. Model Selection. There are different solution models that can be used to implement an NLPCA. These models differ in the following parameters: topology of NNs (number of neurons in layers), criteria for the best weights selection, criteria for stopping the training, and regularization values of weights, among others. The complexity of the model can be increased by increasing the neuron number of the hidden layers of codification and decoding. The regularization of weights can in turn restrict the nonlinearity of a model [Hsieh, 2007]. A nonlinear model with enough flexibility can find zigzag solutions with a smaller MSE. However, these solutions may not be generalized to the samples, causing overfitting. This can be checked when two neighboring points of the input data are projected to distant points on the approximation curve obtained by the dimension reduction method. Thus, the selection of a solution with a smaller MSE is not a sufficient criterion for a better response. Regularization (addition of weight penalty) has been used to control the overfitting [Bishop, 1995]. A greater weight penalty tends to provide solutions with smaller nonlinearity. Considering these facts, for each NLPCA, we analyzed N sets of randomly initialized weights. Each set was trained individually, and then we selected the most suitable among them in two different ways: The set that minimizes the MSE: n i=1 X i X i, (2) The set that minimizes the following objective function: J = x x 2 + u 2 + u P w j 2 2. (3) In Equation 3, the first term is the MSE, the second and third terms are for restraining u (principal component value) towards u = 0 and u 2 = 1, and the final term is a weight penalty or regularization term, with P as the weight penalty parameter and w (2) as the weights of the hidden layer of codification

8 (see [Hsieh, 2007] for details). According to Hsieh [2001], penalizing just this layer of weights is sufficient to limit the nonlinear modeling capacity of the model. After the dimension reduction, the process of secondary structure prediction was performed individually for the data obtained by the two ways of set selection. The method of prediction (classification) is described in the next section and the results are demonstrated and compared in Section Classification: Using the NN Predictor To validate the use of the C-NLPCA in the pre-processing stage of predictors, we have also implemented a classification stage [Guimaraes et al., 2003]. This stage receives the principal components as input and provides a classification of secondary structure for each amino acid of each protein as output. The proposed classifier associates three MLP NNs with distinct topologies, objectifying the choice of a better local minimum. The distinction of the NNs is associated with the amount of neurons of the hidden layer. Each output value is associated according to established norms in the stage of combination of rules. The training of NNs is made by epoch, based on the Resilient Propagation - RPROP [Riedmiller & Braun, 1993] Rules of Combination and Evaluation The results obtained by the classifiers are then presented to the stage of combination of rules. The combination of the three NNs of different topologies aims to improve the prediction of the secondary structures [Cuff & Barton, 1999, Pollastri et al., 2002]. The rules used for the classification of results are: Voting, Average, Product, Maximum and Minimum [Melo et al., 2003a]. In the Voting rule, the most frequently shown result in the three networks is the one chosen as the final reply. The other rules follow the equations below: Product = max Media = max NN i1 i=1 3 i=1 1 NN i1, 3 3, NN i2 i=1 3 i=1 1 NN i2, 3 3, NN i3 i=1 3 NN i3 i=1 Maximum = max min NN i1, min NN i2, min NN i3 Minimum = min max NN i1, max NN i2, max NN i3 where index i represents one of the three output neurons, and j is the NN considered. Each output neuron represents one of the three possible classifications of the net: α-helixes, β-sheets, and coils. After the combination of NNs, the results are evaluated using a reduced variation in the jack-knife process applied to the subgroups of proteins (see [Melo et al., 2003a] for more details). The exactness of each subgroup is measured by the value of Q3 obtained through the following equation: Q3 = correct_number_aminoacids total_number_aminoacids (4) 100 (5)

9 5 Implementation, Results, and Discussions A tool in C++ was developed to implement the preprocessing stage in C-NLPCA and the classifier NNs. This tool can be used to reduce a range of data sets associated with different domains. In this study, we are interested in reducing and analyzing the protein secondary structure. Thus, Figure 4 schematically shows a summary of the steps used in the implementation of the proposed solution. In bank CB396, we obtain the amino acid sequences of some proteins and the expected predictions of their secondary structures. The amino acid sequence of each protein is submitted to the PSI BLAST software; it creates a matrix of PSSM scores of 20xn values (20 existing amino acids x n number of amino acids that compose the protein). The PSSM of each protein is covered by windows of 20xw size; each window is used to represent the amino acid located at his center. w = 13 was used, resulting in 260 values. Through the C-NLPCA method, each window of 260 values is reduced to 80 principal components. The 80 principal components are used to train three predictor NNs; the results of these ANNs are combined. Then as a response, the type of secondary structure is associated with each amino acid of each protein. Figure 4: Summary of the solution steps Figure 5 illustrates the process performed for a protein, and, below, we detail the solution steps and report the parameters used in this study. Step 1: First, in bank CB396, the amino acid sequences of 396 proteins, as well as the expected predictions of their secondary structures, were obtained. Step 2: Next, the 396 protein sequences were submitted to the software PSI Blast research [Altchul et al., 1999]. In this work, the searches in the PSI Blast were accomplished using standard search parameters and three iterations. The searched protein received a score that was stored in the PSSM matrix. Such matrix has the dimension 20xn, the 20 existing amino acids x n number of amino acids that compose the protein. Step 3: The PSSM of each protein is covered by overlapping windows of p = 20xw size (20 existing amino acids x w parameter). In this study, windows of w = 13 were analyzed, resulting in data with a dimension of 260 elements (20x13) to be processed and reduced. Each new window contains a new column of the PSSM matrix on the right and less a column on the left (see the braces in Figure 5). A window of 260 values provides the biological information needed to predict the secondary structure of

10 Figure 5: Illustration of the solution steps the amino acid located at his center, that is, the amino acid at seventh column of the window (see the circles in the amino acids of the sequence, Figure 5). w = 13 was chosen because this value is reported in the literature as the typically optimal size of windows in secondary structure prediction [Baldi & Brunak, 2001]. Step 4: Each window, of p = 260 values, is reduced to q = 80 principal components. In order to get the

11 dimension reduction, the C-NLPCA system is used, as described in Section (see Figure 3). The data dimension p = 260 is divided into a series of NLPCAs with p'= 10 input neurons. In each execution of C-NLPCA, a dimension reduction from p =260 to r = 1 (principal component) is obtained. Next, in order to obtain the second principal component, the difference between the input (expected values) and obtained expansion values is reinjected into the system. Thus, the C-NLPCA is performed 80 times to obtain the 80 principal components. This reduction to q = 80 was used to compare the results obtained in this study with those of other studies Melo et al. [2003a, 2003b, 2004]. The NLPCA NNs used in this study are MLPs with five layers: Input: p =10 neurons with identity activation functions Hidden of codification: m=2 neurons with sigmoidal activation functions Bottleneck: r=1 neurons with identity activation functions Hidden of decoding: m=2 neurons with sigmoidal activation functions Output: p =10 neurons with identity activation functions To reduce the processing time, the NLPCA NNs were trained through the RPROP method, learning by epoch. For the parameters, we used Δmin=1e-6, Δmax=50, η+=1,2 and η-=0,5 (check [Riedmiller & Braun, 1993] for details). In the training of each NLPCA, we used 30 sets of weights randomly initialized between [-1.1]. Each set was trained individually to a maximum of 600 epochs. Figure 6 shows the weight variability during the training process. We can see that the best results, local minimums, are obtained after around 500 epochs. After the training process, the best set of weights was chosen in two different ways: a set with smaller J, using the weigh penalty P = 10-5 a set with smaller MSE All processes of dimension reduction and secondary structure prediction were performed individually for the data obtained by the two ways of set selection; the results are compared below. Step 5: The 80 principal components associated with each amino acid are used to train the three MLP NN predictors. Each NN has three layers of neurons with sigmoidal activation functions: input layer: with 80 neurons (each neuron receives a principal component as input) hidden layer: each NN has a different number of hidden neurons output layer: with only three neurons (each neuron is related to a type of secondary structure) For each of the three NNs, we used 30, 35, and 40 sigmoidal neurons in the hidden layer. A RPROP training method with Δmin=1e-6, Δmax=50, η+=1,2 and η-=0,5 parameters and weights randomly initialized between [-1.1] was used. Finally, to calculate the efficiency of the prediction in this study, we used a simplified Jack-knife method, dividing the set of proteins into seven subsets (see [Melo et al., 2003a]) The figures below show the obtained results:

12 Figure 6: Analyzing the number of epochs for three weight sets a) MSE x Epoch b) Objective function J x Epoch. Figure 7: First window of 260 values of a protein input data for the first C-NLPCA system. These 260 values are reduced to 80 principal components.

13 Figure 8: The first three principal components of the window shown above (see Figure 7) obtained by the C-NLPCA method (selection of the set of weights by the smaller MSE); C- NLPCA method (selection of the set of weights by the smaller objective function J); PCA method. Left: Principal Component Values; Right: Spectrum

14 Figure 9: Expansion of the window shown above (see Figure 7) obtained from the principal components calculated by a) C-NLPCA (selection of the set of weights by the smaller MSE) b) C-NLPCA (selection of the set of weights by the smaller objective function Table 1 shows the accuracy percentage of the predictors related in Section 3, which were obtained in this study. Method Q3 (%) PHD [Qian & Setnowski, 1988] 71.9 DSC [King & Stemberg, 1996] 68.4 PREDATOR [Cuff & Barton, 1999] 68.6 NNSSP [Salamov & Solovyev, 1995] 71.4 CONSENSUS [Cuff & Barton, 1999] 72.9 GMC [Guimaraes et al., 2003] 75.9 PCA [Melo et al., 2003a] 80 PCs 73.8 ICA [Melo et al., 2003b] 80 PCs 73.9 PCA [Melo et al., 2004] 180 PCs 74.5 ICA [Melo et al., 2004] 180 PCs 74.9 C-NLPCA (MSE) 80 PCs C-NLPCA (J) 80 PCs Table 1: The best results of our C-NLPCA approach. The GMC predictor was developed by Guimaraes [Guimaraes et al., 2003] with the same solution method adopted in this chapter, but not using any reduction data and thus making the classifier training difficult. The predictor developed by Melo et al. [2003a, 2003b, 2004] has a dimension reduction of 80 and 180 principal components, through the PCA and ICA. The predictor adopted in this study using C-NLPCA reduction presented better results than GMC, and this is due to the fact that the dimension reduction took into account data nonlinearity besides providing a more effective training of classificatory NNs.

The C-NLPCA obtained a better performance than did the predictors, which used linear dimension reduction, PCA, and ICA, with a larger number of principal components.

15 The C-NLPCA obtained a better performance than did the predictors, which used linear dimension reduction, PCA, and ICA, with a larger number of principal components. This proves the importance of considering data nonlinearity. Comparing the C-NLPCA with the PCA and ICA, the C-NLPCA method is more computationally expensive, but even so, the results justify its use. In the classifier stage, the combination rules had contributed to the increase in the precision of the results. Figure 10 shows that the application of the average rule has produced the best results. The Neural Network with the most neurons in the hidden layer (NN3 in Figure 4) presents a better performance as compared to the others. Figure 10: Prediction results obtained by the rules of combination 6 Conclusions This chapter has presented a study on the computational methods for the classification of protein secondary structures in α-helixes, β-sheets, and coils. It used the C-NLPCA method as the dimension reduction stage of the input data of the classifiers. The protein primary structures contained in the CB396 bank were submitted to the PSI-Blast Web server to locate distant homologous proteins. From the same proteins, the PSI-Blast generated a PSSM scores matrix. Such a matrix was passed to a nonlinear dimension reduction stage, and then the reduced data were passed through classifier NNs with different topologies, whose outputs were combined. A penalty approach applied to C-NLPCA enhances the performance of the method, avoiding overfitting problems. The results of C-NPCA confirm the effectiveness of the dimension reduction of data for the prediction of secondary structures. C-NLPCA reduces the input space associated with the classification stage, obtaining the nonlinear principal components of the amino acid sets. References [Altchul et al., 1997] Altchul, S., Madden, T., Shaffer, A., Zhang, J., Zhang, Z., Miller, W., & Lipman. D. (1997). Gapped Blast and PSI-Blast: A New Generation of Protein Database Search Programs. Nucleic Acids Research.

16 [Baldi & Brunak, 2001] Baldi, P. & Brunak, S. (2001). Bioinformatics. Massachusetts Institute of Technology, London, England. [Bishop, 1995] Bishop, C. M. (1995). Neural Networks Pattern Recogniton. Oxford: Claredon Pr. [Bishop, 2006] Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Microsoft Research Ltd, Cambridge. [Botelho et al., 2005] Botelho, S.S.C., Bem, R.A., Almeida, I.L., & Mata, M.M. (2005). C-NLPCA: Extracting Non-Linear Principal Components of Image Datasets. Simpósio Brasileiro de Sensoriamento Remoto. [Cuff & Barton, 1999] Cuff, A.J. & Barton, J.G. (1999). Evaluation and Improvement of Multiple Sequence Methods for Protein Secondary Structure Prediction. Proteins Structure, Function and Genetics. Vol. 34, pp [Guimaraes et al., 2003] Guimaraes, K.S., Melo, J.C.B., & Cavalcanti, G.D.C. (2003). Combining few Neural Networks for Effective Secondary Structure Prediction. Proceedings of the Third IEEE Symposion on Bioinformatics and Bioengineering, pp [Hsieh, 2001] Hsieh, W. W. (2001). Nonlinear principal component analysis by neural networks. Tellus, Vol. 53A, pp [Hsieh, 2007] Hsieh, W. W. (2007). Nonlinear principal component analysis of noisy data. Neural Networks, Vol. 20, No. 4, pp [Jones, 1999] Jones, D.T. (1999). Protein Secondary Structure Prediction Based on Position-Specific Scoring Matrices. J. Mol. Biol., Vol. 292, pp [Khattree & Naik, 2000] Khattree, R. & Naik, D.N. (2000). Multivariate Data Reduction and Discrimination with SAS Software. Cary, NC, SAS Institute Inc. [King & Stemberg, 1996] King, R. & Sternberg, M. (1996). Identification and Application of the Concepts Important for Accurate and Reliable Protein Secondary Structure Prediction. Proteins Science, Vol. 5, pp [Kramer, 1991] Kramer, M.A. (1991). Nonlinear Principal Component Analysis Using Autoassociative Neural Networks. AlChe Jounal, Vol. 37, pp [Lehninger, 1984] Lehninger, A.L. (1984). Principles of Biochemist, Editora Sarvier, São Paulo. [Melo et al., 2003a] Melo, J.C.B., Cavalcanti, G.D.C., & Guimaraes, K.S. (2003a). PCA Feature Extraction for Protein Structure Prediction. International Joint Conference on Neural Networks, pp [Melo et al., 2003b] Melo, J.C.B., Cavalcanti, G.D.C., & Guimaraes, K.S. (2003b). Protein Secondary Structure Prediction with ICA Feature Extraction. Proceedings of the IEEE International Workshop on Neural Networks for Signal Processing, Special Session on Bioinformatics. [Melo et al., 2004] Melo, J.C.B., Cavalcanti, G.D.C., & Guimaraes, K.S. (2004). Protein Secondary Structure Prediction: Efficient Neural Network and Feature Extraction. IEE Electronics Letters, Vol. 40, No. 21, pp [Pollastri et al., 2002] Pollastri, G., Przybylski, D., Rost, B., & Baldi, P. (2002). Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins, Vol. 47, pp [Qian & Setnowski, 1988] Qian, N. & Setnowski, T.J. (1988). Predicting the Secondary Structure of Globular Proteins Using Neural Network Models Baltimore. [Riedmiller & Braun, 1993] Riedmiller, M. & Braun, H. (1993). A direct adaptive method for faster backpropagation learning: The rprop algorithm. Proceedings of the International Conference of Neural Networks, pp [Rost & Sander, 1993] Rost, B. & Sander, C. (1993). Prediction of protein secondary structure at better than 70accuracy. Journal of Molecular Biology, Vol. 232, pp [Rost &Sander, 1994] Rost, B. & Sander, C. (1994). Combining Evolutionary Information and Neural Network to Predict Secondary Structure. Proteins, Vol. 19, pp [Rost, 2001] Rost, B. (2001). Review: Protein secondary structure prediction continues to rise. Journal of Structural Biology, Vol. 134, pp

17 [Roweis, 1998] Roweis, S. (1998). Algorithms for PCA and SPCA, in Advances in Neural Information Processing Systems. Michael I. Jordan and Michael J. Kearns and Sara A. Solla. [Salamov & Solovyev, 1995] Salamov, A. & Solovyev. M. (1995). Prediction of Protein Secondary Structure by Combining Ñ earest-neighbor Algorithm and Multiple Sequence Alignments. Journal of Molecular Biology, Vol. 247, pp [Simas et al., 2007] Simas, G. M., Botelho, S. S. C., Grando, N., & Colares, R. G. (2007). Dimensional Reduction in the Protein Secondary Structure Prediction: Nonlinear Method Improvements. International Worshop on Hybrid Artificial Intelligence System. [Wang et al., 2001] Wang, J. T. L., Ma, Q., Shasha, D., & WU, C. H. (2001). New techniques for extracting features from protein sequences. IMB Systems Journal, Vol. 40, No. 2, pp [University of Dundee, 2005] University of Dundee and The Barton Group. (2005). Cb396.

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia