Nonlinear Dimensional Reduction in Protein Secondary Structure Prediction

Size: px
Start display at page:

Download "Nonlinear Dimensional Reduction in Protein Secondary Structure Prediction"

Transcription

1 Nonlinear Dimensional Reduction in Protein Secondary Structure Prediction Gisele M. Simas, Silvia S. C. Botelho, Rafael Colares, Ulisses B. Corrêa, Neusa Grando, Guilherme P. Fickel, and Lucas Novelo Centro de Ciências Computacionais C³ Universidade Federal do Rio Grande FURG, Brazil 1 Introduction This chapter deals with a common problem in bioinformatics: Protein Secondary Structure Prediction. It studies the use of Artificial Neural Networks (ANNs), usually called Neural Networks (NNs), as a preprocessing stage in the prediction and classification of protein secondary structure, from Blast profiles [Altchul, 1997] obtained through their amino acid sequence. This classification serves to reduce the search space in other methods of predicting protein tertiary structure, which is, in turn, strictly related to protein functionality [Lehninger, 1984]. Considering the large dimension of data to be treated, a pre-processing stage can increase predictor performance. Melo et al. [2003a; 2003b; 2004] uses the traditional techniques of Principal Components Analysis (PCA) and Independent Component Analysis (ICA) for dimensional reduction, showing that such methods can contribute to the improvement of the predictors. We suggest the use of a pre-processing stage, presenting a comparison between the obtained results and the linear dimensional reduction methods used by Melo et al. [Melo et al., 2003a; Melo et al., 2003b; Melo et al., 2004]. In previous studies, we originally, proposed the use of NNs through the Cascaded Nonlinear Components Analysis (C-NLPCA) method [Botelho et al., 2005; Simas et al., 2007] in the secondary structure prediction problem. Such proposal is justified by the possibility of exploiting NNs nonlinear potential, together with the reduction, to acquire useful information for the classification phase. Moreover, the use of C-NLPCA prevents the occurrence of an unwanted simplification in the variability of data, which could be caused by the use of a linear method, such as the PCA and the ICA. To validate the proposal, data are reduced and used as input for three Neural Networks with different topologies, trained by the Resilient Propagation (RPROP) method [Riedmiller & Braun, 1993]; each NN independently finds the classification of secondary structure for each amino acid. Then these classifications are combined to attain of better results. This chapter is organized into six sections. Section 2 presents an overview of the problem of predicting the protein secondary structure. Section 3 briefly describes some related work. Section 4 discusses the applied methodology and our pre-processing method for data reduction; the classifiers are likewise ex-

2 plained. Section 5 details the analysis of our implementation, tests, and results. Finally, Section 6 presents the conclusion. 2 The Prediction of Secondary Structure Proteins, composed by the numerous amino acids (chains), are responsible for the structural and architectural function of the cell. There are 20 different amino acids in nature. Two amino acids join themselves in the protein molecule by a bond called peptide, and the product formed by the bond of a superior number than 70 amino acids is called protein [Lehninger, 1984]. Proteins can have four types of hierarchical structures, depending on the space configuration and size of the chain, and the type of amino acids that comprise them: Primary - a sequence of amino acids and its peptide bonds. All the space arrangements of the molecule originate from this structure. Secondary - an amino acid chain where hydrogen bonds are established between distant amino acids, that is, folding, whose type is associated with the functions carried out by proteins. Tertiary - in diverse points of the secondary structure, some radicals are ionized, and the opposite electrical potential points attract one another, causing the chain to roll itself, assuming a space configuration. This structure confers biological activity to the proteins. Quaternary - proteins that contain more than one set of peptide chains, called polypeptide chains, have a quaternary structure. This structure refers to the space arrangement of the polypeptide chains of a protein and the nature of its contacts. Knowledge on the secondary structure of a protein serves to reduce the search space of its tertiary structure, therefore facilitating the briefing of its functionality. Detailed knowledge on proteins in turn allows for the correction of cellular dysfunctions and the discovery of alternative treatments for diverse diseases. However, the rules for a protein's three-dimensional conformation are still not clear in science. This absence of apparent pattern makes the scientific modeling of the problem difficult and makes the use of Neural Networks a good method for this purpose. Thus, the problem of predicting protein secondary structure consists of, from its primary structure (sequence of amino acids), classifying each of its constituent amino acids in one of the recurrent substructures in the three-dimensional conformation of the protein, which can be grouped into α-helixes, β-sheets, and coils. 3 Related Works The algorithms developed so far that simulate the folding of sequences can not accurately simulate the laws that control this process. Therefore, Neural Networks (NNs) appear to be a promising method and are commonly used, demonstrating good results [Cuff & Barton, 1999; Guimaraes et al., 2003; Melo et al., 2003a; Qian & Setnowski, 1988; Rost & Sander, 1994]. Qian and Setnowski [Qian & Setnowski, 1988] presented a study on determining the best configuration of NN for the prediction of secondary structure. Their proposed solution suggests that the sequences

3 of proteins must be covered by overlapped window and the prediction is related to the central element of the window. This work used Multiple Layer Perceptrons (MLPs) trained with the traditional backpropagation algorithm, to treat a data base subgroup of 106 dissimilar proteins (not homologous). From the success achieved by Qian and Setnowski, other solutions had been searched, applying different training algorithms, NN topologies, and input data treatments [Baldi & Brunak, 2001; Rost, 2001]. Significant progress was achieved with the use of additional biological information as input for the NNs, more specifically the use of sequence profiles of proteins as input for the prediction. In contrast to sequences that supply local information only, the profiles look for distant relations between diverse sequences of a bank. An example of this profile can be evaluated by the PHD predictor [Rost & Sander, 1993], which uses frequency profiles as the input. Among the most recent publications, the significant improvement obtained through the alteration of the type of input data is detailed, introducing more divergent information, particularly the PSI BLAST profiles [Pollastri et al., 2002]. This type of input was initially proposed by Jones [Jones, 1999] in PSI- PRED predictor. The PSI BLAST [Altchul et al., 1997] can determine distant homologues in proten sequences stored in a data base. Another existing predictor is CONSENSUS, which was developed by Cuff and Barton [Cuff & Barton, 1999]. This predictor compares and combines four others: DSC [King & Stemberg, 1996], PHD [Rost & Sander, 1994], NSSP [Salamov & Solovyev, 1995], and PREDATOR [Cuff & Barton, 1999], without making data reductions. Meanwhile that, Melo et al. [2003a; 2003b], taking into account the high dimension of the data to be treated, used a technique for statistical reduction of input data by applying PCA in the PSI BLAST profiles. The results were compared with those of Cuff and Barton [1999], and the new predictor constructed by Melo et al. [2003a; 2003b] presented better results. Considering all predictors in previous work related with this study, which use PSI BLAST profiles as input data, GMC [Guimaraes et al., 2003] is the predictor that presented the best performance. However, it does not use dimension reduction methods. 4 Methodology The purpose of this study is to evaluate the improvements that the use of a nonlinear method for dimension reduction of data can bring into the prediction of secondary structure. Therefore, a procedure similar to that of Melo [2003a, 2003b, 2004], which used a linear method, was carried out. Figure 1 shows our proposed method to treat the problem. Figure 1: Proposed solution.

4 In summary, a sequence of proteins is presented to the PSI BLAST [Altchul et al., 1997], and the matrix of scores produced by this is re-passed to a pre-processing stage. In such stage, the input data are reduced by the nonlinear method C-NLPCA and then passed to the three classifiers, which receive the same input values. The results obtained by the classifiers are then presented to the rule combination stage, which gives the final classification. Thus, each amino acid of each protein is associated with a secondary substructure classification, which can be α-helixes, β-sheets, and coils. 4.1 Using Blast Profiles in the Prediction Computationally, the primary structure of a protein can be represented by a sequence of characters (strings) of variable sizes, where each character represents one of the 20 existing amino acids. The order where these amino acids are displayed in the sequence represents the polypeptide bonds formed and is associated with the generated secondary structure. The 396 sequences of proteins present in the CB396 [University of Dundee, 2005] were submitted to the PSI Blast research. The software, Position Specific Iterated - Basic Local Alignment Search Tool (PSI Blast), developed by Altchul [Altchul et al., 1997] searches similar proteins in the chosen data base, that is, those formed by similar sequences of amino acids, taking into account the order of their position in the sequence. This software returns as output PSI Blast profiles that are composites of values representing the occurrence of each amino acid in each sequence position. Such profile makes possible a significant improvement in the predictors because it can generalize a similar set of proteins, detect distant homologues in the searched sequence, and provide various data. Thus, the searched protein receives a score that is stored in the PSSM Position Specific Score Matrix. Such matrix has the dimension 20xn (20 existing amino acids x n number of amino acids that compose the protein). The PSSM matrices of all protein are then re-passed to the reduction stage, which will be described in the next section. 4.2 Dimension Reduction On the assumption that NNs can be used as a good approach to classify protein secondary structures [Baldi & Brunak, 2001; Jones, 1999; Guimaraes et al., 2003; Pollastri et al., 2002; Rost, 2001], the selection of relevant information on the input sequences of classifier NNs and the process of codification are crucial steps in the effective prediction of the NN [Wang et al., 2001]. NNs with few inputs have few weights to be adjusted, lead to the best generalizations and faster training, and prevent saturation. The performance of the predictor NN can therefore be increased, reducing the number of its inputs. A reduction stage can improve performance and eliminate redundant information present in the data-base. We proposed the use of Cascaded Nonlinear Components Analysis (C-NLPCA), an approach based on NNs, for the dimension reduction of a large data set. In our approach, C-NLPCA receives the PSSM score matrix and eliminates redundant information presented in this matrix. Thus, the PSSM score matrix of each protein is covered by overlapped windows of size w (Figure 2), displaced reaching the original size of the protein is reached. The dimension of each window is p = 20xw (representing the frequencies of 20 existing amino acids x w amino acids belonging to the window). The reduction stage reduces the number of data (dimension of the window) from p to q. Later, each set of q components is forwarded to the classification stage, which has the task of predicting the corresponding secondary structure to the central amino acid of the window.

5 In the next section, we present some reduction methods, and later, we shall focus on the method proposed in this chapter, the C-NLPCA. Figure 2: One window of the PSSM score matrix. 2.1 Principal Components Analysis (PCA) PCA is a technique that can be used for the statistical analysis of a data set. This analysis is concerned with the extraction of the factors that best represent the structure of interdependence between variables of large dimensions. Therefore, all the variables are analyzed simultaneously, each one in relation to all the others, to determine the factors (principal components) that maximize the explanation of variability existing in the data. However, PCA is indicated for the analysis of variables that have linear relations [Bishop, 2006]. In PCA, the data are approximated by a straight line, which minimizes the mean square error (MSE). In this study, we used the Expectation Maximization (EM) [Roweis, 1998] algorithm to calculate the PCA Independent Components Analysis (ICA) ICA is a method of linear data transformation used to find a data representation that minimizes the statistical dependence of the components represented. This way, to obtain the components of a vector x, the ICA tries to find a linear transformation s=wx, in which the s i components are as independent as possible [Melo et al., 2003b]. This is accomplished through the maximization of a function F(s 1,, s m ) capable of measuring the independence of the components. Both PCA and ICA are projection techniques in their own space. The principal difference lies in the fact that the PCA generates uncorrelated components, while the ICA generates independent components [Bishop, 2006].

6 4.2.3 Cascaded Nonlinear Components Analysis (C-NLPCA) In secondary structure prediction, the nonlinear behavior of input data can be seen in the good performance of NNs in prediction applications [Baldi & Brunak, 2001; Jones, 1999; Guimaraes et al., 2003; Pollastri et al., 2002; Rost, 2001]. We also intend to extend this nonlinear treatment to the pre-processing stage. In this chapter, the use of C-NLPCA is suggested (see details in [Botelho et al., 2005; Simas et al., 2007]) to provide a nonlinear mapping of the data because linear methods such as the PCA and ICA may introduce undesirable simplifications in the analysis of variables with nonlinear relations. The C-NLPCA method is based on the cascading in layers of simple NLPCAs NNs [Kramer, 1991], aimed at the nonlinear treatment of high-dimension data (see Figure 3). Figure 3: C-NLPCA system NLPCA Analysis. NNs for the analysis of nonlinear principal components, called NLPCAs, are MLP NNs composed by the reduction and expansion stages. In the reduction, NNs with five layers are used: input (p neurons), hidden codification (m neurons), bottleneck (r neurons), hidden decoding (m neurons) and output (p neurons). Neurons of the codification and decoding layers use nonlinear functions, while those of the input, bottleneck, and output use linear functions of activation. These characteristics make the NNs model a set of functions. The inputs in a p-dimensional space, when forwarded to the bottleneck layer, are mapped by the r-dimensional space. The activation values of the bottleneck layer neuron supply the nonlinear principal components. Each NLPCA NN is trained to obtain a mapping between input and output, which minimizes the following function: n min X i X i i=1, (1) where X i is the output of NN for each X i input. The output produced by the expansion will contain the cumulative error of samples, so the X i X i residue value can be used to obtain the second principal component and so on, successively From NLPCA Sets to the C-NLPCA System. Due to the intrinsic problem of saturation of NNs, the applicability of the NLPCA is restricted to cases where p n [Botelho et al., 2005]. That is, a more for-

7 mal relation can be established in the function of the number of parameters (weights and bias) of NLPCA and of the number of samples n presented to the NN. This relation establishes that 2m + r + p n To avoid such limitation, we have proposed an architecture where NLPCAs are grouped into layers. In the reduction stage, the data of p initial dimensions are grouped into a series of small NLPCA NNs of p' input neurons. Each NLPCA reduces its respective input of p' for the 1 dimension. The reduced data (principal locals) are again grouped and reduced successively in subsequent layers (reduction layers). Next, the bottleneck neuron of the last NLPCA NN will supply the first global principal component (C- NLPC). Expansion Stage. After the reduction of layers, a set of MLP NNs composes the expansion stage. The output values obtained by the bottleneck NLPCA are used as input for MLP NNs with 1 input neuron and p' output neurons. These NNs, designated as expansion NNs, are disposed successively layer by layer until the output sets are reproduced, whose final dimension is equal to the input presented to the system: p. During the expansion, there is no training of NNs, and only the value of the principal component is propagated. Since every principal local is combined successively, in the stage of reduction, the C-NLPCA considers all relations of neighborhood between the variables. The relations between all windows that compose the PSSM matrix of a given protein are also analyzed in terms of their use as samples in the training. Model Selection. There are different solution models that can be used to implement an NLPCA. These models differ in the following parameters: topology of NNs (number of neurons in layers), criteria for the best weights selection, criteria for stopping the training, and regularization values of weights, among others. The complexity of the model can be increased by increasing the neuron number of the hidden layers of codification and decoding. The regularization of weights can in turn restrict the nonlinearity of a model [Hsieh, 2007]. A nonlinear model with enough flexibility can find zigzag solutions with a smaller MSE. However, these solutions may not be generalized to the samples, causing overfitting. This can be checked when two neighboring points of the input data are projected to distant points on the approximation curve obtained by the dimension reduction method. Thus, the selection of a solution with a smaller MSE is not a sufficient criterion for a better response. Regularization (addition of weight penalty) has been used to control the overfitting [Bishop, 1995]. A greater weight penalty tends to provide solutions with smaller nonlinearity. Considering these facts, for each NLPCA, we analyzed N sets of randomly initialized weights. Each set was trained individually, and then we selected the most suitable among them in two different ways: The set that minimizes the MSE: n i=1 X i X i, (2) The set that minimizes the following objective function: J = x x 2 + u 2 + u P w j 2 2. (3) In Equation 3, the first term is the MSE, the second and third terms are for restraining u (principal component value) towards u = 0 and u 2 = 1, and the final term is a weight penalty or regularization term, with P as the weight penalty parameter and w (2) as the weights of the hidden layer of codification

8 (see [Hsieh, 2007] for details). According to Hsieh [2001], penalizing just this layer of weights is sufficient to limit the nonlinear modeling capacity of the model. After the dimension reduction, the process of secondary structure prediction was performed individually for the data obtained by the two ways of set selection. The method of prediction (classification) is described in the next section and the results are demonstrated and compared in Section Classification: Using the NN Predictor To validate the use of the C-NLPCA in the pre-processing stage of predictors, we have also implemented a classification stage [Guimaraes et al., 2003]. This stage receives the principal components as input and provides a classification of secondary structure for each amino acid of each protein as output. The proposed classifier associates three MLP NNs with distinct topologies, objectifying the choice of a better local minimum. The distinction of the NNs is associated with the amount of neurons of the hidden layer. Each output value is associated according to established norms in the stage of combination of rules. The training of NNs is made by epoch, based on the Resilient Propagation - RPROP [Riedmiller & Braun, 1993] Rules of Combination and Evaluation The results obtained by the classifiers are then presented to the stage of combination of rules. The combination of the three NNs of different topologies aims to improve the prediction of the secondary structures [Cuff & Barton, 1999, Pollastri et al., 2002]. The rules used for the classification of results are: Voting, Average, Product, Maximum and Minimum [Melo et al., 2003a]. In the Voting rule, the most frequently shown result in the three networks is the one chosen as the final reply. The other rules follow the equations below: Product = max Media = max NN i1 i=1 3 i=1 1 NN i1, 3 3, NN i2 i=1 3 i=1 1 NN i2, 3 3, NN i3 i=1 3 NN i3 i=1 Maximum = max min NN i1, min NN i2, min NN i3 Minimum = min max NN i1, max NN i2, max NN i3 where index i represents one of the three output neurons, and j is the NN considered. Each output neuron represents one of the three possible classifications of the net: α-helixes, β-sheets, and coils. After the combination of NNs, the results are evaluated using a reduced variation in the jack-knife process applied to the subgroups of proteins (see [Melo et al., 2003a] for more details). The exactness of each subgroup is measured by the value of Q3 obtained through the following equation: Q3 = correct_number_aminoacids total_number_aminoacids (4) 100 (5)

9 5 Implementation, Results, and Discussions A tool in C++ was developed to implement the preprocessing stage in C-NLPCA and the classifier NNs. This tool can be used to reduce a range of data sets associated with different domains. In this study, we are interested in reducing and analyzing the protein secondary structure. Thus, Figure 4 schematically shows a summary of the steps used in the implementation of the proposed solution. In bank CB396, we obtain the amino acid sequences of some proteins and the expected predictions of their secondary structures. The amino acid sequence of each protein is submitted to the PSI BLAST software; it creates a matrix of PSSM scores of 20xn values (20 existing amino acids x n number of amino acids that compose the protein). The PSSM of each protein is covered by windows of 20xw size; each window is used to represent the amino acid located at his center. w = 13 was used, resulting in 260 values. Through the C-NLPCA method, each window of 260 values is reduced to 80 principal components. The 80 principal components are used to train three predictor NNs; the results of these ANNs are combined. Then as a response, the type of secondary structure is associated with each amino acid of each protein. Figure 4: Summary of the solution steps Figure 5 illustrates the process performed for a protein, and, below, we detail the solution steps and report the parameters used in this study. Step 1: First, in bank CB396, the amino acid sequences of 396 proteins, as well as the expected predictions of their secondary structures, were obtained. Step 2: Next, the 396 protein sequences were submitted to the software PSI Blast research [Altchul et al., 1999]. In this work, the searches in the PSI Blast were accomplished using standard search parameters and three iterations. The searched protein received a score that was stored in the PSSM matrix. Such matrix has the dimension 20xn, the 20 existing amino acids x n number of amino acids that compose the protein. Step 3: The PSSM of each protein is covered by overlapping windows of p = 20xw size (20 existing amino acids x w parameter). In this study, windows of w = 13 were analyzed, resulting in data with a dimension of 260 elements (20x13) to be processed and reduced. Each new window contains a new column of the PSSM matrix on the right and less a column on the left (see the braces in Figure 5). A window of 260 values provides the biological information needed to predict the secondary structure of

10 Figure 5: Illustration of the solution steps the amino acid located at his center, that is, the amino acid at seventh column of the window (see the circles in the amino acids of the sequence, Figure 5). w = 13 was chosen because this value is reported in the literature as the typically optimal size of windows in secondary structure prediction [Baldi & Brunak, 2001]. Step 4: Each window, of p = 260 values, is reduced to q = 80 principal components. In order to get the

11 dimension reduction, the C-NLPCA system is used, as described in Section (see Figure 3). The data dimension p = 260 is divided into a series of NLPCAs with p'= 10 input neurons. In each execution of C-NLPCA, a dimension reduction from p =260 to r = 1 (principal component) is obtained. Next, in order to obtain the second principal component, the difference between the input (expected values) and obtained expansion values is reinjected into the system. Thus, the C-NLPCA is performed 80 times to obtain the 80 principal components. This reduction to q = 80 was used to compare the results obtained in this study with those of other studies Melo et al. [2003a, 2003b, 2004]. The NLPCA NNs used in this study are MLPs with five layers: Input: p =10 neurons with identity activation functions Hidden of codification: m=2 neurons with sigmoidal activation functions Bottleneck: r=1 neurons with identity activation functions Hidden of decoding: m=2 neurons with sigmoidal activation functions Output: p =10 neurons with identity activation functions To reduce the processing time, the NLPCA NNs were trained through the RPROP method, learning by epoch. For the parameters, we used Δmin=1e-6, Δmax=50, η+=1,2 and η-=0,5 (check [Riedmiller & Braun, 1993] for details). In the training of each NLPCA, we used 30 sets of weights randomly initialized between [-1.1]. Each set was trained individually to a maximum of 600 epochs. Figure 6 shows the weight variability during the training process. We can see that the best results, local minimums, are obtained after around 500 epochs. After the training process, the best set of weights was chosen in two different ways: a set with smaller J, using the weigh penalty P = 10-5 a set with smaller MSE All processes of dimension reduction and secondary structure prediction were performed individually for the data obtained by the two ways of set selection; the results are compared below. Step 5: The 80 principal components associated with each amino acid are used to train the three MLP NN predictors. Each NN has three layers of neurons with sigmoidal activation functions: input layer: with 80 neurons (each neuron receives a principal component as input) hidden layer: each NN has a different number of hidden neurons output layer: with only three neurons (each neuron is related to a type of secondary structure) For each of the three NNs, we used 30, 35, and 40 sigmoidal neurons in the hidden layer. A RPROP training method with Δmin=1e-6, Δmax=50, η+=1,2 and η-=0,5 parameters and weights randomly initialized between [-1.1] was used. Finally, to calculate the efficiency of the prediction in this study, we used a simplified Jack-knife method, dividing the set of proteins into seven subsets (see [Melo et al., 2003a]) The figures below show the obtained results:

12 Figure 6: Analyzing the number of epochs for three weight sets a) MSE x Epoch b) Objective function J x Epoch. Figure 7: First window of 260 values of a protein input data for the first C-NLPCA system. These 260 values are reduced to 80 principal components.

13 Figure 8: The first three principal components of the window shown above (see Figure 7) obtained by the C-NLPCA method (selection of the set of weights by the smaller MSE); C- NLPCA method (selection of the set of weights by the smaller objective function J); PCA method. Left: Principal Component Values; Right: Spectrum

14 Figure 9: Expansion of the window shown above (see Figure 7) obtained from the principal components calculated by a) C-NLPCA (selection of the set of weights by the smaller MSE) b) C-NLPCA (selection of the set of weights by the smaller objective function Table 1 shows the accuracy percentage of the predictors related in Section 3, which were obtained in this study. Method Q3 (%) PHD [Qian & Setnowski, 1988] 71.9 DSC [King & Stemberg, 1996] 68.4 PREDATOR [Cuff & Barton, 1999] 68.6 NNSSP [Salamov & Solovyev, 1995] 71.4 CONSENSUS [Cuff & Barton, 1999] 72.9 GMC [Guimaraes et al., 2003] 75.9 PCA [Melo et al., 2003a] 80 PCs 73.8 ICA [Melo et al., 2003b] 80 PCs 73.9 PCA [Melo et al., 2004] 180 PCs 74.5 ICA [Melo et al., 2004] 180 PCs 74.9 C-NLPCA (MSE) 80 PCs C-NLPCA (J) 80 PCs Table 1: The best results of our C-NLPCA approach. The GMC predictor was developed by Guimaraes [Guimaraes et al., 2003] with the same solution method adopted in this chapter, but not using any reduction data and thus making the classifier training difficult. The predictor developed by Melo et al. [2003a, 2003b, 2004] has a dimension reduction of 80 and 180 principal components, through the PCA and ICA. The predictor adopted in this study using C-NLPCA reduction presented better results than GMC, and this is due to the fact that the dimension reduction took into account data nonlinearity besides providing a more effective training of classificatory NNs.

15 The C-NLPCA obtained a better performance than did the predictors, which used linear dimension reduction, PCA, and ICA, with a larger number of principal components. This proves the importance of considering data nonlinearity. Comparing the C-NLPCA with the PCA and ICA, the C-NLPCA method is more computationally expensive, but even so, the results justify its use. In the classifier stage, the combination rules had contributed to the increase in the precision of the results. Figure 10 shows that the application of the average rule has produced the best results. The Neural Network with the most neurons in the hidden layer (NN3 in Figure 4) presents a better performance as compared to the others. Figure 10: Prediction results obtained by the rules of combination 6 Conclusions This chapter has presented a study on the computational methods for the classification of protein secondary structures in α-helixes, β-sheets, and coils. It used the C-NLPCA method as the dimension reduction stage of the input data of the classifiers. The protein primary structures contained in the CB396 bank were submitted to the PSI-Blast Web server to locate distant homologous proteins. From the same proteins, the PSI-Blast generated a PSSM scores matrix. Such a matrix was passed to a nonlinear dimension reduction stage, and then the reduced data were passed through classifier NNs with different topologies, whose outputs were combined. A penalty approach applied to C-NLPCA enhances the performance of the method, avoiding overfitting problems. The results of C-NPCA confirm the effectiveness of the dimension reduction of data for the prediction of secondary structures. C-NLPCA reduces the input space associated with the classification stage, obtaining the nonlinear principal components of the amino acid sets. References [Altchul et al., 1997] Altchul, S., Madden, T., Shaffer, A., Zhang, J., Zhang, Z., Miller, W., & Lipman. D. (1997). Gapped Blast and PSI-Blast: A New Generation of Protein Database Search Programs. Nucleic Acids Research.

16 [Baldi & Brunak, 2001] Baldi, P. & Brunak, S. (2001). Bioinformatics. Massachusetts Institute of Technology, London, England. [Bishop, 1995] Bishop, C. M. (1995). Neural Networks Pattern Recogniton. Oxford: Claredon Pr. [Bishop, 2006] Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Microsoft Research Ltd, Cambridge. [Botelho et al., 2005] Botelho, S.S.C., Bem, R.A., Almeida, I.L., & Mata, M.M. (2005). C-NLPCA: Extracting Non-Linear Principal Components of Image Datasets. Simpósio Brasileiro de Sensoriamento Remoto. [Cuff & Barton, 1999] Cuff, A.J. & Barton, J.G. (1999). Evaluation and Improvement of Multiple Sequence Methods for Protein Secondary Structure Prediction. Proteins Structure, Function and Genetics. Vol. 34, pp [Guimaraes et al., 2003] Guimaraes, K.S., Melo, J.C.B., & Cavalcanti, G.D.C. (2003). Combining few Neural Networks for Effective Secondary Structure Prediction. Proceedings of the Third IEEE Symposion on Bioinformatics and Bioengineering, pp [Hsieh, 2001] Hsieh, W. W. (2001). Nonlinear principal component analysis by neural networks. Tellus, Vol. 53A, pp [Hsieh, 2007] Hsieh, W. W. (2007). Nonlinear principal component analysis of noisy data. Neural Networks, Vol. 20, No. 4, pp [Jones, 1999] Jones, D.T. (1999). Protein Secondary Structure Prediction Based on Position-Specific Scoring Matrices. J. Mol. Biol., Vol. 292, pp [Khattree & Naik, 2000] Khattree, R. & Naik, D.N. (2000). Multivariate Data Reduction and Discrimination with SAS Software. Cary, NC, SAS Institute Inc. [King & Stemberg, 1996] King, R. & Sternberg, M. (1996). Identification and Application of the Concepts Important for Accurate and Reliable Protein Secondary Structure Prediction. Proteins Science, Vol. 5, pp [Kramer, 1991] Kramer, M.A. (1991). Nonlinear Principal Component Analysis Using Autoassociative Neural Networks. AlChe Jounal, Vol. 37, pp [Lehninger, 1984] Lehninger, A.L. (1984). Principles of Biochemist, Editora Sarvier, São Paulo. [Melo et al., 2003a] Melo, J.C.B., Cavalcanti, G.D.C., & Guimaraes, K.S. (2003a). PCA Feature Extraction for Protein Structure Prediction. International Joint Conference on Neural Networks, pp [Melo et al., 2003b] Melo, J.C.B., Cavalcanti, G.D.C., & Guimaraes, K.S. (2003b). Protein Secondary Structure Prediction with ICA Feature Extraction. Proceedings of the IEEE International Workshop on Neural Networks for Signal Processing, Special Session on Bioinformatics. [Melo et al., 2004] Melo, J.C.B., Cavalcanti, G.D.C., & Guimaraes, K.S. (2004). Protein Secondary Structure Prediction: Efficient Neural Network and Feature Extraction. IEE Electronics Letters, Vol. 40, No. 21, pp [Pollastri et al., 2002] Pollastri, G., Przybylski, D., Rost, B., & Baldi, P. (2002). Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins, Vol. 47, pp [Qian & Setnowski, 1988] Qian, N. & Setnowski, T.J. (1988). Predicting the Secondary Structure of Globular Proteins Using Neural Network Models Baltimore. [Riedmiller & Braun, 1993] Riedmiller, M. & Braun, H. (1993). A direct adaptive method for faster backpropagation learning: The rprop algorithm. Proceedings of the International Conference of Neural Networks, pp [Rost & Sander, 1993] Rost, B. & Sander, C. (1993). Prediction of protein secondary structure at better than 70accuracy. Journal of Molecular Biology, Vol. 232, pp [Rost &Sander, 1994] Rost, B. & Sander, C. (1994). Combining Evolutionary Information and Neural Network to Predict Secondary Structure. Proteins, Vol. 19, pp [Rost, 2001] Rost, B. (2001). Review: Protein secondary structure prediction continues to rise. Journal of Structural Biology, Vol. 134, pp

17 [Roweis, 1998] Roweis, S. (1998). Algorithms for PCA and SPCA, in Advances in Neural Information Processing Systems. Michael I. Jordan and Michael J. Kearns and Sara A. Solla. [Salamov & Solovyev, 1995] Salamov, A. & Solovyev. M. (1995). Prediction of Protein Secondary Structure by Combining Ñ earest-neighbor Algorithm and Multiple Sequence Alignments. Journal of Molecular Biology, Vol. 247, pp [Simas et al., 2007] Simas, G. M., Botelho, S. S. C., Grando, N., & Colares, R. G. (2007). Dimensional Reduction in the Protein Secondary Structure Prediction: Nonlinear Method Improvements. International Worshop on Hybrid Artificial Intelligence System. [Wang et al., 2001] Wang, J. T. L., Ma, Q., Shasha, D., & WU, C. H. (2001). New techniques for extracting features from protein sequences. IMB Systems Journal, Vol. 40, No. 2, pp [University of Dundee, 2005] University of Dundee and The Barton Group. (2005). Cb396.

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia

More information

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction Jakob V. Hansen Department of Computer Science, University of Aarhus Ny Munkegade, Bldg. 540, DK-8000 Aarhus C,

More information

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier *

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier * Protein Structure Prediction Using Multiple Artificial Neural Network Classifier * Hemashree Bordoloi and Kandarpa Kumar Sarma Abstract. Protein secondary structure prediction is the method of extracting

More information

Improving Protein Secondary-Structure Prediction by Predicting Ends of Secondary-Structure Segments

Improving Protein Secondary-Structure Prediction by Predicting Ends of Secondary-Structure Segments Improving Protein Secondary-Structure Prediction by Predicting Ends of Secondary-Structure Segments Uros Midic 1 A. Keith Dunker 2 Zoran Obradovic 1* 1 Center for Information Science and Technology Temple

More information

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha Outline Goal is to predict secondary structure of a protein from its sequence Artificial Neural Network used for this

More information

Presentation Outline. Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy

Presentation Outline. Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy Burkhard Rost and Chris Sander By Kalyan C. Gopavarapu 1 Presentation Outline Major Terminology Problem Method

More information

Protein Secondary Structure Prediction using Feed-Forward Neural Network

Protein Secondary Structure Prediction using Feed-Forward Neural Network COPYRIGHT 2010 JCIT, ISSN 2078-5828 (PRINT), ISSN 2218-5224 (ONLINE), VOLUME 01, ISSUE 01, MANUSCRIPT CODE: 100713 Protein Secondary Structure Prediction using Feed-Forward Neural Network M. A. Mottalib,

More information

Protein Structure. W. M. Grogan, Ph.D. OBJECTIVES

Protein Structure. W. M. Grogan, Ph.D. OBJECTIVES Protein Structure W. M. Grogan, Ph.D. OBJECTIVES 1. Describe the structure and characteristic properties of typical proteins. 2. List and describe the four levels of structure found in proteins. 3. Relate

More information

Chap.11 Nonlinear principal component analysis [Book, Chap. 10]

Chap.11 Nonlinear principal component analysis [Book, Chap. 10] Chap.11 Nonlinear principal component analysis [Book, Chap. 1] We have seen machine learning methods nonlinearly generalizing the linear regression method. Now we will examine ways to nonlinearly generalize

More information

Two-Stage Multi-Class Support Vector Machines to Protein Secondary Structure Prediction. M.N. Nguyen and J.C. Rajapakse

Two-Stage Multi-Class Support Vector Machines to Protein Secondary Structure Prediction. M.N. Nguyen and J.C. Rajapakse Two-Stage Multi-Class Support Vector Machines to Protein Secondary Structure Prediction M.N. Nguyen and J.C. Rajapakse Pacific Symposium on Biocomputing 10:346-357(2005) TWO-STAGE MULTI-CLASS SUPPORT VECTOR

More information

Improved Protein Secondary Structure Prediction

Improved Protein Secondary Structure Prediction Improved Protein Secondary Structure Prediction Secondary Structure Prediction! Given a protein sequence a 1 a 2 a N, secondary structure prediction aims at defining the state of each amino acid ai as

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/15/07 CAP5510 1 EM Algorithm Goal: Find θ, Z that maximize Pr

More information

An Artificial Neural Network Classifier for the Prediction of Protein Structural Classes

An Artificial Neural Network Classifier for the Prediction of Protein Structural Classes International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2017 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article An Artificial

More information

SUPPLEMENTARY MATERIALS

SUPPLEMENTARY MATERIALS SUPPLEMENTARY MATERIALS Enhanced Recognition of Transmembrane Protein Domains with Prediction-based Structural Profiles Baoqiang Cao, Aleksey Porollo, Rafal Adamczak, Mark Jarrell and Jaroslaw Meller Contact:

More information

Application of Artificial Neural Networks in Evaluation and Identification of Electrical Loss in Transformers According to the Energy Consumption

Application of Artificial Neural Networks in Evaluation and Identification of Electrical Loss in Transformers According to the Energy Consumption Application of Artificial Neural Networks in Evaluation and Identification of Electrical Loss in Transformers According to the Energy Consumption ANDRÉ NUNES DE SOUZA, JOSÉ ALFREDO C. ULSON, IVAN NUNES

More information

PROTEIN SECONDARY STRUCTURE PREDICTION USING NEURAL NETWORKS AND SUPPORT VECTOR MACHINES

PROTEIN SECONDARY STRUCTURE PREDICTION USING NEURAL NETWORKS AND SUPPORT VECTOR MACHINES PROTEIN SECONDARY STRUCTURE PREDICTION USING NEURAL NETWORKS AND SUPPORT VECTOR MACHINES by Lipontseng Cecilia Tsilo A thesis submitted to Rhodes University in partial fulfillment of the requirements for

More information

CAP 5510 Lecture 3 Protein Structures

CAP 5510 Lecture 3 Protein Structures CAP 5510 Lecture 3 Protein Structures Su-Shing Chen Bioinformatics CISE 8/19/2005 Su-Shing Chen, CISE 1 Protein Conformation 8/19/2005 Su-Shing Chen, CISE 2 Protein Conformational Structures Hydrophobicity

More information

Artificial Neural Networks Examination, June 2005

Artificial Neural Networks Examination, June 2005 Artificial Neural Networks Examination, June 2005 Instructions There are SIXTY questions. (The pass mark is 30 out of 60). For each question, please select a maximum of ONE of the given answers (either

More information

Protein Secondary Structure Prediction

Protein Secondary Structure Prediction Protein Secondary Structure Prediction Doug Brutlag & Scott C. Schmidler Overview Goals and problem definition Existing approaches Classic methods Recent successful approaches Evaluating prediction algorithms

More information

Protein 8-class Secondary Structure Prediction Using Conditional Neural Fields

Protein 8-class Secondary Structure Prediction Using Conditional Neural Fields 2010 IEEE International Conference on Bioinformatics and Biomedicine Protein 8-class Secondary Structure Prediction Using Conditional Neural Fields Zhiyong Wang, Feng Zhao, Jian Peng, Jinbo Xu* Toyota

More information

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD WHAT IS A NEURAL NETWORK? The simplest definition of a neural network, more properly referred to as an 'artificial' neural network (ANN), is provided

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

More information

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Prof. Dr. M. A. Mottalib, Md. Rahat Hossain Department of Computer Science and Information Technology

More information

Master s Thesis June 2018 Supervisor: Christian Nørgaard Storm Pedersen Aarhus University

Master s Thesis June 2018 Supervisor: Christian Nørgaard Storm Pedersen Aarhus University P R O T E I N S E C O N D A RY S T R U C T U R E P R E D I C T I O N U S I N G A RT I F I C I A L N E U R A L N E T W O R K S judit kisistók, 201602119 bakhtawar noor, 201602561 Master s Thesis June 2018

More information

Protein Secondary Structure Prediction

Protein Secondary Structure Prediction part of Bioinformatik von RNA- und Proteinstrukturen Computational EvoDevo University Leipzig Leipzig, SS 2011 the goal is the prediction of the secondary structure conformation which is local each amino

More information

Artifical Neural Networks

Artifical Neural Networks Neural Networks Artifical Neural Networks Neural Networks Biological Neural Networks.................................. Artificial Neural Networks................................... 3 ANN Structure...........................................

More information

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Learning Neural Networks Classifier Short Presentation INPUT: classification data, i.e. it contains an classification (class) attribute.

More information

Basics of protein structure

Basics of protein structure Today: 1. Projects a. Requirements: i. Critical review of one paper ii. At least one computational result b. Noon, Dec. 3 rd written report and oral presentation are due; submit via email to bphys101@fas.harvard.edu

More information

Keywords- Source coding, Huffman encoding, Artificial neural network, Multilayer perceptron, Backpropagation algorithm

Keywords- Source coding, Huffman encoding, Artificial neural network, Multilayer perceptron, Backpropagation algorithm Volume 4, Issue 5, May 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Huffman Encoding

More information

Artificial Neural Networks Examination, March 2004

Artificial Neural Networks Examination, March 2004 Artificial Neural Networks Examination, March 2004 Instructions There are SIXTY questions (worth up to 60 marks). The exam mark (maximum 60) will be added to the mark obtained in the laborations (maximum

More information

Artificial Neural Network

Artificial Neural Network Artificial Neural Network Contents 2 What is ANN? Biological Neuron Structure of Neuron Types of Neuron Models of Neuron Analogy with human NN Perceptron OCR Multilayer Neural Network Back propagation

More information

Bioinformatics: Secondary Structure Prediction

Bioinformatics: Secondary Structure Prediction Bioinformatics: Secondary Structure Prediction Prof. David Jones d.jones@cs.ucl.ac.uk LMLSTQNPALLKRNIIYWNNVALLWEAGSD The greatest unsolved problem in molecular biology:the Protein Folding Problem? Entries

More information

Bayesian Models and Algorithms for Protein Beta-Sheet Prediction

Bayesian Models and Algorithms for Protein Beta-Sheet Prediction 0 Bayesian Models and Algorithms for Protein Beta-Sheet Prediction Zafer Aydin, Student Member, IEEE, Yucel Altunbasak, Senior Member, IEEE, and Hakan Erdogan, Member, IEEE Abstract Prediction of the three-dimensional

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Predicting Secondary Structures of Proteins

Predicting Secondary Structures of Proteins CHALLENGES IN PROTEOMICS BACKGROUND PHOTODISC, FOREGROUND IMAGE: U.S. DEPARTMENT OF ENERGY GENOMICS: GTL PROGRAM, HTTP://WWW.ORNL.GOV.HGMIS BY JACEK BLAŻEWICZ, PETER L. HAMMER, AND PIOTR LUKASIAK Predicting

More information

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92 ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92 BIOLOGICAL INSPIRATIONS Some numbers The human brain contains about 10 billion nerve cells (neurons) Each neuron is connected to the others through 10000

More information

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Naoto Morikawa (nmorika@genocript.com) October 7, 2006. Abstract A protein is a sequence

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Motif Prediction in Amino Acid Interaction Networks

Motif Prediction in Amino Acid Interaction Networks Motif Prediction in Amino Acid Interaction Networks Omar GACI and Stefan BALEV Abstract In this paper we represent a protein as a graph where the vertices are amino acids and the edges are interactions

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

A. Pelliccioni (*), R. Cotroneo (*), F. Pungì (*) (*)ISPESL-DIPIA, Via Fontana Candida 1, 00040, Monteporzio Catone (RM), Italy.

A. Pelliccioni (*), R. Cotroneo (*), F. Pungì (*) (*)ISPESL-DIPIA, Via Fontana Candida 1, 00040, Monteporzio Catone (RM), Italy. Application of Neural Net Models to classify and to forecast the observed precipitation type at the ground using the Artificial Intelligence Competition data set. A. Pelliccioni (*), R. Cotroneo (*), F.

More information

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler + Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Prof. Alexander Ihler Linear Classifiers (Perceptrons) Linear Classifiers a linear classifier is a mapping which partitions

More information

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design

More information

Choosing Variables with a Genetic Algorithm for Econometric models based on Neural Networks learning and adaptation.

Choosing Variables with a Genetic Algorithm for Econometric models based on Neural Networks learning and adaptation. Choosing Variables with a Genetic Algorithm for Econometric models based on Neural Networks learning and adaptation. Daniel Ramírez A., Israel Truijillo E. LINDA LAB, Computer Department, UNAM Facultad

More information

Artificial Neural Network Method of Rock Mass Blastability Classification

Artificial Neural Network Method of Rock Mass Blastability Classification Artificial Neural Network Method of Rock Mass Blastability Classification Jiang Han, Xu Weiya, Xie Shouyi Research Institute of Geotechnical Engineering, Hohai University, Nanjing, Jiangshu, P.R.China

More information

Protein Structure Analysis and Verification. Course S Basics for Biosystems of the Cell exercise work. Maija Nevala, BIO, 67485U 16.1.

Protein Structure Analysis and Verification. Course S Basics for Biosystems of the Cell exercise work. Maija Nevala, BIO, 67485U 16.1. Protein Structure Analysis and Verification Course S-114.2500 Basics for Biosystems of the Cell exercise work Maija Nevala, BIO, 67485U 16.1.2008 1. Preface When faced with an unknown protein, scientists

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University

Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University Department of Chemical Engineering Program of Applied and

More information

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding

More information

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter

Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction. Sepp Hochreiter Bioinformatics III Structural Bioinformatics and Genome Analysis Part Protein Secondary Structure Prediction Institute of Bioinformatics Johannes Kepler University, Linz, Austria Chapter 4 Protein Secondary

More information

Artificial Neural Networks Examination, June 2004

Artificial Neural Networks Examination, June 2004 Artificial Neural Networks Examination, June 2004 Instructions There are SIXTY questions (worth up to 60 marks). The exam mark (maximum 60) will be added to the mark obtained in the laborations (maximum

More information

Lecture 7 Artificial neural networks: Supervised learning

Lecture 7 Artificial neural networks: Supervised learning Lecture 7 Artificial neural networks: Supervised learning Introduction, or how the brain works The neuron as a simple computing element The perceptron Multilayer neural networks Accelerated learning in

More information

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION AND CALIBRATION Calculation of turn and beta intrinsic propensities. A statistical analysis of a protein structure

More information

Research Article Extracting Physicochemical Features to Predict Protein Secondary Structure

Research Article Extracting Physicochemical Features to Predict Protein Secondary Structure The Scientific World Journal Volume 2013, Article ID 347106, 8 pages http://dx.doi.org/10.1155/2013/347106 Research Article Extracting Physicochemical Features to Predict Protein Secondary Structure Yin-Fu

More information

STRUCTURAL BIOINFORMATICS I. Fall 2015

STRUCTURAL BIOINFORMATICS I. Fall 2015 STRUCTURAL BIOINFORMATICS I Fall 2015 Info Course Number - Classification: Biology 5411 Class Schedule: Monday 5:30-7:50 PM, SERC Room 456 (4 th floor) Instructors: Vincenzo Carnevale - SERC, Room 704C;

More information

Protein Structure Prediction using String Kernels. Technical Report

Protein Structure Prediction using String Kernels. Technical Report Protein Structure Prediction using String Kernels Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200 Union Street SE Minneapolis, MN 55455-0159

More information

Profiles and Majority Voting-Based Ensemble Method for Protein Secondary Structure Prediction

Profiles and Majority Voting-Based Ensemble Method for Protein Secondary Structure Prediction Evolutionary Bioinformatics Original Research Open Access Full open access to this and thousands of other papers at http://www.la-press.com. Profiles and Majority Voting-Based Ensemble Method for Protein

More information

TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg

TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg title: short title: TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg lecture: Protein Prediction 1 (for Computational Biology) Protein structure TUM summer semester 09.06.2016 1 Last time 2 3 Yet another

More information

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded

More information

addresses: b Department of Mathematics and Statistics, G.N. Khalsa College, University of Mumbai, India. a.

addresses: b Department of Mathematics and Statistics, G.N. Khalsa College, University of Mumbai, India. a. Reaching Optimized Parameter Set: Protein Secondary Structure Prediction Using Neural Network DongardiveJyotshna* a, Siby Abraham *b a Department of Computer Science, University of Mumbai, Mumbai, India

More information

IT og Sundhed 2010/11

IT og Sundhed 2010/11 IT og Sundhed 2010/11 Sequence based predictors. Secondary structure and surface accessibility Bent Petersen 13 January 2011 1 NetSurfP Real Value Solvent Accessibility predictions with amino acid associated

More information

POWER SYSTEM DYNAMIC SECURITY ASSESSMENT CLASSICAL TO MODERN APPROACH

POWER SYSTEM DYNAMIC SECURITY ASSESSMENT CLASSICAL TO MODERN APPROACH Abstract POWER SYSTEM DYNAMIC SECURITY ASSESSMENT CLASSICAL TO MODERN APPROACH A.H.M.A.Rahim S.K.Chakravarthy Department of Electrical Engineering K.F. University of Petroleum and Minerals Dhahran. Dynamic

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

Learning and Memory in Neural Networks

Learning and Memory in Neural Networks Learning and Memory in Neural Networks Guy Billings, Neuroinformatics Doctoral Training Centre, The School of Informatics, The University of Edinburgh, UK. Neural networks consist of computational units

More information

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen Neural Networks - I Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Neural Networks 1 /

More information

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand

More information

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory Danilo López, Nelson Vera, Luis Pedraza International Science Index, Mathematical and Computational Sciences waset.org/publication/10006216

More information

Bioinformatics: Secondary Structure Prediction

Bioinformatics: Secondary Structure Prediction Bioinformatics: Secondary Structure Prediction Prof. David Jones d.t.jones@ucl.ac.uk Possibly the greatest unsolved problem in molecular biology: The Protein Folding Problem MWMPPRPEEVARK LRRLGFVERMAKG

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data

Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data Data Mining and Knowledge Discovery, 11, 213 222, 2005 c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. DOI: 10.1007/s10618-005-0001-y Accurate Prediction of Protein Disordered

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS Int. J. LifeSc. Bt & Pharm. Res. 2012 Kaladhar, 2012 Research Paper ISSN 2250-3137 www.ijlbpr.com Vol.1, Issue. 1, January 2012 2012 IJLBPR. All Rights Reserved PROTEIN SECONDARY STRUCTURE PREDICTION:

More information

Bayesian Network Multi-classifiers for Protein Secondary Structure Prediction

Bayesian Network Multi-classifiers for Protein Secondary Structure Prediction Bayesian Network Multi-classifiers for Protein Secondary Structure Prediction Víctor Robles a, Pedro Larrañaga b,josém.peña a, Ernestina Menasalvas a,maría S. Pérez a, Vanessa Herves a and Anita Wasilewska

More information

Bayesian Network Multi-classifiers for Protein Secondary Structure Prediction

Bayesian Network Multi-classifiers for Protein Secondary Structure Prediction Bayesian Network Multi-classifiers for Protein Secondary Structure Prediction Víctor Robles a, Pedro Larrañaga b,josém.peña a, Ernestina Menasalvas a,maría S. Pérez a, Vanessa Herves a and Anita Wasilewska

More information

A New Similarity Measure among Protein Sequences

A New Similarity Measure among Protein Sequences A New Similarity Measure among Protein Sequences Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu * Institute of Information Science Academia Sinica, Taipei 115, Taiwan Abstract Protein sequence

More information

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB) Protein structure databases; visualization; and classifications 1. Introduction to Protein Data Bank (PDB) 2. Free graphic software for 3D structure visualization 3. Hierarchical classification of protein

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Nonlinear singular spectrum analysis by neural networks. William W. Hsieh and Aiming Wu. Oceanography/EOS, University of British Columbia,

Nonlinear singular spectrum analysis by neural networks. William W. Hsieh and Aiming Wu. Oceanography/EOS, University of British Columbia, Nonlinear singular spectrum analysis by neural networks William W. Hsieh and Aiming Wu Oceanography/EOS, University of British Columbia, Vancouver, B.C. V6T 1Z4, Canada tel: (64) 822-2821, fax: (64) 822-691

More information

#33 - Genomics 11/09/07

#33 - Genomics 11/09/07 BCB 444/544 Required Reading (before lecture) Lecture 33 Mon Nov 5 - Lecture 31 Phylogenetics Parsimony and ML Chp 11 - pp 142 169 Genomics Wed Nov 7 - Lecture 32 Machine Learning Fri Nov 9 - Lecture 33

More information

Supporting Information

Supporting Information Supporting Information Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction Connor W. Coley a, Regina Barzilay b, William H. Green a, Tommi S. Jaakkola b, Klavs F. Jensen

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

Intelligent Modular Neural Network for Dynamic System Parameter Estimation

Intelligent Modular Neural Network for Dynamic System Parameter Estimation Intelligent Modular Neural Network for Dynamic System Parameter Estimation Andrzej Materka Technical University of Lodz, Institute of Electronics Stefanowskiego 18, 9-537 Lodz, Poland Abstract: A technique

More information

Bearing fault diagnosis based on EMD-KPCA and ELM

Bearing fault diagnosis based on EMD-KPCA and ELM Bearing fault diagnosis based on EMD-KPCA and ELM Zihan Chen, Hang Yuan 2 School of Reliability and Systems Engineering, Beihang University, Beijing 9, China Science and Technology on Reliability & Environmental

More information

Evaluation of the relative contribution of each STRING feature in the overall accuracy operon classification

Evaluation of the relative contribution of each STRING feature in the overall accuracy operon classification Evaluation of the relative contribution of each STRING feature in the overall accuracy operon classification B. Taboada *, E. Merino 2, C. Verde 3 blanca.taboada@ccadet.unam.mx Centro de Ciencias Aplicadas

More information

BIOINF 4120 Bioinformatics 2 - Structures and Systems - Oliver Kohlbacher Summer Protein Structure Prediction I

BIOINF 4120 Bioinformatics 2 - Structures and Systems - Oliver Kohlbacher Summer Protein Structure Prediction I BIOINF 4120 Bioinformatics 2 - Structures and Systems - Oliver Kohlbacher Summer 2013 9. Protein Structure Prediction I Structure Prediction Overview Overview of problem variants Secondary structure prediction

More information

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

1-D Predictions. Prediction of local features: Secondary structure & surface exposure 1-D Predictions Prediction of local features: Secondary structure & surface exposure 1 Learning Objectives After today s session you should be able to: Explain the meaning and usage of the following local

More information

Neural Networks. Nethra Sambamoorthi, Ph.D. Jan CRMportals Inc., Nethra Sambamoorthi, Ph.D. Phone:

Neural Networks. Nethra Sambamoorthi, Ph.D. Jan CRMportals Inc., Nethra Sambamoorthi, Ph.D. Phone: Neural Networks Nethra Sambamoorthi, Ph.D Jan 2003 CRMportals Inc., Nethra Sambamoorthi, Ph.D Phone: 732-972-8969 Nethra@crmportals.com What? Saying it Again in Different ways Artificial neural network

More information

The Relative Importance of Input Encoding and Learning Methodology on Protein Secondary Structure Prediction

The Relative Importance of Input Encoding and Learning Methodology on Protein Secondary Structure Prediction Georgia State University ScholarWorks @ Georgia State University Computer Science Theses Department of Computer Science 6-9-2006 The Relative Importance of Input Encoding and Learning Methodology on Protein

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

COMPARING PERFORMANCE OF NEURAL NETWORKS RECOGNIZING MACHINE GENERATED CHARACTERS

COMPARING PERFORMANCE OF NEURAL NETWORKS RECOGNIZING MACHINE GENERATED CHARACTERS Proceedings of the First Southern Symposium on Computing The University of Southern Mississippi, December 4-5, 1998 COMPARING PERFORMANCE OF NEURAL NETWORKS RECOGNIZING MACHINE GENERATED CHARACTERS SEAN

More information

Combination of M-Estimators and Neural Network Model to Analyze Inside/Outside Bark Tree Diameters

Combination of M-Estimators and Neural Network Model to Analyze Inside/Outside Bark Tree Diameters Combination of M-Estimators and Neural Network Model to Analyze Inside/Outside Bark Tree Diameters Kyriaki Kitikidou, Elias Milios, Lazaros Iliadis, and Minas Kaymakis Democritus University of Thrace,

More information

Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated

More information

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Template Free Protein Structure Modeling Jianlin Cheng, PhD Template Free Protein Structure Modeling Jianlin Cheng, PhD Professor Department of EECS Informatics Institute University of Missouri, Columbia 2018 Protein Energy Landscape & Free Sampling http://pubs.acs.org/subscribe/archive/mdd/v03/i09/html/willis.html

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

Multilayer Perceptron

Multilayer Perceptron Aprendizagem Automática Multilayer Perceptron Ludwig Krippahl Aprendizagem Automática Summary Perceptron and linear discrimination Multilayer Perceptron, nonlinear discrimination Backpropagation and training

More information