Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins

Size: px

Start display at page:

Download "Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins"

Cuthbert Carroll
5 years ago
Views:

1 Mol Divers (2008) 12:41 45 DOI /s FULL LENGTH PAPER Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins Bing Niu Yu-Huan Jin Kai-Yan Feng Wen-Cong Lu Yu-Dong Cai Guo-Zheng Li Received: 6 June 2007 / Accepted: 24 February 2008 / Published online: 28 May 2008 Springer Science+Business Media, B.V Abstract In this paper, AdaBoost algorithm, a popular and effective prediction method, is applied to predict the subcellular locations of Prokaryotic and Eukaryotic Proteins a dataset derived from SWISSPROT Its prediction ability was evaluated by re-substitution test, Leave-One-Out Cross validation (LOOCV) and jackknife test. By comparing its results with some most popular predictors such as Discriminant Function, neural networks, and SVM, we demonstrated that the AdaBoost predictor outperformed these predictors. As a result, we arrive at the conclusion that AdaBoost algorithm could be employed as a robust method to predict subcellular location. An online web server for predicting subcellular location of prokaryotic and eukaryotic proteins is available at B. Niu W.-C. Lu (B) School of Materials Science and Engineering, Shanghai University, 149 Yan-Chang Road, Shanghai , China wclu@mail.shu.edu.cn Y.-H. Jin W.-C. Lu Y.-D. Cai Department of Chemistry, College of Sciences, Shanghai University, 99 Shang-Da Road, Shanghai , China K.-Y. Feng Division of Imaging Science & Biomedical Engineering, The University of Manchester, Room G424 Stopford Building, Manchester M139PT, UK Y.-D. Cai (B) Department of Combinatorics and Geometry, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai , China cyd@picb.ac.cn G.-Z. Li School of Computer Science & Engineering, Shanghai University, Shanghai , China Keywords AdaBoost Subcellular location Self-consistency Jackknife test Introduction Subcellular location relates closely to the functioning of a protein, which is one of the key attributes of gene production. It is well known that subcellular location plays a crucial role in many aspects of molecular biology because it involves the three essential features of a protein: its biological objective, its biochemical activity, as well as its place in the cell where a gene product is active. For example, Nuclear Proteins are found in the nucleus of a cell (e.g., Histones, Chromosomal Proteins). Knowing the localization of these proteins provides a gateway through which functional analysis may be explored. So development of a robust algorithm to effectively predict the subcellular location of a new protein sequence will be very useful. Two approaches are commonly used to predict subcellular location [1,2]: one is based on the recognition of protein N- terminal sorting signals; Nakai [3,4] and Von Heijne et al. [5] have worked extensively on predicting the subcellular location by using N-terminal sorting signals. The other method is based on amino acid composition; many using such methods have been made to predict the subcellular location [6 17]. In this paper, we input the amino acid compositions into the boosting predictor to tackle the prediction problem [18]. Materials and methods Materials The dataset comes from the Reinhardt and Hubbard database [8]. It contains 997 prokaryotic proteins (688 cytoplasmic

2 42 Mol Divers (2008) 12:41 45 proteins; 107 extracellular proteins; 202 periplasmic proteins) and 2,427 eukaryotic proteins (684 cytoplasmic proteins; 325 extracellular proteins; 321 mitochondrial proteins; 1,097 nuclear proteins). According to its amino acid composition, a protein domain can be regarded as a point or a vector in a 20-D space [18]. Suppose there is a set of N proteins, each of which corresponds to a point in a 20-D space. Hence, the kth protein can be expressed as follow: x k,1 x k,2 X k =. x k,20 (k = 1, 2,...,N), where X k,1, X k,2,...x k,20 are the composition components of 20 amino acids for the kth protein X N. The amino acid composition is taken as the input of the AdaBoost algorithm. Methods AdaBoost is one of the most popular and important methods in ensemble learning. AdaBoost is derived from Boosting [19] a predictor that tries to improve the prediction results by combining three basic predictors together. AdaBoost, introduced by Freund and Schapire [20], combines several basic and weak predictors together to output more effective and applicable predictions. The geometrical interpretation of AdaBoost is that it establishes an accurate classification rule, a strong learning algorithm, each time improving the classification (prediction) ability when adding another basic predictor (classifier) into the prediction task [21]. The Ada- Boost predictor works as follows: The input dataset can be expressed as D = (x 1, y 1 ),...,(x m, y m ),...,y { 1, +1}, where x m is a vector representing the amino acid composition, and the label y m is the target class. AdaBoost calls in one weak or base learning algorithm repeatedly in a series of time intervals t = 1, 2,, T. First, each training sample is given with an equally initialized weight; i.e., { 1 } 2M y i = 1 ω t,i = 1 2L y i = 1 In Eq. 1, L is the number of true samples (samples taken from one category), and M is the number of false samples (samples taken from another category). The sum of these weights equals to 1. This set is then trained for T rounds. The goal of this training is to find an optimal base classifier h t and ensemble it to form a strong classifier. The weight of incorrectly classified samples is increased or decreased after they are trained by a base classifier h t in order to classify the incorrectly classified samples correctly in the next round. By doing this, each (1) of the base classifier h t [22] will always focus on the hard samples in the training set. This is achieved by updating the weights as: D t+1 (i) = D t(i) e α i y i h i (x i ) Z t tj=1 = e α i y i h j (x i ) L t j=1 Z j = e mrg(x i,y i, f i ) L t, (2) j=1 Z j where Z t is a normalization factor, h t is the base classifier, and α i (α i R) is a parameter that intuitively minimizes the potential of h t, mrg(x i, y i, f i ) is functional margin of a point (x i, y i ) with respect to the function L Z t = D t (i) exp(-α i y i h t (x i )), (3) i=1 where D t (i) is the weight distribution on training example i on round t [23,24]. Hence, the final classifier H could be concluded by combining many base classifiers through weighted majority vote. H can be expressed as follow ( T ) H(x) = sign α i h t (x), (4) t=1 The skeleton of AdaBoost is shown as following [23]: 1. Consider a training set, 2. Initialize and normalize the weight D = (x 1, y 1 ),..., (x m, y m ),...,y { 1, +1}, 3. Repeat from t = 1,...T, executing the following sub steps, (1) Train the training set with the distribution D t, (2) Get base classifier h t which results in the least error, (3) Update the weight focused on incorrect sample and set the new weight, 4. Output the final strong classifier H. The AdaBoost method usually applies to two-class problems. For our four-class problem, the one-vs.-others strategy [25] was adopted to solve the multi-class problem. Choosing a proper weak classifier is important in performing the AdaBoost algorithm well. The following two criterions were used to identify a good weak classifier: (1) a weak classifier should cope with the reweighing of the data; (2) a weak classifier should not result in over-fitting. Hence, a special decision tree, Random Forests [26], was selected as the base learning machine (weak classifier). Experiment The software package Weka [27], which includes the Ada- Boost algorithm, was run under Java [27] on a 1,400MHz

3 Mol Divers (2008) 12: Pentium IV computer. All the learning input data were rangescaled to [0, 1] in this work. Results and discussion Success rate of self-consistency of AdaBoost algorithm In the present work, the examination for the self-consistency of the AdaBoost method was performed on two datasets [8]. The rates of correct prediction for the three possible subcellular locations and four possible subcellular locations were all 100%. This indicates that after being trained, the AdaBoost algorithm can grasp the complicated relationship between the amino acid composition and subcellular location. Success rate of Leave-One-Out Cross validation (LOOCV ) of the AdaBoost algorithm The performance of the AdaBoost method is evaluated by LOOCV [28]. During the process of Jackknife analysis, the datasets are fed into the system, a model is built with N 1 protein and the Nth protein is predicted. Each protein is left out from the model derivation and predicted in turn. As a result, the rates of correct prediction for the three possible subcellular locations of prokaryotic proteins were 915/997 = 91.8% (cytoplasmic: 662/688 = 98.8%; extracellular: 85/ 107 = 73.8%; periplasmic: 138/202 = 77.2%); the rates of correct prediction for the four possible subcellular location of eukaryotic proteins were 1,962/2,427 = 80.8% (cytoplasmic: 578/684 = 84.5%; extracellular: 248/325 = 76.3%; mitochondrial: 158/321 = 49.2%; nuclear: 978/1,097 = 89.2%). Success rate of Jackknife test of the AdaBoost algorithm To validate the generalization and reliability of the AdaBoost, the Jackknife test was also employed in this study. Each dataset is divided into two parts: a training set and a testing set. We randomly choose two-third samples as training set, and the remainder one-third samples are chosen as testing set. As a result, the rates of correct prediction for the three possible subcellular locations of prokaryotic proteins were 303/332 = 91.3% (cytoplasmic: 224/229 = 97.8%; extracellular: 29/35 = 82.9%; periplasmic: 50/68 = 73.5%); the rates of correct prediction for the four possible subcellular location of eukaryotic proteins were 633/808 = 78.3% (cytoplasmic: 190/228 = 83.3%; extracellular: 75/108 = 69.4%; mitochondrial: 50/107 = 46.7%; nuclear: 318/365 = 87.1%). Comparison to the Support Vector Machines (SVMs), discriminant function and neural networks In this work, the AdaBoost method was compared with other pattern classification methods such as neural networks [8,9] and SVM [29,30]. See Tables 1 and 2 (self-consistency test) and Tables 3 and 4 (Leave-One-Out Cross validation) for details. As shown in Tables 1 and 2, for prokaryotic and eukaryotic, the rates of self-consistency test using AdaBoost algorithm both reach 100%, which are higher than those using neural networks and SVM. For the Leave-One-Out Cross validation, as shown in Table 3, the overall rate of correct prediction for the three possible subcellular locations of prokaryotic proteins using the AdaBoost algorithm is 8.8% higher than using neural networks methods and 6.4% higher than using SVM method. The accuracy for cytoplasmic sequences using the AdaBoost algorithm is 5.6% higher using neural networks and 3.1% higher than using SVM. The accuracy for extracellular sequences using the AdaBoost algorithm is 6.7% higher than using neural networks. The accuracy for periplasmic sequence using the AdaBoost algorithm is 30.3% higher than using neural networks and higher 13.0% than using SVM. For the Leave-One-Out Cross validation test, as shown in Table 4, the overall rate of correct prediction for the four possible subcellular locations of eukaryotic proteins using the AdaBoost algorithm is higher than those using neural networks and SVM method. The accuracy for cytoplasmic sequence using AdaBoost is 7.1% higher than using neural networks and 7.3% higher than that using SVM. The accuracy for extracellular sequence using AdaBoost is 8.3% higher than using neural networks and 11.4% higher than using SVM. The accuracy for mitochondrial using AdaBoost is 14.3% higher than using neural networks and 1.3% lower Table 1 Results of self-consistency test of prokaryotic (%) Cytoplasmic (%) Extracellular (%) Periplasmic (%) 997 domains Neural networks SVM Discriminant Function 90 AdaBoost

4 44 Mol Divers (2008) 12:41 45 Table 2 Results of self-consistency test of eukaryotic Cytoplasmic (%) Extracellular (%) Mitochondrial (%) Nuclear (%) 2,427 domains Neural networks SVM AdaBoost Table 3 Results of LOOCV of prokaryotic Cytoplasmic (%) Extracellular (%) Periplasmi (%) 997 domains Neural networks SVM Discriminant Function 87 AdaBoost Table 4 Results of LOOCV of eukaryotic Cytoplasmic (%) Extracellular (%) Mitochondrial (%) Nuclear (%) 2,427 domains Neural networks SVM AdaBoost than using SVM. The accuracy for nuclear sequence using AdaBoost is 9.0% higher than using neural networks and 3.8% higher than using SVM. An online web server for predicting subcellular location of prokaryotic and eukaryotic proteins is available at In our study, the prediction accuracy using AdaBoost is better than other methods, which means that our model can well predict the subcellular location. Furthermore, the protein function can be correctly predicted by using our results because the protein subcellular location has involved the essential features of a protein (e.g. biological objective and biochemical activity). In addition, our prediction results also confirm that the amino acid composition contains sufficient information to distinguish proteins of various subcellular locations at a detailed level, and this conclusion is consistent with the biology analysis from the reference [7]. For example, nuclear proteins are generally poor in hydrophobic residues but rich in charged residues, an observation that correlates with the high content of serine, threonine, proline, asparagines and glutamine residues [7]. From our results, the leave-one-out rate of nuclear protein reaches 89.2%, which is good when compared to the others (see the Table 4). Conclusion We processed two datasets on subcellular location from Reinhardt and Hubbard database using AdaBoost algorithm. The performance of AdaBoost algorithm was verified by the self-consistency, Leave-One-Out Cross validation and Jackknife test. As a result, high rates of self-consistency, Leave- One-Out Cross validation and Jackknife test are obtained. Taking the promising results aforementioned into account, it could be concluded that AdaBoost algorithm is a robust and highly accurate classification technique that can be successfully applied to derive statistical models with statistical qualities and predictive capabilities for the protein location and function. The Adaboost algorithm should be a complementary tool to the existing pattern recognition in bioinformatics. Acknowledgements This work was supported by the National Natural Science Foundation of China (No ). References 1. Eisenhaber F, Bork PW (1998) Subcellular localization of proteins based on sequence. Trends Cell Biol 8:

5 Mol Divers (2008) 12: Nakai K (2000) Protein sorting signals and prediction of subcellular localization. Adv Protein Chem 54: Nakai K, Kanehisa M (1991) Expert system for predicting protein localization sites in gram negative bacteria. Proteins Struct Funct Genet 1: Nakai K, Kanehisa M (1992) A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14: Von Heijne G, Nielsen H, Engelbrecht J, Brunak S (1997) Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng 10: Nakashima H, Nishikawa K (1994) Discrimination of intracellular and extracellular proteins using amino acid composition and residue pair frequencies. J Mol Biol 238: Cedano J, Aloy P, Pérez-Pons JA (1997) Relation between am ion acid composition and cellular location of proteins. J Mol Biol 266: Reinhardt A, Hubbard T (1998) Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res 9: Cai YD, Chou KC (2000) Using neural networks for prediction of subcellular location of prokaryotic and eukaryotic proteins. Mol Cell Biol Res Commun 4: Cai YD, Chou KC (2003) Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition. Biochem Biophys Res Commun 2: Cai YD, Chou KC (2004) Predicting subcellular localization of proteins in a hybridization space. Bioinformatics 7: Cai YD, Chou KC (2004) Predicting 22 protein localizations in budding yeast. Biochem Biophys Res Communi 2: Chou KC, Elrod DW (1998) Using discriminant function for prediction of subcellular location of prokaryotic proteins. Biochem Biophys Res Commun 252: Chou KC, Elord DW (1999) Prediction of membrane protein types and subcellular locations. Proteins Struct Funct Genet 34: Chou KC, Elrod D (1999) Protein subcellular location prediction. Protein Eng 2: Chou KC, Cai YD (2002) Using functional domain composition and support vector machines for prediction of protein subcellular location. J Mol Biol 48: Yuan Z (1999) Prediction of protein subcellular locations using Markov chain models. FEBS Lett 1: Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet C, Ares JM, Haussler D, Chou KC (1995) A novel approach to predict protein structural classes in a (20-1)-D amino acid composition space. Proteins Struct Funct Genet 21: Schapire RE, Freund Y, Bartlett P, Lee WS (1998) Boosting the margin: a new explanation for the effectiveness of voting methods. Ann Stat 26(5): Freund Y, Schapire RE (1997) A decision-theoretic generalization of online learning and an application to boosting. J Comput Syst Sci 1: Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Machine Learn 37: Romero E (2004) Margin maximization with feed-forward neural networks: a comparative study with SVM and AdaBoost. Neurocomputing 57: Schapire RE (2002) The boosting approach to machine learning. An Overview MSRI Workshop on Nonlinear Estimation and Classification. 24. Duffy N, Helmbold D (2002) A geometric approach to leveraging weak learners. Theor Comput Sci 284: Ding CHQ, Dubchak I (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17: Breiman L (2001) Random Forests. Machine Learn Witten IH, Frank E (1999) Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco 28. Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic Press, London 29. Vapnik V (1998) Statistical learning theory. Wiley-Interscience, New York 30. Chen NY, Lu WC, Li GZ, Yang J (2004) Support vector machine in chemistry. World Scientific Publishing Company, Singapore

The prediction of membrane protein types with NPE

The prediction of membrane protein types with NPE Lipeng Wang 1a), Zhanting Yuan 1, Xuhui Chen 1, and Zhifang Zhou 2 1 College of Electrical and Information Engineering Lanzhou University of Technology,