Supplementary Materials for R3P-Loc Web-server Shibiao Wan and Man-Wai Mak email: shibiao.wan@connect.polyu.hk, enmwmak@polyu.edu.hk June 2014 Back to R3P-Loc Server Contents 1 Introduction to R3P-Loc Server 2 2 Step-by-step Protocol Guide 4 2.1 Inputing Protein Accession Numbers via Copy-and-Paste.......... 5 2.2 Inputing Protein Sequences via Copy-and-Paste............... 6 2.3 File-Upload Function.............................. 7 2.4 Emailing Function............................... 8 3 Statistical Methods 11 4 Dataset Construction 15 1
1 Introduction to R3P-Loc Server R3P-Loc is a subcellular-localization predictor which can deal with datasets with both single-label and multi-label proteins. The R3P-Loc server can predict two different species (eukaryote and plant) and two different input types (amino acid sequences in FASTA format and protein accession numbers 1 in UniProtKB [1] format). R3P-Loc stands for using Ridge Regression and Random Projection for predicting subcellular localization of both single-label and multi-label proteins, meaning that this predictor applies random projection to reduce the feature dimensions of an ensemble ridge regression classifier. The R3P-Loc predictor can deal with both single-location proteins and multi-location proteins. Similar to many other GO-based predictors [2, 3, 4, 5], R3P- Loc uses gene ontology as the feature information. The specific algorithms can be found in the paper. For eukaryote proteins, R3P-Loc is designed to predict 22 subcellular locations of multi-label eukaryotic proteins. The 22 subcellular locations include: (1) acrosome; (2) cell membrane; (3) cell-wall; (4) centrosome; (5) chloroplast; (6) cyanelle; (7) cytoplasm; (8) cytoskeleton; (9) endoplasmic reticulum; (10) endosome; (11) extracellular; (12) golgi apparatus; (13) hydrogenosome; (14) lysosome; (15) melanosome; (16) microsome; (17) mitochondrion; (18) nucleus; (19) peroxisome; (20) spindle pole body; (21) synapse; and (22) vacuole. The predictor is not designed for predicting the subcellular localization of non-eukaryotic proteins when selecting predicting the eukaryotic proteins. Therefore, the prediction results of non-eukaryotic proteins in this case are arbitrary and meaningless. 1 http://www.uniprot.org/manual/accession numbers 2
Figure 1: Interface of the R3P-Loc web-server. For plant proteins, R3P-Loc is designed to predict 12 subcellular locations of multilabel plant proteins. The 12 subcellular locations include: (1) cell membrane; (2) cell wall; (3) chloroplast; (4) cytoplasm; (5) endoplasmic reticulum; (6) extracellular; (7) golgi apparatus; (8) mitochondrion; (9) nucleus; (10) peroxisome; (11) plastid; and (12) vacuole. Note (11) plastid here includes those plastid groups except for (3) chloroplast. The predictor is not designed for predicting the subcellular localization of non-plant proteins. Therefore, the prediction results of non-plant proteins are arbitrary and meaningless. 3
Input format and type selection Figure 2: Different formats and types of input. 2 Step-by-step Protocol Guide Fig. 1 shows the interface of the R3P-Loc web-server. As can be seen, there are two steps to use R3P-Loc: 1. select the species type and input type. Fig. 2 shows the four combinations of species types and input types: eukaryote protein amino acid sequences in FASTA format, eukaryote protein UNIPROTKB accession numbers, plant protein amino acid sequences in FASTA format and plant protein UNIPROTKB accession numbers. 4
2. Input the query proteins in the form of either FASTA sequences or accession numbers. There are also two ways to input the proteins: copy-and-paste the protein information into the textbox or upload a file containing the proteins. Users may optionally provide an email address if they upload a file containing many FASTA sequences or accession numbers. Prediction results will be emailed to the users. For users convenience, several examples of eukaryote sequences, eukaryote accession numbers, plant sequences and plant accession numbers are provided in the R3P-Loc webserver. Also, a help page is provided in the web-server to introduce the concepts of FASTA format and UniProtKB accession number format. Besides, the two benchmark datasets are downloadable from the hyperlinks in the web-server. Some simple yet informative instructions, which include significance of subcellular localization prediction, specific information about R3P-Loc and some notes, are also provided thereafter. For readers ease of using the R3P-Loc web-server, different combinations of species types, input types and ways to input proteins are specifically presented in the following subsections. 2.1 Inputing Protein Accession Numbers via Copy-and-Paste Fig. 3 shows an example of using accession numbers (AC) as input. Note that R3P-Loc can deal with one or more accession numbers for each submission. 2 Details of UniProtKB ACs can be found on the help page. After prediction, a prediction page similar to Fig. 4 will be shown, in which the input statistics, prediction results and a link of a 2 Note that the server can allow users to input maximum 100 accession numbers for each submission. 5
Select eukaryotic accession numbers Input accession numbers Press this button to predict Figure 3: An example of using accession numbers as input. downloadable file containing the prediction results are listed. Fig. 5 specifies the details of the downloadable prediction-result file. 2.2 Inputing Protein Sequences via Copy-and-Paste Fig. 6 shows an example of using protein amino acid sequences as input. Note that R3P- Loc can deal with one or more protein sequences (maximum 10) 3 for each submission. Details of FASTA format can be found in the help page. After prediction, a prediction page similar to Fig. 7 will be shown, where the input statistics, prediction results and a 3 Note that the updated server can allow users to input maximum 50 sequences for each submission. 6
Figure 4: Prediction results page for using accession numbers as input. link to a downloadable text file containing the prediction results are listed. Fig. 8 specifies the details of the prediction-result file. Within the prediction results, besides the final subcellular locations, the BLAST E-value is also shown for each query protein sequence. 2.3 File-Upload Function R3P-Loc allows users to upload a text file containing a list of accession numbers or sequences in FASTA format. Fig. 9 shows an example of uploading a file with a list accession numbers without providing an email address. In this case, R3P-Loc will present the prediction results in HTML format, as shown in Fig. 10. Also, a text file can also be 7
Figure 5: An example of the prediction-result file. downloaded from the result page. Fig. 11 shows an example of the downloadable file. 2.4 Emailing Function For ease of sending results and further processing the prediction results, an emailing function is added to R3P-Loc. By providing their email address as shown in Fig. 12, users will receive the prediction results through emails. After prediction, an email with contents similar to that of Fig. 13 will be sent to the designated email address. The email will be entitled with Results for your subloc prediction task from REP-Loc Server sent by the official email of R3P-Loc server, namely r3plocserver@gmail.com. The contents will be read as: Dear users, Thank you for using our R3P-Loc web-server to predict protein subcellular 8
Select eukaryotic protein sequences Input protein sequences Press this button to predict Figure 6: An example of using protein amino acid sequences as input. localization. Attached please find the prediction results of your submissions. You can find more information from our server website. Thank you again for your support. Best wishes, R3P-Loc Server The prediction results are saved as an attachment within the email. 9
Figure 7: Prediction results page for using accession numbers as input. Figure 8: Details of the downloadable prediction-results file. 10
Select plant accession numbers Input file (with a list of accession numbers) Press this button to predict Figure 9: An example of using a file with a list accession numbers as input without providing emails. 3 Statistical Methods In statistical prediction, there are three methods that are often used for testing the generalization capabilities of predictors: independent tests, subsampling tests (or K-fold crossvalidation) and jackknife tests (or leave-one-out cross validation, short for LOOCV). In independent tests, the training set and the testing set were fixed, thus enabling us to obtain a fixed accuracy for the predictors. However, the selection of independent dataset often bears some sort of arbitrariness [6], which inevitably leads to non-bias-free 11
Figure 10: Prediction results page for using a file input. accuracy for the predictors. In subsampling tests, here we use five-fold cross validation as an example. The whole dataset was randomly divided into 5 disjoint parts with equal size [7]. The last part may have 1-4 more examples than the former 4 parts in order for each example to be evaluated on the model. Then one part of the dataset was used as the test set and the remained parts are jointly used as the training set. This procedure is repeated five times, and each time a different part was chosen as the test set. The number of the selections in dividing the benchmark dataset is obviously an astronomical figure even for a small-size 12
Input format and type selection Figure 11: Details of the downloadable prediction-results file. dataset. This means that different selections lead to different results even for the same benchmark dataset, thus still being liable to statistical arbitrariness. Subsampling tests with a smaller K work definitely faster than that with a larger K. Thus, subsampling tests are faster than LOOCV, which can be regarded as N-fold cross-validation, where N is the number of samples in the dataset, and N > K. At the same time, it is also statistically acceptable and usually regarded as less biased than the independent tests. In LOOCV, every protein in the benchmark dataset will be singled out one-by-one and is tested by the classifier trained by the remaining proteins. In this case, the arbitrariness 13
Select plant protein sequences Input file (with a list of protein sequences) Input email to receive and save results Press this button to predict Figure 12: An example using a file with a list of protein sequences as input and providing emails. can be avoided because LOOCV will yield a unique outcome for the predictors. Therefore, LOOCV is considered to be the most rigorous and bias-free method [8]. Hence, LOOCV was used to examine the performance of R3P-Loc against other state-of-the-art predictors. 14
Figure 13: An example of the email containing the prediction results. 4 Dataset Construction R3P-Loc uses two benchmark datasets [2, 9] to evaluate its performance. Both of them were constructed by using the same standard procedures with the same Swiss-Prot versions and date of construction (i.e., Swiss-Prot 55.3 on 29-Apr-2008 for the benchmark plant dataset). The differences are the species (i.e., eukaryote or plant). Here, we take the plant dataset as an example to illustrate the details of the procedures, which are specified as follows: 1. Go to the UniProt/SwissProt official webpage (http://www.uniprot.org/); 2. Go to the Search section and select Protein Knowledgebase (UniProtKB) (default) in the Search in option; 3. In the Query option, select or type reviewered: yes ; 4. Select AND in the Advanced Search option, and then select Taxonomy [OC] 15
Table 1: Breakdown of the multi-label eukaryotic protein dataset. The sequence identity is cut off at 25%. The superscripts e stand for the eukaryotic dataset. Label Subcellular Location No. of Locative Proteins 1 Acrosome 14 2 Cell membrane 697 3 Cell wall 49 4 Centrosome 96 5 Chloroplast 385 6 Cyanelle 79 7 Cytoplasm 2186 8 Cytoskeleton 139 9 ER 457 10 Endosome 41 11 Extracellular 1048 12 Golgi apparatus 254 13 Hydrogenosome 10 14 Lysosome 57 15 Melanosome 47 16 Microsome 13 17 Mitochondrion 610 18 Nucleus 2320 19 Peroxisome 110 20 SPI 68 21 Synapse 47 22 Vacuole 170 Total number of locative proteins (N loc e ) 8897 Total number of actual proteins (N act e ) 7766 and type in Viridiplantae ; 5. Select AND in the Advanced Search option, and then select Fragment: no ; 6. Select AND in the Advanced Search option, and then select Sequence length and type in 50 - (no less than 50); 16
Table 2: Breakdown of the multi-label plant protein dataset. The sequence identity is cut off at 25%. The superscripts p stand for the plant dataset. Label Subcellular Location No. of Locative Proteins 1 Cell membrane 56 2 Cell wall 32 3 Chloroplast 286 4 Cytoplasm 182 5 Endoplasmic reticulum 42 6 Extracellular 22 7 Golgi apparatus 21 8 Mitochondrion 150 9 Nucleus 152 10 Peroxisome 21 11 Plastid 39 12 Vacuole 52 Total number of locative proteins (N loc p ) 1055 Total number of actual proteins (N act p ) 978 7. Select AND in the Advanced Search option, and then select Date entry integrated and type in -20080429 ; 8. Select AND in the Advanced Search option, and then select Subcellular location: XXX Confidence: Experimental ; (XXX means the specific subcellular locations. Here it includes 12 different locations: cell membrane; cell wall; chloroplast; endoplasmic reticulum; extracellular; golgi apparatus; mitochondrion; nucleus; peroxisome; plastid; vacuole.) 9. Further exclude those proteins which are not experimentally annotated (This is to recheck the proteins to guarantee they are all experimentally annotated). 17
After selecting the proteins, Blastclust 4 was applied to reduce the redundancy in the dataset so that none of the sequence pairs has sequence identity higher than 25%. The details of the breakdown of the two benchmark datasets are listed in Table 1 and Table 2. Both datasets can be accessible from the page of Datasets of R3P-Loc web-server. R3P-Loc server is available at http://bioinfo.eie.polyu.edu.hk/r3plocserver/. References [1] R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. J. Martin, D. A. Ntale, C. O Donovan, N. Redaschi, and L. S. Yeh, UniProt: the Universal Protein knowledgebase, Nucleic Acids Res, vol. 32, pp. D115 D119, 2004. [2] K. C. Chou, Z. C. Wu, and X. Xiao, iloc-euk: A multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins, PLoS ONE, vol. 6, no. 3, pp. e18258, 2011. [3] S. Wan, M. W. Mak, and S. Y. Kung, GOASVM: A subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou s pseudoamino acid composition, Journal of Theoretical Biology, vol. 323, pp. 40 48, 2013. [4] S. Wan, M. W. Mak, and S. Y. Kung, mgoasvm: Multi-label protein subcellular localization based on gene ontology and support vector machines, BMC Bioinformatics, vol. 13, pp. 290, 2012. 4 http://www.ncbi.nlm.nih.gov/web/newsltr/spring04/blastlab.html 18
[5] S. Wan, M. W. Mak, and S. Y. Kung, HybridGO-Loc: Mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins, PLoS ONE, vol. 9, no. 3, pp. e89545, 2014. [6] K. C. Chou and C. T. Zhang, Review: Prediction of protein structural classes, Critical Reviews in Biochemistry and Molecular Biology, vol. 30, no. 4, pp. 275 349, 1995. [7] S. Y. Mei, W. Fei, and S. G. Zhou, Gene ontology based transfer learning for protein subcellular localization, BMC Bioinformatics, vol. 12, pp. 44, 2011. [8] T. Hastie, R. Tibshirani, and J. Friedman, The element of statistical learning, Springer-Verlag, 2001. [9] Z. C. Wu, X. Xiao, and K. C. Chou, iloc-plant: A multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites, Molecular BioSystems, vol. 7, pp. 3287 3297, 2011. 19