Supplementary Materials for mplr-loc Web-server

Supplementary Materials for mplr-loc Web-server Shibiao Wan and Man-Wai Mak email: shibiao.wan@connect.polyu.hk, enmwmak@polyu.edu.hk June 2014 Back to mplr-loc Server Contents 1 Introduction to mplr-loc Server 2 2 Web-server Functions 4 2.1 Inputting Protein Accession Numbers via Copy-and-Paste......... 6 2.2 Inputting Protein Sequences via Copy-and-Paste............... 7 2.3 Inputting Protein Accession Numbers via File-Upload............ 9 2.4 Inputting Protein Sequences via File-Upload................. 10 3 Statistical Methods 12 4 Dataset Construction 17 1

1 Introduction to mplr-loc Server mplr-loc is a subcellular-localization predictor which can deal with datasets with both single-label and multi-label proteins. The mplr-loc server can predict two different species (virus and plant) and two different input types (amino acid sequences in FASTA format and protein accession numbers 1 in UniProtKB [1] format). mplr-loc stands for multi-label Penalized Logistic Regression for protein subcellular Localization, meaning that this predictor extracts the feature information from the gene ontology information and then processes the information by a multi-label multi-class penalized logistic regression classifier with an adaptive decision strategy. The mplr- Loc predictor can deal with both single-location proteins and multi-location proteins. Compared to traditional GO-based predictors [2, 3, 4, 5], mplr-loc can not only rapidly and accurately predict subcellular localization of single- and multi-label proteins, but also provide probabilistic confidence scores for the prediction decisions. For each query protein, the mplr-loc web-server can give both the prediction results and a figure which shows the probabilistic confidence scores for each location. The specific algorithms can be found in the paper. For virus proteins, mplr-loc is designed to predict 6 subcellular locations of multilabel viral proteins. The 6 subcellular locations include: (1) viral capsid; (2) host cell membrane; (3) host endoplasmic reticulum; (4) host cytoplasm; (5) host nucleus; and (6) secreted. The predictor is not designed for predicting the subcellular localization of non-viral proteins. Therefore, the prediction results of non-viral proteins are arbitrary 1 http://www.uniprot.org/manual/accession numbers 2

Figure 1: Interface of the mplr-loc web-server. and meaningless. For plant proteins, mplr-loc is designed to predict 12 subcellular locations of multilabel plant proteins. The 12 subcellular locations include: (1) cell membrane; (2) cell wall; (3) chloroplast; (4) cytoplasm; (5) endoplasmic reticulum; (6) extracellular; (7) golgi apparatus; (8) mitochondrion; (9) nucleus; (10) peroxisome; (11) plastid; and (12) vacuole. Note (11) plastid here includes those plastid groups except for (3) chloroplast. The predictor is not designed for predicting the subcellular localization of non-plant proteins. Therefore, the prediction results of non-plant proteins are arbitrary and meaningless. 3

Input format and type selection Input format and type selection Figure 2: Different formats and types of input. 2 Web-server Functions Fig. 1 shows the interface of the mplr-loc web-server. As can be seen, there are two steps to use mplr-loc: 1. select the species type and input type. Fig. 2 shows the four combinations of species types and input types: plant protein amino acid sequences in FASTA format, plant protein UNIPROTKB accession numbers, virus protein amino acid sequences in FASTA format and virus protein UNIPROTKB accession numbers. 4

2. Input the query proteins in the form of either FASTA sequences or accession numbers (ACs). There are also two ways to input the proteins: copyand-paste the protein information into the textbox or upload a file containing the proteins. Inputting a batch of proteins in either formats (ACs or amino acid sequences) are supported in mplr-loc web-server for large-scale prediction. For users convenience, several examples of plant sequences, plant accession numbers, virus sequences and virus accession numbers are provided in the mplr-loc web-server. Besides, the two benchmark datasets are downloadable from the hyperlinks in the webserver, and the new independent test set can be directly downloaded from the web-server. Some simple yet informative instructions, which include significance of subcellular localization prediction, specific information about mplr-loc and some notes, are also provided thereafter. In addition to being able to rapidly and accurately predict subcellular localization of single- and multi-label proteins, mplr-loc can also provide probabilistic confidence scores for the prediction decisions. For each query protein, a figure showing the probabilistic confidence in assigning the query protein to each location is also provided. For readers ease of using the mplr-loc web-server, different combinations of species types, input types and ways to input proteins are specifically presented in the following subsections. 5

Select virus accession numbers Input accession numbers Figure 3: An example of using accession numbers as input. 2.1 Inputting Protein Accession Numbers via Copy-and-Paste Fig. 3 shows an example of using accession numbers (AC) as input. Note that mplr-loc can deal with one or more accession numbers for each submission. 2 After prediction, a prediction page similar to Fig. 4 will be shown, in which the input statistics and prediction results are listed. Fig. 5(a) and Fig. 5(b) specify the confidence scores for the two virus protein accession numbers (ACs) input. The red bar(s) represent the predicted locations and the blue bars are those locations where are predicted as not located. As can be 2 Note that the server can allow users to input maximum 100 accession numbers for each submission. 6

Figure 4: Prediction results page for using accession numbers as input. seen, the first virus AC is predicted as host-nucleus with a probabilistic confidence of more than 0.9; while the second virus ACs is predicted as host cell membrane and host endoplasmic reticulum, both with confidence of more than 0.9. 2.2 Inputting Protein Sequences via Copy-and-Paste Fig. 6 shows an example of using protein amino acid sequences as input. Note that mplr- Loc can deal with one or more protein sequences (maximum 50) 3 for each submission. After prediction, a prediction page similar to Fig. 7 will be shown, where the input statistics, prediction results are listed. Within the prediction results, besides the final subcellular locations, the BLAST E-value is also shown for each query protein sequence. Fig. 8(a) and Fig. 8(b) specify the confidence scores for the two plant protein sequences 3 Note that the updated server can allow users to input maximum 50 sequences for each submission. 7

(a) The 1-st virus accession number (b) The 2-nd virus accession number Figure 5: Confidence scores of the mplr-loc server for the virus protein accession numbers input in Fig. 3. 8

Select plant protein sequences Input protein sequences Figure 6: An example of using protein amino acid sequences as input. input. 2.3 Inputting Protein Accession Numbers via File-Upload mplr-loc allows users to upload a text file containing a list of accession numbers or sequences in FASTA format. Fig. 9 shows an example of uploading a file with a list of accession numbers. In this case, mplr-loc will present the prediction results in HTML format, as shown in Fig. 10. Fig. 11(a) and Fig. 11(b) specify the confidence scores for the two plant protein accession numbers input. 9

Figure 7: Prediction results page for using accession numbers as input. 2.4 Inputting Protein Sequences via File-Upload mplr-loc allows users to upload a text file containing a list of accession numbers or sequences in FASTA format. Fig. 12 shows an example of uploading a file with a list of protein sequences. In this case, mplr-loc will present the prediction results in HTML format, as shown in Fig. 13. Fig. 14 specifies the confidence scores for the plant protein sequence input. 10

(a) The 1-st plant amino-acid sequence (b) The 2-nd plant amino-acid sequence Figure 8: Confidence scores of the mplr-loc server for the plant protein sequences input in Fig. 6. 11

Select plant accession numbers Input file (with a list of protein accession numbers) Figure 9: An example of using a file with a list accession numbers as input. 3 Statistical Methods In statistical prediction, there are three methods that are often used for testing the generalization capabilities of predictors: independent tests, subsampling tests (or K-fold crossvalidation) and jackknife tests (or leave-one-out cross validation, short for LOOCV). In independent tests, the training set and the testing set were fixed, thus enabling us to obtain a fixed accuracy for the predictors. However, the selection of independent dataset often bears some sort of arbitrariness [6], which inevitably leads to non-bias-free accuracy for the predictors. 12

Figure 10: as input. Prediction results page for using a file with a list accession numbers In subsampling tests, here we use five-fold cross validation as an example. The whole dataset was randomly divided into 5 disjoint parts with equal size. The last part may have 1-4 more examples than the former 4 parts in order for each example to be evaluated on the model. Then one part of the dataset was used as the test set and the remained parts are jointly used as the training set. This procedure is repeated five times, and each time a different part was chosen as the test set. The number of the selections in dividing the benchmark dataset is obviously an astronomical figure even for a small-size dataset. This means that different selections lead to different results even for the same benchmark dataset, thus still being liable to statistical arbitrariness. Subsampling tests with a smaller K work definitely faster than that with a larger K. Thus, subsampling tests are faster than LOOCV, which can be regarded as N-fold cross-validation, where 13

(a) The 1-st plant accession number (b) The 2-nd plant accession number Figure 11: Confidence scores of the mplr-loc server for the plant protein accession numbers input in Fig. 9. 14

Select plant sequences Input file (with a list of protein sequences) Figure 12: An example of using a file with a list of protein sequences as input. N is the number of samples in the dataset, and N > K. At the same time, it is also statistically acceptable and usually regarded as less biased than the independent tests. In LOOCV, every protein in the benchmark dataset will be singled out one-by-one and is tested by the classifier trained by the remaining proteins. In this case, the arbitrariness can be avoided because LOOCV will yield a unique outcome for the predictors. Therefore, LOOCV is considered to be the most rigorous and bias-free method [7]. Hence, LOOCV was used to examine the performance of mplr-loc against other state-of-the-art predictors. 15

Figure 13: Prediction results page for using a file with a list of protein sequences as input. Figure 14: Confidence scores of the mplr-loc server for the plant protein sequences input in Fig. 12. 16

Table 1: Breakdown of the multi-label virus protein dataset. The sequence identity is cut off at 25%. The superscripts v stand for the virus dataset. Label Subcellular Location No. of Locative Proteins 1 Viral capsid 8 2 Host cell membrane 33 3 Host endoplasmic reticulum 20 4 Host cytoplasm 87 5 Host nucleus 84 6 Secreted 20 Total number of locative proteins (N loc v ) 252 Total number of actual proteins (N act v ) 207 4 Dataset Construction mplr-loc uses two benchmark datasets [8, 9] and a new independent test set [4] to evaluate its performance. All of them were constructed by using the same standard procedures. The differences are the species (i.e., virus or plant), the Swiss-Prot versions and date of construction (i.e., Swiss-Prot 57.9 released on 22-Sept-2009 for benchmark virus dataset, Swiss-Prot 55.3 on 29-Apr-2008 for the benchmark plant dataset, and the date between 08-Mar-2011 and 18-Apr-2012 for the new plant dataset). Here, we take the new plant dataset as an example to illustrate the details of the procedures, which are specified as follows: 1. Go to the UniProt/SwissProt official webpage (http://www.uniprot.org/); 2. Go to the Search section and select Protein Knowledgebase (UniProtKB) (default) in the Search in option; 3. In the Query option, select or type reviewered: yes ; 17

Table 2: Breakdown of the multi-label plant protein dataset. The sequence identity is cut off at 25%. The superscripts p stand for the plant dataset. Label Subcellular Location No. of Locative Proteins 1 Cell membrane 56 2 Cell wall 32 3 Chloroplast 286 4 Cytoplasm 182 5 Endoplasmic reticulum 42 6 Extracellular 22 7 Golgi apparatus 21 8 Mitochondrion 150 9 Nucleus 152 10 Peroxisome 21 11 Plastid 39 12 Vacuole 52 Total number of locative proteins (N loc p ) 1055 Total number of actual proteins (N act p ) 978 4. Select AND in the Advanced Search option, and then select Taxonomy [OC] and type in Viridiplantae ; 5. Select AND in the Advanced Search option, and then select Fragment: no ; 6. Select AND in the Advanced Search option, and then select Sequence length and type in 50 - (no less than 50); 7. Select AND in the Advanced Search option, and then select Date entry integrated and type in 20110308-20120418 ; 8. Select AND in the Advanced Search option, and then select Subcellular location: XXX Confidence: Experimental ; (XXX means the specific subcellular locations. 18

Table 3: Breakdown of the new plant dataset. The dataset was constructed from Swiss- Prot created between 08-Mar-2011 and 18-Apr-2012. The sequence identity of the dataset is below 25%. Label Subcellular Location No. of Locative Proteins 1 Cell membrane 16 2 Cell wall 1 3 Chloroplast 54 4 Cytoplasm 38 5 Endoplasmic reticulum 9 6 Extracellular 3 7 Golgi apparatus 7 8 Mitochondrion 16 9 Nucleus 46 10 Peroxisome 6 11 Plastid 1 12 Vacuole 7 Total number of locative proteins 204 Total number of actual proteins 175 Here it includes 12 different locations: cell membrane; cell wall; chloroplast; endoplasmic reticulum; extracellular; golgi apparatus; mitochondrion; nucleus; peroxisome; plastid; vacuole.) 9. Further exclude those proteins which are not experimentally annotated (This is to recheck the proteins to guarantee they are all experimentally annotated). After selecting the proteins, Blastclust 4 was applied to reduce the redundancy in the dataset so that none of the sequence pairs has sequence identity higher than 25%. 4 http://www.ncbi.nlm.nih.gov/web/newsltr/spring04/blastlab.html 19

The details of the breakdown of the two benchmark datasets and the new plant dataset are listed in Table 1, Table 2 and Table 3, respectively. All the datasets can be accessible from the page of Datasets of mplr-loc web-server. mplr-loc server is available at http://bioinfo.eie.polyu.edu.hk/mplrlocserver/. References [1] R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. J. Martin, D. A. Ntale, C. O Donovan, N. Redaschi, and L. S. Yeh, UniProt: the Universal Protein knowledgebase, Nucleic Acids Res, vol. 32, pp. D115 D119, 2004. [2] K. C. Chou, Z. C. Wu, and X. Xiao, iloc-euk: A multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins, PLoS ONE, vol. 6, no. 3, pp. e18258, 2011. [3] S. Wan, M. W. Mak, and S. Y. Kung, GOASVM: A subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou s pseudoamino acid composition, Journal of Theoretical Biology, vol. 323, pp. 40 48, 2013. [4] S. Wan, M. W. Mak, and S. Y. Kung, mgoasvm: Multi-label protein subcellular localization based on gene ontology and support vector machines, BMC Bioinformatics, vol. 13, pp. 290, 2012. 20

[5] S. Wan, M. W. Mak, and S. Y. Kung, HybridGO-Loc: Mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins, PLoS ONE, vol. 9, no. 3, pp. e89545, 2014. [6] K. C. Chou and C. T. Zhang, Review: Prediction of protein structural classes, Critical Reviews in Biochemistry and Molecular Biology, vol. 30, no. 4, pp. 275 349, 1995. [7] T. Hastie, R. Tibshirani, and J. Friedman, The element of statistical learning, Springer-Verlag, 2001. [8] X. Xiao, Z. C. Wu, and K. C. Chou, iloc-virus: A multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites, Journal of Theoretical Biology, vol. 284, pp. 42 51, 2011. [9] Z. C. Wu, X. Xiao, and K. C. Chou, iloc-plant: A multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites, Molecular BioSystems, vol. 7, pp. 3287 3297, 2011. 21