Supplementary Materials for mplr-loc Web-server

Similar documents
Supplementary Materials for R3P-Loc Web-server

Shibiao Wan and Man-Wai Mak December 2013 Back to HybridGO-Loc Server

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH

Supervised Ensembles of Prediction Methods for Subcellular Localization

Large-Scale Plant Protein Subcellular Location Prediction

Efficient Classification of Multi-label and Imbalanced Data Using Min-Max Modular Classifiers

STUDY OF PROTEIN SUBCELLULAR LOCALIZATION PREDICTION: A REVIEW ABSTRACT

Biology. 7-2 Eukaryotic Cell Structure 10/29/2013. Eukaryotic Cell Structures

Cell Alive Homeostasis Plants Animals Fungi Bacteria. Loose DNA DNA Nucleus Membrane-Bound Organelles Humans

Genome-wide multilevel spatial interactome model of rice

UNIT 3 CP BIOLOGY: Cell Structure

Cell Organelles Tutorial

Protein Subcellular Localization Prediction with WoLF PSORT

Organelles in Eukaryotic Cells

The Discovery of Cells

Introduction to Cells- Stations Lab

7-2 Eukaryotic Cell Structure

Complete the table by stating the function associated with each organelle. contains the genetic material.... lysosome ribosome... Table 6.

EXAMPLE-BASED CLASSIFICATION OF PROTEIN SUBCELLULAR LOCATIONS USING PENTA-GRAM FEATURES

Identifying Extracellular Plant Proteins Based on Frequent Subsequences

Chapter 6: A Tour of the Cell

Cell Structure and Function

Organelles & Cells Student Edition. A. chromosome B. gene C. mitochondrion D. vacuole

9/11/18. Molecular and Cellular Biology. 3. The Cell From Genes to Proteins. key processes

AS Biology Summer Work 2015

Supplementary Information 16

Biology Test 2 The Cell. For questions 1 15, choose ONLY ONE correct answer and fill in that choice on your Scantron form.

3.2. Eukaryotic Cells and Cell Organelles. Teacher Notes and Answers. section

9/2/17. Molecular and Cellular Biology. 3. The Cell From Genes to Proteins. key processes

Hands-On Nine The PAX6 Gene and Protein

Chapter 6: A Tour of the Cell

Chapter 7 Learning Targets Cell Structure & Function

Introduction 1) List the 3 types of cells you will be comparing in today s lesson. a. b. c.

Frequent Subsequence-based Protein Localization

Base your answers to questions 1 and 2 on the diagram below which represents a typical green plant cell and on your knowledge of biology.

Eukaryotic Cell Structure. 7.2 Biology Mr. Hines

Know how to read a balance, graduated cylinder, ruler. Know the SI unit of each measurement.

Directions for Plant Cell 3-Part Cards

Frequent Subsequence-based Protein Localization

Biology Exam #1 Study Guide. True/False Indicate whether the statement is true or false. F 1. All living things are composed of many cells.

3.2 Cell Organelles. KEY CONCEPT Eukaryotic cells share many similarities.

Just Print Science. Pack

Cells & Cell Organelles. Doing Life s Work

Chapter 7.2. Cell Structure

Exam 1-6 Review Homework Answer the following in complete sentences.

Organelles in Eukaryotic Cells

Function and Illustration. Nucleus. Nucleolus. Cell membrane. Cell wall. Capsule. Mitochondrion

Exam: Introduction to Cells and Cell Function

Prediction of human protein subcellular localization using deep learning

Truncated Profile Hidden Markov Models

ProtoNet 4.0: A hierarchical classification of one million protein sequences

PREDICTING HUMAN AND ANIMAL PROTEIN SUBCELLULAR LOCATION. Sepideh Khavari

Honors Biology-CW/HW Cell Biology 2018

Biology 160 Cell Lab. Name Lab Section: 1:00pm 3:00 pm. Student Learning Outcomes:

Chapter 4. Table of Contents. Section 1 The History of Cell Biology. Section 2 Introduction to Cells. Section 3 Cell Organelles and Features

Protein subcellular location prediction

Bio 111 Study Guide Chapter 6 Tour of the Cell

Biology Cell Test. Name: Class: Date: ID: A. Multiple Choice Identify the choice that best completes the statement or answers the question.

Biochemistry: A Review and Introduction

Cell-ebrate Cells Cell Structure & Function Notes. April 11, 2017

From the Bioinformatics Centre, Institute of Microbial Technology, Sector 39A, Chandigarh, India

Cell Structure and Function How do the structures and processes of a cell enable it to survive?

The Cell. The basic unit of all living things

Learning Classifiers from Only Positive and Unlabeled Data

02/02/ Living things are organized. Analyze the functional inter-relationship of cell structures. Learning Outcome B1

The Cell. What is a cell?

Biology Cell Organelle Webquest. Name Period Date

Unit 4: Cells. Biology 309/310. Name: Review Guide

Concept 6.1 To study cells, biologists use microscopes and the tools of biochemistry

Biology. Mrs. Michaelsen. Types of cells. Cells & Cell Organelles. Cell size comparison. The Cell. Doing Life s Work. Hooke first viewed cork 1600 s

Clicker Question. Clicker Question

Now starts the fun stuff Cell structure and function

Unicellular Marine Organisms. Chapter 4

PA-GOSUB: A Searchable Database of Model Organism Protein Sequences With Their Predicted GO Molecular Function and Subcellular Localization

Biology A level induction

Name Hour. Section 7-1 Life Is Cellular (pages )

1. Looking at the data above, what was the questions that was being tested?

Chapter 1. DNA is made from the building blocks adenine, guanine, cytosine, and. Answer: d

Unit 2: The Structure and function of Organisms. Section 2: Inside Cells

Introduction to Bioinformatics Online Course: IBT

Prediction of the subcellular location of apoptosis proteins based on approximate entropy

Chapter 4: Cells: The Working Units of Life

Biology. Introduction to Cells. Tuesday, February 9, 16

Synteny Portal Documentation

EUBACTERIA CYTOLOGY CHLOROPLAST: ABSENT RIBOSOME CAPSULE CELL WALL PROTOPLAST CELL MEMBRANE NUCLEOID MESOSOME CYTOSOL FLAGELLA

Components of a functional cell. Boundary-membrane Cytoplasm: Cytosol (soluble components) & particulates DNA-information Ribosomes-protein synthesis

Term Generalization and Synonym Resolution for Biological Abstracts: Using the Gene Ontology for Subcellular Localization Prediction

Fast and accurate semi-supervised protein homology detection with large uncurated sequence databases

Cell Organelles. Wednesday, October 22, 14

Chapter 4 Active Reading Guide A Tour of the Cell

Cells. Structural and functional units of living organisms

Eukaryotic Cell Structure: Organelles in Animal & Plant Cells Why are organelles important and how are plants and animals different?

Introduction to Cells

Biology Summer Assignments

Unit 7: Cells and Life

Essential Question: How do the parts of a cell work together to function as a system?

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

Cell (Learning Objectives)

STUDY GUIDE SECTION 4-1 The History of Cell Biology

Cell Theory. The cell is the basic unit of structure and function for all living things, but no one knew they existed before the 17 th century!

Transcription:

Supplementary Materials for mplr-loc Web-server Shibiao Wan and Man-Wai Mak email: shibiao.wan@connect.polyu.hk, enmwmak@polyu.edu.hk June 2014 Back to mplr-loc Server Contents 1 Introduction to mplr-loc Server 2 2 Web-server Functions 4 2.1 Inputting Protein Accession Numbers via Copy-and-Paste......... 6 2.2 Inputting Protein Sequences via Copy-and-Paste............... 7 2.3 Inputting Protein Accession Numbers via File-Upload............ 9 2.4 Inputting Protein Sequences via File-Upload................. 10 3 Statistical Methods 12 4 Dataset Construction 17 1

1 Introduction to mplr-loc Server mplr-loc is a subcellular-localization predictor which can deal with datasets with both single-label and multi-label proteins. The mplr-loc server can predict two different species (virus and plant) and two different input types (amino acid sequences in FASTA format and protein accession numbers 1 in UniProtKB [1] format). mplr-loc stands for multi-label Penalized Logistic Regression for protein subcellular Localization, meaning that this predictor extracts the feature information from the gene ontology information and then processes the information by a multi-label multi-class penalized logistic regression classifier with an adaptive decision strategy. The mplr- Loc predictor can deal with both single-location proteins and multi-location proteins. Compared to traditional GO-based predictors [2, 3, 4, 5], mplr-loc can not only rapidly and accurately predict subcellular localization of single- and multi-label proteins, but also provide probabilistic confidence scores for the prediction decisions. For each query protein, the mplr-loc web-server can give both the prediction results and a figure which shows the probabilistic confidence scores for each location. The specific algorithms can be found in the paper. For virus proteins, mplr-loc is designed to predict 6 subcellular locations of multilabel viral proteins. The 6 subcellular locations include: (1) viral capsid; (2) host cell membrane; (3) host endoplasmic reticulum; (4) host cytoplasm; (5) host nucleus; and (6) secreted. The predictor is not designed for predicting the subcellular localization of non-viral proteins. Therefore, the prediction results of non-viral proteins are arbitrary 1 http://www.uniprot.org/manual/accession numbers 2

Figure 1: Interface of the mplr-loc web-server. and meaningless. For plant proteins, mplr-loc is designed to predict 12 subcellular locations of multilabel plant proteins. The 12 subcellular locations include: (1) cell membrane; (2) cell wall; (3) chloroplast; (4) cytoplasm; (5) endoplasmic reticulum; (6) extracellular; (7) golgi apparatus; (8) mitochondrion; (9) nucleus; (10) peroxisome; (11) plastid; and (12) vacuole. Note (11) plastid here includes those plastid groups except for (3) chloroplast. The predictor is not designed for predicting the subcellular localization of non-plant proteins. Therefore, the prediction results of non-plant proteins are arbitrary and meaningless. 3

Input format and type selection Input format and type selection Figure 2: Different formats and types of input. 2 Web-server Functions Fig. 1 shows the interface of the mplr-loc web-server. As can be seen, there are two steps to use mplr-loc: 1. select the species type and input type. Fig. 2 shows the four combinations of species types and input types: plant protein amino acid sequences in FASTA format, plant protein UNIPROTKB accession numbers, virus protein amino acid sequences in FASTA format and virus protein UNIPROTKB accession numbers. 4

2. Input the query proteins in the form of either FASTA sequences or accession numbers (ACs). There are also two ways to input the proteins: copyand-paste the protein information into the textbox or upload a file containing the proteins. Inputting a batch of proteins in either formats (ACs or amino acid sequences) are supported in mplr-loc web-server for large-scale prediction. For users convenience, several examples of plant sequences, plant accession numbers, virus sequences and virus accession numbers are provided in the mplr-loc web-server. Besides, the two benchmark datasets are downloadable from the hyperlinks in the webserver, and the new independent test set can be directly downloaded from the web-server. Some simple yet informative instructions, which include significance of subcellular localization prediction, specific information about mplr-loc and some notes, are also provided thereafter. In addition to being able to rapidly and accurately predict subcellular localization of single- and multi-label proteins, mplr-loc can also provide probabilistic confidence scores for the prediction decisions. For each query protein, a figure showing the probabilistic confidence in assigning the query protein to each location is also provided. For readers ease of using the mplr-loc web-server, different combinations of species types, input types and ways to input proteins are specifically presented in the following subsections. 5

Select virus accession numbers Input accession numbers Figure 3: An example of using accession numbers as input. 2.1 Inputting Protein Accession Numbers via Copy-and-Paste Fig. 3 shows an example of using accession numbers (AC) as input. Note that mplr-loc can deal with one or more accession numbers for each submission. 2 After prediction, a prediction page similar to Fig. 4 will be shown, in which the input statistics and prediction results are listed. Fig. 5(a) and Fig. 5(b) specify the confidence scores for the two virus protein accession numbers (ACs) input. The red bar(s) represent the predicted locations and the blue bars are those locations where are predicted as not located. As can be 2 Note that the server can allow users to input maximum 100 accession numbers for each submission. 6

Figure 4: Prediction results page for using accession numbers as input. seen, the first virus AC is predicted as host-nucleus with a probabilistic confidence of more than 0.9; while the second virus ACs is predicted as host cell membrane and host endoplasmic reticulum, both with confidence of more than 0.9. 2.2 Inputting Protein Sequences via Copy-and-Paste Fig. 6 shows an example of using protein amino acid sequences as input. Note that mplr- Loc can deal with one or more protein sequences (maximum 50) 3 for each submission. After prediction, a prediction page similar to Fig. 7 will be shown, where the input statistics, prediction results are listed. Within the prediction results, besides the final subcellular locations, the BLAST E-value is also shown for each query protein sequence. Fig. 8(a) and Fig. 8(b) specify the confidence scores for the two plant protein sequences 3 Note that the updated server can allow users to input maximum 50 sequences for each submission. 7

(a) The 1-st virus accession number (b) The 2-nd virus accession number Figure 5: Confidence scores of the mplr-loc server for the virus protein accession numbers input in Fig. 3. 8

Select plant protein sequences Input protein sequences Figure 6: An example of using protein amino acid sequences as input. input. 2.3 Inputting Protein Accession Numbers via File-Upload mplr-loc allows users to upload a text file containing a list of accession numbers or sequences in FASTA format. Fig. 9 shows an example of uploading a file with a list of accession numbers. In this case, mplr-loc will present the prediction results in HTML format, as shown in Fig. 10. Fig. 11(a) and Fig. 11(b) specify the confidence scores for the two plant protein accession numbers input. 9

Figure 7: Prediction results page for using accession numbers as input. 2.4 Inputting Protein Sequences via File-Upload mplr-loc allows users to upload a text file containing a list of accession numbers or sequences in FASTA format. Fig. 12 shows an example of uploading a file with a list of protein sequences. In this case, mplr-loc will present the prediction results in HTML format, as shown in Fig. 13. Fig. 14 specifies the confidence scores for the plant protein sequence input. 10

(a) The 1-st plant amino-acid sequence (b) The 2-nd plant amino-acid sequence Figure 8: Confidence scores of the mplr-loc server for the plant protein sequences input in Fig. 6. 11

Select plant accession numbers Input file (with a list of protein accession numbers) Figure 9: An example of using a file with a list accession numbers as input. 3 Statistical Methods In statistical prediction, there are three methods that are often used for testing the generalization capabilities of predictors: independent tests, subsampling tests (or K-fold crossvalidation) and jackknife tests (or leave-one-out cross validation, short for LOOCV). In independent tests, the training set and the testing set were fixed, thus enabling us to obtain a fixed accuracy for the predictors. However, the selection of independent dataset often bears some sort of arbitrariness [6], which inevitably leads to non-bias-free accuracy for the predictors. 12

Figure 10: as input. Prediction results page for using a file with a list accession numbers In subsampling tests, here we use five-fold cross validation as an example. The whole dataset was randomly divided into 5 disjoint parts with equal size. The last part may have 1-4 more examples than the former 4 parts in order for each example to be evaluated on the model. Then one part of the dataset was used as the test set and the remained parts are jointly used as the training set. This procedure is repeated five times, and each time a different part was chosen as the test set. The number of the selections in dividing the benchmark dataset is obviously an astronomical figure even for a small-size dataset. This means that different selections lead to different results even for the same benchmark dataset, thus still being liable to statistical arbitrariness. Subsampling tests with a smaller K work definitely faster than that with a larger K. Thus, subsampling tests are faster than LOOCV, which can be regarded as N-fold cross-validation, where 13

(a) The 1-st plant accession number (b) The 2-nd plant accession number Figure 11: Confidence scores of the mplr-loc server for the plant protein accession numbers input in Fig. 9. 14

Select plant sequences Input file (with a list of protein sequences) Figure 12: An example of using a file with a list of protein sequences as input. N is the number of samples in the dataset, and N > K. At the same time, it is also statistically acceptable and usually regarded as less biased than the independent tests. In LOOCV, every protein in the benchmark dataset will be singled out one-by-one and is tested by the classifier trained by the remaining proteins. In this case, the arbitrariness can be avoided because LOOCV will yield a unique outcome for the predictors. Therefore, LOOCV is considered to be the most rigorous and bias-free method [7]. Hence, LOOCV was used to examine the performance of mplr-loc against other state-of-the-art predictors. 15

Figure 13: Prediction results page for using a file with a list of protein sequences as input. Figure 14: Confidence scores of the mplr-loc server for the plant protein sequences input in Fig. 12. 16

Table 1: Breakdown of the multi-label virus protein dataset. The sequence identity is cut off at 25%. The superscripts v stand for the virus dataset. Label Subcellular Location No. of Locative Proteins 1 Viral capsid 8 2 Host cell membrane 33 3 Host endoplasmic reticulum 20 4 Host cytoplasm 87 5 Host nucleus 84 6 Secreted 20 Total number of locative proteins (N loc v ) 252 Total number of actual proteins (N act v ) 207 4 Dataset Construction mplr-loc uses two benchmark datasets [8, 9] and a new independent test set [4] to evaluate its performance. All of them were constructed by using the same standard procedures. The differences are the species (i.e., virus or plant), the Swiss-Prot versions and date of construction (i.e., Swiss-Prot 57.9 released on 22-Sept-2009 for benchmark virus dataset, Swiss-Prot 55.3 on 29-Apr-2008 for the benchmark plant dataset, and the date between 08-Mar-2011 and 18-Apr-2012 for the new plant dataset). Here, we take the new plant dataset as an example to illustrate the details of the procedures, which are specified as follows: 1. Go to the UniProt/SwissProt official webpage (http://www.uniprot.org/); 2. Go to the Search section and select Protein Knowledgebase (UniProtKB) (default) in the Search in option; 3. In the Query option, select or type reviewered: yes ; 17

Table 2: Breakdown of the multi-label plant protein dataset. The sequence identity is cut off at 25%. The superscripts p stand for the plant dataset. Label Subcellular Location No. of Locative Proteins 1 Cell membrane 56 2 Cell wall 32 3 Chloroplast 286 4 Cytoplasm 182 5 Endoplasmic reticulum 42 6 Extracellular 22 7 Golgi apparatus 21 8 Mitochondrion 150 9 Nucleus 152 10 Peroxisome 21 11 Plastid 39 12 Vacuole 52 Total number of locative proteins (N loc p ) 1055 Total number of actual proteins (N act p ) 978 4. Select AND in the Advanced Search option, and then select Taxonomy [OC] and type in Viridiplantae ; 5. Select AND in the Advanced Search option, and then select Fragment: no ; 6. Select AND in the Advanced Search option, and then select Sequence length and type in 50 - (no less than 50); 7. Select AND in the Advanced Search option, and then select Date entry integrated and type in 20110308-20120418 ; 8. Select AND in the Advanced Search option, and then select Subcellular location: XXX Confidence: Experimental ; (XXX means the specific subcellular locations. 18

Table 3: Breakdown of the new plant dataset. The dataset was constructed from Swiss- Prot created between 08-Mar-2011 and 18-Apr-2012. The sequence identity of the dataset is below 25%. Label Subcellular Location No. of Locative Proteins 1 Cell membrane 16 2 Cell wall 1 3 Chloroplast 54 4 Cytoplasm 38 5 Endoplasmic reticulum 9 6 Extracellular 3 7 Golgi apparatus 7 8 Mitochondrion 16 9 Nucleus 46 10 Peroxisome 6 11 Plastid 1 12 Vacuole 7 Total number of locative proteins 204 Total number of actual proteins 175 Here it includes 12 different locations: cell membrane; cell wall; chloroplast; endoplasmic reticulum; extracellular; golgi apparatus; mitochondrion; nucleus; peroxisome; plastid; vacuole.) 9. Further exclude those proteins which are not experimentally annotated (This is to recheck the proteins to guarantee they are all experimentally annotated). After selecting the proteins, Blastclust 4 was applied to reduce the redundancy in the dataset so that none of the sequence pairs has sequence identity higher than 25%. 4 http://www.ncbi.nlm.nih.gov/web/newsltr/spring04/blastlab.html 19

The details of the breakdown of the two benchmark datasets and the new plant dataset are listed in Table 1, Table 2 and Table 3, respectively. All the datasets can be accessible from the page of Datasets of mplr-loc web-server. mplr-loc server is available at http://bioinfo.eie.polyu.edu.hk/mplrlocserver/. References [1] R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. J. Martin, D. A. Ntale, C. O Donovan, N. Redaschi, and L. S. Yeh, UniProt: the Universal Protein knowledgebase, Nucleic Acids Res, vol. 32, pp. D115 D119, 2004. [2] K. C. Chou, Z. C. Wu, and X. Xiao, iloc-euk: A multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins, PLoS ONE, vol. 6, no. 3, pp. e18258, 2011. [3] S. Wan, M. W. Mak, and S. Y. Kung, GOASVM: A subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou s pseudoamino acid composition, Journal of Theoretical Biology, vol. 323, pp. 40 48, 2013. [4] S. Wan, M. W. Mak, and S. Y. Kung, mgoasvm: Multi-label protein subcellular localization based on gene ontology and support vector machines, BMC Bioinformatics, vol. 13, pp. 290, 2012. 20

[5] S. Wan, M. W. Mak, and S. Y. Kung, HybridGO-Loc: Mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins, PLoS ONE, vol. 9, no. 3, pp. e89545, 2014. [6] K. C. Chou and C. T. Zhang, Review: Prediction of protein structural classes, Critical Reviews in Biochemistry and Molecular Biology, vol. 30, no. 4, pp. 275 349, 1995. [7] T. Hastie, R. Tibshirani, and J. Friedman, The element of statistical learning, Springer-Verlag, 2001. [8] X. Xiao, Z. C. Wu, and K. C. Chou, iloc-virus: A multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites, Journal of Theoretical Biology, vol. 284, pp. 42 51, 2011. [9] Z. C. Wu, X. Xiao, and K. C. Chou, iloc-plant: A multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites, Molecular BioSystems, vol. 7, pp. 3287 3297, 2011. 21