Supplementary Materials for R3P-Loc Web-server

Similar documents
Supplementary Materials for mplr-loc Web-server

Shibiao Wan and Man-Wai Mak December 2013 Back to HybridGO-Loc Server

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH

Supervised Ensembles of Prediction Methods for Subcellular Localization

Biology. 7-2 Eukaryotic Cell Structure 10/29/2013. Eukaryotic Cell Structures

9/11/18. Molecular and Cellular Biology. 3. The Cell From Genes to Proteins. key processes

3.2 Cell Organelles. KEY CONCEPT Eukaryotic cells share many similarities.

Cell Alive Homeostasis Plants Animals Fungi Bacteria. Loose DNA DNA Nucleus Membrane-Bound Organelles Humans

9/2/17. Molecular and Cellular Biology. 3. The Cell From Genes to Proteins. key processes

Large-Scale Plant Protein Subcellular Location Prediction

Efficient Classification of Multi-label and Imbalanced Data Using Min-Max Modular Classifiers

Chapter 7 Learning Targets Cell Structure & Function

UNIT 3 CP BIOLOGY: Cell Structure

Biology Cell Organelle Webquest. Name Period Date

Cell Organelles Tutorial

Eukaryotic Cell Structure. 7.2 Biology Mr. Hines

Bio 111 Study Guide Chapter 6 Tour of the Cell

Supplementary Information 16

Introduction to Cells- Stations Lab

The Discovery of Cells

Unit 4: Cells. Biology 309/310. Name: Review Guide

Cell Structure and Function

3.2. Eukaryotic Cells and Cell Organelles. Teacher Notes and Answers. section

STUDY OF PROTEIN SUBCELLULAR LOCALIZATION PREDICTION: A REVIEW ABSTRACT

7-2 Eukaryotic Cell Structure

Now starts the fun stuff Cell structure and function

Chapter 7.2. Cell Structure

Chapter 6: A Tour of the Cell

Cell Organelles. Wednesday, October 22, 14

Chapter 6: A Tour of the Cell

Name Hour. Section 7-1 Life Is Cellular (pages )

Organelles in Eukaryotic Cells

Cells & Cell Organelles. Doing Life s Work

Chapter 4. Table of Contents. Section 1 The History of Cell Biology. Section 2 Introduction to Cells. Section 3 Cell Organelles and Features

Concept 6.1 To study cells, biologists use microscopes and the tools of biochemistry

Honors Biology-CW/HW Cell Biology 2018

Just Print Science. Pack

Protein Subcellular Localization Prediction with WoLF PSORT

Function and Illustration. Nucleus. Nucleolus. Cell membrane. Cell wall. Capsule. Mitochondrion

Chapter 4 Active Reading Guide A Tour of the Cell

Introduction to Cells

AS Biology Summer Work 2015

Essential Question: How do the parts of a cell work together to function as a system?

The Cell System. The main job of a cell is to make proteins Proteins make up almost all of our body. Proteins do a lot of different things!

Complete the table by stating the function associated with each organelle. contains the genetic material.... lysosome ribosome... Table 6.

Organelles & Cells Student Edition. A. chromosome B. gene C. mitochondrion D. vacuole

Directions for Plant Cell 3-Part Cards

Organelles in Eukaryotic Cells

O.k., Now Starts the Good Stuff (Part II) Eukaryotic Cell Structure and Function

Unit 3: Cells. Objective: To be able to compare and contrast the differences between Prokaryotic and Eukaryotic Cells.

Topic 3: Cells Ch. 6. Microscopes pp Microscopes. Microscopes. Microscopes. Microscopes

Chapter 4: Cells: The Working Units of Life

EXAMPLE-BASED CLASSIFICATION OF PROTEIN SUBCELLULAR LOCATIONS USING PENTA-GRAM FEATURES

Know how to read a balance, graduated cylinder, ruler. Know the SI unit of each measurement.

Biology Summer Assignments

Mid-Unit 1 Study Guide

Introduction to Cells

Biology. Mrs. Michaelsen. Types of cells. Cells & Cell Organelles. Cell size comparison. The Cell. Doing Life s Work. Hooke first viewed cork 1600 s

Eukaryotic cells are more complex than prokaryotic cells. They are identified by the presence of certain membrane-bound organelles.

Eukaryotic Cell Structure: Organelles in Animal & Plant Cells Why are organelles important and how are plants and animals different?

Overview of Cells. Prokaryotes vs Eukaryotes The Cell Organelles The Endosymbiotic Theory

Identifying Extracellular Plant Proteins Based on Frequent Subsequences

Biology Day 15. Monday, September 22 Tuesday, September 23, 2014

Biology A level induction

Chapter 6: A Tour of the Cell

Biology Cell Test. Name: Class: Date: ID: A. Multiple Choice Identify the choice that best completes the statement or answers the question.

STUDY GUIDE SECTION 4-1 The History of Cell Biology

Science 9 Biology. Cell Division and Reproduction Booklet 1 M. Roberts RC Palmer

Zimmerman AP Biology CBHS South Name Chapter 7&8 Guided Reading Assignment 1) What is resolving power and why is it important in biology?

Cell Is the basic structural, functional, and biological unit of all known living organisms. Cells are the smallest unit of life and are often called

Subcellular Localization of Proteins

Unit 2: The Structure and function of Organisms. Section 2: Inside Cells

Cell Structure and Function How do the structures and processes of a cell enable it to survive?

Biology Test 2 The Cell. For questions 1 15, choose ONLY ONE correct answer and fill in that choice on your Scantron form.

Parts of the Cell book pgs

What in the Cell is Going On?

CELL Readings BCMS 1/1/2018

Turns sunlight, water & carbon dioxide (CO 2 ) into sugar & oxygen through photosynthesis

CELL STRUCTURE. What are the basic units of life? What are the structures within a cell and what are they capable of? How and why do cells divide?

Cells and Passive Transport Study Guide

Frequent Subsequence-based Protein Localization

2. Cellular and Molecular Biology

Introduction 1) List the 3 types of cells you will be comparing in today s lesson. a. b. c.

Introduction to Cells. Intro to Cells. Scientists who contributed to cell theory. Cell Theory. There are 2 types of cells: All Cells:

Frequent Subsequence-based Protein Localization

Genome-wide multilevel spatial interactome model of rice

Biology Exam #1 Study Guide. True/False Indicate whether the statement is true or false. F 1. All living things are composed of many cells.

To study cells, biologists use microscopes and the tools of biochemistry [2].

= Monera. Taxonomy. Domains (3) BIO162 Page Baluch. Taxonomy: classifying and organizing life

02/02/ Living things are organized. Analyze the functional inter-relationship of cell structures. Learning Outcome B1

Protein subcellular location prediction

7.L.1.2 Plant and Animal Cells. Plant and Animal Cells

Cell Organelles. a review of structure and function

Summer Bridging Work 2018

Structures and Functions of Plant and Animal Cells

How do cell structures enable a cell to carry out basic life processes? Eukaryotic cells can be divided into two parts:

A. The Cell: The Basic Unit of Life. B. Prokaryotic Cells. D. Organelles that Process Information. E. Organelles that Process Energy

Chapter 1. DNA is made from the building blocks adenine, guanine, cytosine, and. Answer: d

Biology Semester 1 Study Guide

Chapter 6 A Tour of the Cell

Transcription:

Supplementary Materials for R3P-Loc Web-server Shibiao Wan and Man-Wai Mak email: shibiao.wan@connect.polyu.hk, enmwmak@polyu.edu.hk June 2014 Back to R3P-Loc Server Contents 1 Introduction to R3P-Loc Server 2 2 Step-by-step Protocol Guide 4 2.1 Inputing Protein Accession Numbers via Copy-and-Paste.......... 5 2.2 Inputing Protein Sequences via Copy-and-Paste............... 6 2.3 File-Upload Function.............................. 7 2.4 Emailing Function............................... 8 3 Statistical Methods 11 4 Dataset Construction 15 1

1 Introduction to R3P-Loc Server R3P-Loc is a subcellular-localization predictor which can deal with datasets with both single-label and multi-label proteins. The R3P-Loc server can predict two different species (eukaryote and plant) and two different input types (amino acid sequences in FASTA format and protein accession numbers 1 in UniProtKB [1] format). R3P-Loc stands for using Ridge Regression and Random Projection for predicting subcellular localization of both single-label and multi-label proteins, meaning that this predictor applies random projection to reduce the feature dimensions of an ensemble ridge regression classifier. The R3P-Loc predictor can deal with both single-location proteins and multi-location proteins. Similar to many other GO-based predictors [2, 3, 4, 5], R3P- Loc uses gene ontology as the feature information. The specific algorithms can be found in the paper. For eukaryote proteins, R3P-Loc is designed to predict 22 subcellular locations of multi-label eukaryotic proteins. The 22 subcellular locations include: (1) acrosome; (2) cell membrane; (3) cell-wall; (4) centrosome; (5) chloroplast; (6) cyanelle; (7) cytoplasm; (8) cytoskeleton; (9) endoplasmic reticulum; (10) endosome; (11) extracellular; (12) golgi apparatus; (13) hydrogenosome; (14) lysosome; (15) melanosome; (16) microsome; (17) mitochondrion; (18) nucleus; (19) peroxisome; (20) spindle pole body; (21) synapse; and (22) vacuole. The predictor is not designed for predicting the subcellular localization of non-eukaryotic proteins when selecting predicting the eukaryotic proteins. Therefore, the prediction results of non-eukaryotic proteins in this case are arbitrary and meaningless. 1 http://www.uniprot.org/manual/accession numbers 2

Figure 1: Interface of the R3P-Loc web-server. For plant proteins, R3P-Loc is designed to predict 12 subcellular locations of multilabel plant proteins. The 12 subcellular locations include: (1) cell membrane; (2) cell wall; (3) chloroplast; (4) cytoplasm; (5) endoplasmic reticulum; (6) extracellular; (7) golgi apparatus; (8) mitochondrion; (9) nucleus; (10) peroxisome; (11) plastid; and (12) vacuole. Note (11) plastid here includes those plastid groups except for (3) chloroplast. The predictor is not designed for predicting the subcellular localization of non-plant proteins. Therefore, the prediction results of non-plant proteins are arbitrary and meaningless. 3

Input format and type selection Figure 2: Different formats and types of input. 2 Step-by-step Protocol Guide Fig. 1 shows the interface of the R3P-Loc web-server. As can be seen, there are two steps to use R3P-Loc: 1. select the species type and input type. Fig. 2 shows the four combinations of species types and input types: eukaryote protein amino acid sequences in FASTA format, eukaryote protein UNIPROTKB accession numbers, plant protein amino acid sequences in FASTA format and plant protein UNIPROTKB accession numbers. 4

2. Input the query proteins in the form of either FASTA sequences or accession numbers. There are also two ways to input the proteins: copy-and-paste the protein information into the textbox or upload a file containing the proteins. Users may optionally provide an email address if they upload a file containing many FASTA sequences or accession numbers. Prediction results will be emailed to the users. For users convenience, several examples of eukaryote sequences, eukaryote accession numbers, plant sequences and plant accession numbers are provided in the R3P-Loc webserver. Also, a help page is provided in the web-server to introduce the concepts of FASTA format and UniProtKB accession number format. Besides, the two benchmark datasets are downloadable from the hyperlinks in the web-server. Some simple yet informative instructions, which include significance of subcellular localization prediction, specific information about R3P-Loc and some notes, are also provided thereafter. For readers ease of using the R3P-Loc web-server, different combinations of species types, input types and ways to input proteins are specifically presented in the following subsections. 2.1 Inputing Protein Accession Numbers via Copy-and-Paste Fig. 3 shows an example of using accession numbers (AC) as input. Note that R3P-Loc can deal with one or more accession numbers for each submission. 2 Details of UniProtKB ACs can be found on the help page. After prediction, a prediction page similar to Fig. 4 will be shown, in which the input statistics, prediction results and a link of a 2 Note that the server can allow users to input maximum 100 accession numbers for each submission. 5

Select eukaryotic accession numbers Input accession numbers Press this button to predict Figure 3: An example of using accession numbers as input. downloadable file containing the prediction results are listed. Fig. 5 specifies the details of the downloadable prediction-result file. 2.2 Inputing Protein Sequences via Copy-and-Paste Fig. 6 shows an example of using protein amino acid sequences as input. Note that R3P- Loc can deal with one or more protein sequences (maximum 10) 3 for each submission. Details of FASTA format can be found in the help page. After prediction, a prediction page similar to Fig. 7 will be shown, where the input statistics, prediction results and a 3 Note that the updated server can allow users to input maximum 50 sequences for each submission. 6

Figure 4: Prediction results page for using accession numbers as input. link to a downloadable text file containing the prediction results are listed. Fig. 8 specifies the details of the prediction-result file. Within the prediction results, besides the final subcellular locations, the BLAST E-value is also shown for each query protein sequence. 2.3 File-Upload Function R3P-Loc allows users to upload a text file containing a list of accession numbers or sequences in FASTA format. Fig. 9 shows an example of uploading a file with a list accession numbers without providing an email address. In this case, R3P-Loc will present the prediction results in HTML format, as shown in Fig. 10. Also, a text file can also be 7

Figure 5: An example of the prediction-result file. downloaded from the result page. Fig. 11 shows an example of the downloadable file. 2.4 Emailing Function For ease of sending results and further processing the prediction results, an emailing function is added to R3P-Loc. By providing their email address as shown in Fig. 12, users will receive the prediction results through emails. After prediction, an email with contents similar to that of Fig. 13 will be sent to the designated email address. The email will be entitled with Results for your subloc prediction task from REP-Loc Server sent by the official email of R3P-Loc server, namely r3plocserver@gmail.com. The contents will be read as: Dear users, Thank you for using our R3P-Loc web-server to predict protein subcellular 8

Select eukaryotic protein sequences Input protein sequences Press this button to predict Figure 6: An example of using protein amino acid sequences as input. localization. Attached please find the prediction results of your submissions. You can find more information from our server website. Thank you again for your support. Best wishes, R3P-Loc Server The prediction results are saved as an attachment within the email. 9

Figure 7: Prediction results page for using accession numbers as input. Figure 8: Details of the downloadable prediction-results file. 10

Select plant accession numbers Input file (with a list of accession numbers) Press this button to predict Figure 9: An example of using a file with a list accession numbers as input without providing emails. 3 Statistical Methods In statistical prediction, there are three methods that are often used for testing the generalization capabilities of predictors: independent tests, subsampling tests (or K-fold crossvalidation) and jackknife tests (or leave-one-out cross validation, short for LOOCV). In independent tests, the training set and the testing set were fixed, thus enabling us to obtain a fixed accuracy for the predictors. However, the selection of independent dataset often bears some sort of arbitrariness [6], which inevitably leads to non-bias-free 11

Figure 10: Prediction results page for using a file input. accuracy for the predictors. In subsampling tests, here we use five-fold cross validation as an example. The whole dataset was randomly divided into 5 disjoint parts with equal size [7]. The last part may have 1-4 more examples than the former 4 parts in order for each example to be evaluated on the model. Then one part of the dataset was used as the test set and the remained parts are jointly used as the training set. This procedure is repeated five times, and each time a different part was chosen as the test set. The number of the selections in dividing the benchmark dataset is obviously an astronomical figure even for a small-size 12

Input format and type selection Figure 11: Details of the downloadable prediction-results file. dataset. This means that different selections lead to different results even for the same benchmark dataset, thus still being liable to statistical arbitrariness. Subsampling tests with a smaller K work definitely faster than that with a larger K. Thus, subsampling tests are faster than LOOCV, which can be regarded as N-fold cross-validation, where N is the number of samples in the dataset, and N > K. At the same time, it is also statistically acceptable and usually regarded as less biased than the independent tests. In LOOCV, every protein in the benchmark dataset will be singled out one-by-one and is tested by the classifier trained by the remaining proteins. In this case, the arbitrariness 13

Select plant protein sequences Input file (with a list of protein sequences) Input email to receive and save results Press this button to predict Figure 12: An example using a file with a list of protein sequences as input and providing emails. can be avoided because LOOCV will yield a unique outcome for the predictors. Therefore, LOOCV is considered to be the most rigorous and bias-free method [8]. Hence, LOOCV was used to examine the performance of R3P-Loc against other state-of-the-art predictors. 14

Figure 13: An example of the email containing the prediction results. 4 Dataset Construction R3P-Loc uses two benchmark datasets [2, 9] to evaluate its performance. Both of them were constructed by using the same standard procedures with the same Swiss-Prot versions and date of construction (i.e., Swiss-Prot 55.3 on 29-Apr-2008 for the benchmark plant dataset). The differences are the species (i.e., eukaryote or plant). Here, we take the plant dataset as an example to illustrate the details of the procedures, which are specified as follows: 1. Go to the UniProt/SwissProt official webpage (http://www.uniprot.org/); 2. Go to the Search section and select Protein Knowledgebase (UniProtKB) (default) in the Search in option; 3. In the Query option, select or type reviewered: yes ; 4. Select AND in the Advanced Search option, and then select Taxonomy [OC] 15

Table 1: Breakdown of the multi-label eukaryotic protein dataset. The sequence identity is cut off at 25%. The superscripts e stand for the eukaryotic dataset. Label Subcellular Location No. of Locative Proteins 1 Acrosome 14 2 Cell membrane 697 3 Cell wall 49 4 Centrosome 96 5 Chloroplast 385 6 Cyanelle 79 7 Cytoplasm 2186 8 Cytoskeleton 139 9 ER 457 10 Endosome 41 11 Extracellular 1048 12 Golgi apparatus 254 13 Hydrogenosome 10 14 Lysosome 57 15 Melanosome 47 16 Microsome 13 17 Mitochondrion 610 18 Nucleus 2320 19 Peroxisome 110 20 SPI 68 21 Synapse 47 22 Vacuole 170 Total number of locative proteins (N loc e ) 8897 Total number of actual proteins (N act e ) 7766 and type in Viridiplantae ; 5. Select AND in the Advanced Search option, and then select Fragment: no ; 6. Select AND in the Advanced Search option, and then select Sequence length and type in 50 - (no less than 50); 16

Table 2: Breakdown of the multi-label plant protein dataset. The sequence identity is cut off at 25%. The superscripts p stand for the plant dataset. Label Subcellular Location No. of Locative Proteins 1 Cell membrane 56 2 Cell wall 32 3 Chloroplast 286 4 Cytoplasm 182 5 Endoplasmic reticulum 42 6 Extracellular 22 7 Golgi apparatus 21 8 Mitochondrion 150 9 Nucleus 152 10 Peroxisome 21 11 Plastid 39 12 Vacuole 52 Total number of locative proteins (N loc p ) 1055 Total number of actual proteins (N act p ) 978 7. Select AND in the Advanced Search option, and then select Date entry integrated and type in -20080429 ; 8. Select AND in the Advanced Search option, and then select Subcellular location: XXX Confidence: Experimental ; (XXX means the specific subcellular locations. Here it includes 12 different locations: cell membrane; cell wall; chloroplast; endoplasmic reticulum; extracellular; golgi apparatus; mitochondrion; nucleus; peroxisome; plastid; vacuole.) 9. Further exclude those proteins which are not experimentally annotated (This is to recheck the proteins to guarantee they are all experimentally annotated). 17

After selecting the proteins, Blastclust 4 was applied to reduce the redundancy in the dataset so that none of the sequence pairs has sequence identity higher than 25%. The details of the breakdown of the two benchmark datasets are listed in Table 1 and Table 2. Both datasets can be accessible from the page of Datasets of R3P-Loc web-server. R3P-Loc server is available at http://bioinfo.eie.polyu.edu.hk/r3plocserver/. References [1] R. Apweiler, A. Bairoch, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. J. Martin, D. A. Ntale, C. O Donovan, N. Redaschi, and L. S. Yeh, UniProt: the Universal Protein knowledgebase, Nucleic Acids Res, vol. 32, pp. D115 D119, 2004. [2] K. C. Chou, Z. C. Wu, and X. Xiao, iloc-euk: A multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins, PLoS ONE, vol. 6, no. 3, pp. e18258, 2011. [3] S. Wan, M. W. Mak, and S. Y. Kung, GOASVM: A subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou s pseudoamino acid composition, Journal of Theoretical Biology, vol. 323, pp. 40 48, 2013. [4] S. Wan, M. W. Mak, and S. Y. Kung, mgoasvm: Multi-label protein subcellular localization based on gene ontology and support vector machines, BMC Bioinformatics, vol. 13, pp. 290, 2012. 4 http://www.ncbi.nlm.nih.gov/web/newsltr/spring04/blastlab.html 18

[5] S. Wan, M. W. Mak, and S. Y. Kung, HybridGO-Loc: Mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins, PLoS ONE, vol. 9, no. 3, pp. e89545, 2014. [6] K. C. Chou and C. T. Zhang, Review: Prediction of protein structural classes, Critical Reviews in Biochemistry and Molecular Biology, vol. 30, no. 4, pp. 275 349, 1995. [7] S. Y. Mei, W. Fei, and S. G. Zhou, Gene ontology based transfer learning for protein subcellular localization, BMC Bioinformatics, vol. 12, pp. 44, 2011. [8] T. Hastie, R. Tibshirani, and J. Friedman, The element of statistical learning, Springer-Verlag, 2001. [9] Z. C. Wu, X. Xiao, and K. C. Chou, iloc-plant: A multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites, Molecular BioSystems, vol. 7, pp. 3287 3297, 2011. 19