ProtoNet 4.0: A hierarchical classification of one million protein sequences

Similar documents
Functionally-Valid Unsupervised Compression of the Protein Space

EVEREST: A collection of evolutionary conserved protein domains

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

Fishing with (Proto)Net a principled approach to protein target selection

A robust method to detect structural and functional remote homologues

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

Supplementary Materials for mplr-loc Web-server

Estimation of Identification Methods of Gene Clusters Using GO Term Annotations from a Hierarchical Cluster Tree

Update on human genome completion and annotations: Protein information resource

Hands-On Nine The PAX6 Gene and Protein

Supplementary Materials for R3P-Loc Web-server

Truncated Profile Hidden Markov Models

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Francisco M. Couto Mário J. Silva Pedro Coutinho

A profile-based protein sequence alignment algorithm for a domain clustering database

BIOINFORMATICS. The metric space of proteins comparative study of clustering algorithms. Ori Sasson 1, Nathan Linial 1 and Michal Linial 2

Update on genome completion and annotations: Protein Information Resource

BIOINFORMATICS. Efficient algorithms for exact hierarchical clustering of huge datasets: Tackling the entire protein space

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Bioinformatics. Dept. of Computational Biology & Bioinformatics

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

EBI web resources II: Ensembl and InterPro

BIOINFORMATICS LAB AP BIOLOGY

Sequence Alignment Techniques and Their Uses

Methodologies for target selection in structural genomics

Fast and accurate semi-supervised protein homology detection with large uncurated sequence databases

PROTEIN CLUSTERING AND CLASSIFICATION

Gene Ontology and overrepresentation analysis

IMMERSIVE GRAPH-BASED VISUALIZATION AND EXPLORATION OF BIOLOGICAL DATA RELATIONSHIPS

CSCE555 Bioinformatics. Protein Function Annotation

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

CS612 - Algorithms in Bioinformatics

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Supplementary Methods

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Introduction to Spark

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Measuring Semantic Similarity between Gene Ontology Terms

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Computational Structural Bioinformatics

MegAlign Pro Pairwise Alignment Tutorials

SUPPLEMENTARY INFORMATION

Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space

Tutorial. Getting started. Sample to Insight. March 31, 2016

A Browser for Pig Genome Data

Small RNA in rice genome

Improving Protein Function Prediction using the Hierarchical Structure of the Gene Ontology

GENE ONTOLOGY (GO) Wilver Martínez Martínez Giovanny Silva Rincón

SUPPLEMENTARY INFORMATION

Network by Weighted Graph Mining

Procedure to Create NCBI KOGS

Some Problems from Enzyme Families

The Homology Kernel: A Biologically Motivated Sequence Embedding into Euclidean Space

EasySDM: A Spatial Data Mining Platform

Role of Duplicate Genes in Robustness against Deleterious Human Mutations

CS570 Introduction to Data Mining

Protein function prediction based on sequence analysis

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

Prediction of protein function from sequence analysis

PNmerger: a Cytoscape plugin to merge biological pathways and protein interaction networks

Introduction to Evolutionary Concepts

Shibiao Wan and Man-Wai Mak December 2013 Back to HybridGO-Loc Server

Functional diversity within protein superfamilies

Integration of functional genomics data

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010

Chapter 7: Rapid alignment methods: FASTA and BLAST

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Ensemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Protein Structure and Evolutionary History Determine Sequence Space Topology. 44 Cummington Street, Boston MA, 02215

A Novel Method for Similarity Analysis of Protein Sequences

Comparing whole genomes

Families of membranous proteins can be characterized by the amino acid composition of their transmembrane domains

PA-GOSUB: A Searchable Database of Model Organism Protein Sequences With Their Predicted GO Molecular Function and Subcellular Localization

BLAST. Varieties of BLAST

Supplementary Information

Computational approaches for functional genomics

The MANTiS Manual. Contents. MANTiS Version 1.1

An Introduction to Sequence Similarity ( Homology ) Searching

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

Implications of Structural Genomics Target Selection Strategies: Pfam5000, Whole Genome, and Random Approaches

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

Carson Andorf 1,3, Adrian Silvescu 1,3, Drena Dobbs 2,3,4, Vasant Honavar 1,3,4. University, Ames, Iowa, 50010, USA. Ames, Iowa, 50010, USA

ROUGH SET PROTEIN CLASSIFIER

BMD645. Integration of Omics

In-Depth Assessment of Local Sequence Alignment

ATLAS of Biochemistry

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Introduction to Bioinformatics Online Course: IBT

Development of a Cell Signaling Networks Database. Division of Chem-Bio Informatics, National Institute of Health Sciences,

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Application of Topology to Complex Object Identification. Eliseo CLEMENTINI University of L Aquila

Integrating Ontological Prior Knowledge into Relational Learning

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

Transcription:

ProtoNet 4.0: A hierarchical classification of one million protein sequences Noam Kaplan 1*, Ori Sasson 2, Uri Inbar 2, Moriah Friedlich 2, Menachem Fromer 2, Hillel Fleischer 2, Elon Portugaly 2, Nathan Linial 2 and Michal Linial 1 1 Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Israel 2 School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel * Corresponding author Corresponding author details: Noam Kaplan E-Mail: kaplann@cc.huji.ac.il Address: The Hebrew University, Department of Biological Chemistry, Givat Ram Campus, Jerusalem, Israel, 91904 Telephone: 972-2-6585433 ABSTRACT ProtoNet is an automatic hierarchical classification of the protein sequence space. In 2004 the ProtoNet (version 4.0) presents the analysis of over one million proteins merged from SwissProt and TrEMBL databases. In addition to rich visualization and analysis tools to navigate the clustering hierarchy, we incorporated several improvements that allow a simplified view of the scaffold of the proteins. An unsupervised biologically valid method that was developed resulted in a condensation of the ProtoNet hierarchy to only 12% of the clusters. A large portion of these clusters were automatically assigned high confidence biological names according to their correspondence with functional annotations. ProtoNet is available at: http://www.protonet.cs.huji.ac.il INTRODUCTION ProtoNet (1) (launched at 2002) is an automatic hierarchical clustering of the SwissProt and TrEMBL (2) protein databases. The clustering process is based on an all-against-all BLAST (3) similarity search. The similarities' E-score is used to perform a continuous bottom-up clustering process by joining at each step the two most similar protein clusters, resulting in a hierarchy of protein clusters at various degrees of biological granularity. This hierarchy is structured as a collection of trees, in which the root clusters contain all the proteins of the tree and the rest of the clusters

represent subdivisions of the proteins into smaller groups. Browsing this global hierarchical organization of the protein world provides several interesting biological insights about protein families and the evolution of structural and functional relations between proteins (4). Furthermore, ProtoNet can be used to assess the function of novel protein sequences, by finding the best matching cluster for the new sequence. We describe here several new developments in ProtoNet 4.0 including the increase of the sequences analyzed from 114,033 proteins in SwissProt (version 2.1) to 1,072,911 sequences (SwissProt and TrEMBL, version 4.0) and the improved methodology for simplification of the scaffold of the protein hierarchy. ProtoNet is available at: http://www.protonet.cs.huji.ac.il HIERARCHY CONDENSATION Due to the immense size of the ProtoNet hierarchy and the number of protein clusters, it would be very difficult to navigate in such a large hierarchy. Furthermore, it is obvious that many of the clusters are biologically irrelevant and uninteresting (for example huge root clusters containing hundreds of thousands non-related proteins). In order to get a condensed yet biologically relevant view of the hierarchy, we searched for some process-intrinsic parameter that would indicate which clusters are biologically relevant. The parameter found measures the stability of the cluster in the process, assuming that a stable cluster would be also more relevant biologically. We found this assumption to be correct, and that if we select a small subset of the clusters that show high stability, they would retain the biological validity of ProtoNet (Kaplan et al., in preparation). The default condensation of the ProtoNet hierarchy leaves 12% of the clusters. However, the ProtoNet website now offers an "advanced mode", in which advanced users can control the level of condensation and the method by which it is done, resulting in a larger or smaller set of clusters as required. Note that the condensation causes the trees to change from binary trees to non-binary trees, and some browsing options have been developed accordingly (see "Web Enhancements"). DATABASE UPDATES ProtoNet has gone through a major update of all database sources. Primarily, the protein database from which the ProtoNet tree is constructed has been updated and extended to include the TrEMBL protein database as well as SwissProt. This results in a leap from 114,033 to 1,072,911 protein sequences. Although TrEMBL is not manually validated by experts, it provides a much more extensive view of the protein world including whole genomes and thorough representation of several key organisms (see Table 1). Table 1 CLUSTER NAMES

Assigning a biological function to proteins is a major objective in bioinformatics. We have developed an automatic high-confidence method that assigns a functional annotation to ProtoNet protein clusters. The method finds a functional annotation from either InterPro (5), GO (6), SwissProt or ENZYME (7) databases that best fits the proteins of the cluster and assigns it a score relative to how well it fits the cluster. If this score is above a certain threshold, the annotation is assigned as the cluster's name. Understandably, not all protein clusters would have an existing annotation that fits them well. Clusters whose best fitting annotation does not pass the threshold remain nameless, possibly suggesting a novel functional group or clusters that are associated with mixed functions. By applying this method we were able to assign biological names to 78% of the clusters that contain 20 proteins or more. Due to the fact that a high threshold is used, the annotation can be assigned with high confidence to the cluster. WEBSITE ENHANCEMENTS Several enhancements have been made to the ProtoNet website in order to allow easier and more in-depth analysis of the ProtoNet trees. BROWSING CLUSTER NAMES Cluster names are extremely useful for quickly browsing the ProtoNet trees, eliminating the need to check the proteins of each cluster in order to get an impression of the hierarchy. Furthermore, the assignment of a biological function to clusters suggests an easy scheme of assigning function to proteins with unknown function: a protein can be assigned the function of all clusters to which it belongs. This scheme can be used not only on each of the 1,072,911 proteins from which the ProtoNet hierarchy is constructed, but also for new protein sequences given by the user (using the "Classify your protein" option in the website, which finds the most suitable cluster for a new protein sequence given by the user). BROWSING THE TREE Subtree View: In order to cope with the change to a non-binary tree, we have introduced the ProtoBrowser (Figure 1), which shows the tree in the vicinity of the cluster that is being displayed. Instead of presenting only the branch of the tree to which the cluster belonged to, the new display allows easy navigation to neighboring clusters and an enhanced global overview of biological protein families. Figure 1 Functionality View: PANDORA (8) is a web-based tool that allows in-depth biological analysis of large protein sets. When trying to biologically interpret large protein clusters that contain hundreds of proteins, PANDORA is a natural choice. ProtoNet now offers a direct link from its cluster page to PANDORA, providing the ability to easily understand what biological groups the cluster is built from and analyze them by several different biological aspects.

Similarity View: When viewing a protein cluster, it is sometimes helpful to obtain an in-depth look into the sequence similarity between the proteins of the cluster. This could allow the user to tell if there is a natural partitioning into subgroups or the cluster inner similarity is uniform, for example (see example in Shachar and Linial, 2004). In order to address this, the website offers a cluster similarity matrix (Figure 2), showing an all-against-all color coded matrix of all protein pairs in the cluster, colored according to the BLAST E-score between the two proteins. This also facilitates access to the BLAST result, which can be obtained simply by clicking on the appropriate cell in the matrix. Figure 2 MAINTENANCE AND UPDATING The ProtoNet source databases are generally updated twice a year. The next ProtoNet release is planned to include the UniProt Ref 100 protein database. Other future plans include allowing the users to select a subset of proteins from ProtoNet according to their needs (e.g. selecting for study only the proteins from the SwissProt database or only the proteins of a specific species) and expanding links to further biological databases such as OMIM (9) and DIP (10). ACKNOWLEDGEMENTS We thank the entire current and previous ProtoNet team for their endless support. Special thanks to Alex Savenok for the Web design as well as for the development of the visualization tools. We thank the fellowship support by the Sudarsky Center for Computational Biology (SCCB) to NK, UI, MF and MF. This study is partially supported by the EU NoE BioSapiens consortium and the CESG consortium supported by the NIH. REFERENCES 1. Sasson, O., Vaaknin, A., Fleischer, H., Portugaly, E., Bilu, Y., Linial, N. and Linial, M. (2003) ProtoNet: hierarchical classification of the protein space. Nucleic Acids Res., 31, 348-352. 2. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O`Donovan, C., Phan, I. et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365-370. 3. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389-2402. 4. Shachar, O. and Linial, M. (2004) A robust method to detect structural and functional remote homologues. Proteins, in press. 5. Camon, E., Magrane, M., Barrell, D., Lee V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R. and Apweiler R. (2004) The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res., 32, D262-D266.

6. Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Barrell, D., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P. et al. (2003) The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res., 31, 315-318. 7. Bairoch, A. (2000) The ENZYME database in 2000. Nucleic Acids Res., 28, 304-305. 8. Kaplan, N., Vaaknin, A. and Linial, M. (2003) PANDORA: keyword-based analysis of proteins sets by integration of annotation sources. Nucleic Acids Res., 31, 5617-5626. 9. McKusick, V.A. (1998) Mendelian Inheritance in Man. Johns Hopkins University Press, Baltimore. 10. Salwinski, L., Miller, C.S., Smith, A.J., Pettit, F.K., Bowie, J.U. and Eisenberg D. (2004) The Database of Interacting Proteins: 2004 update. Nucleic Acids Res., 32, D449-451. FIGURE LEGENDS Figure 1. The ProtoBrowser allows viewing the near vicinity of a cluster in the ProtoNet hierarchy. Blue triangle-shaped icons represent protein clusters. The cluster currently being viewed is the cluster A260586, which appears in the center in red. Clusters that include proteins with 3D solved structures as marked PDB. Figure 2. Example of a cluster similarity matrix. Colored cells represent different degrees of similarity, ranging from white (no similarity: BLAST E-score higher than 100) to dark blue (high similarity, BLAST E-score close to 0). It is evident that the cluster A222801 is roughly divided into 3 subsets: in the upper left of the diagonal there are proteins which show no similarity to each other or to any protein in the cluster; in the center of the diagonal there is a subset of proteins that are similar to each other but to no other proteins; and at the bottom right of the diagonal there is another subset of proteins that are similar to each other but not to other proteins. TABLES Table 1. Species Proteins in ProtoNet 2.1 Proteins in ProtoNet 4.0 Homo sapiens 8,507 47,641 Mus musculus 5,678 41,813 Drosophila melanogaster 2,049 22,603 Arabidopsis thaliana 1,680 39,367 Plasmodium falciparum 153 8,434 FIGURES

Figure 1. Figure 2.