Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010

Similar documents
CS612 - Algorithms in Bioinformatics

Gene Ontology and overrepresentation analysis

Homology. and. Information Gathering and Domain Annotation for Proteins

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Week 10: Homology Modelling (II) - HHpred

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Protein Similarity Networks (PSNs)

BLAST. Varieties of BLAST

The CATH Database provides insights into protein structure/function relationships

Lecture 2. The Blast2GO annotation framework

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

Prediction of protein function from sequence analysis

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Introduction to Evolutionary Concepts

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

SUPPLEMENTARY INFORMATION

EBI web resources II: Ensembl and InterPro

A Protein Ontology from Large-scale Textmining?

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Similarity searching summary (2)

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Improving Protein Function Prediction using the Hierarchical Structure of the Gene Ontology

ATLAS of Biochemistry

Understanding Sequence, Structure and Function Relationships and the Resulting Redundancy

SABIO-RK Integration and Curation of Reaction Kinetics Data Ulrike Wittig

Homology and Information Gathering and Domain Annotation for Proteins

CSCE555 Bioinformatics. Protein Function Annotation

BIOINFORMATICS: An Introduction

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Protein structure analysis. Risto Laakso 10th January 2005

TrAnsFuSE refines the search for protein function: oxidoreductaseswz

Structure to Function. Molecular Bioinformatics, X3, 2006

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Advanced practical course in genome bioinformatics DAY 6: Functional annotation. Petri Törönen Earlier version Patrik Koskinen

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Mitochondrial Genome Annotation

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

A Network Filtration Protocol for Elucidating. Relationships between Families in a Protein

Title 代謝経路からの酵素反応連続パターンの検出 京都大学化学研究所スーパーコンピュータシステム研究成果報告書 (2013), 2013:

Identifying Signaling Pathways

Protein Structure and Function Prediction using Kernel Methods.

Variable-Length Protein Sequence Motif Extraction Using Hierarchically-Clustered Hidden Markov Models

ProtoNet 4.0: A hierarchical classification of one million protein sequences

Bioinformatics. Macromolecular structure

Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource

Some Problems from Enzyme Families

Multiple Mapping Method: A Novel Approach to the Sequence-to-Structure Alignment Problem in Comparative Protein Structure Modeling

Comparative Network Analysis

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Metabolic modelling. Metabolic networks, reconstruction and analysis. Esa Pitkänen Computational Methods for Systems Biology 1 December 2009

Biology Tutorial. Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan

Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space

Large-Scale Genomic Surveys

Project Manual Bio3055. Apoptosis: Caspase-1

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Supplementary Information

Protein Structure: Data Bases and Classification Ingo Ruczinski

2 GENE FUNCTIONAL SIMILARITY. 2.1 Semantic values of GO terms

CS 229 Project: A Machine Learning Framework for Biochemical Reaction Matching

Protein Science (1997), 6: Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society

Multiple sequence alignment

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

Protein function prediction based on sequence analysis

Computational methods for predicting protein-protein interactions

Implications of Structural Genomics Target Selection Strategies: Pfam5000, Whole Genome, and Random Approaches

Building a Homology Model of the Transmembrane Domain of the Human Glycine α-1 Receptor

V14 extreme pathways

IMMERSIVE GRAPH-BASED VISUALIZATION AND EXPLORATION OF BIOLOGICAL DATA RELATIONSHIPS

NetAffx GPCR annotation database summary December 12, 2001

Wataru Nemoto 1,2* and Hiroyuki Toh 1

METABOLIC PATHWAY PREDICTION/ALIGNMENT

V19 Metabolic Networks - Overview

Computational Molecular Biology (

Lecture Notes for Fall Network Modeling. Ernest Fraenkel

Machine-learning scoring functions for docking

Multiple Alignment using Hydrophobic Clusters : a tool to align and identify distantly related proteins

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Francisco M. Couto Mário J. Silva Pedro Coutinho

Gene function annotation

Robert Edgar. Independent scientist

ALGORITHMS FOR BIOLOGICAL NETWORK ALIGNMENT

DATE A DAtabase of TIM Barrel Enzymes

A profile-based protein sequence alignment algorithm for a domain clustering database

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Mining and classification of repeat protein structures

Functional Annotation

Computational Protein Function Prediction: Framework and Challenges

Cross Discipline Analysis made possible with Data Pipelining. J.R. Tozer SciTegic

Homology modeling of Ferredoxin-nitrite reductase from Arabidopsis thaliana

-max_target_seqs: maximum number of targets to report

Detection of Protein Binding Sites II

Transcription:

Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010 1

New genomes (and metagenomes) sequenced every day... 2

3

3

3

3

3

3

3

3

3

Computational Function Prediction Needed Total Sequences Characterized Sequences 4

What about the error that results from large scale function prediction? 5

Our focus: commonly used protein sequence databases How prevalent is misannotation in common sequence databases? What can we learn about these annotation errors and annotation in general? 6

What is function? Many Possible Definitions Phenotype Enzymatic Reaction 7

What is function? Many Possible Definitions Phenotype Enzymatic Reaction 7

Concrete definition of function Substrate Product Chemical conversion Function can be mapped to specific residues Why use enzymes? 8

Functionally Diverse Enzyme Superfamilies 9

Functionally Diverse Enzyme Superfamilies Low % sequence ID Conserved mechanistic step Multifunctional 9

Functionally Diverse Enzyme Superfamilies Conserved mechanistic step Multifunctional Low % sequence ID Monofunctional Family Specific Residues % ID within families > % ID between families 9

What is needed for the misannotation analysis? Gold Standard Sequence Set Requirements Organized hierarchy & data Superfamily definitions Family definitions Sequences Sequence alignments Statistical models Functions are experimentally characterized Understand functional mechanism Structure Active site Functionally important residues Large set 10

What is needed for the misannotation analysis? Gold Standard Sequence Set Requirements Organized hierarchy & data Superfamily definitions Family definitions Sequences Sequence alignments Statistical models Functions are experimentally characterized 6 Superfamilies 5 Structural folds 37 Families 5/6 E.C. categories Genome Biol. 2006;7(1):R8. Understand functional mechanism Structure Active site Functionally important residues Large set 10

Sequence Models (HMMs) Evidence Codes Functionally Important Residues Hierarchically Organized Hand-Curated Sequence Alignments Gold Standard Sequence Set sfld.rbvi.ucsf.edu 11

Data Source: Commonly Used Sequence Databases NCBI Automated Large TrEMBL Automated Large KEGG Automated Swiss-Prot Curated Small 12

Analysis Question Given: A protein sequence annotated to a specific enzyme function Is that annotation correct? 13

General Process 14

15

15

15

15

Non-Family Members Family Members LC NC TC 15

16

16

Variable percent misannotation Manually curated Swiss- Prot is most accurate 17

Misannotation Problem is Getting Worse Number of Sequences 0 200 400 600 800 1000 1200 Sequences Deposited by Year and the Fraction Predicted to be Misannotated (NR DB) 0.5 0.4 0.3 0.2 0.1 Fraction Predicted Misannotated 1993 1995 1997 1999 2001 2003 2005 Year Incorrect Annotations Correct Annotations 18

What are the characteristics of these misannotations? 19

Sensitivity to threshold change 20

Non-Family Members LC Family Members NC TC TC Trusted Cutoff NC Noise Cutoff LC Lenient Cutoff Sensitivity to threshold change 20

Non-Family Members LC Family Members NC TC Non-Family Members LC Family Members Sensitivity to threshold change NC TC 20

21

NSA NSA No Superfamily Association 21

NSA SFA NSA No Superfamily Association SFA Superfamily Association Only 21

NSA SFA NSA No Superfamily Association SFA Superfamily Association Only MFR Missing Functionally Important Residues MFR 21

NSA SFA NSA No Superfamily Association SFA Superfamily Association Only MFR Missing Functionally Important Residues BTC Below Trusted Cutoff MFR BTC 21

Types of Misannotation MFR (6%) SFA (31%) NSA (9%) BTC (54%) NSA Misannotations due to overprediction Misannotations not due to overprediction NSA No Superfamily Association SFA Superfamily Association Only MFR Missing Functionally Important Residues BTC Below Trusted Cutoff SFA MFR BTC Biggest Problem Predicting function without sufficient evidence 21

Dipeptide Epimerase >gi 13786715 pdb 1HZY A Chain A, High Resolution Structure Of The Zinc-Containing Phosphotriesterase From Pseudomonas 1Diminuta GDRINTVRGPITISEAGFTLTHEHICGSSAGFLRAWPEFFGSRKALAEKAVRGLRRARAAGVRTIVDVST FDIGRDVSLLAEVSRAADVHIVAATGLWFDPPLSMRLRSVEELTQFFLREIQYGIEDTGIRAGIIKVATT GKATPFQELVLKAAARASLATGVPVTTHTAASQRDGEQQAAIFESEGLSPSRVCIGHSDDTDDLSYLTAL AARGYLIGLDHIPHSAIGLEDNASASALLGIRSWQTRALLIKALIDQGYMKQILVSNDWLFGFSSYVTNI MDVMDRVNPDGMAFIPLRVIPFLREKGVPQETLAGITVTNPARFLSPTLRAS >gi 1176259 sp P45548 PHP_ECOLI 2Phosphotriesterase homology protein MSFDPTGYTLAHEHLHIDLSGFKNNVDCRLDQYAFICQEMNDLMTRGVRNVIEMTNRYMGRNAQFMLDVM RETGINVVACTGYYQDAFFPEHVATRSVQELAQEMVDEIEQGIDGTELKAGIIAEIGTSEGKITPLEEKV FIAAALAHNQTGRPISTHTSFSTMGLEQLALLQAHGVDLSRVTVGHCDLKDNLDNILKMIDLGAYVQFDT IGKNSYYPDEKRIAMLHALRDRGLLNRVMLSMDITRRSHLKANGGYGYDYLLTTFIPQLRQSGFSQADVD VMLRENPSQFFQ Unknown Function Dipeptide Epimerase 1 = 2 Dipeptide Epimerase 1! Dipeptide 1 & 2 Epimerase INCORRECT! Error Propagation 22

Dipeptide Epimerase >gi 13786715 pdb 1HZY A Chain A, High Resolution Structure Of The Zinc-Containing Phosphotriesterase From Pseudomonas 1Diminuta GDRINTVRGPITISEAGFTLTHEHICGSSAGFLRAWPEFFGSRKALAEKAVRGLRRARAAGVRTIVDVST FDIGRDVSLLAEVSRAADVHIVAATGLWFDPPLSMRLRSVEELTQFFLREIQYGIEDTGIRAGIIKVATT GKATPFQELVLKAAARASLATGVPVTTHTAASQRDGEQQAAIFESEGLSPSRVCIGHSDDTDDLSYLTAL AARGYLIGLDHIPHSAIGLEDNASASALLGIRSWQTRALLIKALIDQGYMKQILVSNDWLFGFSSYVTNI MDVMDRVNPDGMAFIPLRVIPFLREKGVPQETLAGITVTNPARFLSPTLRAS >gi 1176259 sp P45548 PHP_ECOLI 2Phosphotriesterase homology protein MSFDPTGYTLAHEHLHIDLSGFKNNVDCRLDQYAFICQEMNDLMTRGVRNVIEMTNRYMGRNAQFMLDVM RETGINVVACTGYYQDAFFPEHVATRSVQELAQEMVDEIEQGIDGTELKAGIIAEIGTSEGKITPLEEKV FIAAALAHNQTGRPISTHTSFSTMGLEQLALLQAHGVDLSRVTVGHCDLKDNLDNILKMIDLGAYVQFDT IGKNSYYPDEKRIAMLHALRDRGLLNRVMLSMDITRRSHLKANGGYGYDYLLTTFIPQLRQSGFSQADVD VMLRENPSQFFQ Unknown Function Dipeptide Epimerase 1 = 2 Dipeptide Epimerase 1! Dipeptide 1 & 2 Epimerase INCORRECT! Error Propagation 22

BLAST sequence similarity network E-value 1 10 30 or lower Distance between nodes reflects level of sequence similarity Sequence similarity Correct annotation Incorrect annotation 23

BLAST sequence similarity network E-value 1 10 30 or lower Distance between nodes reflects level of sequence similarity Sequence similarity Correct annotation Incorrect annotation 23

Misannotations Cluster with each other Indication of error propagation BLAST sequence similarity network E-value 1 10 30 or lower Distance between nodes reflects level of sequence similarity Sequence similarity Correct annotation Incorrect annotation 23

In Conclusion... Misannotation is a serious problem Automated databases Across multiple folds, functions and superfamilies Hard to predict misannotation a priori Manual curation delivers the highest quality Misannotation problem is getting worse Overprediction is a common problem Error propagation appears to be a common source of misannotation 24

Acknowledgements Patricia Babbitt & lab Shoshana Brown Igor Dodevski University of Zürich Tanja Kortemme & Lab Colin Smith Jim Wells Lab Emily Crawford $$ Howard Hughes Pre-Doctoral Fellowship NIH & NSF PLoS Comput Biol. 2009 Dec;5(12):e1000605. 25 Wiki Commons & Science Magazine for some images