Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010 1
New genomes (and metagenomes) sequenced every day... 2
3
3
3
3
3
3
3
3
3
Computational Function Prediction Needed Total Sequences Characterized Sequences 4
What about the error that results from large scale function prediction? 5
Our focus: commonly used protein sequence databases How prevalent is misannotation in common sequence databases? What can we learn about these annotation errors and annotation in general? 6
What is function? Many Possible Definitions Phenotype Enzymatic Reaction 7
What is function? Many Possible Definitions Phenotype Enzymatic Reaction 7
Concrete definition of function Substrate Product Chemical conversion Function can be mapped to specific residues Why use enzymes? 8
Functionally Diverse Enzyme Superfamilies 9
Functionally Diverse Enzyme Superfamilies Low % sequence ID Conserved mechanistic step Multifunctional 9
Functionally Diverse Enzyme Superfamilies Conserved mechanistic step Multifunctional Low % sequence ID Monofunctional Family Specific Residues % ID within families > % ID between families 9
What is needed for the misannotation analysis? Gold Standard Sequence Set Requirements Organized hierarchy & data Superfamily definitions Family definitions Sequences Sequence alignments Statistical models Functions are experimentally characterized Understand functional mechanism Structure Active site Functionally important residues Large set 10
What is needed for the misannotation analysis? Gold Standard Sequence Set Requirements Organized hierarchy & data Superfamily definitions Family definitions Sequences Sequence alignments Statistical models Functions are experimentally characterized 6 Superfamilies 5 Structural folds 37 Families 5/6 E.C. categories Genome Biol. 2006;7(1):R8. Understand functional mechanism Structure Active site Functionally important residues Large set 10
Sequence Models (HMMs) Evidence Codes Functionally Important Residues Hierarchically Organized Hand-Curated Sequence Alignments Gold Standard Sequence Set sfld.rbvi.ucsf.edu 11
Data Source: Commonly Used Sequence Databases NCBI Automated Large TrEMBL Automated Large KEGG Automated Swiss-Prot Curated Small 12
Analysis Question Given: A protein sequence annotated to a specific enzyme function Is that annotation correct? 13
General Process 14
15
15
15
15
Non-Family Members Family Members LC NC TC 15
16
16
Variable percent misannotation Manually curated Swiss- Prot is most accurate 17
Misannotation Problem is Getting Worse Number of Sequences 0 200 400 600 800 1000 1200 Sequences Deposited by Year and the Fraction Predicted to be Misannotated (NR DB) 0.5 0.4 0.3 0.2 0.1 Fraction Predicted Misannotated 1993 1995 1997 1999 2001 2003 2005 Year Incorrect Annotations Correct Annotations 18
What are the characteristics of these misannotations? 19
Sensitivity to threshold change 20
Non-Family Members LC Family Members NC TC TC Trusted Cutoff NC Noise Cutoff LC Lenient Cutoff Sensitivity to threshold change 20
Non-Family Members LC Family Members NC TC Non-Family Members LC Family Members Sensitivity to threshold change NC TC 20
21
NSA NSA No Superfamily Association 21
NSA SFA NSA No Superfamily Association SFA Superfamily Association Only 21
NSA SFA NSA No Superfamily Association SFA Superfamily Association Only MFR Missing Functionally Important Residues MFR 21
NSA SFA NSA No Superfamily Association SFA Superfamily Association Only MFR Missing Functionally Important Residues BTC Below Trusted Cutoff MFR BTC 21
Types of Misannotation MFR (6%) SFA (31%) NSA (9%) BTC (54%) NSA Misannotations due to overprediction Misannotations not due to overprediction NSA No Superfamily Association SFA Superfamily Association Only MFR Missing Functionally Important Residues BTC Below Trusted Cutoff SFA MFR BTC Biggest Problem Predicting function without sufficient evidence 21
Dipeptide Epimerase >gi 13786715 pdb 1HZY A Chain A, High Resolution Structure Of The Zinc-Containing Phosphotriesterase From Pseudomonas 1Diminuta GDRINTVRGPITISEAGFTLTHEHICGSSAGFLRAWPEFFGSRKALAEKAVRGLRRARAAGVRTIVDVST FDIGRDVSLLAEVSRAADVHIVAATGLWFDPPLSMRLRSVEELTQFFLREIQYGIEDTGIRAGIIKVATT GKATPFQELVLKAAARASLATGVPVTTHTAASQRDGEQQAAIFESEGLSPSRVCIGHSDDTDDLSYLTAL AARGYLIGLDHIPHSAIGLEDNASASALLGIRSWQTRALLIKALIDQGYMKQILVSNDWLFGFSSYVTNI MDVMDRVNPDGMAFIPLRVIPFLREKGVPQETLAGITVTNPARFLSPTLRAS >gi 1176259 sp P45548 PHP_ECOLI 2Phosphotriesterase homology protein MSFDPTGYTLAHEHLHIDLSGFKNNVDCRLDQYAFICQEMNDLMTRGVRNVIEMTNRYMGRNAQFMLDVM RETGINVVACTGYYQDAFFPEHVATRSVQELAQEMVDEIEQGIDGTELKAGIIAEIGTSEGKITPLEEKV FIAAALAHNQTGRPISTHTSFSTMGLEQLALLQAHGVDLSRVTVGHCDLKDNLDNILKMIDLGAYVQFDT IGKNSYYPDEKRIAMLHALRDRGLLNRVMLSMDITRRSHLKANGGYGYDYLLTTFIPQLRQSGFSQADVD VMLRENPSQFFQ Unknown Function Dipeptide Epimerase 1 = 2 Dipeptide Epimerase 1! Dipeptide 1 & 2 Epimerase INCORRECT! Error Propagation 22
Dipeptide Epimerase >gi 13786715 pdb 1HZY A Chain A, High Resolution Structure Of The Zinc-Containing Phosphotriesterase From Pseudomonas 1Diminuta GDRINTVRGPITISEAGFTLTHEHICGSSAGFLRAWPEFFGSRKALAEKAVRGLRRARAAGVRTIVDVST FDIGRDVSLLAEVSRAADVHIVAATGLWFDPPLSMRLRSVEELTQFFLREIQYGIEDTGIRAGIIKVATT GKATPFQELVLKAAARASLATGVPVTTHTAASQRDGEQQAAIFESEGLSPSRVCIGHSDDTDDLSYLTAL AARGYLIGLDHIPHSAIGLEDNASASALLGIRSWQTRALLIKALIDQGYMKQILVSNDWLFGFSSYVTNI MDVMDRVNPDGMAFIPLRVIPFLREKGVPQETLAGITVTNPARFLSPTLRAS >gi 1176259 sp P45548 PHP_ECOLI 2Phosphotriesterase homology protein MSFDPTGYTLAHEHLHIDLSGFKNNVDCRLDQYAFICQEMNDLMTRGVRNVIEMTNRYMGRNAQFMLDVM RETGINVVACTGYYQDAFFPEHVATRSVQELAQEMVDEIEQGIDGTELKAGIIAEIGTSEGKITPLEEKV FIAAALAHNQTGRPISTHTSFSTMGLEQLALLQAHGVDLSRVTVGHCDLKDNLDNILKMIDLGAYVQFDT IGKNSYYPDEKRIAMLHALRDRGLLNRVMLSMDITRRSHLKANGGYGYDYLLTTFIPQLRQSGFSQADVD VMLRENPSQFFQ Unknown Function Dipeptide Epimerase 1 = 2 Dipeptide Epimerase 1! Dipeptide 1 & 2 Epimerase INCORRECT! Error Propagation 22
BLAST sequence similarity network E-value 1 10 30 or lower Distance between nodes reflects level of sequence similarity Sequence similarity Correct annotation Incorrect annotation 23
BLAST sequence similarity network E-value 1 10 30 or lower Distance between nodes reflects level of sequence similarity Sequence similarity Correct annotation Incorrect annotation 23
Misannotations Cluster with each other Indication of error propagation BLAST sequence similarity network E-value 1 10 30 or lower Distance between nodes reflects level of sequence similarity Sequence similarity Correct annotation Incorrect annotation 23
In Conclusion... Misannotation is a serious problem Automated databases Across multiple folds, functions and superfamilies Hard to predict misannotation a priori Manual curation delivers the highest quality Misannotation problem is getting worse Overprediction is a common problem Error propagation appears to be a common source of misannotation 24
Acknowledgements Patricia Babbitt & lab Shoshana Brown Igor Dodevski University of Zürich Tanja Kortemme & Lab Colin Smith Jim Wells Lab Emily Crawford $$ Howard Hughes Pre-Doctoral Fellowship NIH & NSF PLoS Comput Biol. 2009 Dec;5(12):e1000605. 25 Wiki Commons & Science Magazine for some images