Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010

Size: px

Start display at page:

Download "Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010"

Simon Bennett
6 years ago
Views:

1 Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25,

2 New genomes (and metagenomes) sequenced every day... 2

3 3

4 3

5 3

6 3

7 3

8 3

9 3

10 3

11 3

12 Computational Function Prediction Needed Total Sequences Characterized Sequences 4

13 What about the error that results from large scale function prediction? 5

14 Our focus: commonly used protein sequence databases How prevalent is misannotation in common sequence databases? What can we learn about these annotation errors and annotation in general? 6

15 What is function? Many Possible Definitions Phenotype Enzymatic Reaction 7

16 What is function? Many Possible Definitions Phenotype Enzymatic Reaction 7

17 Concrete definition of function Substrate Product Chemical conversion Function can be mapped to specific residues Why use enzymes? 8

18 Functionally Diverse Enzyme Superfamilies 9

19 Functionally Diverse Enzyme Superfamilies Low % sequence ID Conserved mechanistic step Multifunctional 9

20 Functionally Diverse Enzyme Superfamilies Conserved mechanistic step Multifunctional Low % sequence ID Monofunctional Family Specific Residues % ID within families > % ID between families 9

21 What is needed for the misannotation analysis? Gold Standard Sequence Set Requirements Organized hierarchy & data Superfamily definitions Family definitions Sequences Sequence alignments Statistical models Functions are experimentally characterized Understand functional mechanism Structure Active site Functionally important residues Large set 10

22 What is needed for the misannotation analysis? Gold Standard Sequence Set Requirements Organized hierarchy & data Superfamily definitions Family definitions Sequences Sequence alignments Statistical models Functions are experimentally characterized 6 Superfamilies 5 Structural folds 37 Families 5/6 E.C. categories Genome Biol. 2006;7(1):R8. Understand functional mechanism Structure Active site Functionally important residues Large set 10

23 Sequence Models (HMMs) Evidence Codes Functionally Important Residues Hierarchically Organized Hand-Curated Sequence Alignments Gold Standard Sequence Set sfld.rbvi.ucsf.edu 11

24 Data Source: Commonly Used Sequence Databases NCBI Automated Large TrEMBL Automated Large KEGG Automated Swiss-Prot Curated Small 12

25 Analysis Question Given: A protein sequence annotated to a specific enzyme function Is that annotation correct? 13

26 General Process 14

27 15

28 15

29 15

30 15

31 Non-Family Members Family Members LC NC TC 15

32 16

33 16

34 Variable percent misannotation Manually curated Swiss- Prot is most accurate 17

35 Misannotation Problem is Getting Worse Number of Sequences Sequences Deposited by Year and the Fraction Predicted to be Misannotated (NR DB) Fraction Predicted Misannotated Year Incorrect Annotations Correct Annotations 18

36 What are the characteristics of these misannotations? 19

37 Sensitivity to threshold change 20

38 Non-Family Members LC Family Members NC TC TC Trusted Cutoff NC Noise Cutoff LC Lenient Cutoff Sensitivity to threshold change 20

39 Non-Family Members LC Family Members NC TC Non-Family Members LC Family Members Sensitivity to threshold change NC TC 20

40 21

41 NSA NSA No Superfamily Association 21

42 NSA SFA NSA No Superfamily Association SFA Superfamily Association Only 21

43 NSA SFA NSA No Superfamily Association SFA Superfamily Association Only MFR Missing Functionally Important Residues MFR 21

44 NSA SFA NSA No Superfamily Association SFA Superfamily Association Only MFR Missing Functionally Important Residues BTC Below Trusted Cutoff MFR BTC 21

45 Types of Misannotation MFR (6%) SFA (31%) NSA (9%) BTC (54%) NSA Misannotations due to overprediction Misannotations not due to overprediction NSA No Superfamily Association SFA Superfamily Association Only MFR Missing Functionally Important Residues BTC Below Trusted Cutoff SFA MFR BTC Biggest Problem Predicting function without sufficient evidence 21

46 Dipeptide Epimerase >gi pdb 1HZY A Chain A, High Resolution Structure Of The Zinc-Containing Phosphotriesterase From Pseudomonas 1Diminuta GDRINTVRGPITISEAGFTLTHEHICGSSAGFLRAWPEFFGSRKALAEKAVRGLRRARAAGVRTIVDVST FDIGRDVSLLAEVSRAADVHIVAATGLWFDPPLSMRLRSVEELTQFFLREIQYGIEDTGIRAGIIKVATT GKATPFQELVLKAAARASLATGVPVTTHTAASQRDGEQQAAIFESEGLSPSRVCIGHSDDTDDLSYLTAL AARGYLIGLDHIPHSAIGLEDNASASALLGIRSWQTRALLIKALIDQGYMKQILVSNDWLFGFSSYVTNI MDVMDRVNPDGMAFIPLRVIPFLREKGVPQETLAGITVTNPARFLSPTLRAS >gi sp P45548 PHP_ECOLI 2Phosphotriesterase homology protein MSFDPTGYTLAHEHLHIDLSGFKNNVDCRLDQYAFICQEMNDLMTRGVRNVIEMTNRYMGRNAQFMLDVM RETGINVVACTGYYQDAFFPEHVATRSVQELAQEMVDEIEQGIDGTELKAGIIAEIGTSEGKITPLEEKV FIAAALAHNQTGRPISTHTSFSTMGLEQLALLQAHGVDLSRVTVGHCDLKDNLDNILKMIDLGAYVQFDT IGKNSYYPDEKRIAMLHALRDRGLLNRVMLSMDITRRSHLKANGGYGYDYLLTTFIPQLRQSGFSQADVD VMLRENPSQFFQ Unknown Function Dipeptide Epimerase 1 = 2 Dipeptide Epimerase 1! Dipeptide 1 & 2 Epimerase INCORRECT! Error Propagation 22

47 Dipeptide Epimerase >gi pdb 1HZY A Chain A, High Resolution Structure Of The Zinc-Containing Phosphotriesterase From Pseudomonas 1Diminuta GDRINTVRGPITISEAGFTLTHEHICGSSAGFLRAWPEFFGSRKALAEKAVRGLRRARAAGVRTIVDVST FDIGRDVSLLAEVSRAADVHIVAATGLWFDPPLSMRLRSVEELTQFFLREIQYGIEDTGIRAGIIKVATT GKATPFQELVLKAAARASLATGVPVTTHTAASQRDGEQQAAIFESEGLSPSRVCIGHSDDTDDLSYLTAL AARGYLIGLDHIPHSAIGLEDNASASALLGIRSWQTRALLIKALIDQGYMKQILVSNDWLFGFSSYVTNI MDVMDRVNPDGMAFIPLRVIPFLREKGVPQETLAGITVTNPARFLSPTLRAS >gi sp P45548 PHP_ECOLI 2Phosphotriesterase homology protein MSFDPTGYTLAHEHLHIDLSGFKNNVDCRLDQYAFICQEMNDLMTRGVRNVIEMTNRYMGRNAQFMLDVM RETGINVVACTGYYQDAFFPEHVATRSVQELAQEMVDEIEQGIDGTELKAGIIAEIGTSEGKITPLEEKV FIAAALAHNQTGRPISTHTSFSTMGLEQLALLQAHGVDLSRVTVGHCDLKDNLDNILKMIDLGAYVQFDT IGKNSYYPDEKRIAMLHALRDRGLLNRVMLSMDITRRSHLKANGGYGYDYLLTTFIPQLRQSGFSQADVD VMLRENPSQFFQ Unknown Function Dipeptide Epimerase 1 = 2 Dipeptide Epimerase 1! Dipeptide 1 & 2 Epimerase INCORRECT! Error Propagation 22

48 BLAST sequence similarity network E-value or lower Distance between nodes reflects level of sequence similarity Sequence similarity Correct annotation Incorrect annotation 23

49 BLAST sequence similarity network E-value or lower Distance between nodes reflects level of sequence similarity Sequence similarity Correct annotation Incorrect annotation 23

Misannotations Cluster with each other Indication of error propagation BLAST sequence similarity network E-value 1 10 30 or

50 Misannotations Cluster with each other Indication of error propagation BLAST sequence similarity network E-value or lower Distance between nodes reflects level of sequence similarity Sequence similarity Correct annotation Incorrect annotation 23

51 In Conclusion... Misannotation is a serious problem Automated databases Across multiple folds, functions and superfamilies Hard to predict misannotation a priori Manual curation delivers the highest quality Misannotation problem is getting worse Overprediction is a common problem Error propagation appears to be a common source of misannotation 24

Acknowledgements Patricia Babbitt & lab Shoshana Brown Igor Dodevski University of Zürich Tanja Kortemme & Lab Colin Smith Jim Wells Lab Emily

52 Acknowledgements Patricia Babbitt & lab Shoshana Brown Igor Dodevski University of Zürich Tanja Kortemme & Lab Colin Smith Jim Wells Lab Emily Crawford $$ Howard Hughes Pre-Doctoral Fellowship NIH & NSF PLoS Comput Biol Dec;5(12):e Wiki Commons & Science Magazine for some images

CS612 - Algorithms in Bioinformatics

CS612 - Algorithms in Bioinformatics Fall 2017 Databases and Protein Structure Representation October 2, 2017 Molecular Biology as Information Science > 12, 000 genomes sequenced, mostly bacterial (2013) > 5x10 6 unique sequences available