Functional Annotation

Size: px

Start display at page:

Download "Functional Annotation"

Rudolph Leonard
5 years ago
Views:

1 Functional Annotation

2 Outline Introduction Strategy Pipeline Databases

3 Now, what s next?

4 Functional Annotation Adding the layers of analysis and interpretation necessary to extract its biological significance and place it into the context of our understanding of biological processes. (Lincoln Stein, 2001)

5 Levels of annotation 1) Protein-Level Annotationto name the protein and assign them a function 2) Process-Level Annotationto connect genes to biological processes

6 Extrinsic Approach 3-Approaches Homolog search/information Transfer search Information (mostly) based on proteins with known function Intrinsic Approach ab initio Intrinsic characteristics of gene/protein features Give additional, but limited information to unknown function proteins Pathways Biological processes, metabolic pathways Connection of pathways and processes with genes and proteins

7 From the computer to the lab Multi domain proteins Proteins with no match Integrating results Well annotated genomes Data make sense Integrate bioinformatics and experimental biologist

8 BLASTP Annotation Strategy

9 BLASTP First level of annotation Function prediction Second level of annotation Pathway and Operon prediction

10 First level of annotation Function prediction

11 CONSENSUS GENES GENES NON-CONS GENES INTRINSIC APPROACH PROTEINS EXTRINSIC APPROACH BLASTP INTERPRO SCAN BLASTP INTERPRO SCAN NeMeSyS LipoP TMHMM Signal P Swiss-Prot SMART HAMAP PROFILES SUPERFAMILY GENE3D PRINT S PFAM PRODOM PROSITE PIR Superfamily TIGRFAMs CONSENSUS SCRIPT ANNOTATION GENE ONTOLOGY IF NO ANNOTATION BLASTP UniRef 90 FIRST LEVEL OF ANNOTION

12 CONSENSUS GENES GENES NON-CONS GENES A B C INTRINSIC APPROACH PROTEINS EXTRINSIC APPROACH BLASTP INTERPRO SCAN BLASTP INTERPRO SCAN NeMeSyS LipoP TMHMM Signal P Swiss-Prot SMART HAMAP PROFILES SUPERFAMILY GENE3D PRINT S PFAM PRODOM PROSITE PIR Superfamily TIGRFAMs D CONSENSUS SCRIPT ANNOTATION GENE ONTOLOGY IF NO ANNOTATION BLASTP UniRef 90 FIRST LEVEL OF ANNOTION

13 CONSENSUS GENES GENES NON-CONS GENES A B C INTRINSIC APPROACH PROTEINS EXTRINSIC APPROACH BLASTP INTERPRO SCAN BLASTP INTERPRO SCAN NeMeSyS LipoP TMHMM Signal P Swiss-Prot SMART HAMAP PROFILES SUPERFAMILY GENE3D PRINT S PFAM PRODOM PROSITE PIR Superfamily TIGRFAMs D CONSENSUS SCRIPT ANNOTATION GENE ONTOLOGY IF NO ANNOTATION BLASTP UniRef 90 FIRST LEVEL OF ANNOTION

14 NeMesys Neisseria meningiditis database Manually annotated

15 Why?? The main aim of NeMeSys is to facilitate the identification of gene function, notably the discover of genes essential for meningococcal pathogenesis and/or viability And Ultimately to narrow the gap between sequences and function in the meningococcus.

16 How?? Manually annotated using MicroScope(platform for microbial genome annotation). Was created sequencing N.meningiditis8013serogrupC (Central and Eastern Europe, and 4500 transposon mutants). To maximize the potential of NeMesysfor functional analysis, they manually (re)annotatedmultiple strains: N.meningiditis MC58 (ST-32, Serogroup C) N.meningiditis Z2491(ST-4, Serogroup A) N.meningiditis FAM18 (ST-11, Serogroup C) N.meningiditis O53442 (ST-4821, Serogroup C) N.meningiditis α14 (Unencapsulated, carrier strain) N.lactamica (comensal) N.gonorrhoeae FA 1090 (clinical isolate) N.gonorrhoeae NCCP11945(clinical isolate)

17 Data Sharing All information is store within PkDGB (Prokariotic Genome Database) in a thematic subdatabase named as NesseriaScope The user has unlimited access to the whole array of exploratory tools and libraries. Up to 10 mutants can be requested

18 MAGE interface A B C D E

19 CONSENSUS GENES GENES NON-CONS GENES A B C INTRINSIC APPROACH PROTEINS EXTRINSIC APPROACH BLASTP INTERPRO SCAN BLASTP INTERPRO SCAN NeMeSyS LipoP TMHMM Signal P Swiss-Prot SMART HAMAP PROFILES SUPERFAMILY GENE3D PRINT S PFAM PRODOM PROSITE PIR Superfamily TIGRFAMs D CONSENSUS SCRIPT ANNOTATION GENE ONTOLOGY IF NO ANNOTATION BLASTP UniRef 90 FIRST LEVEL OF ANNOTION

20 INTRINSIC APPROACH Ab initio approaches not based on homology or any kind of similarity Based on NN or HMM Try to find the function or sub cellular location to better classify Propose to use 3 programs: SignalP for signal peptides LipoP for lipoproteins TmHMM for transmembrane helices

21 Why do we need so much information? Predicting destination/location of protein gives more information regarding function Sub cellular location is very important in determining function Signal peptides and transmembrane proteins are potential vaccine and drug targets

22 SIGNAL PEPTIDES Based on a combination of several artificial neural networks [sliding window] and Hidden Markov Models Predicts the presence and location of signal peptide cleavage sites in amino acid sequences [c-score] The method incorporates a prediction of cleavage sites and a signal peptide/nonsignal peptide prediction [s-score]

24 LIPOPROTEINS LipoP used to make predictions about lipoproteins Discriminates between lipoproteins and other signal peptides. HMM based High confidence results with low false positives

25 TRANS-MEMBRANE HELICES TmHMM to find the trans-membrane helices that span through the membrane Trans-membrane helices have diverse functions Based on HMM Uses scores generated intrinsically for each query protein Results comparable to experimentally discovered TM proteins

26 CONSENSUS GENES GENES NON-CONS GENES A B C INTRINSIC APPROACH PROTEINS EXTRINSIC APPROACH BLASTP INTERPRO SCAN BLASTP INTERPRO SCAN NeMeSyS LipoP TMHMM Signal P Swiss-Prot SMART HAMAP PROFILES SUPERFAMILY GENE3D PRINT S PFAM PRODOM PROSITE PIR Superfamily TIGRFAMs D CONSENSUS SCRIPT ANNOTATION GENE ONTOLOGY IF NO ANNOTATION BLASTP UniRef 90 FIRST LEVEL OF ANNOTION

27 Universal Protein Resource (UNIPROT)

28 Homology Search against UniProt BLASTP against UniProt Statistically significant Blast hits usually signify sequence homology Protein sequence analysis allows protein classification As many as 1,736 genes (approx 90%) are shared Genes encode proteins displaying at least 30% amino acid identityover at least 80% of their lengthand are in synteny

29 Central resource for storing and interconnecting information from large and disparate sources Most comprehensive catalogue of protein sequence and functional annotation UniProt Protein data only Curated in Swiss-Prot, not TrEMBL GenBank & RefSeq Protein and nucleic acid data Curated in RefSeq, not GenBanl UniProt: Merger of European Bioinformatics Institute (EBI), Protein Information Resource (PIR), Swiss Institute of Bioinformatics(SIB) 3 database components, each addressing a key need in protein bioinformatics

30 UniProt NREF Clustering at at 100%,90%, 50% 100, 90, 50% Automated Annotation Automated Automated merging of sequences Merging of Sequences UniProt Knowledgebase UniProt Archive Literature-Based Annotation Annotation Swiss - Prot TrEMBL PIR-PSD RefSeq EMBL/DDBJ GenBank/ / EnsEMBLPDB Patent EMBL/DDBJ Data Other Data

31 o UniProtKB Expertly curated, comprehensive protein database Central access point for integrated protein information with cross-references to multiple sources Targets - Relevant literature, numerous Cross-references UniProtKB/Swiss-Prot : Manually annotated records with information extracted from literature and curator-evaluated computational analysis UniProtKB/TrEMBL : High quality computationally analyzed records enriched with automatic annotation and classification

32 InterProScan

33 What is InterPro InterPro is a collaborative project aimed at providing an integrated layer on top of the most commonly used signature databases by creating a unique, non-redundant characterization of a given protein family, domain or functional site.

34 InterPro member databases No. Member database Short description 1 PROSITE Patterns Biologically significant amino acid patterns stored as regex. 2 PROSITE Profiles Stored as weight matrices (profiles) for detection of domains 3 HAMAP Profiles Similar to PROSITE profiles, specifically for Bacteria and Archae 4 PRINTS Collection of protein family fingerprints 5 PFAM A Curated database of protein families 6 PRODOM Database of protein domain families obtained through UniProt 7 SMART Allows identification and annotation of genetically mobile 8 TIGRFAMs domains Collection of protein families with curatedmultiple sequence alignments and HMMs 9 PIR Superfamily Classification based on evolutionary relationship of whole proteins 10 SUPEFAMILY Library of Profile HMMs that represent all SCOP proteins 11 GENE3D Protein family database, supplementary to CATH 12 PANTHER Classify proteins to facilitate high-throughput analysis

35 InterPro important Applications Applications ProfileScan FingerPRINTScan HMMpfam BLASTProDom.pl SuperFamily HMMPIR Short description Scans against PROSITE profiles. These profiles are based on weight matrices and are more sensitive for the detection of divergent protein families. Scans against the fingerprints in the PRINTS database. These fingerprints are groups of motifs that together are more potent than single motifs by making use of the biological context inherent in a multiple motif method. Scans the hidden markov models (HMMs) that are present in the protein domain databases Pfam, TIGRFAMMs and SMART. Scans the families in the ProDom database. ProDom is a comprehensive set of protein domain families automatically generated from the UniProtKB/Swiss-Prot sequence database using psi-blast. In InterProScan the BLASTpgb program is used to scan the database. SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure. Scans the hidden markov models (HMMs) that are present in the PIR Protein Sequence Database (PSD) of functionally annotated protein sequences, PIR-PSD.

36 CONSENSUS GENES GENES NON-CONS GENES A B C INTRINSIC APPROACH PROTEINS EXTRINSIC APPROACH BLASTP INTERPRO SCAN BLASTP INTERPRO SCAN NeMeSyS LipoP TMHMM Signal P Swiss-Prot SMART HAMAP PROFILES SUPERFAMILY GENE3D PRINT S PFAM PRODOM PROSITE PIR Superfamily TIGRFAMs D CONSENSUS SCRIPT ANNOTATION GENE ONTOLOGY IF NO ANNOTATION BLASTP UniRef 90 FIRST LEVEL OF ANNOTION

37 Second level of annotation Pathway and Operon prediction

38 FIRST LEVEL OF ANNOTION Pathways OPERONS BLASTP BLASTP OPERON_DB KEGG DOORS CONSENSUS SCRIPT SECOD LEVEL OF ANNOTION

39 FIRST LEVEL OF ANNOTION E F Pathways OPERONS BLASTP BLASTP OPERON_DB KEGG DOORS CONSENSUS SCRIPT SECOD LEVEL OF ANNOTION

40 FIRST LEVEL OF ANNOTION E F Pathways OPERONS BLASTP BLASTP OPERON_DB KEGG DOORS CONSENSUS SCRIPT SECOD LEVEL OF ANNOTION

41 KEGG Kyoto Encyclopedia of Genes and Genomes KEGG/KAAS KEGG Orthology KEGG GENES KEGG PATHWAYS Applications

42 KEGG Visualizes the functions of enzymes in a genome by mapping them onto biosynthetic pathways Consists of many databases of which the most relevant to our purposes are: GENES PATHWAY

43 KAAS KAAS: KEGG Automatic Annotation Server KAAS input is a fasta file Provides annotations through BLAST comparisons against the KEGG GENES database KEGG GENES: database of func. annotated genes GENE contains 5.3 million genes from various genomes: 129 Eukaryotic 971 Bacterial 74 Archaeal

44 Enzyme Commission (EC) numbers EC numbers identify enzyme-catalyzed reactions NOT enzymes If several enzymes catalyze the same reaction, they receive the same EC number Can be used to identify proteins on a metabolic pathway, but not on reference pathways such as cellular processes.

45 KEGG Orthology(KO) Orthologs genes with common ancestry, with same function in different organisms Tend to have a similar sequence and location in genomes Using orthologidentifiers leads to a more specific classification system, since many related enzymes share the same EC number. Link information in GENES and PATHWAY databses

46 KEGG GENES Category Definition Example ENTRY KO number K00033 NAME Enzyme reference names E , PGD, gnd DEFINITION Enzyme identity and EC number 6-phosphogluconate dehydrogenase [EC: ] PATHWAY Pathways involved in ko00030 Pentose phosphate pathway MODULE Chemical reactions involved in and compounds acted on M00006 Pentose phosphate pathway, oxidative phase, glucose 6P => ribulose 5P CLASS Type of pathway Metabolism; Carbohydrate Metabolism; Pentose phosphate pathway DBLINKS GENES Link to other databases entries for the enzyme Gene information for a particular organism COG: COG0362 HSA: 5226(PGD)

48 KEGG PATHWAY database Graphical representations of biosynthetic pathways Predicts pathways by comparing the enzymes found in a genome with reference pathways. Information in GENES databases is linked to the information in PATHWAY database through KO identifiers. If pathways are incomplete, then the missing enzymes are visually detectable Both metabolic and reference pathways mapped

49 Sample KEGG Pathway Map

51 Two-Component Systems NarXphosphorylatesNarLupon binding to nitrate and nitrite. This causes a downregulation of Fumerate reductase expression and an upregulation of Formate dehydrogenase expression.

52 KEGG Why use it? It s a tool for visualizing proteins of a genome in biosynthetic pathways Aids in checking annotations Visually detect missing proteins Can aid the comparative genomics group Comparisons of protein pathways between genomes

53 FIRST LEVEL OF ANNOTION E F Pathways OPERONS BLASTP BLASTP OPERON_DB KEGG DOORS CONSENSUS SCRIPT SECOD LEVEL OF ANNOTION

54 What is Operon? Operon family of co-regulated genes Adjacent Same orientation Not separated by promoters/terminators Related functions Strong selective pressure, conserved Knowledge of operon -> FUNCTION

55 Operon DB: Database of predicted operons Microbial Genomes Computer prediction of operon structures. 500 genomes.

56 Computational Approach Gene pair adjacent, same strand, intergenic length separation P(gene pair in operon) = 1 P(conserved D)X P(SN S)-P chance P(conserved S) D,S = sets of gene pairs P chance = P(conserved S has homologsin other genomes)

57 Algorithm Identification of conserved pairs Identification of orthologs using BLAST Finding conserved gene clusters Homology Teams software. x Evolutionary distance D(G 1,G 2 ) = n(g 1 )+n(g 2 ) h(g 1 ;G 2 ) h(g 2 ;G 1 ) n(g 1 ) + n(g 2 ) Larger dist = greater prob of conservation

58 Alternative possibilities considered Functionally unrelated genes may have the same order due simply because they were adjacent in a common ancestor. Genes may be adjacent in two genomes by chance alone, or due to horizontal transfer of the gene pair.

59 Interface 1). List of genomes 2). List of gene pairs

60 Sample output

61 DOOR Universal, genome specific prediction Reliability α Intergenic distances Different features (scored) Intergenicdistance Conservation of gene pairs(neighborhood) Phylogenetic patterns Ratio between lengths of two genes Frequencies of specific DNA motifs in the intergenic regions.(meme & CUBIC)

65 Doubts? No stand-alone version/code available Can t be automated New query runs??

66 QUESTIONS?

67 Mechanism RAW FASTA EMBL USER SPECIFIED

68 Features Pure Perl, hence modular organization makes implementation efficient for bulk sequence analysis. Indexes corresponding databases, hence fast retrieval. User specified post-processing and cut-offs are made possible to filter final results.

69 RAW FORMAT OUTPUT NF A5FDCE74AB7C3AD 272 HMMPIR PIRSF Prephenate dehydratase e 141 T 06-Aug-2005 IPR Prephenatedehydratasewith ACT region Molecular Function:prephenate dehydratase activity (GO: ), Biological Process:Lphenylalanine biosynthesis (GO: ) Where NF : is the id of the input sequence. 27A9BBAC0587AB84: is the crc64 (checksum) of the protein sequence (supposed to be unique). 272: is the length of the sequence (in AA). HMMPIR: is the analysis method launched. PIRSF001424: is the database members entry for this match. Prephenate dehydratase: is the database member description for the entry. 1: is the start of the domain match. 270: is the end of the domain match. 6.5e-141: is the e-value of the match (reported by member database method). T: is the status of the match (T: true,?: unknown). 06-Aug-2005: is the date of the run. IPR008237: is the corresponding InterPro entry (if iprlookup requested by the user). Prephenate dehydratase with ACT region: is the description of the InterPro entry. Molecular Function:prephenatedehydrataseactivity (GO: ): is the GO (gene ontology) description for the InterPro entry.

) Associated with Located in Active in Performs Gene product = cytochrome c WHERE?

70 Gene Ontology Classification Cellular component (WHERE?) Biological process (WHICH?) Molecular function (WHAT?) Associated with Located in Active in Performs Gene product = cytochrome c WHERE?- Mitochondrial matrix and inner membrane WHAT?-oxidoreductase activity WHICH?- oxidative phospholyration and induction of cell death

71 Gene Ontology Classification Specific function- specific name and gene symbol Likely (or unlikely) function- putative or homolog Generic function- protein family Unknown function- hypothetical, unknown function From: JCVI Gene Naming and Annotation Conventions

-max_target_seqs: maximum number of targets to report

-max_target_seqs: maximum number of targets to report Review of exercise 1 tblastn -num_threads 2 -db contig -query DH10B.fasta -out blastout.xls -evalue 1e-10 -outfmt "6 qseqid sseqid qstart qend sstart send length nident pident evalue" Other options: -max_target_seqs: