Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Similar documents
The Minimal-Gene-Set -Kapil PHY498BIO, HW 3

Comparative genomics: Overview & Tools + MUMmer algorithm

Functional Annotation

EBI web resources II: Ensembl and InterPro

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Homology. and. Information Gathering and Domain Annotation for Proteins

We have: We will: Assembled six genomes Made predictions of most likely gene locations. Add a layers of biological meaning to the sequences

Homology and Information Gathering and Domain Annotation for Proteins

Bioinformatics. Dept. of Computational Biology & Bioinformatics

BMD645. Integration of Omics

-max_target_seqs: maximum number of targets to report

Introduction to Bioinformatics Integrated Science, 11/9/05

Gene Ontology and overrepresentation analysis

CS612 - Algorithms in Bioinformatics

CSCE555 Bioinformatics. Protein Function Annotation

2 Genome evolution: gene fusion versus gene fission

GENE ONTOLOGY (GO) Wilver Martínez Martínez Giovanny Silva Rincón

In-Silico Approach for Hypothetical Protein Function Prediction

Example of Function Prediction

Lecture 2. The Blast2GO annotation framework

2 GENE FUNCTIONAL SIMILARITY. 2.1 Semantic values of GO terms

Gene function annotation

Heuristic Methods. Heuristic methods for alignment Sequence databases Multiple alignment Gene and protein prediction

Supplementary Information 16

Integration of functional genomics data

GCD3033:Cell Biology. Transcription

Bio2. Heuristics, Databases ; Multiple Sequence Alignment ; Gene Finding. Biological Databases (sequences) Armstrong, 2007 Bioinformatics 2

Meiothermus ruber Genome Analysis Project

Computational approaches for functional genomics

# shared OGs (spa, spb) Size of the smallest genome. dist (spa, spb) = 1. Neighbor joining. OG1 OG2 OG3 OG4 sp sp sp

Protein function prediction based on sequence analysis

Computational methods for predicting protein-protein interactions

V19 Metabolic Networks - Overview

V14 extreme pathways

Sequences, Structures, and Gene Regulatory Networks

Update on human genome completion and annotations: Protein information resource

Procedure to Create NCBI KOGS

BLAST. Varieties of BLAST

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever.

Hands-On Nine The PAX6 Gene and Protein

Bioinformatics Chapter 1. Introduction

MiGA: The Microbial Genome Atlas

Comparative RNA-seq analysis of transcriptome dynamics during petal development in Rosa chinensis

A Protein Ontology from Large-scale Textmining?

Riboflavin Metabolism: A study to see if Mrub_1256 is Orthologous to E. coli b0415, and if Mrub_1254 is Orthologous to E.

Bioinformatics in the post-sequence era

Fitness constraints on horizontal gene transfer

Proteins: Structure & Function. Ulf Leser

Supplemental Materials

Gene Ontology. Shifra Ben-Dor. Weizmann Institute of Science

Sequence Alignment Techniques and Their Uses

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Computational Structural Bioinformatics

Large-Scale Genomic Surveys

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype

BME 5742 Biosystems Modeling and Control

G4120: Introduction to Computational Biology

Update on genome completion and annotations: Protein Information Resource

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Genomics and bioinformatics summary. Finding genes -- computer searches

Francisco M. Couto Mário J. Silva Pedro Coutinho

Genome Annotation Project Presentation

Bioinformatics Exercises

Comparative Bioinformatics Midterm II Fall 2004

SUPPLEMENTARY INFORMATION

RGP finder: prediction of Genomic Islands

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

What is the central dogma of biology?

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Biology 105/Summer Bacterial Genetics 8/12/ Bacterial Genomes p Gene Transfer Mechanisms in Bacteria p.

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

objective functions...

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

Orthologs Detection and Applications

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources

Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

V14 Graph connectivity Metabolic networks

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

Introduction to Evolutionary Concepts

From gene to protein. Premedical biology

Prediction of protein function from sequence analysis

NetAffx GPCR annotation database summary December 12, 2001

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Cellular Function Prediction for Hypothetical Proteins Using High-Throughput Data

Chapter 15 Active Reading Guide Regulation of Gene Expression

Metabolic modelling. Metabolic networks, reconstruction and analysis. Esa Pitkänen Computational Methods for Systems Biology 1 December 2009

Implications of Structural Genomics Target Selection Strategies: Pfam5000, Whole Genome, and Random Approaches

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Bio 119 Bacterial Genomics 6/26/10

Transcription:

Genome Annotation Bioinformatics and Computational Biology Genome Annotation Frank Oliver Glöckner 1 Genome Analysis Roadmap Genome sequencing Assembly Gene prediction Protein targeting trna prediction Annotation Genome Annotation Frank Oliver Glöckner 2 1

Genome annotation: From Sequence to Biology! Definition: Annotation of a DNA sequence is the assignment of biologically relevant features to certain regions of the sequence. Genome Annotation Frank Oliver Glöckner 3 Genome Annotation: Functional Assignment Translate the predicted coding region into the amino acid sequence Analyze the amino acid sequence 1. Step: Search in sequence databases for similar sequences! BLAST, FASTA Genome Annotation Frank Oliver Glöckner 4 2

Dogma of Similarity and Function Similarity Homology Function detect infer assign Challenge: Detection of Similarity! Genome Annotation Frank Oliver Glöckner 5 Similarity search - sequence databases Nucleotide databases GenBank, EMBL, DDBJ, [INSDC] Protein databases - curated and non-curated SWISSPROT, PIR, TrEMBL, [UniProt] NR (Non-Redundant) EST databases, e.g. dbest @ NCBI Expressed sequence tags, for eukaryotic systems (valuable information to confirm the transcription and splicing of predicted ORFs) Genome Annotation Frank Oliver Glöckner 6 3

Non-redundant protein database GenBank/EMBL/DDBJ (translations) PIR SWISS-PROT nr : non-redundant protein database Complementary (redundant) to UniProt/UniRef Genome Annotation Frank Oliver Glöckner 7 Generation of functional evidences List of ORFs Patterns Profiles Fold Location BLAST PROSITE BLOCKS PFAM COG SCOP TMHMM SignalP Automatic annotation manually refined annotation Genome Annotation Frank Oliver Glöckner 8 4

Tools COG: Clusters of orthologous genes KEGG: Metabolic pathways GO: Gene Ontology SCOP: Characterize secondary structure elements and compare to database of protein folds Detect characteristics that indicate a cellular location of the protein SignalP for signal peptide prediction protein acts not in the cytosol can be located in the periplasm or outside the cell TMHMM for detection of transmembrane regions integral membrane protein or membrane associated protein Genome Annotation Frank Oliver Glöckner 9 Clusters of Orthologous Groups - COGs Definitions Orthologous Genes Direct common ancestor http://www.ncbi.nlm.nih.gov/cog/ Found in different species assumed to have same functions Paralogous Genes Originates from a gene duplication event Found in the same species, can have different functions COGs Each COG consists of individual orthologous genes or orthologous groups of paralogs found in at least three Genomes Genome Annotation Frank Oliver Glöckner 10 5

COGs - Construction COGs Members of a COG have been evolved from a common ancestral gene by speciation [E.coli] Pairwise comparison of all proteins of all available genomes Determination of best hits BeTs for every protein Minimal COG: Triangle of symmetrical BeTs [Saccharomyces] [Synechocystis] Tatusov, 1997, Science, p. 631-637 Genome Annotation Frank Oliver Glöckner 11 COGs - Construction Example: Isoleucine trna-synthetase [Yeast cyto] [E.coli] [H. influenzae] YPL040c BeTs to all bacterial and the archaeal genes [Yeast mito] [M. genitalium] Symmetrical BeTs to all bacterial genes [M. jannaschii] [Synechocystis] [M. pneumoniae] YPL076c Symmetrical BeT to the M. jannaschii gene Tatusov, 1997, Science, p. 631-637 Genome Annotation Frank Oliver Glöckner 12 6

COGs - Status Version 1, 1997 7 Organisms Escherichia coli Haemophilus influenzae Saccharomyces cerevisiae Synechocystis sp. Mycoplasma genitalium Mycoplasma pneumoniae 17,967 Proteins 720 COGs 15 functional categories Genome Annotation Frank Oliver Glöckner 13 COGs Status Version 2, December 2001-44 Genomes Genome Annotation Frank Oliver Glöckner 14 7

COGs Status, 2003 66 unic. Genomes & 7 euk. Genomes Version 3, 2003 Tatusov, 2003, BMC Bioinformatics, 4(1):41 Genome Annotation Frank Oliver Glöckner 15 COGs Classification Version 2, 18 functional categories Genome Annotation Frank Oliver Glöckner 16 8

COG Classification Version 3, 25 functional categories Genome Annotation Frank Oliver Glöckner 17 COGs - Searching COGNITOR Assigns new proteins to existing COGs Classification of Proteins Prediction of function Provides a fast overview on the possible metabolism of an organism COG Website: http://www.ncbi.nlm.nih.gov/cog/ Genome Annotation Frank Oliver Glöckner 18 9

COG is retired Genome Annotation Frank Oliver Glöckner 19 COGs - Results Genome Annotation Frank Oliver Glöckner 20 10

COG Results Genome Annotation Frank Oliver Glöckner 21 COG Result Genomic context Genome Annotation Frank Oliver Glöckner 22 11

COG: Functional cross genome comparisons Genome Annotation Frank Oliver Glöckner 23 Ontologies Why ontologies? Ontologies provide conceptualizations of domains of knowledge and facilitate both communication between researchers and the use of domain knowledge by computers for multiple purposes http://www.obofoundry.org/ http://bioportal.bioontology.org/ http://www.ebi.ac.uk/ontology-lookup/ Genome Annotation Frank Oliver Glöckner 24 12

Genome Annotation Frank Oliver Glöckner 25 The Gene Ontology Consortium Started 1998 http://www.geneontology.org A joint project of three eukaryotic model databases FlyBase (Drosophila), Mouse Genome Informatics (MGI) and Saccharomyces Genome Database (SGD) The three major goals are: 1. Develop a set of controlled, structured vocabularies to describe key domains of molecular biology, including gene product attributes and biological sequences 2. To apply GO terms in annotation of sequences, genes or gene products 3. To provide a centralized public resource allowing universal access to ontologies, annotation data sets and software tools Genome Annotation Frank Oliver Glöckner 26 13

The GO consortium http://www.geneontology.org/go.consortiumlist.shtml Genome Annotation Frank Oliver Glöckner 27 GO: Three Ontologies Biological process Refers to a biological objective to which the gene or gene product contributes e.g. Cell growth and maintenance (broad) or translation (more specific) Molecular function Defined as the biochemical activity of a gene product (including specific binding to ligands and structures) e.g. enzyme (broad) or adenylate cyclase (more specific) Cellular component Refers to the place in the cell where a gene product is active e.g. ribosome (multicomponent complex) or nuclear membrane (cellular structure) Genome Annotation Frank Oliver Glöckner 28 14

GO Structure Ontology Structure Directed acyclic graphs (DAGs) A child can have more than one parent Relationships is a = instance of the parent part of = component of the parent Every GO term has a: Unique identifier Annotation Oxford dictionary of Molecular Biology Source (literature or computational analysis) Evidence given in the source (e.g. experimental ) Genome Annotation Frank Oliver Glöckner 29 GO Network Genome Annotation Frank Oliver Glöckner 30 15

GO Biological Process One node can have more than one parent! The Gene Ontology Consortium 2000, Nature vol. 25, p. 25 Genome Annotation Frank Oliver Glöckner 31 GO Molecular Function The Gene Ontology Consortium 2000, Nature vol. 25, p. 25 Genome Annotation Frank Oliver Glöckner 32 16

GO Cellular Component The Gene Ontology Consortium 2000, Nature vol. 25, p. 25 Genome Annotation Frank Oliver Glöckner 33 GO @ EBI: The GOA-project See http://www.ebi.ac.uk/goa for recent accomplishments in classification efforts: mapping of external catalogues to GO EC numbers to GO (EC2go) SWISS-PROT keywords to GO (spkw2go) InterPro entries to GO (interpro2go) TIGR functional roles to GO (TIGR2go) GenProtEC functional categories to GO (genprotec2go) Genome Annotation Frank Oliver Glöckner 34 17

Mappings to GO Genome Annotation Frank Oliver Glöckner 35 Example BLASTP of ORF2114 Name? Functional classification? Genome Annotation Frank Oliver Glöckner 36 18

Swiss-Prot Genome Annotation Frank Oliver Glöckner 37 Example BLASTP of ORF2114 against COG COG Aconitase A C: Energy production and conversion TCA Cycle Genome Annotation Frank Oliver Glöckner 38 19

KEGG Genome Annotation Frank Oliver Glöckner 39 Send to GOhst http://www.godatabase.org/cgi-bin/amigo/go.cgi Genome Annotation Frank Oliver Glöckner 40 20

Annotation of ORF2114 Aconitate hydratase or Citrate hydro-lyase, Aconitase Cis-aconitase Deinococcus radiodurans (strain R1) Gene: acnb or citb E.C. 4.2.1.3 COG: Aconitase A, C: Energy production and conversion, TCA Cycle KEGG Pathway Ko00020 Citrate cycle (TCA cycle) Ko00630 Glyoxylate and dicarboxylate metabolism Ko00720 Carbon fixation pathways in prokaryotes GO Recommended name: aconitate hydratase Function: aconitate hydratase GO:0003994 Process: tricarboxylic acid cycle GO:0006099 part of Genome Annotation Frank Oliver Glöckner 41 is a InterPro Genome Annotation Frank Oliver Glöckner 42 21

TIGRFAMs Genome Annotation Frank Oliver Glöckner 43 Rhodopirellula: Annotation Glöckner et al., PNAS, 2003 Genome Annotation Frank Oliver Glöckner 44 22