Functional Annotation & Comparative Genomics. Lu Wang, Georgia Tech

Similar documents
Functional Annotation

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Homology and Information Gathering and Domain Annotation for Proteins

We have: We will: Assembled six genomes Made predictions of most likely gene locations. Add a layers of biological meaning to the sequences

CSCE555 Bioinformatics. Protein Function Annotation

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

-max_target_seqs: maximum number of targets to report

Gene function annotation

Homology. and. Information Gathering and Domain Annotation for Proteins

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Computational approaches for functional genomics

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome

Comparative genomics: Overview & Tools + MUMmer algorithm

Sequences, Structures, and Gene Regulatory Networks

EBI web resources II: Ensembl and InterPro

In-Silico Approach for Hypothetical Protein Function Prediction

Computational methods for predicting protein-protein interactions

CS612 - Algorithms in Bioinformatics

MiGA: The Microbial Genome Atlas

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

FUNCTION ANNOTATION PRELIMINARY RESULTS

Molecular evolution - Part 1. Pawan Dhar BII

Protein function prediction based on sequence analysis

Properties of Life. Levels of Organization. Levels of Organization. Levels of Organization. Levels of Organization. The Science of Biology.

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Bio 119 Bacterial Genomics 6/26/10

The Science of Biology. Chapter 1

Computational Biology: Basics & Interesting Problems

Genetic Variation: The genetic substrate for natural selection. Horizontal Gene Transfer. General Principles 10/2/17.

BMD645. Integration of Omics

Types of biological networks. I. Intra-cellurar networks

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

A Protein Ontology from Large-scale Textmining?

Protein Families. João C. Setubal University of São Paulo Agosto /23/2012 J. C. Setubal

BLAST. Varieties of BLAST

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Prediction of protein function from sequence analysis

Biological Pathways Representation by Petri Nets and extension

Supplementary Information 16

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Regulation and signaling. Overview. Control of gene expression. Cells need to regulate the amounts of different proteins they express, depending on

Outline. I. Methods. II. Preliminary Results. A. Phylogeny Methods B. Whole Genome Methods C. Horizontal Gene Transfer

How to Use This Presentation

Bioinformatics Chapter 1. Introduction

Lecture 2. The Blast2GO annotation framework

A A A A B B1

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

AP BIOLOGY SUMMER ASSIGNMENT

Some Problems from Enzyme Families

Protein Architecture V: Evolution, Function & Classification. Lecture 9: Amino acid use units. Caveat: collagen is a. Margaret A. Daugherty.

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Fundamentals of Biology Valencia College BSC1010C

functional annotation preliminary results

Prokaryotic Gene Expression (Learning Objectives)

Lecture Notes for Fall Network Modeling. Ernest Fraenkel

Vital Statistics Derived from Complete Genome Sequencing (for E. coli MG1655)

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

AP Biology Essential Knowledge Cards BIG IDEA 1

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype

8/23/2014. Phylogeny and the Tree of Life

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Biol478/ August

Valley Central School District 944 State Route 17K Montgomery, NY Telephone Number: (845) ext Fax Number: (845)

Evolution. Just a few points

Introduction to Bioinformatics Integrated Science, 11/9/05

Networks & pathways. Hedi Peterson MTAT Bioinformatics

Lab 2A--Life on Earth

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Evolution. Species Changing over time

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

I. Molecules and Cells: Cells are the structural and functional units of life; cellular processes are based on physical and chemical changes.

Big Idea 1: The process of evolution drives the diversity and unity of life.

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Comparative RNA-seq analysis of transcriptome dynamics during petal development in Rosa chinensis

STRING: Protein association networks. Lars Juhl Jensen

Comparative genomics of gene families in relation with metabolic pathways for gene candidates highlighting

Evolutionary Rate Covariation of Domain Families

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

Chapters AP Biology Objectives. Objectives: You should know...

Grade Level: AP Biology may be taken in grades 11 or 12.

Host-Pathogen Interaction. PN Sharma Department of Plant Pathology CSK HPKV, Palampur

Genome Annotation Project Presentation

Enduring understanding 1.A: Change in the genetic makeup of a population over time is evolution.

Predicting Protein Functions and Domain Interactions from Protein Interactions

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

The Science of Biology. Chapter 1

Evolution Problem Drill 09: The Tree of Life

Metabolic modelling. Metabolic networks, reconstruction and analysis. Esa Pitkänen Computational Methods for Systems Biology 1 December 2009

V19 Metabolic Networks - Overview

Sequence Alignment Techniques and Their Uses

Chapter 15 Active Reading Guide Regulation of Gene Expression

Map of AP-Aligned Bio-Rad Kits with Learning Objectives

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/8/16

Translation Part 2 of Protein Synthesis

Transcription:

Functional Annotation & Comparative Genomics Lu Wang, Georgia Tech

Outline Functional annotation What is functional annotation? What needs to be annotated Approaches to functional annotation Pros/cons of available approaches Comparative genomics What is comparative genomics? Questions answered by comparative genomics Approaches and tools

Outline Functional annotation What is functional annotation? What needs to be annotated Approaches to functional annotation Pros/cons of available approaches Comparative genomics What is comparative genomics? Questions answered by comparative genomics Approaches and tools

What is functional annotation? http://www.biochem.arizona.edu/miesfeld/teaching/bioc471-2/pages/lecture7/lecture7.html

Take one step back Genome Assembly Assemble the Pieces Right 5

Gene Prediction Identify the words When on board HMS Beagle, as naturalist, I was much struck with certain facts in the distribution of the inhabitants of South America, and in the When on board HMS Beagle, geological as relations of the present naturalist, I was much struck to the past inhabitants of that with certain facts in continent. the These facts seemed to distribution of the inhabitants me of to throw some light on the South America, and inorigin the of species - that mystery of geological relations of the present mysteries, as it has been called by to the past inhabitants ofone that of our greatestphilosophers. continent. These facts seemed to me to throw some light on the origin of species - that mystery of mysteries,as it has been called by one of our greatestphilosophers. 6

Functional Annotation nat u ral ist [nach-er-uh-list, nach-ruh-] noun 1. a person who studies or is an expert in natural history, especially a zoologist or botanist. 2. an adherent of naturalism in literature or art. Origin: 1580 90; natural + -ist DATABASES Identify the function (i.e., meaning) of each word When on board HMS Beagle, as naturalist, I was much struck with certain facts in the distribution of the inhabitants of South America, and in the geological relations of the present to the past inhabitants of that continent. These facts seemed to me to throw some light on the origin of species - that mystery of mysteries,as it has been called by one of our greatestphilosophers. PROFILES Origin of Species, The noun ( On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life ) a treatise (1859) by Charles Darwin setting forth his theory of 7 evolution.

Comparative Genomics When on board RMS Titanic, as painter, I was much struck with certain facts in the distribution of the inhabitants of United Kingdom, and in the socioeconomical relations of the present to the past inhabitants of that continent. These facts seemed to me to throw some light on the origin of capitalismthat mystery of mysteries, as it has been called by one of our greatestphilosophers. When on board HMS Beagle, as naturalist, I was much struck with certain facts in the distribution of the inhabitants of South America, and in the geological relations of the present to the past inhabitants of that continent. These facts seemed to me to throw some light on the origin of species - that mystery of mysteries, as it has been called by one of our greatestphilosophers. 8

One more step back Function? What is function? 9

To a cell biologist function might refer to the network of interactions in which the protein participates or to the location to a certain cellular compartment. To a biochemist, function refers to the metabolic process in which a protein is involved or to the reaction catalyzed by an enzyme. 10

So what is Functional Annotation Functional annotation consists of attaching biological information to genomic elements regarding Biochemical function Biological function Regulatory function Interactions 11

What needs to be annotated? 12

What needs to be annotated? Proteins/Coding portion Domain/Motifs Signaling Peptide Transmembrane region Non-coding RNA s Riboswitches CRISPR Small RNA Operons Others features to address the specific biological question(s). 13

Since proteins are really the Proteins can be: building blocks Enzymes Regulatory Receptors Virulence Factors Transmembrane Structural Signal Transduction Toxins Membrane 14

Domain A Domain is: a discrete structural unit assumed to fold independently of the rest of the protein have its own function ~20-100 aa long Small subdomains can be assembled into larger domains http://en.wikipedia.org/wiki/protein_domain Pyruvate kinase, a protein with three domains 15

Motif The sequences of many proteins contain short, conserved motifs that are involved in recognition and targeting activities, often separate. These motifs are linear, in the sense that three-dimensional organization is not required to bring distant segments of the molecule together to make the recognizable unit. - Tim Hunt (English biochemist) http://en.wikipedia.org/wiki/protein_domain 16

In short Motifs are: short, conserved regions usually are the most conserved regions of domains are critical for the domain to function The Human papilloma virus E7 oncoprotein mimic of the LxCxE motif (red) bound to the host Retinoblastoma protein (dark grey) which is a tumor suppressor gene 26th Feb 2014 17

How Genes Collectively Performs Function? Operon: Several genes with related functions that are regulated together, because one piece of mrna codes for several related proteins. Polycistronic mrna - mrna coding for more than one polypeptide, is found only in prokaryotes 18

Approaches to Functional Annotation 26th Feb 2014 19

Functional Annotation Ab initio Based on intrinsic characteristics of gene/protein features Signaling peptides (SignalP, LipoP) Transmembrane domains (TMHMM) Homology Based Information transfer from experimentally characterized system BLAST InterPro 26th Feb 2014 20

Ab initio approaches Transmembrane(TM) and Signaling peptides have a distinct pattern of sequence composition TM proteins are membrane bound receptors and channels that are of particular pharmacological relevance (therapeutic or vaccine target) Signal peptides direct proteins to their proper cellular or extracellular location 21

Homology based approaches Assumption: Significant sequence similarity implies homology or shared ancestry that often leads to shared function Specifically: Genes/proteins evolved to perform some function will retain that function Deleterious mutations will be weeded out by purifying selection Evolution is mostly dominated by divergence Homology will thus entail a high chance of shared origin and function 26th Feb 2014 22

Homology based approaches Databases: NCBI GenBank RefSeq EBI SwissProt UniProt DDBJ KEGG Tools BLAST InterProScan GO-based 23

The Three Kingdoms 24

Primary vs. derivative sequence databases Genomes PGAAP Sequence Data GenBank Curators RefSeq From Sequencing Labs UniGene 25

Databases of Choice RefSeq, SwissProt and UniProt are all Very reliable High level of annotation Minimal redundancy Integration with other databases 26

Gene Ontology Shulaev, V., Sargent, D. J., Crowhurst, R. N., Mockler, T. C., Folkerts, O., Delcher, A. L.,... & Salama, D. Y. (2010). The genome of woodland strawberry (Fragaria vesca). Nature genetics, 43(2), 109-116. 27

Analysis Tools GO Based Blast2GO GOMiner Many more 28

Analysis Tools - BLAST If you do this here. 29

Analysis Tools - BLAST One way of doing this 30

Analysis Tools - BLAST Alternatively, you can use the cloud-based version 31

Analysis Tools - InterProScan 32

Analysis Tools - InterProScan Member database information Signature Database Version Signatures* Integrated Signatures** CATH-Gene3D 3.5.0 2626 1726 HAMAP 201511.02 2045 2037 PANTHER 10.0 95118 4925 PIRSF 3.01 3285 3223 PRINTS 42.0 2106 2003 PROSITE patterns 20.119 1309 1291 PROSITE profiles 20.119 1136 1109 Pfam 28.0 16230 15638 ProDom 2006.1 1894 1125 SMART 6.2 1008 996 SUPERFAMILY 1.75 2019 1405 TIGRFAMs 15.0 4488 4454 CATH-Gene3D 3.5.0 2626 1726 * Some signatures may not have matches to UniProtKB proteins. ** Not all signatures of a member database may be integrated at the time of an InterPro release.

Criteria for selecting methods 1. Method can scale (~30-60 genomes!!) 2. Currently being maintained 3. Applicable to Prokaryotic sequences 4. Could be installed locally (support batch jobs if GUI) OR Could be included in a pipeline i.e., have a commandline interface 34

Gene naming You need to have a clear logic and support for assigning names to the predicted proteins Your naming scheme should be consistent A generally accepted scheme is as follows: High confidence matches function and annotation can be transferred Multiple high confidence matches assign a less specific name based the majority Low confidence matches assign function as putative Match to a hypothetical protein conserved hypothetical protein No match in the database hypothetical protein How high is high? Ask your data. 35

Automated Pipelines Takes in whole genome assembly and spits out annotations. E.g.: PGAAP Prokaryotic Genome Automatic Annotation Pipeline CG-Pipeline Computational Genomics Pipeline RAST Rapid Annotation using subsystem technology KAAS KEGG Automatic Annotation Server? 36

CAUTION! PROS AND CONS OF ANNOTATION APPROACHES 37

38

The Assumption Given an unannotated protein, the homology transfer approach suggests searching for an annotated homolog and using the experimentally verified function of the latter to infer the function of the former. Punta, M., & Ofran, Y. (2008). The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function. PLoS computational biology, 4(10), e1000160. 39

The Truth Perutz et al. showed in 1960 that myoglobin and hemoglobin, the first two protein structures to be solved at atomic resolution using X-ray crystallography, have similar structures even though their sequences differ. 40

Molecular Evolution Refresher Homolog? Paralog? Ortholog? Jensen, R. A. (2001). Orthologs and paralogs - we need to get it right. Genome Biology, 2(8), interactions1002.1 interactions1002.3. 41

Molecular Evolution Refresher Orthologs are homologous genes that are the result of a speciation event. Paralogs are homologous genes that are the result of a duplication event. Jensen, R. A. (2001). Orthologs and paralogs - we need to get it right. Genome Biology, 2(8), interactions1002.1 interactions1002.3. 42

Homology - Pros and Cons Homology Useful but different from same function Simply implies common ancestry Punta, M., & Ofran, Y. (2008). The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function. PLoS computational biology, 4(10), e1000160.

Pros and Cons: There are no free lunches! Punta, M., & Ofran, Y. (2008). The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function. PLoS computational biology, 4(10), e1000160. 44

Pros and Cons: There are no free lunches! Quality of prediction is at most as good as the quality of annotation in the database Eukaryotic function predictor can not be used for Prokaryotes and vice versa 45

Outline Functional annotation What is functional annotation? What needs to be annotated Approaches to functional annotation Pros/cons of available approaches Comparative genomics What is comparative genomics? Questions answered by comparative genomics Approaches and tools

Comparative Genomics Ciccarelli, F. D., Doerks, T., Von Mering, C., Creevey, C. J., Snel, B., & Bork, P. (2006). Toward automatic reconstruction of a highly resolved tree of life.science, 311(5765), 1283-1287. 47

Comparative Genomics In a nutshell it s comparing similarities and differences in genomes (proteins/genes/snps) of multiple organisms from same or different species. Helps in answering Present: lifestyle - virulent vs avirulent; horizontally acquired segments Past: Evolution 48

Comparative Genomics Biological questions of general interest: Are there rearrangements? Is the region(s) of interest syntenic across species? Are their gene gain/loss event leading to specific trait? What factors confer virulence to the genome? Which organisms are more similar? Which are more distant? 49

Comparative Genomics More specific questions from last year Which genomic feature(s) is unique to N. menigitidis(nm), H. influenzae(hi) or H. haemolyticus(hh)? Which region(s) is unique to a specific Hm serogroup? Which region(s) is unique to a specific Hi serotype? What is the genotype of a given sample? 50

Comparative Genomics For this year Which genomic features and/or genomic features that can provide power to distinguish the NT Hi 51

Genomic Rearrangement Darling, Aaron E., István Miklós, and Mark A. Ragan. "Dynamics of genome rearrangement in bacterial populations." PLoS Genetics 4.7 (2008): e1000128. 52

What is Synteny VS. http://www.nature.com/scitable/topicpage/synteny-inferring-ancestral-genomes-44022 53

Synteny Krause, A., Ramakumar, A., Bartels, D., Battistoni, F., Bekel, T., Boch, J.,... & Goesmann, A. (2006). Complete genome of the mutualistic, N2-fixing grass endophyte54 Azoarcus sp. strain BH72. Nature biotechnology, 24(11).

Horizontal Gene Transfer http://www.quora.com/why-do-prokaryotes-undergo-horizontal-gene-transfer-but-eukaryotes-dont 55

Last year http://compgenomics2015.biology.gatech.edu/images/c/c5/lecture6_ngs_for_confirmation_and_characterization_of_meninigitis_pathogens_v3.pdf 56

Analysis Tools Homology Based BLAST, Protein Clusters, Pathway Analysis Phylogenetics MEGA, T-Coffee Virulence - VFDB Horizontal/Lateral Gene Transfer Dark Horse, Alien Hunter Visualization 57

Phylogenetic Analysis There are a number of ways you can compare organisms/genomes: 16S rrna tree MLST based methods ANI based methods More traditional All three can be visualized as a tree to assess the relatedness between the organisms ANI has been shown to correlate well with DDH by Konstantinidis et al. Konstantinidis, K. T., Ramette, A., & Tiedje, J. M. (2006). The bacterial species definition in the genomic era. Philosophical Transactions of the Royal Society B: Biological Sciences, 361(1475), 1929-1940. Goris, J., Konstantinidis, K. T., Klappenbach, J. A., Coenye, T., Vandamme, P., & Tiedje, J. M. (2007). DNA DNA hybridization values and their relationship to whole-genome sequence similarities. International journal of systematic and evolutionary microbiology, 57(1), 81-91. 58

Visualization is more than a thousand words < 59

Visualization Tools Circos 60

CGView Visualization Tools 61

Visualization Tools BRIG 62

Artemis IGV Visualization Tools 63

Mauve Visualization Tools 64

Capsule switching breakpoint resolution Rishishwar, L., Katz, L. S., Sharma, N. V., Rowe, L., Frace, M., Thomas, J. D.,... & Jordan, I. K. (2012). Genomic Basis of a Polyagglutinating Isolate of Neisseria meningitidis. Journal of bacteriology, 194(20), 5649-5656. 65

Outline Functional annotation What is functional annotation? What needs to be annotated Approaches to functional annotation Pros/cons of available approaches Comparative genomics What is comparative genomics? Questions answered by comparative genomics Approaches and tools

Come to Dr. Xin Wang s lecture and pay attention to the biological questions The problems which biologists solved or not solved with their tubes and plates, is going to be solved by you with your genomic sequences. 67