Functional Annotation & Comparative Genomics Lu Wang, Georgia Tech
Outline Functional annotation What is functional annotation? What needs to be annotated Approaches to functional annotation Pros/cons of available approaches Comparative genomics What is comparative genomics? Questions answered by comparative genomics Approaches and tools
Outline Functional annotation What is functional annotation? What needs to be annotated Approaches to functional annotation Pros/cons of available approaches Comparative genomics What is comparative genomics? Questions answered by comparative genomics Approaches and tools
What is functional annotation? http://www.biochem.arizona.edu/miesfeld/teaching/bioc471-2/pages/lecture7/lecture7.html
Take one step back Genome Assembly Assemble the Pieces Right 5
Gene Prediction Identify the words When on board HMS Beagle, as naturalist, I was much struck with certain facts in the distribution of the inhabitants of South America, and in the When on board HMS Beagle, geological as relations of the present naturalist, I was much struck to the past inhabitants of that with certain facts in continent. the These facts seemed to distribution of the inhabitants me of to throw some light on the South America, and inorigin the of species - that mystery of geological relations of the present mysteries, as it has been called by to the past inhabitants ofone that of our greatestphilosophers. continent. These facts seemed to me to throw some light on the origin of species - that mystery of mysteries,as it has been called by one of our greatestphilosophers. 6
Functional Annotation nat u ral ist [nach-er-uh-list, nach-ruh-] noun 1. a person who studies or is an expert in natural history, especially a zoologist or botanist. 2. an adherent of naturalism in literature or art. Origin: 1580 90; natural + -ist DATABASES Identify the function (i.e., meaning) of each word When on board HMS Beagle, as naturalist, I was much struck with certain facts in the distribution of the inhabitants of South America, and in the geological relations of the present to the past inhabitants of that continent. These facts seemed to me to throw some light on the origin of species - that mystery of mysteries,as it has been called by one of our greatestphilosophers. PROFILES Origin of Species, The noun ( On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life ) a treatise (1859) by Charles Darwin setting forth his theory of 7 evolution.
Comparative Genomics When on board RMS Titanic, as painter, I was much struck with certain facts in the distribution of the inhabitants of United Kingdom, and in the socioeconomical relations of the present to the past inhabitants of that continent. These facts seemed to me to throw some light on the origin of capitalismthat mystery of mysteries, as it has been called by one of our greatestphilosophers. When on board HMS Beagle, as naturalist, I was much struck with certain facts in the distribution of the inhabitants of South America, and in the geological relations of the present to the past inhabitants of that continent. These facts seemed to me to throw some light on the origin of species - that mystery of mysteries, as it has been called by one of our greatestphilosophers. 8
One more step back Function? What is function? 9
To a cell biologist function might refer to the network of interactions in which the protein participates or to the location to a certain cellular compartment. To a biochemist, function refers to the metabolic process in which a protein is involved or to the reaction catalyzed by an enzyme. 10
So what is Functional Annotation Functional annotation consists of attaching biological information to genomic elements regarding Biochemical function Biological function Regulatory function Interactions 11
What needs to be annotated? 12
What needs to be annotated? Proteins/Coding portion Domain/Motifs Signaling Peptide Transmembrane region Non-coding RNA s Riboswitches CRISPR Small RNA Operons Others features to address the specific biological question(s). 13
Since proteins are really the Proteins can be: building blocks Enzymes Regulatory Receptors Virulence Factors Transmembrane Structural Signal Transduction Toxins Membrane 14
Domain A Domain is: a discrete structural unit assumed to fold independently of the rest of the protein have its own function ~20-100 aa long Small subdomains can be assembled into larger domains http://en.wikipedia.org/wiki/protein_domain Pyruvate kinase, a protein with three domains 15
Motif The sequences of many proteins contain short, conserved motifs that are involved in recognition and targeting activities, often separate. These motifs are linear, in the sense that three-dimensional organization is not required to bring distant segments of the molecule together to make the recognizable unit. - Tim Hunt (English biochemist) http://en.wikipedia.org/wiki/protein_domain 16
In short Motifs are: short, conserved regions usually are the most conserved regions of domains are critical for the domain to function The Human papilloma virus E7 oncoprotein mimic of the LxCxE motif (red) bound to the host Retinoblastoma protein (dark grey) which is a tumor suppressor gene 26th Feb 2014 17
How Genes Collectively Performs Function? Operon: Several genes with related functions that are regulated together, because one piece of mrna codes for several related proteins. Polycistronic mrna - mrna coding for more than one polypeptide, is found only in prokaryotes 18
Approaches to Functional Annotation 26th Feb 2014 19
Functional Annotation Ab initio Based on intrinsic characteristics of gene/protein features Signaling peptides (SignalP, LipoP) Transmembrane domains (TMHMM) Homology Based Information transfer from experimentally characterized system BLAST InterPro 26th Feb 2014 20
Ab initio approaches Transmembrane(TM) and Signaling peptides have a distinct pattern of sequence composition TM proteins are membrane bound receptors and channels that are of particular pharmacological relevance (therapeutic or vaccine target) Signal peptides direct proteins to their proper cellular or extracellular location 21
Homology based approaches Assumption: Significant sequence similarity implies homology or shared ancestry that often leads to shared function Specifically: Genes/proteins evolved to perform some function will retain that function Deleterious mutations will be weeded out by purifying selection Evolution is mostly dominated by divergence Homology will thus entail a high chance of shared origin and function 26th Feb 2014 22
Homology based approaches Databases: NCBI GenBank RefSeq EBI SwissProt UniProt DDBJ KEGG Tools BLAST InterProScan GO-based 23
The Three Kingdoms 24
Primary vs. derivative sequence databases Genomes PGAAP Sequence Data GenBank Curators RefSeq From Sequencing Labs UniGene 25
Databases of Choice RefSeq, SwissProt and UniProt are all Very reliable High level of annotation Minimal redundancy Integration with other databases 26
Gene Ontology Shulaev, V., Sargent, D. J., Crowhurst, R. N., Mockler, T. C., Folkerts, O., Delcher, A. L.,... & Salama, D. Y. (2010). The genome of woodland strawberry (Fragaria vesca). Nature genetics, 43(2), 109-116. 27
Analysis Tools GO Based Blast2GO GOMiner Many more 28
Analysis Tools - BLAST If you do this here. 29
Analysis Tools - BLAST One way of doing this 30
Analysis Tools - BLAST Alternatively, you can use the cloud-based version 31
Analysis Tools - InterProScan 32
Analysis Tools - InterProScan Member database information Signature Database Version Signatures* Integrated Signatures** CATH-Gene3D 3.5.0 2626 1726 HAMAP 201511.02 2045 2037 PANTHER 10.0 95118 4925 PIRSF 3.01 3285 3223 PRINTS 42.0 2106 2003 PROSITE patterns 20.119 1309 1291 PROSITE profiles 20.119 1136 1109 Pfam 28.0 16230 15638 ProDom 2006.1 1894 1125 SMART 6.2 1008 996 SUPERFAMILY 1.75 2019 1405 TIGRFAMs 15.0 4488 4454 CATH-Gene3D 3.5.0 2626 1726 * Some signatures may not have matches to UniProtKB proteins. ** Not all signatures of a member database may be integrated at the time of an InterPro release.
Criteria for selecting methods 1. Method can scale (~30-60 genomes!!) 2. Currently being maintained 3. Applicable to Prokaryotic sequences 4. Could be installed locally (support batch jobs if GUI) OR Could be included in a pipeline i.e., have a commandline interface 34
Gene naming You need to have a clear logic and support for assigning names to the predicted proteins Your naming scheme should be consistent A generally accepted scheme is as follows: High confidence matches function and annotation can be transferred Multiple high confidence matches assign a less specific name based the majority Low confidence matches assign function as putative Match to a hypothetical protein conserved hypothetical protein No match in the database hypothetical protein How high is high? Ask your data. 35
Automated Pipelines Takes in whole genome assembly and spits out annotations. E.g.: PGAAP Prokaryotic Genome Automatic Annotation Pipeline CG-Pipeline Computational Genomics Pipeline RAST Rapid Annotation using subsystem technology KAAS KEGG Automatic Annotation Server? 36
CAUTION! PROS AND CONS OF ANNOTATION APPROACHES 37
38
The Assumption Given an unannotated protein, the homology transfer approach suggests searching for an annotated homolog and using the experimentally verified function of the latter to infer the function of the former. Punta, M., & Ofran, Y. (2008). The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function. PLoS computational biology, 4(10), e1000160. 39
The Truth Perutz et al. showed in 1960 that myoglobin and hemoglobin, the first two protein structures to be solved at atomic resolution using X-ray crystallography, have similar structures even though their sequences differ. 40
Molecular Evolution Refresher Homolog? Paralog? Ortholog? Jensen, R. A. (2001). Orthologs and paralogs - we need to get it right. Genome Biology, 2(8), interactions1002.1 interactions1002.3. 41
Molecular Evolution Refresher Orthologs are homologous genes that are the result of a speciation event. Paralogs are homologous genes that are the result of a duplication event. Jensen, R. A. (2001). Orthologs and paralogs - we need to get it right. Genome Biology, 2(8), interactions1002.1 interactions1002.3. 42
Homology - Pros and Cons Homology Useful but different from same function Simply implies common ancestry Punta, M., & Ofran, Y. (2008). The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function. PLoS computational biology, 4(10), e1000160.
Pros and Cons: There are no free lunches! Punta, M., & Ofran, Y. (2008). The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function. PLoS computational biology, 4(10), e1000160. 44
Pros and Cons: There are no free lunches! Quality of prediction is at most as good as the quality of annotation in the database Eukaryotic function predictor can not be used for Prokaryotes and vice versa 45
Outline Functional annotation What is functional annotation? What needs to be annotated Approaches to functional annotation Pros/cons of available approaches Comparative genomics What is comparative genomics? Questions answered by comparative genomics Approaches and tools
Comparative Genomics Ciccarelli, F. D., Doerks, T., Von Mering, C., Creevey, C. J., Snel, B., & Bork, P. (2006). Toward automatic reconstruction of a highly resolved tree of life.science, 311(5765), 1283-1287. 47
Comparative Genomics In a nutshell it s comparing similarities and differences in genomes (proteins/genes/snps) of multiple organisms from same or different species. Helps in answering Present: lifestyle - virulent vs avirulent; horizontally acquired segments Past: Evolution 48
Comparative Genomics Biological questions of general interest: Are there rearrangements? Is the region(s) of interest syntenic across species? Are their gene gain/loss event leading to specific trait? What factors confer virulence to the genome? Which organisms are more similar? Which are more distant? 49
Comparative Genomics More specific questions from last year Which genomic feature(s) is unique to N. menigitidis(nm), H. influenzae(hi) or H. haemolyticus(hh)? Which region(s) is unique to a specific Hm serogroup? Which region(s) is unique to a specific Hi serotype? What is the genotype of a given sample? 50
Comparative Genomics For this year Which genomic features and/or genomic features that can provide power to distinguish the NT Hi 51
Genomic Rearrangement Darling, Aaron E., István Miklós, and Mark A. Ragan. "Dynamics of genome rearrangement in bacterial populations." PLoS Genetics 4.7 (2008): e1000128. 52
What is Synteny VS. http://www.nature.com/scitable/topicpage/synteny-inferring-ancestral-genomes-44022 53
Synteny Krause, A., Ramakumar, A., Bartels, D., Battistoni, F., Bekel, T., Boch, J.,... & Goesmann, A. (2006). Complete genome of the mutualistic, N2-fixing grass endophyte54 Azoarcus sp. strain BH72. Nature biotechnology, 24(11).
Horizontal Gene Transfer http://www.quora.com/why-do-prokaryotes-undergo-horizontal-gene-transfer-but-eukaryotes-dont 55
Last year http://compgenomics2015.biology.gatech.edu/images/c/c5/lecture6_ngs_for_confirmation_and_characterization_of_meninigitis_pathogens_v3.pdf 56
Analysis Tools Homology Based BLAST, Protein Clusters, Pathway Analysis Phylogenetics MEGA, T-Coffee Virulence - VFDB Horizontal/Lateral Gene Transfer Dark Horse, Alien Hunter Visualization 57
Phylogenetic Analysis There are a number of ways you can compare organisms/genomes: 16S rrna tree MLST based methods ANI based methods More traditional All three can be visualized as a tree to assess the relatedness between the organisms ANI has been shown to correlate well with DDH by Konstantinidis et al. Konstantinidis, K. T., Ramette, A., & Tiedje, J. M. (2006). The bacterial species definition in the genomic era. Philosophical Transactions of the Royal Society B: Biological Sciences, 361(1475), 1929-1940. Goris, J., Konstantinidis, K. T., Klappenbach, J. A., Coenye, T., Vandamme, P., & Tiedje, J. M. (2007). DNA DNA hybridization values and their relationship to whole-genome sequence similarities. International journal of systematic and evolutionary microbiology, 57(1), 81-91. 58
Visualization is more than a thousand words < 59
Visualization Tools Circos 60
CGView Visualization Tools 61
Visualization Tools BRIG 62
Artemis IGV Visualization Tools 63
Mauve Visualization Tools 64
Capsule switching breakpoint resolution Rishishwar, L., Katz, L. S., Sharma, N. V., Rowe, L., Frace, M., Thomas, J. D.,... & Jordan, I. K. (2012). Genomic Basis of a Polyagglutinating Isolate of Neisseria meningitidis. Journal of bacteriology, 194(20), 5649-5656. 65
Outline Functional annotation What is functional annotation? What needs to be annotated Approaches to functional annotation Pros/cons of available approaches Comparative genomics What is comparative genomics? Questions answered by comparative genomics Approaches and tools
Come to Dr. Xin Wang s lecture and pay attention to the biological questions The problems which biologists solved or not solved with their tubes and plates, is going to be solved by you with your genomic sequences. 67