Dr. Dirk Gevers 1,2 1 Laboratorium voor Microbiologie 2 Bioinformatics & Evolutionary Genomics The bacterial species in the genomic era CTACCATGAAAGACTTGTGAATCCAGGAAGAGAGACTGACTGGGCAACATGTTATTCAG GTACAAAAAGATTTGGACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGG GCAATGGGAAGCTCTTCTGGAGAGTGAGAGAAGCTTCCAGTTAAGGTGACATTGAAGC AAGTCCTGAAAGATGAGGAAGAGTTGTATGAGAGTGGGGAGGGAAGGGGGAGGTGGA GGGATGGGGAATGGGCCGGGATGGGATAGCGCAAACTGCCCGGGAAGGGAAACCAGCA TGTACAGACCTGAACAACGAAGATGGCATATTTTGTTCAGGGAATGGTGAATTAAGTGT GGCAGGAATGCTTTGTAGACACAGTAATTTGCTTGTATGGAATTTTGCCTGAGAGACCTC CTTCATCCATCACTGTCCTTGTCAAATAGTTTGGAACAGGTATAATGATCACAATAACCC AAGCATAATATTTCGTTAATTCTCACAGAATCACATATAGGTGCCACAGTTATCCCCATT TATGAATGGAGTMinimalProkaryoticGenomeGATGAAAACCTTAGGAATAATGAA GATTTGCGCAGGCTCACCTGGATATTAAGACTGAGTCAAATGTTGGGTCTGGTCTGACT TAATGTTTGCTTTGTTCATGAGCACCACATATTGCCTCTCCTATGCAGTTAAGCAGGTAG GTGACAGAAAAGCCCATGTTTGTCTCTACTCACACACTTCCGACTGAATGTATGTATGGA GTTTCTACACCAGATTCTTCAGTGCTCTGGATATTAACTGGGTATCCCATGACTTTATTCT GACACTACCTGGACCTTGTCAAATAGTTTGGACCTTGTCAAATAGTTTGGAGTCCTTGTCA AATAGTTTGGGGTTAGCACAGACCCCACAAGTTAGGGGCTCAGTCCCACGAGGCCATCCT ACTTCAGATGACAATGGCAAGTCCTAAGTTGTCACCATACTTTTGACCAACCTGTTACC AATCGGGGGTTCCCGTAACTGTCTTCTTGGGTTTAATAATTTGCTAGAACAGTTTACGGA ACTCAGAAAAACAGTTTATTTTCTTTTTTTCTGAGAGAGAGGGTCTTATTTTGTTGCCCAG GCTGGTGTGCAATGGTGCAGTCATAGCTCATTGCAGCCTTGATTGTCTGGGTTCCAGTGG TCTCCCACCTCAGCCTCCCTAGTAGCTGAGACTACATGCCTGCACCACCACATCTGGCTA GTTTCTTTTATTTTTTGTATAGATGGGGTCTTGTTGTGTTGGCCAGGCTGGCCACAAATTC TGGTCTCAAGTGATCCTCCCACCTCAGCCTCTGAAAGTGCTGGGATTACAGATGTGAGC ACCACATCTGGCCAGTTCATTTCCTATTACTGGTTCATTGTGAAGGATACATCTCAGAAA AGTCAATGAAAGAGACGTGCATGCTGGATGCAGTGGCTCATGCCTGTAATCTCAGCACT The minimal prokaryotic genome The minimal prokaryotic genome Mycoplasma genitalium ~108-121 genes not required for growth in laboratory ~265-350 genes required for growth in laboratory The minimal prokaryotic genome Diagram of the genome of Mycoplasma genitalium - 480 proteins The minimal prokaryotic genome Haemophilus influenzae (1703) Using transposon mutagenesis (one gene disruptions): 130 genes not required for growth in laboratory 350 genes required for growth in laboratory 240 Mycoplasma genitalium (468) C.M. Fraser et al., Science 1995 1
The minimal gene set A synthetic minimal genome The only way to better understand the minimal component of cellular life and understand the evolution of life Koonin, E.V. NRM 2003 CTACCATGAAAGACTTGTGAATCCAGGAAGAGAGACTGACTGGGCAACATGTTATTCAG GTACAAAAAGATTTGGACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGG GCAATGGGAAGCTCTTCTGGAGAGTGAGAGAAGCTTCCAGTTAAGGTGACATTGAAGC AAGTCCTGAAAGATGAGGAAGAGTTGTATGAGAGTGGGGAGGGAAGGGGGAGGTGGA GGGATGGGGAATGGGCCGGGATGGGATAGCGCAAACTGCCCGGGAAGGGAAACCAGCA TGTACAGACCTGAACAACGAAGATGGCATATTTTGTTCAGGGAATGGTGAATTAAGTGT GGCAGGAATGCTTTGTAGACACAGTAATTTGCTTGTATGGAATTTTGCCTGAGAGACCTC CTTCATCCATCACTGTCCTTGTCAAATAGTTTGGAACAGGTATAATGATCACAATAACCC AAGCATAATATTTCGTTAATTCTCACAGAATCACATATAGGTGCCACAGTTATCCCCATT TATGAATGGAGTProkaryoticCoreGenomeGATGAAAACCTTAGGAATAATGAATGA TTGCGCAGGCTCACCTGGATATTAAGACTGAGTCAAATGTTGGGTCTGGTCTGACTTTA ATGTTTGCTTTGTTCATGAGCACCACATATTGCCTCTCCTATGCAGTTAAGCAGGTAGGTG ACAGAAAAGCCCATGTTTGTCTCTACTCACACACTTCCGACTGAATGTATGTATGGAGTT CTACACCAGATTCTTCAGTGCTCTGGATATTAACTGGGTATCCCATGACTTTATTCTGAC ACTACCTGGACCTTGTCAAATAGTTTGGACCTTGTCAAATAGTTTGGAGTCCTTGTCAAAT AGTTTGGGGTTAGCACAGACCCCACAAGTTAGGGGCTCAGTCCCACGAGGCCATCCTCAC TCAGATGACAATGGCAAGTCCTAAGTTGTCACCATACTTTTGACCAACCTGTTACCAAT GGGGGTTCCCGTAACTGTCTTCTTGGGTTTAATAATTTGCTAGAACAGTTTACGGAACTC AGAAAAACAGTTTATTTTCTTTTTTTCTGAGAGAGAGGGTCTTATTTTGTTGCCCAGGCTG GTGTGCAATGGTGCAGTCATAGCTCATTGCAGCCTTGATTGTCTGGGTTCCAGTGGTTCTC CACCTCAGCCTCCCTAGTAGCTGAGACTACATGCCTGCACCACCACATCTGGCTAGTTT TTTTATTTTTTGTATAGATGGGGTCTTGTTGTGTTGGCCAGGCTGGCCACAAATTCCTGG CTCAAGTGATCCTCCCACCTCAGCCTCTGAAAGTGCTGGGATTACAGATGTGAGCCACC ACATCTGGCCAGTTCATTTCCTATTACTGGTTCATTGTGAAGGATACATCTCAGAAACAGT AATGAAAGAGACGTGCATGCTGGATGCAGTGGCTCATGCCTGTAATCTCAGCACTTTGG The prokaryotic core genome How big? How stable? Common history? Core genome is NOT = minimal genome!!! Why so small? Non-orthologous gene displacement Maybe, at deep divergences, many true orthologs fall below the radar screen of BLAST Maybe, a few radically reduced parasite genomes are skewing the analysis But just maybe, there really are only about 100 genes, mostly translational/transcriptional in the true core - all else is mix and match So what?!? Doolittle, F. Genomes2005, Halifax 2
The phylogenetic problem resolved? Phylogenetic incongruence (HGT - hidden paralogies) Patchy distribution (different gene content within species) Loss of phylogenetic signal for deep branches More data = more phylogenetic signal? Resist both loss and HGT? Daubin et al., GR 2002 All the possible comparisons between gene phylogenies by using principle component analysis 120 genes with common phylogenetic history Phylogenetic artifact Lack of signal HGT Daubin et al., GR 2002 Among these, 205 contain exactly one gene per species. We consider these 205 genes to represent likely orthologs and, consequently, to be good candidates for use in inferring the organismal phylogeny and the extent of LGT the Shimodaira Hasegawa (SH) test Lerat et al., PLOS 2004 Lerat et al., PLOS 2004 3
Failure of rejection is not the same as support Ford W. Doolittle Genes with little signal may fail to reject many or even all topologies, but they cannot be said to support a certain topology it is possible that a robust tree based on concatenated sequences is well supported because different constituent genes contribute strong support to different individual nodes of the tree, without any supporting that tree over all statistical test for each gene against many topologies more nuanced than rejection or failure of rejection visualize the compatibility of all genes with all trees simultaneously Susko et al., MBE 2006 Heat map = simultaneous display of all combinations of genes and test topologies together with simultaneous clustering of both genes and topologies according to p- values Clustering of genes identifies the core set of genes with a similar evol. history. Clustering of toplogoies identifies which trees are (nearly) equally supported (= # best trees) Susko et al., MBE 2006 Susko et al., MBE 2006 Can we prove Darwin s s theory of evolution? Suited evidence would be: a molecular phylogeny similar to organism phylogeny Our phylogenetic analyses do not support treethinking.... Representations other than a tree should be investigated because a non-critical concatenation of markers could be highly misleading. Bapteste et al., 2005 But for prokaryotes we don t have anything else besides the molecular phylogeny to determine the organism phylogeny = CIRCULAR REASONING to use concatenated genes as a the organism phylogeny We have to live with that!! We could believe in the concatenated phylogeny IF most genes would support the same phylogeny We do live in an era in which we have an enormous amount of data (> 350 genomes) 4
Problems with this study according to Ford W. Doolittle: concatenated without evaluating congruence among genes (failure of rejection is not suport) individual genes are compared with the concatenated genes => acception of rejection Even if they were congruent, still it is no prove of Darwin s theory as ONLY 31 genes were considered -> can t be representative of the phylogenetic history of the organisms HGT is shown to make a substantial contribution to genome evolution (up to 25-30%)... therefore no tree can, in principle, fully reflect the course of evolution of species. Eugene V. Koonin What does it mean... to speak of an organismal genealogy when nearly all of the genes in the cell - genes that give its general character - do not share a common history? C.R. Woese 2002 CTACCATGAAAGACTTGTGAATCCAGGAAGAGAGACTGACTGGGCAACATGTTATTCAG GTACAAAAAGATTTGGACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGG GCAATGGGAAGCTCTTCTGGAGAGTGAGAGAAGCTTCCAGTTAAGGTGACATTGAAGC AAGTCCTGAAAGATGAGGAAGAGTTGTATGAGAGTGGGGAGGGAAGGGGGAGGTGGA GGGATGGGGAATGGGCCGGGATGGGATAGCGCAAACTGCCCGGGAAGGGAAACCAGCA TGTACAGACCTGAACAACGAAGATGGCATATTTTGTTCAGGGAATGGTGAATTAAGTGT GGCAGGAATGCTTTGTAGACACAGTAATTTGCTTGTATGGAATTTTGCCTGAGAGACCTC CTTCATCCATCACTGTCCTTGTCAAATAGTTTGGAACAGGTATAATGATCACAATAACCC AAGCATAATATTTCGTTAATTCTCACAGAATCACATATAGGTGCCACAGTTATCCCCATT TATGAATGGAGTMicrobialPanGenomeGATGAAAACCTTAGGAATAATGAATGATT GCGCAGGCTCACCTGGATATTAAGACTGAGTCAAATGTTGGGTCTGGTCTGACTTTAAT GTTTGCTTTGTTCATGAGCACCACATATTGCCTCTCCTATGCAGTTAAGCAGGTAGGTGAC AGAAAAGCCCATGTTTGTCTCTACTCACACACTTCCGACTGAATGTATGTATGGAGTTTCT ACACCAGATTCTTCAGTGCTCTGGATATTAACTGGGTATCCCATGACTTTATTCTGACACT ACCTGGACCTTGTCAAATAGTTTGGACCTTGTCAAATAGTTTGGAGTCCTTGTCAAATAGT TGGGGTTAGCACAGACCCCACAAGTTAGGGGCTCAGTCCCACGAGGCCATCCTCACTTC AGATGACAATGGCAAGTCCTAAGTTGTCACCATACTTTTGACCAACCTGTTACCAATCGG GGGTTCCCGTAACTGTCTTCTTGGGTTTAATAATTTGCTAGAACAGTTTACGGAACTCAGA AAAACAGTTTATTTTCTTTTTTTCTGAGAGAGAGGGTCTTATTTTGTTGCCCAGGCTGGTG GCAATGGTGCAGTCATAGCTCATTGCAGCCTTGATTGTCTGGGTTCCAGTGGTTCTCCCA CTCAGCCTCCCTAGTAGCTGAGACTACATGCCTGCACCACCACATCTGGCTAGTTTCTTT ATTTTTTGTATAGATGGGGTCTTGTTGTGTTGGCCAGGCTGGCCACAAATTCCTGGTCTC AAGTGATCCTCCCACCTCAGCCTCTGAAAGTGCTGGGATTACAGATGTGAGCCACCACAT TGGCCAGTTCATTTCCTATTACTGGTTCATTGTGAAGGATACATCTCAGAAACAGTCAAT GAAAGAGACGTGCATGCTGGATGCAGTGGCTCATGCCTGTAATCTCAGCACTTTGGGAGG Intra-species comparisons A strain is only a single representative of a species, the members of which can be genotypically and phenotypically much more diverse. How many genomes are needed to fully describe a bacterial species? Most analyses have revealed large differences in gene content between closely related strains. Lawrence COiM 2005 Medini et al., COiGD 2005 5
the microbial pan-genome the microbial pan-genome From this study: This question was addressed by sequencing the genomes of 8 Streptococcus agalactiae group B strains (GBS) A bacterial species can be described by its pan-genome (pan, Greek for whole ) Each strain on average 1806 genes present in every strain (core genome) plus 439 genes that are absent in one or more strains (dispensable genome) the present GBS pan-genome contains 2713 genes unique genes will continue to emerge even after 100s or 1000s genomes on average 33 new genes with every new strain sequenced core = essence of the species dispensable = diversity of the species (conserved + strain-specific) The bacterial species will never be fully described, i.e. open pan-genome Claire Fraser Claire Fraser, TIGR GBS pan-genome GBS core genome the microbial pan-genome the microbial pan-genome open closed Core genome = essence, basic aspects of the biology of a sp and its major phenotypic traits Dispensable genome = supplementary biochemical pathways and functions that are not essential for bacterial growth but confer selective advantages,such as adaptation to different niches, virulence, capsular serotype, antibiotic resistance, or colonization of a new host. (conserved / unique) Medini COiGD 2005 6
Open pan-genome: typical for species that colonize multiple environments and have multiple ways of exchanging genetic material: e.g. Streptococci, Meningococci, H. pylori, Salmonellae and E. coli each new genome of Streptococci: + 50 genes (Streptococci is an ecological and phenotypical uniform species) Each new genome of E. coli: + 300 genes (E. coli might be too heterogeneous to be one species) Closed pan-genome: more conserved, live in isolated niches with limited access to the global microbial gene pool: e.g. B. Anthracis, Mycobacterium tuberculosis, Buchnera aphidicola and Chlamydia trachomatis CTACCATGAAAGACTTGTGAATCCAGGAAGAGAGACTGACTGGGCAACATGTTATTCAG GTACAAAAAGATTTGGACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGG GCAATGGGAAGCTCTTCTGGAGAGTGAGAGAAGCTTCCAGTTAAGGTGACATTGAAGC AAGTCCTGAAAGATGAGGAAGAGTTGTATGAGAGTGGGGAGGGAAGGGGGAGGTGGA GGGATGGGGAATGGGCCGGGATGGGATAGCGCAAACTGCCCGGGAAGGGAAACCAGCA TGTACAGACCTGAACAACGAAGATGGCATATTTTGTTCAGGGAATGGTGAATTAAGTGT GGCAGGAATGCTTTGTAGACACAGTAATTTGCTTGTATGGAATTTTGCCTGAGAGACCTC CTTCATCCATCACTGTCCTTGTCAAATAGTTTGGAACAGGTATAATGATCACAATAACCC AAGCATAATATTTCGTTAATTCTCACAGAATCACATATAGGTGCCACAGTTATCCCCATT TATGAATGGAGTSpeciesGenomeConceptGATGAAAACCTTAGGAATAATGAATGA TTGCGCAGGCTCACCTGGATATTAAGACTGAGTCAAATGTTGGGTCTGGTCTGACTTTA ATGTTTGCTTTGTTCATGAGCACCACATATTGCCTCTCCTATGCAGTTAAGCAGGTAGGTG ACAGAAAAGCCCATGTTTGTCTCTACTCACACACTTCCGACTGAATGTATGTATGGAGTT CTACACCAGATTCTTCAGTGCTCTGGATATTAACTGGGTATCCCATGACTTTATTCTGAC ACTACCTGGACCTTGTCAAATAGTTTGGACCTTGTCAAATAGTTTGGAGTCCTTGTCAAAT AGTTTGGGGTTAGCACAGACCCCACAAGTTAGGGGCTCAGTCCCACGAGGCCATCCTCAC TCAGATGACAATGGCAAGTCCTAAGTTGTCACCATACTTTTGACCAACCTGTTACCAAT GGGGGTTCCCGTAACTGTCTTCTTGGGTTTAATAATTTGCTAGAACAGTTTACGGAACTC AGAAAAACAGTTTATTTTCTTTTTTTCTGAGAGAGAGGGTCTTATTTTGTTGCCCAGGCTG GTGTGCAATGGTGCAGTCATAGCTCATTGCAGCCTTGATTGTCTGGGTTCCAGTGGTTCTC CACCTCAGCCTCCCTAGTAGCTGAGACTACATGCCTGCACCACCACATCTGGCTAGTTT TTTTATTTTTTGTATAGATGGGGTCTTGTTGTGTTGGCCAGGCTGGCCACAAATTCCTGG CTCAAGTGATCCTCCCACCTCAGCCTCTGAAAGTGCTGGGATTACAGATGTGAGCCACC ACATCTGGCCAGTTCATTTCCTATTACTGGTTCATTGTGAAGGATACATCTCAGAAACAGT AATGAAAGAGACGTGCATGCTGGATGCAGTGGCTCATGCCTGTAATCTCAGCACTTTGG Comparison between DDH and genome similarity What have we learned from almost a decade of extensive genome sequencing with respect to currently named bacterial species? DDH Can we improve the species definition/concept? ANI Konstantinidis PNAS 2005 unpublished study of 28 sequenced strains: y = 0.785x + 16.197 100 R 2 = 0.9486 Comparison between 16S and genome similarity % DNA-DNA Hybridization 80 60 40 16S seq id 20 0-20 0 20 40 60 80 100 % Conserved DNA % conserved DNA: blastn of 1020 nt frags with 90% seq id. cut off Goris IJSEM (in press) ANI Konstantinidis PNAS 2005 7
Gene content diversity within species Gene content diversity within species % conserved genes ANI Species may differ up to 35% gene content or 20% (excl. hypothetical and mobile elements) 70% DDH = min 80% gene content shared 20% = on average 1000 genes! Konstantinidis PNAS 2005 ANI Konstantinidis PNAS 2005 Evolutionary relatedness should be coupled to ecological relatedness This will give better predictive species definition than just an evolutionary one (what is the genetic basis for ecological distinctiveness) clusters or continuum of diversity? Do bacteria exhibit a genetic continuum in nature or are there coherent sequence/genomic clusters on which a species definition/concept could be based? Freq. 333 Hsp60 sequences from vibrio strains isolated from coastal bacterioplankton Current datasets in taxonomy consists of only a few representative strains per species, consequently borders based on this dataset might not hold when more intraspecies diversity is included! % sequence identity 8
Unsolved questions: How is selection acting on bacterial populations? Recombination rate at the whole genome level within a bacterial population? Answers to these questions will advance our knowledge of the two basic processes on which our current species concepts are based. One thing is certain: reconciling eukaryotic and bacterial species under the same biological species concept is NOT possible because these organisms are too different in terms of evolutionary processes 9