Bacterial pan-genomics

Similar documents
Comparison of 61 E. coli genomes

Microbial Typing by Machine Learned DNA Melt Signatures

Comparative RNA-seq analysis of transcriptome dynamics during petal development in Rosa chinensis

Figure S1. Pangenome plots of ten recombining bacterial species based on RAST annotated

MICROBIAL BIOCHEMISTRY BIOT 309. Dr. Leslye Johnson Sept. 30, 2012

SUPPLEMENTARY INFORMATION

SENCA: A codon substitution model to better estimate evolutionary processes

Pan- and core- Genomics

MiGA: The Microbial Genome Atlas

Figure Page 117 Microbiology: An Introduction, 10e (Tortora/ Funke/ Case)

Domain Bacteria. BIO 220 Microbiology Jackson Community College

Design of an Enterobacteriaceae Pan-genome Microarray Chip

The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome

Genetic Basis of Variation in Bacteria

Introduction to polyphasic taxonomy

Whole Genome based Phylogeny

Fitness constraints on horizontal gene transfer

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Additional file 1 for Structural correlations in bacterial metabolic networks by S. Bernhardsson, P. Gerlee & L. Lizana

The Prokaryotes: Domains Bacteria and Archaea

Introduction to Bioinformatics Integrated Science, 11/9/05

Overview of the major bacterial pathogens The major bacterial pathogens are presented in this table:

Genome reduction in prokaryotic obligatory intracellular parasites of humans: a comparative analysis

Identification of Bacteria Using Phylogenetic Relationships Revealed by MS/MS Sequencing of Tryptic Peptides Derived from Cellular Proteins

Microbial Genetics, Mutation and Repair. 2. State the function of Rec A proteins in homologous genetic recombination.

Comparative genomics: Overview & Tools + MUMmer algorithm

INTERPRETATION OF THE GRAM STAIN

2 Genome evolution: gene fusion versus gene fission

Burton's Microbiology for the Health Sciences

Kharkov National Medical University. Head of Microbiology, Virology and Immunology Department Minukhin Valeriy Vladimirivich

Horizontal transfer and pathogenicity

PHE Food and Water Microbiology External Quality Assessment Schemes

Microbial Taxonomy. Classification of living organisms into groups. A group or level of classification

Gram negative bacilli

Cell Structure and Function. The Development of Cell Theory

BIOLOGY STANDARDS BASED RUBRIC

Saturday (18 Aug 2007)

Tetracycline Rationale for the EUCAST clinical breakpoints, version th November 2009

EASTERN ARIZONA COLLEGE General Biology I

Obligate anaerobes - cannot grow in the presence of oxygen Facultative anaerobes - can grow with or without oxygen Aerobic - require oxygen

EASTERN ARIZONA COLLEGE Microbiology

Bio 119 Bacterial Genomics 6/26/10

The Minimal-Gene-Set -Kapil PHY498BIO, HW 3

The bacterial pangenome as a new tool for analysing pathogenic bacteria

Biology 105/Summer Bacterial Genetics 8/12/ Bacterial Genomes p Gene Transfer Mechanisms in Bacteria p.

Product Catalogue 2015 Clinical and Industrial Microbiology

Pan-genome analysis provides much higher strain typing resolution than multi-locus sequence typing

Mouth animalcules (bacteria)

9/8/2017. Bacteria and Archaea. Three domain system: The present tree of life. Structural and functional adaptations contribute to prokaryotic success

Dynamic optimisation identifies optimal programs for pathway regulation in prokaryotes. - Supplementary Information -

Product Catalogue 2016 Clinical and Industrial Microbiology

NAME: Microbiology BI234 MUST be written and will not be accepted as a typed document. 1.

Objects of the Medical Microbiology revision a) Pathogenic microbes (causing diseases of human beings or animals) b) Normal microflora (microbes commo

Dr. habil. Anna Salek. Mikrobiologist Biotechnologist Research Associate

MONTGOMERY COUNTY COMMUNITY COLLEGE BIO 140 CHAPTER 4. Functional Anatomy of Prokaryotic and Eukaryotic Cells

10ml. Set (4 poly and 17 monovalent, 2ml each)

C. elegans as an in vivo model to decipher microbial virulence. Centre d Immunologie de Marseille-Luminy

Bacterial clasification

Microbiology Helmut Pospiech

DEPARTMENT OF ANIMAL HEALTH TECHNOLOGY COURSE OUTLINE - FALL 2014 LAB PROCEDURES AND MICROBIOLOGY AH 174 E- MAIL:

Shape, Arrangement, and Size. Cocci (s., coccus) bacillus (pl., bacilli) 9/21/2013

BIOLOGY YEAR AT A GLANCE RESOURCE ( )

Basic Biology. Content Skills Learning Targets Assessment Resources & Technology

Bacteria Outline. 1. Overview. 2. Structural & Functional Features. 3. Taxonomy. 4. Communities

BIOLOGY YEAR AT A GLANCE RESOURCE ( ) REVISED FOR HURRICANE DAYS

Product List. - Kits & Reagents

AP Bio Module 16: Bacterial Genetics and Operons, Student Learning Guide

Virginia Western Community College BIO 101 General Biology I

Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss

1. Prokaryotic Nutritional & Metabolic Adaptations

Biological Process Term Enrichment

General context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment.

EZ-COMP EZ-COMP For Training and Proficiency Testing Product Details

Microbiology / Active Lecture Questions Chapter 10 Classification of Microorganisms 1 Chapter 10 Classification of Microorganisms

Principles of Biotechnology Lectures of week 4 MICROBIOLOGY AND BIOTECHNOLOGY

A genomic insight into evolution and virulence of Corynebacterium diphtheriae

Bacterial Genetics & Operons

Genomic Comparison of Bacterial Species Based on Metabolic Characteristics

CELL BIOLOGY, BIOINFORMATICS AND SYSTEMS BIOLOGY BIO160 (GU); UMF012 (Chalmers) schedule 2010

Characteristics. Nucleoid Region single circular chromosome plasmids mesosome

MCB 110. "Molecular Biology: Macromolecular Synthesis and Cellular Function" Spring, 2018

Introduction to Microbiology. CLS 212: Medical Microbiology Miss Zeina Alkudmani

Ch 3. Bacteria and Archaea

Biology Assessment. Eligible Texas Essential Knowledge and Skills

Inferring positional homologs with common intervals of sequences

STAAR Biology Assessment

Chapter 21 PROKARYOTES AND VIRUSES

Opening the pan-genomics box for E. coli

Microbial Taxonomy and the Evolution of Diversity

Product List. - Kits & Reagents

TER 26. Preview for 2/6/02 Dr. Kopeny. Bacteria and Archaea: The Prokaryotic Domains. Nitrogen cycle

HACCP: INTRODUCTION AND HAZARD ANALYSIS

BEFORE TAKING THIS MODULE YOU MUST ( TAKE BIO-4013Y OR TAKE BIO-

Syllabus BIMM 120 Bacteriology FALL 2013 Last Modified September 24, 2013

Formative/Summative Assessments (Tests, Quizzes, reflective writing, Journals, Presentations)

Bundle at a Glance Biology 2015/16

Vital Statistics Derived from Complete Genome Sequencing (for E. coli MG1655)

Bio 101 General Biology 1

BMD645. Integration of Omics

Biology Spring Final Exam Study Guide

Transcription:

Bacterial pan-genomics Dave Ussery 4 th annual workshop on Comparative Microbial Genomics and Taxonomy Petropolois, Brazil Lecture #3 Wednesday, 5 August, 2009

What we DO know What we DON'T know Outline What can we do about it? 0 5000 10000 15000 New genes New gene families Core genome Pan genome - E. coli K-12 1 : Ecoli_K12DH10B 2 : Ecoli_K12MG1655 3 : Ecoli_K12W3110 4 : Ecoli_B 5 : Ecoli_B03 6 : Ecoli_B171 7 : Ecoli_B7A 8 : Ecoli_F11 9 : Ecoli_H10407 10 : Ecoli_HS 11 : Ecoli_101!1 12 : Ecoli_536 13 : Ecoli_53638 14 : Ecoli_55989 15 : Ecoli_8739 16 : Ecoli_CFT073 17 : Ecoli_E110019 18 : Ecoli_E22 19 : Ecoli_E2348 20 : Ecoli_E24377A 21 : Ecoli_ED1a 22 : Ecoli_IAI1 23 : Ecoli_IAI39 24 : Ecoli_LANL_ECA 25 : Ecoli_LANL_ECF 26 : Ecoli_O103Oslo 27 : Ecoli_O157_EC4042 28 : Ecoli_O157_EC4045 29 : Ecoli_O157_EC4076 30 : Ecoli_O157_EC4113 31 : Ecoli_O157_EC4115 32 : Ecoli_O157_EC4196 33 : Ecoli_O157_EC4206 34 : Ecoli_O157_EC4401 35 : Ecoli_O157_EC4486 36 : Ecoli_O157_EC4501 37 : Ecoli_O157_EC508 38 : Ecoli_O157_EC869 39 : Ecoli_O157_EDL933 40 : Ecoli_O157_Sakai 41 : Ecoli_042 42 : Ecoli_RS218 43 : Ecoli_S88 44 : Ecoli_SE11 45 : Ecoli_SMS35 46 : Ecoli_UMN026 47 : Ecoli_UTI89 48 : Ecoli_VR50 49 : Ecoli_APEC01 50 : Sflex_2457 51 : Sflex_2a301 52 : Sflex_8401 53 : Sboyd_Sb227 54 : Sdyse_Sd197 55 : Ssone_Ss046 56 : Ealbe_TW07627 57 : Eferg_35469T 1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57

Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial pan-genome Hervé Tettelin a,b, Vega Masignani b,c, Michael J. Cieslewicz b,d,e, Claudio Donati c, Duccio Medini c, Naomi L. Ward a,f, Samuel V. Angiuoli a, Jonathan Crabtree a, Amanda L. Jones g, A. Scott Durkin a, Robert T. DeBoy a, Tanja M. Davidsen a, Marirosa Mora c, Maria Scarselli c, Immaculada Margarit y Ros c, Jeremy D. Peterson a, Christopher R. Hauser a, Jaideep P. Sundaram a, William C. Nelson a, Ramana Madupu a, Lauren M. Brinkac a, Robert J. Dodson a, Mary J. Rosovitz a, Steven A. Sullivan a, Sean C. Daugherty a, Daniel H. Haft a, Jeremy Selengut a, Michelle L. Gwinn a, Liwei Zhou a, Nikhat Zafar a, Hoda Khouri a, Diana Radune a, George Dimitrov a, Kisha Watkins a, Kevin J. B. O Connor h, Shannon Smith i, Teresa R. Utterback i, Owen White a, Craig E. Rubens g, Guido Grandi c, Lawrence C. Madoff e,j, Dennis L. Kasper e,j, John L. Telford c, Michael R. Wessels d,e, Rino Rappuoli c,k,l, and Claire M. Fraser a,b,k,m a Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850; c Chiron Vaccines, Via Fiorentina 1, 53100 Siena, Italy; d Division of Infectious Diseases, Children s Hospital, 300 Longwood Avenue, Boston, MA 02115; e Harvard Medical School, Boston, MA 02115; f Center of Marine Biotechnology, University of Maryland Biotechnology Institute, 701 East Pratt Street, Baltimore, MD 21202; g Children s Hospital and Regional Medical Center, 307 Westlake Avenue N, Seattle, WA 98109; h The Johns Hopkins University, 3400 North Charles Street, Baltimore, MD 21218; i J. Craig Venter Institute, 5 Research Place, Rockville, MD 20850; j Channing Laboratory, Brigham and Women s Hospital, 181 Longwood Avenue, Boston, MA 02115; and m George Washington University Medical Center, 2300 Eye Street NW, Washington, DC 20037 t t Contributed by Rino Rappuoli, August 5, 2005 Fig. 2. GBS core genome. The number of shared genes is plotted as a function of the number n of strains sequentially added (see Materials and Methods). For each n, circles are the 8! [(n 1)! (8 n)!] values obtained for Proc Natl Acad Sci USA, 102:13950-13955 (2005). Fig. 3. GBS pan-genome. The number of specific genes is plotted as a function of the number n of strains sequentially added (see Materials and Methods). For each n, circles are the 8! [(n 1)! (8 n)!] values obtained for G p t w i t c c t i g t ( F ( 7 s t s s

Table 12.1 The number of sequenced genomes and ongoing projects of various bacterial genera and the current numbers of available multiple genomes per species. The top-scoring genera only are listed Sequencing projects Number of species Number of Number of finished With projects finished With projects Genus projects in progress genomes in progress Streptococcus 27 76 8 16 Clostridium 15 59 10 25 Burkholderia 16 59 9 16 Bacillus 17 55 10 17 Salmonella 7 43 2 3 Escherichia 11 43 1 2 Vibrio 7 35 5 14 Mycobacterium 17 32 10 15 Listeria 4 29 3 6 Yersinia 11 25 3 7 Mycoplasma 13 24 11 17 Shewanella 16 24 11 15 Pseudomonas 14 24 7 8 Borrelia 3 23 3 9 Haemophilus 6 23 3 4 Staphylococcus 18 22 4 5 Campylobacter 9 22 2 2 Synechococcus 11 21 5 9 Francisella 7 16 1 2 Lactobacillus 11 15 10 12 Rickettsia 10 15 9 12 43 114 11 25 23 86 11 37 23 73 12 24 25 129 11 31 16 75 1 3 28 77 2 7 13 55 6 14 22 59 13 20 6 34 3 6 15 11 50 35 19 23 16 12 3 20 11 17 19 51 7 11 8 30 8 10 8 32 4 4 20 80 5 10 10 28 6 11 12 22 11 22 9 28 3 3 19 65 15 25 12 18 10 14 As of 5 August, 2009 D.W. Ussery et al., Computing for Comparative Microbial Genomics, Computational Biology 8, DOI 10.1007/978-1-84800-255-5_14, Springer-Verlag London Limited 2009

# genomes sequenced 0 20 40 60 80 100 120 Streptococcus 43 71 Burkholderia 23 50 Bacillus Clostridium Vibrio Mycobacterium Salmonella Listeria Escherichia Mycoplasma Shewanella Pseudomonas Yersinia Haemophilus Staphylococcus Campylobacter Synechococcus Francisella Lactobacillus Rickettsia 6 25 23 13 22 16 28 15 19 4 19 11 8 20 10 8 12 10 9 17 19 12 6 28 24 20 32 39 42 43 46 59 49 60 63 Projects 104 ongoing

News and Views Nature 412, 597-598 (9 August 2001) doi:10.1038/35088167 Genome sequencing: The ABC of symbiosis J. Allan Downie and J. Peter W. Young It is a truth universally acknowledged, that there are only two kinds of bacteria. One is Escherichia coli and the other is not. Anything that E. coli does is a universal truth about bacteria; anything it does not do must be a specialization.

E. coli CFT073 Escherichia coli, strain K-12, isolate W3110 4,641,433 bp 4390 genes E. coli O157 E. coli E2348 5,528,445 bp 5,231,428 bp 4,227,846 bp

A_Crenarchaeota (n=12) A_Euryarchaeota (n=26) A_Nanoarchaeota (n=1) B_Acidobacteria (n=2) B_Actinobacteria (n=39) B_Aquificae (n=1) B_BacteroidetesChlorobi (n=12) B_Chlamydiae (n=11) B_Chloroflexi (n=2) B_Cyanobacteria (n=23) B_DeinococcusThermus (n=4) B_Firmicutes (n=106) B_Fusobacteria (n=1) B_Planctomycetes (n=1) B_Proteobacteria_Alpha (n=56) B_Proteobacteria_Beta (n=43) B_Proteobacteria_Delta (n=14) B_Proteobacteria_Epsilon (n=11) B_Proteobacteria_Gamma (n=115) B_Spirochaetes (n=9) B_Thermotogae (n=1) Size distribution of Prokaryotic genomes (n=490) E. E. coli coli 0 2 4 6 8 10 Genome Size (Mbases)

Core genes 4000 3500 3000 2500 2000 E. coli core genes in 32 genomes ~1560 core genes 0 100 200 300 400 500 600 700 800 900 1100 1300 counts 10 20 30 40 n genomes Genome Biology, 2007, 8:R267doi:10.1186/gb-2007-8-12-r267

Specific genes 1000 800 600 400 200 0 E. coli pan-genome based on 32 genomes 11,862 gene families ~79 new genes/genome (cp. 441 new genes/genome, predicted previously) 0 100 200 300 400 500 600 700 800 900 1000 counts 10 20 30 40 n genomes Genome Biology, 2007, 8:R267doi:10.1186/gb-2007-8-12-r267

9,797 unique to E. coli and Shigella 9,797 core families E. coli and Shigella strain-specific families 2,041 Total of 11,838 E. coli and Shigella gene families (32 genome sequences) 2,041 E. coli and Shigella core families

# Gene Families 4,000 3,000 2,000 1,000 0 How large is the E. coli pan-genome? Can we estimate this? Distribution of 11,838 E. coli and Shigella gene families 3798 unique genes 2041 genes found in all 32 genomes 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 # Homologs per Family

0 5000 10000 15000 IS2 transposase orfb Size of gene families in 53 Ecoli + Shigella genomes From: carsten@cbs.dtu.dk Subject: A question for the Ecoli expert Date: 5 August 2009 08:59:56 GMT-03:00 To: dave@cbs.dtu.dk Hi Dave, Was looking at the Ecoli pangenome. There is a transposase which seems ridiculously ubiquitous. In 52 Ecoli/Shigella genomes I find 16605 individual instances of this bugger with the coregenome script. I've been into the genbank annotations and at first glance this thing appears genuine. Is this described somewhere? I've found it listed as "IS2 transposase orfb", but it may go under other names as well. -- /Carsten 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0 5000 10000 15000 IS2 transposase orfb 0 1000 2000 3000 4000 Size of gene families in 53 Ecoli + Shigella genomes Size of gene families in 45 Ecoli genomes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 20 largest gene families 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1 E. coli K-12 MG1655 4289 2 E. coli K-12 W3110 4387 3 E. coli K-12 DH10B 4126 4 E. coli CFT073 5379 0 1000 2000 3000 4000 5000 6000 #genes # new genes # new gene families pan- core- 4289 4057 4057 4057 174 140 4197 3959 86 81 4278 3658 1710 1524 5802 3227 K-12 MG1655 K-12 W3110 K-12 DH10B CFT073

0 5000 10000 15000 number of gene families 0 5000 K12 New genes New gene families Core genome Pan genome 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 new E. coli genomes 1 : E. coli K12MG16 2 : E. coli O157RIM 3 : E. coli O157EDL 4 : S. flexneri 2a 30 5 : E. coli CFT073 6 : S. flexneri 2a 24 7 : S. sonnei Ss046 8 : S. boydii Sb227 9 : S. dysenteriae S 10 : E. coli W3110 11 : E. coli UTI89 12 : E. coli 536 13 : S. flexneri 5 84 14 : E. coli APEC O 15 : E. coli 101 1 16 : E. coli 53638 17 : E. coli B171 18 : E. coli B7A 19 : E. coli E22 20 : E. coli E24377 21 : E. coli F11 22 : E. coli HS 23 : E. coli B 24 : E. coli E11001 25 : S. boydii BS51 26 : E. coli 042 27 : E. coli B03 28 : E. coli E2348 29 : E. coli H10407 30 : E. coli RS218

0 5000 10000 15000 K-12 New genes New gene families Core genome Pan genome O157 Shigella 1 3 5 7 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 E. fergusonii E. albertii 1 : Ecoli_K12DH10B 2 : Ecoli_K12MG1655 3 : Ecoli_K12W3110 4 : Ecoli_B 5 : Ecoli_B03 6 : Ecoli_B171 7 : Ecoli_B7A 8 : Ecoli_F11 9 : Ecoli_H10407 10 : Ecoli_HS 11 : Ecoli_101!1 12 : Ecoli_536 13 : Ecoli_53638 14 : Ecoli_55989 15 : Ecoli_8739 16 : Ecoli_CFT073 17 : Ecoli_E110019 18 : Ecoli_E22 19 : Ecoli_E2348 20 : Ecoli_E24377A 21 : Ecoli_ED1a 22 : Ecoli_IAI1 23 : Ecoli_IAI39 24 : Ecoli_LANL_ECA 25 : Ecoli_LANL_ECF 26 : Ecoli_O103Oslo 27 : Ecoli_O157_EC4042 28 : Ecoli_O157_EC4045 29 : Ecoli_O157_EC4076 30 : Ecoli_O157_EC4113 31 : Ecoli_O157_EC4115 32 : Ecoli_O157_EC4196 33 : Ecoli_O157_EC4206 34 : Ecoli_O157_EC4401 35 : Ecoli_O157_EC4486 36 : Ecoli_O157_EC4501 37 : Ecoli_O157_EC508 38 : Ecoli_O157_EC869 39 : Ecoli_O157_EDL933 40 : Ecoli_O157_Sakai 41 : Ecoli_042 42 : Ecoli_RS218 43 : Ecoli_S88 44 : Ecoli_SE11 45 : Ecoli_SMS35 46 : Ecoli_UMN026 47 : Ecoli_UTI89 48 : Ecoli_VR50 49 : Ecoli_APEC01 50 : Sflex_2457 51 : Sflex_2a301 52 : Sflex_8401 53 : Sboyd_Sb227 54 : Sdyse_Sd197 55 : Ssone_Ss046 56 : Ealbe_TW07627 57 : Eferg_35469T Plot made for E. coli alliance (by DU) on 19 January, 2009

Microbial comparative pan-genomics using binomial mixture models Lars Snipen 1, Trygve Almøy 1 and David W. Ussery 2 1 Biostatistics, Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, Ås, Norway 2 Centre for Biological Sequence Analysis, Technical University of Denmark, Lyngby, Denmark Email: Lars Snipen - lars.snipen@umb.no; Trygve Almøy - trygve.almoy@umb.no; David W. Ussery - dave@cbs.dtu.dk; Corresponding author Abstract BMC Genomics 2009, 10: in the press [August 2009] Background: The size of the core- and pan-genome of bacterial species is a topic of increasing interest due to the growing number of sequenced prokaryote genomes, many from the same species. Attempts to estimate these quantities have been made, using regression methods or mixture models. We extend the latter approach by using statistical ideas developed for capture-recapture problems in ecology and epidemiology. Results: We estimate core- and pan-genome sizes for 16 different bacterial species. The results reveal a complex dependency structure for most species, manifested as heterogeneous detection probabilities. Estimated pangenome sizes range from small (around 2600 gene families) in Buchnera aphidicola to large (around 43000 gene families) in Escherichia coli. Results for Echerichia coli show that as more data become available, a larger diversity is estimated, indicating an extensive pool of rarely occurring genes in the population. Conclusions: Analyzing pan-genomics data with binomial mixture models is a way to handle dependencies between genomes, which we find is always present. A bottleneck in the estimation procedure is the annotation of rarely occurring genes.

Table 1: Number of genomes refer to completed genomes at NCBI [8] at the end of January 2009. Number of components is the optimal choice of mixture components. The black bars indicate pan-genome coverage, i.e. the current sample pan-genome size as a fraction of the estimated pan-genome size. Species Genomes Components Coverage Campylobacter jejuni 5 3 Coxiella burnetii 5 3 Acinetobacter baumannii 6 4 Buchnera aphidicola 6 3 Helicobacter pylori 6 3 Rhodopseudomonas palustris 6 3 Streptococcus pneumoniae 6 3 Yersinia pestis 7 4 Francisella tularensis 7 4 Bacillus cereus 8 4 Clostridium botulinum 8 3 Prochlorococcus marinus 12 4 Streptococcus pyogenes 13 5 Salmonella enterica 14 5 Staphylococcus aureus 14 6 Escherichia coli 22 6 BMC Genomics 2009, 10: in the press [August 2009]

Escherichia coli Bacillus cereus Rhodopseudomonas palustris Salmonella enterica Clostridium botulinum Prochlorococcus marinus Acinetobacter baumannii Yersinia pestis Staphylococcus aureus Streptococcus pyogenes Streptococcus pneumoniae Coxiella burnetii Campylobacter jejuni Helicobacter pylori Francisella tularensis Buchnera aphidicola c c c c c c c c c c c c c c c c + predict core observed core obs. genome observed pan c Chao lower pan X predict pan - 90% naive bootstrap Figure 2 64 256 1024 4096 16384 65536 Number of gene families BMC Genomics 2009, 10: in the press [August 2009]

Number of gene families Number of gene families 0 10000 25000 0 200 600 Francisella tularensis 0 1 2 3 4 5 6 7 θ g = Number of genomes ( ) G K π k ρ g k g (1 ρ k) G g, g =0,..., G (1) k=1 where is the mixing proportion and is the de- Escherichia coli 0 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 Number of genomes BMC Genomics 2009, 10: in the press [August 2009]

Probability 0.0 0.2 0.4 0.6 0.8 1.0 Probability 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6 7 8 9 Number of genomes 0 1 2 3 4 5 6 7 8 9 Number of genomes Probability 0.0 0.2 0.4 0.6 0.8 1.0 Probability 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6 7 8 9 Number of genomes 0 1 2 3 4 5 6 7 8 9 Number of genomes Figure 5: An illustration of a three component binomial mixture model when G = 10. The upper left panel shows the binomial PDF (red) for the detection probability ρ 1 =1.0, i.e. the core genes who are always detected in any genome. In the upper right panel a second component has a binomial PDF (green) where ρ 2 =0.9, i.e. these genes are often detected. In the lower left panel a third component (blue) has ρ 3 =0.05 and genes of this class are rarely observed. The lower right panel shows the combination with mixing proportions π 1 =0.2, π 2 =0.1 and π 3 =0.7. BMC Genomics 2009, 10: in the press [August 2009]

Francisella tularensis Coxiella burnetii Yersinia pestis Streptococcus pneumoniae Helicobacter pylori Clostridium botulinum Rhodopseudomonas palustris Streptococcus pyogenes Prochlorococcus marinus Staphylococcus aureus Campylobacter jejuni Acinetobacter baumannii Buchnera aphidicola Salmonella enterica Bacillus cereus Escherichia coli 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.8 0.6 0.4 0.2 0.0 Relative contribution Figure 3 BMC Genomics 2009, 10: in the press [August 2009]

Number of gene families 10000 20000 30000 40000 50000 5 10 15 20 Number of genomes sampled BMC Genomics 2009, 10: in the press [August 2009]

Fraction of proteins in set 0.010 0.005 E. coli str. K-12 substr. MG1655 All proteins (4131) Unique proteins (2614) Not matching SWISS-PROT (16) Matching SWISS-PROT (2592) E. coli singletons (6195) 0.000 0 500 1000 1500 2000 Length of protein (aa)

gene family tree for 25 E. coli genomes Manhattan Distance (# gene families)

The Burkholderia Pan- and Core Genome 223 x 1000 15 Salmonella E. coli / Shigella Yersinia x 1000 15 x 1000 15 10 10 10 5 5 5 x 1000 20 Pseudomonas x 1000 20 Vibrio 15 10 15 10 Pan-genome Core genome Novel genes Novel gene families 5 5 Fig. 12.5 The pan-genome and core genome for five different Proteobacterial genera. The Salmonella graph represents one species (Salmonella enterica), whereas the E. coli/shigella figure contains both E. coli and four different Shigella species. The other graphs represent multiple species per genus. All graphs are drawn on the same scale D.W. Ussery et al., Computing for Comparative Microbial Genomics, Computational Biology 8, DOI 10.1007/978-1-84800-255-5_14, Springer-Verlag London Limited 2009

0 2000 4000 6000 8000 10000 12000 New genes New gene families Core genome Pan genome Streptococcus [26 genomes] 1=Spyogenes_MGAS315 2=Spyogenes_MGAS2096 3=Spyogenes_MGAS10394 4=Spyogenes_MGAS10750 5=Spyogenes_MGAS10270 6=Spyogenes_MGAS6180 7=Spyogenes_MGAS5005 8=Spyogenes_MGAS9429 9=Spyogenes_MGAS8232 10=Spyogenes_M1_GAS 11=Spyogenes_SSI-1 12=Spyogenes_Manfredo 13=Spneumoniae_R6 14=Spneumoniae_TIGR4 15=Spneumoniae_D39 16=Sthermophilus_CNRZ1066 17=Sthermophilus_LMD-9 18=Sthermophilus_LMG18311 19=Ssuis_98HAH33 20=Ssuis_05ZYH33 21=Sagalactiae_NEM316 22=Sagalactiae_A909 23=Sagalactiae_2603V_R 24=Ssanguinis_SK36 25=Smutans_UA159 26=Sgordonii_Challis_CH1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Number of genes and gene families 40 000 30 000 20 000 Pan-genome Core genome Novel genes Burkholderia [56 genomes] Novel gene families B. cepacia complex Pseudomallei group 10 000 B. cenocepacia B. ambifaria B. dolosa B. vietnamiensis B. mulltivorans B. ubonensis B. lata B. pseudomallei B. mallei Genomes (n=56) B. oklahomensis B. thailandensis B. graminis B. phytofirmans B. phymatum Burkholderia H160 B. xenovorans

0 5000 10000 15000 20000 25000 New genes New gene families Core genome Pan genome Bacillus [24 genomes] 1=Banthracis_A1055 2=Banthracis_Ames0581 3=Banthracis_Ames 4=Banthracis_Australia94 5=Banthracis_CNEVA-9066 6=Banthracis_Kruger_B 7=Banthracis_Sterne 8=Banthracis_USA6153 9=Banthracis_Vollum 10=Bcereus_AH187 11=Bcereus_AH820 12=Bcereus_ATCC10987_Main 13=Bcereus_ATCC14579_Main 14=Bcereus_E33L_Main 15=Bcereus_G9241 16=Bcereus_NVH391-98 17=Bsubtilis_168 18=Bthuringiensis_97-27 19=Bthuringiensis_AlHakam 20=Bamyloliquefaciens_FZB42 21=Bclausii_KSM-K16 22=Bhalodurans_C-125 23=Blicheniformis_ATCC14580 24=Bpumilus_SAFR-032 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

*!*** "*** #*** $*** %*** &*** +,-./,0,1 +,-./,0,.234565,1 789,./,084, :30./,084,!.;..734<=68>3?@,9.A,AB05."&)C)' ".;..734<=68>3?@,9.A,AB05.DE!""! #.;..734<=68>3?@,9.A,AB05.7F($(& $.;..734<=68>3?@,9.A,AB05.7G)#!& %.;..734<=68>3?@,9.A,AB05.($!"% &.;..734<=68>3?@,9.A,AB05.7F($"! '.;..734<=68>3?@,9.A,AB05.(!!!'&H (.;..734<=68>3?@,9.A,AB05.(!!!'&I ).;..734<=68>3?@,9.A,AB05."&*C)$!*.;..734<=68>3?@,9.A,AB05.+7J7K!!!&(!!.;..734<=68>3?@,9.A,AB05.LI)#!!#!".;..734<=68>3?@,9.A,AB05.(!!!&!#.;..734<=68>3?@,9.A,AB05.E!!$.;..734<=68>3?@,9.?865.DE"""(!%.;..734<=68>3?@,9.6395.DE"!**!&.;..734<=68>3?@,9.?80?51B1.!#("&!'.;..734<=68>3?@,9.2,@B1.("!$*!(.;..734<=68>3?@,9.B<1365,0151.DE#!)%!).;..734<=68>3?@,9.M845051.HJ77KIHH!#(! "*.;..734<=68>3?@,9.?B9NB1.%"%C)"! " # $ % & ' ( )!*!!!"!#!$!%!&!'!(!) "*

0 5000 10000 15000 20000 25000 30000 New genes New gene families Core genome Pan genome Clostridium [15 genomes] 1=Cperfringens_13 2=Cperfringens_ATCC13124 3=Cperfringens_SM101 4=Cbotulinum_A_ATCC_19397 5=Cbotulinum_F_Langeland 6=Cdifficile_630 7=Ctetani_E88 8=Cacetobutylicum_ATCC824 9=Cbeijerincki_NCIMB_8052 10=Ccellulolyticum_H10 11=Cnovyi_NT 12=Cphytofermentans_ISDg 13=Csp_OhILAs 14=Cthermocellum_ATCC27405 15=Ckluyveri_DSM555 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 5000 10000 15000 20000 5e+02 Newgenes Coregenome Pangenome Vibrio [28 genomes] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 V.cholerae V.parahaemolyticus V.vulnificus P. profundum V.profundum Others

0 5000 10000 15000 20000 New genes New gene families Core genome Pan genome Mycobacterium [18 genomes] 1=Mtuberculosis_CDC1551 2=Mtuberculosis_C 3=Mtuberculosis_F11 4=Mtuberculosis_H37Ra 5=Mtuberculosis_H37Rv 6=Mtuberculosis_Haarlem 7=Msp_JLS 8=Msp_KMS 9=Msp_MCS 10=Mavium_104 11=Mavium_K-10 12=Mbovis_AF2122_97 13=Mbovis_BCG 14=Mgilvum_PYR-GCK 15=Mleprae_TN 16=Msmegmatis_MC2_155 17=Mulcerans_Agy99 18=Mvanbaalenii_PYR-1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Genus length %AT core pan #genomes Streptococcus ~2 Mbp 62% 800 10,000 26 [63] Burkholderia 5-10 Mbp 31% 900 32,000 36 [55] Bacillus 4-5 Mbp 63% 500 22,000 24 [48] Clostridium 2-5 Mbp 72% 200 27,000 15 [43] Vibrio 4-6 Mbp 53% 800 23,000 28 [35] Mycobacterium 3-5 Mbp 43% 900 16,000 18 [30] Salmonella ~5 Mbp 48% 2300 8,400 24 [30] Yersinia ~5 Mbp 52% 2100 13,000 25 [23]

Unknown 5.3% Unk. conserved 5.3% Partial Info. 4.0% Phage/IS in common pseudo in both leader pep. cell process lipo RNA enzymes pc carrierpsstruct mem pf pm factor pr regul pe pt trans

192 11 Of Proteins, Genomes, and Proteomes Code COGs Domains Description Information storage and processing J A K L B 245 25 231 238 19 10,572 137 11,271 10,338 228 Translation, ribosomal structure and biogenesis RNA processing and modification Transcription Replication, recombination and repair Chromatin structure and dynamics Cellular processes and signaling D 72 - - Y V T M N Z W U O C 46 152 188 96 12 1 159 203 258 G 230 270 95 E F H I P Q R S 179 94 212 88 702 1346 1,678 2,380 7,683 7,853 2,747 128 25 3,743 6,206 9.830 10,816 14,939 3,922 6,582 5,201 9,232 4,055 22,721 13,883 Cell cycle control, cell division, chromosome partitioning Nuclear structure Defense mechanisms Signal transduction mechanisms Cell wall/membrane/envelope biogenesis Cell motility Cytoskeleton Extra-cellular structures Intracellular trafficking, secretion, vesicular transport Post-translational modification, protein turnover, chaperones Metabolism Energy production and conversion Carbohydrate transport and metabolism Amino acid transport and metabolism Nucleotide transport and metabolism Coenzyme transport and metabolism Lipid transport and metabolism Inorganic ion transport and metabolism Secondary metabolites biosynth., transport and catabolism Poorly characterized General function prediction only Function unknown Fig. 11.1 Cluster of Orthologous Genes codes for functionally related gene categories. The table was reproduced from the NCBI website as it appeared in March 2008 (Obtained from http://www.ncbi.nlm.nih.gov/cog/grace/fiew.cgi).

E. coli K-12 genome unknown not in COGs Metabolism Information Cell proc. Translation RNA processing and modification Transcription Replication, recombination and repair Cell cycle control, mitosis and mei Defense mechanisms Signal transduction mechanisms Cell wall/membrane biogenesis Cell motility Pili and flagella Intracellular trafficking and secret Posttranslational modification Energy production and conversion Carbohydrate transport and metabolism Amino acid transport and metabolism Nucleotide transport and metabolism Coenzyme transport and metabolism Lipid transport and metabolism Inorganic ion transport and metabolism Secondary metab biosynt, transp catab General function prediction only Function unknown not in COGs

E. coli core E. coli pan not in COGs Information Information unknown Cell proc. not in COGS Cell proc. Metabolism unknown unknown Metabolism Translation RNA processing and modification Transcription Replication, recombination and repair Cell cycle control, mitosis and mei Defense mechanisms Signal transduction mechanisms Cell wall/membrane biogenesis Cell motility Pili and flagella Intracellular trafficking and secret Posttranslational modification Energy production and conversion Carbohydrate transport and metabolism Amino acid transport and metabolism Nucleotide transport and metabolism Coenzyme transport and metabolism Lipid transport and metabolism Inorganic ion transport and metabolism Secondary metab biosynt, transp catab General function prediction only

E. coli K-12 genome Unknown Information Cellular processes Metabolism E. coli core E. coli pan