MiGA: The Microbial Genome Atlas

December 12 th 2017 MiGA: The Microbial Genome Atlas Jim Cole Center for Microbial Ecology Dept. of Plant, Soil & Microbial Sciences Michigan State University East Lansing, Michigan U.S.A.

Where I m From Michigan State University - 50,000 students (11,000 graduate students) College of Agriculture & Natural Resources Department of Plant Soil & Microbial Sciences Center for Microbial Ecology - Study the interactions of microbes with each other and with their environment. http://rdp.cme.msu.edu/ 1

Relationship to Health Sciences Microbiome: the microorganisms in a particular environment (including the body or a part of the body). Only 10% of the cells in your body are human! ~23,000 human genes 1,000,000+ genes in human microbiome

Outline of My Presentation Background Material MiGA, the Microbial Genome Atlas http://rdp.cme.msu.edu 3

One Representation of the Tree of Life

Can you name these bacteria? From: Ch. 2 -- Terrestrial Bacteria from Agricultural Soils: By Masoomeh Shams-Ghahfarokhi, Sanaz Kalantari and Mehdi Razzaghi-Abyaneh DOI: 10.5772/45918 http://rdp.cme.msu.edu 5

Elucidation of the three domains of life Carl Woese (1929 2012) Ribosomal RNA sequence as phylogenetic marker Discovered 3 rd kingdom Archaea and Bacteria separate domains Contrast with former Prokaryote hypothesis

Phylogenetic Tree of Life Three domains of life based on the work of Carl Woese and colleagues http://rdp.cme.msu.edu/ 7

Ribosomes Universal Marker Subunits 30S 50S rrna 16S 23S 5S Protein synthesis factory. Core function present in all cellular organisms. Very little evidence of horizontal gene transfer. Historically easy to work with. Purify by centrifugation and extract rrna. Now we use PCR to amplify from genomic DNA rrna genes have conserved regions interspersed with highly variable regions. Conserved regions used for both PCR primers and sequencing primers. http://rdp.cme.msu.edu 8

Diversity of uncultured organisms explored by rrna sequencing David A. Stahl, David J. Lane, Gary J. Olsen and Norman R. Pace Science, New Series, Vol. 224, No. 4647 (Apr. 27, 1984), pp. 409-411 Published by: American Association for the Advancement of Science

Hydrothermal Vent Black Smoker 10

Explosion in rrna Sequencing By 2008, the majority of all bacterial sequences submitted to GenBank were 16S rrna sequences Less than 2% of these had a Latin name attached (valid or not) (R. Christen, 2008)

Growth of rrna data 3.5 Release 11.4: 3,333,501 sequences Environmental Sequences Isolate Sequences 3 2.5 2 1.5 1 No. of Sequences (in Millions) 0.5 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Year 0

Limits of rrna Phylogeny Slowly evolving - Can t resolve species Short sequence, ~1550 bases High random error Can add LSU rrna, but database is limited

Genes Beyond rrna rrna genes are slowly evolving and present in multiple copies. Other single-copy conserved genes are faster evolving. Many important ecological functions are encoded by genes that are horizontally transferred. Their evolutionary history does not match that of rrna. 14

rplb vs 16S Pariwise Distances in one Order (RefSeq Genomes) Jiarong Guo

1995 Haemophilus influenzae genome published

Bacterial and Archaeal Genomes from cultured organisms (INSDC 9/4/2017) (Compare to >3 million rrna genes) 8227 Complete Genomes 1469 Genomes with Gaps 46773 Scaffolds 49569 Contigs only 106038 Isolate genomes in total Now cheaper to obtain draft genome than single 16S rrna 15 years ago!

Microbial Genomes from Uncultured Organisms Single Cell Genomes: Single microbial cells are separated before sequencing Issues: Incomplete genomes, enzymatic DNA amplification causes artifacts Metagenomic Binning: Grouped from metagenomic assemblies Issues: Incomplete genomes, may mix allelic variants, contamination an issue

Objectives of the MiGA project How would you taxonomically classify a novel genome? How would you build a novel classification for a collection of genomes? In other words: to built the genome-equivalent of the Ribosomal Database Project (RDP) based on the ANI/AAI approach.

Multi-Gene Phylogenetic Analysis Use additional universal marker genes Universal: transcription translation replication Choose for no horizontal gene transfer Unfortunately, few genes meet these criteria 100 130 genes commonly used Compare all genes common between each pair of organisms (Average Identity) Uses larger part of available genome Robust to missing data (partial genomes)

Introduction to the Pangenome Of Terms in Biology: The Pan-Genome by Christoph Weigel In Small Things Considered June 12, 2014 schaechter.asmblog.org

Horizontal gene transfer occurs more readily between closely related organisms

Need to find comparable genes Homologous: for ANI method The existence of shared ancestry between a pair of genes. Orthologous: Inherited by two organisms from the same ancestral sequence. (Usually same function.) Paralogous: Originally created by a duplication event within a single genome. (May have different functions.)

Reciprocal Best Matches - Likely Orthologs Strain A genes Strain B genes

Best matches not reciprocal - Potential Paralogs? Strain A genes Strain B genes

ANI: Average Nucleotide Identity AAI: Average Amino Acid Identity haai: Heuristic AAI Implementation Rodriguez-R & Konstantinidis 2016 PeerJ Preprint 27

Detect not-previously described (novel) taxa % of genome pairs in a taxonomic rank Novel taxa are determined at species, genus & phylum levels Novel species <95% AAI Novel genus <65% AAI Novel phylum <45% AAI

Average Nucleotide Identity - a replacement for DDH Among available genome relatedness indices, average nucleotide identity (ANI) is one of the of the most robust measurements of genomic relatedness between strains, and has great potential in the taxonomy of bacteria and archaea as a substitute for the labour-intensive DNA DNA hybridization (DDH) technique. Kim et al., IJSEM February 2014 vol. 64 no. Pt 2 346-351

MiGA uses average Identity 30

Hierarchical approach genome classification 1 Bacterium vs archeon, 1 CPU 2 Two E. coli genomes, 1 CPU 32

Hierarchical approach to genome classification 1 In the NCBI RefSeq database 2 Phylum, genus, or distant species 33

Pre-clustering references AAI or ANI distances Medoid clustering 34

Query the clustering Medoid clustering 35

Input data types and project types Genome classification against reference Clade project

MiGA s genome clasification output (in part)

16S rrna taxonomy and quality metrics

Genome contamination analysis with MyTaxa MyTaxa* scan of assembled Salmonella from a stool metagenome Detecting chimeras, areas to focus for manual checking, HGT Soon available through: http://enve-omics.gatech.edu/

So many bad quality genomes. What to do? MyTaxa_Scan of a submitted genome B. cereus in 95% of the sequence, Streptococcus pneumoniae in the rest and 16S Detecting chimeras, areas to focus for manual checking, HGT

Clade project See also the ogs.* utilities in the Enveomics Collection

Pangenome calculation in a clade project Enveomics Collection: Rodriguez-R & Konstantinidis, PeerJ 2016

Medoid clustering to call clades Very robust separation even among closely related genome of B. anthracis (>99.5% ANI)