Introduction to microbiota data analysis

Similar documents
Microbiome: 16S rrna Sequencing 3/30/2018

Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses

Taxonomy and Clustering of SSU rrna Tags. Susan Huse Josephine Bay Paul Center August 5, 2013

MiGA: The Microbial Genome Atlas

Amplicon Sequencing. Dr. Orla O Sullivan SIRG Research Fellow Teagasc

Section 18-1 Finding Order in Diversity

Introductory Microbiology Dr. Hala Al Daghistani

Assigning Taxonomy to Marker Genes. Susan Huse Brown University August 7, 2014

Taxonomical Classification using:

Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria

Microbial Diversity and Assessment (II) Spring, 2007 Guangyi Wang, Ph.D. POST103B

Objectives. Classification. Activity. Scientists classify millions of species

Exploring Microbes in the Sea. Alma Parada Postdoctoral Scholar Stanford University

Chapter 19. Microbial Taxonomy

Microbial Taxonomy and the Evolution of Diversity

Rapid Learning Center Chemistry :: Biology :: Physics :: Math

Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Microbial Taxonomy. Slowly evolving molecules (e.g., rrna) used for large-scale structure; "fast- clock" molecules for fine-structure.

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

Summary Finding Order in Diversity Modern Evolutionary Classification

Bacillus anthracis. Last Lecture: 1. Introduction 2. History 3. Koch s Postulates. 1. Prokaryote vs. Eukaryote 2. Classifying prokaryotes

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Detailed overview of the primer-free full-length SSU rrna library preparation.

Chapter 17. Table of Contents. Objectives. Taxonomy. Classifying Organisms. Section 1 Biodiversity. Section 2 Systematics

Other resources. Greengenes (bacterial) Silva (bacteria, archaeal and eukarya)

A. Correct! Taxonomy is the science of classification. B. Incorrect! Taxonomy is the science of classification.

Probing diversity in a hidden world: applications of NGS in microbial ecology

Kingdoms in Eukarya: Protista, Fungi, Plantae, & Animalia Each Eukarya kingdom has distinguishing characteristics:

An Automated Phylogenetic Tree-Based Small Subunit rrna Taxonomy and Alignment Pipeline (STAP)

Characteristics of Life

Microbial Taxonomy. Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Introduction to Microbiology. CLS 212: Medical Microbiology Miss Zeina Alkudmani

Microbial analysis with STAMP

Bioinformatics Chapter 1. Introduction

Name: Class: Date: ID: A

Concept Modern Taxonomy reflects evolutionary history.

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

What makes things alive? CRITERIA FOR LIFE

9/19/2012. Chapter 17 Organizing Life s Diversity. Early Systems of Classification

CH. 18 Classification

Taxonomy. Taxonomy is the science of classifying organisms. It has two main purposes: to identify organisms to represent relationships among organisms

Outline Classes of diversity measures. Species Divergence and the Measurement of Microbial Diversity. How do we describe and compare diversity?

Multiple Choice Write the letter on the line provided that best answers the question or completes the statement.

Zoology. Classification

The Effect of Primer Choice and Short Read Sequences on the Outcome of 16S rrna Gene Based Diversity Studies

Robert Edgar. Independent scientist

Ch 10. Classification of Microorganisms

Unit Two: Biodiversity. Chapter 4

Carolus Linnaeus System for Classifying Organisms. Unit 3 Lesson 2

Taxonomy and Biodiversity

Phylogeny 9/8/2014. Evolutionary Relationships. Data Supporting Phylogeny. Chapter 26

Classification Practice Test

CLASSIFICATION OF LIVING THINGS

FIG S1: Rarefaction analysis of observed richness within Drosophila. All calculations were

Sec$on 9. Evolu$onary Rela$onships

Introduction to Evolutionary Concepts

Classification of Living Things. Unit II pp 98

Classification and Viruses Practice Test

NAME: DATE: PER: CLASSIFICATION OF LIFE Powerpoint Notes

Test Bank for Microbiology A Systems Approach 3rd edition by Cowan

Biology Classification Unit 11. CLASSIFICATION: process of dividing organisms into groups with similar characteristics

Bergey s Manual Classification Scheme. Vertical inheritance and evolutionary mechanisms

Station A: #3. If two organisms belong to the same order, they must also belong to the same

A. Incorrect! In the binomial naming convention the Kingdom is not part of the name.

The Tree of Life. Chapter 17

Test Bank for Microbiology A Systems Approach 3rd edition by Cowan

Evolution and Taxonomy Laboratory

PHYLOGENY AND SYSTEMATICS

Organizing Life on Earth

Microbiota: Its Evolution and Essence. Hsin-Jung Joyce Wu "Microbiota and man: the story about us

TAXONOMY. The Science of Classifying Organisms

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Let s get started. So, what is science?

Microbiology / Active Lecture Questions Chapter 10 Classification of Microorganisms 1 Chapter 10 Classification of Microorganisms

Taxonomy Taxonomy: field of biology that identifies and classifies organisms

Speciation and Classification

Learning Outcome B1 13/10/2012. Student Achievement Indicators. Taxonomy: Scientific Classification. Student Achievement Indicators

Microbial Diversity. Yuzhen Ye I609 Bioinformatics Seminar I (Spring 2010) School of Informatics and Computing Indiana University

TAXONOMY. The Science of Classifying Organisms

Building the Tree of Life

Supplemental Online Results:

Origins of Life. Fundamental Properties of Life. Conditions on Early Earth. Evolution of Cells. The Tree of Life

Plant Names and Classification

Characteristics of Living Things Card Sort

Burton's Microbiology for the Health Sciences

CLASSIFICATION. Why Classify? 2/18/2013. History of Taxonomy Biodiversity: variety of organisms at all levels from populations to ecosystems.

UNIT 4 TAXONOMY AND CLASSIFICATION

es tion Nota Classific

Classification Classification key Kingdom Organism Species Class Genus Binomial Nomenclature

SECTION 17-1 REVIEW BIODIVERSITY. VOCABULARY REVIEW Distinguish between the terms in each of the following pairs of terms.

Study Guide. Biology 2101B. Science. Biodiversity. Adult Basic Education. Biology 2101A. Prerequisite: Credit Value: 1

Comparison of Three Fugal ITS Reference Sets. Qiong Wang and Jim R. Cole

Classification of Living Things

18-1 Finding Order in Diversity Slide 2 of 26

Whole Genome Alignments and Synteny Maps

8/23/2014. Phylogeny and the Tree of Life

Unit 5: Taxonomy. KEY CONCEPT Organisms can be classified based on physical similarities.

Comparative genomics: Overview & Tools + MUMmer algorithm

Interpreting the Molecular Tree of Life: What Happened in Early Evolution? Norm Pace MCD Biology University of Colorado-Boulder

Classification. Essential Question Why is it important to place living things into categories?

Amy Driskell. Laboratories of Analytical Biology National Museum of Natural History Smithsonian Institution, Wash. DC

Transcription:

Introduction to microbiota data analysis Natalie Knox, PhD Head Bacterial Genomics, Bioinformatics Core National Microbiology Laboratory, Public Health Agency of Canada 2 National Microbiology Laboratory

project manager 3 Who I am data analyst researcher biostatistician bioinformatician Computational biologist Consult and sound board Jack of all trades, master of none 4 The NML bioinformatics toolbox

Microbiology & Microbial ecology 101 The basics 5 Microbial ecology is the study of microbes in the environment and their interactions with each other. Microbes are the tiniest creatures on Earth, yet despite their small size, they have a huge impact on us and on our environment. 6

7 8 Microbial ecology - the big 3 1. Who is there? 2. How many are there? 3. What do they do? What are microbes? Bacteria Archaea Virus Parasite Helminths (parasitic worms) Fungi yeast, mold, mushrooms Protozoa Ciliates Amoebae flagellates

9 Tree of life: Six Kingdom Classification ry ka u E es ot Prokaryotes - 10 unicellular/multicellular Presence of nucleus and mitochondria - Unicellular Absence of nucleus and mitochondria Taxonomic Ranking System Life Life Bacteria Bacteria Eubacteria Animalia Proteobacteria Chordata Gammaproteobacteria Mammalia Enterobacteriales Primates Enterobacteriaceae Hominidae Escherichia Homo E. coli H. sapiens

11 Scientific Nomenclature for Bacteria binomial nomenclature Italicization Family, genus, species, subspecies Not strain or serovar Capitalization Genus and above Serovars Abbreviations genus after initial introduction (e.g. E. coli) sp. & spp. (e.g. Salmonella spp.) gen. nov. (genus novum) or sp. nov. (species nova) Plurals vs singular Genus / genera Species / species Phylum / phyla Introduction to Microbiota Studies 12 The Father of Taxonomy Carl Linnaeus

13 Let s start with some terminology Microbiota Represents the assemblage of microorganisms present in a given environment Microbiome Includes the microorganisms and their entire genetic content in a given environment Metagenome Gene and genomes of a microbiota 14 DNA-based microbiota survey methods Targeted amplicon sequencing Shot-gun metagenomics

15 DNA-based microbiota survey methods Targeted amplicon 16 Requires taxonomically informative marker Ubiquitous to target group Bacteria and archaea: 16S rrna, cpn60 Fungi: ITS, 18S rrna Easily amplified Highly curated and comprehensive reference database Amplification and primer bias/error Ideally single copy Low selective pressure What approach is best? Shot-gun metagenomics Unrestricted sequencing of all DNA present in a sample Eukaryotic, prokaryotic, virus Sampling all genomic content Sequencing depth Sample matrix Low or high biomass sample Functional profiling Assemblies challenging

17 DNA-based microbiota study categories Targeted amplicon Shotgun metagenomics Advantage Disadvantage Advantage Disadvantage Microbial target Narrow scope a priori knowledge required Wide scope (unbiased*) Virome assays complex & overwhelming host DNA Cost Relatively low Can be costly Moderate High Computational requirements Data analysis Many workflows & software available Relies on curated DB Complex & (few) research grade software * unbiased sequencing (wet-lab requires targeted approaches) 18 shot -gun metagenomics Bioinformatics rule of thumb Using the right tool for the job Targeted amplicon sequencing Relies on curated DB

19 More confusion... 20 Which approach is best? Consider the following: What are your research questions How much money and time do you want to invest? Both methods require moderate to expert knowledge High performance computing environment Most open-source software available Not always microbiologist friendly Often based on the command-line

Targeted amplicon sequencing The famous 16S rrna gene 21 22 The Bacterial Ribosome

23 Ribosome fun facts 24 Prokaryotes and Eukaryotes have ribosomes The ribosome is composed of 2 major components: Small subunit (SSU): reads the RNA Large subunit (LSU): join amino acids to form a polypeptide chain Unit of measurement is Svedberg unit Measure of the rate of sedimentation in centrifugation Prokaryotes have a 70S ribosome SSU = 30S (16S rrna) LSU = 50S (5S and 23S rrna) Eukaryotes have an 80S ribosome SSU = 40S (18S rrna) LSU = 60S (5S, 5.8S, and 28S rrna) Let s talk about 16S rrna Bacteria and archaea specific (mitochondria, chloroplasts) Multiple copy numbers (anywhere from x to z) copies of the ribosome gene operon dispersed throughout the genome e.g. E. coli has ~ 7 copies of same operon (rrna, rrnb, etc.) Eukaryotic ribosome gene operons typically occur in tandem arrays

25 Getting to know the 16S rrna gene in more depth Part of the 30S SSU of prokaryotic ribosome that binds to the RBS (Shine-Dalgarno sequence/ribosome Binding Site) Length = 1540 nucleotides bound to 21 proteins 3 end codes for the anti-rbs Helps bind SSU and LSU together (interacts with 23S in LSU) 26 Getting to know the 16S rrna gene intimately Most widely used taxonomic marker Conserved secondary structure acts as an anchor

27 Why is the 16S rrna gene ideal to characterize microbiota in a given environment? 28 Universally present in bacteria Low selective pressure Easily amplified Conserved with variable regions Adequate discriminatory power to genus level Variable regions can be adequately sequenced with most next generation sequencing technologies Other taxon markers ITS: Internal transcribed spacers (ITS1 and ITS2) for Fungi Variable lengths: ~ 360/232 bp each (600-700 bp) cpn60: chaperonin 60 (cpn60 group I) ~550 bp rpob: beta subunit of DNA polymerase ~370 bp 23S rrna ~ 2300 bp

Targeted amplicon sequencing Data analysis 29 30 Overall workflow of targeted-amplicon sequencing approach gdna extraction 16s rrna variable region amplification High-throughput sequencing Data analysis: QC and filtering building OTUs Downstream analytics & statistics

31 16S rrna sequencing methodology 32 Illumina sequencing methodology

33 Illumina sequencing methodology 34 Illumina clusters and base calling

35 Sequencing output * This is a fastq file. Remember, you will have 2 fastq files for each sample sequenced (forward and reverse) 36 The basics of a fastq file

37 Phred scores A quality score is a prediction of the probability of an error in sequencing base calling 38 Data Quality Control R1: Forward R2: Reverse

39 Wet-lab and sequencing aspects to consider while planning your project... 40 Sequencing platform Chemistry Read length Primers Sample matrix and expected microbial community DNA extraction protocol Sequencing concerns R2 R1 Amplicon Low base diversity libraries PhiX incorporation(~10-50%) Cluster density Amplicon length Sequencing overlap

Data analysis General overview and software 41 42 Data analysis software Pat Schloss (University of Michigan) Open-source written in C and C++ Few dependencies / easy to install platform independent Windows-compatible executable source code: Unix/Linux or Mac OS X environments Rob Knight (University of California San Diego) Open-source Backbone written in python Available as a virtual machine or in packaged container (miniconda) Many dependencies / install challenging Linux or MAC supported Container for different analysis software More visualizations Written in python Open-source

43 Data analysis workflow Trim and filter sequences Build contigs 44 Align sequences Taxonomic classification Build OTUs Downstream analysis The OTU Operational Taxonomic Unit A bin containing sequences of X % similarity - a sorting process OTU1 OTU2 OTU3 OTU4 * Suggested guidelines: 97% represents species level while 95% equivalent to genus level

45 Major steps to building an OTU 1. Alignment 2. Distance Matrix 3. Clustering 46 Health break

Data analysis Detailed workflow 47 48 16S rrna data analysis token - OTUs R2 R1 Alignment Goal: High quality and stable OTUs Are they the same? Considerations for creating OTUs clustering method Aligner database contig

49 Building contigs: sample output This is a fasta file 50 Trim and filter sequences Pre-processing step trim off primer sequences and barcodes cull sequences based on: quality scores sequence length (min. and max.) presence of ambiguous bases and homopolymers (sequencer dependent) Create non-redundant set of sequences redundant sequences are indexed for use in downstream analysis

51 Removing sequencing noise and chimeras Pre-clustering sequences (Huse et al. 2010) assumes rare sequences which are highly similar to abundant sequences are an artefact of pyrosequencing error default: difference=1 base mismatch A G T C C T G == A G G C C T G Identifying and removing chimeras chimera slayer (Broad Institute) and Uchime (Rob Knight s Lab) cpu intensive 52 Pitfalls of not removing PCR and sequencing artefacts Inaccurate results Falsely inflated diversity Mis-classified OTUs or presence of novel OTUs Excessively large distance matrix High memory usage and CPU time

53 Removing other non-prokaryotic junk Other sources of contaminating DNA Mitochondrial Chloroplastic rrna 54 Other sources of contamination: the kitome

55 The kitome : the reagent microbiome Lab reagent microbiome e.g. DNA extraction reagents DNA is everywhere! Autoclaving DNA-free Bleach and filter tips (a must!) Separate your DNA extraction and PCR setup stations Sample-to-sample contamination Some reagents produced by bacteria Sequence a negative control Even if no band present on gel! 56 Some considerations... 1. Sequence positive and negative control at every step a. Mock community - sequencing error rate b. Abundant donor - sanity check c. Negative control (extraction blank) - contaminant check 2. Low biomass samples more susceptible a. Aim for starting sample >103-104 cells 3. Careful sample collection 4. Random order processing (different kits for replicates) 5. Documentation is key (eg. lot numbers) 6. Critical evaluation of results

57 Mock community Known community structure No PCR inhibitors Used to calculate sequencing error rate Create your own or use HMP s 58

59 Phylotype vs. OTU-based approach OTU-based approach Phylotype No MSA Sequences assigned to bins based on similarity to database inaccurate 60 Based on MSA Secondary structure should be considered to conserve positional homology More CPU intensive E.g. typical user could not align more than ~10,000 full-length sequences - lack of RAM Phylotype-based analysis Build contigs Trim and filter sequences Align sequences Build OTUs Taxonomic classification Downstream analysis Align sequences Phylotype-based analysis: Sequences assigned to bin based on similarity to database sequence Limitations Marginally similar sequences may be assigned the same taxonomy Novel sequences may be incorrectly classified Are taxonomy dependent classification to species/genus level inaccurate

61 MSA: aligners Why can t we use generic aligners? Don t scale well Data stored in RAM They don t consider secondary structure of the 16S rrna Why should we care about secondary structure? Conserves positional homology Increases robustness of alignment (less sensitive to user supplied parameters) 62 MSA: alignment strategy for 16S rrna Not using a typical de novo MSA approach

63 MSA: aligners for 16S rrna analysis What are these special aligners? Profile-based aligners scale linearly in time and smaller memory footprint Considerations for choosing a 16S rrna profile-based aligner Alignment quality, scalability for large datasets, speed, database, open-source 64 MSA: 16S rrna aligners and databases RDP (ribosomal database project) aligner uses secondary structure models (HMM) BUT...variable regions poorly aligned (limited number of models) Alignment length not fixed SILVA database project (spin-off of ARB) uses SINA aligner (SILVA Incremental Aligner) k-mer search/partial order alignment (POA) their alignment is 50,000 columns long (archaea and euk. 18S) Better alignment of variable regions Greengenes aligner puts less emphasis on secondary structure but their database alignment does uses kmer search (k=7mers) to id closest match in ref DB followed by blastn for pairwise alignment use NAST (nearest alignment space termination) to re-insert gaps in candidate sequence positional homology retained allows for fixed column count in MSA reference alignment 7,682 columns (much shorter than SILVA)

65 MSA: comparison of alignments to different databases 66 MSA: more on sequence alignments Bottle neck of the analysis & many options! 1. Find closest template for each candidate using: kmer searching blastn (alignment dependent) suffix tree searching 2. Make pairwise alignment between candidate and de-gapped template sequences using one of the following algorithms: Needleman-Wunsch (fast; global alignment) Gotoh (slowest; global alignment) blastn (Smith-Waterman; local alignment) 3. Re-insert gaps using NAST algorithm candidate sequence alignment and original template alignment same length

67 MSA and databases: Key message Quality of candidate sequence alignment is dependent on reference alignment 68 Clustering Build distance matrix calculate pairwise distances Cluster cluster sequences into OTUs Three clustering methods: Nearest neighbour: Each of the sequences within an OTU are at most X% distant from its most similar sequence in the OTU. Furthest neighbour All of the sequences within an OTU are at most X% distant from all of the other sequences within the OTU. Average neighbour happy medium between NN and FN

69 The distance matrix A massive file! 70 OTU classification sample output

71 Classification and abundance count per sample 72 Evaluation of output

73 Understanding diversity indices Alpha diversity 74 Beta diversity Characterizing the microbiota Classify sequences Measure alpha and beta-diversity Alpha diversity: how many taxa are in a sample rarefaction curves Richness (i.e. Chao1 and ACE) and diversity (Shannon and Simpson) indices Beta diversity: how many taxa are shared between samples PCoA heat maps Unifrac venn diagrams phylogenetic trees And much more...

75 Data exploration 76 Downstream analysis Sample diversity Similarity of samples, associations with metadata Procrustes analysis PCoA ANOSIM Metastats LEfSe Functional prediction PICRUst

77 Final thoughts No standardized approach! Garbage in = garbage out Data QC, filtering, and de-noising critical Increasing sequencing throughput will inevitably increase cpu performance requirement Does it make sense biologically? 78 THANKS! Any questions? You can find me at: natalie.knox@canada.ca Special credits to: Gary Van Domselaar Morag Graham Jessica Forbes