The problem Lineage model Examples. The lineage model

Similar documents
Taming the Beast Workshop

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Estimating Evolutionary Trees. Phylogenetic Methods

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

Frequency Spectra and Inference in Population Genetics

MCMC: Markov Chain Monte Carlo

Populations in statistical genetics

Challenges when applying stochastic models to reconstruct the demographic history of populations.

6 Introduction to Population Genetics

Mathematical models in population genetics II

6 Introduction to Population Genetics

EVOLUTIONARY DISTANCES

Population Genetics I. Bio

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.


Learning ancestral genetic processes using nonparametric Bayesian models

Bayesian Inference of Interactions and Associations

Hidden Markov models in population genetics and evolutionary biology

Reading for Lecture 13 Release v10

Diffusion Models in Population Genetics

Bayesian Classification and Regression Trees

MiGA: The Microbial Genome Atlas

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

Bayesian Phylogenetics

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

How robust are the predictions of the W-F Model?

Bayesian Phylogenetics:

Haplotype-based variant detection from short-read sequencing

Detecting selection from differentiation between populations: the FLK and hapflk approach.

Processes of Evolution

CONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109

A Nonparametric Bayesian Approach for Haplotype Reconstruction from Single and Multi-Population Data

Mechanisms of Evolution Microevolution. Key Concepts. Population Genetics

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Dr. Amira A. AL-Hosary

DNA-based species delimitation

Quantitative Biology II Lecture 4: Variational Methods

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Robust demographic inference from genomic and SNP data

Density Estimation. Seungjin Choi

Curriculum Links. AQA GCE Biology. AS level

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

The Prokaryotic World

2. Map genetic distance between markers

Species Tree Inference using SVDquartets

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference

Intraspecific gene genealogies: trees grafting into networks

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference Fall 2016

Inferring Molecular Phylogeny

ACGTTTGACTGAGGAGTTTACGGGAGCAAAGCGGCGTCATTGCTATTCGTATCTGTTTAG Human Population Genomics

Contents. Part I: Fundamentals of Bayesian Inference 1

Penalized Loss functions for Bayesian Model Choice

8/23/2014. Phylogeny and the Tree of Life

De novo assembly and genotyping of variants using colored de Bruijn graphs

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Demography April 10, 2015

Machine Learning Summer School

chatper 17 Multiple Choice Identify the choice that best completes the statement or answers the question.

Computational Systems Biology: Biology X

Statistical population genetics

ADVANCED PLACEMENT BIOLOGY

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Text mining and natural language analysis. Jefrey Lijffijt

p(d g A,g B )p(g B ), g B

Lecture 11 Friday, October 21, 2011

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Coalescent based demographic inference. Daniel Wegmann University of Fribourg

Proteorhodopsin phototrophy in the ocean

Phylogeny & Systematics

Bayesian analysis of the Hardy-Weinberg equilibrium model

Gibbs Sampling Methods for Multiple Sequence Alignment

Evolutionary Genetics: Part 0.2 Introduction to Population genetics

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Similarity Measures and Clustering In Genetics

Lecture 6: Graphical Models: Learning

Inferring Protein-Signaling Networks II

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

Lecture 1 Bayesian inference

What Are the Protists?

Warm-Up- Review Natural Selection and Reproduction for quiz today!!!! Notes on Evidence of Evolution Work on Vocabulary and Lab

Phylogenetics. BIOL 7711 Computational Bioscience

Quartet Inference from SNP Data Under the Coalescent Model

Principles of Bayesian Inference

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

SCIENTIFIC EVIDENCE TO SUPPORT THE THEORY OF EVOLUTION. Using Anatomy, Embryology, Biochemistry, and Paleontology

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

A (short) introduction to phylogenetics

A. Incorrect! In the binomial naming convention the Kingdom is not part of the name.

Creating a Dichotomous Key

Inferring Species Trees Directly from Biallelic Genetic Markers: Bypassing Gene Trees in a Full Coalescent Analysis. Research article.

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

Phylogenies & Classifying species (AKA Cladistics & Taxonomy) What are phylogenies & cladograms? How do we read them? How do we estimate them?

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Whole Genome Alignment. Adam Phillippy University of Maryland, Fall 2012

Transcription:

The lineage model A Bayesian approach to inferring community structure and evolutionary history from whole-genome metagenomic data Jack O Brien Bowdoin College with Daniel Falush and Xavier Didelot Cambridge, UK - March 2014

What s the problem? Suppose we want to focus in on a single species using shotgun metagenomic data from several samples. But... the species may have evolved, the new variants are mixed across samples, and there will be inevitable errors. How can we infer the community representation of the variants within each of the samples? Haplotype phasing with an unknown number of haplotypes. -Mihai Pop, yesterday

Metagenomic data comes in two distinct flavors: Amplicon sequencing samples a single conserved gene Whole-genome sequencing shotgun samples DNA from across genomes fraction of total reads

Overview The problem The lineage model itself Examples Inferred Pool 1 2 3 4 5 6 7

Ecoevolutionary dynamics for E. coli

Ecoevolutionary dynamics for E. coli

Ecoevolutionary dynamics for E. coli

Ecoevolutionary dynamics for E. coli

IMPORTANT USE CASES human microbiomes malaria infections phytoplankton

BASIC SCIENTIFIC QUESTIONS 1. How do we infer the evolutionary history of the population? (What does the tree look like? ) 2. How do we infer how the ecological structure samples? (How are the taxa mixed together?) Unfortunately... It is not possible to directly infer directly these since there is not enough information in individual reads to infer the tree directly, assemblers often assume that an organism is clonal, and the reads may be drawn from samples that are a mixtures of different taxa,...so we employ a Bayesian approach.

Bayesian phylogenetics: A statistical model for sequence data Y are sequences observed at the tips θ = (Λ,T ) where T is the underlying tree and Λ specifies the mutation model Likelihood by pruning: P(Y T,Λ) = t i T s j S P(s j )P(s i s j t i,λ) t i A G A Specify prior distributions for T, Λ, and infer P(T,Λ D).

So where were we? Suppose we sample from N = 6 locations...... what does the data look like?

A read Read counts C G C C G CCTGGTGCGTGTC TCCCTGGTCGGT GTCCCTGGTCGG CTGTAGAGGCTGTCCCTGGTCGGTTGTACAGCAACTGTAG REFERENCE GENOME

Read count data A read Read counts C G C C G CCTGGTGCGTGTC TCCCTGGTCGGT GTCCCTGGTCGG CTGTAGAGGCTGTCCCTGGTCGGTTGTACAGCAACTGTAG REFERENCE GENOME For sample i, we align all the reads against the appropriate reference genome. Suppose G is the j th variant in the genome. We observe read counts d ij = (r ij,n ij ) = (4,4). The full data has all N samples and M variants: D = [d ij : i = 1,,N;j = 1,,M]

The lineage model jointly infers phylogeny and sample composition. Pools by color Pool 1 Pool 2 Pool 3 (Assumed) reality The lineage approximation IDEA Each sample is a mixture of different lineages. Each lineage defines an unobserved haplotype. The lineages are connected by a phylogeny. The lineage mixture specifies the read count distribution.

P(Θ D), here Θ = (L,S,T,K,η) Lineages - L : 0 0 1 0 1 1 1 0 1 0 1 1 Mixtures - S : 0.25 0.01 0.1 0.09 0.55 0.33 0.24 0.01 0.02 0.4 0.1 0.2 0.27 0.3 0.13 t i Tree - T K - number of lineages A G A ξ - error rate

Nuts and bolts There are i = 1,,N samples, and j = 1,,M variants. For convenience, we assume biallelic variation. We assume there are K lineages, i.e. the tree has K tips. Each lineage L k defines a haplotype of allele states: L k = [l kj : j = 1, M] = [0 1 0 0 1] S = [s ik ] gives the proportion of lineage k in sample i. Together L and S give the expected proportion of read counts in sample i at variant j: p ij = K s ik l kj. k=1

Likelihood Absent any sequencing errors, reference read counts within sample i at variant j arise i.i.d. with probability p ij. This gives a binomial likelihood for d ij : ( ) rij +n ij P(d ij L,S) = p rij ij (1 p ij ) nij. n ij Assuming that sites and samples are independent, the full data likelihood is P(D L,S) = N M P(d ij L,S). i=1j=1 We can include the effect of sequencing errors by altering p ij p ij according to a parameter ξ.

Bayesian inference T specifies the tree; µ,λ are parameters. P(L,T,S,λ,µ,ξ D) P(D L,S,ξ) P(L T,λ) P(S) P(T ) P(λ) P(µ) P(ξ) P(D L,S,ξ) is the binomial likelihood; P(L T,λ) is a standard phylogenetic likelihood; P(T ) is a coalescent; P(S) is N realizations from Dirichlet(1K ). P(λ),P(ξ) are simple. Inference via Markov chain Monte Carlo. Harmonic mean estimator to estimate Bayes factor to find K.

Simulations from the model Simulated Inferred Pool Pool 1 5 1 5 2 6 2 6 3 7 3 7 4 4

Simulations from the model Lineage 1 Lineage 2 Lineage 3 Inferred lineage 1 2 3 4 5 % Similarity 90% 70% 50% 30% Pool proportion Pool proportion 0.1 0.3 0.5 0.7 0.1 0.3 0.5 0.7 Lineage 4 Simulated Inferred Combined Lineage 5 Lineage 6 1 2 3 4 5 6 Simulated lineage 1 2 3 4 5 6 7 Pool number 1 2 3 4 5 6 7 Pool number 1 2 3 4 5 6 7 Pool number

Simulations - reads and SNPS Fraction of concordant SNPs 0.5 0.6 0.7 0.8 0.9 1.0 Mix : Err : SNP 1.5 : 0.05 : 250 4 : 0.15 : 250 4 : 0.00 : 25 10 : 0.00 : 1000 Fraction of concordant SNPs 0.5 0.6 0.7 0.8 0.9 1.0 Mix : Err : Reads 1.5 : 0.15 : 5 1.5 : 0.00 : 5 10 : 0.00 : 2 4 : 0.05 : 10 2 5 10 50 Number of reads 25 100 250 1000 Number of SNPs Reads SNPs

Simulations: island coalescent Locations of haplotype 00001: 252 SNPs Locations of haplotype 00010: 82 SNPs Locations of haplotype 01001: 22 SNPs Pools 1 2 3 4 5

A meromictic Antarctic lake (Lauro et al., The ISME Journal. 2011)

(ibid.) An important green sulphur bacteria species, Chlorobium limicola, a photosynthetic bacterium, stratifies across the lake s layers.

Lineage results on C. limicola 5 samples : 3 from lake, 1 with missing metadata Distinct lake, ocean and deep-water variants present. Sample Ace 12m Ace 23m? Open ocean Newcomb Bay

Plasmodium falciparum Most severe malaria is caused by Plasmodium falciparum, a single celled protist. Manske et al. (2013) showed widespread mixture in clinical infections.

The parasite requires two plastids: a mitochondrion and an apicoplast, for which one cell only has a single copy

Malaria infections in northern Ghana Sample 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 We find mixture levels consistent with the nuclear genome. Surprisingly, there s a lot of structure in the mixtures.

Where do we go from here... Recombination? Multiple species Use experimental design Better K estimation - reversible jump? Genotyping and de Bruijn? Cancer Paired-end information Better likelihood - multinomial-dirichlet

Summary WHAT S BEEN DONE: We can take read count data from metagenomic samples and produce estimates of phylogeny and commmunity composition. In simulations, this model works well. In real examples, our results appear consistent with other methods, and seem to go beyond them in some places. There are a lot of possible extensions to more involved experimental contexts, better statistical methods, and computational improvements.

References J. O Brien et al. A Bayesian approach to inferring the phylogenetic structure of communities from metagenomic data. Genetics (forthcoming). Manske et al. 2013. Analysis of Plasmodium falciparum diversity in natural infections by deep sequencing Nature. in press. F. Lauro et al. 2011. An integrative study of a meromictic lake ecosystem in Antarctica. ISME J. 5(1).

Acknowledgements

Thanks for listening!