The problem Lineage model Examples. The lineage model

The lineage model A Bayesian approach to inferring community structure and evolutionary history from whole-genome metagenomic data Jack O Brien Bowdoin College with Daniel Falush and Xavier Didelot Cambridge, UK - March 2014

What s the problem? Suppose we want to focus in on a single species using shotgun metagenomic data from several samples. But... the species may have evolved, the new variants are mixed across samples, and there will be inevitable errors. How can we infer the community representation of the variants within each of the samples? Haplotype phasing with an unknown number of haplotypes. -Mihai Pop, yesterday

Metagenomic data comes in two distinct flavors: Amplicon sequencing samples a single conserved gene Whole-genome sequencing shotgun samples DNA from across genomes fraction of total reads

Overview The problem The lineage model itself Examples Inferred Pool 1 2 3 4 5 6 7

Ecoevolutionary dynamics for E. coli

IMPORTANT USE CASES human microbiomes malaria infections phytoplankton

BASIC SCIENTIFIC QUESTIONS 1. How do we infer the evolutionary history of the population? (What does the tree look like? ) 2. How do we infer how the ecological structure samples? (How are the taxa mixed together?) Unfortunately... It is not possible to directly infer directly these since there is not enough information in individual reads to infer the tree directly, assemblers often assume that an organism is clonal, and the reads may be drawn from samples that are a mixtures of different taxa,...so we employ a Bayesian approach.

Bayesian phylogenetics: A statistical model for sequence data Y are sequences observed at the tips θ = (Λ,T ) where T is the underlying tree and Λ specifies the mutation model Likelihood by pruning: P(Y T,Λ) = t i T s j S P(s j )P(s i s j t i,λ) t i A G A Specify prior distributions for T, Λ, and infer P(T,Λ D).

So where were we? Suppose we sample from N = 6 locations...... what does the data look like?

A read Read counts C G C C G CCTGGTGCGTGTC TCCCTGGTCGGT GTCCCTGGTCGG CTGTAGAGGCTGTCCCTGGTCGGTTGTACAGCAACTGTAG REFERENCE GENOME

Read count data A read Read counts C G C C G CCTGGTGCGTGTC TCCCTGGTCGGT GTCCCTGGTCGG CTGTAGAGGCTGTCCCTGGTCGGTTGTACAGCAACTGTAG REFERENCE GENOME For sample i, we align all the reads against the appropriate reference genome. Suppose G is the j th variant in the genome. We observe read counts d ij = (r ij,n ij ) = (4,4). The full data has all N samples and M variants: D = [d ij : i = 1,,N;j = 1,,M]

The lineage model jointly infers phylogeny and sample composition. Pools by color Pool 1 Pool 2 Pool 3 (Assumed) reality The lineage approximation IDEA Each sample is a mixture of different lineages. Each lineage defines an unobserved haplotype. The lineages are connected by a phylogeny. The lineage mixture specifies the read count distribution.

P(Θ D), here Θ = (L,S,T,K,η) Lineages - L : 0 0 1 0 1 1 1 0 1 0 1 1 Mixtures - S : 0.25 0.01 0.1 0.09 0.55 0.33 0.24 0.01 0.02 0.4 0.1 0.2 0.27 0.3 0.13 t i Tree - T K - number of lineages A G A ξ - error rate

Nuts and bolts There are i = 1,,N samples, and j = 1,,M variants. For convenience, we assume biallelic variation. We assume there are K lineages, i.e. the tree has K tips. Each lineage L k defines a haplotype of allele states: L k = [l kj : j = 1, M] = [0 1 0 0 1] S = [s ik ] gives the proportion of lineage k in sample i. Together L and S give the expected proportion of read counts in sample i at variant j: p ij = K s ik l kj. k=1

Likelihood Absent any sequencing errors, reference read counts within sample i at variant j arise i.i.d. with probability p ij. This gives a binomial likelihood for d ij : ( ) rij +n ij P(d ij L,S) = p rij ij (1 p ij ) nij. n ij Assuming that sites and samples are independent, the full data likelihood is P(D L,S) = N M P(d ij L,S). i=1j=1 We can include the effect of sequencing errors by altering p ij p ij according to a parameter ξ.

Bayesian inference T specifies the tree; µ,λ are parameters. P(L,T,S,λ,µ,ξ D) P(D L,S,ξ) P(L T,λ) P(S) P(T ) P(λ) P(µ) P(ξ) P(D L,S,ξ) is the binomial likelihood; P(L T,λ) is a standard phylogenetic likelihood; P(T ) is a coalescent; P(S) is N realizations from Dirichlet(1K ). P(λ),P(ξ) are simple. Inference via Markov chain Monte Carlo. Harmonic mean estimator to estimate Bayes factor to find K.

Simulations from the model Simulated Inferred Pool Pool 1 5 1 5 2 6 2 6 3 7 3 7 4 4

Simulations from the model Lineage 1 Lineage 2 Lineage 3 Inferred lineage 1 2 3 4 5 % Similarity 90% 70% 50% 30% Pool proportion Pool proportion 0.1 0.3 0.5 0.7 0.1 0.3 0.5 0.7 Lineage 4 Simulated Inferred Combined Lineage 5 Lineage 6 1 2 3 4 5 6 Simulated lineage 1 2 3 4 5 6 7 Pool number 1 2 3 4 5 6 7 Pool number 1 2 3 4 5 6 7 Pool number

Simulations - reads and SNPS Fraction of concordant SNPs 0.5 0.6 0.7 0.8 0.9 1.0 Mix : Err : SNP 1.5 : 0.05 : 250 4 : 0.15 : 250 4 : 0.00 : 25 10 : 0.00 : 1000 Fraction of concordant SNPs 0.5 0.6 0.7 0.8 0.9 1.0 Mix : Err : Reads 1.5 : 0.15 : 5 1.5 : 0.00 : 5 10 : 0.00 : 2 4 : 0.05 : 10 2 5 10 50 Number of reads 25 100 250 1000 Number of SNPs Reads SNPs

Simulations: island coalescent Locations of haplotype 00001: 252 SNPs Locations of haplotype 00010: 82 SNPs Locations of haplotype 01001: 22 SNPs Pools 1 2 3 4 5

A meromictic Antarctic lake (Lauro et al., The ISME Journal. 2011)

(ibid.) An important green sulphur bacteria species, Chlorobium limicola, a photosynthetic bacterium, stratifies across the lake s layers.

Lineage results on C. limicola 5 samples : 3 from lake, 1 with missing metadata Distinct lake, ocean and deep-water variants present. Sample Ace 12m Ace 23m? Open ocean Newcomb Bay

Plasmodium falciparum Most severe malaria is caused by Plasmodium falciparum, a single celled protist. Manske et al. (2013) showed widespread mixture in clinical infections.

The parasite requires two plastids: a mitochondrion and an apicoplast, for which one cell only has a single copy

Malaria infections in northern Ghana Sample 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 We find mixture levels consistent with the nuclear genome. Surprisingly, there s a lot of structure in the mixtures.

Where do we go from here... Recombination? Multiple species Use experimental design Better K estimation - reversible jump? Genotyping and de Bruijn? Cancer Paired-end information Better likelihood - multinomial-dirichlet

Summary WHAT S BEEN DONE: We can take read count data from metagenomic samples and produce estimates of phylogeny and commmunity composition. In simulations, this model works well. In real examples, our results appear consistent with other methods, and seem to go beyond them in some places. There are a lot of possible extensions to more involved experimental contexts, better statistical methods, and computational improvements.

References J. O Brien et al. A Bayesian approach to inferring the phylogenetic structure of communities from metagenomic data. Genetics (forthcoming). Manske et al. 2013. Analysis of Plasmodium falciparum diversity in natural infections by deep sequencing Nature. in press. F. Lauro et al. 2011. An integrative study of a meromictic lake ecosystem in Antarctica. ISME J. 5(1).

Acknowledgements

Thanks for listening!