Non-Parametric Bayesian Population Dynamics Inference Philippe Lemey and Marc A. Suchard Department of Microbiology and Immunology K.U. Leuven, Belgium, and Departments of Biomathematics, Biostatistics and Human Genetics University of California, Los Angeles SISMID Lemey and Suchard (UCLA) SISMID 1 / 16 Review: Continuous-Time Coalescent Time measured in N generation units [ (k ) ] N=const u k Exp N = N(t) Pr(u k > t t k+1 )=e (k 2) R t+t k+1 t k+1 2 N N(u) du u k are not independent any more Constant population size Exponential growth Lemey and Suchard (UCLA) SISMID 2 / 16
Review: Continuous-Time Coalescent Time measured in N generation units [ (k ) ] N=const u k Exp N = N(t) Pr(u k > t t k+1 )=e (k 2) R t+t k+1 t k+1 2 N N(u) du u k are not independent any more Constant population size Exponential growth Lemey and Suchard (UCLA) SISMID 2 / 16 Review: Continuous-Time Coalescent Time t 1 t 2 t 3 t 4 u 2 u 3 u 4 Time measured in N generation units [ (k ) ] N=const u k Exp N = N(t) Pr(u k > t t k+1 )=e (k 2) R t+t k+1 t k+1 2 N N(u) du u k are not independent any more Constant population size Exponential growth Lemey and Suchard (UCLA) SISMID 2 / 16
Review: Continuous-Time Coalescent Time t 1 t 2 t 3 t 4 u 2 u 3 u 4 Time measured in N generation units [ (k ) ] N=const u k Exp 2 N = N(t) Pr(u k > t t k+1 )=e (k 2) R t+t k+1 t k+1 N N(u) du u k are not independent any more N(t) = N Constant population size Exponential growth N(t) = Ne 100t Lemey and Suchard (UCLA) SISMID 2 / 16 Sequence Data Population Model Parameters accggaaacgcgcgaaatttacacggggg accggaaacgcgcgaaatttacacggggg accggaaacgcgcgaaatttacacggggg Sequence Data accggaaacgcgcgaaatttacacggggg accggaaacgcgcgaaatttacacggggg Genealogy Pop. N(t) Dynamics Time More Formally (Bayesian Approach): Pr (G, Q, θ D) Pr (D G, Q) Pr (Q) Pr (G θ) Pr (θ) G -genealogywithbranchlengths Q -substitutionmatrix θ -populationgeneticsparameters D -sequencedata Pr (G θ) - Coalescent prior Lemey and Suchard (UCLA) SISMID 3 / 16
Piecewise Constant Demographic Model Number of Lineages: 2 3 4 5 Isochronous Data N e (t) =θ k for t k < t t k 1. u 2,...,u n are independent Pr (u k θ k )= k(k 1) 2θ k e k(k 1)uk 2θ k u 2 u 3 u 4 t 1 t 2 t 3 t 4 t 5 = 0 u 5 Pr (F θ) n k=2 Pr (u k θ k ) Equivalent to estimating exponential mean from one observation. Need further restrictions to estimate θ! Lemey and Suchard (UCLA) SISMID 4 / 16 Piecewise Constant Demographic Model Number of Lineages: 2 3 4 3 4 3 1 Heterochronous Data w 20,...,w njn are independent Pr(w k0 θ k )= n k0 (n k0 1) e n k0 (n k0 1)w k0 2θ k 2θ k Pr(w kj θ k )=e n kj (n kj 1)w kj 2θ k, j>0 u 2 w 20 u 3 u 4 u 5 w 30 w 40 w w 50 w 52 41 w 51 Pr (F θ) Q n Q jk k=2 j=0 Pr `w kj θ k Equivalent to estimating exponential mean from one observation. Need further restrictions to estimate θ! Lemey and Suchard (UCLA) SISMID 4 / 16
Piecewise Constant Demographic Model Number of Lineages: 2 3 4 3 4 3 1 Heterochronous Data w 20,...,w njn are independent Pr(w k0 θ k )= n k0 (n k0 1) e n k0 (n k0 1)w k0 2θ k 2θ k Pr(w kj θ k )=e n kj (n kj 1)w kj 2θ k, j>0 u 2 w 20 u 3 u 4 u 5 w 30 w 40 w w 50 w 52 41 w 51 Pr (F θ) Q n Q jk k=2 j=0 Pr `w kj θ k Equivalent to estimating exponential mean from one observation. Need further restrictions to estimate θ! Lemey and Suchard (UCLA) SISMID 4 / 16 Current Approaches Strimmer and Pybus (2001) Make N e (t) constant across some inter-coalescent times Group inter-coalescent intervals with AIC Drummond et al. (2005) Multiple change-point model with fixed number of change-points Change-points allowed only at Coalescent events Joint estimation of phylogenies and population dynamics Opgen-Rhein et al. (2005) Multiple change-point model with random number of change-points Change-points allowed anywhere in interval (0, t 1 ] Posterior is approximated with rjmcmc Lemey and Suchard (UCLA) SISMID 5 / 16
Current Approaches Strimmer and Pybus (2001) Make N e (t) constant across some inter-coalescent times Group inter-coalescent intervals with AIC Drummond et al. (2005) Multiple change-point model with fixed number of change-points Change-points allowed only at Coalescent events Joint estimation of phylogenies and population dynamics Opgen-Rhein et al. (2005) Multiple change-point model with random number of change-points Change-points allowed anywhere in interval (0, t 1 ] Posterior is approximated with rjmcmc Lemey and Suchard (UCLA) SISMID 5 / 16 Current Approaches Strimmer and Pybus (2001) Make N e (t) constant across some inter-coalescent times Group inter-coalescent intervals with AIC Drummond et al. (2005) Multiple change-point model with fixed number of change-points Change-points allowed only at Coalescent events Joint estimation of phylogenies and population dynamics Opgen-Rhein et al. (2005) Multiple change-point model with random number of change-points Change-points allowed anywhere in interval (0, t 1 ] Posterior is approximated with rjmcmc Lemey and Suchard (UCLA) SISMID 5 / 16
Smoothing Prior (GMRF approach) Go to the log scale x k = log θ k [ Pr(x ω) ω (n 2)/2 exp ω 2 n 2 k=1 ] 1 (x k+1 x k ) 2 d k d 1 d 2 d 3 d n 2 x1 x 2 x3 x 4 x n 2 x n 1 Weighting Schemes 1 Uniform: d k = 1 2 Time-Aware: d k = u k+1+u k 2 Pr(x, ω) =Pr(x ω)pr(ω) Pr(ω) ω α 1 e βω,diffusepriorwithα = 0.01, β = 0.01 Lemey and Suchard (UCLA) SISMID 6 / 16 MCMC Algorithm Pr (G, Q, x D) Pr (D G, Q) Pr (Q) Pr (G x) Pr (x) Updating Population Size Trajectory Use fast GMRF sampling (Rue et al., 2001, 2004) Draw ω from an arbitrary univariate proposal distribution Use Gaussian approximation of Pr(x ω, G) to propose x Jointly accept/reject (ω, x ) in Metropolis-Hastings step Object-Oriented Reality? BEAST = Bayesian Evolutionary Analysis Sampling Trees Pr(G x, D, Q) -sampledbybeast Pr(Q G, D) -sampledbybeast Lemey and Suchard (UCLA) SISMID 7 / 16
Simulation: Constant Population Size Classical Skyline Plot ORMCP Model Beast MCP Effective Population Size 0.1 1.0 5.0 0.1 1.0 5.0 0.1 1.0 5.0 1.5 1.0 0.5 0.0 1.5 1.0 0.5 0.0 1.5 1.0 0.5 0.0 Uniform GMRF Time Aware GMRF Beast GMRF Effective Population Size 0.1 1.0 5.0 0.1 1.0 5.0 0.1 1.0 5.0 1.5 1.0 0.5 0.0 1.5 1.0 0.5 0.0 1.5 1.0 0.5 0.0 Lemey and Suchard (UCLA) SISMID 8 / 16 Simulation: Exponential Growth Classical Skyline Plot ORMCP Model Beast MCP Effective Population Size 0.001 0.1 1.0 0.001 0.1 1.0 0.001 0.1 1.0 0.014 0.010 0.006 0.002 0.014 0.010 0.006 0.002 0.014 0.010 0.006 0.002 Uniform GMRF Time Aware GMRF Beast GMRF Effective Population Size 0.001 0.1 1.0 0.001 0.1 1.0 0.001 0.1 1.0 0.014 0.010 0.006 0.002 0.014 0.010 0.006 0.002 0.014 0.010 0.006 0.002 Lemey and Suchard (UCLA) SISMID 9 / 16
Simulation: Exponential Growth with Bottleneck Classical Skyline Plot ORMCP Model Beast MCP Effective Population Size 0.001 0.01 1.0 0.15 0.10 0.05 0.00 0.001 0.01 1.0 0.15 0.10 0.05 0.00 0.001 0.01 1.0 0.15 0.10 0.05 0.00 Uniform GMRF Time Aware GMRF Beast GMRF Effective Population Size 0.001 0.01 1.0 0.15 0.10 0.05 0.00 0.001 0.01 1.0 0.15 0.10 0.05 0.00 0.001 0.01 1.0 0.15 0.10 0.05 0.00 Lemey and Suchard (UCLA) SISMID 10 / 16 Accuracy in Simulations Percent Error = TMRCA 0 N e (t) N e (t) dt 100, (1) N e (t) Table: Percent error in simulations. We compare percent errors, defined in equation (1), for the Opgen-Rhein multiple change-point (ORMCP), uniform and fixed-tree time-aware Gaussian Markov random field (GMRF) smoothing, BEAST multiple change-point (MCP) model, and BEAST GMRF smoothing. Model Constant Exponential Bottleneck ORMCP 14.0 1.7 7.4 Uniform GMRF 32.8 1.5 5.9 Time-Aware GMRF 2.8 1.2 4.8 BEAST MCP 38.2 1.6 5.2 BEAST GMRF 1.7 1.0 5.4 Lemey and Suchard (UCLA) SISMID 11 / 16
GMRF Precision Prior Sensitivity ω -GMRFprecision,controls smoothness Usually Pr(ω D) is sensitive to perturbations of Pr(ω) Not in our Coalescent model! Density 0.0 0.2 0.4 0.6 0.8 GMRF Precision Prior and Posterior prior posterior 10 8 6 4 2 0 2 log τ Posterior of log τ 6 5 4 3 GMRF Precision Sensitiviy to Prior 1 2 5 10 100 Prior Mean of τ Lemey and Suchard (UCLA) SISMID 12 / 16 HCV Epidemics in Egypt Scaled Effective Pop. Size 10 10 2 10 3 10 4 10 5 Estimated Genealogy Unconstrained Fixed Tree GMRF 250 150 50 0 Time (Years Before 1993) Scaled Effective Pop. Size 10 10 2 10 3 10 4 10 5 10 10 2 10 3 10 4 10 5 BEAST GMRF 250 150 50 0 Constrained Fixed Tree GMRF PAT Starts 250 150 50 0 Time (Years Before 1993) Random population sample No sign of population sub-structure Parenteral antischistosomal therapy (PAT) was practiced from 1920s to 1980s Bayes Factor 12,880 in favor of constant population size prior to 1920 Lemey and Suchard (UCLA) SISMID 13 / 16
Influenza Intra-Season Population Dynamics 1999 2000 Season Scaled Effective Pop. Size 10 100 10 3 10 4 October 1 40 30 20 10 0 Time (Weeks before 03/01/00) 10 100 10 3 10 4 100 80 60 40 20 0 Time (Weeks before 03/01/00) 2001 2002 Season Scaled Effective Pop. Size 10 100 10 3 10 4 October 1 60 50 40 30 20 10 0 Time (Weeks before 03/21/02) 10 100 10 3 10 4 70 50 30 10 0 Time (Weeks before 03/21/02) New York state hemagglutinin sequences serially sampled (Ghedin et al., 2005) Lemey and Suchard (UCLA) SISMID 14 / 16 Influenza Intra-Season Population Dynamics 2001 2002 Season Scaled Effective Pop. Size 10 100 10 3 10 4 October 1 60 50 40 30 20 10 0 Time (Weeks before 03/21/02) 10 100 10 3 10 4 70 50 30 10 0 Time (Weeks before 03/21/02) 2003 2004 Season Scaled Effective Pop. Size 1 10 100 10 3 10 4 10 5 October 1 80 60 40 20 0 Time (Weeks before 02/05/04) 1 10 100 10 3 10 4 10 5 50 40 30 20 10 0 Time (Weeks before 02/05/04) New York state hemagglutinin sequences serially sampled (Ghedin et al., 2005) Lemey and Suchard (UCLA) SISMID 14 / 16
Summary Genealogies inform us about population size trajectories Prior restrictions are necessary for non(semi)-parametric estimation of N e (t) Smoothing can be imposed by GMRF priors Software: The Skyride Implemented as a Coalescent prior in BEAST Exploits approximate Gibbs sampling Faster convergence? Better mixing? Reference: Minin,BloomquistandSuchard(2008)Molecular Biology & Evolution, 25,1459 1471. Lemey and Suchard (UCLA) SISMID 15 / 16 Active Ideas: GMRFs are Highly Generalizable Hierarchical Modeling Flu genes display similar (not equal) dynamics Incorporate multiple loci simultaneously Pool information for statistical power No need for strict equality Introducing Covariates Augment field at fixed observation times Formal statistical testing for: External factors (environment, drug tx) Population dynamics (bottle-necks, growth) Lemey and Suchard (UCLA) SISMID 16 / 16