Bayes Factors, posterior predictives, short intro to RJMCMC. Thermodynamic Integration

Size: px

Start display at page:

Download "Bayes Factors, posterior predictives, short intro to RJMCMC. Thermodynamic Integration"

Brook Jones
5 years ago
Views:

1 Bayes Factors, posterior predictives, short intro to RJMCMC Thermodynamic Integration Dave Campbell 2016

2 Bayesian Statistical Inference P(θ Y ) P(Y θ)π(θ) Once you have posterior samples you can compute the predictive distribution of future observations: P(Y new θ,y old )

3 To do this you sample a θ * from P(θ Y ) (Sample 1 value from your collection of posterior samples) Generate simulated data from the likelihood: P(Y new θ * ) Repeat for a large sample of from to get at the posterior predictive distribution θ * P(θ Y )

4 Posterior predictive distribution: No need to use asymptotic normal assumptions or a single point and variance estimate for θ * Any shaped distribution on P(θ Y ) naturally feeds it s entire distribution through to the data generating process!

5 Obtaining P(Y θ,y ) is related to new old obtaining a set of fake data samples for parametric bootstrap, except that the distribution assumption on the parameters doesn t require asymptotic arguments.

6 Uses: Another diagnostic tool; Obtain a sample from P(Y new θ,y old ) and see if it is similar to the data. Use the posterior predictive distribution for sequential experimental design: Choose the new covariate points that optimize some criterion.

7 Hypothesis testing; model comparison

8 Ultimately we want inference on P(M Y ) But computing the marginal likelihood is difficult. P(M Y ) = Θ P(Y θm )π(θ) π(m )dθ P(Y )

9 Usually Bayesians make model decisions through Bayes Factors B 12 (y) = w 1 (y) w 2 (y) w(y) = Θ π(θ) f (y θ)dθ

10 Bayes Factor interpretation B 12 (y) = w 1 (y) w 2 (y) w(y) = Θ π(θ) f (y θ)dθ

11 The odds ratio for two models: posterior odds = Bayes Factor X prior odds Uniform prior odds across models implies that posterior odds = Bayes Factor

12 posterior odds = Bayes Factor X prior odds So the Bayes factor is the amount of evidence for one model compared to another. Bf = the change in odds when moving from the prior to the posterior

13 Recall: P(θ Y ) = P(y θ)p(θ) P(y) P(Y ) = P(y θ)p(θ)dθ

14 Newton & Raftery (1994) P(θ Y ) = P(y θ)p(θ) P(Y ) P(θ Y ) P(y θ) = P(θ) P(y) P(θ Y ) P(Y ) P(y θ) dθ = P(θ)dθ = 1

15 Newton & Raftery (1994) P(θ Y ) P(Y ) P(y θ) dθ = 1 E 1 P(y θ) P(θ Y ) = 1 P(Y ) And estimated P(Y) by ˆP (Y )= " 1 NX 1 # 1 N i=1 P (y )

16 Newton & Raftery (1994) ˆP (Y )= " 1 N NX i=1 1 P (y ) # 1 Compute this by calculating the likelihood for each value of θ that was obtained i from the posterior sampling step

17 Newton & Raftery (1994) The harmonic mean estimator is very very very very very sensitive to outliers with extremely small values of P(y θ) But it is asymptotically unbiased Estimate P(Y) by ˆP (Y )= " 1 N NX i=1 1 P (y ) # 1

18 Calderhead and Girolami (2009) showed that the harmonic mean estimator is can be massively biased for finite samples

19 Thermodynamic Integration Friel, N., Pettitt, A., Marginal likelihood estimation via power posteriors. Journal of the Royal Statistical Society: Series B 70 (3) Calderhead, Ben, and Mark Girolami. "Estimating Bayes Factors Via Thermodynamic Integration and Population MCMC." Computational Statistics and Data Analysis 53 (2009)

20 In Parallel Tempering we sample from P m (θ Y ) = P(y θ)β m P(θ) P m (y) But we can get the marginal likelihood via: log(p(y )) = 1 0 log p(y θ) P (θ Y )dθ dβ m log(p(y )) = 1 0 { } Pm E log p(y θ) (θ Y ) dβ

21 log(p(y )) = 1 0 { } Pm E log p(y θ) (θ Y ) dβ Compute via 1-dimensional quadrature over the temperature! log(p(y )) = 1 2 m ( β β )[ E + E ] m m 1 m m 1 { } Pm E = E log p(y θ) m (θ Y )

22 log(p(y )) = 1 2 m ( β β )[ E + E ] m m 1 m m 1 { } Pm E = E log p(y θ) m (θ Y ) To compute log(marginal likelihoods) all we need is to define a good grid for temperatures Calderhead and Girolami (2009) suggest β = seq( from = 1,to = N) N 5

23 Parallel Tempering To the Extreme! R Studio plots for the Galaxy data set (3 groups, density of one of the mean parameters vs temperature

24 Parallel Tempering densities That dip just before temperature β = 1 is real. It is caused by the introduction of new modes

25 Compare the 3 group Galaxy to the 6 group galaxy. Show plots of mean density vs temperature B 12 (y) = w 1 (y) w 2 (y) 25,000 iterations with 30 parallel chains

26 Now, back to RStudio to compare the Galaxy data with k=3 groups vs k=6 groups. (the result: there is decisive evidence that the k=3 groups model is better)

27 Alternative to Bayes Factors: RJMCMC

28 MODEL POSTERIOR PROBABILITY Likelihood: P(Y θ j, M j ) Parameter Prior: P(θ j M j ) Model Prior: P(M j Ω) for M j Ω The marginal posterior probability of a model is helpful when the answer is not clear P(M j Y,Ω) = P(Y θ j, M j,ω)p(θ j M j,ω)p(m j Ω)dθ j P(Y ) P(M j Y,Ω) = P(θ j, M j Y,Ω)dθ j

29 Our goal is to get in a single MCMC chain P(M j Y,Ω) = P(θ j, M j Y,Ω)dθ j even if Ω contains a lot of models We need simulation methods that sample across models.

30 REVERSIBLE JUMP MCMC Biometrika (1995), 82, 4, pp We can avoid extensive MCMC for each model and instead sample from directly! P(M j Y,Ω) We just adjust MCMC so at each iteration we: 1. Sample j, i.e. choose a model M new θ new 2. Then propose a from M new 3. Keep M new and with probability θ new α = min P(Y θ new,m new )P(θ new,m new )P new (v new ) P(Y θ old,m old )P(θ old,m old )P old (v old ) J old,new,1

31 V We use auxiliary variables v to augment the dimension space so that dim(m old ) = dim (M new ) α = min P(Y θ new,m new )P(θ new,m new )P new (v new ) P(Y θ old,m old )P(θ old,m old )P old (v old ) J old,new,1

JACOBIAN We need the Jacobian for the transformation J = θ old,1 θ new,1...... θ old,1 θ new, pnew.

32 JACOBIAN We need the Jacobian for the transformation J = θ old,1 θ new, θ old,1 θ new, pnew... And the proposed values θ new possibility of being accepted. θ old,1 v new M O M M θ old, pold θ new,1 θ new, pnew θ old, pold M M O M v old θ new, v old v new needs to allow the

33 POTENTIAL PROBLEMS M 1 and M 2 have different parameter dimensions Often model parameters don t have an obvious a transformation allowing an intuitive transition The last accepted value might be from a different model and may require a large jump in the parameter space.

34 M 1 : Y N( 1,0 + 1,1 X, 2 1) M 2 : Y N( 2,0 + 2,1 X + 2,2 X 2, 2 2) Moving from M 1 to M 2 to will require moving β 1,0 quite far to get to a reasonable location for β 2,0

35 M 1 : Galaxy with 3 Gaussians M 2 : Galaxy with 4 Gaussians Moving from M 1 to M 2 can be done by dividing one of the current Gaussians. From M 2 to M 1 can be done through merging 2 components

36 RJMCMC: Beautiful in principle, nasty in practice Needs: transition function between parameters in multiple model spaces. Efficiency depends completely on this functional choice and the distribution for the auxiliary variables. Works well when we can use birth / death process (change-point analysis).

Monte Carlo in Bayesian Statistics

Monte Carlo in Bayesian Statistics Matthew Thomas SAMBa - University of Bath m.l.thomas@bath.ac.uk December 4, 2014 Matthew Thomas (SAMBa) Monte Carlo in Bayesian Statistics December 4, 2014 1 / 16 Overview