Bayesian Model Comparison

Size: px

Start display at page:

Download "Bayesian Model Comparison"

Albert Stokes
5 years ago
Views:

1 BS2 Statistical Inference, Lecture 11, Hilary Term 2009 February 26, 2009

2 Basic result An accurate approximation Asymptotic posterior distribution An integral of form I = b a e λg(y) h(y) dy where h(y) and g(y) are smooth and g has local minimum at y (a, b) can be approximated as I = e λg(y ) h(y ) 2π λg (y ) { 1 + O ( )} 1. λ

3 Basic result An accurate approximation Asymptotic posterior distribution A more accurate approximation is { I = e λ g λ(ỹ λ ) 2π λ g λ (ỹ ρ ( )} 3 3 ρ O λ) 24λ λ 2, where now ỹ λ maximizes g λ (y) = g(y) λ 1 log h(y), and ρ 3 = (4) (3) g λ (ỹ λ) { g λ (ỹ λ)} 3/2, ρ 4 = g λ (ỹ λ) { g λ (ỹ λ)} 2.

4 Basic result An accurate approximation Asymptotic posterior distribution It holds approximately for large n, that the posterior distribution of θ is θ N d {ˆθ, j n (ˆθ) 1 } = N d (ˆθ, j(ˆθ) 1 /n}. A more accurate approximation is obtained from the Laplace approximation to be π exp{l(θ)}π(θ) (θ) = Θ exp{l(θ)}π(θ) dθ = (2π/n) d/2 exp{l(θ) l(ˆθ)} π(θ) j(ˆθ) 1/2 {1 + O(n 1 )}. π(ˆθ) Note in particular the expression for the normalization constant f (x θ)π(θ) dθ = (2π/n) d/2 L(ˆθ)π(ˆθ) j(ˆθ) 1/2 {1 + O(n 1 )}. Θ

5 Basic setup Bayes factor for Gaussian independence We consider a number of competing models M j, j = 1,..., m for data X ; for example M 1 might specify that the expectation of a component X i of X depends linearly on covariates Y i, an alternative M 2 may specify that it has a quadratic dependence, whereas a third model M 3 might specify that the expectation does not depend on Y i at all. Associated with each of these models are parameter spaces Θ j and prior distributions π j (θ j ) as well as prior model probabilities π j for model M j being the correct description of affairs.

6 Basic setup Bayes factor for Gaussian independence The posterior probability for model M j would then satisfy πj f (x θ j, M j )π j (θ j ) dθ j π j Θ j i.e. it will as usual be proportional to the product of the marginal or integrated likelihood L j of model M j with the prior model probability, π j where L j = f (x M j ) = f (x θ j, M j )π j (θ j ) dθ j. Θ j

7 Basic setup Bayes factor for Gaussian independence Comparing two models yields πj πk = f (x M j) f (x M k ) = Θ j f (x θ j, M j )π j (θ j ) dθ j π j. Θ 2 f (x θ k, M k )π k (θ k ) dθ k π k The factor B jk = f (x M j) f (x M k ) = Θ j f (x θ j, M j )π j (θ j ) dθ j = L j. Θ 2 f (x θ k, M k )π k (θ k ) dθ k L k ia known as the Bayes Factor in favour of model j over model k. Note that if the Bayesian model is taken to its consequence, this is nothing but the usual likelihood ratio.

8 Basic setup Bayes factor for Gaussian independence Recall that Σ follows an inverse Wishart distribution if K = Σ 1 follows a Wishart distribution, formally expressed as Σ IW d (δ, Ψ) K = Σ 1 W d (δ + d 1, Ψ 1 ), i.e. if the density of K has the form f (K δ, Ψ) (det K) δ/2 1 e tr(ψk)/2.

9 Basic setup Bayes factor for Gaussian independence The inverse Wishart distributions form a conjugate family for Σ. If the prior distribution of Σ is IW d (δ, Ψ) and W Σ W d (n, Σ), the posterior density of K is f (K δ, Ψ, W ) (det K) n/2 tr(kw )/2 e (det K) δ/2 1 e tr(ψk)/2 = (det K) (n+δ)/2 1 e tr{(ψ+w )K}/2, and hence the posterior distribution is simply IW d (δ + n, Ψ + W ) = IW d (δ, Ψ ).

10 Basic setup Bayes factor for Gaussian independence To calculate the Bayes factor for independence we need the full form of the Wishart density for K: f d (K δ, Ψ) = c(d, δ) 1 (det Ψ) (δ+d 1)/2 (det K) δ/2 1 e tr(ψk)/2 The constant c(d, δ) is c(d, δ) = 2 (δ+d 1)d/2 (2π) d(d 1)/4 d Γ{(δ + d i)/2}. i=1

11 Basic setup Bayes factor for Gaussian independence The marginal density of W becomes f (W δ, Ψ) = f (W n, K)f (K δ, Ψ) dk = (det W ) (n d 1)/2 c(d, n) 1 c(d, δ) 1 (det Ψ) (δ+d 1)/2 (det K) (n+δ)/2 1 e tr{k(w +Ψ)}/2 dk = (det W ) (n d 1)/2 c(d, n) 1 c(d, δ) 1 (det Ψ) (δ+d 1)/2 {det(ψ + W )} (δ+n 1)/2 c(d, n + δ) = (det W )(n d 1)/2 (det Ψ) (δ+d 1)/2 {det(ψ + W )} (δ+n 1)/2 c(d, n + δ) c(d, n)c(d, δ).

12 Basic setup Bayes factor for Gaussian independence Consider now alternative models M 2 with Σ arbitrary and M 1 with Σ of block diagonal form, i.e. with ( ) Σ11 0 Σ =. 0 Σ 22 If the associated prior distributions are for M 2 that Σ IW d (δ, I d ) and for M 1 that Σ 11 IW r (δ, I r ), and Σ 22 IW s (δ, I s ), we can now calculate the Bayes factor.

13 Basic setup Bayes factor for Gaussian independence We get B 12 = f (W 11 δ, I r )f (W 22 δ, I s ) f (W δ, I d ) = (det W 11) (n r 1)/2 (det W 22 ) (n s 1)/2 (det W ) (n d 1)/2 { det(i d + W ) det(i r + W 11 ) det(i s + W 22 ) c(d, n)c(d, δ)c(r, n + δ)c(s, n + δ) c(d, n + δ)c(r, n)c(r, δ)c(s, n)c(s, δ) } (δ+n 1)/2) Note the similarity between the first fraction and Wilks Λ for independence.

14 Basic Laplace approximation Bayesian information criterion In general the Bayes factor is difficult or impossible to calculate explicitly. Recall that for competing models M 1 and M 2 with parameters θ 1 Θ 1 R d 1 and θ 2 Θ 2 R d 2 and prior distributions π 1, π 2, the Bayes factor B in favour of M 1 over M 2 is B = f (x 1,..., x n M 1 ) f (x 1,..., x n M 2 ) = Θ 1 f (x θ 1, M 1 )π 1 (θ 1 ) dθ 1 Θ 2 f (x θ 2, M 2 )π 2 (θ 2 ) dθ 2. Recall the approximate expression obtained for the Bayesian marginal likelihood using Laplace s method f (x θ)π(θ) dθ = (2π/n) d/2 L(ˆθ)π(ˆθ) j(ˆθ) 1/2 {1 + O(n 1 )}. Θ

15 Basic Laplace approximation Bayesian information criterion We then get B = (2π) (d 1 d 2 )/2 n (d 2 d 1 )/2 L(ˆθ 1 )π(ˆθ 1 ) j 2 (ˆθ 2 ) 1/2 L(ˆθ 2 )π(ˆθ 2 ) j 1 (ˆθ 1 ) 1/2 {1 + O(n 1 )}. To study the asymptotic behaviour of the Bayes factor we take logarithms and collect terms of similar order to get log B = n{ l n (ˆθ 1 ) l n (ˆθ 2 )} + d 2 d 1 log n + log{π(ˆθ 1 )/π(ˆθ 2 )} 2 1 { } 2 log j1 (ˆθ 2 ) / j 1 (ˆθ 1 ) d 2 d 1 log(2π) + O(n 1 ). 2

16 Basic Laplace approximation Bayesian information criterion The dominating terms are those on the first line, as all other terms are of smaller order for n. Ignoring the latter we get log B {l(ˆθ 1 ) l(ˆθ 2 )} d 1 d 2 2 log n. The right-hand side is the Bayesian Information Criterion (BIC). It reflects that, for large n, the Bayes factor will favour the model with highest maximized likelihood (the first term), but will also penalize the model having the largest number of parameters. The prior distributions π i do not enter in the expression for BIC which may or may not be seen as an advantage. Models with a high value of BIC would be preferred over models with a low value of BIC.

17 Basic Laplace approximation Bayesian information criterion One can get a more accurate approximation of the Bayes factor by adding terms 1 { } 2 log ji (ˆθ 2 ) + d i 2 log(2π) but this correction is not increasing with n, so it is most commonly ignored. For the comparison of two models we get BIC = l(ˆθ 1 ) l(ˆθ 2 ) + d 1 d 2 2 = log LR + d 1 d 2 2 log n. log n Thus, in comparison with straight maximized likelihood, the simpler model gets preference by entertaining a lower penalty.

18 Basic Laplace approximation Bayesian information criterion In the nested case, if d 1 < d 2 the deviance difference between the models is D = 2 log LR so 2 BIC = D + (d 1 d 2 ) log n. If the true value of the parameter θ 0 M 1 M 2, the deviance D would under suitable regularity conditions be approximately χ 2 (d 2 d 1 ). The penalty term will thus dominate for large values of n, so the simpler model will eventually be chosen. In this sense, BIC will asymptotically choose the simplest model which is correct, often referred to as consistency of the BIC.

Bayesian Asymptotics

Bayesian Asymptotics BS2 Statistical Inference, Lecture 8, Hilary Term 2008 May 7, 2008 The univariate case The multivariate case For large λ we have the approximation I = b a e λg(y) h(y) dy = e λg(y ) h(y ) 2π λg (y ) {