Foundations of Statistical Inference

Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 32

Lecture 14 : Variational Bayes An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem. John W. Tukey, 1915-2000 Julien Berestycki (University of Oxford) SB2a MT 2016 2 / 32

Laplace approximation The Laplace approximation provides a way of approximating a density whose normalisation constant we cannot evaluate, by fitting a Gaussian distribution to its mode. p(z) = }{{} proba. density 1 Z }{{} Unknown constant Pierre-Simon Laplace (1749-1827) f (z) }{{} Main part of the density (easy to evaluate) Observe this is exactly the situation we face in Bayesian inference p(θ y) = }{{} posterior density 1 p(y) }{{} marginal dist. p(θ, y) }{{} joint proba (likelihood x prior) Julien Berestycki (University of Oxford) SB2a MT 2016 3 / 32

Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. Julien Berestycki (University of Oxford) SB2a MT 2016 4 / 32

Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 = l(θ ) + 1 2 l (θ )(θ θ ) 2 Julien Berestycki (University of Oxford) SB2a MT 2016 4 / 32

Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) + 1 2 l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 Julien Berestycki (University of Oxford) SB2a MT 2016 4 / 32

Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) + 1 2 l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 So approximate posterior by: q(θ) = N (θ µ, σ 2 ) with µ = θ (mode of log-posterior) and σ 2 = l (θ ) (negative curvature at the mode) Julien Berestycki (University of Oxford) SB2a MT 2016 4 / 32

Computing integrals More generally, assume f (x) has a unique global maximum at x 0. f (x) f (x 0 ) 1 2 f (x 0 ) (x x 0 ) 2 so To obtain Lemma b a b a b e Nf (x) dx e Nf (x 0) a e N f (x 0 ) (x x 0 ) 2 /2 dx e Nf (x) 2π x N f (x 0 ) enf (x0) as N. Julien Berestycki (University of Oxford) SB2a MT 2016 5 / 32

Computing integrals More generally, assume f (x) has a unique global maximum at x 0. f (x) f (x 0 ) 1 2 f (x 0 ) (x x 0 ) 2 so To obtain Lemma b a b a b e Nf (x) dx e Nf (x 0) a e N f (x 0 ) (x x 0 ) 2 /2 dx e Nf (x) 2π x N f (x 0 ) enf (x0) as N. Laplace approximations becomes better as N grows. Julien Berestycki (University of Oxford) SB2a MT 2016 5 / 32

In dimension d > 1 If x R d then the Taylor expansion becomes f (x0 = f (x 0 ) + (x x 0 ) T H(x x 0 ) where H is the hessian matrix of second derivatives of f. In that case it can be shown that Lemma e Nf (x) x ( ) 2π d/2 H(x 0 ) 1/2 e Nf (x0) as N. N Julien Berestycki (University of Oxford) SB2a MT 2016 6 / 32

Using Laplace approximation Given model with θ = (θ 1,..., θ p ) Step 1 Find mode of log-joint (=MAP) estimate of θ): θ = argmax θ log p(θ, y) Julien Berestycki (University of Oxford) SB2a MT 2016 7 / 32

Using Laplace approximation Given model with θ = (θ 1,..., θ p ) Step 1 Find mode of log-joint (=MAP) estimate of θ): θ = argmax θ log p(θ, y) Step 2 Evaluate curvature of the log-joint at the mode H = D 2 log p(θ, y) is the Hessian matrix Julien Berestycki (University of Oxford) SB2a MT 2016 7 / 32

Using Laplace approximation Given model with θ = (θ 1,..., θ p ) Step 1 Find mode of log-joint (=MAP) estimate of θ): Step 2 θ = argmax θ log p(θ, y) Evaluate curvature of the log-joint at the mode is the Hessian matrix Step 3 Obtain Gaussian approximation H = D 2 log p(θ, y) N (θ µ, Σ), µ = θ, Σ = H 1. Julien Berestycki (University of Oxford) SB2a MT 2016 7 / 32

Example Suppose the y i are iid N(µ, σ 2 ) with a flat prior on µ and on log σ. The posterior is p(µ, σ 2 y) (σ 2 ) n 2 1 e (n 1)s 2 +n(ȳ µ) 2 2σ 2 where ȳ = 1 n yi and s 2 = 1 n 1 (yi ȳ) 2. Writig ν = log σ we get p(µ, ν y) f (µ, ν) = e (n 1)s 2 +n(ȳ µ) 2 nν 2e 2ν Julien Berestycki (University of Oxford) SB2a MT 2016 8 / 32

Example It is easy to check that ( (ˆµ, ˆν) = mode(µ, ν y) = ȳ, 1 ( )) n 1 2 log n s2 Second order derivatives are 2 log f = ne 2ν, 2 µ 2 µ ν log f = 2n(ȳ ν)e 2ν and So that 2 ν 2 log f = 2(n 1)s2 + n(ȳ µ) 2 e 2ν H(x 0 ) = ( ) n 2 0 (n 1)s 2 0 2n and we have ( ( ) ( )) ȳ (n 1)s 2 µ, ν N 1 2 log ( 0 n 1 n s2), n 2 1 0 2n Julien Berestycki (University of Oxford) SB2a MT 2016 9 / 32

Limitations of Laplace method The Laplace approximation is often too strong a simplification. Julien Berestycki (University of Oxford) SB2a MT 2016 10 / 32

Laplace method for computing the marginal P(x) = = P(x θ)π(θ)dθ exp { N( 1N log P(x θ) 1N } log π(θ)) dθ Julien Berestycki (University of Oxford) SB2a MT 2016 11 / 32

Laplace method for computing the marginal P(x) = = P(x θ)π(θ)dθ exp { N( 1N log P(x θ) 1N } log π(θ)) dθ Define h(θ) = 1 N log P(x θ) 1 N log π(θ) so that the integral we want to compute is of the form exp { Nh(θ)} dθ. h(θ) h(θ ) 1 2 h (θ ) (θ θ ) 2 and we can approximate the integral as { e Nh(θ)) dθ e Nh(θ ) exp N 2 h (θ ) (θ θ ) 2} dθ Comparing to a normal pdf we have e Nh(θ) dx e Nh(θ ) (2π) 1 2 Nh (θ ) 1 2 = p(x θ )π(θ )(2π) 1 2 Nh (θ ) 1 2 Julien Berestycki (University of Oxford) SB2a MT 2016 11 / 32

Laplace s method For a d-dimensional function the analogue of this result is e Nf (x) dx e Nf (x 0) (2π) d 2 N d 2 f (x 0 ) 1 2 where f (x 0 ) is the determinant of the Hessian of the function evaluated at x 0. Julien Berestycki (University of Oxford) SB2a MT 2016 12 / 32

Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Julien Berestycki (University of Oxford) SB2a MT 2016 13 / 32

Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Firstly, the MAP estimate θ is replaced by the MLE ˆθ, which is reasonable if the prior has a small effect. Secondly, BIC only retains the terms that vary in N, since asymptotically the terms that are constant in N do not matter. Julien Berestycki (University of Oxford) SB2a MT 2016 13 / 32

Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Firstly, the MAP estimate θ is replaced by the MLE ˆθ, which is reasonable if the prior has a small effect. Secondly, BIC only retains the terms that vary in N, since asymptotically the terms that are constant in N do not matter. Dropping the constant terms we get, log P(θ X) log P(X ˆθ) d 2 log N Julien Berestycki (University of Oxford) SB2a MT 2016 13 / 32

Bayesian Information Criterion (BIC) - extra details Why can we ignore the term 1 2 log f ( θ) 1? Assume (as above) that we can ignore the prior i.e. P(θ) = 1 data points X 1,..., X N are iid Then f ( θ) = 1 N log P(X θ) θ= θ N = 1 N i=1 log P(X i θ) θ= θ The thing to notice about this term is that it is now the average log-likelihood. Julien Berestycki (University of Oxford) SB2a MT 2016 14 / 32

Bayesian Information Criterion (BIC) - extra details Now consider random variables X i = log P(X i θ) and apply WLLN So the (m, n)th element of f ( θ) is f ( θ) E[log P(X i θ)] θ= θ 2 E[log P(X i θ)] θ m θ n θ= θ and these are constants i.e expected log-likelihoods for a single data point, so f ( θ) is constant, and can be ignored in the BIC approximation. Julien Berestycki (University of Oxford) SB2a MT 2016 15 / 32

Variational Bayes The idea of VB is to find an approximation Q(θ) to a given posterior distribution P(θ X). That is Q(θ) P(θ X) where θ is the vector of parameters. We then use Q(θ) to approximate the marginal likelihood. In fact, what we do is find a lower bound for the marginal likelihood. Julien Berestycki (University of Oxford) SB2a MT 2016 16 / 32

Variational Bayes The idea of VB is to find an approximation Q(θ) to a given posterior distribution P(θ X). That is Q(θ) P(θ X) where θ is the vector of parameters. We then use Q(θ) to approximate the marginal likelihood. In fact, what we do is find a lower bound for the marginal likelihood. Question How to find a good approximate posterior Q(θ)? Julien Berestycki (University of Oxford) SB2a MT 2016 16 / 32

Kullback-Liebler (KL) divergence The strategy we take is to find a distribution Q(θ) that minimizes a measure of distance between Q(θ) and the posterior P(θ X). Julien Berestycki (University of Oxford) SB2a MT 2016 17 / 32

Kullback-Liebler (KL) divergence The strategy we take is to find a distribution Q(θ) that minimizes a measure of distance between Q(θ) and the posterior P(θ X). Definition The Kullback-Leibler divergence KL(q p) between two distributions q(x) and p(x) is KL(q p) = log [ q(x) ] q(x)dx p(x) density 0.00 0.05 0.10 0.15 0.20 q(x) p(x) q(x) * log(q(x)/p(x)) 0.02 0.02 0.06 0.10 0 5 10 15 20 0 5 10 15 20 x x Julien Berestycki (University of Oxford) SB2a MT 2016 17 / 32

Kullback-Liebler (KL) divergence The strategy we take is to find a distribution Q(θ) that minimizes a measure of distance between Q(θ) and the posterior P(θ X). Definition The Kullback-Leibler divergence KL(q p) between two distributions q(x) and p(x) is KL(q p) = log [ q(x) ] q(x)dx p(x) density 0.00 0.05 0.10 0.15 0.20 q(x) p(x) q(x) * log(q(x)/p(x)) 0.02 0.02 0.06 0.10 0 5 10 15 20 0 5 10 15 20 x x Exercise KL(q p) 0 and KL(q p) = 0 iff q = p. Julien Berestycki (University of Oxford) SB2a MT 2016 17 / 32

N(µ, σ 2 ) approximations to a Gamma(10,1) µ = 10,σ 2 = 4,KL = 0.223 µ = 9.11,σ 2 = 3.03,KL = 0.124 density 0.00 0.10 0.20 0 5 10 15 20 density 0.00 0.10 0.20 0 5 10 15 20 x x µ = 13,σ 2 = 2.23,KL = 0.347 µ = 9,σ 2 = 2,KL = 0.386 density 0.00 0.10 0.20 0 5 10 15 20 density 0.00 0.10 0.20 0 5 10 15 20 x x Julien Berestycki (University of Oxford) SB2a MT 2016 18 / 32

We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) Julien Berestycki (University of Oxford) SB2a MT 2016 19 / 32

We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) Julien Berestycki (University of Oxford) SB2a MT 2016 19 / 32

We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) [ p(θ, X) ] = log P(X) log Q(θ)dθ Q(θ) Julien Berestycki (University of Oxford) SB2a MT 2016 19 / 32

We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) [ p(θ, X) ] = log P(X) log Q(θ)dθ Q(θ) The log marginal likelihood can then be written as where F(Q(θ)) = log log P(X) = F(Q(θ)) + KL(Q(θ) P(θ X)) (1) ] Q(θ)dθ. [ P(θ,D M) Q(θ) Julien Berestycki (University of Oxford) SB2a MT 2016 19 / 32

We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) [ p(θ, X) ] = log P(X) log Q(θ)dθ Q(θ) The log marginal likelihood can then be written as where F(Q(θ)) = log log P(X) = F(Q(θ)) + KL(Q(θ) P(θ X)) (1) ] Q(θ)dθ. [ P(θ,D M) Q(θ) Note Since KL(q p) 0 we have that log P(X) F(Q(θ)) so that F(Q(θ)) is a lower bound on the log-marginal likelihood. Julien Berestycki (University of Oxford) SB2a MT 2016 19 / 32

The mean field approximation We now need to ask what form that Q(θ) should take? Julien Berestycki (University of Oxford) SB2a MT 2016 20 / 32

The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i The VB algorithm iteratively maximises F(Q(θ)) with respect to the free distributions, Q(θ i ), which is coordinate ascent in the function space of variational distributions. Julien Berestycki (University of Oxford) SB2a MT 2016 20 / 32

The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i The VB algorithm iteratively maximises F(Q(θ)) with respect to the free distributions, Q(θ i ), which is coordinate ascent in the function space of variational distributions. We refer to each Q(θ i ) as a VB component. Julien Berestycki (University of Oxford) SB2a MT 2016 20 / 32

The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i The VB algorithm iteratively maximises F(Q(θ)) with respect to the free distributions, Q(θ i ), which is coordinate ascent in the function space of variational distributions. We refer to each Q(θ i ) as a VB component. We update each component Q(θ i ) in turn keeping Q(θ j ) j i fixed. Julien Berestycki (University of Oxford) SB2a MT 2016 20 / 32

VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Julien Berestycki (University of Oxford) SB2a MT 2016 21 / 32

VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Proof Writing Q(θ) = Q(θ i )Q(θ i ) where θ i = θ\θ i, the lower-bound can be re-written as [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) [ P(θ, X) ] = log Q(θ i )Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) Julien Berestycki (University of Oxford) SB2a MT 2016 21 / 32

VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Proof Writing Q(θ) = Q(θ i )Q(θ i ) where θ i = θ\θ i, the lower-bound can be re-written as [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) [ P(θ, X) ] = log Q(θ i )Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) [ ] = Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Julien Berestycki (University of Oxford) SB2a MT 2016 21 / 32

F (Q(θ)) = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Julien Berestycki (University of Oxford) SB2a MT 2016 22 / 32

F (Q(θ)) = = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i ) log Q(θ i )dθ i Q(θ j ) log Q(θ j )dθ j j i Julien Berestycki (University of Oxford) SB2a MT 2016 22 / 32

F (Q(θ)) = = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i ) log Q(θ i )dθ i Q(θ j ) log Q(θ j )dθ j j i If we let Q (θ i ) = 1 Z exp [ log P(D, θ M)Q(θ i )dθ i ] where Z is a normalising constant and write H(Q(θ j )) = Q(θ j ) log Q(θ j )dθ j as the entropy of Q(θ j ) then F (Q(θ)) = Q(θ i ) log Q (θ i ) Q(θ i ) dθ i + log Z + H(Q(θ j )) j i Julien Berestycki (University of Oxford) SB2a MT 2016 22 / 32

F (Q(θ)) = = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i ) log Q(θ i )dθ i Q(θ j ) log Q(θ j )dθ j j i If we let Q (θ i ) = 1 Z exp [ log P(D, θ M)Q(θ i )dθ i ] where Z is a normalising constant and write H(Q(θ j )) = Q(θ j ) log Q(θ j )dθ j as the entropy of Q(θ j ) then F (Q(θ)) = Q(θ i ) log Q (θ i ) Q(θ i ) dθ i + log Z + H(Q(θ j )) j i = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) Julien Berestycki (University of Oxford) SB2a MT 2016 22 / 32

F(Q(θ)) = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) We then see that F(Q(θ)) is maximised when Q(θ i ) = Q (θ i ) as this choice minimises the Kullback-Liebler divergence term. Julien Berestycki (University of Oxford) SB2a MT 2016 23 / 32

F(Q(θ)) = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) We then see that F(Q(θ)) is maximised when Q(θ i ) = Q (θ i ) as this choice minimises the Kullback-Liebler divergence term. Thus the update for Q(θ i ) is given by [ ] Q(θ i ) exp log P(X, θ)q(θ i )dθ i Julien Berestycki (University of Oxford) SB2a MT 2016 23 / 32

F(Q(θ)) = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) We then see that F(Q(θ)) is maximised when Q(θ i ) = Q (θ i ) as this choice minimises the Kullback-Liebler divergence term. Thus the update for Q(θ i ) is given by [ ] Q(θ i ) exp log P(X, θ)q(θ i )dθ i or log Q(θ i ) = E Q(θ i ) ( ) log P(X, θ) + const Julien Berestycki (University of Oxford) SB2a MT 2016 23 / 32

VB algorithm This implies a straightforward algorithm for variational inference: Julien Berestycki (University of Oxford) SB2a MT 2016 24 / 32

VB algorithm This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ) = Q(µ)Q(τ), e.g., by setting them to their priors. 2 Cycle over the parameters, revising each given the current estimates of the others. 3 Loop until convergence. Convergence is checked by calculating the VB lower bound at each step i.e. [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) The precise form of this term needs to be derived, and can be quite tricky. Julien Berestycki (University of Oxford) SB2a MT 2016 24 / 32

Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Julien Berestycki (University of Oxford) SB2a MT 2016 25 / 32

Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Julien Berestycki (University of Oxford) SB2a MT 2016 25 / 32

Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Let θ = (µ, τ) and assume Q(θ) = Q(µ)Q(τ). We will use notation θ i = E Q(θ i )θ i. Julien Berestycki (University of Oxford) SB2a MT 2016 25 / 32

Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Let θ = (µ, τ) and assume Q(θ) = Q(µ)Q(τ). We will use notation θ i = E Q(θ i )θ i. The log joint density is log P(X, θ) = P 2 log τ τ 2 P (X i µ) 2 + 1 2 log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 Julien Berestycki (University of Oxford) SB2a MT 2016 25 / 32

log P(X, θ) = P 2 log τ τ 2 P (X i µ) 2 + 1 2 log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 Julien Berestycki (University of Oxford) SB2a MT 2016 26 / 32

log P(X, θ) = P 2 log τ τ 2 P (X i µ) 2 + 1 2 log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. Julien Berestycki (University of Oxford) SB2a MT 2016 26 / 32

log P(X, θ) = P 2 log τ τ 2 P (X i µ) 2 + 1 2 log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. ( ) log Q(µ) = E Q(τ) log P(X, θ) + C where τ = E Q(τ) (τ). = τ 2 ( P (X i µ) 2 β(µ m) 2) + C i=1 Julien Berestycki (University of Oxford) SB2a MT 2016 26 / 32

log P(X, θ) = P 2 log τ τ 2 P (X i µ) 2 + 1 2 log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. ( ) log Q(µ) = E Q(τ) log P(X, θ) + C = τ 2 ( P (X i µ) 2 β(µ m) 2) + C i=1 where τ = E Q(τ) (τ). We will be able to determine τ when we derive the other component of the approximate density, Q(τ). Julien Berestycki (University of Oxford) SB2a MT 2016 26 / 32

We can see this log density has the form of a normal distribution log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 Julien Berestycki (University of Oxford) SB2a MT 2016 27 / 32

We can see this log density has the form of a normal distribution log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 = β 2 (µ m ) 2 Julien Berestycki (University of Oxford) SB2a MT 2016 27 / 32

We can see this log density has the form of a normal distribution where log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 = β 2 (µ m ) 2 β = (β + P) τ m = β 1( P ) βm + X i i=1 Julien Berestycki (University of Oxford) SB2a MT 2016 27 / 32

We can see this log density has the form of a normal distribution where log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 = β 2 (µ m ) 2 β = (β + P) τ m = β 1( βm + Thus Q(µ) = N(µ m, β 1 ). P ) X i i=1 Julien Berestycki (University of Oxford) SB2a MT 2016 27 / 32

log P(X, θ) = P 2 log τ τ 2 P (X i µ) 2 + 1 2 log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 Julien Berestycki (University of Oxford) SB2a MT 2016 28 / 32

log P(X, θ) = P 2 log τ τ 2 P (X i µ) 2 + 1 2 log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 The second component of the VB approximation is derived as ( ) log Q(τ) = E Q(µ) log P(X, θ) + C Julien Berestycki (University of Oxford) SB2a MT 2016 28 / 32

log P(X, θ) = P 2 log τ τ 2 P (X i µ) 2 + 1 2 log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 The second component of the VB approximation is derived as ( ) log Q(τ) = E Q(µ) log P(X, θ) + C = ((P + 1)/2 + a 1) log τ τ P (X i µ) 2 2 i=1 β(µ m) 2 τ 2 + C Julien Berestycki (University of Oxford) SB2a MT 2016 28 / 32

We can see this log density has the form of a gamma distribution log Q(τ) = ((P + 1)/2 + a 1) log τ τ P (X i µ) 2 2 i=1 β(µ τ m) 2 + C 2 Julien Berestycki (University of Oxford) SB2a MT 2016 29 / 32

We can see this log density has the form of a gamma distribution log Q(τ) = ((P + 1)/2 + a 1) log τ τ P (X i µ) 2 2 i=1 β(µ τ m) 2 + C 2 which is Γ(a, b ) where a = a + (P + 1)/2 b = b + 1 ( P ) Xi 2 2 µ + P µ 2 + β ( ) m 2 2 µ + µ 2 2 2 i=1 Julien Berestycki (University of Oxford) SB2a MT 2016 29 / 32

So overall we have 1 Q(µ) = N(µ m, β 1 ) where β = (β + P) τ (2) m = β 1( P ) βm + D i (3) i=1 Julien Berestycki (University of Oxford) SB2a MT 2016 30 / 32

So overall we have 1 Q(µ) = N(µ m, β 1 ) where 2 Q(τ) = Γ(τ a, b ) where i=1 β = (β + P) τ (2) m = β 1( P ) βm + D i (3) a = a + (P + 1)/2 b = b + 1 ( P ) Xi 2 2 µ + P µ 2 + β ( ) m 2 2 µ + µ 2 2 2 i=1 Julien Berestycki (University of Oxford) SB2a MT 2016 30 / 32

So overall we have 1 Q(µ) = N(µ m, β 1 ) where 2 Q(τ) = Γ(τ a, b ) where i=1 β = (β + P) τ (2) m = β 1( P ) βm + D i (3) a = a + (P + 1)/2 b = b + 1 ( P ) Xi 2 2 µ + P µ 2 + β ( ) m 2 2 µ + µ 2 2 2 To calculate these we need τ = a b µ = m µ 2 = β 1 + m 2 Julien Berestycki (University of Oxford) SB2a MT 2016 30 / 32 i=1

Example 1 For this model the exact posterior was calculate in Lecture 6. [ }] π(τ, µ X) τ α 1 exp τ {b + β 2 (m µ) 2 where a = a + P + 1 2 b = b + 1 2 (Xi X) 2 + 1 2 β = β + P m = β 1 (βm + P X i ) i=1 Pβ P + β ( X m) 2 We note some similarity between the VB updates and the true posterior parameters. Julien Berestycki (University of Oxford) SB2a MT 2016 31 / 32

We can compare the true and VB posterior when applied to a real dataset. We see that VB approximations underestimate posterior variances. µ Density 0.0 0.5 1.0 1.5 2.0 True posterior VB posterior Density 0.0 0.5 1.0 1.5 2.0 2.5 3.0 True posterior VB posterior 10 5 0 5 10 15 0 2 4 6 8 Julien Berestycki (University of Oxford) SB2a MT 2016 32 / 32

General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. VB is often much, much faster to implement than MCMC or other sampling based methods. The VB updates and lower bound can be tricky to derive, and sometime further approximation is needed. Julien Berestycki (University of Oxford) SB2a MT 2016 33 / 32

General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. VB is often much, much faster to implement than MCMC or other sampling based methods. The VB updates and lower bound can be tricky to derive, and sometime further approximation is needed. The VB algorithm will find a local mode of the posterior, so care should be taken when the posterior is thought/known to be multi-modal. Julien Berestycki (University of Oxford) SB2a MT 2016 33 / 32