Foundations of Statistical Inference

Size: px

Start display at page:

Download "Foundations of Statistical Inference"

Helena Porter
6 years ago
Views:

1 Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT / 32

2 Lecture 14 : Variational Bayes An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem. John W. Tukey, Julien Berestycki (University of Oxford) SB2a MT / 32

3 Laplace approximation The Laplace approximation provides a way of approximating a density whose normalisation constant we cannot evaluate, by fitting a Gaussian distribution to its mode. Pierre-Simon Laplace ( ) Julien Berestycki (University of Oxford) SB2a MT / 32

4 Laplace approximation The Laplace approximation provides a way of approximating a density whose normalisation constant we cannot evaluate, by fitting a Gaussian distribution to its mode. p(z) = }{{} proba. density 1 Z }{{} Unknown constant Pierre-Simon Laplace ( ) f (z) }{{} Main part of the density (easy to evaluate) Julien Berestycki (University of Oxford) SB2a MT / 32

Laplace approximation The Laplace approximation provides a way of approximating a density whose normalisation constant we cannot evaluate, by fitting a Gaussian distribution to its mode.

5 Laplace approximation The Laplace approximation provides a way of approximating a density whose normalisation constant we cannot evaluate, by fitting a Gaussian distribution to its mode. p(z) = }{{} proba. density 1 Z }{{} Unknown constant Pierre-Simon Laplace ( ) f (z) }{{} Main part of the density (easy to evaluate) Observe this is exactly the situation we face in Bayesian inference p(θ y) = }{{} posterior density 1 p(y) }{{} marginal dist. p(θ, y) }{{} joint proba (likelihood x prior) Julien Berestycki (University of Oxford) SB2a MT / 32

6 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. Julien Berestycki (University of Oxford) SB2a MT / 32

7 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 = l(θ ) l (θ )(θ θ ) 2 Julien Berestycki (University of Oxford) SB2a MT / 32

8 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 Julien Berestycki (University of Oxford) SB2a MT / 32

9 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 So approximate posterior by: q(θ) = N (θ µ, σ 2 ) with µ = θ (mode of log-posterior) and σ 2 = l (θ ) (negative curvature at the mode) Julien Berestycki (University of Oxford) SB2a MT / 32

10 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 So approximate posterior by: q(θ) = N (θ µ, σ 2 ) with µ = θ (mode of log-posterior) and σ 2 = l (θ ) (negative curvature at the mode) Julien Berestycki (University of Oxford) SB2a MT / 32

11 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 So approximate posterior by: q(θ) = N (θ µ, σ 2 ) with µ = θ (mode of log-posterior) and σ 2 = l (θ ) (negative curvature at the mode) Julien Berestycki (University of Oxford) SB2a MT / 32

12 Computing integrals More generally, assume f (x) has a unique global maximum at x 0. f (x) f (x 0 ) 1 2 f (x 0 ) (x x 0 ) 2 so To obtain Lemma b a b a b e Nf (x) dx e Nf (x 0) a e N f (x 0 ) (x x 0 ) 2 /2 dx e Nf (x) 2π x N f (x 0 ) enf (x0) as N. Julien Berestycki (University of Oxford) SB2a MT / 32

13 Computing integrals More generally, assume f (x) has a unique global maximum at x 0. f (x) f (x 0 ) 1 2 f (x 0 ) (x x 0 ) 2 so To obtain Lemma b a b a b e Nf (x) dx e Nf (x 0) a e N f (x 0 ) (x x 0 ) 2 /2 dx e Nf (x) 2π x N f (x 0 ) enf (x0) as N. Laplace approximations becomes better as N grows. Julien Berestycki (University of Oxford) SB2a MT / 32

14 In dimension d > 1 If x R d then the Taylor expansion becomes f (x0 = f (x 0 ) + (x x 0 ) T H(x x 0 ) where H is the hessian matrix of second derivatives of f. In that case it can be shown that Lemma e Nf (x) x ( ) 2π d/2 H(x 0 ) 1/2 e Nf (x0) as N. N Julien Berestycki (University of Oxford) SB2a MT / 32

15 Using Laplace approximation Given model with θ = (θ 1,..., θ p ) Step 1 Find mode of log-joint (=MAP) estimate of θ): θ = argmax θ log p(θ, y) Julien Berestycki (University of Oxford) SB2a MT / 32

16 Using Laplace approximation Given model with θ = (θ 1,..., θ p ) Step 1 Find mode of log-joint (=MAP) estimate of θ): θ = argmax θ log p(θ, y) Step 2 Evaluate curvature of the log-joint at the mode H = D 2 log p(θ, y) is the Hessian matrix Julien Berestycki (University of Oxford) SB2a MT / 32

17 Using Laplace approximation Given model with θ = (θ 1,..., θ p ) Step 1 Find mode of log-joint (=MAP) estimate of θ): Step 2 θ = argmax θ log p(θ, y) Evaluate curvature of the log-joint at the mode is the Hessian matrix Step 3 Obtain Gaussian approximation H = D 2 log p(θ, y) N (θ µ, Σ), µ = θ, Σ = H 1. Julien Berestycki (University of Oxford) SB2a MT / 32

18 Example Suppose the y i are iid N(µ, σ 2 ) with a flat prior on µ and on log σ. The posterior is p(µ, σ 2 y) (σ 2 ) n 2 1 e (n 1)s 2 +n(ȳ µ) 2 2σ 2 where ȳ = 1 n yi and s 2 = 1 n 1 (yi ȳ) 2. Writig ν = log σ we get p(µ, ν y) f (µ, ν) = e (n 1)s 2 +n(ȳ µ) 2 nν 2e 2ν Julien Berestycki (University of Oxford) SB2a MT / 32

19 Example It is easy to check that ( (ˆµ, ˆν) = mode(µ, ν y) = ȳ, 1 ( )) n 1 2 log n s2 Second order derivatives are 2 log f = ne 2ν, 2 µ 2 µ ν log f = 2n(ȳ ν)e 2ν and So that 2 ν 2 log f = 2(n 1)s2 + n(ȳ µ) 2 e 2ν H(x 0 ) = ( ) n 2 0 (n 1)s 2 0 2n and we have ( ( ) ( )) ȳ (n 1)s 2 µ, ν N 1 2 log ( 0 n 1 n s2), n n Julien Berestycki (University of Oxford) SB2a MT / 32

20 Limitations of Laplace method The Laplace approximation is often too strong a simplification. Julien Berestycki (University of Oxford) SB2a MT / 32

21 Laplace method for computing the marginal P(x) = = P(x θ)π(θ)dθ exp { N( 1N log P(x θ) 1N } log π(θ)) dθ Julien Berestycki (University of Oxford) SB2a MT / 32

22 Laplace method for computing the marginal P(x) = = P(x θ)π(θ)dθ exp { N( 1N log P(x θ) 1N } log π(θ)) dθ Define h(θ) = 1 N log P(x θ) 1 N log π(θ) so that the integral we want to compute is of the form exp { Nh(θ)} dθ. Julien Berestycki (University of Oxford) SB2a MT / 32

23 Laplace method for computing the marginal P(x) = = P(x θ)π(θ)dθ exp { N( 1N log P(x θ) 1N } log π(θ)) dθ Define h(θ) = 1 N log P(x θ) 1 N log π(θ) so that the integral we want to compute is of the form exp { Nh(θ)} dθ. h(θ) h(θ ) 1 2 h (θ ) (θ θ ) 2 and we can approximate the integral as { e Nh(θ)) dθ e Nh(θ ) exp N 2 h (θ ) (θ θ ) 2} dθ Comparing to a normal pdf we have e Nh(θ) dx e Nh(θ ) (2π) 1 2 Nh (θ ) 1 2 = p(x θ )π(θ )(2π) 1 2 Nh (θ ) 1 2 Julien Berestycki (University of Oxford) SB2a MT / 32

24 Laplace s method For a d-dimensional function the analogue of this result is e Nf (x) dx e Nf (x 0) (2π) d 2 N d 2 f (x 0 ) 1 2 where f (x 0 ) is the determinant of the Hessian of the function evaluated at x 0. Julien Berestycki (University of Oxford) SB2a MT / 32

25 Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Julien Berestycki (University of Oxford) SB2a MT / 32

26 Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Firstly, the MAP estimate θ is replaced by the MLE ˆθ, which is reasonable if the prior has a small effect. Julien Berestycki (University of Oxford) SB2a MT / 32

27 Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Firstly, the MAP estimate θ is replaced by the MLE ˆθ, which is reasonable if the prior has a small effect. Secondly, BIC only retains the terms that vary in N, since asymptotically the terms that are constant in N do not matter. Julien Berestycki (University of Oxford) SB2a MT / 32

28 Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Firstly, the MAP estimate θ is replaced by the MLE ˆθ, which is reasonable if the prior has a small effect. Secondly, BIC only retains the terms that vary in N, since asymptotically the terms that are constant in N do not matter. Dropping the constant terms we get, log P(θ X) log P(X ˆθ) d 2 log N Julien Berestycki (University of Oxford) SB2a MT / 32

29 Bayesian Information Criterion (BIC) - extra details Why can we ignore the term 1 2 log f ( θ) 1? Assume (as above) that we can ignore the prior i.e. P(θ) = 1 data points X 1,..., X N are iid Then f ( θ) = 1 N log P(X θ) θ= θ N = 1 N i=1 log P(X i θ) θ= θ The thing to notice about this term is that it is now the average log-likelihood. Julien Berestycki (University of Oxford) SB2a MT / 32

30 Bayesian Information Criterion (BIC) - extra details Now consider random variables X i = log P(X i θ) and apply WLLN So the (m, n)th element of f ( θ) is f ( θ) E[log P(X i θ)] θ= θ 2 E[log P(X i θ)] θ m θ n θ= θ and these are constants i.e expected log-likelihoods for a single data point, so f ( θ) is constant, and can be ignored in the BIC approximation. Julien Berestycki (University of Oxford) SB2a MT / 32

31 Variational Bayes The idea of VB is to find an approximation Q(θ) to a given posterior distribution P(θ X). That is Q(θ) P(θ X) where θ is the vector of parameters. We then use Q(θ) to approximate the marginal likelihood. In fact, what we do is find a lower bound for the marginal likelihood. Julien Berestycki (University of Oxford) SB2a MT / 32

32 Variational Bayes The idea of VB is to find an approximation Q(θ) to a given posterior distribution P(θ X). That is Q(θ) P(θ X) where θ is the vector of parameters. We then use Q(θ) to approximate the marginal likelihood. In fact, what we do is find a lower bound for the marginal likelihood. Question How to find a good approximate posterior Q(θ)? Julien Berestycki (University of Oxford) SB2a MT / 32

33 Kullback-Liebler (KL) divergence The strategy we take is to find a distribution Q(θ) that minimizes a measure of distance between Q(θ) and the posterior P(θ X). Julien Berestycki (University of Oxford) SB2a MT / 32

34 Kullback-Liebler (KL) divergence The strategy we take is to find a distribution Q(θ) that minimizes a measure of distance between Q(θ) and the posterior P(θ X). Definition The Kullback-Leibler divergence KL(q p) between two distributions q(x) and p(x) is KL(q p) = log [ q(x) ] q(x)dx p(x) density q(x) p(x) q(x) * log(q(x)/p(x)) x x Julien Berestycki (University of Oxford) SB2a MT / 32

35 Kullback-Liebler (KL) divergence The strategy we take is to find a distribution Q(θ) that minimizes a measure of distance between Q(θ) and the posterior P(θ X). Definition The Kullback-Leibler divergence KL(q p) between two distributions q(x) and p(x) is KL(q p) = log [ q(x) ] q(x)dx p(x) density q(x) p(x) q(x) * log(q(x)/p(x)) x x Exercise KL(q p) 0 and KL(q p) = 0 iff q = p. Julien Berestycki (University of Oxford) SB2a MT / 32

36 N(µ, σ 2 ) approximations to a Gamma(10,1) µ = 10,σ 2 = 4,KL = µ = 9.11,σ 2 = 3.03,KL = density density x x µ = 13,σ 2 = 2.23,KL = µ = 9,σ 2 = 2,KL = density density x x Julien Berestycki (University of Oxford) SB2a MT / 32

37 We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) Julien Berestycki (University of Oxford) SB2a MT / 32

38 We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) Julien Berestycki (University of Oxford) SB2a MT / 32

39 We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) [ p(θ, X) ] = log P(X) log Q(θ)dθ Q(θ) Julien Berestycki (University of Oxford) SB2a MT / 32

40 We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) [ p(θ, X) ] = log P(X) log Q(θ)dθ Q(θ) The log marginal likelihood can then be written as where F(Q(θ)) = log log P(X) = F(Q(θ)) + KL(Q(θ) P(θ X)) (1) ] Q(θ)dθ. [ P(θ,D M) Q(θ) Julien Berestycki (University of Oxford) SB2a MT / 32

41 We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) [ p(θ, X) ] = log P(X) log Q(θ)dθ Q(θ) The log marginal likelihood can then be written as where F(Q(θ)) = log log P(X) = F(Q(θ)) + KL(Q(θ) P(θ X)) (1) ] Q(θ)dθ. [ P(θ,D M) Q(θ) Note Since KL(q p) 0 we have that log P(X) F(Q(θ)) so that F(Q(θ)) is a lower bound on the log-marginal likelihood. Julien Berestycki (University of Oxford) SB2a MT / 32

42 The mean field approximation We now need to ask what form that Q(θ) should take? Julien Berestycki (University of Oxford) SB2a MT / 32

43 The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i Julien Berestycki (University of Oxford) SB2a MT / 32

44 The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i The VB algorithm iteratively maximises F(Q(θ)) with respect to the free distributions, Q(θ i ), which is coordinate ascent in the function space of variational distributions. Julien Berestycki (University of Oxford) SB2a MT / 32

45 The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i The VB algorithm iteratively maximises F(Q(θ)) with respect to the free distributions, Q(θ i ), which is coordinate ascent in the function space of variational distributions. We refer to each Q(θ i ) as a VB component. Julien Berestycki (University of Oxford) SB2a MT / 32

46 The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i The VB algorithm iteratively maximises F(Q(θ)) with respect to the free distributions, Q(θ i ), which is coordinate ascent in the function space of variational distributions. We refer to each Q(θ i ) as a VB component. We update each component Q(θ i ) in turn keeping Q(θ j ) j i fixed. Julien Berestycki (University of Oxford) SB2a MT / 32

47 VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Julien Berestycki (University of Oxford) SB2a MT / 32

48 VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Proof Writing Q(θ) = Q(θ i )Q(θ i ) where θ i = θ\θ i, the lower-bound can be re-written as [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) Julien Berestycki (University of Oxford) SB2a MT / 32

49 VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Proof Writing Q(θ) = Q(θ i )Q(θ i ) where θ i = θ\θ i, the lower-bound can be re-written as [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) [ P(θ, X) ] = log Q(θ i )Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) Julien Berestycki (University of Oxford) SB2a MT / 32

50 VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Proof Writing Q(θ) = Q(θ i )Q(θ i ) where θ i = θ\θ i, the lower-bound can be re-written as [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) [ P(θ, X) ] = log Q(θ i )Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) [ ] = Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Julien Berestycki (University of Oxford) SB2a MT / 32

51 F (Q(θ)) = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Julien Berestycki (University of Oxford) SB2a MT / 32

52 F (Q(θ)) = = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i ) log Q(θ i )dθ i Q(θ j ) log Q(θ j )dθ j j i Julien Berestycki (University of Oxford) SB2a MT / 32

53 F (Q(θ)) = = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i ) log Q(θ i )dθ i Q(θ j ) log Q(θ j )dθ j j i If we let Q (θ i ) = 1 Z exp [ log P(D, θ M)Q(θ i )dθ i ] where Z is a normalising constant and write H(Q(θ j )) = Q(θ j ) log Q(θ j )dθ j as the entropy of Q(θ j ) then F (Q(θ)) = Q(θ i ) log Q (θ i ) Q(θ i ) dθ i + log Z + H(Q(θ j )) j i Julien Berestycki (University of Oxford) SB2a MT / 32

54 F (Q(θ)) = = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i ) log Q(θ i )dθ i Q(θ j ) log Q(θ j )dθ j j i If we let Q (θ i ) = 1 Z exp [ log P(D, θ M)Q(θ i )dθ i ] where Z is a normalising constant and write H(Q(θ j )) = Q(θ j ) log Q(θ j )dθ j as the entropy of Q(θ j ) then F (Q(θ)) = Q(θ i ) log Q (θ i ) Q(θ i ) dθ i + log Z + H(Q(θ j )) j i = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) Julien Berestycki (University of Oxford) SB2a MT / 32

55 F(Q(θ)) = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) We then see that F(Q(θ)) is maximised when Q(θ i ) = Q (θ i ) as this choice minimises the Kullback-Liebler divergence term. Julien Berestycki (University of Oxford) SB2a MT / 32

56 F(Q(θ)) = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) We then see that F(Q(θ)) is maximised when Q(θ i ) = Q (θ i ) as this choice minimises the Kullback-Liebler divergence term. Thus the update for Q(θ i ) is given by [ ] Q(θ i ) exp log P(X, θ)q(θ i )dθ i Julien Berestycki (University of Oxford) SB2a MT / 32

57 F(Q(θ)) = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) We then see that F(Q(θ)) is maximised when Q(θ i ) = Q (θ i ) as this choice minimises the Kullback-Liebler divergence term. Thus the update for Q(θ i ) is given by [ ] Q(θ i ) exp log P(X, θ)q(θ i )dθ i or log Q(θ i ) = E Q(θ i ) ( ) log P(X, θ) + const Julien Berestycki (University of Oxford) SB2a MT / 32

58 VB algorithm This implies a straightforward algorithm for variational inference: Julien Berestycki (University of Oxford) SB2a MT / 32

59 VB algorithm This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ) = Q(µ)Q(τ), e.g., by setting them to their priors. Julien Berestycki (University of Oxford) SB2a MT / 32

60 VB algorithm This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ) = Q(µ)Q(τ), e.g., by setting them to their priors. 2 Cycle over the parameters, revising each given the current estimates of the others. Julien Berestycki (University of Oxford) SB2a MT / 32

61 VB algorithm This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ) = Q(µ)Q(τ), e.g., by setting them to their priors. 2 Cycle over the parameters, revising each given the current estimates of the others. 3 Loop until convergence. Julien Berestycki (University of Oxford) SB2a MT / 32

62 VB algorithm This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ) = Q(µ)Q(τ), e.g., by setting them to their priors. 2 Cycle over the parameters, revising each given the current estimates of the others. 3 Loop until convergence. Convergence is checked by calculating the VB lower bound at each step i.e. [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) The precise form of this term needs to be derived, and can be quite tricky. Julien Berestycki (University of Oxford) SB2a MT / 32

63 Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Julien Berestycki (University of Oxford) SB2a MT / 32

64 Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Julien Berestycki (University of Oxford) SB2a MT / 32

65 Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Let θ = (µ, τ) and assume Q(θ) = Q(µ)Q(τ). Julien Berestycki (University of Oxford) SB2a MT / 32

66 Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Let θ = (µ, τ) and assume Q(θ) = Q(µ)Q(τ). We will use notation θ i = E Q(θ i )θ i. Julien Berestycki (University of Oxford) SB2a MT / 32

67 Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Let θ = (µ, τ) and assume Q(θ) = Q(µ)Q(τ). We will use notation θ i = E Q(θ i )θ i. The log joint density is log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 Julien Berestycki (University of Oxford) SB2a MT / 32

68 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 Julien Berestycki (University of Oxford) SB2a MT / 32

69 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. Julien Berestycki (University of Oxford) SB2a MT / 32

70 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. ( ) log Q(µ) = E Q(τ) log P(X, θ) + C Julien Berestycki (University of Oxford) SB2a MT / 32

71 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. ( ) log Q(µ) = E Q(τ) log P(X, θ) + C where τ = E Q(τ) (τ). = τ 2 ( P (X i µ) 2 β(µ m) 2) + C i=1 Julien Berestycki (University of Oxford) SB2a MT / 32

72 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. ( ) log Q(µ) = E Q(τ) log P(X, θ) + C = τ 2 ( P (X i µ) 2 β(µ m) 2) + C i=1 where τ = E Q(τ) (τ). We will be able to determine τ when we derive the other component of the approximate density, Q(τ). Julien Berestycki (University of Oxford) SB2a MT / 32

73 We can see this log density has the form of a normal distribution log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 Julien Berestycki (University of Oxford) SB2a MT / 32

74 We can see this log density has the form of a normal distribution log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 = β 2 (µ m ) 2 Julien Berestycki (University of Oxford) SB2a MT / 32

75 We can see this log density has the form of a normal distribution where log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 = β 2 (µ m ) 2 β = (β + P) τ m = β 1( P ) βm + X i i=1 Julien Berestycki (University of Oxford) SB2a MT / 32

76 We can see this log density has the form of a normal distribution where log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 = β 2 (µ m ) 2 β = (β + P) τ m = β 1( βm + Thus Q(µ) = N(µ m, β 1 ). P ) X i i=1 Julien Berestycki (University of Oxford) SB2a MT / 32

77 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 Julien Berestycki (University of Oxford) SB2a MT / 32

78 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 The second component of the VB approximation is derived as ( ) log Q(τ) = E Q(µ) log P(X, θ) + C Julien Berestycki (University of Oxford) SB2a MT / 32

79 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 The second component of the VB approximation is derived as ( ) log Q(τ) = E Q(µ) log P(X, θ) + C = ((P + 1)/2 + a 1) log τ τ P (X i µ) 2 2 i=1 β(µ m) 2 τ 2 + C Julien Berestycki (University of Oxford) SB2a MT / 32

80 We can see this log density has the form of a gamma distribution log Q(τ) = ((P + 1)/2 + a 1) log τ τ P (X i µ) 2 2 i=1 β(µ τ m) 2 + C 2 Julien Berestycki (University of Oxford) SB2a MT / 32

81 We can see this log density has the form of a gamma distribution log Q(τ) = ((P + 1)/2 + a 1) log τ τ P (X i µ) 2 2 i=1 β(µ τ m) 2 + C 2 which is Γ(a, b ) where a = a + (P + 1)/2 b = b + 1 ( P ) Xi 2 2 µ + P µ 2 + β ( ) m 2 2 µ + µ i=1 Julien Berestycki (University of Oxford) SB2a MT / 32

82 So overall we have 1 Q(µ) = N(µ m, β 1 ) where β = (β + P) τ (2) m = β 1( P ) βm + D i (3) i=1 Julien Berestycki (University of Oxford) SB2a MT / 32

83 So overall we have 1 Q(µ) = N(µ m, β 1 ) where 2 Q(τ) = Γ(τ a, b ) where i=1 β = (β + P) τ (2) m = β 1( P ) βm + D i (3) a = a + (P + 1)/2 b = b + 1 ( P ) Xi 2 2 µ + P µ 2 + β ( ) m 2 2 µ + µ i=1 Julien Berestycki (University of Oxford) SB2a MT / 32

84 So overall we have 1 Q(µ) = N(µ m, β 1 ) where 2 Q(τ) = Γ(τ a, b ) where i=1 β = (β + P) τ (2) m = β 1( P ) βm + D i (3) a = a + (P + 1)/2 b = b + 1 ( P ) Xi 2 2 µ + P µ 2 + β ( ) m 2 2 µ + µ To calculate these we need τ = a b µ = m µ 2 = β 1 + m 2 Julien Berestycki (University of Oxford) SB2a MT / 32 i=1

85 Example 1 For this model the exact posterior was calculate in Lecture 6. [ }] π(τ, µ X) τ α 1 exp τ {b + β 2 (m µ) 2 where a = a + P b = b (Xi X) β = β + P m = β 1 (βm + P X i ) i=1 Pβ P + β ( X m) 2 We note some similarity between the VB updates and the true posterior parameters. Julien Berestycki (University of Oxford) SB2a MT / 32

86 We can compare the true and VB posterior when applied to a real dataset. We see that VB approximations underestimate posterior variances. µ Density True posterior VB posterior Density True posterior VB posterior Julien Berestycki (University of Oxford) SB2a MT / 32

87 General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. Julien Berestycki (University of Oxford) SB2a MT / 32

88 General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. VB is often much, much faster to implement than MCMC or other sampling based methods. Julien Berestycki (University of Oxford) SB2a MT / 32

89 General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. VB is often much, much faster to implement than MCMC or other sampling based methods. The VB updates and lower bound can be tricky to derive, and sometime further approximation is needed. Julien Berestycki (University of Oxford) SB2a MT / 32

90 General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. VB is often much, much faster to implement than MCMC or other sampling based methods. The VB updates and lower bound can be tricky to derive, and sometime further approximation is needed. The VB algorithm will find a local mode of the posterior, so care should be taken when the posterior is thought/known to be multi-modal. Julien Berestycki (University of Oxford) SB2a MT / 32

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M PD M = PD θ, MPθ Mdθ Lecture 14 : Variational Bayes where θ are the parameters of the model and Pθ M is