Foundations of Statistical Inference

Size: px
Start display at page:

Download "Foundations of Statistical Inference"

Transcription

1 Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT / 32

2 Lecture 14 : Variational Bayes An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem. John W. Tukey, Julien Berestycki (University of Oxford) SB2a MT / 32

3 Laplace approximation The Laplace approximation provides a way of approximating a density whose normalisation constant we cannot evaluate, by fitting a Gaussian distribution to its mode. Pierre-Simon Laplace ( ) Julien Berestycki (University of Oxford) SB2a MT / 32

4 Laplace approximation The Laplace approximation provides a way of approximating a density whose normalisation constant we cannot evaluate, by fitting a Gaussian distribution to its mode. p(z) = }{{} proba. density 1 Z }{{} Unknown constant Pierre-Simon Laplace ( ) f (z) }{{} Main part of the density (easy to evaluate) Julien Berestycki (University of Oxford) SB2a MT / 32

5 Laplace approximation The Laplace approximation provides a way of approximating a density whose normalisation constant we cannot evaluate, by fitting a Gaussian distribution to its mode. p(z) = }{{} proba. density 1 Z }{{} Unknown constant Pierre-Simon Laplace ( ) f (z) }{{} Main part of the density (easy to evaluate) Observe this is exactly the situation we face in Bayesian inference p(θ y) = }{{} posterior density 1 p(y) }{{} marginal dist. p(θ, y) }{{} joint proba (likelihood x prior) Julien Berestycki (University of Oxford) SB2a MT / 32

6 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. Julien Berestycki (University of Oxford) SB2a MT / 32

7 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 = l(θ ) l (θ )(θ θ ) 2 Julien Berestycki (University of Oxford) SB2a MT / 32

8 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 Julien Berestycki (University of Oxford) SB2a MT / 32

9 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 So approximate posterior by: q(θ) = N (θ µ, σ 2 ) with µ = θ (mode of log-posterior) and σ 2 = l (θ ) (negative curvature at the mode) Julien Berestycki (University of Oxford) SB2a MT / 32

10 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 So approximate posterior by: q(θ) = N (θ µ, σ 2 ) with µ = θ (mode of log-posterior) and σ 2 = l (θ ) (negative curvature at the mode) Julien Berestycki (University of Oxford) SB2a MT / 32

11 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 So approximate posterior by: q(θ) = N (θ µ, σ 2 ) with µ = θ (mode of log-posterior) and σ 2 = l (θ ) (negative curvature at the mode) Julien Berestycki (University of Oxford) SB2a MT / 32

12 Computing integrals More generally, assume f (x) has a unique global maximum at x 0. f (x) f (x 0 ) 1 2 f (x 0 ) (x x 0 ) 2 so To obtain Lemma b a b a b e Nf (x) dx e Nf (x 0) a e N f (x 0 ) (x x 0 ) 2 /2 dx e Nf (x) 2π x N f (x 0 ) enf (x0) as N. Julien Berestycki (University of Oxford) SB2a MT / 32

13 Computing integrals More generally, assume f (x) has a unique global maximum at x 0. f (x) f (x 0 ) 1 2 f (x 0 ) (x x 0 ) 2 so To obtain Lemma b a b a b e Nf (x) dx e Nf (x 0) a e N f (x 0 ) (x x 0 ) 2 /2 dx e Nf (x) 2π x N f (x 0 ) enf (x0) as N. Laplace approximations becomes better as N grows. Julien Berestycki (University of Oxford) SB2a MT / 32

14 In dimension d > 1 If x R d then the Taylor expansion becomes f (x0 = f (x 0 ) + (x x 0 ) T H(x x 0 ) where H is the hessian matrix of second derivatives of f. In that case it can be shown that Lemma e Nf (x) x ( ) 2π d/2 H(x 0 ) 1/2 e Nf (x0) as N. N Julien Berestycki (University of Oxford) SB2a MT / 32

15 Using Laplace approximation Given model with θ = (θ 1,..., θ p ) Step 1 Find mode of log-joint (=MAP) estimate of θ): θ = argmax θ log p(θ, y) Julien Berestycki (University of Oxford) SB2a MT / 32

16 Using Laplace approximation Given model with θ = (θ 1,..., θ p ) Step 1 Find mode of log-joint (=MAP) estimate of θ): θ = argmax θ log p(θ, y) Step 2 Evaluate curvature of the log-joint at the mode H = D 2 log p(θ, y) is the Hessian matrix Julien Berestycki (University of Oxford) SB2a MT / 32

17 Using Laplace approximation Given model with θ = (θ 1,..., θ p ) Step 1 Find mode of log-joint (=MAP) estimate of θ): Step 2 θ = argmax θ log p(θ, y) Evaluate curvature of the log-joint at the mode is the Hessian matrix Step 3 Obtain Gaussian approximation H = D 2 log p(θ, y) N (θ µ, Σ), µ = θ, Σ = H 1. Julien Berestycki (University of Oxford) SB2a MT / 32

18 Example Suppose the y i are iid N(µ, σ 2 ) with a flat prior on µ and on log σ. The posterior is p(µ, σ 2 y) (σ 2 ) n 2 1 e (n 1)s 2 +n(ȳ µ) 2 2σ 2 where ȳ = 1 n yi and s 2 = 1 n 1 (yi ȳ) 2. Writig ν = log σ we get p(µ, ν y) f (µ, ν) = e (n 1)s 2 +n(ȳ µ) 2 nν 2e 2ν Julien Berestycki (University of Oxford) SB2a MT / 32

19 Example It is easy to check that ( (ˆµ, ˆν) = mode(µ, ν y) = ȳ, 1 ( )) n 1 2 log n s2 Second order derivatives are 2 log f = ne 2ν, 2 µ 2 µ ν log f = 2n(ȳ ν)e 2ν and So that 2 ν 2 log f = 2(n 1)s2 + n(ȳ µ) 2 e 2ν H(x 0 ) = ( ) n 2 0 (n 1)s 2 0 2n and we have ( ( ) ( )) ȳ (n 1)s 2 µ, ν N 1 2 log ( 0 n 1 n s2), n n Julien Berestycki (University of Oxford) SB2a MT / 32

20 Limitations of Laplace method The Laplace approximation is often too strong a simplification. Julien Berestycki (University of Oxford) SB2a MT / 32

21 Laplace method for computing the marginal P(x) = = P(x θ)π(θ)dθ exp { N( 1N log P(x θ) 1N } log π(θ)) dθ Julien Berestycki (University of Oxford) SB2a MT / 32

22 Laplace method for computing the marginal P(x) = = P(x θ)π(θ)dθ exp { N( 1N log P(x θ) 1N } log π(θ)) dθ Define h(θ) = 1 N log P(x θ) 1 N log π(θ) so that the integral we want to compute is of the form exp { Nh(θ)} dθ. Julien Berestycki (University of Oxford) SB2a MT / 32

23 Laplace method for computing the marginal P(x) = = P(x θ)π(θ)dθ exp { N( 1N log P(x θ) 1N } log π(θ)) dθ Define h(θ) = 1 N log P(x θ) 1 N log π(θ) so that the integral we want to compute is of the form exp { Nh(θ)} dθ. h(θ) h(θ ) 1 2 h (θ ) (θ θ ) 2 and we can approximate the integral as { e Nh(θ)) dθ e Nh(θ ) exp N 2 h (θ ) (θ θ ) 2} dθ Comparing to a normal pdf we have e Nh(θ) dx e Nh(θ ) (2π) 1 2 Nh (θ ) 1 2 = p(x θ )π(θ )(2π) 1 2 Nh (θ ) 1 2 Julien Berestycki (University of Oxford) SB2a MT / 32

24 Laplace s method For a d-dimensional function the analogue of this result is e Nf (x) dx e Nf (x 0) (2π) d 2 N d 2 f (x 0 ) 1 2 where f (x 0 ) is the determinant of the Hessian of the function evaluated at x 0. Julien Berestycki (University of Oxford) SB2a MT / 32

25 Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Julien Berestycki (University of Oxford) SB2a MT / 32

26 Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Firstly, the MAP estimate θ is replaced by the MLE ˆθ, which is reasonable if the prior has a small effect. Julien Berestycki (University of Oxford) SB2a MT / 32

27 Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Firstly, the MAP estimate θ is replaced by the MLE ˆθ, which is reasonable if the prior has a small effect. Secondly, BIC only retains the terms that vary in N, since asymptotically the terms that are constant in N do not matter. Julien Berestycki (University of Oxford) SB2a MT / 32

28 Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Firstly, the MAP estimate θ is replaced by the MLE ˆθ, which is reasonable if the prior has a small effect. Secondly, BIC only retains the terms that vary in N, since asymptotically the terms that are constant in N do not matter. Dropping the constant terms we get, log P(θ X) log P(X ˆθ) d 2 log N Julien Berestycki (University of Oxford) SB2a MT / 32

29 Bayesian Information Criterion (BIC) - extra details Why can we ignore the term 1 2 log f ( θ) 1? Assume (as above) that we can ignore the prior i.e. P(θ) = 1 data points X 1,..., X N are iid Then f ( θ) = 1 N log P(X θ) θ= θ N = 1 N i=1 log P(X i θ) θ= θ The thing to notice about this term is that it is now the average log-likelihood. Julien Berestycki (University of Oxford) SB2a MT / 32

30 Bayesian Information Criterion (BIC) - extra details Now consider random variables X i = log P(X i θ) and apply WLLN So the (m, n)th element of f ( θ) is f ( θ) E[log P(X i θ)] θ= θ 2 E[log P(X i θ)] θ m θ n θ= θ and these are constants i.e expected log-likelihoods for a single data point, so f ( θ) is constant, and can be ignored in the BIC approximation. Julien Berestycki (University of Oxford) SB2a MT / 32

31 Variational Bayes The idea of VB is to find an approximation Q(θ) to a given posterior distribution P(θ X). That is Q(θ) P(θ X) where θ is the vector of parameters. We then use Q(θ) to approximate the marginal likelihood. In fact, what we do is find a lower bound for the marginal likelihood. Julien Berestycki (University of Oxford) SB2a MT / 32

32 Variational Bayes The idea of VB is to find an approximation Q(θ) to a given posterior distribution P(θ X). That is Q(θ) P(θ X) where θ is the vector of parameters. We then use Q(θ) to approximate the marginal likelihood. In fact, what we do is find a lower bound for the marginal likelihood. Question How to find a good approximate posterior Q(θ)? Julien Berestycki (University of Oxford) SB2a MT / 32

33 Kullback-Liebler (KL) divergence The strategy we take is to find a distribution Q(θ) that minimizes a measure of distance between Q(θ) and the posterior P(θ X). Julien Berestycki (University of Oxford) SB2a MT / 32

34 Kullback-Liebler (KL) divergence The strategy we take is to find a distribution Q(θ) that minimizes a measure of distance between Q(θ) and the posterior P(θ X). Definition The Kullback-Leibler divergence KL(q p) between two distributions q(x) and p(x) is KL(q p) = log [ q(x) ] q(x)dx p(x) density q(x) p(x) q(x) * log(q(x)/p(x)) x x Julien Berestycki (University of Oxford) SB2a MT / 32

35 Kullback-Liebler (KL) divergence The strategy we take is to find a distribution Q(θ) that minimizes a measure of distance between Q(θ) and the posterior P(θ X). Definition The Kullback-Leibler divergence KL(q p) between two distributions q(x) and p(x) is KL(q p) = log [ q(x) ] q(x)dx p(x) density q(x) p(x) q(x) * log(q(x)/p(x)) x x Exercise KL(q p) 0 and KL(q p) = 0 iff q = p. Julien Berestycki (University of Oxford) SB2a MT / 32

36 N(µ, σ 2 ) approximations to a Gamma(10,1) µ = 10,σ 2 = 4,KL = µ = 9.11,σ 2 = 3.03,KL = density density x x µ = 13,σ 2 = 2.23,KL = µ = 9,σ 2 = 2,KL = density density x x Julien Berestycki (University of Oxford) SB2a MT / 32

37 We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) Julien Berestycki (University of Oxford) SB2a MT / 32

38 We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) Julien Berestycki (University of Oxford) SB2a MT / 32

39 We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) [ p(θ, X) ] = log P(X) log Q(θ)dθ Q(θ) Julien Berestycki (University of Oxford) SB2a MT / 32

40 We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) [ p(θ, X) ] = log P(X) log Q(θ)dθ Q(θ) The log marginal likelihood can then be written as where F(Q(θ)) = log log P(X) = F(Q(θ)) + KL(Q(θ) P(θ X)) (1) ] Q(θ)dθ. [ P(θ,D M) Q(θ) Julien Berestycki (University of Oxford) SB2a MT / 32

41 We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) [ p(θ, X) ] = log P(X) log Q(θ)dθ Q(θ) The log marginal likelihood can then be written as where F(Q(θ)) = log log P(X) = F(Q(θ)) + KL(Q(θ) P(θ X)) (1) ] Q(θ)dθ. [ P(θ,D M) Q(θ) Note Since KL(q p) 0 we have that log P(X) F(Q(θ)) so that F(Q(θ)) is a lower bound on the log-marginal likelihood. Julien Berestycki (University of Oxford) SB2a MT / 32

42 The mean field approximation We now need to ask what form that Q(θ) should take? Julien Berestycki (University of Oxford) SB2a MT / 32

43 The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i Julien Berestycki (University of Oxford) SB2a MT / 32

44 The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i The VB algorithm iteratively maximises F(Q(θ)) with respect to the free distributions, Q(θ i ), which is coordinate ascent in the function space of variational distributions. Julien Berestycki (University of Oxford) SB2a MT / 32

45 The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i The VB algorithm iteratively maximises F(Q(θ)) with respect to the free distributions, Q(θ i ), which is coordinate ascent in the function space of variational distributions. We refer to each Q(θ i ) as a VB component. Julien Berestycki (University of Oxford) SB2a MT / 32

46 The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i The VB algorithm iteratively maximises F(Q(θ)) with respect to the free distributions, Q(θ i ), which is coordinate ascent in the function space of variational distributions. We refer to each Q(θ i ) as a VB component. We update each component Q(θ i ) in turn keeping Q(θ j ) j i fixed. Julien Berestycki (University of Oxford) SB2a MT / 32

47 VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Julien Berestycki (University of Oxford) SB2a MT / 32

48 VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Proof Writing Q(θ) = Q(θ i )Q(θ i ) where θ i = θ\θ i, the lower-bound can be re-written as [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) Julien Berestycki (University of Oxford) SB2a MT / 32

49 VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Proof Writing Q(θ) = Q(θ i )Q(θ i ) where θ i = θ\θ i, the lower-bound can be re-written as [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) [ P(θ, X) ] = log Q(θ i )Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) Julien Berestycki (University of Oxford) SB2a MT / 32

50 VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Proof Writing Q(θ) = Q(θ i )Q(θ i ) where θ i = θ\θ i, the lower-bound can be re-written as [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) [ P(θ, X) ] = log Q(θ i )Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) [ ] = Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Julien Berestycki (University of Oxford) SB2a MT / 32

51 F (Q(θ)) = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Julien Berestycki (University of Oxford) SB2a MT / 32

52 F (Q(θ)) = = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i ) log Q(θ i )dθ i Q(θ j ) log Q(θ j )dθ j j i Julien Berestycki (University of Oxford) SB2a MT / 32

53 F (Q(θ)) = = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i ) log Q(θ i )dθ i Q(θ j ) log Q(θ j )dθ j j i If we let Q (θ i ) = 1 Z exp [ log P(D, θ M)Q(θ i )dθ i ] where Z is a normalising constant and write H(Q(θ j )) = Q(θ j ) log Q(θ j )dθ j as the entropy of Q(θ j ) then F (Q(θ)) = Q(θ i ) log Q (θ i ) Q(θ i ) dθ i + log Z + H(Q(θ j )) j i Julien Berestycki (University of Oxford) SB2a MT / 32

54 F (Q(θ)) = = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i ) log Q(θ i )dθ i Q(θ j ) log Q(θ j )dθ j j i If we let Q (θ i ) = 1 Z exp [ log P(D, θ M)Q(θ i )dθ i ] where Z is a normalising constant and write H(Q(θ j )) = Q(θ j ) log Q(θ j )dθ j as the entropy of Q(θ j ) then F (Q(θ)) = Q(θ i ) log Q (θ i ) Q(θ i ) dθ i + log Z + H(Q(θ j )) j i = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) Julien Berestycki (University of Oxford) SB2a MT / 32

55 F(Q(θ)) = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) We then see that F(Q(θ)) is maximised when Q(θ i ) = Q (θ i ) as this choice minimises the Kullback-Liebler divergence term. Julien Berestycki (University of Oxford) SB2a MT / 32

56 F(Q(θ)) = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) We then see that F(Q(θ)) is maximised when Q(θ i ) = Q (θ i ) as this choice minimises the Kullback-Liebler divergence term. Thus the update for Q(θ i ) is given by [ ] Q(θ i ) exp log P(X, θ)q(θ i )dθ i Julien Berestycki (University of Oxford) SB2a MT / 32

57 F(Q(θ)) = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) We then see that F(Q(θ)) is maximised when Q(θ i ) = Q (θ i ) as this choice minimises the Kullback-Liebler divergence term. Thus the update for Q(θ i ) is given by [ ] Q(θ i ) exp log P(X, θ)q(θ i )dθ i or log Q(θ i ) = E Q(θ i ) ( ) log P(X, θ) + const Julien Berestycki (University of Oxford) SB2a MT / 32

58 VB algorithm This implies a straightforward algorithm for variational inference: Julien Berestycki (University of Oxford) SB2a MT / 32

59 VB algorithm This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ) = Q(µ)Q(τ), e.g., by setting them to their priors. Julien Berestycki (University of Oxford) SB2a MT / 32

60 VB algorithm This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ) = Q(µ)Q(τ), e.g., by setting them to their priors. 2 Cycle over the parameters, revising each given the current estimates of the others. Julien Berestycki (University of Oxford) SB2a MT / 32

61 VB algorithm This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ) = Q(µ)Q(τ), e.g., by setting them to their priors. 2 Cycle over the parameters, revising each given the current estimates of the others. 3 Loop until convergence. Julien Berestycki (University of Oxford) SB2a MT / 32

62 VB algorithm This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ) = Q(µ)Q(τ), e.g., by setting them to their priors. 2 Cycle over the parameters, revising each given the current estimates of the others. 3 Loop until convergence. Convergence is checked by calculating the VB lower bound at each step i.e. [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) The precise form of this term needs to be derived, and can be quite tricky. Julien Berestycki (University of Oxford) SB2a MT / 32

63 Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Julien Berestycki (University of Oxford) SB2a MT / 32

64 Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Julien Berestycki (University of Oxford) SB2a MT / 32

65 Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Let θ = (µ, τ) and assume Q(θ) = Q(µ)Q(τ). Julien Berestycki (University of Oxford) SB2a MT / 32

66 Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Let θ = (µ, τ) and assume Q(θ) = Q(µ)Q(τ). We will use notation θ i = E Q(θ i )θ i. Julien Berestycki (University of Oxford) SB2a MT / 32

67 Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Let θ = (µ, τ) and assume Q(θ) = Q(µ)Q(τ). We will use notation θ i = E Q(θ i )θ i. The log joint density is log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 Julien Berestycki (University of Oxford) SB2a MT / 32

68 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 Julien Berestycki (University of Oxford) SB2a MT / 32

69 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. Julien Berestycki (University of Oxford) SB2a MT / 32

70 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. ( ) log Q(µ) = E Q(τ) log P(X, θ) + C Julien Berestycki (University of Oxford) SB2a MT / 32

71 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. ( ) log Q(µ) = E Q(τ) log P(X, θ) + C where τ = E Q(τ) (τ). = τ 2 ( P (X i µ) 2 β(µ m) 2) + C i=1 Julien Berestycki (University of Oxford) SB2a MT / 32

72 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. ( ) log Q(µ) = E Q(τ) log P(X, θ) + C = τ 2 ( P (X i µ) 2 β(µ m) 2) + C i=1 where τ = E Q(τ) (τ). We will be able to determine τ when we derive the other component of the approximate density, Q(τ). Julien Berestycki (University of Oxford) SB2a MT / 32

73 We can see this log density has the form of a normal distribution log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 Julien Berestycki (University of Oxford) SB2a MT / 32

74 We can see this log density has the form of a normal distribution log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 = β 2 (µ m ) 2 Julien Berestycki (University of Oxford) SB2a MT / 32

75 We can see this log density has the form of a normal distribution where log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 = β 2 (µ m ) 2 β = (β + P) τ m = β 1( P ) βm + X i i=1 Julien Berestycki (University of Oxford) SB2a MT / 32

76 We can see this log density has the form of a normal distribution where log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 = β 2 (µ m ) 2 β = (β + P) τ m = β 1( βm + Thus Q(µ) = N(µ m, β 1 ). P ) X i i=1 Julien Berestycki (University of Oxford) SB2a MT / 32

77 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 Julien Berestycki (University of Oxford) SB2a MT / 32

78 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 The second component of the VB approximation is derived as ( ) log Q(τ) = E Q(µ) log P(X, θ) + C Julien Berestycki (University of Oxford) SB2a MT / 32

79 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 The second component of the VB approximation is derived as ( ) log Q(τ) = E Q(µ) log P(X, θ) + C = ((P + 1)/2 + a 1) log τ τ P (X i µ) 2 2 i=1 β(µ m) 2 τ 2 + C Julien Berestycki (University of Oxford) SB2a MT / 32

80 We can see this log density has the form of a gamma distribution log Q(τ) = ((P + 1)/2 + a 1) log τ τ P (X i µ) 2 2 i=1 β(µ τ m) 2 + C 2 Julien Berestycki (University of Oxford) SB2a MT / 32

81 We can see this log density has the form of a gamma distribution log Q(τ) = ((P + 1)/2 + a 1) log τ τ P (X i µ) 2 2 i=1 β(µ τ m) 2 + C 2 which is Γ(a, b ) where a = a + (P + 1)/2 b = b + 1 ( P ) Xi 2 2 µ + P µ 2 + β ( ) m 2 2 µ + µ i=1 Julien Berestycki (University of Oxford) SB2a MT / 32

82 So overall we have 1 Q(µ) = N(µ m, β 1 ) where β = (β + P) τ (2) m = β 1( P ) βm + D i (3) i=1 Julien Berestycki (University of Oxford) SB2a MT / 32

83 So overall we have 1 Q(µ) = N(µ m, β 1 ) where 2 Q(τ) = Γ(τ a, b ) where i=1 β = (β + P) τ (2) m = β 1( P ) βm + D i (3) a = a + (P + 1)/2 b = b + 1 ( P ) Xi 2 2 µ + P µ 2 + β ( ) m 2 2 µ + µ i=1 Julien Berestycki (University of Oxford) SB2a MT / 32

84 So overall we have 1 Q(µ) = N(µ m, β 1 ) where 2 Q(τ) = Γ(τ a, b ) where i=1 β = (β + P) τ (2) m = β 1( P ) βm + D i (3) a = a + (P + 1)/2 b = b + 1 ( P ) Xi 2 2 µ + P µ 2 + β ( ) m 2 2 µ + µ To calculate these we need τ = a b µ = m µ 2 = β 1 + m 2 Julien Berestycki (University of Oxford) SB2a MT / 32 i=1

85 Example 1 For this model the exact posterior was calculate in Lecture 6. [ }] π(τ, µ X) τ α 1 exp τ {b + β 2 (m µ) 2 where a = a + P b = b (Xi X) β = β + P m = β 1 (βm + P X i ) i=1 Pβ P + β ( X m) 2 We note some similarity between the VB updates and the true posterior parameters. Julien Berestycki (University of Oxford) SB2a MT / 32

86 We can compare the true and VB posterior when applied to a real dataset. We see that VB approximations underestimate posterior variances. µ Density True posterior VB posterior Density True posterior VB posterior Julien Berestycki (University of Oxford) SB2a MT / 32

87 General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. Julien Berestycki (University of Oxford) SB2a MT / 32

88 General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. VB is often much, much faster to implement than MCMC or other sampling based methods. Julien Berestycki (University of Oxford) SB2a MT / 32

89 General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. VB is often much, much faster to implement than MCMC or other sampling based methods. The VB updates and lower bound can be tricky to derive, and sometime further approximation is needed. Julien Berestycki (University of Oxford) SB2a MT / 32

90 General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. VB is often much, much faster to implement than MCMC or other sampling based methods. The VB updates and lower bound can be tricky to derive, and sometime further approximation is needed. The VB algorithm will find a local mode of the posterior, so care should be taken when the posterior is thought/known to be multi-modal. Julien Berestycki (University of Oxford) SB2a MT / 32

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M PD M = PD θ, MPθ Mdθ Lecture 14 : Variational Bayes where θ are the parameters of the model and Pθ M is

More information

Foundations of Statistical Inference

Foundations of Statistical Inference Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 20 Lecture 6 : Bayesian Inference

More information

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Minimum Message Length Analysis of the Behrens Fisher Problem

Minimum Message Length Analysis of the Behrens Fisher Problem Analysis of the Behrens Fisher Problem Enes Makalic and Daniel F Schmidt Centre for MEGA Epidemiology The University of Melbourne Solomonoff 85th Memorial Conference, 2011 Outline Introduction 1 Introduction

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Lecture 6: Model Checking and Selection

Lecture 6: Model Checking and Selection Lecture 6: Model Checking and Selection Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 27, 2014 Model selection We often have multiple modeling choices that are equally sensible: M 1,, M T. Which

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

MIT Spring 2016

MIT Spring 2016 MIT 18.655 Dr. Kempthorne Spring 2016 1 MIT 18.655 Outline 1 2 MIT 18.655 Decision Problem: Basic Components P = {P θ : θ Θ} : parametric model. Θ = {θ}: Parameter space. A{a} : Action space. L(θ, a) :

More information

G8325: Variational Bayes

G8325: Variational Bayes G8325: Variational Bayes Vincent Dorie Columbia University Wednesday, November 2nd, 2011 bridge Variational University Bayes Press 2003. On-screen viewing permitted. Printing not permitted. http://www.c

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 10 Alternatives to Monte Carlo Computation Since about 1990, Markov chain Monte Carlo has been the dominant

More information

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester Physics 403 Parameter Estimation, Correlations, and Error Bars Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Best Estimates and Reliability

More information

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018 CLASS NOTES Models, Algorithms and Data: Introduction to computing 208 Petros Koumoutsakos, Jens Honore Walther (Last update: June, 208) IMPORTANT DISCLAIMERS. REFERENCES: Much of the material (ideas,

More information

The Expectation Maximization or EM algorithm

The Expectation Maximization or EM algorithm The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,

More information

Nested Sampling. Brendon J. Brewer. brewer/ Department of Statistics The University of Auckland

Nested Sampling. Brendon J. Brewer.   brewer/ Department of Statistics The University of Auckland Department of Statistics The University of Auckland https://www.stat.auckland.ac.nz/ brewer/ is a Monte Carlo method (not necessarily MCMC) that was introduced by John Skilling in 2004. It is very popular

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

David Giles Bayesian Econometrics

David Giles Bayesian Econometrics David Giles Bayesian Econometrics 1. General Background 2. Constructing Prior Distributions 3. Properties of Bayes Estimators and Tests 4. Bayesian Analysis of the Multiple Regression Model 5. Bayesian

More information

Statistical Machine Learning Lectures 4: Variational Bayes

Statistical Machine Learning Lectures 4: Variational Bayes 1 / 29 Statistical Machine Learning Lectures 4: Variational Bayes Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 29 Synonyms Variational Bayes Variational Inference Variational Bayesian Inference

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Instructor: Dr. Volkan Cevher. 1. Background

Instructor: Dr. Volkan Cevher. 1. Background Instructor: Dr. Volkan Cevher Variational Bayes Approximation ice University STAT 631 / ELEC 639: Graphical Models Scribe: David Kahle eviewers: Konstantinos Tsianos and Tahira Saleem 1. Background These

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Introduction to Bayesian inference

Introduction to Bayesian inference Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions

More information

A Very Brief Summary of Bayesian Inference, and Examples

A Very Brief Summary of Bayesian Inference, and Examples A Very Brief Summary of Bayesian Inference, and Examples Trinity Term 009 Prof Gesine Reinert Our starting point are data x = x 1, x,, x n, which we view as realisations of random variables X 1, X,, X

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator Estimation Theory Estimation theory deals with finding numerical values of interesting parameters from given set of data. We start with formulating a family of models that could describe how the data were

More information

Integrated Non-Factorized Variational Inference

Integrated Non-Factorized Variational Inference Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

STAT 730 Chapter 4: Estimation

STAT 730 Chapter 4: Estimation STAT 730 Chapter 4: Estimation Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 23 The likelihood We have iid data, at least initially. Each datum

More information

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/?? to Bayesian Methods Introduction to Bayesian Methods p.1/?? We develop the Bayesian paradigm for parametric inference. To this end, suppose we conduct (or wish to design) a study, in which the parameter

More information

Bayesian Asymptotics

Bayesian Asymptotics BS2 Statistical Inference, Lecture 8, Hilary Term 2008 May 7, 2008 The univariate case The multivariate case For large λ we have the approximation I = b a e λg(y) h(y) dy = e λg(y ) h(y ) 2π λg (y ) {

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

An Introduction to Expectation-Maximization

An Introduction to Expectation-Maximization An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative

More information

Foundations of Statistical Inference

Foundations of Statistical Inference Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2015 Julien Berestycki (University of Oxford) SB2a MT 2015 1 / 16 Lecture 16 : Bayesian analysis

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

CS281A/Stat241A Lecture 22

CS281A/Stat241A Lecture 22 CS281A/Stat241A Lecture 22 p. 1/4 CS281A/Stat241A Lecture 22 Monte Carlo Methods Peter Bartlett CS281A/Stat241A Lecture 22 p. 2/4 Key ideas of this lecture Sampling in Bayesian methods: Predictive distribution

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Module 22: Bayesian Methods Lecture 9 A: Default prior selection Module 22: Bayesian Methods Lecture 9 A: Default prior selection Peter Hoff Departments of Statistics and Biostatistics University of Washington Outline Jeffreys prior Unit information priors Empirical

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Choosing among models

Choosing among models Eco 515 Fall 2014 Chris Sims Choosing among models September 18, 2014 c 2014 by Christopher A. Sims. This document is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing. Objective. Recommended reading 7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Lecture 8: Bayesian Estimation of Parameters in State Space Models in State Space Models March 30, 2016 Contents 1 Bayesian estimation of parameters in state space models 2 Computational methods for parameter estimation 3 Practical parameter estimation in state space

More information

Spring 2006: Examples: Laplace s Method; Hierarchical Models

Spring 2006: Examples: Laplace s Method; Hierarchical Models 36-724 Spring 2006: Examples: Laplace s Method; Hierarchical Models Brian Junker February 13, 2006 Second-order Laplace Approximation for E[g(θ) y] An Analytic Example Hierarchical Models Example of Hierarchical

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference 1 The views expressed in this paper are those of the authors and do not necessarily reflect the views of the Federal Reserve Board of Governors or the Federal Reserve System. Bayesian Estimation of DSGE

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

Computing the MLE and the EM Algorithm

Computing the MLE and the EM Algorithm ECE 830 Fall 0 Statistical Signal Processing instructor: R. Nowak Computing the MLE and the EM Algorithm If X p(x θ), θ Θ, then the MLE is the solution to the equations logp(x θ) θ 0. Sometimes these equations

More information

Approximating mixture distributions using finite numbers of components

Approximating mixture distributions using finite numbers of components Approximating mixture distributions using finite numbers of components Christian Röver and Tim Friede Department of Medical Statistics University Medical Center Göttingen March 17, 2016 This project has

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Data Analysis and Uncertainty Part 2: Estimation

Data Analysis and Uncertainty Part 2: Estimation Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur N. University at Buffalo The State University of New York srihari@cedar.buffalo.edu 1 Topics in Estimation 1. Estimation 2. Desirable

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

Understanding Covariance Estimates in Expectation Propagation

Understanding Covariance Estimates in Expectation Propagation Understanding Covariance Estimates in Expectation Propagation William Stephenson Department of EECS Massachusetts Institute of Technology Cambridge, MA 019 wtstephe@csail.mit.edu Tamara Broderick Department

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

Introduction to Bayesian Statistics

Introduction to Bayesian Statistics School of Computing & Communication, UTS January, 207 Random variables Pre-university: A number is just a fixed value. When we talk about probabilities: When X is a continuous random variable, it has a

More information

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which

More information

Bayesian statistics, simulation and software

Bayesian statistics, simulation and software Module 10: Bayesian prediction and model checking Department of Mathematical Sciences Aalborg University 1/15 Prior predictions Suppose we want to predict future data x without observing any data x. Assume:

More information

STAT 830 Bayesian Estimation

STAT 830 Bayesian Estimation STAT 830 Bayesian Estimation Richard Lockhart Simon Fraser University STAT 830 Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Bayesian Estimation STAT 830 Fall 2011 1 / 23 Purposes of These

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Stat 451 Lecture Notes Numerical Integration

Stat 451 Lecture Notes Numerical Integration Stat 451 Lecture Notes 03 12 Numerical Integration Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 5 in Givens & Hoeting, and Chapters 4 & 18 of Lange 2 Updated: February 11, 2016 1 / 29

More information

Probabilistic and Bayesian Machine Learning

Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Why a

More information

Master s Written Examination

Master s Written Examination Master s Written Examination Option: Statistics and Probability Spring 05 Full points may be obtained for correct answers to eight questions Each numbered question (which may have several parts) is worth

More information

ComputationalToolsforComparing AsymmetricGARCHModelsviaBayes Factors. RicardoS.Ehlers

ComputationalToolsforComparing AsymmetricGARCHModelsviaBayes Factors. RicardoS.Ehlers ComputationalToolsforComparing AsymmetricGARCHModelsviaBayes Factors RicardoS.Ehlers Laboratório de Estatística e Geoinformação- UFPR http://leg.ufpr.br/ ehlers ehlers@leg.ufpr.br II Workshop on Statistical

More information

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)

More information

Statistical Theory MT 2007 Problems 4: Solution sketches

Statistical Theory MT 2007 Problems 4: Solution sketches Statistical Theory MT 007 Problems 4: Solution sketches 1. Consider a 1-parameter exponential family model with density f(x θ) = f(x)g(θ)exp{cφ(θ)h(x)}, x X. Suppose that the prior distribution has the

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2017 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX

More information

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing. Objective. Recommended reading 7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

An Extended BIC for Model Selection

An Extended BIC for Model Selection An Extended BIC for Model Selection at the JSM meeting 2007 - Salt Lake City Surajit Ray Boston University (Dept of Mathematics and Statistics) Joint work with James Berger, Duke University; Susie Bayarri,

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of Discussion Functionals Calculus of Variations Maximizing a Functional Finding Approximation to a Posterior Minimizing K-L divergence Factorized

More information

General Bayesian Inference I

General Bayesian Inference I General Bayesian Inference I Outline: Basic concepts, One-parameter models, Noninformative priors. Reading: Chapters 10 and 11 in Kay-I. (Occasional) Simplified Notation. When there is no potential for

More information

Introduction to Bayesian Methods

Introduction to Bayesian Methods Introduction to Bayesian Methods Jessi Cisewski Department of Statistics Yale University Sagan Summer Workshop 2016 Our goal: introduction to Bayesian methods Likelihoods Priors: conjugate priors, non-informative

More information

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang Bayesian Dropout Tue Herlau, Morten Morup and Mikkel N. Schmidt Discussed by: Yizhe Zhang Feb 20, 2016 Outline 1 Introduction 2 Model 3 Inference 4 Experiments Dropout Training stage: A unit is present

More information

CS 540: Machine Learning Lecture 2: Review of Probability & Statistics

CS 540: Machine Learning Lecture 2: Review of Probability & Statistics CS 540: Machine Learning Lecture 2: Review of Probability & Statistics AD January 2008 AD () January 2008 1 / 35 Outline Probability theory (PRML, Section 1.2) Statistics (PRML, Sections 2.1-2.4) AD ()

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

1 Hypothesis Testing and Model Selection

1 Hypothesis Testing and Model Selection A Short Course on Bayesian Inference (based on An Introduction to Bayesian Analysis: Theory and Methods by Ghosh, Delampady and Samanta) Module 6: From Chapter 6 of GDS 1 Hypothesis Testing and Model Selection

More information

Lecture 1b: Linear Models for Regression

Lecture 1b: Linear Models for Regression Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Chapter 4 - Fundamentals of spatial processes Lecture notes

Chapter 4 - Fundamentals of spatial processes Lecture notes Chapter 4 - Fundamentals of spatial processes Lecture notes Geir Storvik January 21, 2013 STK4150 - Intro 2 Spatial processes Typically correlation between nearby sites Mostly positive correlation Negative

More information

Chapter 8: Sampling distributions of estimators Sections

Chapter 8: Sampling distributions of estimators Sections Chapter 8: Sampling distributions of estimators Sections 8.1 Sampling distribution of a statistic 8.2 The Chi-square distributions 8.3 Joint Distribution of the sample mean and sample variance Skip: p.

More information

Lecture 4: Dynamic models

Lecture 4: Dynamic models linear s Lecture 4: s Hedibert Freitas Lopes The University of Chicago Booth School of Business 5807 South Woodlawn Avenue, Chicago, IL 60637 http://faculty.chicagobooth.edu/hedibert.lopes hlopes@chicagobooth.edu

More information

Linear Models A linear model is defined by the expression

Linear Models A linear model is defined by the expression Linear Models A linear model is defined by the expression x = F β + ɛ. where x = (x 1, x 2,..., x n ) is vector of size n usually known as the response vector. β = (β 1, β 2,..., β p ) is the transpose

More information

Lecture 2: Priors and Conjugacy

Lecture 2: Priors and Conjugacy Lecture 2: Priors and Conjugacy Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 6, 2014 Some nice courses Fred A. Hamprecht (Heidelberg U.) https://www.youtube.com/watch?v=j66rrnzzkow Michael I.

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information