Foundations of Statistical Inference
|
|
- Helena Porter
- 6 years ago
- Views:
Transcription
1 Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT / 32
2 Lecture 14 : Variational Bayes An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem. John W. Tukey, Julien Berestycki (University of Oxford) SB2a MT / 32
3 Laplace approximation The Laplace approximation provides a way of approximating a density whose normalisation constant we cannot evaluate, by fitting a Gaussian distribution to its mode. Pierre-Simon Laplace ( ) Julien Berestycki (University of Oxford) SB2a MT / 32
4 Laplace approximation The Laplace approximation provides a way of approximating a density whose normalisation constant we cannot evaluate, by fitting a Gaussian distribution to its mode. p(z) = }{{} proba. density 1 Z }{{} Unknown constant Pierre-Simon Laplace ( ) f (z) }{{} Main part of the density (easy to evaluate) Julien Berestycki (University of Oxford) SB2a MT / 32
5 Laplace approximation The Laplace approximation provides a way of approximating a density whose normalisation constant we cannot evaluate, by fitting a Gaussian distribution to its mode. p(z) = }{{} proba. density 1 Z }{{} Unknown constant Pierre-Simon Laplace ( ) f (z) }{{} Main part of the density (easy to evaluate) Observe this is exactly the situation we face in Bayesian inference p(θ y) = }{{} posterior density 1 p(y) }{{} marginal dist. p(θ, y) }{{} joint proba (likelihood x prior) Julien Berestycki (University of Oxford) SB2a MT / 32
6 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. Julien Berestycki (University of Oxford) SB2a MT / 32
7 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 = l(θ ) l (θ )(θ θ ) 2 Julien Berestycki (University of Oxford) SB2a MT / 32
8 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 Julien Berestycki (University of Oxford) SB2a MT / 32
9 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 So approximate posterior by: q(θ) = N (θ µ, σ 2 ) with µ = θ (mode of log-posterior) and σ 2 = l (θ ) (negative curvature at the mode) Julien Berestycki (University of Oxford) SB2a MT / 32
10 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 So approximate posterior by: q(θ) = N (θ µ, σ 2 ) with µ = θ (mode of log-posterior) and σ 2 = l (θ ) (negative curvature at the mode) Julien Berestycki (University of Oxford) SB2a MT / 32
11 Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 So approximate posterior by: q(θ) = N (θ µ, σ 2 ) with µ = θ (mode of log-posterior) and σ 2 = l (θ ) (negative curvature at the mode) Julien Berestycki (University of Oxford) SB2a MT / 32
12 Computing integrals More generally, assume f (x) has a unique global maximum at x 0. f (x) f (x 0 ) 1 2 f (x 0 ) (x x 0 ) 2 so To obtain Lemma b a b a b e Nf (x) dx e Nf (x 0) a e N f (x 0 ) (x x 0 ) 2 /2 dx e Nf (x) 2π x N f (x 0 ) enf (x0) as N. Julien Berestycki (University of Oxford) SB2a MT / 32
13 Computing integrals More generally, assume f (x) has a unique global maximum at x 0. f (x) f (x 0 ) 1 2 f (x 0 ) (x x 0 ) 2 so To obtain Lemma b a b a b e Nf (x) dx e Nf (x 0) a e N f (x 0 ) (x x 0 ) 2 /2 dx e Nf (x) 2π x N f (x 0 ) enf (x0) as N. Laplace approximations becomes better as N grows. Julien Berestycki (University of Oxford) SB2a MT / 32
14 In dimension d > 1 If x R d then the Taylor expansion becomes f (x0 = f (x 0 ) + (x x 0 ) T H(x x 0 ) where H is the hessian matrix of second derivatives of f. In that case it can be shown that Lemma e Nf (x) x ( ) 2π d/2 H(x 0 ) 1/2 e Nf (x0) as N. N Julien Berestycki (University of Oxford) SB2a MT / 32
15 Using Laplace approximation Given model with θ = (θ 1,..., θ p ) Step 1 Find mode of log-joint (=MAP) estimate of θ): θ = argmax θ log p(θ, y) Julien Berestycki (University of Oxford) SB2a MT / 32
16 Using Laplace approximation Given model with θ = (θ 1,..., θ p ) Step 1 Find mode of log-joint (=MAP) estimate of θ): θ = argmax θ log p(θ, y) Step 2 Evaluate curvature of the log-joint at the mode H = D 2 log p(θ, y) is the Hessian matrix Julien Berestycki (University of Oxford) SB2a MT / 32
17 Using Laplace approximation Given model with θ = (θ 1,..., θ p ) Step 1 Find mode of log-joint (=MAP) estimate of θ): Step 2 θ = argmax θ log p(θ, y) Evaluate curvature of the log-joint at the mode is the Hessian matrix Step 3 Obtain Gaussian approximation H = D 2 log p(θ, y) N (θ µ, Σ), µ = θ, Σ = H 1. Julien Berestycki (University of Oxford) SB2a MT / 32
18 Example Suppose the y i are iid N(µ, σ 2 ) with a flat prior on µ and on log σ. The posterior is p(µ, σ 2 y) (σ 2 ) n 2 1 e (n 1)s 2 +n(ȳ µ) 2 2σ 2 where ȳ = 1 n yi and s 2 = 1 n 1 (yi ȳ) 2. Writig ν = log σ we get p(µ, ν y) f (µ, ν) = e (n 1)s 2 +n(ȳ µ) 2 nν 2e 2ν Julien Berestycki (University of Oxford) SB2a MT / 32
19 Example It is easy to check that ( (ˆµ, ˆν) = mode(µ, ν y) = ȳ, 1 ( )) n 1 2 log n s2 Second order derivatives are 2 log f = ne 2ν, 2 µ 2 µ ν log f = 2n(ȳ ν)e 2ν and So that 2 ν 2 log f = 2(n 1)s2 + n(ȳ µ) 2 e 2ν H(x 0 ) = ( ) n 2 0 (n 1)s 2 0 2n and we have ( ( ) ( )) ȳ (n 1)s 2 µ, ν N 1 2 log ( 0 n 1 n s2), n n Julien Berestycki (University of Oxford) SB2a MT / 32
20 Limitations of Laplace method The Laplace approximation is often too strong a simplification. Julien Berestycki (University of Oxford) SB2a MT / 32
21 Laplace method for computing the marginal P(x) = = P(x θ)π(θ)dθ exp { N( 1N log P(x θ) 1N } log π(θ)) dθ Julien Berestycki (University of Oxford) SB2a MT / 32
22 Laplace method for computing the marginal P(x) = = P(x θ)π(θ)dθ exp { N( 1N log P(x θ) 1N } log π(θ)) dθ Define h(θ) = 1 N log P(x θ) 1 N log π(θ) so that the integral we want to compute is of the form exp { Nh(θ)} dθ. Julien Berestycki (University of Oxford) SB2a MT / 32
23 Laplace method for computing the marginal P(x) = = P(x θ)π(θ)dθ exp { N( 1N log P(x θ) 1N } log π(θ)) dθ Define h(θ) = 1 N log P(x θ) 1 N log π(θ) so that the integral we want to compute is of the form exp { Nh(θ)} dθ. h(θ) h(θ ) 1 2 h (θ ) (θ θ ) 2 and we can approximate the integral as { e Nh(θ)) dθ e Nh(θ ) exp N 2 h (θ ) (θ θ ) 2} dθ Comparing to a normal pdf we have e Nh(θ) dx e Nh(θ ) (2π) 1 2 Nh (θ ) 1 2 = p(x θ )π(θ )(2π) 1 2 Nh (θ ) 1 2 Julien Berestycki (University of Oxford) SB2a MT / 32
24 Laplace s method For a d-dimensional function the analogue of this result is e Nf (x) dx e Nf (x 0) (2π) d 2 N d 2 f (x 0 ) 1 2 where f (x 0 ) is the determinant of the Hessian of the function evaluated at x 0. Julien Berestycki (University of Oxford) SB2a MT / 32
25 Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Julien Berestycki (University of Oxford) SB2a MT / 32
26 Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Firstly, the MAP estimate θ is replaced by the MLE ˆθ, which is reasonable if the prior has a small effect. Julien Berestycki (University of Oxford) SB2a MT / 32
27 Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Firstly, the MAP estimate θ is replaced by the MLE ˆθ, which is reasonable if the prior has a small effect. Secondly, BIC only retains the terms that vary in N, since asymptotically the terms that are constant in N do not matter. Julien Berestycki (University of Oxford) SB2a MT / 32
28 Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Firstly, the MAP estimate θ is replaced by the MLE ˆθ, which is reasonable if the prior has a small effect. Secondly, BIC only retains the terms that vary in N, since asymptotically the terms that are constant in N do not matter. Dropping the constant terms we get, log P(θ X) log P(X ˆθ) d 2 log N Julien Berestycki (University of Oxford) SB2a MT / 32
29 Bayesian Information Criterion (BIC) - extra details Why can we ignore the term 1 2 log f ( θ) 1? Assume (as above) that we can ignore the prior i.e. P(θ) = 1 data points X 1,..., X N are iid Then f ( θ) = 1 N log P(X θ) θ= θ N = 1 N i=1 log P(X i θ) θ= θ The thing to notice about this term is that it is now the average log-likelihood. Julien Berestycki (University of Oxford) SB2a MT / 32
30 Bayesian Information Criterion (BIC) - extra details Now consider random variables X i = log P(X i θ) and apply WLLN So the (m, n)th element of f ( θ) is f ( θ) E[log P(X i θ)] θ= θ 2 E[log P(X i θ)] θ m θ n θ= θ and these are constants i.e expected log-likelihoods for a single data point, so f ( θ) is constant, and can be ignored in the BIC approximation. Julien Berestycki (University of Oxford) SB2a MT / 32
31 Variational Bayes The idea of VB is to find an approximation Q(θ) to a given posterior distribution P(θ X). That is Q(θ) P(θ X) where θ is the vector of parameters. We then use Q(θ) to approximate the marginal likelihood. In fact, what we do is find a lower bound for the marginal likelihood. Julien Berestycki (University of Oxford) SB2a MT / 32
32 Variational Bayes The idea of VB is to find an approximation Q(θ) to a given posterior distribution P(θ X). That is Q(θ) P(θ X) where θ is the vector of parameters. We then use Q(θ) to approximate the marginal likelihood. In fact, what we do is find a lower bound for the marginal likelihood. Question How to find a good approximate posterior Q(θ)? Julien Berestycki (University of Oxford) SB2a MT / 32
33 Kullback-Liebler (KL) divergence The strategy we take is to find a distribution Q(θ) that minimizes a measure of distance between Q(θ) and the posterior P(θ X). Julien Berestycki (University of Oxford) SB2a MT / 32
34 Kullback-Liebler (KL) divergence The strategy we take is to find a distribution Q(θ) that minimizes a measure of distance between Q(θ) and the posterior P(θ X). Definition The Kullback-Leibler divergence KL(q p) between two distributions q(x) and p(x) is KL(q p) = log [ q(x) ] q(x)dx p(x) density q(x) p(x) q(x) * log(q(x)/p(x)) x x Julien Berestycki (University of Oxford) SB2a MT / 32
35 Kullback-Liebler (KL) divergence The strategy we take is to find a distribution Q(θ) that minimizes a measure of distance between Q(θ) and the posterior P(θ X). Definition The Kullback-Leibler divergence KL(q p) between two distributions q(x) and p(x) is KL(q p) = log [ q(x) ] q(x)dx p(x) density q(x) p(x) q(x) * log(q(x)/p(x)) x x Exercise KL(q p) 0 and KL(q p) = 0 iff q = p. Julien Berestycki (University of Oxford) SB2a MT / 32
36 N(µ, σ 2 ) approximations to a Gamma(10,1) µ = 10,σ 2 = 4,KL = µ = 9.11,σ 2 = 3.03,KL = density density x x µ = 13,σ 2 = 2.23,KL = µ = 9,σ 2 = 2,KL = density density x x Julien Berestycki (University of Oxford) SB2a MT / 32
37 We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) Julien Berestycki (University of Oxford) SB2a MT / 32
38 We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) Julien Berestycki (University of Oxford) SB2a MT / 32
39 We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) [ p(θ, X) ] = log P(X) log Q(θ)dθ Q(θ) Julien Berestycki (University of Oxford) SB2a MT / 32
40 We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) [ p(θ, X) ] = log P(X) log Q(θ)dθ Q(θ) The log marginal likelihood can then be written as where F(Q(θ)) = log log P(X) = F(Q(θ)) + KL(Q(θ) P(θ X)) (1) ] Q(θ)dθ. [ P(θ,D M) Q(θ) Julien Berestycki (University of Oxford) SB2a MT / 32
41 We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) [ p(θ, X) ] = log P(X) log Q(θ)dθ Q(θ) The log marginal likelihood can then be written as where F(Q(θ)) = log log P(X) = F(Q(θ)) + KL(Q(θ) P(θ X)) (1) ] Q(θ)dθ. [ P(θ,D M) Q(θ) Note Since KL(q p) 0 we have that log P(X) F(Q(θ)) so that F(Q(θ)) is a lower bound on the log-marginal likelihood. Julien Berestycki (University of Oxford) SB2a MT / 32
42 The mean field approximation We now need to ask what form that Q(θ) should take? Julien Berestycki (University of Oxford) SB2a MT / 32
43 The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i Julien Berestycki (University of Oxford) SB2a MT / 32
44 The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i The VB algorithm iteratively maximises F(Q(θ)) with respect to the free distributions, Q(θ i ), which is coordinate ascent in the function space of variational distributions. Julien Berestycki (University of Oxford) SB2a MT / 32
45 The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i The VB algorithm iteratively maximises F(Q(θ)) with respect to the free distributions, Q(θ i ), which is coordinate ascent in the function space of variational distributions. We refer to each Q(θ i ) as a VB component. Julien Berestycki (University of Oxford) SB2a MT / 32
46 The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i The VB algorithm iteratively maximises F(Q(θ)) with respect to the free distributions, Q(θ i ), which is coordinate ascent in the function space of variational distributions. We refer to each Q(θ i ) as a VB component. We update each component Q(θ i ) in turn keeping Q(θ j ) j i fixed. Julien Berestycki (University of Oxford) SB2a MT / 32
47 VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Julien Berestycki (University of Oxford) SB2a MT / 32
48 VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Proof Writing Q(θ) = Q(θ i )Q(θ i ) where θ i = θ\θ i, the lower-bound can be re-written as [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) Julien Berestycki (University of Oxford) SB2a MT / 32
49 VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Proof Writing Q(θ) = Q(θ i )Q(θ i ) where θ i = θ\θ i, the lower-bound can be re-written as [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) [ P(θ, X) ] = log Q(θ i )Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) Julien Berestycki (University of Oxford) SB2a MT / 32
50 VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Proof Writing Q(θ) = Q(θ i )Q(θ i ) where θ i = θ\θ i, the lower-bound can be re-written as [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) [ P(θ, X) ] = log Q(θ i )Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) [ ] = Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Julien Berestycki (University of Oxford) SB2a MT / 32
51 F (Q(θ)) = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Julien Berestycki (University of Oxford) SB2a MT / 32
52 F (Q(θ)) = = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i ) log Q(θ i )dθ i Q(θ j ) log Q(θ j )dθ j j i Julien Berestycki (University of Oxford) SB2a MT / 32
53 F (Q(θ)) = = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i ) log Q(θ i )dθ i Q(θ j ) log Q(θ j )dθ j j i If we let Q (θ i ) = 1 Z exp [ log P(D, θ M)Q(θ i )dθ i ] where Z is a normalising constant and write H(Q(θ j )) = Q(θ j ) log Q(θ j )dθ j as the entropy of Q(θ j ) then F (Q(θ)) = Q(θ i ) log Q (θ i ) Q(θ i ) dθ i + log Z + H(Q(θ j )) j i Julien Berestycki (University of Oxford) SB2a MT / 32
54 F (Q(θ)) = = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i ) log Q(θ i )dθ i Q(θ j ) log Q(θ j )dθ j j i If we let Q (θ i ) = 1 Z exp [ log P(D, θ M)Q(θ i )dθ i ] where Z is a normalising constant and write H(Q(θ j )) = Q(θ j ) log Q(θ j )dθ j as the entropy of Q(θ j ) then F (Q(θ)) = Q(θ i ) log Q (θ i ) Q(θ i ) dθ i + log Z + H(Q(θ j )) j i = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) Julien Berestycki (University of Oxford) SB2a MT / 32
55 F(Q(θ)) = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) We then see that F(Q(θ)) is maximised when Q(θ i ) = Q (θ i ) as this choice minimises the Kullback-Liebler divergence term. Julien Berestycki (University of Oxford) SB2a MT / 32
56 F(Q(θ)) = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) We then see that F(Q(θ)) is maximised when Q(θ i ) = Q (θ i ) as this choice minimises the Kullback-Liebler divergence term. Thus the update for Q(θ i ) is given by [ ] Q(θ i ) exp log P(X, θ)q(θ i )dθ i Julien Berestycki (University of Oxford) SB2a MT / 32
57 F(Q(θ)) = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) We then see that F(Q(θ)) is maximised when Q(θ i ) = Q (θ i ) as this choice minimises the Kullback-Liebler divergence term. Thus the update for Q(θ i ) is given by [ ] Q(θ i ) exp log P(X, θ)q(θ i )dθ i or log Q(θ i ) = E Q(θ i ) ( ) log P(X, θ) + const Julien Berestycki (University of Oxford) SB2a MT / 32
58 VB algorithm This implies a straightforward algorithm for variational inference: Julien Berestycki (University of Oxford) SB2a MT / 32
59 VB algorithm This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ) = Q(µ)Q(τ), e.g., by setting them to their priors. Julien Berestycki (University of Oxford) SB2a MT / 32
60 VB algorithm This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ) = Q(µ)Q(τ), e.g., by setting them to their priors. 2 Cycle over the parameters, revising each given the current estimates of the others. Julien Berestycki (University of Oxford) SB2a MT / 32
61 VB algorithm This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ) = Q(µ)Q(τ), e.g., by setting them to their priors. 2 Cycle over the parameters, revising each given the current estimates of the others. 3 Loop until convergence. Julien Berestycki (University of Oxford) SB2a MT / 32
62 VB algorithm This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ) = Q(µ)Q(τ), e.g., by setting them to their priors. 2 Cycle over the parameters, revising each given the current estimates of the others. 3 Loop until convergence. Convergence is checked by calculating the VB lower bound at each step i.e. [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) The precise form of this term needs to be derived, and can be quite tricky. Julien Berestycki (University of Oxford) SB2a MT / 32
63 Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Julien Berestycki (University of Oxford) SB2a MT / 32
64 Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Julien Berestycki (University of Oxford) SB2a MT / 32
65 Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Let θ = (µ, τ) and assume Q(θ) = Q(µ)Q(τ). Julien Berestycki (University of Oxford) SB2a MT / 32
66 Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Let θ = (µ, τ) and assume Q(θ) = Q(µ)Q(τ). We will use notation θ i = E Q(θ i )θ i. Julien Berestycki (University of Oxford) SB2a MT / 32
67 Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Let θ = (µ, τ) and assume Q(θ) = Q(µ)Q(τ). We will use notation θ i = E Q(θ i )θ i. The log joint density is log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 Julien Berestycki (University of Oxford) SB2a MT / 32
68 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 Julien Berestycki (University of Oxford) SB2a MT / 32
69 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. Julien Berestycki (University of Oxford) SB2a MT / 32
70 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. ( ) log Q(µ) = E Q(τ) log P(X, θ) + C Julien Berestycki (University of Oxford) SB2a MT / 32
71 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. ( ) log Q(µ) = E Q(τ) log P(X, θ) + C where τ = E Q(τ) (τ). = τ 2 ( P (X i µ) 2 β(µ m) 2) + C i=1 Julien Berestycki (University of Oxford) SB2a MT / 32
72 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. ( ) log Q(µ) = E Q(τ) log P(X, θ) + C = τ 2 ( P (X i µ) 2 β(µ m) 2) + C i=1 where τ = E Q(τ) (τ). We will be able to determine τ when we derive the other component of the approximate density, Q(τ). Julien Berestycki (University of Oxford) SB2a MT / 32
73 We can see this log density has the form of a normal distribution log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 Julien Berestycki (University of Oxford) SB2a MT / 32
74 We can see this log density has the form of a normal distribution log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 = β 2 (µ m ) 2 Julien Berestycki (University of Oxford) SB2a MT / 32
75 We can see this log density has the form of a normal distribution where log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 = β 2 (µ m ) 2 β = (β + P) τ m = β 1( P ) βm + X i i=1 Julien Berestycki (University of Oxford) SB2a MT / 32
76 We can see this log density has the form of a normal distribution where log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 = β 2 (µ m ) 2 β = (β + P) τ m = β 1( βm + Thus Q(µ) = N(µ m, β 1 ). P ) X i i=1 Julien Berestycki (University of Oxford) SB2a MT / 32
77 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 Julien Berestycki (University of Oxford) SB2a MT / 32
78 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 The second component of the VB approximation is derived as ( ) log Q(τ) = E Q(µ) log P(X, θ) + C Julien Berestycki (University of Oxford) SB2a MT / 32
79 log P(X, θ) = P 2 log τ τ 2 P (X i µ) log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 The second component of the VB approximation is derived as ( ) log Q(τ) = E Q(µ) log P(X, θ) + C = ((P + 1)/2 + a 1) log τ τ P (X i µ) 2 2 i=1 β(µ m) 2 τ 2 + C Julien Berestycki (University of Oxford) SB2a MT / 32
80 We can see this log density has the form of a gamma distribution log Q(τ) = ((P + 1)/2 + a 1) log τ τ P (X i µ) 2 2 i=1 β(µ τ m) 2 + C 2 Julien Berestycki (University of Oxford) SB2a MT / 32
81 We can see this log density has the form of a gamma distribution log Q(τ) = ((P + 1)/2 + a 1) log τ τ P (X i µ) 2 2 i=1 β(µ τ m) 2 + C 2 which is Γ(a, b ) where a = a + (P + 1)/2 b = b + 1 ( P ) Xi 2 2 µ + P µ 2 + β ( ) m 2 2 µ + µ i=1 Julien Berestycki (University of Oxford) SB2a MT / 32
82 So overall we have 1 Q(µ) = N(µ m, β 1 ) where β = (β + P) τ (2) m = β 1( P ) βm + D i (3) i=1 Julien Berestycki (University of Oxford) SB2a MT / 32
83 So overall we have 1 Q(µ) = N(µ m, β 1 ) where 2 Q(τ) = Γ(τ a, b ) where i=1 β = (β + P) τ (2) m = β 1( P ) βm + D i (3) a = a + (P + 1)/2 b = b + 1 ( P ) Xi 2 2 µ + P µ 2 + β ( ) m 2 2 µ + µ i=1 Julien Berestycki (University of Oxford) SB2a MT / 32
84 So overall we have 1 Q(µ) = N(µ m, β 1 ) where 2 Q(τ) = Γ(τ a, b ) where i=1 β = (β + P) τ (2) m = β 1( P ) βm + D i (3) a = a + (P + 1)/2 b = b + 1 ( P ) Xi 2 2 µ + P µ 2 + β ( ) m 2 2 µ + µ To calculate these we need τ = a b µ = m µ 2 = β 1 + m 2 Julien Berestycki (University of Oxford) SB2a MT / 32 i=1
85 Example 1 For this model the exact posterior was calculate in Lecture 6. [ }] π(τ, µ X) τ α 1 exp τ {b + β 2 (m µ) 2 where a = a + P b = b (Xi X) β = β + P m = β 1 (βm + P X i ) i=1 Pβ P + β ( X m) 2 We note some similarity between the VB updates and the true posterior parameters. Julien Berestycki (University of Oxford) SB2a MT / 32
86 We can compare the true and VB posterior when applied to a real dataset. We see that VB approximations underestimate posterior variances. µ Density True posterior VB posterior Density True posterior VB posterior Julien Berestycki (University of Oxford) SB2a MT / 32
87 General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. Julien Berestycki (University of Oxford) SB2a MT / 32
88 General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. VB is often much, much faster to implement than MCMC or other sampling based methods. Julien Berestycki (University of Oxford) SB2a MT / 32
89 General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. VB is often much, much faster to implement than MCMC or other sampling based methods. The VB updates and lower bound can be tricky to derive, and sometime further approximation is needed. Julien Berestycki (University of Oxford) SB2a MT / 32
90 General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. VB is often much, much faster to implement than MCMC or other sampling based methods. The VB updates and lower bound can be tricky to derive, and sometime further approximation is needed. The VB algorithm will find a local mode of the posterior, so care should be taken when the posterior is thought/known to be multi-modal. Julien Berestycki (University of Oxford) SB2a MT / 32
Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M
A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M PD M = PD θ, MPθ Mdθ Lecture 14 : Variational Bayes where θ are the parameters of the model and Pθ M is
More informationFoundations of Statistical Inference
Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 20 Lecture 6 : Bayesian Inference
More informationBayesian Inference Course, WTCN, UCL, March 2013
Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationMinimum Message Length Analysis of the Behrens Fisher Problem
Analysis of the Behrens Fisher Problem Enes Makalic and Daniel F Schmidt Centre for MEGA Epidemiology The University of Melbourne Solomonoff 85th Memorial Conference, 2011 Outline Introduction 1 Introduction
More informationWeek 3: The EM algorithm
Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationLecture 6: Model Checking and Selection
Lecture 6: Model Checking and Selection Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 27, 2014 Model selection We often have multiple modeling choices that are equally sensible: M 1,, M T. Which
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationMIT Spring 2016
MIT 18.655 Dr. Kempthorne Spring 2016 1 MIT 18.655 Outline 1 2 MIT 18.655 Decision Problem: Basic Components P = {P θ : θ Θ} : parametric model. Θ = {θ}: Parameter space. A{a} : Action space. L(θ, a) :
More informationG8325: Variational Bayes
G8325: Variational Bayes Vincent Dorie Columbia University Wednesday, November 2nd, 2011 bridge Variational University Bayes Press 2003. On-screen viewing permitted. Printing not permitted. http://www.c
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationCSC 2541: Bayesian Methods for Machine Learning
CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 10 Alternatives to Monte Carlo Computation Since about 1990, Markov chain Monte Carlo has been the dominant
More informationPhysics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester
Physics 403 Parameter Estimation, Correlations, and Error Bars Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Best Estimates and Reliability
More informationCLASS NOTES Models, Algorithms and Data: Introduction to computing 2018
CLASS NOTES Models, Algorithms and Data: Introduction to computing 208 Petros Koumoutsakos, Jens Honore Walther (Last update: June, 208) IMPORTANT DISCLAIMERS. REFERENCES: Much of the material (ideas,
More informationThe Expectation Maximization or EM algorithm
The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,
More informationNested Sampling. Brendon J. Brewer. brewer/ Department of Statistics The University of Auckland
Department of Statistics The University of Auckland https://www.stat.auckland.ac.nz/ brewer/ is a Monte Carlo method (not necessarily MCMC) that was introduced by John Skilling in 2004. It is very popular
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationDavid Giles Bayesian Econometrics
David Giles Bayesian Econometrics 1. General Background 2. Constructing Prior Distributions 3. Properties of Bayes Estimators and Tests 4. Bayesian Analysis of the Multiple Regression Model 5. Bayesian
More informationStatistical Machine Learning Lectures 4: Variational Bayes
1 / 29 Statistical Machine Learning Lectures 4: Variational Bayes Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 29 Synonyms Variational Bayes Variational Inference Variational Bayesian Inference
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationInstructor: Dr. Volkan Cevher. 1. Background
Instructor: Dr. Volkan Cevher Variational Bayes Approximation ice University STAT 631 / ELEC 639: Graphical Models Scribe: David Kahle eviewers: Konstantinos Tsianos and Tahira Saleem 1. Background These
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationIntroduction to Bayesian inference
Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions
More informationA Very Brief Summary of Bayesian Inference, and Examples
A Very Brief Summary of Bayesian Inference, and Examples Trinity Term 009 Prof Gesine Reinert Our starting point are data x = x 1, x,, x n, which we view as realisations of random variables X 1, X,, X
More informationCOMP90051 Statistical Machine Learning
COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide
More informationExpectation Propagation for Approximate Bayesian Inference
Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationEstimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator
Estimation Theory Estimation theory deals with finding numerical values of interesting parameters from given set of data. We start with formulating a family of models that could describe how the data were
More informationIntegrated Non-Factorized Variational Inference
Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014
More informationA Very Brief Summary of Statistical Inference, and Examples
A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationSTAT 730 Chapter 4: Estimation
STAT 730 Chapter 4: Estimation Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 23 The likelihood We have iid data, at least initially. Each datum
More informationIntroduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??
to Bayesian Methods Introduction to Bayesian Methods p.1/?? We develop the Bayesian paradigm for parametric inference. To this end, suppose we conduct (or wish to design) a study, in which the parameter
More informationBayesian Asymptotics
BS2 Statistical Inference, Lecture 8, Hilary Term 2008 May 7, 2008 The univariate case The multivariate case For large λ we have the approximation I = b a e λg(y) h(y) dy = e λg(y ) h(y ) 2π λg (y ) {
More informationBayesian Regression Linear and Logistic Regression
When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationAn Introduction to Expectation-Maximization
An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative
More informationFoundations of Statistical Inference
Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2015 Julien Berestycki (University of Oxford) SB2a MT 2015 1 / 16 Lecture 16 : Bayesian analysis
More informationExpectation Propagation Algorithm
Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationCS281A/Stat241A Lecture 22
CS281A/Stat241A Lecture 22 p. 1/4 CS281A/Stat241A Lecture 22 Monte Carlo Methods Peter Bartlett CS281A/Stat241A Lecture 22 p. 2/4 Key ideas of this lecture Sampling in Bayesian methods: Predictive distribution
More informationIntroduction to Probabilistic Machine Learning
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning
More informationModule 22: Bayesian Methods Lecture 9 A: Default prior selection
Module 22: Bayesian Methods Lecture 9 A: Default prior selection Peter Hoff Departments of Statistics and Biostatistics University of Washington Outline Jeffreys prior Unit information priors Empirical
More informationMCMC algorithms for fitting Bayesian models
MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models
More informationChoosing among models
Eco 515 Fall 2014 Chris Sims Choosing among models September 18, 2014 c 2014 by Christopher A. Sims. This document is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More information7. Estimation and hypothesis testing. Objective. Recommended reading
7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationLecture 8: Bayesian Estimation of Parameters in State Space Models
in State Space Models March 30, 2016 Contents 1 Bayesian estimation of parameters in state space models 2 Computational methods for parameter estimation 3 Practical parameter estimation in state space
More informationSpring 2006: Examples: Laplace s Method; Hierarchical Models
36-724 Spring 2006: Examples: Laplace s Method; Hierarchical Models Brian Junker February 13, 2006 Second-order Laplace Approximation for E[g(θ) y] An Analytic Example Hierarchical Models Example of Hierarchical
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationVariational Scoring of Graphical Model Structures
Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational
More informationBayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference
1 The views expressed in this paper are those of the authors and do not necessarily reflect the views of the Federal Reserve Board of Governors or the Federal Reserve System. Bayesian Estimation of DSGE
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More informationComputing the MLE and the EM Algorithm
ECE 830 Fall 0 Statistical Signal Processing instructor: R. Nowak Computing the MLE and the EM Algorithm If X p(x θ), θ Θ, then the MLE is the solution to the equations logp(x θ) θ 0. Sometimes these equations
More informationApproximating mixture distributions using finite numbers of components
Approximating mixture distributions using finite numbers of components Christian Röver and Tim Friede Department of Medical Statistics University Medical Center Göttingen March 17, 2016 This project has
More informationCOS513 LECTURE 8 STATISTICAL CONCEPTS
COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions
More informationData Analysis and Uncertainty Part 2: Estimation
Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur N. University at Buffalo The State University of New York srihari@cedar.buffalo.edu 1 Topics in Estimation 1. Estimation 2. Desirable
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationUnderstanding Covariance Estimates in Expectation Propagation
Understanding Covariance Estimates in Expectation Propagation William Stephenson Department of EECS Massachusetts Institute of Technology Cambridge, MA 019 wtstephe@csail.mit.edu Tamara Broderick Department
More informationPrinciples of Bayesian Inference
Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters
More informationIntroduction to Bayesian Statistics
School of Computing & Communication, UTS January, 207 Random variables Pre-university: A number is just a fixed value. When we talk about probabilities: When X is a continuous random variable, it has a
More informationSparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference
Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which
More informationBayesian statistics, simulation and software
Module 10: Bayesian prediction and model checking Department of Mathematical Sciences Aalborg University 1/15 Prior predictions Suppose we want to predict future data x without observing any data x. Assume:
More informationSTAT 830 Bayesian Estimation
STAT 830 Bayesian Estimation Richard Lockhart Simon Fraser University STAT 830 Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Bayesian Estimation STAT 830 Fall 2011 1 / 23 Purposes of These
More informationLecture 6: Graphical Models: Learning
Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)
More informationStat 451 Lecture Notes Numerical Integration
Stat 451 Lecture Notes 03 12 Numerical Integration Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 5 in Givens & Hoeting, and Chapters 4 & 18 of Lange 2 Updated: February 11, 2016 1 / 29
More informationProbabilistic and Bayesian Machine Learning
Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Why a
More informationMaster s Written Examination
Master s Written Examination Option: Statistics and Probability Spring 05 Full points may be obtained for correct answers to eight questions Each numbered question (which may have several parts) is worth
More informationComputationalToolsforComparing AsymmetricGARCHModelsviaBayes Factors. RicardoS.Ehlers
ComputationalToolsforComparing AsymmetricGARCHModelsviaBayes Factors RicardoS.Ehlers Laboratório de Estatística e Geoinformação- UFPR http://leg.ufpr.br/ ehlers ehlers@leg.ufpr.br II Workshop on Statistical
More informationModel comparison and selection
BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)
More informationStatistical Theory MT 2007 Problems 4: Solution sketches
Statistical Theory MT 007 Problems 4: Solution sketches 1. Consider a 1-parameter exponential family model with density f(x θ) = f(x)g(θ)exp{cφ(θ)h(x)}, x X. Suppose that the prior distribution has the
More informationCheng Soon Ong & Christian Walder. Canberra February June 2017
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX
More information7. Estimation and hypothesis testing. Objective. Recommended reading
7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing
More informationLecture 7 and 8: Markov Chain Monte Carlo
Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani
More informationAn Extended BIC for Model Selection
An Extended BIC for Model Selection at the JSM meeting 2007 - Salt Lake City Surajit Ray Boston University (Dept of Mathematics and Statistics) Joint work with James Berger, Duke University; Susie Bayarri,
More informationVariational Inference. Sargur Srihari
Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of Discussion Functionals Calculus of Variations Maximizing a Functional Finding Approximation to a Posterior Minimizing K-L divergence Factorized
More informationGeneral Bayesian Inference I
General Bayesian Inference I Outline: Basic concepts, One-parameter models, Noninformative priors. Reading: Chapters 10 and 11 in Kay-I. (Occasional) Simplified Notation. When there is no potential for
More informationIntroduction to Bayesian Methods
Introduction to Bayesian Methods Jessi Cisewski Department of Statistics Yale University Sagan Summer Workshop 2016 Our goal: introduction to Bayesian methods Likelihoods Priors: conjugate priors, non-informative
More informationBayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang
Bayesian Dropout Tue Herlau, Morten Morup and Mikkel N. Schmidt Discussed by: Yizhe Zhang Feb 20, 2016 Outline 1 Introduction 2 Model 3 Inference 4 Experiments Dropout Training stage: A unit is present
More informationCS 540: Machine Learning Lecture 2: Review of Probability & Statistics
CS 540: Machine Learning Lecture 2: Review of Probability & Statistics AD January 2008 AD () January 2008 1 / 35 Outline Probability theory (PRML, Section 1.2) Statistics (PRML, Sections 2.1-2.4) AD ()
More informationMachine Learning, Fall 2012 Homework 2
0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0
More informationMachine Learning Summer School
Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,
More informationan introduction to bayesian inference
with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena
More information1 Hypothesis Testing and Model Selection
A Short Course on Bayesian Inference (based on An Introduction to Bayesian Analysis: Theory and Methods by Ghosh, Delampady and Samanta) Module 6: From Chapter 6 of GDS 1 Hypothesis Testing and Model Selection
More informationLecture 1b: Linear Models for Regression
Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationChapter 4 - Fundamentals of spatial processes Lecture notes
Chapter 4 - Fundamentals of spatial processes Lecture notes Geir Storvik January 21, 2013 STK4150 - Intro 2 Spatial processes Typically correlation between nearby sites Mostly positive correlation Negative
More informationChapter 8: Sampling distributions of estimators Sections
Chapter 8: Sampling distributions of estimators Sections 8.1 Sampling distribution of a statistic 8.2 The Chi-square distributions 8.3 Joint Distribution of the sample mean and sample variance Skip: p.
More informationLecture 4: Dynamic models
linear s Lecture 4: s Hedibert Freitas Lopes The University of Chicago Booth School of Business 5807 South Woodlawn Avenue, Chicago, IL 60637 http://faculty.chicagobooth.edu/hedibert.lopes hlopes@chicagobooth.edu
More informationLinear Models A linear model is defined by the expression
Linear Models A linear model is defined by the expression x = F β + ɛ. where x = (x 1, x 2,..., x n ) is vector of size n usually known as the response vector. β = (β 1, β 2,..., β p ) is the transpose
More informationLecture 2: Priors and Conjugacy
Lecture 2: Priors and Conjugacy Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 6, 2014 Some nice courses Fred A. Hamprecht (Heidelberg U.) https://www.youtube.com/watch?v=j66rrnzzkow Michael I.
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More information