Foundations of Statistical Inference

Similar documents
Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M

Foundations of Statistical Inference

Bayesian Inference Course, WTCN, UCL, March 2013

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Minimum Message Length Analysis of the Behrens Fisher Problem

Week 3: The EM algorithm

Unsupervised Learning

Lecture 6: Model Checking and Selection

Lecture 13 : Variational Inference: Mean Field Approximation

STA 4273H: Statistical Machine Learning

Latent Variable Models and EM algorithm

MIT Spring 2016

G8325: Variational Bayes

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

CSC 2541: Bayesian Methods for Machine Learning

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

The Expectation Maximization or EM algorithm

Nested Sampling. Brendon J. Brewer. brewer/ Department of Statistics The University of Auckland

Lecture 4: Probabilistic Learning

David Giles Bayesian Econometrics

Statistical Machine Learning Lectures 4: Variational Bayes

Machine learning - HT Maximum Likelihood

Parametric Techniques Lecture 3

CSC321 Lecture 18: Learning Probabilistic Models

Instructor: Dr. Volkan Cevher. 1. Background

Density Estimation. Seungjin Choi

Introduction to Bayesian inference

A Very Brief Summary of Bayesian Inference, and Examples

COMP90051 Statistical Machine Learning

Expectation Propagation for Approximate Bayesian Inference

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Variational Principal Components

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Integrated Non-Factorized Variational Inference

A Very Brief Summary of Statistical Inference, and Examples

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

STAT 730 Chapter 4: Estimation

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Bayesian Asymptotics

Bayesian Regression Linear and Logistic Regression

Parametric Techniques

An Introduction to Expectation-Maximization

Foundations of Statistical Inference

Expectation Propagation Algorithm

COM336: Neural Computing

CS281A/Stat241A Lecture 22

Introduction to Probabilistic Machine Learning

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

MCMC algorithms for fitting Bayesian models

Choosing among models

PATTERN RECOGNITION AND MACHINE LEARNING

7. Estimation and hypothesis testing. Objective. Recommended reading

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Spring 2006: Examples: Laplace s Method; Hierarchical Models

13: Variational inference II

Variational Scoring of Graphical Model Structures

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Quantitative Biology II Lecture 4: Variational Methods

Computing the MLE and the EM Algorithm

Approximating mixture distributions using finite numbers of components

COS513 LECTURE 8 STATISTICAL CONCEPTS

Data Analysis and Uncertainty Part 2: Estimation

Latent Variable Models and EM Algorithm

Understanding Covariance Estimates in Expectation Propagation

Principles of Bayesian Inference

Introduction to Bayesian Statistics

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Bayesian statistics, simulation and software

STAT 830 Bayesian Estimation

Lecture 6: Graphical Models: Learning

Stat 451 Lecture Notes Numerical Integration

Probabilistic and Bayesian Machine Learning

Master s Written Examination

ComputationalToolsforComparing AsymmetricGARCHModelsviaBayes Factors. RicardoS.Ehlers

Model comparison and selection

Statistical Theory MT 2007 Problems 4: Solution sketches

Cheng Soon Ong & Christian Walder. Canberra February June 2017

7. Estimation and hypothesis testing. Objective. Recommended reading

Lecture 7 and 8: Markov Chain Monte Carlo

An Extended BIC for Model Selection

Variational Inference. Sargur Srihari

General Bayesian Inference I

Introduction to Bayesian Methods

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

CS 540: Machine Learning Lecture 2: Review of Probability & Statistics

Machine Learning, Fall 2012 Homework 2

Machine Learning Summer School

an introduction to bayesian inference

1 Hypothesis Testing and Model Selection

Lecture 1b: Linear Models for Regression

The Expectation-Maximization Algorithm

Chapter 4 - Fundamentals of spatial processes Lecture notes

Chapter 8: Sampling distributions of estimators Sections

Lecture 4: Dynamic models

Linear Models A linear model is defined by the expression

Lecture 2: Priors and Conjugacy

Lecture 9: PGM Learning

Transcription:

Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 32

Lecture 14 : Variational Bayes An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem. John W. Tukey, 1915-2000 Julien Berestycki (University of Oxford) SB2a MT 2016 2 / 32

Laplace approximation The Laplace approximation provides a way of approximating a density whose normalisation constant we cannot evaluate, by fitting a Gaussian distribution to its mode. Pierre-Simon Laplace (1749-1827) Julien Berestycki (University of Oxford) SB2a MT 2016 3 / 32

Laplace approximation The Laplace approximation provides a way of approximating a density whose normalisation constant we cannot evaluate, by fitting a Gaussian distribution to its mode. p(z) = }{{} proba. density 1 Z }{{} Unknown constant Pierre-Simon Laplace (1749-1827) f (z) }{{} Main part of the density (easy to evaluate) Julien Berestycki (University of Oxford) SB2a MT 2016 3 / 32

Laplace approximation The Laplace approximation provides a way of approximating a density whose normalisation constant we cannot evaluate, by fitting a Gaussian distribution to its mode. p(z) = }{{} proba. density 1 Z }{{} Unknown constant Pierre-Simon Laplace (1749-1827) f (z) }{{} Main part of the density (easy to evaluate) Observe this is exactly the situation we face in Bayesian inference p(θ y) = }{{} posterior density 1 p(y) }{{} marginal dist. p(θ, y) }{{} joint proba (likelihood x prior) Julien Berestycki (University of Oxford) SB2a MT 2016 3 / 32

Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. Julien Berestycki (University of Oxford) SB2a MT 2016 4 / 32

Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 = l(θ ) + 1 2 l (θ )(θ θ ) 2 Julien Berestycki (University of Oxford) SB2a MT 2016 4 / 32

Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) + 1 2 l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 Julien Berestycki (University of Oxford) SB2a MT 2016 4 / 32

Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) + 1 2 l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 So approximate posterior by: q(θ) = N (θ µ, σ 2 ) with µ = θ (mode of log-posterior) and σ 2 = l (θ ) (negative curvature at the mode) Julien Berestycki (University of Oxford) SB2a MT 2016 4 / 32

Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) + 1 2 l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 So approximate posterior by: q(θ) = N (θ µ, σ 2 ) with µ = θ (mode of log-posterior) and σ 2 = l (θ ) (negative curvature at the mode) Julien Berestycki (University of Oxford) SB2a MT 2016 4 / 32

Deriving Laplace approximation Idea: 2nd order Taylor approximation to l(θ) = log p(y, θ) around mode θ. l(θ) l(θ ) + l (θ )(θ θ ) + 1 }{{} 2 l (θ )(θ θ ) 2 =0 Recognize Gaussian density = l(θ ) + 1 2 l (θ )(θ θ ) 2 log N (θ µ, σ 2 ) = log σ 1 2 log 2π 1 2 σ 2 (θ µ) 2 So approximate posterior by: q(θ) = N (θ µ, σ 2 ) with µ = θ (mode of log-posterior) and σ 2 = l (θ ) (negative curvature at the mode) Julien Berestycki (University of Oxford) SB2a MT 2016 4 / 32

Computing integrals More generally, assume f (x) has a unique global maximum at x 0. f (x) f (x 0 ) 1 2 f (x 0 ) (x x 0 ) 2 so To obtain Lemma b a b a b e Nf (x) dx e Nf (x 0) a e N f (x 0 ) (x x 0 ) 2 /2 dx e Nf (x) 2π x N f (x 0 ) enf (x0) as N. Julien Berestycki (University of Oxford) SB2a MT 2016 5 / 32

Computing integrals More generally, assume f (x) has a unique global maximum at x 0. f (x) f (x 0 ) 1 2 f (x 0 ) (x x 0 ) 2 so To obtain Lemma b a b a b e Nf (x) dx e Nf (x 0) a e N f (x 0 ) (x x 0 ) 2 /2 dx e Nf (x) 2π x N f (x 0 ) enf (x0) as N. Laplace approximations becomes better as N grows. Julien Berestycki (University of Oxford) SB2a MT 2016 5 / 32

In dimension d > 1 If x R d then the Taylor expansion becomes f (x0 = f (x 0 ) + (x x 0 ) T H(x x 0 ) where H is the hessian matrix of second derivatives of f. In that case it can be shown that Lemma e Nf (x) x ( ) 2π d/2 H(x 0 ) 1/2 e Nf (x0) as N. N Julien Berestycki (University of Oxford) SB2a MT 2016 6 / 32

Using Laplace approximation Given model with θ = (θ 1,..., θ p ) Step 1 Find mode of log-joint (=MAP) estimate of θ): θ = argmax θ log p(θ, y) Julien Berestycki (University of Oxford) SB2a MT 2016 7 / 32

Using Laplace approximation Given model with θ = (θ 1,..., θ p ) Step 1 Find mode of log-joint (=MAP) estimate of θ): θ = argmax θ log p(θ, y) Step 2 Evaluate curvature of the log-joint at the mode H = D 2 log p(θ, y) is the Hessian matrix Julien Berestycki (University of Oxford) SB2a MT 2016 7 / 32

Using Laplace approximation Given model with θ = (θ 1,..., θ p ) Step 1 Find mode of log-joint (=MAP) estimate of θ): Step 2 θ = argmax θ log p(θ, y) Evaluate curvature of the log-joint at the mode is the Hessian matrix Step 3 Obtain Gaussian approximation H = D 2 log p(θ, y) N (θ µ, Σ), µ = θ, Σ = H 1. Julien Berestycki (University of Oxford) SB2a MT 2016 7 / 32

Example Suppose the y i are iid N(µ, σ 2 ) with a flat prior on µ and on log σ. The posterior is p(µ, σ 2 y) (σ 2 ) n 2 1 e (n 1)s 2 +n(ȳ µ) 2 2σ 2 where ȳ = 1 n yi and s 2 = 1 n 1 (yi ȳ) 2. Writig ν = log σ we get p(µ, ν y) f (µ, ν) = e (n 1)s 2 +n(ȳ µ) 2 nν 2e 2ν Julien Berestycki (University of Oxford) SB2a MT 2016 8 / 32

Example It is easy to check that ( (ˆµ, ˆν) = mode(µ, ν y) = ȳ, 1 ( )) n 1 2 log n s2 Second order derivatives are 2 log f = ne 2ν, 2 µ 2 µ ν log f = 2n(ȳ ν)e 2ν and So that 2 ν 2 log f = 2(n 1)s2 + n(ȳ µ) 2 e 2ν H(x 0 ) = ( ) n 2 0 (n 1)s 2 0 2n and we have ( ( ) ( )) ȳ (n 1)s 2 µ, ν N 1 2 log ( 0 n 1 n s2), n 2 1 0 2n Julien Berestycki (University of Oxford) SB2a MT 2016 9 / 32

Limitations of Laplace method The Laplace approximation is often too strong a simplification. Julien Berestycki (University of Oxford) SB2a MT 2016 10 / 32

Laplace method for computing the marginal P(x) = = P(x θ)π(θ)dθ exp { N( 1N log P(x θ) 1N } log π(θ)) dθ Julien Berestycki (University of Oxford) SB2a MT 2016 11 / 32

Laplace method for computing the marginal P(x) = = P(x θ)π(θ)dθ exp { N( 1N log P(x θ) 1N } log π(θ)) dθ Define h(θ) = 1 N log P(x θ) 1 N log π(θ) so that the integral we want to compute is of the form exp { Nh(θ)} dθ. Julien Berestycki (University of Oxford) SB2a MT 2016 11 / 32

Laplace method for computing the marginal P(x) = = P(x θ)π(θ)dθ exp { N( 1N log P(x θ) 1N } log π(θ)) dθ Define h(θ) = 1 N log P(x θ) 1 N log π(θ) so that the integral we want to compute is of the form exp { Nh(θ)} dθ. h(θ) h(θ ) 1 2 h (θ ) (θ θ ) 2 and we can approximate the integral as { e Nh(θ)) dθ e Nh(θ ) exp N 2 h (θ ) (θ θ ) 2} dθ Comparing to a normal pdf we have e Nh(θ) dx e Nh(θ ) (2π) 1 2 Nh (θ ) 1 2 = p(x θ )π(θ )(2π) 1 2 Nh (θ ) 1 2 Julien Berestycki (University of Oxford) SB2a MT 2016 11 / 32

Laplace s method For a d-dimensional function the analogue of this result is e Nf (x) dx e Nf (x 0) (2π) d 2 N d 2 f (x 0 ) 1 2 where f (x 0 ) is the determinant of the Hessian of the function evaluated at x 0. Julien Berestycki (University of Oxford) SB2a MT 2016 12 / 32

Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Julien Berestycki (University of Oxford) SB2a MT 2016 13 / 32

Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Firstly, the MAP estimate θ is replaced by the MLE ˆθ, which is reasonable if the prior has a small effect. Julien Berestycki (University of Oxford) SB2a MT 2016 13 / 32

Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Firstly, the MAP estimate θ is replaced by the MLE ˆθ, which is reasonable if the prior has a small effect. Secondly, BIC only retains the terms that vary in N, since asymptotically the terms that are constant in N do not matter. Julien Berestycki (University of Oxford) SB2a MT 2016 13 / 32

Bayesian Information Criterion (BIC) The Bayesian Information Criterion (BIC) takes the approximation one step further, essentially by minimizing the impact of the prior. Firstly, the MAP estimate θ is replaced by the MLE ˆθ, which is reasonable if the prior has a small effect. Secondly, BIC only retains the terms that vary in N, since asymptotically the terms that are constant in N do not matter. Dropping the constant terms we get, log P(θ X) log P(X ˆθ) d 2 log N Julien Berestycki (University of Oxford) SB2a MT 2016 13 / 32

Bayesian Information Criterion (BIC) - extra details Why can we ignore the term 1 2 log f ( θ) 1? Assume (as above) that we can ignore the prior i.e. P(θ) = 1 data points X 1,..., X N are iid Then f ( θ) = 1 N log P(X θ) θ= θ N = 1 N i=1 log P(X i θ) θ= θ The thing to notice about this term is that it is now the average log-likelihood. Julien Berestycki (University of Oxford) SB2a MT 2016 14 / 32

Bayesian Information Criterion (BIC) - extra details Now consider random variables X i = log P(X i θ) and apply WLLN So the (m, n)th element of f ( θ) is f ( θ) E[log P(X i θ)] θ= θ 2 E[log P(X i θ)] θ m θ n θ= θ and these are constants i.e expected log-likelihoods for a single data point, so f ( θ) is constant, and can be ignored in the BIC approximation. Julien Berestycki (University of Oxford) SB2a MT 2016 15 / 32

Variational Bayes The idea of VB is to find an approximation Q(θ) to a given posterior distribution P(θ X). That is Q(θ) P(θ X) where θ is the vector of parameters. We then use Q(θ) to approximate the marginal likelihood. In fact, what we do is find a lower bound for the marginal likelihood. Julien Berestycki (University of Oxford) SB2a MT 2016 16 / 32

Variational Bayes The idea of VB is to find an approximation Q(θ) to a given posterior distribution P(θ X). That is Q(θ) P(θ X) where θ is the vector of parameters. We then use Q(θ) to approximate the marginal likelihood. In fact, what we do is find a lower bound for the marginal likelihood. Question How to find a good approximate posterior Q(θ)? Julien Berestycki (University of Oxford) SB2a MT 2016 16 / 32

Kullback-Liebler (KL) divergence The strategy we take is to find a distribution Q(θ) that minimizes a measure of distance between Q(θ) and the posterior P(θ X). Julien Berestycki (University of Oxford) SB2a MT 2016 17 / 32

Kullback-Liebler (KL) divergence The strategy we take is to find a distribution Q(θ) that minimizes a measure of distance between Q(θ) and the posterior P(θ X). Definition The Kullback-Leibler divergence KL(q p) between two distributions q(x) and p(x) is KL(q p) = log [ q(x) ] q(x)dx p(x) density 0.00 0.05 0.10 0.15 0.20 q(x) p(x) q(x) * log(q(x)/p(x)) 0.02 0.02 0.06 0.10 0 5 10 15 20 0 5 10 15 20 x x Julien Berestycki (University of Oxford) SB2a MT 2016 17 / 32

Kullback-Liebler (KL) divergence The strategy we take is to find a distribution Q(θ) that minimizes a measure of distance between Q(θ) and the posterior P(θ X). Definition The Kullback-Leibler divergence KL(q p) between two distributions q(x) and p(x) is KL(q p) = log [ q(x) ] q(x)dx p(x) density 0.00 0.05 0.10 0.15 0.20 q(x) p(x) q(x) * log(q(x)/p(x)) 0.02 0.02 0.06 0.10 0 5 10 15 20 0 5 10 15 20 x x Exercise KL(q p) 0 and KL(q p) = 0 iff q = p. Julien Berestycki (University of Oxford) SB2a MT 2016 17 / 32

N(µ, σ 2 ) approximations to a Gamma(10,1) µ = 10,σ 2 = 4,KL = 0.223 µ = 9.11,σ 2 = 3.03,KL = 0.124 density 0.00 0.10 0.20 0 5 10 15 20 density 0.00 0.10 0.20 0 5 10 15 20 x x µ = 13,σ 2 = 2.23,KL = 0.347 µ = 9,σ 2 = 2,KL = 0.386 density 0.00 0.10 0.20 0 5 10 15 20 density 0.00 0.10 0.20 0 5 10 15 20 x x Julien Berestycki (University of Oxford) SB2a MT 2016 18 / 32

We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) Julien Berestycki (University of Oxford) SB2a MT 2016 19 / 32

We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) Julien Berestycki (University of Oxford) SB2a MT 2016 19 / 32

We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) [ p(θ, X) ] = log P(X) log Q(θ)dθ Q(θ) Julien Berestycki (University of Oxford) SB2a MT 2016 19 / 32

We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) [ p(θ, X) ] = log P(X) log Q(θ)dθ Q(θ) The log marginal likelihood can then be written as where F(Q(θ)) = log log P(X) = F(Q(θ)) + KL(Q(θ) P(θ X)) (1) ] Q(θ)dθ. [ P(θ,D M) Q(θ) Julien Berestycki (University of Oxford) SB2a MT 2016 19 / 32

We consider the KL divergence betweem Q(θ) and P(θ X) [ Q(θ) ] KL(Q(θ) P(θ X)) = log Q(θ)dθ P(θ X) [ Q(θ)P(X) ] = log Q(θ)dθ P(θ, X) [ p(θ, X) ] = log P(X) log Q(θ)dθ Q(θ) The log marginal likelihood can then be written as where F(Q(θ)) = log log P(X) = F(Q(θ)) + KL(Q(θ) P(θ X)) (1) ] Q(θ)dθ. [ P(θ,D M) Q(θ) Note Since KL(q p) 0 we have that log P(X) F(Q(θ)) so that F(Q(θ)) is a lower bound on the log-marginal likelihood. Julien Berestycki (University of Oxford) SB2a MT 2016 19 / 32

The mean field approximation We now need to ask what form that Q(θ) should take? Julien Berestycki (University of Oxford) SB2a MT 2016 20 / 32

The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i Julien Berestycki (University of Oxford) SB2a MT 2016 20 / 32

The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i The VB algorithm iteratively maximises F(Q(θ)) with respect to the free distributions, Q(θ i ), which is coordinate ascent in the function space of variational distributions. Julien Berestycki (University of Oxford) SB2a MT 2016 20 / 32

The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i The VB algorithm iteratively maximises F(Q(θ)) with respect to the free distributions, Q(θ i ), which is coordinate ascent in the function space of variational distributions. We refer to each Q(θ i ) as a VB component. Julien Berestycki (University of Oxford) SB2a MT 2016 20 / 32

The mean field approximation We now need to ask what form that Q(θ) should take? The most widely used approximation is known as the mean field approximation and assumes only that the approximate posterior has a factorized form Q(θ) = Q(θ i ) i The VB algorithm iteratively maximises F(Q(θ)) with respect to the free distributions, Q(θ i ), which is coordinate ascent in the function space of variational distributions. We refer to each Q(θ i ) as a VB component. We update each component Q(θ i ) in turn keeping Q(θ j ) j i fixed. Julien Berestycki (University of Oxford) SB2a MT 2016 20 / 32

VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Julien Berestycki (University of Oxford) SB2a MT 2016 21 / 32

VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Proof Writing Q(θ) = Q(θ i )Q(θ i ) where θ i = θ\θ i, the lower-bound can be re-written as [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) Julien Berestycki (University of Oxford) SB2a MT 2016 21 / 32

VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Proof Writing Q(θ) = Q(θ i )Q(θ i ) where θ i = θ\θ i, the lower-bound can be re-written as [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) [ P(θ, X) ] = log Q(θ i )Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) Julien Berestycki (University of Oxford) SB2a MT 2016 21 / 32

VB components Lemma The VB components take the form ( ) log Q(θ i ) = E Q(θ i ) log P(X, θ) + const Proof Writing Q(θ) = Q(θ i )Q(θ i ) where θ i = θ\θ i, the lower-bound can be re-written as [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) [ P(θ, X) ] = log Q(θ i )Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) [ ] = Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Julien Berestycki (University of Oxford) SB2a MT 2016 21 / 32

F (Q(θ)) = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Julien Berestycki (University of Oxford) SB2a MT 2016 22 / 32

F (Q(θ)) = = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i ) log Q(θ i )dθ i Q(θ j ) log Q(θ j )dθ j j i Julien Berestycki (University of Oxford) SB2a MT 2016 22 / 32

F (Q(θ)) = = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i ) log Q(θ i )dθ i Q(θ j ) log Q(θ j )dθ j j i If we let Q (θ i ) = 1 Z exp [ log P(D, θ M)Q(θ i )dθ i ] where Z is a normalising constant and write H(Q(θ j )) = Q(θ j ) log Q(θ j )dθ j as the entropy of Q(θ j ) then F (Q(θ)) = Q(θ i ) log Q (θ i ) Q(θ i ) dθ i + log Z + H(Q(θ j )) j i Julien Berestycki (University of Oxford) SB2a MT 2016 22 / 32

F (Q(θ)) = = [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i Q(θ i )Q(θ i ) log Q(θ i )dθ i dθ i [ ] Q(θ i ) log P(X, θ)q(θ i )dθ i dθ i Q(θ i ) log Q(θ i )dθ i Q(θ j ) log Q(θ j )dθ j j i If we let Q (θ i ) = 1 Z exp [ log P(D, θ M)Q(θ i )dθ i ] where Z is a normalising constant and write H(Q(θ j )) = Q(θ j ) log Q(θ j )dθ j as the entropy of Q(θ j ) then F (Q(θ)) = Q(θ i ) log Q (θ i ) Q(θ i ) dθ i + log Z + H(Q(θ j )) j i = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) Julien Berestycki (University of Oxford) SB2a MT 2016 22 / 32

F(Q(θ)) = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) We then see that F(Q(θ)) is maximised when Q(θ i ) = Q (θ i ) as this choice minimises the Kullback-Liebler divergence term. Julien Berestycki (University of Oxford) SB2a MT 2016 23 / 32

F(Q(θ)) = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) We then see that F(Q(θ)) is maximised when Q(θ i ) = Q (θ i ) as this choice minimises the Kullback-Liebler divergence term. Thus the update for Q(θ i ) is given by [ ] Q(θ i ) exp log P(X, θ)q(θ i )dθ i Julien Berestycki (University of Oxford) SB2a MT 2016 23 / 32

F(Q(θ)) = KL(Q(θ i ) Q (θ i )) + log Z + j i H(Q(θ j )) We then see that F(Q(θ)) is maximised when Q(θ i ) = Q (θ i ) as this choice minimises the Kullback-Liebler divergence term. Thus the update for Q(θ i ) is given by [ ] Q(θ i ) exp log P(X, θ)q(θ i )dθ i or log Q(θ i ) = E Q(θ i ) ( ) log P(X, θ) + const Julien Berestycki (University of Oxford) SB2a MT 2016 23 / 32

VB algorithm This implies a straightforward algorithm for variational inference: Julien Berestycki (University of Oxford) SB2a MT 2016 24 / 32

VB algorithm This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ) = Q(µ)Q(τ), e.g., by setting them to their priors. Julien Berestycki (University of Oxford) SB2a MT 2016 24 / 32

VB algorithm This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ) = Q(µ)Q(τ), e.g., by setting them to their priors. 2 Cycle over the parameters, revising each given the current estimates of the others. Julien Berestycki (University of Oxford) SB2a MT 2016 24 / 32

VB algorithm This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ) = Q(µ)Q(τ), e.g., by setting them to their priors. 2 Cycle over the parameters, revising each given the current estimates of the others. 3 Loop until convergence. Julien Berestycki (University of Oxford) SB2a MT 2016 24 / 32

VB algorithm This implies a straightforward algorithm for variational inference: 1 Initialize all approximate posteriors Q(θ) = Q(µ)Q(τ), e.g., by setting them to their priors. 2 Cycle over the parameters, revising each given the current estimates of the others. 3 Loop until convergence. Convergence is checked by calculating the VB lower bound at each step i.e. [ P(θ, X) ] F(Q(θ)) = log Q(θ)dθ Q(θ) The precise form of this term needs to be derived, and can be quite tricky. Julien Berestycki (University of Oxford) SB2a MT 2016 24 / 32

Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Julien Berestycki (University of Oxford) SB2a MT 2016 25 / 32

Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Julien Berestycki (University of Oxford) SB2a MT 2016 25 / 32

Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Let θ = (µ, τ) and assume Q(θ) = Q(µ)Q(τ). Julien Berestycki (University of Oxford) SB2a MT 2016 25 / 32

Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Let θ = (µ, τ) and assume Q(θ) = Q(µ)Q(τ). We will use notation θ i = E Q(θ i )θ i. Julien Berestycki (University of Oxford) SB2a MT 2016 25 / 32

Example 1 Consider applying VB to the hierarchical model X i N(µ, τ 1 ) i = 1,..., P, µ N(m, (τβ) 1 ) τ Γ(a, b) Note We are using a prior of the form π(τ, µ) = π(µ τ)π(τ). Let θ = (µ, τ) and assume Q(θ) = Q(µ)Q(τ). We will use notation θ i = E Q(θ i )θ i. The log joint density is log P(X, θ) = P 2 log τ τ 2 P (X i µ) 2 + 1 2 log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 Julien Berestycki (University of Oxford) SB2a MT 2016 25 / 32

log P(X, θ) = P 2 log τ τ 2 P (X i µ) 2 + 1 2 log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 Julien Berestycki (University of Oxford) SB2a MT 2016 26 / 32

log P(X, θ) = P 2 log τ τ 2 P (X i µ) 2 + 1 2 log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. Julien Berestycki (University of Oxford) SB2a MT 2016 26 / 32

log P(X, θ) = P 2 log τ τ 2 P (X i µ) 2 + 1 2 log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. ( ) log Q(µ) = E Q(τ) log P(X, θ) + C Julien Berestycki (University of Oxford) SB2a MT 2016 26 / 32

log P(X, θ) = P 2 log τ τ 2 P (X i µ) 2 + 1 2 log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. ( ) log Q(µ) = E Q(τ) log P(X, θ) + C where τ = E Q(τ) (τ). = τ 2 ( P (X i µ) 2 β(µ m) 2) + C i=1 Julien Berestycki (University of Oxford) SB2a MT 2016 26 / 32

log P(X, θ) = P 2 log τ τ 2 P (X i µ) 2 + 1 2 log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 We can derive the VB updates one at a time. We start with Q(µ). Note We just need to focus on terms involving µ. ( ) log Q(µ) = E Q(τ) log P(X, θ) + C = τ 2 ( P (X i µ) 2 β(µ m) 2) + C i=1 where τ = E Q(τ) (τ). We will be able to determine τ when we derive the other component of the approximate density, Q(τ). Julien Berestycki (University of Oxford) SB2a MT 2016 26 / 32

We can see this log density has the form of a normal distribution log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 Julien Berestycki (University of Oxford) SB2a MT 2016 27 / 32

We can see this log density has the form of a normal distribution log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 = β 2 (µ m ) 2 Julien Berestycki (University of Oxford) SB2a MT 2016 27 / 32

We can see this log density has the form of a normal distribution where log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 = β 2 (µ m ) 2 β = (β + P) τ m = β 1( P ) βm + X i i=1 Julien Berestycki (University of Oxford) SB2a MT 2016 27 / 32

We can see this log density has the form of a normal distribution where log Q(µ) = τ 2 ( P (D i µ) 2 β(µ m) 2) + C i=1 = β 2 (µ m ) 2 β = (β + P) τ m = β 1( βm + Thus Q(µ) = N(µ m, β 1 ). P ) X i i=1 Julien Berestycki (University of Oxford) SB2a MT 2016 27 / 32

log P(X, θ) = P 2 log τ τ 2 P (X i µ) 2 + 1 2 log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 Julien Berestycki (University of Oxford) SB2a MT 2016 28 / 32

log P(X, θ) = P 2 log τ τ 2 P (X i µ) 2 + 1 2 log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 The second component of the VB approximation is derived as ( ) log Q(τ) = E Q(µ) log P(X, θ) + C Julien Berestycki (University of Oxford) SB2a MT 2016 28 / 32

log P(X, θ) = P 2 log τ τ 2 P (X i µ) 2 + 1 2 log τ τβ 2 i=1 +(a 1) log τ bτ + K (µ m)2 The second component of the VB approximation is derived as ( ) log Q(τ) = E Q(µ) log P(X, θ) + C = ((P + 1)/2 + a 1) log τ τ P (X i µ) 2 2 i=1 β(µ m) 2 τ 2 + C Julien Berestycki (University of Oxford) SB2a MT 2016 28 / 32

We can see this log density has the form of a gamma distribution log Q(τ) = ((P + 1)/2 + a 1) log τ τ P (X i µ) 2 2 i=1 β(µ τ m) 2 + C 2 Julien Berestycki (University of Oxford) SB2a MT 2016 29 / 32

We can see this log density has the form of a gamma distribution log Q(τ) = ((P + 1)/2 + a 1) log τ τ P (X i µ) 2 2 i=1 β(µ τ m) 2 + C 2 which is Γ(a, b ) where a = a + (P + 1)/2 b = b + 1 ( P ) Xi 2 2 µ + P µ 2 + β ( ) m 2 2 µ + µ 2 2 2 i=1 Julien Berestycki (University of Oxford) SB2a MT 2016 29 / 32

So overall we have 1 Q(µ) = N(µ m, β 1 ) where β = (β + P) τ (2) m = β 1( P ) βm + D i (3) i=1 Julien Berestycki (University of Oxford) SB2a MT 2016 30 / 32

So overall we have 1 Q(µ) = N(µ m, β 1 ) where 2 Q(τ) = Γ(τ a, b ) where i=1 β = (β + P) τ (2) m = β 1( P ) βm + D i (3) a = a + (P + 1)/2 b = b + 1 ( P ) Xi 2 2 µ + P µ 2 + β ( ) m 2 2 µ + µ 2 2 2 i=1 Julien Berestycki (University of Oxford) SB2a MT 2016 30 / 32

So overall we have 1 Q(µ) = N(µ m, β 1 ) where 2 Q(τ) = Γ(τ a, b ) where i=1 β = (β + P) τ (2) m = β 1( P ) βm + D i (3) a = a + (P + 1)/2 b = b + 1 ( P ) Xi 2 2 µ + P µ 2 + β ( ) m 2 2 µ + µ 2 2 2 To calculate these we need τ = a b µ = m µ 2 = β 1 + m 2 Julien Berestycki (University of Oxford) SB2a MT 2016 30 / 32 i=1

Example 1 For this model the exact posterior was calculate in Lecture 6. [ }] π(τ, µ X) τ α 1 exp τ {b + β 2 (m µ) 2 where a = a + P + 1 2 b = b + 1 2 (Xi X) 2 + 1 2 β = β + P m = β 1 (βm + P X i ) i=1 Pβ P + β ( X m) 2 We note some similarity between the VB updates and the true posterior parameters. Julien Berestycki (University of Oxford) SB2a MT 2016 31 / 32

We can compare the true and VB posterior when applied to a real dataset. We see that VB approximations underestimate posterior variances. µ Density 0.0 0.5 1.0 1.5 2.0 True posterior VB posterior Density 0.0 0.5 1.0 1.5 2.0 2.5 3.0 True posterior VB posterior 10 5 0 5 10 15 0 2 4 6 8 Julien Berestycki (University of Oxford) SB2a MT 2016 32 / 32

General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. Julien Berestycki (University of Oxford) SB2a MT 2016 33 / 32

General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. VB is often much, much faster to implement than MCMC or other sampling based methods. Julien Berestycki (University of Oxford) SB2a MT 2016 33 / 32

General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. VB is often much, much faster to implement than MCMC or other sampling based methods. The VB updates and lower bound can be tricky to derive, and sometime further approximation is needed. Julien Berestycki (University of Oxford) SB2a MT 2016 33 / 32

General comments The property of VB underestimating the variance in the posterior is a general feature of the method, when there exists correlation between the θ i s in the posterior, which is usually the case. This may not be important if the purpose of inference is model comparison i.e. comparing the approximate marginal likelihoods between models. VB is often much, much faster to implement than MCMC or other sampling based methods. The VB updates and lower bound can be tricky to derive, and sometime further approximation is needed. The VB algorithm will find a local mode of the posterior, so care should be taken when the posterior is thought/known to be multi-modal. Julien Berestycki (University of Oxford) SB2a MT 2016 33 / 32