G8325: Variational Bayes Vincent Dorie Columbia University Wednesday, November 2nd, 2011
bridge Variational University Bayes Press 2003. On-screen viewing permitted. Printing not permitted. http://www.c is book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 0 Goal a) σ 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 (b) σ 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 (c) σ 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0 0.5 1 1.5 2 µ 0.2 (d) σ 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 0 0.5 1 1.5 2 µ 0.2 (e) σ 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 0 0.5 1 1.5 2 µ 0.2 0 0.5 1 1.5 2 µ 0.2 0 0.5 1 1.5 2 µ (f) σ Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 2 / 17
Expectation-Maximization Setup Latent variable model: y θ p η (y θ), θ p η (θ). Likelihood: p η (y) = L(η), = p η (y, θ) dθ, = p η (y θ)p η (θ)dθ. Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 3 / 17
Expectation-Maximization E-M Define the Q function: We iterate: Q(η η (t) ) = E θ y;η (t) [log p η (y, θ)]. Has an intuitive basis. η (t+1) = arg sup Q(η η (t) ). η Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 4 / 17
Expectation-Maximization E-M Q( ˆη MLE ) is maximized at ˆη MLE L(η (t+1) ) L(η (t) ) Can optimize over any function which is defined as an integral. e.g. for p(η y) = p(η, θ y) dθ, p(y, θ, η)p(θ, η), dθ, Q(η η (t) ) E θ y;η (t) [log p(y, θ, η)p(θ, η)]. Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 5 / 17
Expectation-Maximization Closer look Q(η η (t) ) = = log p η (y, θ)p η (t)(θ y) dθ, log p η(y, θ) p η (t)(θ y) p η (t)(θ y) dθ log = D KL (p η (t)(θ y) p η (y, θ)) H(p η (t)(θ y)). 1 p η (t)(θ y) p η (t)(θ y) dθ where D KL is the Kullback-Leibler divergence and H is the entropy. Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 6 / 17
Expectation-Maximization Kullback-Leibler Divergence D KL (f g) log f g f, D KL (f g) log = 0. g f f, If f and g have common support, D KL (f g) = 0 iff f = g. In addition, H(p η (t)(θ y)) does not depend on η, so maximizing Q( η (t) ) is equivalent to minimizing D KL (p η (t) (θ y) p η(y, θ)). Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 7 / 17
Expectation-Maximization Lower-bound property l(η) = log = log p η (y, θ) dθ, p η (y, θ) p η (t)(θ y) p η (t)(θ y) dθ, log p η(y, θ) p η (t)(θ y) p η (t)(θ y) dθ, = D KL (p η (t)(θ y) p η (y, θ)). The Q function provides a lower bound on the log-likelihood. Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 8 / 17
Expectation-Maximization Equivalent representation Define F (q, η) = E θ q [log p η (y, θ)] + H(q), = D KL (q p η (y, θ)), = D KL (q p η (θ y)) + l(η), where the last line is from Bayes Rule. E-M is a coordinate ascent on this function. 1. For fixed θ, F is mazimized at q = p η (θ y). 2. For q fixed at p η (t)(θ y), F = Q(η η (t) ) + H(p η (t)(θ y)). Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 9 / 17
Expectation-Maximization E-M Summary 1. Use a distance or divergence function. 2. Produces a sequence of distributions which approximate the posterior distribution of the latent variables by minimizing the divergence. 3. Provides a lower-bound on the log-likelihood. Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 10 / 17
Variational Bayes Variational Bayes In VB, we consider alternative ways of augmenting the model. EM: Full-Bayes: give all of θ a prior. Let q be an approximation to the posterior distribution of θ y. q will be chosen so as to be the best in a certain class. (For a given iteration, q (t+1) will likely depend on some parameters from time t). F (q, η) = E θ q [log p η (y, θ)] + H(q). VB: F (q) = E θ q [log p(y, θ)] + H(q). Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 11 / 17
Variational Bayes VB Theory Pictoral Representation 2.3. Variational methods for Bayesian learning log marginal likelihood ln p(y m) ln p(y m) ln p(y m) h KL q x (t) q (t) i θ p(x, θ y) new lower bound h KL q x (t+1) q (t) i θ p(x, θ y) F(q (t+1) x (x),q (t) θ (θ)) newer lower bound h KL q x (t+1) F(q (t+1) x q (t+1) θ i p(x, θ y) (x),q (t+1) (θ)) θ lower bound F(q (t) x (x),q (t) θ (θ)) VBE step VBM step Figure 2.3: The variational Bayesian EM (VBEM) algorithm. In the VBE step, the variational Beal posterior 2003 over hidden variables q x (x) is set according to (2.60). In the VBM step, the variational Vincent posterior Dorie over (Columbia parameters University) is set according Variational to (2.56). Bayes Each step is guaranteed Nov 2, to2011 increase 12 / (or 17
Variational Bayes Calculus of Variations For functionals of the sort J[q] = b a G(θ, q, q ) dθ defined on a set of functions with continuous first derivatives and satisfying q(a) = A, q(b) = B, then J[q] will have an extremum if G q d dθ G q = 0. Break q into independent blocks, q(θ) = K i=1 q i(θ i ) and write F function as: ] K E θ[ j] q [ j] [log p y,θj (θ [ j] ) log q i (θ i ) q j (θ j ) dθ j i=1 Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 13 / 17
Variational Bayes Calculus of Variations Add a Lagrangian to get q j = 1 and apply Euler s equation: 0 = E θ[ j] q [ j] [p y,θj (θ [ j] ) ] K log q i (θ i ) i=1 = E θ[ j] q [ j] [ py,θj (θ [ j] ) ] log q j (θ j ) + const + λ j, log q j (θ j ) E θ[ j] q [ j] [ py,θj (θ [ j] ) ]. 1 q j (θ j ) q j(θ j ) + λ j, Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 14 / 17
Variational Bayes Using VB First, chose a divergence measure and a class of distributions. 1. Write out the joint distribution of θ and y. 2. Initialize to some q (0). 3. Iterate q (t+1) i by maximizing F with q (t) [ i] held consant. q may depend on some parameters. Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 15 / 17
Variational Bayes Factoring the Distributions Split latent variables and parameters. θ θ x1 x2 x3 x1 x2 x3 y1 y2 y3 y1 y2 y3 (a) The generative graphical model. (b) Graph representing the exact posterior. θ x1 x2 x3 (c) Posterior graph after the variational approximation. Figure 2.4: Graphical depiction of the hidden-variable / parameter factorisation. (a) The origi- Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 16 / 17
Variational Bayes Choice of q (0) Each update requires computing an expectation with respect to previous approximation. If { } p(y θ) = h(y) exp φ(θ) T (y) a(θ), { } p(θ ν, λ) = g(ν, λ) exp φ(θ) ν λa(θ), then { q(θ) = g( ν, λ) exp φ(θ) ν } λa(θ). λ = λ + 1, ν = ν + T (y). Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 17 / 17
Variational Bayes Uses of VB 1. Obtain an approximate posterior. 2. Approximate posterior modes. 3. Provide a lower bound on p(y). In Bayesian model selection, p(y M i ). Online variants exist. Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 18 / 17