G8325: Variational Bayes

Similar documents
An introduction to Variational calculus in Machine Learning

Variational Scoring of Graphical Model Structures

Unsupervised Learning

Expectation Maximization

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

The Expectation Maximization or EM algorithm

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Foundations of Statistical Inference

Variational Inference and Learning. Sargur N. Srihari

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M

Statistical Machine Learning Lectures 4: Variational Bayes

Week 3: The EM algorithm

Expectation Propagation Algorithm

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Posterior Regularization

13: Variational inference II

Lecture 8: Graphical models for Text

The Expectation-Maximization Algorithm

Bayesian Inference Course, WTCN, UCL, March 2013

Expectation Maximization

Probabilistic Graphical Models for Image Analysis - Lecture 4

Basic math for biology

Quantitative Biology II Lecture 4: Variational Methods

Bayesian Machine Learning - Lecture 7

Graphical Models for Collaborative Filtering

Gaussian Mixture Models

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

an introduction to bayesian inference

Probabilistic and Bayesian Machine Learning

CSC 2541: Bayesian Methods for Machine Learning

Algorithms for Variational Learning of Mixture of Gaussians

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Density Estimation. Seungjin Choi

Note 1: Varitional Methods for Latent Dirichlet Allocation

Lecture 13 : Variational Inference: Mean Field Approximation

MIT Spring 2016

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

EM Algorithm. Expectation-maximization (EM) algorithm.

An Introduction to Expectation-Maximization

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Clustering, K-Means, EM Tutorial

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Part 1: Expectation Propagation

Variational inference

Probabilistic Graphical Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Bayesian inference

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

A minimalist s exposition of EM

A brief introduction to Conditional Random Fields

Probabilistic Reasoning in Deep Learning

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Technical Details about the Expectation Maximization (EM) Algorithm

14 : Mean Field Assumption

1 Expectation Maximization

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Density Estimation under Independent Similarly Distributed Sampling Assumptions

PATTERN RECOGNITION AND MACHINE LEARNING

Expectation maximization tutorial

Notes on Machine Learning for and

Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference

Predictive Processing in Planning:

Introduction to Probabilistic Machine Learning

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Mixtures of Gaussians. Sargur Srihari

CPSC 540: Machine Learning

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

Lecture 4: Probabilistic Learning

10708 Graphical Models: Homework 2

Lecture 6: Graphical Models: Learning

Series 7, May 22, 2018 (EM Convergence)

The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures

Latent Variable Models and EM algorithm

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Expectation Propagation for Approximate Bayesian Inference

Lecture 6: Model Checking and Selection

Variational Bayesian Hidden Markov Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

U Logo Use Guidelines

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

Computing the MLE and the EM Algorithm

Machine Learning Summer School

A Note on the Expectation-Maximization (EM) Algorithm

The Variational Gaussian Approximation Revisited

Latent Dirichlet Alloca/on

Probabilistic Graphical Models

ICES REPORT Model Misspecification and Plausibility

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Lecture 14. Clustering, K-means, and EM

Graphical models: parameter learning

Data Mining Techniques

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

1 Bayesian Linear Regression (BLR)

COMP90051 Statistical Machine Learning

Manifold Constrained Variational Mixtures

Self-Organization by Optimizing Free-Energy

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Transcription:

G8325: Variational Bayes Vincent Dorie Columbia University Wednesday, November 2nd, 2011

bridge Variational University Bayes Press 2003. On-screen viewing permitted. Printing not permitted. http://www.c is book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 0 Goal a) σ 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 (b) σ 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 (c) σ 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0 0.5 1 1.5 2 µ 0.2 (d) σ 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 0 0.5 1 1.5 2 µ 0.2 (e) σ 0.9 1 0.8 0.7 0.6 0.5 0.4 0.3 0 0.5 1 1.5 2 µ 0.2 0 0.5 1 1.5 2 µ 0.2 0 0.5 1 1.5 2 µ (f) σ Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 2 / 17

Expectation-Maximization Setup Latent variable model: y θ p η (y θ), θ p η (θ). Likelihood: p η (y) = L(η), = p η (y, θ) dθ, = p η (y θ)p η (θ)dθ. Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 3 / 17

Expectation-Maximization E-M Define the Q function: We iterate: Q(η η (t) ) = E θ y;η (t) [log p η (y, θ)]. Has an intuitive basis. η (t+1) = arg sup Q(η η (t) ). η Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 4 / 17

Expectation-Maximization E-M Q( ˆη MLE ) is maximized at ˆη MLE L(η (t+1) ) L(η (t) ) Can optimize over any function which is defined as an integral. e.g. for p(η y) = p(η, θ y) dθ, p(y, θ, η)p(θ, η), dθ, Q(η η (t) ) E θ y;η (t) [log p(y, θ, η)p(θ, η)]. Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 5 / 17

Expectation-Maximization Closer look Q(η η (t) ) = = log p η (y, θ)p η (t)(θ y) dθ, log p η(y, θ) p η (t)(θ y) p η (t)(θ y) dθ log = D KL (p η (t)(θ y) p η (y, θ)) H(p η (t)(θ y)). 1 p η (t)(θ y) p η (t)(θ y) dθ where D KL is the Kullback-Leibler divergence and H is the entropy. Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 6 / 17

Expectation-Maximization Kullback-Leibler Divergence D KL (f g) log f g f, D KL (f g) log = 0. g f f, If f and g have common support, D KL (f g) = 0 iff f = g. In addition, H(p η (t)(θ y)) does not depend on η, so maximizing Q( η (t) ) is equivalent to minimizing D KL (p η (t) (θ y) p η(y, θ)). Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 7 / 17

Expectation-Maximization Lower-bound property l(η) = log = log p η (y, θ) dθ, p η (y, θ) p η (t)(θ y) p η (t)(θ y) dθ, log p η(y, θ) p η (t)(θ y) p η (t)(θ y) dθ, = D KL (p η (t)(θ y) p η (y, θ)). The Q function provides a lower bound on the log-likelihood. Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 8 / 17

Expectation-Maximization Equivalent representation Define F (q, η) = E θ q [log p η (y, θ)] + H(q), = D KL (q p η (y, θ)), = D KL (q p η (θ y)) + l(η), where the last line is from Bayes Rule. E-M is a coordinate ascent on this function. 1. For fixed θ, F is mazimized at q = p η (θ y). 2. For q fixed at p η (t)(θ y), F = Q(η η (t) ) + H(p η (t)(θ y)). Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 9 / 17

Expectation-Maximization E-M Summary 1. Use a distance or divergence function. 2. Produces a sequence of distributions which approximate the posterior distribution of the latent variables by minimizing the divergence. 3. Provides a lower-bound on the log-likelihood. Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 10 / 17

Variational Bayes Variational Bayes In VB, we consider alternative ways of augmenting the model. EM: Full-Bayes: give all of θ a prior. Let q be an approximation to the posterior distribution of θ y. q will be chosen so as to be the best in a certain class. (For a given iteration, q (t+1) will likely depend on some parameters from time t). F (q, η) = E θ q [log p η (y, θ)] + H(q). VB: F (q) = E θ q [log p(y, θ)] + H(q). Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 11 / 17

Variational Bayes VB Theory Pictoral Representation 2.3. Variational methods for Bayesian learning log marginal likelihood ln p(y m) ln p(y m) ln p(y m) h KL q x (t) q (t) i θ p(x, θ y) new lower bound h KL q x (t+1) q (t) i θ p(x, θ y) F(q (t+1) x (x),q (t) θ (θ)) newer lower bound h KL q x (t+1) F(q (t+1) x q (t+1) θ i p(x, θ y) (x),q (t+1) (θ)) θ lower bound F(q (t) x (x),q (t) θ (θ)) VBE step VBM step Figure 2.3: The variational Bayesian EM (VBEM) algorithm. In the VBE step, the variational Beal posterior 2003 over hidden variables q x (x) is set according to (2.60). In the VBM step, the variational Vincent posterior Dorie over (Columbia parameters University) is set according Variational to (2.56). Bayes Each step is guaranteed Nov 2, to2011 increase 12 / (or 17

Variational Bayes Calculus of Variations For functionals of the sort J[q] = b a G(θ, q, q ) dθ defined on a set of functions with continuous first derivatives and satisfying q(a) = A, q(b) = B, then J[q] will have an extremum if G q d dθ G q = 0. Break q into independent blocks, q(θ) = K i=1 q i(θ i ) and write F function as: ] K E θ[ j] q [ j] [log p y,θj (θ [ j] ) log q i (θ i ) q j (θ j ) dθ j i=1 Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 13 / 17

Variational Bayes Calculus of Variations Add a Lagrangian to get q j = 1 and apply Euler s equation: 0 = E θ[ j] q [ j] [p y,θj (θ [ j] ) ] K log q i (θ i ) i=1 = E θ[ j] q [ j] [ py,θj (θ [ j] ) ] log q j (θ j ) + const + λ j, log q j (θ j ) E θ[ j] q [ j] [ py,θj (θ [ j] ) ]. 1 q j (θ j ) q j(θ j ) + λ j, Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 14 / 17

Variational Bayes Using VB First, chose a divergence measure and a class of distributions. 1. Write out the joint distribution of θ and y. 2. Initialize to some q (0). 3. Iterate q (t+1) i by maximizing F with q (t) [ i] held consant. q may depend on some parameters. Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 15 / 17

Variational Bayes Factoring the Distributions Split latent variables and parameters. θ θ x1 x2 x3 x1 x2 x3 y1 y2 y3 y1 y2 y3 (a) The generative graphical model. (b) Graph representing the exact posterior. θ x1 x2 x3 (c) Posterior graph after the variational approximation. Figure 2.4: Graphical depiction of the hidden-variable / parameter factorisation. (a) The origi- Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 16 / 17

Variational Bayes Choice of q (0) Each update requires computing an expectation with respect to previous approximation. If { } p(y θ) = h(y) exp φ(θ) T (y) a(θ), { } p(θ ν, λ) = g(ν, λ) exp φ(θ) ν λa(θ), then { q(θ) = g( ν, λ) exp φ(θ) ν } λa(θ). λ = λ + 1, ν = ν + T (y). Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 17 / 17

Variational Bayes Uses of VB 1. Obtain an approximate posterior. 2. Approximate posterior modes. 3. Provide a lower bound on p(y). In Bayesian model selection, p(y M i ). Online variants exist. Vincent Dorie (Columbia University) Variational Bayes Nov 2, 2011 18 / 17