Instructor: Dr. Volkan Cevher. 1. Background

Similar documents
13 : Variational Inference: Loopy Belief Propagation and Mean Field

13: Variational inference II

Variational Inference. Sargur Srihari

Expectation Propagation Algorithm

Variational Inference (11/04/13)

14 : Mean Field Assumption

Foundations of Statistical Inference

2.1 Optimization formulation of k-means

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M

Lecture 13 : Variational Inference: Mean Field Approximation

STA 4273H: Statistical Machine Learning

19 : Slice Sampling and HMC

The Expectation Maximization or EM algorithm

Lecture : Probabilistic Machine Learning

Quantitative Biology II Lecture 4: Variational Methods

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Variational Inference. Sargur Srihari

EM & Variational Bayes

An Introduction to Expectation-Maximization

Expectation Maximization

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Variational Mixture of Gaussians. Sargur Srihari

Expectation Maximization

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Machine Learning using Bayesian Approaches

More Spectral Clustering and an Introduction to Conjugacy

Probabilistic Graphical Models for Image Analysis - Lecture 4

Variational Autoencoders (VAEs)

Basic Sampling Methods

Week 3: The EM algorithm

Introduction to Statistical Learning Theory

Integrating Correlated Bayesian Networks Using Maximum Entropy

COS513 LECTURE 8 STATISTICAL CONCEPTS

Variational Principal Components

Introduction to Probabilistic Graphical Models: Exercises

Density Estimation. Seungjin Choi

Posterior Regularization

A minimalist s exposition of EM

Variational Learning : From exponential families to multilinear systems

STA 4273H: Statistical Machine Learning

ICES REPORT Model Misspecification and Plausibility

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Understanding Covariance Estimates in Expectation Propagation

Probabilistic Graphical Models

12 : Variational Inference I

Curve Fitting Re-visited, Bishop1.2.5

17 : Markov Chain Monte Carlo

Variational Autoencoders

Data Mining Techniques

The Expectation-Maximization Algorithm

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

13 : Variational Inference: Loopy Belief Propagation

Gaussian Mixture Models

Bayesian Machine Learning - Lecture 7

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Expectation Propagation for Approximate Bayesian Inference

Parametric Techniques

Variational Scoring of Graphical Model Structures

ECE 4400:693 - Information Theory

Variational Inference and Learning. Sargur N. Srihari

Variational inference

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Parametric Techniques Lecture 3

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

an introduction to bayesian inference

Stochastic Variational Inference

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Introduction to Machine Learning

Conjugate Predictive Distributions and Generalized Entropies

Multivariate Bayesian Linear Regression MLAI Lecture 11

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference

ECE531 Lecture 4b: Composite Hypothesis Testing

14 : Theory of Variational Inference: Inner and Outer Approximation

STA 4273H: Statistical Machine Learning

Stochastic Spectral Approaches to Bayesian Inference

PATTERN RECOGNITION AND MACHINE LEARNING

Outline Lecture 2 2(32)

Expectation Maximization

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Bayesian Inference Course, WTCN, UCL, March 2013

The information complexity of sequential resource allocation

Statistical Inference

CSC411 Fall 2018 Homework 5

Expectation Propagation performs smooth gradient descent GUILLAUME DEHAENE

Bayesian data analysis in practice: Three simple examples

Lightweight Probabilistic Deep Networks Supplemental Material

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

STA 4273H: Statistical Machine Learning

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Lecture 35: December The fundamental statistical distances

Machine Learning Techniques for Computer Vision

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 4: State Estimation in Hidden Markov Models (cont.)

CSC 2541: Bayesian Methods for Machine Learning

Transcription:

Instructor: Dr. Volkan Cevher Variational Bayes Approximation ice University STAT 631 / ELEC 639: Graphical Models Scribe: David Kahle eviewers: Konstantinos Tsianos and Tahira Saleem 1. Background These lecture notes are the seventh in a series of lecture notes taken from a course offered in the Fall of 2008 at ice University entitled Graphical Models. The course was written and instructed by Dr. Volkan Cevher in the Department of Electrical and Computer Engineering. This particular set of notes was taken by David Kahle on September 16, 2008. 2. Introduction In the last lecture we became particularly interested in making inference on the graph displayed in Figure 1. Here Z m and X n are random vectors. For the purposes of this lecture, all Z X Figure 1. Simple observed directed graph random vectors will be assumed to exhibit densities with respect to the Lebesgue measure denoted p and subscripted by the corresponding random vector. For example, the random vector Z exhibits the density p Z (z. The joint density of the random vectors X and Z is written p X,Z (x, z. The graphical model presented above is simple in order to emphasize the fact that we wish to make inference regarding the vector Z provided we have information on the vector X. To that end, recall the crucial formula derived from the definition of conditional expectation p X,Z (x, z p Z X (z xp X (x, (1 which is sometimes referred to as the product rule. As we well know, our primary interest lies in the factor p Z X (z x. The rest of this lecture is concerned with characterizing this density using the method of variational Bayes (VB approximation. This is the second example of a deterministic scheme in approximating the conditional density for inferential procedures (the first being Laplace approximation. 1

2 3. Motivation and the Kullback-Leibler Divergence In this section we derive a pseudo distance metric of probability distributions known as the Kullback- Leibler divergence. The idea is that without some kind of measure of difference between probability measures, we have no benchmark for comparing the accuracy of different approximations. Our result will be the Kullback-Leibler divergence; it will soon become clear why it is called divergence instead of distance. The divergence is derived as follows. Beginning with (1 and taking logs, we obtain which rearranges to log p X,Z (x, z log p Z X (z x + log p X (x (2 log p X (x log p X,Z (x, z log p Z X (z x. (3 Now, recall the property of logs which states log a log b log a c log b c (provided everything is defined appropriately. Appealing to this fact, we introduce another density, q Z (z, which is an arbitrary probability density (also with respect to the Lebesgue measure. Then (3 grants log p X (x log p X,Z(x, z q Z (z and multiplying both sides by q Z (z we obtain q Z (z log p X (x q Z (z log p X,Z(x, z q Z (z By integrating with respect to z, we have q Z (z log p X (x dz q Z (z log p X,Z(x, z m q m Z (z log p Z X(z x, (4 q Z (z q Z (z log p Z X(z x. (5 q Z (z dz q Z (z log p Z X(z x q m Z (z dz. (6 This is the key equation for our search. To summarize it we note that the left hand side of (6 is simply log p X (x and use short hand notation for the two terms on the right hand side (with the second including the negative sign. The short hand notation is defined as L (q Z : q Z (z log p X,Z(x, z dz q m Z (z E qz log p X,Z(X, Z KL ( q Z p Z X q Z (Z : q Z (z log p Z X(z x dz q m Z (z q Z (z q Z (z log p m Z X (z x dz E qz log q Z (Z p Z X (Z X where E Π represents the expected value operator with respect to the probability measure (or equivalent Π. Thus, the fundamental relationship is concisely written log p X (x L (q Z + KL ( q Z p Z X. (7,

3 The functional KL ( q Z p Z X is known as the Kullback-Leibler divergence of pz X from q Z and is the pseudo metric which we are seeking. A few properties of KL are that for all valid p Z X and q Z, 1. KL ( q Z p Z X 0 and 2. KL ( q Z p Z X 0 qz p Z X a.e. Proof of both these facts is provided in Lemma 3.1 of 1. 1 However, note that the Kullback- Leibler divergence is not symmetric, and thus not a true distance metric. To conclude this section it is instructive to see the meaning of the KL divergence with an example. In figure 2 we plot an attempt to approximate a normal distribution p N(2, 0.5 with three different log-normals q1, q2, q3. Oberve that the approximation that visually seems more accurate also has the smallest KL divergence. Moreover, it is worth noting that for the case where KL(p q2 0.1115 we have KL(q2 p 0.1399. Figure 2. Three different approximations. smaller the KL divergence. The better the approximation the 4. Variational Bayes Approximation ecall that the choice of q Z (z is entirely arbitrary so long as it is a probability density with respect to the Lebesgue measure. The idea in variational Bayes (VB approximation is to select a simple approximation q Z to the complicated conditional density p Z X. The Kullback-Leibler divergence gives us a measure of the discrepancy between q Z and p Z X ; so the goal is to find an approximation q Z which minimizes KL ( q Z p Z X. Further consideration of (7 in light of our new task will prove beneficial. Note that the left hand side does not vary with z; moreover, in our graph we are considering an experiment where we observe x, so the quantity is fixed. 2 It is generally referred to as the log marginal likelihood or the 1 Note that what we are labeling divergence and what Kullback and Leibler label divergence are different functionals. Our definition of KL divergence is consistent with literature, despite Kullback and Leibler defining it differently. 2 In this set of notes the upper case oman characters such as X will be used to denote random vectors while the lower case oman characters such as x (outside a density or integral equation will denote a single observation of X as is common in statistics literature.

4 log evidence. Since it does not vary with q Z, it is clear that the functionals L and KL are inversely related. Therefore, a minimization of KL amounts to a maximization of L, i.e., q Z : arg min q Z Q KL ( q Z p Z X arg max q Z Q L (q Z, (8 where qz is our approximation of interest and Q denotes any set of valid probability densities. We will refer to qz as the Q-VB approximation. It is also sometimes useful to note that by the first property of KL, log p X (x L (q Z, and thus e L(qZ provides a lower bound for the marginal density of X. Finding qz when Q is any probability density is in general a difficult task. To make our analysis more tractable we can impose an independence structure on the random vector Z; that is, we will only consider q Z s which come from the set Q : q Z (z : q Z (z q Zi (z i, (9 i1 where q Zi (z i is the probability density of Z i, the ith element of Z. Sometimes even this task proves difficult and more restrictions are imposed to make Q even smaller. Such techniques are referred to as restricted variational Bayes (-VB techniques. For example, we could add the additional requirement that each q Zi (z i is in the exponential family. For the rest of these notes, we will take Q as defined in (9 for our Q in (8. To find (8, we begin by looking at L (q Z. In particular, it will be beneficial to look at the jth factor q Zj. For this reason, our mathematics is aimed at separating out the factors which depend on q Zj. Presented in the equations which follow is the derivation in terms of expectations. An equivalent form of the first part of the derivation in terms of integrals is provided in the Appendix.

5 So, from (9 and Fubini s theorem 3 we have L (q Z E qz log p X,Z(X, Z q Z (Z E qz log p X,Z (X, Z log q Z (Z E qz log p X,Z (X, Z log q Zk (Z k E qz log p X,Z (X, Z E qz log q Zk (Z k E qzj E Q m log p X,Z (X, Z E qzj E Q m log p X,Z (X, Z E qzk log q Zk (Z k E qzj log qzj (Z j ( E qzj log exp E Q m log p X,Z (X, Z ( E qzj log exp E Q m log p X,Z (X, Z E qzj exp log E Q m log p X,Z (X, Z q Zj (Z j E qzk log q Zk (Z k E qzj log qzj (Z j log q Zj (Z j m E qzk log q Zk (Z k E qzk log q Zk (Z k E qzk log q Zk (Z k E qzj log q Zj (Z j m E qzk log q Zk (Z k exp q Zi log p X,Z (X, Z ( KL q Zj exp E Q m i1 E Q m log p X,Z (X, Z E qzk log q Zk (Z k. Now, we know from KL s first property that it is never negative. Thus, to maximize L (q Z we need to minimize the KL term in the last equation. From KL s second property, we know that it is minimized precisely when the two terms are equivalent a.e. This gives the nice formula we know must hold for the probability density qz j - the jth factor of qz - qz j (z j exp E Q m log p X,Z (X, Z, (10 where here Z Z 1,..., Z j1, z j, Z j+1,..., Z m. 3 The theorem states that if A B f(x, y d(x, y < then ` A B f(x, ydy dx B A B f(x, yd(x, y. ` A f(x, ydx dy

6 5. Example: A Univariate Gaussian To understand the aforementioned approximation procedure, we show the analytic computations in the case of a univariate gaussian distribution. Assume we have a data set D x 1,..., x N draw from a distribution with unknown parameters µ, τ. Given the data, the likelihood function is: p(d µ, τ ( τ N/2 exp τ 2π 2 N (x n µ 2 We may have some idea about the distribution of the unknown parameters µ, τ. To simplify the analysis here, we introduce the following conjugate priors: n1 p(µ τ N (µ µ 0, (λ 0 τ 1 p(τ Gamma(τ a 0, b 0 ecall that we are interested in estimated the posterior distribution q(µ, τ over the unknown variables. According the mean field approximation we assume it takes a factorized form (which is not how the true posterior factorizes: q(µ, τ q µ (µq τ (τ To find the optimal value for factor q µ (µ we apply the formula the we derived above: log q µ(µ E τ log p(d, µ, τ E τ logp(d µ, τp(µ τp(τ E τ log p(d µ, τ + log p(µ τ + log p(τ E τ log p(d µ, τ + log p(µ τ + const Eτ N λ 0 (µ µ 0 2 + (x n µ 2 + constant. 2 n1 The next step is to complete the square over µ to obtain the form of a gaussian N(µ µ N, λ 1 N for q µ (µ where: µ N λ 0µ 0 + Nx λ 0 + N, λ N (λ 0 + NEτ. A similar analysis for q τ (τ shows that it follows a gamma distribution Gamma(τ a N, b N where: a N a 0 + N + 1, 2 b N b 0 + 1 N 2 E τ (x n µ 2 + λ 0 (µ µ 0 2. n1

7 The same analysis together with more sophisticated examples can be found in chapter 10 of?. It important to notice that we can use the above derived formulas to iterative compute more refined estimates of the model parameters. Observe that the paramters of q µ (µ depend on the mean value of τ and vice verca. To conlude this section we refer to the reader to two excellent references on variational methods?,?. 6. Appendix L (q Z q Z (z log p X,Z(x, z dz q m Z (z q Zi (z i log p X,Z(x, z m q Z k (z k dz 1 dz m i1 ( q Zi (z i log p X,Z (x, z log q Zk (z k dz 1 dz m i1 i1 q Zi (z i log p X,Z (x, z dz 1 dz m q Zj (z j q Zj (z j q Zj (z j q Zi (z i log q Zk (z k dz 1 dz m i1 m1 m1 q Zi (z i log p X,Z (x, z dz \j dz j i1 log q Zk (z k q Zi (z i dz 1 dz m i1 q Zi (z i log p X,Z (x, z dz \j dz j i1 log q Zk (z k q Zk (z k dz k m1 q Zi (z i log p X,Z (x, z dz \j dz j i1 log q Zj (z j q Zj (z j dz j log q Zk (z k q Zk (z k dz k eferences 1. S. Kullback and.a. Leibler, On information and sufficiency, Annals of Mathematical Statistics 22 (1951, no. 1, 79 86.