Two Useful Bounds for Variational Inference

Similar documents
Expectation Propagation Algorithm

Latent Variable Models and EM algorithm

13: Variational inference II

Stochastic Variational Inference

Lecture 13 : Variational Inference: Mean Field Approximation

Machine Learning Lecture Notes

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Variational Inference (11/04/13)

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Unsupervised Learning

Probabilistic Graphical Models

Study Notes on the Latent Dirichlet Allocation

an introduction to bayesian inference

STA 4273H: Statistical Machine Learning

Gaussian Mixture Model

A Simple Proof of the Stick-Breaking Construction of the Dirichlet Process

Clustering with k-means and Gaussian mixture distributions

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Mixtures of Gaussians. Sargur Srihari

Quantitative Biology II Lecture 4: Variational Methods

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Recent Advances in Bayesian Inference Techniques

Probabilistic Graphical Models

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

An Introduction to Expectation-Maximization

Probabilistic Time Series Classification

Dirichlet Enhanced Latent Semantic Analysis

Introduction to Machine Learning

Lecture 8: Graphical models for Text

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Statistical Machine Learning Lectures 4: Variational Bayes

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

PATTERN RECOGNITION AND MACHINE LEARNING

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling

Nonparametric Bayesian Methods (Gaussian Processes)

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

STA 4273H: Statistical Machine Learning

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Expectation Maximization

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

VIBES: A Variational Inference Engine for Bayesian Networks

Non-conjugate Variational Message Passing for Multinomial and Binary Regression

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Bayesian Inference Course, WTCN, UCL, March 2013

Technical Details about the Expectation Maximization (EM) Algorithm

Variational Inference. Sargur Srihari

Curve Fitting Re-visited, Bishop1.2.5

Latent Dirichlet Alloca/on

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Variational Scoring of Graphical Model Structures

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Linear Dynamical Systems

The Variational Gaussian Approximation Revisited

Variational Principal Components

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Density Estimation under Independent Similarly Distributed Sampling Assumptions

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Cheng Soon Ong & Christian Walder. Canberra February June 2018

An introduction to Variational calculus in Machine Learning

Clustering with k-means and Gaussian mixture distributions

The Generalized CEM Algorithm. MIT Media Lab. Abstract. variable probability models to maximize conditional likelihood

But if z is conditioned on, we need to model it:

The Infinite PCFG using Hierarchical Dirichlet Processes

Stochastic Variational Inference for the HDP-HMM

Auto-Encoding Variational Bayes

Collapsed Variational Inference for Sum-Product Networks

Variational Bayes and Variational Message Passing

Machine Learning Summer School

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

The Expectation Maximization or EM algorithm

Contrastive Divergence

The Discrete Infinite Logistic Normal Distribution for Mixed-Membership Modeling

Mutual Information and Optimal Data Coding

Bayesian Methods for Machine Learning

Bayesian Nonparametrics

CS Lecture 18. Topic Models and LDA

Algorithms for Variational Learning of Mixture of Gaussians

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Probabilistic Graphical Models for Image Analysis - Lecture 4

Lecture 6: Graphical Models: Learning

Lecture 1a: Basic Concepts and Recaps

Mixed-membership Models (and an introduction to variational inference)

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Variational algorithms for marginal MAP

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

CMPS 242: Project Report

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Probabilistic Graphical Models

Probabilistic Graphical Models

Non-Parametric Bayes

Posterior Regularization

Variational Inference. Sargur Srihari

Transcription:

Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the expectation of the log-sum function in the context of variational inference. The first bound relies on the first-order Taylor expansion about an auxiliary parameter. The second bound relies on an auxiliary probability distribution. We show how these auxiliary parameters can be removed entirely from the model. We then discuss the advantage of eeping these parameters in certain cases, giving an example of a lielihood/prior pair for each case. 1 Introduction Variational inference methods (Wainwright and Jordan, 2008) are useful for approximate posterior inference in hierarchical models. They also provide an approximate value of the log-evidence of a model. For example, let Y = Y 1,..., Y n be a set of observed data and X = X 1,..., X a set of latent parameters. Let p(y, X) be the joint lielihood of a given model and Q(X) be any probability distribution. Then by Jensen s inequality, ln p(y, X) p(y X)p(X)dX = ln Ω X Ω X Q(X) Q(X)dX { p(y, X) Q(X) ln Ω X Q(X) } dx (1) In variational methods, this Q distribution is given a factorized form, Q(X) := q(x ), and is predefined. It can be shown that maximizing this lower bound on the log-evidence with respect to Q(X) is equivalent to minimizing the ullbac-leibler divergence between this Q distribution and the true posterior p(x Y). Therefore, the factorized Q distribution can be interpreted as an approximation to the true posterior. It is often the case that an integral in (1) is intractable, meaning that the expression Q(X ) ln p(y X I )dx I (2) Ω X I I Technical Report, August 2010 1

cannot be analytically solved for some set of parameters indexed by I. In this case, approximating functions can be used to lower bound the terms of interest and provide analytical tractability. In this technical report, we review two simple lower bounds that are often useful for variational inference. 2 Two Lower Bounds Without Auxiliary Parameters In this section, we present the two bounds of interest in this technical report and show how the auxiliary parameters used to derive each bound can be set such that they disappear from the model. The result is a function that depends only on the original parameters of interest and also represents the tightest lower bound on the original function given the functional form of the selected lower bound. 2.1 A lower bound on ln X The tightest lower bound on ln X using a first-order Taylor expansion is ln X ln X (3) Derivation: The function ln( ) is convex. Therefore, a first-order Taylor expansion about the point ω > 0 produces the inequality ln X ln ω X ω ω Taing the derivative with respect to ω and setting to zero will give the value of ω that maximizes this function for a given value of X. The derivative of this function is df dω = 1 ω 1 ω 2 (4) X (5) Setting to zero and solving for ω gives ω = X. Plugging this value bac into (4) produces (3). Therefore, the parameter ω can be removed and all instances of ln X can be replaced with ln X. Technical Report, August 2010 2

2.2 A lower bound on ln X The tightest lower bound on ln X using an auxiliary probability distribution is ln X ln e ln X (6) Derivation: The function ln( ) is concave. Therefore, using an auxiliary probability vector, (p 1,..., p ), where p > 0 and p = 1, it follows from Jensen s inequality that X ln X = ln p p p ln X p = p ln X p ln p (7) Taing the derivative with respect to p, finding it s proportionality and normalizing produces p = eln X j eln X j (8) Replacing p 1,..., p in (7) with these values gives ln X e ln X j eln X j ln X e ln X ln eeqln X j eln X j j eln X j (9) The rightmost term can be expanded as e ln X j eln X j ln X e ln X j eln X j ln j=1 e ln X j (10) The first term in (10) cancels the first term on the right of the inequality in (9) to give ln X e ln X j eln X j ln j=1 e ln X j (11) The logarithm on the right can be brought outside of the summation, which then sums to one. This produces (6). As with the inequality of the previous section, the auxiliary parameters do not remain in the final bound; all instances of ln X can simply be replaced with ln eln X. Technical Report, August 2010 3

3 The Advantage of Retaining Auxiliary Parameters Given certain lielihood/prior combinations, retaining the auxiliary parameters ω and (p 1,..., p ) may significantly simplify model learning. This simplification comes in the form of analytical posterior updates for certain variational parameters, while using the tightest lower bounds given in Section 2 require gradient methods. Gradient methods introduce potential issues: steepest ascent vs. Newton-type methods; setting vs. learning step sizes; checing for feasibility; assessing convergence vs. fixing the number of gradient steps when updating a parameter. In most of these concerns, there is usually a trade-off between accuracy and computation time. Analytical updates, on the other hand, are fast and move to the optimal value i.e., the value that locally maximizes the lower bound in one step. In this section, we discuss two cases where eeping the auxiliary parameters ω and (p 1,..., p ) can be beneficial. In both cases, the lower bound approximation results in analytical updates when the variational q distribution is chosen to be of the same form as the prior. Therefore, the effect of each approximation can be viewed as altering the lielihood such that it is conjugate to the prior. 3.1 Woring with ln X ln ω X ω ω Consider the following hierarchical process. Z n iid X j X δ, j X ind Gamma(a, b ) (12) This process could appear in a mixture modeling setting, where X 1,..., X are random variables that construct a discrete probability distribution and Z n is a random index value that selects parameters, θ Zn, from a set. This would be followed by an observation, Y n F (θ Zn ). When b = b for all, this is equivalent to a Dirichlet prior. If and a 1, a 2,... are scaled weights from a Dirichlet process (Ferguson, 1973; Sethuraman, 1994), this is the hierarchical Dirichlet process (Teh et al., 2006). Furthermore, when the values b 1, b 2,... are log-normal, this the infinite logistic normal distribution (Paisley et al., 2010). For some of these models, we note that other representations are available that do not require the approximation discussed in this section. The posterior of X 1: in this model is proportional to ( ) N I(Zn=) p(x 1: Z 1:N, a 1:, b 1: ) X j X j n=1 X a 1 e b X (13) Technical Report, August 2010 4

Under a factorized Q distribution, the variational lower bound at nodes X 1,..., X is ln p(x 1: Z 1:N, a 1:, b 1: ) HQ N n=1 P Q (Z n = ) ln X N ln ( a 1) ln X X b X HQ(X ) (14) Analytical calculation of all expectations in (14) is not possible because of the term N ln X. Using the lower bound given in Section 2.1, this term is replaced with N ln ω N( X ω)/ω, where ω > 0. Before learning can proceed for X 1,..., X, the form of their variational q distributions must be defined; we assume q distributions have been selected for all other parameters as well. As shown by Winn and Bishop (2005), the procedure for finding the optimal form and parameterization of a given q distribution is to exponentiate the variational lower bound with all expectations involving the parameter of interest not taen. When two distributions are conjugate, this will always produce an analytical result that is of the same form as the prior. Following this procedure with respect to the lower bound approximation of (14), q(x ) e X ln p(x Z 1:N,a 1:,b 1: ) X a N n=1 P Q(Z n=) 1 e (b N/ω)X (15) Therefore, the optimal q distribution for X is q(x ) = Gamma(X a, b ), with a = a N P Q (Z n = ) (16) n=1 b = b N/ω (17) The impact of the Taylor approximation is to alter the lielihood, maing it conjugate to the prior. As motivated in the introduction, this requires setting the auxiliary parameter, ω. For a given iteration, let ȧ and ḃ be the variational parameters resulting from the previous iteration. Then the auxiliary parameter ω can be set to ω = ȧ /ḃ (18) which, as motivated in Section 2.1, creates the tightest lower bound for the approximation. Of course, other options are available as well, such as incrementally updating ω after updating each (a, b ), rather than after updating the set of all. Technical Report, August 2010 5

3.2 Woring with ln X p ln X p ln p Consider the following hierarchical process. ( ) ind iid N m Poisson c,m X, X Gamma(a, b ) (19) An example of where this process arises is in non-negative matrix factorization. For example, this could be part of the generative process of a matrix of non-negative integers, Y N D M, and N m would be the value in the m th column of a particular row of Y (not indexed). In this framewor, the data could be counts of word occurrence in documents, with Y d,m equal to the number of times word d appears in document m. The posterior of X 1: in this model is proportional to ( M ) Nm p(x 1: N, c, a, b) c,m X e c,mx X a 1 e b X (20) Under a factorized Q distribution, the variational lower bound at nodes X 1,..., X is M M ln p(x 1: N, c, a, b) HQ = N m ln c,m X c,m X ( a 1) ln X b X HQ(X ) (21) Analytical calculation of all expectations in (21) is not possible because of the term N m ln c,mx. Using the lower bound given in Section 2.2, this term is replaced with N m p(m) ln c,m X N m p(m) ln p (m), where p (m). As in Section 3.1, q distributions must be defined for X 1,..., X before learning can continue. We again follow the procedure outlined in (Winn and Bishop, 2005) for finding these distributions and their parameter settings. q(x ) e X ln p(x N,c,a,b) X a M Nmp(m) 1 e (b M c,m )X (22) Therefore, the optimal q distribution for X is q(x ) = Gamma(X a, b ), with M a = a N m p (m) (23) b = b M c,m (24) Technical Report, August 2010 6

Again, the impact of the approximation is to modify the lielihood to a form that is conjugate to the prior. This requires values for p (m) for = 1,..., and m = 1,..., M. Following the same line of thought as in Section 3.1, let ȧ and ḃ be the variational parameters following a given iteration. Prior to the next iteration, the auxiliary distributions can be set to p (m) = eln c,m ln X j eln c j,m ln X j = c,mψ(ȧ eeqln ) ln ḃ j eln c j,m ψ(ȧ j ) ln ḃ j (25) This tightens the approximation for the current iteration. As in Section 3.1, the vector p (m) can be updated at other points in the inference process, such as after updating each (a, b ). 4 Summary In this technical report, we have reviewed two simple lower bounds on the log-sum function with a focus on variational inference applications. We showed how retaining auxiliary parameters can aid the inference procedure in the form of analytical parameter updates. Hoffman et al. (2010) give another instance of this advantage in a matrix factorization model where the emissions are exponentially distributed. Their model includes the approximation in Section 2.1, and an approximation similar to Section 2.2 that uses an auxiliary probability distribution, but where the concave function is the negative inverse-sum function. References Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1:209 230. Hoffman, M., Blei, D. M., and Coo, P. (2010). Bayesian nonparametric matrix factorization for recorded music. In ICML. Paisley, J., Wang, C., and Blei, D. M. (2010). The infinite logistic normal distribution for nonparametric correlated topic modeling. In Progress. Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4:639 650. Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566 1581. Wainwright, M. and Jordan, M. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305. Winn, J. and Bishop, C. (2005). Variational message passing. Journal of Machine Learning Research, 6:661 694. Technical Report, August 2010 7