Power EP. Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR , October 4, Abstract

Similar documents
Probabilistic and Bayesian Machine Learning

Expectation propagation as a way of life

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Expectation Propagation Algorithm

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25

Beyond Newton s method Thomas P. Minka

Part 1: Expectation Propagation

Bregman Divergence and Mirror Descent

Black-box α-divergence Minimization

Expectation Propagation in Dynamical Systems

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M

13 : Variational Inference: Loopy Belief Propagation

Bayesian Inference of Noise Levels in Regression

Approximate Inference Part 1 of 2

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

Approximate Inference Part 1 of 2

Bayesian Conditional Random Fields

Expectation Propagation in Factor Graphs: A Tutorial

Math 180A. Lecture 16 Friday May 7 th. Expectation. Recall the three main probability density functions so far (1) Uniform (2) Exponential.

Diagram Structure Recognition by Bayesian Conditional Random Fields

Machine Learning Lecture 3

Understanding Covariance Estimates in Expectation Propagation

NUMERICAL COMPUTATION OF THE CAPACITY OF CONTINUOUS MEMORYLESS CHANNELS

Bayesian Inference Course, WTCN, UCL, March 2013

ON CONVERGENCE PROPERTIES OF MESSAGE-PASSING ESTIMATION ALGORITHMS. Justin Dauwels

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

Expectation Consistent Free Energies for Approximate Inference

Deterministic Approximation Methods in Bayesian Inference

Lecture 3: Pattern Classification. Pattern classification

Recent Advances in Bayesian Inference Techniques

12 : Variational Inference I

Machine Learning Techniques for Computer Vision

Approximate Message Passing

Variational Principal Components

6.867 Machine learning

Variational Inference (11/04/13)

Example 1: What do you know about the graph of the function

Probabilistic Graphical Models

STA 4273H: Statistical Machine Learning

Expectation Propagation for Approximate Bayesian Inference

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Spatial smoothing using Gaussian processes

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery

Extended Version of Expectation propagation for approximate inference in dynamic Bayesian networks

Unsupervised Learning

Introduction to Exponents and Logarithms

Graphing Exponential Functions

Fractional Belief Propagation

CS Lecture 19. Exponential Families & Expectation Propagation

Active and Semi-supervised Kernel Classification

Quadratic Programming Relaxations for Metric Labeling and Markov Random Field MAP Estimation

Expectation propagation for signal detection in flat-fading channels

Particle Methods as Message Passing

Statistical Machine Learning Lectures 4: Variational Bayes

Gaussian Process Vine Copulas for Multivariate Dependence

NON-NEGATIVE MATRIX FACTORIZATION FOR PARAMETER ESTIMATION IN HIDDEN MARKOV MODELS. Balaji Lakshminarayanan and Raviv Raich

n px p x (1 p) n x. p x n(n 1)... (n x + 1) x!

arxiv:cs/ v2 [cs.it] 1 Oct 2006

Representation of undirected GM. Kayhan Batmanghelich

Approximate formulas for the Point-to-Ellipse and for the Point-to-Ellipsoid Distance Problem

Answers. Investigation 2. ACE Assignment Choices. Applications. Problem 2.5. Problem 2.1. Problem 2.2. Problem 2.3. Problem 2.4

p L yi z n m x N n xi

18.303: Introduction to Green s functions and operator inverses

Quantum Dynamics. March 10, 2017

LESSON 2 Negative exponents Product and power theorems for exponents Circle relationships

Machine Learning Basics III

3.3.1 Linear functions yet again and dot product In 2D, a homogenous linear scalar function takes the general form:

Expectation propagation for infinite mixtures (Extended abstract) Thomas Minka and Zoubin Ghahramani December 17, 2003

13: Variational inference II

Reading Group on Deep Learning Session 1

Machine Learning Summer School

Regression with Input-Dependent Noise: A Bayesian Treatment

State Space and Hidden Markov Models

Bayesian Regression Linear and Logistic Regression

Color Scheme. swright/pcmi/ M. Figueiredo and S. Wright () Inference and Optimization PCMI, July / 14

Where now? Machine Learning and Bayesian Inference

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Course. Print and use this sheet in conjunction with MathinSite s Maclaurin Series applet and worksheet.

Variational Methods in Bayesian Deconvolution

Bayesian Machine Learning - Lecture 7

Modeling Natural Images with Higher-Order Boltzmann Machines

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

Lecture 3: September 10

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Automating variational inference for statistics and data mining

Lecture 6: Graphical Models: Learning

Variational Scoring of Graphical Model Structures

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Particle-Based Approximate Inference on Graphical Model

EM & Variational Bayes

The Variational Gaussian Approximation Revisited

Analytic Long-Term Forecasting with Periodic Gaussian Processes

Variational algorithms for marginal MAP

Type II variational methods in Bayesian estimation

Variational Inference. Sargur Srihari

A path integral approach to agent planning

Lecture 23: November 19

Does Better Inference mean Better Learning?

Exponential and Logarithmic Functions

State Space Gaussian Processes with Non-Gaussian Likelihoods

Transcription:

Power EP Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR-2004-149, October 4, 2004 Abstract This note describes power EP, an etension of Epectation Propagation (EP) that makes the computations more tractable. In this way, power EP is applicable to a wide variety of models, much more than EP. Instead of minimizing KL-divergence at each step, power EP minimizes α-divergence. This minimization turns out to be equivalent to minimizing KL-divergence with the eact distribution raised to a power. By choosing this power to cancel eponents, the problem may be substantially simplified. The resulting approimation is not the same as regular EP, but in practice is still very good, and allows tackling problems which are intractable under regular EP. 1 Introduction Epectation Propagation (EP) is a method for approimating a comple probability distribution p by a simpler distribution q (Minka, 2001b). To apply it, you write the distribution as a product of terms, p() = a t a(), and then you approimate the terms one by one. At each step, you need to evaluate an integral of the form t a()q \a ()d (a definite integral). If the terms t a are simple enough, this integral can be done quickly, usually analytically. Unfortunately, there are many functions which cannot be factored into simple terms. For eample, consider a product of T densities: p() = 1 1 2 + 1 2 (1) + 1 This function naturally factors into two terms, but the integrals 1 2 +1 q\a ()d are not easy to evaluate, so EP bogs down. On the other hand, the integrals (2 + 1)q \a ()d are easy to evaluate. Using power EP, we can approimate p using only these easy integrals. Power EP is an etension of EP to employ integrals of the form t a() β q()d, for any powers β that you choose. For eample, in the T density problem we can let the powers be β = 1, which leads to easy integrals. The rest of the algorithm is essentially the same as EP, with a scaling step to account for the powers. Power EP was originally described by Minka & Lafferty (2002), briefly and without derivation. Later, Wiegerinck & Heskes (2002) discussed an algorithm called fractional belief propagation, describing its energy function and an interpretation in terms of α-divergence. In fact, fractional belief propagation is a special case of power EP, and all of their results also apply to power EP. This paper epands on both Minka & Lafferty (2002) and Wiegerinck & Heskes (2002), giving a unified perspective on power EP and the different ways of implementing it. The paper is organized as follows. Section 2 describes the algorithm in detail. Section 3 describes the α-divergence interpretation, and section 4 gives the objective function.

2 EP and power EP A set of distributions is called an eponential family if it can be written as q() = ep( j g j ()ν j ) (2) where ν j are the parameters of the distribution and g j are fied features of the family, such as (1,, 2 ) in the Gaussian case. Because we will be working with unnormalized distributions, we make 1 a feature, whose corresponding parameter captures the scale of the distribution. The reason to use eponential families is closure under multiplication: the product of distributions in the family is also in the family. Given a distribution p and an eponential family F, Epectation Propagation tries to find a distribution q F which is close to p in the sense of the KL-divergence: KL(p q) = p() log p() q() d + (q() p())d (3) Here we have written the divergence with a correction factor, so that it applies to unnormalized distributions. The basic operation in EP is KL-projection, defined as proj[p = argminkl(p q) (4) q F (In this notation, the identity of F comes from contet.) Substituting the form of q (2) into the KLdivergence, we find that the minimum is the unique member of F whose epectation of g j matches that of p, for all j: q = proj[p j g j q()d = g j p()d (5) For eample, if F is the set of Gaussians, then proj[p means computing the mean and variance of p, and returning a Gaussian with that mean and variance. Now, write the original distribution p as a product of terms: p() = a t a () (6) This defines the specific way in which we want to divide the network, and is not unique. approimating each term t a by t a, we get an approimation divided in the same way: q() = a By t a () (7) For each term t a, the algorithm first computes q \a (), which represents the rest of the distribution. Then it minimizes KL-divergence over t a, holding q \a fied. This process can be written succinctly as follows: q \a () = q()/ t a () (8) t a () new = argmin KL(t a ()q \a () t a ()q \a ()) [ (9) = proj t a ()q \a () /q \a () (10) In the Gaussian case, both q() and t a () will be scaled Gaussians, so algorithmically (8) means divide the Gaussians and their corresponding scale factors to get a new scaled Gaussian, and call it q \a (). Similarly, (9) means construct a scaled Gaussian whose moments match t a ()q \a () and divide it by q \a (), to get a new scaled Gaussian which replaces t a (). This is the basic EP algorithm. Power EP etends the algorithm to use the following update (for any desired n a ): q \a () = q()/ t a () 1/na (11) ( [ t a () new = proj t a () 1/na q \a () /q ()) \a na (12)

To see this another way, define the fractional term f a : f a () = t a () 1/na f a () = t a () 1/na (13) The update for f is: q \fa () = q()/ f a () (14) [ f a () new = proj f a ()q \fa () /q \fa () (15) This is eactly like EP, ecept we then scale f in order to get t. For eample, suppose we want to approimate p() = a()/b() with q() = ã()/ b(). For a() we use n = 1, and for b() we use n = 1. Then the updates are: q \a 1 () = (16) b() [ ã() new = proj a() q \a () /q \a () (17) q \b () = ã() (18) b() 2 [ b() new = proj b() q \b () /q \b () (19) Notice the old b() is contained in q \b (), creating a feedback loop. By iterating these equations, we get the following algorithm for power EP: Power EP Initialize f a () for all a, and set q() to their product. Repeat until convergence: For each a: q \fa () = q()/ f a () (20) [ q () = proj f a ()q \fa () (21) ( f a () new = f q a () 1 γ ) γ ( () = q \fa () f q ) γ () a () (22) q() ( ) fa q() new () new na ( q ) naγ () = q() = q() (23) f a () q() In (22), a damping factor γ has been used, which doesn t change the fied points, but can help convergence. Choosing γ = 1/n a makes q() new = q (), which is convenient computationally and also tends to have good convergence properties. To get an intuitive sense of what power EP is doing, consider the case of positive integer n a. This means that one fractional term f a () is repeated, say, 3 times. Power EP approimates each copy with the same f a (). To update it, power EP removes one copy of f a (), computes a new f a () for it, and then replicates that approimation to all the copies. If we applied regular EP across all copies, then the approimations would not be synchronized, but at convergence they would be the same, yielding the same answer as power EP (this is shown in section 4). Of course, in the non-integer case this construction is not possible and there is no equivalent EP algorithm. The above discussion has focused on EP, but in fact the mean-field method has a very similar structure. All you do is reverse the direction of the divergence in (9). This way of writing the mean-field updates is known as Variational Message Passing (Winn & Bishop, 2005).

3 Alpha-divergence This section shows how power EP can be understood in terms of minimizing α-divergence instead of KL-divergence. This interpretation is originally due to Wiegerinck & Heskes (2002), who mentioned it in the contet of fractional belief propagation. The α-divergence (Amari, 1985; Trottini & Spezzaferri, 1999) is a generalization of KL-divergence with the following formula: ( ) 4 D α (p q) = 1 α 2 1 p() (1+α)/2 q() (1 α)/2 d (24) The α-divergence is 0, with equality iff p = q. It is conve with respect to p and q. When α = 1, it reduces to KL(p q), and when α = 1, it is KL(q p). Consider the following algorithm, call it α-ep: α-ep Initialize t a () for all a, and set q() to their product. Repeat until convergence: For each a: q \a () = q()/ t a () (25) q() new = argmin D αa (t a ()q \a () q()) (26) q F t a () new = q() new /q \a () (27) This algorithm is just like EP ecept it minimizes α-divergence at each step. It turns out that this algorithm has eactly the same fied points as power EP. The power EP algorithm given in the previous section is just a particular way of implementing α-ep. To see the equivalence, set α a = (2/n a ) 1. Then the minimization in (26) becomes n 2 ( ( ) a argmin 1 t a ()q \a 1/na ()) q() 1 1/n a d q F n a 1 This function is conve in the parameters of q(), so the minimum is unique. Setting the gradient to zero gives a condition for the minimum: ( ) q() = proj[ t a ()q \a 1/na () q() 1 1/n a (29) The right hand side is equivalent to q () from (21), so this amounts to q() = q (), which is true at any fied point of (23). Thus the inner loop of power EP is equivalent to minimizing α-divergence. (28)

4 The objective function EP does not directly minimize the KL-divergence KL(p q), but rather a surrogate objective function, described by Minka (2001b; 2001a). This objective etends straightforwardly to power EP. As described in section 2, power EP can be interpreted as taking the fractional term f i () = t i () 1/ni and copying it n i times. The objective function has eactly this form: we take the EP objective and copy each term n i times. The power EP objective is min ma n i ˆp i () log ˆp i() + (1 n i ) q() log q() (30) ˆp i q i f i () i such that g j ()ˆp i ()d = g j ()q()d (31) ˆp i ()d = 1 (32) q()d = 1 (33) Recall that g j are the features of the eponential family F, so (31) means that proj[ˆp i () = q(). The dual objective is min ma ν λ ( i n i=1 n i 1) log ep( j n i log f i () ep( j such that ( i n i 1)ν j = i g j ()ν j )d g j ()λ ij )d (34) n i λ ij (35) Define then it follows that τ ij = ν j λ ij (36) ν j = i n i τ ij (37) Thus we can make the following definitions: f i () = ep( j g j ()τ ij ) (38) q() = ep( j g j ()ν j ) = i f i () ni (39) By taking gradients, the stationary points of (34) are found to be: [ q() = proj f i ()q()/ f i () for all i (40) The right hand side is equivalent to q () from (21), so this amounts to q() = q () for all i, which is true at any fied point of power EP. Thus the power EP fied points are eactly the stationary points of (34). The same result follows similarly for (30). When the n i are positive integers, the power EP objective is equivalent to the EP objective with repeated terms. Thus both algorithms converge to the same solutions for this case.

References Amari, S. (1985). Differential-geometrical methods in statistics. Springer-Verlag. Minka, T. P. (2001a). The EP energy function and minimization schemes. resarch.microsoft.com/~minka/. Minka, T. P. (2001b). Epectation propagation for approimate Bayesian inference. UAI (pp. 362 369). Minka, T. P., & Lafferty, J. (2002). Epectation-propagation for the generative aspect model. UAI. Trottini, M., & Spezzaferri, F. (1999). A generalized predictive criterion for model selection (Technical Report 702). CMU Statistics Dept. www.stat.cmu.edu/tr/. Wiegerinck, W., & Heskes, T. (2002). Fractional belief propagation. NIPS 15. Winn, J., & Bishop, C. (2005). Variational message passing. Journal of Machine Learning Research, 6, 661 694.