Expectation Propagation Algorithm

Similar documents
13 : Variational Inference: Loopy Belief Propagation and Mean Field

Expectation Propagation for Approximate Bayesian Inference

13: Variational inference II

Variational Inference (11/04/13)

STA 4273H: Statistical Machine Learning

Expectation Propagation in Factor Graphs: A Tutorial

Expectation Propagation in Dynamical Systems

Bayesian Inference Course, WTCN, UCL, March 2013

The Expectation-Maximization Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Robust Monte Carlo Methods for Sequential Planning and Decision Making

Probabilistic and Bayesian Machine Learning

CS Lecture 19. Exponential Families & Expectation Propagation

Lecture 13 : Variational Inference: Mean Field Approximation

p L yi z n m x N n xi

Variational Learning : From exponential families to multilinear systems

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

PATTERN RECOGNITION AND MACHINE LEARNING

Two Useful Bounds for Variational Inference

Variational Bayes and Variational Message Passing

Part 1: Expectation Propagation

Approximate Inference Part 1 of 2

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Deterministic Approximation Methods in Bayesian Inference

Variational Inference. Sargur Srihari

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Black-box α-divergence Minimization

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Linear Dynamical Systems

STA 4273H: Statistical Machine Learning

Expectation Maximization

The Expectation Maximization or EM algorithm

Unsupervised Learning

Gaussian Mixture Models

Statistical Machine Learning Lectures 4: Variational Bayes

Approximate Inference Part 1 of 2

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Instructor: Dr. Volkan Cevher. 1. Background

Probabilistic Graphical Models for Image Analysis - Lecture 4

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Series 7, May 22, 2018 (EM Convergence)

Probabilistic Graphical Models

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Bayesian Machine Learning - Lecture 7

Pattern Recognition and Machine Learning

Machine learning - HT Maximum Likelihood

K-Means and Gaussian Mixture Models

Machine Learning Techniques for Computer Vision

Lecture 6: Graphical Models: Learning

Expectation propagation as a way of life

Bayesian Methods for Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Introduction to Machine Learning

Variational Scoring of Graphical Model Structures

STA 414/2104: Machine Learning

Lecture 16 Deep Neural Generative Models

Curve Fitting Re-visited, Bishop1.2.5

Online Algorithms for Sum-Product

Machine Learning Summer School

Variational Inference. Sargur Srihari

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Probabilistic Graphical Models

Machine Learning Lecture Notes

Ch 4. Linear Models for Classification

Lecture 4: Probabilistic Learning

Approximate Message Passing

Artificial Intelligence

Estimation and Maintenance of Measurement Rates for Multiple Extended Target Tracking

Recent Advances in Bayesian Inference Techniques

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Variational Inference and Learning. Sargur N. Srihari

Quantitative Biology II Lecture 4: Variational Methods

Probability and Estimation. Alan Moses

Density Estimation. Seungjin Choi

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

13 : Variational Inference: Loopy Belief Propagation

Expectation Maximization and Mixtures of Gaussians

G8325: Variational Bayes

Bioinformatics: Biology X

Probabilistic Graphical Models

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Graphical Models for Collaborative Filtering

CSC 2541: Bayesian Methods for Machine Learning

Adaptive Monte Carlo methods

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Probabilistic Graphical Models

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

PMR Learning as Inference

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Stochastic Variational Inference

Latent Variable Models and EM algorithm

Bayesian Machine Learning

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A minimalist s exposition of EM

Understanding Covariance Estimates in Expectation Propagation

Introduction to Probability and Statistics (Continued)

Bayesian Machine Learning

The Variational Gaussian Approximation Revisited

Expectation propagation for signal detection in flat-fading channels

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Transcription:

Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First, we will review some preliminaries for EP. Then, EP algorithm will be described in the next section. Finally, the relationship between EP and other variational methods will be discussed. A. Exponential family I. PRELIMINARIES The exponential family of distributions over x is a set of distributions with the form ( ) p(x; θ) = h(x)g(θ)exp θ T u(x), (1) where measurement x may be scalar or vector, discrete or continuous, θ are parameters of the distribution, h(x) and u(x) are some functions of x, and the function g(θ) is a normalization factor as ( ) g(θ) h(x)exp θ T u(x) dx = 1. (2) In addition, if the variables are discrete, just simply replace the integration with summation. Exponential family has many properties, which may simplify computations. For example, if a likelihood function is one of members in the exponential family, the posterior can be expressed in a closed-form expression by choosing a conjugate prior within the exponential family. Moreover, exponential family has a wide range of members such as Gaussian, Bernoulli, discrete multinomial, Poisson and so on, thus it is applicable to many different inference models. B. Kullback-Leibler divergence Kullback-Leibler (KL) divergence [1] is a measure to quantify the difference between a probabilistic distributions p(x) and an approximate distribution q(x). For the distributions p(x) and q(x) over continuous variables, KL divergence is defined as D KL (p(x) q(x)) = p(x)log p(x) dx, (3) q(x) where for discrete variables, just replace integration with summation. Moreover, KL divergence is a non-symmetric measure, which means D KL (p(x) q(x)) D KL (q(x) p(x)). To give readers an intuitive view about the difference between the above two forms of KL divergence, we assume that the true distribution p(x) is multimodal and the candidate distribution q(x) is unimodal. By minimizing D KL (q(x) p(x)), the approximate distribution q(x) will

2 pick one of the modes in p(x), which is usually used in variational Bayes method. However, the best approximate distribution q(x) obtained by minimizing D KL (p(x) q(x)) will be the average of all modes. The later case is used in the approximate inference procedure of EP. Since this report focus on the review of EP algorithm, we will study the property of minimizing D KL (p(x) q(x)) first. Regarding the difference between minimizing D KL (p(x) q(x)) and D KL (q(x) p(x)), we will discuss it later in this chapter. To ensure a tractable solution for minimizing KL divergence D KL (p(x) q(x)), the approximate distribution q(x) is usually restricted within a member of the exponential family. Thus, according to (1), q(x) can be written as ( ) q(x; θ) = h(x)g(θ)exp θ T u(x), (4) where θ are the parameters of the given distribution. By substituting q(x; θ) into the KL divergence D KL (p(x) q(x)), we get D KL (p(x) q(x)) = lng(θ) θ T E p(x) [u(x)] + const, (5) where the const represents all the terms that are independent of parameters θ. To minimize KL divergence, take the gradient of D KL (p(x) q(x)) with respect to θ to zero, we get lng(θ) = E p(x) [u(x)]. (6) Moreover, for (2), taking the gradient of both sides respect to θ, we get { } { } g(θ) h(x)exp θ T u(x) dx + g(θ) h(x)exp θ T u(x) u(x)dx = 0. (7) Then by rearranging and reusing (2) again, we get lng(θ) = E q(x) [u(x)]. (8) By comparing (6) and (8), we obtain E p(x) [u(x)] = E q(x) [u(x)]. (9) Thus, from (9), we see that the minimization of KL divergence is equivalent to matching the expected sufficient statistics. For example, for minimizing KL divergence with a Gaussian distribution q(x; θ), we only need to find the mean and covariance of q(x; θ) that are equal to the mean and covariance of p(x; θ), respectively. C. Assumed-density filtering (ADF) ADF is a technique to construct tractable approximation to complex probability distribution. EP can be viewed as an extension on ADF. Thus, we first provide a concise review of ADF and then extend it to EP algorithm.

3 form Let us consider the Bayes s rule and suppose that the factorization of posterior distribution has the following p(x y) = p(x)p(y x) p(y) = 1 N Z p 0(x) p(y i x), = 1 Z i=1 N p i (x), i=0 where Z is a normalization constant, p i (x) is a simplified notation of each corresponding factor in (10), where p 0 (x) = p 0 (x) and p i (x) = p i (y i x) for i > 0. If we assume that the likelihood function p(y i x) has a complex form, the direct evaluation of the posterior distribution would be infeasible. For example, if each likelihood function is a mixture of two Gaussian distributions and there are total N = 100 number of observed data. Then to get the posterior distribution, we need to evaluate mixture of 2 100 Gaussians. To solve this problem, approximate inference methods try to seek an approximate posterior distribution that can be very close to the true posterior distribution p(x y). Usually, the approximate distributions are chosen within the exponential family to ensure the computational feasibility. Then the best approximate distribution can be found by minimizing KL divergence: (10) θ = arg min θ D KL (p(x) q(x; θ)). (11) However, we can see that it is difficult to solve (11) directly. ADF solves this problem by iteratively including each factor function in the true posterior distribution. Thus, at first, ADF chooses q(x; θ 0 ) to best approximate factor function p 0 (x) through θ 0 = arg min θ D KL (p 0 (x) q(x; θ)). (12) Then ADF will update the approximation by incorporating the next factor function p i (y i x) until all the factor functions have been involved, which gives us the following update rule θ i = arg min θ D KL (p i (x)q(x; θ i 1 ) q(x; θ)). (13) As shown in Section I-B, if q(x; θ) is chosen from the exponential family, the optimal solution of (13) is matching the expected sufficient statistics between the approximate distribution q(x; θ i ) and the target distribution p i (x)q(x; θ i 1 ). Moreover, according to (13), we can see that the current best approximation is based on the previous best approximation. For this reason, the estimation performance of ADF may be sensitive to the process order of factor functions, which may produce extremely poor approximation in some cases. In the next section, we will provide another perspective of the ADF update rule, which results the EP algorithm and provides a way to avoid the drawback associated with ADF algorithm.

4 II. EXPECTATION PROPAGATION By taking another perspective, ADF can be seen as sequentially approximating the factor function p i (x) by the approximate factor function p i (x), which is restricted within the exponential family, and then exactly updating the approximate distribution q(x; θ) by multiplying these approximate factor functions. This alternative view of ADF can be described as: p i (x) q(x; θi ) q(x; θ i 1 ), (14) which also produces the EP algorithm. EP algorithm initializes each factor function p i (x) by a corresponding approximate factor function p i (x). Then, at later iterations, EP revisits each approximated factor function p i (x) and refined it by multiplying all the current best estimate but one true factor function p i (x) of the revisiting term. After multiple iterations, the approximation is obtained according (15). q(x; θ ) i p i (x). (15) Since this procedure does not depend on the process order of the factor function, EP provides a more accurate approximation than ADF. The general process of EP is given as follows: 1) Initialize the term approximation p i (x), which can be chosen from one of members in the exponential family based on the problem. 2) Compute the approximate distribution where Z = i p i(x)dx. 3) Until all p i (x) converge: a) Choose p i (x) to refine the approximate. q(x; θ) = 1 p i (x), (16) Z b) Remove p i (x) from the current approximated distribution q(x; θ) with a normalization factor: i q(x; θ \i ) q(x; θ) p i (x). (17) c) Update q(x; θ), where we first combine q(x; θ \i ), current p i (x) and a normalizer Z i, and then minimize the KL divergence through moment matching projection (9) (i.e. the Proj( ) operator): ( ) 1 q(x; θ) = Proj q(x; θ \i )p i (x). (18) Z i d) Update p i (x) as 4) Get the final approximate distribution through p i (x) = Z i q(x; θ) q(x; θ \i ). (19) p(x) q(x; θ ) i p i (x). (20)

5 A. Relationship with BP This section shows that BP algorithm is a special case of EP, where EP provides an improved approximation for models in which BP is generally intractable. Let us first take a quick review of BP algorithm. The procedure of BP algorithm is iteratively updating all variables nodes, then updating all factor nodes through sending messages in parallel, and finally update the belief of each variable after each iteration. By taking another viewpoint, BP can be viewed as updating the belief over a variable x i by incorporating one factor node at each time. Under this perspective, the belief of variable x i will be updated as b(x i ) = m X i f s (x i ) m fs X i (x i ) Z i, (21) where Z i = m Xi f s (x i ) m fs X i (x i ) dx i is the normalization factor. Moreover, we can loosely interpret m Xi f s (x i ) and m fs X i (x i ) as the prior and likelihood message, respectively. Let us suppose that each likelihood message m fs X i (x i ) has a complex form, e.g. a mixture of multiple Gaussian distributions. Then the computational complexity to evaluate the exact beliefs over all variables would be infeasible. Instead of propagating exact likelihood message m fs X i (x i ), EP passes an approximate message m fs X i (x i ), where m fs X i (x i ) is obtained by using the projection operation as shown in the general process of EP. Moreover, m fs X i (x i ) is usually chosen from exponential family to make the problem tractable. Thus, the approximate belief in EP has the following form b(x i ) q(x i ) s N(X i ) m fs X i (x i ). (22) To show BP as a special case of EP, we further define the partial belief of a variable node as b(x i ) \f s = and the partial belief of a factor node as b(x i ) m fs X i (x i ) b(f s ) \Xi = s N(X i )\s m fs X i (x i ), (23) b(f s ) m Xi f s (x i ), (24) where b(f s ) = j N(f s) m X j f s (x j ) is define as the belief of the factor node f s. By comparing to (18) and (19), the factor node message updating rule in EP can be written as m fs X i (x i ) = Proj ( b(x i ) \f s m fs X i (x i ) ) b(x i ) \f s ( ) Proj b(x i ) \f s x s\x i f s (x s ) b(f s ) \X i = b(x i ) \fs where the integral works over continuous variable. For discrete variable, one can simply replace integral with summation. Furthermore, the new belief b(x i ) will be approximated as where Z i = x i b(x i ) \fs m fs X i (x i ). (25) b(x i ) q i (x i ) = b(x i) \fs m fs X i (x i ) Z i, (26)

6 Now, if the integral in (25) is tractable (e.g. a linear Gaussian model) even without using the projection to approximate m fs X i. Then b(x i ) \fs in (25) can be canceled. Finally, the factor node message update rule in EP reduces to the standard BP case. III. RELATIONSHIP WITH OTHER VARIATIONAL INFERENCE METHODS In this section, we will describe the relationship between EP and other variational inference algorithms, e.g. variational Bayes (VB). The Bayesian probabilistic model specifies the joint distribution p(x, y), where all the hidden variables in x are given prior distributions. The goal is to find the best approximation for the posterior distribution p(x y). Let us take a look at the decomposition of the log joint distribution as follows log p(x, y) = log p(x y) + log p(y). (27) By rearranging (27) and taking the integral of the both side of the rearranged equation with respect to a given distribution q(x), we get the log model evidence log p(y) = q(x) log(p(y))dx (28) = q(x) log(p(x, y)) q(x) log(p(x y))dx, where q(x)dx = 1. Then, by reformatting (28), we get where we define log p(y) = L(q(x)) + D KL (q(x) p(x)), (29) p(x, y) L(q(x)) = q(x) log( )dx, (30) q(x) D KL (q(x) p(x)) = q(x) log( q(x) )dx. (31) p(x y) Since D KL (q(x) p(x)) is a nonnegative functional, L(q(x)) gives the lower bound of log p(y). Then the maximization of the lower bound L(q(x)) with respect to the distribution q(x) is equivalent to minimizing D KL (q(x) p(x)), which happens when q(x) = p(x y). However, working with the true posterior distribution p(x y) may be intractable. Thus, we assume that the elements of x can be partitioned into M disjoint groups x i, i = 1, 2,, M. We then further assume that the factorization of the approximate distribution q(x) with respect to these group has the form q(x) = M q i (x i ). (32) i Please note that the factorized approximation corresponds to the mean filed theory, which was developed in physics. Given the aforementioned assumptions, we now try to find any possible distribution q(x) over which the lower bound L(q(x)) is largest. Since the direct maximization of (30) with respect to q(x) is difficult, we instead to

7 optimize (30) with respect to each of the factors in (32). By substituting (32) into (30), we get L(q(x)) = q j(x j) log (p(x, y)) q i(x i)dx i dx j q j(x j) log(q j(x j))dx j + const i j = where we define p(x j, y) as q j (x j ) log( qj(xj) p(x j, y) )dx j + const = D KL (q j (x j ) p(x j, y)) + const, p(x j, y) = exp log (p(x, y)) q i (x i )dx i i j (33) (34) = exp (E i j [log p(x, y)]). Thus, if we keep all the factors q i (x i ) for i j fixed, then the maximization of (33) with respect to q j (x j ) is equivalent to the minimization of D KL (q j (x j ) p(x j, y)). In practices, we need to initialize all of the factors q i (x i ) first, and then iteratively update each of the factor q j (x j ) by minimizing the D KL (q j (x j ) p(x j, y)), until the algorithm convergences. Now we can see the key difference between EP and VB is the way to minimizing the KL divergence. The advantage of VB is that it provides a lower bound during each optimizing step, thus the convergence is guaranteed. However, VB may cause under-estimate for variance. In EP, minimizing D KL (p(x) q(x)) is equivalence to the moment matching, but convergence is not guaranteed. However, EP has a fix point and if it does converge, the approximation performance of EP usually outperforms VB. REFERENCES [1] S. Kullback and R. Leibler, On information and sufficiency, The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79 86, 1951. [2] C. Bishop, Pattern recognition and machine learning. Springer New York, 2006, vol. 4. [3] T. Minka, Expectation propagation for approximate Bayesian inference, in Uncertainty in Artificial Intelligence, vol. 17. Citeseer, 2001, pp. 362 369. [4], http://research.microsoft.com/en-us/um/people/minka/papers/ep/roadmap.html.