Generalizing expectation propagation with mixtures of exponential family distributions and an application to Bayesian logistic regression

Size: px

Start display at page:

Download "Generalizing expectation propagation with mixtures of exponential family distributions and an application to Bayesian logistic regression"

Candace Hampton
5 years ago
Views:

1 Generalizing expectation propagation with mixtures of exponential family distributions and an application to Bayesian logistic regression Shiliang Sun, Shaojie He Department of Computer Science and Technology, East China Normal University, 3663 North Zhongshan Road, Shanghai , P. R. China Abstract Expectation propagation (EP) is a widely used deterministic approximate inference algorithm in Bayesian machine learning. Traditional EP approximates an intractable posterior distribution through a set of local approximations which are updated iteratively. In this paper, we propose a generalized version of EP called generalized EP (GEP), which is a new method based on the minimization of KL divergence for approximate inference. However, when the variance of the gradient is large, the algorithm may need a long time to converge. We use control variates and develop a variance reduced version of this method called GEP-CV. We evaluate our approach on Bayesian logistic regression, which provides faster convergence and better performance than other state-of-the-art approaches. Key words: Expectation Propagation; Stochastic Optimization; Variance Reduction; Machine Learning 1 Introduction Expectation propagation (EP) is a well-known method for deterministic approximate inference, which is based on the minimization of the Kullback- Leibler (KL) divergence. But different with variational inference (VI), EP computes the reverse form KL(p q) instead of KL(q p) [34]. Compared to VI, EP that makes an approximation by optimizing each factor in turn in Corresponding author. Tel.: ; fax: address: shiliangsun@gmail.com, slsun@cs.ecnu.edu.cn (S. Sun). Preprint submitted to Elsevier Science 25 January 2019

2 the context of all the remaining factors can ensure that the approximation is most accurate in the regions of high posterior probability as defined by the remaining factors [1]. For approximation q(θ) in the exponential family, if the iterations do converge, the resulting solution will be a stationary point of a particular energy function [16], although each iteration of EP may not necessarily decrease the value of this energy function. Although it has some advantages compared to VI, EP still has several shortcomings which we hope to solve or avoid. For example, EP which approximates the true posterior with an exponential family distribution can be inferior, since the approximate distribution is a unimodal distribution. The true posteriors in practical applications [1] are often multi-modal, in which case minimizing KL(p q) will lead to worse approximation. In particular, if EP is applied to mixture probability distributions, then the results are almost meaningless because the approximation tries to capture all of the modes of the probability distribution. Therefore, multi-modality should be explicitly considered in the approximate distribution. The moment matching operation in EP requires the evaluation of expectations, and thus standard EP is limited to the class of models for which this evaluation is possible. Moreover, since EP does not directly optimize the cost function, there is no guarantee that EP iterations will converge. It is thus desirable to introduce convergent substitutes. For these defects, we generalize the EP inference method that can resolve all of the above issues. It is reasonable to optimize the objective function directly, which does not involve moment matching evaluations and is guaranteed to converge. In this paper, we choose the mixtures of exponential family distributions as the approximate distribution. They are made up of simple components, but the mixtures can be rich enough to approximate any distribution. Furthermore, sampling from exponential family distributions is comparatively easy, which will benefit from the adopted stochastic optimization approach. However, if the gradient of the cost function of EP has a large variance, the stochastic optimization method may need a lot of time to converge, which often leads to inferior performance. Our approach adopts a variance reduction technique and employs control variates to reduce the variance. For variance reduction methods, sometimes the gradient can be expressed as the form of Monte Carlo integration. The Monte Carlo integration typically has an error variance of the form σ2 with n being the number of samples [27]. n Although we get a better solution by sampling with a larger value of n, the computing time grows with n [27]. Sometimes we can find a way to reduce σ instead. To do this, we construct a new Monte Carlo problem with the same solution as our original one but with a lower σ. Methods to do this are known as variance reduction techniques. The technology of variance reduction can 2

3 be placed into several groups, such as antithetic variables, control variates, conditioning or called Rao-Blackwellization, stratified sampling, importance sampling and reparameterization. Each variance reduction technology has its own characteristics and scope of application. In these methods, control variates and Rao-Blackwellization are most commonly used and also have the largest reduction of variance. Recently, Ranganath et al. [25] proposed a black box variational inference method combining both of these two methods, based on a stochastic optimization of the variational objective, to reduce the gradient variance. It can reach better predictive likelihoods than the sampling method. In this paper, we focus on one of the most promising methods, control variates. It is widely used in many scenarios. If we want to solve an optimization problem, we can write it as a form of stochastic gradient optimization. But stochastic optimization will be hindered because of the large variance of the noisy gradients which may lead to slower convergence and worse performance. Then we can use the control variates to improve the performance by reducing the variance of noisy gradients. Control variates have the potential for widespread use in general stochastic optimization. It is feasible in any stochastic optimization experiment, in other words, as long as the algorithm of the specific problem has a form of stochastic optimization. The choice of control variates should satisfy the following two principles: (i) they should have a large correlation with the gradient, and (ii) the computation of their expectation with respect to random data samples is not expensive. Wang et al. [37] showed that control variates can be formed in terms of low-order approximations to the noisy gradient. We will discuss how the variance reduction technique is applied to our approach in Section 4. In experiments, we use both synthesized and real world datasets. We observe that our method shows the faster convergence and better performance than other state-of-the-art approaches. We also observe that our method combining the variance reduction technology typically converges quickly in terms of CPU time, compared to the version of not using control variates. The key contributions of this paper are summarized as follows: EP is a crucial deterministic approximate inference method, but the solution that using a single exponential family distribution as the approximate distribution can be inferior. To solve this problem, we use mixtures of exponential family distributions to replace the original single modal approximate distribution. In the process of EP parameter update, the number of approximate factors is increasing with the data point number N. Different with traditional EP method, our method does not involve moment matching but directly optimize the KL divergence, which avoids expensive memory consuming. Moreover, the proposed approximation method fits with the model in which the 3

4 evaluation of expectations is intractable. Evaluating our method on Bayesian logistic regression, we get an improvement in classification accuracy compared to other standard approximate methods. We set our algorithm with different mixture numbers of approximate distributions and discuss how the performance changes with the mixture numbers of approximate distributions. In the subsequent sections, the paper is organized as follows. In Section 2, we describe the background of our work, including the introduction of EP, the main issues and progress of EP in recent years. Section 3 describes our algorithm GEP and show the objective function of GEP. Then we briefly introduce stochastic optimization and retrospect the global convergence of the Robbins-Monro stochastic approximation method. In Section 4 we show the GEP s variance reduced stochastic optimization and discuss how to construct the control variates. In Section 5, we illustrate the Bayesian logistic regression model used in the experiments and provide our technical details on this model. In Section 6, we show the experimental setup and results, and then the final section concludes the paper. 2 Expectation Propagation In this section, we mainly review the exponential family distribution and the deterministic approximate inference algorithm EP [16], and discuss some of its issues. 2.1 Exponential Family Distributions Exponential family distributions are very important in machine learning. One of the important reasons is that they can be calculated efficiently and analytically through natural parameters or expected sufficient statistics [1]. EP [16] provides a general-purpose framework for approximate posterior beliefs by exponential family distributions. One of the important reasons for considering the exponential family is that the likelihood function for i.i.d. data from the exponential family is a function of the sample average of the sufficient statistics u(x) which has the fixed dimensionality, independent of the sample size since new information can be incorporated without increasing the size of the parametric representation [29]. Moreover, even if a model does not give rise to posteriors in the exponential family, members of the exponential family can be used as approximate distributions. Next we present exponential family distribution. 4

5 In probability and statistics, the exponential family is a set of probability distributions of a certain form, which is chosen for mathematical convenience, based on some useful algebraic properties. A random variable x has an exponential family distribution if its probability mass function or density function admits the form: f(x) = h(x)g(η)exp{η T u(x)}, η H. (1) The vector u(x) and η are called, respectively, the sufficient statistic and natural parameter. The set H is the space of allowable natural parameter values. Thefunctionh(x)isthebasemeasureandthefunctiong(η)canbeinterpreted as the coefficient that ensures the distribution is normalized and therefore satisfies g(η) h(x)exp{η T u(x)}dx = 1. (2) A key exponential family distributional result by taking gradients of both sides of (2) with respect to η is that lng(η) = E[u(x)]. (3) where lng(η) is the column vector of partial derivatives of lng(η) with respect to each of the components of η. The exponential distribution family has many important properties. As the exponential family has sufficient statistics that can use a fixed number of values to summarize any amount of i.i.d. data, the posterior predictive distribution of an exponential family random variable with a conjugate prior can always be written in closed form (provided that the normalizing factor of the exponential family distribution can itself be written in closed form). Also in the variational mean field approximation, the best approximate posterior distribution of an exponential family node (a node is a random variable in the context of graphical models) with a conjugate prior is in the same family as the node. 2.2 Expectation Propagation Minka [16] described a general method for approximate inference in hierarchical Bayesian models known as EP, building on earlier work on topics such as assumed density filtering [15] and loopy belief propagation [5]. EP achieves the similar aims as variational inference but with the reverse form of Kullback- Leibler divergence. A small number of numerical studies including Minka [16] have shown that EP is convincingly better than variational Bayes, Monte Carlo and Laplace s method with similar computational cost. Following Minka [18], given a target distribution p(x), the goal of EP is to 5

6 solve arg minkl(p q), (4) q Q where q is a member of the set of exponential family distributions Q. The key aspect of EP is to rely on a factorization of p, such that the posterior can be decomposed into a product of terms: p(θ D) = 1 p(d) f i (θ), (5) i where p(d) is the model evidence, and the factor f n (θ) equals the likelihood p(x n θ) for each data point x n, along with a factor f 0 (θ) = p(θ) corresponding to the prior. The approximate distribution q has the same factor structure: q(θ) = 1 Z i f i (θ), (6) in which each factor f i (θ) in the approximate distribution corresponds to the factor f i (θ) in the true posterior, and the factor 1 is the normalizing constant Z needed to ensure that q(θ) integrates to unity. In addition, each approximate factor f i (θ) has the exponential family form. An optimal result for the exponential family shows that the global solution of (4) combining (3) is a moment matching solution [34]: E q [u(θ)] = E p [u(θ)]. (7) We see that the optimum solution simply corresponds to matching the expected sufficient statistics. In the Gaussian case for q, this means that the best approximation of p based on KL divergence is a Gaussian with the same mean and covariance. Of course, it is difficult to directly calculate the mean and covariance of p, and EP attempts to achieve this by iterative refinements of an approximation. EP iterates among simple local approximations to refine factors which approximate the corresponding posterior contribution. We briefly review the EP algorithm which is the basis of our new method. For simplicity, we observe a dataset containing N i.i.d. samples D = {x n } N n=1. We have a probabilistic model p(x θ) parameterized by a d-dimensional vector θ with prior p 0 (θ). Bayesian inference involves calculating the probably intractable posterior distribution of the parameters, p(θ D) = 1 p(d) f i (θ) q(θ) = 1 i Z i f i (θ), (8) 6

7 where Z is a normalization constant, and q(θ) is a tractable approximate distribution which will be updated by EP. The aim of EP is to update the approximate factors so that the contribution of each of the prior and likelihood terms to the posterior is captured, i.e., f n (θ) p(x n θ). The ideal approach is to minimize the KL divergence between the posterior distribution and that formed by removing each term and filling withthecorrespondingapproximatefactor f 1 n (θ),kl[ p(d) if i (θ) 1 f Z i i (θ)]. But in general, this minimization will be intractable, because the KL divergence involves computing the full posterior. Instead, EP makes a much better approximation with four steps. First, choose a factor f j (θ) to refine. Second, removing f j (θ) from the posterior by division, we get the cavity distribution q \j (θ) = q(θ)/ f j (θ). Note that we could instead find q \j (θ) from the product of factors i j. But in practice division is usually easier, since for a distribution from the widely used exponential family, the above calculation is simplified to subtraction of natural parameters. Third, the corresponding term in the true posterior is included to generate the tilted distribution p j (θ) q \j (θ)f j (θ), and we minimize KL[ p j (θ) q \j (θ) f j (θ)/z j ], where Z j is the normalizing constant. This minimization is tractable since it does not involve the true posterior distribution, and as mentioned before, this minimization turns out to be moment matching for the exponential family. Finally, the updated factor is included into the approximate distribution. The above four steps are then iterated until convergence. 2.3 Convergence and Memory Cost Issues From the Brouwer fixed-point theorem [24], we know that EP updates can be shown to always have a fixed point if it converges and approximations are in the exponential family. For approximations q(θ) in the exponential family, if the iterations do converge, the resulting solution will be a stationary point of a particular energy function, which is much superior to the solution of VI in many scenarios. EP can provide better solutions than VI in specific cases. For instance, VI provides poor approximations when nonsmooth likelihood functions are used [3]. However, one disadvantage of EP is that there is no assurance for the iterations to converge [16]. This is in contrast to VI, which maximizes a lower bound on the log marginal likelihood, in which iterations are guaranteed not to decrease the bound. So, the convergent version of EP is worth considering. Another one is the prohibitive memory cost for large datasets and big models. A crucial limitation of EP is that the number of approximate factors needs to increase with the number of data points, which often leads to a prohibitively large memory overhead [12]. For these two issues that hinder the widespread deployment of EP, recently, 7

8 many works tried to address these issues. For example, Heskes et al. [10] and Opper et al.[20] derived convergent double-loop implementations of EP, which, however, can be far slower than the original message passing procedure. Hasenclever et al. [8] proposed the stochastic natural-gradient EP (SNEP) method, which is also double-loop-like. But in practice, they only performed one-step inner loop update to speed up training. Seeger and Nickisch [30] proposed a fast convergent algorithm for EP with covariance decoupling techniques [19,38]. For the second issue, Dehaene and Barthelmé [4] and Li and Turner [12] used factor tying or called local parameter sharing through the averaged EP (AEP) and stochastic EP (SEP) algorithm, which can decrease the prohibitive memory cost. In contrast to AEP and SEP, Hernández-Lobato et al. [9] proposed a black-box method called black-box-α (BB-α) that directly optimizes the power EP [17] (an extension of EP that makes the computations more tractable) energy function which is equivalent to minimize α-divergence, not as SEP or AEP update through the power EP message passing. The BB-α method is similar to our proposed method. The distinction is that the objective function is different. Particularly, our objective function has a simpler form and it is more convenient to compute gradients. BB-α also has an analytic energy form that does not require double-loop procedures and can be directly optimized using gradient descent. This means that popular stochastic optimization methods can be used for large-scale learning with BB-α. It is proved that this method is convergent when the energy function is finite. BB-α can address both flaws of EP. However, using the power EP energy function as the objective function has a disadvantage that there are too many parameters to update since each likelihood function has a corresponding parameter to update. Having seen the advantage of the BB-α method, we apply a similar idea to the function of KL divergence of EP. In addition, considering that true posteriors are multi-modal in many practical applications, we adopt the mixture of exponential family distributions as the approximate distribution. In this way, our method can approximate complicated distributions. We make experimental comparisons with BB-α in Section 6. We summarize the existing EP works and our method based on convergence and memory cost issues in Table 1. We can see that these two issues have been discussed during these years and the line of research is worth exploring. [Table 1 about here.] 8

9 3 Generalizing EP with Mixtures of Exponential Family Distributions In this section, we give details of the proposed algorithm GEP with stochastic optimization. 3.1 GEP We abbreviate our approximate inference method generalized expectation propagation with the mixture of exponential family distributions as GEP. Let Θ represent the set of hidden variables in the probabilistic model and D represent the observed data. The model may have some hyperparameters which are omitted now for clarity. The true posterior is given by p(θ D) = p(θ,d) p(d). (9) Suppose the approximate distribution of the true posterior is Q(Θ). The original objective function of EP arises by assuming Q(Θ) is from a specific exponential family and minimizing the following KL divergence: KL(p(Θ D) Q(Θ)) = Θ p(θ D)ln p(θ D) dθ. (10) Q(Θ) Instead of optimizing the above objective directly, the typical EP adopts a factorized form for Q(Θ) and then iteratively updates the factors. Note that the estimate of model evidence p(d) can also be given when EP terminates. Since the Q(Θ) given in EP is a single exponential family distribution, many complicated posterior distributions may not be well characterized. Now, in the GEP, we assume Q(Θ) is a mixture of exponential family distributions, i.e., M Q(Θ) = α m Q m (Θ η m ) = m=1 M m=1 α m h m (Θ)g m (η m )exp{η T mu m (Θ)}, (11) where η T m and u m (Θ) are the natural parameters and sufficient statistics of the corresponding exponential family distribution, respectively. Thus even if the posterior distribution is a complicated multi-modal distribution, our approximate distribution Q can fit this posterior better. Denote the set of parameters by Φ, which is composed of {α m,η m } M m=1. The initial objective function which GEP aims to minimize is 9

10 KL(p q) = p(θ D)lnQ(Θ)dΘ+const Θ = 1 p(θ,d)lnq(θ)dθ+const p(d) Θ = 1 p(d) M p(θ,d)ln[ α m Q m (Θ η m )]dθ+const, Θ m=1 (12) where const represents terms unrelated to Q(Θ). Since p(d) is usually positive, we can give an equivalent objective function by M J = p(θ,d)ln[ α m Q m (Θ η m )]dθ, (13) Θ m=1 which should be maximized to find parameters. However, in complicated probabilistic models, a closed form of the expectation may not be analytically computed, which hinders the direct optimization of parameters. In a general way, gradient ascent can be used to maximize the objective function. We proceed to use stochastic gradient ascent to update parameters, which is similar to the stochastic search method in Blei et al. [2]. 3.2 Stochastic Optimization Following the stochastic gradient ascent discussed in the previous section, we briefly recall stochastic optimization. Assume that we want to maximize the function f(x), and h(x) can be seen as a random variable whose expectation is the gradient of f(x), i.e., E[h(x)] = x f(x). Also, make sure the learning rate ρ t is a non-negative scalar. At the tth iteration, the stochastic optimization update for x is as follows: where for any t, h t (x) is a sample of h(x). x t+1 x t +ρ t h t (x t ), (14) If ρ t, the learning rate, follows the Robbins-Monro conditions + t=1 + t=1 ρ t = +, ρ 2 t < +, (15) then x t+1 will converge to the optimal x (if f is concave) or a local optimum of f (if not concave) [11]. 10

11 It is proven that the Robbins-Monro stochastic approximation is always convergent, i.e., outputs are (global or local) optimal solutions if the learning rate follows the Robbins-Monro condition, by the convergence analysis given in Robbins and Monro [26]. The stochastic optimization algorithm is one of the mainstream optimization methods for solving large-scale machine learning problems. It has a solid theoretical foundation and a good prospect for development. However, there are still many problems that people are generally concerned and many inevitable defects need to be overcome. For example, if the gradient has a large variance, the stochastic optimization algorithm might take much time to converge and lead to bad performance. In this paper, we develop a variance reduced version of the primary algorithm by making use of control variates to reduce the variance of the noisy gradient. 4 Variance Reduction for GEP In this section, we apply the variance reduction technology to our method because by reducing the variance of gradients, one can hope to achieve convergence with a larger or even a constant step size and thus obtain a faster convergent rate. We first compute the gradient of GEP s objective to apply the stochastic approximation method and then review the basic ideas of the variance reduction method, control variate. Finally we show how to choose the control variates (variance reduced gradients) as the substitution of gradients of the objective. 4.1 Gradient of the GEP Objective Function The stochastic optimization makes a stochastic approximation of gradients of the objective function with respect to parameters. There are two different types of parameters in our approximate posterior, which are mixture weight α and natural parameter of exponential family distributions η, respectively. These gradients are given below (assuming the necessary regularity condi- 11

12 tions): M αm J = αm p(θ,d)ln( α m Q m (Θ η m ))dθ Θ m=1 M = p(θ,d) αm ln( α m Q m (Θ η m ))dθ Θ m=1 p(θ, D) = Mm=1 Θ α m Q m (Θ η m ) Q m(θ η m )dθ p(θ, D) = E Qm [ Mm=1 α m Q m (Θ η m ) ], (16) M ηm J = ηm p(θ,d)ln( α m Q m (Θ η m ))dθ Θ m=1 M = p(θ,d) ηm ln( α m Q m (Θ η m ))dθ Θ m=1 α m p(θ,d) = Mm=1 Θ α m Q m (Θ η m ) η m Q m (Θ η m )dθ α m p(θ,d) ηm lnq m (Θ η m ) = Mm=1 Q m (Θ η m )dθ Θ α m Q m (Θ η m ) = E Qm [ α mp(θ,d) ηm lnq m (Θ η m ) Mm=1 ], α m Q m (Θ η m ) (17) wherewehaveusedtheidentity ηm Q m (Θ η m ) = Q m (Θ η m ) ηm lnq m (Θ η m ). We form the derivation of the objective as an expectation with respect to p(θ,d) the approximate distribution. We let M := f(θ). It follows m=1 αmqm(θ ηm) that αm J = E Qm [f(θ)] and ηm J = α m E Qm [f(θ) ηm lnq m (Θ η m )]. Then we can stochastically approximate this expectation using Monte Carlo approximation on specific exponential family distribution Q m to get noisy but unbiased gradient estimators, and αm J 1 S S s=1 ηm J 1 S = 1 S p(θ [s],d) Mm=1 α m Q m (Θ [s] η m ) = 1 S S s=1 S s=1 α m p(θ [s],d) ηm lnq m (Θ [s] η m )) Mm=1 α m Q m (Θ [s] η m ) α m f(θ [s] ) ηm lnq m (Θ [s] η m ), S f(θ [s] ), (18) s=1 (19) where Θ [s] Q m for s = 1,...,S. We can therefore replace αm J and ηm J with the unbiased approximation of these gradients as (18) and (19). We can update the parameters α m and η m by stochastic optimization. At each iteration t, we have 12

13 α m (t+1) = α m (t) +ρ t αm J = η (t) η (t+1) m (20) m +γ t ηm J, where ρ t and γ t are proper stochastic gradient optimization step sizes. In addition, since α has the constraint M m=1 α m = 1, a constraint optimization problem is under consideration. We use the Lagrangian multiplier method to solve this problem. The description and derivation of the Lagrangian multiplier method will be given in Section 4.3. Algorithm 1 describes the GEP method. Algorithm 1 GEP Input: data x, joint distribution p, mixture of exponential family distributions Q. Initialize: α m and η m randomly, t = 1. 1: repeat 2: // Draw S samples from the mth exponential approximate distribution 3: for s = 1 to S do 4: Θ [s] Q m ; 5: end for 6: α m (t+1) = α m (t) +ρ t αm J 7: η m (t+1) = η m (t) +γ t ηm J 8: ρ t and γ t are the tth values of Robbins-Monro sequences 9: t = t+1 10: until change of α m and η m (in terms of Euclidean distance) is less than ε (ε = 0.01 in this paper) After these gradients have been calculated, we need to find suitable control variables to replace them in order to obtain smaller variances and achieve faster convergence. If we find suitable control variates, we can derive variance reduced gradients ˆ αm J and ˆ ηm J, which can replace ηm J and αm J. The new gradient estimate is an unbiased estimation of the true gradient with a lower variance when they are positively correlated. Then we can obtain α (t+1) m η (t+1) m = α m (t) +ρ tˆ αm J = η m (t) +γ tˆ ηm J. (21) The variance reduced GEP method is described in Algorithm 2. We use a function that is highly related to the original function as a control variable. How to select control variables is discussed in Section Variance Reduction with Control Variates We now recall the variance reduced methods with control variates, a wellknown technique in Monte Carlo simulation(see, e.g., Rubinstein and Kroese[28]) 13

14 Algorithm 2 GEP-CV Input: data x, joint distribution p, mixture of exponential family distribution Q. Initialize: α m and η m randomly, t = 1. 1: repeat 2: // Draw S samples from the mth exponential family approximate distribution 3: for s = 1 to S do 4: Θ [s] Q m ; 5: end for 6: for s = 1 to S do do p(θ 7: f[s] = [s],d) M m=1 αmqm(θ[s] η m) 8: g[s] is the control variate of f[s] 9: g [s] is g[s] multiplying α m ηm lnq m (Θ [s] η m ) 10: end for 11: â = Cov(f,g), estimated from a few samples Var(g) 12: We find variance reduced gradients, 13: ˆ αm J 1 Ss=1 (f[s] â(g[s] E[g[s]])) S 14: ˆ ηm J 1 Ss=1 (f[s]α S m ηm lnq m (Θ [s] η m ) â(g [s] E[g [s]])) 15: α m (t+1) = α m (t) +ρ tˆ αm J 16: η m (t+1) = η m (t) +γ tˆ ηm J 17: ρ t and γ t are the tth values of Robbins-Monro sequences 18: until change of α m and η m (in terms of Euclidean distance) is less than ε (ε = 0.01 in this paper) designed to reduce the variance of the estimate of the expectation of a random variable. Generally speaking, variance reduction works by modifying a function of a random variable such that its expectation remains the same, but its variance decreases. To reach this goal, we introduce a well-known technique in Monte Carlo simulation (see, e.g., [28]) called control variate. Control variate can be seen as a function. We assume a control variate, g(θ), which approximates f(θ) well in the highly probable regions as defined by Q m (θ), but also has a closed-form expectation under Q m. Using g and a scalar a, we first form the new function, ˆf(θ) = f(θ) a(g(θ) E Qm [g(θ)]). (22) The next step is to set the value of a to minimize the variance of ˆf. A simple calculation shows that Var(ˆf) = Var(f) 2aCov(f,g)+a 2 Var(g). (23) Taking the derivative with respect to a and setting it to zero give the optimal 14

15 value, a = Cov(f,g) Var(g). (24) Usually this covariance and variance are unknown in the functions we encounter. We can approximate a with â, found by plugging the sample variance and covariance into (24) using samples from the algorithm (in experiments, we estimate â by calculating the sample covariance and variance from minibatches). The potential reduction in variance is seen by plugging (24) into (23). Then we take the ratio of the two variances, Var(ˆf)/Var(f) = 1 Corr(f,g) 2. (25) For a high Pearson correlation coefficient Corr(f,g) between f and g, the control variate g will lead to a more significant effect of variance reduction and thus a faster convergence is expected. We will discuss how to construct the control variates later. 4.3 Lagrangian Multiplier Lagrange multiplier methods introduce Lagrange multipliers to resolve optimization problems with constraints [36]. It is an exact method that optimizes the objective f(x) to meet the Kuhn-Tucker conditions [31]. The constrained optimization problems generally have the following form: max f(x), s.t. g(x) 0, X = (x 1,x 1,...,x n ), h(x) = 0, (26) where X is a vector of real variables in continuous problems or a vector of discrete numbers in discrete problems, f(x) is the objective function, g(x) = [g 1 (X),...,g k (X)] T is a set of k inequality constraints, and h(x) = [h 1 (X),...,h m (X)] T is a set of m equality constraints. f(x),g(x) and h(x), as well as their derivatives are continuous functions. SinceLagrangianmethodscannotdirectlysolveinequalityconstraintsg i (X) 0 under normal circumstances, we transform inequality constraints into equality constraints by adding slack variables z i, which results in p i (X) = g i (X)+ 15

16 z 2 i. The corresponding Lagrange function is defined as follows: L(X,λ,µ) = f(x)+λ T h(x)+µ T p(x), (27) where λ = [λ 1,...,λ m ] T and µ = [µ 1,...,µ k ] T are two sets of Lagrange multipliers. According to classical optimization theory [31], all the optimal solutions of (27) obey the following set of necessary conditions: X L(X,λ,µ) = 0, λ L(X,λ,µ) = 0, µ L(X,λ,µ) = 0. (28) In this paper, we have no inequality constraints, so let µ = 0. The equality constraint h(x) is that h(α m ) = M m=1 α m 1 = 0. The Lagrange function corresponding to the GEP method is: L(α m,λ) = J(α m )+λ T h(α m ) M M = p(θ,d)ln[ α m Q m (Θ η m )]dθ+λ T ( α m 1) m=1 m=1 (29) where α m is the parameter to be optimized. Suppose α = {α 1,...,α m } and L(α,λ) is the Lagrange function for GEP. With the conditions λ L(α,λ) = 0 α1 L(α,λ) = 0 (30)... αm L(α,λ) = 0, we reach a feasible local extremum α when all gradients vanish. 4.4 How to Choose Control Variates? As discussed in Section 4.2, the stronger the correlation between f and g, the greater the variance is reduced. The variance of the estimator is directly linked to the convergence speed, and improving correlation will lead to faster convergence [7]. Thus, finding control variates that have a greater correlation with the gradient would allow us to maintain a fast convergence and potentially use larger step sizes. In this paper, we want to reduce the variance of the gradient of the objective with respect to parameters Φ, Φ J. So we need to calculate variance reduced gradient ˆ Φ J to replace Φ J. The principle of our selection of control variates is that we form g based on some statistics of the hidden variables, e.g., 16

17 low-order moments. The low-order moments roughly characterize the hidden variable distribution, which is not dependent with parameter Φ. The expectation of control variates can thus be pre-computed while running the stochastic gradient algorithm. We will use this rule to construct control variates in this paper. The original GEP method in Algorithm 1 uses Monte-Carlo samples to stochastically approximate the gradient. One problem with this method is that we may set a relatively small step size in order to reach the optimum solution. Theneedforasmalllearningrateisduetothepotentiallylargevarianceofthe stochastic gradient which approximates the full gradient using a small batch or a single example. This leads to slower convergence. For the sake of achieving quicker convergence, we proceed to find a more accurate approximation of the full gradient. Through Eqs. (16) and (17) in Section 4.1, we know the gradients of the objective with respect to Φ. We obtain the derivative of the objective as an expectation of f. As in Algorithm 2, we show the construction of control variates. In order to ensure that f and g are highly correlated and the expectation of g is easy to calculate, we can choose the second-order Taylor expansion of f orapartoff ascontrolvariateg,whichcancapturethesecond-orderinformation about the individual function. 1 After estimating â on a small number of samples, we can calculate variance reduced gradient ˆ Φ J. The control variate Monte-Carlo estimate of the gradient using S samples is thus: ˆ Φ J 1 S f[s] â(g[s] E[g[s]]). (31) S s=1 This construction can highly retain the function s information and guarantee high correlation between the control variate and the function. In empirical studies, we show that this control variate can well reduce the variance of the gradient. This method is an explicit variance reduction technique for GEP, which we call GEP-CV (GEP with control variates). This improvement over GEP provides useful insights into the performance of the underlying optimization problem. 1 Alternatives to Taylor expansion for constructing control variates exist as well [2]. 17

18 5 Bayesian Logistic Regression for Classification with GEP In this section, we illustrate GEP on Bayesian logistic regression for classification. We choose a Bayesian probabilistic model, logistic regression, in experimental validations. Logistic regression is an important linear classifier in machine learning and has been widely used in computer vision [1], bioinformatics [35], gene classification [13], neural signal processing [21,6,22], matrix data classification [32] and semi-supervised learning [33]. 5.1 Bayesian Logistic Regression for Classification Binary logistic regression takes in P-dimensional data vectors x R N P, wheren isthenumberofthedataandp isthedimensionofdata,andpredicts thebinaryoutputsy R N towhicheachbelongs(y n {0,1}).Theparameter is θ R P. We can model the prediction likelihood distribution p(y i x i,θ) Bern(σ(θ T x i )), where σ( ) is the sigmoid function, σ(b) = (1 + e b ) 1. Bayesian logistic regression places a prior distribution on the coefficient vector, which is drawn from a P-dimensional multivariate normal distribution with independent components, θ N(0,I P ). The joint distribution can be calculated as p(d,θ) = p(d θ)p(θ) = nσ(θ T x n ) yn {1 σ(θ T x n )} 1 yn N(0,I P ). We would like to evaluate p(θ D), but this is not available in closed form. Instead, for inference we define a mixture of exponential distribution as approximate distribution q M q(θ) = α m q m (θ η m ). (32) m=1 Since the approximate distribution q m needs to be selected from the exponential family distribution, it is natural to choose the Gaussian. Then the natural parameter η of the exponential family distribution is the mean µ and variance σ of Gaussian. For the convenience of calculation, we posit specific approximate distribution q m over θ, i.e., q m (θ η m ) = P j=1 N(θ j µ mj,σ 2 m j ). To be clear, we model each θ j as an independent Gaussian with mean µ mj and variance σ 2 m j, and we use GEP to learn the optimal values of η m = {µ mj,σ 2 m j } P j=1. We will use the shorthand µ m = (µ m1,...,µ mp ) and σ 2 m = (σ 2 m 1,...,σ 2 m P ). For the data joint likelihood, we can decompose p(y,θ x) = p(y x,θ)p(θ), using the chain rule of probability (and noting that x is a constant). Thus, it is straightforward to calculate 18

19 N p(y x,θ) = σ(θ T x i ) y i (1 σ(θ T x i )) 1 y i p(θ) = i=1 P i=1 N(0,1). (33) The objective for this model is to maximize J = θ M p(y x, θ)p(θ) ln α m q m (θ η m )dθ. (34) m=1 5.2 Parameter Optimization with GEP and GEP-CV In this section, we show the parameter optimization of GEP and GEP-CV on the Bayesian logistic regression model. Recall from Section 3.1, we have two different types of parameters: mixture weight α and natural parameter of exponential family distribution η. Gradients of the objective and their Monte Carlo approximation with respect to these two parameters are discussed in Section 4.1. So we can use stochastic optimization to optimize the parameters. We give their optimization solution in Section 4.3. We focus on the optimization in terms of natural parameter η on Bayesian logistic regression model below. In order to calculate ηm J, we need to calculate ηm logq m (θ η m ). As our approximate distribution q uses a mixture of Gaussian distributions, the natural parameter of a single component are the mean and variance of the corresponding Gaussian distribution. Since σ 2 m j is constrained to be positive, we will instead optimize over γ mj = log(σ 2 m j ). It is straightforward to see µmj logq m (θ η m ) = µmj P i=1 log(σ2 m i ) 2 (θ i µ mi ) 2 2σ 2 m i = (θ j µ mj ) σ 2 m j. (35) P γmj logq m (θ η m ) = σmj ( log(σ2 m i ) 2 i=1 = ( 1 2σ 2 m j + (θ j µ mj ) 2 2(σ 2 m j ) 2 )σ 2 m j. (θ i µ mi ) 2 2σ 2 m i ) γmj (σ 2 m j ) (36) Note that we use the chain rule in the derivation for γmj logq(θ η). Then we can update the parameters α m and η m by stochastic optimization. Now for Bayesian logistic regression we show how to form our control variates. We choose the second-order Taylor expansion at ξ for the sigmoid function 19

20 denoted as ˆσ(z), ˆσ(z,ξ) = σ(ξ)+σ(ξ)σ( ξ)(z ξ)+ σ(ξ)(1+2σ(ξ))σ( ξ) (z ξ) T (z ξ). 2 (37) As we discussed in Section 4.1, we let f n = σ(x T nθ)p(θ) M m=1 for each observation x n. Let z n = θ T x n and ξ n = ˆµ T x n, ˆµ is the current mean of q m. For αmqm(θ ηm) convenience, we use the second-order Taylor expansion of a part of f n to define our control variate g n as g n (z n,ξ n ) = p(θ)ˆσ(z n,ξ n ) Mm=1 α m q m (θ η m ). (38) The variance reduced gradients ˆ αm J and ˆ ηm J are given below ˆ αm J = 1 S ˆ ηm J = 1 S S (f s â(g s E[g s ])) s=1 S (f s α m ηm lnq m (θ [s] η m ) â(g s E[g s])) s=1 (39) where g s is g s multiplying α m ηm logq m (θ η m ) Moreover, the expectation of g n can be computed in closed-form as ( E[g n ] = p(θ)σ(ξ n ) 1+θ T x n σ( ξ n )(1 ξ n (1+2σ(ξ n )) ξ n σ( ξ n ) + (1+2σ(ξ ) n))σ( ξ n ) (θ T θ(var(x n )+ x n ( x n ) T )+ξ 2 2 n) M / α m q m (θ η m ), m=1 (40) which facilitates the use of control variates for variance reduction. 6 Experiments In this section, we perform experiments using the GEP methods for binary classification with Bayesian logistic regression. 20

21 6.1 Data and Set-up For Bayesian logistic regression, we first compare our GEP methods with three methods: maximum likelihood estimate (MLE), variational inference (VI) and expectation propagation (EP) with synthesized data. Synthesized data is three-modal, which are generated from three 2-D Gaussian distributions with different means. These data sets include 7000 labeled examples living in 3-D space which has a dimension of all ones as an augmented representation. We randomly selected 5000 data points used for training and the rest 2000 data points for testing. We set learning rates ρ t and γ t. By decreasing the step sizes to meet the Robbins-Monro condition, convergence to local optimal solutions of J is guaranteed. For example, ρ t = γ t = (τ + t) κ with κ (0.5,1]andτ 0satisfythisrequirement,wheretistheiterationnumber, and τ and κ are parameters for adjusting learning rates. In our experiments, we set τ = 10 and κ = 0.6 for GEP, and set τ = 10 and κ = 0.99 for VI. The difference in learning rates is because one of the GEP methods uses the variance reduction technique, which obtains smaller variance of the gradient. With variance reduction, a large learning rate can allow faster convergence without sacrificing performance. Next, we use seven real-world data sets from the UCI repository: Ionosphere, Madelon, Pima, Colon Cancer, WPBC, WDBC, and SPECTF to compare our methods with the BB-α [9], SVGD [23], and BBBVI [14]. 2 These data sets range from 198 to 4400 labeled examples living in 9 to 501 dimensions. We used a minibatch size of 25, i.e., S = 25, and at most 2000 iterations in all our experiments. 6.2 Comparison Experiments on Synthetic Data We set GEP s number of mixing components in the approximate distribution to 1, i.e., the approximate distribution q is a single Gaussian distribution rather than the mixture of Gaussian distributions. This GEP setting can be considered comparable to classical EP. The prior and likelihood of Bayesian logistic regression was given in Section 5.1, except that the approximate distribution q is now a single Gaussian. Then we can calculate the likelihood function of MLE and the evidence lower bound of VI. We give the objective function of MLE and VI that need to be

22 maximized, J MLE (θ) = logp(y x,θ) = i and [y i logp(y i = 1 x i )+(1 y i )log(1 p(y i = 1 x i ))], (41) J VI (θ) = E q [logp(y,θ x) logq(θ)]. (42) As for EP, our approximate distribution q(θ) is written as a multiplication of 1 factors corresponding to the prior and likelihoods, f Z i i (θ). Next we refine each factor through moment matching until algorithm convergence. Note that GEP-CV represents the GEP with control variates. Referring to Xu et al. [39], we measure the performance by computing root mean square error (RMSE) with posterior mean values. In Figure 1, the five lines from top to bottom are MLE, VI, GEP, EP, GEP-CV. Results in Figure 1 indicate that GEP-CV (the lowest line) is the best performing method and has a big outperforming margin over EP. MLE is the worst one. Compared to GEP, GEP-CV has a lower RMSE and it can converge faster. That s because GEP- CV employs the variance reduction technology, which reduces the variance of the gradients. With variance reduction, a large step size can allow the algorithm converge faster but without sacrificing performance. We know that fluctuation suffers from the large variance of gradients. Due to fluctuations, the number of iterations (the number of learning) increases, i.e., the convergence becomes slower. Compared to the performance of GEP, VI is slightly worse. These two methods do not involve variance reduction, which is the reason that they have large fluctuations among the iterations. [Fig. 1 about here.] In Figure 2, the three lines in the left part represent the three dimensions of the approximate mean, and they represent the approximate mean values in each iteration. The magnitude of the change of approximate mean (in terms of Euclidean distance) is shown in the right part. We see that the change in approximate mean of GEP-CV is obviously smoother than GEP which means GEP-CV can converge faster. The control variates provide a major reduction in variance, which is the reason that GEP-CV can converge faster and each update is more stable. [Fig. 2 about here.] 22

23 6.3 Real Data Classification In this experiment, we test GEP-CV method on seven real-world binary classification datasets from the UCI machine learning repository. For comparison to GEP-CV, we use three state-of-the-art methods: BB-α [9], SVGD [23], and BBBVI [14]. We ran the evaluations with damping learning rates and stopped learning after convergence. In the experiment, we set α in BB-α to be 1 and 10 6, for which the algorithm is equivalent to EP and VI (proved in Hernández-Lobato et al. [9]). As for the other settings of this experiment, we follow the settings of the above methods. We test performance in terms of test log-likelihood and test accuracy. We set up a mixture of 3 Gaussian (M = 3) as an approximate distribution in our GEP-CV method. We evaluate the performance of each method on 20 random training and test splits of the data with 90% and 10% of the data. The results (averaged test accuracy and averaged test log-likelihood) are summarized in Table 2 and 3. The results show that GEP-CV method performs better than BB-α, SVGD, and BBBVI. [Table 2 about here.] [Table 3 about here.] 6.4 GEP with Different Numbers of Components In Table 4 we show the RMSE, test accuracy and the value of test loglikelihood for GEP and GEP-CV on synthetic data with approximate distribution s component M = 1,2,3,5,7. We see that with the increment of the value of M, the overall trend of the performance of GEP-CV and GEP firstly gets better and then gets worse. This situation is probably because for the 3-modal synthesized data when the components of approximate distributions are 3 (M = 3), the approximate distribution can better approximate the posterior distribution. When components of approximate distribution get larger, the approximate distribution gets more complicated and there are more parameters to update, which lead to inferior performances. [Table 4 about here.] 7 Conclusion In this paper, we proposed generalized expectation propagation (GEP) approximate inference, which is a kind of convergence-guaranteed and less memory consumption deterministic approximate inference algorithm. This is ful- 23

24 filled by considering the EP s KL divergence as the objective function, taking Monte Carlo approximations of the gradients of this function and using variance reduction techniques. Scalability to large datasets can be further achieved in the future by using stochastic gradient descent with minibatches of observations. In experiments, we evaluated GEP methods based on multi-modal synthetic data and real-world data. We concluded that the GEP offers the following impressive performances. It outperforms VI, EP, SVGD, and BBBVI. Furthermore, it can be considered as an alternative algorithm to EP, which is guaranteed to be convergent. In addition, we also tested the performance of GEP on different numbers of components. For future work, it will be interesting to combine with different variance reduction techniques, such as the reparameterization trick. It is also important to explore more advanced models using this method. Acknowledgements This work is supported by the National Natural Science Foundation of China under Project References [1] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, [2] David M Blei, Michael I Jordan, and John Paisley. Variational Bayesian inference with stochastic search. International Conference on Machine Learning, pages , [3] John P Cunningham, Philipp Hennig, and Simon Lacostejulien. Gaussian probabilities and expectation propagation. arxiv: Machine Learning, pages 1 56, [4] Guillaume Dehaene and Simon Barthelmé. Expectation propagation in the large data limit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1): , [5] Brendan J Frey and David JC MacKay. A revolution: Belief propagation in graphs with cycles. Advances in Neural Information Processing Systems, pages ,

25 [6] Adam D. Gerson, Lucas C. Parra, and Paul Sajda. Cortical origins of response time variability during rapid discrimination of visual objects. NeuroImage, 28(2): , [7] Robert Mansel Gower, Nicolas Le Roux, and Francis R Bach. Tracking the gradients using the hessian: A new look at variance reducing stochastic methods. International Conference on Artificial Intelligence and Statistics, pages , [8] Leonard Hasenclever, Stefan Webb, Thibaut Lienart, Sebastian Vollmer, Balaji Lakshminarayanan, Charles Blundell, and Yee Whye Teh. Distributed Bayesian learning with stochastic natural gradient expectation propagation and the posterior server. Journal of Machine Learning Research, 18(106):1 37, [9] José Miguel Hernández-Lobato, Yingzhen Li, Mark Rowland, Daniel Hernández-Lobato, Thang D Bui, and Richard E Turner. Black-box alphadivergence minimization. International Conference on Machine Learning, pages , [10] Tom Heskes and Onno Zoeter. Expectation propagation for approximate inference in dynamic Bayesian networks. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, pages Morgan Kaufmann Publishers Inc., [11] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1): , [12] Yingzhen Li and Richard E. Turner. Stochastic expectation propagation. Advances in Neural Information Processing Systems, pages , [13] Jason Liao and Khewvoon Chin. Logistic regression for disease classification using microarray data. Bioinformatics, 23(15): , [14] Francesco Locatello, Gideon Dresdner, Rajiv Khanna, Isabel Valera, and Gunnar Rtsch. Boosting black box variational inference. Advances in Neural Information Processing Systems, pages , [15] Peter S Maybeck and George M Siouris. Stochastic models, estimation, and control. IEEE Transactions on Systems, Man, and Cybernetics, 10(5): , [16] Thomas Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pages Morgan Kaufmann Publishers Inc., [17] Thomas Minka. Power EP. Waste Age, 41(8):48 49, [18] Thomas Minka. Divergence measures and message passing. Microsoft Research Ltd, pages 1 17, [19] Hannes Nickisch and Matthias W. Seeger. Convex variational Bayesian inference for large scale generalized linear models. In International Conference on Machine Learning, pages ,

Black-box α-divergence Minimization

Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.