Generalizing expectation propagation with mixtures of exponential family distributions and an application to Bayesian logistic regression

Size: px
Start display at page:

Download "Generalizing expectation propagation with mixtures of exponential family distributions and an application to Bayesian logistic regression"

Transcription

1 Generalizing expectation propagation with mixtures of exponential family distributions and an application to Bayesian logistic regression Shiliang Sun, Shaojie He Department of Computer Science and Technology, East China Normal University, 3663 North Zhongshan Road, Shanghai , P. R. China Abstract Expectation propagation (EP) is a widely used deterministic approximate inference algorithm in Bayesian machine learning. Traditional EP approximates an intractable posterior distribution through a set of local approximations which are updated iteratively. In this paper, we propose a generalized version of EP called generalized EP (GEP), which is a new method based on the minimization of KL divergence for approximate inference. However, when the variance of the gradient is large, the algorithm may need a long time to converge. We use control variates and develop a variance reduced version of this method called GEP-CV. We evaluate our approach on Bayesian logistic regression, which provides faster convergence and better performance than other state-of-the-art approaches. Key words: Expectation Propagation; Stochastic Optimization; Variance Reduction; Machine Learning 1 Introduction Expectation propagation (EP) is a well-known method for deterministic approximate inference, which is based on the minimization of the Kullback- Leibler (KL) divergence. But different with variational inference (VI), EP computes the reverse form KL(p q) instead of KL(q p) [34]. Compared to VI, EP that makes an approximation by optimizing each factor in turn in Corresponding author. Tel.: ; fax: address: shiliangsun@gmail.com, slsun@cs.ecnu.edu.cn (S. Sun). Preprint submitted to Elsevier Science 25 January 2019

2 the context of all the remaining factors can ensure that the approximation is most accurate in the regions of high posterior probability as defined by the remaining factors [1]. For approximation q(θ) in the exponential family, if the iterations do converge, the resulting solution will be a stationary point of a particular energy function [16], although each iteration of EP may not necessarily decrease the value of this energy function. Although it has some advantages compared to VI, EP still has several shortcomings which we hope to solve or avoid. For example, EP which approximates the true posterior with an exponential family distribution can be inferior, since the approximate distribution is a unimodal distribution. The true posteriors in practical applications [1] are often multi-modal, in which case minimizing KL(p q) will lead to worse approximation. In particular, if EP is applied to mixture probability distributions, then the results are almost meaningless because the approximation tries to capture all of the modes of the probability distribution. Therefore, multi-modality should be explicitly considered in the approximate distribution. The moment matching operation in EP requires the evaluation of expectations, and thus standard EP is limited to the class of models for which this evaluation is possible. Moreover, since EP does not directly optimize the cost function, there is no guarantee that EP iterations will converge. It is thus desirable to introduce convergent substitutes. For these defects, we generalize the EP inference method that can resolve all of the above issues. It is reasonable to optimize the objective function directly, which does not involve moment matching evaluations and is guaranteed to converge. In this paper, we choose the mixtures of exponential family distributions as the approximate distribution. They are made up of simple components, but the mixtures can be rich enough to approximate any distribution. Furthermore, sampling from exponential family distributions is comparatively easy, which will benefit from the adopted stochastic optimization approach. However, if the gradient of the cost function of EP has a large variance, the stochastic optimization method may need a lot of time to converge, which often leads to inferior performance. Our approach adopts a variance reduction technique and employs control variates to reduce the variance. For variance reduction methods, sometimes the gradient can be expressed as the form of Monte Carlo integration. The Monte Carlo integration typically has an error variance of the form σ2 with n being the number of samples [27]. n Although we get a better solution by sampling with a larger value of n, the computing time grows with n [27]. Sometimes we can find a way to reduce σ instead. To do this, we construct a new Monte Carlo problem with the same solution as our original one but with a lower σ. Methods to do this are known as variance reduction techniques. The technology of variance reduction can 2

3 be placed into several groups, such as antithetic variables, control variates, conditioning or called Rao-Blackwellization, stratified sampling, importance sampling and reparameterization. Each variance reduction technology has its own characteristics and scope of application. In these methods, control variates and Rao-Blackwellization are most commonly used and also have the largest reduction of variance. Recently, Ranganath et al. [25] proposed a black box variational inference method combining both of these two methods, based on a stochastic optimization of the variational objective, to reduce the gradient variance. It can reach better predictive likelihoods than the sampling method. In this paper, we focus on one of the most promising methods, control variates. It is widely used in many scenarios. If we want to solve an optimization problem, we can write it as a form of stochastic gradient optimization. But stochastic optimization will be hindered because of the large variance of the noisy gradients which may lead to slower convergence and worse performance. Then we can use the control variates to improve the performance by reducing the variance of noisy gradients. Control variates have the potential for widespread use in general stochastic optimization. It is feasible in any stochastic optimization experiment, in other words, as long as the algorithm of the specific problem has a form of stochastic optimization. The choice of control variates should satisfy the following two principles: (i) they should have a large correlation with the gradient, and (ii) the computation of their expectation with respect to random data samples is not expensive. Wang et al. [37] showed that control variates can be formed in terms of low-order approximations to the noisy gradient. We will discuss how the variance reduction technique is applied to our approach in Section 4. In experiments, we use both synthesized and real world datasets. We observe that our method shows the faster convergence and better performance than other state-of-the-art approaches. We also observe that our method combining the variance reduction technology typically converges quickly in terms of CPU time, compared to the version of not using control variates. The key contributions of this paper are summarized as follows: EP is a crucial deterministic approximate inference method, but the solution that using a single exponential family distribution as the approximate distribution can be inferior. To solve this problem, we use mixtures of exponential family distributions to replace the original single modal approximate distribution. In the process of EP parameter update, the number of approximate factors is increasing with the data point number N. Different with traditional EP method, our method does not involve moment matching but directly optimize the KL divergence, which avoids expensive memory consuming. Moreover, the proposed approximation method fits with the model in which the 3

4 evaluation of expectations is intractable. Evaluating our method on Bayesian logistic regression, we get an improvement in classification accuracy compared to other standard approximate methods. We set our algorithm with different mixture numbers of approximate distributions and discuss how the performance changes with the mixture numbers of approximate distributions. In the subsequent sections, the paper is organized as follows. In Section 2, we describe the background of our work, including the introduction of EP, the main issues and progress of EP in recent years. Section 3 describes our algorithm GEP and show the objective function of GEP. Then we briefly introduce stochastic optimization and retrospect the global convergence of the Robbins-Monro stochastic approximation method. In Section 4 we show the GEP s variance reduced stochastic optimization and discuss how to construct the control variates. In Section 5, we illustrate the Bayesian logistic regression model used in the experiments and provide our technical details on this model. In Section 6, we show the experimental setup and results, and then the final section concludes the paper. 2 Expectation Propagation In this section, we mainly review the exponential family distribution and the deterministic approximate inference algorithm EP [16], and discuss some of its issues. 2.1 Exponential Family Distributions Exponential family distributions are very important in machine learning. One of the important reasons is that they can be calculated efficiently and analytically through natural parameters or expected sufficient statistics [1]. EP [16] provides a general-purpose framework for approximate posterior beliefs by exponential family distributions. One of the important reasons for considering the exponential family is that the likelihood function for i.i.d. data from the exponential family is a function of the sample average of the sufficient statistics u(x) which has the fixed dimensionality, independent of the sample size since new information can be incorporated without increasing the size of the parametric representation [29]. Moreover, even if a model does not give rise to posteriors in the exponential family, members of the exponential family can be used as approximate distributions. Next we present exponential family distribution. 4

5 In probability and statistics, the exponential family is a set of probability distributions of a certain form, which is chosen for mathematical convenience, based on some useful algebraic properties. A random variable x has an exponential family distribution if its probability mass function or density function admits the form: f(x) = h(x)g(η)exp{η T u(x)}, η H. (1) The vector u(x) and η are called, respectively, the sufficient statistic and natural parameter. The set H is the space of allowable natural parameter values. Thefunctionh(x)isthebasemeasureandthefunctiong(η)canbeinterpreted as the coefficient that ensures the distribution is normalized and therefore satisfies g(η) h(x)exp{η T u(x)}dx = 1. (2) A key exponential family distributional result by taking gradients of both sides of (2) with respect to η is that lng(η) = E[u(x)]. (3) where lng(η) is the column vector of partial derivatives of lng(η) with respect to each of the components of η. The exponential distribution family has many important properties. As the exponential family has sufficient statistics that can use a fixed number of values to summarize any amount of i.i.d. data, the posterior predictive distribution of an exponential family random variable with a conjugate prior can always be written in closed form (provided that the normalizing factor of the exponential family distribution can itself be written in closed form). Also in the variational mean field approximation, the best approximate posterior distribution of an exponential family node (a node is a random variable in the context of graphical models) with a conjugate prior is in the same family as the node. 2.2 Expectation Propagation Minka [16] described a general method for approximate inference in hierarchical Bayesian models known as EP, building on earlier work on topics such as assumed density filtering [15] and loopy belief propagation [5]. EP achieves the similar aims as variational inference but with the reverse form of Kullback- Leibler divergence. A small number of numerical studies including Minka [16] have shown that EP is convincingly better than variational Bayes, Monte Carlo and Laplace s method with similar computational cost. Following Minka [18], given a target distribution p(x), the goal of EP is to 5

6 solve arg minkl(p q), (4) q Q where q is a member of the set of exponential family distributions Q. The key aspect of EP is to rely on a factorization of p, such that the posterior can be decomposed into a product of terms: p(θ D) = 1 p(d) f i (θ), (5) i where p(d) is the model evidence, and the factor f n (θ) equals the likelihood p(x n θ) for each data point x n, along with a factor f 0 (θ) = p(θ) corresponding to the prior. The approximate distribution q has the same factor structure: q(θ) = 1 Z i f i (θ), (6) in which each factor f i (θ) in the approximate distribution corresponds to the factor f i (θ) in the true posterior, and the factor 1 is the normalizing constant Z needed to ensure that q(θ) integrates to unity. In addition, each approximate factor f i (θ) has the exponential family form. An optimal result for the exponential family shows that the global solution of (4) combining (3) is a moment matching solution [34]: E q [u(θ)] = E p [u(θ)]. (7) We see that the optimum solution simply corresponds to matching the expected sufficient statistics. In the Gaussian case for q, this means that the best approximation of p based on KL divergence is a Gaussian with the same mean and covariance. Of course, it is difficult to directly calculate the mean and covariance of p, and EP attempts to achieve this by iterative refinements of an approximation. EP iterates among simple local approximations to refine factors which approximate the corresponding posterior contribution. We briefly review the EP algorithm which is the basis of our new method. For simplicity, we observe a dataset containing N i.i.d. samples D = {x n } N n=1. We have a probabilistic model p(x θ) parameterized by a d-dimensional vector θ with prior p 0 (θ). Bayesian inference involves calculating the probably intractable posterior distribution of the parameters, p(θ D) = 1 p(d) f i (θ) q(θ) = 1 i Z i f i (θ), (8) 6

7 where Z is a normalization constant, and q(θ) is a tractable approximate distribution which will be updated by EP. The aim of EP is to update the approximate factors so that the contribution of each of the prior and likelihood terms to the posterior is captured, i.e., f n (θ) p(x n θ). The ideal approach is to minimize the KL divergence between the posterior distribution and that formed by removing each term and filling withthecorrespondingapproximatefactor f 1 n (θ),kl[ p(d) if i (θ) 1 f Z i i (θ)]. But in general, this minimization will be intractable, because the KL divergence involves computing the full posterior. Instead, EP makes a much better approximation with four steps. First, choose a factor f j (θ) to refine. Second, removing f j (θ) from the posterior by division, we get the cavity distribution q \j (θ) = q(θ)/ f j (θ). Note that we could instead find q \j (θ) from the product of factors i j. But in practice division is usually easier, since for a distribution from the widely used exponential family, the above calculation is simplified to subtraction of natural parameters. Third, the corresponding term in the true posterior is included to generate the tilted distribution p j (θ) q \j (θ)f j (θ), and we minimize KL[ p j (θ) q \j (θ) f j (θ)/z j ], where Z j is the normalizing constant. This minimization is tractable since it does not involve the true posterior distribution, and as mentioned before, this minimization turns out to be moment matching for the exponential family. Finally, the updated factor is included into the approximate distribution. The above four steps are then iterated until convergence. 2.3 Convergence and Memory Cost Issues From the Brouwer fixed-point theorem [24], we know that EP updates can be shown to always have a fixed point if it converges and approximations are in the exponential family. For approximations q(θ) in the exponential family, if the iterations do converge, the resulting solution will be a stationary point of a particular energy function, which is much superior to the solution of VI in many scenarios. EP can provide better solutions than VI in specific cases. For instance, VI provides poor approximations when nonsmooth likelihood functions are used [3]. However, one disadvantage of EP is that there is no assurance for the iterations to converge [16]. This is in contrast to VI, which maximizes a lower bound on the log marginal likelihood, in which iterations are guaranteed not to decrease the bound. So, the convergent version of EP is worth considering. Another one is the prohibitive memory cost for large datasets and big models. A crucial limitation of EP is that the number of approximate factors needs to increase with the number of data points, which often leads to a prohibitively large memory overhead [12]. For these two issues that hinder the widespread deployment of EP, recently, 7

8 many works tried to address these issues. For example, Heskes et al. [10] and Opper et al.[20] derived convergent double-loop implementations of EP, which, however, can be far slower than the original message passing procedure. Hasenclever et al. [8] proposed the stochastic natural-gradient EP (SNEP) method, which is also double-loop-like. But in practice, they only performed one-step inner loop update to speed up training. Seeger and Nickisch [30] proposed a fast convergent algorithm for EP with covariance decoupling techniques [19,38]. For the second issue, Dehaene and Barthelmé [4] and Li and Turner [12] used factor tying or called local parameter sharing through the averaged EP (AEP) and stochastic EP (SEP) algorithm, which can decrease the prohibitive memory cost. In contrast to AEP and SEP, Hernández-Lobato et al. [9] proposed a black-box method called black-box-α (BB-α) that directly optimizes the power EP [17] (an extension of EP that makes the computations more tractable) energy function which is equivalent to minimize α-divergence, not as SEP or AEP update through the power EP message passing. The BB-α method is similar to our proposed method. The distinction is that the objective function is different. Particularly, our objective function has a simpler form and it is more convenient to compute gradients. BB-α also has an analytic energy form that does not require double-loop procedures and can be directly optimized using gradient descent. This means that popular stochastic optimization methods can be used for large-scale learning with BB-α. It is proved that this method is convergent when the energy function is finite. BB-α can address both flaws of EP. However, using the power EP energy function as the objective function has a disadvantage that there are too many parameters to update since each likelihood function has a corresponding parameter to update. Having seen the advantage of the BB-α method, we apply a similar idea to the function of KL divergence of EP. In addition, considering that true posteriors are multi-modal in many practical applications, we adopt the mixture of exponential family distributions as the approximate distribution. In this way, our method can approximate complicated distributions. We make experimental comparisons with BB-α in Section 6. We summarize the existing EP works and our method based on convergence and memory cost issues in Table 1. We can see that these two issues have been discussed during these years and the line of research is worth exploring. [Table 1 about here.] 8

9 3 Generalizing EP with Mixtures of Exponential Family Distributions In this section, we give details of the proposed algorithm GEP with stochastic optimization. 3.1 GEP We abbreviate our approximate inference method generalized expectation propagation with the mixture of exponential family distributions as GEP. Let Θ represent the set of hidden variables in the probabilistic model and D represent the observed data. The model may have some hyperparameters which are omitted now for clarity. The true posterior is given by p(θ D) = p(θ,d) p(d). (9) Suppose the approximate distribution of the true posterior is Q(Θ). The original objective function of EP arises by assuming Q(Θ) is from a specific exponential family and minimizing the following KL divergence: KL(p(Θ D) Q(Θ)) = Θ p(θ D)ln p(θ D) dθ. (10) Q(Θ) Instead of optimizing the above objective directly, the typical EP adopts a factorized form for Q(Θ) and then iteratively updates the factors. Note that the estimate of model evidence p(d) can also be given when EP terminates. Since the Q(Θ) given in EP is a single exponential family distribution, many complicated posterior distributions may not be well characterized. Now, in the GEP, we assume Q(Θ) is a mixture of exponential family distributions, i.e., M Q(Θ) = α m Q m (Θ η m ) = m=1 M m=1 α m h m (Θ)g m (η m )exp{η T mu m (Θ)}, (11) where η T m and u m (Θ) are the natural parameters and sufficient statistics of the corresponding exponential family distribution, respectively. Thus even if the posterior distribution is a complicated multi-modal distribution, our approximate distribution Q can fit this posterior better. Denote the set of parameters by Φ, which is composed of {α m,η m } M m=1. The initial objective function which GEP aims to minimize is 9

10 KL(p q) = p(θ D)lnQ(Θ)dΘ+const Θ = 1 p(θ,d)lnq(θ)dθ+const p(d) Θ = 1 p(d) M p(θ,d)ln[ α m Q m (Θ η m )]dθ+const, Θ m=1 (12) where const represents terms unrelated to Q(Θ). Since p(d) is usually positive, we can give an equivalent objective function by M J = p(θ,d)ln[ α m Q m (Θ η m )]dθ, (13) Θ m=1 which should be maximized to find parameters. However, in complicated probabilistic models, a closed form of the expectation may not be analytically computed, which hinders the direct optimization of parameters. In a general way, gradient ascent can be used to maximize the objective function. We proceed to use stochastic gradient ascent to update parameters, which is similar to the stochastic search method in Blei et al. [2]. 3.2 Stochastic Optimization Following the stochastic gradient ascent discussed in the previous section, we briefly recall stochastic optimization. Assume that we want to maximize the function f(x), and h(x) can be seen as a random variable whose expectation is the gradient of f(x), i.e., E[h(x)] = x f(x). Also, make sure the learning rate ρ t is a non-negative scalar. At the tth iteration, the stochastic optimization update for x is as follows: where for any t, h t (x) is a sample of h(x). x t+1 x t +ρ t h t (x t ), (14) If ρ t, the learning rate, follows the Robbins-Monro conditions + t=1 + t=1 ρ t = +, ρ 2 t < +, (15) then x t+1 will converge to the optimal x (if f is concave) or a local optimum of f (if not concave) [11]. 10

11 It is proven that the Robbins-Monro stochastic approximation is always convergent, i.e., outputs are (global or local) optimal solutions if the learning rate follows the Robbins-Monro condition, by the convergence analysis given in Robbins and Monro [26]. The stochastic optimization algorithm is one of the mainstream optimization methods for solving large-scale machine learning problems. It has a solid theoretical foundation and a good prospect for development. However, there are still many problems that people are generally concerned and many inevitable defects need to be overcome. For example, if the gradient has a large variance, the stochastic optimization algorithm might take much time to converge and lead to bad performance. In this paper, we develop a variance reduced version of the primary algorithm by making use of control variates to reduce the variance of the noisy gradient. 4 Variance Reduction for GEP In this section, we apply the variance reduction technology to our method because by reducing the variance of gradients, one can hope to achieve convergence with a larger or even a constant step size and thus obtain a faster convergent rate. We first compute the gradient of GEP s objective to apply the stochastic approximation method and then review the basic ideas of the variance reduction method, control variate. Finally we show how to choose the control variates (variance reduced gradients) as the substitution of gradients of the objective. 4.1 Gradient of the GEP Objective Function The stochastic optimization makes a stochastic approximation of gradients of the objective function with respect to parameters. There are two different types of parameters in our approximate posterior, which are mixture weight α and natural parameter of exponential family distributions η, respectively. These gradients are given below (assuming the necessary regularity condi- 11

12 tions): M αm J = αm p(θ,d)ln( α m Q m (Θ η m ))dθ Θ m=1 M = p(θ,d) αm ln( α m Q m (Θ η m ))dθ Θ m=1 p(θ, D) = Mm=1 Θ α m Q m (Θ η m ) Q m(θ η m )dθ p(θ, D) = E Qm [ Mm=1 α m Q m (Θ η m ) ], (16) M ηm J = ηm p(θ,d)ln( α m Q m (Θ η m ))dθ Θ m=1 M = p(θ,d) ηm ln( α m Q m (Θ η m ))dθ Θ m=1 α m p(θ,d) = Mm=1 Θ α m Q m (Θ η m ) η m Q m (Θ η m )dθ α m p(θ,d) ηm lnq m (Θ η m ) = Mm=1 Q m (Θ η m )dθ Θ α m Q m (Θ η m ) = E Qm [ α mp(θ,d) ηm lnq m (Θ η m ) Mm=1 ], α m Q m (Θ η m ) (17) wherewehaveusedtheidentity ηm Q m (Θ η m ) = Q m (Θ η m ) ηm lnq m (Θ η m ). We form the derivation of the objective as an expectation with respect to p(θ,d) the approximate distribution. We let M := f(θ). It follows m=1 αmqm(θ ηm) that αm J = E Qm [f(θ)] and ηm J = α m E Qm [f(θ) ηm lnq m (Θ η m )]. Then we can stochastically approximate this expectation using Monte Carlo approximation on specific exponential family distribution Q m to get noisy but unbiased gradient estimators, and αm J 1 S S s=1 ηm J 1 S = 1 S p(θ [s],d) Mm=1 α m Q m (Θ [s] η m ) = 1 S S s=1 S s=1 α m p(θ [s],d) ηm lnq m (Θ [s] η m )) Mm=1 α m Q m (Θ [s] η m ) α m f(θ [s] ) ηm lnq m (Θ [s] η m ), S f(θ [s] ), (18) s=1 (19) where Θ [s] Q m for s = 1,...,S. We can therefore replace αm J and ηm J with the unbiased approximation of these gradients as (18) and (19). We can update the parameters α m and η m by stochastic optimization. At each iteration t, we have 12

13 α m (t+1) = α m (t) +ρ t αm J = η (t) η (t+1) m (20) m +γ t ηm J, where ρ t and γ t are proper stochastic gradient optimization step sizes. In addition, since α has the constraint M m=1 α m = 1, a constraint optimization problem is under consideration. We use the Lagrangian multiplier method to solve this problem. The description and derivation of the Lagrangian multiplier method will be given in Section 4.3. Algorithm 1 describes the GEP method. Algorithm 1 GEP Input: data x, joint distribution p, mixture of exponential family distributions Q. Initialize: α m and η m randomly, t = 1. 1: repeat 2: // Draw S samples from the mth exponential approximate distribution 3: for s = 1 to S do 4: Θ [s] Q m ; 5: end for 6: α m (t+1) = α m (t) +ρ t αm J 7: η m (t+1) = η m (t) +γ t ηm J 8: ρ t and γ t are the tth values of Robbins-Monro sequences 9: t = t+1 10: until change of α m and η m (in terms of Euclidean distance) is less than ε (ε = 0.01 in this paper) After these gradients have been calculated, we need to find suitable control variables to replace them in order to obtain smaller variances and achieve faster convergence. If we find suitable control variates, we can derive variance reduced gradients ˆ αm J and ˆ ηm J, which can replace ηm J and αm J. The new gradient estimate is an unbiased estimation of the true gradient with a lower variance when they are positively correlated. Then we can obtain α (t+1) m η (t+1) m = α m (t) +ρ tˆ αm J = η m (t) +γ tˆ ηm J. (21) The variance reduced GEP method is described in Algorithm 2. We use a function that is highly related to the original function as a control variable. How to select control variables is discussed in Section Variance Reduction with Control Variates We now recall the variance reduced methods with control variates, a wellknown technique in Monte Carlo simulation(see, e.g., Rubinstein and Kroese[28]) 13

14 Algorithm 2 GEP-CV Input: data x, joint distribution p, mixture of exponential family distribution Q. Initialize: α m and η m randomly, t = 1. 1: repeat 2: // Draw S samples from the mth exponential family approximate distribution 3: for s = 1 to S do 4: Θ [s] Q m ; 5: end for 6: for s = 1 to S do do p(θ 7: f[s] = [s],d) M m=1 αmqm(θ[s] η m) 8: g[s] is the control variate of f[s] 9: g [s] is g[s] multiplying α m ηm lnq m (Θ [s] η m ) 10: end for 11: â = Cov(f,g), estimated from a few samples Var(g) 12: We find variance reduced gradients, 13: ˆ αm J 1 Ss=1 (f[s] â(g[s] E[g[s]])) S 14: ˆ ηm J 1 Ss=1 (f[s]α S m ηm lnq m (Θ [s] η m ) â(g [s] E[g [s]])) 15: α m (t+1) = α m (t) +ρ tˆ αm J 16: η m (t+1) = η m (t) +γ tˆ ηm J 17: ρ t and γ t are the tth values of Robbins-Monro sequences 18: until change of α m and η m (in terms of Euclidean distance) is less than ε (ε = 0.01 in this paper) designed to reduce the variance of the estimate of the expectation of a random variable. Generally speaking, variance reduction works by modifying a function of a random variable such that its expectation remains the same, but its variance decreases. To reach this goal, we introduce a well-known technique in Monte Carlo simulation (see, e.g., [28]) called control variate. Control variate can be seen as a function. We assume a control variate, g(θ), which approximates f(θ) well in the highly probable regions as defined by Q m (θ), but also has a closed-form expectation under Q m. Using g and a scalar a, we first form the new function, ˆf(θ) = f(θ) a(g(θ) E Qm [g(θ)]). (22) The next step is to set the value of a to minimize the variance of ˆf. A simple calculation shows that Var(ˆf) = Var(f) 2aCov(f,g)+a 2 Var(g). (23) Taking the derivative with respect to a and setting it to zero give the optimal 14

15 value, a = Cov(f,g) Var(g). (24) Usually this covariance and variance are unknown in the functions we encounter. We can approximate a with â, found by plugging the sample variance and covariance into (24) using samples from the algorithm (in experiments, we estimate â by calculating the sample covariance and variance from minibatches). The potential reduction in variance is seen by plugging (24) into (23). Then we take the ratio of the two variances, Var(ˆf)/Var(f) = 1 Corr(f,g) 2. (25) For a high Pearson correlation coefficient Corr(f,g) between f and g, the control variate g will lead to a more significant effect of variance reduction and thus a faster convergence is expected. We will discuss how to construct the control variates later. 4.3 Lagrangian Multiplier Lagrange multiplier methods introduce Lagrange multipliers to resolve optimization problems with constraints [36]. It is an exact method that optimizes the objective f(x) to meet the Kuhn-Tucker conditions [31]. The constrained optimization problems generally have the following form: max f(x), s.t. g(x) 0, X = (x 1,x 1,...,x n ), h(x) = 0, (26) where X is a vector of real variables in continuous problems or a vector of discrete numbers in discrete problems, f(x) is the objective function, g(x) = [g 1 (X),...,g k (X)] T is a set of k inequality constraints, and h(x) = [h 1 (X),...,h m (X)] T is a set of m equality constraints. f(x),g(x) and h(x), as well as their derivatives are continuous functions. SinceLagrangianmethodscannotdirectlysolveinequalityconstraintsg i (X) 0 under normal circumstances, we transform inequality constraints into equality constraints by adding slack variables z i, which results in p i (X) = g i (X)+ 15

16 z 2 i. The corresponding Lagrange function is defined as follows: L(X,λ,µ) = f(x)+λ T h(x)+µ T p(x), (27) where λ = [λ 1,...,λ m ] T and µ = [µ 1,...,µ k ] T are two sets of Lagrange multipliers. According to classical optimization theory [31], all the optimal solutions of (27) obey the following set of necessary conditions: X L(X,λ,µ) = 0, λ L(X,λ,µ) = 0, µ L(X,λ,µ) = 0. (28) In this paper, we have no inequality constraints, so let µ = 0. The equality constraint h(x) is that h(α m ) = M m=1 α m 1 = 0. The Lagrange function corresponding to the GEP method is: L(α m,λ) = J(α m )+λ T h(α m ) M M = p(θ,d)ln[ α m Q m (Θ η m )]dθ+λ T ( α m 1) m=1 m=1 (29) where α m is the parameter to be optimized. Suppose α = {α 1,...,α m } and L(α,λ) is the Lagrange function for GEP. With the conditions λ L(α,λ) = 0 α1 L(α,λ) = 0 (30)... αm L(α,λ) = 0, we reach a feasible local extremum α when all gradients vanish. 4.4 How to Choose Control Variates? As discussed in Section 4.2, the stronger the correlation between f and g, the greater the variance is reduced. The variance of the estimator is directly linked to the convergence speed, and improving correlation will lead to faster convergence [7]. Thus, finding control variates that have a greater correlation with the gradient would allow us to maintain a fast convergence and potentially use larger step sizes. In this paper, we want to reduce the variance of the gradient of the objective with respect to parameters Φ, Φ J. So we need to calculate variance reduced gradient ˆ Φ J to replace Φ J. The principle of our selection of control variates is that we form g based on some statistics of the hidden variables, e.g., 16

17 low-order moments. The low-order moments roughly characterize the hidden variable distribution, which is not dependent with parameter Φ. The expectation of control variates can thus be pre-computed while running the stochastic gradient algorithm. We will use this rule to construct control variates in this paper. The original GEP method in Algorithm 1 uses Monte-Carlo samples to stochastically approximate the gradient. One problem with this method is that we may set a relatively small step size in order to reach the optimum solution. Theneedforasmalllearningrateisduetothepotentiallylargevarianceofthe stochastic gradient which approximates the full gradient using a small batch or a single example. This leads to slower convergence. For the sake of achieving quicker convergence, we proceed to find a more accurate approximation of the full gradient. Through Eqs. (16) and (17) in Section 4.1, we know the gradients of the objective with respect to Φ. We obtain the derivative of the objective as an expectation of f. As in Algorithm 2, we show the construction of control variates. In order to ensure that f and g are highly correlated and the expectation of g is easy to calculate, we can choose the second-order Taylor expansion of f orapartoff ascontrolvariateg,whichcancapturethesecond-orderinformation about the individual function. 1 After estimating â on a small number of samples, we can calculate variance reduced gradient ˆ Φ J. The control variate Monte-Carlo estimate of the gradient using S samples is thus: ˆ Φ J 1 S f[s] â(g[s] E[g[s]]). (31) S s=1 This construction can highly retain the function s information and guarantee high correlation between the control variate and the function. In empirical studies, we show that this control variate can well reduce the variance of the gradient. This method is an explicit variance reduction technique for GEP, which we call GEP-CV (GEP with control variates). This improvement over GEP provides useful insights into the performance of the underlying optimization problem. 1 Alternatives to Taylor expansion for constructing control variates exist as well [2]. 17

18 5 Bayesian Logistic Regression for Classification with GEP In this section, we illustrate GEP on Bayesian logistic regression for classification. We choose a Bayesian probabilistic model, logistic regression, in experimental validations. Logistic regression is an important linear classifier in machine learning and has been widely used in computer vision [1], bioinformatics [35], gene classification [13], neural signal processing [21,6,22], matrix data classification [32] and semi-supervised learning [33]. 5.1 Bayesian Logistic Regression for Classification Binary logistic regression takes in P-dimensional data vectors x R N P, wheren isthenumberofthedataandp isthedimensionofdata,andpredicts thebinaryoutputsy R N towhicheachbelongs(y n {0,1}).Theparameter is θ R P. We can model the prediction likelihood distribution p(y i x i,θ) Bern(σ(θ T x i )), where σ( ) is the sigmoid function, σ(b) = (1 + e b ) 1. Bayesian logistic regression places a prior distribution on the coefficient vector, which is drawn from a P-dimensional multivariate normal distribution with independent components, θ N(0,I P ). The joint distribution can be calculated as p(d,θ) = p(d θ)p(θ) = nσ(θ T x n ) yn {1 σ(θ T x n )} 1 yn N(0,I P ). We would like to evaluate p(θ D), but this is not available in closed form. Instead, for inference we define a mixture of exponential distribution as approximate distribution q M q(θ) = α m q m (θ η m ). (32) m=1 Since the approximate distribution q m needs to be selected from the exponential family distribution, it is natural to choose the Gaussian. Then the natural parameter η of the exponential family distribution is the mean µ and variance σ of Gaussian. For the convenience of calculation, we posit specific approximate distribution q m over θ, i.e., q m (θ η m ) = P j=1 N(θ j µ mj,σ 2 m j ). To be clear, we model each θ j as an independent Gaussian with mean µ mj and variance σ 2 m j, and we use GEP to learn the optimal values of η m = {µ mj,σ 2 m j } P j=1. We will use the shorthand µ m = (µ m1,...,µ mp ) and σ 2 m = (σ 2 m 1,...,σ 2 m P ). For the data joint likelihood, we can decompose p(y,θ x) = p(y x,θ)p(θ), using the chain rule of probability (and noting that x is a constant). Thus, it is straightforward to calculate 18

19 N p(y x,θ) = σ(θ T x i ) y i (1 σ(θ T x i )) 1 y i p(θ) = i=1 P i=1 N(0,1). (33) The objective for this model is to maximize J = θ M p(y x, θ)p(θ) ln α m q m (θ η m )dθ. (34) m=1 5.2 Parameter Optimization with GEP and GEP-CV In this section, we show the parameter optimization of GEP and GEP-CV on the Bayesian logistic regression model. Recall from Section 3.1, we have two different types of parameters: mixture weight α and natural parameter of exponential family distribution η. Gradients of the objective and their Monte Carlo approximation with respect to these two parameters are discussed in Section 4.1. So we can use stochastic optimization to optimize the parameters. We give their optimization solution in Section 4.3. We focus on the optimization in terms of natural parameter η on Bayesian logistic regression model below. In order to calculate ηm J, we need to calculate ηm logq m (θ η m ). As our approximate distribution q uses a mixture of Gaussian distributions, the natural parameter of a single component are the mean and variance of the corresponding Gaussian distribution. Since σ 2 m j is constrained to be positive, we will instead optimize over γ mj = log(σ 2 m j ). It is straightforward to see µmj logq m (θ η m ) = µmj P i=1 log(σ2 m i ) 2 (θ i µ mi ) 2 2σ 2 m i = (θ j µ mj ) σ 2 m j. (35) P γmj logq m (θ η m ) = σmj ( log(σ2 m i ) 2 i=1 = ( 1 2σ 2 m j + (θ j µ mj ) 2 2(σ 2 m j ) 2 )σ 2 m j. (θ i µ mi ) 2 2σ 2 m i ) γmj (σ 2 m j ) (36) Note that we use the chain rule in the derivation for γmj logq(θ η). Then we can update the parameters α m and η m by stochastic optimization. Now for Bayesian logistic regression we show how to form our control variates. We choose the second-order Taylor expansion at ξ for the sigmoid function 19

20 denoted as ˆσ(z), ˆσ(z,ξ) = σ(ξ)+σ(ξ)σ( ξ)(z ξ)+ σ(ξ)(1+2σ(ξ))σ( ξ) (z ξ) T (z ξ). 2 (37) As we discussed in Section 4.1, we let f n = σ(x T nθ)p(θ) M m=1 for each observation x n. Let z n = θ T x n and ξ n = ˆµ T x n, ˆµ is the current mean of q m. For αmqm(θ ηm) convenience, we use the second-order Taylor expansion of a part of f n to define our control variate g n as g n (z n,ξ n ) = p(θ)ˆσ(z n,ξ n ) Mm=1 α m q m (θ η m ). (38) The variance reduced gradients ˆ αm J and ˆ ηm J are given below ˆ αm J = 1 S ˆ ηm J = 1 S S (f s â(g s E[g s ])) s=1 S (f s α m ηm lnq m (θ [s] η m ) â(g s E[g s])) s=1 (39) where g s is g s multiplying α m ηm logq m (θ η m ) Moreover, the expectation of g n can be computed in closed-form as ( E[g n ] = p(θ)σ(ξ n ) 1+θ T x n σ( ξ n )(1 ξ n (1+2σ(ξ n )) ξ n σ( ξ n ) + (1+2σ(ξ ) n))σ( ξ n ) (θ T θ(var(x n )+ x n ( x n ) T )+ξ 2 2 n) M / α m q m (θ η m ), m=1 (40) which facilitates the use of control variates for variance reduction. 6 Experiments In this section, we perform experiments using the GEP methods for binary classification with Bayesian logistic regression. 20

21 6.1 Data and Set-up For Bayesian logistic regression, we first compare our GEP methods with three methods: maximum likelihood estimate (MLE), variational inference (VI) and expectation propagation (EP) with synthesized data. Synthesized data is three-modal, which are generated from three 2-D Gaussian distributions with different means. These data sets include 7000 labeled examples living in 3-D space which has a dimension of all ones as an augmented representation. We randomly selected 5000 data points used for training and the rest 2000 data points for testing. We set learning rates ρ t and γ t. By decreasing the step sizes to meet the Robbins-Monro condition, convergence to local optimal solutions of J is guaranteed. For example, ρ t = γ t = (τ + t) κ with κ (0.5,1]andτ 0satisfythisrequirement,wheretistheiterationnumber, and τ and κ are parameters for adjusting learning rates. In our experiments, we set τ = 10 and κ = 0.6 for GEP, and set τ = 10 and κ = 0.99 for VI. The difference in learning rates is because one of the GEP methods uses the variance reduction technique, which obtains smaller variance of the gradient. With variance reduction, a large learning rate can allow faster convergence without sacrificing performance. Next, we use seven real-world data sets from the UCI repository: Ionosphere, Madelon, Pima, Colon Cancer, WPBC, WDBC, and SPECTF to compare our methods with the BB-α [9], SVGD [23], and BBBVI [14]. 2 These data sets range from 198 to 4400 labeled examples living in 9 to 501 dimensions. We used a minibatch size of 25, i.e., S = 25, and at most 2000 iterations in all our experiments. 6.2 Comparison Experiments on Synthetic Data We set GEP s number of mixing components in the approximate distribution to 1, i.e., the approximate distribution q is a single Gaussian distribution rather than the mixture of Gaussian distributions. This GEP setting can be considered comparable to classical EP. The prior and likelihood of Bayesian logistic regression was given in Section 5.1, except that the approximate distribution q is now a single Gaussian. Then we can calculate the likelihood function of MLE and the evidence lower bound of VI. We give the objective function of MLE and VI that need to be

22 maximized, J MLE (θ) = logp(y x,θ) = i and [y i logp(y i = 1 x i )+(1 y i )log(1 p(y i = 1 x i ))], (41) J VI (θ) = E q [logp(y,θ x) logq(θ)]. (42) As for EP, our approximate distribution q(θ) is written as a multiplication of 1 factors corresponding to the prior and likelihoods, f Z i i (θ). Next we refine each factor through moment matching until algorithm convergence. Note that GEP-CV represents the GEP with control variates. Referring to Xu et al. [39], we measure the performance by computing root mean square error (RMSE) with posterior mean values. In Figure 1, the five lines from top to bottom are MLE, VI, GEP, EP, GEP-CV. Results in Figure 1 indicate that GEP-CV (the lowest line) is the best performing method and has a big outperforming margin over EP. MLE is the worst one. Compared to GEP, GEP-CV has a lower RMSE and it can converge faster. That s because GEP- CV employs the variance reduction technology, which reduces the variance of the gradients. With variance reduction, a large step size can allow the algorithm converge faster but without sacrificing performance. We know that fluctuation suffers from the large variance of gradients. Due to fluctuations, the number of iterations (the number of learning) increases, i.e., the convergence becomes slower. Compared to the performance of GEP, VI is slightly worse. These two methods do not involve variance reduction, which is the reason that they have large fluctuations among the iterations. [Fig. 1 about here.] In Figure 2, the three lines in the left part represent the three dimensions of the approximate mean, and they represent the approximate mean values in each iteration. The magnitude of the change of approximate mean (in terms of Euclidean distance) is shown in the right part. We see that the change in approximate mean of GEP-CV is obviously smoother than GEP which means GEP-CV can converge faster. The control variates provide a major reduction in variance, which is the reason that GEP-CV can converge faster and each update is more stable. [Fig. 2 about here.] 22

23 6.3 Real Data Classification In this experiment, we test GEP-CV method on seven real-world binary classification datasets from the UCI machine learning repository. For comparison to GEP-CV, we use three state-of-the-art methods: BB-α [9], SVGD [23], and BBBVI [14]. We ran the evaluations with damping learning rates and stopped learning after convergence. In the experiment, we set α in BB-α to be 1 and 10 6, for which the algorithm is equivalent to EP and VI (proved in Hernández-Lobato et al. [9]). As for the other settings of this experiment, we follow the settings of the above methods. We test performance in terms of test log-likelihood and test accuracy. We set up a mixture of 3 Gaussian (M = 3) as an approximate distribution in our GEP-CV method. We evaluate the performance of each method on 20 random training and test splits of the data with 90% and 10% of the data. The results (averaged test accuracy and averaged test log-likelihood) are summarized in Table 2 and 3. The results show that GEP-CV method performs better than BB-α, SVGD, and BBBVI. [Table 2 about here.] [Table 3 about here.] 6.4 GEP with Different Numbers of Components In Table 4 we show the RMSE, test accuracy and the value of test loglikelihood for GEP and GEP-CV on synthetic data with approximate distribution s component M = 1,2,3,5,7. We see that with the increment of the value of M, the overall trend of the performance of GEP-CV and GEP firstly gets better and then gets worse. This situation is probably because for the 3-modal synthesized data when the components of approximate distributions are 3 (M = 3), the approximate distribution can better approximate the posterior distribution. When components of approximate distribution get larger, the approximate distribution gets more complicated and there are more parameters to update, which lead to inferior performances. [Table 4 about here.] 7 Conclusion In this paper, we proposed generalized expectation propagation (GEP) approximate inference, which is a kind of convergence-guaranteed and less memory consumption deterministic approximate inference algorithm. This is ful- 23

24 filled by considering the EP s KL divergence as the objective function, taking Monte Carlo approximations of the gradients of this function and using variance reduction techniques. Scalability to large datasets can be further achieved in the future by using stochastic gradient descent with minibatches of observations. In experiments, we evaluated GEP methods based on multi-modal synthetic data and real-world data. We concluded that the GEP offers the following impressive performances. It outperforms VI, EP, SVGD, and BBBVI. Furthermore, it can be considered as an alternative algorithm to EP, which is guaranteed to be convergent. In addition, we also tested the performance of GEP on different numbers of components. For future work, it will be interesting to combine with different variance reduction techniques, such as the reparameterization trick. It is also important to explore more advanced models using this method. Acknowledgements This work is supported by the National Natural Science Foundation of China under Project References [1] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, [2] David M Blei, Michael I Jordan, and John Paisley. Variational Bayesian inference with stochastic search. International Conference on Machine Learning, pages , [3] John P Cunningham, Philipp Hennig, and Simon Lacostejulien. Gaussian probabilities and expectation propagation. arxiv: Machine Learning, pages 1 56, [4] Guillaume Dehaene and Simon Barthelmé. Expectation propagation in the large data limit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1): , [5] Brendan J Frey and David JC MacKay. A revolution: Belief propagation in graphs with cycles. Advances in Neural Information Processing Systems, pages ,

25 [6] Adam D. Gerson, Lucas C. Parra, and Paul Sajda. Cortical origins of response time variability during rapid discrimination of visual objects. NeuroImage, 28(2): , [7] Robert Mansel Gower, Nicolas Le Roux, and Francis R Bach. Tracking the gradients using the hessian: A new look at variance reducing stochastic methods. International Conference on Artificial Intelligence and Statistics, pages , [8] Leonard Hasenclever, Stefan Webb, Thibaut Lienart, Sebastian Vollmer, Balaji Lakshminarayanan, Charles Blundell, and Yee Whye Teh. Distributed Bayesian learning with stochastic natural gradient expectation propagation and the posterior server. Journal of Machine Learning Research, 18(106):1 37, [9] José Miguel Hernández-Lobato, Yingzhen Li, Mark Rowland, Daniel Hernández-Lobato, Thang D Bui, and Richard E Turner. Black-box alphadivergence minimization. International Conference on Machine Learning, pages , [10] Tom Heskes and Onno Zoeter. Expectation propagation for approximate inference in dynamic Bayesian networks. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, pages Morgan Kaufmann Publishers Inc., [11] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1): , [12] Yingzhen Li and Richard E. Turner. Stochastic expectation propagation. Advances in Neural Information Processing Systems, pages , [13] Jason Liao and Khewvoon Chin. Logistic regression for disease classification using microarray data. Bioinformatics, 23(15): , [14] Francesco Locatello, Gideon Dresdner, Rajiv Khanna, Isabel Valera, and Gunnar Rtsch. Boosting black box variational inference. Advances in Neural Information Processing Systems, pages , [15] Peter S Maybeck and George M Siouris. Stochastic models, estimation, and control. IEEE Transactions on Systems, Man, and Cybernetics, 10(5): , [16] Thomas Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pages Morgan Kaufmann Publishers Inc., [17] Thomas Minka. Power EP. Waste Age, 41(8):48 49, [18] Thomas Minka. Divergence measures and message passing. Microsoft Research Ltd, pages 1 17, [19] Hannes Nickisch and Matthias W. Seeger. Convex variational Bayesian inference for large scale generalized linear models. In International Conference on Machine Learning, pages ,

Black-box α-divergence Minimization

Black-box α-divergence Minimization Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb,

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

Understanding Covariance Estimates in Expectation Propagation

Understanding Covariance Estimates in Expectation Propagation Understanding Covariance Estimates in Expectation Propagation William Stephenson Department of EECS Massachusetts Institute of Technology Cambridge, MA 019 wtstephe@csail.mit.edu Tamara Broderick Department

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which

More information

Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling

Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence IJCAI-18 Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling Ximing Li, Changchun

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Probabilistic and Bayesian Machine Learning

Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Day 4: Expectation and Belief Propagation Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Probabilistic Graphical Models for Image Analysis - Lecture 4

Probabilistic Graphical Models for Image Analysis - Lecture 4 Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Expectation Propagation in Dynamical Systems

Expectation Propagation in Dynamical Systems Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 10 Alternatives to Monte Carlo Computation Since about 1990, Markov chain Monte Carlo has been the dominant

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Deterministic Approximation Methods in Bayesian Inference

Deterministic Approximation Methods in Bayesian Inference Deterministic Approximation Methods in Bayesian Inference Tobias Plötz Department of Computer Science Technical University of Darmstadt 64289 Darmstadt t_ploetz@rbg.informatik.tu-darmstadt.de Abstract

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Local Expectation Gradients for Doubly Stochastic. Variational Inference Local Expectation Gradients for Doubly Stochastic Variational Inference arxiv:1503.01494v1 [stat.ml] 4 Mar 2015 Michalis K. Titsias Athens University of Economics and Business, 76, Patission Str. GR10434,

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Black-Box α-divergence Minimization

Black-Box α-divergence Minimization Black-Box α-divergence Minimization José Miguel Hernández-Lobato 1 JMH@SEAS.HARVARD.EDU Yingzhen Li 2 YL494@CAM.AC.UK Mark Rowland 2 MR504@CAM.AC.UK Daniel Hernández-Lobato 3 DANIEL.HERNANDEZ@UAM.ES Thang

More information

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

More information

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17 MCMC for big data Geir Storvik BigInsight lunch - May 2 2018 Geir Storvik MCMC for big data BigInsight lunch - May 2 2018 1 / 17 Outline Why ordinary MCMC is not scalable Different approaches for making

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Two Useful Bounds for Variational Inference

Two Useful Bounds for Variational Inference Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the

More information

Expectation propagation as a way of life

Expectation propagation as a way of life Expectation propagation as a way of life Yingzhen Li Department of Engineering Feb. 2014 Yingzhen Li (Department of Engineering) Expectation propagation as a way of life Feb. 2014 1 / 9 Reference This

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Expectation Propagation in Factor Graphs: A Tutorial

Expectation Propagation in Factor Graphs: A Tutorial DRAFT: Version 0.1, 28 October 2005. Do not distribute. Expectation Propagation in Factor Graphs: A Tutorial Charles Sutton October 28, 2005 Abstract Expectation propagation is an important variational

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Stochastic Variational Inference

Stochastic Variational Inference Stochastic Variational Inference David M. Blei Princeton University (DRAFT: DO NOT CITE) December 8, 2011 We derive a stochastic optimization algorithm for mean field variational inference, which we call

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Expectation propagation for signal detection in flat-fading channels

Expectation propagation for signal detection in flat-fading channels Expectation propagation for signal detection in flat-fading channels Yuan Qi MIT Media Lab Cambridge, MA, 02139 USA yuanqi@media.mit.edu Thomas Minka CMU Statistics Department Pittsburgh, PA 15213 USA

More information

Advances in Variational Inference

Advances in Variational Inference 1 Advances in Variational Inference Cheng Zhang Judith Bütepage Hedvig Kjellström Stephan Mandt arxiv:1711.05597v1 [cs.lg] 15 Nov 2017 Abstract Many modern unsupervised or semi-supervised machine learning

More information

arxiv: v3 [stat.ml] 1 Jun 2016

arxiv: v3 [stat.ml] 1 Jun 2016 Black-Box α-divergence Minimization arxiv:5.03243v3 [stat.ml] Jun 206 José Miguel Hernández-Lobato JMH@SEAS.HARVARD.EDU Yingzhen Li 2 YL494@CAM.AC.UK Mark Rowland 2 MR504@CAM.AC.UK Daniel Hernández-Lobato

More information

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart 1 Motivation and Problem In Lecture 1 we briefly saw how histograms

More information

Lecture 4 September 15

Lecture 4 September 15 IFT 6269: Probabilistic Graphical Models Fall 2017 Lecture 4 September 15 Lecturer: Simon Lacoste-Julien Scribe: Philippe Brouillard & Tristan Deleu 4.1 Maximum Likelihood principle Given a parametric

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

Predictive Variance Reduction Search

Predictive Variance Reduction Search Predictive Variance Reduction Search Vu Nguyen, Sunil Gupta, Santu Rana, Cheng Li, Svetha Venkatesh Centre of Pattern Recognition and Data Analytics (PRaDA), Deakin University Email: v.nguyen@deakin.edu.au

More information

Variational Learning : From exponential families to multilinear systems

Variational Learning : From exponential families to multilinear systems Variational Learning : From exponential families to multilinear systems Ananth Ranganathan th February 005 Abstract This note aims to give a general overview of variational inference on graphical models.

More information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing

More information

Variational Autoencoder

Variational Autoencoder Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational

More information

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions Mohammad Emtiyaz Khan, Reza Babanezhad, Wu Lin, Mark Schmidt, Masashi Sugiyama Conference on Uncertainty

More information

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms

More information

Approximate Bayesian inference

Approximate Bayesian inference Approximate Bayesian inference Variational and Monte Carlo methods Christian A. Naesseth 1 Exchange rate data 0 20 40 60 80 100 120 Month Image data 2 1 Bayesian inference 2 Variational inference 3 Stochastic

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Probabilistic machine learning group, Aalto University  Bayesian theory and methods, approximative integration, model Aki Vehtari, Aalto University, Finland Probabilistic machine learning group, Aalto University http://research.cs.aalto.fi/pml/ Bayesian theory and methods, approximative integration, model assessment and

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

PDF hosted at the Radboud Repository of the Radboud University Nijmegen PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is a preprint version which may differ from the publisher's version. For additional information about this

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Confidence Estimation Methods for Neural Networks: A Practical Comparison

Confidence Estimation Methods for Neural Networks: A Practical Comparison , 6-8 000, Confidence Estimation Methods for : A Practical Comparison G. Papadopoulos, P.J. Edwards, A.F. Murray Department of Electronics and Electrical Engineering, University of Edinburgh Abstract.

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o Category: Algorithms and Architectures. Address correspondence to rst author. Preferred Presentation: oral. Variational Belief Networks for Approximate Inference Wim Wiegerinck David Barber Stichting Neurale

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information