The χ-divergence for Approximate Inference

Size: px
Start display at page:

Download "The χ-divergence for Approximate Inference"

Transcription

1 Adji B. Dieng 1 Dustin Tran 1 Rajesh Ranganath 2 John Paisley 1 David M. Blei 1 CHIVI enjoys advantages of both EP and KLVI. Like EP, it produces overdispersed approximations; like KLVI, it oparxiv:161328v2 [stat.ml] 27 Feb 217 Abstract Variational inference enables Bayesian analysis for complex probabilistic models with massive data sets. It posits a family of approximating distributions and finds the member closest to the posterior. While successful, variational inference methods can run into pathologies; for example, they typically underestimate posterior uncertainty. In this paper we propose CHIVI, a complementary algorithm to traditional variational inference. CHIVI is a black box algorithm that minimizes the χ-divergence from the posterior to the family of approximating distributions and provides an upper bound of the model evidence. We studied CHIVI in several scenarios. On Bayesian probit regression and Gaussian process classification it yielded better classification error rates than expectation propagation (EP) and classical variational inference (VI). When modeling basketball data with a Cox process, it gave better estimates of posterior uncertainty. Finally, we show how to use the CHIVI upper bound and classical VI lower bound to sandwich estimate the model evidence. 1. Introduction Bayesian analysis provides a foundation for reasoning with probabilistic models (Bishop, 26; Murphy, 212; Barber, 212; Gelman et al., 214a). We first set a joint distribution p(x, z) of latent variables z and observed variables x. We then analyze data through the posterior, p(z x) = p(x, z) p(x). In typical applications, this posterior is difficult to compute because the marginal likelihood p(x) is intractable. This necessitates approximate posterior inference methods such as Monte Carlo (Robert & Casella, 24) and varia- 1 Columbia University, New York, NY, USA 2 Princeton University, Princeton, NJ, USA. Correspondence to: Adji B. Dieng <abd2141@columbia.edu>. tional inference (Jordan et al., 1999a; Wainwright & Jordan, 28). This paper focuses on variational inference. Variational inference approximates the posterior through optimization. The idea is to posit a family of approximating distributions and then to find the member of the family that is closest to the posterior. Typically, closeness is defined by the Kullback-Leibler divergence KL(q p), where is the variational family indexed by parameters λ. This approach, which we call KLVI, also provides the evidence lower bound (ELBO), a convenient lower bound of the model evidence log p(x). KLVI scales well (Hoffman et al., 213) and is suited to applications that use complex models to analyze large data sets. But it also has drawbacks. For one, it tends to favor underdispersed approximations relative to the exact posterior (Murphy, 212; Bishop, 26). Further, it faces difficulties with light-tailed posteriors when the variational distribution has heavier tails. For example, KLVI for Gaussian process classification uses a Gaussian approximating family; this leads to unstable optimization and a poor approximation (Hensman et al., 214). One alternative to KLVI is expectation propagation (EP), which enjoys good empirical performance on models with light-tailed posteriors (Minka, 21a; Kuss & Rasmussen, 25). Procedurally, EP reverses the arguments in the Kullback-Leibler (KL) divergence and performs local minimizations of KL(p q); this corresponds to iterative moment matching using partitions of the data set. Relative to KLVI, EP produces overdispersed approximations. But EP also has drawbacks. It is not guaranteed to converge (Minka, 21b, Figure 3.6); it does not provide an easy estimate of the marginal likelihood; and it does not optimize a well-defined global objective (Beal, 23). In this paper we develop χ-divergence variational inference (CHIVI), a variational inference algorithm that minimizes the χ-divergence between the variational family and the exact posterior. It is [( p(z x) ) 2 ] D χ 2(p q) = E q(z;λ) 1. (1)

2 timizes a well-defined objective and produces an approximation of the evidence. As we mentioned, KLVI optimizes a lower bound on the model evidence. The idea behind CHIVI is to optimize an upper bound, which we call the chi upper bound (CUBO), where minimizing the CUBO is equivalent to minimizing the χ-divergence. In providing an upper bound, CHIVI complements KLVI. For example, the CUBO and ELBO together to give sandwich estimates of the model evidence (See Figure 2). Sandwich estimates are useful for tasks like model selection (MacKay, 1992; Raftery, 1995), where lower bounds alone do not provide enough information. In more detail, CHIVI is a stochastic optimization algorithm that computes unbiased noisy gradients of the exponentiated CUBO. An advantage of this strategy is that it provides a black-box inference algorithm (Ranganath et al., 214). This means CHIVI does not need model-specific derivations; it only requires sampling from an approximating distribution, evaluating the model s complete log likelihood log p(x, z), and evaluating the score function of the approximating family λ log. Thus it is easy to apply to a wide class of models. Related work. Variational inference was originally developed in the 199s, adapting ideas from statistical physics to derive methods for approximate Bayesian inference (Hinton & Van Camp, 1993; Waterhouse et al., 1996; Jordan et al., 1999b). Though the most widely studied variational objective is KL(q p), there has been work on alternatives. The main alternative is EP, proposed by Opper & Winther (2) and Minka (21a), which locally minimizes the KL(p q). Recent work revisits EP from the perspective of distributed computing (Gelman et al., 214b; Xu et al., 214; Teh et al., 215; Li et al., 215) and also revisits Minka (24), which studies local minimizations with the general family of α-divergences (Minka, 24; Hernández- Lobato et al., 215; Li & Turner, 216). CHIVI relates to EP and its extensions in that it leads to overdispersed approximations relative to KLVI. However, unlike Minka (24); Hernández-Lobato et al. (215), CHIVI does not rely on tying local factors; it optimizes a well-defined global objective. In this sense, CHIVI relates to the recent work on alternative divergence measures for variational inference (Li & Turner, 216; Ranganath et al., 216), but with a focus on the χ-divergence. The closest related work is Li & Turner (216). They perform black-box variational inference using the reverse α-divergence D α (q p) which is a valid divergence 1 only when α >. However, no positive value 1 Satisfies D(p q) and D(p q) = p = q a.e. of α in D α (q p) leads to the χ-divergence. Minimizing D α (q p) is equivalent to maximizing the VR-bound 2 which is a lower bound of the model evidence. We minimize the χ-divergence, a monotonic function of the direct α-divergence D α (p q). This is equivalent to minimizing the CUBO, upper bound to the model evidence. Furthermore, we provide a different black-box algorithm for minimizing these upper bounds using Monte Carlo gradients of the exponentiated bound to reduce bias. In this sense, our work is complementary to the work of Li & Turner (216). The rest of the paper is organized as follow: In Section 2 we briefly review variational inference before presenting the χ-divergence and it s zero-avoiding property that makes it produce overdispersed posterior approximations. In Section 2 we also derive the CUBO and its extensions. These are a family of upper bounds to the model evidence that enable approximate posterior inference with the χ-divergence. We propose and prove the sandwich theorem that relates CUBO and ELBO for sandwich-estimating the model evidence. We close Section 2 by proposing CHIVI, a scalable black box variational inference algorithm that uses unbiased noisy gradients of the exponentiated CUBO. Section 3 illustrates the performance of CHIVI on two classification problems, Bayesian probit regression and Gaussian process classification using benchmark datasets, and on a Cox process model for basketball data from the National Basketball Association (NBA) season. When compared to KLVI and EP, we find that CHIVI often produces better error rates and more accurate estimates of posterior uncertainty. Finally we conclude and explore some extensions and relationships of this work to a f-divergence minimization framework and importance sampling in Section χ-divergence Variational Inference We present the χ-divergence for variational inference. We describe some of its properties and develop CHIVI, a black box algorithm that minimizes the χ-divergence for a large class of models Variational Inference and the χ-divergence Variational inference (VI) casts Bayesian inference as optimization (Jordan et al., 1999b; Wainwright & Jordan, 28). VI posits a family of approximating distributions and finds the closest member to the posterior. In its typical formulation, VI minimizes the Kullback- Leibler divergence from to p(z x). This divergence is computationally intractable because it involves the 2 This is name of the family of lower bounds provided in Li & Turner (216)

3 Figure 1. We consider the posterior (red) as a mixture of two Gaussians, and the variational family (blue) is a Gaussian. From left to right: behavior of the divergences KL(q p) and χ n for n = 1.1, 2., and 5.. KL(q p) is mode-seeking, and χ for increasing n favors more overdispersed approximations. posterior. Fortunately, minimizing KL(q p) is equivalent to maximizing a tractable alternative, [ p(x, z) ] ELBO(λ) = E q(z;λ) log. (2) This objective is known as the evidence lower bound (ELBO), and we term methods that maximize it KLVI. The ELBO is not only a tractable objective but also a lower bound to the model evidence log p(x). Maximizing the ELBO imposes properties on the resulting approximate posterior such as underestimation of its support (Murphy, 212; Bishop, 26); these properties may be undesirable, especially when dealing with light-tailed posteriors as is the case in Gaussian process classification. As an alternative, we consider the χ-divergence (Equation 1). CHIVI seeks to minimize this divergence with respect to the variational parameters λ. Like KL(q p), this objective depends on the posterior. We derive a tractable proxy in Section 2.3, whose optimization is equivalent to optimizing Equation 1. Moreover, this tractable objective is an upper bound on log p(x). Minimizing the χ-divergence induces useful properties on the approximate posterior distribution such as a zeroavoiding behavior. This property leads to the overestimation of the posterior s support. (See Appendix A.5 for more details on all these properties.) We emphasize the main propoerty below Zero-avoiding behavior Optimizing the χ-divergence leads to a variational distribution with a zero-avoiding behavior. Indeed the χ- divergence is infinite whenever = and p(z x) >. Therefore during optimization p(z x) > will force >. This means q avoids having zero mass at places where p has nonzero mass. Notice that the classical objective KL(q p) leads to approximate posteriors with the opposite behavior, called zero-forcing. Indeed KL(q p) is infinite when p(z x) = and >. Therefore the optimal variational distribution q will be when p(z x) =. This zero-forcing behavior may lead to degenerate solutions during optimization, for example when the approximating family q has heavier tails than the target posterior p. This is because variational distributions with the zero-forcing behavior are overconfident and underestimate the true posterior s support. The χ-divergence, similar to KL(p q), does not suffer from this. The zero-avoiding behavior forces its posterior approximations to overestimate the support of the true posterior distribution (Minka, 25). To gain more intuition on this important property, we explore a simple scenario. Consider the extension of the χ-divergence to the family of χ n -divergences for n > 1, [( p(z x) D χ n(p q) = E q(z;λ) 1. This is a valid divergence for any n > 1 3. Figure 1 shows that varying n in the χ n -divergence provides an explicit knob for controlling this zeroing behavior. KL(q p) favors the mixture component with the highest weight and underestimates the posterior s support. D χ 2(p q) also picks the component with highest weight but it overestimates the posterior s support. For n < 2, D χ n(p q) tries to find a middleground between the two mixture components. This is because when n = 1.1 p(z x) D χ 1.1(p q) = E q [( q(z;λ) )1.1 ]; this weakly penalizes not putting high mass at the mode of p. When n > 2, D χ n(p q) penalizes placing mass where p is not at its highest and thus favors the mode CUBO: the chi upper bound We derive a tractable objective for variational inference with the χ 2 -divergence and also generalize it to the χ n - divergence for n > 1. Consider the optimization problem of minimizing Equation 1. We seek to find a relationship between the χ 2 -divergence and log p(x). We take the following steps: [( p(z x) ) 2 ] E q(z;λ) = 1 + D χ 2(p(z x) ) [( p(x, z) ) 2 ] E q(z;λ) = p(x) 2 [1 + D χ 2(p(z x) )] Taking logarithms on both sides, we find a relationship analogous to how KL(q p) relates to the ELBO. Namely, 3 D χ n(p q) by Jensen s inequality and D χ n(p q) = p = q a.e.

4 the χ 2 -divergence satisfies 1 2 log(1 + D χ2(p(z x) )) = log p(x) + 1 [( p(x, z) ) 2 ] 2 log E q(z;λ). By monotonicity of log, and because log p(x) is constant, minimizing the χ 2 -divergence is equivalent to minimizing: L χ 2(λ) = 1 [( p(x, z) ) 2 ] 2 log E q(z;λ). Furthermore, by nonnegativity of the χ 2 -divergence, this quantity is an upper bound to the model evidence, log p(x) 1 2 log E q(z;λ) [( p(x, z) ) 2 ] = L χ 2(λ). We call this objective the chi upper bound (CUBO). A general upper bound. This derivation also follows for the χ n -divergence. The general upper bound is L χ n(λ) = 1 n log E q(z;λ) [( p(x, z) = CUBO n. (3) We have produced a family of bounds: When n < 1, CUBO n is a lower bound and minimizing it for these values of n does not minimize the χ-divergence( rather, when n < 1, we recover the α-divergence and the VR-bound (Li & Turner, 216). The bound is tight for n = 1, CUBO 1 = log p(x). For any n 1, CUBO n is an upper bound to the model evidence. We focus on n = 2. Sandwiching the model evidence. Equation 3 has practical value. We can simultaneously minimize the CUBO n and maximize the ELBO. This produces a sandwich on the model evidence, ELBO log p(x) CUBO n. (See Appendix A.8 for a simulated illustration.) Sandwiching can be used to better approximate the model evidence, perform model selection, and assess convergence. Comparing models using only lower bounds, i.e the ELBO is unreliable. The following sandwich theorem states that the gap induced by CUBO n and ELBO increases with n. This suggests that letting n as close to 1 as possible enables approximating log p(x) with higher precision. When we further decrease n to, CUBO n becomes a lower bound and tends to the ELBO. Theorem 1 (Sandwich Theorem): Define CUBO n and ELBO as in Equation 3 and Equation 2. CUBO n is a non-decreasing function of the order n of the χ-divergence. Furthermore, when n is allowed to go to zero, the upper bound becomes a lower bound and more specifically: lim n CUBO n = ELBO. See proof in Appendix A.1. This theorem has many implications since estimating log p(x) is important for many applications, such as the evidence framework (MacKay, 23), where the marginal likelihood is argued to embody an Occam s razor. It can also help estimate Bayes factors (Raftery, 1995). However, few works have analyzed its sandwich estimation (Grosse et al., 215). We study our variational approach to sandwich estimation of the model evidence in Section Optimizing the CUBO We derived the CUBO, an upper bound to the model evidence that can be used to minimize the χ-divergence. We now develop CHIVI, a black box algorithm that minimizes the CUBO n. The goal in CHIVI is to minimize the CUBO n with respect to variational parameters, CUBO n (λ) = 1 n log E q(z;λ) [( p(x, z), The expectation in the CUBO n is usually intractable. Thus we use Monte Carlo to construct stochastic gradients. One approach is to naively perform Monte Carlo on this objective, CUBO n (λ) 1 n log 1 S S s=1 [( p(x, z (s) ), q(z (s) ; λ) for S samples z (1),..., z (S). However, by Jensen s inequality, the log transform of the expectation implies that this is a biased estimate of CUBO n (λ). Gradients of this estimate are also biased. We use a different approach than Li & Turner (216) and consider the objective L = exp{n CUBO n (λ)}. By monotonicity of the exponential function, this objective admits the same optima as CUBO n (λ). We minimize it using its reparameterization gradients (Kingma & Welling, 214; Rezende et al., 214). These gradients apply to models with differentiable latent variables and have lower variance. More formally, assume z = g(λ, ɛ) where ɛ p(ɛ). Then ˆL = 1 B ( p(x, g(λ, ɛ (b) )) ) n B q(g(λ, ɛ (b) ); λ) b=1

5 is an unbiased estimator of L and its gradient is Algorithm 1: Scalable CHIVI for massive datasets n B ( p(x, g(λ, ɛ λ ˆL (b) )) ) n λ ( p(x, g(λ, ɛ (b) )) ) = log. B q(g(λ, ɛ (b) ); λ) q(g(λ, ɛ (b) ); λ) Input: Data x, Model p(x, z), Variational family. b=1 (4) Output: Variational parameters λ. Initialize λ randomly. (See Appendix A.7 for a more detailed derivation of both the score function gradient and the reparameterization gradient). Draw S samples z (1),..., z (S) from q(z; while not converged do λ). Computing Equation 4 requires all the data x. This does not scale to massive data sets. In such a setting, we can apply the average likelihood technique from EP (Li et al., 215; Dehaene & Barthelmé, 215). Consider data {x 1,..., x N }. Define the likelihood factor f i (z) = p(x i z). Now consider a data subsample, {x i1,..., x im }. The subsampled likelihood is f 1:M (z) = M f ij (z). j=1 We approximate the full likelihood by multiplying the subsampled likelihood with itself N M times Subsample data points {x i1,..., x im }. Compute the corresponding average likelihoods f M (z (1) ),..., fm (z (S) ). Set ρ t from a Robbins-Monro sequence. Set w (s) = p(z(s) ) f M (z (s) ) N q(z (s) ;λ t) Set c = max log w (s). s, s {1,..., S}. Set w (s) = exp(log w (s) c), s {1,..., S}. Update λ t+1 = [( ) n λ ] λ t (1 n) ρt S S s=1 w (s) log q(z (s) ; λ t ). end p(x z) f 1:M (z) N M. Using this proxy to the full dataset we derive CHIVI, an algorithm in which each iteration depends on only a minibatch of data. CHIVI is a black box algorithm for performing approximate inference with the χ n -divergence. Algorithm 1 summarizes the procedure. In practice, we subtract the maximum of the logarithm of the importance weights, defined as p(x, z) w =. to avoid underflow. Stochastic optimization theory still gives us convergence guarantees with this aproach (Sunehag et al., 29; Robbins & Monro, 1951). 3. Empirical Study We study CHIVI as an approximate inference algorithm and also as a means for model selection by sandwich estimating the model evidence. First, we study Bayesian probit regression with benchmark datasets where we compare the predictive performance of CHIVI against EP and KLVI where we also illustrate the sandwich gap for model selection. Second, we compare CHIVI to Laplace and EP on a model class for which EP is the method of choice: Gaussian process classification. Third, we analyze Cox processes, a type of spatial point process, to compare profiles of different NBA basketball players. We find CHIVI has better predictive power and yields better posterior uncertainty estimates. We also illustrate the sandwich of the model evidence for UCI datasets using mainly n = 2. All experiments were implemented in Edward (Tran et al., 216) Bayesian Probit Regression We analyze inference for Bayesian probit regression. First, we illustrate sandwich estimation on UCI datasets. Figure 2 illustrates the bounds of the log marginal likelihood given by the ELBO and the CUBO. Using both quantities provides a reliable approximation of the model evidence. The tightness of the gap depends on the order of the divergence n. When n gets close to 1 which corresponds to exact inference a large number of samples from the variational distribution is needed. In addition, these figures show convergence for CHIVI, which EP does not always satisfy. We also compared the predictive performance of CHIVI, EP, and KLVI. For large datasets, we apply Algorithm 1 with a minibatch size of 64 and 2 iterations for each batch. We computed the average classification error rate and the standard deviation using 5 random splits of the data. We split all the datasets with 9% of the data for training and 1% for testing. For the Covertype dataset, we implemented Bayesian probit regression to discriminate the class 1 against all other classes. Table 1 shows the average

6 Table 1. Test error for Bayesian probit regression. The lower the better. CHIVI (this paper) yields lower test error rates when compared to BBVI (Ranganath et al., 214) and EP on most datasets. Dataset BBVI EP CHIVI Pima 35 ± 6 34 ± 6 22 ± 48 Ionos.123 ± ± ± 5 Madelon 57 ± 5 45 ± 5 53 ± 29 Covertype.157 ± ± ± 14 Table 2. Test error for Gaussian process classification. The lower the better. CHIVI (this paper) yields lower test error rates when compared to Laplace and EP on most datasets. Dataset Laplace EP CHIVI Crabs ± 3 Pima - 45 ± ± 35 Sonar ± 35 Ionos 84 8 ± 4 69 ± 34 Heart ± 59 error rate for KLVI (as implemented by black box variational inference (BBVI)), EP, and CHIVI. CHIVI performs better for all but one dataset Gaussian Process Classification Gaussian process (GP) classification is an alternative to probit regression. The posterior is analytically intractable because the likelihood is not conjugate to the prior. Moreover, the posterior tends to be skewed. EP has been the method of choice for approximating the posterior (Kuss & Rasmussen, 25). We choose a factorized Gaussian for the variational distribution q and fit its mean and the log variance to avoid negative variances during optimization. With UCI benchmark datasets, we compared the predictive performance of CHIVI to EP and Laplace. Table 2 summarizes the results. The error rates for CHIVI correspond to the average of 1 error rates obtained by dividing the data into 1 folds, applying CHIVI to 9 folds to learn the variational parameters and performing prediction on the remainder. The kernel hyperparameters were chosen using grid search. The error rates for the other methods correspond to the best results reported in (Kuss & Rasmussen, 25) and (Kim & Ghahramani, 23). On all the datasets CHIVI performs as well or better than EP and Laplace Cox Processes Finally we study Cox processes. They are Poisson processes with stochastic rate functions. They capture dependence between the frequency of points in different regions Curry Demarcus Lebron Duncan CHIVI BBVI Table 3. Average L 1 error for posterior uncertainty estimates (ground truth from HMC). We find that CHIVI is similar to or better than BBVI at capturing posterior uncertainties. Demarcus Cousins, who plays center, stands out in particular. His shots are concentrated near the basket, so the posterior is uncertain over a large part of the court Figure 3. of a space. We apply Cox processes to model the spatial locations of shots (made and missed) from the NBA season; see also Miller et al. (214). The data are from 38 NBA players who took more than 15, shots in total. The n th player s set of M n shot attempts are x n = {x n,1,..., x n,mn }, and the location of the m th shot by the n th player in the basketball court is x n,m [ 25, 25] [, 4]. Let PP(λ) denote a Poisson process with intensity function λ, and K be a covariance matrix resulting from a kernel applied to every location of the court. The generative process for the n th player s shot is K i,j = k(x i, x j ) = σ 2 exp( 1 2φ 2 x i x j 2 ) f GP(, k(, )) and λ = exp(f) x n,k PP(λ) for k {1,..., M n }. The kernel of the Gaussian process encodes the spatial correlation between different areas of the basketball court. The model treats the N players as independent. But the kernel K introduces correlation between the shots attempted by a given player. Our goal is to infer the intensity functions λ(.) for each player. We compare the shooting profiles of different players using these inferred intensity surfaces. The results are shown in Figure 3. The shooting profiles of Demarcus Cousins and Stephen Curry are captured by both BBVI and CHIVI. BBVI has lower posterior uncertainty while CHIVI provides more overdispersed solutions. We plot the profiles for two more players, LeBron James and Tim Duncan, in the appendix. In Table 3, we compare the posterior uncertainty estimates of CHIVI and BBVI to that of HMC, a computationally expensive Markov chain Monte Carlo procedure that we treat as exact. We use the average L 1 distance from HMC as error measure. We do this on four different players: Stephen Curry, Demarcus Cousins, LeBron James, and Tim Duncan. We find that CHIVI is similar or better than BBVI, especially on players like Demarcus Cousins who shoot in a limited part of the court.

7 Sandwich Plot Using CHIVI and BBVI On Ionosphere Dataset upper bound lower bound 1.5 Sandwich Plot Using CHIVI and BBVI On Ionosphere Dataset upper bound lower bound 1.5 Sandwich Plot Using CHIVI and BBVI On Heart Dataset upper bound lower bound 1.5 Sandwich Plot Using CHIVI and BBVI On Crabs Dataset upper bound lower bound objective objective objective objective epoch epoch epoch epoch Figure 2. Sandwich gap via CHIVI and BBVI on different datasets. The first two plots correspond to a divergence with order 2 and 1.2 respectively on the Ionosphere dataset. As mentioned in our theoretical analysis and in the simulations in the appendix, the gap tightens when n 1. However there is a trade-off between tightening the gap and computational efficiency: when n 1 more samples from the variational distribution are needed for the algorithm to converge. The last two plots correspond to n = 2 for Heart and Crabs datasets respectively. More sandwich plots can be found in the appendix. 4. Discussion and Extensions We described CHIVI, a black box algorithm that minimizes the χ-divergence by minimizing the CUBO. We now describe how this algorithm can be extended to optimize f- divergences and to find an optimal proposal f-divergences The χ-divergence is a member of the general f-divergence family (Csiszár & Shields, 24). An f-divergence has the form ( p(x) ) D f (p q) = f q(x) dx, q(x) where f is a convex function such that f(1) =. For example, the divergence KL(q p) corresponds to choosing f(x) = log x and the divergence KL(p q) corresponds to f(x) = x log x. The α-divergence family is a subfamily of this larger family of divergences. The χ n -divergence corresponds to f(x) = x n 1. A key property is that any f-divergence can be rewritten as a Taylor sum of χ-divergences (Nielsen & Nock, 214). Expanding around a point r in the domain of f, 1 ( p(x) ) n D f (p q) = q(x) n! f (n) (r ) q(x) r dx = n= n= 1 n! f (n) (x ) χ n r (p q), where χ n r (p q) is a higher-order χ-divergence. CHIVI can be extended to approximately minimize any f-divergence at a given truncation level. As one example, the above equation implies that the χ 2 -divergence can be interpreted (up to proportion) as a second-order Taylor approximation of KL(p q). If desired, incorporating higher-order χ-divergences for posterior inference can better mimic properties of KL(p q), such as moment matching Importance sampling The χ-divergence also has deep connections to importance sampling (Minka, 25). Consider estimating the marginal likelihood p(x, z) p(x) = p(x, z) dz = q(z) dz q(z) using a proposal distribution q(z). We d like to learn the optimal proposal among a family parameterized by λ. The importance-sampled estimate of p(x) is p(x) = 1 S S s=1 p(x, z (s) ) q(z (s) ; λ), The variance of this estimator is Var( p(x)) = 1 S ( z(1),..., z (s). [( p(x, z (1) ) ) 2 ] E q(z;λ) p(x) 2). q(z (1) ; λ) One approach to choose is to find parameters which minimize the variance. Formally, this is equivalent to finding the minimum-variance unbiased estimator. Dropping constant terms, this is equivalent to minimizing the χ 2 -divergence. This idea originates from adaptive importance sampling based on maximizing the effective sample size (Kong et al., 1994; Cappé et al., 28) and has recently seen renewed interest in the context of online learning (Bouchard et al., 215) Summary We derived CHIVI, a black box algorithm for doing variational inference with the χ-divergence. We showed that CHIVI is an effective algorithm for Bayesian probit regression, Gaussian process classification, and Cox processes. We also showed how to use CHIVI in concert with KLVI to sandwich-estimate the model evidence.

8 The χ-divergence for Approximate Inference Curry Shot Chart Curry Posterior Intensity (KLQP) 225 Demarcus Shot Chart Curry Posterior Intensity (Chi) Curry Posterior Intensity (HMC) Demarcus Posterior Intensity (KLQP) 36 Demarcus Posterior Intensity (Chi) Demarcus Posterior Intensity (HMC) Curry Posterior Uncertainty (KLQP) Curry Posterior Uncertainty (Chi) Curry Posterior Uncertainty (HMC) Demarcus Posterior Uncertainty (KLQP) Demarcus Posterior Uncertainty (Chi) Demarcus Posterior Uncertainty (HMC) Figure 3. Basketball players shooting profiles as inferred by BBVI (Ranganath et al., 214), CHIVI (this paper), and HMC. The top row displays the raw data, consisting of made shots (green) and missed shots (red). The second and third rows display the posterior intensities inferred by BBVI, CHIVI, and HMC for Stephen Curry and Demarcus Cousins respectively. Both BBVI and CHIVI capture the shooting behavior of both players in terms of the posterior mean. The fourth and fifth rows display the posterior uncertainty inferred by BBVI, CHIVI, and HMC for Stephen Curry and Demarcus Cousins respectively. CHIVI tends to get higher posterior uncertainty for both players in areas where data is scarce compared to BBVI. This illustrates the variance underestimation problem of KLVI, which is not the case for CHIVI.

9 References Barber, D. Bayesian Reasoning and Machine Learning. Cambridge University Press, 212. Beal, M. J. Variational algorithms for approximate Bayesian inference. University of London, 23. Bishop, C. M. Pattern recognition. Machine Learning, 128, 26. Bouchard, G., Trouillon, T., Perez, J., and Gaidon, A. Online Learning to Sample. arxiv preprint arxiv: , 215. Cappé, O., Douc, R., Guillin, A., Marin, J. M., and Robert, C. P. Adaptive importance sampling in general mixture classes. Statistics and Computing, 18(4): , 28. Csiszár, I. and Shields, P. C. Information Theory and Statistics: A Tutorial. Now Publishers Inc, 24. Dehaene, G. and Barthelmé, S. Expectation propagation in the large-data limit. In Neural Information Processing Systems, 215. Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., and Rubin, D. Bayesian Data Analysis. Chapman & Hall/CRC, 3rd edition, 214a. Gelman, A., Vehtari, A., Jylänki, P., Robert, C.n, Chopin, N., and Cunningham, J. P. Expectation propagation as a way of life. arxiv preprint arxiv: , 214b. Grosse, R. B., Ghahramani, Z., and Adams, R. P. Sandwiching the marginal likelihood using bidirectional Monte Carlo. arxiv.org, November 215. Hensman, J., Zwießele, M., and Lawrence, N. D. Tilted variational Bayes. The Journal of Machine Learning Research, 214. Hernández-Lobato, J. M., Li, Y., Hernández-Lobato, D., Bui, T., and Turner, R. E. Black-box α-divergence minimization. arxiv preprint, 215. Hinton, G. and Van Camp, D. Keeping the neural networks simple by minimizing the description length of the weights. In Computational Learning Theory, pp ACM, Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. Journal of Machine Learning Research, 14: , 213. Jordan, M., Ghahramani, Z., Jaakkola, T., and Saul, L. Introduction to variational methods for graphical models. Machine Learning, 37: , 1999a. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models. Machine Learning, 37(2): , 1999b. Kim, H. and Ghahramani, Z. The em-ep algorithm for gaussian process classification. In Proceedings of the Workshop on Probabilistic Graphical Models for Classification (ECML), pp , 23. Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. In International Conference on Learning Representations, 214. Kong, A., Liu, J. S., and Wong, W. H. Sequential imputations and Bayesian missing data problems. Journal of the American Statistical Association, 89(425): , Kuss, M. and Rasmussen, C. E. Assessing approximate inference for binary Gaussian process classification. The Journal of Machine Learning Research, 6: , 25. Li, Y. and Turner, R. E. Variational inference with Rényi divergence. arxiv preprint arxiv: , 216. Li, Y., Hernández-Lobato, J. M., and Turner, R. E. Stochastic Expectation Propagation. In Neural Information Processing Systems, 215. MacKay, D. J. C. Bayesian interpolation. Neural computation, 4(3): , MacKay, D. J. C. Information Theory, Inference and Learning Algorithms. Cambridge university press, 23. Miller, A., Bornn, L., Adams, R., and Goldsberry, K. Factorized point process intensities: A spatial analysis of professional basketball. In ICML, pp , 214. Minka, T. Expectation propagation for approximate bayesian inference. In Uncertainty in Artificial Intelligence, pp Morgan Kaufmann Publishers Inc., 21a. Minka, T. A family of algorithms for approximate Bayesian inference. PhD thesis, Massachusetts Institute of Technology, 21b. Minka, T. Power EP. Technical report, Microsoft Research, 24. Minka, T. Divergence measures and message passing. Technical report, Microsoft Research, 25. Murphy, K. P. Machine Learning: A Probabilistic Perspective. MIT press, 212. Nielsen, F. and Nock, R. On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Processing Letters, 21(1):1 13, 214. Opper, M. and Winther, O. Gaussian processes for classification: Mean-field algorithms. Neural Computation, 12 (11): , 2.

10 Raftery, A. E. Bayesian model selection in social research. Sociological methodology, 25: , Ranganath, R., Gerrish, S., and Blei, D. M. Black box variational inference. In Artificial Intelligence and Statistics, 214. Ranganath, Rajesh, Altosaar, Jaan, Tran, Dustin, and Blei, David M. Operator variational inference. In Neural Information Processing Systems, 216. Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In International Conference on Machine Learning, 214. Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, pp. 4 47, Robert, C. and Casella, G. Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer-Verlag, New York, NY, 24. Sunehag, Peter, Trumpf, Jochen, Vishwanathan, SVN, Schraudolph, Nicol N, et al. Variable metric stochastic approximation theory. In AISTATS, pp , 29. Teh, Y. W., Hasenclever, L., Lienart, T., Vollmer, S., Webb, S., Lakshminarayanan, B., and Blundell, C. Distributed Bayesian learning with stochastic natural-gradient expectation propagation and the posterior server. arxiv preprint arxiv: , 215. Tran, Dustin, Kucukelbir, Alp, Dieng, Adji B, Rudolph, Maja, Liang, Dawen, and Blei, David M. Edward: A library for probabilistic modeling, inference, and criticism. arxiv preprint arxiv: , 216. Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1-2):1 35, 28. Waterhouse, S., MacKay, D., and Robinson, T. Bayesian methods for mixtures of experts. In Neural Information Processing Systems, Xu, M., Lakshminarayanan, B., Teh, Y. W., Zhu, J., and Zhang, B. Distributed Bayesian posterior sampling via moment sharing. In Neural Information Processing Systems, 214. The χ-divergence for Approximate Inference

11 A. Supplementary Material A.1. Proof of Sandwich Theorem We denote by z the latent variable and x the data. Assume z R D. We first show that CUBO n is a nondecreasing function of the order n of the χ-divergence. Denote by the triplet (Ω, F, Q) the probability space induced by the variational distribution q where Ω is a subspace of R D, F is the corresponding Borel sigma algebra, and Q is absolutely continuous with respect to the Lebesgue measure µ and is such that dq(z) = q(z)dz. Define w = p(x,z) q(z). We can rewrite CUBO n as: CUBO n = 1 n log E q[w n ] ( ) = log (E q [w n ]) 1 n Since log is nondecreasing, it is enough to show n (E q [w n ]) 1 n is nondecreasing. This function is the L n norm in the space defined above: ( ) (E q [w n ]) 1 1 n = w n n dq Ω ( ) 1 = w n n q(z)dz This is a nondecreasing function of n by virtue of the Lyapunov inequality. We now show the second claim in the sandwich theorem, namely that the limit when n of CUBO n is the ELBO. Since CUBO n is a monotonic function of n and is bounded from below by ELBO, it admits a limit when n. Call this limit L. We show L = ELBO. On the one hand, since CUBO n ELBO for all n >, we have L ELBO. On the other hand, since log x x 1; x > we have CUBO n = 1 n log E q[w n ] 1 [ ] E q [w n ] 1 n Ω [ w n 1 ] = E q n f : n w n is differentiable [ ] and furthermore f w () = lim n 1 n n = log w. Therefore n > such that wn 1 n log w < 1 n < n. Since wn 1 n log w < wn 1 n log w, we have wn 1 n < 1 + log w which is E q-integrable. Therefore by Lebesgue s [ ] dominated [ convergence ] theorem: w lim n E n 1 w q n = E q lim n 1 n n = E q [log w] = ELBO. Since CUBO [ n converges ] when n w and CUBO n E n 1 q n n, we establish L [ ] w lim n E n 1 q n = ELBO. The conclusion follows. A.2. The CHIVI algorithm for small datasets Algorithm 2: CHIVI Input: Data x, Model p(x, z), Variational family. Output: Variational parameters λ. Initialize λ randomly. while not converged do Draw S samples z (1),..., z (S) from. Set ρ t from a Robbins-Monro sequence. Set log w (s) = log p(x, z (s) ) log q(z (s) ; λ t ), s {1,..., S}. Set c = max log w (s). s Set w (s) = exp(log w (s) c), s {1,..., S}. Update λ t+1 = [( ) n λ ] λ t (1 n) ρt S S s=1 w (s) log q(z (s) ; λ t ). end A.3. Approximately minimizing f-divergence with χ-divergence In this section we provide a proof that minimizing an f-divergence can be done by minimizing a sum of χ- divergences. Consider D f (p q) = f ( p(x) ) q(x)dx q(x) Without loss of generality assume f is analytic. The Taylor expansion of f around some point x is f(x) = f(x ) + f (x )(x x ) + Therefore i=2 f (i) (x ) (x x ) i i! ( [ p(x) ] ) D f (p q) = f(x ) + f (x ) E q(z λ) x q(x) [ f (i) (x ) ( p(x) ) i ] + E q(z λ) i! q(x) x i=2 = f(x ) + f (x )(1 x ) f (i) (1) [( p(x) ) i ] + E q(z λ) i! q(x) 1 i=2 where we switch summation and expectation by invoking Fubini s theorem.

12 In particular if we take x = 1 the linear terms are zero and we end up with: D f (p q) = = i=2 i=2 f (i) (1) [( p(x) ) i ] E q(z λ) i! q(x) 1 f (i) (1) D i! χ i(p q) If f is not analytic but k times differentiable for some k then the proof still holds considering the Taylor expansion of f up to the order k. A.4. Importance sampling In this section we establish the relationship between χ 2 - divergence minimization and importance sampling. Consider estimating the marginal likelihood I with importance sampling: I = p(x) = p(x, z)dz p(x, z) = q(z) q(z)dz = w(z)q(z)dz The Monte Carlo estimate of I is Î = 1 B B w(z (b) ) b=1 where z (1),..., z (B) q(z). The variance of Î is Var(Î) = 1 B [E q(z λ)(w(z (b) ) 2 ) (E q(z λ) (w(z (b) ))) 2 ] = 1 B [ E q(z λ) (( p(x, z (1) ) q(z (1) ) ) 2 ) p(x) 2] Therefore minimizing this variance is equivalent to minimizing the quantity (( p(x, z (1) ) ) 2 ) E q(z λ) q(z (1) ) which is equivalent to minimizing the χ 2 - divergence. A.5. General properties of the χ-divergence In this section we outline several properties of the χ- divergence. Conjugate symmetry Define f (u) = uf( 1 u ) to be the conjugate of f. f is also convex and satisfies f (1) =. Therefore Df (p q) is a valid divergence in the f-divergence family and: ( q(x) ) D f (q p) = f p(x)dx p(x) q(x) = p(x) f ( p(x) ) p(x)dx q(x) = D f (p q) D f (q p) is symmetric if and only if f = f which is not the case here. To symmetrize the divergence one can use D(p q) = D f (p q) + D f (p q) Invariance under parameter transformation. Let y = u(x) for some function u. Then by Jacobi p(x)dx = p(y)dy and q(x)dx = q(y)dy. x1 ( p(x) ) nq(x)dx D χ n(p(x) q(x)) = 1 x q(x) y1 ( p(y) dy ) nq(y)dy dx = 1 y q(y) dy dx y1 ( p(y) ) nq(y)dy = 1 q(y) y = D χ n(p(y) q(y)) Factorization for independent distributions. Consider taking p(x, y) = p 1 (x)p 2 (y) and q(x, y) = q 1 (x)q 2 (y). p(x, y) D χ n(p(x, y) q(x, y)) = dxdy q(x, y) n 1 p 1 (x) n p 2 (y) n = q 1 (x) n 1 dxdy q 2 (y) n 1 ( p 1 (x) n ) = q 1 (x) n 1 dx ( p 2 (y) n ) q 2 (y) n 1 dy = D χ n(p 1 (x) q 1 (x)) D χ n(p 2 (y) q 2 (y)) Therefore χ-divergence is multiplicative under independent distributions while KL is additive. Other properties. The χ-divergence enjoys some other properties that it shares with all members of the f- divergence family namely monotonicity with respect to the distributions and joint convexity. A.6. Derivation of the CUBO n In this section we outline the derivation of CUBO n, the upper bound to the marginal likelihood induced by the mini-

13 mization of the χ-divergence. By definition: [( p(z x) D χ n(p(z x) ) = E q(z;λ) 1 Following the derivation of ELBO, we seek an expression of log(p(x)) involving D χ n(p(z x) ). We achieve that as follows: [( p(z x) E q(z;λ) = 1 + D χ n(p(z x) ) [( p(x, z) E q(z;λ) = p(x) n [1 + D χ n(p(z x) )] This gives the relationship log p(x) = 1 [( p(x, z) n log E q(z;λ) 1 n log(1 + D χn(p(z x) )) log p(x) = CUBO n 1 n log(1 + D χn(p(z x) )) By nonnegativity of the divergence this last equation establishes the upper bound: A.7. Black Box Inference log p(x) CUBO n In this section we derive the score gradient and the reparameterization gradient for doing black box inference with the χ-divergence. CUBO n (λ) = 1 n log E q(z;λ) [( p(x, z) where λ is the set of variational parameters. To minimize CUBO n (λ) with respect to λ we need to resort to Monte Carlo. To minimize CUBO n (λ) we consider the equivalent minimization of exp{n CUBO(λ)}. This enables unbiased estimation of the noisy gradient used to perform black box inference with the χ-divergence. The score gradient The score gradient of our objective function is derived below: λ L = λ p(x, z) n 1 n dz = p(x, z) n λ 1 n dz = p(x, z) n (1 n) n λ dz p(x, z) = (1 n) ( )n λ dz = (1 n) p(x, z) ( )n λ log dz = (1 n)e q(z;λ) [( p(x, z) ) n λ log ] where we switched differentiation and integration by invoking Lebesgue s dominated convergence theorem. We estimate this gradient as was done in Paisley et al. with the unbiased estimator: (1 n) B B b=1 [( p(x, z (b) ) ) n λ log q(z (b) ; λ)] q(z (b) ); λ Reparameterization gradient The reparameterization gradient empirically has lower variance than the score gradient. We used it in our experiments. Denote by L the quantity exp{n CUBO} [( p(x, z) L = E q(z;λ) Assume z = g(λ, ɛ) where ɛ p(ɛ). Then ˆL = 1 B B b=1 ( p(x, g(λ, ɛ (b) )) ) n q(g(λ, ɛ (b) ); λ) is an unbiased estimator of L and its gradient is given by λ ˆL = n B = n B B b=1 B b=1 A.8. Simulation Studies ( p(x, g(λ, ɛ (b) )) ) n 1 λ ( p(x, g(λ, ɛ (b) )) ) q(g(λ, ɛ (b) ); λ) q(g(λ, ɛ (b) ); λ) ( p(x, g(λ, ɛ (b) )) ) n λ ( p(x, g(λ, ɛ (b) )) ) log. q(g(λ, ɛ (b) ); λ) q(g(λ, ɛ (b) ); λ) The following figures are results of various Monte Carlo simulations on the CUBO. L = exp{n CUBO(λ)}

14 (a) (a) (b) (b) Figure 4. Samples from the variational distribution resulting from a Bayesian neural network regression on synthetic data using KL(q p) (Figure 4a) and the χ-divergence Figure 4b. Note the overdispersion outside the [ 2, +2] region for the χ-divergence compared to KL(q p). (c) Figure 5. Sandwich gap via Monte Carlo simulations when the order of the χ-divergence is n = 4, n = 2, and n = 1.5 respectively. As we demonstrated theoretically, the gap closes as n decreases.

15 Sandwich Plot Using CHIVI and BBVI On Pima Dataset upper bound lower bound objective epoch Figure 6. Top row:sandwich plot on real (left) and synthetic (right) datasets. Bottom left: example of a situation where BBVI fails. The prior is Gaussian. The likelihood is a Heaviside function. The resulting posterior has light tails due to the truncation. BBVI fits a point mass while CHIVI fits a reasonable distribution. Bottom right: Monte Carlo simulation on CUBO n: the CUBO n is a nondecreasing function of the order n of the χ-divergence as we proved in Appendix A.1.

16 The χ-divergence for Approximate Inference James Shot Chart James Posterior Intensity (KLQP) 36 Duncan Shot Chart James Posterior Intensity (Chi) 32 4 James Posterior Intensity (HMC) Duncan Posterior Intensity (KLQP) Duncan Posterior Intensity (Chi) 8 Duncan Posterior Intensity (HMC) James Posterior Uncertainty (KLQP) James Posterior Uncertainty (Chi) James Posterior Uncertainty (HMC) Duncan Posterior Uncertainty (KLQP) Duncan Posterior Uncertainty (Chi) Duncan Posterior Uncertainty (HMC) Figure 7. More player profiles. Basketball players shooting profiles as inferred by BBVI (Ranganath et al., 214), CHIVI (this paper) and HMC. The top row displays the raw data, consisting of made shots (green) and missed shots (red). The second and third rows display the posterior intensities inferred by BBVI, CHIVI and HMC for Lebron James and Tim Duncan respectively. Both BBVI and CHIVI nicely capture the shooting behavior of both players in terms of their posterior mean.the fourth and fifth rows display the posterior uncertainty inferred by BBVI, CHIVI and HMC for Lebron James and Tim Duncan respectively. Here CHIVI and BBVI tend to get similar posterior uncertainty for Lebron James. CHIVI has better uncertainty for Tim Duncan.

Variational Inference via χ Upper Bound Minimization

Variational Inference via χ Upper Bound Minimization Variational Inference via χ Upper Bound Minimization Adji B. Dieng Columbia University Dustin Tran Columbia University Rajesh Ranganath Princeton University John Paisley Columbia University David M. Blei

More information

arxiv:submit/ [stat.ml] 12 Nov 2017

arxiv:submit/ [stat.ml] 12 Nov 2017 Variational Inference via χ Upper Bound Minimization arxiv:submit/26872 [stat.ml] 12 Nov 217 Adji B. Dieng Columbia University John Paisley Columbia University Dustin Tran Columbia University Abstract

More information

Black-box α-divergence Minimization

Black-box α-divergence Minimization Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.

More information

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University An Overview of Edward: A Probabilistic Programming System Dustin Tran Columbia University Alp Kucukelbir Eugene Brevdo Andrew Gelman Adji Dieng Maja Rudolph David Blei Dawen Liang Matt Hoffman Kevin Murphy

More information

Natural Gradients via the Variational Predictive Distribution

Natural Gradients via the Variational Predictive Distribution Natural Gradients via the Variational Predictive Distribution Da Tang Columbia University datang@cs.columbia.edu Rajesh Ranganath New York University rajeshr@cims.nyu.edu Abstract Variational inference

More information

Operator Variational Inference

Operator Variational Inference Operator Variational Inference Rajesh Ranganath Princeton University Jaan Altosaar Princeton University Dustin Tran Columbia University David M. Blei Columbia University Abstract Variational inference

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

arxiv: v3 [stat.ml] 15 Mar 2018

arxiv: v3 [stat.ml] 15 Mar 2018 Operator Variational Inference Rajesh Ranganath Princeton University Jaan ltosaar Princeton University Dustin Tran Columbia University David M. Blei Columbia University arxiv:1610.09033v3 [stat.ml] 15

More information

Stochastic Variational Inference

Stochastic Variational Inference Stochastic Variational Inference David M. Blei Princeton University (DRAFT: DO NOT CITE) December 8, 2011 We derive a stochastic optimization algorithm for mean field variational inference, which we call

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Local Expectation Gradients for Doubly Stochastic. Variational Inference Local Expectation Gradients for Doubly Stochastic Variational Inference arxiv:1503.01494v1 [stat.ml] 4 Mar 2015 Michalis K. Titsias Athens University of Economics and Business, 76, Patission Str. GR10434,

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Understanding Covariance Estimates in Expectation Propagation

Understanding Covariance Estimates in Expectation Propagation Understanding Covariance Estimates in Expectation Propagation William Stephenson Department of EECS Massachusetts Institute of Technology Cambridge, MA 019 wtstephe@csail.mit.edu Tamara Broderick Department

More information

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Expectation Propagation in Dynamical Systems

Expectation Propagation in Dynamical Systems Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex

More information

Two Useful Bounds for Variational Inference

Two Useful Bounds for Variational Inference Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Approximate Bayesian inference

Approximate Bayesian inference Approximate Bayesian inference Variational and Monte Carlo methods Christian A. Naesseth 1 Exchange rate data 0 20 40 60 80 100 120 Month Image data 2 1 Bayesian inference 2 Variational inference 3 Stochastic

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning

More information

Variational Inference: A Review for Statisticians

Variational Inference: A Review for Statisticians Variational Inference: A Review for Statisticians David M. Blei Department of Computer Science and Statistics Columbia University Alp Kucukelbir Department of Computer Science Columbia University Jon D.

More information

Probabilistic Graphical Models for Image Analysis - Lecture 4

Probabilistic Graphical Models for Image Analysis - Lecture 4 Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Non-Gaussian likelihoods for Gaussian Processes

Non-Gaussian likelihoods for Gaussian Processes Non-Gaussian likelihoods for Gaussian Processes Alan Saul University of Sheffield Outline Motivation Laplace approximation KL method Expectation Propagation Comparing approximations GP regression Model

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Bayesian Semi-supervised Learning with Deep Generative Models

Bayesian Semi-supervised Learning with Deep Generative Models Bayesian Semi-supervised Learning with Deep Generative Models Jonathan Gordon Department of Engineering Cambridge University jg801@cam.ac.uk José Miguel Hernández-Lobato Department of Engineering Cambridge

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling

Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence IJCAI-18 Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling Ximing Li, Changchun

More information

Proximity Variational Inference

Proximity Variational Inference Jaan Altosaar Rajesh Ranganath David M. Blei altosaar@princeton.edu rajeshr@cims.nyu.edu david.blei@columbia.edu Princeton University New York University Columbia University Abstract Variational inference

More information

Variational Autoencoder

Variational Autoencoder Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational

More information

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

More information

17 : Optimization and Monte Carlo Methods

17 : Optimization and Monte Carlo Methods 10-708: Probabilistic Graphical Models Spring 2017 17 : Optimization and Monte Carlo Methods Lecturer: Avinava Dubey Scribes: Neil Spencer, YJ Choe 1 Recap 1.1 Monte Carlo Monte Carlo methods such as rejection

More information

Generative models for missing value completion

Generative models for missing value completion Generative models for missing value completion Kousuke Ariga Department of Computer Science and Engineering University of Washington Seattle, WA 98105 koar8470@cs.washington.edu Abstract Deep generative

More information

arxiv: v9 [stat.co] 9 May 2018

arxiv: v9 [stat.co] 9 May 2018 Variational Inference: A Review for Statisticians David M. Blei Department of Computer Science and Statistics Columbia University arxiv:1601.00670v9 [stat.co] 9 May 2018 Alp Kucukelbir Department of Computer

More information

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb,

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p

More information

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,

More information

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse Sandwiching the marginal likelihood using bidirectional Monte Carlo Roger Grosse Ryan Adams Zoubin Ghahramani Introduction When comparing different statistical models, we d like a quantitative criterion

More information

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme Ghassen Jerfel April 2017 As we will see during this talk, the Bayesian and information-theoretic

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Nonparametric Inference for Auto-Encoding Variational Bayes

Nonparametric Inference for Auto-Encoding Variational Bayes Nonparametric Inference for Auto-Encoding Variational Bayes Erik Bodin * Iman Malik * Carl Henrik Ek * Neill D. F. Campbell * University of Bristol University of Bath Variational approximations are an

More information

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016 Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? Want to estimate some distribution, p*(x) p*(x) What is

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference

More information

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS Worshop trac - ICLR 207 REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS Chris Cremer, Quaid Morris & David Duvenaud Department of Computer Science University of Toronto {ccremer,duvenaud}@cs.toronto.edu

More information

Advances in Variational Inference

Advances in Variational Inference 1 Advances in Variational Inference Cheng Zhang Judith Bütepage Hedvig Kjellström Stephan Mandt arxiv:1711.05597v1 [cs.lg] 15 Nov 2017 Abstract Many modern unsupervised or semi-supervised machine learning

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Kernel Sequential Monte Carlo

Kernel Sequential Monte Carlo Kernel Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) * equal contribution April 25, 2016 1 / 37 Section

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Sparse Approximations for Non-Conjugate Gaussian Process Regression

Sparse Approximations for Non-Conjugate Gaussian Process Regression Sparse Approximations for Non-Conjugate Gaussian Process Regression Thang Bui and Richard Turner Computational and Biological Learning lab Department of Engineering University of Cambridge November 11,

More information

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract) Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract) Arthur Choi and Adnan Darwiche Computer Science Department University of California, Los Angeles

More information

Probabilistic & Bayesian deep learning. Andreas Damianou

Probabilistic & Bayesian deep learning. Andreas Damianou Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of Sheffield, 19 March 2019 In this talk Not in this talk: CRFs, Boltzmann machines,... In this

More information

Variational Dropout and the Local Reparameterization Trick

Variational Dropout and the Local Reparameterization Trick Variational ropout and the Local Reparameterization Trick iederik P. Kingma, Tim Salimans and Max Welling Machine Learning Group, University of Amsterdam Algoritmica University of California, Irvine, and

More information

Probabilistic and Bayesian Machine Learning

Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Day 4: Expectation and Belief Propagation Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Overdispersed Black-Box Variational Inference

Overdispersed Black-Box Variational Inference Overdispersed Black-Box Variational Inference Francisco J. R. Ruiz Data Science Institute Dept. of Computer Science Columbia University Michalis K. Titsias Dept. of Informatics Athens University of Economics

More information

arxiv: v1 [stat.ml] 2 Mar 2016

arxiv: v1 [stat.ml] 2 Mar 2016 Automatic Differentiation Variational Inference Alp Kucukelbir Data Science Institute, Department of Computer Science Columbia University arxiv:1603.00788v1 [stat.ml] 2 Mar 2016 Dustin Tran Department

More information

Quasi-Monte Carlo Flows

Quasi-Monte Carlo Flows Quasi-Monte Carlo Flows Florian Wenzel TU Kaiserslautern Germany wenzelfl@hu-berlin.de Alexander Buchholz ENSAE-CREST, Paris France alexander.buchholz@ensae.fr Stephan Mandt Univ. of California, Irvine

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Automatic Differentiation Variational Inference

Automatic Differentiation Variational Inference Journal of Machine Learning Research X (2016) 1-XX Submitted X/16; Published X/16 Automatic Differentiation Variational Inference Alp Kucukelbir Data Science Institute, Department of Computer Science Columbia

More information

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning Michalis K. Titsias Department of Informatics Athens University of Economics and Business

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

Integrated Non-Factorized Variational Inference

Integrated Non-Factorized Variational Inference Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

Deterministic Approximation Methods in Bayesian Inference

Deterministic Approximation Methods in Bayesian Inference Deterministic Approximation Methods in Bayesian Inference Tobias Plötz Department of Computer Science Technical University of Darmstadt 64289 Darmstadt t_ploetz@rbg.informatik.tu-darmstadt.de Abstract

More information

Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model

Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model LETTER Communicated by Ilya M. Nemenman Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model Charles K. Fisher charleskennethfisher@gmail.com Pankaj Mehta pankajm@bu.edu

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Convergence Rate of Expectation-Maximization

Convergence Rate of Expectation-Maximization Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation

More information

Automatic Differentiation Variational Inference

Automatic Differentiation Variational Inference Journal of Machine Learning Research X (2016) 1-XX Submitted 01/16; Published X/16 Automatic Differentiation Variational Inference Alp Kucukelbir Data Science Institute and Department of Computer Science

More information

Large-scale Ordinal Collaborative Filtering

Large-scale Ordinal Collaborative Filtering Large-scale Ordinal Collaborative Filtering Ulrich Paquet, Blaise Thomson, and Ole Winther Microsoft Research Cambridge, University of Cambridge, Technical University of Denmark ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17 MCMC for big data Geir Storvik BigInsight lunch - May 2 2018 Geir Storvik MCMC for big data BigInsight lunch - May 2 2018 1 / 17 Outline Why ordinary MCMC is not scalable Different approaches for making

More information

Parameter Expanded Variational Bayesian Methods

Parameter Expanded Variational Bayesian Methods Parameter Expanded Variational Bayesian Methods Yuan (Alan) Qi MIT CSAIL 32 Vassar street Cambridge, MA 02139 alanqi@csail.mit.edu Tommi S. Jaakkola MIT CSAIL 32 Vassar street Cambridge, MA 02139 tommi@csail.mit.edu

More information

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Willem van den Boom Department of Statistics and Applied Probability National University

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu Dilin Wang Department of Computer Science Dartmouth College Hanover, NH 03755 {qiang.liu, dilin.wang.gr}@dartmouth.edu

More information

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions Mohammad Emtiyaz Khan, Reza Babanezhad, Wu Lin, Mark Schmidt, Masashi Sugiyama Conference on Uncertainty

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information