The χ-divergence for Approximate Inference

Size: px

Start display at page:

Download "The χ-divergence for Approximate Inference"

Elwin Dickerson
6 years ago
Views:

1 Adji B. Dieng 1 Dustin Tran 1 Rajesh Ranganath 2 John Paisley 1 David M. Blei 1 CHIVI enjoys advantages of both EP and KLVI. Like EP, it produces overdispersed approximations; like KLVI, it oparxiv:161328v2 [stat.ml] 27 Feb 217 Abstract Variational inference enables Bayesian analysis for complex probabilistic models with massive data sets. It posits a family of approximating distributions and finds the member closest to the posterior. While successful, variational inference methods can run into pathologies; for example, they typically underestimate posterior uncertainty. In this paper we propose CHIVI, a complementary algorithm to traditional variational inference. CHIVI is a black box algorithm that minimizes the χ-divergence from the posterior to the family of approximating distributions and provides an upper bound of the model evidence. We studied CHIVI in several scenarios. On Bayesian probit regression and Gaussian process classification it yielded better classification error rates than expectation propagation (EP) and classical variational inference (VI). When modeling basketball data with a Cox process, it gave better estimates of posterior uncertainty. Finally, we show how to use the CHIVI upper bound and classical VI lower bound to sandwich estimate the model evidence. 1. Introduction Bayesian analysis provides a foundation for reasoning with probabilistic models (Bishop, 26; Murphy, 212; Barber, 212; Gelman et al., 214a). We first set a joint distribution p(x, z) of latent variables z and observed variables x. We then analyze data through the posterior, p(z x) = p(x, z) p(x). In typical applications, this posterior is difficult to compute because the marginal likelihood p(x) is intractable. This necessitates approximate posterior inference methods such as Monte Carlo (Robert & Casella, 24) and varia- 1 Columbia University, New York, NY, USA 2 Princeton University, Princeton, NJ, USA. Correspondence to: Adji B. Dieng <abd2141@columbia.edu>. tional inference (Jordan et al., 1999a; Wainwright & Jordan, 28). This paper focuses on variational inference. Variational inference approximates the posterior through optimization. The idea is to posit a family of approximating distributions and then to find the member of the family that is closest to the posterior. Typically, closeness is defined by the Kullback-Leibler divergence KL(q p), where is the variational family indexed by parameters λ. This approach, which we call KLVI, also provides the evidence lower bound (ELBO), a convenient lower bound of the model evidence log p(x). KLVI scales well (Hoffman et al., 213) and is suited to applications that use complex models to analyze large data sets. But it also has drawbacks. For one, it tends to favor underdispersed approximations relative to the exact posterior (Murphy, 212; Bishop, 26). Further, it faces difficulties with light-tailed posteriors when the variational distribution has heavier tails. For example, KLVI for Gaussian process classification uses a Gaussian approximating family; this leads to unstable optimization and a poor approximation (Hensman et al., 214). One alternative to KLVI is expectation propagation (EP), which enjoys good empirical performance on models with light-tailed posteriors (Minka, 21a; Kuss & Rasmussen, 25). Procedurally, EP reverses the arguments in the Kullback-Leibler (KL) divergence and performs local minimizations of KL(p q); this corresponds to iterative moment matching using partitions of the data set. Relative to KLVI, EP produces overdispersed approximations. But EP also has drawbacks. It is not guaranteed to converge (Minka, 21b, Figure 3.6); it does not provide an easy estimate of the marginal likelihood; and it does not optimize a well-defined global objective (Beal, 23). In this paper we develop χ-divergence variational inference (CHIVI), a variational inference algorithm that minimizes the χ-divergence between the variational family and the exact posterior. It is [( p(z x) ) 2 ] D χ 2(p q) = E q(z;λ) 1. (1)

2 timizes a well-defined objective and produces an approximation of the evidence. As we mentioned, KLVI optimizes a lower bound on the model evidence. The idea behind CHIVI is to optimize an upper bound, which we call the chi upper bound (CUBO), where minimizing the CUBO is equivalent to minimizing the χ-divergence. In providing an upper bound, CHIVI complements KLVI. For example, the CUBO and ELBO together to give sandwich estimates of the model evidence (See Figure 2). Sandwich estimates are useful for tasks like model selection (MacKay, 1992; Raftery, 1995), where lower bounds alone do not provide enough information. In more detail, CHIVI is a stochastic optimization algorithm that computes unbiased noisy gradients of the exponentiated CUBO. An advantage of this strategy is that it provides a black-box inference algorithm (Ranganath et al., 214). This means CHIVI does not need model-specific derivations; it only requires sampling from an approximating distribution, evaluating the model s complete log likelihood log p(x, z), and evaluating the score function of the approximating family λ log. Thus it is easy to apply to a wide class of models. Related work. Variational inference was originally developed in the 199s, adapting ideas from statistical physics to derive methods for approximate Bayesian inference (Hinton & Van Camp, 1993; Waterhouse et al., 1996; Jordan et al., 1999b). Though the most widely studied variational objective is KL(q p), there has been work on alternatives. The main alternative is EP, proposed by Opper & Winther (2) and Minka (21a), which locally minimizes the KL(p q). Recent work revisits EP from the perspective of distributed computing (Gelman et al., 214b; Xu et al., 214; Teh et al., 215; Li et al., 215) and also revisits Minka (24), which studies local minimizations with the general family of α-divergences (Minka, 24; Hernández- Lobato et al., 215; Li & Turner, 216). CHIVI relates to EP and its extensions in that it leads to overdispersed approximations relative to KLVI. However, unlike Minka (24); Hernández-Lobato et al. (215), CHIVI does not rely on tying local factors; it optimizes a well-defined global objective. In this sense, CHIVI relates to the recent work on alternative divergence measures for variational inference (Li & Turner, 216; Ranganath et al., 216), but with a focus on the χ-divergence. The closest related work is Li & Turner (216). They perform black-box variational inference using the reverse α-divergence D α (q p) which is a valid divergence 1 only when α >. However, no positive value 1 Satisfies D(p q) and D(p q) = p = q a.e. of α in D α (q p) leads to the χ-divergence. Minimizing D α (q p) is equivalent to maximizing the VR-bound 2 which is a lower bound of the model evidence. We minimize the χ-divergence, a monotonic function of the direct α-divergence D α (p q). This is equivalent to minimizing the CUBO, upper bound to the model evidence. Furthermore, we provide a different black-box algorithm for minimizing these upper bounds using Monte Carlo gradients of the exponentiated bound to reduce bias. In this sense, our work is complementary to the work of Li & Turner (216). The rest of the paper is organized as follow: In Section 2 we briefly review variational inference before presenting the χ-divergence and it s zero-avoiding property that makes it produce overdispersed posterior approximations. In Section 2 we also derive the CUBO and its extensions. These are a family of upper bounds to the model evidence that enable approximate posterior inference with the χ-divergence. We propose and prove the sandwich theorem that relates CUBO and ELBO for sandwich-estimating the model evidence. We close Section 2 by proposing CHIVI, a scalable black box variational inference algorithm that uses unbiased noisy gradients of the exponentiated CUBO. Section 3 illustrates the performance of CHIVI on two classification problems, Bayesian probit regression and Gaussian process classification using benchmark datasets, and on a Cox process model for basketball data from the National Basketball Association (NBA) season. When compared to KLVI and EP, we find that CHIVI often produces better error rates and more accurate estimates of posterior uncertainty. Finally we conclude and explore some extensions and relationships of this work to a f-divergence minimization framework and importance sampling in Section χ-divergence Variational Inference We present the χ-divergence for variational inference. We describe some of its properties and develop CHIVI, a black box algorithm that minimizes the χ-divergence for a large class of models Variational Inference and the χ-divergence Variational inference (VI) casts Bayesian inference as optimization (Jordan et al., 1999b; Wainwright & Jordan, 28). VI posits a family of approximating distributions and finds the closest member to the posterior. In its typical formulation, VI minimizes the Kullback- Leibler divergence from to p(z x). This divergence is computationally intractable because it involves the 2 This is name of the family of lower bounds provided in Li & Turner (216)

3 Figure 1. We consider the posterior (red) as a mixture of two Gaussians, and the variational family (blue) is a Gaussian. From left to right: behavior of the divergences KL(q p) and χ n for n = 1.1, 2., and 5.. KL(q p) is mode-seeking, and χ for increasing n favors more overdispersed approximations. posterior. Fortunately, minimizing KL(q p) is equivalent to maximizing a tractable alternative, [ p(x, z) ] ELBO(λ) = E q(z;λ) log. (2) This objective is known as the evidence lower bound (ELBO), and we term methods that maximize it KLVI. The ELBO is not only a tractable objective but also a lower bound to the model evidence log p(x). Maximizing the ELBO imposes properties on the resulting approximate posterior such as underestimation of its support (Murphy, 212; Bishop, 26); these properties may be undesirable, especially when dealing with light-tailed posteriors as is the case in Gaussian process classification. As an alternative, we consider the χ-divergence (Equation 1). CHIVI seeks to minimize this divergence with respect to the variational parameters λ. Like KL(q p), this objective depends on the posterior. We derive a tractable proxy in Section 2.3, whose optimization is equivalent to optimizing Equation 1. Moreover, this tractable objective is an upper bound on log p(x). Minimizing the χ-divergence induces useful properties on the approximate posterior distribution such as a zeroavoiding behavior. This property leads to the overestimation of the posterior s support. (See Appendix A.5 for more details on all these properties.) We emphasize the main propoerty below Zero-avoiding behavior Optimizing the χ-divergence leads to a variational distribution with a zero-avoiding behavior. Indeed the χ- divergence is infinite whenever = and p(z x) >. Therefore during optimization p(z x) > will force >. This means q avoids having zero mass at places where p has nonzero mass. Notice that the classical objective KL(q p) leads to approximate posteriors with the opposite behavior, called zero-forcing. Indeed KL(q p) is infinite when p(z x) = and >. Therefore the optimal variational distribution q will be when p(z x) =. This zero-forcing behavior may lead to degenerate solutions during optimization, for example when the approximating family q has heavier tails than the target posterior p. This is because variational distributions with the zero-forcing behavior are overconfident and underestimate the true posterior s support. The χ-divergence, similar to KL(p q), does not suffer from this. The zero-avoiding behavior forces its posterior approximations to overestimate the support of the true posterior distribution (Minka, 25). To gain more intuition on this important property, we explore a simple scenario. Consider the extension of the χ-divergence to the family of χ n -divergences for n > 1, [( p(z x) D χ n(p q) = E q(z;λ) 1. This is a valid divergence for any n > 1 3. Figure 1 shows that varying n in the χ n -divergence provides an explicit knob for controlling this zeroing behavior. KL(q p) favors the mixture component with the highest weight and underestimates the posterior s support. D χ 2(p q) also picks the component with highest weight but it overestimates the posterior s support. For n < 2, D χ n(p q) tries to find a middleground between the two mixture components. This is because when n = 1.1 p(z x) D χ 1.1(p q) = E q [( q(z;λ) )1.1 ]; this weakly penalizes not putting high mass at the mode of p. When n > 2, D χ n(p q) penalizes placing mass where p is not at its highest and thus favors the mode CUBO: the chi upper bound We derive a tractable objective for variational inference with the χ 2 -divergence and also generalize it to the χ n - divergence for n > 1. Consider the optimization problem of minimizing Equation 1. We seek to find a relationship between the χ 2 -divergence and log p(x). We take the following steps: [( p(z x) ) 2 ] E q(z;λ) = 1 + D χ 2(p(z x) ) [( p(x, z) ) 2 ] E q(z;λ) = p(x) 2 [1 + D χ 2(p(z x) )] Taking logarithms on both sides, we find a relationship analogous to how KL(q p) relates to the ELBO. Namely, 3 D χ n(p q) by Jensen s inequality and D χ n(p q) = p = q a.e.

4 the χ 2 -divergence satisfies 1 2 log(1 + D χ2(p(z x) )) = log p(x) + 1 [( p(x, z) ) 2 ] 2 log E q(z;λ). By monotonicity of log, and because log p(x) is constant, minimizing the χ 2 -divergence is equivalent to minimizing: L χ 2(λ) = 1 [( p(x, z) ) 2 ] 2 log E q(z;λ). Furthermore, by nonnegativity of the χ 2 -divergence, this quantity is an upper bound to the model evidence, log p(x) 1 2 log E q(z;λ) [( p(x, z) ) 2 ] = L χ 2(λ). We call this objective the chi upper bound (CUBO). A general upper bound. This derivation also follows for the χ n -divergence. The general upper bound is L χ n(λ) = 1 n log E q(z;λ) [( p(x, z) = CUBO n. (3) We have produced a family of bounds: When n < 1, CUBO n is a lower bound and minimizing it for these values of n does not minimize the χ-divergence( rather, when n < 1, we recover the α-divergence and the VR-bound (Li & Turner, 216). The bound is tight for n = 1, CUBO 1 = log p(x). For any n 1, CUBO n is an upper bound to the model evidence. We focus on n = 2. Sandwiching the model evidence. Equation 3 has practical value. We can simultaneously minimize the CUBO n and maximize the ELBO. This produces a sandwich on the model evidence, ELBO log p(x) CUBO n. (See Appendix A.8 for a simulated illustration.) Sandwiching can be used to better approximate the model evidence, perform model selection, and assess convergence. Comparing models using only lower bounds, i.e the ELBO is unreliable. The following sandwich theorem states that the gap induced by CUBO n and ELBO increases with n. This suggests that letting n as close to 1 as possible enables approximating log p(x) with higher precision. When we further decrease n to, CUBO n becomes a lower bound and tends to the ELBO. Theorem 1 (Sandwich Theorem): Define CUBO n and ELBO as in Equation 3 and Equation 2. CUBO n is a non-decreasing function of the order n of the χ-divergence. Furthermore, when n is allowed to go to zero, the upper bound becomes a lower bound and more specifically: lim n CUBO n = ELBO. See proof in Appendix A.1. This theorem has many implications since estimating log p(x) is important for many applications, such as the evidence framework (MacKay, 23), where the marginal likelihood is argued to embody an Occam s razor. It can also help estimate Bayes factors (Raftery, 1995). However, few works have analyzed its sandwich estimation (Grosse et al., 215). We study our variational approach to sandwich estimation of the model evidence in Section Optimizing the CUBO We derived the CUBO, an upper bound to the model evidence that can be used to minimize the χ-divergence. We now develop CHIVI, a black box algorithm that minimizes the CUBO n. The goal in CHIVI is to minimize the CUBO n with respect to variational parameters, CUBO n (λ) = 1 n log E q(z;λ) [( p(x, z), The expectation in the CUBO n is usually intractable. Thus we use Monte Carlo to construct stochastic gradients. One approach is to naively perform Monte Carlo on this objective, CUBO n (λ) 1 n log 1 S S s=1 [( p(x, z (s) ), q(z (s) ; λ) for S samples z (1),..., z (S). However, by Jensen s inequality, the log transform of the expectation implies that this is a biased estimate of CUBO n (λ). Gradients of this estimate are also biased. We use a different approach than Li & Turner (216) and consider the objective L = exp{n CUBO n (λ)}. By monotonicity of the exponential function, this objective admits the same optima as CUBO n (λ). We minimize it using its reparameterization gradients (Kingma & Welling, 214; Rezende et al., 214). These gradients apply to models with differentiable latent variables and have lower variance. More formally, assume z = g(λ, ɛ) where ɛ p(ɛ). Then ˆL = 1 B ( p(x, g(λ, ɛ (b) )) ) n B q(g(λ, ɛ (b) ); λ) b=1

5 is an unbiased estimator of L and its gradient is Algorithm 1: Scalable CHIVI for massive datasets n B ( p(x, g(λ, ɛ λ ˆL (b) )) ) n λ ( p(x, g(λ, ɛ (b) )) ) = log. B q(g(λ, ɛ (b) ); λ) q(g(λ, ɛ (b) ); λ) Input: Data x, Model p(x, z), Variational family. b=1 (4) Output: Variational parameters λ. Initialize λ randomly. (See Appendix A.7 for a more detailed derivation of both the score function gradient and the reparameterization gradient). Draw S samples z (1),..., z (S) from q(z; while not converged do λ). Computing Equation 4 requires all the data x. This does not scale to massive data sets. In such a setting, we can apply the average likelihood technique from EP (Li et al., 215; Dehaene & Barthelmé, 215). Consider data {x 1,..., x N }. Define the likelihood factor f i (z) = p(x i z). Now consider a data subsample, {x i1,..., x im }. The subsampled likelihood is f 1:M (z) = M f ij (z). j=1 We approximate the full likelihood by multiplying the subsampled likelihood with itself N M times Subsample data points {x i1,..., x im }. Compute the corresponding average likelihoods f M (z (1) ),..., fm (z (S) ). Set ρ t from a Robbins-Monro sequence. Set w (s) = p(z(s) ) f M (z (s) ) N q(z (s) ;λ t) Set c = max log w (s). s, s {1,..., S}. Set w (s) = exp(log w (s) c), s {1,..., S}. Update λ t+1 = [( ) n λ ] λ t (1 n) ρt S S s=1 w (s) log q(z (s) ; λ t ). end p(x z) f 1:M (z) N M. Using this proxy to the full dataset we derive CHIVI, an algorithm in which each iteration depends on only a minibatch of data. CHIVI is a black box algorithm for performing approximate inference with the χ n -divergence. Algorithm 1 summarizes the procedure. In practice, we subtract the maximum of the logarithm of the importance weights, defined as p(x, z) w =. to avoid underflow. Stochastic optimization theory still gives us convergence guarantees with this aproach (Sunehag et al., 29; Robbins & Monro, 1951). 3. Empirical Study We study CHIVI as an approximate inference algorithm and also as a means for model selection by sandwich estimating the model evidence. First, we study Bayesian probit regression with benchmark datasets where we compare the predictive performance of CHIVI against EP and KLVI where we also illustrate the sandwich gap for model selection. Second, we compare CHIVI to Laplace and EP on a model class for which EP is the method of choice: Gaussian process classification. Third, we analyze Cox processes, a type of spatial point process, to compare profiles of different NBA basketball players. We find CHIVI has better predictive power and yields better posterior uncertainty estimates. We also illustrate the sandwich of the model evidence for UCI datasets using mainly n = 2. All experiments were implemented in Edward (Tran et al., 216) Bayesian Probit Regression We analyze inference for Bayesian probit regression. First, we illustrate sandwich estimation on UCI datasets. Figure 2 illustrates the bounds of the log marginal likelihood given by the ELBO and the CUBO. Using both quantities provides a reliable approximation of the model evidence. The tightness of the gap depends on the order of the divergence n. When n gets close to 1 which corresponds to exact inference a large number of samples from the variational distribution is needed. In addition, these figures show convergence for CHIVI, which EP does not always satisfy. We also compared the predictive performance of CHIVI, EP, and KLVI. For large datasets, we apply Algorithm 1 with a minibatch size of 64 and 2 iterations for each batch. We computed the average classification error rate and the standard deviation using 5 random splits of the data. We split all the datasets with 9% of the data for training and 1% for testing. For the Covertype dataset, we implemented Bayesian probit regression to discriminate the class 1 against all other classes. Table 1 shows the average

6 Table 1. Test error for Bayesian probit regression. The lower the better. CHIVI (this paper) yields lower test error rates when compared to BBVI (Ranganath et al., 214) and EP on most datasets. Dataset BBVI EP CHIVI Pima 35 ± 6 34 ± 6 22 ± 48 Ionos.123 ± ± ± 5 Madelon 57 ± 5 45 ± 5 53 ± 29 Covertype.157 ± ± ± 14 Table 2. Test error for Gaussian process classification. The lower the better. CHIVI (this paper) yields lower test error rates when compared to Laplace and EP on most datasets. Dataset Laplace EP CHIVI Crabs ± 3 Pima - 45 ± ± 35 Sonar ± 35 Ionos 84 8 ± 4 69 ± 34 Heart ± 59 error rate for KLVI (as implemented by black box variational inference (BBVI)), EP, and CHIVI. CHIVI performs better for all but one dataset Gaussian Process Classification Gaussian process (GP) classification is an alternative to probit regression. The posterior is analytically intractable because the likelihood is not conjugate to the prior. Moreover, the posterior tends to be skewed. EP has been the method of choice for approximating the posterior (Kuss & Rasmussen, 25). We choose a factorized Gaussian for the variational distribution q and fit its mean and the log variance to avoid negative variances during optimization. With UCI benchmark datasets, we compared the predictive performance of CHIVI to EP and Laplace. Table 2 summarizes the results. The error rates for CHIVI correspond to the average of 1 error rates obtained by dividing the data into 1 folds, applying CHIVI to 9 folds to learn the variational parameters and performing prediction on the remainder. The kernel hyperparameters were chosen using grid search. The error rates for the other methods correspond to the best results reported in (Kuss & Rasmussen, 25) and (Kim & Ghahramani, 23). On all the datasets CHIVI performs as well or better than EP and Laplace Cox Processes Finally we study Cox processes. They are Poisson processes with stochastic rate functions. They capture dependence between the frequency of points in different regions Curry Demarcus Lebron Duncan CHIVI BBVI Table 3. Average L 1 error for posterior uncertainty estimates (ground truth from HMC). We find that CHIVI is similar to or better than BBVI at capturing posterior uncertainties. Demarcus Cousins, who plays center, stands out in particular. His shots are concentrated near the basket, so the posterior is uncertain over a large part of the court Figure 3. of a space. We apply Cox processes to model the spatial locations of shots (made and missed) from the NBA season; see also Miller et al. (214). The data are from 38 NBA players who took more than 15, shots in total. The n th player s set of M n shot attempts are x n = {x n,1,..., x n,mn }, and the location of the m th shot by the n th player in the basketball court is x n,m [ 25, 25] [, 4]. Let PP(λ) denote a Poisson process with intensity function λ, and K be a covariance matrix resulting from a kernel applied to every location of the court. The generative process for the n th player s shot is K i,j = k(x i, x j ) = σ 2 exp( 1 2φ 2 x i x j 2 ) f GP(, k(, )) and λ = exp(f) x n,k PP(λ) for k {1,..., M n }. The kernel of the Gaussian process encodes the spatial correlation between different areas of the basketball court. The model treats the N players as independent. But the kernel K introduces correlation between the shots attempted by a given player. Our goal is to infer the intensity functions λ(.) for each player. We compare the shooting profiles of different players using these inferred intensity surfaces. The results are shown in Figure 3. The shooting profiles of Demarcus Cousins and Stephen Curry are captured by both BBVI and CHIVI. BBVI has lower posterior uncertainty while CHIVI provides more overdispersed solutions. We plot the profiles for two more players, LeBron James and Tim Duncan, in the appendix. In Table 3, we compare the posterior uncertainty estimates of CHIVI and BBVI to that of HMC, a computationally expensive Markov chain Monte Carlo procedure that we treat as exact. We use the average L 1 distance from HMC as error measure. We do this on four different players: Stephen Curry, Demarcus Cousins, LeBron James, and Tim Duncan. We find that CHIVI is similar or better than BBVI, especially on players like Demarcus Cousins who shoot in a limited part of the court.

7 Sandwich Plot Using CHIVI and BBVI On Ionosphere Dataset upper bound lower bound 1.5 Sandwich Plot Using CHIVI and BBVI On Ionosphere Dataset upper bound lower bound 1.5 Sandwich Plot Using CHIVI and BBVI On Heart Dataset upper bound lower bound 1.5 Sandwich Plot Using CHIVI and BBVI On Crabs Dataset upper bound lower bound objective objective objective objective epoch epoch epoch epoch Figure 2. Sandwich gap via CHIVI and BBVI on different datasets. The first two plots correspond to a divergence with order 2 and 1.2 respectively on the Ionosphere dataset. As mentioned in our theoretical analysis and in the simulations in the appendix, the gap tightens when n 1. However there is a trade-off between tightening the gap and computational efficiency: when n 1 more samples from the variational distribution are needed for the algorithm to converge. The last two plots correspond to n = 2 for Heart and Crabs datasets respectively. More sandwich plots can be found in the appendix. 4. Discussion and Extensions We described CHIVI, a black box algorithm that minimizes the χ-divergence by minimizing the CUBO. We now describe how this algorithm can be extended to optimize f- divergences and to find an optimal proposal f-divergences The χ-divergence is a member of the general f-divergence family (Csiszár & Shields, 24). An f-divergence has the form ( p(x) ) D f (p q) = f q(x) dx, q(x) where f is a convex function such that f(1) =. For example, the divergence KL(q p) corresponds to choosing f(x) = log x and the divergence KL(p q) corresponds to f(x) = x log x. The α-divergence family is a subfamily of this larger family of divergences. The χ n -divergence corresponds to f(x) = x n 1. A key property is that any f-divergence can be rewritten as a Taylor sum of χ-divergences (Nielsen & Nock, 214). Expanding around a point r in the domain of f, 1 ( p(x) ) n D f (p q) = q(x) n! f (n) (r ) q(x) r dx = n= n= 1 n! f (n) (x ) χ n r (p q), where χ n r (p q) is a higher-order χ-divergence. CHIVI can be extended to approximately minimize any f-divergence at a given truncation level. As one example, the above equation implies that the χ 2 -divergence can be interpreted (up to proportion) as a second-order Taylor approximation of KL(p q). If desired, incorporating higher-order χ-divergences for posterior inference can better mimic properties of KL(p q), such as moment matching Importance sampling The χ-divergence also has deep connections to importance sampling (Minka, 25). Consider estimating the marginal likelihood p(x, z) p(x) = p(x, z) dz = q(z) dz q(z) using a proposal distribution q(z). We d like to learn the optimal proposal among a family parameterized by λ. The importance-sampled estimate of p(x) is p(x) = 1 S S s=1 p(x, z (s) ) q(z (s) ; λ), The variance of this estimator is Var( p(x)) = 1 S ( z(1),..., z (s). [( p(x, z (1) ) ) 2 ] E q(z;λ) p(x) 2). q(z (1) ; λ) One approach to choose is to find parameters which minimize the variance. Formally, this is equivalent to finding the minimum-variance unbiased estimator. Dropping constant terms, this is equivalent to minimizing the χ 2 -divergence. This idea originates from adaptive importance sampling based on maximizing the effective sample size (Kong et al., 1994; Cappé et al., 28) and has recently seen renewed interest in the context of online learning (Bouchard et al., 215) Summary We derived CHIVI, a black box algorithm for doing variational inference with the χ-divergence. We showed that CHIVI is an effective algorithm for Bayesian probit regression, Gaussian process classification, and Cox processes. We also showed how to use CHIVI in concert with KLVI to sandwich-estimate the model evidence.

The χ-divergence for Approximate Inference Curry Shot Chart Curry Posterior Intensity (KLQP) 225 Demarcus Shot Chart Curry Posterior Intensity (Chi) 2 225 Curry Posterior Intensity (HMC) 225 2 2 175

16 16 12 12 8 8 4 4 4 2 16 12 8 Curry Posterior Uncertainty (KLQP) Curry Posterior Uncertainty (Chi) Curry Posterior Uncertainty (HMC).1.1.1 Demarcus Posterior Uncertainty (KLQP) Demarcus Posterior Uncertainty (Chi) Demarcus Posterior Uncertainty (HMC).

8 The χ-divergence for Approximate Inference Curry Shot Chart Curry Posterior Intensity (KLQP) 225 Demarcus Shot Chart Curry Posterior Intensity (Chi) Curry Posterior Intensity (HMC) Demarcus Posterior Intensity (KLQP) 36 Demarcus Posterior Intensity (Chi) Demarcus Posterior Intensity (HMC) Curry Posterior Uncertainty (KLQP) Curry Posterior Uncertainty (Chi) Curry Posterior Uncertainty (HMC) Demarcus Posterior Uncertainty (KLQP) Demarcus Posterior Uncertainty (Chi) Demarcus Posterior Uncertainty (HMC) Figure 3. Basketball players shooting profiles as inferred by BBVI (Ranganath et al., 214), CHIVI (this paper), and HMC. The top row displays the raw data, consisting of made shots (green) and missed shots (red). The second and third rows display the posterior intensities inferred by BBVI, CHIVI, and HMC for Stephen Curry and Demarcus Cousins respectively. Both BBVI and CHIVI capture the shooting behavior of both players in terms of the posterior mean. The fourth and fifth rows display the posterior uncertainty inferred by BBVI, CHIVI, and HMC for Stephen Curry and Demarcus Cousins respectively. CHIVI tends to get higher posterior uncertainty for both players in areas where data is scarce compared to BBVI. This illustrates the variance underestimation problem of KLVI, which is not the case for CHIVI.

9 References Barber, D. Bayesian Reasoning and Machine Learning. Cambridge University Press, 212. Beal, M. J. Variational algorithms for approximate Bayesian inference. University of London, 23. Bishop, C. M. Pattern recognition. Machine Learning, 128, 26. Bouchard, G., Trouillon, T., Perez, J., and Gaidon, A. Online Learning to Sample. arxiv preprint arxiv: , 215. Cappé, O., Douc, R., Guillin, A., Marin, J. M., and Robert, C. P. Adaptive importance sampling in general mixture classes. Statistics and Computing, 18(4): , 28. Csiszár, I. and Shields, P. C. Information Theory and Statistics: A Tutorial. Now Publishers Inc, 24. Dehaene, G. and Barthelmé, S. Expectation propagation in the large-data limit. In Neural Information Processing Systems, 215. Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., and Rubin, D. Bayesian Data Analysis. Chapman & Hall/CRC, 3rd edition, 214a. Gelman, A., Vehtari, A., Jylänki, P., Robert, C.n, Chopin, N., and Cunningham, J. P. Expectation propagation as a way of life. arxiv preprint arxiv: , 214b. Grosse, R. B., Ghahramani, Z., and Adams, R. P. Sandwiching the marginal likelihood using bidirectional Monte Carlo. arxiv.org, November 215. Hensman, J., Zwießele, M., and Lawrence, N. D. Tilted variational Bayes. The Journal of Machine Learning Research, 214. Hernández-Lobato, J. M., Li, Y., Hernández-Lobato, D., Bui, T., and Turner, R. E. Black-box α-divergence minimization. arxiv preprint, 215. Hinton, G. and Van Camp, D. Keeping the neural networks simple by minimizing the description length of the weights. In Computational Learning Theory, pp ACM, Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. Journal of Machine Learning Research, 14: , 213. Jordan, M., Ghahramani, Z., Jaakkola, T., and Saul, L. Introduction to variational methods for graphical models. Machine Learning, 37: , 1999a. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models. Machine Learning, 37(2): , 1999b. Kim, H. and Ghahramani, Z. The em-ep algorithm for gaussian process classification. In Proceedings of the Workshop on Probabilistic Graphical Models for Classification (ECML), pp , 23. Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. In International Conference on Learning Representations, 214. Kong, A., Liu, J. S., and Wong, W. H. Sequential imputations and Bayesian missing data problems. Journal of the American Statistical Association, 89(425): , Kuss, M. and Rasmussen, C. E. Assessing approximate inference for binary Gaussian process classification. The Journal of Machine Learning Research, 6: , 25. Li, Y. and Turner, R. E. Variational inference with Rényi divergence. arxiv preprint arxiv: , 216. Li, Y., Hernández-Lobato, J. M., and Turner, R. E. Stochastic Expectation Propagation. In Neural Information Processing Systems, 215. MacKay, D. J. C. Bayesian interpolation. Neural computation, 4(3): , MacKay, D. J. C. Information Theory, Inference and Learning Algorithms. Cambridge university press, 23. Miller, A., Bornn, L., Adams, R., and Goldsberry, K. Factorized point process intensities: A spatial analysis of professional basketball. In ICML, pp , 214. Minka, T. Expectation propagation for approximate bayesian inference. In Uncertainty in Artificial Intelligence, pp Morgan Kaufmann Publishers Inc., 21a. Minka, T. A family of algorithms for approximate Bayesian inference. PhD thesis, Massachusetts Institute of Technology, 21b. Minka, T. Power EP. Technical report, Microsoft Research, 24. Minka, T. Divergence measures and message passing. Technical report, Microsoft Research, 25. Murphy, K. P. Machine Learning: A Probabilistic Perspective. MIT press, 212. Nielsen, F. and Nock, R. On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Processing Letters, 21(1):1 13, 214. Opper, M. and Winther, O. Gaussian processes for classification: Mean-field algorithms. Neural Computation, 12 (11): , 2.

10 Raftery, A. E. Bayesian model selection in social research. Sociological methodology, 25: , Ranganath, R., Gerrish, S., and Blei, D. M. Black box variational inference. In Artificial Intelligence and Statistics, 214. Ranganath, Rajesh, Altosaar, Jaan, Tran, Dustin, and Blei, David M. Operator variational inference. In Neural Information Processing Systems, 216. Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In International Conference on Machine Learning, 214. Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, pp. 4 47, Robert, C. and Casella, G. Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer-Verlag, New York, NY, 24. Sunehag, Peter, Trumpf, Jochen, Vishwanathan, SVN, Schraudolph, Nicol N, et al. Variable metric stochastic approximation theory. In AISTATS, pp , 29. Teh, Y. W., Hasenclever, L., Lienart, T., Vollmer, S., Webb, S., Lakshminarayanan, B., and Blundell, C. Distributed Bayesian learning with stochastic natural-gradient expectation propagation and the posterior server. arxiv preprint arxiv: , 215. Tran, Dustin, Kucukelbir, Alp, Dieng, Adji B, Rudolph, Maja, Liang, Dawen, and Blei, David M. Edward: A library for probabilistic modeling, inference, and criticism. arxiv preprint arxiv: , 216. Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1-2):1 35, 28. Waterhouse, S., MacKay, D., and Robinson, T. Bayesian methods for mixtures of experts. In Neural Information Processing Systems, Xu, M., Lakshminarayanan, B., Teh, Y. W., Zhu, J., and Zhang, B. Distributed Bayesian posterior sampling via moment sharing. In Neural Information Processing Systems, 214. The χ-divergence for Approximate Inference

11 A. Supplementary Material A.1. Proof of Sandwich Theorem We denote by z the latent variable and x the data. Assume z R D. We first show that CUBO n is a nondecreasing function of the order n of the χ-divergence. Denote by the triplet (Ω, F, Q) the probability space induced by the variational distribution q where Ω is a subspace of R D, F is the corresponding Borel sigma algebra, and Q is absolutely continuous with respect to the Lebesgue measure µ and is such that dq(z) = q(z)dz. Define w = p(x,z) q(z). We can rewrite CUBO n as: CUBO n = 1 n log E q[w n ] ( ) = log (E q [w n ]) 1 n Since log is nondecreasing, it is enough to show n (E q [w n ]) 1 n is nondecreasing. This function is the L n norm in the space defined above: ( ) (E q [w n ]) 1 1 n = w n n dq Ω ( ) 1 = w n n q(z)dz This is a nondecreasing function of n by virtue of the Lyapunov inequality. We now show the second claim in the sandwich theorem, namely that the limit when n of CUBO n is the ELBO. Since CUBO n is a monotonic function of n and is bounded from below by ELBO, it admits a limit when n. Call this limit L. We show L = ELBO. On the one hand, since CUBO n ELBO for all n >, we have L ELBO. On the other hand, since log x x 1; x > we have CUBO n = 1 n log E q[w n ] 1 [ ] E q [w n ] 1 n Ω [ w n 1 ] = E q n f : n w n is differentiable [ ] and furthermore f w () = lim n 1 n n = log w. Therefore n > such that wn 1 n log w < 1 n < n. Since wn 1 n log w < wn 1 n log w, we have wn 1 n < 1 + log w which is E q-integrable. Therefore by Lebesgue s [ ] dominated [ convergence ] theorem: w lim n E n 1 w q n = E q lim n 1 n n = E q [log w] = ELBO. Since CUBO [ n converges ] when n w and CUBO n E n 1 q n n, we establish L [ ] w lim n E n 1 q n = ELBO. The conclusion follows. A.2. The CHIVI algorithm for small datasets Algorithm 2: CHIVI Input: Data x, Model p(x, z), Variational family. Output: Variational parameters λ. Initialize λ randomly. while not converged do Draw S samples z (1),..., z (S) from. Set ρ t from a Robbins-Monro sequence. Set log w (s) = log p(x, z (s) ) log q(z (s) ; λ t ), s {1,..., S}. Set c = max log w (s). s Set w (s) = exp(log w (s) c), s {1,..., S}. Update λ t+1 = [( ) n λ ] λ t (1 n) ρt S S s=1 w (s) log q(z (s) ; λ t ). end A.3. Approximately minimizing f-divergence with χ-divergence In this section we provide a proof that minimizing an f-divergence can be done by minimizing a sum of χ- divergences. Consider D f (p q) = f ( p(x) ) q(x)dx q(x) Without loss of generality assume f is analytic. The Taylor expansion of f around some point x is f(x) = f(x ) + f (x )(x x ) + Therefore i=2 f (i) (x ) (x x ) i i! ( [ p(x) ] ) D f (p q) = f(x ) + f (x ) E q(z λ) x q(x) [ f (i) (x ) ( p(x) ) i ] + E q(z λ) i! q(x) x i=2 = f(x ) + f (x )(1 x ) f (i) (1) [( p(x) ) i ] + E q(z λ) i! q(x) 1 i=2 where we switch summation and expectation by invoking Fubini s theorem.

12 In particular if we take x = 1 the linear terms are zero and we end up with: D f (p q) = = i=2 i=2 f (i) (1) [( p(x) ) i ] E q(z λ) i! q(x) 1 f (i) (1) D i! χ i(p q) If f is not analytic but k times differentiable for some k then the proof still holds considering the Taylor expansion of f up to the order k. A.4. Importance sampling In this section we establish the relationship between χ 2 - divergence minimization and importance sampling. Consider estimating the marginal likelihood I with importance sampling: I = p(x) = p(x, z)dz p(x, z) = q(z) q(z)dz = w(z)q(z)dz The Monte Carlo estimate of I is Î = 1 B B w(z (b) ) b=1 where z (1),..., z (B) q(z). The variance of Î is Var(Î) = 1 B [E q(z λ)(w(z (b) ) 2 ) (E q(z λ) (w(z (b) ))) 2 ] = 1 B [ E q(z λ) (( p(x, z (1) ) q(z (1) ) ) 2 ) p(x) 2] Therefore minimizing this variance is equivalent to minimizing the quantity (( p(x, z (1) ) ) 2 ) E q(z λ) q(z (1) ) which is equivalent to minimizing the χ 2 - divergence. A.5. General properties of the χ-divergence In this section we outline several properties of the χ- divergence. Conjugate symmetry Define f (u) = uf( 1 u ) to be the conjugate of f. f is also convex and satisfies f (1) =. Therefore Df (p q) is a valid divergence in the f-divergence family and: ( q(x) ) D f (q p) = f p(x)dx p(x) q(x) = p(x) f ( p(x) ) p(x)dx q(x) = D f (p q) D f (q p) is symmetric if and only if f = f which is not the case here. To symmetrize the divergence one can use D(p q) = D f (p q) + D f (p q) Invariance under parameter transformation. Let y = u(x) for some function u. Then by Jacobi p(x)dx = p(y)dy and q(x)dx = q(y)dy. x1 ( p(x) ) nq(x)dx D χ n(p(x) q(x)) = 1 x q(x) y1 ( p(y) dy ) nq(y)dy dx = 1 y q(y) dy dx y1 ( p(y) ) nq(y)dy = 1 q(y) y = D χ n(p(y) q(y)) Factorization for independent distributions. Consider taking p(x, y) = p 1 (x)p 2 (y) and q(x, y) = q 1 (x)q 2 (y). p(x, y) D χ n(p(x, y) q(x, y)) = dxdy q(x, y) n 1 p 1 (x) n p 2 (y) n = q 1 (x) n 1 dxdy q 2 (y) n 1 ( p 1 (x) n ) = q 1 (x) n 1 dx ( p 2 (y) n ) q 2 (y) n 1 dy = D χ n(p 1 (x) q 1 (x)) D χ n(p 2 (y) q 2 (y)) Therefore χ-divergence is multiplicative under independent distributions while KL is additive. Other properties. The χ-divergence enjoys some other properties that it shares with all members of the f- divergence family namely monotonicity with respect to the distributions and joint convexity. A.6. Derivation of the CUBO n In this section we outline the derivation of CUBO n, the upper bound to the marginal likelihood induced by the mini-

13 mization of the χ-divergence. By definition: [( p(z x) D χ n(p(z x) ) = E q(z;λ) 1 Following the derivation of ELBO, we seek an expression of log(p(x)) involving D χ n(p(z x) ). We achieve that as follows: [( p(z x) E q(z;λ) = 1 + D χ n(p(z x) ) [( p(x, z) E q(z;λ) = p(x) n [1 + D χ n(p(z x) )] This gives the relationship log p(x) = 1 [( p(x, z) n log E q(z;λ) 1 n log(1 + D χn(p(z x) )) log p(x) = CUBO n 1 n log(1 + D χn(p(z x) )) By nonnegativity of the divergence this last equation establishes the upper bound: A.7. Black Box Inference log p(x) CUBO n In this section we derive the score gradient and the reparameterization gradient for doing black box inference with the χ-divergence. CUBO n (λ) = 1 n log E q(z;λ) [( p(x, z) where λ is the set of variational parameters. To minimize CUBO n (λ) with respect to λ we need to resort to Monte Carlo. To minimize CUBO n (λ) we consider the equivalent minimization of exp{n CUBO(λ)}. This enables unbiased estimation of the noisy gradient used to perform black box inference with the χ-divergence. The score gradient The score gradient of our objective function is derived below: λ L = λ p(x, z) n 1 n dz = p(x, z) n λ 1 n dz = p(x, z) n (1 n) n λ dz p(x, z) = (1 n) ( )n λ dz = (1 n) p(x, z) ( )n λ log dz = (1 n)e q(z;λ) [( p(x, z) ) n λ log ] where we switched differentiation and integration by invoking Lebesgue s dominated convergence theorem. We estimate this gradient as was done in Paisley et al. with the unbiased estimator: (1 n) B B b=1 [( p(x, z (b) ) ) n λ log q(z (b) ; λ)] q(z (b) ); λ Reparameterization gradient The reparameterization gradient empirically has lower variance than the score gradient. We used it in our experiments. Denote by L the quantity exp{n CUBO} [( p(x, z) L = E q(z;λ) Assume z = g(λ, ɛ) where ɛ p(ɛ). Then ˆL = 1 B B b=1 ( p(x, g(λ, ɛ (b) )) ) n q(g(λ, ɛ (b) ); λ) is an unbiased estimator of L and its gradient is given by λ ˆL = n B = n B B b=1 B b=1 A.8. Simulation Studies ( p(x, g(λ, ɛ (b) )) ) n 1 λ ( p(x, g(λ, ɛ (b) )) ) q(g(λ, ɛ (b) ); λ) q(g(λ, ɛ (b) ); λ) ( p(x, g(λ, ɛ (b) )) ) n λ ( p(x, g(λ, ɛ (b) )) ) log. q(g(λ, ɛ (b) ); λ) q(g(λ, ɛ (b) ); λ) The following figures are results of various Monte Carlo simulations on the CUBO. L = exp{n CUBO(λ)}

14 (a) (a) (b) (b) Figure 4. Samples from the variational distribution resulting from a Bayesian neural network regression on synthetic data using KL(q p) (Figure 4a) and the χ-divergence Figure 4b. Note the overdispersion outside the [ 2, +2] region for the χ-divergence compared to KL(q p). (c) Figure 5. Sandwich gap via Monte Carlo simulations when the order of the χ-divergence is n = 4, n = 2, and n = 1.5 respectively. As we demonstrated theoretically, the gap closes as n decreases.

15 Sandwich Plot Using CHIVI and BBVI On Pima Dataset upper bound lower bound objective epoch Figure 6. Top row:sandwich plot on real (left) and synthetic (right) datasets. Bottom left: example of a situation where BBVI fails. The prior is Gaussian. The likelihood is a Heaviside function. The resulting posterior has light tails due to the truncation. BBVI fits a point mass while CHIVI fits a reasonable distribution. Bottom right: Monte Carlo simulation on CUBO n: the CUBO n is a nondecreasing function of the order n of the χ-divergence as we proved in Appendix A.1.

The χ-divergence for Approximate Inference James Shot Chart James Posterior Intensity (KLQP) 36 Duncan Shot Chart James Posterior Intensity (Chi) 32 4 James Posterior Intensity (HMC) 28 36 32 35 28 3

8 1 8 16 James Posterior Uncertainty (KLQP) James Posterior Uncertainty (Chi) James Posterior Uncertainty (HMC).1.1.1 Duncan Posterior Uncertainty (KLQP) Duncan Posterior Uncertainty (Chi) Duncan Posterior Uncertainty (HMC).

The top row displays the raw data, consisting of made shots (green) and missed shots (red).

16 The χ-divergence for Approximate Inference James Shot Chart James Posterior Intensity (KLQP) 36 Duncan Shot Chart James Posterior Intensity (Chi) 32 4 James Posterior Intensity (HMC) Duncan Posterior Intensity (KLQP) Duncan Posterior Intensity (Chi) 8 Duncan Posterior Intensity (HMC) James Posterior Uncertainty (KLQP) James Posterior Uncertainty (Chi) James Posterior Uncertainty (HMC) Duncan Posterior Uncertainty (KLQP) Duncan Posterior Uncertainty (Chi) Duncan Posterior Uncertainty (HMC) Figure 7. More player profiles. Basketball players shooting profiles as inferred by BBVI (Ranganath et al., 214), CHIVI (this paper) and HMC. The top row displays the raw data, consisting of made shots (green) and missed shots (red). The second and third rows display the posterior intensities inferred by BBVI, CHIVI and HMC for Lebron James and Tim Duncan respectively. Both BBVI and CHIVI nicely capture the shooting behavior of both players in terms of their posterior mean.the fourth and fifth rows display the posterior uncertainty inferred by BBVI, CHIVI and HMC for Lebron James and Tim Duncan respectively. Here CHIVI and BBVI tend to get similar posterior uncertainty for Lebron James. CHIVI has better uncertainty for Tim Duncan.

Variational Inference via χ Upper Bound Minimization

Variational Inference via χ Upper Bound Minimization Adji B. Dieng Columbia University Dustin Tran Columbia University Rajesh Ranganath Princeton University John Paisley Columbia University David M. Blei