Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions Mohammad Emtiyaz Khan, Reza Babanezhad, Wu Lin, Mark Schmidt, Masashi Sugiyama Conference on Uncertainty in Artificial Intelligence 2016 Discussion led by Yan Kaganovsky Duke University 1 / 21

Outline 1 Intro: Variational Inference and Proximal Methods 2 Proximal-Gradient Stochastic Variational Inference (PG-SVI) 3 Convergence of PG-SVI for Fixed Step Size 4 Experimental Results 2 / 21

Intro: Variational Inference and Proximal Methods Outline 1 Intro: Variational Inference and Proximal Methods 2 Proximal-Gradient Stochastic Variational Inference (PG-SVI) 3 Convergence of PG-SVI for Fixed Step Size 4 Experimental Results 2 / 21

Intro: Variational Inference and Proximal Methods Variational Inference Bayesian inference with a general latent variable model Data vector y of length N Latent vector z of length D Approximate the evidence p(y) with the ELBO p(y, z) log p(y) = log q(z λ) dz (1) q(z λ) max λ S L(λ) := E q(z λ) [ log p(y, z) q(z λ) ] (2) and the problem reduces to finding the parameters λ of q λ = min λ S L(λ) (3) 3 / 21

Intro: Variational Inference and Proximal Methods Proximal-Gradient Methods: Gradient Descent Form a linear approximation of the objective L L(λ k ) λ T L(λ k ) and minimize it within some distance from the previous solution. The simplest case: λ k+1 = arg min λ S which reduces to gradient descent [ λ T L(λ k ) + 1 ] λ λ k 2 2 2β k (4) λ k+1 = λ k + β k L(λ k ) (5) 4 / 21

Intro: Variational Inference and Proximal Methods Proximal-Gradient Methods: Gradient Descent Two problems with gradient descent: Impractical when we have a large dataset and some terms in the ELBO are intractable Uses Euclidean distance and thus ignores the geometry of the variational-parameter space slow convergence 5 / 21

Intro: Variational Inference and Proximal Methods Proximal-Gradient Methods: Gradient Descent The euclidean distance is a poor measure of dissimilarity between distributions N (0, 10000) and N (10, 10000) yield λ 1 λ 2 2 = 10 N (0, 0.01)and N (0.1, 0.01) yield λ 1 λ 2 2 = 0.1 5 / 21

Intro: Variational Inference and Proximal Methods Proximal-Gradient Methods: Natural Gradient The problem is addressed by replacing the Euclidean distance with another divergence. In the Natural-Gradient Method (Hoffman et al. 2013) [ λ k+1 = arg min λ T L(λ k ) + 1 D sym[ q(z λ) q(z λ k ) ] λ S β k (6) This leads to the update λ k+1 = λ k + β k [ 2 G(λ k ) ] 1 L(λk ) (7) where G is the Fisher information matrix } G := E q(z λ) {[ log q(z λ)][ log q(z λ)] (8) 6 / 21

Intro: Variational Inference and Proximal Methods Proximal-Gradient Methods: Natural Gradient Limitations of the Natural Gradient: Conditionally conjugate exponential family models Factorization as q(z) = i p(z i pa i ) (z i are disjoint sets and pa i are parents of z i in a directed acyclic graph) Each conditional distribution p(z i pa i ) is in the exponential family In the general case computing the Fisher matrix is very costly 7 / 21

Intro: Variational Inference and Proximal Methods Proximal-Gradient Methods: KL-Based Methods KL Proximal Variational Inference - Khan et al. (2015) Uses KL-divergence as in Thei & Hoffman (2013) Splits the objective L := f + h where f contains the difficult terms Leading to the iterations λ k+1 = arg min λ S [λ T [ f (λ k )] + h(λ) + 1βk D KL [ q(z λ) q(z λk ) ]] Limitations: Exact gradient not feasible for large datasets Deterministic closed-form updates limited to simple models and Gaussian q (9) 8 / 21

Proximal-Gradient Stochastic Variational Inference (PG-SVI) Outline 1 Intro: Variational Inference and Proximal Methods 2 Proximal-Gradient Stochastic Variational Inference (PG-SVI) 3 Convergence of PG-SVI for Fixed Step Size 4 Experimental Results 8 / 21

Proximal-Gradient Stochastic Variational Inference (PG-SVI) The Proposed Method The authors propose a proximal-gradient stochastic variational inference (PG-SVI) method λ k+1 = arg min λ S [ λ T [ ˆ f (λ k )] + h(λ) + 1 β k D [ λ λ k ] ] (10) Contributions: Splitting L into a simple and difficult term similar to Khan (2015) Stochastic approximation ˆ f of the gradient of the difficult term Divergence functions D that incorporate the geometry of the parameters space 9 / 21

Proximal-Gradient Stochastic Variational Inference (PG-SVI) Splitting The split is p(y, z)/q(z λ) = c p d (z λ) p e (z λ) (11) Substituting into the ELBO we get L(λ) = E q [log p d (z λ)] + E }{{} q [log p e (z λ)] +log c (12) }{{} f (λ) h(λ) The following assumptions are made The function f is differentiable and is L-Lipschitz continuous, i.e., λ and λ S we have f (λ) f (λ ) L λ λ (13) The function h is a general convex function 10 / 21

Proximal-Gradient Stochastic Variational Inference (PG-SVI) Example 1: Gaussian Process Models Non-Gaussian likelihood p(y n z n ) Let z n = f (x n ) be a latent function drawn from a GP with mean zero and covariance K The EBLO is split into E q [log p(y n z n )] n } {{ } f (λ) N p(y, z) q(z λ) = p(y n z n ) n=1 }{{} p d (z λ) N (z 0, K) N (z m, V ) } {{ } p e(z λ) D KL [N (z m, V ) N (z 0, K)] }{{} h(λ) (h is convex and this leads to a closed-form update) (14) (15) 11 / 21

Proximal-Gradient Stochastic Variational Inference (PG-SVI) Example 2: Generalized Linear Model N p(y, z) q(z λ) = p(y n x T n z) n=1 }{{} p d (z λ) The EBLO is split into E q [log p(y n x T n z)] n } {{ } f (λ) N (z 0, I ) N (z m, V ) } {{ } p e(z λ) D KL [N (z m, V ) N (z 0, I )] }{{} h(λ) (16) (17) 12 / 21

Proximal-Gradient Stochastic Variational Inference (PG-SVI) Example 3: Correlated Topic Model Multinomial model with latent Gaussian variable The split is p(z µ, Σ) = N (z µ, Σ) (18) p(t n = k z) = exp(z k) K j=1 exp(z j) (19) p(observing a word v t n, β) = β v,tn (20) N p(y, z) q(z λ) = n=1 [ K k=1 exp(z k ) β n,k j exp(z k) ] yn } {{ } p d (z λ) N (z µ, Σ) N (z m, V ) } {{ } p e(z λ) (21) 13 / 21

Proximal-Gradient Stochastic Variational Inference (PG-SVI) Stochastic Approximations Computing the gradient of the expectation term E q [ f n (z)] ĝ(λ, ξ n ) := 1 S S f n (z (s) ) [log q(z (s) λ)] s=1 where ξ n is the noise in the stochastic approximation ĝ The identity q(z λ) = q(z λ) [log q(z λ)] was used (22) The stochastic-gradient is formed by randomly selecting a mini-batch of size M ˆ f (λ) = N M ĝ(λ, ξ M ni ) (23) i=1 14 / 21

Convergence of PG-SVI for Fixed Step Size Outline 1 Intro: Variational Inference and Proximal Methods 2 Proximal-Gradient Stochastic Variational Inference (PG-SVI) 3 Convergence of PG-SVI for Fixed Step Size 4 Experimental Results 14 / 21

Convergence of PG-SVI for Fixed Step Size Additional Assumptions D(λ λ ) > 0 λ λ There exist an α > 0 such that for all λ, λ generated by the algorithm (λ λ ) T λ D(λ λ ) α λ λ 2 (24) The estimate for the gradient is unbiased E[ĝ(λ, ξ)] = f (λ) The variance of the gradient estimate is bounded Var[ĝ(λ, ξ n )] σ 2 15 / 21

Convergence of PG-SVI for Fixed Step Size Convergence Proof for the Deterministic Case Proposition 1. Let the assumptions made above be satisfied. If we run t iterations with a fixed step-size β k = α/l for all k and an exact gradient f (λ), then we have min λ k+1 λ k 2 2C 0 k {0,1,...,t 1} αt (25) where C 0 = L(λ ) L(λ 0 ) is the initial sub-optimality 16 / 21

Convergence of PG-SVI for Fixed Step Size Convergence Proof for the Stochastic Case Proposition 3. If we run t iterations with a fixed step-size β k = γα /L (where 0 < γ < 2 is a scalar) and fixed batch-size M k = M for all k with a stochastic gradient ˆ f (λ), then we have E R,ξ ( λ R+1 λ R 2 ) 1 [ ] 2C0 2 γ α t + γcσ2 (26) ML where c > 1/2α and α := α 1/2c and the expectation is with respect to the noise ξ due to the mini-batch selection and a random variable R drawn from Prob(R = k) = 1/t, k {0, 1,..., t 1}. 17 / 21

Experimental Results Outline 1 Intro: Variational Inference and Proximal Methods 2 Proximal-Gradient Stochastic Variational Inference (PG-SVI) 3 Convergence of PG-SVI for Fixed Step Size 4 Experimental Results 17 / 21

Experimental Results Gaussian Process Classification Experiment Zero-mean GP prior with squared-exponential covariance function (hyperparameters set by cross-validation) Stochastic estimate of the gradient using a mini-batch size of 5 (Sonor, Ionosphere) and 20 (USPS) Number of MC samples for the expectation in ELBO is 2000 (Sonor, USPS 3v5) and 500 (Ionosphere) Parameters: mean m and Cholesky factor L for V Fixed step size for the proposed PG-SVI method Compare to adaptive stochastic methods (SGD, ADAGRAD, RMS-Prop, etc.) 18 / 21

Experimental Results Gaussian Process Classification Results 19 / 21

Experimental Results Correlated Topic Model Experiment NIPS + Associated Press (AP) Datasets NIPS = 1500 documents 1987-1999 (vocabulary size=12,419, total words=1.9m) AP = 2,246 documents (vocabulary size=10,473, total worlds=436k) 50% 50% split for training and testing Compare to the Delta and Laplace methods from Wang & Blei (2013) Compare to mean-field from Blei & Lafferty (2007) 20 / 21

Experimental Results Correlated Topic Model Results 21 / 21