Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Similar documents
Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Convergence of Proximal-Gradient Stochastic Variational Inference under Non-Decreasing Step-Size Sequence

Kullback-Leibler Proximal Variational Inference

Bayesian Deep Learning

Convergence Rate of Expectation-Maximization

CSC2541 Lecture 5 Natural Gradient

Kullback-Leibler Proximal Variational Inference

Natural Gradients via the Variational Predictive Distribution

Advances in Variational Inference

Fast yet Simple Natural-Gradient Variational Inference in Complex Models

STA 4273H: Statistical Machine Learning

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Stochastic Variational Inference

Why should you care about the solution strategies?

Auto-Encoding Variational Bayes

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Probabilistic Graphical Models for Image Analysis - Lecture 4

Decoupled Variational Gaussian Inference

Expectation Propagation Algorithm

Nonparametric Inference for Auto-Encoding Variational Bayes

Logistic Regression. Stochastic Gradient Descent

Bayesian Deep Learning

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Variational Inference via Stochastic Backpropagation

Variational Inference. Sargur Srihari

ECS171: Machine Learning

Collapsed Variational Bayesian Inference for Hidden Markov Models

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Bayesian Approaches Data Mining Selected Technique

Comparison of Modern Stochastic Optimization Algorithms

Towards stability and optimality in stochastic gradient descent

Basic Sampling Methods

Probabilistic numerics for deep learning

Neural Network Training

Bayesian Machine Learning

Variational Inference with Copula Augmentation

Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data

Variational Autoencoders

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

CS260: Machine Learning Algorithms

Nonparametric Bayesian Methods (Gaussian Processes)

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Stochastic Analogues to Deterministic Optimizers

Reading Group on Deep Learning Session 1

Sparse Stochastic Inference for Latent Dirichlet Allocation

13: Variational inference II

Recurrent Latent Variable Networks for Session-Based Recommendation

Nesterov s Acceleration

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Deep Learning & Artificial Intelligence WS 2018/2019

Probabilistic Graphical Models & Applications

STA 4273H: Statistical Machine Learning

Linear Regression (continued)

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Probability and Information Theory. Sargur N. Srihari

Outline Lecture 2 2(32)

Sample questions for Fundamentals of Machine Learning 2018

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Machine Learning Basics: Maximum Likelihood Estimation

J. Sadeghi E. Patelli M. de Angelis

Proximity Variational Inference

Variational Bayesian Logistic Regression

Introduction to Machine Learning

Variational Autoencoders (VAEs)

Day 3 Lecture 3. Optimizing deep networks

Approximate Bayesian inference

Variational Bayes and Variational Message Passing

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Study Notes on the Latent Dirichlet Allocation

Nonparameteric Regression:

Variational Inference: A Review for Statisticians

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Introduction to Bayesian inference

Deep Generative Models

Sparse Approximations for Non-Conjugate Gaussian Process Regression

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Gaussian Mixture Models

Lecture 13 : Variational Inference: Mean Field Approximation

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Stochastic Gradient Descent with Variance Reduction

STA414/2104 Statistical Methods for Machine Learning II

Linear Classification

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Gaussian Mixture Model

Summary and discussion of: Dropout Training as Adaptive Regularization

Variational inference

K-Means and Gaussian Mixture Models

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Latent Dirichlet Allocation

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Overfitting, Bias / Variance Analysis

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Online Bayesian Passive-Aggressive Learning"

Online Bayesian Passive-Aggressive Learning

Algorithms for Variational Learning of Mixture of Gaussians

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Transcription:

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions Mohammad Emtiyaz Khan, Reza Babanezhad, Wu Lin, Mark Schmidt, Masashi Sugiyama Conference on Uncertainty in Artificial Intelligence 2016 Discussion led by Yan Kaganovsky Duke University 1 / 21

Outline 1 Intro: Variational Inference and Proximal Methods 2 Proximal-Gradient Stochastic Variational Inference (PG-SVI) 3 Convergence of PG-SVI for Fixed Step Size 4 Experimental Results 2 / 21

Intro: Variational Inference and Proximal Methods Outline 1 Intro: Variational Inference and Proximal Methods 2 Proximal-Gradient Stochastic Variational Inference (PG-SVI) 3 Convergence of PG-SVI for Fixed Step Size 4 Experimental Results 2 / 21

Intro: Variational Inference and Proximal Methods Variational Inference Bayesian inference with a general latent variable model Data vector y of length N Latent vector z of length D Approximate the evidence p(y) with the ELBO p(y, z) log p(y) = log q(z λ) dz (1) q(z λ) max λ S L(λ) := E q(z λ) [ log p(y, z) q(z λ) ] (2) and the problem reduces to finding the parameters λ of q λ = min λ S L(λ) (3) 3 / 21

Intro: Variational Inference and Proximal Methods Proximal-Gradient Methods: Gradient Descent Form a linear approximation of the objective L L(λ k ) λ T L(λ k ) and minimize it within some distance from the previous solution. The simplest case: λ k+1 = arg min λ S which reduces to gradient descent [ λ T L(λ k ) + 1 ] λ λ k 2 2 2β k (4) λ k+1 = λ k + β k L(λ k ) (5) 4 / 21

Intro: Variational Inference and Proximal Methods Proximal-Gradient Methods: Gradient Descent Two problems with gradient descent: Impractical when we have a large dataset and some terms in the ELBO are intractable Uses Euclidean distance and thus ignores the geometry of the variational-parameter space slow convergence 5 / 21

Intro: Variational Inference and Proximal Methods Proximal-Gradient Methods: Gradient Descent The euclidean distance is a poor measure of dissimilarity between distributions N (0, 10000) and N (10, 10000) yield λ 1 λ 2 2 = 10 N (0, 0.01)and N (0.1, 0.01) yield λ 1 λ 2 2 = 0.1 5 / 21

Intro: Variational Inference and Proximal Methods Proximal-Gradient Methods: Natural Gradient The problem is addressed by replacing the Euclidean distance with another divergence. In the Natural-Gradient Method (Hoffman et al. 2013) [ λ k+1 = arg min λ T L(λ k ) + 1 D sym[ q(z λ) q(z λ k ) ] λ S β k (6) This leads to the update λ k+1 = λ k + β k [ 2 G(λ k ) ] 1 L(λk ) (7) where G is the Fisher information matrix } G := E q(z λ) {[ log q(z λ)][ log q(z λ)] (8) 6 / 21

Intro: Variational Inference and Proximal Methods Proximal-Gradient Methods: Natural Gradient Limitations of the Natural Gradient: Conditionally conjugate exponential family models Factorization as q(z) = i p(z i pa i ) (z i are disjoint sets and pa i are parents of z i in a directed acyclic graph) Each conditional distribution p(z i pa i ) is in the exponential family In the general case computing the Fisher matrix is very costly 7 / 21

Intro: Variational Inference and Proximal Methods Proximal-Gradient Methods: KL-Based Methods KL Proximal Variational Inference - Khan et al. (2015) Uses KL-divergence as in Thei & Hoffman (2013) Splits the objective L := f + h where f contains the difficult terms Leading to the iterations λ k+1 = arg min λ S [λ T [ f (λ k )] + h(λ) + 1βk D KL [ q(z λ) q(z λk ) ]] Limitations: Exact gradient not feasible for large datasets Deterministic closed-form updates limited to simple models and Gaussian q (9) 8 / 21

Proximal-Gradient Stochastic Variational Inference (PG-SVI) Outline 1 Intro: Variational Inference and Proximal Methods 2 Proximal-Gradient Stochastic Variational Inference (PG-SVI) 3 Convergence of PG-SVI for Fixed Step Size 4 Experimental Results 8 / 21

Proximal-Gradient Stochastic Variational Inference (PG-SVI) The Proposed Method The authors propose a proximal-gradient stochastic variational inference (PG-SVI) method λ k+1 = arg min λ S [ λ T [ ˆ f (λ k )] + h(λ) + 1 β k D [ λ λ k ] ] (10) Contributions: Splitting L into a simple and difficult term similar to Khan (2015) Stochastic approximation ˆ f of the gradient of the difficult term Divergence functions D that incorporate the geometry of the parameters space 9 / 21

Proximal-Gradient Stochastic Variational Inference (PG-SVI) Splitting The split is p(y, z)/q(z λ) = c p d (z λ) p e (z λ) (11) Substituting into the ELBO we get L(λ) = E q [log p d (z λ)] + E }{{} q [log p e (z λ)] +log c (12) }{{} f (λ) h(λ) The following assumptions are made The function f is differentiable and is L-Lipschitz continuous, i.e., λ and λ S we have f (λ) f (λ ) L λ λ (13) The function h is a general convex function 10 / 21

Proximal-Gradient Stochastic Variational Inference (PG-SVI) Example 1: Gaussian Process Models Non-Gaussian likelihood p(y n z n ) Let z n = f (x n ) be a latent function drawn from a GP with mean zero and covariance K The EBLO is split into E q [log p(y n z n )] n } {{ } f (λ) N p(y, z) q(z λ) = p(y n z n ) n=1 }{{} p d (z λ) N (z 0, K) N (z m, V ) } {{ } p e(z λ) D KL [N (z m, V ) N (z 0, K)] }{{} h(λ) (h is convex and this leads to a closed-form update) (14) (15) 11 / 21

Proximal-Gradient Stochastic Variational Inference (PG-SVI) Example 2: Generalized Linear Model N p(y, z) q(z λ) = p(y n x T n z) n=1 }{{} p d (z λ) The EBLO is split into E q [log p(y n x T n z)] n } {{ } f (λ) N (z 0, I ) N (z m, V ) } {{ } p e(z λ) D KL [N (z m, V ) N (z 0, I )] }{{} h(λ) (16) (17) 12 / 21

Proximal-Gradient Stochastic Variational Inference (PG-SVI) Example 3: Correlated Topic Model Multinomial model with latent Gaussian variable The split is p(z µ, Σ) = N (z µ, Σ) (18) p(t n = k z) = exp(z k) K j=1 exp(z j) (19) p(observing a word v t n, β) = β v,tn (20) N p(y, z) q(z λ) = n=1 [ K k=1 exp(z k ) β n,k j exp(z k) ] yn } {{ } p d (z λ) N (z µ, Σ) N (z m, V ) } {{ } p e(z λ) (21) 13 / 21

Proximal-Gradient Stochastic Variational Inference (PG-SVI) Stochastic Approximations Computing the gradient of the expectation term E q [ f n (z)] ĝ(λ, ξ n ) := 1 S S f n (z (s) ) [log q(z (s) λ)] s=1 where ξ n is the noise in the stochastic approximation ĝ The identity q(z λ) = q(z λ) [log q(z λ)] was used (22) The stochastic-gradient is formed by randomly selecting a mini-batch of size M ˆ f (λ) = N M ĝ(λ, ξ M ni ) (23) i=1 14 / 21

Convergence of PG-SVI for Fixed Step Size Outline 1 Intro: Variational Inference and Proximal Methods 2 Proximal-Gradient Stochastic Variational Inference (PG-SVI) 3 Convergence of PG-SVI for Fixed Step Size 4 Experimental Results 14 / 21

Convergence of PG-SVI for Fixed Step Size Additional Assumptions D(λ λ ) > 0 λ λ There exist an α > 0 such that for all λ, λ generated by the algorithm (λ λ ) T λ D(λ λ ) α λ λ 2 (24) The estimate for the gradient is unbiased E[ĝ(λ, ξ)] = f (λ) The variance of the gradient estimate is bounded Var[ĝ(λ, ξ n )] σ 2 15 / 21

Convergence of PG-SVI for Fixed Step Size Convergence Proof for the Deterministic Case Proposition 1. Let the assumptions made above be satisfied. If we run t iterations with a fixed step-size β k = α/l for all k and an exact gradient f (λ), then we have min λ k+1 λ k 2 2C 0 k {0,1,...,t 1} αt (25) where C 0 = L(λ ) L(λ 0 ) is the initial sub-optimality 16 / 21

Convergence of PG-SVI for Fixed Step Size Convergence Proof for the Stochastic Case Proposition 3. If we run t iterations with a fixed step-size β k = γα /L (where 0 < γ < 2 is a scalar) and fixed batch-size M k = M for all k with a stochastic gradient ˆ f (λ), then we have E R,ξ ( λ R+1 λ R 2 ) 1 [ ] 2C0 2 γ α t + γcσ2 (26) ML where c > 1/2α and α := α 1/2c and the expectation is with respect to the noise ξ due to the mini-batch selection and a random variable R drawn from Prob(R = k) = 1/t, k {0, 1,..., t 1}. 17 / 21

Experimental Results Outline 1 Intro: Variational Inference and Proximal Methods 2 Proximal-Gradient Stochastic Variational Inference (PG-SVI) 3 Convergence of PG-SVI for Fixed Step Size 4 Experimental Results 17 / 21

Experimental Results Gaussian Process Classification Experiment Zero-mean GP prior with squared-exponential covariance function (hyperparameters set by cross-validation) Stochastic estimate of the gradient using a mini-batch size of 5 (Sonor, Ionosphere) and 20 (USPS) Number of MC samples for the expectation in ELBO is 2000 (Sonor, USPS 3v5) and 500 (Ionosphere) Parameters: mean m and Cholesky factor L for V Fixed step size for the proposed PG-SVI method Compare to adaptive stochastic methods (SGD, ADAGRAD, RMS-Prop, etc.) 18 / 21

Experimental Results Gaussian Process Classification Results 19 / 21

Experimental Results Correlated Topic Model Experiment NIPS + Associated Press (AP) Datasets NIPS = 1500 documents 1987-1999 (vocabulary size=12,419, total words=1.9m) AP = 2,246 documents (vocabulary size=10,473, total worlds=436k) 50% 50% split for training and testing Compare to the Delta and Laplace methods from Wang & Blei (2013) Compare to mean-field from Blei & Lafferty (2007) 20 / 21

Experimental Results Correlated Topic Model Results 21 / 21