Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Similar documents
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Reinforcement Learning as Variational Inference: Two Recent Approaches

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Variational Inference via Stochastic Backpropagation

Riemannian Stein Variational Gradient Descent for Bayesian Inference

Density Estimation. Seungjin Choi

Kernel Learning via Random Fourier Representations

Learning to Sample Using Stein Discrepancy

CPSC 540: Machine Learning

Nonlinear Statistical Learning with Truncated Gaussian Graphical Models

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Bayesian Machine Learning

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Reading Group on Deep Learning Session 1

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Machine Learning Basics: Maximum Likelihood Estimation

Energy-Based Generative Adversarial Network

Machine Learning Basics III

Linear Models for Regression CS534

CMU-Q Lecture 24:

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Black-box α-divergence Minimization

Probabilistic Graphical Models & Applications

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

STA 4273H: Statistical Machine Learning

Dynamic Approaches: The Hidden Markov Model

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

The connection of dropout and Bayesian statistics

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Linear Models for Regression CS534

Deep learning with differential Gaussian process flows

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Probabilistic Reasoning in Deep Learning

Lecture : Probabilistic Machine Learning

On Markov chain Monte Carlo methods for tall data

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Neural Networks: Backpropagation

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

CSC 411: Lecture 04: Logistic Regression

Gaussian Mixture Models

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

ECE521 lecture 4: 19 January Optimization, MLE, regularization

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Pattern Recognition and Machine Learning

A Deep Interpretation of Classifier Chains

CS60010: Deep Learning

Introduction to Probabilistic Machine Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Linear Models for Regression

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Undirected Graphical Models

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Introduction to Statistical Learning Theory

Bayesian Support Vector Machines for Feature Ranking and Selection

Gradient-Based Learning. Sargur N. Srihari

Deep Learning & Artificial Intelligence WS 2018/2019

Bayesian Deep Learning

Probabilistic Graphical Models

VCMC: Variational Consensus Monte Carlo

Variational Autoencoder

Neural networks and support vector machines

Logistic Regression. William Cohen

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Part 1: Expectation Propagation

Expectation Propagation for Approximate Bayesian Inference

Approximate Bayesian inference

1 Machine Learning Concepts (16 points)

Expectation Propagation Algorithm

Probabilistic Graphical Models

Logistic Regression. Stochastic Gradient Descent

Recurrent Latent Variable Networks for Session-Based Recommendation

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Introduction to Machine Learning

Fast yet Simple Natural-Gradient Variational Inference in Complex Models

Contrastive Divergence

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Introduction to Machine Learning

Overfitting, Bias / Variance Analysis

Deep Feedforward Networks

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Auto-Encoding Variational Bayes

Importance Sampling for Minibatches

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Generative v. Discriminative classifiers Intuition

Deep Feedforward Networks

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

CS 340 Lec. 16: Logistic Regression

Measuring Sample Quality with Stein s Method

Algorithmisches Lernen/Machine Learning

Bayesian Machine Learning

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Transcription:

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8

Introduction Let x R d be random variables with prior p 0 (x), and D = {D k } N k=1 is a set of i.i.d. observation. Bayesian inference seek the posterior distribution p(x D) = p 0 (x)p θ (D x)/ p 0 (x)p θ (D x)dx: 1 Sampling: Markov chain Monte Carlo(MCMC)... 2 Variational inference: approximate p(x D) with q(x) Mean-field: q(x) = d i=1 q(xi) Black box variational inference, e.g., VAE and Normalizing flow: maximize the variational lower bound with stochastic gradient descent (θ, φ) = arg max θ,φ E x q φ (x)[log p θ (D x)] KL(q φ (x) p 0(x)) (1) Stein variational gradient descent: Minimize KL(q(x) p(x D)) by density transformation: q(x) = q T q T 1 q 1 q 0(x) (2) March 17, 2017 2 / 8

Stein variational gradient descent Density transformation: assume x X is random variables with distribution q(x) and z = T (x) where T : X X is a smooth transformation. The density of z is q T (z) = q ( T 1 (z) ) det z T 1 (z) (3) Theorem Assume x is random variable drawn from distributions q(x). Consider the transformation T (x) = x + ɛφ(x) and let q T (z) represent the distribution of z = T (x). We have ) ( ɛ (KL(q T p) ɛ=0 = E x q(x) trace(ap φ(x)) ), (4) where A p φ(x) = x log p(x)φ(x) T + x φ(x) March 17, 2017 3 / 8

Stein variational gradient descent Lemma Assume φ(x) lives in a reproducing kernel Hilbert space (RKHS) with kernel k(, ), the direction of steepest descent that maximizes the gradient in (4) is: φ q,p( ) = E x q [k(, x) x log p(x) + x k(, x)], (5) ) and the gradient ɛ (KL(q T p) ɛ=0 = φ q,p(x) 2 H d Theorem Assume x is random variable drawn from distributions q(x) and T (x) = x + f (x) where f (x) H d lives in a RKHS. and let q T (z) represent the distribution of z = T (x). We have ) f (KL(q T p) f =0 = φ q,p(x). (6) March 17, 2017 4 / 8

distribution q 0, and then iteratively update the particles with an empirical version of March the17, transform 2017 in 5 / 8 Stein variational gradient descent The expectation in (4) is approximated with samples. A radial basis-function (RBF) kernel is employed, i.e., k(x, x ) = exp( 1 h x x 2 2 ), where the bandwidth, h, is the median of pairwise distances between current samples. Normalizing constant is not needed ( ) x log p(x D) = x log p θ (D x) + log p 0 (x) (7) Algorithm 1 Bayesian Inference via Variational Gradient Descent Input: A target distribution with density function p(x) and a set of initial particles {x 0 i }n i=1. Output: A set of particles {x i } n i=1 that approximates the target distribution. for iteration l do x l+1 i x l i + ɛ ˆφ l (x l i) where ˆφ (x) = 1 n [ k(x l n j, x) log x l p(xl j j) + x l j k(x l j, x) ], (8) where ɛ l is the step size at the l-th iteration. end for j=1

Figure 2: We use the same setting as Figure 1, except varying the number n of particles. March 17, (a)-(c) 2017show6 / 8 Experiments: Toy Example on 1D Gaussian Mixture Assume p(x) = 1/3N ( 2, 1) + 2/3N (2, 1) and q 0 (x) = N ( 10, 1) 0th Iteration 50th Iteration 75th Iteration 100th Iteration 150th Iteration 500th Iteration Figure 1: Toy example with 1D Gaussian mixture. The red dashed lines are the target density function and the solid green lines are the densities of the particles at different iterations of our algorithm (estimated using kernel density estimator). Note that the initial distribution is set to have almost zero overlap with the target distribution, and our method demonstrates the ability of escaping the local mode on the left to recover the mode on the left that is further away. We use n = 100 particles. Log10 MSE -0.5-1 -1.5 10 50 250 Sample Size (n) Log10 MSE 0-1 -2-3 10 50 250 Sample Size (n) Log10 MSE -2-2.5-3 -3.5 10 50 250 Sample Size (n) (a) Estimating E(x) (b) Estimating E(x 2 ) (c) Estimating E(cos(ωx + b)) Monte Carlo Stein Variational Gradient Descent

Experiments: Bayesian Logistic Regression Let s R K be the feature and y {0, 1} be the label with p(y = 1 s) = σ(ω T s), ω k N (0, α 1 ), α Gamma(1, 0.01) (8) Testing Accuracy 0.75 0.7 0.65 1 2 Number of Epoches (a) Particle size n = 100 Testing Accuracy 0.75 0.7 0.65 1 10 50 250 Particle Size (n) (b) Results at 3000 iteration ( 2 epoches) Stein Variational Gradient Descent (Our Method) Stochastic Langevin (Parallel SGLD) Particle Mirror Descent (PMD) Doubly Stochastic (DSVI) Stochastic Langevin (Sequential SGLD) Figure 3: Results on Bayesian logistic regression on Covertype dataset w.r.t. epochs and the particle size n. We use n = 100 particles for our method, parallel SGLD and PMD, and average the last 100 points for the sequential SGLD. The particle-based methods (solid lines) in principle require 100 times of likelihood evaluations compare with DVSI and sequential SGLD (dash lines) per iteration, but are implemented efficiently using Matlab matrix operation (e.g., each iteration of parallel SGLD is about 3 times slower than sequential SGLD). We partition the data into 80% for training and 20% for testing and average on 50 random trials. A mini-batch size of 50 is used for all the algorithms. likelihood evaluations compared with sequential SGLD. However, by leveraging the matrix operation in MATLAB, we find that each iteration of parallel SGLD is only 3 times more expensive than March 17, 2017 7 / 8

layers, and take 50 hidden units for most datasets, except that we take 100 units for Protein and Year which are relatively large; all the datasets are randomly partitioned into 90% for training and 10% for testing, and the results are averaged over 20 random trials, except for Protein and Year on which 5 and 1 trials are repeated, respectively. We use RELU(x) = max(0, x) as the active function, whose weak derivative is I[x > 0] (Stein s identity also holds for weak derivatives; see Stein et al. [31]). PBP is repeated using the default setting of the authors code 6. For our algorithm, we only use 20 particles, and use AdaGrad with momentum as what is standard in deep learning. The mini-batch size is 100 except for Year on which we use 1000. Experiments: Bayesian Neural Network A feedforward neural netwok is employed with one hidden layer and 50 hidden units. The number of particles is set as 20. The proposed algorithm is compared with probabilistic back-propagation (PBP). We find our algorithm consistently improves over PBP both in terms of the accuracy and speed (except on Yacht); this is encouraging since PBP were specifically designed for Bayesian neural network. We also find that our results are comparable with the more recent results reported on the same datasets [e.g., 32 34] which leverage some advanced techniques that we can also benefit from. Avg. Test RMSE Avg. Test LL Avg. Time (Secs) Dataset PBP Our Method PBP Our Method PBP Ours Boston 2.977 ± 0.093 2.957 ± 0.099 2.579 ± 0.052 2.504 ± 0.029 18 16 Concrete 5.506 ± 03 5.324 ± 04 3.137 ± 0.021 3.082 ± 0.018 33 24 Energy 1.734 ± 0.051 1.374 ± 0.045 1.981 ± 0.028 1.767 ± 0.024 25 21 Kin8nm 0.098 ± 0.001 0.090 ± 0.001 0.901 ± 0.010 0.984 ± 0.008 118 41 Naval 0.006 ± 0.000 0.004 ± 0.000 3.735 ± 0.004 4.089 ± 0.012 173 49 Combined 4.052 ± 0.031 4.033 ± 0.033 2.819 ± 0.008 2.815 ± 0.008 136 51 Protein 4.623 ± 0.009 4.606 ± 0.013 2.950 ± 0.002 2.947 ± 0.003 682 68 Wine 0.614 ± 0.008 0.609 ± 0.010 0.931 ± 0.014 0.925 ± 0.014 26 22 Yacht 0.778 ± 0.042 0.864 ± 0.052 1.211 ± 0.044 1.225 ± 0.042 25 25 Year 8.733 ± NA 8.684 ± NA 3.586 ± NA 3.580 ± NA 7777 684 6 Conclusion We propose a simple general purpose variational inference algorithm for fast and scalable Bayesian inference. Future directions include more theoretical understanding on our method, more practical applications in deep learning models, and other potential applications of our basic Theorem in March 17, 2017 8 / 8