Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8

Introduction Let x R d be random variables with prior p 0 (x), and D = {D k } N k=1 is a set of i.i.d. observation. Bayesian inference seek the posterior distribution p(x D) = p 0 (x)p θ (D x)/ p 0 (x)p θ (D x)dx: 1 Sampling: Markov chain Monte Carlo(MCMC)... 2 Variational inference: approximate p(x D) with q(x) Mean-field: q(x) = d i=1 q(xi) Black box variational inference, e.g., VAE and Normalizing flow: maximize the variational lower bound with stochastic gradient descent (θ, φ) = arg max θ,φ E x q φ (x)[log p θ (D x)] KL(q φ (x) p 0(x)) (1) Stein variational gradient descent: Minimize KL(q(x) p(x D)) by density transformation: q(x) = q T q T 1 q 1 q 0(x) (2) March 17, 2017 2 / 8

Stein variational gradient descent Density transformation: assume x X is random variables with distribution q(x) and z = T (x) where T : X X is a smooth transformation. The density of z is q T (z) = q ( T 1 (z) ) det z T 1 (z) (3) Theorem Assume x is random variable drawn from distributions q(x). Consider the transformation T (x) = x + ɛφ(x) and let q T (z) represent the distribution of z = T (x). We have ) ( ɛ (KL(q T p) ɛ=0 = E x q(x) trace(ap φ(x)) ), (4) where A p φ(x) = x log p(x)φ(x) T + x φ(x) March 17, 2017 3 / 8

Stein variational gradient descent Lemma Assume φ(x) lives in a reproducing kernel Hilbert space (RKHS) with kernel k(, ), the direction of steepest descent that maximizes the gradient in (4) is: φ q,p( ) = E x q [k(, x) x log p(x) + x k(, x)], (5) ) and the gradient ɛ (KL(q T p) ɛ=0 = φ q,p(x) 2 H d Theorem Assume x is random variable drawn from distributions q(x) and T (x) = x + f (x) where f (x) H d lives in a RKHS. and let q T (z) represent the distribution of z = T (x). We have ) f (KL(q T p) f =0 = φ q,p(x). (6) March 17, 2017 4 / 8

distribution q 0, and then iteratively update the particles with an empirical version of March the17, transform 2017 in 5 / 8 Stein variational gradient descent The expectation in (4) is approximated with samples. A radial basis-function (RBF) kernel is employed, i.e., k(x, x ) = exp( 1 h x x 2 2 ), where the bandwidth, h, is the median of pairwise distances between current samples. Normalizing constant is not needed ( ) x log p(x D) = x log p θ (D x) + log p 0 (x) (7) Algorithm 1 Bayesian Inference via Variational Gradient Descent Input: A target distribution with density function p(x) and a set of initial particles {x 0 i }n i=1. Output: A set of particles {x i } n i=1 that approximates the target distribution. for iteration l do x l+1 i x l i + ɛ ˆφ l (x l i) where ˆφ (x) = 1 n [ k(x l n j, x) log x l p(xl j j) + x l j k(x l j, x) ], (8) where ɛ l is the step size at the l-th iteration. end for j=1

Figure 2: We use the same setting as Figure 1, except varying the number n of particles. March 17, (a)-(c) 2017show6 / 8 Experiments: Toy Example on 1D Gaussian Mixture Assume p(x) = 1/3N ( 2, 1) + 2/3N (2, 1) and q 0 (x) = N ( 10, 1) 0th Iteration 50th Iteration 75th Iteration 100th Iteration 150th Iteration 500th Iteration Figure 1: Toy example with 1D Gaussian mixture. The red dashed lines are the target density function and the solid green lines are the densities of the particles at different iterations of our algorithm (estimated using kernel density estimator). Note that the initial distribution is set to have almost zero overlap with the target distribution, and our method demonstrates the ability of escaping the local mode on the left to recover the mode on the left that is further away. We use n = 100 particles. Log10 MSE -0.5-1 -1.5 10 50 250 Sample Size (n) Log10 MSE 0-1 -2-3 10 50 250 Sample Size (n) Log10 MSE -2-2.5-3 -3.5 10 50 250 Sample Size (n) (a) Estimating E(x) (b) Estimating E(x 2 ) (c) Estimating E(cos(ωx + b)) Monte Carlo Stein Variational Gradient Descent

Experiments: Bayesian Logistic Regression Let s R K be the feature and y {0, 1} be the label with p(y = 1 s) = σ(ω T s), ω k N (0, α 1 ), α Gamma(1, 0.01) (8) Testing Accuracy 0.75 0.7 0.65 1 2 Number of Epoches (a) Particle size n = 100 Testing Accuracy 0.75 0.7 0.65 1 10 50 250 Particle Size (n) (b) Results at 3000 iteration ( 2 epoches) Stein Variational Gradient Descent (Our Method) Stochastic Langevin (Parallel SGLD) Particle Mirror Descent (PMD) Doubly Stochastic (DSVI) Stochastic Langevin (Sequential SGLD) Figure 3: Results on Bayesian logistic regression on Covertype dataset w.r.t. epochs and the particle size n. We use n = 100 particles for our method, parallel SGLD and PMD, and average the last 100 points for the sequential SGLD. The particle-based methods (solid lines) in principle require 100 times of likelihood evaluations compare with DVSI and sequential SGLD (dash lines) per iteration, but are implemented efficiently using Matlab matrix operation (e.g., each iteration of parallel SGLD is about 3 times slower than sequential SGLD). We partition the data into 80% for training and 20% for testing and average on 50 random trials. A mini-batch size of 50 is used for all the algorithms. likelihood evaluations compared with sequential SGLD. However, by leveraging the matrix operation in MATLAB, we find that each iteration of parallel SGLD is only 3 times more expensive than March 17, 2017 7 / 8

layers, and take 50 hidden units for most datasets, except that we take 100 units for Protein and Year which are relatively large; all the datasets are randomly partitioned into 90% for training and 10% for testing, and the results are averaged over 20 random trials, except for Protein and Year on which 5 and 1 trials are repeated, respectively. We use RELU(x) = max(0, x) as the active function, whose weak derivative is I[x > 0] (Stein s identity also holds for weak derivatives; see Stein et al. [31]). PBP is repeated using the default setting of the authors code 6. For our algorithm, we only use 20 particles, and use AdaGrad with momentum as what is standard in deep learning. The mini-batch size is 100 except for Year on which we use 1000. Experiments: Bayesian Neural Network A feedforward neural netwok is employed with one hidden layer and 50 hidden units. The number of particles is set as 20. The proposed algorithm is compared with probabilistic back-propagation (PBP). We find our algorithm consistently improves over PBP both in terms of the accuracy and speed (except on Yacht); this is encouraging since PBP were specifically designed for Bayesian neural network. We also find that our results are comparable with the more recent results reported on the same datasets [e.g., 32 34] which leverage some advanced techniques that we can also benefit from. Avg. Test RMSE Avg. Test LL Avg. Time (Secs) Dataset PBP Our Method PBP Our Method PBP Ours Boston 2.977 ± 0.093 2.957 ± 0.099 2.579 ± 0.052 2.504 ± 0.029 18 16 Concrete 5.506 ± 03 5.324 ± 04 3.137 ± 0.021 3.082 ± 0.018 33 24 Energy 1.734 ± 0.051 1.374 ± 0.045 1.981 ± 0.028 1.767 ± 0.024 25 21 Kin8nm 0.098 ± 0.001 0.090 ± 0.001 0.901 ± 0.010 0.984 ± 0.008 118 41 Naval 0.006 ± 0.000 0.004 ± 0.000 3.735 ± 0.004 4.089 ± 0.012 173 49 Combined 4.052 ± 0.031 4.033 ± 0.033 2.819 ± 0.008 2.815 ± 0.008 136 51 Protein 4.623 ± 0.009 4.606 ± 0.013 2.950 ± 0.002 2.947 ± 0.003 682 68 Wine 0.614 ± 0.008 0.609 ± 0.010 0.931 ± 0.014 0.925 ± 0.014 26 22 Yacht 0.778 ± 0.042 0.864 ± 0.052 1.211 ± 0.044 1.225 ± 0.042 25 25 Year 8.733 ± NA 8.684 ± NA 3.586 ± NA 3.580 ± NA 7777 684 6 Conclusion We propose a simple general purpose variational inference algorithm for fast and scalable Bayesian inference. Future directions include more theoretical understanding on our method, more practical applications in deep learning models, and other potential applications of our basic Theorem in March 17, 2017 8 / 8