Fast yet Simple Natural-Gradient Variational Inference in Complex Models

Size: px

Start display at page:

Download "Fast yet Simple Natural-Gradient Variational Inference in Complex Models"

Ronald Walker
5 years ago
Views:

1 Fast yet Simple Natural-Gradient Variational Inference in Complex Models Mohammad Emtiyaz Khan RIKEN Center for AI Project, Tokyo, Japan Joint work with Wu Lin (UBC) DidrikNielsen, Voot Tangkaratt (RIKEN) Yarin Gal (University of Oxford) Akash Srivastava (University of Edinburgh) ZuozhuLiu (SUTD, Singapore)

2 The Goal of My Research To understand the fundamental principles of learning from data and use them to develop algorithms that can learn like living beings. 2

3 Learning by exploring at the age of 6 months

4 Converged at the age of 12 months

5 Transfer Learning at 14 months 5

6 The Goal of My Research To understand the fundamental principles of learning from data and use them to develop algorithms that can learn like living beings. 6

7 Current Focus: Methods to Improve Deep Learning Data-efficiency, robustness, active learning, continual/online learning, exploration 7

8 Bayesian Inference Compute a probability distribution over the unknowns given the data to know how much we don t know 8

9 Uncertainty Estimation Scene Uncertainty of depth estimates (taken from Kendall et al. 2017) 9

10 Bayesian Inference is Difficult! Bayes rule: p(w D) = Data p(d w)p(w) R p(d w)p(w)dw Parameters Intractable integral Variational Inference (VI) using gradient methods (SGD/Adam) Gaussian VI: Bayes by Backprop (Blundell et al. 2015), Practical VI (Graves et al. 2011), Black-box VI (Rangnathan et al. 2014) and many more. This talk: VI using natural-gradient methods (faster and simpler methods than gradientsbased methods) Khan & Lin (Aistats 2017), Khan et al. (ICML 2018), Khan & Nielsen (ISITA2018) 10

11 Outline Backgound Bayesian model and Variational Inference (VI) VI using gradient descent VI using natural-gradient descent Fast and simple natural-gradient VI Results on Bayesian deep learning and RL 11

12 Bayesian model VI using gradient descent Euclidean distance is inappropriate VI using natural-gradient descent BACKGROUND 12

13 A Bayesian Model p(d w) = NY i=1 Neural network p(y i f w (x i )) p(w) = ExpFamily( 0 ) 0 = 1 2 1, 1 µ p(w D) = p(d w)p(w) R p(d w)p(w)dw Intractable integral 13

14 Variational Inference with Gradients p(w D) max L( ):=E q h q (w) = ExpFamily( ) log p(w) i q (w) Regularizer + = NX i=1 1 2 V 1,V 1 m E q [log p(d i w)] Data-fit term t+1 = t + t r L t = arg max T r L t 1 2 t k tk 2 14

15 Euclidean Distance is inappropriate! (Amari 1999, Sato 2001, Honkela et.al. 2010, Hoffman et.al. 2013, Khan and Lin 2017) 15

16 VI using Natural-Gradient Descent Fisher Information Matrix (FIM) F ( ):=E q h r log q (w)r log q (w) >i max T r L t 1 2 t ( t) T F ( t )( t) t+1 = t + t F ( t ) 1 r L t Natural Gradients: r L t 16

17 Natural-Gradients require computation of the FIM Can we avoid this? Yes, by computing the gradient w.r.t. the expectation parameter of exponential family 17

18 Simple Natural-Gradients Part I 18

19 Expectation Parameters of Exp-Family Mean/expectation /moment parameters µ( ):=E q [ (w)] Wainwright and Jordan, 2006 Sufficient statistics E q [w] =m E q [ww > ]=mm > + V 19

20 NatGrad Descent == Mirror Descent Raskutti and Mukherjee, 2015, Khan and Lin 2017 t+1 = t + t F ( t ) 1 r L t max µ µt r µ L t 1 t KL[q µ kq µt ] r µ L t 1 t ( t) =0 r µ L t = F ( t ) 1 r L t := r L t r L t = F (µ t ) 1 r µ L t := r µ L t 20

21 Dually-Flat Riemannian Structure See Amari s book 2016 Figure from Wainwright and Jordan, 2006 For VI, natural gradient in natural-parameter is computationally simpler than in the expectation parameter space 21

22 Natural-Gradient Descent in the Natural-Parameter Space max L( ):=E q h r L = r µ L = 0 + log p(w) i q (w) Conjugate + NX i=1 Conjugate i=1 Nonconjugate (similar to SVI, Hoffmann et al. 2013) E q [log p(d i w)] Nonconjugate NY p(d w) = p(y i f w (x i )) i=1 p(w) = ExpFamily( 0 ) q (w) = ExpFamily( ) NX r µ E q [log p(d i w)] µ=µ( ) 22

23 NGVI as Message Passing Khan and Lin 2017, Khan and Nielsen 2018 t+1 =(1 t ) t + # NX t " 0 + r µ E q [log p(d i w)] µ=µ( t ) i=1 η " w D $ D % D & A generalization of Variational Message Passing (Winn and Bishop 2005) and stochastic variational inference (Hoffman et al. 2013) to nonconjugate models. Convergence proof is in Khan et al. UAI

24 Approximate Bayesian Filter q(w t+1) / [q(w t)] (1 t) h p(w)e r µe[log p(d i w)] > (w) i t New Approximation Previous Approximation ExpFamily Prior DNN Likelihood Optimal natural-parameter == natural-gradient NX = 0 + r E q [log p(d i w)] q(w ) / p(w) i=1 Similar to EP, we get local approximations, but now they are natural-gradients of the local factors. " N Y i=1 e (w)> r Eq [log p(d i w)] # 24

25 Fast Natural-Gradients Part II: Application to Bayesian deep learning Notation change θ = w 25

26 Natural-Gradient VI by using Weight-Perturbed Adam (Vadam) 26

27 Natural-gradient vs gradients h max L(µ, 2 ):=E µ, 2 q log Natural-Gradient VI µ µ r µ L +2 r 2L Gaussian prior N ( 0, I) N ( µ, 2 ) Gaussian variational approximation i + µ µ + NX E q [log p(d i )] i=1 Existing Methods + Non-Gaussian (DNN) ˆr µ L p sµ + ˆr L p s + (Graves et al. 2011, Blundell et al. 2015) 27

28 Doubly Stochastic Approximation h max L(µ, 2 ):=E µ, 2 q log Gaussian prior N ( 0, I) i N ( µ, 2 ) Gaussian variational approximation + NX i=1 Non-Gaussian (DNN) E q [log p(d i )] f + (θ) Natural-Gradient VI µ µ 2 r µ L r 2L Sample a minibatch and a θ from q µ + N M X r f i ( ) i2m N M X r 2 f i ( ) i2m 28

29 Hessian Approximation N M X i2m r 2 f i ( ) N M N Gauss-Newton " X i2m 1 M [r f i ( )] 2 X i2m r f i ( ) Gradient-Magnitude # 2 29

30 Natural Momentum Method Polyak s heavy-ball method Our natural-momentum method t+1 = t + t r µ L t + t( t t 1 ) 30

31 Natural-Gradient VI by using Weight-Perturbed Adam (Vadam) 31

32 Results 32

33 Illustration VOGN uses Gauss-Newton Vadam uses Gradient- Magnitude 33

34 Quality of Posterior Approximation VOGN-1 uses Gauss-Newton with minibatch of size 1 Vadam uses Gradient- Magnitude with minibatch > 1 34

35 Effect of Minibatch on the Accuracy As we decrease minibatch size, the accuracy improves, but the stochastic noise increases which might slow-down the algorithm. 35

36 Vadam performs comparable to BBVI 1 layer 50 hidden units with ReLU on energy (N=768, D= 8) and kin8nm (N=8192, D=8), 5 MC samples for Vadam, 10 for BBVI, minibatch of 32 36

37 Gauss-Newton Converges fast 1 layer 50 hidden units with tanh on Breast Cancer [N=683, D=10], minibatch of size 1 with 16 MC samples and step-sizes = 0.01 log 2 loss BBVI BBVI Vadam VOGN CVI-GGN Epoch 37

38 Avoiding Local Minima An example taken from Casella and Robert s book. Vadam reaches the flat minima, but GD gets stuck at a local minima. Optimization by smoothing, Gaussian homotopy/blurring etc., Entropy SGLD etc. 38

Parameter-Space Noise for Deep RL On OpenAI Gym Cheetah with DDPG with DNN with [400,300] ReLU VAdaGrad (noise updated with natural-gradients) Reward

39 Parameter-Space Noise for Deep RL On OpenAI Gym Cheetah with DDPG with DNN with [400,300] ReLU VAdaGrad (noise updated with natural-gradients) Reward 5264 SGD (noise updated with standard gradients) SGD (no noise injection) Reward 3674 Ruckstriesh et.al.2010, Fortunato et.al. 2017, Plapper et.al

40 Improving the Marginal-value of Adam/AdaGrad SGD and Vadam reach a better minimum than Adam and AdaGrad 2-layer LSTM on War & Peace dataset. The example taken from Wilson et.al Marginalvalue of adaptivegradient method 40

41 Summary 41

42 Related Work Natural-Gradient Methods for VI Sato 2001, Honkela et al. 2010, Hoffman et al Gradient methods for VI Rangnathan et al. 2014, Graves et al. 2011, Blundell et al. 2015, Salimans and Knowles 2013 Zhang et al. ICML 2018 Very similar to our ICML paper and our previous work on Variational Adaptive Newton method. Mandt et al. 2017, SGD as VI. Global optimization methods Optimization by smoothing, graduated optimization, Gaussian homotopy, etc. Entropy-SGD, noisy networks for exploration etc. 42

43 References 43

44 Thanks! I am looking for post-docs, research assistants, and interns See details at 44

Bayesian Deep Learning

Bayesian Deep Learning Mohammad Emtiyaz Khan RIKEN Center for AI Project, Tokyo http://emtiyaz.github.io Keywords Bayesian Statistics Gaussian distribution, Bayes rule Continues Optimization Gradient descent,