Bayesian Deep Learning

Size: px

Start display at page:

Download "Bayesian Deep Learning"

Reynold Berry
5 years ago
Views:

1 Bayesian Deep Learning Mohammad Emtiyaz Khan RIKEN Center for AI Project, Tokyo

2 Keywords Bayesian Statistics Gaussian distribution, Bayes rule Continues Optimization Gradient descent, Least-squares. Deep Learning Stochastic gradient descent, RMSprop Information Geometry Riemannian Manifold, Natural Gradients 2

3 The Goal of My Research To understand the fundamental principles of learning from data and use them to develop algorithms that can learn like living beings. 3

4 Learning by exploring at the age of 6 months

5 Converged at the age of 12 months

6 Transfer Learning at 14 months 6

7 The Goal of My Research To understand the fundamental principles of learning from data and use them to develop algorithms that can learn like living beings. 7

8 Bayesian Inference Compute the posterior distribution Instead of just a point estimate (e.g. MLE). A natural representation of all the past information which can then be sequentially updated with new information useful for active learning, sequential experiment design, continual learning, RL. But also for global optimization, causality, etc. Eventually, for ML methods which can learn like humans (data efficient, robust, causal). 8

9 Uncertainty in Deep Learning To estimate the confidence in the predictions of a deep-learning system 9

10 Example: Which is a Better Fit? Frequency 57% Red 43% Blue Magnitude of Earthquake Real data from Tohoku (Japan). Example taken from Nate Silver s book The signal and noise 10

11 Example: Which is a Better Fit? Frequency Magnitude of Earthquake When the data is scarce and noisy, e.g., in medicine, and robotics.

12 Uncertainty for Image Segmentation Image (a) Input Image Input Image (a)(a) Input Image Truth Prediction (b) Ground Truth Semantic Ground Truth (c) Semantic (b)(b) Ground Truth (c)(c) Semantic Segmentation Segmentation Segmentation Uncertainty (d) Aleatoric (d) Aleatoric (d) Aleatoric Uncertainty Uncertainty Uncertainty (e)(e) Epistemic (e) Epistemic Epistemic Uncertainty Uncertainty Uncertainty Figure 1: Illustrating the difference between aleatoric and epistemic uncertainty for semantic segmentation Figure Illustrating the difference between aleatoric and epistemic uncertainty for semantic segmentation Figure 1:1: Illustrating the difference between aleatoric and epistemic uncertainty for semantic segmentation (taken from Kendall et al. 2017) on the CamVid dataset [8]. Aleatoric uncertainty captures noise inherent ininin the observations. InInIn (d) our model the CamVid dataset [8]. Aleatoric uncertainty captures noise inherent the observations. (d) our model onon the CamVid dataset [8]. Aleatoric uncertainty captures noise inherent the observations. (d) our model 12 exhibits increased aleatoric uncertainty on object boundaries and for objects far from the camera. exhibits increased aleatoric uncertainty object boundaries and for objects far from the camera.epistemic Epistemic exhibits increased aleatoric uncertainty onon object boundaries and for objects far from the camera. Epistemic

13 Variational Inference (VI) Approximate the posterior using optimization Popular in reinforcement learning, unsupervised learning, online learning, active learning etc. We need accurate VI algorithms that are general (apply to many models), scalable (for large data and models), fast (converge quickly), simple (easy to implement). This talk: New algorithms with such features. 13

14 Gradient vs Natural-Gradient Gradient Descent (GD) Rely on stochastic and automatic gradients. Simple, general, and scalable, but can have suboptimal convergence. Practical VI (2011), Black-box VI (2014), Bayes by backprop (2015), ADVI (2015), and many more. Natural-Gradient Descent (NGD) Fast convergence, but computationally difficult, therefore not simple, general, and scalable (Sato (2001), Riemannian CG (2010), Stochastic VI (2013), etc. Fast and simple NGD for complex models, such as those containing deep networks. 14

15 Talk Outline Variational Inference with gradient descent and natural-gradient descent. NGD with Conjugate-Computation VI Generalization of forward-backward algorithms, SVI, Message Passing (AIstats 2017). Deep Nets (ICML 2018, NeurIPS 2018). Generalizations and Extensions, Structured VAEs (ICLR 2018), Mixture of Exponential Family approximations, Evolution strategy (ArXiv 2017), etc. 15

16 Variational Inference Gradient Descent (GD) Vs Natural-Gradient Descent (NGD) 16

17 A Naïve Method Data Output Input Parameters Neural network Generate Prior distribution 17

18 Bayesian Inference Bayes rule : Narrow Posterior distribution Intractable integral Wide 18

19 Parameters Variational Inference Data Intractable integral Variational Approximation Natural parameters Maximize the Evidence Lower Bound (ELBO): 19

20 Relationship to Other Fields Perturbation to avoid local minima. Gaussian homotopy and Continuation method. Smoothed/graduated optimization. Online learning. Exponentiated weighted averaging. Reinforcement learning. Structured distribution. q is the policy x environment (Levin 2018). 20

21 Gradient Descent Maximize the Evidence Lower Bound (ELBO): Gradient descent (GD) : 21 21

22 VI with Natural-Gradient Descent Sato 2001, Honkela et al. 2010, Hoffman et.al NGD: Natural Gradient Fisher Information Matrix (FIM) Fast convergence due to optimization in Riemannian manifold (not Euclidean space). But requires additional computations. Can we simplify/reduce this computation? 22

23 Can we simplify NGD computation? Yes, by using algorithms such as message passing/ backprop. Conjugate-Computation VI Khan and Lin, AI-STATS

24 The key idea: Expectation Parameters Expectation/moment/ mean parameters Sufficient statistics For Gaussians, it s mean and correlation matrix A key relationship: Natural Gradient wrt natural parameter Gradient wrt expectation parameter NGD : 24

25 Conjugate-Computation VI (CVI) In a conjugate model, this is equivalent to simply adding the natural parameters of the factors of a model. This is a type of conjugate computation, and enables simple updates for complex models. 25

26 <latexit sha1_base64="rx8iynjebji9kltelka+f3dm0qq=">aaacixicbvdlsgnbeoz1bxxfpqoykiiihl0vepkaf48kxgsymfrojsngzo4y0yue4ifkiok/4swdit7en3gs9rafbqnfvtu9xvgqpcxf//amjqemz2bn5gsli0vlk8xvtsubzialck9uymorwqfklcoksylaagtqsilqdhm69ku3wlizxjfus0vdyyewbcmrnnqshu/2dmohdqxh3nviscrgblbpwg5qjswx8kdom8vtv+spwp6s4jtslzchgwcaog8w38nwwjmtyuikra0hfkqnphqsxim7qphzksk/wy6ooxqjfrbrh114x3ac0mltxlgxexup4xn91nb2dossgqlrf3td8t+vnlh7ungxczqrihm+qj0prgkb1sva0ghoqucicipdxxnvokfortsckyh4ffjfcnvycvxscohaoiecc7abw7alarxbgc7ghcra4r6e4avevufv2xvz3vpohpc9sw4/4h1+axsopk0=</latexit> <latexit sha1_base64="nzi+y9dk5ea4/asyelzd4abbz9m=">aaacixicbvdlsgmxfm3ud31vxqoslejfldnu7eoeny4v7am6tdxj0zy0mrmso0ip/rbdupfx3lhqxj34m6adlmrrgcdhnho5useiptdout9ozmfxaxllds27vrg5tz3b2a2ykngml1kki10lwhapql5ggzlxys1bbzjxg97vyk8+cg1efn5hp+ynbz1qtaudtfizvyr0t2s+djnc8b2puuynbhpc/q4obtqv0kdkm7m8w3thoppem5d85chjce83zdyx34pyoniitiixdc+nsteajyjjpsz6ieexsb50en3sebq3jch4wie9skqltintx4h0re5pdeaz01ebtsrarpn1ruj/xj3bdqkxeggcia9zuqidsiorhdvfw0jzhrjvctat7f8p64ighrburc3bmz15nltoip5b9g5tgxckxsrzj4ekqdxyti7jnbkhzclim3kl7+tdexhene/nk41mnmnmhvkd5+cx0cumcg==</latexit> <latexit sha1_base64="nzi+y9dk5ea4/asyelzd4abbz9m=">aaacixicbvdlsgmxfm3ud31vxqoslejfldnu7eoeny4v7am6tdxj0zy0mrmso0ip/rbdupfx3lhqxj34m6adlmrrgcdhnho5useiptdout9ozmfxaxllds27vrg5tz3b2a2ykngml1kki10lwhapql5ggzlxys1bbzjxg97vyk8+cg1efn5hp+ynbz1qtaudtfizvyr0t2s+djnc8b2puuynbhpc/q4obtqv0kdkm7m8w3thoppem5d85chjce83zdyx34pyoniitiixdc+nsteajyjjpsz6ieexsb50en3sebq3jch4wie9skqltintx4h0re5pdeaz01ebtsrarpn1ruj/xj3bdqkxeggcia9zuqidsiorhdvfw0jzhrjvctat7f8p64ighrburc3bmz15nltoip5b9g5tgxckxsrzj4ekqdxyti7jnbkhzclim3kl7+tdexhene/nk41mnmnmhvkd5+cx0cumcg==</latexit> <latexit sha1_base64="8pef7szy+c83gda6d/6vh96xwme=">aaacixicbvdlsgmxfm34rpvvdekmwiskwgbc2juu3lisyb/qqevomrahycyq3bfk6a+48vfcufcko/fntdtd1nydgcm553jztxblydb1v5219y3nre3mtnz3b//gmhd0xdnrohmvskhguhga4vkeviocjw/emomkjk8hg7upx3/m2ogofmrhzfskeqhocgzopxauvbhenxzsc4sljx+jmc4i9jl6pvakacqkgzs3c3m36m5av4k3j3kyr6wdm/idicwkh8gkgnp03bhbi9aomotjrj8yhgmbqi83lq1bcdmazs4c03ordgg30vafsgfq4sqildfdfdikauybzw8q/uc1e+ywwimrxgnykkwluomkgnfpxbqjngcoh5ya08l+lbi+agbos83aerzlk1dj7brouuxvwc2xb+d1zmgposmf4pebuib3pekqhjex8ky+ykfz6rw7x84kja4585kt8gfozy+avkj9</latexit> CVI on Bayesian Linear Regression likelihood prior (y X ) > (y X )+ > approx Expectation params Natural Gradient 26

27 NGD == Newton s Method Least-square solution For rho=1, converges in 1 step (Newton s method). Gradient descent is suboptimal: This property generalizes to all conjugate models, where forward-backward algorithm returns the natural-gradients of ELBO. 27

28 Conditionally-Conjugate Models VMP: Sequential update with rho =1 SVI: Update local variable with rho=1 and global variable with rho in (0,1) For CVI, rho can follow any schedule, and updates can be sequential or parallel. Local Global Data Images taken from Hoffman et al. (2013) and 28

29 <latexit sha1_base64="m+6etnltu37ozbs8uvh48gttv8w=">aaacdxicbvfda9rafj3er7p+rr8pgghjq1itbpn90cdieruukrhtyscnn7otznjjbzm3whlzd/x1vvnmb/dfv292t6ctfy4czjl37syzpdbayrd88pwlfy9dvrjxdxdt+o2bt4a37xy6qrfstwrlknucgfngl2qcgo06rq2cijhqkjnv9/rrf2wdrsrpukhvvebw6lrlqkli4tdraozj0r7puhitsykxx/m2mhtcdoi5f8fpctvfctv+bffyvcltyxjpj0hxfadio6kf2y4/7mdb1wowdq7xczjix3fwohrozwwcjlv246mdnlud5rjfw61gfcylnwfhgmztvf15j1edxmpvylbjplalsgpotcogxqgfi1oa1q1e41qncg6zmhisovauapepdfwjmtoevpa6rl5k/55ooxbuustk7dnyz7we/j82btb9fbw6rbtupvwtshvdsel9f/cztkqiwraaatxdlcsckb2kjxpqcohzj58hh+nrgizct5tge7aqdfaabbjtfrkxbi+9ywdswit75d33hnmb3m//of/yf7qy+t565i77p/zdp+o+vu8=</latexit> <latexit sha1_base64="0nzoxzhtfllmbxvsa/3nkkzchy4=">aaacdxicbvfdaxqxfm2mwtvv2tw+cclevqufuj3zf30sflfboylbfjbt4u42mxm280fyr1ji/ip+ot988zf44pngzncl2nrhwugcc3otk6rw0maqfpf8gzdvrdxexevdubt+b6n//8gjqrrnxyhxqtjncrihzclgkfgjs1olkbiltpppuaeffhhaykr8jlnaravkpuwlb3ru3l9gbwcejpzns9lrmy0p+0p3mhintcce0hf0etvpftjs3qom88qzzodzf+r00xwkbjtvwo3ww1ectjabqnoi95wrw7q/fdkzmivgfnjaj5cot1nllmco7m8hg2be9doil2d78o2pzd/rk7pjup+ntsrefkjerscycrjugfnqklksby81rttap5cjsymlfmjedp5as585zkltsrsukc7zvycsfmbmisq5u4zmva0j/6eng0xfrvawdyoi5itfaamovrt7ajqrwnbumweaa+nusnkolh10h9vziyrxn3wdnawhytaip7k03pnfrzjhzivskjc8jifkhtkmi8ljt++h98tb8n75j/2n/vof1fewm5vkn/ip/gdxf77x</latexit> <latexit sha1_base64="0nzoxzhtfllmbxvsa/3nkkzchy4=">aaacdxicbvfdaxqxfm2mwtvv2tw+cclevqufuj3zf30sflfboylbfjbt4u42mxm280fyr1ji/ip+ot988zf44pngzncl2nrhwugcc3otk6rw0maqfpf8gzdvrdxexevdubt+b6n//8gjqrrnxyhxqtjncrihzclgkfgjs1olkbiltpppuaeffhhaykr8jlnaravkpuwlb3ru3l9gbwcejpzns9lrmy0p+0p3mhintcce0hf0etvpftjs3qom88qzzodzf+r00xwkbjtvwo3ww1ectjabqnoi95wrw7q/fdkzmivgfnjaj5cot1nllmco7m8hg2be9doil2d78o2pzd/rk7pjup+ntsrefkjerscycrjugfnqklksby81rttap5cjsymlfmjedp5as585zkltsrsukc7zvycsfmbmisq5u4zmva0j/6eng0xfrvawdyoi5itfaamovrt7ajqrwnbumweaa+nusnkolh10h9vziyrxn3wdnawhytaip7k03pnfrzjhzivskjc8jifkhtkmi8ljt++h98tb8n75j/2n/vof1fewm5vkn/ip/gdxf77x</latexit> <latexit sha1_base64="lkyi1kvktuess2rpnhro0d8sr40=">aaacdxicbvfdaxqxfm2mx3x92ryvggixq9ja3m7ss30slykcqgw3lwymw51szizs5opkjrde+qf+ot/6n3zx1cx2ctp64clhnhnzk5okvtjgefx4/q3bd+7ew7s/epdw0emnw/wne1m1mospr1slzxiwqslstfgieme1flakspwmi6nop/0mtjfv+rwxtygkyeqzsg7oqhj4gxwaezly9y1lhzkbufadbjpltphdvkbv6rw2i92wptt7toevm51pvv7i6alrfn1oqohbyaejoggta1xnel9xrmzpbi9yzmrwwpmktz+vhg6zllmoutwcbengvfqmchswin0dx8ofbf7xphalcgxgzmkgxsicrsmvaaesmaigvobmzbwsoramsqvuwvrkmxoavtp1ixtf/j1hotbmwsto2wvkrmsd+t9t1md6lrkyrbsujb9cldakykw7l6bzqqvhtxqaujburptn4njb91edf0j4/ck3wclkhabj8eswovjyx7fgnpetsk1csk8oyadytkaek1/eu++ft+x99p/7l/3xl1bf62c2yt/l7/0b4q+71a==</latexit> Convergence Rates for CVI Lipschitz constant of (nonconvex) ELBO h E k( k k+1 )/ k 2i apple apple 2LC0 2 t Gradient noise variance + c 2 M Strong convexity of the Fisher Information Matrix Mini-batch size See Khan et al. UAI The proof is based on Ghadimi, Lan, and Zhang (2014) 29

30 NGD for Deep Learning Using CVI on Bayesian deep learning with Gaussian approximation. Reduces to a Newton step. 30

31 CVI for Bayesian Neural Network likelihood prior approx neural network Back-propagated gradient & Hessian 31

32 CVI for Bayesian Neural Network Back-propagated gradient & Hessian 32

33 Variational Adam for Mean-Field Adaptive learning-rate method (e.g., Adam) Variational Adam (Vadam) for gamma =1 ICML 2018 Approximate the Hessian by square of gradients. Mean 1. Select a minibatch Variance 2. Compute gradient using backpropagation 3. Compute a scale vector to adapt the learning rate 4. Take a gradient step 33

34 Illustration: Classification Logistic regression (30 data points, 2 dimensional input). Sampled from Gaussian mixture with 2 components 34

35 Adam vs Vadam (on Logistic-Reg) Adam Vadam (mean) Vadam (samples) M = 5, Rho = 0.01, Gamma =

36 Adam vs Vadam (on Neural Nets) Adam Vadam (mean) Vadam (samples) (By Runa E.) 36

37 LeNet-5 on CIFAR10 VOGN Adam Log Loss Error (By Anirudh Jain) 37

38 Faster, Simpler, and More Robust Regression on Australian-Scale dataset using deep neural nets for various number of minibatch size. Existing Method (BBVI) Our method (Vadam) Our method (VOGN) 38

39 Faster, Simpler, and More Robust Results on MNIST digit classification (for various values of Gaussian prior precision parameter) Existing Method (BBVI) Our method (Vadam) 39

40 Parameter-Space Noise for Deep RL On OpenAI Gym Cheetah with DDPG with DNN with [400,300] ReLU Vadam(noise using natural-gradients) Reward 5264 SGD (noise using standard gradients) SGD (no noise) Reward 2038 Ruckstriesh et.al.2010, Fortunato et.al. 2017, Plapper et.al

41 Avoiding Local Minima An example taken from Casella and Robert s book. Vadam reaches the flat minima, but GD gets stuck at a local minima. Optimization by smoothing, Gaussian homotopy/blurring etc., Entropy SGLD 41

42 Stochastic, Low-Rank, Approximate, Natural-Gradient (SLANG) NeurIPS 2018 Low-rank + diagonal covariance matrix. SLANG is linear in D! Low-Rank + diagonal D L L D D M M D D L L D gradient gradient fast_eig + = 42

43 SLANG is Faster than GD Classification on USPS with BNNs 43

44 Generalization and Extensions 44

45 CLR 2018 Published as a conference Deep Nets + Graphical Models CLR 2018 Neural Nets + GMM Latent mixture Neural Nets + Linear Dynamical System Latent state-space model PGM PGM x1 x2 Latent state-space model zn x x x x PGMx NN NN 2 y1 y2 y3 y4 x1 x2 x3 x4 (c) Generative Model PGM y1 y2 x1 y2 y3 y4 x4 NN xnny3 x2 x3 y4 PGM x4 (d) SIN Model (a) Generative y NN y1 3 yn NN PGM PGM PGM xn NN NN PG Latent mixtur PGM PGM zn y1 n y2 N y3 NN y4 Figure 1: Fig. and (c) s examples of generative models that combine deep(a) models with (c) Generative Model (d) SIN Fig. (a) Generative Model PGMs, while (b) and 45

46 Amortized Inference on VAE + Probabilistic Graphical Models (PGM) ICLR 2018 Graphical model + Deep Model Structured Inference Network Decoder Encoder Backprop on DNN, and forward-backward on PGM. 46

47 Going Beyond Exponential Family Fast and Simple NGD for approximations outside exponential family, Scale mixture of Gaussians, e.g., T-distribution, Finite mixture of Gaussian, Matrix Variate Gaussian, Skew-Gaussians. The updates can be implemented using message passing and back-propagation. 47

48 Summary of the Talk Fast yet simple NGD for VI using Conjugate-Computation VI (AI-STATS 2017), Generalization of forward-backward algorithm, Stochastic VI, Variational Message Passing etc. Beyond conjugacy: Extends fast and simple NGD to deep nets (ICML 2018, NeurIPS 2018). Generalizations and Extensions, VAEs (ICLR 2018), Mixture of Exponential Family, Evolution strategy (ArXiv 2017), etc. 48

49 Related Works Sorry, if I miss some important work! Please me. 49

50 EM, Forward-Backward, and VI Sato (1998), Fast Learning of On-line EM Algorithm. Sato (2001), Online Model Selection Based on the Variational Bayes. Jordan et al. (1999), An Introduction to Variational Methods for Graphical Models. Winn and Bishop (2005), Variational Message Passing. Knowles and Minka (2011), Non-conjugate Variational Message Passing for Multinomial and Binary Regression. 50

51 NGD: Author Name Starting with an H Honkela et al. (2007), Natural Conjugate Gradient in Variational Inference. Honkela et al. (2010), Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes. Hensman et al. (2012), Fast Variational Inference in the Conjugate Exponential Family. Hoffman et al. (2013), Stochastic Variational Inference. 51

52 NGD: Author Name Starting with an S Salimans and Knowles (2013), Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression. Approximate Natural-Gradient steps. Seth and Khardon (2016), Monte Carlo Structured SVI for Two-Level Non-Conjugate Models. Applies to models with two level of hierarchy. Salimbani et al. (2018), Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian Process Models. Fast convergence on GP models 52

53 NGD for Bayesian Deep Learning Zhang et al. (2018), Noisy Natural Gradient as Variational Inference For Bayesian deep learning (similar to Variational Adam). 53

54 Issues and Open Problems Automatic natural-gradient computation. Good implementation of message passing. Gradient with respect to covariance matrices. Structured approximation for covariance. Comparisons on really large problems. Applications. Flexible posterior approximations. 54

55 References Available at 55

56 References Available at 56

57 A 5 page review 57

58 Acknowledgement RIKEN AIP Wu Lin (now at UBC), Didrik Nielsen (now at DTU), Voot Tangkaratt, Nicolas Hubacher, Masashi Sugiyama, Sunichi-Amari. Interns at RIKEN AIP Zuozhu Liu (SUTD, Singapore), Aaron Mishkin (UBC), Frederik Kunstner (EPFL). Collaborators Mark Schmidt (UBC), Yarin Gal (University of Oxford), Akash Srivastava (University of Edinburgh), Reza Babanezhad (UBC). 58

59 Thanks! Slides, papers, and code available at 59

Fast yet Simple Natural-Gradient Variational Inference in Complex Models

Fast yet Simple Natural-Gradient Variational Inference in Complex Models Mohammad Emtiyaz Khan RIKEN Center for AI Project, Tokyo, Japan http://emtiyaz.github.io Joint work with Wu Lin (UBC) DidrikNielsen,