Bayesian Deep Learning

Size: px
Start display at page:

Download "Bayesian Deep Learning"

Transcription

1 Bayesian Deep Learning Mohammad Emtiyaz Khan RIKEN Center for AI Project, Tokyo

2 Keywords Bayesian Statistics Gaussian distribution, Bayes rule Continues Optimization Gradient descent, Least-squares. Deep Learning Stochastic gradient descent, RMSprop Information Geometry Riemannian Manifold, Natural Gradients 2

3 The Goal of My Research To understand the fundamental principles of learning from data and use them to develop algorithms that can learn like living beings. 3

4 Learning by exploring at the age of 6 months

5 Converged at the age of 12 months

6 Transfer Learning at 14 months 6

7 The Goal of My Research To understand the fundamental principles of learning from data and use them to develop algorithms that can learn like living beings. 7

8 Bayesian Inference Compute the posterior distribution Instead of just a point estimate (e.g. MLE). A natural representation of all the past information which can then be sequentially updated with new information useful for active learning, sequential experiment design, continual learning, RL. But also for global optimization, causality, etc. Eventually, for ML methods which can learn like humans (data efficient, robust, causal). 8

9 Uncertainty in Deep Learning To estimate the confidence in the predictions of a deep-learning system 9

10 Example: Which is a Better Fit? Frequency 57% Red 43% Blue Magnitude of Earthquake Real data from Tohoku (Japan). Example taken from Nate Silver s book The signal and noise 10

11 Example: Which is a Better Fit? Frequency Magnitude of Earthquake When the data is scarce and noisy, e.g., in medicine, and robotics.

12 Uncertainty for Image Segmentation Image (a) Input Image Input Image (a)(a) Input Image Truth Prediction (b) Ground Truth Semantic Ground Truth (c) Semantic (b)(b) Ground Truth (c)(c) Semantic Segmentation Segmentation Segmentation Uncertainty (d) Aleatoric (d) Aleatoric (d) Aleatoric Uncertainty Uncertainty Uncertainty (e)(e) Epistemic (e) Epistemic Epistemic Uncertainty Uncertainty Uncertainty Figure 1: Illustrating the difference between aleatoric and epistemic uncertainty for semantic segmentation Figure Illustrating the difference between aleatoric and epistemic uncertainty for semantic segmentation Figure 1:1: Illustrating the difference between aleatoric and epistemic uncertainty for semantic segmentation (taken from Kendall et al. 2017) on the CamVid dataset [8]. Aleatoric uncertainty captures noise inherent ininin the observations. InInIn (d) our model the CamVid dataset [8]. Aleatoric uncertainty captures noise inherent the observations. (d) our model onon the CamVid dataset [8]. Aleatoric uncertainty captures noise inherent the observations. (d) our model 12 exhibits increased aleatoric uncertainty on object boundaries and for objects far from the camera. exhibits increased aleatoric uncertainty object boundaries and for objects far from the camera.epistemic Epistemic exhibits increased aleatoric uncertainty onon object boundaries and for objects far from the camera. Epistemic

13 Variational Inference (VI) Approximate the posterior using optimization Popular in reinforcement learning, unsupervised learning, online learning, active learning etc. We need accurate VI algorithms that are general (apply to many models), scalable (for large data and models), fast (converge quickly), simple (easy to implement). This talk: New algorithms with such features. 13

14 Gradient vs Natural-Gradient Gradient Descent (GD) Rely on stochastic and automatic gradients. Simple, general, and scalable, but can have suboptimal convergence. Practical VI (2011), Black-box VI (2014), Bayes by backprop (2015), ADVI (2015), and many more. Natural-Gradient Descent (NGD) Fast convergence, but computationally difficult, therefore not simple, general, and scalable (Sato (2001), Riemannian CG (2010), Stochastic VI (2013), etc. Fast and simple NGD for complex models, such as those containing deep networks. 14

15 Talk Outline Variational Inference with gradient descent and natural-gradient descent. NGD with Conjugate-Computation VI Generalization of forward-backward algorithms, SVI, Message Passing (AIstats 2017). Deep Nets (ICML 2018, NeurIPS 2018). Generalizations and Extensions, Structured VAEs (ICLR 2018), Mixture of Exponential Family approximations, Evolution strategy (ArXiv 2017), etc. 15

16 Variational Inference Gradient Descent (GD) Vs Natural-Gradient Descent (NGD) 16

17 A Naïve Method Data Output Input Parameters Neural network Generate Prior distribution 17

18 Bayesian Inference Bayes rule : Narrow Posterior distribution Intractable integral Wide 18

19 Parameters Variational Inference Data Intractable integral Variational Approximation Natural parameters Maximize the Evidence Lower Bound (ELBO): 19

20 Relationship to Other Fields Perturbation to avoid local minima. Gaussian homotopy and Continuation method. Smoothed/graduated optimization. Online learning. Exponentiated weighted averaging. Reinforcement learning. Structured distribution. q is the policy x environment (Levin 2018). 20

21 Gradient Descent Maximize the Evidence Lower Bound (ELBO): Gradient descent (GD) : 21 21

22 VI with Natural-Gradient Descent Sato 2001, Honkela et al. 2010, Hoffman et.al NGD: Natural Gradient Fisher Information Matrix (FIM) Fast convergence due to optimization in Riemannian manifold (not Euclidean space). But requires additional computations. Can we simplify/reduce this computation? 22

23 Can we simplify NGD computation? Yes, by using algorithms such as message passing/ backprop. Conjugate-Computation VI Khan and Lin, AI-STATS

24 The key idea: Expectation Parameters Expectation/moment/ mean parameters Sufficient statistics For Gaussians, it s mean and correlation matrix A key relationship: Natural Gradient wrt natural parameter Gradient wrt expectation parameter NGD : 24

25 Conjugate-Computation VI (CVI) In a conjugate model, this is equivalent to simply adding the natural parameters of the factors of a model. This is a type of conjugate computation, and enables simple updates for complex models. 25

26 <latexit sha1_base64="rx8iynjebji9kltelka+f3dm0qq=">aaacixicbvdlsgnbeoz1bxxfpqoykiiihl0vepkaf48kxgsymfrojsngzo4y0yue4ifkiok/4swdit7en3gs9rafbqnfvtu9xvgqpcxf//amjqemz2bn5gsli0vlk8xvtsubzialck9uymorwqfklcoksylaagtqsilqdhm69ku3wlizxjfus0vdyyewbcmrnnqshu/2dmohdqxh3nviscrgblbpwg5qjswx8kdom8vtv+spwp6s4jtslzchgwcaog8w38nwwjmtyuikra0hfkqnphqsxim7qphzksk/wy6ooxqjfrbrh114x3ac0mltxlgxexup4xn91nb2dossgqlrf3td8t+vnlh7ungxczqrihm+qj0prgkb1sva0ghoqucicipdxxnvokfortsckyh4ffjfcnvycvxscohaoiecc7abw7alarxbgc7ghcra4r6e4avevufv2xvz3vpohpc9sw4/4h1+axsopk0=</latexit> <latexit sha1_base64="nzi+y9dk5ea4/asyelzd4abbz9m=">aaacixicbvdlsgmxfm3ud31vxqoslejfldnu7eoeny4v7am6tdxj0zy0mrmso0ip/rbdupfx3lhqxj34m6adlmrrgcdhnho5useiptdout9ozmfxaxllds27vrg5tz3b2a2ykngml1kki10lwhapql5ggzlxys1bbzjxg97vyk8+cg1efn5hp+ynbz1qtaudtfizvyr0t2s+djnc8b2puuynbhpc/q4obtqv0kdkm7m8w3thoppem5d85chjce83zdyx34pyoniitiixdc+nsteajyjjpsz6ieexsb50en3sebq3jch4wie9skqltintx4h0re5pdeaz01ebtsrarpn1ruj/xj3bdqkxeggcia9zuqidsiorhdvfw0jzhrjvctat7f8p64ighrburc3bmz15nltoip5b9g5tgxckxsrzj4ekqdxyti7jnbkhzclim3kl7+tdexhene/nk41mnmnmhvkd5+cx0cumcg==</latexit> <latexit sha1_base64="nzi+y9dk5ea4/asyelzd4abbz9m=">aaacixicbvdlsgmxfm3ud31vxqoslejfldnu7eoeny4v7am6tdxj0zy0mrmso0ip/rbdupfx3lhqxj34m6adlmrrgcdhnho5useiptdout9ozmfxaxllds27vrg5tz3b2a2ykngml1kki10lwhapql5ggzlxys1bbzjxg97vyk8+cg1efn5hp+ynbz1qtaudtfizvyr0t2s+djnc8b2puuynbhpc/q4obtqv0kdkm7m8w3thoppem5d85chjce83zdyx34pyoniitiixdc+nsteajyjjpsz6ieexsb50en3sebq3jch4wie9skqltintx4h0re5pdeaz01ebtsrarpn1ruj/xj3bdqkxeggcia9zuqidsiorhdvfw0jzhrjvctat7f8p64ighrburc3bmz15nltoip5b9g5tgxckxsrzj4ekqdxyti7jnbkhzclim3kl7+tdexhene/nk41mnmnmhvkd5+cx0cumcg==</latexit> <latexit sha1_base64="8pef7szy+c83gda6d/6vh96xwme=">aaacixicbvdlsgmxfm34rpvvdekmwiskwgbc2juu3lisyb/qqevomrahycyq3bfk6a+48vfcufcko/fntdtd1nydgcm553jztxblydb1v5219y3nre3mtnz3b//gmhd0xdnrohmvskhguhga4vkeviocjw/emomkjk8hg7upx3/m2ogofmrhzfskeqhocgzopxauvbhenxzsc4sljx+jmc4i9jl6pvakacqkgzs3c3m36m5av4k3j3kyr6wdm/idicwkh8gkgnp03bhbi9aomotjrj8yhgmbqi83lq1bcdmazs4c03ordgg30vafsgfq4sqildfdfdikauybzw8q/uc1e+ywwimrxgnykkwluomkgnfpxbqjngcoh5ya08l+lbi+agbos83aerzlk1dj7brouuxvwc2xb+d1zmgposmf4pebuib3pekqhjex8ky+ykfz6rw7x84kja4585kt8gfozy+avkj9</latexit> CVI on Bayesian Linear Regression likelihood prior (y X ) > (y X )+ > approx Expectation params Natural Gradient 26

27 NGD == Newton s Method Least-square solution For rho=1, converges in 1 step (Newton s method). Gradient descent is suboptimal: This property generalizes to all conjugate models, where forward-backward algorithm returns the natural-gradients of ELBO. 27

28 Conditionally-Conjugate Models VMP: Sequential update with rho =1 SVI: Update local variable with rho=1 and global variable with rho in (0,1) For CVI, rho can follow any schedule, and updates can be sequential or parallel. Local Global Data Images taken from Hoffman et al. (2013) and 28

29 <latexit sha1_base64="m+6etnltu37ozbs8uvh48gttv8w=">aaacdxicbvfda9rafj3er7p+rr8pgghjq1itbpn90cdieruukrhtyscnn7otznjjbzm3whlzd/x1vvnmb/dfv292t6ctfy4czjl37syzpdbayrd88pwlfy9dvrjxdxdt+o2bt4a37xy6qrfstwrlknucgfngl2qcgo06rq2cijhqkjnv9/rrf2wdrsrpukhvvebw6lrlqkli4tdraozj0r7puhitsykxx/m2mhtcdoi5f8fpctvfctv+bffyvcltyxjpj0hxfadio6kf2y4/7mdb1wowdq7xczjix3fwohrozwwcjlv246mdnlud5rjfw61gfcylnwfhgmztvf15j1edxmpvylbjplalsgpotcogxqgfi1oa1q1e41qncg6zmhisovauapepdfwjmtoevpa6rl5k/55ooxbuustk7dnyz7we/j82btb9fbw6rbtupvwtshvdsel9f/cztkqiwraaatxdlcsckb2kjxpqcohzj58hh+nrgizct5tge7aqdfaabbjtfrkxbi+9ywdswit75d33hnmb3m//of/yf7qy+t565i77p/zdp+o+vu8=</latexit> <latexit sha1_base64="0nzoxzhtfllmbxvsa/3nkkzchy4=">aaacdxicbvfdaxqxfm2mwtvv2tw+cclevqufuj3zf30sflfboylbfjbt4u42mxm280fyr1ji/ip+ot988zf44pngzncl2nrhwugcc3otk6rw0maqfpf8gzdvrdxexevdubt+b6n//8gjqrrnxyhxqtjncrihzclgkfgjs1olkbiltpppuaeffhhaykr8jlnaravkpuwlb3ru3l9gbwcejpzns9lrmy0p+0p3mhintcce0hf0etvpftjs3qom88qzzodzf+r00xwkbjtvwo3ww1ectjabqnoi95wrw7q/fdkzmivgfnjaj5cot1nllmco7m8hg2be9doil2d78o2pzd/rk7pjup+ntsrefkjerscycrjugfnqklksby81rttap5cjsymlfmjedp5as585zkltsrsukc7zvycsfmbmisq5u4zmva0j/6eng0xfrvawdyoi5itfaamovrt7ajqrwnbumweaa+nusnkolh10h9vziyrxn3wdnawhytaip7k03pnfrzjhzivskjc8jifkhtkmi8ljt++h98tb8n75j/2n/vof1fewm5vkn/ip/gdxf77x</latexit> <latexit sha1_base64="0nzoxzhtfllmbxvsa/3nkkzchy4=">aaacdxicbvfdaxqxfm2mwtvv2tw+cclevqufuj3zf30sflfboylbfjbt4u42mxm280fyr1ji/ip+ot988zf44pngzncl2nrhwugcc3otk6rw0maqfpf8gzdvrdxexevdubt+b6n//8gjqrrnxyhxqtjncrihzclgkfgjs1olkbiltpppuaeffhhaykr8jlnaravkpuwlb3ru3l9gbwcejpzns9lrmy0p+0p3mhintcce0hf0etvpftjs3qom88qzzodzf+r00xwkbjtvwo3ww1ectjabqnoi95wrw7q/fdkzmivgfnjaj5cot1nllmco7m8hg2be9doil2d78o2pzd/rk7pjup+ntsrefkjerscycrjugfnqklksby81rttap5cjsymlfmjedp5as585zkltsrsukc7zvycsfmbmisq5u4zmva0j/6eng0xfrvawdyoi5itfaamovrt7ajqrwnbumweaa+nusnkolh10h9vziyrxn3wdnawhytaip7k03pnfrzjhzivskjc8jifkhtkmi8ljt++h98tb8n75j/2n/vof1fewm5vkn/ip/gdxf77x</latexit> <latexit sha1_base64="lkyi1kvktuess2rpnhro0d8sr40=">aaacdxicbvfdaxqxfm2mx3x92ryvggixq9ja3m7ss30slykcqgw3lwymw51szizs5opkjrde+qf+ot/6n3zx1cx2ctp64clhnhnzk5okvtjgefx4/q3bd+7ew7s/epdw0emnw/wne1m1mospr1slzxiwqslstfgieme1flakspwmi6nop/0mtjfv+rwxtygkyeqzsg7oqhj4gxwaezly9y1lhzkbufadbjpltphdvkbv6rw2i92wptt7toevm51pvv7i6alrfn1oqohbyaejoggta1xnel9xrmzpbi9yzmrwwpmktz+vhg6zllmoutwcbengvfqmchswin0dx8ofbf7xphalcgxgzmkgxsicrsmvaaesmaigvobmzbwsoramsqvuwvrkmxoavtp1ixtf/j1hotbmwsto2wvkrmsd+t9t1md6lrkyrbsujb9cldakykw7l6bzqqvhtxqaujburptn4njb91edf0j4/ck3wclkhabj8eswovjyx7fgnpetsk1csk8oyadytkaek1/eu++ft+x99p/7l/3xl1bf62c2yt/l7/0b4q+71a==</latexit> Convergence Rates for CVI Lipschitz constant of (nonconvex) ELBO h E k( k k+1 )/ k 2i apple apple 2LC0 2 t Gradient noise variance + c 2 M Strong convexity of the Fisher Information Matrix Mini-batch size See Khan et al. UAI The proof is based on Ghadimi, Lan, and Zhang (2014) 29

30 NGD for Deep Learning Using CVI on Bayesian deep learning with Gaussian approximation. Reduces to a Newton step. 30

31 CVI for Bayesian Neural Network likelihood prior approx neural network Back-propagated gradient & Hessian 31

32 CVI for Bayesian Neural Network Back-propagated gradient & Hessian 32

33 Variational Adam for Mean-Field Adaptive learning-rate method (e.g., Adam) Variational Adam (Vadam) for gamma =1 ICML 2018 Approximate the Hessian by square of gradients. Mean 1. Select a minibatch Variance 2. Compute gradient using backpropagation 3. Compute a scale vector to adapt the learning rate 4. Take a gradient step 33

34 Illustration: Classification Logistic regression (30 data points, 2 dimensional input). Sampled from Gaussian mixture with 2 components 34

35 Adam vs Vadam (on Logistic-Reg) Adam Vadam (mean) Vadam (samples) M = 5, Rho = 0.01, Gamma =

36 Adam vs Vadam (on Neural Nets) Adam Vadam (mean) Vadam (samples) (By Runa E.) 36

37 LeNet-5 on CIFAR10 VOGN Adam Log Loss Error (By Anirudh Jain) 37

38 Faster, Simpler, and More Robust Regression on Australian-Scale dataset using deep neural nets for various number of minibatch size. Existing Method (BBVI) Our method (Vadam) Our method (VOGN) 38

39 Faster, Simpler, and More Robust Results on MNIST digit classification (for various values of Gaussian prior precision parameter) Existing Method (BBVI) Our method (Vadam) 39

40 Parameter-Space Noise for Deep RL On OpenAI Gym Cheetah with DDPG with DNN with [400,300] ReLU Vadam(noise using natural-gradients) Reward 5264 SGD (noise using standard gradients) SGD (no noise) Reward 2038 Ruckstriesh et.al.2010, Fortunato et.al. 2017, Plapper et.al

41 Avoiding Local Minima An example taken from Casella and Robert s book. Vadam reaches the flat minima, but GD gets stuck at a local minima. Optimization by smoothing, Gaussian homotopy/blurring etc., Entropy SGLD 41

42 Stochastic, Low-Rank, Approximate, Natural-Gradient (SLANG) NeurIPS 2018 Low-rank + diagonal covariance matrix. SLANG is linear in D! Low-Rank + diagonal D L L D D M M D D L L D gradient gradient fast_eig + = 42

43 SLANG is Faster than GD Classification on USPS with BNNs 43

44 Generalization and Extensions 44

45 CLR 2018 Published as a conference Deep Nets + Graphical Models CLR 2018 Neural Nets + GMM Latent mixture Neural Nets + Linear Dynamical System Latent state-space model PGM PGM x1 x2 Latent state-space model zn x x x x PGMx NN NN 2 y1 y2 y3 y4 x1 x2 x3 x4 (c) Generative Model PGM y1 y2 x1 y2 y3 y4 x4 NN xnny3 x2 x3 y4 PGM x4 (d) SIN Model (a) Generative y NN y1 3 yn NN PGM PGM PGM xn NN NN PG Latent mixtur PGM PGM zn y1 n y2 N y3 NN y4 Figure 1: Fig. and (c) s examples of generative models that combine deep(a) models with (c) Generative Model (d) SIN Fig. (a) Generative Model PGMs, while (b) and 45

46 Amortized Inference on VAE + Probabilistic Graphical Models (PGM) ICLR 2018 Graphical model + Deep Model Structured Inference Network Decoder Encoder Backprop on DNN, and forward-backward on PGM. 46

47 Going Beyond Exponential Family Fast and Simple NGD for approximations outside exponential family, Scale mixture of Gaussians, e.g., T-distribution, Finite mixture of Gaussian, Matrix Variate Gaussian, Skew-Gaussians. The updates can be implemented using message passing and back-propagation. 47

48 Summary of the Talk Fast yet simple NGD for VI using Conjugate-Computation VI (AI-STATS 2017), Generalization of forward-backward algorithm, Stochastic VI, Variational Message Passing etc. Beyond conjugacy: Extends fast and simple NGD to deep nets (ICML 2018, NeurIPS 2018). Generalizations and Extensions, VAEs (ICLR 2018), Mixture of Exponential Family, Evolution strategy (ArXiv 2017), etc. 48

49 Related Works Sorry, if I miss some important work! Please me. 49

50 EM, Forward-Backward, and VI Sato (1998), Fast Learning of On-line EM Algorithm. Sato (2001), Online Model Selection Based on the Variational Bayes. Jordan et al. (1999), An Introduction to Variational Methods for Graphical Models. Winn and Bishop (2005), Variational Message Passing. Knowles and Minka (2011), Non-conjugate Variational Message Passing for Multinomial and Binary Regression. 50

51 NGD: Author Name Starting with an H Honkela et al. (2007), Natural Conjugate Gradient in Variational Inference. Honkela et al. (2010), Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes. Hensman et al. (2012), Fast Variational Inference in the Conjugate Exponential Family. Hoffman et al. (2013), Stochastic Variational Inference. 51

52 NGD: Author Name Starting with an S Salimans and Knowles (2013), Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression. Approximate Natural-Gradient steps. Seth and Khardon (2016), Monte Carlo Structured SVI for Two-Level Non-Conjugate Models. Applies to models with two level of hierarchy. Salimbani et al. (2018), Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian Process Models. Fast convergence on GP models 52

53 NGD for Bayesian Deep Learning Zhang et al. (2018), Noisy Natural Gradient as Variational Inference For Bayesian deep learning (similar to Variational Adam). 53

54 Issues and Open Problems Automatic natural-gradient computation. Good implementation of message passing. Gradient with respect to covariance matrices. Structured approximation for covariance. Comparisons on really large problems. Applications. Flexible posterior approximations. 54

55 References Available at 55

56 References Available at 56

57 A 5 page review 57

58 Acknowledgement RIKEN AIP Wu Lin (now at UBC), Didrik Nielsen (now at DTU), Voot Tangkaratt, Nicolas Hubacher, Masashi Sugiyama, Sunichi-Amari. Interns at RIKEN AIP Zuozhu Liu (SUTD, Singapore), Aaron Mishkin (UBC), Frederik Kunstner (EPFL). Collaborators Mark Schmidt (UBC), Yarin Gal (University of Oxford), Akash Srivastava (University of Edinburgh), Reza Babanezhad (UBC). 58

59 Thanks! Slides, papers, and code available at 59

Fast yet Simple Natural-Gradient Variational Inference in Complex Models

Fast yet Simple Natural-Gradient Variational Inference in Complex Models Fast yet Simple Natural-Gradient Variational Inference in Complex Models Mohammad Emtiyaz Khan RIKEN Center for AI Project, Tokyo, Japan http://emtiyaz.github.io Joint work with Wu Lin (UBC) DidrikNielsen,

More information

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions Mohammad Emtiyaz Khan, Reza Babanezhad, Wu Lin, Mark Schmidt, Masashi Sugiyama Conference on Uncertainty

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Convergence of Proximal-Gradient Stochastic Variational Inference under Non-Decreasing Step-Size Sequence

Convergence of Proximal-Gradient Stochastic Variational Inference under Non-Decreasing Step-Size Sequence Convergence of Proximal-Gradient Stochastic Variational Inference under Non-Decreasing Step-Size Sequence Mohammad Emtiyaz Khan Ecole Polytechnique Fédérale de Lausanne Lausanne Switzerland emtiyaz@gmail.com

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Variational Autoencoder

Variational Autoencoder Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

Online Algorithms for Sum-Product

Online Algorithms for Sum-Product Online Algorithms for Sum-Product Networks with Continuous Variables Priyank Jaini Ph.D. Seminar Consistent/Robust Tensor Decomposition And Spectral Learning Offline Bayesian Learning ADF, EP, SGD, oem

More information

Deep Learning Lab Course 2017 (Deep Learning Practical)

Deep Learning Lab Course 2017 (Deep Learning Practical) Deep Learning Lab Course 207 (Deep Learning Practical) Labs: (Computer Vision) Thomas Brox, (Robotics) Wolfram Burgard, (Machine Learning) Frank Hutter, (Neurorobotics) Joschka Boedecker University of

More information

Probabilistic & Bayesian deep learning. Andreas Damianou

Probabilistic & Bayesian deep learning. Andreas Damianou Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of Sheffield, 19 March 2019 In this talk Not in this talk: CRFs, Boltzmann machines,... In this

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable

More information

The Success of Deep Generative Models

The Success of Deep Generative Models The Success of Deep Generative Models Jakub Tomczak AMLAB, University of Amsterdam CERN, 2018 What is AI about? What is AI about? Decision making: What is AI about? Decision making: new data High probability

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Approximate Bayesian inference

Approximate Bayesian inference Approximate Bayesian inference Variational and Monte Carlo methods Christian A. Naesseth 1 Exchange rate data 0 20 40 60 80 100 120 Month Image data 2 1 Bayesian inference 2 Variational inference 3 Stochastic

More information

The K-FAC method for neural network optimization

The K-FAC method for neural network optimization The K-FAC method for neural network optimization James Martens Thanks to my various collaborators on K-FAC research and engineering: Roger Grosse, Jimmy Ba, Vikram Tankasali, Matthew Johnson, Daniel Duckworth,

More information

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications Neural Networks Bishop PRML Ch. 5 Alireza Ghane Neural Networks Alireza Ghane / Greg Mori 1 Neural Networks Neural networks arise from attempts to model human/animal brains Many models, many claims of

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

Expectation Propagation in Dynamical Systems

Expectation Propagation in Dynamical Systems Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

Advances in Variational Inference

Advances in Variational Inference 1 Advances in Variational Inference Cheng Zhang Judith Bütepage Hedvig Kjellström Stephan Mandt arxiv:1711.05597v1 [cs.lg] 15 Nov 2017 Abstract Many modern unsupervised or semi-supervised machine learning

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Black-box α-divergence Minimization

Black-box α-divergence Minimization Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.

More information

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016 Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? Want to estimate some distribution, p*(x) p*(x) What is

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain Variational Inference in TensorFlow Danijar Hafner Stanford CS 20 2018-02-16 University College London, Google Brain Outline Variational Inference Tensorflow Distributions VAE in TensorFlow Variational

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Algorithms for Variational Learning of Mixture of Gaussians

Algorithms for Variational Learning of Mixture of Gaussians Algorithms for Variational Learning of Mixture of Gaussians Instructors: Tapani Raiko and Antti Honkela Bayes Group Adaptive Informatics Research Center 28.08.2008 Variational Bayesian Inference Mixture

More information

Variational Dropout via Empirical Bayes

Variational Dropout via Empirical Bayes Variational Dropout via Empirical Bayes Valery Kharitonov 1 kharvd@gmail.com Dmitry Molchanov 1, dmolch111@gmail.com Dmitry Vetrov 1, vetrovd@yandex.ru 1 National Research University Higher School of Economics,

More information

Natural Gradients via the Variational Predictive Distribution

Natural Gradients via the Variational Predictive Distribution Natural Gradients via the Variational Predictive Distribution Da Tang Columbia University datang@cs.columbia.edu Rajesh Ranganath New York University rajeshr@cims.nyu.edu Abstract Variational inference

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

Neutron inverse kinetics via Gaussian Processes

Neutron inverse kinetics via Gaussian Processes Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques

More information

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb,

More information

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

CSC411: Final Review. James Lucas & David Madras. December 3, 2018 CSC411: Final Review James Lucas & David Madras December 3, 2018 Agenda 1. A brief overview 2. Some sample questions Basic ML Terminology The final exam will be on the entire course; however, it will be

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

Probabilistic numerics for deep learning

Probabilistic numerics for deep learning Presenter: Shijia Wang Department of Engineering Science, University of Oxford rning (RLSS) Summer School, Montreal 2017 Outline 1 Introduction Probabilistic Numerics 2 Components Probabilistic modeling

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Bayesian Semi-supervised Learning with Deep Generative Models

Bayesian Semi-supervised Learning with Deep Generative Models Bayesian Semi-supervised Learning with Deep Generative Models Jonathan Gordon Department of Engineering Cambridge University jg801@cam.ac.uk José Miguel Hernández-Lobato Department of Engineering Cambridge

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

arxiv: v1 [stat.ml] 10 Dec 2017

arxiv: v1 [stat.ml] 10 Dec 2017 Sensitivity Analysis for Predictive Uncertainty in Bayesian Neural Networks Stefan Depeweg 1,2, José Miguel Hernández-Lobato 3, Steffen Udluft 2, Thomas Runkler 1,2 arxiv:1712.03605v1 [stat.ml] 10 Dec

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes CS 6501: Deep Learning for Computer Graphics Basics of Neural Networks Connelly Barnes Overview Simple neural networks Perceptron Feedforward neural networks Multilayer perceptron and properties Autoencoders

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning Topics Summary of Class Advanced Topics Dhruv Batra Virginia Tech HW1 Grades Mean: 28.5/38 ~= 74.9%

More information

Deep Generative Models

Deep Generative Models Deep Generative Models Durk Kingma Max Welling Deep Probabilistic Models Worksop Wednesday, 1st of Oct, 2014 D.P. Kingma Deep generative models Transformations between Bayes nets and Neural nets Transformation

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY. Arto Klami

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY. Arto Klami PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY Arto Klami 1 PROBABILISTIC PROGRAMMING Probabilistic programming is to probabilistic modelling as deep learning is to neural networks (Antti Honkela,

More information

Variational Dropout and the Local Reparameterization Trick

Variational Dropout and the Local Reparameterization Trick Variational ropout and the Local Reparameterization Trick iederik P. Kingma, Tim Salimans and Max Welling Machine Learning Group, University of Amsterdam Algoritmica University of California, Irvine, and

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

How to do backpropagation in a brain

How to do backpropagation in a brain How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto & Google Inc. Prelude I will start with three slides explaining a popular type of deep

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY Arto Klami Adapted from my talk in AIHelsinki seminar Dec 15, 2016 1 MOTIVATING INTRODUCTION Most of the artificial intelligence success stories

More information

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p

More information

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow, Index A Activation functions, neuron/perceptron binary threshold activation function, 102 103 linear activation function, 102 rectified linear unit, 106 sigmoid activation function, 103 104 SoftMax activation

More information

Fundamentals of Machine Learning

Fundamentals of Machine Learning Fundamentals of Machine Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://icapeople.epfl.ch/mekhan/ emtiyaz@gmail.com Jan 2, 27 Mohammad Emtiyaz Khan 27 Goals Understand (some) fundamentals of Machine

More information

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Auto-Encoding Variational Bayes. Stochastic Backpropagation and Approximate Inference in Deep Generative Models Auto-Encoding Variational Bayes Diederik Kingma and Max Welling Stochastic Backpropagation and Approximate Inference in Deep Generative Models Danilo J. Rezende, Shakir Mohamed, Daan Wierstra Neural Variational

More information

Logistic Regression. Seungjin Choi

Logistic Regression. Seungjin Choi Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Fundamentals of Machine Learning (Part I)

Fundamentals of Machine Learning (Part I) Fundamentals of Machine Learning (Part I) Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp April 12, 2018 Mohammad Emtiyaz Khan 2018 1 Goals Understand (some) fundamentals

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Introduction to Neural Networks

Introduction to Neural Networks CUONG TUAN NGUYEN SEIJI HOTTA MASAKI NAKAGAWA Tokyo University of Agriculture and Technology Copyright by Nguyen, Hotta and Nakagawa 1 Pattern classification Which category of an input? Example: Character

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information