Probabilistic & Bayesian deep learning. Andreas Damianou

Similar documents
System identification and control with (deep) Gaussian processes. Andreas Damianou

Bayesian Machine Learning

Lecture : Probabilistic Machine Learning

Dropout as a Bayesian Approximation: Insights and Applications

Probabilistic Reasoning in Deep Learning

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

GAUSSIAN PROCESS REGRESSION

Probabilistic Models for Learning Data Representations. Andreas Damianou

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Variational Dropout via Empirical Bayes

Bayesian Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Bayesian Machine Learning

Probabilistic & Unsupervised Learning

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Reliability Monitoring Using Log Gaussian Process Regression

Bayesian Deep Learning

CMU-Q Lecture 24:

20: Gaussian Processes

Density Estimation. Seungjin Choi

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Model Selection for Gaussian Processes

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Bayesian Semi-supervised Learning with Deep Generative Models

Gaussian Process Regression

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Introduction to Machine Learning

The connection of dropout and Bayesian statistics

Lecture 5: GPs and Streaming regression

Learning Gaussian Process Models from Uncertain Data

Recurrent Latent Variable Networks for Session-Based Recommendation

CS-E3210 Machine Learning: Basic Principles

Lecture 4: Probabilistic Learning

GWAS IV: Bayesian linear (variance component) models

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Variational Autoencoders

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Deep Gaussian Processes

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Auto-Encoding Variational Bayes

Introduction to Gaussian Process

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Introduction to Gaussian Processes

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Gradient Estimation Using Stochastic Computation Graphs

STA414/2104 Statistical Methods for Machine Learning II

Gaussian Processes in Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian RL Seminar. Chris Mansley September 9, 2008

Expectation Propagation for Approximate Bayesian Inference

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Deep Gaussian Processes

Machine Learning Basics: Maximum Likelihood Estimation

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Variational Inference via Stochastic Backpropagation

STA 4273H: Sta-s-cal Machine Learning

Outline Lecture 2 2(32)

Doubly Stochastic Inference for Deep Gaussian Processes. Hugh Salimbeni Department of Computing Imperial College London

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Introduction to Machine Learning

Overfitting, Bias / Variance Analysis

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bayesian Support Vector Machines for Feature Ranking and Selection

Deep learning with differential Gaussian process flows

Gaussian Processes for Big Data. James Hensman

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Non-Gaussian likelihoods for Gaussian Processes

Gaussian Processes for Machine Learning

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Nonparameteric Regression:

GWAS V: Gaussian processes

MTTTS16 Learning from Multiple Sources

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Gaussian Processes (10/16/13)

Linear Models for Regression

Unsupervised Learning

Linear Models for Regression

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Unsupervised Learning

Bayesian Linear Regression [DRAFT - In Progress]

Fast yet Simple Natural-Gradient Variational Inference in Complex Models

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Deep Neural Networks as Gaussian Processes

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.

PATTERN RECOGNITION AND MACHINE LEARNING

Transcription:

Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of Sheffield, 19 March 2019

In this talk Not in this talk: CRFs, Boltzmann machines,...

In this talk Not in this talk: CRFs, Boltzmann machines,...

A standard neural network x 1 x 2 x 3... x q ϕ 1 w 1 ϕ 2 w2 ϕ 3 w3... w q ϕ q Y Define: φ j = φ(x j ) and f j = w j φ j (ignore bias for now) Once we ve defined all w s with back-prop, then f (and the whole network) becomes deterministic. What does that imply?

Trained neural network is deterministic. Implications? Generalization: Overfitting occurs. Need for ad-hoc invention of regularizers: dropout, early stopping... Data generation: A model which generalizes well, should also understand -or even be able to create ( imagine )- variations. No predictive uncertainty: Uncertainty needs to be propagated across the model to be reliable.

Need for uncertainty Reinforcement learning Critical predictive systems Active learning Semi-automatic systems Scarce data scenarios...

BNN with priors on its weights x 1 x 2 x 3... x q x 1 x 2 x 3... x q ϕ 1 ϕ 2 ϕ 3... ϕ q ϕ 1 ϕ 2 ϕ 3... ϕ q w1 w2 w3 w q q(w 1 ) q(w 2 ) q(w3) q(w q ) Y Y

BNN with priors on its weights x 1 x 2 x 3... x q x 1 x 2 x 3... x q ϕ 1 ϕ 2 ϕ 3... ϕ q ϕ 1 ϕ 2 ϕ 3... ϕ q 0.23 0.90 0.12 0.54 Y Y

Probabilistic re-formulation DNN: y = g(w, x) = w 1 ϕ(w 2 ϕ(... x)) Training minimizing loss: arg min W 1 N (g(w, x i ) y i ) 2 + λ 2 i=1 i }{{} fit w i } {{ } regularizer Equivalent probabilistic view for regression, maximizing posterior probability: arg max W log p(y x, W) }{{} fit where p(y x, W) N and p(w) Laplace + log p(w) }{{} regularizer

Probabilistic re-formulation DNN: y = g(w, x) = w 1 ϕ(w 2 ϕ(... x)) Training minimizing loss: arg min W 1 N (g(w, x i ) y i ) 2 + λ 2 i=1 i }{{} fit w i } {{ } regularizer Equivalent probabilistic view for regression, maximizing posterior probability: arg max W log p(y x, W) }{{} fit where p(y x, W) N and p(w) Laplace + log p(w) }{{} regularizer

Probabilistic re-formulation DNN: y = g(w, x) = w 1 ϕ(w 2 ϕ(... x)) Training minimizing loss: arg min W 1 N (g(w, x i ) y i ) 2 + λ 2 i=1 i }{{} fit w i } {{ } regularizer Equivalent probabilistic view for regression, maximizing posterior probability: arg max W log p(y x, W) }{{} fit where p(y x, W) N and p(w) Laplace + log p(w) }{{} regularizer

Integrating out weights Define: D = (x, y) Remember Bayes rule: p(w D) = p(d w)p(w) p(d) = p(d w)p(w)dw For Bayesian inference, weights need to also be integrated out. This gives us a properly defined posterior on the parametres.

(Bayesian) Occam s Razor A plurality is not to be posited without necessity. W. of Ockham Everything should be made as simple as possible, but not simpler. A. Einstein Evidence is higher for the model that is not unnecessarily complex but still explains the data D.

Bayesian Inference Remember: Separation of Model and Inference

Inference p(d) (and hence p(w D)) is difficult to compute because of the nonlinear way in which w appears through g. Attempt at variational inference: where KL (q(w; θ) p(w D)) = log(p(d)) L(θ) }{{}}{{} minimize maximize L(θ) = E q(w;θ) [log p(d, w)] +H [q(w; θ)] }{{} F Term in red is still problematic. Solution: MC. Such approaches can be formulated as black-box inferences.

Inference p(d) (and hence p(w D)) is difficult to compute because of the nonlinear way in which w appears through g. Attempt at variational inference: where KL (q(w; θ) p(w D)) = log(p(d)) L(θ) }{{}}{{} minimize maximize L(θ) = E q(w;θ) [log p(d, w)] +H [q(w; θ)] }{{} F Term in red is still problematic. Solution: MC. Such approaches can be formulated as black-box inferences.

Inference p(d) (and hence p(w D)) is difficult to compute because of the nonlinear way in which w appears through g. Attempt at variational inference: where KL (q(w; θ) p(w D)) = log(p(d)) L(θ) }{{}}{{} minimize maximize L(θ) = E q(w;θ) [log p(d, w)] +H [q(w; θ)] }{{} F Term in red is still problematic. Solution: MC. Such approaches can be formulated as black-box inferences.

Credit: Tushar Tank

Credit: Tushar Tank

Credit: Tushar Tank

Credit: Tushar Tank

Inference: Score function method θ F = θ E q(w;θ) [log p(d, w)] = E q(w;θ) [p(d, w) θ log q(w; θ)] 1 K p(d, w (k) ) θ log q(w (k) ; θ), K i=1 w (k) iid q(w; θ) (Paisley et al., 2012; Ranganath et al., 2014; Mnih and Gregor, 2014, Ruiz et al. 2016)

Inference: Reparameterization gradient Reparametrize w as a transformation T of a simpler variable ɛ: w = T (ɛ; θ) q(ɛ) is now independent of θ θ F = θ E q(w;θ) [log p(d, w)] = E q(ɛ) [ w p(d, w) w=t (ɛ;θ) θ T (ɛ; θ) ] For example: w N (µ, σ) T w = µ + σ ɛ, ɛ N (0, 1) MC by sampling from q(ɛ) (thus obtaining samples from w through T ) (Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

Inference: Reparameterization gradient Reparametrize w as a transformation T of a simpler variable ɛ: w = T (ɛ; θ) q(ɛ) is now independent of θ θ F = θ E q(w;θ) [log p(d, w)] = E q(ɛ) [ w p(d, w) w=t (ɛ;θ) θ T (ɛ; θ) ] For example: w N (µ, σ) T w = µ + σ ɛ, ɛ N (0, 1) MC by sampling from q(ɛ) (thus obtaining samples from w through T ) (Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

Inference: Reparameterization gradient Reparametrize w as a transformation T of a simpler variable ɛ: w = T (ɛ; θ) q(ɛ) is now independent of θ θ F = θ E q(w;θ) [log p(d, w)] = E q(ɛ) [ w p(d, w) w=t (ɛ;θ) θ T (ɛ; θ) ] For example: w N (µ, σ) T w = µ + σ ɛ, ɛ N (0, 1) MC by sampling from q(ɛ) (thus obtaining samples from w through T ) (Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

Inference: Reparameterization gradient Reparametrize w as a transformation T of a simpler variable ɛ: w = T (ɛ; θ) q(ɛ) is now independent of θ θ F = θ E q(w;θ) [log p(d, w)] = E q(ɛ) [ w p(d, w) w=t (ɛ;θ) θ T (ɛ; θ) ] For example: w N (µ, σ) T w = µ + σ ɛ, ɛ N (0, 1) MC by sampling from q(ɛ) (thus obtaining samples from w through T ) (Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

Black-box VI

Black-box VI (github.com/blei-lab/edward)

Priors on weights (what we saw before) X ϕ ϕ... ϕ H. H ϕ ϕ... ϕ Y

From NN to GP X... ϕ ϕ... ϕ... H. H Y G G... ϕ ϕ... ϕ... NN: H 2 = W 2 φ(h 1 ) GP: φ is dimensional so: H 2 = f 2 (H 1 ; θ 2 ) + ɛ NN: p(w) GP: p(f( )) VAE can be seen as a special case of this

From NN to GP X... ϕ ϕ... ϕ... H. H Y G G... ϕ ϕ... ϕ... NN: H 2 = W 2 φ(h 1 ) GP: φ is dimensional so: H 2 = f 2 (H 1 ; θ 2 ) + ɛ NN: p(w) GP: p(f( )) VAE can be seen as a special case of this

From NN to GP X... ϕ ϕ... ϕ... H. H Y G G... ϕ ϕ... ϕ... NN: H 2 = W 2 φ(h 1 ) GP: φ is dimensional so: H 2 = f 2 (H 1 ; θ 2 ) + ɛ NN: p(w) GP: p(f( )) VAE can be seen as a special case of this

From NN to GP X... ϕ ϕ... ϕ... H. H Y G G... ϕ ϕ... ϕ... Real world perfectly described by unobserved latent variables: Ĥ But we only observe noisy high-dimensional data: Y We try to interpret the world and infer the latents: H Ĥ Inference: p(h Y) = p(y H)p(H) p(y)

Deep Gaussian processes Uncertainty about parameters: Check. Uncertainty about structure? Deep GP simultaneously brings in: prior on weights input/latent space is kernalized stochasticity in the warping

Introducing Gaussian Processes: A Gaussian distribution depends on a mean and a covariance matrix. A Gaussian process depends on a mean and a covariance function.

Samples from a 1-D Gaussian

Samples from a 1-D Gaussian

Samples from a 1-D Gaussian

Samples from a 1-D Gaussian

Samples from a 1-D Gaussian

Samples from a 1-D Gaussian

Samples from a 1-D Gaussian

Samples from a 1-D Gaussian

How do GPs solve the overfitting problem (i.e. regularize)? Answer: Integrate over the function itself! So, we will average out all possible function forms, under a (GP) prior! Recap: MAP: Bayesian: θ are hyperparameters arg max p(y w, φ(x))p(w) e.g. y = φ(x) w + ɛ w arg max f p(y f) p(f x, θ) e.g. y = f(x, θ) + ɛ θ }{{} GP prior The Bayesian approach (GP) automatically balances the data-fitting with the complexity penalty.

How do GPs solve the overfitting problem (i.e. regularize)? Answer: Integrate over the function itself! So, we will average out all possible function forms, under a (GP) prior! Recap: MAP: Bayesian: θ are hyperparameters arg max p(y w, φ(x))p(w) e.g. y = φ(x) w + ɛ w arg max f p(y f) p(f x, θ) e.g. y = f(x, θ) + ɛ θ }{{} GP prior The Bayesian approach (GP) automatically balances the data-fitting with the complexity penalty.

Deep Gaussian processes x Define a recursive stacked construction h 1 f 1 f 2 f(x) GP f L (f L 1 (f L 2 f 1 (x)))) deep GP h 2 Compare to: y f 3 ϕ(x) w NN ϕ(ϕ(ϕ(x) w 1 )... w L 1 ) w L DNN

Two-layered DGP X Input f 1 H Unobserved f 2 Output Y

Inference in DGPs It s tricky. An additional difficulty comes from the fact that the kernel couples all points, so p(y x) is not factorized anymore. Indeed, p(y x, w) = f p(y f)p(f x, w) but w appears inside the non-linear kernel function: p(f x, w) = N(f 0, k(x, X w))

The resulting objective function By integrating over all unknowns: We obtain approximate posteriors q(h), q(f) over them We end up with a well regularized variational lower bound on the true marginal likelihood: F = Data Fit KL (q(h 1 ) p(h 1 )) + }{{} Regularisation Regularization achieved through uncertainty propagation. L H (q(h l )) }{{} Regularisation l=2

An example where uncertainty propagation matters: Recurrent learning.

Dynamics/memory: Deep Recurrent Gaussian Process

Avatar control Figure: The generated motion with a step function signal, starting with walking (blue), switching to running (red) and switching back to walking (blue). https://youtu.be/fr-oegxv6yy https://youtu.be/at0hmtopgjc https://youtu.be/fuf-uz83vmw Videos: Switching between learned speeds Interpolating (un)seen speed Constant unseen speed

RGP GPNARX MLP-NARX LSTM

RGP GPNARX MLP-NARX LSTM

Stochastic warping X ϕ ϕ... ϕ H N (0, I) G. H N (0, I) G ϕ ϕ... ϕ Inference: Need to infer posteriors on H Define q(h) = N (µ, Σ) Amortize: {µ, Σ} = MLP (Y; θ) Y

Stochastic warping X ϕ ϕ... ϕ H N (0, I) G. H N (0, I) G ϕ ϕ... ϕ Inference: Need to infer posteriors on H Define q(h) = N (µ, Σ) Amortize: {µ, Σ} = MLP (Y; θ) Y

Amortized inference Solution: Reparameterization through recognition model g: µ n 1 = g 1 (y (n) ) = g l (µ (n) l 1 ) g l = MLP(θ l ) µ (n) l g l deterministic µ (n) l = g l (... g 1 (y (n) )) Dai, Damianou, Gonzalez, Lawrence, ICLR 2016

Bayesian approach recap Three ways of introducing uncertainty / noise in a NN: Treat weights w as distributions (BNNs) Stochasticity in the warping function φ (VAE) Bayesian non-parametrics applied to DNNs (DGP)

Connection between Bayesian and traditional approaches There s another view of Bayesian NNs. Namely that they can be used as a means of explaining / motivating DNN techniques (e.g. Dropout) in a more mathematically grounded way. Dropout as variational inference. (Kingma et al. 2015), (Gal and Ghahramani. 2016) Low-rank NN weight approximation and Deep Gaussian processes (Hensman and Lawrence 2014, Damianou 2015, Louizos and Welling 2016, Cutajar et al. 2016)...

Summary Motivation for probabilistic and Bayesian reasoning Three ways of incorporating uncertainty in DNNs Inference is more challenging when uncertainty has to be propagated Connection between Bayesian and traditional NN approaches