Bayesian Deep Learning

Similar documents
PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Variational Inference via Stochastic Backpropagation

Bayesian Machine Learning

Lecture : Probabilistic Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Bayesian Semi-supervised Learning with Deep Generative Models

Auto-Encoding Variational Bayes

Introduction to Gaussian Process

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Bayesian Learning (II)

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Bayesian Machine Learning

Fundamentals of Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

ECE521 week 3: 23/26 January 2017

Logistic Regression. Machine Learning Fall 2018

Sample questions for Fundamentals of Machine Learning 2018

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Evaluating the Variance of

Fundamentals of Machine Learning (Part I)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Variational Dropout via Empirical Bayes

Probabilistic & Bayesian deep learning. Andreas Damianou

Variational Autoencoder

Lecture-6: Uncertainty in Neural Networks

CSC321 Lecture 18: Learning Probabilistic Models

Stochastic Gradient Descent

Fast yet Simple Natural-Gradient Variational Inference in Complex Models

Variational Autoencoders (VAEs)

Nonparameteric Regression:

Variational Autoencoders

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

MTTTS16 Learning from Multiple Sources

CS-E3210 Machine Learning: Basic Principles

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Probabilistic Graphical Models

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Reading Group on Deep Learning Session 1

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Algorithmisches Lernen/Machine Learning

Black-box α-divergence Minimization

Efficient Likelihood-Free Inference

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Neutron inverse kinetics via Gaussian Processes

Machine Learning Basics: Maximum Likelihood Estimation

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

J. Sadeghi E. Patelli M. de Angelis

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Density Estimation. Seungjin Choi

Kernelized Perceptron Support Vector Machines

ECE521 Lecture 19 HMM cont. Inference in HMM

Probabilistic Reasoning in Deep Learning

20: Gaussian Processes

Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 1, 2015

Introduction to Bayesian Inference

Bayesian Inference and an Intro to Monte Carlo Methods

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Logistic Regression. COMP 527 Danushka Bollegala

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Modelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Generative v. Discriminative classifiers Intuition

Generative models for missing value completion

Probabilistic Machine Learning

CPSC 540: Machine Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Probabilistic Graphical Models & Applications

An Overview of Edward: A Probabilistic Programming System. Dustin Tran Columbia University

STA414/2104 Statistical Methods for Machine Learning II

Bayesian Regression Linear and Logistic Regression

GWAS IV: Bayesian linear (variance component) models

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Overfitting, Bias / Variance Analysis

MODULE -4 BAYEIAN LEARNING

Introduction to Deep Neural Networks

Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Parameter Estimation. Industrial AI Lab.

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Model Selection for Gaussian Processes

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Dropout as a Bayesian Approximation: Insights and Applications

Notes on pseudo-marginal methods, variational Bayes and ABC

Bayesian Methods: Naïve Bayes

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Generative v. Discriminative classifiers Intuition

Transcription:

Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1

What will you learn? Why is Bayesian inference useful? 1. It allows for computation of uncertainty. 2. It can prevent overfitting in a principled way. Why is it computationally challenging? 3. It usually requires solving hard integrals. How does Variational Inference (VI) simplify the computational challenge? 4. Integration is converted to an optimization problem. 5. Optimization gives a lower bound to the integral. How to perform variational inference? 6. By using stochastic gradient descent. 1

1 Why be Bayesian? Why be Bayesian What are the main reasons behind the recent success of Machinelearning and deep-learning methods, e.g., in fields such as computer vision, speech recognition, and recommendation systems? Existing methods focus on fitting the data well (e.g., using maximum likelihood), but this may be problematic. Which is a better fit? Consider this real data where y-axis is the frequency of an event and x- axis is its magnitude. 2

This example is taken from Nate Silver s book. Which model is a better fit? blue or red? 3

Uncertainty Estimation In many applications, it is also important to estimate the uncertainty of our unknowns and/or predictions. In this lecture, we will learn about a Bayesian way to do so. Scene Uncertainty of depth estimates (a) Input Image (taken from Kendall et. al. 2017) (b) Ground Truth Figure 1: Illustrating 4 the difference betw

2 Bayesian Model for Regression Regression Regression is to relate input variables to the output variable, to either predict outputs for new inputs and/or to understand the effect of the input on the output. Dataset for regression In regression, data consists of pairs (x n, y n ), where x n is a vector of D inputs and y n is the n th output. Number of pairs N is the data-size and D is the dimensionality. Prediction In prediction, we wish to predict the output for a new input vector, i.e., find a regression function that approximates the output well enough given inputs. y n f θ (x n ), for all n where θ is the parameter of the regression model. 5

Likelihood Assume y n to be independent samples from an exponential family distribution, whose mean parameter is equal to f θ (x n ): p(y X) = N p(y n f θ (x n )) n=1 For example, when p is a Gaussian, the log-likelihood results in meansquare error. Examples of f θ Linear model: θ x n Generalized: f(θ x n ) Deep Neural Network: f 1 (Θ 1 f 2 (Θ 2... f L (Θ L x n ))) Input Output 6

Maximum A Posteriori (MAP) By using a regularizer log p(θ), we can estimate θ by maximizing the following loss denoted by L MAP (θ) := N log p(y n f θ (x n )) + log p(θ) n=1 Example: an L 2 regularizer. The θ MAP obtained by maximizing L MAP is known as the maximum-aposteriori estimate. Stochastic Gradient Descent (SGD) When N is large, choose a random pair (x n, y n ) in the training set and approximate the gradient: L MAP θ θ [N log p(y n f θ (x n )) + log p(θ)] Using the above stochastic gradient, take a step: L MAP θ t+1 = θ t + ρ t θ where t is the iteration, and ρ t is a learning rate. 7

The Joint Distribution How can we obtain a distribution over θ? We can make use of the Bayes rule. We can view p(θ) as a prior distribution to define the following joint distribution: log p(y, θ X) N := log p(y n f θ (x n )) + log p(θ) + cnst n=1 [ N ] = log p(y n f θ (x n )) p(θ) n=1 For example: L 2 loss corresponds to a Gaussian prior distribution. 8

The Posterior Distribution The posterior distribution is defined using the Bayes rule: p(y, θ X) p(θ y, X) = [ p(y X) N ] n=1 p(y n f θ (x n )) p(θ) = [ N n=1 p(y n f θ (x n ))] p(θ)dθ The integral over all possible θ defines the normalizing constant of the posterior distribution. Without it, we cannot know the true spread of the distribution, which gives us a notion of uncertainty or confidence. Taken from [Nickisch and Rasmussen, 2008] 9

Difficult Integral The normalizing constant of the posterior distribution is difficult to compute in general: [ N ] p(y X) = p(y n f θ (x n )) p(θ)dθ n=1 There are (at least) three reasons behind it: 1) Too many factors (large N) 2) Too many dimensions (large D) 3) Nonconjugacy (e.g., non Gaussian likelihood with a Gaussian prior) 10

3 Variational Inference Integration to Optimization. Variational Distribution Approximate the posterior: p(θ y, X) q λ (θ) q λ is called the variational distribution, e.g., a Gaussian distribution with λ := {µ, Σ} Variational Lower Bound Given q λ, we can obtain a lower bound to the integral: p(y X) L V I (λ) := N [ E qλ [log p(y n f θ (x n ))] + E qλ log p(θ) ], q λ (θ) i=1 a.k.a. variational objective, but it is very similar to the MAP objective: L MAP (θ) = N log p(y n f θ (x n ))+log p(θ) n=1 So we can just use SGD! 11

Stochastic Gradients I How do we compute a stochastic gradient of L V I? The reparameterization trick is one method to do so [Kingma and Welling, 2013]. Let q λ := N (z µ, diag(σ 2 )), then θ(λ) = µ + σ ɛ, ɛ N (ɛ 0, I) is a sample from q λ. We have written θ as a function of λ := {µ, σ}. Here is a stochastic gradient with one Monte-Carlo sample θ(λ): L V I (λ) λ θ λ θ L MAP (θ) λ θ log q λ (θ) θ log q λ(θ) λ This can be done whenever q λ reparameterizable. 12 is

Stochastic Gradients II REINFORCE is another approach to computing a stochastic gradient of L V I (λ) [Williams, 1992]. It uses the log-derivative trick: q λ λ = q log q λ λ λ to express the gradient of the expected MAP objective as λ E q λ [L MAP (θ)] = [ log qλ (θ) E qλ L MAP (θ) λ The REINFORCE gradient estimator with one Monte-Carlo sample θ(λ) is given by: L V I (λ) λ ] log q λ(θ) L MAP (θ) λ log q λ(θ) (1 + log q λ (θ)) λ REINFORCE is more widely applicable compared to the reparametrization trick, but has higher variance. 13

Homework 1. Derive the expression for REINFORCE. 2. VI for Linear regression (a) Derive the variational lower bound for a linear-regression problem. Assume a Gaussian prior on θ. (b) What kind of posterior approximation is appropriate in this case? (c) When will the lower bound be tight? (d) What is the memory and computational complexity of SGD? 14

References [Kingma and Welling, 2013] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational Bayes. arxiv preprint arxiv:1312.6114. [Nickisch and Rasmussen, 2008] Nickisch, H. and Rasmussen, C. (2008). Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9(10). [Williams, 1992] Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256. 15

Extra space 16

Extra space 17