The connection of dropout and Bayesian statistics

Similar documents
ECE521 Lectures 9 Fully Connected Neural Networks

Ways to make neural networks generalize better

Variational Dropout and the Local Reparameterization Trick

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Variational Bayesian Logistic Regression

Jakub Hajic Artificial Intelligence Seminar I

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

STA 4273H: Statistical Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Learning Energy-Based Models of High-Dimensional Data

ECE521 week 3: 23/26 January 2017

Dropout as a Bayesian Approximation: Insights and Applications

How to do backpropagation in a brain

Density Estimation. Seungjin Choi

Lecture : Probabilistic Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Introduction to Machine Learning

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Auto-Encoding Variational Bayes

Approximate inference in Energy-Based Models

From perceptrons to word embeddings. Simon Šuster University of Groningen

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

arxiv: v1 [cs.lg] 10 Aug 2018

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Reading Group on Deep Learning Session 1

Pattern Recognition and Machine Learning

an introduction to bayesian inference

text classification 3: neural networks

13: Variational inference II

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Probabilistic Machine Learning

Least Squares Regression

CSCI567 Machine Learning (Fall 2018)

Final Examination CS 540-2: Introduction to Artificial Intelligence

Logistic Regression & Neural Networks

Variational Inference via Stochastic Backpropagation

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Lecture 3a: The Origin of Variational Bayes

Variational Dropout via Empirical Bayes

Variational Autoencoder

Lecture 13 : Variational Inference: Mean Field Approximation

Probabilistic Graphical Models

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Unsupervised Learning

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

arxiv: v1 [cs.lg] 8 Jan 2019

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Summary and discussion of: Dropout Training as Adaptive Regularization

Unsupervised Learning

Bayesian Inference and MCMC

Introduction to Probabilistic Machine Learning

UNSUPERVISED LEARNING

Variational Autoencoders

Machine Learning Basics III

Introduction to Bayesian Learning

Improved Bayesian Compression

Bayesian Methods for Machine Learning

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Probability and Information Theory. Sargur N. Srihari

Rapid Introduction to Machine Learning/ Deep Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

STA 414/2104: Lecture 8

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Machine Learning using Bayesian Approaches

CSC 411 Lecture 10: Neural Networks

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Probabilistic Reasoning in Deep Learning

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Statistical Machine Learning

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Variational Inference. Sargur Srihari

Lecture 16 Deep Neural Generative Models

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

The Origin of Deep Learning. Lili Mou Jan, 2015

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

arxiv: v1 [stat.ml] 8 Jun 2015

Quantitative Biology II Lecture 4: Variational Methods

Linear Models for Regression

Bayesian Regression Linear and Logistic Regression

11. Learning graphical models

Probabilistic & Bayesian deep learning. Andreas Damianou

Bayesian Models in Machine Learning

Introduction to Convolutional Neural Networks (CNNs)

Lecture 17: Neural Networks and Deep Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Introduction to Bayesian Learning. Machine Learning Fall 2018

Deep Neural Networks

ECE295, Data Assimila0on and Inverse Problems, Spring 2015

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Transcription:

The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University Toronto Srivastava et al., Journal of Machine Learning Research 15 (2014) Nitish Srivastava University Toronto Presented by Beate Sick, IDP, ZHAW 1

Outline Main principles of frequentists and Bayesians a reminder Neural networks and how they get trained a reminder What is dropout and what is the connection to Bayesian inference? How to do approximate Bayesian inference in deep NN How to use dropouts to get an uncertainty measure for predictions? How can we use the Bayesian approach to make better models? 2

Frequentist s and Bayesian s view on data and model parameters Frequentist: - Data are a repeatable random sample -> e.g. 10 coin tosses: frequencies 2xtail, 8xheads Model: Bernoulli with head-probability as parameter - Underlying model parameters are fixed during this repeatable data sampling - We use Maximum Likelihood to determine the parameter value under which the observed data is most likely. - 95% CI means that when repeating the experiment many times 95% of all 95% CI will cover the true fixed parameter. PX ( ) Distribution of data under fixed parameter tail head Bayesian: - Data are observed once and seen as fixed - Parameters are described probabilistically - Before data observation we formulate a prior belief in parameter distribution - After data observation we updated posterior distribution of model parameters. - 95% credibility interval means an interval of parameter values that cover 95% of the posterior distribution. Prior distribution P( ) Posterior distribution P( X, Y) 3

Bayesian modeling has less problems with complex models Frequentist s strategy: You can only use a complex model if you have enough data! 2 model parameters * Bayesian s strategy: Use the model complexity you believe in. Do not just use the best fitting model. Do use the full posterior distribution over parameter settings leading to vague predictions since many different parameter settings have significant posterior probability. 6 model parameters Models with significant posterior probability 6 model parameters * y* x*,x,y (y* x*, ) X,Y p p p d prediction via marginalization over : * Image credits: Hinton coursera course x*: new input 4

Logistic regression or a neural net with 1 neuron b w 1 w 2 1 x1 x image y z output y 2 input vector bias i th input index over input weight on i th input Different non-linear transformations are used to get from z to output y logistic 1 y 1 e z ReLu 5

A neural network is a parametric model We observe - inputs X = {x i } i=1,,n Y - outputs Y = {y i } i=1,,n W 2 Using a NN as output generating model we can do parametric inference for the connection weights W in the NN that are the parameter defining the model. W 1 The output is y, X 6

Fitting a NN classifier via softmax likelihood optimization Determine all z s and y s in forward-pass x y 0 z1 y1 z 2 e e j1:3 z21 z2 j P(y=k) e pk z k jclass.id e z j W1 W2 Update weights in back-pass e j1:3 e j1:3 z22 e z23 e z2 j z2 j - LogLikelihood = cost-function C Find sets of weights that minimize C! C Ik log pk, k I k is indicator for true class k ( t) ( t1) ( t) C( ) i i i i ( t1) i C Learning rate Taking the gradient with chain rule (via backpropagation) 7

Main principles of frequentists and Bayesians a reminder Neural networks and how they get trained a reminder What is dropout and what is the connection to Bayesian inference? How to do approximate Bayesian inference in deep NN How to use dropouts to get an uncertainty measure for predictions? How can we use the Bayesian approach to make better models? 8

Dropout without dropout with dropout dropout At each training step we remove random nodes with a probability of p resulting in a sparse version of the full net and we use backpropagation to update the weights. -> In each training step we train another NN model, but all models share weights! Meaning a certain connection weight is the same in all models at the current update value. Srivastava et al., Journal of Machine Learning Research 15 (2014) 9

Why dropout can be a good idea Dropout reduces complexity of the model and thus overfitting? Using dropout during training implies: We train in each training step another sparse model. Only weights to not-dropped units are updated Sharing weights may introduce regularization By averaging over these models we should be able to reduce noise, overfitting Use the trained net without dropout during test time But we need to do a small adjustment! 10

Why dropout can be a good idea Use the trained net without dropout during test time Q: Suppose that with all inputs present at test time and the output of this neuron is x. What would its output be during training time, in expectation? (e.g. if p = 0.5) 0 0 0 y x 0 x y during test: a = w0*x + w1*y during training with dropout: + + + Using all units in the test time forward pass leads to an output that 2x the expected output during training (this is true for linear units without hidden layers). => Have to compensate by reducing the weights during test time by multiplying them by the dropout probability p=0.5 11

Why dropout can be a good idea The training data consists of many different pictures of Oliver Dürr and Albert Einstein We need a huge number of neurons to extract good features which help to distinguish Oliver from Einstein Dropout forces the network to learn redundant and independent features has a mustache is well shaved holds no mobile wears no glasses has dark eyes 12

We will see that dropout is an approximation of full Bayesian learning using a Bayesian Network dropout At each training step we remove random nodes with a probability p Bayesian NN At each training step we update the posterior distribution of the weights In test time we use the posterior to determine the predictive distribution 13

Bayesian Modeling in the framework of NN In Bayesian Modeling we define a prior distribution over the parameter W: ~0, ) defining p(w i ) For classification NN tasks we assume a softmax likelihood e P(y=k,x) p k e k ' f k,ω (x) f k,ω (x) Given a dataset X, Y we then look for the posteriori distribution capturing the most probable model parameter given the observed data p X,Y likelihood prior p(y,x) p( ) p(y X) normalizer=marginal likelihood 14

Marginalization steps in Bayesian statistics We can use the posterior distribution p w X,Y p(y w,x) p(w) p(y X) -> to determine the most probable weight vector MAP: Maximum a Posteriori -> to determine credible intervals -> sample weight vectors from posterior distribution For the posteriori we need to determine the normalizer, also called model evidence via integrating over all possible parameters weighted by their probabilities. This process is also called marginalization over leading to the name marginal likelihood Y X Y X, p p p d 15

Using a neural network for prediction A deep NN or deep CNN Performs a hierarchical non-linear transformation of its input y, These models give us only point estimates with no uncertainty information. Remark: for shallow NN CIs for the weights can be determined if the network is identified (White, 1998) With Bayesian modeling we can get a measure of uncertainty by evaluating the posterior distribution of the NN weights. 16

Main principles of frequentists and Bayesians a reminder Neural networks and how they get trained a reminder What is dropout and what is the connection to Bayesian inference? How to do approximate Bayesian inference in deep NN How to use dropouts to get an uncertainty measure for predictions? How can we use the Bayesian approach to make better models? 17

Full Bayesian learning and ideas for approximations Full Bayesian learning means computing the full posterior distribution over all possible parameter settings. This is extremely computationally intensive for all but the simplest models But it allows us to use complicated models with not so much data Approximation of full Bayesian learning is the only way if we have models with many (>100) parameters or complex models. We can use Markov Chain Monte Carlo (MCMC) methods to sample parameter vectors according to their posterior probabilities at the observed data -, and use them to estimate the distribution. We can approximate the posterior distribution for the model parameters via a) Integrated Nested Laplace Approximations (INLA) b) Variational inference - replacing the posterior distribution at the observed data, with a member of a simpler distribution family that minimizes the Kullback Leibler divergence to the posterior 18

Variational inference approximates full Bayesian learning Approximate p( X;Y) with simple distribution q () Minimize Kullback Leibler divergence of q from the posterior w.r.t. to the variational parameters : q ( ) KL ( ) p( ) ( )log q log ( ) log p( ) p( X, Y) q X, Y q d E q X, Y Minimizing KL means to wiggle the parameter of q to find the value of for which q resembles the posterior distribution as good as possible. The term variational is used because you pick the best q in Q -- the term derives from the "calculus of variations," which deals with optimization problems. A particular q in Q is specified by setting some variational parameters -- the parameter which is adjusted during optimization. 19

Variational inference to approximate Bayesian learning ctd. Minimizing the KL divergence of q from the posterior dist. p w.r.t. q ( ) KL q( ) p( X, Y) q( )log d p( X, Y ) is equivalent to maximizing a lower bound of the log marginal likelihood w.r.t. (also called evidence lower bound or ELBO) L VI ( ): q( ) log p Y X, dkl q ( ) p( ) L log p(y X) The key trick is here like in most variational methods, to write the true log-likelihood L as the log of an expectation under some q. We can then get a lower bound via Jensen's inequality, which tells us that log expectation >= expectation log (since log is concave and q is a distribution). Y X, p q ( ) p L log p(y X) log py X, pd log py X, p d log Eq q( ) q( ) py X, p p q ( ) Eq log Eq log py X, log Eq log py X, Eq log q( ) q( ) p q ( ) log p Y X, dkl q ( ) p( ) 20

Variational inference to approximate Bayesian learning ctd. The best that makes q a good approximation for the posterior is given by the that maximizes the following objective as lower bound of L: L VI ( ): q( ) log p Y X, dkl q ( ) p( ) Since this integral is not tractable for almost all q therefore we will MC integration to approximate this quantity. We will sample from q and look at the following term where for each sampling step the integral is replaced by log, leading to a new objective: p ˆ KLq L( ˆ ) : log Y X, ( ) p( ) In approximate Bayesian statistics it is known, that is an unbiased estimator of L. 21

Variational inference to approximate Bayesian learning ctd. L VI ( ): q ( ) log p Y X, dkl q ( ) p( ) p ˆ KLq L( ˆ ) : log Y X, ( ) p( ) Results from stochastic inference guarantee that optimizing w.r.t. will converge to the same optimal than optimizing L VI w.r.t.. To stepwise optimize we sample in each step one from and do one step of optimization of and then we update before sampling the next. For inference repeatedly do: a) Sample ~ b) Do one step of minimization w.r.t. : p ˆ KLq L( ˆ ) : log Y X, ( ) p( ) What kind of q-distribution should we use to resemble dropout with Bayesian learning? 22

Define the structure of the approximate distribution q We define q to factorizes over weight matrices W i of the layers 1 to L. q ( ) q ( W M i ) i i For each W i we define q to be the product of the mean weight matrix M i and a diagonal matrix which holds Bernoulli variables on the diagonal. q ( W ) diag i i M i z M i, j M i mean( W) i z i, j ~ Bernoulli(p i ) W i ~ ( W) q M i Approximate posterior of model parameters i M i is as variational parameter of q Sampling the diagonal elements z from a Bernoulli is identical to randomly setting columns of M to zero which is identical to randomly setting units of the network to zero -> dropout! For derivations and proofs see: http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf 23

Why this structure of the approximate distribution q it resembles dropout! We know what to do during test time: use full NN and set W i = p i W i Bernoullis are computationally cheap to get multi-modality This q constrains the weights to be near zero: - for this q structure the variance of the weights are: Var( W) M M T p (1 p ) i i i i i - we know that posterior uncertainty decreases with more data implying that for fixed dropout probability p i the average weights need to decrease with more data. The strongest regularization is achieved with dropout probability p i =0.5. 24

Main principles of frequentists and Bayesians a reminder Neural networks and how they get trained a reminder What is dropout and what is the connection to Bayesian inference? How to do approximate Bayesian inference in deep NN How to use dropouts to get an uncertainty measure for predictions? How can we use the Bayesian approach to make better models? 25

DL with dropout as approximate Bayesian inference allows to get a prediction distribution for uncertainty estimation To get an approximation of the posterior via training Randomly set columns of M i to zero (do dropout) Update the weights by doing one step To sample from the learned approximate posterior we just can do dropout during the test time when using the trained NN for prediction. From the received predictions we can estimate the predictive distribution and from this different uncertainty measures such as the variance. 26

Can this insights help us to make better models? Yes just do dropout also in the convolutional layer during training and test time! MC dropout is equivalent to performing several stochastic forward passes through the network and averaging the results. By doing so Yarin Gal was able to outperform state of the art error rates on MNIST and CIFAR-10 without changing the architecture of the used CNNs or anything else beside dropout. Standard means no dropout in test time all means dropout was done in all layers MC dropout means that dropout was done in test time and the results were averaged 27

Dropout approach speeds up reinforcement learning Example: train a Roomba vacuum cleaner via enforcement learning http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa5 32c1ce.html#graph_canvas Reward for collecting dirt Penalty for bumping into walls Without dropout: take random action with probability and the model-predicted action otherwise (Epsilon-greedy) With dropout: Do several stochastic forward passes through the dropout network and choose action with highest value (Thompson sampling) 28

Thank you for your attention 29

How good is the dropout approach? Compare performance of dropout-method with well established and published variational inference (VI) and probabilistic back-propagation (PBP) results on the basis of root-mean-square-error (RMSE) and log-likelihood (LL). The better fitting the model is the smaler is the RMSE. The smaller the uncertainty of the model is the larger is the LL.