Dropout as a Bayesian Approximation: Insights and Applications

Similar documents
arxiv: v5 [stat.ml] 25 May 2016

Probabilistic & Bayesian deep learning. Andreas Damianou

The connection of dropout and Bayesian statistics

Variational Dropout via Empirical Bayes

Lecture : Probabilistic Machine Learning

On Modern Deep Learning and Variational Inference

Bayesian Deep Learning

Chapter 3 Bayesian Deep Learning

arxiv: v1 [stat.ml] 10 Dec 2017

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

ECE521 Lectures 9 Fully Connected Neural Networks

Variational Inference via Stochastic Backpropagation

Introduction to Gaussian Processes

Probabilistic Reasoning in Deep Learning

Bayesian Semi-supervised Learning with Deep Generative Models

GAUSSIAN PROCESS REGRESSION

Model Selection for Gaussian Processes

Recurrent Latent Variable Networks for Session-Based Recommendation

Improved Bayesian Compression

Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data

Variational Dropout and the Local Reparameterization Trick

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Bayesian Machine Learning

20: Gaussian Processes

Non-Factorised Variational Inference in Dynamical Systems

Density Estimation. Seungjin Choi

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

The evolution from MLE to MAP to Bayesian Learning

Introduction to Machine Learning

arxiv: v6 [stat.ml] 18 Jan 2016

Variational Autoencoder

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

GWAS V: Gaussian processes

Linear Models for Regression

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Deep Gaussian Processes

Deep Gaussian Processes

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Introduction to Deep Neural Networks

State Space Models for Wind Forecast Correction

Least Squares Regression

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Chapter 20. Deep Generative Models

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Machine Learning Final Exam May 5, 2015

Gaussian Processes (10/16/13)

Reliability Monitoring Using Log Gaussian Process Regression

The Success of Deep Generative Models

Relevance Vector Machines

Reading Group on Deep Learning Session 2

Pattern Recognition and Machine Learning

Introduction to Machine Learning

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References

Quantifying mismatch in Bayesian optimization

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Deep Neural Networks as Gaussian Processes

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Counterfactuals via Deep IV. Matt Taddy (Chicago + MSR) Greg Lewis (MSR) Jason Hartford (UBC) Kevin Leyton-Brown (UBC)

Outline Lecture 2 2(32)

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Deep learning with differential Gaussian process flows

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Machine Learning Basics: Maximum Likelihood Estimation

Introduction to Machine Learning

Variational Bayesian Logistic Regression

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Nonparameteric Regression:

Linear Models for Regression

Gaussian processes for inference in stochastic differential equations

Linear Models for Regression

Gaussian Process Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Machine Learning (CSE 446): Probabilistic Machine Learning

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Afternoon Meeting on Bayesian Computation 2018 University of Reading

CITEC SummerSchool 2013 Learning From Physics to Knowledge Selected Learning Methods

Learning Gaussian Process Models from Uncertain Data

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Neural Network Training

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Least Squares Regression

Machine Learning Basics III

Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Neural Networks. Intro to AI Bert Huang Virginia Tech

Energy-Based Generative Adversarial Network

Neural network ensembles and variational inference revisited

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Convolutional Dictionary Learning and Feature Design

STA 4273H: Statistical Machine Learning

Transcription:

Dropout as a Bayesian Approximation: Insights and Applications Yarin Gal and Zoubin Ghahramani Discussion by: Chunyuan Li Jan. 15, 2016 1 / 16

Main idea In the framework of variational inference, the authors show that the standard algorithm of SGD training with Dropout is ensentially optimizing the stochastic lower bound of Gaussian Processes whose kernel takes the form of neural networks. 2 / 16

Outline Preliminaries Dropout Gaussian Processes Variational Inference 3 / 16

Dropout Gaussian Processes Variational Inference Dropout Procedure Training stage: A unit is present with probability p Testing stage: The unit is always present and the weights are multiplied by p Intuition Training a neural network with dropout can be seen as training a collection of 2 K thinned networks with extensive weight sharing A single neural network to approximate averaging output at test time 4 / 16

Dropout Gaussian Processes Variational Inference Dropout for one-hidden-layer Neural Networks Dropout local units ŷ = ( g((x b 1 )W 1 ) b 2 ) W2 (1) Input x and Output y; g: activation function; Weights: W R K l 1 K l b l is binary dropout vairiables Equivalent to multiplying the global weight matrices by the binary vectors to dropout entire rows: Application to regression L = 1 2N ŷ = g(x(diag(b 1 )W 1 ))(diag(b 2 )W 2 ) (2) N y n ŷ n 2 2 + λ( W 1 2 + W 2 2 ) (3) n=1 bias term is ignored for simplicity 5 / 16

Dropout Gaussian Processes Variational Inference Gaussian Processes f is the GP function p(f X, Y) p(f) p(y X, f) }{{}}{{}}{{} Posterior Prior Likelihood Applications Regression F X N (0, K(X, X)), Y F N (F, τ 1 I N ) 6 / 16

Dropout Gaussian Processes Variational Inference Variational Inference The predictive distribution K(y x, X, Y) = p(y x, w) p(w X, Y) dw }{{} (4) q(w) Objective: argmin q KL ( q(w) p(w X, Y) ) Variational Prediction: q(y x ) = p(y x, w)q(w)dw Log evidence lower bound L VI = q(w) log p(y X, w)dw KL ( q(w) p(w) ) (5) Objective: arg maxq L VI A variational distribution q(w) that explains the data well while still being close to prior 7 / 16

A single-layer neural network example Setup Q: input dimension, K: number of hidden units, D: ouput dimesnion Goal: Learn W1 R Q K and W 2 R K D to map X R N Q to Y R N D Idea: Introduce W 1 and W 2 to GP approxiamtion 8 / 16

Introduce W 1 Define the kernel of GP K(x, x ) = p(w)g(w x)g(w x )dw (6) Resort to Monte Carlo integration, with generative process: w k p(w), W 1 = [w k ] K k=1, K(x, x ) = 1 K K g(wk x)g(w k x ) k=1 F X, W 1 N (0, K), Y F N (F, τ 1 I N ) (7) where K is the number of hidden units 9 / 16

Introduce W 2 Analytically integrating wrt F The predictive distribution p(y X) = p(y F)p(F X, W 1 )p(w 1 )dw 1 df (8) can be rewritten as p(y X) = N (Y; 0, ΦΦ + τ 1 I N )p(w 1 )dw 1 (9) where Φ = [φ] N n=1, φ = 1 K g(w 1 x) For W 2 = [w 2 ] D d=1, w d N (0, I K ) N (y d ; 0, ΦΦ + τ 1 I N ) = N (y d ; Φw d, τ 1 I N )N (w d ; 0, I K )dw 1 p(y X) = p(y X, W 1, W 2 )p(w 1 )p(w 2 )dw 1 dw 2 (10) 10 / 16

Variational Inference in the Approximate Model q(w 1, W 2 ) = q(w 1 )q(w 2 ) (11) To mimic Dropout, q(w 1 ) is factorised over input dimension, each of them is a Gaussian mixture distribution with two components, q(w 1 ) = Q q(w q), q(w q) = p 1 N (m q, σ 2 I K ) + (1 p 1 )N (0, σ 2 I K ) (12) q=1 with p 1 [0, 1], m q R K The same for q(w 2 ) q(w 2 ) = K q(w k ), q(w k ) = p 2 N (m k, σ 2 I D ) + (1 p 2 )N (0, σ 2 I D ) (13) k=1 with p 1 [0, 1], m q R D Optimise over parameters, especially M 1 = [m q] Q q=1, M 2 = [m k ] K k=1 11 / 16

for Regression Log evidence lower bound (14) L GP-VI = q(w 1, W 2 ) log p(y X, W 1, W 2 )dw 1 dw 2 } {{ } L 1 KL ( q(w 1, W 2 ) p(w 1, W 2 ) ) }{{} L 2 Approximation of L 2 For large enough K we can approximate the KL divergence term as KL(q(W 1 ) p(w 1 )) QK(σ 2 log(σ 2 ) 1) + p 1 2 Q q=1 m q m q (15) Similarly for KL(q(W 2 ) p(w 2 ) Following Proposition 1 on KL of a Mixture of Gaussians in Appendix 12 / 16

L 1 : Monte Carlo integration Parameterization L 1 = q(b 1, ɛ 1, b 2, ɛ 2 ) log p(y X, W 1 (b 1, ɛ 1 ), W 2 (b 2, ɛ 2 ))db 1 db 2 dɛ 1 dɛ 2 (16) W 1 = diag(b 1 )(M 1 + σɛ 1 ) + (1 diag(b 1 ))σɛ 1 W 2 = diag(b 2 )(M 2 + σɛ 2 ) + (1 diag(b 2 ))σɛ 2 (17) where ɛ 1 N (0, I Q K ), b 1q Bernoulli(p 1 ), ɛ 2 N (0, I K D ), b 2k Bernoulli(p 2 ), (18) Take a single sample ( L GP-MC = log p(y X, Ŵ1, Ŵ2) KL q(w1, W 2 ) p(w 1, W 2 ) ) (19) }{{} L 1-MC Optimising L GP-MC converges to the same limit as L GP-VI. 13 / 16

L 1 : Monte Carlo integration Regression L 1-MC = log p(y X, Ŵ1, Ŵ2) = D log N (y d ; φŵ d, τ 1 I N ) d=1 = ND 2 ND D log(2π) + 2 log(τ) τ 2 y d φŵ d 2 2 (20) d=1 Sum over the rows instead of the columns of Ŷ D d=1 τ 2 y d φŵ d 2 2 = N n=1 τ 2 yn φŵn 2 2 (21) where ŷ n = φŵ2 = 1 K g(xnŵ1)ŵ2 14 / 16

Recover SGD training with Dropout Take the approximation for L 1 and L 2 for the bound, and ignoring constant terms τ and σ L GP-MC = τ N y n ŷ n 2 2 2 p 1 2 M 1 2 2 p 2 2 M 2 2 2 (22) n=1 Setting σ tend to 0 W 1 = diag(b 1 )(M 1 + σɛ 1 ) + (1 diag(b 1 ))σɛ 1 Ŵ 1 diag(ˆb 1 )M 1, Ŵ2 diag(ˆb 2 )M 2 (23) 1 1 ŷ n = K g(xnŵ1)ŵ2 = K g(xn(diag(ˆb 1 )M 1 ))(diag(ˆb 2 )M 2 ) (24) 15 / 16

More Applications Convolutional Neural Networks Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference, arxiv:1506.02158, 2015 Recurrent Neural Networks A Theoretically Grounded Application of Dropout in Recurrent Neural Networks, arxiv:1512.05287, 2015 Reinforcement Learning Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, arxiv:1506.02142, 2015 16 / 16