Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

Similar documents
Patterns of Scalable Bayesian Inference Background (Session 1)

Machine learning - HT Maximum Likelihood

Bayesian Regression Linear and Logistic Regression

Integrated Non-Factorized Variational Inference

MIT Spring 2016

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Lecture 6: Model Checking and Selection

Introduction to Machine Learning

Density Estimation. Seungjin Choi

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Part 1: Expectation Propagation

Introduction to Machine Learning

Bayesian Deep Learning

Expectation Propagation for Approximate Bayesian Inference

Algorithms for Variational Learning of Mixture of Gaussians

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Dropout as a Bayesian Approximation: Insights and Applications

Unsupervised Learning

Lecture 6: Graphical Models: Learning

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Introduction to Bayesian Inference

PATTERN RECOGNITION AND MACHINE LEARNING

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Machine Learning

The Variational Gaussian Approximation Revisited

Bayesian Learning (II)

Foundations of Statistical Inference

GWAS IV: Bayesian linear (variance component) models

Multivariate Bayesian Linear Regression MLAI Lecture 11

Introduction to Bayesian Statistics

Least Squares Regression

Lecture : Probabilistic Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Expectation propagation as a way of life

STA 4273H: Statistical Machine Learning

Variational Autoencoder

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Graphical Models for Collaborative Filtering

Variational Inference via Stochastic Backpropagation

Bayesian Inference Course, WTCN, UCL, March 2013

Least Squares Regression

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior

STA 4273H: Statistical Machine Learning

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

1. Fisher Information

STA414/2104 Statistical Methods for Machine Learning II

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

CPSC 540: Machine Learning

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

Kernel methods, kernel SVM and ridge regression

Machine Learning Summer School

Lecture 13 : Variational Inference: Mean Field Approximation

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Pattern Recognition and Machine Learning

1 Bayesian Linear Regression (BLR)

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Approximate Inference Part 1 of 2

Chris Bishop s PRML Ch. 8: Graphical Models

Naive Bayes and Gaussian Bayes Classifier

STA 414/2104, Spring 2014, Practice Problem Set #1

Machine Learning for Signal Processing Bayes Classification and Regression

Approximate Inference Part 1 of 2

Naive Bayes and Gaussian Bayes Classifier

Machine Learning - MT Classification: Generative Models

G8325: Variational Bayes

Outline Lecture 2 2(32)

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Basic math for biology

CPSC 540: Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Generative Clustering, Topic Modeling, & Bayesian Inference

Notes on pseudo-marginal methods, variational Bayes and ABC

A short introduction to INLA and R-INLA

Support Vector Machines

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

Why (Bayesian) inference makes (Differential) privacy easy

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian Inference by Density Ratio Estimation

The connection of dropout and Bayesian statistics

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Lecture 2: From Linear Regression to Kalman Filter and Beyond

The Expectation-Maximization Algorithm

Bayesian Inference and an Intro to Monte Carlo Methods

Nonparametric Bayesian Methods (Gaussian Processes)

an introduction to bayesian inference

Variational Bayesian Logistic Regression

Auto-Encoding Variational Bayes

GWAS V: Gaussian processes

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

CSC321 Lecture 18: Learning Probabilistic Models

Gaussian discriminant analysis Naive Bayes

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Bayesian Methods: Naïve Bayes

Naive Bayes and Gaussian Bayes Classifier

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

A note on multiple imputation for general purpose estimation

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

Transcription:

Bayesian Dropout Tue Herlau, Morten Morup and Mikkel N. Schmidt Discussed by: Yizhe Zhang Feb 20, 2016

Outline 1 Introduction 2 Model 3 Inference 4 Experiments

Dropout Training stage: A unit is present with probability p Testing stage: The unit is always present and the weights are multiplied by p Training a neural network with dropout can be seen as training a collection of 2 K thinned networks with extensive weight sharing A single neural network to approximate averaging output at test time Dropout as regularization [S. Wager 2013]. Dropout on units can be understood as dropout on weights.

Bayesian dropout Consider training a regression model by optimizing quadratic loss: E(θ) = 1 (y i f θi (X i )) 2. (1) 2 The authors claim, for dropout, the quadratic loss becomes i Ẽ(θ) = 1 (y i (X f θi i )) 2. (2) 2 i where θ p( θ), p( θ) is the dropout distribution. In Bayesian setup, if one impose a prior on θ, the generative process can be written as θ p( ), y i θ N (f θi (X i ), σ 2 ). (3) By adding a dropout step, above generative model becomes: θ p( ), θ p( θ θ), y i θ N (f θi (X i ), σ 2 ). (4)

Bayesian dropout Dropout does not favor or learn any particular perturbation of the parameters, whereas a standard Bayesian formulation of dropout would. The authors claim that, this Bayesian specification allows the dropout distribution (i.e., p( θ x, θ)) to adapt to the data. This not desirable, because it will reduce the effective number of perturbed models. Instead, they use a proposal distribution 1 p(x, θ, θ) to approximate target joint distribution 2 q(x, θ, θ). From (4), q(x, θ, θ) = q(x θ)q( θ θ)q(θ). (5) For the proposal distribution, they enforce that the dropout distribution does not depend on data. p( θ x, θ) = q( θ θ), (Bayesian dropout). (6) Such specification implies p(x, θ, θ) = q( θ θ)p(x, θ). 1 they called posterior in the paper 2 they called prior in the paper

Bayesian dropout Minimize the KL divergence between p(x, θ, θ) and q(x, θ, θ): S[p, q] = dxd θdθp(x, θ, θ) log p(x, θ, θ) q(x, θ, θ), (7) under the constraints that, dθp(x, θ) = δ(x x ), (8) and p(x, θ, θ) is a valid density. Solve this using functional derivative (with Lagrange multiplier):

Approximate inference Omit details, solving the functional derivative gives (Z(x) is the normalization constant) p(θ x) = 1 q(θ) exp[ Z(x) d θq( θ θ) log N q(x i h i ( θ))] (9) Analytically computing the expectation d θq( θ θ) log q(x i h i ( θ)) is hard. Two strategies for approximately compute this expectation: Gaussian approximation (which they call analytical approximation). d θq( θ θ) log q(x i h i ( θ)) E N (h;µθ,σ 2 θ ) [log q(x i h)] (10) Variational Bayesian using an unnormalized exponential family proposal distribution r η (θ) = exp[t (θ)η]ν(θ) to approximate p(θ x) 3. 3 they originally write as ˆθ = argmin η KL(r η(θ) p(x, θ)). i=1 ˆη = argmin η KL(r η (θ) p(θ x)). (11)

Approximate inference The parameters for proposal distribution is obtained by minimizing (11) ˆη = [ dν(θ)r η (θ)t (θ) T T (θ)] 1 [ dν(θ)r η (θ)t (θ) T log p(θ)]. (12) where the integrals are computed by Monte Carlo integration. ˆη = Ĉ 1 ĝ (13) ĝ = 1 S q( θ s θ s )T (θ s ) T log[q(x θ s )q(θ s )] S (14) Ĉ = 1 S s=1 S T (θ s ) T T (θ s ) (15) s=1

Approximate inference

Bayesian linear regression Model setups The inferred distribution of model parameters is given by where f denote the binary dropout probability (f = 0:no dropout).

Simulated data (Bayesian linear regression) Higher dropout proportion leads to heavier shrinkage. Comparable results with ridge regression. Figure: Left: Comparison of shrinkage. Right: Performance of ridge regression and dropout.

Real-world data (Bayesian linear regression) In Amazon review dataset, they claimed their method have lower prediction error. Figure: Comparison between ridge regression and Bayesian dropout. Lower error is better.

Bayesian logistic regression J: number of Gaussian noise junk features added to the data. Figure: Comparison between Laplace approximation and variational Bayes on the Bayesian logistic regression model.

Bayesian logistic regression