Dropout as a Bayesian Approximation: Insights and Applications

Dropout as a Bayesian Approximation: Insights and Applications Yarin Gal and Zoubin Ghahramani Discussion by: Chunyuan Li Jan. 15, 2016 1 / 16

Main idea In the framework of variational inference, the authors show that the standard algorithm of SGD training with Dropout is ensentially optimizing the stochastic lower bound of Gaussian Processes whose kernel takes the form of neural networks. 2 / 16

Outline Preliminaries Dropout Gaussian Processes Variational Inference 3 / 16

Dropout Gaussian Processes Variational Inference Dropout Procedure Training stage: A unit is present with probability p Testing stage: The unit is always present and the weights are multiplied by p Intuition Training a neural network with dropout can be seen as training a collection of 2 K thinned networks with extensive weight sharing A single neural network to approximate averaging output at test time 4 / 16

Dropout Gaussian Processes Variational Inference Dropout for one-hidden-layer Neural Networks Dropout local units ŷ = ( g((x b 1 )W 1 ) b 2 ) W2 (1) Input x and Output y; g: activation function; Weights: W R K l 1 K l b l is binary dropout vairiables Equivalent to multiplying the global weight matrices by the binary vectors to dropout entire rows: Application to regression L = 1 2N ŷ = g(x(diag(b 1 )W 1 ))(diag(b 2 )W 2 ) (2) N y n ŷ n 2 2 + λ( W 1 2 + W 2 2 ) (3) n=1 bias term is ignored for simplicity 5 / 16

Dropout Gaussian Processes Variational Inference Gaussian Processes f is the GP function p(f X, Y) p(f) p(y X, f) }{{}}{{}}{{} Posterior Prior Likelihood Applications Regression F X N (0, K(X, X)), Y F N (F, τ 1 I N ) 6 / 16

Dropout Gaussian Processes Variational Inference Variational Inference The predictive distribution K(y x, X, Y) = p(y x, w) p(w X, Y) dw }{{} (4) q(w) Objective: argmin q KL ( q(w) p(w X, Y) ) Variational Prediction: q(y x ) = p(y x, w)q(w)dw Log evidence lower bound L VI = q(w) log p(y X, w)dw KL ( q(w) p(w) ) (5) Objective: arg maxq L VI A variational distribution q(w) that explains the data well while still being close to prior 7 / 16

A single-layer neural network example Setup Q: input dimension, K: number of hidden units, D: ouput dimesnion Goal: Learn W1 R Q K and W 2 R K D to map X R N Q to Y R N D Idea: Introduce W 1 and W 2 to GP approxiamtion 8 / 16

Introduce W 1 Define the kernel of GP K(x, x ) = p(w)g(w x)g(w x )dw (6) Resort to Monte Carlo integration, with generative process: w k p(w), W 1 = [w k ] K k=1, K(x, x ) = 1 K K g(wk x)g(w k x ) k=1 F X, W 1 N (0, K), Y F N (F, τ 1 I N ) (7) where K is the number of hidden units 9 / 16

Introduce W 2 Analytically integrating wrt F The predictive distribution p(y X) = p(y F)p(F X, W 1 )p(w 1 )dw 1 df (8) can be rewritten as p(y X) = N (Y; 0, ΦΦ + τ 1 I N )p(w 1 )dw 1 (9) where Φ = [φ] N n=1, φ = 1 K g(w 1 x) For W 2 = [w 2 ] D d=1, w d N (0, I K ) N (y d ; 0, ΦΦ + τ 1 I N ) = N (y d ; Φw d, τ 1 I N )N (w d ; 0, I K )dw 1 p(y X) = p(y X, W 1, W 2 )p(w 1 )p(w 2 )dw 1 dw 2 (10) 10 / 16

Variational Inference in the Approximate Model q(w 1, W 2 ) = q(w 1 )q(w 2 ) (11) To mimic Dropout, q(w 1 ) is factorised over input dimension, each of them is a Gaussian mixture distribution with two components, q(w 1 ) = Q q(w q), q(w q) = p 1 N (m q, σ 2 I K ) + (1 p 1 )N (0, σ 2 I K ) (12) q=1 with p 1 [0, 1], m q R K The same for q(w 2 ) q(w 2 ) = K q(w k ), q(w k ) = p 2 N (m k, σ 2 I D ) + (1 p 2 )N (0, σ 2 I D ) (13) k=1 with p 1 [0, 1], m q R D Optimise over parameters, especially M 1 = [m q] Q q=1, M 2 = [m k ] K k=1 11 / 16

for Regression Log evidence lower bound (14) L GP-VI = q(w 1, W 2 ) log p(y X, W 1, W 2 )dw 1 dw 2 } {{ } L 1 KL ( q(w 1, W 2 ) p(w 1, W 2 ) ) }{{} L 2 Approximation of L 2 For large enough K we can approximate the KL divergence term as KL(q(W 1 ) p(w 1 )) QK(σ 2 log(σ 2 ) 1) + p 1 2 Q q=1 m q m q (15) Similarly for KL(q(W 2 ) p(w 2 ) Following Proposition 1 on KL of a Mixture of Gaussians in Appendix 12 / 16

L 1 : Monte Carlo integration Parameterization L 1 = q(b 1, ɛ 1, b 2, ɛ 2 ) log p(y X, W 1 (b 1, ɛ 1 ), W 2 (b 2, ɛ 2 ))db 1 db 2 dɛ 1 dɛ 2 (16) W 1 = diag(b 1 )(M 1 + σɛ 1 ) + (1 diag(b 1 ))σɛ 1 W 2 = diag(b 2 )(M 2 + σɛ 2 ) + (1 diag(b 2 ))σɛ 2 (17) where ɛ 1 N (0, I Q K ), b 1q Bernoulli(p 1 ), ɛ 2 N (0, I K D ), b 2k Bernoulli(p 2 ), (18) Take a single sample ( L GP-MC = log p(y X, Ŵ1, Ŵ2) KL q(w1, W 2 ) p(w 1, W 2 ) ) (19) }{{} L 1-MC Optimising L GP-MC converges to the same limit as L GP-VI. 13 / 16

L 1 : Monte Carlo integration Regression L 1-MC = log p(y X, Ŵ1, Ŵ2) = D log N (y d ; φŵ d, τ 1 I N ) d=1 = ND 2 ND D log(2π) + 2 log(τ) τ 2 y d φŵ d 2 2 (20) d=1 Sum over the rows instead of the columns of Ŷ D d=1 τ 2 y d φŵ d 2 2 = N n=1 τ 2 yn φŵn 2 2 (21) where ŷ n = φŵ2 = 1 K g(xnŵ1)ŵ2 14 / 16

Recover SGD training with Dropout Take the approximation for L 1 and L 2 for the bound, and ignoring constant terms τ and σ L GP-MC = τ N y n ŷ n 2 2 2 p 1 2 M 1 2 2 p 2 2 M 2 2 2 (22) n=1 Setting σ tend to 0 W 1 = diag(b 1 )(M 1 + σɛ 1 ) + (1 diag(b 1 ))σɛ 1 Ŵ 1 diag(ˆb 1 )M 1, Ŵ2 diag(ˆb 2 )M 2 (23) 1 1 ŷ n = K g(xnŵ1)ŵ2 = K g(xn(diag(ˆb 1 )M 1 ))(diag(ˆb 2 )M 2 ) (24) 15 / 16

More Applications Convolutional Neural Networks Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference, arxiv:1506.02158, 2015 Recurrent Neural Networks A Theoretically Grounded Application of Dropout in Recurrent Neural Networks, arxiv:1512.05287, 2015 Reinforcement Learning Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, arxiv:1506.02142, 2015 16 / 16