Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016

Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

Outline Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

Bayesian inference on latent variable model y observed data x latent variable p θ (x, y) probabilistic model

Bayesian inference on latent variable model y observed data x latent variable p θ (x, y) probabilistic model Purpose: we are (very) interested in inferring a posterior distribution p θ (x y) Enables learning parameters in latent variable models Deep learning

Non-variational approx. inference methods Point estimate of p θ (x y) (MAP) Fast Overfitting Markov Chain Monte Carlo (MCMC) Asymptotically unbiased Expensive, slow to assess convergence

Variational Inference Introduce variational distribution q φ (x) or q φ (x y) of true posterior. φ variational parameters Objective: minimize w.r.t. the KL-divergence D KL (q φ (x y) p θ (x y)) q φ (x y) = p θ (x y) achieves 0 KL divergence.

Lower Bound From marginal log-likelihood to lower bound, [ log p θ (y) = E q log p ] θ(y, x) + D KL (q φ (x y) p θ (x y)) q φ (x y) E q [log p θ (y, x) log q φ (x y)] L Objective: maximize w.r.t the Lower Bound Non-gradient-based optimization technique: Mean-Field VB with fixed-point equations Efficiency Intractable / not applicable in many cases

Outline Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

Reparameterized Gradient Estimator Consider a general form of lower bound L = E qφ (x y)[f (y, x)] Monte Carlo Gradient Approximation at Iteration t sample ɛ t from some base distribution p(ɛ) transformation x t = g φ (ɛ l ), s.t. x t q φ (x y) compute φ f (y, x t ) to approximate φ L Reparameterization has to exist. E.g. Gaussian, Laplace, Student t s, etc.

Reparameterization Trick

Gaussian Backpropagation x N (µ, C), we have following identities. µi E N (x µ,c) [f (x)] = E N (x µ,c) [ zi f (x)] Cij E N (x µ,c) [f (x)] = 1 2 E N (x µ,c)[ 2 z i,z j f (x)], 2 C i,j,c k,l E N (x µ,c) [f (x)] = 1 4 E N (x µ,c)[ 4 z i,z j,z k,z l f (x)], 2 µ i,c k,l E N (x µ,c) [f (x)] = 1 2 E [ N (x µ,c) 3 zi,z k,z l f (x) ]. Unbiased estimator of k E[f ], k = 1, 2 Higher order derivatives need to calculated w.r.t. f

Reparameterized Gaussian Backpropagation x N (µ, RR ), thus x = µ + Rɛ where ɛ N (0, I) New identities R E N (µ,c) [f (x)] = E ɛ N (0,Idz )[ɛg ] 2 µ,r E N (µ,c)[f (x)] = E ɛ N (0,Idz )[ɛ H] 2 R E N (µ,c)[f (x)] = E ɛ N (0,Idz )[(ɛɛ T ) H] where is Kronecker product, and gradient g, Hessian H are evaluated at µ + Rɛ in terms of f (x). Still easy to obtain unbiased estimator Hessian-vector multiplication due to the fact that (A B)vec(V ) = vec(avb)

Bayesian Logreg Prior N (0, Λ) where Λ is diagonal Variational distribution q(β µ, D) where D is diagonal for simplicity. Lower Bound 1 x 104 1.1 1.2 1.3 1.4 A9a DSVI L BFGS SGVI HFSGVI 1.5 0 100 200 300 400 500 time(s) value 3 2 1 0 1 2 3 A9a regression coefficients DSVI L BFGS SGVI HFSGVI 4 0 50 100 150 index

Outline Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

Model Formulation Gaussian latent variable, prior p(x) = N (0, I) Generative model p θ (y x), characterize a non-linear transformation, e.g. MLP Recognition model q φ (x y) = N (µ, D), where φ = [µ, D] = MLP(y; W, b) and denote ψ = (W, b) Objective Function: L = log p(y x) + log p(x) log q(x y) log p(y x) reconstruction error log p(x) log q(x y) regularization Unlike VEM, (θ, ψ) is optimized simultaneously, by gradient based algorithm.

Unrolled VAE Hidden decoder layer Gaussian latent layer Hidden encoder layer Output: y h d z h e W 5, b 5 W 4, b 4 (W 2, b 2 ), (W 3, b 3 ) W 1, b 1 ~ (W 5, b 5 ), (W 6, b 6 ) if x is con?nuous. N( mu, Sigma ) (W 2, b 2 ) (W 3, b 3 ) h e Input: x

Back to Backpropagation Fast Gradient computation [ ] (µ + Rɛ) ψl E N (µ,c) [f (x)] = E ɛ N (0,I) g ψ l 2 ψ l1 ψ l2 E N (µ,c) [f (x)] = [ ] (µ + Rɛ) (µ + Rɛ) E ɛ N (0,I) H + g 2 (µ + Rɛ) ψ l1 ψ l2 ψ l1 l2 O(d 2 z ) algorithmic complexity for both 1st and 2nd derivative.

Back to Backpropagation F (ψ+γv) F (ψ) For any F, H ψ v = lim γ 0 γ H ψ v = F (ψ + γv) γ = γ E N (0,I) γ=0 [ (µ(ψ) + R(ψ)ɛ) g ψ ψ ψ+γv] γ=0 [ ( (µ(ψ) + R(ψ)ɛ) = E N (0,I) g γ ψ ψ ψ+γv)] PCG only requires H ψ v to solve linear system Hp = g. γ=0 For K iteration of PCG, relative tolerance e < exp( 2K/ c), where c is matrix conditioner. Thus, c can be nearly as large as O(K 2 ). Complexity for each iteration: O(Kdd 2 z ) v.s. O(dd 2 z )

Theoretical Perspective If f is an L-Lipschitz differentiable function and ɛ N (0, I dz ), then E[(f (ɛ) E[f (ɛ)]) 2 ] L2 π 2 4. ( 1 ) P M M m=1 f (ɛ m) E[f (ɛ)] t 2e 2Mt2 π 2 L 2. In most application, M = 1 is used as MC integration.

VAE Experiments Manifold of Generative Model by setting d z = 2

VAE Experiments Lower Bound 1600 Frey Face 6000 Olivetti Face Lower Bound 1400 1200 1000 800 600 400 Ada train Ada test L BFGS SGVI train L BFGS SGVI test HFSGVI train HFSGVI test 200 0 0.5 1 1.5 2 time(s) x 10 4 Lower Bound 4000 2000 0 2000 Ada train Ada test L BFGS SGVI train L BFGS SGVI test HFSGVI train HFSGVI test 4000 0 0.5 1 1.5 2 time (s) x 10 4

Outline Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

Semi-supervised VAE (NIPS 2014) Generative Model: p(y) = Cat(y π); p(x) = N (0, I); p(y y, x) = MLP Recognition Model: q(x y, y) = N (µ φ (y, y), D φ (y)) and q(y y) = Cat(y π(y)), parameter function is also MLP.

Neural Variational Inference (ICML 2014) Sigmoid Belief Networks Generative Model: h L h L 1... h 1 y Recognition Model: reverse the arrow direction Learning signal or control variate for variance reduction borrowing idea form RL φ L = E q [(log p θ (x, z) log q φ (z x)) φ log q φ (z x)] = E q [(log p θ (x, z) log q φ (z x) C ξ (x)) φ log q φ (z x)] (θ, φ, ξ) joint learning

Dynamic Modeling DRAW (Dynamic VAE with LSTM, ICML 2015, reviewed before) DSBN (NIPS 2015), generative model is similar to HMM

10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 Dynamic Modeling, ctd hgchmm, my model (g) NVI (h) Doctor (i) GibbsEM

Bayesian Dark Knowledge (NIPS 2015) Teacher Model: deep neural networks T (y x, θ), prior p(θ λ) Student Model: deep neural networks S(y x, ω), prior p(ω γ) Two step training (or distilled SGLD, term they used in paper) Mini-bactch data (X, Y ) with size B SGLD update θ ( ) θ t+1 = η t 2 θ log p(θ λ) + N B θ log p(y i x i, θ) x i X + N (0, η t ) SGD update ω with noisy data X only; ỹ i is obtained by feeding x i to current teacher model ω t+1 = ρ t 1 ω log p(ỹ i x i, ω) + γω t B x i X

Outline Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

Summary Minimize the difference between Generative model and recognition model Variational inference framework