Variational Inference via Stochastic Backpropagation

Size: px

Start display at page:

Download "Variational Inference via Stochastic Backpropagation"

Norah Goodwin
5 years ago
Views:

1 Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016

2 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

3 Outline Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

4 Bayesian inference on latent variable model y observed data x latent variable p θ (x, y) probabilistic model

5 Bayesian inference on latent variable model y observed data x latent variable p θ (x, y) probabilistic model Purpose: we are (very) interested in inferring a posterior distribution p θ (x y) Enables learning parameters in latent variable models Deep learning

6 Bayesian inference on latent variable model y observed data x latent variable p θ (x, y) probabilistic model Purpose: we are (very) interested in inferring a posterior distribution p θ (x y) Enables learning parameters in latent variable models Deep learning Difficulty: p(x y) = p(x,y) p(y) is most often intractable.

7 Non-variational approx. inference methods Point estimate of p θ (x y) (MAP) Fast Overfitting Markov Chain Monte Carlo (MCMC) Asymptotically unbiased Expensive, slow to assess convergence

8 Variational Inference Introduce variational distribution q φ (x) or q φ (x y) of true posterior. φ variational parameters Objective: minimize w.r.t. the KL-divergence D KL (q φ (x y) p θ (x y)) q φ (x y) = p θ (x y) achieves 0 KL divergence.

9 Lower Bound From marginal log-likelihood to lower bound, [ log p θ (y) = E q log p ] θ(y, x) + D KL (q φ (x y) p θ (x y)) q φ (x y) E q [log p θ (y, x) log q φ (x y)] L Objective: maximize w.r.t the Lower Bound Non-gradient-based optimization technique: Mean-Field VB with fixed-point equations Efficiency Intractable / not applicable in many cases

10 Outline Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

11 Reparameterized Gradient Estimator Consider a general form of lower bound L = E qφ (x y)[f (y, x)] Monte Carlo Gradient Approximation at Iteration t sample ɛ t from some base distribution p(ɛ) transformation x t = g φ (ɛ l ), s.t. x t q φ (x y) compute φ f (y, x t ) to approximate φ L Reparameterization has to exist. E.g. Gaussian, Laplace, Student t s, etc.

12 Reparameterization Trick

13 Gaussian Backpropagation x N (µ, C), we have following identities. µi E N (x µ,c) [f (x)] = E N (x µ,c) [ zi f (x)] Cij E N (x µ,c) [f (x)] = 1 2 E N (x µ,c)[ 2 z i,z j f (x)], 2 C i,j,c k,l E N (x µ,c) [f (x)] = 1 4 E N (x µ,c)[ 4 z i,z j,z k,z l f (x)], 2 µ i,c k,l E N (x µ,c) [f (x)] = 1 2 E [ N (x µ,c) 3 zi,z k,z l f (x) ]. Unbiased estimator of k E[f ], k = 1, 2 Higher order derivatives need to calculated w.r.t. f

14 Reparameterized Gaussian Backpropagation x N (µ, RR ), thus x = µ + Rɛ where ɛ N (0, I) New identities R E N (µ,c) [f (x)] = E ɛ N (0,Idz )[ɛg ] 2 µ,r E N (µ,c)[f (x)] = E ɛ N (0,Idz )[ɛ H] 2 R E N (µ,c)[f (x)] = E ɛ N (0,Idz )[(ɛɛ T ) H] where is Kronecker product, and gradient g, Hessian H are evaluated at µ + Rɛ in terms of f (x). Still easy to obtain unbiased estimator Hessian-vector multiplication due to the fact that (A B)vec(V ) = vec(avb)

15 Bayesian Logreg Prior N (0, Λ) where Λ is diagonal Variational distribution q(β µ, D) where D is diagonal for simplicity. Lower Bound 1 x A9a DSVI L BFGS SGVI HFSGVI time(s) value A9a regression coefficients DSVI L BFGS SGVI HFSGVI index

16 Outline Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

17 Model Formulation Gaussian latent variable, prior p(x) = N (0, I) Generative model p θ (y x), characterize a non-linear transformation, e.g. MLP Recognition model q φ (x y) = N (µ, D), where φ = [µ, D] = MLP(y; W, b) and denote ψ = (W, b) Objective Function: L = log p(y x) + log p(x) log q(x y) log p(y x) reconstruction error log p(x) log q(x y) regularization Unlike VEM, (θ, ψ) is optimized simultaneously, by gradient based algorithm.

18 Unrolled VAE Hidden decoder layer Gaussian latent layer Hidden encoder layer Output: y h d z h e W 5, b 5 W 4, b 4 (W 2, b 2 ), (W 3, b 3 ) W 1, b 1 ~ (W 5, b 5 ), (W 6, b 6 ) if x is con?nuous. N( mu, Sigma ) (W 2, b 2 ) (W 3, b 3 ) h e Input: x

19 Back to Backpropagation Fast Gradient computation [ ] (µ + Rɛ) ψl E N (µ,c) [f (x)] = E ɛ N (0,I) g ψ l 2 ψ l1 ψ l2 E N (µ,c) [f (x)] = [ ] (µ + Rɛ) (µ + Rɛ) E ɛ N (0,I) H + g 2 (µ + Rɛ) ψ l1 ψ l2 ψ l1 l2 O(d 2 z ) algorithmic complexity for both 1st and 2nd derivative.

20 Back to Backpropagation F (ψ+γv) F (ψ) For any F, H ψ v = lim γ 0 γ H ψ v = F (ψ + γv) γ = γ E N (0,I) γ=0 [ (µ(ψ) + R(ψ)ɛ) g ψ ψ ψ+γv] γ=0 [ ( (µ(ψ) + R(ψ)ɛ) = E N (0,I) g γ ψ ψ ψ+γv)] PCG only requires H ψ v to solve linear system Hp = g. γ=0 For K iteration of PCG, relative tolerance e < exp( 2K/ c), where c is matrix conditioner. Thus, c can be nearly as large as O(K 2 ). Complexity for each iteration: O(Kdd 2 z ) v.s. O(dd 2 z )

21 Theoretical Perspective If f is an L-Lipschitz differentiable function and ɛ N (0, I dz ), then E[(f (ɛ) E[f (ɛ)]) 2 ] L2 π 2 4. ( 1 ) P M M m=1 f (ɛ m) E[f (ɛ)] t 2e 2Mt2 π 2 L 2. In most application, M = 1 is used as MC integration.

22 VAE Experiments Manifold of Generative Model by setting d z = 2

23 VAE Experiments Lower Bound 1600 Frey Face 6000 Olivetti Face Lower Bound Ada train Ada test L BFGS SGVI train L BFGS SGVI test HFSGVI train HFSGVI test time(s) x 10 4 Lower Bound Ada train Ada test L BFGS SGVI train L BFGS SGVI test HFSGVI train HFSGVI test time (s) x 10 4

24 Outline Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

25 Semi-supervised VAE (NIPS 2014) Generative Model: p(y) = Cat(y π); p(x) = N (0, I); p(y y, x) = MLP Recognition Model: q(x y, y) = N (µ φ (y, y), D φ (y)) and q(y y) = Cat(y π(y)), parameter function is also MLP.

26 Neural Variational Inference (ICML 2014) Sigmoid Belief Networks Generative Model: h L h L 1... h 1 y Recognition Model: reverse the arrow direction Learning signal or control variate for variance reduction borrowing idea form RL φ L = E q [(log p θ (x, z) log q φ (z x)) φ log q φ (z x)] = E q [(log p θ (x, z) log q φ (z x) C ξ (x)) φ log q φ (z x)] (θ, φ, ξ) joint learning

27 Dynamic Modeling DRAW (Dynamic VAE with LSTM, ICML 2015, reviewed before) DSBN (NIPS 2015), generative model is similar to HMM

28 Dynamic Modeling, ctd hgchmm, my model (g) NVI (h) Doctor (i) GibbsEM

29 Bayesian Dark Knowledge (NIPS 2015) Teacher Model: deep neural networks T (y x, θ), prior p(θ λ) Student Model: deep neural networks S(y x, ω), prior p(ω γ) Two step training (or distilled SGLD, term they used in paper) Mini-bactch data (X, Y ) with size B SGLD update θ ( ) θ t+1 = η t 2 θ log p(θ λ) + N B θ log p(y i x i, θ) x i X + N (0, η t ) SGD update ω with noisy data X only; ỹ i is obtained by feeding x i to current teacher model ω t+1 = ρ t 1 ω log p(ỹ i x i, ω) + γω t B x i X

30 Outline Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary

31 Summary Minimize the difference between Generative model and recognition model Variational inference framework

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower