Deep Generative Models

Size: px

Start display at page:

Download "Deep Generative Models"

Juniper Patterson
5 years ago
Views:

1 Deep Generative Models Durk Kingma Max Welling Deep Probabilistic Models Worksop Wednesday, 1st of Oct, 2014 D.P. Kingma

2 Deep generative models Transformations between Bayes nets and Neural nets Transformation between Bayes nets and Neural Nets. ICML 14 Deep generative models of images Auto-Encoding Variational Bayes, ICLR 14! Semi-Supervised Learning wit Deep Generative Models NIPS 14; wit Sakir & Danilo (Deepmind)

3 Part 1: Transformations between Bayes nets and Neural nets

4 Neural Nets and Deep Generative Models (1) Directed latent variables models Can model complex (multimodal) distributions over observed variables Feedforward neural nets: Typically used for learning a single conditional: log p(y x) = f(x,y) e.g. Classification/regression wit deep net But: not good for modelling complex distributions, e.g. multi-modal distributions

5 Neural Nets and Deep Generative Models (2) Natural idea: neural nets as components of deep latent-variable models! Learn complex distributions over data Inner loop of learning requires posterior inference!! Exact inference is intractable Efficient gradient-based approximate inference metods: Stocastic Variational inference (biased but fast) [Hoffman & Blei 2012] [Kingma & Welling 2013] [Deepmind 2014] Sampling-based metods (unbiased but slower), e.g. HMC Can we increase te efficiency of gradient-based p(z x) =p(x, posterior z)/p(x) inference? p(z x) / p(x, z) r z log p(z x) =r z log p(x, z)

6 Idea: Reparameterizations (Gaussian example) Centered form (CP) Differentiable Non-Centered Form (DNCP) p (z pa) =N (z; µ, 2 I) µ = f (pa) (e.g. neural net) N (0, I) ez = f (pa)+ Neural net perspective: idden unit wit injected noise

7 Idea: Reparameterizations Centered form (CP) Differentiable Non-Centered Form (DNCP) p (z pa) p( ) z = g(pa,, ) Neural net perspective: idden unit wit injected noise

8 Reparameterizations Can be performed for a broad class of distributions

9 So: we ave a coice Wic form to use for learning / inference?

10 Posterior correlation analysis (1) Squared posterior correlation ρ 2! second-order metric of posterior dependency between pair of latent variables A and B

11 Posterior correlation analysis (2) Squared correlations for CP:!! Squared correlations for DNCP:

12 Posterior correlation analysis (3) Inequality: all terms cancel wit very simple result: 2 pa(z) i,z > 2 pa(z) i,, 2 z pa(z) < 2 z c(z) i.e., DNCP leads to more efficient inference wen latent variable 'z' is more strongly bound to its parents ten to its cildren

13 Posterior correlation analysis (4) beauty-and-beast pair Disney

14 2D linear-gaussian posterior example σz = 50 σz = 1 σz = 0.02 z = 50 ( 0.00) z =1( 0.41) z =0.02 ( 1.00) z2 z 1 z = 50 ( 0.58) z 1 e2 e2 z2 z2 z =1( 0.41) z 1 z = 0.02 ( 0.01) e2 e 1 e 1 e 1

15 Robust HMC Sampler Proposal distribution is mix between two proposal distributions:! Were we use ρ = 0.5 in experiments

16 Experiment: HMC sampling wit Dynamic Bayesian network (DBN)

17 Autocorrelation results on DBN

18 Part 2: Te Variational Auto-Encoder

19 Deep latent variable model z p(z) x p(x z)

20 Monte Carlo EM

21 Monte Carlo EM - End result

22 MAP inference wit L-BFGS

23 Te Variational Auto-Encoder (ICLR 14) Q P q(z x) z z p(z) ep(x) x x p(x z) L = D KL (ep(x)q(z x) p(x, z)) apple log p(x)

24 { { { Wy is tis an auto-encoder? Reparameterized Q(z x) P(x z) ε x z x L = D KL (ep(x)q(z x) p(x, z)) = E ep(x) Ep( ) [log p(x z) + log p(z) log q(z x)] Reconstruction error Regularization terms! (dictated by te bound)

25 Variational Auto-Encoder trained on MNIST (2D Latent space)

26 3D latent space

27 Labeled faces in te wild

28 Labeled faces in te wild

29 Semi-supervised learning wit deep generative models ( Model 2.5 )

30 Approac 1 DLGM / VAE as feature extractor for semi-supervised classifier Q P q(z x) z z p(z) ep(x) x x p(x z) L = D KL (ep(x)q(z x) p(x, z))

31 Approac 2 DLGM / VAE as regularizer of neural net classifier Q P q(y x)q(z x, y) y z y z p(y)p(z) ep(x) x x p(x y, z) L = E epl (x,y) [log q(y x)] + L reg L reg = D KL ( ep l (x, y)q(z x, y) p(x, y, z) ) D KL ( ep u (x)q(y x)q(z x, y) p(x, y, z) )

32 Approac 3 Q P y z2 y z2 z1 z1 x x

33 Results Table 1: Bencmark results of semi-supervised classification on MNIST wit few labels. N labeled NN CNN TSVM EmbNN CAE MTC M1 M2 M * (± 0.25) (± 1.71) 3.54 (± 0.03) * 5.72 (± 0.049) 4.94 (± 0.13) 2.85 (± 0.1) * 4.24 (± 0.07) 3.60 (± 0.56) 2.76 (± 0.30) * 3.49 (± 0.04) 3.92 (± 0.63) 2.63 (± 0.24)

34 Analogic reasoning (will improve)

35 Next steps SSL on larger images SVHN, CIFAR-10 (tis NIPS paper) Imagenet (Future papers) SSL on video s Youtube Applications of anagogic reasoning

36 Tanks!

37 RMSProp Initialization: m 0 v 0 Update parameters(w, g) : m 1 g +(1 1) m v 2 g 2 +(1 2) v w w m/ p v

38 Estimation based on exp. mov. avg. Assume: g is a random variable Goal: from exponential moving averages of i.i.d. draws gi, estimate moments E[g] and E[g 2 ] m t = 1 g t +(1 1) m t 1 tx = 1 (1 1) t i g i (As function of past samples) i=1 " E [m t ]=E 1 # tx (1 1) t i g i (Taking expectations) i=1 = E [g] 1 tx (1 1) t i (Because i.i.d.) i=1 = E [g] (1 (1 1) t ) (Furter simplification)

39 Fixed RMSProp (AdaM) Initialization: m 0 v 0 Update parameters(w, g) : m 1 g +(1 1) m v 2 g 2 +(1 2) v w w t m/ p v were t = p 1 (1 2) t /(1 (1 1) t )

40 VAE, 10 epocs β1 = 1 β1 = 0.1 β2 = 0.01 β2 = β2 = More like AdaGrad (infinite memory)

41 VAE, 100 epocs β1 = 1 β1 = 0.1 β2 = 0.01 β2 = β2 = More like AdaGrad (infinite memory)

42 Now possible: Interpolation between Adagrad and RMSProp Initialization: m 0 v 0 Update parameters(w, g) : m 1 g +(1 1) m v 2 g 2 +(1 2) p v w w t m/ t v were t = p 1 (1 2) t /(1 (1 1) t ) Tis algoritm AdaGrad wen β1 = 1 and β2 = ε (infinitesimal)

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian