Chapter 20. Deep Generative Models

Peng et al.: Deep Learning and Practice 1 Chapter 20 Deep Generative Models

Peng et al.: Deep Learning and Practice 2 Generative Models Models that are able to Provide an estimate of the probability distribution function, p data, or Generate samples from a (likely implicit) distribution

Peng et al.: Deep Learning and Practice 3 Why Study Generative Models? Manipulation of high-dimensional, multi-modal distributions Potential uses in reinforcement learning, such as future state prediction Training with missing data (e.g. missing labels) and prediction on them Generation of realistic samples etc.

Peng et al.: Deep Learning and Practice 4 Taxonomy of Generative Models

Peng et al.: Deep Learning and Practice 5 Explicit density, p model (x; θ) Tractable (trained with the ordinary ML) Intractable/approximate (trained with approximate inference and/or MCMC approximations) Implicit density Single-step sample generation via a network Multi-step sample generation via Markov chains

Peng et al.: Deep Learning and Practice 6 Generative dversarial Networks (GN) differentiable generation network G, paired with a discriminator D for training Generator G maps latent noises z p(z) to visible variables x Conceptually, a graphical model with the same structure as VE x = G(z) can be regarded as a sample drawn from some p g (x) Generator is what we are concerned with

Peng et al.: Deep Learning and Practice 7 Discriminator D divides inputs into real and fake classes n ordinary binary classifier trained supervisedly Inputs are training examples (real) and generated samples (fake)

Peng et al.: Deep Learning and Practice 8

Peng et al.: Deep Learning and Practice 9 Training GNs: Two-Player Minimax Game D(x; θ (D) ), G(z; θ (G) ) are implemented with neural networks, and each has their own cost to minimize Discriminator cost (cross-entropy cost) J (D) (θ (D), θ (G) ) = E x pdata log D(x) E z pz log(1 D(G(z))) where D(x) denotes the probability of x being real Generator cost J (G) (θ (D), θ (G) ) = J (D) (θ (D), θ (G) ) Note that the sum of all players costs is zero (zero-sum game)

Peng et al.: Deep Learning and Practice 10 The entire game can be summarized with a value function V (θ (D), θ (G) ) J (D) (θ (D), θ (G) ) and the objective is to find a generator θ (G) = arg min θ (G) max θ (D) V (θ (D), θ (G) )

Peng et al.: Deep Learning and Practice 11 Optimization vs. Game The solution to an optimization problem is generally a local minimum of an objective function in parameter space, e.g. arg min θ (G),θ (D) V (θ (D), θ (G) ) where both θ (G), θ (D) are optimized simultaneously The solution to a game problem is generally a saddle point of an objective function in parameter space, e.g. arg min θ (G) max θ (D) V (θ (D), θ (G) ) where θ (G), θ (D) are optimized in turn by controlling one of them at a time with the other fixed

Peng et al.: Deep Learning and Practice 12 The Optimal Discriminator For a given generator G, the optimal discriminator is seen to be D G(x) = p data (x) p data (x) + p g (x) which can be obtained by having δ δd(x) J (D) (x) = 0 When given enough capacity, the discriminator obtains an estimate at every x p data(x) p g(x) This is the key that sets GNs apart from other generative models

Peng et al.: Deep Learning and Practice 13

Peng et al.: Deep Learning and Practice 14 The generator is to learn a model by following a discriminator uphill

Peng et al.: Deep Learning and Practice 15 The Optimal Generator Given DG (x) and enough capacity, the optimal generator is to minimize the Jensen-Shannon divergence between p data and p g arg min p g = arg min p g = arg min p g E x pdata log D G(x) + E x pg log(1 D G(G(x))) p data (x) E x pdata log p data (x) + p g (x) + E x p g log ( log(4) + KL p data p ) data + p g + KL 2 = arg min p g log(4) + 2 JSD(p data p g ) p g (x) p data (x) + p g (x) ( p g p ) data + p g 2 The minimum is achieved when p g = p data, i.e. JSD(p data p g ) = 0

Peng et al.: Deep Learning and Practice 16 Remarks The optimization is done w.r.t. p g directly The analysis for the discriminator is done w.r.t. D(x) Enough capacity in both contexts means that D G (x) and p g(x) can be implemented by D(x; θ (D) ) and G(z; θ (G) ), respectively

Peng et al.: Deep Learning and Practice 17 Implementation

Peng et al.: Deep Learning and Practice 18 (Convergence) If G and D have enough capacity, and at each step of lgorithm I, the discriminator is allowed to reach its optimum D G (x) given G, and p g is updated to improve the criterion (reduce the cost) E x pdata log D G(x) + E x pg log(1 D G(G(x))) then p g converges to p data Nothing is said about the convergence when optimization is done based on simultaneous stochastic gradient descent in parameter space

Peng et al.: Deep Learning and Practice 19 Toy problem Non-Convergence of Gradient Descent min x max y V (x, y) = xy x, y are optimized based on gradient descent with a tiny learning rate x(t + Δt) = x(t) Δt V (x(t), y(t)) x(t) y(t + Δt) = y(t) + Δt V (x(t), y(t)) y(t) This amounts to solving x (t) = y(t) y (t) = x(t) x (t) = x(t)

Peng et al.: Deep Learning and Practice 20 which has a solution of the form x(t) = x(0) cos(t) + y(0) sin(t) y(t) = x(0) sin(t) + y(0) cos(t) 2 (x(t), y(t)) (x(0), y(0)) (x, y ) 20 1 V (x, y) 0 y 0 1 20 4 2 0 2 4 5 x 0 5 y 2 2 1 0 1 2 x

Peng et al.: Deep Learning and Practice 21 Other Games Zero-sum game does not perform well in learning generator: gradients of J (G) w.r.t D(G(z)) vanish when the discriminator performs well

Peng et al.: Deep Learning and Practice 22 Heuristic, non-saturating game (to ensure non-zero gradients) J (G) = E z log D(G(z)) Maximum likelihood game (to minimize KL divergence) J (G) = E z exp(σ 1 (D(G(z)))

Peng et al.: Deep Learning and Practice 23 Mode Collapse Problem The generator learns to map different z to the same x Top: Data distribution (Mixture Gaussian) Bottom: Learned generator distribution over time The generator distribution produces only a single mode at a time and does not converge in this example This is acceptable in some applications but not all

Peng et al.: Deep Learning and Practice 24 Learned Representation The generator can learn a distributed representation that disentangles high-level concepts, e.g. gender vs. wearing glasses

Peng et al.: Deep Learning and Practice 25 DCGN There are many different implementations for generators, such as DCGN, LPGN, and more (study by yourself)

Peng et al.: Deep Learning and Practice 26 Deep Boltzmann Machines (DBM) n energy-based generative model with an explicit density over binary visible v and hidden h (1), h (2), h (3) variables where p(v, h (1), h (2), h (3) ; θ) = 1 Z(θ) exp( E(v, h(1), h (2), h (3) ; θ)) E(v, h (1), h (2), h (3) ; θ) = v T W (1) h (1) h (1)T W (2) h (2) h (2)T W (3) h (3) and θ = {W (1), W (2), W (3) } Note that bias terms are omitted for simplicity

Peng et al.: Deep Learning and Practice 27 Graphical model for DBM, where odd layers can be separated from even layers to reveal a bipartite structure s a result, variables in odd layers are conditionally independent given even layers and vice versa; this enables block Gibbs sampling

Peng et al.: Deep Learning and Practice 28 Likewise, it is seen that variables in a layer are conditionally independent given the neighbouring layers In the case of two hidden layers, we have p(v i = 1 h (1) ) = σ(w (1) i,: h (1) ) p(h (1) i = 1 v, h (2) ) = σ(v T W (1) :,i + W (2) i,: h (2) ) p(h (2) i = 1 h (1) ) = σ(h (1)T W (2) :,i ) However, the posterior distribution of all hidden layers given the visible layer does not factorize because of interactions between layers p(h (1), h (2) v) j p(h (1) j v) k p(h (2) k v) pproximate inference needs to be sought

Peng et al.: Deep Learning and Practice 29 DBM Mean Field Inference To construct a factorial Q(h v) for approximating p(h v) p(h (1), h (2) v) Q(h v) = j q(h (1) j v) k q(h (2) k v) In the present case, all hidden variables h (1) j, h (2) k are binary; these q(h v) must have a functional form of the Bernoulli distribution, i.e. q(h (1) j q(h (2) k v) = (ĥ(1) j v) = (ĥ(2) k ) h(1) j )h(2) (1 ĥ(1) j k (1 ĥ (2) k ) (1 h(1) j ), i )(1 h(2) k ), k where ĥ(1) j, ĥ(2) k [0, 1] are the corresponding parameters

Peng et al.: Deep Learning and Practice 30 Carrying out the expectation (needs some work) q j (h j v) = exp(e q j (log p(v, h (1), h (2) ; θ))) yields the following fixed-point update equations ( ĥ (1) j = σ v i W (1) i,j + i k W (2) j,k ĥ(2) k ), j ĥ (2) k = σ j W (2) j,k ĥ(1) j, k

Peng et al.: Deep Learning and Practice 31 DBM Parameter Learning DBM learning has to confront both the intractable inference p(h v) and the intractable partition function Z(θ) Combined variational inference, learning, and MCMC is necessary The objective then becomes to find W (1), W (2) that minimize L(Q, θ) = v i W (1) i,j ĥ(1) j + ĥ (1) j W (2) j,k ĥ(2) k log Z(θ)+H(Q) i j j which can be done via gradient descent (study lgorithm 20.1) θ = θ ε θ L(Q, θ) k In general, layer-wise pre-training is needed to arrive at a good model

Peng et al.: Deep Learning and Practice 32

Peng et al.: Deep Learning and Practice 33 Topics Not Covered Optimization for training deep models (Chapter 8) Representation learning (Chapter 15) Back-prop through random operations (REINFORCE, Chapter 20) BM for real-valued data (Chapter 20) Generative Stochastic Networks (Chapter 20) Deep Belief Networks (Chapter 20) Other generative models (Chapter 20)