Energy-Based Generative Adversarial Network

Energy-Based Generative Adversarial Network Energy-Based Generative Adversarial Network J. Zhao, M. Mathieu and Y. LeCun Learning to Draw Samples: With Application to Amoritized MLE for Generalized Adversarial Learning D. Wang and Q. Liu Presented by Chunyuan Li 1 / 12

Preliminaries: vanilla GAN 1 Real data distribution P r(x), where x X ; 2 Fake data distribution P g(x), implemented via a generator as x = G(z), z P (z), where z Z; Objective Discriminator D : X [0, 1] min G max D V (D, G) L d = E x Pr [log D(x)] E x Pg [log(1 D(x))] (1) D(x): the probability that x from the real data rather than generator. Generator G : Z X L g = E x Pg [log(1 D(x))] = E z P (z) [log(1 D(G(z)))] (2) 2 / 12

Ideas of the two papers 1 EBGAN An energy-based formulation for adversarial training Autoencoder is employed for the energy function Repelling force 2 SteinGAN Amortized SVGD algorithm A probabilistic version of EBGAN 3 / 12

EBGAN Discriminator D: X R. L d = E x Pr [D(x)] + E x Pg [max(0, m D(x))] (3) where m is a positive margin. Generator G : Z X L g = E x Pg [D(x)] = E z P (z) [D(G(z))] (4) 4 / 12

EBGAN: Autoencoder as the energy function Discriminator Generator D(x) = Dec(Enc(x)) x (5) L d = E x Pr [Dec(Enc(x)) x] + E x Pg [max(0, m Dec(Enc(x)) x )] (6) L g = E z P (z) Dec(Enc(G(z))) G(z) (7) Figure: EBGAN architecture. 5 / 12

Repelling regularizer L P T (S) = 1 bs(bs 1) i j i ( ) S i S j S i S j (8) bs: batch size S i : i-th sample in minibatch S on feature level Enc(G(z)) Decreasing similarity between the sample representations within the minibatch and thus makes them as mutually orthogonal as possible, and further increases the diversity of generated samples Regularized generator loss L g = L g + λl P T, λ = 0.1 (9) 6 / 12

SteinGAN: SVGD for P Idea: approximate target distribution with a set of particles SVGD updates the particles iteratively x i x i + ɛφ(x i ), where φ(x) is a particle gradient direction chosen to maximumly decrease the KL divergence: { φ = argmax d } [ɛφ] KL(P g P ) ɛ=0 (10) φ F dɛ where P g [ɛφ] denotes the approximated density of the updated particle F is the set of perturbation directions that we optimize over When F is RKHS, there exists a closed solution, approximated with n particles n φ (x i ) x i = 1 [ x log p(x)k(x, x i ) + xk(x, x i )] (11) n i=1 1st term: drive toward the high probability regions of P r 2nd term: repulsive force to encourage diversity 7 / 12

Amortize SVGD Disadvantage of SVGD: Generation/inference can not improve based on the experience from the past tasks Solution: Training a η-parameterized neural network f(η; z), z N (0, I) to mimic the SVGD dynamics. An incremental approach in which η is iteratively adjusted so that the change of network outputs fit the change in SVGD. η t+1 argmin η t+1 m f(η t+1 ; z i ) f(η t ; z i ) ɛ x i 2 (12) i=1 Combined with Taylor expansion f(η t+1 ; z i ) f(η t ; z i ) ηf(η t ; z i )(η t+1 η t ) η t+1 η t + ɛ m ηf(η t ; z i ) x i (13) i=1 This is a chain rule that backpropagates the Stein variational gradient to the network parameter η. 8 / 12

Maximum likelihood with energy functions The empirical data distribution as P r, and we build a ω-parameterized p(x ω) as our model, with the energy function: p(x ω) = exp{ f(x, ω) F (ω)} and F (ω) = log where p ω(x) is the marginal x under our model p ω(x) exp{ f(x, ω)} (14) The MLE for the entire dataset is: { } max L(ω; x) = log p(x ω) (15) ω P r Gradient: ωl = P r [ θ f(x, ω)] + p ω(x) [ ωf(x, ω)] (16) The expectation of 2nd term is intractable, P g is used as its approximation 9 / 12

Amortize SVGD for GAN Reformulate SteinGAN with the notations of EBGAN D(x) = f(ω, x) and G(z) = f(η, z) (17) 1 Discriminator (Update ω with SGD): L d = E x Pr [D(x)] E x Pg [D(x)] (18) Decreases the energy of the true data and increases the energy of the simulated data 2 Generator (Update η with Amortized SVGD): L g = E x Pg [D(x)] = E z P (z) [D(G(z))] (19) Decreases the energy of the simulated data to better fit with p(x ω). 3 Specification of D(x) and k(x, x ) D(x) = Dec(Enc(x)) x (20) k(x, x ) = exp( 1 h 2 Enc(x) Enc(x ) 2 ), h is bandwidth (21) 10 / 12

Inception Score (IS): Evaluation Metric Pretrained Inception model p(y x) is applied to every generated image. C1: For higher quality of a single image, the conditional p(y x) with lower entropy is expected. C2: For higher diversity of all images, the marginal ˆp(y) = E z P (z) p(y x = G(z)) with higher entropy is expected. Combined C1 and C2, IS is obtained as exp(e Pg [KL(p(y x) ˆp(y))]) (22) 11 / 12

Results 1 EBGAN MNIST dataset Test various architectures and optimization methods (512 experiments) 2 SteinGAN Cifar10 dataset Real Training Set DCGAN SteinGAN 9.848 7.368 7.428 12 / 12