MMD GAN 1 Fisher GAN 2

Size: px

Start display at page:

Download "MMD GAN 1 Fisher GAN 2"

Garry Small
5 years ago
Views:

1 MMD GAN 1 Fisher GAN 1 Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos (CMU, IBM Research) Youssef Mroueh, and Tom Sercu (IBM Research) Presented by Rui-Yi(Roy) Zhang Decemeber 1, 017

2 Outline 1 Overview MMD GAN 3 Fisher GAN 4 Conclusion

3 Preliminaries: Generative Adversarial Nets (GANs) 1 Real data distribution P X ; Fake data distribution P θ, implemented via a generator as x = G(z), z P (z), where z Z; Objective: min G max D V (D, G) Discriminator D : X [0, 1] L d = E x PX [log D(x)] E x Pθ [log(1 D(x))] (1) D(x): the probability that x from the real data rather than generator. Generator G : Z X L g = E x Pθ [log(1 D(x))] = E z P (z) [log(1 D(G(z)))] ()

4 Maximum Mean Discrepancy (MMD) Given two distributions P and Q, and a kernel κ, the square of MMD distance is defined as: M κ (P, Q) = µ P µ Q H (3) = E P [κ(x, x )] E P,Q [κ(x, y)] + E Q [κ(y, y )] (4) Given a kernel κ, if κ is a characteristic kernel, then M κ (P, Q) = 0 iff P = Q. (one kind of two sample test) In practice we use finite samples from distributions to estimate MMD distance. Given X = {x 1,, x n } P and Y = {y 1,, y n } Q, one estimator of M κ (P, Q) is ˆM κ (X, Y ) = ( 1 n ) κ(x i, x i) ( n ) i i i j κ(x i, y j ) + 1 ( n ) j j κ(y j, y j) (5)

5 MMD with Kernel Learning GMMN [ ] trains g θ with a pre-specified kernel κ. min θ M κ (P X, P θ ), (6) MMD GAN trains g θ taking different possible characteristic kernel (difficult to optimize, K is a possible kernel set.) min max M κ(p X, P θ ), (7) θ κ K f φ is a injective function and κ is characteristic, resulting characteristic kernel κ = κ fφ. Practically MMD GAN chooses Gaussian Kernels. κ(x, x ) = exp( f φ (x) f φ (x ) ) [ ] Li, Yujia, Kevin Swersky, and Rich Zemel. "Generative moment matching networks." ICML 015

6 MMD GAN Assume g θ is locally Lipschitz; the gradient θ (max φ f φ g θ ) has to be bounded: weight clipping for Lipschitz approximations Approximate injective function f φ by an autoencoder. The objective is relaxed to be min max M f κ φe (P(X ), P(g θ (Z))) λe y X g(z) y f φd (f φe (y)). θ φ E1: kernel selection via learning. E: f φe as a feature transformation function; The kernel two-sample test is performed on the code space.

7 MMD GAN Algorithm 1: MMD GAN input : α the learning rate, c the clipping parameter, B the batch size, n c the number of iterations of discriminator per generator update. initialize generator parameter θ and discriminator parameter φ; while θ has not converged do for t = 1,..., n c do Sample a minibatches {x i } B i=1 P(X ) and {z j} B j=1 P(Z) g φ φ M fφe (P(X ), P(g θ (Z))) λe y X g(z) y f φd (f φe (y)) φ φ + α RMSProp(φ, g φ ) φ clip(φ, c, c) Sample a minibatches {x i } B i=1 P(X ) and {z j} B j=1 P(Z) g θ θ M fφe (P(X ), P(g θ (Z))) θ θ α RMSProp(θ, g θ )

8 Overview MMD GAN Fisher GAN MMD GAN Experiments I (a) WGAN MNIST (b) WGAN CelebA (c) WGAN LSUN (d) MMD GAN MNIST (e) MMD GAN CelebA (f) MMD GAN LSUN Conclusion

9 Overview MMD GAN Fisher GAN Conclusion MMD GAN Experiments II Method Real data DFM ALI Improved GANs MMD GAN WGAN GMMN-C GMMN-D Scores ± std ± ± ± ± ±.03 Table: Inception scores (a) MNIST Table: Computation time (b) CelebA (c) LSUN Bedrooms

GAN results using gradient penalty and without

10 Overview MMD GAN Fisher GAN Gradient penalty & without reconstruction loss (a) Cifar10 (b) CelebA Figure: MMD GAN results using gradient penalty and without auto-encoder reconstruction loss during training. Conclusion

11 Fisher s Linear Discriminant Analysis (LDA) Utilize the label information (fake or real in GAN) in finding informative projections. Two-class Fisher LDA considers maximizing the following objective: J(v) = v S B v v (8) S W v where S B is the between classes scatter matrix and S W is the within classes scatter matrix Optimization: S B = (µ 1 µ )(µ 1 µ ) (9) S W = (x i µ i )(x i µ i ) (10) i {1,} min v 1 v S B v (11) s.t. v S W v = 1 (1)

12 Fisher IPM Integral probability Metrics (IPM) framework: Given two probability distributions P, Q P(X ), the IPM indexed by a symmetric function space F is defined as follows: d F (P, Q) = sup f F { } E f(x) E f(x). (13) x P x Q The Fisher IPM for a function space F is defined as follows: E [f(x)] E [f(x)] x P x Q d F (P, Q) = sup f F 1 E x Pf (x) + 1. (14) E x Qf (x) The constrained form: d F (P, Q) = sup f F, 1 E x Pf (x)+ 1 E x Qf (x)=1 E(f) := E x P [f(x)] E x Q [f(x)]. (15)

13 Fisher GAN Generator g θ minimize Fisher IPM: min gθ d Fp (P X, P θ ). Given samples {x i, 1... N} from P X and samples {z i, 1... M} from p z we shall solve the following empirical problem: min g θ sup f p F p Ê(f p, g θ ) : = 1 N Subject to ˆΩ(f p, g θ ) = 1 N N f p (x i ) 1 M i=1 N i=1 M f p (g θ (z j )) (16) j=1 f p (x i ) + 1 M M fp (g θ (z j )) = 1 j=1 Fisher GAN with Augmented Lagrangian (ALM): (17) L F (p, θ, λ) = Ê(f p, g θ )+λ(1 ˆΩ(f p, g θ )) ρ (ˆΩ(f p, g θ ) 1) (18) where λ is the Lagrange multiplier and ρ > 0 (hyper-parameter) is the quadratic penalty weight.

14 Fisher GAN

15 Fisher IPM Interpretations A whitened mean matching interpretation. Consider the function space F v,ω : F v,ω = {f(x) = v, Φ ω (x) v R m, Φ ω : X R m }, The mean and covariance feature embedding as in McGan: µ ω (P) = E x P (Φ ω (x)) and Σ ω (P) = E x P ( Φω (x)φ ω (x) ), Fisher IPM on F v,ω can be written as follows: d Fv,ω (P, Q) = max max ω v v, µ ω (P) µ ω (Q) v ( 1 Σ ω(p) + 1 Σ ω(q) + γi m )v, (19) Mean matching with a Mahalanobis distance d Fv,ω (P, Q) = max (µ ω (P) µ ω (Q)) Σ 1 ω (P; Q)(µ ω (P) µ ω (Q)), ω

Φ ω is a convolutional neural network which defines the embedding space.

16 Fisher IPM Interpretations Real P x! v Q!(x) R m Fake Figure: Illustration of Fisher IPM with Neural Networks. Φ ω is a convolutional neural network which defines the embedding space. v is the direction in this embedding space with maximal mean separation v, µ ω (P) µ ω (Q), constrained by the hyperellipsoid v Σ ω (P; Q) v = 1.

17 Fisher GAN Theory Theorem (Chi-squared distance at full capacity) Consider the Fisher IPM for F being the space of all measurable functions endowed by 1 (P + Q), i.e. F := L (X, P+Q ). Define the Chi-squared distance between two distributions: χ (P, Q) = X (P(x) Q(x)) dx (0) P(x)+Q(x) The following holds true for any P, Q, P Q: 1) The Fisher IPM for F = L (X, P+Q ) is equal to the Chi-squared distance defined above: d F (P, Q) = χ (P, Q). ) The optimal critic of the Fisher IPM on L (X, P+Q ) is : f χ (x) = 1 χ (P, Q) P(x) Q(x). P(x)+Q(x)

Overview MMD GAN Fisher GAN Conclusion Fisher GAN Experiments I E train 4 0 0 4 4 4 λ

0 E train E val 4 0 Mean difference E (c) CIFAR-10 (b) CelebA E train E val Mean

0 gθ iterations 1.5 105 Figure: Samples and plots of the loss E (.

18 Overview MMD GAN Fisher GAN Conclusion Fisher GAN Experiments I E train λ Ω 3 Ω λ gθ iterations Ω λ E train E val 4 0 Mean difference E (c) CIFAR-10 (b) CelebA E train E val Mean difference E Mean difference E (a) LSUN 4 1 gθ iterations gθ iterations Figure: Samples and plots of the loss E (.), lagrange multiplier λ, and constraint Ω (.) on 3 benchmark datasets. We see that during training as λ grows slowly, the constraint becomes tight.

19 Fisher GAN Experiments II Table: CIFAR-10 inception scores; Layer Normalization (LN) with resnets. Method Score Method Score ALI 5.34 ±.05 BEGAN 5.6 DCGAN 6.16 ±.07 Improved GAN 6.86 ±.06 EGAN-Ent-VI 7.07 ±.10 DFM 7.7 ±.13 WGAN-GP ResNet 7.86 ±.07 Fisher GAN ResNet 7.90 ±.05 Unsupervised SteinGan 6.35 DCGAN (with labels) 6.58 Improved GAN 8.09 ±.07 Fisher GAN ResNet 8.16 ±.1 AC-GAN 8.5 ±.07 SGAN-no-joint 8.37 ±.08 WGAN-GP ResNet 8.4 ±.10 SGAN 8.59 ±.1 Supervised

20 Conclusion Table: Comparison of GANs Stability Unconstrained Efficient Representation capacity Computation power (SSL) Standard GAN WGAN, McGan WGAN-GP? MMD Gan? Fisher Gan

Energy-Based Generative Adversarial Network

Energy-Based Generative Adversarial Network Energy-Based Generative Adversarial Network J. Zhao, M. Mathieu and Y. LeCun Learning to Draw Samples: With Application to Amoritized MLE for Generalized Adversarial