Wasserstein GAN. Juho Lee. Jan 23, 2017

Size: px

Start display at page:

Download "Wasserstein GAN. Juho Lee. Jan 23, 2017"

Everett Garrett
6 years ago
Views:

1 Wasserstein GAN Juho Lee Jan 23, 2017

2 Wasserstein GAN (WGAN) Arxiv submission Martin Arjovsky, Soumith Chintala, and Léon Bottou A new GAN model minimizing the Earth-Mover s distance (Wasserstein-1 distance) Stabilized GAN training with way less mode collapse Provide meaningful learning curves useful for debegging

3 Towards principled methods for training generative adversarial networks ICLR 2017 (oral) Martin Arjovsky and Léon Bottou Why do updates gets worse as the discriminator gets better? Why is GAN training massively unstable? The impact of log D(G(z)) trick; is it following the JSD?

4 Learning probability distribution Given a set of observations {x i } n i=1, assume a model distribution P θ of parametric family. Select a distance measure between the model distribution and real distribution P r ; ρ(p θ, P r ). Convergence: as t, θ t θ, so P θt P θ where ρ(p r, P θ ) 0. Desirable conditions: the mapping θ ρ(p r, P θ ) is continuous.

5 Distances between probability distributions I Let (X, Σ) be measurable space, where X is a compact metric set and Σ is a Borel σ-algebra. The Total Variation (TV) distance δ(p r, P θ ) = sup P r (A) P θ (A). A Σ The Kullback-Leibler (KL) divergence KL(P r P θ ) = log P r(x) P θ (x) P r(x)dµ(x), where both P r and P θ are assumed to be absolutely continuous, and therefore admit densities, w.r.t. a same measure µ on X.

6 Distances between probability distributions II The Jensen-Shannon (JS) divergence JS(P r, P θ ) = 1 2 KL(P r P m ) KL(P θ P m ), where P m := (P r + P θ )/2. The Earth-Mover s (EM) distance or Wasserstein-1 distance W (P r, P θ ) = inf E (x,y) γ[ x y ], γ Π(P r,p θ ) where Π(P r, P θ ) denotes the set of all joint distributions γ(x, y) whose marginals are respectively P r and P θ.

7 Distances between probability distributions III Z Unif([0, 1]) (0, Z) (θ, Z) { if θ 0 KL(P θ P 0 ) = 0 if θ = 0. { log 2 if θ 0 JS(P 0, P θ ) = 0 if θ = 0. { 1 if θ 0 δ(p 0, P θ ) = 0 if θ = 0. W (P 0, P θ ) = θ.

8 Instability of GAN I Original objective function: L(D, g θ ) = E x Pr [log D(x)] + E x Pg [log(1 D(x))]. The optimal discriminator is D (x) = P r (x) P r (x) + P g (x), and L(D, g θ ) = 2JS(P r, P g ) 2 log 2.

9 Instability of GAN II Theorem 1 Let P r and P G be two distributions that have support contained in two closed manifolds M and P that don t perfectly align and don t have full dimensions. We further assume that P r and P g are continuous in their respective manifolds, meaning that if there is a set A with measure 0 in M, then P r (A) = 0 (and analogously for P g ). Then, there exists an optimal discriminator D : X [0, 1] that has accuracy 1 and for almost any x in M P, D is smooth in a neighbourhood of x and x D (x) = 0.

10 Instability of GAN III Theorem 2 (Vanishing gradients on the generator) Let g θ : Z X be a differentiable function that induces a distribution P g. If some conditions are satisfied, and D D < ɛ, and E z p(z) [ J θ g θ (z) 2 2] M 2, θ E z p(z) [log(1 D(g θ (z)))] 2 < M ɛ 1 ɛ.

11 Instability of GAN IV

12 Instability of GAN V

13 The log D trick I For generator, instead of minimizing E z p(z) [log(1 D(g θ (z))], minimize E z p(z) [log(d(g θ (z))]. This does not change the fixed points. Theorem 3 Let D = Pr P r+p g be the optimal discriminator for a fixed θ = θ 0. E z p(z) [ θ log D (g θ (z)) θ=θ0 ] = θ [KL(P gθ0 P r ) 2JS(P gθ0, P r )] θ=θ0.

14 The log D trick II Theorem 4 (Under some conditions) E z p(z) [ θ log D(g θ (z))] is a centered Cauchy distribution with infinite expectation and variance.

15 Why should we use Wasserstein distance I Theorem 5 Let P r be a fixed distribution over X. Let Z be a random variable over another space Z. Let g : Z R d X be a function, that will be denoted g θ (z). Let P θ denote the distribution of g θ (Z). Then, 1. If g is continuous in θ, so is W (P r, P θ ). 2. If g is locally Lipschitz and satisfies regularity assumption 1, then W (P r, P θ ) is continuous everywhere, and differentiable almost everywhere and 2 are false for the Jensen-Shannon and KL divergences. If we choose g θ to be any feedforward neural network parametrized by θ, and p(z) to be E[ z ] <, then the regularity assumption 1 is satisfied.

16 Why should we use Wasserstein distance II Theorem 6 Let P be a distribution on a compact space X and (P n ) n N be a sequence of distributions on X. Then, as n, 1. δ(p n, P) 0 JS(P n, P) W (P n, P) 0 P n D P. 3. KL(P n P) 0 or KL(P P n ) 0 implies implies 2.

17 Why should we use Wasserstein distance III

18 Approximating the Earth-Mover s distance By the Kantorovich-Rubinstein duelity [1] W (P r, P θ ) = sup E x Pr [f(x)] E x Pθ [f(x)], f L 1 where the supremum is over all the 1-Lipschitz functions f : X R. 1-Lipschitz can be replaced by K-Lipschitz. Theorem 7 Let P r be any distribution, and let P θ be the distribution of g θ (Z) satisfying assumption 1. Then, there exists a solution f : X R to the problem max f L 1 E x P r [f(x)] E x Pθ [f(x)] and we have θ W (P r, P θ ) = E z p(z) [ θ f(g θ (z))], when both terms are well-defined.

19 WGAN algorithm

20 Experiments I

21 Experiments II

22 C. Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer, Berlin, 2009.

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Nishant Gurnani. GAN Reading Group. April 14th, / 107 Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107 Why are these Papers Important? 2 / 107 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN,