Nishant Gurnani. GAN Reading Group. April 14th, / 107

Size: px

Start display at page:

Download "Nishant Gurnani. GAN Reading Group. April 14th, / 107"

Beryl Daniels
5 years ago
Views:

1 Nishant Gurnani GAN Reading Group April 14th, / 107

2 Why are these Papers Important? 2 / 107

3 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... 3 / 107

4 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... Wasserstein GAN is yet another GAN training algorithm, however it is backed up by rigorous theory in addition to good performance 4 / 107

5 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... Wasserstein GAN is yet another GAN training algorithm, however it is backed up by rigorous theory in addition to good performance WGAN removes the need to balance generator updates with discriminator updates, thus removing a key source of training instability in the original GAN 5 / 107

6 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... Wasserstein GAN is yet another GAN training algorithm, however it is backed up by rigorous theory in addition to good performance WGAN removes the need to balance generator updates with discriminator updates, thus removing a key source of training instability in the original GAN Empirical results show correlation between discriminator loss and perceptual quality thus providing a rough measure of training progress 6 / 107

7 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... Wasserstein GAN is yet another GAN training algorithm, however it is backed up by rigorous theory in addition to good performance WGAN removes the need to balance generator updates with discriminator updates, thus removing a key source of training instability in the original GAN Empirical results show correlation between discriminator loss and perceptual quality thus providing a rough measure of training progress Goal - Convince you that WGAN is the best GAN 7 / 107

8 Outline Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 8 / 107

9 Outline Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 9 / 107

10 What does it mean to learn a probability distribution? When learning generative models, we assume the data we have comes from some unknown distribution P r. Want to learn a distribution P θ that approximates P r, where θ are the parameters of the distribution. There are two approaches for doing this: 10 / 107

11 What does it mean to learn a probability distribution? When learning generative models, we assume the data we have comes from some unknown distribution P r. Want to learn a distribution P θ that approximates P r, where θ are the parameters of the distribution. There are two approaches for doing this: 1. Directly learn the probability density function P θ and then optimize through maximum likelihood estimation 11 / 107

12 What does it mean to learn a probability distribution? When learning generative models, we assume the data we have comes from some unknown distribution P r. Want to learn a distribution P θ that approximates P r, where θ are the parameters of the distribution. There are two approaches for doing this: 1. Directly learn the probability density function P θ and then optimize through maximum likelihood estimation 2. Learn a function that transforms an existing distribution Z into P θ. Here, g θ is some differentiable function, Z is a common distribution (usually uniform or Gaussian), and P θ = g θ (Z) 12 / 107

13 Maximum Likelihood approach is problematic Recall that for continuous distributions P and Q the KL divergence is: KL(P Q) = P(x) log P(x) Q(x) dx and given function P θ, the MLE objective is 1 max θ R d m x m log P θ (x (i) ) i=1 In the limit (as m ), samples will appear based on the data distribution P r, so lim max 1 m θ R d m m i=1 log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x 13 / 107

14 Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) x P r (x) log P θ 14 / 107

15 Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) Note if P θ = 0 at an x where P r > 0, the KL divergence goes to + (bad for the MLE if P θ has low dimensional support) x P r (x) log P θ 15 / 107

16 Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) Note if P θ = 0 at an x where P r > 0, the KL divergence goes to + (bad for the MLE if P θ has low dimensional support) Typical remedy is to add a noise term to the model distribution to ensure distribution is defined everywhere x P r (x) log P θ 16 / 107

17 Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) Note if P θ = 0 at an x where P r > 0, the KL divergence goes to + (bad for the MLE if P θ has low dimensional support) Typical remedy is to add a noise term to the model distribution to ensure distribution is defined everywhere This unfortunately introduces some error, and empirically people have needed to add a lot of random noise to make models train x P r (x) log P θ 17 / 107

18 Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: 18 / 107

19 Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: Unlike densities, this approach can represent distributions confined to a low dimensional manifold 19 / 107

20 Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: Unlike densities, this approach can represent distributions confined to a low dimensional manifold It s very easy to generate samples - given a trained g θ, simply sample random noise z Z, and evaluate g θ (z) 20 / 107

21 Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: Unlike densities, this approach can represent distributions confined to a low dimensional manifold It s very easy to generate samples - given a trained g θ, simply sample random noise z Z, and evaluate g θ (z) VAEs and GANs are well known examples of this approach VAEs focus on the approximate likelihood of the examples and so share the limitation that you need to fiddle with additional noise terms. 21 / 107

22 Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: Unlike densities, this approach can represent distributions confined to a low dimensional manifold It s very easy to generate samples - given a trained g θ, simply sample random noise z Z, and evaluate g θ (z) VAEs and GANs are well known examples of this approach VAEs focus on the approximate likelihood of the examples and so share the limitation that you need to fiddle with additional noise terms. GANs offer much more flexibility but their training is unstable. 22 / 107

23 Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). 23 / 107

24 Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties 24 / 107

25 Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties Distance d is weaker than distance d if every sequence that converges under d converges under d 25 / 107

26 Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties Distance d is weaker than distance d if every sequence that converges under d converges under d Given a distance d, we can treat d(p r, P θ ) as a loss function 26 / 107

27 Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties Distance d is weaker than distance d if every sequence that converges under d converges under d Given a distance d, we can treat d(p r, P θ ) as a loss function We can minimize d(p r, P θ ) with respect to θ as long as the mapping θ P θ is continuous (true if g θ is a neural network) 27 / 107

28 Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties Distance d is weaker than distance d if every sequence that converges under d converges under d Given a distance d, we can treat d(p r, P θ ) as a loss function We can minimize d(p r, P θ ) with respect to θ as long as the mapping θ P θ is continuous (true if g θ is a neural network) How do we define d? 28 / 107

29 Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 29 / 107

30 Distance definitions 30 / 107

31 Distance definitions Notation χ - compact metric set (such as the space of images [0, 1] d ) Σ - set of all Borel subsets of χ Prob(χ) - space of probability measures defined on χ 31 / 107

32 Distance definitions Notation χ - compact metric set (such as the space of images [0, 1] d ) Σ - set of all Borel subsets of χ Prob(χ) - space of probability measures defined on χ Total Variation (TV) distance δ(p r, P θ ) = sup P r (A) P θ (A) A Σ 32 / 107

33 Distance definitions Notation χ - compact metric set (such as the space of images [0, 1] d ) Σ - set of all Borel subsets of χ Prob(χ) - space of probability measures defined on χ Total Variation (TV) distance δ(p r, P θ ) = sup P r (A) P θ (A) A Σ Kullback-Leibler (KL) divergence KL(P r P θ ) = P r (x) log P r (x) P θ (x) dµ(x) x where both P r and P θ are assumed to be absolutely continuous with respect to a same measure µ defined on χ. 33 / 107

34 Distance Definitions 34 / 107

35 Distance Definitions Jensen-Shannon (JS) Divergence JS(P r, P θ ) = KL(P r P m ) + KL(P θ P m ) where P m is the mixture (P r + P θ )/2 35 / 107

36 Distance Definitions Jensen-Shannon (JS) Divergence JS(P r, P θ ) = KL(P r P m ) + KL(P θ P m ) where P m is the mixture (P r + P θ )/2 Earth-Mover (EM) distance or Wasserstein-1 W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) where Π(P r, P θ ) denotes the set of all join distributions γ(x, y) whose marginals are respectively P r and P θ 36 / 107

37 Understanding the EM distance 37 / 107

38 Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. 38 / 107

39 Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. Imagine we started with distribution P r, and wanted to move mass around to change the distribution into P θ. 39 / 107

40 Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. Imagine we started with distribution P r, and wanted to move mass around to change the distribution into P θ. Moving mass m by distance d costs m d effort. 40 / 107

41 Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. Imagine we started with distribution P r, and wanted to move mass around to change the distribution into P θ. Moving mass m by distance d costs m d effort. The earth mover distance is the minimal effort we need to spend. 41 / 107

42 Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. Imagine we started with distribution P r, and wanted to move mass around to change the distribution into P θ. Moving mass m by distance d costs m d effort. The earth mover distance is the minimal effort we need to spend. Transport Plan Each γ Π is a transport plan and to execute the plan, for all x, y move γ(x, y) mass from x to y. 42 / 107

43 Understanding the EM distance 43 / 107

44 Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? 44 / 107

45 Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? The amount of mass that leaves x is y γ(x, y)dy. This must equal P r (x), the amount of mass originally at x. 45 / 107

46 Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? The amount of mass that leaves x is y γ(x, y)dy. This must equal P r (x), the amount of mass originally at x. The amount of mass that enters y is x γ(x, y)dx. This must equal P θ (y), the amount of mass that ends up at y. 46 / 107

47 Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? The amount of mass that leaves x is y γ(x, y)dy. This must equal P r (x), the amount of mass originally at x. The amount of mass that enters y is x γ(x, y)dx. This must equal P θ (y), the amount of mass that ends up at y. This shows why the marginals of γ Π must be P r and P θ. For scoring, the effort spent is γ(x, y) x y dydx = E (x,y) γ [ x y ] x y 47 / 107

48 Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? The amount of mass that leaves x is y γ(x, y)dy. This must equal P r (x), the amount of mass originally at x. The amount of mass that enters y is x γ(x, y)dx. This must equal P θ (y), the amount of mass that ends up at y. This shows why the marginals of γ Π must be P r and P θ. For scoring, the effort spent is γ(x, y) x y dydx = E (x,y) γ [ x y ] x y Computing the infimum of this over all valid γ gives the earth mover distance 48 / 107

49 Learning Parallel Lines Example 49 / 107

50 Learning Parallel Lines Example Let Z U[0, 1] and let P 0 be the distribution of (0, Z) R 2, uniform on a straight vertical line passing through the origin. Now let g θ (z) = (θ, z) with θ a single real parameter. 50 / 107

51 Learning Parallel Lines Example Let Z U[0, 1] and let P 0 be the distribution of (0, Z) R 2, uniform on a straight vertical line passing through the origin. Now let g θ (z) = (θ, z) with θ a single real parameter. 51 / 107

52 Learning Parallel Lines Example Let Z U[0, 1] and let P 0 be the distribution of (0, Z) R 2, uniform on a straight vertical line passing through the origin. Now let g θ (z) = (θ, z) with θ a single real parameter. We d like our optimization algorithm to learn to move θ to 0. As θ 0, the distance d(p 0, P θ ) should decrease. 52 / 107

53 Learning Parallel Lines Example For many common distance functions, this doesn t happen. 53 / 107

54 Learning Parallel Lines Example For many common distance functions, this doesn t happen. { 1 if θ 0 δ(p 0, P θ ) = 0 if θ = 0 { log 2 if θ 0 JS(P 0, P θ ) = 0 if θ = 0 { + if θ 0 KL(P θ, P 0 ) = KL(P 0, P θ ) = 0 if θ = 0 W (P 0, P θ ) = θ 54 / 107

55 Theoretical Justification Theorem (1) Let P r be a fixed distribution over χ. Let Z be a random variable over another space Z. Let g : Z R d χ be a function, that will be denote g θ (z) with z the first coordinate and θ the second. Let P θ denote the distribution of g θ (). Then, 1. If g is continuous in θ, so is W (P r, P θ ). 2. If g is locally Lipschitz and satisfies regularity assumption 1, then W (P r, P θ ) is continuous everywhere, and differentiable almost everywhere 3. Statements 1-2 are false for the Jensen-Shannon divergence JS(P r, P θ ) and all the KLs. 55 / 107

56 Theoretical Justification Theorem (2) Let P be a distribution on a compact space χ and (P n ) n N be a sequence of distributions on χ. Then, considering all limits as n, 1. The following statements are equivalent δ(p n, P) 0 JS(Pn, P) 0 2. The following statements are equivalent W (P n, P) 0 Pn D P where D represents convergence in distribution for random variables 3. KL(P n P) 0 or KL(P P n ) 0 imply the statements in (1). 4. The statements in (1) imply the statements in (2). 56 / 107

57 Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 57 / 107

58 Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. 58 / 107

59 Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. The generator network G maps a source of noise to the input space. 59 / 107

60 Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. The generator network G maps a source of noise to the input space. The discriminator network D receives either a generated sample or a true data sample and must distinguish between the two. 60 / 107

61 Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. The generator network G maps a source of noise to the input space. The discriminator network D receives either a generated sample or a true data sample and must distinguish between the two. The generator is trained to fool the discriminator. 61 / 107

62 Generative Adversarial Networks Formally we can express the game between the generator G and the discriminator D with the minimax objective: min G max E x P r [log(d(x))] + E x Pg [log(1 D( x)] D where P r is the data distribution and P g is the model distribution implicitly defined by x = G(z), z p(z) 62 / 107

63 Generative Adversarial Networks Remarks 63 / 107

64 Generative Adversarial Networks Remarks If the discriminator is trained to optimality before each generator parameter update, minimizing the value function amounts to minimizing the Jensen-Shannon divergence between the data and model distributions on x 64 / 107

65 Generative Adversarial Networks Remarks If the discriminator is trained to optimality before each generator parameter update, minimizing the value function amounts to minimizing the Jensen-Shannon divergence between the data and model distributions on x This is expensive and often leads to vanishing gradients as the discriminator saturates 65 / 107

66 Generative Adversarial Networks Remarks If the discriminator is trained to optimality before each generator parameter update, minimizing the value function amounts to minimizing the Jensen-Shannon divergence between the data and model distributions on x This is expensive and often leads to vanishing gradients as the discriminator saturates In practice, this requirement is relaxed, and the generator and discriminator are update simultaneously 66 / 107

67 Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 67 / 107

68 Kantorivich-Rubinstein Duality Unfortunately, computing the Wasserstein distance exactly is intractable. W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) 68 / 107

69 Kantorivich-Rubinstein Duality Unfortunately, computing the Wasserstein distance exactly is intractable. W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) However, a result from the Kantorivich-Rubinstein Duality (Villani 2008) shows W is equivalent to W (P r, P θ ) = sup E x Pr [f (x)] E x Pθ [f (x)] f L 1 where the supremum is taken over all 1-Lipschitz functions 69 / 107

70 Kantorivich-Rubinstein Duality Unfortunately, computing the Wasserstein distance exactly is intractable. W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) However, a result from the Kantorivich-Rubinstein Duality (Villani 2008) shows W is equivalent to W (P r, P θ ) = sup E x Pr [f (x)] E x Pθ [f (x)] f L 1 where the supremum is taken over all 1-Lipschitz functions Calculating this is still intractable, but now it s easier to approximate. 70 / 107

71 Wasserstein GAN Approximation 71 / 107

72 Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. 72 / 107

73 Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. Suppose we have a parametrized function family {f w } w W, where w are the weights and W is the set of all possible weights Furthermore suppose these functions are all K-Lipschitz for some K. 73 / 107

74 Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. Suppose we have a parametrized function family {f w } w W, where w are the weights and W is the set of all possible weights Furthermore suppose these functions are all K-Lipschitz for some K. Then we have max E x P r [f w (x)] E x Pθ [f w (x)] sup E x Pr [f (x)] E x Pθ [f (x)] w W f L K = K W (P r, P θ ) 74 / 107

75 Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. Suppose we have a parametrized function family {f w } w W, where w are the weights and W is the set of all possible weights Furthermore suppose these functions are all K-Lipschitz for some K. Then we have max E x P r [f w (x)] E x Pθ [f w (x)] sup E x Pr [f (x)] E x Pθ [f (x)] w W f L K = K W (P r, P θ ) If {f w } w W contains the true supremum among K-Lipschitz functions, this gives the distance exactly. 75 / 107

76 Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. Suppose we have a parametrized function family {f w } w W, where w are the weights and W is the set of all possible weights Furthermore suppose these functions are all K-Lipschitz for some K. Then we have max E x P r [f w (x)] E x Pθ [f w (x)] sup E x Pr [f (x)] E x Pθ [f (x)] w W f L K = K W (P r, P θ ) If {f w } w W contains the true supremum among K-Lipschitz functions, this gives the distance exactly. In practice this won t be true! 76 / 107

77 Wasserstein GAN Algorithm Looping all this back to generative models, we d like to train P θ = g θ (z) to match P r. 77 / 107

78 Wasserstein GAN Algorithm Looping all this back to generative models, we d like to train P θ = g θ (z) to match P r. Intuitively, given a fixed g θ, we can compute the optimal f w for the Wasserstein distance. 78 / 107

79 Wasserstein GAN Algorithm Looping all this back to generative models, we d like to train P θ = g θ (z) to match P r. Intuitively, given a fixed g θ, we can compute the optimal f w for the Wasserstein distance. We can then backpropagate through W (P r, g θ (Z)) to get the gradient for θ. θ W (P r, P θ ) = θ (E x Pr [f w (x)] E z Z [f w (g θ (z))]) = E z Z [ θ f w (g θ (z))] 79 / 107

80 Wasserstein GAN Algorithm The training process now has three steps: 80 / 107

81 Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 81 / 107

82 Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 2. Once we have the optimal f w, compute the θ gradient E z Z [ θ f w (g θ (z))] by sampling several z Z 82 / 107

83 Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 2. Once we have the optimal f w, compute the θ gradient E z Z [ θ f w (g θ (z))] by sampling several z Z 3. Update θ, and repeat the process 83 / 107

84 Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 2. Once we have the optimal f w, compute the θ gradient E z Z [ θ f w (g θ (z))] by sampling several z Z 3. Update θ, and repeat the process Important Detail The entire derivation only works when the function family {f w } w W is K-Lipschitz. 84 / 107

85 Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 2. Once we have the optimal f w, compute the θ gradient E z Z [ θ f w (g θ (z))] by sampling several z Z 3. Update θ, and repeat the process Important Detail The entire derivation only works when the function family {f w } w W is K-Lipschitz. To guarantee this is true, the authors use weight clamping. The weights w are constrained to lie within [ c, c], by clipping w after every update to w. 85 / 107

86 Wasserstein GAN Algorithm 86 / 107

87 Comparison with Standard GANs 87 / 107

88 Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) i=1 88 / 107

89 Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) In WGANs, nothing requires the f w to output a probability and hence it is referred to as a critic instead of a discriminator i=1 89 / 107

90 Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) In WGANs, nothing requires the f w to output a probability and hence it is referred to as a critic instead of a discriminator Although GANs are formulated as a min max problem, in practice we never train D to convergence i=1 90 / 107

91 Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) In WGANs, nothing requires the f w to output a probability and hence it is referred to as a critic instead of a discriminator Although GANs are formulated as a min max problem, in practice we never train D to convergence Consequently, we re updating G against an objective that kind of aims towards the JS divergence, but doesn t go all the way i=1 91 / 107

92 Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) In WGANs, nothing requires the f w to output a probability and hence it is referred to as a critic instead of a discriminator Although GANs are formulated as a min max problem, in practice we never train D to convergence Consequently, we re updating G against an objective that kind of aims towards the JS divergence, but doesn t go all the way In constrast, because the Wasserstein distance is differentiable nearly everywhere, we can (and should) train f w to convergence before each generator update, to get as accurate an estimate of W (P r, P θ ) as possible i=1 92 / 107

93 Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 93 / 107

94 Properties of optimal WGAN critic 94 / 107

95 Properties of optimal WGAN critic An open question is how to effectively enforce the Lipschitz constraint on the critic? 95 / 107

96 Properties of optimal WGAN critic An open question is how to effectively enforce the Lipschitz constraint on the critic? Previously (Arjovksy et. al 2017) we ve seen that you can clip the weights of the critic to lie within the compact space [ c, c] 96 / 107

97 Properties of optimal WGAN critic An open question is how to effectively enforce the Lipschitz constraint on the critic? Previously (Arjovksy et. al 2017) we ve seen that you can clip the weights of the critic to lie within the compact space [ c, c] To understand why weight clipping is problematic in WGAN critic we need to understand what are the properties of the optimal WGAN critic? 97 / 107

98 Properties of optimal WGAN critic 98 / 107

99 Properties of optimal WGAN critic If the optimal critic under the Kantorovich-Rubinstein dual D is differentiable, and x is a point from our generator distribution P θ, then there is a point y sampled from the true distribution P r such that the gradient of D at all points x t = (1 t)x + ty lie on a straight line between x and y. 99 / 107

100 Properties of optimal WGAN critic If the optimal critic under the Kantorovich-Rubinstein dual D is differentiable, and x is a point from our generator distribution P θ, then there is a point y sampled from the true distribution P r such that the gradient of D at all points x t = (1 t)x + ty lie on a straight line between x and y. In other words, D (x t ) = y xt y x. t 100 / 107

101 Properties of optimal WGAN critic If the optimal critic under the Kantorovich-Rubinstein dual D is differentiable, and x is a point from our generator distribution P θ, then there is a point y sampled from the true distribution P r such that the gradient of D at all points x t = (1 t)x + ty lie on a straight line between x and y. In other words, D (x t ) = y xt y x. t This implies that the optimal WGAN critic has gradients with norm 1 almost everywhere under P r and P θ 101 / 107

102 Gradient penalty 102 / 107

103 Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. 103 / 107

104 Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. A differentiable function is 1-Lipschitz if and only if it has gradients with norm less than or equal to 1 everywhere. 104 / 107

105 Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. A differentiable function is 1-Lipschitz if and only if it has gradients with norm less than or equal to 1 everywhere. This implies we should directly constrain the gradient norm of our critic function with respect to its input. 105 / 107

106 Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. A differentiable function is 1-Lipschitz if and only if it has gradients with norm less than or equal to 1 everywhere. This implies we should directly constrain the gradient norm of our critic function with respect to its input. Enforcing a soft version of this we get: 106 / 107

107 WGAN with gradient penalty 107 / 107

Wasserstein GAN. Juho Lee. Jan 23, 2017

Wasserstein GAN. Juho Lee. Jan 23, 2017 Wasserstein GAN Juho Lee Jan 23, 2017 Wasserstein GAN (WGAN) Arxiv submission Martin Arjovsky, Soumith Chintala, and Léon Bottou A new GAN model minimizing the Earth-Mover s distance (Wasserstein-1 distance)