Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107
Why are these Papers Important? 2 / 107
Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... 3 / 107
Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... Wasserstein GAN is yet another GAN training algorithm, however it is backed up by rigorous theory in addition to good performance 4 / 107
Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... Wasserstein GAN is yet another GAN training algorithm, however it is backed up by rigorous theory in addition to good performance WGAN removes the need to balance generator updates with discriminator updates, thus removing a key source of training instability in the original GAN 5 / 107
Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... Wasserstein GAN is yet another GAN training algorithm, however it is backed up by rigorous theory in addition to good performance WGAN removes the need to balance generator updates with discriminator updates, thus removing a key source of training instability in the original GAN Empirical results show correlation between discriminator loss and perceptual quality thus providing a rough measure of training progress 6 / 107
Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... Wasserstein GAN is yet another GAN training algorithm, however it is backed up by rigorous theory in addition to good performance WGAN removes the need to balance generator updates with discriminator updates, thus removing a key source of training instability in the original GAN Empirical results show correlation between discriminator loss and perceptual quality thus providing a rough measure of training progress Goal - Convince you that WGAN is the best GAN 7 / 107
Outline Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 8 / 107
Outline Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 9 / 107
What does it mean to learn a probability distribution? When learning generative models, we assume the data we have comes from some unknown distribution P r. Want to learn a distribution P θ that approximates P r, where θ are the parameters of the distribution. There are two approaches for doing this: 10 / 107
What does it mean to learn a probability distribution? When learning generative models, we assume the data we have comes from some unknown distribution P r. Want to learn a distribution P θ that approximates P r, where θ are the parameters of the distribution. There are two approaches for doing this: 1. Directly learn the probability density function P θ and then optimize through maximum likelihood estimation 11 / 107
What does it mean to learn a probability distribution? When learning generative models, we assume the data we have comes from some unknown distribution P r. Want to learn a distribution P θ that approximates P r, where θ are the parameters of the distribution. There are two approaches for doing this: 1. Directly learn the probability density function P θ and then optimize through maximum likelihood estimation 2. Learn a function that transforms an existing distribution Z into P θ. Here, g θ is some differentiable function, Z is a common distribution (usually uniform or Gaussian), and P θ = g θ (Z) 12 / 107
Maximum Likelihood approach is problematic Recall that for continuous distributions P and Q the KL divergence is: KL(P Q) = P(x) log P(x) Q(x) dx and given function P θ, the MLE objective is 1 max θ R d m x m log P θ (x (i) ) i=1 In the limit (as m ), samples will appear based on the data distribution P r, so lim max 1 m θ R d m m i=1 log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x 13 / 107
Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) x P r (x) log P θ 14 / 107
Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) Note if P θ = 0 at an x where P r > 0, the KL divergence goes to + (bad for the MLE if P θ has low dimensional support) x P r (x) log P θ 15 / 107
Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) Note if P θ = 0 at an x where P r > 0, the KL divergence goes to + (bad for the MLE if P θ has low dimensional support) Typical remedy is to add a noise term to the model distribution to ensure distribution is defined everywhere x P r (x) log P θ 16 / 107
Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) Note if P θ = 0 at an x where P r > 0, the KL divergence goes to + (bad for the MLE if P θ has low dimensional support) Typical remedy is to add a noise term to the model distribution to ensure distribution is defined everywhere This unfortunately introduces some error, and empirically people have needed to add a lot of random noise to make models train x P r (x) log P θ 17 / 107
Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: 18 / 107
Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: Unlike densities, this approach can represent distributions confined to a low dimensional manifold 19 / 107
Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: Unlike densities, this approach can represent distributions confined to a low dimensional manifold It s very easy to generate samples - given a trained g θ, simply sample random noise z Z, and evaluate g θ (z) 20 / 107
Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: Unlike densities, this approach can represent distributions confined to a low dimensional manifold It s very easy to generate samples - given a trained g θ, simply sample random noise z Z, and evaluate g θ (z) VAEs and GANs are well known examples of this approach VAEs focus on the approximate likelihood of the examples and so share the limitation that you need to fiddle with additional noise terms. 21 / 107
Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: Unlike densities, this approach can represent distributions confined to a low dimensional manifold It s very easy to generate samples - given a trained g θ, simply sample random noise z Z, and evaluate g θ (z) VAEs and GANs are well known examples of this approach VAEs focus on the approximate likelihood of the examples and so share the limitation that you need to fiddle with additional noise terms. GANs offer much more flexibility but their training is unstable. 22 / 107
Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). 23 / 107
Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties 24 / 107
Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties Distance d is weaker than distance d if every sequence that converges under d converges under d 25 / 107
Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties Distance d is weaker than distance d if every sequence that converges under d converges under d Given a distance d, we can treat d(p r, P θ ) as a loss function 26 / 107
Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties Distance d is weaker than distance d if every sequence that converges under d converges under d Given a distance d, we can treat d(p r, P θ ) as a loss function We can minimize d(p r, P θ ) with respect to θ as long as the mapping θ P θ is continuous (true if g θ is a neural network) 27 / 107
Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties Distance d is weaker than distance d if every sequence that converges under d converges under d Given a distance d, we can treat d(p r, P θ ) as a loss function We can minimize d(p r, P θ ) with respect to θ as long as the mapping θ P θ is continuous (true if g θ is a neural network) How do we define d? 28 / 107
Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 29 / 107
Distance definitions 30 / 107
Distance definitions Notation χ - compact metric set (such as the space of images [0, 1] d ) Σ - set of all Borel subsets of χ Prob(χ) - space of probability measures defined on χ 31 / 107
Distance definitions Notation χ - compact metric set (such as the space of images [0, 1] d ) Σ - set of all Borel subsets of χ Prob(χ) - space of probability measures defined on χ Total Variation (TV) distance δ(p r, P θ ) = sup P r (A) P θ (A) A Σ 32 / 107
Distance definitions Notation χ - compact metric set (such as the space of images [0, 1] d ) Σ - set of all Borel subsets of χ Prob(χ) - space of probability measures defined on χ Total Variation (TV) distance δ(p r, P θ ) = sup P r (A) P θ (A) A Σ Kullback-Leibler (KL) divergence KL(P r P θ ) = P r (x) log P r (x) P θ (x) dµ(x) x where both P r and P θ are assumed to be absolutely continuous with respect to a same measure µ defined on χ. 33 / 107
Distance Definitions 34 / 107
Distance Definitions Jensen-Shannon (JS) Divergence JS(P r, P θ ) = KL(P r P m ) + KL(P θ P m ) where P m is the mixture (P r + P θ )/2 35 / 107
Distance Definitions Jensen-Shannon (JS) Divergence JS(P r, P θ ) = KL(P r P m ) + KL(P θ P m ) where P m is the mixture (P r + P θ )/2 Earth-Mover (EM) distance or Wasserstein-1 W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) where Π(P r, P θ ) denotes the set of all join distributions γ(x, y) whose marginals are respectively P r and P θ 36 / 107
Understanding the EM distance 37 / 107
Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. 38 / 107
Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. Imagine we started with distribution P r, and wanted to move mass around to change the distribution into P θ. 39 / 107
Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. Imagine we started with distribution P r, and wanted to move mass around to change the distribution into P θ. Moving mass m by distance d costs m d effort. 40 / 107
Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. Imagine we started with distribution P r, and wanted to move mass around to change the distribution into P θ. Moving mass m by distance d costs m d effort. The earth mover distance is the minimal effort we need to spend. 41 / 107
Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. Imagine we started with distribution P r, and wanted to move mass around to change the distribution into P θ. Moving mass m by distance d costs m d effort. The earth mover distance is the minimal effort we need to spend. Transport Plan Each γ Π is a transport plan and to execute the plan, for all x, y move γ(x, y) mass from x to y. 42 / 107
Understanding the EM distance 43 / 107
Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? 44 / 107
Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? The amount of mass that leaves x is y γ(x, y)dy. This must equal P r (x), the amount of mass originally at x. 45 / 107
Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? The amount of mass that leaves x is y γ(x, y)dy. This must equal P r (x), the amount of mass originally at x. The amount of mass that enters y is x γ(x, y)dx. This must equal P θ (y), the amount of mass that ends up at y. 46 / 107
Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? The amount of mass that leaves x is y γ(x, y)dy. This must equal P r (x), the amount of mass originally at x. The amount of mass that enters y is x γ(x, y)dx. This must equal P θ (y), the amount of mass that ends up at y. This shows why the marginals of γ Π must be P r and P θ. For scoring, the effort spent is γ(x, y) x y dydx = E (x,y) γ [ x y ] x y 47 / 107
Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? The amount of mass that leaves x is y γ(x, y)dy. This must equal P r (x), the amount of mass originally at x. The amount of mass that enters y is x γ(x, y)dx. This must equal P θ (y), the amount of mass that ends up at y. This shows why the marginals of γ Π must be P r and P θ. For scoring, the effort spent is γ(x, y) x y dydx = E (x,y) γ [ x y ] x y Computing the infimum of this over all valid γ gives the earth mover distance 48 / 107
Learning Parallel Lines Example 49 / 107
Learning Parallel Lines Example Let Z U[0, 1] and let P 0 be the distribution of (0, Z) R 2, uniform on a straight vertical line passing through the origin. Now let g θ (z) = (θ, z) with θ a single real parameter. 50 / 107
Learning Parallel Lines Example Let Z U[0, 1] and let P 0 be the distribution of (0, Z) R 2, uniform on a straight vertical line passing through the origin. Now let g θ (z) = (θ, z) with θ a single real parameter. 51 / 107
Learning Parallel Lines Example Let Z U[0, 1] and let P 0 be the distribution of (0, Z) R 2, uniform on a straight vertical line passing through the origin. Now let g θ (z) = (θ, z) with θ a single real parameter. We d like our optimization algorithm to learn to move θ to 0. As θ 0, the distance d(p 0, P θ ) should decrease. 52 / 107
Learning Parallel Lines Example For many common distance functions, this doesn t happen. 53 / 107
Learning Parallel Lines Example For many common distance functions, this doesn t happen. { 1 if θ 0 δ(p 0, P θ ) = 0 if θ = 0 { log 2 if θ 0 JS(P 0, P θ ) = 0 if θ = 0 { + if θ 0 KL(P θ, P 0 ) = KL(P 0, P θ ) = 0 if θ = 0 W (P 0, P θ ) = θ 54 / 107
Theoretical Justification Theorem (1) Let P r be a fixed distribution over χ. Let Z be a random variable over another space Z. Let g : Z R d χ be a function, that will be denote g θ (z) with z the first coordinate and θ the second. Let P θ denote the distribution of g θ (). Then, 1. If g is continuous in θ, so is W (P r, P θ ). 2. If g is locally Lipschitz and satisfies regularity assumption 1, then W (P r, P θ ) is continuous everywhere, and differentiable almost everywhere 3. Statements 1-2 are false for the Jensen-Shannon divergence JS(P r, P θ ) and all the KLs. 55 / 107
Theoretical Justification Theorem (2) Let P be a distribution on a compact space χ and (P n ) n N be a sequence of distributions on χ. Then, considering all limits as n, 1. The following statements are equivalent δ(p n, P) 0 JS(Pn, P) 0 2. The following statements are equivalent W (P n, P) 0 Pn D P where D represents convergence in distribution for random variables 3. KL(P n P) 0 or KL(P P n ) 0 imply the statements in (1). 4. The statements in (1) imply the statements in (2). 56 / 107
Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 57 / 107
Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. 58 / 107
Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. The generator network G maps a source of noise to the input space. 59 / 107
Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. The generator network G maps a source of noise to the input space. The discriminator network D receives either a generated sample or a true data sample and must distinguish between the two. 60 / 107
Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. The generator network G maps a source of noise to the input space. The discriminator network D receives either a generated sample or a true data sample and must distinguish between the two. The generator is trained to fool the discriminator. 61 / 107
Generative Adversarial Networks Formally we can express the game between the generator G and the discriminator D with the minimax objective: min G max E x P r [log(d(x))] + E x Pg [log(1 D( x)] D where P r is the data distribution and P g is the model distribution implicitly defined by x = G(z), z p(z) 62 / 107
Generative Adversarial Networks Remarks 63 / 107
Generative Adversarial Networks Remarks If the discriminator is trained to optimality before each generator parameter update, minimizing the value function amounts to minimizing the Jensen-Shannon divergence between the data and model distributions on x 64 / 107
Generative Adversarial Networks Remarks If the discriminator is trained to optimality before each generator parameter update, minimizing the value function amounts to minimizing the Jensen-Shannon divergence between the data and model distributions on x This is expensive and often leads to vanishing gradients as the discriminator saturates 65 / 107
Generative Adversarial Networks Remarks If the discriminator is trained to optimality before each generator parameter update, minimizing the value function amounts to minimizing the Jensen-Shannon divergence between the data and model distributions on x This is expensive and often leads to vanishing gradients as the discriminator saturates In practice, this requirement is relaxed, and the generator and discriminator are update simultaneously 66 / 107
Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 67 / 107
Kantorivich-Rubinstein Duality Unfortunately, computing the Wasserstein distance exactly is intractable. W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) 68 / 107
Kantorivich-Rubinstein Duality Unfortunately, computing the Wasserstein distance exactly is intractable. W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) However, a result from the Kantorivich-Rubinstein Duality (Villani 2008) shows W is equivalent to W (P r, P θ ) = sup E x Pr [f (x)] E x Pθ [f (x)] f L 1 where the supremum is taken over all 1-Lipschitz functions 69 / 107
Kantorivich-Rubinstein Duality Unfortunately, computing the Wasserstein distance exactly is intractable. W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) However, a result from the Kantorivich-Rubinstein Duality (Villani 2008) shows W is equivalent to W (P r, P θ ) = sup E x Pr [f (x)] E x Pθ [f (x)] f L 1 where the supremum is taken over all 1-Lipschitz functions Calculating this is still intractable, but now it s easier to approximate. 70 / 107
Wasserstein GAN Approximation 71 / 107
Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. 72 / 107
Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. Suppose we have a parametrized function family {f w } w W, where w are the weights and W is the set of all possible weights Furthermore suppose these functions are all K-Lipschitz for some K. 73 / 107
Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. Suppose we have a parametrized function family {f w } w W, where w are the weights and W is the set of all possible weights Furthermore suppose these functions are all K-Lipschitz for some K. Then we have max E x P r [f w (x)] E x Pθ [f w (x)] sup E x Pr [f (x)] E x Pθ [f (x)] w W f L K = K W (P r, P θ ) 74 / 107
Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. Suppose we have a parametrized function family {f w } w W, where w are the weights and W is the set of all possible weights Furthermore suppose these functions are all K-Lipschitz for some K. Then we have max E x P r [f w (x)] E x Pθ [f w (x)] sup E x Pr [f (x)] E x Pθ [f (x)] w W f L K = K W (P r, P θ ) If {f w } w W contains the true supremum among K-Lipschitz functions, this gives the distance exactly. 75 / 107
Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. Suppose we have a parametrized function family {f w } w W, where w are the weights and W is the set of all possible weights Furthermore suppose these functions are all K-Lipschitz for some K. Then we have max E x P r [f w (x)] E x Pθ [f w (x)] sup E x Pr [f (x)] E x Pθ [f (x)] w W f L K = K W (P r, P θ ) If {f w } w W contains the true supremum among K-Lipschitz functions, this gives the distance exactly. In practice this won t be true! 76 / 107
Wasserstein GAN Algorithm Looping all this back to generative models, we d like to train P θ = g θ (z) to match P r. 77 / 107
Wasserstein GAN Algorithm Looping all this back to generative models, we d like to train P θ = g θ (z) to match P r. Intuitively, given a fixed g θ, we can compute the optimal f w for the Wasserstein distance. 78 / 107
Wasserstein GAN Algorithm Looping all this back to generative models, we d like to train P θ = g θ (z) to match P r. Intuitively, given a fixed g θ, we can compute the optimal f w for the Wasserstein distance. We can then backpropagate through W (P r, g θ (Z)) to get the gradient for θ. θ W (P r, P θ ) = θ (E x Pr [f w (x)] E z Z [f w (g θ (z))]) = E z Z [ θ f w (g θ (z))] 79 / 107
Wasserstein GAN Algorithm The training process now has three steps: 80 / 107
Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 81 / 107
Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 2. Once we have the optimal f w, compute the θ gradient E z Z [ θ f w (g θ (z))] by sampling several z Z 82 / 107
Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 2. Once we have the optimal f w, compute the θ gradient E z Z [ θ f w (g θ (z))] by sampling several z Z 3. Update θ, and repeat the process 83 / 107
Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 2. Once we have the optimal f w, compute the θ gradient E z Z [ θ f w (g θ (z))] by sampling several z Z 3. Update θ, and repeat the process Important Detail The entire derivation only works when the function family {f w } w W is K-Lipschitz. 84 / 107
Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 2. Once we have the optimal f w, compute the θ gradient E z Z [ θ f w (g θ (z))] by sampling several z Z 3. Update θ, and repeat the process Important Detail The entire derivation only works when the function family {f w } w W is K-Lipschitz. To guarantee this is true, the authors use weight clamping. The weights w are constrained to lie within [ c, c], by clipping w after every update to w. 85 / 107
Wasserstein GAN Algorithm 86 / 107
Comparison with Standard GANs 87 / 107
Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) i=1 88 / 107
Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) In WGANs, nothing requires the f w to output a probability and hence it is referred to as a critic instead of a discriminator i=1 89 / 107
Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) In WGANs, nothing requires the f w to output a probability and hence it is referred to as a critic instead of a discriminator Although GANs are formulated as a min max problem, in practice we never train D to convergence i=1 90 / 107
Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) In WGANs, nothing requires the f w to output a probability and hence it is referred to as a critic instead of a discriminator Although GANs are formulated as a min max problem, in practice we never train D to convergence Consequently, we re updating G against an objective that kind of aims towards the JS divergence, but doesn t go all the way i=1 91 / 107
Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) In WGANs, nothing requires the f w to output a probability and hence it is referred to as a critic instead of a discriminator Although GANs are formulated as a min max problem, in practice we never train D to convergence Consequently, we re updating G against an objective that kind of aims towards the JS divergence, but doesn t go all the way In constrast, because the Wasserstein distance is differentiable nearly everywhere, we can (and should) train f w to convergence before each generator update, to get as accurate an estimate of W (P r, P θ ) as possible i=1 92 / 107
Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 93 / 107
Properties of optimal WGAN critic 94 / 107
Properties of optimal WGAN critic An open question is how to effectively enforce the Lipschitz constraint on the critic? 95 / 107
Properties of optimal WGAN critic An open question is how to effectively enforce the Lipschitz constraint on the critic? Previously (Arjovksy et. al 2017) we ve seen that you can clip the weights of the critic to lie within the compact space [ c, c] 96 / 107
Properties of optimal WGAN critic An open question is how to effectively enforce the Lipschitz constraint on the critic? Previously (Arjovksy et. al 2017) we ve seen that you can clip the weights of the critic to lie within the compact space [ c, c] To understand why weight clipping is problematic in WGAN critic we need to understand what are the properties of the optimal WGAN critic? 97 / 107
Properties of optimal WGAN critic 98 / 107
Properties of optimal WGAN critic If the optimal critic under the Kantorovich-Rubinstein dual D is differentiable, and x is a point from our generator distribution P θ, then there is a point y sampled from the true distribution P r such that the gradient of D at all points x t = (1 t)x + ty lie on a straight line between x and y. 99 / 107
Properties of optimal WGAN critic If the optimal critic under the Kantorovich-Rubinstein dual D is differentiable, and x is a point from our generator distribution P θ, then there is a point y sampled from the true distribution P r such that the gradient of D at all points x t = (1 t)x + ty lie on a straight line between x and y. In other words, D (x t ) = y xt y x. t 100 / 107
Properties of optimal WGAN critic If the optimal critic under the Kantorovich-Rubinstein dual D is differentiable, and x is a point from our generator distribution P θ, then there is a point y sampled from the true distribution P r such that the gradient of D at all points x t = (1 t)x + ty lie on a straight line between x and y. In other words, D (x t ) = y xt y x. t This implies that the optimal WGAN critic has gradients with norm 1 almost everywhere under P r and P θ 101 / 107
Gradient penalty 102 / 107
Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. 103 / 107
Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. A differentiable function is 1-Lipschitz if and only if it has gradients with norm less than or equal to 1 everywhere. 104 / 107
Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. A differentiable function is 1-Lipschitz if and only if it has gradients with norm less than or equal to 1 everywhere. This implies we should directly constrain the gradient norm of our critic function with respect to its input. 105 / 107
Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. A differentiable function is 1-Lipschitz if and only if it has gradients with norm less than or equal to 1 everywhere. This implies we should directly constrain the gradient norm of our critic function with respect to its input. Enforcing a soft version of this we get: 106 / 107
WGAN with gradient penalty 107 / 107