Nishant Gurnani. GAN Reading Group. April 14th, / PDF Free Download

Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107

Why are these Papers Important? 2 / 107

Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... 3 / 107

Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... Wasserstein GAN is yet another GAN training algorithm, however it is backed up by rigorous theory in addition to good performance WGAN removes the need to balance generator updates with discriminator updates, thus removing a key source of training instability in the original GAN Empirical results show correlation between discriminator loss and perceptual quality thus providing a rough measure of training progress 6 / 107

Outline Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 8 / 107

Outline Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 9 / 107

What does it mean to learn a probability distribution? When learning generative models, we assume the data we have comes from some unknown distribution P r. Want to learn a distribution P θ that approximates P r, where θ are the parameters of the distribution. There are two approaches for doing this: 1. Directly learn the probability density function P θ and then optimize through maximum likelihood estimation 2. Learn a function that transforms an existing distribution Z into P θ. Here, g θ is some differentiable function, Z is a common distribution (usually uniform or Gaussian), and P θ = g θ (Z) 12 / 107

Maximum Likelihood approach is problematic Recall that for continuous distributions P and Q the KL divergence is: KL(P Q) = P(x) log P(x) Q(x) dx and given function P θ, the MLE objective is 1 max θ R d m x m log P θ (x (i) ) i=1 In the limit (as m ), samples will appear based on the data distribution P r, so lim max 1 m θ R d m m i=1 log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x 13 / 107

Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) x P r (x) log P θ 14 / 107

Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) Note if P θ = 0 at an x where P r > 0, the KL divergence goes to + (bad for the MLE if P θ has low dimensional support) Typical remedy is to add a noise term to the model distribution to ensure distribution is defined everywhere x P r (x) log P θ 16 / 107

Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: Unlike densities, this approach can represent distributions confined to a low dimensional manifold It s very easy to generate samples - given a trained g θ, simply sample random noise z Z, and evaluate g θ (z) VAEs and GANs are well known examples of this approach VAEs focus on the approximate likelihood of the examples and so share the limitation that you need to fiddle with additional noise terms. 21 / 107

Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). 23 / 107

Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties 24 / 107

Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties Distance d is weaker than distance d if every sequence that converges under d converges under d Given a distance d, we can treat d(p r, P θ ) as a loss function We can minimize d(p r, P θ ) with respect to θ as long as the mapping θ P θ is continuous (true if g θ is a neural network) 27 / 107

Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 29 / 107

Distance definitions 30 / 107

Distance definitions Notation χ - compact metric set (such as the space of images [0, 1] d ) Σ - set of all Borel subsets of χ Prob(χ) - space of probability measures defined on χ 31 / 107

Distance definitions Notation χ - compact metric set (such as the space of images [0, 1] d ) Σ - set of all Borel subsets of χ Prob(χ) - space of probability measures defined on χ Total Variation (TV) distance δ(p r, P θ ) = sup P r (A) P θ (A) A Σ Kullback-Leibler (KL) divergence KL(P r P θ ) = P r (x) log P r (x) P θ (x) dµ(x) x where both P r and P θ are assumed to be absolutely continuous with respect to a same measure µ defined on χ. 33 / 107

Distance Definitions 34 / 107

Distance Definitions Jensen-Shannon (JS) Divergence JS(P r, P θ ) = KL(P r P m ) + KL(P θ P m ) where P m is the mixture (P r + P θ )/2 35 / 107

Distance Definitions Jensen-Shannon (JS) Divergence JS(P r, P θ ) = KL(P r P m ) + KL(P θ P m ) where P m is the mixture (P r + P θ )/2 Earth-Mover (EM) distance or Wasserstein-1 W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) where Π(P r, P θ ) denotes the set of all join distributions γ(x, y) whose marginals are respectively P r and P θ 36 / 107

Understanding the EM distance 37 / 107

Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. 38 / 107

Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. Imagine we started with distribution P r, and wanted to move mass around to change the distribution into P θ. Moving mass m by distance d costs m d effort. The earth mover distance is the minimal effort we need to spend. 41 / 107

Understanding the EM distance 43 / 107

Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? 44 / 107

Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? The amount of mass that leaves x is y γ(x, y)dy. This must equal P r (x), the amount of mass originally at x. The amount of mass that enters y is x γ(x, y)dx. This must equal P θ (y), the amount of mass that ends up at y. This shows why the marginals of γ Π must be P r and P θ. For scoring, the effort spent is γ(x, y) x y dydx = E (x,y) γ [ x y ] x y 47 / 107

Learning Parallel Lines Example 49 / 107

Learning Parallel Lines Example Let Z U[0, 1] and let P 0 be the distribution of (0, Z) R 2, uniform on a straight vertical line passing through the origin. Now let g θ (z) = (θ, z) with θ a single real parameter. We d like our optimization algorithm to learn to move θ to 0. As θ 0, the distance d(p 0, P θ ) should decrease. 52 / 107

Learning Parallel Lines Example For many common distance functions, this doesn t happen. 53 / 107

Learning Parallel Lines Example For many common distance functions, this doesn t happen. { 1 if θ 0 δ(p 0, P θ ) = 0 if θ = 0 { log 2 if θ 0 JS(P 0, P θ ) = 0 if θ = 0 { + if θ 0 KL(P θ, P 0 ) = KL(P 0, P θ ) = 0 if θ = 0 W (P 0, P θ ) = θ 54 / 107

Theoretical Justification Theorem (1) Let P r be a fixed distribution over χ. Let Z be a random variable over another space Z. Let g : Z R d χ be a function, that will be denote g θ (z) with z the first coordinate and θ the second. Let P θ denote the distribution of g θ (). Then, 1. If g is continuous in θ, so is W (P r, P θ ). 2. If g is locally Lipschitz and satisfies regularity assumption 1, then W (P r, P θ ) is continuous everywhere, and differentiable almost everywhere 3. Statements 1-2 are false for the Jensen-Shannon divergence JS(P r, P θ ) and all the KLs. 55 / 107

Theoretical Justification Theorem (2) Let P be a distribution on a compact space χ and (P n ) n N be a sequence of distributions on χ. Then, considering all limits as n, 1. The following statements are equivalent δ(p n, P) 0 JS(Pn, P) 0 2. The following statements are equivalent W (P n, P) 0 Pn D P where D represents convergence in distribution for random variables 3. KL(P n P) 0 or KL(P P n ) 0 imply the statements in (1). 4. The statements in (1) imply the statements in (2). 56 / 107

Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 57 / 107

Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. 58 / 107

Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. The generator network G maps a source of noise to the input space. 59 / 107

Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. The generator network G maps a source of noise to the input space. The discriminator network D receives either a generated sample or a true data sample and must distinguish between the two. 60 / 107

Generative Adversarial Networks Formally we can express the game between the generator G and the discriminator D with the minimax objective: min G max E x P r [log(d(x))] + E x Pg [log(1 D( x)] D where P r is the data distribution and P g is the model distribution implicitly defined by x = G(z), z p(z) 62 / 107

Generative Adversarial Networks Remarks 63 / 107

Generative Adversarial Networks Remarks If the discriminator is trained to optimality before each generator parameter update, minimizing the value function amounts to minimizing the Jensen-Shannon divergence between the data and model distributions on x This is expensive and often leads to vanishing gradients as the discriminator saturates 65 / 107

Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 67 / 107

Kantorivich-Rubinstein Duality Unfortunately, computing the Wasserstein distance exactly is intractable. W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) 68 / 107

Kantorivich-Rubinstein Duality Unfortunately, computing the Wasserstein distance exactly is intractable. W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) However, a result from the Kantorivich-Rubinstein Duality (Villani 2008) shows W is equivalent to W (P r, P θ ) = sup E x Pr [f (x)] E x Pθ [f (x)] f L 1 where the supremum is taken over all 1-Lipschitz functions 69 / 107

Wasserstein GAN Approximation 71 / 107

Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. 72 / 107

Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. Suppose we have a parametrized function family {f w } w W, where w are the weights and W is the set of all possible weights Furthermore suppose these functions are all K-Lipschitz for some K. Then we have max E x P r [f w (x)] E x Pθ [f w (x)] sup E x Pr [f (x)] E x Pθ [f (x)] w W f L K = K W (P r, P θ ) 74 / 107

Wasserstein GAN Algorithm Looping all this back to generative models, we d like to train P θ = g θ (z) to match P r. 77 / 107

Wasserstein GAN Algorithm Looping all this back to generative models, we d like to train P θ = g θ (z) to match P r. Intuitively, given a fixed g θ, we can compute the optimal f w for the Wasserstein distance. We can then backpropagate through W (P r, g θ (Z)) to get the gradient for θ. θ W (P r, P θ ) = θ (E x Pr [f w (x)] E z Z [f w (g θ (z))]) = E z Z [ θ f w (g θ (z))] 79 / 107

Wasserstein GAN Algorithm The training process now has three steps: 80 / 107

Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 81 / 107

Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 2. Once we have the optimal f w, compute the θ gradient E z Z [ θ f w (g θ (z))] by sampling several z Z 3. Update θ, and repeat the process Important Detail The entire derivation only works when the function family {f w } w W is K-Lipschitz. 84 / 107

Wasserstein GAN Algorithm 86 / 107

Comparison with Standard GANs 87 / 107

Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) i=1 88 / 107

Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) In WGANs, nothing requires the f w to output a probability and hence it is referred to as a critic instead of a discriminator Although GANs are formulated as a min max problem, in practice we never train D to convergence Consequently, we re updating G against an objective that kind of aims towards the JS divergence, but doesn t go all the way i=1 91 / 107

Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 93 / 107

Properties of optimal WGAN critic 94 / 107

Properties of optimal WGAN critic An open question is how to effectively enforce the Lipschitz constraint on the critic? 95 / 107

Properties of optimal WGAN critic An open question is how to effectively enforce the Lipschitz constraint on the critic? Previously (Arjovksy et. al 2017) we ve seen that you can clip the weights of the critic to lie within the compact space [ c, c] To understand why weight clipping is problematic in WGAN critic we need to understand what are the properties of the optimal WGAN critic? 97 / 107

Properties of optimal WGAN critic 98 / 107

Properties of optimal WGAN critic If the optimal critic under the Kantorovich-Rubinstein dual D is differentiable, and x is a point from our generator distribution P θ, then there is a point y sampled from the true distribution P r such that the gradient of D at all points x t = (1 t)x + ty lie on a straight line between x and y. 99 / 107

Gradient penalty 102 / 107

Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. 103 / 107

Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. A differentiable function is 1-Lipschitz if and only if it has gradients with norm less than or equal to 1 everywhere. This implies we should directly constrain the gradient norm of our critic function with respect to its input. 105 / 107

WGAN with gradient penalty 107 / 107

Nishant Gurnani. GAN Reading Group. April 14th, / 107