Nishant Gurnani. GAN Reading Group. April 14th, / 107

Similar documents
Wasserstein GAN. Juho Lee. Jan 23, 2017

AdaGAN: Boosting Generative Models

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

topics about f-divergence

Lecture 14: Deep Generative Learning

Lecture 35: December The fundamental statistical distances

GENERATIVE ADVERSARIAL LEARNING

MMD GAN 1 Fisher GAN 2

Variational Autoencoders (VAEs)

arxiv: v7 [cs.lg] 27 Jul 2018

Generative Adversarial Networks

Energy-Based Generative Adversarial Network

Chapter 20. Deep Generative Models

Posterior Regularization

Some theoretical properties of GANs. Gérard Biau Toulouse, September 2018

Theory and Applications of Wasserstein Distance. Yunsheng Bai Mar 2, 2018

Series 7, May 22, 2018 (EM Convergence)

Generative Adversarial Networks. Presented by Yi Zhang

Generative adversarial networks

PATTERN RECOGNITION AND MACHINE LEARNING

Gradient descent GAN optimization is locally stable

Information theoretic perspectives on learning algorithms

arxiv: v3 [cs.lg] 2 Nov 2018

Data Mining Techniques

Robustness and duality of maximum entropy and exponential family distributions

arxiv: v3 [stat.ml] 6 Dec 2017

CSCI-567: Machine Learning (Spring 2019)

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

Auto-Encoding Variational Bayes

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Expectation Maximization

Optimal Transport and Wasserstein Distance

Latent Variable Models and EM algorithm

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Wasserstein Generative Adversarial Networks

Lecture 21: Minimax Theory

Normed and Banach spaces

Theory of Probability Fall 2008

arxiv: v1 [stat.ml] 19 Jan 2018

arxiv: v1 [cs.lg] 20 Apr 2017

Discrete transport problems and the concavity of entropy

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Deep Bayesian Inversion Computational uncertainty quantification for large scale inverse problems

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Do you like to be successful? Able to see the big picture

Wasserstein Training of Boltzmann Machines

Distirbutional robustness, regularizing variance, and adversaries

The power of two samples in generative adversarial networks

Lecture 5 Channel Coding over Continuous Channels

Latent Variable Models

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

U Logo Use Guidelines

Quantitative Biology II Lecture 4: Variational Methods

Applications of Information Geometry to Hypothesis Testing and Signal Detection

Summary of A Few Recent Papers about Discrete Generative models

CPSC 540: Machine Learning

STA 4273H: Statistical Machine Learning

Optimal Transport Methods in Operations Research and Statistics

Support Vector Machines and Bayes Regression

Information geometry for bivariate distribution control

Estimation of Dynamic Regression Models

Algorithms for Nonsmooth Optimization

Variational Autoencoder

Generative MaxEnt Learning for Multiclass Classification

Review and Motivation

Ian Goodfellow, Staff Research Scientist, Google Brain. Seminar at CERN Geneva,

Variational Inference. Sargur Srihari

ECO Class 6 Nonparametric Econometrics

Expectation Maximization

GANs. Machine Learning: Jordan Boyd-Graber University of Maryland SLIDES ADAPTED FROM GRAHAM NEUBIG

Gaussian Mixture Models

Generative Adversarial Networks

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

GANs, GANs everywhere

Algorithms for Variational Learning of Mixture of Gaussians

A Few Notes on Fisher Information (WIP)

Lecture 7 Introduction to Statistical Decision Theory

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Singing Voice Separation using Generative Adversarial Networks

Geometrical Insights for Implicit Generative Modeling

Stability results for Logarithmic Sobolev inequality

g-priors for Linear Regression

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Week 3: The EM algorithm

A Unified View of Deep Generative Models

Machine learning - HT Maximum Likelihood

Advanced Machine Learning & Perception

Generative Models and Optimal Transport

Lecture 2: Statistical Decision Theory (Part I)

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Optimal Transport in Risk Analysis

Density Estimation. Seungjin Choi

A new Hellinger-Kantorovich distance between positive measures and optimal Entropy-Transport problems

Lecture 16 Deep Neural Generative Models

Supplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Bayesian Machine Learning

Transcription:

Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107

Why are these Papers Important? 2 / 107

Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... 3 / 107

Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... Wasserstein GAN is yet another GAN training algorithm, however it is backed up by rigorous theory in addition to good performance 4 / 107

Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... Wasserstein GAN is yet another GAN training algorithm, however it is backed up by rigorous theory in addition to good performance WGAN removes the need to balance generator updates with discriminator updates, thus removing a key source of training instability in the original GAN 5 / 107

Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... Wasserstein GAN is yet another GAN training algorithm, however it is backed up by rigorous theory in addition to good performance WGAN removes the need to balance generator updates with discriminator updates, thus removing a key source of training instability in the original GAN Empirical results show correlation between discriminator loss and perceptual quality thus providing a rough measure of training progress 6 / 107

Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN, DCGAN, DiscoGAN... Wasserstein GAN is yet another GAN training algorithm, however it is backed up by rigorous theory in addition to good performance WGAN removes the need to balance generator updates with discriminator updates, thus removing a key source of training instability in the original GAN Empirical results show correlation between discriminator loss and perceptual quality thus providing a rough measure of training progress Goal - Convince you that WGAN is the best GAN 7 / 107

Outline Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 8 / 107

Outline Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 9 / 107

What does it mean to learn a probability distribution? When learning generative models, we assume the data we have comes from some unknown distribution P r. Want to learn a distribution P θ that approximates P r, where θ are the parameters of the distribution. There are two approaches for doing this: 10 / 107

What does it mean to learn a probability distribution? When learning generative models, we assume the data we have comes from some unknown distribution P r. Want to learn a distribution P θ that approximates P r, where θ are the parameters of the distribution. There are two approaches for doing this: 1. Directly learn the probability density function P θ and then optimize through maximum likelihood estimation 11 / 107

What does it mean to learn a probability distribution? When learning generative models, we assume the data we have comes from some unknown distribution P r. Want to learn a distribution P θ that approximates P r, where θ are the parameters of the distribution. There are two approaches for doing this: 1. Directly learn the probability density function P θ and then optimize through maximum likelihood estimation 2. Learn a function that transforms an existing distribution Z into P θ. Here, g θ is some differentiable function, Z is a common distribution (usually uniform or Gaussian), and P θ = g θ (Z) 12 / 107

Maximum Likelihood approach is problematic Recall that for continuous distributions P and Q the KL divergence is: KL(P Q) = P(x) log P(x) Q(x) dx and given function P θ, the MLE objective is 1 max θ R d m x m log P θ (x (i) ) i=1 In the limit (as m ), samples will appear based on the data distribution P r, so lim max 1 m θ R d m m i=1 log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x 13 / 107

Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) x P r (x) log P θ 14 / 107

Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) Note if P θ = 0 at an x where P r > 0, the KL divergence goes to + (bad for the MLE if P θ has low dimensional support) x P r (x) log P θ 15 / 107

Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) Note if P θ = 0 at an x where P r > 0, the KL divergence goes to + (bad for the MLE if P θ has low dimensional support) Typical remedy is to add a noise term to the model distribution to ensure distribution is defined everywhere x P r (x) log P θ 16 / 107

Maximum Likelihood approach is problematic lim max 1 m θ R d m m log P θ (x (i) ) = max P r (x) log P θ (x)dx θ R d x = min P r (x) log P θ (x)dx θ R d x = min P r (x) log P r (x)dx θ R d i=1 x = min θ R d KL(P r P θ ) Note if P θ = 0 at an x where P r > 0, the KL divergence goes to + (bad for the MLE if P θ has low dimensional support) Typical remedy is to add a noise term to the model distribution to ensure distribution is defined everywhere This unfortunately introduces some error, and empirically people have needed to add a lot of random noise to make models train x P r (x) log P θ 17 / 107

Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: 18 / 107

Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: Unlike densities, this approach can represent distributions confined to a low dimensional manifold 19 / 107

Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: Unlike densities, this approach can represent distributions confined to a low dimensional manifold It s very easy to generate samples - given a trained g θ, simply sample random noise z Z, and evaluate g θ (z) 20 / 107

Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: Unlike densities, this approach can represent distributions confined to a low dimensional manifold It s very easy to generate samples - given a trained g θ, simply sample random noise z Z, and evaluate g θ (z) VAEs and GANs are well known examples of this approach VAEs focus on the approximate likelihood of the examples and so share the limitation that you need to fiddle with additional noise terms. 21 / 107

Transforming an existing distribution Shortcomings of the maximum likelihood approach motivate the second approach of learning a g θ (a generator) to transform a known distribution Z. Advantages of this approach: Unlike densities, this approach can represent distributions confined to a low dimensional manifold It s very easy to generate samples - given a trained g θ, simply sample random noise z Z, and evaluate g θ (z) VAEs and GANs are well known examples of this approach VAEs focus on the approximate likelihood of the examples and so share the limitation that you need to fiddle with additional noise terms. GANs offer much more flexibility but their training is unstable. 22 / 107

Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). 23 / 107

Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties 24 / 107

Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties Distance d is weaker than distance d if every sequence that converges under d converges under d 25 / 107

Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties Distance d is weaker than distance d if every sequence that converges under d converges under d Given a distance d, we can treat d(p r, P θ ) as a loss function 26 / 107

Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties Distance d is weaker than distance d if every sequence that converges under d converges under d Given a distance d, we can treat d(p r, P θ ) as a loss function We can minimize d(p r, P θ ) with respect to θ as long as the mapping θ P θ is continuous (true if g θ is a neural network) 27 / 107

Transforming an existing distribution To train g θ (and by extension P θ ), we need a measure of distance between distributions i.e. d(p r, P θ ). Distance Properties Distance d is weaker than distance d if every sequence that converges under d converges under d Given a distance d, we can treat d(p r, P θ ) as a loss function We can minimize d(p r, P θ ) with respect to θ as long as the mapping θ P θ is continuous (true if g θ is a neural network) How do we define d? 28 / 107

Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 29 / 107

Distance definitions 30 / 107

Distance definitions Notation χ - compact metric set (such as the space of images [0, 1] d ) Σ - set of all Borel subsets of χ Prob(χ) - space of probability measures defined on χ 31 / 107

Distance definitions Notation χ - compact metric set (such as the space of images [0, 1] d ) Σ - set of all Borel subsets of χ Prob(χ) - space of probability measures defined on χ Total Variation (TV) distance δ(p r, P θ ) = sup P r (A) P θ (A) A Σ 32 / 107

Distance definitions Notation χ - compact metric set (such as the space of images [0, 1] d ) Σ - set of all Borel subsets of χ Prob(χ) - space of probability measures defined on χ Total Variation (TV) distance δ(p r, P θ ) = sup P r (A) P θ (A) A Σ Kullback-Leibler (KL) divergence KL(P r P θ ) = P r (x) log P r (x) P θ (x) dµ(x) x where both P r and P θ are assumed to be absolutely continuous with respect to a same measure µ defined on χ. 33 / 107

Distance Definitions 34 / 107

Distance Definitions Jensen-Shannon (JS) Divergence JS(P r, P θ ) = KL(P r P m ) + KL(P θ P m ) where P m is the mixture (P r + P θ )/2 35 / 107

Distance Definitions Jensen-Shannon (JS) Divergence JS(P r, P θ ) = KL(P r P m ) + KL(P θ P m ) where P m is the mixture (P r + P θ )/2 Earth-Mover (EM) distance or Wasserstein-1 W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) where Π(P r, P θ ) denotes the set of all join distributions γ(x, y) whose marginals are respectively P r and P θ 36 / 107

Understanding the EM distance 37 / 107

Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. 38 / 107

Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. Imagine we started with distribution P r, and wanted to move mass around to change the distribution into P θ. 39 / 107

Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. Imagine we started with distribution P r, and wanted to move mass around to change the distribution into P θ. Moving mass m by distance d costs m d effort. 40 / 107

Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. Imagine we started with distribution P r, and wanted to move mass around to change the distribution into P θ. Moving mass m by distance d costs m d effort. The earth mover distance is the minimal effort we need to spend. 41 / 107

Understanding the EM distance Main Idea Probability distributions are defined by how much mass they put on each point. Imagine we started with distribution P r, and wanted to move mass around to change the distribution into P θ. Moving mass m by distance d costs m d effort. The earth mover distance is the minimal effort we need to spend. Transport Plan Each γ Π is a transport plan and to execute the plan, for all x, y move γ(x, y) mass from x to y. 42 / 107

Understanding the EM distance 43 / 107

Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? 44 / 107

Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? The amount of mass that leaves x is y γ(x, y)dy. This must equal P r (x), the amount of mass originally at x. 45 / 107

Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? The amount of mass that leaves x is y γ(x, y)dy. This must equal P r (x), the amount of mass originally at x. The amount of mass that enters y is x γ(x, y)dx. This must equal P θ (y), the amount of mass that ends up at y. 46 / 107

Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? The amount of mass that leaves x is y γ(x, y)dy. This must equal P r (x), the amount of mass originally at x. The amount of mass that enters y is x γ(x, y)dx. This must equal P θ (y), the amount of mass that ends up at y. This shows why the marginals of γ Π must be P r and P θ. For scoring, the effort spent is γ(x, y) x y dydx = E (x,y) γ [ x y ] x y 47 / 107

Understanding the EM distance What properties does the plan need to satisfy to transform P r into P θ? The amount of mass that leaves x is y γ(x, y)dy. This must equal P r (x), the amount of mass originally at x. The amount of mass that enters y is x γ(x, y)dx. This must equal P θ (y), the amount of mass that ends up at y. This shows why the marginals of γ Π must be P r and P θ. For scoring, the effort spent is γ(x, y) x y dydx = E (x,y) γ [ x y ] x y Computing the infimum of this over all valid γ gives the earth mover distance 48 / 107

Learning Parallel Lines Example 49 / 107

Learning Parallel Lines Example Let Z U[0, 1] and let P 0 be the distribution of (0, Z) R 2, uniform on a straight vertical line passing through the origin. Now let g θ (z) = (θ, z) with θ a single real parameter. 50 / 107

Learning Parallel Lines Example Let Z U[0, 1] and let P 0 be the distribution of (0, Z) R 2, uniform on a straight vertical line passing through the origin. Now let g θ (z) = (θ, z) with θ a single real parameter. 51 / 107

Learning Parallel Lines Example Let Z U[0, 1] and let P 0 be the distribution of (0, Z) R 2, uniform on a straight vertical line passing through the origin. Now let g θ (z) = (θ, z) with θ a single real parameter. We d like our optimization algorithm to learn to move θ to 0. As θ 0, the distance d(p 0, P θ ) should decrease. 52 / 107

Learning Parallel Lines Example For many common distance functions, this doesn t happen. 53 / 107

Learning Parallel Lines Example For many common distance functions, this doesn t happen. { 1 if θ 0 δ(p 0, P θ ) = 0 if θ = 0 { log 2 if θ 0 JS(P 0, P θ ) = 0 if θ = 0 { + if θ 0 KL(P θ, P 0 ) = KL(P 0, P θ ) = 0 if θ = 0 W (P 0, P θ ) = θ 54 / 107

Theoretical Justification Theorem (1) Let P r be a fixed distribution over χ. Let Z be a random variable over another space Z. Let g : Z R d χ be a function, that will be denote g θ (z) with z the first coordinate and θ the second. Let P θ denote the distribution of g θ (). Then, 1. If g is continuous in θ, so is W (P r, P θ ). 2. If g is locally Lipschitz and satisfies regularity assumption 1, then W (P r, P θ ) is continuous everywhere, and differentiable almost everywhere 3. Statements 1-2 are false for the Jensen-Shannon divergence JS(P r, P θ ) and all the KLs. 55 / 107

Theoretical Justification Theorem (2) Let P be a distribution on a compact space χ and (P n ) n N be a sequence of distributions on χ. Then, considering all limits as n, 1. The following statements are equivalent δ(p n, P) 0 JS(Pn, P) 0 2. The following statements are equivalent W (P n, P) 0 Pn D P where D represents convergence in distribution for random variables 3. KL(P n P) 0 or KL(P P n ) 0 imply the statements in (1). 4. The statements in (1) imply the statements in (2). 56 / 107

Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 57 / 107

Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. 58 / 107

Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. The generator network G maps a source of noise to the input space. 59 / 107

Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. The generator network G maps a source of noise to the input space. The discriminator network D receives either a generated sample or a true data sample and must distinguish between the two. 60 / 107

Generative Adversarial Networks Recall that the GAN training strategy is to define a game between two competing networks. The generator network G maps a source of noise to the input space. The discriminator network D receives either a generated sample or a true data sample and must distinguish between the two. The generator is trained to fool the discriminator. 61 / 107

Generative Adversarial Networks Formally we can express the game between the generator G and the discriminator D with the minimax objective: min G max E x P r [log(d(x))] + E x Pg [log(1 D( x)] D where P r is the data distribution and P g is the model distribution implicitly defined by x = G(z), z p(z) 62 / 107

Generative Adversarial Networks Remarks 63 / 107

Generative Adversarial Networks Remarks If the discriminator is trained to optimality before each generator parameter update, minimizing the value function amounts to minimizing the Jensen-Shannon divergence between the data and model distributions on x 64 / 107

Generative Adversarial Networks Remarks If the discriminator is trained to optimality before each generator parameter update, minimizing the value function amounts to minimizing the Jensen-Shannon divergence between the data and model distributions on x This is expensive and often leads to vanishing gradients as the discriminator saturates 65 / 107

Generative Adversarial Networks Remarks If the discriminator is trained to optimality before each generator parameter update, minimizing the value function amounts to minimizing the Jensen-Shannon divergence between the data and model distributions on x This is expensive and often leads to vanishing gradients as the discriminator saturates In practice, this requirement is relaxed, and the generator and discriminator are update simultaneously 66 / 107

Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 67 / 107

Kantorivich-Rubinstein Duality Unfortunately, computing the Wasserstein distance exactly is intractable. W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) 68 / 107

Kantorivich-Rubinstein Duality Unfortunately, computing the Wasserstein distance exactly is intractable. W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) However, a result from the Kantorivich-Rubinstein Duality (Villani 2008) shows W is equivalent to W (P r, P θ ) = sup E x Pr [f (x)] E x Pθ [f (x)] f L 1 where the supremum is taken over all 1-Lipschitz functions 69 / 107

Kantorivich-Rubinstein Duality Unfortunately, computing the Wasserstein distance exactly is intractable. W (P r, P θ ) = inf E (x,y) γ[ x y ] γ Π((P r,p θ ) However, a result from the Kantorivich-Rubinstein Duality (Villani 2008) shows W is equivalent to W (P r, P θ ) = sup E x Pr [f (x)] E x Pθ [f (x)] f L 1 where the supremum is taken over all 1-Lipschitz functions Calculating this is still intractable, but now it s easier to approximate. 70 / 107

Wasserstein GAN Approximation 71 / 107

Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. 72 / 107

Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. Suppose we have a parametrized function family {f w } w W, where w are the weights and W is the set of all possible weights Furthermore suppose these functions are all K-Lipschitz for some K. 73 / 107

Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. Suppose we have a parametrized function family {f w } w W, where w are the weights and W is the set of all possible weights Furthermore suppose these functions are all K-Lipschitz for some K. Then we have max E x P r [f w (x)] E x Pθ [f w (x)] sup E x Pr [f (x)] E x Pθ [f (x)] w W f L K = K W (P r, P θ ) 74 / 107

Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. Suppose we have a parametrized function family {f w } w W, where w are the weights and W is the set of all possible weights Furthermore suppose these functions are all K-Lipschitz for some K. Then we have max E x P r [f w (x)] E x Pθ [f w (x)] sup E x Pr [f (x)] E x Pθ [f (x)] w W f L K = K W (P r, P θ ) If {f w } w W contains the true supremum among K-Lipschitz functions, this gives the distance exactly. 75 / 107

Wasserstein GAN Approximation Note that if we replace the supremum over 1-Lipschitz functions with the supremum over K-Lipschitz functions, then the supremum is K W (P r, P θ ) instead. Suppose we have a parametrized function family {f w } w W, where w are the weights and W is the set of all possible weights Furthermore suppose these functions are all K-Lipschitz for some K. Then we have max E x P r [f w (x)] E x Pθ [f w (x)] sup E x Pr [f (x)] E x Pθ [f (x)] w W f L K = K W (P r, P θ ) If {f w } w W contains the true supremum among K-Lipschitz functions, this gives the distance exactly. In practice this won t be true! 76 / 107

Wasserstein GAN Algorithm Looping all this back to generative models, we d like to train P θ = g θ (z) to match P r. 77 / 107

Wasserstein GAN Algorithm Looping all this back to generative models, we d like to train P θ = g θ (z) to match P r. Intuitively, given a fixed g θ, we can compute the optimal f w for the Wasserstein distance. 78 / 107

Wasserstein GAN Algorithm Looping all this back to generative models, we d like to train P θ = g θ (z) to match P r. Intuitively, given a fixed g θ, we can compute the optimal f w for the Wasserstein distance. We can then backpropagate through W (P r, g θ (Z)) to get the gradient for θ. θ W (P r, P θ ) = θ (E x Pr [f w (x)] E z Z [f w (g θ (z))]) = E z Z [ θ f w (g θ (z))] 79 / 107

Wasserstein GAN Algorithm The training process now has three steps: 80 / 107

Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 81 / 107

Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 2. Once we have the optimal f w, compute the θ gradient E z Z [ θ f w (g θ (z))] by sampling several z Z 82 / 107

Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 2. Once we have the optimal f w, compute the θ gradient E z Z [ θ f w (g θ (z))] by sampling several z Z 3. Update θ, and repeat the process 83 / 107

Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 2. Once we have the optimal f w, compute the θ gradient E z Z [ θ f w (g θ (z))] by sampling several z Z 3. Update θ, and repeat the process Important Detail The entire derivation only works when the function family {f w } w W is K-Lipschitz. 84 / 107

Wasserstein GAN Algorithm The training process now has three steps: 1. For a fixed θ, compute an approximation of W (P r, P θ ) by training f w to convergence 2. Once we have the optimal f w, compute the θ gradient E z Z [ θ f w (g θ (z))] by sampling several z Z 3. Update θ, and repeat the process Important Detail The entire derivation only works when the function family {f w } w W is K-Lipschitz. To guarantee this is true, the authors use weight clamping. The weights w are constrained to lie within [ c, c], by clipping w after every update to w. 85 / 107

Wasserstein GAN Algorithm 86 / 107

Comparison with Standard GANs 87 / 107

Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) i=1 88 / 107

Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) In WGANs, nothing requires the f w to output a probability and hence it is referred to as a critic instead of a discriminator i=1 89 / 107

Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) In WGANs, nothing requires the f w to output a probability and hence it is referred to as a critic instead of a discriminator Although GANs are formulated as a min max problem, in practice we never train D to convergence i=1 90 / 107

Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) In WGANs, nothing requires the f w to output a probability and hence it is referred to as a critic instead of a discriminator Although GANs are formulated as a min max problem, in practice we never train D to convergence Consequently, we re updating G against an objective that kind of aims towards the JS divergence, but doesn t go all the way i=1 91 / 107

Comparison with Standard GANs In GANs, the discriminator maximizes 1 m m log D(x (i) ) + 1 m i=1 m log(1 D(g θ (z (i) ))) where we constrain D(x) to always be a probability p (0, 1) In WGANs, nothing requires the f w to output a probability and hence it is referred to as a critic instead of a discriminator Although GANs are formulated as a min max problem, in practice we never train D to convergence Consequently, we re updating G against an objective that kind of aims towards the JS divergence, but doesn t go all the way In constrast, because the Wasserstein distance is differentiable nearly everywhere, we can (and should) train f w to convergence before each generator update, to get as accurate an estimate of W (P r, P θ ) as possible i=1 92 / 107

Introduction Different Distances Standard GAN Wasserstein GAN Improved Wasserstein GAN 93 / 107

Properties of optimal WGAN critic 94 / 107

Properties of optimal WGAN critic An open question is how to effectively enforce the Lipschitz constraint on the critic? 95 / 107

Properties of optimal WGAN critic An open question is how to effectively enforce the Lipschitz constraint on the critic? Previously (Arjovksy et. al 2017) we ve seen that you can clip the weights of the critic to lie within the compact space [ c, c] 96 / 107

Properties of optimal WGAN critic An open question is how to effectively enforce the Lipschitz constraint on the critic? Previously (Arjovksy et. al 2017) we ve seen that you can clip the weights of the critic to lie within the compact space [ c, c] To understand why weight clipping is problematic in WGAN critic we need to understand what are the properties of the optimal WGAN critic? 97 / 107

Properties of optimal WGAN critic 98 / 107

Properties of optimal WGAN critic If the optimal critic under the Kantorovich-Rubinstein dual D is differentiable, and x is a point from our generator distribution P θ, then there is a point y sampled from the true distribution P r such that the gradient of D at all points x t = (1 t)x + ty lie on a straight line between x and y. 99 / 107

Properties of optimal WGAN critic If the optimal critic under the Kantorovich-Rubinstein dual D is differentiable, and x is a point from our generator distribution P θ, then there is a point y sampled from the true distribution P r such that the gradient of D at all points x t = (1 t)x + ty lie on a straight line between x and y. In other words, D (x t ) = y xt y x. t 100 / 107

Properties of optimal WGAN critic If the optimal critic under the Kantorovich-Rubinstein dual D is differentiable, and x is a point from our generator distribution P θ, then there is a point y sampled from the true distribution P r such that the gradient of D at all points x t = (1 t)x + ty lie on a straight line between x and y. In other words, D (x t ) = y xt y x. t This implies that the optimal WGAN critic has gradients with norm 1 almost everywhere under P r and P θ 101 / 107

Gradient penalty 102 / 107

Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. 103 / 107

Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. A differentiable function is 1-Lipschitz if and only if it has gradients with norm less than or equal to 1 everywhere. 104 / 107

Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. A differentiable function is 1-Lipschitz if and only if it has gradients with norm less than or equal to 1 everywhere. This implies we should directly constrain the gradient norm of our critic function with respect to its input. 105 / 107

Gradient penalty We consider an alternative method to enforce the Lipschitz constraint on the training objective. A differentiable function is 1-Lipschitz if and only if it has gradients with norm less than or equal to 1 everywhere. This implies we should directly constrain the gradient norm of our critic function with respect to its input. Enforcing a soft version of this we get: 106 / 107

WGAN with gradient penalty 107 / 107