Gradient descent GAN optimization is locally stable

Size: px

Start display at page:

Download "Gradient descent GAN optimization is locally stable"

Alexina Holt
6 years ago
Views:

1 Gradient descent GAN optimization is locally stable Advances in Neural Information Processing Systems, 2017 Vaishnavh Nagarajan J. Zico Kolter Carnegie Mellon University 05 January 2018 Presented by: Kevin Liang

Generative Adversarial Networks (GANs) Learn a generative model as a game between a generator (G) and a discriminator (D), with the former trying to fool the latter into thinking the

2 Generative Adversarial Networks (GANs) Learn a generative model as a game between a generator (G) and a discriminator (D), with the former trying to fool the latter into thinking the generated images are real: V (θ G, θ D ) = E x preal [log(d(x))] + E z platent [log(1 D(G(z)))] (1) Karras et. al 2017: Progressive Growing of GANs for Improved Quality, Stability, and Variation

3 GAN Optimization Typically treated as a min-max optimization problem: min max V (θ G, θ D ) (2) θ G θ D While GANs have become extremely popular, many of their convergence properties are not well-explored, and indeed GANs are typically hard to train (mode collapse, vanishing gradients, etc.).

4 GAN Optimization Instability? Assuming good equilibria exist, are GANs locally stable?

5 GAN Optimization Difficulties Min-max objective: V (θ G, θ D ) = E x preal [log(d(x))] + E z platent [log(1 D(G(z)))]

6 Non-linear Systems GAN updates: θ D = θd V (θ G, θ D ), θg = θg V (θ G, θ D ) (3) Linearization Theorem: If the Jacobian of a dynamical system J = θ θ θ=θ evaluated at the equilibrium point θ is Hurwitz (eigenvalues real parts are all strictly negative), then the ODE will converge to θ for some non-empty region around θ at an exponential rate. (4)

7 Generalized set-up Consider a slightly more generalized form: V (θ G, θ D ) = E x preal [f(d(x))] + E z platent [f( D(G(z)))] (5) where f : R R is a concave function. Specific cases: Traditional GAN: f(x) = log(1 + exp( x)) Wasserstein GAN: f(x) = x

8 Assumptions Assumption I. p θ G = p data and D θ D (x) = 0, x supp(p data ) Assumption I. (Non-realizable) The discriminator is linear in it parameters θ D, and furthermore, for any equilibrium point (θ D, θ G ), D θ (x) = 0, x supp(p data) supp(p θ G ) Defines good equilibria for the context of the proof. Also implicitly implies that the discriminator and generator are both powerful enough that there are no bad equilibria near such good ones.

9 Assumptions (cont.) Assumption II. The function f satisfies f (0) < 0, and f (0) 0 Loss function must be strictly concave.

10 Assumptions (cont.) Property I. g : Θ R satisfies Property I at θ Θ if for any θ Null( 2 θ g(θ) θ ) the function is locally constant along θ at θ ; i.e. ɛ > 0 such that for all ɛ ( ɛ, ɛ), g(θ ) = g(θ + ɛ θ) Assumption III. At an equilibrium (θd, θ G ), the functions E pdata [Dθ 2 D (x)] and E pdata [ θd D θd (x)] E pθg [ θd D θd (x)] 2 θd must satisfy =θd Property I in the discriminator and generator space respectively. Allows for small perturbations of (θ D, θ G ) in certain directions to still yield equilibrium discriminators and generators, as defined in Assumption I.

11 Assumptions (cont.) Assumption IV. ɛ G > 0 such that θ G B ɛg (θ G ), supp(p θg ) = supp(p data ). Assumption IV (Relaxed). ɛ G > 0 such that x θg B ɛg (θ G ) supp(p θg ), D θ D (x) = 0. All generators within a sufficiently small neighborhood of the equilibrium have the same support as the true distribution; or (weaker), the union of the supports of such generators is small enough that the equilibrium discriminator is still zero.

12 Main Result [ ] JDD J J = DG = J T DG J GG [ JDD ] J DG JDG T 0 Contribution: Under suitable conditions on the representational powers of the discriminator and the generator, the resulting GAN dynamical system is locally exponentially stable In other words, any initialization sufficiently close to the equilibrium will converge to the equilibrium.

13 Stabilizing optimization via gradient-based regularization GAN updates with an added regularizer: θ D := θ D + α θd V (θ G, θ D ) θ G := θ G α θg (V (θ G, θ D ) + η θd V (θ G, θ D ) 2 ) Transforms the Jacobian to the following: [ ] J J = DD J DG JDG T (I + 2ηJ DD) 2ηJDG T J DG

14 Experimental Results

15 Experimental Results

arxiv: v3 [cs.lg] 13 Jan 2018

arxiv: v3 [cs.lg] 13 Jan 2018 radient descent AN optimization is locally stable arxiv:1706.04156v3 cs.l] 13 Jan 2018 Vaishnavh Nagarajan Computer Science epartment Carnegie-Mellon University Pittsburgh, PA 15213 vaishnavh@cs.cmu.edu