The Numerics of GANs

Similar documents
Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

The Numerics of GANs

Negative Momentum for Improved Game Dynamics

Ian Goodfellow, Staff Research Scientist, Google Brain. Seminar at CERN Geneva,

arxiv: v1 [math.oc] 7 Dec 2018

arxiv: v3 [cs.lg] 11 Jun 2018

Which Training Methods for GANs do actually Converge?

A Necessary and Sufficient Condition for a Unique Maximum with an Application to Potential Games

MMD GAN 1 Fisher GAN 2

Composite Functional Gradient Learning of Generative Adversarial Models. Appendix

Don t Decay the Learning Rate, Increase the Batch Size. Samuel L. Smith, Pieter-Jan Kindermans, Quoc V. Le December 9 th 2017

Machine Learning

Machine Learning in Modern Well Testing

Linear Models in Machine Learning

a i,1 a i,j a i,m be vectors with positive entries. The linear programming problem associated to A, b and c is to find all vectors sa b

Which Training Methods for GANs do actually Converge? Supplementary material

LEARNING IN CONCAVE GAMES

Generative Adversarial Networks. Presented by Yi Zhang

Generative adversarial networks

Q-Learning in Continuous State Action Spaces

Energy-Based Generative Adversarial Network

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Non Convex-Concave Saddle Point Optimization

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Inverse and Implicit Mapping Theorems (Sections III.3-III.4)

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

Bayesian Games for Adversarial Regression Problems (Online Appendix)

Nonlinear Diffusion. 1 Introduction: Motivation for non-standard diffusion

Lecture 14: Deep Generative Learning

Adaptive State Feedback Nash Strategies for Linear Quadratic Discrete-Time Games

Computing Solution Concepts of Normal-Form Games. Song Chong EE, KAIST

Vector Derivatives and the Gradient

David Silver, Google DeepMind

Lecture notes for Analysis of Algorithms : Markov decision processes

Multi-Agent Learning with Policy Prediction

A picture of the energy landscape of! deep neural networks

Math 312 Lecture Notes Linearization

Deep Residual. Variations

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Newton-type Methods for Solving the Nonsmooth Equations with Finitely Many Maximum Functions

Zero-Sum Games Public Strategies Minimax Theorem and Nash Equilibria Appendix. Zero-Sum Games. Algorithmic Game Theory.

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Deep Learning: Approximation of Functions by Composition

Neural networks COMS 4771

Density Propagation for Continuous Temporal Chains Generative and Discriminative Models

Gradient descent GAN optimization is locally stable

Multiagent Value Iteration in Markov Games

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Introduction to Optimization Techniques. Nonlinear Optimization in Function Spaces

arxiv: v1 [cs.lg] 20 Apr 2017

Multiagent Learning Using a Variable Learning Rate

Stability of Feedback Solutions for Infinite Horizon Noncooperative Differential Games

Mathematical Economics - PhD in Economics

Robustness in GANs and in Black-box Optimization

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

What kind of bonus point system makes the rugby teams more offensive?

Tangent spaces, normals and extrema

Machine Learning Gaussian Naïve Bayes Big Picture

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera

Midterm exam CS 189/289, Fall 2015

Kernelized Perceptron Support Vector Machines

Introduction to Machine Learning

Incremental Policy Learning: An Equilibrium Selection Algorithm for Reinforcement Learning Agents with Common Interests

Logistic Regression Logistic

Mechanism Design: Implementation. Game Theory Course: Jackson, Leyton-Brown & Shoham

On the Stability of the Best Reply Map for Noncooperative Differential Games

Near-Potential Games: Geometry and Dynamics

Machine Learning Lecture 7

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Supplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization

Adversarial Sequential Monte Carlo

6.3. Nonlinear Systems of Equations

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

B5.6 Nonlinear Systems

Lecture 4. Chapter 4: Lyapunov Stability. Eugenio Schuster. Mechanical Engineering and Mechanics Lehigh University.

האוניברסיטה העברית בירושלים

About solving time dependent Schrodinger equation

arxiv: v1 [math.oc] 1 Jul 2016

ITK Filters. Thresholding Edge Detection Gradients Second Order Derivatives Neighborhood Filters Smoothing Filters Distance Map Image Transforms

Training Generative Adversarial Networks Via Turing Test

Generative v. Discriminative classifiers Intuition

Unsupervised Learning

ML4NLP Multiclass Classification

Long-Run versus Short-Run Player

Game theory Lecture 19. Dynamic games. Game theory

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Convolutional Neural Networks

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Vlad Estivill-Castro (2016) Robots for People --- A project for intelligent integrated systems

Neural Network Training

Learning to Sample Using Stein Discrepancy

arxiv: v1 [cs.it] 26 Oct 2018

Neural networks and optimization

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Transcription:

The Numerics of GANs Lars Mescheder 1, Sebastian Nowozin 2, Andreas Geiger 1 1 Autonomous Vision Group, MPI Tübingen 2 Machine Intelligence and Perception Group, Microsoft Research Presented by: Dinghan Shen Feb 2, 2018 Feb 2, 2018 1 / 13

Main idea Motivation: While very powerful, GANs are known to be notoriously hard to train; To stabilize the training of GANs from the perspective of finding local Nash-equilibria of smooth games. Contributions: Identify the main reasons why simultaneous gradient ascent often fails to find local Nash-equilibria; Design a new, more robust algorithm for finding Nash-equilibria of smooth two-player games. Feb 2, 2018 2 / 13

Background Consider GANs as Smooth Two-Player Games: A di erentiable two-player game is defined by two utility functions f(, ) and g(, ), defined over a common space (, ) 2 1 2 ; 1 corresponds to the possible actions of player 1, 2 corresponds to the possible actions of player 2. The goal of player 1 is to maximize f, whereas player 2 tries to maximize g; In the context of GANs, 1 is the set of possible parameter values for the generator, whereas 2 is the set of possible parameter values for the discriminator. Feb 2, 2018 3 / 13

Background Consider GANs as Smooth Two-Player Games: Our goal is to find a Nash-equilibrium of the game, i.e. a point x =(, ) given by the two conditions: 2 argmax f (, ) and 2 argmax g(, ). (1) We call a point (, ) a local Nash-equilibrium, if (1) holds in a local neighborhood of (, ). A vector field for every di erentiable two-player game is defined as: r f (, ) v(, ) =. (2) r g(, ) Corollary. For zero-sum games, v 0 ( x) is negative semi-definite for any local Nash-equilibrium x. Conversely, if x is a stationary point of v(x) and v 0 ( x) is negative definite, then x is a local Nash-equilibrium. Feb 2, 2018 4 / 13

Simultaneous Gradient Ascent The de-facto standard algorithm for finding Nash-equilibria of general smooth two-player games is Simultaneous Gradient Ascent (SimGA). Main idea of SimGA: Iteratively update the parameters of the two players by simultaneously applying gradient ascent to the utility functions of the two players. The paper shows that two major failure causes for this algorithm are: 1) eigenvalues of the Jacobian of the associated gradient vector field with zero real-part; 2) eigenvalues with large imaginary part. Feb 2, 2018 5 / 13

Convergence Theory In numerics, we often consider functions of the form F (x) =x + hg(x) (3) Finding fixed points of F (F (x) =x) isthenequivalenttofinding solutions to the nonlinear equation G(x) = 0 for x. For F,the Jacobian is given by: F 0 (x) =I + hg 0 (x). (4) Since in general neither F 0 (x) norg 0 (x) aresymmetricandcan therefore have complex eigenvalues. Feb 2, 2018 6 / 13

Convergence Theory Lemma: AssumethatA 2 R n n only has eigenvalues with negative real-part and let h > 0. Then the eigenvalues of the matrix I + ha lie in the unit ball if and only if for all eigenvalues of A. h < 1 <( ) 1+ 2 =( ) <( ) 2 (5) As q = =( ) <( ) goes to infinity, we have to choose h according to O(q 2 ), which can quickly become extremely small. As a result, if G 0 ( x) has an eigenvalue with small absolute real part but big imaginary part, h needs to be chosen extremely small to still achieve convergence. Feb 2, 2018 7 / 13

Consensus Optimization Finding stationary points of the vector field v(x) is equivalent to solving the equation v(x) = 0. In the context of two-player games this means solving the two equations: r f (, ) =0 and r g(, ) =0. (6) A simple strategy for finding such stationary points is to minimize L(x) = 1 2 kv(x)k2 for x. Unfortunately, practically they found it did not work well. They therefore consider a modified vector field w(x): w(x) =v(x) rl(x) (7) Feb 2, 2018 8 / 13

Consensus Optimization In the context of two-player games: f (, ) =f (, ) L(, ) and g(, ) =g(, ) L(, ). (8) Feb 2, 2018 9 / 13

Consensus Optimization Lemma: AssumethatA 2 R n n is negative semi-definite. Let q( ) =( ) be the maximum of <( ) (possibly infinite) with respect to where denotes the eigenvalues of A A T A and <( )and=( ) denote their real and imaginary part respectively. Moreover, assume that A is invertible with Av v for >0andlet c = v T (A + A T )v min v2s(c n ) v T (A A T )v (9) where S(C n ) denotes the unit sphere in C n.then q( ) apple 1 c +2 2. (10) Thus, the imaginary-to-real-part-quotient can be made arbitrarily small for an appropriate choice of, which could lead to better convergence properties near a local Nash-equilibrium. Feb 2, 2018 10 / 13

Experiments (a) Simultaneous Gradient Ascent (b) Consensus optimization Figure: Comparison of Simultaneous Gradient Ascent and Consensus optimization on a circular mixture of Gaussians. The images depict from left to right the resulting densities of the algorithm after 0, 5000, 10000 and 20000 iterations as well as the target density (in red). Feb 2, 2018 11 / 13

Experiments (a) Discriminator loss (b) Generator loss (c) Inception score Figure: (a) and (b): Comparison of the generator and discriminator loss on a DC-GAN architecture with 3 convolutional layers trained on cifar-10 for consensus optimization (without batch-normalization) and alternating gradient ascent (with batch-normalization). We observe that while alternating gradient ascent leads to highly fluctuating losses, consensus optimization successfully stabilizes the training and makes the losses almost constant during training. (c):comparisonoftheinceptionscore over time which was computed using 6400 samples. We see that on this architecture both methods have comparable rates of convergence and consensus optimization achieves slightly better end results. Feb 2, 2018 12 / 13

Experiments (a) cifar-10 (b) celeba Figure: Samples generated from a model where both the generator and discriminator are given as in DCGAN, but without batch-normalization. For celeba, we also use a constant number of filters in each layer and add additional RESNET-layers. Feb 2, 2018 13 / 13