Interaction Matters: A Note on Non-asymptotic Local Convergence of Generative Adversarial Networks

Size: px

Start display at page:

Download "Interaction Matters: A Note on Non-asymptotic Local Convergence of Generative Adversarial Networks"

Randall McCormick
5 years ago
Views:

1 Interaction Matters: A Note on Non-asymptotic Local onvergence of Generative Adversarial Networks engyuan Liang University of hicago arxiv:submit/ statml 16 Feb 18 James Stokes University of Pennsylvania Abstract Motivated by the pursuit of a systematic computational and algorithmic understanding of Generative Adversarial Networks GANs) we present a simple yet unified non-asymptotic local convergence theory for smooth two-player games which subsumes several discrete-time gradientbased saddle point dynamics he analysis reveals the surprising nature of the off-diagonal interaction term as both a blessing and a curse On the one hand this interaction term explains the origin of the slow-down effect in the convergence of Simultaneous Gradient Ascent SGA) to stable Nash equilibria On the other hand for the unstable equilibria exponential convergence can be proved thanks to the interaction term for three modified dynamics which have been proposed to stabilize GAN training: Optimistic Mirror Descent OMD) onsensus Optimization O) and Predictive Method PM) he analysis uncovers the intimate connections among these stabilizing techniques and provides detailed characterization on the choice of learning rate Key words and phrases: Generative adversarial networks Saddle point dynamics Smooth two-players game Local Nash equilibrium Non-asymptotic convergence Stability 1 INRODUION In this paper we consider the non-asymptotic local convergence and stability of discrete-time gradient-based optimization algorithms for solving smooth two-player zero-sum games of the form 11) min max θ R p ω R q Uθ ω) he motivation behind our non-asymptotic analysis follows from the observation that Generative Adversarial Networks GANs) lack principled understanding both in the computational and the algorithmic level 1 GAN optimization is a special case of 11) which has been developed for learning tengyuanliang@chicagoboothedu) stokesj@sasupennedu) 1 We refer the readers to Huszár 17 s concise post on this issue 1 file: papertex date: February 16 18

2 a complex and multi-modal probability distribution based on samples from P real over X ) through learning a generator function g θ ) that transforms the input distribution P input over Z) to match the target P real Ignoring the parameter regularization the value function corresponding to a GAN is of the form 1) Uθ ω) = E h 1 f ω X)) E h f ω g θ Z))) X P real Z P input where θ ω) R p R q parametrizes the generator function g θ : Z X and discriminator function f ω : X R respectively he original GAN Goodfellow et al 14 for example corresponds to choosing h 1 t) = log σt) h t) = log1 σt)) where σ is the sigmoid function; Wasserstein GAN Arjovsky et al 17 considers h 1 t) = h t) = t; f-gan Nowozin et al 16 proposes to use h 1 t) = t h t) = f t) where f denotes the Fenchel dual of f Recently several attempts have been made to understand whether GANs learn the target distribution in the statistical sense Liu et al 17 Arora and Zhang 17 Liang 17 Arora et al 17 Optimization of GANs and value functions of the form 11) at large) is hard both in theory and in practice Singh et al Pfau and Vinyals 16 Salimans et al 16 Global optimization of a general value function with multiple saddle points is impractical and unstable so we instead resort to the more modest problem of searching for a local saddle point θ ω ) such that no player has the incentive to deviate locally Uθ ω ) Uθ ω ) for θ in an open neighborhood of θ Uθ ω ) Uθ ω) for ω in an open neighborhood of ω For smooth value functions the above conditions are equivalent to the following solution concept: Definition 1 Local Nash Equilibrium) 1 θ Uθ ω ) = ω Uθ ω ) = ; θθ Uθ ω ) ωω Uθ ω ) θ ω ) is called a local Nash equilibrium if Here we use θω Uθ ω) R p q to denote the off-diagonal term U/ θ ω and name it the interaction term throughout the paper θ Uθ ω) R p denotes the gradient U/ θ and θθ Uθ ω) R p p for the Hessian U/ θ θ In practice discrete-time dynamical systems are employed to numerically approach the saddle points of Uθ ω) as is the case in GANs Goodfellow et al 14 and in primal-dual methods for non-linear optimization Singh et al he simplest possibility is Simultaneous Gradient Ascent SGA) which corresponds to the following discrete-time dynamical system 13) θ t+1 = θ t η θ Uθ t ω t ) ω t+1 = ω t + η ω Uθ t ω t ) where η is the step size or learning rate In the limit of vanishing step size SGA approximates a continuous-time autonomous dynamical system the asymptotic convergence of which has been established in Singh et al herukuri et al 17 Nagarajan and Kolter 17 In practice however it has been widely reported that the discrete-time SGA dynamics for GAN optimization suffers from instabilities due to the possibility of complex eigenvalues in the operator of the dynamical system Salimans et al 16 Metz et al 16 Nagarajan and Kolter 17 Mescheder et al 17 Heusel et al 17 We believe room for improvement still exists in the current theory which we hope will render it to be more informative in practice:

3 Non-asymptotic convergence speed In practice one is concerned with finite step size η > which is typically subject to extensive hyperparameter tuning Detailed characterizations on the convergence speed and theoretical insights on the choice of learning rate can be helpful Unified simple analysis for modified saddle point dynamics Several attempts to fix GAN optimization have been put forth by independent researchers which modify the dynamics Mescheder et al 17 Daskalakis et al 17 Yadav et al 17 using very different insights A unified analysis that reviews the deeper connections amongst these proposals helps to better understand the saddle point dynamics at large In this paper we address the above points by studying the theory of non-asymptotic convergence of SGA and related discrete-time saddle point dynamics namely Optimistic Mirror Descent OMD) onsensus Optimization O) and Predictive Method PM) More concretely we provide the following theoretical contributions about the crucial effect of the off-diagonal interaction term θω Uθ ω) in two-player games: Stable case: curse of the interaction term Locally SGA converges exponentially fast to a stable Nash equilibrium with a carefully chosen learning rate his can be viewed as a generalization rather than a special case) of the local convergence guarantee for single-player gradient descent for strongly-convex functions In addition we quantitatively isolate the slowdown in the convergence rate of two-player SGA compared to single-player gradient descent due to the presence of the off-diagonal interaction term θω Uθ ω) for the two-player game Unstable case: blessing of the interaction term For unstable Nash equilibria SGA diverges away for any non-zero learning rate We discover a unified non-asymptotic analysis that encompasses three proposed modified dynamics OMD O and PM he analysis shows that all these algorithms at a high level share the same idea of utilizing the curvature introduced by the interaction term θω Uθ ω) Unlike the slow sub-linear rate of convergence experienced by single-player gradient descent for non-strongly convex functions the OMD/O/PM effectively exploit the interaction term to achieve exponential convergence to unstable Nash equilibria he analysis also provides specific advice on the choice of learning rate for each procedure albeit restricted to the simple case of bi-linear games he organization of the paper is as follows In Section we consider the admittedly idealized) situation when locally the value function satisfies strict convexity/concavity We show non-asymptotic exponential convergence to Nash equilibria for SGA and identify an optimized learning rate o reveal and understand the new features of the modified dynamics we study in Section 3 the minimal unstable bilinear game showing that proposed stabilizing techniques all achieve exponential convergence to unstable Nash equilibria Finally in Section 4 we take a step closer to the real world by numerically evaluating each of the proposed dynamical systems using value functions of GAN form 1) under objective evaluation metrics Detailed proofs are deferred to the Section 5 SABLE ASE: NON-ASYMPOI LOAL ONVERGENE In this section we will establish the non-asymptotic convergence of SGA dynamics to saddle points that are stable local Nash equilibria With a properly chosen learning rate the local convergence can In fact Nesterov 13 constructed a convex function that is non-strongly convex such that all first order methods suffer slow sub-linear rate of convergence in optimization literature linear rate refers to exponential convergence speed) 3

4 be intuitively pictured as cycling inwards to these saddle points where the distance to the saddle point of interest is exponentially contracting First let s introduce the notion of stable equilibrium if Definition Stable Local Nash Equilibrium) 1 θ Uθ ω ) = ω Uθ ω ) = ; θθ Uθ ω ) ωω Uθ ω ) θ ω ) is called a stable local Nash equilibrium he above notion of stability is stronger than the Definition 1 in the sense that θθ Uθ ω ) and ωω Uθ ω ) have smallest eigenvalues bounded away from Assumption 1 Local Strong onvexity-oncavity) onsider Uθ ω) : R p R q R that is smooth and twice differentiable and let θ ω ) be a stable local Nash equilibrium as in Definition Assume that for some r > there exists an open neighborhood near θ ω ) such that for all θ ω) B θ ω ) r) the following strong convexity-concavity condition holds θθ Uθ ω) ωω Uθ ω) It will prove convenient to introduce some notation before introducing the main theorem Let us define the following block-wise abbreviation for the matrix of second derivatives θθ Uθ ω) θω Uθ ω) Aθω 1) := θω ωθ Uθ ω) ωω Uθ ω) θω B θω and define α β as ) α := β := min θω) B θ ω )r) λ min max θω) B θ ω )r) λ max A ) θω Bθω A θω + θωθω θω A θω + B θω θω ) A θω θω + θω B θω Bθω + θω θω where λ max M) λ min M) denote the largest and smallest eigenvalue of matrix M heorem 1 Exponential onvergence: SGA) onsider Uθ ω) : R p R q R that satisfies Assumption 1 holds for some radius r > near a stable local Nash equilibrium θ ω ) as in Definition onsider the initialization satisfies θ ω ) B θ ω ) r) hen the SGA dynamics 13) with fixed learning rate η = α/β α β defined in Eqn )) obtains an ɛ-minimizer such that θ ω ) B θ ω ) ɛ) as long as SGA := β α log r ɛ Remark 1 It is interesting to compare the convergence speed of the saddle point dynamics to conventional gradient descent in one variable for a strongly-convex function We remind the reader that to obtain an ɛ-minimizer for a strongly-convex function one needs the following number of iterations of gradient descent { λmax A θω ) GD := max λ min A θω ) log r ɛ λ max B θω ) λ min B θω ) log r } ɛ 4

5 depending on whether we are optimizing with respect to θ or ω respectively It is now evident that due to the presence of θω θω the convergence of two-player SGA to a saddle-point can be significantly slower than convergence of single-player gradient descent Note the following inequality A θω λ + θωθω max θω A θω + B θω θω ) A θω θω + θω B θω A Bθω + θω λ θω max θω which follows from the fact that the LHS is lower-bounded by λ max A θω + θω θω) λmax A θω ) herefore the convergence of SGA is slower than that in the conventional GD SGA GD We would like to emphasize that for the saddle point convergence the slow-down effect of the interaction term θω is explicit in our non-asymptotic analysis he intuition that the discrete-time SGA dynamics cycles inward to a stable Nash equilibrium exponentially fast can be seen in the following way he presence of the off-diagonal anti-symmetric component in Eqn 1) means that the associated linear operator of the discrete-time dynamics has complex eigenvalues which results in periodic cycling behavior However due to the explicit choice of η the distance to stable Nash equilibrium is shrinking exponentially fast he local exponential stability in the infinitesimal/asymptotic case when η has already been studied in a nice paper Nagarajan and Kolter 17 heorem 31 therein) by showing the Jacobian matrix of a particular form of GAN objective is Hurwitz has all strictly negative eigenvalues) here are two distinct differences in our result: 1) we provide non-asymptotic convergence with specific guidance on the choice of learning rate η; ) our analysis goes through analyzing the singular values which is rather different from the modulus of eigenvalue for a general matrix) instead of involving the complex eigenvalues and this simple technique generalizes to three other modified saddle point dynamics which we discuss in the next section 3 UNSABLE ASE: LOAL BI-LINEAR PROBLEM Oscillation and instability for SGA occurs when the problem is non-strongly convex-concave as in the bi-linear game or more precisely at least linear in one player) his observation was first pointed out using a very simple linear game Ux y) = xy in Salimans et al 16 More generally as a result of heorem 1 this phenomenon occurs when the local Nash equilibrium is non-stable ) Aθ ω λ min B θ ω B θω Aθ ω θ ω θ θ ω ω B θ ω θ ω Let s consider an extreme case when A θ ω = B θ ω = In this case we will show that: 1) an improved analysis for Optimistic Mirror Descent OMD) introduced in Daskalakis et al 17 ) a modified version of Predictive Methods PM) motivated from Yadav et al 17 and 3) onsensus Optimization O) introduced in Mescheder et al 17 can fix the oscillation problem and provide exponential convergence to unstable Nash equilibrium using a novel unified non-asymptotic analysis Our analysis shows that these stabilizing techniques at a high level all manipulate the dynamics to utilize the curvature generated by the interaction term θ ω θ ω which we refer to as the blessing of the interaction term to contrast with the slow-down effect of 5 )

6 the interaction term in the strongly convex-concave case Once again as eluded in the introduction this fast linear rate convergence result in the non-strongly convex-concave two-player game) should be compared to the significantly slower sub-linear convergence rate for all first-order-methods in convex but non-strongly convex optimization one player) he latter was proved by a lower bound argument in Nesterov 13 heorem 17 We are ready to state the main result proved in this section informally heorem 31 Informal: Unstable ase) All these three modified dynamics in the bi-linear game enjoy the last iterate exponential convergence guarantee We will start with the minimal unstable bilinear game explaining why oscillation can happen when the game is non-strongly convex-concave and to reveal and understand the new features of the modified dynamics his bilinear game can be motivated by considering the lowest-order truncation of the aylor expansion of a general smooth two-player game around a Nash equilibrium A B ) 31) Uθ ω) = 1 θ Aθ + 1 ω Bω + θ ω + higher order terms θ ω assuming that θ ω ) = ) Now consider the simple bi-linear game Uθ ω) = θ ω With the SGA dynamics defined in 13) one can easily verify that θ t η λ min ) ) θ t ω t η λ min ) ) ω t herefore the continuous limit η is cycling around a sphere while with any practical learning rate η the distance to the Nash equilibrium can be increasing exponentially instead of converging Per heorem 1 and the discussion above instability for SGA only occurs when the local game is approximately bi-linear From now on we will focus on the simplest unstable form of the game the bi-linear game to isolate the main idea behind fixing the unstable problem 31 Improved) Optimistic Mirror Descent Daskalakis et al 17 employed Optimistic Mirror Descent OMD) motivated by online learning to solve the instability problem in GANs Here we provide a stronger result showing that the last iterate of OMD enjoys exponential convergence for bi-linear games We note that although the last-iterate convergence of this OMD procedure was already rigorously proved in Daskalakis et al 17 the exponential convergence however is not known to the best of our knowledge Roughly speaking for initialization with distance r Daskalakis et al 17 heorem 1 asserts that after t η iterations the last iterate of OMD is within r for a small enough learning rate η under λ min ) the some additional assumptions) and the convergence only happens in the limit of vanishing step size η In contrast we will prove after t iterations with the properly chosen step size η the last iterate of OMD is within a distance of 1 λ min t/ ) 4λ max )) r his improved theory also explains the exponential convergence found in simulations Lemma 31 Exponential onvergence: OMD) onsider a bi-linear game Uθ ω) = θ ω Assume p = q and is full rank hen the OMD dynamics 3) θ t+1 = θ t η θ Uθ t ω t ) + η θ Uθ t 1 ω t 1 ) ω t+1 = ω t + η ω Uθ t ω t ) η ω Uθ t 1 ω t 1 ) 6

7 with the learning rate η η = 1 λ max ) obtains an ɛ-minimizer such that θ ω ) B ɛ) as long as OMD := 8 λ max ) ) 4r 1 + λ min ) λ min ) + 4 λ max log ) ɛ under the assumption that θ ω ) θ 1 ω 1 ) r 3 Modified) Predictive Methods From a very different motivation in ODE Yadav et al 17 proposed Predictive Methods PM) to fix the instability problem he intuition is to evaluate the gradient at a predictive future location then perform the update In this section we propose and analyze a modified version of the predictive method for simultaneous gradient updates) inspired by Yadav et al 17 onsider the following modified PM dynamic ) 33) predictive step: θ t+1/ = θ t γ θ Uθ t ω t ) ω t+1/ = ω t + γ ω Uθ t ω t ) ; gradient step: θ t+1 = θ t η θ Uθ t+1/ ω t+1/ ) ω t+1 = ω t + η ω Uθ t+1/ ω t+1/ ) Lemma 3 Exponential onvergence: PM) onsider a bi-linear game Uθ ω) = θ ω Assume p = q and is full rank Fix some γ > hen the PM dynamics in Eqn 33) with learning rate η η = γλ min ) λ max ) + γ λ max ) obtains an ɛ-minimizer such that θ ω ) B ɛ) as long as PM := γ λ max ) + λ max ) γ λ log r min ) ɛ under the assumption that θ ω ) r 33 onsensus Optimization onsensus Optimization O) is one other elegant attempt to fix the aforementioned problem proposed in Mescheder et al 17 he authors motivation is to add an extra potential vector field consensus part) on top of the current curl vector field for the SGA of the bi-linear problem) in order to attract the dynamics to the critical points Mescheder et al 17 Nagarajan and Kolter 17 analyzed the infinitesimal flow version of the consensus optimization and intuitively showed that it pushes the real part of the eigenvalue away from 1 to ensure asymptotic convergence In this section we provide a simple convergence analysis of the discretized dynamics of the same flavor as 7

8 the previous section An upshot of the analysis is that it sheds light on possible choice of learning rate Recall that the regularization term defining consensus optimization is given by 34) Rθ ω) = θ Uθ ω) + ω Uθ ω) ) / We are ready to state the Lemma Surprisingly we find that the consensus optimization coincides with the modified predictive method for the bi-linear game Lemma 33 Exponential onvergence: O) onsider a bi-linear game Uθ ω) = θ ω Assume p = q and is full rank Recall Rθ ω) defined in Eqn 34) and fix some γ > hen the O dynamics with the same learning rate η as in Lemma 3 35) θ t+1 = θ t η θ Uθ t ω t ) + γ θ Rθ t ω t ) ω t+1 = ω t + η ω Uθ t ω t ) γ ω Rθ t ω t ) converges exponentially fast in the same way as the PM dynamics in Lemma 3 4 EXPERIMENS Lucic et al 17 recently conducted a large-scale study of GANs and found that improvements from algorithmic changes mostly disappeared after taking into account hyper-parameter tuning and randomness of initialization hey conclude that future GAN research should be based on more systematic and objective evaluation procedures Inspired by this conclusion we conduct a systematic evaluation of the proposed optimization algorithms on two basic density learning problems and introduce corresponding objective evaluation metrics he goal of this analysis is not to achieve state-of-art performance but rather to compare and contrast the existing proposals in a carefully controlled learning environment We focus on the Wasserstein GAN formulation so that the value function is given by 41) Uθ ω) = E X Preal f ω X) E Z NIk )f ω g θ Z)) where f ω : R d R is a multi-layer neural network with L hidden layers and rectifier non-linearities and the input distribution P input was chosen to be k-dimensional standard Gaussian noise Following Gulrajani et al 17 we impose the Lipschitz-1 constraint on the discriminator network using the two-sided gradient penalty term Λω) introduced in Gulrajani et al 17 Eqn 3) he consensus optimization loss is defined with respect to the value function as in 34) without including gradient penalty he combined loss function of the discriminator and generator are respectively 4) 43) L dis θ ω) = Uθ ω) + γ Rθ ω) + λ Λω) L gen θ ω) = Uθ ω) + γ Rθ ω) he coefficients of the gradient penalty and consensus optimization terms were determined by a coarse parameter search and then locked to λ = γ = 1 throughout In order to make close contact with our theoretical formalism we optimize the above loss functions on an alternating schedule using vanilla gradient descent with fixed learning rate of η = 1 3 o ensure reproducibility all algorithms were independently implemented on top of the FGAN and Keras framework 8

9 41 Learning ovariance of Multivariate Gaussian onsider the problem of learning the covariance matrix Σ S d ++ of a d-dimensional multivariate Gaussian distribution P real = N Σ) with non-degenerate covariance Σ Note that the learning problem is well-specified if we choose the generator function g θ : R k R d to be a simple linear transformation of the k-dimensional latent space k d) Although the GAN approach is clearly overkill for this simple density estimation problem we find this example illuminating because it affords some analytical tractability for the otherwise intractable general GAN value function 1) Specifically if we choose the discriminator function f ω : R d R to be a neural network with L = 1 hidden layer consisting of H hidden units with rectifier nonlinearities and set biases to zero then the explicit functional forms of discriminator and generator are respectively 44) f ω x) = H v i w i x 1 { wi x } g θ z) = V z i=1 where ω {w i R d v i R : i H} and θ {V R d k } are the discriminator and generator parameters respectively If moreover we express the covariance matrix as Σ = AA then the value function can be expressed in closed form as 45) Uθ ω) = const H v i A w i V w i i=1 he above analytical form of the value function sheds some light on the nature of the local Nash equilibrium solution concept In particular if one solves for the condition of being a Nash equilibrium one does not conclude that V V = Σ he result depends on the rank of the matrix w 1 w H he evaluation of different optimization algorithms involved comparing the target density P real = N Σ) and the analytical generator density P fake = N V V ) after t = training iterations Fig 1) For simplicity we chose the evaluation metric to be the Frobenius norm of the difference between the covariance matrices Σ V V F he covariance learning experiments were conducted in the well-specified and over-parametrized regime k = 16 d = ) using H = 18 hidden units for the discriminator network We also performed a head-to-head comparison of simultaneous and alternating update methods on this distribution and found negligible difference Fig ) 4 Mixture of Gaussians In practical applications GANs are typically trained using the empirical distribution of the samples where the samples are drawn from an idealized multi-modal probability distribution o capture the notion of a multi-modal data distribution we focus on a mixture of 8 Gaussians with means located at the vertices of a regular octagon inscribed in the unit circle where each component has a fixed diagonal covariance of width σ = 3 In contrast to previous visual-based evaluations we estimate the Wasserstein-1 distance W 1 P real P fake ) between the target density P real and the distribution P fake of the random variable g θ Z) implied by the trained generator network he estimate is obtained by solving a linear program which computes the exact Wasserstein-1 distance between the sample estimates P real = 1 m m i=1 δ X i and P fake = 1 m m i=1 δ g θ Z i ) respectively and approaches the population version W 1 P real P fake ) as number of samples m he experiments with the mixture of Gaussians used dimensional Gaussian as input k = ) Both the generator and discriminator networks consisted of L = 4 hidden layers with H = 18 units per hidden layer he estimate of the Wasserstein-1 distance was calculated using a sample size of 9

10 m = 51 after training for t = iterations It is clear from Fig 3 that the Wasserstein-1 distance correlates closely with the visual fit to the target distribution he empirical evaluation Fig 1) shows that the exponential separation between consensus optimization and competing algorithms disappears on the mixture distribution suggesting that the qualitative ranking is not robust to the choice of loss landscape hese findings demand deeper understanding of the global structure of the landscape which is not captured by our local stability analysis ovariance Learning 16 Mixture of Gaussians 1 14 VV F Wasserstein Distance SGA Predictive Optimism onsensus SGA Predictive Optimism onsensus Fig 1 Evaluation metrics for covariance learning left) and mixture of Gaussians learning right) using different dynamical systems after t = training iterations and 16 random seeds Note for covariance learning we use the log-scale on y-axis 5 Simultaneous Alternating 4 Frobenius Norm 3 1 SGA Algorithm OMD Fig omparison of alternating/simultaneous gradient updates on the covariance learning evaluation metric after t = training iterations and 16 random seeds 1

11 Fig 3 Density plots of worst/best generator distribution ranked by Wasserstein-1 distance across all baselines amongst 16 random seeds after training for iterations Left: SGA W 1 = 17) Right: OMD W 1 = 66) 5 EHNIAL PROOFS Proof of heorem 1 Define the line interpolation between two points θx) = xθ t + 1 x)θ ωx) = xω t + 1 x)ω hen the SGA dynamics can be written as θt+1 θ θt θ = θ Uθ η t ω t ) ω t+1 ω ω t ω ω Uθ t ω t ) θt θ 1 = θθ Uθx) ωx)) η θω Uθx) ωx)) θt θ dx ω t ω ωθ Uθx) ωx)) ωω Uθx) ωx)) ω t ω 1 ) θθ Uθx) ωx)) = I η θω Uθx) ωx)) θt θ dx ωθ Uθx) ωx)) ωω Uθx) ωx)) ω t ω Assume that one can prove for some r > and θ ω) B θ ω ) r) with a proper choice of η the largest singular value is bounded above by 1 I η θθ Uθ ω) θω Uθ ω) < 1 ωθ Uθ ω) ωω Uθ ω) hen due to convexity of the operator norm the dynamics of SGA is contracting locally because θt+1 θ 1 ) θθ Uθx) ωx)) ω t+1 ω I η θω Uθx) ωx)) dx ωθ Uθx) ωx)) ωω Uθx) ωx)) θt θ op ω t ω 1 θθ Uθx) ωx)) I η θω Uθx) dx θt θ ωθ Uθx) ωx)) ωω Uθx) ωx))) ω op t ω < θt θ ω t ω op Let s analyze the singular values of θθ Uθ ω) I η θω Uθ ω) ωθ Uθ ω) ωω Uθ ω) 11

12 assuming θθ Uθ ω) ωω Uθ ω) Abbreviate θθ Uθ ω) θω Uθ ω) A := ωθ Uθ ω) ωω Uθ ω) B he largest singular value of A I η B is the square root of the largest eigenvalue of the following symmetric matrix I ηa η I ηa η I ηa) η I ηb η = + η η A B) I ηb η A B ) I ηb) + η It is clear that when η = the largest eigenvalue of the above matrix is 1 Observe I ηa) + η η A B) A η A B ) I ηb) + η = I η + η A + A + B B A + B B + ) A 1 ηλ min + η A λ + ) A + B B max A + B B + I If we choose η to be ) Aθω min θω) B θ ω )r) λ min B θω η = A θω max θω) B θ ω )r) λ + θω ) = θω A θω θω + θω B θω max θω A θω + B θω θω Bθω + θω θω α β then we find I ηa) + η η A B) η A B ) I ηb) + η 1 α ) I β In this case θt+1 θ sup ω t+1 ω I η θθ Uθ ω) θω Uθ ω) θω) B θ ω )r) ωθ Uθ ω) ωω Uθ ω) 1 α β θt θ ω t ω herefore to obtain an ɛ-minimizer one requires a number of steps equal to β α log r ɛ op θt θ ω t ω Proof of Lemma 31 Recall that the OMD dynamics iteratively updates ) θt+1 θt θt 1 = I η + η ω t+1 1 ω t ω t 1

13 Define the following matrices 51) 5) R 1 = R = I η I η ) ) + I 4η 1/ ) I 4η It is easy to verify that ) R 1 + R = I η ) ) I η I 4η R 1 R = R R 1 = 4 ) 1/ = η he commutative property R 1 R = R R 1 follows from a singular value decomposition argument: Letting = UDV be the SVD of D diagonal) one finds I 4η ) 1/ = UD I 4η D ) 1/ V = U I 4η D ) 1/ DV = I 4η ) 1/ Using the above equality the commutative property follows I η ) ) I 4η 1/ ) = I 4η 1/ I η = R 1 R = R R 1 ) Now we have the following relations for OMD ) θt+1 θt θt θt 1 R 1 = R R 1 ω t+1 θt+1 ω t+1 ω t R θt ω t ω t = R 1 θt ω t ω t 1 ) θt 1 R ω t 1 Hence 53) R 1 + R ) θt+1 ω t+1 = R t θ1 ω 1 ) θ R 1 + R1 t ω θ1 ω 1 ) θ R ω 13

14 Let s analyze the eigenvalues of R 1 and R We have ) I η + I 4η R 1 = I+I 4η ) 1/ = η Similarly R 1 R 1 = = = η I+I 4η ) 1/ I+I 4η ) 1/ η η I+I 4η ) 1/ I+I 4η ) 1/ ) + η I+I 4η ) 1/ I+I 4η ) 1/ R R = ) 1/ I+I 4η ) 1/ η η I+I 4η ) 1/ I+I 4η ) 1/ I I 4η ) 1/ I I 4η ) 1/ ) + η For η small enough the spectral radius satisfies the strict inequality R 1 op < 1 oncretely for example η = 1 λ max ) = I + I 4η ) 1/ I + I η ) = I η 1 1 λ min ) 4 λ max ) I I 4η ) 1/ 1 I herefore R 1 op 1 1 λ min ) 4 λ max ) R op 1 1 Due to Eqn 53) we know the RHS is upper bounded ) θ1 θ R t/ θ 1 ω 1 ) + θ ω ) ) ) Rt Rt 1 ω 1 θ1 ω 1 ω Moreover the LHS satisfies R 1 + R ) o sum up when one can ensure θ ω ) ɛ ) θ R 1 1 λ min ) t/ ) ω 4 λ max θ 1 ω 1 ) + θ ω ) ) ) θt+1 ω t+1 θt+1 ω t+1 8 λ max ) ) λ min ) + 4 log λ min ) ) λ max ) 4r 1 + λ min ) λ max ) ɛ ) I

15 Proof of Lemma 33 In this case the consensus optimization satisfies the following update ) θt+1 γ θt 54) = I η γ ω t+1 ) γ Again let s analyze the singular values of the operator K := I η γ or equivalently the eigenvalues of KK KK I ηγ η I ηγ η = η I ηγ η I ηγ I ηγ = ) + η I ηγ )η ηi ηγ ) η I ηγ ) I ηγ )η I ηγ ) + η I ηγ = ) + η I ηγ ) + η Now consider the largest eigenvalue of I ηγ ) +η for a fixed γ with a properly chosen η Using the SVD = UDV we obtain I ηγ ) + η = U I ηγd ) + η D U 1 γλ min )η + γ λ max ) + λ max )η I γ λ min = 1 ) γ λ max ) + λ max I ) ω t with η = γλ min ) λ max ) + γ λ max ) Proof of Lemma 3 In the simple bi-linear game case θt+1 θt η θt+1/ = ω t+1 ω t η ω t+1/ θt η I γ θt = ω t η γ I ω t I ηγ η θt = η I ηγ ω t Note this linear system is the same as that in Lemma 33 herefore the convergence analysis follows in the same way as Lemma 33 REFERENES Martin Arjovsky Soumith hintala and Léon Bottou Wasserstein gan arxiv preprint arxiv: Sanjeev Arora and Yi Zhang Do gans actually learn the distribution? an empirical study arxiv preprint arxiv: Sanjeev Arora Andrej Risteski and Yi Zhang heoretical limitations of encoder-decoder gan architectures arxiv preprint arxiv:

16 Ashish herukuri Bahman Gharesifard and Jorge ortes Saddle-point dynamics: conditions for asymptotic stability of saddle points SIAM Journal on ontrol and Optimization 551): onstantinos Daskalakis Andrew Ilyas Vasilis Syrgkanis and Haoyang Zeng raining gans with optimism arxiv preprint arxiv: Ian Goodfellow Jean Pouget-Abadie Mehdi Mirza Bing Xu David Warde-Farley Sherjil Ozair Aaron ourville and Yoshua Bengio Generative adversarial nets In Advances in neural information processing systems pages Ishaan Gulrajani Faruk Ahmed Martin Arjovsky Vincent Dumoulin and Aaron ourville Improved training of wasserstein gans In Advances in Neural Information Processing Systems pages Martin Heusel Hubert Ramsauer homas Unterthiner Bernhard Nessler and Sepp Hochreiter Gans trained by a two time-scale update rule converge to a local nash equilibrium In Advances in Neural Information Processing Systems pages Ferenc Huszár Gans are broken in more than one way: he numerics of gans 17 URL vc/my-notes-on-the-numerics-of-gans/ engyuan Liang How well can generative adversarial networks learn densities: A nonparametric view arxiv preprint arxiv: Shuang Liu Olivier Bousquet and Kamalika haudhuri Approximation and convergence properties of generative adversarial learning arxiv preprint arxiv: Mario Lucic Karol Kurach Marcin Michalski Sylvain Gelly and Olivier Bousquet Are gans created equal? a large-scale study arxiv preprint arxiv: Lars Mescheder Sebastian Nowozin and Andreas Geiger he numerics of gans arxiv preprint arxiv: Luke Metz Ben Poole David Pfau and Jascha Sohl-Dickstein Unrolled generative adversarial networks arxiv preprint arxiv: Vaishnavh Nagarajan and J Zico Kolter Gradient descent gan optimization is locally stable In Advances in Neural Information Processing Systems pages Yurii Nesterov Introductory lectures on convex optimization: A basic course volume 87 Springer Science & Business Media 13 Sebastian Nowozin Botond seke and Ryota omioka f-gan: raining generative neural samplers using variational divergence minimization In Advances in Neural Information Processing Systems pages David Pfau and Oriol Vinyals onnecting generative adversarial networks and actor-critic methods arxiv preprint arxiv: im Salimans Ian Goodfellow Wojciech Zaremba Vicki heung Alec Radford and Xi hen Improved techniques for training gans In Advances in Neural Information Processing Systems pages Satinder Singh Michael Kearns and Yishay Mansour Nash convergence of gradient dynamics in general-sum games In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence pages Morgan Kaufmann Publishers Inc Abhay Yadav Sohil Shah Zheng Xu David Jacobs and om Goldstein Stabilizing adversarial nets with prediction methods Accepted at ILR

Negative Momentum for Improved Game Dynamics

Negative Momentum for Improved Game Dynamics Gauthier Gidel Reyhane Askari Hemmat Mohammad Pezeshki Gabriel Huang Rémi Lepriol Simon Lacoste-Julien Ioannis Mitliagkas Mila & DIRO, Université de Montréal