A VARIATIONAL INEQUALITY PERSPECTIVE ON GENERATIVE ADVERSARIAL NETWORKS

Size: px

Start display at page:

Download "A VARIATIONAL INEQUALITY PERSPECTIVE ON GENERATIVE ADVERSARIAL NETWORKS"

Benjamin Thornton
5 years ago
Views:

1 A VARIAIONAL INEQUALIY PERSPECIVE ON GENERAIVE ADVERSARIAL NEWORKS Gauthier Gidel 1, Hugo Berard 1,3, Gaëtan Vignoud 1 Pascal Vincent 1,,3 Simon Lacoste-Julien 1, 1 Mila & DIRO, Université de Montréal. CIFAR fellow. 3 Facebook Artificial Intelligence Research. Montreal. ABSRAC Generative adversarial networks (GANs) form a generative modeling approach known for producing appealing samples, but they are notably difficult to train. One common way to tackle this issue has been to propose new formulations of the GAN objective. Yet, surprisingly few studies have looked at optimization methods designed for this adversarial training. In this work, we cast GAN optimization problems in the general variational inequality framework. apping into the mathematical programming literature, we counter some common misconceptions about the difficulties of saddle point optimization and propose to extend techniques designed for variational inequalities to the training of GANs. We apply averaging, extrapolation and a novel computationally cheaper variant that we call extrapolation from the past to the stochastic gradient method (SGD) and Adam. 1 INRODUCION Generative adversarial networks (GANs) (Goodfellow et al., 014) form a generative modeling approach known for producing realistic natural images (Karras et al., 018) as well as high quality super-resolution (Ledig et al., 017) and style transfer (Zhu et al., 017). Nevertheless, GANs are also known to be difficult to train, often displaying an unstable behavior (Goodfellow, 016). Much recent work has tried to tackle these training difficulties, usually by proposing new formulations of the GAN objective (Nowozin et al., 016; Arjovsky et al., 017). Each of these formulations can be understood as a two-player game, in the sense of game theory (Von Neumann and Morgenstern, 1944), and can be addressed as a variational inequality problem (VIP) (Harker and Pang, 1990), a framework that encompasses traditional saddle point optimization algorithms (Korpelevich, 1976). Solving such GAN games is traditionally approached by running variants of stochastic gradient descent (SGD) initially developed for optimizing supervised neural network objectives. Yet it is known that for some games (Goodfellow, 016, 8.) SGD exhibits oscillatory behavior and fails to converge. his oscillatory behavior, which does not arise from stochasticity, highlights a fundamental problem: while a direct application of basic gradient descent is an appropriate method for regular minimization problems, it is not a sound optimization algorithm for the kind of two-player games of GANs. his constitutes a fundamental issue for GAN training, and calls for the use of more principled methods with more reassuring convergence guarantees. Contributions. We point out that multi-player games can be cast as variational inequality problems (VIPs) and consequently the same applies to any GAN formulation posed as a minimax or non-zerosum game. We present two techniques from this literature, namely averaging and extrapolation, widely used to solve VIPs but which have not been explored in the context of GANs before. 1 We extend standard GAN training methods such as SGD or Adam into variants that incorporate these techniques (Alg. 3, 4 are new). We also explain that the oscillations of basic SGD for GAN training previously noticed (Goodfellow, 016) can be explained by standard variational inequality optimization results and we illustrate how averaging and extrapolation can fix this issue. Equal contribution, correspondence to firstname.lastname@umontreal.ca. 1 Independent works (Mertikopoulos et al., 018) and (Yazıcı et al., 018) respectively explored extrapolation and averaging in the context of GANs. More details in related work section 6. 1

2 We introduce a new technique, called extrapolation from the past, that only requires one gradient computation per iteration compared to extrapolation which requires to compute the gradient twice. We prove its convergence in the stochastic variational inequality setting, i.e. when applied to SGD. Finally, we test these techniques in the context of standard GAN training. We observe a 4%-6% improvement on the inception score (Salimans et al., 016) of WGAN (Arjovsky et al., 017) and WGAN-GP (Gulrajani et al., 017) on the CIFAR-10 dataset. Outline. presents the background on GAN and optimization, and shows how to cast this optimization as a VIP. 3 presents standard techniques to optimize variational inequalities in a batch setting as well as our new one, extrapolation from the past. 4 considers these methods in the stochastic setting, yielding three corresponding variants of SGD, and provides their respective convergence rates. 5 develops how to combine these techniques with already existing algorithms. 6 discusses the related work and 7 presents experimental results. GAN OPIMIZAION AS A VARIAIONAL INEQUALIY PROBLEM.1 GAN FORMULAIONS he purpose of generative modeling is to generate samples from a distribution q θ that matches best the true distribution p of the data. he generative adversarial network training strategy can be understood as a game between two players called generator and discriminator. he former produces a sample that the latter has to classify between real or fake data. he final goal is to build a generator able to produce sufficiently realistic samples to fool the discriminator. In the original GAN paper (Goodfellow et al., 014), the GAN objective is formulated as a zero-sum game where the cost function of the discriminator D ϕ is given by the negative log-likelihood of the binary classification task between real or fake data generated from q θ by the generator, min θ max L(θ, ϕ) where L(θ, ϕ) = E[log D ϕ (x)] E [log(1 D ϕ (x ))]. (1) ϕ x p x q θ However Goodfellow et al. (014) recommends to use in practice a second formulation, called non-saturating GAN. his formulation is a non-zero-sum game where the aim is to jointly minimize: L (θ) (θ, ϕ) = E log D ϕ (x ) and L (ϕ) (θ, ϕ) = E log D ϕ (x) E log(1 D ϕ (x )). () x q θ x p x q θ he dynamics of this formulation have the same stationary points as the zero-sum one (1) but are claimed to provide much stronger gradients early in learning (Goodfellow et al., 014).. EQUILIBRIUM he minimax formulation (1) is theoretically convenient because a large literature on games studies this problem and provides guarantees on the existence of equilibria. Nevertheless, practical considerations lead the GAN literature to consider a different objective for each player as formulated in (). In that case, the two-player game problem (Von Neumann and Morgenstern, 1944) consists in finding the following Nash equilibrium: θ arg min θ Θ L (θ) (θ, ϕ ) and ϕ arg min L (ϕ) (θ, ϕ). (3) ϕ Φ Only when L (θ) = L (ϕ) is the game called a zero-sum game and (3) can be formulated as a minimax problem. One important point to notice is that the two optimization problems in (3) are coupled and have to be considered jointly from an optimization point of view. Standard GAN objectives are non-convex (i.e. each cost function is non-convex), and thus such (pure) equilibria may not exist. As far as we know, not much is known about the existence of these equilibria for non-convex losses (see Heusel et al. (017) and references therein for some results). In our theoretical analysis in 4, our assumptions (monotonicity (4) of the operator and convexity of the constraints set) imply the existence of an equilibrium. In this paper, we focus on ways to optimize these games, assuming that an equilibrium exists. As is often standard in non-convex optimization, we also focus on finding points satisfying the necessary

3 stationary conditions. As we mentioned previously, one difficulty that emerges in the optimization of such games is that the two different cost functions of (3) have to be minimized jointly in θ and ϕ. Fortunately, the optimization literature has for a long time studied so-called variational inequality problems, which generalize the stationary conditions for two-player game problems..3 VARIAIONAL INEQUALIY PROBLEM FORMULAION We first consider the local necessary conditions that characterize the solution of the smooth two-player game (3), ining stationary points, which will motivate the inition of a variational inequality. In the unconstrained setting, a stationary point is a couple (θ, ϕ ) with zero gradient: θ L (θ) (θ, ϕ ) = ϕ L (ϕ) (θ, ϕ ) = 0. (4) When constraints are present, a stationary point (θ, ϕ ) is such that the directional derivative of each cost function is non-negative in any feasible direction (i.e. there is no feasible descent direction): θ L (θ) (θ, ϕ ) (θ θ ) 0 and ϕ L (ϕ) (θ, ϕ ) (ϕ ϕ ) 0, (θ, ϕ) Θ Φ. (5) Defining ω = (θ, ϕ), ω = (θ, ϕ ), Ω = Θ Φ, Eq. (5) can be compactly formulated as: F (ω ) (ω ω ) 0, ω Ω where F (ω) = [ θ L (θ) (θ, ϕ) ϕ L (ϕ) (θ, ϕ)]. (6) hese stationary conditions can be generalized to any continuous vector field: let Ω R d and F : Ω R d be a continuous mapping. he variational inequality problem (Harker and Pang, 1990) (depending on F and Ω) is: find ω Ω such that F (ω ) (ω ω ) 0, ω Ω. (VIP) We call optimal set the set Ω of ω Ω verifying (VIP). he intuition behind it is that any ω Ω is a fixed point of the constrained dynamic of F (constrained to Ω). We have thus showed that both saddle point optimization and non-zero sum game optimization, which encompasses the large majority of GAN variants proposed in the literature, can be cast as VIPs. In the next section, we turn to suitable optimization techniques for such problems. 3 OPIMIZAION OF VARIAIONAL INEQUALIIES (BACH SEING) Let us begin by looking at techniques that were developed in the optimization literature to solve (VIP). We present the intuitions behind them as well as their performance on a simple bilinear problem (see Fig. 1). Our goal here is to provide mathematical insights into the techniques of averaging ( 3.1) and extrapolation ( 3.), to inspire their application to extending other optimization algorithm. We then propose a novel variant of the extrapolation technique in 3.3 extrapolation from the past. We here treat the batch setting, i.e. considering that the operator F (ω) as ined in Eq. 6 yields an exact full gradient. We will present extensions of these techniques to the stochastic setting later in 4. he two standard methods studied in the VIP literature are the gradient method (Bruck, 1977) and the extragradient method (Korpelevich, 1976). he iterates of the basic gradient method are given by ω t+1 = P Ω [ω t ηf (ω t )] where P Ω [ ] is the projection onto the constraints set (if constraints are present) associated to (VIP). hese iterates are known to converge linearly under an additional assumption on the operator 3 (Chen and Rockafellar, 1997), but oscillate for a bilinear operator as shown in Fig. 1. On the other hand, the uniform average of these iterates converge for any bounded monotone operator with a O(1/ t) rate (Nedić and Ozdaglar, 009), motivating the presentation of averaging in 3.1. By contrast, the extragradient method (extrapolated gradient) does not require any averaging to converge for monotone operators (in the batch setting), and can even converge at the faster O(1/t) rate (Nesterov, 007). he idea of this method is to compute a lookahead step (see intuition on extrapolation in 3.) in order to compute a more stable direction to follow. An example of constraint for GANs is to clip the parameters of the discriminator (Arjovsky et al., 017). 3 Strong monotonicity, a generalization of strong convexity. See A. 3

4 3.1 AVERAGING More generally, we consider a weighted averaging scheme with weights ρ t 0. his weighted averaging scheme have been proposed for the first time for (batch) VIP by Bruck (1977), ω = ρ tω t, S = ρ t. (7) S Averaging schemes can be efficiently implemented in an online fashion noticing that, ω t = (1 ρ t ) ω t + ρ t ω t where 0 ρ t 1. (8) For instance, setting ρ t = 1 t provides uniform averaging (ρ t = 1) and ρ t = 1 β < 1 provides geometric averaging also known as exponential moving averaging (ρ t = β t ). Averaging is experimentally compared with the other techniques presented in this section in Fig. 1. In order to illustrate how averaging tackles the oscillatory behavior in game optimization, we consider a toy example where the discriminator and the generator are linear: D ϕ (x) = ϕ x and G θ (z) = θz (implicitly ining q θ ). By replacing these expressions in the WGAN objective, 4 we get the following bilinear objective: min max θ Θ ϕ Φ, ϕ 1 ϕ E[x] ϕ θe[z]. (9) A similar task was presented by Nagarajan and Kolter (017) where they consider a quadratic discriminator instead of a linear one, and show that gradient descent is not necessarily asymptotically stable. he bilinear objective has been extensively used (Goodfellow, 016; Mescheder et al., 018; Yadav et al., 018) to highlight the difficulties of gradient descent for saddle point optimization. Yet, ways to cope with this issue have been proposed decades ago in the context of mathematical programming. Simplifying further by setting the dimension to 1 and centering the equilibrium to the origin, Eq. 9 becomes: min max θ φ and θ R φ R (θ, φ ) = (0, 0). (10) he operator associated with this minimax game is F (θ, φ) = (φ, θ). here are several ways to compute the discrete updates of this dynamics. he two most common ones are the simultaneous and the alternating gradient update rules, Simultaneous update: { θt+1 = θ t ηφ t φ t+1 = φ t + ηθ t, Alternating update: { θt+1 = θ t ηφ t φ t+1 = φ t + ηθ t+1. (11) Interestingly, these two choices give rise to have a completely different behavior. he norm of the simultaneous updates diverges geometrically whereas the alternating iterates are bounded but do not converge to the equilibrium. As a consequence, their respective uniform average have a different behavior, as highlighted in the following proposition (more details and proof in B.1): Proposition 1. he simultaneous iterates diverge geometrically and the alternating iterates ined in (11) are bounded but do not converge to 0 as Simultaneous: θ t+1 + φ t+1 = (1 + η )(θ t + φ t ), Alternating: θ t + φ t = Θ(θ 0 + φ 0) (1) where u t = Θ(v t ) α, β > 0 : αv t u t βv t. he uniform average ( θ t, φ t ) = 1 t t s=0 (θ s, φ s ) of the simultaneous updates (resp. the alternating updates) diverges (resp. converges to 0) as, ( ) ( ) Simultaneous: θ t + φ θ0 + φ 0 t = Θ η t (1 + η ) t, Alternating: θ t + φ θ0 + φ 0 t = Θ η t. (13) his sublinear convergence result, proved in B, underlines the benefits of averaging when the sequence of iterates is bounded (i.e. for alternating update rule). When the sequence of iterates is not bounded (i.e. for simultaneous updates) averaging fails to ensure convergence. his theorem also shows how alternating updates may have better convergence properties than simultaneous updates. 4 Wasserstein GAN (WGAN) proposed by Arjovsky et al. (017) boils down to the following minimax formulation: min θ Θ max ϕ Φ, Dϕ L 1 E x p[d ϕ(x)] E x q θ [D ϕ(x )]. 4

5 3. EXRAPOLAION Another technique used in the variational inequality literature to prevent oscillations is extrapolation. his concept is anterior to the extragradient method since Korpelevich (1976) mentions that the idea of extrapolated prices to give stability had been already formulated by Polyak (1963, Chap. II). he idea behind this technique is to compute the gradient at an (extrapolated) point different from the current point from which the update is performed, stabilizing the dynamics: Compute extrapolated point: ω t+1/ = P Ω [ω t ηf (ω t )], (14) Perform update step: ω t+1 = P Ω [ω t ηf (ω t+1/ )]. (15) Note that, even in the unconstrained case, this method is intrinsically different from Nesterov s momentum 5 (Nesterov, 1983, Eq...9) because of this lookahead step for the gradient computation: Nesterov s method: ω t+1/ = ω t ηf (ω t ), ω t+1 = ω t+1/ + β(ω t+1/ ω t ). (16) Nesterov s method does not converge when trying to optimize (10). One intuition explaining why extrapolation provides better convergence properties than the standard gradient method comes from Euler s method framework (see for instance (Atkinson, 003) for more details on that topic). Actually, if we consider a first order approximation of ω t+1/, we have ω t+1/ ω t+1 + o(η) and consequently, the update step (15) is close to an implicit method step: Implicit step: ω t+1 = ω t ηf (ω t+1 ). (17) In the literature on Euler s method, implicit methods are known to be more stable and to benefit from better convergence properties (Atkinson, 003) than explicit methods. hey are not often used in practice though since they require to solve a potentially non-linear system at each iteration. aking back the simplified WGAN toy example (10) from 3.1 we get the following update rules, { { θt+1 = θ t ηφ t+1 θt+1 = θ t η(φ t + ηθ t ) Implicit:, Extrapolation: φ t+1 = φ t + ηθ t+1 φ t+1 = φ t + η(θ t ηφ t ). (18) In the following proposition, we will see that the respective convergence rates of the implicit method and extrapolation are highly similar. Keeping in mind that the latter has the major advantage of being more practical, this proposition clearly underlines the benefits of extrapolation (more details and proof in B.), Proposition. he squared norm of the iterates N t = θ t + φ t, where the update rule of θ t and φ t are ined in (18), decreases geometrically for any η < 1 as, Implicit: N t+1 = ( 1 η + η 4 + O(η 6 ) ) N t, Extrapolation: N t+1 = (1 η + η 4 )N t. (19) 3.3 EXRAPOLAION FROM HE PAS One issue with extrapolation is that the algorithm wastes a gradient (14). Indeed we need to compute the gradient at two different positions for every single update of the parameters. We thus propose a new technique that we call extrapolation from the past which only requires to compute one gradient for every update. he idea of this technique is to store and re-use the previous extrapolated gradient to compute the new extrapolation point: Extrapolation from the past: ω t+1/ = P Ω [ω t ηf (ω t/ )] (0) Perform update step: ω t+1 = P Ω [ω t ηf (ω t+1/ )] and store: F (ω t+1/ ) (1) his update scheme can be related to (Chiang et al., 01, Alg. 1) in the context of online learning and to the optimistic mirror descent (Rakhlin and Sridharan, 013; Daskalakis et al., 018). In the unconstrained case, (0) and (1) reduce to: Optimistic mirror descent (OMD): ω t+1/ = ω t/ ηf (ω t/ ) + ηf (ω t 3/ ) () However our technique comes from a different perspective, it was motivated by VIP and inspired from the extragradient method. Furthermore our technique extends to constrained optimization as shown in (0) and (1). It is not clear whether or not a single projection added to () provides a provably converging algorithm. Using the VIP point of view we are able to prove a linear convergence rate for a projected version of the extrapolation from the past (see details and proof of heorem 1 in B.3). We also extend these results to the stochastic operator setting in 4. 5 Sutskever (013, 7.) showed the equivalence between standard momentum and Nesterov s formulation. 5

6 Adam with γ = 0.01 Gradient method γ = 0.1 Averaging γ =.0 Extrapolation from the past γ = 0.5 Extrapolation γ = 0.6 φ Start 0 1 θ Figure 1: Comparison of the basic gradient method (as well as Adam) with the techniques presented in 3 on the optimization of (9). Only the algorithms advocated in this paper (Averaging, Extrapolation and Extrapolation from the past) converge quickly to the solution. Each marker represents 0 iterations. We compare these algorithms on a non-convex objective in G.1. Algorithm 1 AvgSGD Let ω 0 Ω for t = do ξ t P (minibatch) d t F (ω t, ξ t ) ω t+1 P Ω [ω t η t d t ] end for Return ω ηtωt ηt Algorithm AvgExtraSGD for t = do ξ t, ξ t P (minibatches) d t F (ω t, ξ t ) ω t P Ω [ω t η t d t ] d t F (ω t, ξ t) ω t+1 P Ω [ω t η t d t] end for Return ω ηtω t ηt Algorithm 3 AvgPastExtraSGD Let ω 0 Ω for t = do ξ t P (minibatch) ω t P Ω [ω t η t d t ] d t F (ω t, ξ t ) ω t+1 P Ω [ω t η t d t ] end for Return ω ηtω t ηt Figure : hree variants of SGD using the techniques introduced in 3. heorem 1 (Linear convergence of extrapolation from the past). If F is µ-strongly monotone (see A for the inition of strong monotonicity) and L-Lipschitz, then the updates (0) and (1) with η = 1 4L provide linearly converging iterates, ( ω t ω 1 µ ) t ω 0 ω, t 0. (3) 4L In comparison to the results from (Daskalakis et al., 018) that hold only for a bilinear objective, we provide a faster convergence rate (linear vs sublinear) on the last iterate for a general (strongly monotone) operator F and any projection on a convex Ω. One thing to notice is that the operator of a bilinear objective is not strongly monotone, but in that case one can use the standard extrapolation method (14) which converges linearly for a (constrained or not) bilinear game (seng, 1995, Cor. 3.3). 4 OPIMIZAION OF VIP WIH SOCHASIC GRADIENS In this section, we consider extensions of the techniques presented in section 3 for optimizing (VIP), to the context of a stochastic operator. In this case, at each time step we no longer have access to the exact gradient F (ω) but to an unbiased stochastic estimate of it F (ω, ξ) where ξ P and F (ω) := E ξ P [F (ω, ξ)]. his is motivated from the GAN formulation where we only have access to a finite sample estimate of the expected gradient, computed on a mini-batch. For GANs, ξ is thus a mini-batch of points coming from the true data distribution p and the generator distribution q θ. For our analysis, we require at least one of the two following assumptions on the stochastic operator: Assumption 1. Bounded variance by σ : E ξ [ F (ω) F (ω, ξ) ] σ, ω Ω. Assumption. Bounded expected squared norm by M : E ξ [ F (ω, ξ) ] M, ω Ω. Assump. 1 is standard in stochastic variational analysis, while Assump. is a stronger assumption sometimes made in stochastic convex optimization. o illustrate how strong Assump. is, note that it does not hold for an unconstrained bilinear objective like in our example 10 in 3. It is thus mainly reasonable for bounded constraint sets. Note that in practice we have σ M. We now present and analyse three algorithms, variants of SGD that are appropriate to solve (VIP). he first one Alg. 1 (AvgSGD) is the stochastic extension of the gradient method for solving (VIP); Alg. (AvgExtraSGD) uses extrapolation and Alg. 3 (AvgPastExtraSGD) uses extrapolation from the past. A fourth variant (Alg.5) is proposed in D. hese three algorithms return an average of the iterates. he proofs of the theorems presented in this section are in F. 6

7 o handle constraints such as parameter clipping (Arjovsky et al., 017), we present a projected version of theses algorithms, where P Ω [ω ] denotes the projection of ω onto Ω (see A). Note that when Ω = R d, the projection is the identity mapping (unconstrained setting). In order to prove the convergence of these three algorithms we will assume that F is monotone: (F (ω) F (ω )) (ω ω ) 0 ω, ω Ω. (4) If F can be written as (6), it implies that the cost functions are convex 6. Note however that in general GANs parametrized with neural networks lead to non-monotone VIPs. Assumption 3. F is monotone and Ω is a compact convex set, such that max ω,ω Ω ω ω R. In that setting the quantity g(ω ) := max ω Ω F (ω) (ω ω) is well ined and is equal to 0 if and only if ω is a solution of (VIP). Moreover, if we are optimizing a zero-sum game, we have ω = (θ, ϕ), Ω = Θ Φ and F (θ, ϕ) = [ θ L(θ, ϕ) ϕ L(θ, ϕ)]. Hence, the quantity h(θ, ϕ ) := max ϕ Φ L(θ, ϕ) min θ Θ L(θ, ϕ ) is well ined and equal to 0 if and only if (θ, ϕ ) is a Nash equilibrium of the game. he two functions g and h are called merit functions (more details on the concept of merit functions in C). In the following, we call, Err(ω) = max L(θ, θ,ϕ Θ Φ ϕ ) L(θ, ϕ) max F ω Ω (ω ) (ω ω ) if F (θ, ϕ) = [ θ L(θ, ϕ) ϕ L(θ, ϕ)] otherwise. Averaging. Alg. 1 (AvgSGD) presents the stochastic gradient method with averaging, which reduces to the standard (simultaneous) SGD updates for the two-player games used in the GAN literature, but returning an average of the iterates. heorem. Under Assump. 1, and 3, SGD with averaging (Alg. 1) with a constant step-size gives, E[Err( ω )] R η + η M + σ where (5) ω = 1 ω t, 1. (6) hm. uses a similar proof as (Nemirovski et al., 009). he constant term η(m + σ )/ in (6) is called the variance term. his type of bound is standard in stochastic optimization. We also provide in F a similar rate with an extra log factor when η t = η t. We show that this variance term is smaller than the one of SGD with prediction method (Yadav et al., 018) in E. Extrapolations. Alg. (AvgExtraSGD) adds an extrapolation step compared to Alg. 1 in order to reduce the oscillations due to the game between the two players. A theoretical consequence is that it has a smaller variance term than (6). As discussed previously, Assump. made in hm. for the convergence of Alg. 1 is very strong in the unbounded setting. One advantage of SGD with extrapolation is that hm. 3 does not require this assumption. heorem 3. (Juditsky et al., 011, hm. 1) Under Assump. 1 and 3, if E ξ [F ] is L-Lipschitz, then SGD with extrapolation and averaging (Alg. ) using a constant step-size η 1 3L gives, E[Err( ω )] R η + 7 ησ where ω = 1 ω t, 1. (7) Since in practice σ M, the variance term in (7) is significantly smaller than the one in (6). o summarize, SGD with extrapolation provides better convergence guarantees but requires two gradient computations and samples per iteration. his motivates our new method, Alg. 3 (AvgPastExtraSGD) which uses extrapolation from the past and achieves the best of both worlds. heorem 4. Under Assump. 1 and 3, if E ξ [F ] is L-Lipschitz then SGD with extrapolation from the past using a constant step-size η 1, gives that the averaged iterates converge as, 3L E[Err( ω )] R η + 13 ησ where ω = 1 ω t 1. (8) he bounds is similar to the one provided in hm. 3 but each iteration of Alg. 3 is computationally half the cost of an iteration of Alg.. 6 he convexity of the cost functions in (3) is a necessary condition (not sufficient) for the operator to be monotone. In the context of a zero-sum game, the convexity of the cost functions is a sufficient condition. 7

8 5 COMBINING HE ECHNIQUES WIH ESABLISHED ALGORIHMS In the previous sections we presented several techniques that converge on a simple bilinear example. hese techniques can be combined in practice with existing algorithms. We propose to combine them to two standard algorithms used for training deep neural networks: the Adam optimizer (Kingma and Ba, 015) and the SGD optimizer (Robbins and Monro, 1951). Note that in the case of a two-player game (3), the previous results can be generalized to gradient updates with a different step-size for each player by simply rescaling the objectives L (θ) and L (ϕ) by a different scaling factor. A detailed pseudo-code for Adam with extrapolation step (Extra-Adam) is given in Algorithm 4. Algorithm 4 Extra-Adam: proposed Adam with extrapolation step. input: step-size η, decay rates for moment estimates β 1, β, access to the stochastic gradients l t ( ) and to the projection P Ω [ ] onto the constraint set Ω, initial parameter ω 0, averaging scheme (ρ t ) t 1 for t = do Option 1: Standard extrapolation. Sample new minibatch and compute stochastic gradient: g t l t (ω t ) Option : Extrapolation from the past Load previously saved stochastic gradient: g t = l t/ (ω t/ ) Update estimate of first moment for extrapolation: m t/ β 1 m t + (1 β 1 )g t Update estimate of second moment for extrapolation: v t/ β v t + (1 β )gt Correct the bias for the moments: ˆm t/ m t/ /(1 β1 t ), ˆv t/ v t/ /(1 β t ) ˆm Perform extrapolation step from iterate at time t: ω t/ P Ω [ω t η t/ ˆvt/ +ɛ ] Sample new minibatch and compute stochastic gradient: g t+1/ l t+1/ (ω t+1/ ) Update estimate of first moment: m t β 1 m t/ + (1 β 1 )g t+1/ Update estimate of second moment: v t β v t/ + (1 β )gt+1/ Compute bias corrected for first and second moment: ˆm t m t /(1 β1 t ), ˆv t v t /(1 β t Perform update step from the iterate at time t: ω t+1 P Ω [ω t η ˆvt+ɛ ˆmt ] end for Output: ω /, ω or ω = ρ t+1ω t+1/ / ρ t+1 (see (8) for online averaging) ) 6 RELAED WORK he extragradient method is the standard algorithm to optimize variational inequalities. his algorithm has been originally introduced by Korpelevich (1976) and extended by Nesterov (007) and Nemirovski (004). Stochastic versions of the extragradient have been recently analyzed (Juditsky et al., 011; Yousefian et al., 014; Iusem et al., 017) for stochastic variational inequalities with bounded constraints. A linearly convergent variance reduced version of the stochastic gradient method has been proposed by Palaniappan and Bach (016) for strongly monotone variational inequalities. Several methods to stabilize GANs consist in transforming a zero-sum formulation into a more general game that can no longer be cast as a saddle point problem. his is the case of the non-saturating formulation of GANs (Goodfellow et al., 014; Fedus et al., 018), the DCGANs (Radford et al., 016), the gradient penalty 7 for WGANs (Gulrajani et al., 017). Yadav et al. (018) propose an optimization method for GANs based on AltSGD using a momentum based step on the generator. Daskalakis et al. (018) proposed a method inspired from game theory. Li et al. (017) suggest to dualize the GAN objective to reformulate it as a maximization problem and Mescheder et al. (017) propose to add the norm of the gradient in the objective and provide an interesting perspective on GANs, interpreting the training as the search of a two-player game equilibrium. A study of the continuous version of two player games has been conducted by Ratliff et al. (016). Interesting non-convex results were proved, for a new notion of regret minimization, by Hazan et al. (017) and in the context of GANs by Grnarova et al. (018). he technique of unrolling steps proposed by Metz et al. (017) can be confused with extrapolation but is actually fundamentally different: the perspective is try to construct the true generator objective function" unrolling for K steps the updates of the generator and then update the discriminator. 7 he gradient penalty is only added to the discriminator cost function. Since this gradient penalty depends also on the generator, WGAN-GP cannot be cast as a SP problem and is actually a non-zero sum game. 8

9 Nevertheless the fact that this true generator function" may not be found with a satisfying accuracy may lead to a different behavior than the one expected. Regarding the averaging technique, some recent work appear to have already successfully used geometric averaging (7) for GANs in practice, but only briefly mention it (Karras et al., 018; Mescheder et al., 018). By contrast the present work formally motivates and justifies the use of averaging for GANs by relating them to the VIP perspective, and sheds light on its underlying intuitions in 3.1. Another independent work (Yazıcı et al., 018) made a similar attempt but in the context of regret minimization in games. Mertikopoulos et al. (018) also independently explored extrapolation providing asymptotic convergence results (i.e. without any rate of convergence) in the context of coherent saddle point. he coherence assumption is slightly weaker than monotonicity. 7 EXPERIMENS Our goal in this experimental section is not to provide new state-of-the art results with architectural improvements or a new GAN formulation but to show that using the techniques (with theoretical guarantees in the monotone case) that we introduced earlier allow us to optimize standard GANs in a better way. hese techniques, which are orthogonal to the design of new formulations of GAN optimization objectives, and to architectural choices, can potentially be used for the training of any type of GAN. We will compare the following optimization algorithms: baselines are SGD and Adam using either simultaneous updates on the generator and on the discriminator (denoted SimAdam and SimSGD) or k updates on the discriminator alternating with 1 update on the generator (denoted AltSGD{k} and AltAdam{k}) 8. Variants that use extrapolation are denoted ExtraSGD (Alg. ) and ExtraAdam (Alg. 4). Variants using extrapolation from the past are PastExtraSGD (Alg. 3) and PastExtraAdam (Alg. 4). We also present results using as output the averaged iterates, adding Avg as a prefix of the algorithm name when we use (uniform) averaging. 7.1 BILINEAR SADDLE POIN (SOCHASIC) We evaluate the performance of the various stochastic algorithms first on a simple (n = 10 3, d = 10 3 ) finite sum bilinear objective (a monotone operator) constrained to [, 1] d : 1 n n i=1 ( θ M (i) ϕ + θ a (i) + ϕ b (i)) (9) solved by (θ, ϕ ) s.t. { Mϕ = ā M θ = b, Distance to the optimum AltAdam5 AvgAltAdam5 AltSGD1 AvgAltSGD1 ExtraSGD AvgExtraSGD PastExtraSGD AvgPastExtraSGD where ā M = 1 n b (i) k = 1 n n i=1 a(i), b = 1 n n i=1 b(i) and n i=1 M (i). Marices and vectors M (i) kj, a(i) k, ; 1 i n, 1 j, k d were randomly generated, but ensuring that (θ, ϕ ) would belong to [, 1] d. Results are shown in Fig. 3. We can see that AvgSGD and AvgPastExtraSGD perform the best on this task Number of gradient computation/n Figure 3: Performance of the considered stochastic optimization algorithms on the bilinear problem (9). Each method uses its respective optimal step-size found by grid-search. 7. WGAN AND WGAN-GP ON CIFAR10 We now evaluate the proposed techniques in the context of GAN training, which is a challenging stochastic optimization problem where the objectives of both players are non-convex. We propose to evaluate the Adam variants of the different optimization algorithms (see Alg. 4 for Adam with extrapolation) by training two different architectures on the CIFAR10 dataset (Krizhevsky and Hinton, 009). First we consider a constrained zero-sum game by training the DCGAN architecture (Radford et al., 016) with the WGAN objective and weight clipping as proposed in Arjovsky et al. (017). hen we compared the different methods on a state of the art architecture, by training a ResNet with 8 In the original WGAN (Arjovsky et al., 017) paper the authors use k = 5. 9

Model WGAN (DCGAN) WGAN-GP (ResNet) Method no averaging uniform avg EMA no averaging uniform avg SimAdam 6.05 ±.1 5.85 ±.16 6.08 ±.10 7.54 ±.1 7.74 ±.7 AltAdam5 5.45 ±.08 5.7 ±.06 5.49 ±.05 7.0 ±.

10 Model WGAN (DCGAN) WGAN-GP (ResNet) Method no averaging uniform avg EMA no averaging uniform avg SimAdam 6.05 ± ± ± ± ±.7 AltAdam ± ± ± ± ±.15 ExtraAdam 6.38 ± ± ± ± ±.1 PastExtraAdam 5.98 ± ± ± ± ±.18 OptimAdam 5.74 ± ± ± ± ±.1 able 1: Best inception scores (averaged over 5 runs) achieved on CIFAR10 for every considered Adam variant. OptimAdam is the related Optimistic Adam (Daskalakis et al., 018) algorithm. EMA denotes exponential moving average (with β = 0.999, see Eq. 8). We see that the techniques of extrapolation and averaging consistently enable improvements over the baselines (in italic) Inception Score SimAdam γ = 10 5 ExtraAdam γ = PastExtraAdam γ = Number of generator updates 10 5 Inception Score AvgSimAdam AvgAltAdam1 AvgAltAdam5 4.5 AvgExtraAdam AvgPastExtraAdam Number of generator updates 5 Figure 4: Left: Mean and standard deviation of the inception score computed over 5 runs for each method on WGAN trained on CIFAR10. o keep the graph readable we show only SimAdam but AltAdam performs similarly. Middle: Samples from a ResNet generator trained with the WGAN-GP objective using AvgExtraAdam. Right: WGAN-GP trained on CIFAR10: mean and standard deviation of the inception score computed over 5 runs for each method using the best performing learning rate plotted over wall-clock time; all experiments were run on a NVIDIA Quadro GP100 GPU. We see that ExtraAdam converges faster than the Adam baselines. the WGAN-GP objective similar to Gulrajani et al. (017). Models are evaluated using the inception score (Salimans et al., 016) computed on 10,000 samples. For each algorithm we did an extensive search over the hyperparameters of Adam, we fixed β 1 = 0.5 and β = 0.9 for all methods as they seemed to perform well. We want to emphasize that as proposed by Heusel et al. (017) it is quite important to set different learning rates for the generator and the discriminator. Experiments were run with 5 random seeds for 500,000 updates of the generator. able 1 reports the best inception score achieved on this problem by each considered method. We see that the techniques of extrapolation and averaging consistently enable improvements over the baselines (see G.4 for more experiments on averaging). Fig. 4 shows training curves for each method (for their best performing learning rate), as well as samples from a ResNet generator trained with ExtraAdam on a WGAN-GP objective. For both tasks, using an extrapolation step and averaging with Adam (ExtraAdam) outperformed all other methods. ExtraAdam with averaging is able to reach state of the art results on CIFAR10 similar to Miyato et al. (018). We also observed that methods based on extrapolation are less sensitive to the choice of learning rate and can be used with higher learning rates with less degradation; see App. G.3 for more details. 8 CONCLUSION We newly addressed GAN objectives in the framework of variational inequality. We tapped into the optimization literature to provide more principled sound techniques to optimize such games. We leveraged these techniques to develop practical optimization algorithms suitable for a wide range of GAN training objectives (including non-zero sum games and projections onto constraints). We experimentally verified that this could yield better trained models, achieving to our knowledge the best inception score when optimizing a WGAN objective on the reference unmodified DCGAN architecture (Radford et al., 016). he presented techniques address a fundamental problem in GAN training in a principled way, and are orthogonal to the design of new GAN architectures and objectives. hey are thus likely to be widely applicable, and benefit future development of GANs. 10

11 ACKNOWLEDGMENS. his research was partially supported by the Canada Excellence Research Chair in Data Science for Realtime Decision-making and by the NSERC Discovery Grant RGPIN Gauthier Gidel would like to acknowledge Benoît Joly and Florestan Martin-Baillon for bringing a fresh point of view on the proof of Proposition 1. REFERENCES M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 017. K. E. Atkinson. An introduction to numerical analysis. John Wiley & Sons, 003. S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 004. R. E. Bruck. On the weak convergence of an ergodic iteration for the solution of variational inequalities for monotone operators in hilbert space. Journal of Mathematical Analysis and Applications, G. H. Chen and R.. Rockafellar. Convergence rates in forward backward splitting. SIAM Journal on Optimization, C.-K. Chiang,. Yang, C.-J. Lee, M. Mahdavi, C.-J. Lu, R. Jin, and S. Zhu. Online optimization with gradual variations. In COL, 01. G. P. Crespi, A. Guerraggio, and M. Rocca. Minty variational inequality and optimization: Scalar and vector case. In Generalized Convexity, Generalized Monotonicity and Applications, 005. C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. raining GANs with optimism. In ICLR, 018. W. Fedus, M. Rosca, B. Lakshminarayanan, A. M. Dai, S. Mohamed, and I. Goodfellow. Many paths to equilibrium: GANs do not need to decrease a divergence at every step. In ICLR, 018. G. Gidel,. Jebara, and S. Lacoste-Julien. Frank-wolfe algorithms for saddle point problems. In AISAS, 017. I. Goodfellow. Nips 016 tutorial: Generative adversarial networks. arxiv: , 016. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 014. P. Grnarova, K. Y. Levy, A. Lucchi,. Hofmann, and A. Krause. An online learning approach to generative adversarial networks. In ICLR, 018. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein GANs. In NIPS, 017. P.. Harker and J.-S. Pang. Finite-dimensional variational inequality and nonlinear complementarity problems: a survey of theory, algorithms and applications. Mathematical programming, E. Hazan, K. Singh, and C. Zhang. Efficient regret minimization in non-convex games. In ICML, 017. M. Heusel, H. Ramsauer,. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NIPS, 017. A. Iusem, A. Jofré, R. I. Oliveira, and P. hompson. Extragradient method with variance reduction for stochastic variational inequalities. SIAM Journal on Optimization, 017. A. Juditsky, A. Nemirovski, and C. auvel. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, Karras,. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation,

12 D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 015. G. Korpelevich. he extragradient method for finding saddle points and other problems. Matecon, 1, A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master s thesis, University of oronto, Canada, Larsson and M. Patriksson. A class of gap functions for variational inequalities. Math. Program., C. Ledig, L. heis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. ejani, J. otz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 017. Y. Li, A. Schwing, K.-C. Wang, and R. Zemel. Dualing GANs. In NIPS, 017. P. Mertikopoulos, H. Zenati, B. Lecouat, C.-S. Foo, V. Chandrasekhar, and G. Piliouras. Mirror descent in saddle-point problems: Going the extra (gradient) mile. arxiv, 018. L. Mescheder, S. Nowozin, and A. Geiger. he numerics of GANs. In NIPS, 017. L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for GANs do actually converge? In ICML, 018. L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. In ICLR, Miyato,. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 018. V. Nagarajan and J. Z. Kolter. Gradient descent GAN optimization is locally stable. In NIPS, 017. A. Nedić and A. Ozdaglar. Subgradient methods for saddle-point problems. J Optim heory Appl, 009. A. Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim., 004. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 009. Y. Nesterov. Introductory Lectures On Convex Optimization. Springer, Y. Nesterov. Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Program., 007. S. Nowozin, B. Cseke, and R. omioka. f-gan: raining generative neural samplers using variational divergence minimization. In NIPS, 016. B. Palaniappan and F. Bach. Stochastic variance reduction methods for saddle-point problems. In NIPS, 016. B.. Polyak. Gradient methods for minimizing functionals. Zhurnal Vychislitel noi Matematiki i Matematicheskoi Fiziki, A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 016. A. Rakhlin and K. Sridharan. Online learning with predictable sequences. In COL, 013. L. J. Ratliff, S. A. Burden, and S. S. Sastry. On the characterization of local nash equilibria in continuous games

13 H. Robbins and S. Monro. A stochastic approximation method. he Annals of Mathematical Statistics, Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In NIPS, 016. I. Sutskever. raining recurrent neural networks. PhD thesis, 013. P. seng. On linear convergence of iterative methods for the variational inequality problem. Journal of Computational and Applied Mathematics, J. Von Neumann and O. Morgenstern. heory of games and economic behavior. Princeton University Press, A. Yadav, S. Shah, Z. Xu, D. Jacobs, and. Goldstein. Stabilizing adversarial nets with prediction methods. In ICLR, 018. Y. Yazıcı, C.-S. Foo, S. Winkler, K.-H. Yap, G. Piliouras, and V. Chandrasekhar. he unusual effectiveness of averaging in gan training. arxiv preprint arxiv: , 018. F. Yousefian, A. Nedić, and U. V. Shanbhag. Optimal robust smoothing extragradient algorithms for stochastic variational inequality problems. In CDC. IEEE, 014. J.-Y. Zhu,. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycleconsistent adversarial networks. In ICCV,

14 A DEFINIIONS In this section we recall usual initions and lemmas from convex analysis. We start with the initions and lemmas regarding the projection mapping. A.1 PROJECION MAPPING Definition 1. he projection P Ω onto Ω is ined as, P Ω (ω ) arg min ω ω. (30) ω Ω When Ω is a convex set this projection is unique. his is a consequence of the following lemma that we will use in the following sections: the non-expansiveness of the projection onto a convex set. Lemma 1. Let Ω a convex set, the projection mapping P Ω : R d Ω is nonexpansive, i.e., P Ω (ω) P Ω (ω ) ω ω, ω, ω Ω. (31) his is standard convex analysis result which can be found for instance in (Boyd and Vandenberghe, 004). he following Lemma is also standard in convex analysis and its proof uses similar arguments as the proof of Lemma 1. Lemma. Let ω Ω and ω + = P Ω (ω + u) then for all ω Ω we have, ω + ω ω ω + u (ω + ω ) ω + ω. (3) Proof of Lemma. We start by simply developing, ω + ω = (ω + ω) + (ω ω ) = ω ω + (ω + ω) (ω ω ) + ω + ω = ω ω + (ω + ω) (ω + ω ) ω + ω. hen since ω + is the projection onto the convex set Ω of ω + u we have that (ω + (ω + u)) (ω + ω ) 0, ω Ω, leading to the result of the Lemma. A. SMOOHNESS AND MONOONICIY OF HE OPERAOR Another important property used is the Lipschitzness of an operator. Definition. A mapping F : R p R d is said to be L-Lipschitz if, F (ω) F (ω ) L ω ω, ω, ω Ω. (33) Definition 3. A differentiable function f : Ω R is said to be µ-strongly convex if f(ω) f(ω ) + f(ω ) (ω ω ) + µ ω ω ω, ω Ω. (34) Definition 4. A function (θ, ϕ) L(θ, ϕ) is said convex-concave if L(, ϕ) is convex for all ϕ Φ and L(θ, ) is concave for all θ Θ. An L is said to be µ-strongly convex concave if (θ, ϕ) L(θ, ϕ) µ θ + µ ϕ is convex concave. Definition 5. For µ > 0, an operator F : Ω R d is said to be µ-strongly monotone if (F (ω) F (ω )) (ω ω ) µ ω ω. (35) B GRADIEN MEHODS ON UNCONSRAINED BILINEAR GAMES In this section we will prove the results provided in 3, namely Proposition 1, Proposition and heorem 1. For Proposition 1 and let us recall the context. We wanted to derive properties of some gradient methods on the following simple illustrative example min max θ φ (36) θ R φ R 14

15 B.1 PROOF OF PROPOSIION 1 Let us first recall the proposition: Proposition 1. he simultaneous iterates diverge geometrically and the alternating iterates ined in (11) are bounded but do not converge to 0 as Simultaneous: θ t+1 + φ t+1 = (1 + η )(θ t + φ t ), Alternating: θ t + φ t = Θ(θ 0 + φ 0) (37) where u t = Θ(v t ) α, β > 0 : αv t u t βv t. he uniform average ( θ t, φ t ) = 1 t t s=0 (θ s, φ s ) of the simultaneous updates (resp. the alternating updates) diverges (resp. converges to 0) as, ( ) ( ) Simultaneous: θ t + φ θ0 + φ 0 t = Θ η t (1 + η ) t, Alternating: θ t + φ θ0 + φ 0 t = Θ η t. (38) Proof. Let us start with the simultaneous update rule: { θt+1 = θ t ηφ t φ t+1 = φ t + ηθ t. (39) hen we have, θt+1 + φ t+1 = (θ t ηφ t ) + (φ t + ηθ t ) (40) = (1 + η )(θt + φ t ). (41) he update rule (39) also gives us, { ηφt = θ t θ t+1 ηθ t+1 = φ t+1 φ t. (4) Summing these equation for 0 t 1 we get, φ + θ = (θ 0 θ ) + (φ 0 φ ) (43) = ((1 + η ) + 1)(θ0 + φ 0) θ 0 θ φ 0 φ (44) ( ) = Θ (1 + η ) ((θ0 + φ 0) (45) Let continue start with the alternating update rule { θt+1 = θ t ηφ t φ t+1 = φ t + ηθ t+1 = φ t + η(θ t ηφ t ) (46) hen we have [ ] [ θt+1 1 η = φ t+1 η by simple linear algebra, for η <, the matrix M = eigenvalues which are (47) [ ] 1 η η 1 η has two complex conjugate ] [ ] θt 1 η φ t λ ± = 1 η η ± i 4 η (48) and their squared magnitude is equal to det(m) = 1 η + η = 1. We can diagonalize M meaning that there exist P an invertible matrix such that M = P diag(λ +, λ )P. hen, we have [ ] [ ] [ ] θt = M φ t θ0 = P t φ diag(λ t 0 +, λ t θ0 )P (49) φ 0 and consequently, [ θt + φ ] θt t = φ t C = P diag(λ t +, λ t )P 15 [ ] θ0 φ 0 C P P (θ 0 + φ 0) (50)

16 P u C u C where C is the norm in C and P := max u C way we have [ θ0 + φ ] 0 = M t θt = φ t P diag(λ t +, λ t )P C is the induced matrix norm. he same [ ] θt φ t C P P (θ t + φ t ) (51) Hence, if θ0 + φ 0 > 0 the sequence (θ t, φ t ) is bounded but do not converge to 0. Moreover the update rule gives us, η { φ ηφt = θ t θ t = θ 0 θ t+1 φ = θ 0 θ η ηθ t = φ t φ t η θ t = φ φ 0 + ηθ 0 θ = φ (5) φ 0 + ηθ 0 η Consequently, since θ t + φ t = O(θ 0 + φ 0), θ t + φ t = O ( θ 0 + φ 0 ηt ) (53) B. IMPLICI AND EXRAPOLAION MEHOD In this section we will prove a slightly more precise proposition than Proposition, Proposition. he squared norm of the iterates N t = θ t + φ t, where the update rule of θ t and φ t ined in (18), decrease geometrically for any 0 < η < 1 as, Implicit: N t+1 = N t 1 + η, Extrapolation: N t+1 = (1 η + η 4 )N t, t 0 (54) Proof. Let us recall the update rule for the implicit method hen, Implying that { θt+1 = θ t ηφ t+1 φ t+1 = φ t + ηθ t+1 { (1 + η )θ t+1 = θ t ηφ t (1 + η )φ t+1 = φ t + ηθ t (55) (1 + η ) (θ t+1 + φ t+1) = (θ t ηφ t ) + (φ t + ηθ t ) (56) = θ t + φ t + +η (θ t + φ t ) (57) θ t+1 + φ t+1 = θ t + φ t 1 + η (58) For the extrapolation method we have the update rule { θt+1 = θ t η(φ t + ηθ t ) φ t+1 = φ t + η(θ t ηφ t ) (59) Implying that, θt+1 + φ t+1 = (θ t η(φ t + ηθ t )) + (φ t + η(θ t ηφ t )) (60) = θt + φ t η (θ + φ ) + η ((θ t ηφ t ) + (φ t + ηθ t ) ) (61) = (1 η + η 4 )(θt + φ t ) (6) 16

17 B.3 EXRAPOLAION FROM HE PAS Let us recall what we call projected extrapolation form the past, where we used the notation ω t = ω t+1/ for compactness, Extrapolation from the past: ω t = P Ω [ω t ηf (ω t)] (63) Perform update step: ω t+1 = P Ω [ω t ηf (ω t)] and store: F (ω t) (64) where P Ω [ ] is the projection onto the constraint set Ω. An operator F : Ω R d is said to be µ-strongly monotone if (F (ω) F (ω )) (ω ω ) µ ω ω. (65) If F is strongly monotone we can prove the following theorem: heorem 1. If F is µ-strongly monotone (see A for the inition of strong monotonicity) and L-Lipschitz, then the updates (0) and (1) with η = 1 4L provide linearly converging iterates, ( ω t ω 1 µ ) t ω 0 ω, t 0. (66) 4L Proof. In order to prove tis theorem we will prove a slightly more general result, ( ω t+1 ω + ω t ω t 1 µ ) ( ω t ω + ω t ω 4L t ). (67) implying that ω t+1 ω ω t+1 ω + ω t ω t with the convention that ω 0 = ω = ω. Let us first proof three technical lemmas, ( 1 µ ) t ω 0 ω. (68) 4L Lemma 3. If F is µ-strongly monotone we have, ( ) µ ω t ω ω t ω t F (ω t) (ω t ω ), ω Ω. (69) Proof. By strong monotonicity and optimality of ω, µ ω t ω F (ω ) (ω t ω ) + µ ω t ω F (ω t) (ω t ω ) (70) and then we use the inequality ω t ω ω t ω ω t ω t to get the result claimed. Lemma 4. If F is L-Lipschitz, we have for any ω Ω, η t F (ω t) (ω t ω) ω t ω ω t+1 ω ω t ω t + η t L ω t ω t. (71) Proof. Applying Lemma for (ω, u, ω +, ω ) = (ω t, η t F (ω t), ω t+1, ω) and (ω, u, ω +, ω ) = (ω t, η t F (ω t), ω t, ω t+1 ), we get, ω t+1 ω ω t ω η t F (ω t) (ω t+1 ω) ω t+1 ω t (7) and ω t ω t+1 ω t ω t+1 η t F (ω t) (ω t ω t+1 ) ω t ω t. (73) Summing (7) and (73) we get, ω t+1 ω ω t ω η t F (ω t) (ω t+1 ω) (74) η t F (ω t) (ω t ω t+1 ) ω t ω t ω t ω t+1 (75) = ω t ω η t F (ω t) (ω t ω) ω t ω t ω t ω t+1 η t (F (ω t) F (ω t)) (ω t ω t+1 ). (76) 17

18 hen, we can use the Young s inequality a b a + b to get, ω t+1 ω ω t ω η t F (ω t) (ω t ω) + η t F (ω t) F (ω t) + ω t ω t+1 ω t ω t ω t ω t+1 (77) = ω t ω η t F (ω t) (ω t ω) + η t F (ω t) F (ω t) ω t ω t ω t ω η t F (ω t) (ω t ω) + η t L ω t ω t ω t ω t. (78) Lemma 5. For all t 0, if we set ω = ω = ω 0 we have ω t ω t 4 ω t ω t + 4η tl ω t ω t ω t ω t. (79) Proof. We start with a + b a + b. ω t ω t ω t ω t + ω t ω t. (80) Moreover, since the projection is contractive we have that ω t ω t ω t η t F (ω t) ω t η t F (ω t ) (81) = ηt F (ω t) F (ω t ) (8) ηtl ω t ω t. (83) Combining (80) and (83) we get, ω t ω t = ω t ω t ω t ω t (84) 4 ω t ω t + 4 ω t ω t ω t ω t (85) 4 ω t ω t + 4ηtL ω t ω t ω t ω t. (86) Proof of heorem 1. Let ω Ω be an optimal point of (VIP). Combining Lemma 3 and Lemma 4 we get, ( ) η t µ ω t ω ω t ω t ω t ω ω t+1 ω +ηt L ω t ω t ω t ω t leading to, ω t+1 ω (1 η t µ) ω t ω + η t L ω t ω t (1 η t µ) ω t ω t (87) Now using Lemma 5 we get, ω t+1 ω (1 η t µ) ω t ω + η t L (4η tl ω t ω t ω t ω t ) Now with η t = 1 4L 1 4µ we get, ω t+1 ω Hence, using the fact that (1 η t µ 4η t L ) ω t ω t (88) ( 1 µ 4L µ 4L 1 4 ) ω t ω + 1 ( ) ω t ω t ω t ω t we get, ω t+1 ω ω t ω t ( 1 µ ) ( ) ω t ω + 1 4L 16 ω t ω t. (89) 18

19 C MORE ON MERI FUNCIONS In this section, we will present how to handle unbounded constaints set Ω with a more refined merit function than (5) used in the main paper. Let F the continuous operator and Ω the constraints set associated with the VIP find ω Ω such that F (ω ) (ω ω ) 0, ω Ω. (VIP) When the operator F is monotone, we have that F (ω ) (ω ω ) F (ω) (ω ω ), ω, ω. Hence, in this case (VIP) implies a stronger formulation sometimes called Minty variational inequality (Crespi et al., 005): find ω Ω such that F (ω) (ω ω ) 0, ω Ω. (MVI) his formulation is stronger in the sense that if (MVI) holds for some ω Ω, then (VIP) holds too. A merit function useful for our analysis can be derived from this formulation. Roughly, a merit function is a convergence measure. More formally, a function g : Ω R is called a merit function if g is non-negative such that g(ω) = 0 ω Ω (Larsson and Patriksson, 1994). A way to derive a merit function from (MVI) would be to use g(ω ) = sup ω Ω F (ω) (ω ω) which is zero if and only if (MVI) holds for ω. o deal with unbounded constraint sets (leading to a potentially infinite valued function outside of the optimal set), we use the restricted merit function (Nesterov, 007): Err R (ω t ) = max ω Ω, ω ω 0 R F (ω) (ω t ω). (90) his function acts as merit function for (VIP) on the interior of the open ball of radius R around ω 0, as shown in Lemma 1 of Nesterov (007). hat is, let Ω R = Ω {ω : ω ω 0 < R}. hen for any point ˆω Ω R, we have: Err R ( ˆω) = 0 ˆω Ω Ω R. (91) he reference point ω 0 is arbitrary, but in practice it is usually the initialization point of the algorithm. R has to be big enough to ensure that Ω R contains a solution. Err R measures how much (MVI) is violated on the restriction Ω R. Such merit function is standard in the variational inequality literature. A similar one is used in (Nemirovski, 004; Juditsky et al., 011). When F is derived from the gradients (5) of a zero-sum game, we can ine a more interpretable merit function. One has to be careful though when extending properties from the minimization setting to the saddle point setting (e.g. the merit function used by Yadav et al. (018) is vacuous for a bilinear game as explained in App C.). In the appendix we adopt a set of assumptions a little more general than the one in the main paper Assumption 4. F is monotone and Ω is convex and closed. R is set big enough such that R > ω 0 ω and F is a monotone operator. Contrary to Assumption 3, in Assumption 4 the constraints set in no longer assumed to be bounded. Assumption 4 is more general than Assumption 3 since that if Ω is bounded then with R set to the diameter of Ω, Assumption 4 is true. C.1 MORE GENERAL MERI FUNCIONS In the appendix we will note Err (VI) R inition, When the objective is a saddle point problem i.e., the restricted merit function ined in (90). Let us recall its Err (VI) R (ω t) = max F ω Ω, ω ω (ω) (ω t ω). (9) 0 R F (θ, ϕ) = [ θ L(θ, ϕ) ϕ L(θ, ϕ)] (93) and L is convex-concave (see Definition 4 in A), we can use another merit function than (9) on Ω R that is more interpretable and more directly related to the cost function of the minimax formulation: Err (SP) R (θ t, ϕ t ) = max L(θ t, ϕ) L(θ, ϕ t ). (94) ϕ Φ,θ Θ (θ,ϕ) (θ 0,ϕ 0) R 19

20 In particular, if the equilibrium (θ, ϕ ) Ω Ω R and we have that L(, ϕ ) and L(θ, ) are µ-strongly convex (see A), then the merit function for saddle points upper bounds the distance for (θ, ϕ) Ω R to the equilibrium as: Err (SP) R (θ, ϕ) µ ( θ θ + ϕ ϕ ). (95) In the appendix, we provide our convergence results with the merit functions (9) and (94), depending on the setup: Err R (ω) = Err (SP) R Err (VI) (ω) if F is a SP operator (5) R (ω) otherwise. (96) C. ON HE IMPORANCE OF HE MERI FUNCION In this section, we illustrate the fact that one has to be careful when extending results and properties from the minimization setting to the minimax setting (and consequently to the variational inequality setting). Another candidate as a merit function for saddle point optimization would be to naturally extend the suboptimality f(ω) f(ω ) used in standard minimization (i.e. find ω the minimizer of f) to the gap P (θ, ϕ) = L(θ, ϕ ) L(θ, ϕ). In a previous analysis of a modification of the stochastic gradient descent (SGD) method for GANs, Yadav et al. (018) gave their convergence rate on P that they called the primal-dual gap. Unfortunately, if we do not assume that the function L is strongly convex-concave (a stronger assumption ined in A and which fails for bilinear objective e.g.), P may not be a merit function. It can be 0 for a non optimal point, see for instance the discussion on the differences between (94) and P in (Gidel et al., 017, Section 3). In particular, for the simple D bilinear example L(θ, ϕ) = θ ϕ, we have that θ = ϕ = 0 and thus P (θ, ϕ) = 0 θ, ϕ. C.3 VARIAIONAL INEQUALIIES FOR NON-CONVEX COS FUNCIONS When the cost functions ined in (3) are non-convex, the operator F is no longer monotone. Nevertheless, (VIP) and (MVI) can still be ined, though a solution to (MVI) is less likely to exist. We note that (VIP) is a local condition for F (as only evaluating F at the points ω ). On the other hand, an appealing property of (MVI) is that it is a global condition. In the context of minimization of a function f for example (where F = f), if ω solves (MVI) then ω is a global minimum of f (and not just a stationary point for the solution of (MVI); see Proposition. from Crespi et al. (005)). A less restrictive way to consider variational inequalities in the non-monotone setting is to use a local version of (MVI). If the cost functions are locally convex around the optimal couple (θ, ϕ ) and if our iterates eventually fall and stay into that neighborhood, then we can consider our restricted merit function Err R ( ) with a well suited constant R and apply our convergence results for monotone operators. D ANOHER WAY OF IMPLEMENING EXRAPOLAION O SGD We now introduce another way to combine extrapolation and SGD. his extension is very similar to Alg. 5, the only difference is that it re-uses the minibatch sample of the extrapolation step for the update of the current point. he intuition is that it correlates the estimator of the gradient of the extrapolation step and the one of the update step leading to a better correction of the oscillations which are also due to the stochasticity. One emerging issue (for the analysis) of this method is that since ω t depend on ξ t, the quantity F (ω t, ξ t ) is a biased estimator of F (ω t). 0

21 Algorithm 5 Re-used minibatches for stochastic extrapolation (ReExtraSGD) 1: Let ω 0 Ω : for t = do 3: Sample ξ t P = P Ω [ω t η t F (ω t, ξ t )] Extrapolation step 4: ω t 5: ω t+1 = P Ω [ω t η t F ( ) ω t, ξ t ] Update step wit the same sample 6: end for 7: Return ω = η tω t/ η t heorem 5. Assume that ω t ω 0 R, t 0 where (ω t) t 0 are the iterates of Alg. 5. Under Assumption 1 and 4, for any 1, Alg. 5 with constant step-size η 1 L has the following convergence properties: Particularly, η t = E[Err R ( ω )] R η + η σ + 4L (4R + σ ) η gives E[Err R ( ω )] O(1). where ω = 1 ω t. he assumption that the sequence of the iterates provided by the algorithm is bounded is strong, but has been also done for instance in (Yadav et al., 018). he proof of this result is provided in F. E VARIANCE COMPARISON BEWEEN AVGSGD AND SGD WIH PREDICION MEHOD o compare the variance term of AvgSGD in (6) with the one of the SGD with prediction method (Yadav et al., 018), we need to have the same convergence certificate. Fortunately, their proof can be adapted to our convergence criterion (using Lemma 6 in F), revealing an extra σ / in the variance term from their paper. he resulting variance can be summarized with our notation as (M (1 + L) + σ )/ where the L is the Lipschitz constant of the operator F. Since M σ, their variance term is then 1 + L time larger than the one provided by the AvgSGD method. F PROOF OF HEOREMS his section is dedicated on the proof of the theorems provided in this paper in a slightly more general form working with the merit function ined in (96). First we prove an additional lemma necessary to the proof of our theorems. Lemma 6. Let F be a monotone operator and let (ω t ), (ω t), (z t ), ( t ), (ξ t ) and (ζ t ) be six random sequences such that, for all t 0 η t F (ω t) (ω t ω) N t N t+1 + η t (M 1 (ω t, ξ t ) + M (ω t, ζ t )) + η t t (z t ω), where N t = N(ω t, ω t, ω t ) 0 and we extend (ω t) with ω = ω = ω 0. Let also assume that with N 0 R, E[ t ] σ, E[ t z t, 0,..., t ] = 0, E[M 1 (ω t, ξ t )] M 1 and E[M (ω t, ζ t )] M, then, E[Err R ( ω )] R + M 1 + M + σ ηt (97) S S where ω = η tω t/s and S = η t. 1

22 Proof of Lemma 6. We sum (6) for 0 t 1 to get, η t F (ω t) (ω t ω) [ ] (N t N t+1 ) + ηt ((M 1 (ω t, ξ t ) + M (ω t, ζ t )) + η t t (z t ω). (98) We will then upper bound each sum in the right-hand side, t (z t ω) = t (z t u t ) + t (u t ω) where u t+1 = P Ω (u t η t t ) and u 0 = ω 0. hen, leading to u t+1 ω u t ω η t t (u t ω) + η t t η t t (z t ω) η t t (z t u t ) + u t ω u t+1 ω + η t t (99) hen noticing that z 0 = ω 0, back to (98) we get a telescoping sum, η t F (ω t) (ω t ω) [ N 0 + η t ((M 1 (ω t, ξ t )+M (ω t, ζ t ))+ t )+η t t (z t u t ) ]. If F is the operator of a convex-concave saddle point (5), we get, with ω t = (θ t, ϕ t ) F (ω t) (ω t ω) θ L(θ t, ϕ t ) (θ t θ) ϕ L(θ t, ϕ t ) (ϕ t ϕ) L(θ t, ϕ) L(θ t, ϕ t ) + L(θ t, ϕ t ) L(θ, ϕ t ) (by convexity and concavity) = L(θ t, ϕ) L(θ, ϕ t ) then by convexity of L(, ϕ) and concavity of L(θ, ), we have that, S η t F (ω S t) (ω t ω) S (100) η t S (L(θ t, ϕ) L(θ, ϕ t )) S (L( θ t, ϕ) L( θ, ϕ t )) (101) Otherwise if the operator F is just monotone since F (ω t) (ω t ω) F (ω ) (ω t ω) we have that S η t F (ω t) (ω t ω) S η t F (ω ) (ω t ω) = S F (ω ) ( ω t ω) (10) In both cases, we can now maximize the left hand side respect to ω (since the RHS does not depend on ω) to get, S Err R ( ω t ) R [ + η t ((M 1 (ω t, ξ t )+M (ω t, ζ t ))+ t )+η t t (z t u t ) ]. (103) hen taking the expectation, since E[ t z t, u t ] = E[ t z t, 0,..., t ] = 0, E ζt [ t ] σ, E ξt [M 1 (ω t, ξ t )] M 1 and E ζt [M (ω t, ζ t )] M, we get that, E[Err R ( ω )] R + M 1 + M + σ ηt (104) S S

23 F.1 PROOF OF HM. First let us state heorem in its general form, heorem. Under Assumption 1, and 4, Alg. 1 with constant step-size η has the following convergence rate for all 1, Particularly, η = E[Err R ( ω )] R η + η M + σ where R gives E[Err (M R( ω )] R M +σ +σ ). ω = 1 ω t. (105) Proof of heorem. Let any ω Ω such that ω 0 ω R, ω t+1 ω = P Ω (ω t η t F (ω t, ξ t )) ω ω t η t F (ω t, ξ t )) ω (projections are non-contractive, Lemma 1) = ω t ω η t F (ω t, ξ t ) (ω t ω) + η t F (ω t, ξ t ) hen we can make appear the quantity F (ω t ) (ω t ω) on the left-hand side, η t F (ω t ) (ω t ω) ω t ω ω t+1 ω +η t F (ω t, ξ t ) +η t (F (ω t ) F (ω t, ξ t )) (ω t ω) (106) we can sum (106) for 0 t 1 to get, η t F (ω t ) (ω t ω) [ ] ( ω t ω ω t+1 ω ) + ηt F (ω t, ξ t ) + η t t (ω t ω) where we noted t = F (ω t ) F (ω t, ξ t ). By monotonicity, F (ω t ) (ω t ω) F (ω) (ω t ω) we get, (107) [ ] S F (ω) ( ω ω) ( ω t ω ω t+1 ω ) + ηt F (ω t, ξ t ) + η t t (ω t ω), where S = η t and ω = 1 S η tω t. We will then upper bound each sum in the right hand side, t (ω t ω) = t (ω t u t ) + t (u t ω) where u t+1 = P Ω (u t η t t ) and u 0 = ω 0. hen, leading to u t+1 ω u t ω η t t (u t ω) + η t t (108) η t t (ω t ω) η t t (ω t u t ) + u t ω u t+1 ω + η t t (109) hen noticing that u 0 = ω 0, back to (108) we get a telescoping sum, S F (ω) ( ω ω) ω 0 ω + ηt ( F (ω t, ξ t ) + t ) + η t t (ω t u t ) R + ηt ( F (ω t, ξ t ) + t ) + η t t (ω t u t ) 3

24 hen the right hand side does not depends on ω, we can maximize over ω to get, S Err R ( ω ) R + ηt ( F (ω t, ξ t ) + t ) + η t t (ω t u t ) (110) Noticing that E[ t ω t, u t ] = 0 (the estimates of F are unbiased), by Assumption E[( F (ω t, ξ t ) ] M and by Assumption 1 E[ t ] σ we get, particularly for η t = η and η t = E[Err R ( ω )] R + M + σ ηt (111) S S η t+1 we respectively get, and E[Err R ( ω )] E[Err R ( ω )] R η + η (M + σ ) (11) 4R η η ln( + 1) M + σ (113) F. PROOF OF HM.3 heorem 3. Under Assumption 1 and 4, if E ξ [F ] is L-Lipschitz, then Alg. with a constant step-size η 1 3L has the following convergence rate for any 1, E[Err R ( ω )] R η + 7 ησ where ω = 1 ω t. (114) Particularly, η = R σ 7 gives E[Err R( ω )] 14Rσ. Proof of hm.3. Let any ω Ω such that ω 0 ω R. hen, the update rules become ω t+1 = P Ω (ω t η t F (ω t, ζ t )) and ω t = P Ω (ω t ηf (ω t, ξ t )). We start by applying Lemma for (ω, u, ω, ω + ) = (ω t, ηf (ω t, ζ t ), ω, ω t+1 ) and (ω, u, ω, ω + ) = (ω t, η t F (ω t, ξ t ), ω t+1, ω t), ω t+1 ω ω t ω η t F (ω t, ζ t ) (ω t+1 ω) ω t+1 ω t ω t ω t+1 ω t ω t+1 η t F (ω t, ξ t ) (ω t ω t+1 ) ω t ω t hen, summing them we get ω t+1 ω ω t ω η t F (ω t, ζ t ) (ω t+1 ω) leading to η t F (ω t, ξ t ) (ω t ω t+1 ) ω t ω t ω t+1 ω t (115) ω t+1 ω ω t ω η t F (ω t, ζ t ) (ω t ω) + η t (F (ω t, ζ t ) F (ω t, ξ t )) (ω t ω t+1 ) ω t ω t ω t+1 ω t hen with a b a + b we get ω t+1 ω ω t ω η t F (ω t, ζ t ) (ω t ω) ω t ω t + η t F (ω t, ζ t ) F (ω t, ξ t ) Using the inequality a + b + c 3( a + b + c ) we get, ω t+1 ω ω t ω η t F (ω t, ζ t ) (ω t ω) ω t ω t + 3η t ( F (ω t ) F (ω t, ξ t ) + F (ω t) F (ω t, ζ t ) + F (ω t) F (ω t ) ) 4

25 hen we can use the L-Lipschitzness of F to get, ω t+1 ω ω t ω η t F (ω t, ζ t ) (ω t ω) ω t ω t + 3η t ( F (ω t ) F (ω t, ξ t ) + F (ω t) F (ω t, ζ t ) + L ω t ω t ) As we restricted the step-size to η t 1 3L we get, η t F (ω t) (ω t ω) ω t ω ω t+1 ω + η t (F (ω t) F (ω t, ζ t )) (ω t ω) + 3η t F (ω t ) F (ω t, ξ t ) + 3η t F (ω t) F (ω t, ζ t ) We get a particular case of (6) so we can use Lemma 6 where N t = ω t ω, M 1 (ω t, ξ t ) = 3 F (ω t ) F (ω t, ξ t ), M (ω t, ζ t ) = 3 F (ω t) F (ω t, ζ t ), t = F (ω t) F (ω t, ζ t ) and z t = ω t. By Assumption 1, M 1 = M = 3σ and by the fact that E[F (ω t) F (ω t, ζ t ) ω t, 0,..., t ] = E[E[F (ω t) F (ω t, ζ t ) ω t] 0,..., t ] = 0 the hypothesis of Lemma 6 hold and we get, E[Err R ( ω )] R + 7σ ηt (116) S S F.3 PROOF OF HM. 4 heorem 4. Under Assumption 1, if E ξ [F ] is L-Lipschitz, then AvgPastExtraSGD (Alg. 3) with a constant step-size η 1 has the following convergence rate for any 1, 3L E[Err R ( ω )] R η + 13 ησ where ω = 1 ω t. (117) Particularly, η = R σ 7 gives E[Err R( ω )] 14Rσ. First let us recall the update rule { ωt+1 = P Ω [ω t η t F (ω t, ξ t )] ω t+1 = P Ω [ω t+1 η t+1 F (ω t, ξ t )]. (118) Lemma 7. We have for any ω Ω, ηf (ω t, ξ t ) (ω t ω) ω t ω ω t+1 ω ω t ω t + 3η t L ω t ω t + 3η t [ F (ω t, ξ t ) F (ω t) + F (ω t) F (ω t, ξ t ) ]. (119) Proof. Applying Lemma for (ω, u, ω +, ω ) = (ω t, η t F (ω t, ξ t ), ω t+1, ω) and (ω, u, ω +, ω ) = (ω t, η t F (ω t, ξ t ), ω t, ω t+1 ), we get, and ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t+1 ω) ω t+1 ω t (10) ω t ω t+1 ω t ω t+1 η t F (ω t, ξ t ) (ω t ω t+1 ) ω t ω t. (11) Summing (10) and (11) we get, ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t+1 ω) (1) η t F (ω t, ξ t ) (ω t ω t+1 ) ω t ω t ω t ω t+1 (13) = ω t ω η t F (ω t, ξ t ) (ω t ω) ω t ω t ω t ω t+1 (14) η t (F (ω t, ξ t ) F (ω t, ξ t )) (ω t ω t+1 ). (15) 5

26 hen, we can use the inequality of arithmetic and geometric means a b a + b to get, ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t ω) + η t F (ω t, ξ t ) F (ω t, ξ t ) + ω t ω t+1 ω t ω t ω t ω t+1 (16) = ω t ω η t F (ω t, ξ t ) (ω t ω) (17) + η t F (ω t, ξ t ) F (ω t, ξ t ) ω t ω t. (18) Using the inequality a + b + c 3( a + b + c ) we get, F (ω t, ξ t ) F (ω t, ξ t ) 3( F (ω t, ξ t ) F (ω t) + F (ω t) F (ω t) + F (ω t) F (ω t, ξ t ) ) (19) where we used the L-Lipschitzness of F for the last inequality. Combining (18) with (130) we get, 3( F (ω t, ξ t ) F (ω t) + L ω t ω t + F (ω t) F (ω t, ξ t ) ), (130) ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t ω) ω t ω t + 3η t L ω t ω t + 3η t [ F (ω t, ξ t ) F (ω t) + F (ω t) F (ω t, ξ t ) ]. (131) Lemma 8. For all t 0, if we set ω = ω = ω 0 we have ω t ω t 4 ω t ω t + 1η t( F (ω t, ξ t ) F (ω t) + L ω t ω t + F (ω t ) F (ω t, ξ t ) ) ) ω t ω t. (13) Proof. We start with a + b a + b. ω t ω t ω t ω t + ω t ω t. (133) Moreover, since the projection is contractive we have that ω t ω t ω t η t F (ω t, ξ t ) ω t η t F (ω t, ξ t ) (134) = η t F (ω t, ξ t ) F (ω t, ξ t ) (135) 3η t( F (ω t, ξ t ) F (ω t) + L ω t ω t + F (ω t ) F (ω t, ξ t ) ) ). (136) where in the last line we used the same inequality as in (130). Combining (13) and (136) we get, ω t ω t = ω t ω t ω t ω t (137) 4 ω t ω t + 4 ω t ω t ω t ω t (138) 4 ω t ω t + 1η t( F (ω t, ξ t ) F (ω t) + L ω t ω t + F (ω t ) F (ω t, ξ t ) ) ) ω t ω t. (139) Proof of heorem 4. Combining Lemma 8 and Lemma 7 we get, η t F (ω t, ξ t ) (ω t ω) ω t ω ω t+1 ω + 36ηt ηtl ( F (ω t, ξ t ) F (ω t) + L ω t ω t + F (ω t ) F (ω t, ξ t ) ) 3η t L ω t ω t + (1η t L 1) ω t ω t + 3η t [ F (ω t, ξ t ) F (ω t) + F (ω t) F (ω t, ξ t ) ]. (140) 6

27 hen for η t 1 3L we have 36η t η tl 4 3η tl, η t F (ω t) (ω t ω) ω t ω ω t+1 ω + 3L (η t ω t ω t η t ω t ω t ) We can then use Lemma 6 where M 1 (ω t, ξ t ) = 0 + η t (F (ω t) F (ω t, ξ t )) (ω t ω) + 3η t [ F (ω t, ξ t ) F (ω t ) + F (ω t, ξ t ) F (ω t) + F (ω t) F (ω t, ξ t ) ]. (141) N t = ω t ω + 3L 3 η t ω t ω t, M (ω t, ξ t ) = 3 F (ω t) F (ω t, ξ t ) + 6 F (ω t) F (ω t, ξ t ) + 3 F (ω t ) F (ω t, ξ t ) t = F (ω t) F (ω t, ξ t ) z t = ω t. By Assumption 1, M = 1σ and by the fact that E[F (ω t) F (ω t, ξ t ) ω t, 0,..., t ] = E[E[F (ω t) F (ω t, ξ t ) ω t] 0,..., t ] = 0 the hypothesis of Lemma 6 hold and we get, E[Err R ( ω )] R + 13σ ηt (14) S S F.4 PROOF OF HEOREM 5 heorem 5 has been introduced in D. his theorem is about Algorithm 5 which consists in another way to implement extrapolation to SGD. Let us first restate this theorem, heorem 5. Assume that ω t ω 0 R, t 0 where (ω t) t 0 are the iterates of Alg. 5. Under Assumption 1 and 4, for any 1, Alg. 5 with constant step-size η 1 L has the following convergence properties: Particularly, η t = E[Err R ( ω )] R η + η σ + 4L (4R + σ ) η gives E[Err R ( ω )] O(1). where ω = 1 ω t. Proof of hm.5. Let any ω Ω such that ω 0 ω R. hen, the update rules become ω t+1 = P Ω (ω t η t F (ω t, ξ t )) and ω t = P Ω (ω t ηf (ω t, ξ t )). We start the same way as the proof of hm. 3 by applying Lemma for (ω, u, ω, ω ) = (ω t, ηf (ω t, ξ t ), ω, ω t+1 ) and (ω, u, ω, ω + ) = (ω t, η t F (ω t, ξ t ), ω t+1, ω t), ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t+1 ω) ω t+1 ω t ω t ω t+1 ω t ω t+1 η t F (ω t, ξ t ) (ω t ω t+1 ) ω t ω t hen, summing them we get ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t+1 ω) leading to η t F (ω t, ξ t ) (ω t ω t+1 ) ω t ω t ω t+1 ω t (143) ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t ω) + η t (F (ω t, ξ t ) F (ω t, ξ t )) (ω t ω t+1 ) ω t ω t ω t+1 ω t 7

28 hen with a b a + b we get ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t ω) +ηt F (ω t, ξ t ) F (ω t, ξ t ) ω t ω t Using the Lipschitzness assumption we get ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t ω) + (ηt L 1) ω t ω t hen we add η t F (ω t) (ω t ω) in both sides to get, η t F (ω t) (ω t ω) ω t ω ω t+1 ω η t (F (ω t, ξ t ) F (ω t)) (ω t ω) + (η t L 1) ω t ω t (144) Here, unfortunately we cannot use Lemma 6 because F (ω t, ξ t ) is biased. We will then deal with the quantity A = (F (ω t, ξ t ) F (ω )) (ω t ω). We have that, A = (F (ω t, ξ t ) F (ω t, ξ t )) (ω ω t) + (F (ω t ) F (ω t)) (ω ω t) + (F (ω t, ξ t ) F (ω t )) (ω t ω t) + (F (ω t, ξ t ) F (ω t )) (ω ω t ) L ω t ω t ω t ω + F (ω t, ξ t ) F (ω t ) ω t ω t + (F (ω t, ξ t ) F (ω t )) (ω ω t ) (Using Cauchy-Schwarz and the L-Lip of F ) hen using a b δ a + 1 δ b, for δ = 4, η t (F (ω t, ξ t ) F (ω t)) (ω t ω) 1 ω t ω t + 8η t L ω t ω leading to, + 4η t F (ω t, ξ t ) F (ω t ) ω t ω t + η t (F (ω t, ξ t ) F (ω t )) (ω ω t ) η t F (ω t) (ω t ω) ω t ω ω t+1 ω + η t (F (ω t, ξ t ) F (ω t )) (ω ω t ) + (η t L 1 4 ) ω t ω t + 4η t (L ω t ω + F (ω t, ξ t ) F (ω t ) ) If one assumes finally that ω t ω 0 R (assumption of the theorem) and that η t 1 L we get, η t F (ω t) (ω t ω) ω t ω ω t+1 ω + η t (F (ω t, ξ t ) F (ω t )) (ω ω t ) + 4η t (4L R + F (ω t, ξ t ) F (ω t ) ) where we used that ω t ω ω t ω 0 + ω 0 ω R. Once again this equation is a particular case of Lemma 6 where N t = ω t ω, M 1 (ω t, ξ t ) = 4(4L R + F (ω t, ξ t ) F (ω t ) ), M (ω t, ζ t ) = 0, z t = ω t and t = F (ω t, ξ t ) F (ω t ). By Assumption 1 E[M 1 (ω t, ξ t )] 16L R + 4σ and E[ t ω t, 0,..., t ] = E[E[ t ω t ] 0,..., t ] = 0 so we can use Lemma 6 and get, E[Err R ( ω )] R + σ + 16L R + 4σ ηt. (145) S S 8

29 G ADDIIONAL EXPERIMENAL RESULS G.1 OY NON-CONVEX GAN (D AND DEERMINISIC) We now consider a task similar to (Mescheder et al., 018) where the discriminator is linear D ϕ (ω) = ϕ ω, the generator is a Dirac distribution at θ, q θ = δ θ and the distribution we try to match is also a Dirac at ω, p = δ ω. he minimax formulation from Goodfellow et al. (014) gives: min max log ( 1 + e ϕ ω ) log ( 1 + e ϕ θ ) (146) θ ϕ Note that as observed by Nagarajan and Kolter (017), this objective is concave-concave, making it hard to optimize. We compare the methods on this objective where we take ω =, thus the position of the equilibrium is shifted towards the position (θ, ϕ) = (, 0). he convergence and the gradient vector field are shown in Figure 5. We observe that depending on the initialization some methods can fail to converge but extrapolation (18) seem to perform better than the rest of the methods. Gradient Vector Field and rajectory raining Curves 30 φ Distance to the optimum Simultaneous γ = 0.1 Alternated γ = 0.1 Averaging γ = 0.1 Extrapolation from the Past γ = 0.7 Extrapolation γ = 0.7 Start θ Number of Iterations Figure 5: Comparison of five algorithms (described in Section 3) on the non-convex gan objective (146), using the optimal step-size for each method. Left: he gradient vector field and the dynamics of the different methods. Right:he distance to the optimum as a function of the number of iterations for each methods. 9

30 G. DCGAN WIH WGAN-GP OBJECIVE We also trained the DCGAN architecture of the WGAN experiments but this time with the WGAN-GP objective. he results are shown in able. he best results are achieved with uniform averaging of AltAdam5, However its iterations require to update the discriminator 5 times for every generator update. With a small drop in best final score, ExtraAdam can train WGAN-GP significantly faster (see Fig. 6 right) as the discriminator and generator are updated only twice. Model WGAN-GP (DCGAN) Method no averaging uniform avg SimAdam 6.00 ± ±.08 AltAdam5 6.5 ± ±.05 ExtraAdam 6. ± ±.05 PastExtraAdam 6.7 ± ± 0.13 able : Best inception scores (averaged over 5 runs) achieved on CIFAR10 for every considered Adam variant. OptimAdam is the related Optimistic Adam (Daskalakis et al., 018) algorithm. We see that the techniques of extrapolation and averaging consistently enable improvements over the baselines (in italic) Inception Score Inception Score AltAdam5 γ = AvgExtraAdam γ = AvgPastExtraAdam γ = Number of generator updates AltAdam5 γ = AvgExtraAdam γ = AvgPastExtraAdam γ = Wall-Clock time in seconds 10 4 Figure 6: DCGAN architecture with WGAN-GP trained on CIFAR10: mean and standard deviation of the inception score computed over 5 runs for each method using the best performing learning rate plotted over number of generator updates (Left) and wall-clock time (Right); all experiments were run on a NVIDIA Quadro GP100 GPU. We see that ExtraAdam converges faster than the Adam baselines. 30

31 G.3 COMPARISON OF HE MEHODS WIH HE SAME LEARNING RAE In this section we compare how the methods presented in 7 perform with the same step-size. We follow the same protocol as in the experimental section 7. In Figure 7 we plot the inception score provided by each training method as a function of the number of generator updates. Note that these plots advantage AltAdam5 a bit because each iteration of this algorithm is a bit more costly (since it perform 5 discriminator updates for each generator update). Nevertheless, the goal of this experiment is not to show that AltAdam5 is faster but to show that that ExtraAdam is less sensitive to the choice of learning rate and can be used with higher learning rates with less degradation. he same way in Figure 8 we compare the sample quality on the WGAN-GP experiment (DCGAN architecture) of AltAdam5 and AvgExtraAdam for different step-sizes. We notice that for AvgExtraAdam the sample quality do not significantly change whereas the sample quality of AltAdam5 seems to be really sensitive to step-size tunning. We think that robustness to step-size tuning is a key property for an optimization algorithm in order to save as much time as possible to tune other hyperparameters of the learning procedure such as regularization. 6 6 Inception Score AltAdam5 γ = SimAdam γ = PastExtraAdam γ = AvgExtraAdam γ = Number of generator updates 10 5 (a) learning rate of 10 3 Inception Score AltAdam5 γ = SimAdam γ = PastExtraAdam γ = AvgExtraAdam γ = Number of generator updates 10 5 (b) learning rate of 10 4 Figure 7: Inception score on CIFAR10 for WGAN-GP (DCGAN) over number of generator updates for different learning rates. We can see that AvgExtraAdam is less sensitive to the choice of learning rate. 31

32 (a) AvgExtraAdam with η = 10 3 (b) AvgExtraAdam with η = 10 4 (c) AltAdam with η = 10 3 (d) AltAdam with η = 10 4 Figure 8: Comparison of the samples quality on the WGAN-GP (DCGAN) experiment for different methods and learning rate η. 3

33 G.4 COMPARISON OF HE MEHODS WIH AND WIHOU AVERAGING In this section we compare how uniform averaging improve the performance of the methods presented in 7. We follow the same protocol as in the experimental section 7. In Figure 9 and 10, we plot the inception score provided by each training method as a function of the number of generator updates with and without averaging. We notice that uniform averaging seems to improve the inception score, nevertheless it looks like the sample are a bit more blurry (see Figure 11) this is confirmed by our result (Figure 1) on the Fréchet Inception Distance (FID) which is more sensitive to blurriness Inception Score Inception Score AltAdam5 γ = SimAdam γ = PastExtraAdam γ = ExtraAdam γ = Number of generator updates AltAdam5 γ = SimAdam γ = PastExtraAdam γ = ExtraAdam γ = Number of generator updates 10 5 (a) with averaging (b) without averaging Figure 9: Inception Score on CIFAR10 for WGAN over number of generator updates with and without averaging. We can see that averaging improve the inception score Inception Score Inception Score AltAdam5 γ = SimAdam γ = PastExtraAdam γ = ExtraAdam γ = Number of generator updates AltAdam5 γ = SimAdam γ = PastExtraAdam γ = ExtraAdam γ = Number of generator updates 10 5 (a) with averaging (b) without averaging Figure 10: Inception score on CIFAR10 for WGAN-GP (DCGAN) over number of generator updates 33

(a) PastExtraAdam without averaging (b) PastExtraAdam with averaging (c) AltAdam5 without averaging (d) AltAdam5 with averaging Figure 11: Comparison of the samples

Although averaging improves the inception score, the samples seem more blurry Convergence on CIFAR10 of WGAN 10 ReExtraAdam AvgReExtraAdam ExtraAdam AvgExtraAdam

00 104 Figure 1: he Fréchet Inception Distance (FID) from Heusel et al. (017) computed using 50,000 samples, on the WGAN experiments. ReExtraAdam refers to Alg.

34 (a) PastExtraAdam without averaging (b) PastExtraAdam with averaging (c) AltAdam5 without averaging (d) AltAdam5 with averaging Figure 11: Comparison of the samples of a WGAN trained with the different methods with and without averaging. Although averaging improves the inception score, the samples seem more blurry Convergence on CIFAR10 of WGAN 10 ReExtraAdam AvgReExtraAdam ExtraAdam AvgExtraAdam SimAdam AvgSimAdam AltAdam5 AvgAltAdam FID Number of Generator iterations Figure 1: he Fréchet Inception Distance (FID) from Heusel et al. (017) computed using 50,000 samples, on the WGAN experiments. ReExtraAdam refers to Alg. 5 introduced in D. We can see that averaging performs worse than when comparing with the Inception Score. We observed that the samples generated by using averaging are a little more blurry and that the FID is more sensitive to blurriness, thus providing an explanation for this observation. 34

arxiv: v3 [cs.lg] 2 Nov 2018

arxiv: v3 [cs.lg] 2 Nov 2018 A VARIAIONAL INEQUALIY PERSPECIVE ON GENERAIVE ADVERSARIAL NEWORKS Gauthier Gidel Mila Hugo Berard Mila, FAIR Gaëtan Vignoud Mila arxiv:180.10551v3 [cs.lg] Nov 018 Pascal Vincent Simon Lacoste-Julien Mila,