A VARIATIONAL INEQUALITY PERSPECTIVE ON GENERATIVE ADVERSARIAL NETWORKS

Size: px
Start display at page:

Download "A VARIATIONAL INEQUALITY PERSPECTIVE ON GENERATIVE ADVERSARIAL NETWORKS"

Transcription

1 A VARIAIONAL INEQUALIY PERSPECIVE ON GENERAIVE ADVERSARIAL NEWORKS Gauthier Gidel 1, Hugo Berard 1,3, Gaëtan Vignoud 1 Pascal Vincent 1,,3 Simon Lacoste-Julien 1, 1 Mila & DIRO, Université de Montréal. CIFAR fellow. 3 Facebook Artificial Intelligence Research. Montreal. ABSRAC Generative adversarial networks (GANs) form a generative modeling approach known for producing appealing samples, but they are notably difficult to train. One common way to tackle this issue has been to propose new formulations of the GAN objective. Yet, surprisingly few studies have looked at optimization methods designed for this adversarial training. In this work, we cast GAN optimization problems in the general variational inequality framework. apping into the mathematical programming literature, we counter some common misconceptions about the difficulties of saddle point optimization and propose to extend techniques designed for variational inequalities to the training of GANs. We apply averaging, extrapolation and a novel computationally cheaper variant that we call extrapolation from the past to the stochastic gradient method (SGD) and Adam. 1 INRODUCION Generative adversarial networks (GANs) (Goodfellow et al., 014) form a generative modeling approach known for producing realistic natural images (Karras et al., 018) as well as high quality super-resolution (Ledig et al., 017) and style transfer (Zhu et al., 017). Nevertheless, GANs are also known to be difficult to train, often displaying an unstable behavior (Goodfellow, 016). Much recent work has tried to tackle these training difficulties, usually by proposing new formulations of the GAN objective (Nowozin et al., 016; Arjovsky et al., 017). Each of these formulations can be understood as a two-player game, in the sense of game theory (Von Neumann and Morgenstern, 1944), and can be addressed as a variational inequality problem (VIP) (Harker and Pang, 1990), a framework that encompasses traditional saddle point optimization algorithms (Korpelevich, 1976). Solving such GAN games is traditionally approached by running variants of stochastic gradient descent (SGD) initially developed for optimizing supervised neural network objectives. Yet it is known that for some games (Goodfellow, 016, 8.) SGD exhibits oscillatory behavior and fails to converge. his oscillatory behavior, which does not arise from stochasticity, highlights a fundamental problem: while a direct application of basic gradient descent is an appropriate method for regular minimization problems, it is not a sound optimization algorithm for the kind of two-player games of GANs. his constitutes a fundamental issue for GAN training, and calls for the use of more principled methods with more reassuring convergence guarantees. Contributions. We point out that multi-player games can be cast as variational inequality problems (VIPs) and consequently the same applies to any GAN formulation posed as a minimax or non-zerosum game. We present two techniques from this literature, namely averaging and extrapolation, widely used to solve VIPs but which have not been explored in the context of GANs before. 1 We extend standard GAN training methods such as SGD or Adam into variants that incorporate these techniques (Alg. 3, 4 are new). We also explain that the oscillations of basic SGD for GAN training previously noticed (Goodfellow, 016) can be explained by standard variational inequality optimization results and we illustrate how averaging and extrapolation can fix this issue. Equal contribution, correspondence to firstname.lastname@umontreal.ca. 1 Independent works (Mertikopoulos et al., 018) and (Yazıcı et al., 018) respectively explored extrapolation and averaging in the context of GANs. More details in related work section 6. 1

2 We introduce a new technique, called extrapolation from the past, that only requires one gradient computation per iteration compared to extrapolation which requires to compute the gradient twice. We prove its convergence in the stochastic variational inequality setting, i.e. when applied to SGD. Finally, we test these techniques in the context of standard GAN training. We observe a 4%-6% improvement on the inception score (Salimans et al., 016) of WGAN (Arjovsky et al., 017) and WGAN-GP (Gulrajani et al., 017) on the CIFAR-10 dataset. Outline. presents the background on GAN and optimization, and shows how to cast this optimization as a VIP. 3 presents standard techniques to optimize variational inequalities in a batch setting as well as our new one, extrapolation from the past. 4 considers these methods in the stochastic setting, yielding three corresponding variants of SGD, and provides their respective convergence rates. 5 develops how to combine these techniques with already existing algorithms. 6 discusses the related work and 7 presents experimental results. GAN OPIMIZAION AS A VARIAIONAL INEQUALIY PROBLEM.1 GAN FORMULAIONS he purpose of generative modeling is to generate samples from a distribution q θ that matches best the true distribution p of the data. he generative adversarial network training strategy can be understood as a game between two players called generator and discriminator. he former produces a sample that the latter has to classify between real or fake data. he final goal is to build a generator able to produce sufficiently realistic samples to fool the discriminator. In the original GAN paper (Goodfellow et al., 014), the GAN objective is formulated as a zero-sum game where the cost function of the discriminator D ϕ is given by the negative log-likelihood of the binary classification task between real or fake data generated from q θ by the generator, min θ max L(θ, ϕ) where L(θ, ϕ) = E[log D ϕ (x)] E [log(1 D ϕ (x ))]. (1) ϕ x p x q θ However Goodfellow et al. (014) recommends to use in practice a second formulation, called non-saturating GAN. his formulation is a non-zero-sum game where the aim is to jointly minimize: L (θ) (θ, ϕ) = E log D ϕ (x ) and L (ϕ) (θ, ϕ) = E log D ϕ (x) E log(1 D ϕ (x )). () x q θ x p x q θ he dynamics of this formulation have the same stationary points as the zero-sum one (1) but are claimed to provide much stronger gradients early in learning (Goodfellow et al., 014).. EQUILIBRIUM he minimax formulation (1) is theoretically convenient because a large literature on games studies this problem and provides guarantees on the existence of equilibria. Nevertheless, practical considerations lead the GAN literature to consider a different objective for each player as formulated in (). In that case, the two-player game problem (Von Neumann and Morgenstern, 1944) consists in finding the following Nash equilibrium: θ arg min θ Θ L (θ) (θ, ϕ ) and ϕ arg min L (ϕ) (θ, ϕ). (3) ϕ Φ Only when L (θ) = L (ϕ) is the game called a zero-sum game and (3) can be formulated as a minimax problem. One important point to notice is that the two optimization problems in (3) are coupled and have to be considered jointly from an optimization point of view. Standard GAN objectives are non-convex (i.e. each cost function is non-convex), and thus such (pure) equilibria may not exist. As far as we know, not much is known about the existence of these equilibria for non-convex losses (see Heusel et al. (017) and references therein for some results). In our theoretical analysis in 4, our assumptions (monotonicity (4) of the operator and convexity of the constraints set) imply the existence of an equilibrium. In this paper, we focus on ways to optimize these games, assuming that an equilibrium exists. As is often standard in non-convex optimization, we also focus on finding points satisfying the necessary

3 stationary conditions. As we mentioned previously, one difficulty that emerges in the optimization of such games is that the two different cost functions of (3) have to be minimized jointly in θ and ϕ. Fortunately, the optimization literature has for a long time studied so-called variational inequality problems, which generalize the stationary conditions for two-player game problems..3 VARIAIONAL INEQUALIY PROBLEM FORMULAION We first consider the local necessary conditions that characterize the solution of the smooth two-player game (3), ining stationary points, which will motivate the inition of a variational inequality. In the unconstrained setting, a stationary point is a couple (θ, ϕ ) with zero gradient: θ L (θ) (θ, ϕ ) = ϕ L (ϕ) (θ, ϕ ) = 0. (4) When constraints are present, a stationary point (θ, ϕ ) is such that the directional derivative of each cost function is non-negative in any feasible direction (i.e. there is no feasible descent direction): θ L (θ) (θ, ϕ ) (θ θ ) 0 and ϕ L (ϕ) (θ, ϕ ) (ϕ ϕ ) 0, (θ, ϕ) Θ Φ. (5) Defining ω = (θ, ϕ), ω = (θ, ϕ ), Ω = Θ Φ, Eq. (5) can be compactly formulated as: F (ω ) (ω ω ) 0, ω Ω where F (ω) = [ θ L (θ) (θ, ϕ) ϕ L (ϕ) (θ, ϕ)]. (6) hese stationary conditions can be generalized to any continuous vector field: let Ω R d and F : Ω R d be a continuous mapping. he variational inequality problem (Harker and Pang, 1990) (depending on F and Ω) is: find ω Ω such that F (ω ) (ω ω ) 0, ω Ω. (VIP) We call optimal set the set Ω of ω Ω verifying (VIP). he intuition behind it is that any ω Ω is a fixed point of the constrained dynamic of F (constrained to Ω). We have thus showed that both saddle point optimization and non-zero sum game optimization, which encompasses the large majority of GAN variants proposed in the literature, can be cast as VIPs. In the next section, we turn to suitable optimization techniques for such problems. 3 OPIMIZAION OF VARIAIONAL INEQUALIIES (BACH SEING) Let us begin by looking at techniques that were developed in the optimization literature to solve (VIP). We present the intuitions behind them as well as their performance on a simple bilinear problem (see Fig. 1). Our goal here is to provide mathematical insights into the techniques of averaging ( 3.1) and extrapolation ( 3.), to inspire their application to extending other optimization algorithm. We then propose a novel variant of the extrapolation technique in 3.3 extrapolation from the past. We here treat the batch setting, i.e. considering that the operator F (ω) as ined in Eq. 6 yields an exact full gradient. We will present extensions of these techniques to the stochastic setting later in 4. he two standard methods studied in the VIP literature are the gradient method (Bruck, 1977) and the extragradient method (Korpelevich, 1976). he iterates of the basic gradient method are given by ω t+1 = P Ω [ω t ηf (ω t )] where P Ω [ ] is the projection onto the constraints set (if constraints are present) associated to (VIP). hese iterates are known to converge linearly under an additional assumption on the operator 3 (Chen and Rockafellar, 1997), but oscillate for a bilinear operator as shown in Fig. 1. On the other hand, the uniform average of these iterates converge for any bounded monotone operator with a O(1/ t) rate (Nedić and Ozdaglar, 009), motivating the presentation of averaging in 3.1. By contrast, the extragradient method (extrapolated gradient) does not require any averaging to converge for monotone operators (in the batch setting), and can even converge at the faster O(1/t) rate (Nesterov, 007). he idea of this method is to compute a lookahead step (see intuition on extrapolation in 3.) in order to compute a more stable direction to follow. An example of constraint for GANs is to clip the parameters of the discriminator (Arjovsky et al., 017). 3 Strong monotonicity, a generalization of strong convexity. See A. 3

4 3.1 AVERAGING More generally, we consider a weighted averaging scheme with weights ρ t 0. his weighted averaging scheme have been proposed for the first time for (batch) VIP by Bruck (1977), ω = ρ tω t, S = ρ t. (7) S Averaging schemes can be efficiently implemented in an online fashion noticing that, ω t = (1 ρ t ) ω t + ρ t ω t where 0 ρ t 1. (8) For instance, setting ρ t = 1 t provides uniform averaging (ρ t = 1) and ρ t = 1 β < 1 provides geometric averaging also known as exponential moving averaging (ρ t = β t ). Averaging is experimentally compared with the other techniques presented in this section in Fig. 1. In order to illustrate how averaging tackles the oscillatory behavior in game optimization, we consider a toy example where the discriminator and the generator are linear: D ϕ (x) = ϕ x and G θ (z) = θz (implicitly ining q θ ). By replacing these expressions in the WGAN objective, 4 we get the following bilinear objective: min max θ Θ ϕ Φ, ϕ 1 ϕ E[x] ϕ θe[z]. (9) A similar task was presented by Nagarajan and Kolter (017) where they consider a quadratic discriminator instead of a linear one, and show that gradient descent is not necessarily asymptotically stable. he bilinear objective has been extensively used (Goodfellow, 016; Mescheder et al., 018; Yadav et al., 018) to highlight the difficulties of gradient descent for saddle point optimization. Yet, ways to cope with this issue have been proposed decades ago in the context of mathematical programming. Simplifying further by setting the dimension to 1 and centering the equilibrium to the origin, Eq. 9 becomes: min max θ φ and θ R φ R (θ, φ ) = (0, 0). (10) he operator associated with this minimax game is F (θ, φ) = (φ, θ). here are several ways to compute the discrete updates of this dynamics. he two most common ones are the simultaneous and the alternating gradient update rules, Simultaneous update: { θt+1 = θ t ηφ t φ t+1 = φ t + ηθ t, Alternating update: { θt+1 = θ t ηφ t φ t+1 = φ t + ηθ t+1. (11) Interestingly, these two choices give rise to have a completely different behavior. he norm of the simultaneous updates diverges geometrically whereas the alternating iterates are bounded but do not converge to the equilibrium. As a consequence, their respective uniform average have a different behavior, as highlighted in the following proposition (more details and proof in B.1): Proposition 1. he simultaneous iterates diverge geometrically and the alternating iterates ined in (11) are bounded but do not converge to 0 as Simultaneous: θ t+1 + φ t+1 = (1 + η )(θ t + φ t ), Alternating: θ t + φ t = Θ(θ 0 + φ 0) (1) where u t = Θ(v t ) α, β > 0 : αv t u t βv t. he uniform average ( θ t, φ t ) = 1 t t s=0 (θ s, φ s ) of the simultaneous updates (resp. the alternating updates) diverges (resp. converges to 0) as, ( ) ( ) Simultaneous: θ t + φ θ0 + φ 0 t = Θ η t (1 + η ) t, Alternating: θ t + φ θ0 + φ 0 t = Θ η t. (13) his sublinear convergence result, proved in B, underlines the benefits of averaging when the sequence of iterates is bounded (i.e. for alternating update rule). When the sequence of iterates is not bounded (i.e. for simultaneous updates) averaging fails to ensure convergence. his theorem also shows how alternating updates may have better convergence properties than simultaneous updates. 4 Wasserstein GAN (WGAN) proposed by Arjovsky et al. (017) boils down to the following minimax formulation: min θ Θ max ϕ Φ, Dϕ L 1 E x p[d ϕ(x)] E x q θ [D ϕ(x )]. 4

5 3. EXRAPOLAION Another technique used in the variational inequality literature to prevent oscillations is extrapolation. his concept is anterior to the extragradient method since Korpelevich (1976) mentions that the idea of extrapolated prices to give stability had been already formulated by Polyak (1963, Chap. II). he idea behind this technique is to compute the gradient at an (extrapolated) point different from the current point from which the update is performed, stabilizing the dynamics: Compute extrapolated point: ω t+1/ = P Ω [ω t ηf (ω t )], (14) Perform update step: ω t+1 = P Ω [ω t ηf (ω t+1/ )]. (15) Note that, even in the unconstrained case, this method is intrinsically different from Nesterov s momentum 5 (Nesterov, 1983, Eq...9) because of this lookahead step for the gradient computation: Nesterov s method: ω t+1/ = ω t ηf (ω t ), ω t+1 = ω t+1/ + β(ω t+1/ ω t ). (16) Nesterov s method does not converge when trying to optimize (10). One intuition explaining why extrapolation provides better convergence properties than the standard gradient method comes from Euler s method framework (see for instance (Atkinson, 003) for more details on that topic). Actually, if we consider a first order approximation of ω t+1/, we have ω t+1/ ω t+1 + o(η) and consequently, the update step (15) is close to an implicit method step: Implicit step: ω t+1 = ω t ηf (ω t+1 ). (17) In the literature on Euler s method, implicit methods are known to be more stable and to benefit from better convergence properties (Atkinson, 003) than explicit methods. hey are not often used in practice though since they require to solve a potentially non-linear system at each iteration. aking back the simplified WGAN toy example (10) from 3.1 we get the following update rules, { { θt+1 = θ t ηφ t+1 θt+1 = θ t η(φ t + ηθ t ) Implicit:, Extrapolation: φ t+1 = φ t + ηθ t+1 φ t+1 = φ t + η(θ t ηφ t ). (18) In the following proposition, we will see that the respective convergence rates of the implicit method and extrapolation are highly similar. Keeping in mind that the latter has the major advantage of being more practical, this proposition clearly underlines the benefits of extrapolation (more details and proof in B.), Proposition. he squared norm of the iterates N t = θ t + φ t, where the update rule of θ t and φ t are ined in (18), decreases geometrically for any η < 1 as, Implicit: N t+1 = ( 1 η + η 4 + O(η 6 ) ) N t, Extrapolation: N t+1 = (1 η + η 4 )N t. (19) 3.3 EXRAPOLAION FROM HE PAS One issue with extrapolation is that the algorithm wastes a gradient (14). Indeed we need to compute the gradient at two different positions for every single update of the parameters. We thus propose a new technique that we call extrapolation from the past which only requires to compute one gradient for every update. he idea of this technique is to store and re-use the previous extrapolated gradient to compute the new extrapolation point: Extrapolation from the past: ω t+1/ = P Ω [ω t ηf (ω t/ )] (0) Perform update step: ω t+1 = P Ω [ω t ηf (ω t+1/ )] and store: F (ω t+1/ ) (1) his update scheme can be related to (Chiang et al., 01, Alg. 1) in the context of online learning and to the optimistic mirror descent (Rakhlin and Sridharan, 013; Daskalakis et al., 018). In the unconstrained case, (0) and (1) reduce to: Optimistic mirror descent (OMD): ω t+1/ = ω t/ ηf (ω t/ ) + ηf (ω t 3/ ) () However our technique comes from a different perspective, it was motivated by VIP and inspired from the extragradient method. Furthermore our technique extends to constrained optimization as shown in (0) and (1). It is not clear whether or not a single projection added to () provides a provably converging algorithm. Using the VIP point of view we are able to prove a linear convergence rate for a projected version of the extrapolation from the past (see details and proof of heorem 1 in B.3). We also extend these results to the stochastic operator setting in 4. 5 Sutskever (013, 7.) showed the equivalence between standard momentum and Nesterov s formulation. 5

6 Adam with γ = 0.01 Gradient method γ = 0.1 Averaging γ =.0 Extrapolation from the past γ = 0.5 Extrapolation γ = 0.6 φ Start 0 1 θ Figure 1: Comparison of the basic gradient method (as well as Adam) with the techniques presented in 3 on the optimization of (9). Only the algorithms advocated in this paper (Averaging, Extrapolation and Extrapolation from the past) converge quickly to the solution. Each marker represents 0 iterations. We compare these algorithms on a non-convex objective in G.1. Algorithm 1 AvgSGD Let ω 0 Ω for t = do ξ t P (minibatch) d t F (ω t, ξ t ) ω t+1 P Ω [ω t η t d t ] end for Return ω ηtωt ηt Algorithm AvgExtraSGD for t = do ξ t, ξ t P (minibatches) d t F (ω t, ξ t ) ω t P Ω [ω t η t d t ] d t F (ω t, ξ t) ω t+1 P Ω [ω t η t d t] end for Return ω ηtω t ηt Algorithm 3 AvgPastExtraSGD Let ω 0 Ω for t = do ξ t P (minibatch) ω t P Ω [ω t η t d t ] d t F (ω t, ξ t ) ω t+1 P Ω [ω t η t d t ] end for Return ω ηtω t ηt Figure : hree variants of SGD using the techniques introduced in 3. heorem 1 (Linear convergence of extrapolation from the past). If F is µ-strongly monotone (see A for the inition of strong monotonicity) and L-Lipschitz, then the updates (0) and (1) with η = 1 4L provide linearly converging iterates, ( ω t ω 1 µ ) t ω 0 ω, t 0. (3) 4L In comparison to the results from (Daskalakis et al., 018) that hold only for a bilinear objective, we provide a faster convergence rate (linear vs sublinear) on the last iterate for a general (strongly monotone) operator F and any projection on a convex Ω. One thing to notice is that the operator of a bilinear objective is not strongly monotone, but in that case one can use the standard extrapolation method (14) which converges linearly for a (constrained or not) bilinear game (seng, 1995, Cor. 3.3). 4 OPIMIZAION OF VIP WIH SOCHASIC GRADIENS In this section, we consider extensions of the techniques presented in section 3 for optimizing (VIP), to the context of a stochastic operator. In this case, at each time step we no longer have access to the exact gradient F (ω) but to an unbiased stochastic estimate of it F (ω, ξ) where ξ P and F (ω) := E ξ P [F (ω, ξ)]. his is motivated from the GAN formulation where we only have access to a finite sample estimate of the expected gradient, computed on a mini-batch. For GANs, ξ is thus a mini-batch of points coming from the true data distribution p and the generator distribution q θ. For our analysis, we require at least one of the two following assumptions on the stochastic operator: Assumption 1. Bounded variance by σ : E ξ [ F (ω) F (ω, ξ) ] σ, ω Ω. Assumption. Bounded expected squared norm by M : E ξ [ F (ω, ξ) ] M, ω Ω. Assump. 1 is standard in stochastic variational analysis, while Assump. is a stronger assumption sometimes made in stochastic convex optimization. o illustrate how strong Assump. is, note that it does not hold for an unconstrained bilinear objective like in our example 10 in 3. It is thus mainly reasonable for bounded constraint sets. Note that in practice we have σ M. We now present and analyse three algorithms, variants of SGD that are appropriate to solve (VIP). he first one Alg. 1 (AvgSGD) is the stochastic extension of the gradient method for solving (VIP); Alg. (AvgExtraSGD) uses extrapolation and Alg. 3 (AvgPastExtraSGD) uses extrapolation from the past. A fourth variant (Alg.5) is proposed in D. hese three algorithms return an average of the iterates. he proofs of the theorems presented in this section are in F. 6

7 o handle constraints such as parameter clipping (Arjovsky et al., 017), we present a projected version of theses algorithms, where P Ω [ω ] denotes the projection of ω onto Ω (see A). Note that when Ω = R d, the projection is the identity mapping (unconstrained setting). In order to prove the convergence of these three algorithms we will assume that F is monotone: (F (ω) F (ω )) (ω ω ) 0 ω, ω Ω. (4) If F can be written as (6), it implies that the cost functions are convex 6. Note however that in general GANs parametrized with neural networks lead to non-monotone VIPs. Assumption 3. F is monotone and Ω is a compact convex set, such that max ω,ω Ω ω ω R. In that setting the quantity g(ω ) := max ω Ω F (ω) (ω ω) is well ined and is equal to 0 if and only if ω is a solution of (VIP). Moreover, if we are optimizing a zero-sum game, we have ω = (θ, ϕ), Ω = Θ Φ and F (θ, ϕ) = [ θ L(θ, ϕ) ϕ L(θ, ϕ)]. Hence, the quantity h(θ, ϕ ) := max ϕ Φ L(θ, ϕ) min θ Θ L(θ, ϕ ) is well ined and equal to 0 if and only if (θ, ϕ ) is a Nash equilibrium of the game. he two functions g and h are called merit functions (more details on the concept of merit functions in C). In the following, we call, Err(ω) = max L(θ, θ,ϕ Θ Φ ϕ ) L(θ, ϕ) max F ω Ω (ω ) (ω ω ) if F (θ, ϕ) = [ θ L(θ, ϕ) ϕ L(θ, ϕ)] otherwise. Averaging. Alg. 1 (AvgSGD) presents the stochastic gradient method with averaging, which reduces to the standard (simultaneous) SGD updates for the two-player games used in the GAN literature, but returning an average of the iterates. heorem. Under Assump. 1, and 3, SGD with averaging (Alg. 1) with a constant step-size gives, E[Err( ω )] R η + η M + σ where (5) ω = 1 ω t, 1. (6) hm. uses a similar proof as (Nemirovski et al., 009). he constant term η(m + σ )/ in (6) is called the variance term. his type of bound is standard in stochastic optimization. We also provide in F a similar rate with an extra log factor when η t = η t. We show that this variance term is smaller than the one of SGD with prediction method (Yadav et al., 018) in E. Extrapolations. Alg. (AvgExtraSGD) adds an extrapolation step compared to Alg. 1 in order to reduce the oscillations due to the game between the two players. A theoretical consequence is that it has a smaller variance term than (6). As discussed previously, Assump. made in hm. for the convergence of Alg. 1 is very strong in the unbounded setting. One advantage of SGD with extrapolation is that hm. 3 does not require this assumption. heorem 3. (Juditsky et al., 011, hm. 1) Under Assump. 1 and 3, if E ξ [F ] is L-Lipschitz, then SGD with extrapolation and averaging (Alg. ) using a constant step-size η 1 3L gives, E[Err( ω )] R η + 7 ησ where ω = 1 ω t, 1. (7) Since in practice σ M, the variance term in (7) is significantly smaller than the one in (6). o summarize, SGD with extrapolation provides better convergence guarantees but requires two gradient computations and samples per iteration. his motivates our new method, Alg. 3 (AvgPastExtraSGD) which uses extrapolation from the past and achieves the best of both worlds. heorem 4. Under Assump. 1 and 3, if E ξ [F ] is L-Lipschitz then SGD with extrapolation from the past using a constant step-size η 1, gives that the averaged iterates converge as, 3L E[Err( ω )] R η + 13 ησ where ω = 1 ω t 1. (8) he bounds is similar to the one provided in hm. 3 but each iteration of Alg. 3 is computationally half the cost of an iteration of Alg.. 6 he convexity of the cost functions in (3) is a necessary condition (not sufficient) for the operator to be monotone. In the context of a zero-sum game, the convexity of the cost functions is a sufficient condition. 7

8 5 COMBINING HE ECHNIQUES WIH ESABLISHED ALGORIHMS In the previous sections we presented several techniques that converge on a simple bilinear example. hese techniques can be combined in practice with existing algorithms. We propose to combine them to two standard algorithms used for training deep neural networks: the Adam optimizer (Kingma and Ba, 015) and the SGD optimizer (Robbins and Monro, 1951). Note that in the case of a two-player game (3), the previous results can be generalized to gradient updates with a different step-size for each player by simply rescaling the objectives L (θ) and L (ϕ) by a different scaling factor. A detailed pseudo-code for Adam with extrapolation step (Extra-Adam) is given in Algorithm 4. Algorithm 4 Extra-Adam: proposed Adam with extrapolation step. input: step-size η, decay rates for moment estimates β 1, β, access to the stochastic gradients l t ( ) and to the projection P Ω [ ] onto the constraint set Ω, initial parameter ω 0, averaging scheme (ρ t ) t 1 for t = do Option 1: Standard extrapolation. Sample new minibatch and compute stochastic gradient: g t l t (ω t ) Option : Extrapolation from the past Load previously saved stochastic gradient: g t = l t/ (ω t/ ) Update estimate of first moment for extrapolation: m t/ β 1 m t + (1 β 1 )g t Update estimate of second moment for extrapolation: v t/ β v t + (1 β )gt Correct the bias for the moments: ˆm t/ m t/ /(1 β1 t ), ˆv t/ v t/ /(1 β t ) ˆm Perform extrapolation step from iterate at time t: ω t/ P Ω [ω t η t/ ˆvt/ +ɛ ] Sample new minibatch and compute stochastic gradient: g t+1/ l t+1/ (ω t+1/ ) Update estimate of first moment: m t β 1 m t/ + (1 β 1 )g t+1/ Update estimate of second moment: v t β v t/ + (1 β )gt+1/ Compute bias corrected for first and second moment: ˆm t m t /(1 β1 t ), ˆv t v t /(1 β t Perform update step from the iterate at time t: ω t+1 P Ω [ω t η ˆvt+ɛ ˆmt ] end for Output: ω /, ω or ω = ρ t+1ω t+1/ / ρ t+1 (see (8) for online averaging) ) 6 RELAED WORK he extragradient method is the standard algorithm to optimize variational inequalities. his algorithm has been originally introduced by Korpelevich (1976) and extended by Nesterov (007) and Nemirovski (004). Stochastic versions of the extragradient have been recently analyzed (Juditsky et al., 011; Yousefian et al., 014; Iusem et al., 017) for stochastic variational inequalities with bounded constraints. A linearly convergent variance reduced version of the stochastic gradient method has been proposed by Palaniappan and Bach (016) for strongly monotone variational inequalities. Several methods to stabilize GANs consist in transforming a zero-sum formulation into a more general game that can no longer be cast as a saddle point problem. his is the case of the non-saturating formulation of GANs (Goodfellow et al., 014; Fedus et al., 018), the DCGANs (Radford et al., 016), the gradient penalty 7 for WGANs (Gulrajani et al., 017). Yadav et al. (018) propose an optimization method for GANs based on AltSGD using a momentum based step on the generator. Daskalakis et al. (018) proposed a method inspired from game theory. Li et al. (017) suggest to dualize the GAN objective to reformulate it as a maximization problem and Mescheder et al. (017) propose to add the norm of the gradient in the objective and provide an interesting perspective on GANs, interpreting the training as the search of a two-player game equilibrium. A study of the continuous version of two player games has been conducted by Ratliff et al. (016). Interesting non-convex results were proved, for a new notion of regret minimization, by Hazan et al. (017) and in the context of GANs by Grnarova et al. (018). he technique of unrolling steps proposed by Metz et al. (017) can be confused with extrapolation but is actually fundamentally different: the perspective is try to construct the true generator objective function" unrolling for K steps the updates of the generator and then update the discriminator. 7 he gradient penalty is only added to the discriminator cost function. Since this gradient penalty depends also on the generator, WGAN-GP cannot be cast as a SP problem and is actually a non-zero sum game. 8

9 Nevertheless the fact that this true generator function" may not be found with a satisfying accuracy may lead to a different behavior than the one expected. Regarding the averaging technique, some recent work appear to have already successfully used geometric averaging (7) for GANs in practice, but only briefly mention it (Karras et al., 018; Mescheder et al., 018). By contrast the present work formally motivates and justifies the use of averaging for GANs by relating them to the VIP perspective, and sheds light on its underlying intuitions in 3.1. Another independent work (Yazıcı et al., 018) made a similar attempt but in the context of regret minimization in games. Mertikopoulos et al. (018) also independently explored extrapolation providing asymptotic convergence results (i.e. without any rate of convergence) in the context of coherent saddle point. he coherence assumption is slightly weaker than monotonicity. 7 EXPERIMENS Our goal in this experimental section is not to provide new state-of-the art results with architectural improvements or a new GAN formulation but to show that using the techniques (with theoretical guarantees in the monotone case) that we introduced earlier allow us to optimize standard GANs in a better way. hese techniques, which are orthogonal to the design of new formulations of GAN optimization objectives, and to architectural choices, can potentially be used for the training of any type of GAN. We will compare the following optimization algorithms: baselines are SGD and Adam using either simultaneous updates on the generator and on the discriminator (denoted SimAdam and SimSGD) or k updates on the discriminator alternating with 1 update on the generator (denoted AltSGD{k} and AltAdam{k}) 8. Variants that use extrapolation are denoted ExtraSGD (Alg. ) and ExtraAdam (Alg. 4). Variants using extrapolation from the past are PastExtraSGD (Alg. 3) and PastExtraAdam (Alg. 4). We also present results using as output the averaged iterates, adding Avg as a prefix of the algorithm name when we use (uniform) averaging. 7.1 BILINEAR SADDLE POIN (SOCHASIC) We evaluate the performance of the various stochastic algorithms first on a simple (n = 10 3, d = 10 3 ) finite sum bilinear objective (a monotone operator) constrained to [, 1] d : 1 n n i=1 ( θ M (i) ϕ + θ a (i) + ϕ b (i)) (9) solved by (θ, ϕ ) s.t. { Mϕ = ā M θ = b, Distance to the optimum AltAdam5 AvgAltAdam5 AltSGD1 AvgAltSGD1 ExtraSGD AvgExtraSGD PastExtraSGD AvgPastExtraSGD where ā M = 1 n b (i) k = 1 n n i=1 a(i), b = 1 n n i=1 b(i) and n i=1 M (i). Marices and vectors M (i) kj, a(i) k, ; 1 i n, 1 j, k d were randomly generated, but ensuring that (θ, ϕ ) would belong to [, 1] d. Results are shown in Fig. 3. We can see that AvgSGD and AvgPastExtraSGD perform the best on this task Number of gradient computation/n Figure 3: Performance of the considered stochastic optimization algorithms on the bilinear problem (9). Each method uses its respective optimal step-size found by grid-search. 7. WGAN AND WGAN-GP ON CIFAR10 We now evaluate the proposed techniques in the context of GAN training, which is a challenging stochastic optimization problem where the objectives of both players are non-convex. We propose to evaluate the Adam variants of the different optimization algorithms (see Alg. 4 for Adam with extrapolation) by training two different architectures on the CIFAR10 dataset (Krizhevsky and Hinton, 009). First we consider a constrained zero-sum game by training the DCGAN architecture (Radford et al., 016) with the WGAN objective and weight clipping as proposed in Arjovsky et al. (017). hen we compared the different methods on a state of the art architecture, by training a ResNet with 8 In the original WGAN (Arjovsky et al., 017) paper the authors use k = 5. 9

10 Model WGAN (DCGAN) WGAN-GP (ResNet) Method no averaging uniform avg EMA no averaging uniform avg SimAdam 6.05 ± ± ± ± ±.7 AltAdam ± ± ± ± ±.15 ExtraAdam 6.38 ± ± ± ± ±.1 PastExtraAdam 5.98 ± ± ± ± ±.18 OptimAdam 5.74 ± ± ± ± ±.1 able 1: Best inception scores (averaged over 5 runs) achieved on CIFAR10 for every considered Adam variant. OptimAdam is the related Optimistic Adam (Daskalakis et al., 018) algorithm. EMA denotes exponential moving average (with β = 0.999, see Eq. 8). We see that the techniques of extrapolation and averaging consistently enable improvements over the baselines (in italic) Inception Score SimAdam γ = 10 5 ExtraAdam γ = PastExtraAdam γ = Number of generator updates 10 5 Inception Score AvgSimAdam AvgAltAdam1 AvgAltAdam5 4.5 AvgExtraAdam AvgPastExtraAdam Number of generator updates 5 Figure 4: Left: Mean and standard deviation of the inception score computed over 5 runs for each method on WGAN trained on CIFAR10. o keep the graph readable we show only SimAdam but AltAdam performs similarly. Middle: Samples from a ResNet generator trained with the WGAN-GP objective using AvgExtraAdam. Right: WGAN-GP trained on CIFAR10: mean and standard deviation of the inception score computed over 5 runs for each method using the best performing learning rate plotted over wall-clock time; all experiments were run on a NVIDIA Quadro GP100 GPU. We see that ExtraAdam converges faster than the Adam baselines. the WGAN-GP objective similar to Gulrajani et al. (017). Models are evaluated using the inception score (Salimans et al., 016) computed on 10,000 samples. For each algorithm we did an extensive search over the hyperparameters of Adam, we fixed β 1 = 0.5 and β = 0.9 for all methods as they seemed to perform well. We want to emphasize that as proposed by Heusel et al. (017) it is quite important to set different learning rates for the generator and the discriminator. Experiments were run with 5 random seeds for 500,000 updates of the generator. able 1 reports the best inception score achieved on this problem by each considered method. We see that the techniques of extrapolation and averaging consistently enable improvements over the baselines (see G.4 for more experiments on averaging). Fig. 4 shows training curves for each method (for their best performing learning rate), as well as samples from a ResNet generator trained with ExtraAdam on a WGAN-GP objective. For both tasks, using an extrapolation step and averaging with Adam (ExtraAdam) outperformed all other methods. ExtraAdam with averaging is able to reach state of the art results on CIFAR10 similar to Miyato et al. (018). We also observed that methods based on extrapolation are less sensitive to the choice of learning rate and can be used with higher learning rates with less degradation; see App. G.3 for more details. 8 CONCLUSION We newly addressed GAN objectives in the framework of variational inequality. We tapped into the optimization literature to provide more principled sound techniques to optimize such games. We leveraged these techniques to develop practical optimization algorithms suitable for a wide range of GAN training objectives (including non-zero sum games and projections onto constraints). We experimentally verified that this could yield better trained models, achieving to our knowledge the best inception score when optimizing a WGAN objective on the reference unmodified DCGAN architecture (Radford et al., 016). he presented techniques address a fundamental problem in GAN training in a principled way, and are orthogonal to the design of new GAN architectures and objectives. hey are thus likely to be widely applicable, and benefit future development of GANs. 10

11 ACKNOWLEDGMENS. his research was partially supported by the Canada Excellence Research Chair in Data Science for Realtime Decision-making and by the NSERC Discovery Grant RGPIN Gauthier Gidel would like to acknowledge Benoît Joly and Florestan Martin-Baillon for bringing a fresh point of view on the proof of Proposition 1. REFERENCES M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 017. K. E. Atkinson. An introduction to numerical analysis. John Wiley & Sons, 003. S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 004. R. E. Bruck. On the weak convergence of an ergodic iteration for the solution of variational inequalities for monotone operators in hilbert space. Journal of Mathematical Analysis and Applications, G. H. Chen and R.. Rockafellar. Convergence rates in forward backward splitting. SIAM Journal on Optimization, C.-K. Chiang,. Yang, C.-J. Lee, M. Mahdavi, C.-J. Lu, R. Jin, and S. Zhu. Online optimization with gradual variations. In COL, 01. G. P. Crespi, A. Guerraggio, and M. Rocca. Minty variational inequality and optimization: Scalar and vector case. In Generalized Convexity, Generalized Monotonicity and Applications, 005. C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. raining GANs with optimism. In ICLR, 018. W. Fedus, M. Rosca, B. Lakshminarayanan, A. M. Dai, S. Mohamed, and I. Goodfellow. Many paths to equilibrium: GANs do not need to decrease a divergence at every step. In ICLR, 018. G. Gidel,. Jebara, and S. Lacoste-Julien. Frank-wolfe algorithms for saddle point problems. In AISAS, 017. I. Goodfellow. Nips 016 tutorial: Generative adversarial networks. arxiv: , 016. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 014. P. Grnarova, K. Y. Levy, A. Lucchi,. Hofmann, and A. Krause. An online learning approach to generative adversarial networks. In ICLR, 018. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein GANs. In NIPS, 017. P.. Harker and J.-S. Pang. Finite-dimensional variational inequality and nonlinear complementarity problems: a survey of theory, algorithms and applications. Mathematical programming, E. Hazan, K. Singh, and C. Zhang. Efficient regret minimization in non-convex games. In ICML, 017. M. Heusel, H. Ramsauer,. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NIPS, 017. A. Iusem, A. Jofré, R. I. Oliveira, and P. hompson. Extragradient method with variance reduction for stochastic variational inequalities. SIAM Journal on Optimization, 017. A. Juditsky, A. Nemirovski, and C. auvel. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, Karras,. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation,

12 D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 015. G. Korpelevich. he extragradient method for finding saddle points and other problems. Matecon, 1, A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Master s thesis, University of oronto, Canada, Larsson and M. Patriksson. A class of gap functions for variational inequalities. Math. Program., C. Ledig, L. heis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. ejani, J. otz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, 017. Y. Li, A. Schwing, K.-C. Wang, and R. Zemel. Dualing GANs. In NIPS, 017. P. Mertikopoulos, H. Zenati, B. Lecouat, C.-S. Foo, V. Chandrasekhar, and G. Piliouras. Mirror descent in saddle-point problems: Going the extra (gradient) mile. arxiv, 018. L. Mescheder, S. Nowozin, and A. Geiger. he numerics of GANs. In NIPS, 017. L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for GANs do actually converge? In ICML, 018. L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein. Unrolled generative adversarial networks. In ICLR, Miyato,. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 018. V. Nagarajan and J. Z. Kolter. Gradient descent GAN optimization is locally stable. In NIPS, 017. A. Nedić and A. Ozdaglar. Subgradient methods for saddle-point problems. J Optim heory Appl, 009. A. Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim., 004. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 009. Y. Nesterov. Introductory Lectures On Convex Optimization. Springer, Y. Nesterov. Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Program., 007. S. Nowozin, B. Cseke, and R. omioka. f-gan: raining generative neural samplers using variational divergence minimization. In NIPS, 016. B. Palaniappan and F. Bach. Stochastic variance reduction methods for saddle-point problems. In NIPS, 016. B.. Polyak. Gradient methods for minimizing functionals. Zhurnal Vychislitel noi Matematiki i Matematicheskoi Fiziki, A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 016. A. Rakhlin and K. Sridharan. Online learning with predictable sequences. In COL, 013. L. J. Ratliff, S. A. Burden, and S. S. Sastry. On the characterization of local nash equilibria in continuous games

13 H. Robbins and S. Monro. A stochastic approximation method. he Annals of Mathematical Statistics, Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In NIPS, 016. I. Sutskever. raining recurrent neural networks. PhD thesis, 013. P. seng. On linear convergence of iterative methods for the variational inequality problem. Journal of Computational and Applied Mathematics, J. Von Neumann and O. Morgenstern. heory of games and economic behavior. Princeton University Press, A. Yadav, S. Shah, Z. Xu, D. Jacobs, and. Goldstein. Stabilizing adversarial nets with prediction methods. In ICLR, 018. Y. Yazıcı, C.-S. Foo, S. Winkler, K.-H. Yap, G. Piliouras, and V. Chandrasekhar. he unusual effectiveness of averaging in gan training. arxiv preprint arxiv: , 018. F. Yousefian, A. Nedić, and U. V. Shanbhag. Optimal robust smoothing extragradient algorithms for stochastic variational inequality problems. In CDC. IEEE, 014. J.-Y. Zhu,. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycleconsistent adversarial networks. In ICCV,

14 A DEFINIIONS In this section we recall usual initions and lemmas from convex analysis. We start with the initions and lemmas regarding the projection mapping. A.1 PROJECION MAPPING Definition 1. he projection P Ω onto Ω is ined as, P Ω (ω ) arg min ω ω. (30) ω Ω When Ω is a convex set this projection is unique. his is a consequence of the following lemma that we will use in the following sections: the non-expansiveness of the projection onto a convex set. Lemma 1. Let Ω a convex set, the projection mapping P Ω : R d Ω is nonexpansive, i.e., P Ω (ω) P Ω (ω ) ω ω, ω, ω Ω. (31) his is standard convex analysis result which can be found for instance in (Boyd and Vandenberghe, 004). he following Lemma is also standard in convex analysis and its proof uses similar arguments as the proof of Lemma 1. Lemma. Let ω Ω and ω + = P Ω (ω + u) then for all ω Ω we have, ω + ω ω ω + u (ω + ω ) ω + ω. (3) Proof of Lemma. We start by simply developing, ω + ω = (ω + ω) + (ω ω ) = ω ω + (ω + ω) (ω ω ) + ω + ω = ω ω + (ω + ω) (ω + ω ) ω + ω. hen since ω + is the projection onto the convex set Ω of ω + u we have that (ω + (ω + u)) (ω + ω ) 0, ω Ω, leading to the result of the Lemma. A. SMOOHNESS AND MONOONICIY OF HE OPERAOR Another important property used is the Lipschitzness of an operator. Definition. A mapping F : R p R d is said to be L-Lipschitz if, F (ω) F (ω ) L ω ω, ω, ω Ω. (33) Definition 3. A differentiable function f : Ω R is said to be µ-strongly convex if f(ω) f(ω ) + f(ω ) (ω ω ) + µ ω ω ω, ω Ω. (34) Definition 4. A function (θ, ϕ) L(θ, ϕ) is said convex-concave if L(, ϕ) is convex for all ϕ Φ and L(θ, ) is concave for all θ Θ. An L is said to be µ-strongly convex concave if (θ, ϕ) L(θ, ϕ) µ θ + µ ϕ is convex concave. Definition 5. For µ > 0, an operator F : Ω R d is said to be µ-strongly monotone if (F (ω) F (ω )) (ω ω ) µ ω ω. (35) B GRADIEN MEHODS ON UNCONSRAINED BILINEAR GAMES In this section we will prove the results provided in 3, namely Proposition 1, Proposition and heorem 1. For Proposition 1 and let us recall the context. We wanted to derive properties of some gradient methods on the following simple illustrative example min max θ φ (36) θ R φ R 14

15 B.1 PROOF OF PROPOSIION 1 Let us first recall the proposition: Proposition 1. he simultaneous iterates diverge geometrically and the alternating iterates ined in (11) are bounded but do not converge to 0 as Simultaneous: θ t+1 + φ t+1 = (1 + η )(θ t + φ t ), Alternating: θ t + φ t = Θ(θ 0 + φ 0) (37) where u t = Θ(v t ) α, β > 0 : αv t u t βv t. he uniform average ( θ t, φ t ) = 1 t t s=0 (θ s, φ s ) of the simultaneous updates (resp. the alternating updates) diverges (resp. converges to 0) as, ( ) ( ) Simultaneous: θ t + φ θ0 + φ 0 t = Θ η t (1 + η ) t, Alternating: θ t + φ θ0 + φ 0 t = Θ η t. (38) Proof. Let us start with the simultaneous update rule: { θt+1 = θ t ηφ t φ t+1 = φ t + ηθ t. (39) hen we have, θt+1 + φ t+1 = (θ t ηφ t ) + (φ t + ηθ t ) (40) = (1 + η )(θt + φ t ). (41) he update rule (39) also gives us, { ηφt = θ t θ t+1 ηθ t+1 = φ t+1 φ t. (4) Summing these equation for 0 t 1 we get, φ + θ = (θ 0 θ ) + (φ 0 φ ) (43) = ((1 + η ) + 1)(θ0 + φ 0) θ 0 θ φ 0 φ (44) ( ) = Θ (1 + η ) ((θ0 + φ 0) (45) Let continue start with the alternating update rule { θt+1 = θ t ηφ t φ t+1 = φ t + ηθ t+1 = φ t + η(θ t ηφ t ) (46) hen we have [ ] [ θt+1 1 η = φ t+1 η by simple linear algebra, for η <, the matrix M = eigenvalues which are (47) [ ] 1 η η 1 η has two complex conjugate ] [ ] θt 1 η φ t λ ± = 1 η η ± i 4 η (48) and their squared magnitude is equal to det(m) = 1 η + η = 1. We can diagonalize M meaning that there exist P an invertible matrix such that M = P diag(λ +, λ )P. hen, we have [ ] [ ] [ ] θt = M φ t θ0 = P t φ diag(λ t 0 +, λ t θ0 )P (49) φ 0 and consequently, [ θt + φ ] θt t = φ t C = P diag(λ t +, λ t )P 15 [ ] θ0 φ 0 C P P (θ 0 + φ 0) (50)

16 P u C u C where C is the norm in C and P := max u C way we have [ θ0 + φ ] 0 = M t θt = φ t P diag(λ t +, λ t )P C is the induced matrix norm. he same [ ] θt φ t C P P (θ t + φ t ) (51) Hence, if θ0 + φ 0 > 0 the sequence (θ t, φ t ) is bounded but do not converge to 0. Moreover the update rule gives us, η { φ ηφt = θ t θ t = θ 0 θ t+1 φ = θ 0 θ η ηθ t = φ t φ t η θ t = φ φ 0 + ηθ 0 θ = φ (5) φ 0 + ηθ 0 η Consequently, since θ t + φ t = O(θ 0 + φ 0), θ t + φ t = O ( θ 0 + φ 0 ηt ) (53) B. IMPLICI AND EXRAPOLAION MEHOD In this section we will prove a slightly more precise proposition than Proposition, Proposition. he squared norm of the iterates N t = θ t + φ t, where the update rule of θ t and φ t ined in (18), decrease geometrically for any 0 < η < 1 as, Implicit: N t+1 = N t 1 + η, Extrapolation: N t+1 = (1 η + η 4 )N t, t 0 (54) Proof. Let us recall the update rule for the implicit method hen, Implying that { θt+1 = θ t ηφ t+1 φ t+1 = φ t + ηθ t+1 { (1 + η )θ t+1 = θ t ηφ t (1 + η )φ t+1 = φ t + ηθ t (55) (1 + η ) (θ t+1 + φ t+1) = (θ t ηφ t ) + (φ t + ηθ t ) (56) = θ t + φ t + +η (θ t + φ t ) (57) θ t+1 + φ t+1 = θ t + φ t 1 + η (58) For the extrapolation method we have the update rule { θt+1 = θ t η(φ t + ηθ t ) φ t+1 = φ t + η(θ t ηφ t ) (59) Implying that, θt+1 + φ t+1 = (θ t η(φ t + ηθ t )) + (φ t + η(θ t ηφ t )) (60) = θt + φ t η (θ + φ ) + η ((θ t ηφ t ) + (φ t + ηθ t ) ) (61) = (1 η + η 4 )(θt + φ t ) (6) 16

17 B.3 EXRAPOLAION FROM HE PAS Let us recall what we call projected extrapolation form the past, where we used the notation ω t = ω t+1/ for compactness, Extrapolation from the past: ω t = P Ω [ω t ηf (ω t)] (63) Perform update step: ω t+1 = P Ω [ω t ηf (ω t)] and store: F (ω t) (64) where P Ω [ ] is the projection onto the constraint set Ω. An operator F : Ω R d is said to be µ-strongly monotone if (F (ω) F (ω )) (ω ω ) µ ω ω. (65) If F is strongly monotone we can prove the following theorem: heorem 1. If F is µ-strongly monotone (see A for the inition of strong monotonicity) and L-Lipschitz, then the updates (0) and (1) with η = 1 4L provide linearly converging iterates, ( ω t ω 1 µ ) t ω 0 ω, t 0. (66) 4L Proof. In order to prove tis theorem we will prove a slightly more general result, ( ω t+1 ω + ω t ω t 1 µ ) ( ω t ω + ω t ω 4L t ). (67) implying that ω t+1 ω ω t+1 ω + ω t ω t with the convention that ω 0 = ω = ω. Let us first proof three technical lemmas, ( 1 µ ) t ω 0 ω. (68) 4L Lemma 3. If F is µ-strongly monotone we have, ( ) µ ω t ω ω t ω t F (ω t) (ω t ω ), ω Ω. (69) Proof. By strong monotonicity and optimality of ω, µ ω t ω F (ω ) (ω t ω ) + µ ω t ω F (ω t) (ω t ω ) (70) and then we use the inequality ω t ω ω t ω ω t ω t to get the result claimed. Lemma 4. If F is L-Lipschitz, we have for any ω Ω, η t F (ω t) (ω t ω) ω t ω ω t+1 ω ω t ω t + η t L ω t ω t. (71) Proof. Applying Lemma for (ω, u, ω +, ω ) = (ω t, η t F (ω t), ω t+1, ω) and (ω, u, ω +, ω ) = (ω t, η t F (ω t), ω t, ω t+1 ), we get, ω t+1 ω ω t ω η t F (ω t) (ω t+1 ω) ω t+1 ω t (7) and ω t ω t+1 ω t ω t+1 η t F (ω t) (ω t ω t+1 ) ω t ω t. (73) Summing (7) and (73) we get, ω t+1 ω ω t ω η t F (ω t) (ω t+1 ω) (74) η t F (ω t) (ω t ω t+1 ) ω t ω t ω t ω t+1 (75) = ω t ω η t F (ω t) (ω t ω) ω t ω t ω t ω t+1 η t (F (ω t) F (ω t)) (ω t ω t+1 ). (76) 17

18 hen, we can use the Young s inequality a b a + b to get, ω t+1 ω ω t ω η t F (ω t) (ω t ω) + η t F (ω t) F (ω t) + ω t ω t+1 ω t ω t ω t ω t+1 (77) = ω t ω η t F (ω t) (ω t ω) + η t F (ω t) F (ω t) ω t ω t ω t ω η t F (ω t) (ω t ω) + η t L ω t ω t ω t ω t. (78) Lemma 5. For all t 0, if we set ω = ω = ω 0 we have ω t ω t 4 ω t ω t + 4η tl ω t ω t ω t ω t. (79) Proof. We start with a + b a + b. ω t ω t ω t ω t + ω t ω t. (80) Moreover, since the projection is contractive we have that ω t ω t ω t η t F (ω t) ω t η t F (ω t ) (81) = ηt F (ω t) F (ω t ) (8) ηtl ω t ω t. (83) Combining (80) and (83) we get, ω t ω t = ω t ω t ω t ω t (84) 4 ω t ω t + 4 ω t ω t ω t ω t (85) 4 ω t ω t + 4ηtL ω t ω t ω t ω t. (86) Proof of heorem 1. Let ω Ω be an optimal point of (VIP). Combining Lemma 3 and Lemma 4 we get, ( ) η t µ ω t ω ω t ω t ω t ω ω t+1 ω +ηt L ω t ω t ω t ω t leading to, ω t+1 ω (1 η t µ) ω t ω + η t L ω t ω t (1 η t µ) ω t ω t (87) Now using Lemma 5 we get, ω t+1 ω (1 η t µ) ω t ω + η t L (4η tl ω t ω t ω t ω t ) Now with η t = 1 4L 1 4µ we get, ω t+1 ω Hence, using the fact that (1 η t µ 4η t L ) ω t ω t (88) ( 1 µ 4L µ 4L 1 4 ) ω t ω + 1 ( ) ω t ω t ω t ω t we get, ω t+1 ω ω t ω t ( 1 µ ) ( ) ω t ω + 1 4L 16 ω t ω t. (89) 18

19 C MORE ON MERI FUNCIONS In this section, we will present how to handle unbounded constaints set Ω with a more refined merit function than (5) used in the main paper. Let F the continuous operator and Ω the constraints set associated with the VIP find ω Ω such that F (ω ) (ω ω ) 0, ω Ω. (VIP) When the operator F is monotone, we have that F (ω ) (ω ω ) F (ω) (ω ω ), ω, ω. Hence, in this case (VIP) implies a stronger formulation sometimes called Minty variational inequality (Crespi et al., 005): find ω Ω such that F (ω) (ω ω ) 0, ω Ω. (MVI) his formulation is stronger in the sense that if (MVI) holds for some ω Ω, then (VIP) holds too. A merit function useful for our analysis can be derived from this formulation. Roughly, a merit function is a convergence measure. More formally, a function g : Ω R is called a merit function if g is non-negative such that g(ω) = 0 ω Ω (Larsson and Patriksson, 1994). A way to derive a merit function from (MVI) would be to use g(ω ) = sup ω Ω F (ω) (ω ω) which is zero if and only if (MVI) holds for ω. o deal with unbounded constraint sets (leading to a potentially infinite valued function outside of the optimal set), we use the restricted merit function (Nesterov, 007): Err R (ω t ) = max ω Ω, ω ω 0 R F (ω) (ω t ω). (90) his function acts as merit function for (VIP) on the interior of the open ball of radius R around ω 0, as shown in Lemma 1 of Nesterov (007). hat is, let Ω R = Ω {ω : ω ω 0 < R}. hen for any point ˆω Ω R, we have: Err R ( ˆω) = 0 ˆω Ω Ω R. (91) he reference point ω 0 is arbitrary, but in practice it is usually the initialization point of the algorithm. R has to be big enough to ensure that Ω R contains a solution. Err R measures how much (MVI) is violated on the restriction Ω R. Such merit function is standard in the variational inequality literature. A similar one is used in (Nemirovski, 004; Juditsky et al., 011). When F is derived from the gradients (5) of a zero-sum game, we can ine a more interpretable merit function. One has to be careful though when extending properties from the minimization setting to the saddle point setting (e.g. the merit function used by Yadav et al. (018) is vacuous for a bilinear game as explained in App C.). In the appendix we adopt a set of assumptions a little more general than the one in the main paper Assumption 4. F is monotone and Ω is convex and closed. R is set big enough such that R > ω 0 ω and F is a monotone operator. Contrary to Assumption 3, in Assumption 4 the constraints set in no longer assumed to be bounded. Assumption 4 is more general than Assumption 3 since that if Ω is bounded then with R set to the diameter of Ω, Assumption 4 is true. C.1 MORE GENERAL MERI FUNCIONS In the appendix we will note Err (VI) R inition, When the objective is a saddle point problem i.e., the restricted merit function ined in (90). Let us recall its Err (VI) R (ω t) = max F ω Ω, ω ω (ω) (ω t ω). (9) 0 R F (θ, ϕ) = [ θ L(θ, ϕ) ϕ L(θ, ϕ)] (93) and L is convex-concave (see Definition 4 in A), we can use another merit function than (9) on Ω R that is more interpretable and more directly related to the cost function of the minimax formulation: Err (SP) R (θ t, ϕ t ) = max L(θ t, ϕ) L(θ, ϕ t ). (94) ϕ Φ,θ Θ (θ,ϕ) (θ 0,ϕ 0) R 19

20 In particular, if the equilibrium (θ, ϕ ) Ω Ω R and we have that L(, ϕ ) and L(θ, ) are µ-strongly convex (see A), then the merit function for saddle points upper bounds the distance for (θ, ϕ) Ω R to the equilibrium as: Err (SP) R (θ, ϕ) µ ( θ θ + ϕ ϕ ). (95) In the appendix, we provide our convergence results with the merit functions (9) and (94), depending on the setup: Err R (ω) = Err (SP) R Err (VI) (ω) if F is a SP operator (5) R (ω) otherwise. (96) C. ON HE IMPORANCE OF HE MERI FUNCION In this section, we illustrate the fact that one has to be careful when extending results and properties from the minimization setting to the minimax setting (and consequently to the variational inequality setting). Another candidate as a merit function for saddle point optimization would be to naturally extend the suboptimality f(ω) f(ω ) used in standard minimization (i.e. find ω the minimizer of f) to the gap P (θ, ϕ) = L(θ, ϕ ) L(θ, ϕ). In a previous analysis of a modification of the stochastic gradient descent (SGD) method for GANs, Yadav et al. (018) gave their convergence rate on P that they called the primal-dual gap. Unfortunately, if we do not assume that the function L is strongly convex-concave (a stronger assumption ined in A and which fails for bilinear objective e.g.), P may not be a merit function. It can be 0 for a non optimal point, see for instance the discussion on the differences between (94) and P in (Gidel et al., 017, Section 3). In particular, for the simple D bilinear example L(θ, ϕ) = θ ϕ, we have that θ = ϕ = 0 and thus P (θ, ϕ) = 0 θ, ϕ. C.3 VARIAIONAL INEQUALIIES FOR NON-CONVEX COS FUNCIONS When the cost functions ined in (3) are non-convex, the operator F is no longer monotone. Nevertheless, (VIP) and (MVI) can still be ined, though a solution to (MVI) is less likely to exist. We note that (VIP) is a local condition for F (as only evaluating F at the points ω ). On the other hand, an appealing property of (MVI) is that it is a global condition. In the context of minimization of a function f for example (where F = f), if ω solves (MVI) then ω is a global minimum of f (and not just a stationary point for the solution of (MVI); see Proposition. from Crespi et al. (005)). A less restrictive way to consider variational inequalities in the non-monotone setting is to use a local version of (MVI). If the cost functions are locally convex around the optimal couple (θ, ϕ ) and if our iterates eventually fall and stay into that neighborhood, then we can consider our restricted merit function Err R ( ) with a well suited constant R and apply our convergence results for monotone operators. D ANOHER WAY OF IMPLEMENING EXRAPOLAION O SGD We now introduce another way to combine extrapolation and SGD. his extension is very similar to Alg. 5, the only difference is that it re-uses the minibatch sample of the extrapolation step for the update of the current point. he intuition is that it correlates the estimator of the gradient of the extrapolation step and the one of the update step leading to a better correction of the oscillations which are also due to the stochasticity. One emerging issue (for the analysis) of this method is that since ω t depend on ξ t, the quantity F (ω t, ξ t ) is a biased estimator of F (ω t). 0

21 Algorithm 5 Re-used minibatches for stochastic extrapolation (ReExtraSGD) 1: Let ω 0 Ω : for t = do 3: Sample ξ t P = P Ω [ω t η t F (ω t, ξ t )] Extrapolation step 4: ω t 5: ω t+1 = P Ω [ω t η t F ( ) ω t, ξ t ] Update step wit the same sample 6: end for 7: Return ω = η tω t/ η t heorem 5. Assume that ω t ω 0 R, t 0 where (ω t) t 0 are the iterates of Alg. 5. Under Assumption 1 and 4, for any 1, Alg. 5 with constant step-size η 1 L has the following convergence properties: Particularly, η t = E[Err R ( ω )] R η + η σ + 4L (4R + σ ) η gives E[Err R ( ω )] O(1). where ω = 1 ω t. he assumption that the sequence of the iterates provided by the algorithm is bounded is strong, but has been also done for instance in (Yadav et al., 018). he proof of this result is provided in F. E VARIANCE COMPARISON BEWEEN AVGSGD AND SGD WIH PREDICION MEHOD o compare the variance term of AvgSGD in (6) with the one of the SGD with prediction method (Yadav et al., 018), we need to have the same convergence certificate. Fortunately, their proof can be adapted to our convergence criterion (using Lemma 6 in F), revealing an extra σ / in the variance term from their paper. he resulting variance can be summarized with our notation as (M (1 + L) + σ )/ where the L is the Lipschitz constant of the operator F. Since M σ, their variance term is then 1 + L time larger than the one provided by the AvgSGD method. F PROOF OF HEOREMS his section is dedicated on the proof of the theorems provided in this paper in a slightly more general form working with the merit function ined in (96). First we prove an additional lemma necessary to the proof of our theorems. Lemma 6. Let F be a monotone operator and let (ω t ), (ω t), (z t ), ( t ), (ξ t ) and (ζ t ) be six random sequences such that, for all t 0 η t F (ω t) (ω t ω) N t N t+1 + η t (M 1 (ω t, ξ t ) + M (ω t, ζ t )) + η t t (z t ω), where N t = N(ω t, ω t, ω t ) 0 and we extend (ω t) with ω = ω = ω 0. Let also assume that with N 0 R, E[ t ] σ, E[ t z t, 0,..., t ] = 0, E[M 1 (ω t, ξ t )] M 1 and E[M (ω t, ζ t )] M, then, E[Err R ( ω )] R + M 1 + M + σ ηt (97) S S where ω = η tω t/s and S = η t. 1

22 Proof of Lemma 6. We sum (6) for 0 t 1 to get, η t F (ω t) (ω t ω) [ ] (N t N t+1 ) + ηt ((M 1 (ω t, ξ t ) + M (ω t, ζ t )) + η t t (z t ω). (98) We will then upper bound each sum in the right-hand side, t (z t ω) = t (z t u t ) + t (u t ω) where u t+1 = P Ω (u t η t t ) and u 0 = ω 0. hen, leading to u t+1 ω u t ω η t t (u t ω) + η t t η t t (z t ω) η t t (z t u t ) + u t ω u t+1 ω + η t t (99) hen noticing that z 0 = ω 0, back to (98) we get a telescoping sum, η t F (ω t) (ω t ω) [ N 0 + η t ((M 1 (ω t, ξ t )+M (ω t, ζ t ))+ t )+η t t (z t u t ) ]. If F is the operator of a convex-concave saddle point (5), we get, with ω t = (θ t, ϕ t ) F (ω t) (ω t ω) θ L(θ t, ϕ t ) (θ t θ) ϕ L(θ t, ϕ t ) (ϕ t ϕ) L(θ t, ϕ) L(θ t, ϕ t ) + L(θ t, ϕ t ) L(θ, ϕ t ) (by convexity and concavity) = L(θ t, ϕ) L(θ, ϕ t ) then by convexity of L(, ϕ) and concavity of L(θ, ), we have that, S η t F (ω S t) (ω t ω) S (100) η t S (L(θ t, ϕ) L(θ, ϕ t )) S (L( θ t, ϕ) L( θ, ϕ t )) (101) Otherwise if the operator F is just monotone since F (ω t) (ω t ω) F (ω ) (ω t ω) we have that S η t F (ω t) (ω t ω) S η t F (ω ) (ω t ω) = S F (ω ) ( ω t ω) (10) In both cases, we can now maximize the left hand side respect to ω (since the RHS does not depend on ω) to get, S Err R ( ω t ) R [ + η t ((M 1 (ω t, ξ t )+M (ω t, ζ t ))+ t )+η t t (z t u t ) ]. (103) hen taking the expectation, since E[ t z t, u t ] = E[ t z t, 0,..., t ] = 0, E ζt [ t ] σ, E ξt [M 1 (ω t, ξ t )] M 1 and E ζt [M (ω t, ζ t )] M, we get that, E[Err R ( ω )] R + M 1 + M + σ ηt (104) S S

23 F.1 PROOF OF HM. First let us state heorem in its general form, heorem. Under Assumption 1, and 4, Alg. 1 with constant step-size η has the following convergence rate for all 1, Particularly, η = E[Err R ( ω )] R η + η M + σ where R gives E[Err (M R( ω )] R M +σ +σ ). ω = 1 ω t. (105) Proof of heorem. Let any ω Ω such that ω 0 ω R, ω t+1 ω = P Ω (ω t η t F (ω t, ξ t )) ω ω t η t F (ω t, ξ t )) ω (projections are non-contractive, Lemma 1) = ω t ω η t F (ω t, ξ t ) (ω t ω) + η t F (ω t, ξ t ) hen we can make appear the quantity F (ω t ) (ω t ω) on the left-hand side, η t F (ω t ) (ω t ω) ω t ω ω t+1 ω +η t F (ω t, ξ t ) +η t (F (ω t ) F (ω t, ξ t )) (ω t ω) (106) we can sum (106) for 0 t 1 to get, η t F (ω t ) (ω t ω) [ ] ( ω t ω ω t+1 ω ) + ηt F (ω t, ξ t ) + η t t (ω t ω) where we noted t = F (ω t ) F (ω t, ξ t ). By monotonicity, F (ω t ) (ω t ω) F (ω) (ω t ω) we get, (107) [ ] S F (ω) ( ω ω) ( ω t ω ω t+1 ω ) + ηt F (ω t, ξ t ) + η t t (ω t ω), where S = η t and ω = 1 S η tω t. We will then upper bound each sum in the right hand side, t (ω t ω) = t (ω t u t ) + t (u t ω) where u t+1 = P Ω (u t η t t ) and u 0 = ω 0. hen, leading to u t+1 ω u t ω η t t (u t ω) + η t t (108) η t t (ω t ω) η t t (ω t u t ) + u t ω u t+1 ω + η t t (109) hen noticing that u 0 = ω 0, back to (108) we get a telescoping sum, S F (ω) ( ω ω) ω 0 ω + ηt ( F (ω t, ξ t ) + t ) + η t t (ω t u t ) R + ηt ( F (ω t, ξ t ) + t ) + η t t (ω t u t ) 3

24 hen the right hand side does not depends on ω, we can maximize over ω to get, S Err R ( ω ) R + ηt ( F (ω t, ξ t ) + t ) + η t t (ω t u t ) (110) Noticing that E[ t ω t, u t ] = 0 (the estimates of F are unbiased), by Assumption E[( F (ω t, ξ t ) ] M and by Assumption 1 E[ t ] σ we get, particularly for η t = η and η t = E[Err R ( ω )] R + M + σ ηt (111) S S η t+1 we respectively get, and E[Err R ( ω )] E[Err R ( ω )] R η + η (M + σ ) (11) 4R η η ln( + 1) M + σ (113) F. PROOF OF HM.3 heorem 3. Under Assumption 1 and 4, if E ξ [F ] is L-Lipschitz, then Alg. with a constant step-size η 1 3L has the following convergence rate for any 1, E[Err R ( ω )] R η + 7 ησ where ω = 1 ω t. (114) Particularly, η = R σ 7 gives E[Err R( ω )] 14Rσ. Proof of hm.3. Let any ω Ω such that ω 0 ω R. hen, the update rules become ω t+1 = P Ω (ω t η t F (ω t, ζ t )) and ω t = P Ω (ω t ηf (ω t, ξ t )). We start by applying Lemma for (ω, u, ω, ω + ) = (ω t, ηf (ω t, ζ t ), ω, ω t+1 ) and (ω, u, ω, ω + ) = (ω t, η t F (ω t, ξ t ), ω t+1, ω t), ω t+1 ω ω t ω η t F (ω t, ζ t ) (ω t+1 ω) ω t+1 ω t ω t ω t+1 ω t ω t+1 η t F (ω t, ξ t ) (ω t ω t+1 ) ω t ω t hen, summing them we get ω t+1 ω ω t ω η t F (ω t, ζ t ) (ω t+1 ω) leading to η t F (ω t, ξ t ) (ω t ω t+1 ) ω t ω t ω t+1 ω t (115) ω t+1 ω ω t ω η t F (ω t, ζ t ) (ω t ω) + η t (F (ω t, ζ t ) F (ω t, ξ t )) (ω t ω t+1 ) ω t ω t ω t+1 ω t hen with a b a + b we get ω t+1 ω ω t ω η t F (ω t, ζ t ) (ω t ω) ω t ω t + η t F (ω t, ζ t ) F (ω t, ξ t ) Using the inequality a + b + c 3( a + b + c ) we get, ω t+1 ω ω t ω η t F (ω t, ζ t ) (ω t ω) ω t ω t + 3η t ( F (ω t ) F (ω t, ξ t ) + F (ω t) F (ω t, ζ t ) + F (ω t) F (ω t ) ) 4

25 hen we can use the L-Lipschitzness of F to get, ω t+1 ω ω t ω η t F (ω t, ζ t ) (ω t ω) ω t ω t + 3η t ( F (ω t ) F (ω t, ξ t ) + F (ω t) F (ω t, ζ t ) + L ω t ω t ) As we restricted the step-size to η t 1 3L we get, η t F (ω t) (ω t ω) ω t ω ω t+1 ω + η t (F (ω t) F (ω t, ζ t )) (ω t ω) + 3η t F (ω t ) F (ω t, ξ t ) + 3η t F (ω t) F (ω t, ζ t ) We get a particular case of (6) so we can use Lemma 6 where N t = ω t ω, M 1 (ω t, ξ t ) = 3 F (ω t ) F (ω t, ξ t ), M (ω t, ζ t ) = 3 F (ω t) F (ω t, ζ t ), t = F (ω t) F (ω t, ζ t ) and z t = ω t. By Assumption 1, M 1 = M = 3σ and by the fact that E[F (ω t) F (ω t, ζ t ) ω t, 0,..., t ] = E[E[F (ω t) F (ω t, ζ t ) ω t] 0,..., t ] = 0 the hypothesis of Lemma 6 hold and we get, E[Err R ( ω )] R + 7σ ηt (116) S S F.3 PROOF OF HM. 4 heorem 4. Under Assumption 1, if E ξ [F ] is L-Lipschitz, then AvgPastExtraSGD (Alg. 3) with a constant step-size η 1 has the following convergence rate for any 1, 3L E[Err R ( ω )] R η + 13 ησ where ω = 1 ω t. (117) Particularly, η = R σ 7 gives E[Err R( ω )] 14Rσ. First let us recall the update rule { ωt+1 = P Ω [ω t η t F (ω t, ξ t )] ω t+1 = P Ω [ω t+1 η t+1 F (ω t, ξ t )]. (118) Lemma 7. We have for any ω Ω, ηf (ω t, ξ t ) (ω t ω) ω t ω ω t+1 ω ω t ω t + 3η t L ω t ω t + 3η t [ F (ω t, ξ t ) F (ω t) + F (ω t) F (ω t, ξ t ) ]. (119) Proof. Applying Lemma for (ω, u, ω +, ω ) = (ω t, η t F (ω t, ξ t ), ω t+1, ω) and (ω, u, ω +, ω ) = (ω t, η t F (ω t, ξ t ), ω t, ω t+1 ), we get, and ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t+1 ω) ω t+1 ω t (10) ω t ω t+1 ω t ω t+1 η t F (ω t, ξ t ) (ω t ω t+1 ) ω t ω t. (11) Summing (10) and (11) we get, ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t+1 ω) (1) η t F (ω t, ξ t ) (ω t ω t+1 ) ω t ω t ω t ω t+1 (13) = ω t ω η t F (ω t, ξ t ) (ω t ω) ω t ω t ω t ω t+1 (14) η t (F (ω t, ξ t ) F (ω t, ξ t )) (ω t ω t+1 ). (15) 5

26 hen, we can use the inequality of arithmetic and geometric means a b a + b to get, ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t ω) + η t F (ω t, ξ t ) F (ω t, ξ t ) + ω t ω t+1 ω t ω t ω t ω t+1 (16) = ω t ω η t F (ω t, ξ t ) (ω t ω) (17) + η t F (ω t, ξ t ) F (ω t, ξ t ) ω t ω t. (18) Using the inequality a + b + c 3( a + b + c ) we get, F (ω t, ξ t ) F (ω t, ξ t ) 3( F (ω t, ξ t ) F (ω t) + F (ω t) F (ω t) + F (ω t) F (ω t, ξ t ) ) (19) where we used the L-Lipschitzness of F for the last inequality. Combining (18) with (130) we get, 3( F (ω t, ξ t ) F (ω t) + L ω t ω t + F (ω t) F (ω t, ξ t ) ), (130) ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t ω) ω t ω t + 3η t L ω t ω t + 3η t [ F (ω t, ξ t ) F (ω t) + F (ω t) F (ω t, ξ t ) ]. (131) Lemma 8. For all t 0, if we set ω = ω = ω 0 we have ω t ω t 4 ω t ω t + 1η t( F (ω t, ξ t ) F (ω t) + L ω t ω t + F (ω t ) F (ω t, ξ t ) ) ) ω t ω t. (13) Proof. We start with a + b a + b. ω t ω t ω t ω t + ω t ω t. (133) Moreover, since the projection is contractive we have that ω t ω t ω t η t F (ω t, ξ t ) ω t η t F (ω t, ξ t ) (134) = η t F (ω t, ξ t ) F (ω t, ξ t ) (135) 3η t( F (ω t, ξ t ) F (ω t) + L ω t ω t + F (ω t ) F (ω t, ξ t ) ) ). (136) where in the last line we used the same inequality as in (130). Combining (13) and (136) we get, ω t ω t = ω t ω t ω t ω t (137) 4 ω t ω t + 4 ω t ω t ω t ω t (138) 4 ω t ω t + 1η t( F (ω t, ξ t ) F (ω t) + L ω t ω t + F (ω t ) F (ω t, ξ t ) ) ) ω t ω t. (139) Proof of heorem 4. Combining Lemma 8 and Lemma 7 we get, η t F (ω t, ξ t ) (ω t ω) ω t ω ω t+1 ω + 36ηt ηtl ( F (ω t, ξ t ) F (ω t) + L ω t ω t + F (ω t ) F (ω t, ξ t ) ) 3η t L ω t ω t + (1η t L 1) ω t ω t + 3η t [ F (ω t, ξ t ) F (ω t) + F (ω t) F (ω t, ξ t ) ]. (140) 6

27 hen for η t 1 3L we have 36η t η tl 4 3η tl, η t F (ω t) (ω t ω) ω t ω ω t+1 ω + 3L (η t ω t ω t η t ω t ω t ) We can then use Lemma 6 where M 1 (ω t, ξ t ) = 0 + η t (F (ω t) F (ω t, ξ t )) (ω t ω) + 3η t [ F (ω t, ξ t ) F (ω t ) + F (ω t, ξ t ) F (ω t) + F (ω t) F (ω t, ξ t ) ]. (141) N t = ω t ω + 3L 3 η t ω t ω t, M (ω t, ξ t ) = 3 F (ω t) F (ω t, ξ t ) + 6 F (ω t) F (ω t, ξ t ) + 3 F (ω t ) F (ω t, ξ t ) t = F (ω t) F (ω t, ξ t ) z t = ω t. By Assumption 1, M = 1σ and by the fact that E[F (ω t) F (ω t, ξ t ) ω t, 0,..., t ] = E[E[F (ω t) F (ω t, ξ t ) ω t] 0,..., t ] = 0 the hypothesis of Lemma 6 hold and we get, E[Err R ( ω )] R + 13σ ηt (14) S S F.4 PROOF OF HEOREM 5 heorem 5 has been introduced in D. his theorem is about Algorithm 5 which consists in another way to implement extrapolation to SGD. Let us first restate this theorem, heorem 5. Assume that ω t ω 0 R, t 0 where (ω t) t 0 are the iterates of Alg. 5. Under Assumption 1 and 4, for any 1, Alg. 5 with constant step-size η 1 L has the following convergence properties: Particularly, η t = E[Err R ( ω )] R η + η σ + 4L (4R + σ ) η gives E[Err R ( ω )] O(1). where ω = 1 ω t. Proof of hm.5. Let any ω Ω such that ω 0 ω R. hen, the update rules become ω t+1 = P Ω (ω t η t F (ω t, ξ t )) and ω t = P Ω (ω t ηf (ω t, ξ t )). We start the same way as the proof of hm. 3 by applying Lemma for (ω, u, ω, ω ) = (ω t, ηf (ω t, ξ t ), ω, ω t+1 ) and (ω, u, ω, ω + ) = (ω t, η t F (ω t, ξ t ), ω t+1, ω t), ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t+1 ω) ω t+1 ω t ω t ω t+1 ω t ω t+1 η t F (ω t, ξ t ) (ω t ω t+1 ) ω t ω t hen, summing them we get ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t+1 ω) leading to η t F (ω t, ξ t ) (ω t ω t+1 ) ω t ω t ω t+1 ω t (143) ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t ω) + η t (F (ω t, ξ t ) F (ω t, ξ t )) (ω t ω t+1 ) ω t ω t ω t+1 ω t 7

28 hen with a b a + b we get ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t ω) +ηt F (ω t, ξ t ) F (ω t, ξ t ) ω t ω t Using the Lipschitzness assumption we get ω t+1 ω ω t ω η t F (ω t, ξ t ) (ω t ω) + (ηt L 1) ω t ω t hen we add η t F (ω t) (ω t ω) in both sides to get, η t F (ω t) (ω t ω) ω t ω ω t+1 ω η t (F (ω t, ξ t ) F (ω t)) (ω t ω) + (η t L 1) ω t ω t (144) Here, unfortunately we cannot use Lemma 6 because F (ω t, ξ t ) is biased. We will then deal with the quantity A = (F (ω t, ξ t ) F (ω )) (ω t ω). We have that, A = (F (ω t, ξ t ) F (ω t, ξ t )) (ω ω t) + (F (ω t ) F (ω t)) (ω ω t) + (F (ω t, ξ t ) F (ω t )) (ω t ω t) + (F (ω t, ξ t ) F (ω t )) (ω ω t ) L ω t ω t ω t ω + F (ω t, ξ t ) F (ω t ) ω t ω t + (F (ω t, ξ t ) F (ω t )) (ω ω t ) (Using Cauchy-Schwarz and the L-Lip of F ) hen using a b δ a + 1 δ b, for δ = 4, η t (F (ω t, ξ t ) F (ω t)) (ω t ω) 1 ω t ω t + 8η t L ω t ω leading to, + 4η t F (ω t, ξ t ) F (ω t ) ω t ω t + η t (F (ω t, ξ t ) F (ω t )) (ω ω t ) η t F (ω t) (ω t ω) ω t ω ω t+1 ω + η t (F (ω t, ξ t ) F (ω t )) (ω ω t ) + (η t L 1 4 ) ω t ω t + 4η t (L ω t ω + F (ω t, ξ t ) F (ω t ) ) If one assumes finally that ω t ω 0 R (assumption of the theorem) and that η t 1 L we get, η t F (ω t) (ω t ω) ω t ω ω t+1 ω + η t (F (ω t, ξ t ) F (ω t )) (ω ω t ) + 4η t (4L R + F (ω t, ξ t ) F (ω t ) ) where we used that ω t ω ω t ω 0 + ω 0 ω R. Once again this equation is a particular case of Lemma 6 where N t = ω t ω, M 1 (ω t, ξ t ) = 4(4L R + F (ω t, ξ t ) F (ω t ) ), M (ω t, ζ t ) = 0, z t = ω t and t = F (ω t, ξ t ) F (ω t ). By Assumption 1 E[M 1 (ω t, ξ t )] 16L R + 4σ and E[ t ω t, 0,..., t ] = E[E[ t ω t ] 0,..., t ] = 0 so we can use Lemma 6 and get, E[Err R ( ω )] R + σ + 16L R + 4σ ηt. (145) S S 8

29 G ADDIIONAL EXPERIMENAL RESULS G.1 OY NON-CONVEX GAN (D AND DEERMINISIC) We now consider a task similar to (Mescheder et al., 018) where the discriminator is linear D ϕ (ω) = ϕ ω, the generator is a Dirac distribution at θ, q θ = δ θ and the distribution we try to match is also a Dirac at ω, p = δ ω. he minimax formulation from Goodfellow et al. (014) gives: min max log ( 1 + e ϕ ω ) log ( 1 + e ϕ θ ) (146) θ ϕ Note that as observed by Nagarajan and Kolter (017), this objective is concave-concave, making it hard to optimize. We compare the methods on this objective where we take ω =, thus the position of the equilibrium is shifted towards the position (θ, ϕ) = (, 0). he convergence and the gradient vector field are shown in Figure 5. We observe that depending on the initialization some methods can fail to converge but extrapolation (18) seem to perform better than the rest of the methods. Gradient Vector Field and rajectory raining Curves 30 φ Distance to the optimum Simultaneous γ = 0.1 Alternated γ = 0.1 Averaging γ = 0.1 Extrapolation from the Past γ = 0.7 Extrapolation γ = 0.7 Start θ Number of Iterations Figure 5: Comparison of five algorithms (described in Section 3) on the non-convex gan objective (146), using the optimal step-size for each method. Left: he gradient vector field and the dynamics of the different methods. Right:he distance to the optimum as a function of the number of iterations for each methods. 9

30 G. DCGAN WIH WGAN-GP OBJECIVE We also trained the DCGAN architecture of the WGAN experiments but this time with the WGAN-GP objective. he results are shown in able. he best results are achieved with uniform averaging of AltAdam5, However its iterations require to update the discriminator 5 times for every generator update. With a small drop in best final score, ExtraAdam can train WGAN-GP significantly faster (see Fig. 6 right) as the discriminator and generator are updated only twice. Model WGAN-GP (DCGAN) Method no averaging uniform avg SimAdam 6.00 ± ±.08 AltAdam5 6.5 ± ±.05 ExtraAdam 6. ± ±.05 PastExtraAdam 6.7 ± ± 0.13 able : Best inception scores (averaged over 5 runs) achieved on CIFAR10 for every considered Adam variant. OptimAdam is the related Optimistic Adam (Daskalakis et al., 018) algorithm. We see that the techniques of extrapolation and averaging consistently enable improvements over the baselines (in italic) Inception Score Inception Score AltAdam5 γ = AvgExtraAdam γ = AvgPastExtraAdam γ = Number of generator updates AltAdam5 γ = AvgExtraAdam γ = AvgPastExtraAdam γ = Wall-Clock time in seconds 10 4 Figure 6: DCGAN architecture with WGAN-GP trained on CIFAR10: mean and standard deviation of the inception score computed over 5 runs for each method using the best performing learning rate plotted over number of generator updates (Left) and wall-clock time (Right); all experiments were run on a NVIDIA Quadro GP100 GPU. We see that ExtraAdam converges faster than the Adam baselines. 30

31 G.3 COMPARISON OF HE MEHODS WIH HE SAME LEARNING RAE In this section we compare how the methods presented in 7 perform with the same step-size. We follow the same protocol as in the experimental section 7. In Figure 7 we plot the inception score provided by each training method as a function of the number of generator updates. Note that these plots advantage AltAdam5 a bit because each iteration of this algorithm is a bit more costly (since it perform 5 discriminator updates for each generator update). Nevertheless, the goal of this experiment is not to show that AltAdam5 is faster but to show that that ExtraAdam is less sensitive to the choice of learning rate and can be used with higher learning rates with less degradation. he same way in Figure 8 we compare the sample quality on the WGAN-GP experiment (DCGAN architecture) of AltAdam5 and AvgExtraAdam for different step-sizes. We notice that for AvgExtraAdam the sample quality do not significantly change whereas the sample quality of AltAdam5 seems to be really sensitive to step-size tunning. We think that robustness to step-size tuning is a key property for an optimization algorithm in order to save as much time as possible to tune other hyperparameters of the learning procedure such as regularization. 6 6 Inception Score AltAdam5 γ = SimAdam γ = PastExtraAdam γ = AvgExtraAdam γ = Number of generator updates 10 5 (a) learning rate of 10 3 Inception Score AltAdam5 γ = SimAdam γ = PastExtraAdam γ = AvgExtraAdam γ = Number of generator updates 10 5 (b) learning rate of 10 4 Figure 7: Inception score on CIFAR10 for WGAN-GP (DCGAN) over number of generator updates for different learning rates. We can see that AvgExtraAdam is less sensitive to the choice of learning rate. 31

32 (a) AvgExtraAdam with η = 10 3 (b) AvgExtraAdam with η = 10 4 (c) AltAdam with η = 10 3 (d) AltAdam with η = 10 4 Figure 8: Comparison of the samples quality on the WGAN-GP (DCGAN) experiment for different methods and learning rate η. 3

33 G.4 COMPARISON OF HE MEHODS WIH AND WIHOU AVERAGING In this section we compare how uniform averaging improve the performance of the methods presented in 7. We follow the same protocol as in the experimental section 7. In Figure 9 and 10, we plot the inception score provided by each training method as a function of the number of generator updates with and without averaging. We notice that uniform averaging seems to improve the inception score, nevertheless it looks like the sample are a bit more blurry (see Figure 11) this is confirmed by our result (Figure 1) on the Fréchet Inception Distance (FID) which is more sensitive to blurriness Inception Score Inception Score AltAdam5 γ = SimAdam γ = PastExtraAdam γ = ExtraAdam γ = Number of generator updates AltAdam5 γ = SimAdam γ = PastExtraAdam γ = ExtraAdam γ = Number of generator updates 10 5 (a) with averaging (b) without averaging Figure 9: Inception Score on CIFAR10 for WGAN over number of generator updates with and without averaging. We can see that averaging improve the inception score Inception Score Inception Score AltAdam5 γ = SimAdam γ = PastExtraAdam γ = ExtraAdam γ = Number of generator updates AltAdam5 γ = SimAdam γ = PastExtraAdam γ = ExtraAdam γ = Number of generator updates 10 5 (a) with averaging (b) without averaging Figure 10: Inception score on CIFAR10 for WGAN-GP (DCGAN) over number of generator updates 33

34 (a) PastExtraAdam without averaging (b) PastExtraAdam with averaging (c) AltAdam5 without averaging (d) AltAdam5 with averaging Figure 11: Comparison of the samples of a WGAN trained with the different methods with and without averaging. Although averaging improves the inception score, the samples seem more blurry Convergence on CIFAR10 of WGAN 10 ReExtraAdam AvgReExtraAdam ExtraAdam AvgExtraAdam SimAdam AvgSimAdam AltAdam5 AvgAltAdam FID Number of Generator iterations Figure 1: he Fréchet Inception Distance (FID) from Heusel et al. (017) computed using 50,000 samples, on the WGAN experiments. ReExtraAdam refers to Alg. 5 introduced in D. We can see that averaging performs worse than when comparing with the Inception Score. We observed that the samples generated by using averaging are a little more blurry and that the FID is more sensitive to blurriness, thus providing an explanation for this observation. 34

arxiv: v3 [cs.lg] 2 Nov 2018

arxiv: v3 [cs.lg] 2 Nov 2018 A VARIAIONAL INEQUALIY PERSPECIVE ON GENERAIVE ADVERSARIAL NEWORKS Gauthier Gidel Mila Hugo Berard Mila, FAIR Gaëtan Vignoud Mila arxiv:180.10551v3 [cs.lg] Nov 018 Pascal Vincent Simon Lacoste-Julien Mila,

More information

Negative Momentum for Improved Game Dynamics

Negative Momentum for Improved Game Dynamics Negative Momentum for Improved Game Dynamics Gauthier Gidel Reyhane Askari Hemmat Mohammad Pezeshki Gabriel Huang Rémi Lepriol Simon Lacoste-Julien Ioannis Mitliagkas Mila & DIRO, Université de Montréal

More information

Training Generative Adversarial Networks Via Turing Test

Training Generative Adversarial Networks Via Turing Test raining enerative Adversarial Networks Via uring est Jianlin Su School of Mathematics Sun Yat-sen University uangdong, China bojone@spaces.ac.cn Abstract In this article, we introduce a new mode for training

More information

Which Training Methods for GANs do actually Converge?

Which Training Methods for GANs do actually Converge? Lars Mescheder 1 Andreas Geiger 1 2 Sebastian Nowozin 3 Abstract Recent work has shown local convergence of GAN training for absolutely continuous data and generator distributions. In this paper, we show

More information

GENERATIVE ADVERSARIAL LEARNING

GENERATIVE ADVERSARIAL LEARNING GENERATIVE ADVERSARIAL LEARNING OF MARKOV CHAINS Jiaming Song, Shengjia Zhao & Stefano Ermon Computer Science Department Stanford University {tsong,zhaosj12,ermon}@cs.stanford.edu ABSTRACT We investigate

More information

Lecture 14: Deep Generative Learning

Lecture 14: Deep Generative Learning Generative Modeling CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 14: Deep Generative Learning Density estimation Reconstructing probability density function using samples Bohyung Han

More information

Generative Adversarial Networks

Generative Adversarial Networks Generative Adversarial Networks SIBGRAPI 2017 Tutorial Everything you wanted to know about Deep Learning for Computer Vision but were afraid to ask Presentation content inspired by Ian Goodfellow s tutorial

More information

arxiv: v3 [cs.lg] 11 Jun 2018

arxiv: v3 [cs.lg] 11 Jun 2018 Lars Mescheder 1 Andreas Geiger 1 2 Sebastian Nowozin 3 arxiv:1801.04406v3 [cs.lg] 11 Jun 2018 Abstract Recent work has shown local convergence of GAN training for absolutely continuous data and generator

More information

arxiv: v2 [cs.lg] 13 Feb 2018

arxiv: v2 [cs.lg] 13 Feb 2018 TRAINING GANS WITH OPTIMISM Constantinos Daskalakis MIT, EECS costis@mit.edu Andrew Ilyas MIT, EECS ailyas@mit.edu Vasilis Syrgkanis Microsoft Research vasy@microsoft.com Haoyang Zeng MIT, EECS haoyangz@mit.edu

More information

First Order Generative Adversarial Networks

First Order Generative Adversarial Networks Calvin Seward 1 2 Thomas Unterthiner 2 Urs Bergmann 1 Nikolay Jetchev 1 Sepp Hochreiter 2 Abstract GANs excel at learning high dimensional distributions, but they can update generator parameters in directions

More information

arxiv: v4 [cs.cv] 5 Sep 2018

arxiv: v4 [cs.cv] 5 Sep 2018 Wasserstein Divergence for GANs Jiqing Wu 1, Zhiwu Huang 1, Janine Thoma 1, Dinesh Acharya 1, and Luc Van Gool 1,2 arxiv:1712.01026v4 [cs.cv] 5 Sep 2018 1 Computer Vision Lab, ETH Zurich, Switzerland {jwu,zhiwu.huang,jthoma,vangool}@vision.ee.ethz.ch,

More information

Singing Voice Separation using Generative Adversarial Networks

Singing Voice Separation using Generative Adversarial Networks Singing Voice Separation using Generative Adversarial Networks Hyeong-seok Choi, Kyogu Lee Music and Audio Research Group Graduate School of Convergence Science and Technology Seoul National University

More information

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab, Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab, 2016-08-31 Generative Modeling Density estimation Sample generation

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

arxiv: v1 [cs.lg] 20 Apr 2017

arxiv: v1 [cs.lg] 20 Apr 2017 Softmax GAN Min Lin Qihoo 360 Technology co. ltd Beijing, China, 0087 mavenlin@gmail.com arxiv:704.069v [cs.lg] 0 Apr 07 Abstract Softmax GAN is a novel variant of Generative Adversarial Network (GAN).

More information

Some theoretical properties of GANs. Gérard Biau Toulouse, September 2018

Some theoretical properties of GANs. Gérard Biau Toulouse, September 2018 Some theoretical properties of GANs Gérard Biau Toulouse, September 2018 Coauthors Benoît Cadre (ENS Rennes) Maxime Sangnier (Sorbonne University) Ugo Tanielian (Sorbonne University & Criteo) 1 video Source:

More information

An Online Learning Approach to Generative Adversarial Networks

An Online Learning Approach to Generative Adversarial Networks An Online Learning Approach to Generative Adversarial Networks arxiv:1706.03269v1 [cs.lg] 10 Jun 2017 Paulina Grnarova EH Zürich paulina.grnarova@inf.ethz.ch homas Hofmann EH Zürich thomas.hofmann@inf.ethz.ch

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

arxiv: v3 [stat.ml] 20 Feb 2018

arxiv: v3 [stat.ml] 20 Feb 2018 MANY PATHS TO EQUILIBRIUM: GANS DO NOT NEED TO DECREASE A DIVERGENCE AT EVERY STEP William Fedus 1, Mihaela Rosca 2, Balaji Lakshminarayanan 2, Andrew M. Dai 1, Shakir Mohamed 2 and Ian Goodfellow 1 1

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Interaction Matters: A Note on Non-asymptotic Local Convergence of Generative Adversarial Networks

Interaction Matters: A Note on Non-asymptotic Local Convergence of Generative Adversarial Networks Interaction Matters: A Note on Non-asymptotic Local onvergence of Generative Adversarial Networks engyuan Liang University of hicago arxiv:submit/166633 statml 16 Feb 18 James Stokes University of Pennsylvania

More information

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS Dan Hendrycks University of Chicago dan@ttic.edu Steven Basart University of Chicago xksteven@uchicago.edu ABSTRACT We introduce a

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

The Success of Deep Generative Models

The Success of Deep Generative Models The Success of Deep Generative Models Jakub Tomczak AMLAB, University of Amsterdam CERN, 2018 What is AI about? What is AI about? Decision making: What is AI about? Decision making: new data High probability

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Understanding GANs: Back to the basics

Understanding GANs: Back to the basics Understanding GANs: Back to the basics David Tse Stanford University Princeton University May 15, 2018 Joint work with Soheil Feizi, Farzan Farnia, Tony Ginart, Changho Suh and Fei Xia. GANs at NIPS 2017

More information

The Numerics of GANs

The Numerics of GANs The Numerics of GANs Lars Mescheder 1, Sebastian Nowozin 2, Andreas Geiger 1 1 Autonomous Vision Group, MPI Tübingen 2 Machine Intelligence and Perception Group, Microsoft Research Presented by: Dinghan

More information

arxiv: v2 [cs.lg] 7 Jun 2018

arxiv: v2 [cs.lg] 7 Jun 2018 Calvin Seward 2 Thomas Unterthiner 2 Urs Bergmann Nikolay Jetchev Sepp Hochreiter 2 arxiv:802.0459v2 cs.lg 7 Jun 208 Abstract GANs excel at learning high dimensional distributions, but they can update

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

arxiv: v1 [math.oc] 7 Dec 2018

arxiv: v1 [math.oc] 7 Dec 2018 arxiv:1812.02878v1 [math.oc] 7 Dec 2018 Solving Non-Convex Non-Concave Min-Max Games Under Polyak- Lojasiewicz Condition Maziar Sanjabi, Meisam Razaviyayn, Jason D. Lee University of Southern California

More information

Provable Non-Convex Min-Max Optimization

Provable Non-Convex Min-Max Optimization Provable Non-Convex Min-Max Optimization Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa City, IA, 52242 Department of Mathematics, The

More information

Deep Generative Models. (Unsupervised Learning)

Deep Generative Models. (Unsupervised Learning) Deep Generative Models (Unsupervised Learning) CEng 783 Deep Learning Fall 2017 Emre Akbaş Reminders Next week: project progress demos in class Describe your problem/goal What you have done so far What

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Learning to Sample Using Stein Discrepancy

Learning to Sample Using Stein Discrepancy Learning to Sample Using Stein Discrepancy Dilin Wang Yihao Feng Qiang Liu Department of Computer Science Dartmouth College Hanover, NH 03755 {dilin.wang.gr, yihao.feng.gr, qiang.liu}@dartmouth.edu Abstract

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Normalization Techniques in Training of Deep Neural Networks

Normalization Techniques in Training of Deep Neural Networks Normalization Techniques in Training of Deep Neural Networks Lei Huang ( 黄雷 ) State Key Laboratory of Software Development Environment, Beihang University Mail:huanglei@nlsde.buaa.edu.cn August 17 th,

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Ian Goodfellow, Staff Research Scientist, Google Brain. Seminar at CERN Geneva,

Ian Goodfellow, Staff Research Scientist, Google Brain. Seminar at CERN Geneva, MedGAN ID-CGAN CoGAN LR-GAN CGAN IcGAN b-gan LS-GAN AffGAN LAPGAN DiscoGANMPM-GAN AdaGAN LSGAN InfoGAN CatGAN AMGAN igan IAN Open Challenges for Improving GANs McGAN Ian Goodfellow, Staff Research Scientist,

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Mixed batches and symmetric discriminators for GAN training

Mixed batches and symmetric discriminators for GAN training Thomas Lucas* 1 Corentin Tallec* 2 Jakob Verbeek 1 Yann Ollivier 3 Abstract Generative adversarial networks (GANs) are powerful generative models based on providing feedback to a generative network via

More information

Generative adversarial networks

Generative adversarial networks 14-1: Generative adversarial networks Prof. J.C. Kao, UCLA Generative adversarial networks Why GANs? GAN intuition GAN equilibrium GAN implementation Practical considerations Much of these notes are based

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer

More information

GANs, GANs everywhere

GANs, GANs everywhere GANs, GANs everywhere particularly, in High Energy Physics Maxim Borisyak Yandex, NRU Higher School of Economics Generative Generative models Given samples of a random variable X find X such as: P X P

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems Weinan E 1 and Bing Yu 2 arxiv:1710.00211v1 [cs.lg] 30 Sep 2017 1 The Beijing Institute of Big Data Research,

More information

Open Set Learning with Counterfactual Images

Open Set Learning with Counterfactual Images Open Set Learning with Counterfactual Images Lawrence Neal, Matthew Olson, Xiaoli Fern, Weng-Keen Wong, Fuxin Li Collaborative Robotics and Intelligent Systems Institute Oregon State University Abstract.

More information

UNDERSTAND THE DYNAMICS OF GANS VIA PRIMAL-DUAL OPTIMIZATION

UNDERSTAND THE DYNAMICS OF GANS VIA PRIMAL-DUAL OPTIMIZATION UNDERSTAND THE DYNAMICS OF GANS VIA PRIMAL-DUAL OPTIMIZATION Anonymous authors Paper under double-blind review ABSTRACT Generative adversarial network GAN is one of the best known unsupervised learning

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

Matching Adversarial Networks

Matching Adversarial Networks Matching Adversarial Networks Gellért Máttyus and Raquel Urtasun Uber Advanced Technologies Group and University of Toronto gmattyus@uber.com, urtasun@uber.com Abstract Generative Adversarial Nets (GANs)

More information

Generative Adversarial Networks, and Applications

Generative Adversarial Networks, and Applications Generative Adversarial Networks, and Applications Ali Mirzaei Nimish Srivastava Kwonjoon Lee Songting Xu CSE 252C 4/12/17 2/44 Outline: Generative Models vs Discriminative Models (Background) Generative

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Hamiltonian Descent Methods

Hamiltonian Descent Methods Hamiltonian Descent Methods Chris J. Maddison 1,2 with Daniel Paulin 1, Yee Whye Teh 1,2, Brendan O Donoghue 2, Arnaud Doucet 1 Department of Statistics University of Oxford 1 DeepMind London, UK 2 The

More information

arxiv: v1 [cs.lg] 8 Dec 2016

arxiv: v1 [cs.lg] 8 Dec 2016 Improved generator objectives for GANs Ben Poole Stanford University poole@cs.stanford.edu Alexander A. Alemi, Jascha Sohl-Dickstein, Anelia Angelova Google Brain {alemi, jaschasd, anelia}@google.com arxiv:1612.02780v1

More information

arxiv: v3 [cs.lg] 30 Jan 2018

arxiv: v3 [cs.lg] 30 Jan 2018 COULOMB GANS: PROVABLY OPTIMAL NASH EQUI- LIBRIA VIA POTENTIAL FIELDS Thomas Unterthiner 1 Bernhard Nessler 1 Calvin Seward 1, Günter Klambauer 1 Martin Heusel 1 Hubert Ramsauer 1 Sepp Hochreiter 1 arxiv:1708.08819v3

More information

FreezeOut: Accelerate Training by Progressively Freezing Layers

FreezeOut: Accelerate Training by Progressively Freezing Layers FreezeOut: Accelerate Training by Progressively Freezing Layers Andrew Brock, Theodore Lim, & J.M. Ritchie School of Engineering and Physical Sciences Heriot-Watt University Edinburgh, UK {ajb5, t.lim,

More information

LEARNING TO SAMPLE WITH ADVERSARIALLY LEARNED LIKELIHOOD-RATIO

LEARNING TO SAMPLE WITH ADVERSARIALLY LEARNED LIKELIHOOD-RATIO LEARNING TO SAMPLE WITH ADVERSARIALLY LEARNED LIKELIHOOD-RATIO Chunyuan Li, Jianqiao Li, Guoyin Wang & Lawrence Carin Duke University {chunyuan.li,jianqiao.li,guoyin.wang,lcarin}@duke.edu ABSTRACT We link

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

arxiv: v1 [cs.lg] 13 Nov 2018

arxiv: v1 [cs.lg] 13 Nov 2018 EVALUATING GANS VIA DUALITY Paulina Grnarova 1, Kfir Y Levy 1, Aurelien Lucchi 1, Nathanael Perraudin 1,, Thomas Hofmann 1, and Andreas Krause 1 1 ETH, Zurich, Switzerland Swiss Data Science Center arxiv:1811.1v1

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

ON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT. Ashutosh Pandey 1 and Deliang Wang 1,2. {pandey.99, wang.5664,

ON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT. Ashutosh Pandey 1 and Deliang Wang 1,2. {pandey.99, wang.5664, ON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT Ashutosh Pandey and Deliang Wang,2 Department of Computer Science and Engineering, The Ohio State University, USA 2 Center for Cognitive

More information

Notes on Adversarial Examples

Notes on Adversarial Examples Notes on Adversarial Examples David Meyer dmm@{1-4-5.net,uoregon.edu,...} March 14, 2017 1 Introduction The surprising discovery of adversarial examples by Szegedy et al. [6] has led to new ways of thinking

More information

LEARNING IN CONCAVE GAMES

LEARNING IN CONCAVE GAMES LEARNING IN CONCAVE GAMES P. Mertikopoulos French National Center for Scientific Research (CNRS) Laboratoire d Informatique de Grenoble GSBE ETBC seminar Maastricht, October 22, 2015 Motivation and Preliminaries

More information

Which Training Methods for GANs do actually Converge? Supplementary material

Which Training Methods for GANs do actually Converge? Supplementary material Supplementary material A. Preliminaries In this section we first summarize some results from the theory of discrete dynamical systems. We also prove a discrete version of a basic convergence theorem for

More information

Lecture 16: FTRL and Online Mirror Descent

Lecture 16: FTRL and Online Mirror Descent Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

Generalized Boosting Algorithms for Convex Optimization

Generalized Boosting Algorithms for Convex Optimization Alexander Grubb agrubb@cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 523 USA Abstract Boosting is a popular way to derive powerful

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Dual and primal-dual methods

Dual and primal-dual methods ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method

More information