Interaction Matters: A Note on Non-asymptotic Local Convergence of Generative Adversarial Networks

Size: px
Start display at page:

Download "Interaction Matters: A Note on Non-asymptotic Local Convergence of Generative Adversarial Networks"

Transcription

1 Interaction Matters: A Note on Non-asymptotic Local onvergence of Generative Adversarial Networks engyuan Liang University of hicago arxiv:submit/ statml 16 Feb 18 James Stokes University of Pennsylvania Abstract Motivated by the pursuit of a systematic computational and algorithmic understanding of Generative Adversarial Networks GANs) we present a simple yet unified non-asymptotic local convergence theory for smooth two-player games which subsumes several discrete-time gradientbased saddle point dynamics he analysis reveals the surprising nature of the off-diagonal interaction term as both a blessing and a curse On the one hand this interaction term explains the origin of the slow-down effect in the convergence of Simultaneous Gradient Ascent SGA) to stable Nash equilibria On the other hand for the unstable equilibria exponential convergence can be proved thanks to the interaction term for three modified dynamics which have been proposed to stabilize GAN training: Optimistic Mirror Descent OMD) onsensus Optimization O) and Predictive Method PM) he analysis uncovers the intimate connections among these stabilizing techniques and provides detailed characterization on the choice of learning rate Key words and phrases: Generative adversarial networks Saddle point dynamics Smooth two-players game Local Nash equilibrium Non-asymptotic convergence Stability 1 INRODUION In this paper we consider the non-asymptotic local convergence and stability of discrete-time gradient-based optimization algorithms for solving smooth two-player zero-sum games of the form 11) min max θ R p ω R q Uθ ω) he motivation behind our non-asymptotic analysis follows from the observation that Generative Adversarial Networks GANs) lack principled understanding both in the computational and the algorithmic level 1 GAN optimization is a special case of 11) which has been developed for learning tengyuanliang@chicagoboothedu) stokesj@sasupennedu) 1 We refer the readers to Huszár 17 s concise post on this issue 1 file: papertex date: February 16 18

2 a complex and multi-modal probability distribution based on samples from P real over X ) through learning a generator function g θ ) that transforms the input distribution P input over Z) to match the target P real Ignoring the parameter regularization the value function corresponding to a GAN is of the form 1) Uθ ω) = E h 1 f ω X)) E h f ω g θ Z))) X P real Z P input where θ ω) R p R q parametrizes the generator function g θ : Z X and discriminator function f ω : X R respectively he original GAN Goodfellow et al 14 for example corresponds to choosing h 1 t) = log σt) h t) = log1 σt)) where σ is the sigmoid function; Wasserstein GAN Arjovsky et al 17 considers h 1 t) = h t) = t; f-gan Nowozin et al 16 proposes to use h 1 t) = t h t) = f t) where f denotes the Fenchel dual of f Recently several attempts have been made to understand whether GANs learn the target distribution in the statistical sense Liu et al 17 Arora and Zhang 17 Liang 17 Arora et al 17 Optimization of GANs and value functions of the form 11) at large) is hard both in theory and in practice Singh et al Pfau and Vinyals 16 Salimans et al 16 Global optimization of a general value function with multiple saddle points is impractical and unstable so we instead resort to the more modest problem of searching for a local saddle point θ ω ) such that no player has the incentive to deviate locally Uθ ω ) Uθ ω ) for θ in an open neighborhood of θ Uθ ω ) Uθ ω) for ω in an open neighborhood of ω For smooth value functions the above conditions are equivalent to the following solution concept: Definition 1 Local Nash Equilibrium) 1 θ Uθ ω ) = ω Uθ ω ) = ; θθ Uθ ω ) ωω Uθ ω ) θ ω ) is called a local Nash equilibrium if Here we use θω Uθ ω) R p q to denote the off-diagonal term U/ θ ω and name it the interaction term throughout the paper θ Uθ ω) R p denotes the gradient U/ θ and θθ Uθ ω) R p p for the Hessian U/ θ θ In practice discrete-time dynamical systems are employed to numerically approach the saddle points of Uθ ω) as is the case in GANs Goodfellow et al 14 and in primal-dual methods for non-linear optimization Singh et al he simplest possibility is Simultaneous Gradient Ascent SGA) which corresponds to the following discrete-time dynamical system 13) θ t+1 = θ t η θ Uθ t ω t ) ω t+1 = ω t + η ω Uθ t ω t ) where η is the step size or learning rate In the limit of vanishing step size SGA approximates a continuous-time autonomous dynamical system the asymptotic convergence of which has been established in Singh et al herukuri et al 17 Nagarajan and Kolter 17 In practice however it has been widely reported that the discrete-time SGA dynamics for GAN optimization suffers from instabilities due to the possibility of complex eigenvalues in the operator of the dynamical system Salimans et al 16 Metz et al 16 Nagarajan and Kolter 17 Mescheder et al 17 Heusel et al 17 We believe room for improvement still exists in the current theory which we hope will render it to be more informative in practice:

3 Non-asymptotic convergence speed In practice one is concerned with finite step size η > which is typically subject to extensive hyperparameter tuning Detailed characterizations on the convergence speed and theoretical insights on the choice of learning rate can be helpful Unified simple analysis for modified saddle point dynamics Several attempts to fix GAN optimization have been put forth by independent researchers which modify the dynamics Mescheder et al 17 Daskalakis et al 17 Yadav et al 17 using very different insights A unified analysis that reviews the deeper connections amongst these proposals helps to better understand the saddle point dynamics at large In this paper we address the above points by studying the theory of non-asymptotic convergence of SGA and related discrete-time saddle point dynamics namely Optimistic Mirror Descent OMD) onsensus Optimization O) and Predictive Method PM) More concretely we provide the following theoretical contributions about the crucial effect of the off-diagonal interaction term θω Uθ ω) in two-player games: Stable case: curse of the interaction term Locally SGA converges exponentially fast to a stable Nash equilibrium with a carefully chosen learning rate his can be viewed as a generalization rather than a special case) of the local convergence guarantee for single-player gradient descent for strongly-convex functions In addition we quantitatively isolate the slowdown in the convergence rate of two-player SGA compared to single-player gradient descent due to the presence of the off-diagonal interaction term θω Uθ ω) for the two-player game Unstable case: blessing of the interaction term For unstable Nash equilibria SGA diverges away for any non-zero learning rate We discover a unified non-asymptotic analysis that encompasses three proposed modified dynamics OMD O and PM he analysis shows that all these algorithms at a high level share the same idea of utilizing the curvature introduced by the interaction term θω Uθ ω) Unlike the slow sub-linear rate of convergence experienced by single-player gradient descent for non-strongly convex functions the OMD/O/PM effectively exploit the interaction term to achieve exponential convergence to unstable Nash equilibria he analysis also provides specific advice on the choice of learning rate for each procedure albeit restricted to the simple case of bi-linear games he organization of the paper is as follows In Section we consider the admittedly idealized) situation when locally the value function satisfies strict convexity/concavity We show non-asymptotic exponential convergence to Nash equilibria for SGA and identify an optimized learning rate o reveal and understand the new features of the modified dynamics we study in Section 3 the minimal unstable bilinear game showing that proposed stabilizing techniques all achieve exponential convergence to unstable Nash equilibria Finally in Section 4 we take a step closer to the real world by numerically evaluating each of the proposed dynamical systems using value functions of GAN form 1) under objective evaluation metrics Detailed proofs are deferred to the Section 5 SABLE ASE: NON-ASYMPOI LOAL ONVERGENE In this section we will establish the non-asymptotic convergence of SGA dynamics to saddle points that are stable local Nash equilibria With a properly chosen learning rate the local convergence can In fact Nesterov 13 constructed a convex function that is non-strongly convex such that all first order methods suffer slow sub-linear rate of convergence in optimization literature linear rate refers to exponential convergence speed) 3

4 be intuitively pictured as cycling inwards to these saddle points where the distance to the saddle point of interest is exponentially contracting First let s introduce the notion of stable equilibrium if Definition Stable Local Nash Equilibrium) 1 θ Uθ ω ) = ω Uθ ω ) = ; θθ Uθ ω ) ωω Uθ ω ) θ ω ) is called a stable local Nash equilibrium he above notion of stability is stronger than the Definition 1 in the sense that θθ Uθ ω ) and ωω Uθ ω ) have smallest eigenvalues bounded away from Assumption 1 Local Strong onvexity-oncavity) onsider Uθ ω) : R p R q R that is smooth and twice differentiable and let θ ω ) be a stable local Nash equilibrium as in Definition Assume that for some r > there exists an open neighborhood near θ ω ) such that for all θ ω) B θ ω ) r) the following strong convexity-concavity condition holds θθ Uθ ω) ωω Uθ ω) It will prove convenient to introduce some notation before introducing the main theorem Let us define the following block-wise abbreviation for the matrix of second derivatives θθ Uθ ω) θω Uθ ω) Aθω 1) := θω ωθ Uθ ω) ωω Uθ ω) θω B θω and define α β as ) α := β := min θω) B θ ω )r) λ min max θω) B θ ω )r) λ max A ) θω Bθω A θω + θωθω θω A θω + B θω θω ) A θω θω + θω B θω Bθω + θω θω where λ max M) λ min M) denote the largest and smallest eigenvalue of matrix M heorem 1 Exponential onvergence: SGA) onsider Uθ ω) : R p R q R that satisfies Assumption 1 holds for some radius r > near a stable local Nash equilibrium θ ω ) as in Definition onsider the initialization satisfies θ ω ) B θ ω ) r) hen the SGA dynamics 13) with fixed learning rate η = α/β α β defined in Eqn )) obtains an ɛ-minimizer such that θ ω ) B θ ω ) ɛ) as long as SGA := β α log r ɛ Remark 1 It is interesting to compare the convergence speed of the saddle point dynamics to conventional gradient descent in one variable for a strongly-convex function We remind the reader that to obtain an ɛ-minimizer for a strongly-convex function one needs the following number of iterations of gradient descent { λmax A θω ) GD := max λ min A θω ) log r ɛ λ max B θω ) λ min B θω ) log r } ɛ 4

5 depending on whether we are optimizing with respect to θ or ω respectively It is now evident that due to the presence of θω θω the convergence of two-player SGA to a saddle-point can be significantly slower than convergence of single-player gradient descent Note the following inequality A θω λ + θωθω max θω A θω + B θω θω ) A θω θω + θω B θω A Bθω + θω λ θω max θω which follows from the fact that the LHS is lower-bounded by λ max A θω + θω θω) λmax A θω ) herefore the convergence of SGA is slower than that in the conventional GD SGA GD We would like to emphasize that for the saddle point convergence the slow-down effect of the interaction term θω is explicit in our non-asymptotic analysis he intuition that the discrete-time SGA dynamics cycles inward to a stable Nash equilibrium exponentially fast can be seen in the following way he presence of the off-diagonal anti-symmetric component in Eqn 1) means that the associated linear operator of the discrete-time dynamics has complex eigenvalues which results in periodic cycling behavior However due to the explicit choice of η the distance to stable Nash equilibrium is shrinking exponentially fast he local exponential stability in the infinitesimal/asymptotic case when η has already been studied in a nice paper Nagarajan and Kolter 17 heorem 31 therein) by showing the Jacobian matrix of a particular form of GAN objective is Hurwitz has all strictly negative eigenvalues) here are two distinct differences in our result: 1) we provide non-asymptotic convergence with specific guidance on the choice of learning rate η; ) our analysis goes through analyzing the singular values which is rather different from the modulus of eigenvalue for a general matrix) instead of involving the complex eigenvalues and this simple technique generalizes to three other modified saddle point dynamics which we discuss in the next section 3 UNSABLE ASE: LOAL BI-LINEAR PROBLEM Oscillation and instability for SGA occurs when the problem is non-strongly convex-concave as in the bi-linear game or more precisely at least linear in one player) his observation was first pointed out using a very simple linear game Ux y) = xy in Salimans et al 16 More generally as a result of heorem 1 this phenomenon occurs when the local Nash equilibrium is non-stable ) Aθ ω λ min B θ ω B θω Aθ ω θ ω θ θ ω ω B θ ω θ ω Let s consider an extreme case when A θ ω = B θ ω = In this case we will show that: 1) an improved analysis for Optimistic Mirror Descent OMD) introduced in Daskalakis et al 17 ) a modified version of Predictive Methods PM) motivated from Yadav et al 17 and 3) onsensus Optimization O) introduced in Mescheder et al 17 can fix the oscillation problem and provide exponential convergence to unstable Nash equilibrium using a novel unified non-asymptotic analysis Our analysis shows that these stabilizing techniques at a high level all manipulate the dynamics to utilize the curvature generated by the interaction term θ ω θ ω which we refer to as the blessing of the interaction term to contrast with the slow-down effect of 5 )

6 the interaction term in the strongly convex-concave case Once again as eluded in the introduction this fast linear rate convergence result in the non-strongly convex-concave two-player game) should be compared to the significantly slower sub-linear convergence rate for all first-order-methods in convex but non-strongly convex optimization one player) he latter was proved by a lower bound argument in Nesterov 13 heorem 17 We are ready to state the main result proved in this section informally heorem 31 Informal: Unstable ase) All these three modified dynamics in the bi-linear game enjoy the last iterate exponential convergence guarantee We will start with the minimal unstable bilinear game explaining why oscillation can happen when the game is non-strongly convex-concave and to reveal and understand the new features of the modified dynamics his bilinear game can be motivated by considering the lowest-order truncation of the aylor expansion of a general smooth two-player game around a Nash equilibrium A B ) 31) Uθ ω) = 1 θ Aθ + 1 ω Bω + θ ω + higher order terms θ ω assuming that θ ω ) = ) Now consider the simple bi-linear game Uθ ω) = θ ω With the SGA dynamics defined in 13) one can easily verify that θ t η λ min ) ) θ t ω t η λ min ) ) ω t herefore the continuous limit η is cycling around a sphere while with any practical learning rate η the distance to the Nash equilibrium can be increasing exponentially instead of converging Per heorem 1 and the discussion above instability for SGA only occurs when the local game is approximately bi-linear From now on we will focus on the simplest unstable form of the game the bi-linear game to isolate the main idea behind fixing the unstable problem 31 Improved) Optimistic Mirror Descent Daskalakis et al 17 employed Optimistic Mirror Descent OMD) motivated by online learning to solve the instability problem in GANs Here we provide a stronger result showing that the last iterate of OMD enjoys exponential convergence for bi-linear games We note that although the last-iterate convergence of this OMD procedure was already rigorously proved in Daskalakis et al 17 the exponential convergence however is not known to the best of our knowledge Roughly speaking for initialization with distance r Daskalakis et al 17 heorem 1 asserts that after t η iterations the last iterate of OMD is within r for a small enough learning rate η under λ min ) the some additional assumptions) and the convergence only happens in the limit of vanishing step size η In contrast we will prove after t iterations with the properly chosen step size η the last iterate of OMD is within a distance of 1 λ min t/ ) 4λ max )) r his improved theory also explains the exponential convergence found in simulations Lemma 31 Exponential onvergence: OMD) onsider a bi-linear game Uθ ω) = θ ω Assume p = q and is full rank hen the OMD dynamics 3) θ t+1 = θ t η θ Uθ t ω t ) + η θ Uθ t 1 ω t 1 ) ω t+1 = ω t + η ω Uθ t ω t ) η ω Uθ t 1 ω t 1 ) 6

7 with the learning rate η η = 1 λ max ) obtains an ɛ-minimizer such that θ ω ) B ɛ) as long as OMD := 8 λ max ) ) 4r 1 + λ min ) λ min ) + 4 λ max log ) ɛ under the assumption that θ ω ) θ 1 ω 1 ) r 3 Modified) Predictive Methods From a very different motivation in ODE Yadav et al 17 proposed Predictive Methods PM) to fix the instability problem he intuition is to evaluate the gradient at a predictive future location then perform the update In this section we propose and analyze a modified version of the predictive method for simultaneous gradient updates) inspired by Yadav et al 17 onsider the following modified PM dynamic ) 33) predictive step: θ t+1/ = θ t γ θ Uθ t ω t ) ω t+1/ = ω t + γ ω Uθ t ω t ) ; gradient step: θ t+1 = θ t η θ Uθ t+1/ ω t+1/ ) ω t+1 = ω t + η ω Uθ t+1/ ω t+1/ ) Lemma 3 Exponential onvergence: PM) onsider a bi-linear game Uθ ω) = θ ω Assume p = q and is full rank Fix some γ > hen the PM dynamics in Eqn 33) with learning rate η η = γλ min ) λ max ) + γ λ max ) obtains an ɛ-minimizer such that θ ω ) B ɛ) as long as PM := γ λ max ) + λ max ) γ λ log r min ) ɛ under the assumption that θ ω ) r 33 onsensus Optimization onsensus Optimization O) is one other elegant attempt to fix the aforementioned problem proposed in Mescheder et al 17 he authors motivation is to add an extra potential vector field consensus part) on top of the current curl vector field for the SGA of the bi-linear problem) in order to attract the dynamics to the critical points Mescheder et al 17 Nagarajan and Kolter 17 analyzed the infinitesimal flow version of the consensus optimization and intuitively showed that it pushes the real part of the eigenvalue away from 1 to ensure asymptotic convergence In this section we provide a simple convergence analysis of the discretized dynamics of the same flavor as 7

8 the previous section An upshot of the analysis is that it sheds light on possible choice of learning rate Recall that the regularization term defining consensus optimization is given by 34) Rθ ω) = θ Uθ ω) + ω Uθ ω) ) / We are ready to state the Lemma Surprisingly we find that the consensus optimization coincides with the modified predictive method for the bi-linear game Lemma 33 Exponential onvergence: O) onsider a bi-linear game Uθ ω) = θ ω Assume p = q and is full rank Recall Rθ ω) defined in Eqn 34) and fix some γ > hen the O dynamics with the same learning rate η as in Lemma 3 35) θ t+1 = θ t η θ Uθ t ω t ) + γ θ Rθ t ω t ) ω t+1 = ω t + η ω Uθ t ω t ) γ ω Rθ t ω t ) converges exponentially fast in the same way as the PM dynamics in Lemma 3 4 EXPERIMENS Lucic et al 17 recently conducted a large-scale study of GANs and found that improvements from algorithmic changes mostly disappeared after taking into account hyper-parameter tuning and randomness of initialization hey conclude that future GAN research should be based on more systematic and objective evaluation procedures Inspired by this conclusion we conduct a systematic evaluation of the proposed optimization algorithms on two basic density learning problems and introduce corresponding objective evaluation metrics he goal of this analysis is not to achieve state-of-art performance but rather to compare and contrast the existing proposals in a carefully controlled learning environment We focus on the Wasserstein GAN formulation so that the value function is given by 41) Uθ ω) = E X Preal f ω X) E Z NIk )f ω g θ Z)) where f ω : R d R is a multi-layer neural network with L hidden layers and rectifier non-linearities and the input distribution P input was chosen to be k-dimensional standard Gaussian noise Following Gulrajani et al 17 we impose the Lipschitz-1 constraint on the discriminator network using the two-sided gradient penalty term Λω) introduced in Gulrajani et al 17 Eqn 3) he consensus optimization loss is defined with respect to the value function as in 34) without including gradient penalty he combined loss function of the discriminator and generator are respectively 4) 43) L dis θ ω) = Uθ ω) + γ Rθ ω) + λ Λω) L gen θ ω) = Uθ ω) + γ Rθ ω) he coefficients of the gradient penalty and consensus optimization terms were determined by a coarse parameter search and then locked to λ = γ = 1 throughout In order to make close contact with our theoretical formalism we optimize the above loss functions on an alternating schedule using vanilla gradient descent with fixed learning rate of η = 1 3 o ensure reproducibility all algorithms were independently implemented on top of the FGAN and Keras framework 8

9 41 Learning ovariance of Multivariate Gaussian onsider the problem of learning the covariance matrix Σ S d ++ of a d-dimensional multivariate Gaussian distribution P real = N Σ) with non-degenerate covariance Σ Note that the learning problem is well-specified if we choose the generator function g θ : R k R d to be a simple linear transformation of the k-dimensional latent space k d) Although the GAN approach is clearly overkill for this simple density estimation problem we find this example illuminating because it affords some analytical tractability for the otherwise intractable general GAN value function 1) Specifically if we choose the discriminator function f ω : R d R to be a neural network with L = 1 hidden layer consisting of H hidden units with rectifier nonlinearities and set biases to zero then the explicit functional forms of discriminator and generator are respectively 44) f ω x) = H v i w i x 1 { wi x } g θ z) = V z i=1 where ω {w i R d v i R : i H} and θ {V R d k } are the discriminator and generator parameters respectively If moreover we express the covariance matrix as Σ = AA then the value function can be expressed in closed form as 45) Uθ ω) = const H v i A w i V w i i=1 he above analytical form of the value function sheds some light on the nature of the local Nash equilibrium solution concept In particular if one solves for the condition of being a Nash equilibrium one does not conclude that V V = Σ he result depends on the rank of the matrix w 1 w H he evaluation of different optimization algorithms involved comparing the target density P real = N Σ) and the analytical generator density P fake = N V V ) after t = training iterations Fig 1) For simplicity we chose the evaluation metric to be the Frobenius norm of the difference between the covariance matrices Σ V V F he covariance learning experiments were conducted in the well-specified and over-parametrized regime k = 16 d = ) using H = 18 hidden units for the discriminator network We also performed a head-to-head comparison of simultaneous and alternating update methods on this distribution and found negligible difference Fig ) 4 Mixture of Gaussians In practical applications GANs are typically trained using the empirical distribution of the samples where the samples are drawn from an idealized multi-modal probability distribution o capture the notion of a multi-modal data distribution we focus on a mixture of 8 Gaussians with means located at the vertices of a regular octagon inscribed in the unit circle where each component has a fixed diagonal covariance of width σ = 3 In contrast to previous visual-based evaluations we estimate the Wasserstein-1 distance W 1 P real P fake ) between the target density P real and the distribution P fake of the random variable g θ Z) implied by the trained generator network he estimate is obtained by solving a linear program which computes the exact Wasserstein-1 distance between the sample estimates P real = 1 m m i=1 δ X i and P fake = 1 m m i=1 δ g θ Z i ) respectively and approaches the population version W 1 P real P fake ) as number of samples m he experiments with the mixture of Gaussians used dimensional Gaussian as input k = ) Both the generator and discriminator networks consisted of L = 4 hidden layers with H = 18 units per hidden layer he estimate of the Wasserstein-1 distance was calculated using a sample size of 9

10 m = 51 after training for t = iterations It is clear from Fig 3 that the Wasserstein-1 distance correlates closely with the visual fit to the target distribution he empirical evaluation Fig 1) shows that the exponential separation between consensus optimization and competing algorithms disappears on the mixture distribution suggesting that the qualitative ranking is not robust to the choice of loss landscape hese findings demand deeper understanding of the global structure of the landscape which is not captured by our local stability analysis ovariance Learning 16 Mixture of Gaussians 1 14 VV F Wasserstein Distance SGA Predictive Optimism onsensus SGA Predictive Optimism onsensus Fig 1 Evaluation metrics for covariance learning left) and mixture of Gaussians learning right) using different dynamical systems after t = training iterations and 16 random seeds Note for covariance learning we use the log-scale on y-axis 5 Simultaneous Alternating 4 Frobenius Norm 3 1 SGA Algorithm OMD Fig omparison of alternating/simultaneous gradient updates on the covariance learning evaluation metric after t = training iterations and 16 random seeds 1

11 Fig 3 Density plots of worst/best generator distribution ranked by Wasserstein-1 distance across all baselines amongst 16 random seeds after training for iterations Left: SGA W 1 = 17) Right: OMD W 1 = 66) 5 EHNIAL PROOFS Proof of heorem 1 Define the line interpolation between two points θx) = xθ t + 1 x)θ ωx) = xω t + 1 x)ω hen the SGA dynamics can be written as θt+1 θ θt θ = θ Uθ η t ω t ) ω t+1 ω ω t ω ω Uθ t ω t ) θt θ 1 = θθ Uθx) ωx)) η θω Uθx) ωx)) θt θ dx ω t ω ωθ Uθx) ωx)) ωω Uθx) ωx)) ω t ω 1 ) θθ Uθx) ωx)) = I η θω Uθx) ωx)) θt θ dx ωθ Uθx) ωx)) ωω Uθx) ωx)) ω t ω Assume that one can prove for some r > and θ ω) B θ ω ) r) with a proper choice of η the largest singular value is bounded above by 1 I η θθ Uθ ω) θω Uθ ω) < 1 ωθ Uθ ω) ωω Uθ ω) hen due to convexity of the operator norm the dynamics of SGA is contracting locally because θt+1 θ 1 ) θθ Uθx) ωx)) ω t+1 ω I η θω Uθx) ωx)) dx ωθ Uθx) ωx)) ωω Uθx) ωx)) θt θ op ω t ω 1 θθ Uθx) ωx)) I η θω Uθx) dx θt θ ωθ Uθx) ωx)) ωω Uθx) ωx))) ω op t ω < θt θ ω t ω op Let s analyze the singular values of θθ Uθ ω) I η θω Uθ ω) ωθ Uθ ω) ωω Uθ ω) 11

12 assuming θθ Uθ ω) ωω Uθ ω) Abbreviate θθ Uθ ω) θω Uθ ω) A := ωθ Uθ ω) ωω Uθ ω) B he largest singular value of A I η B is the square root of the largest eigenvalue of the following symmetric matrix I ηa η I ηa η I ηa) η I ηb η = + η η A B) I ηb η A B ) I ηb) + η It is clear that when η = the largest eigenvalue of the above matrix is 1 Observe I ηa) + η η A B) A η A B ) I ηb) + η = I η + η A + A + B B A + B B + ) A 1 ηλ min + η A λ + ) A + B B max A + B B + I If we choose η to be ) Aθω min θω) B θ ω )r) λ min B θω η = A θω max θω) B θ ω )r) λ + θω ) = θω A θω θω + θω B θω max θω A θω + B θω θω Bθω + θω θω α β then we find I ηa) + η η A B) η A B ) I ηb) + η 1 α ) I β In this case θt+1 θ sup ω t+1 ω I η θθ Uθ ω) θω Uθ ω) θω) B θ ω )r) ωθ Uθ ω) ωω Uθ ω) 1 α β θt θ ω t ω herefore to obtain an ɛ-minimizer one requires a number of steps equal to β α log r ɛ op θt θ ω t ω Proof of Lemma 31 Recall that the OMD dynamics iteratively updates ) θt+1 θt θt 1 = I η + η ω t+1 1 ω t ω t 1

13 Define the following matrices 51) 5) R 1 = R = I η I η ) ) + I 4η 1/ ) I 4η It is easy to verify that ) R 1 + R = I η ) ) I η I 4η R 1 R = R R 1 = 4 ) 1/ = η he commutative property R 1 R = R R 1 follows from a singular value decomposition argument: Letting = UDV be the SVD of D diagonal) one finds I 4η ) 1/ = UD I 4η D ) 1/ V = U I 4η D ) 1/ DV = I 4η ) 1/ Using the above equality the commutative property follows I η ) ) I 4η 1/ ) = I 4η 1/ I η = R 1 R = R R 1 ) Now we have the following relations for OMD ) θt+1 θt θt θt 1 R 1 = R R 1 ω t+1 θt+1 ω t+1 ω t R θt ω t ω t = R 1 θt ω t ω t 1 ) θt 1 R ω t 1 Hence 53) R 1 + R ) θt+1 ω t+1 = R t θ1 ω 1 ) θ R 1 + R1 t ω θ1 ω 1 ) θ R ω 13

14 Let s analyze the eigenvalues of R 1 and R We have ) I η + I 4η R 1 = I+I 4η ) 1/ = η Similarly R 1 R 1 = = = η I+I 4η ) 1/ I+I 4η ) 1/ η η I+I 4η ) 1/ I+I 4η ) 1/ ) + η I+I 4η ) 1/ I+I 4η ) 1/ R R = ) 1/ I+I 4η ) 1/ η η I+I 4η ) 1/ I+I 4η ) 1/ I I 4η ) 1/ I I 4η ) 1/ ) + η For η small enough the spectral radius satisfies the strict inequality R 1 op < 1 oncretely for example η = 1 λ max ) = I + I 4η ) 1/ I + I η ) = I η 1 1 λ min ) 4 λ max ) I I 4η ) 1/ 1 I herefore R 1 op 1 1 λ min ) 4 λ max ) R op 1 1 Due to Eqn 53) we know the RHS is upper bounded ) θ1 θ R t/ θ 1 ω 1 ) + θ ω ) ) ) Rt Rt 1 ω 1 θ1 ω 1 ω Moreover the LHS satisfies R 1 + R ) o sum up when one can ensure θ ω ) ɛ ) θ R 1 1 λ min ) t/ ) ω 4 λ max θ 1 ω 1 ) + θ ω ) ) ) θt+1 ω t+1 θt+1 ω t+1 8 λ max ) ) λ min ) + 4 log λ min ) ) λ max ) 4r 1 + λ min ) λ max ) ɛ ) I

15 Proof of Lemma 33 In this case the consensus optimization satisfies the following update ) θt+1 γ θt 54) = I η γ ω t+1 ) γ Again let s analyze the singular values of the operator K := I η γ or equivalently the eigenvalues of KK KK I ηγ η I ηγ η = η I ηγ η I ηγ I ηγ = ) + η I ηγ )η ηi ηγ ) η I ηγ ) I ηγ )η I ηγ ) + η I ηγ = ) + η I ηγ ) + η Now consider the largest eigenvalue of I ηγ ) +η for a fixed γ with a properly chosen η Using the SVD = UDV we obtain I ηγ ) + η = U I ηγd ) + η D U 1 γλ min )η + γ λ max ) + λ max )η I γ λ min = 1 ) γ λ max ) + λ max I ) ω t with η = γλ min ) λ max ) + γ λ max ) Proof of Lemma 3 In the simple bi-linear game case θt+1 θt η θt+1/ = ω t+1 ω t η ω t+1/ θt η I γ θt = ω t η γ I ω t I ηγ η θt = η I ηγ ω t Note this linear system is the same as that in Lemma 33 herefore the convergence analysis follows in the same way as Lemma 33 REFERENES Martin Arjovsky Soumith hintala and Léon Bottou Wasserstein gan arxiv preprint arxiv: Sanjeev Arora and Yi Zhang Do gans actually learn the distribution? an empirical study arxiv preprint arxiv: Sanjeev Arora Andrej Risteski and Yi Zhang heoretical limitations of encoder-decoder gan architectures arxiv preprint arxiv:

16 Ashish herukuri Bahman Gharesifard and Jorge ortes Saddle-point dynamics: conditions for asymptotic stability of saddle points SIAM Journal on ontrol and Optimization 551): onstantinos Daskalakis Andrew Ilyas Vasilis Syrgkanis and Haoyang Zeng raining gans with optimism arxiv preprint arxiv: Ian Goodfellow Jean Pouget-Abadie Mehdi Mirza Bing Xu David Warde-Farley Sherjil Ozair Aaron ourville and Yoshua Bengio Generative adversarial nets In Advances in neural information processing systems pages Ishaan Gulrajani Faruk Ahmed Martin Arjovsky Vincent Dumoulin and Aaron ourville Improved training of wasserstein gans In Advances in Neural Information Processing Systems pages Martin Heusel Hubert Ramsauer homas Unterthiner Bernhard Nessler and Sepp Hochreiter Gans trained by a two time-scale update rule converge to a local nash equilibrium In Advances in Neural Information Processing Systems pages Ferenc Huszár Gans are broken in more than one way: he numerics of gans 17 URL vc/my-notes-on-the-numerics-of-gans/ engyuan Liang How well can generative adversarial networks learn densities: A nonparametric view arxiv preprint arxiv: Shuang Liu Olivier Bousquet and Kamalika haudhuri Approximation and convergence properties of generative adversarial learning arxiv preprint arxiv: Mario Lucic Karol Kurach Marcin Michalski Sylvain Gelly and Olivier Bousquet Are gans created equal? a large-scale study arxiv preprint arxiv: Lars Mescheder Sebastian Nowozin and Andreas Geiger he numerics of gans arxiv preprint arxiv: Luke Metz Ben Poole David Pfau and Jascha Sohl-Dickstein Unrolled generative adversarial networks arxiv preprint arxiv: Vaishnavh Nagarajan and J Zico Kolter Gradient descent gan optimization is locally stable In Advances in Neural Information Processing Systems pages Yurii Nesterov Introductory lectures on convex optimization: A basic course volume 87 Springer Science & Business Media 13 Sebastian Nowozin Botond seke and Ryota omioka f-gan: raining generative neural samplers using variational divergence minimization In Advances in Neural Information Processing Systems pages David Pfau and Oriol Vinyals onnecting generative adversarial networks and actor-critic methods arxiv preprint arxiv: im Salimans Ian Goodfellow Wojciech Zaremba Vicki heung Alec Radford and Xi hen Improved techniques for training gans In Advances in Neural Information Processing Systems pages Satinder Singh Michael Kearns and Yishay Mansour Nash convergence of gradient dynamics in general-sum games In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence pages Morgan Kaufmann Publishers Inc Abhay Yadav Sohil Shah Zheng Xu David Jacobs and om Goldstein Stabilizing adversarial nets with prediction methods Accepted at ILR

Negative Momentum for Improved Game Dynamics

Negative Momentum for Improved Game Dynamics Negative Momentum for Improved Game Dynamics Gauthier Gidel Reyhane Askari Hemmat Mohammad Pezeshki Gabriel Huang Rémi Lepriol Simon Lacoste-Julien Ioannis Mitliagkas Mila & DIRO, Université de Montréal

More information

arxiv: v1 [cs.lg] 20 Apr 2017

arxiv: v1 [cs.lg] 20 Apr 2017 Softmax GAN Min Lin Qihoo 360 Technology co. ltd Beijing, China, 0087 mavenlin@gmail.com arxiv:704.069v [cs.lg] 0 Apr 07 Abstract Softmax GAN is a novel variant of Generative Adversarial Network (GAN).

More information

GENERATIVE ADVERSARIAL LEARNING

GENERATIVE ADVERSARIAL LEARNING GENERATIVE ADVERSARIAL LEARNING OF MARKOV CHAINS Jiaming Song, Shengjia Zhao & Stefano Ermon Computer Science Department Stanford University {tsong,zhaosj12,ermon}@cs.stanford.edu ABSTRACT We investigate

More information

Training Generative Adversarial Networks Via Turing Test

Training Generative Adversarial Networks Via Turing Test raining enerative Adversarial Networks Via uring est Jianlin Su School of Mathematics Sun Yat-sen University uangdong, China bojone@spaces.ac.cn Abstract In this article, we introduce a new mode for training

More information

arxiv: v3 [stat.ml] 20 Feb 2018

arxiv: v3 [stat.ml] 20 Feb 2018 MANY PATHS TO EQUILIBRIUM: GANS DO NOT NEED TO DECREASE A DIVERGENCE AT EVERY STEP William Fedus 1, Mihaela Rosca 2, Balaji Lakshminarayanan 2, Andrew M. Dai 1, Shakir Mohamed 2 and Ian Goodfellow 1 1

More information

arxiv: v3 [cs.lg] 13 Jan 2018

arxiv: v3 [cs.lg] 13 Jan 2018 radient descent AN optimization is locally stable arxiv:1706.04156v3 cs.l] 13 Jan 2018 Vaishnavh Nagarajan Computer Science epartment Carnegie-Mellon University Pittsburgh, PA 15213 vaishnavh@cs.cmu.edu

More information

Gradient descent GAN optimization is locally stable

Gradient descent GAN optimization is locally stable Gradient descent GAN optimization is locally stable Advances in Neural Information Processing Systems, 2017 Vaishnavh Nagarajan J. Zico Kolter Carnegie Mellon University 05 January 2018 Presented by: Kevin

More information

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS

A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS Dan Hendrycks University of Chicago dan@ttic.edu Steven Basart University of Chicago xksteven@uchicago.edu ABSTRACT We introduce a

More information

arxiv: v2 [cs.lg] 13 Feb 2018

arxiv: v2 [cs.lg] 13 Feb 2018 TRAINING GANS WITH OPTIMISM Constantinos Daskalakis MIT, EECS costis@mit.edu Andrew Ilyas MIT, EECS ailyas@mit.edu Vasilis Syrgkanis Microsoft Research vasy@microsoft.com Haoyang Zeng MIT, EECS haoyangz@mit.edu

More information

arxiv: v3 [cs.lg] 11 Jun 2018

arxiv: v3 [cs.lg] 11 Jun 2018 Lars Mescheder 1 Andreas Geiger 1 2 Sebastian Nowozin 3 arxiv:1801.04406v3 [cs.lg] 11 Jun 2018 Abstract Recent work has shown local convergence of GAN training for absolutely continuous data and generator

More information

Some theoretical properties of GANs. Gérard Biau Toulouse, September 2018

Some theoretical properties of GANs. Gérard Biau Toulouse, September 2018 Some theoretical properties of GANs Gérard Biau Toulouse, September 2018 Coauthors Benoît Cadre (ENS Rennes) Maxime Sangnier (Sorbonne University) Ugo Tanielian (Sorbonne University & Criteo) 1 video Source:

More information

The Numerics of GANs

The Numerics of GANs The Numerics of GANs Lars Mescheder Autonomous Vision Group MPI Tübingen lars.mescheder@tuebingen.mpg.de Sebastian Nowozin Machine Intelligence and Perception Group Microsoft Research sebastian.nowozin@microsoft.com

More information

Which Training Methods for GANs do actually Converge?

Which Training Methods for GANs do actually Converge? Lars Mescheder 1 Andreas Geiger 1 2 Sebastian Nowozin 3 Abstract Recent work has shown local convergence of GAN training for absolutely continuous data and generator distributions. In this paper, we show

More information

arxiv: v1 [cs.lg] 8 Dec 2016

arxiv: v1 [cs.lg] 8 Dec 2016 Improved generator objectives for GANs Ben Poole Stanford University poole@cs.stanford.edu Alexander A. Alemi, Jascha Sohl-Dickstein, Anelia Angelova Google Brain {alemi, jaschasd, anelia}@google.com arxiv:1612.02780v1

More information

UNDERSTAND THE DYNAMICS OF GANS VIA PRIMAL-DUAL OPTIMIZATION

UNDERSTAND THE DYNAMICS OF GANS VIA PRIMAL-DUAL OPTIMIZATION UNDERSTAND THE DYNAMICS OF GANS VIA PRIMAL-DUAL OPTIMIZATION Anonymous authors Paper under double-blind review ABSTRACT Generative adversarial network GAN is one of the best known unsupervised learning

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

The Numerics of GANs

The Numerics of GANs The Numerics of GANs Lars Mescheder 1, Sebastian Nowozin 2, Andreas Geiger 1 1 Autonomous Vision Group, MPI Tübingen 2 Machine Intelligence and Perception Group, Microsoft Research Presented by: Dinghan

More information

arxiv: v1 [math.oc] 7 Dec 2018

arxiv: v1 [math.oc] 7 Dec 2018 arxiv:1812.02878v1 [math.oc] 7 Dec 2018 Solving Non-Convex Non-Concave Min-Max Games Under Polyak- Lojasiewicz Condition Maziar Sanjabi, Meisam Razaviyayn, Jason D. Lee University of Southern California

More information

Deep Generative Models. (Unsupervised Learning)

Deep Generative Models. (Unsupervised Learning) Deep Generative Models (Unsupervised Learning) CEng 783 Deep Learning Fall 2017 Emre Akbaş Reminders Next week: project progress demos in class Describe your problem/goal What you have done so far What

More information

Generative Adversarial Networks

Generative Adversarial Networks Generative Adversarial Networks SIBGRAPI 2017 Tutorial Everything you wanted to know about Deep Learning for Computer Vision but were afraid to ask Presentation content inspired by Ian Goodfellow s tutorial

More information

Singing Voice Separation using Generative Adversarial Networks

Singing Voice Separation using Generative Adversarial Networks Singing Voice Separation using Generative Adversarial Networks Hyeong-seok Choi, Kyogu Lee Music and Audio Research Group Graduate School of Convergence Science and Technology Seoul National University

More information

arxiv: v1 [cs.lg] 13 Nov 2018

arxiv: v1 [cs.lg] 13 Nov 2018 EVALUATING GANS VIA DUALITY Paulina Grnarova 1, Kfir Y Levy 1, Aurelien Lucchi 1, Nathanael Perraudin 1,, Thomas Hofmann 1, and Andreas Krause 1 1 ETH, Zurich, Switzerland Swiss Data Science Center arxiv:1811.1v1

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

Supplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization

Supplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization Supplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization Sebastian Nowozin, Botond Cseke, Ryota Tomioka Machine Intelligence and Perception Group

More information

ON THE LIMITATIONS OF FIRST-ORDER APPROXIMATION IN GAN DYNAMICS

ON THE LIMITATIONS OF FIRST-ORDER APPROXIMATION IN GAN DYNAMICS ON THE LIMITATIONS OF FIRST-ORDER APPROXIMATION IN GAN DYNAMICS Anonymous authors Paper under double-blind review ABSTRACT Generative Adversarial Networks (GANs) have been proposed as an approach to learning

More information

Notes on Adversarial Examples

Notes on Adversarial Examples Notes on Adversarial Examples David Meyer dmm@{1-4-5.net,uoregon.edu,...} March 14, 2017 1 Introduction The surprising discovery of adversarial examples by Szegedy et al. [6] has led to new ways of thinking

More information

arxiv: v1 [cs.lg] 7 Nov 2017

arxiv: v1 [cs.lg] 7 Nov 2017 Theoretical Limitations of Encoder-Decoder GAN architectures Sanjeev Arora, Andrej Risteski, Yi Zhang November 8, 2017 arxiv:1711.02651v1 [cs.lg] 7 Nov 2017 Abstract Encoder-decoder GANs architectures

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Understanding GANs: Back to the basics

Understanding GANs: Back to the basics Understanding GANs: Back to the basics David Tse Stanford University Princeton University May 15, 2018 Joint work with Soheil Feizi, Farzan Farnia, Tony Ginart, Changho Suh and Fei Xia. GANs at NIPS 2017

More information

Adversarial Sequential Monte Carlo

Adversarial Sequential Monte Carlo Adversarial Sequential Monte Carlo Kira Kempinska Department of Security and Crime Science University College London London, WC1E 6BT kira.kowalska.13@ucl.ac.uk John Shawe-Taylor Department of Computer

More information

Which Training Methods for GANs do actually Converge? Supplementary material

Which Training Methods for GANs do actually Converge? Supplementary material Supplementary material A. Preliminaries In this section we first summarize some results from the theory of discrete dynamical systems. We also prove a discrete version of a basic convergence theorem for

More information

Generative Adversarial Networks, and Applications

Generative Adversarial Networks, and Applications Generative Adversarial Networks, and Applications Ali Mirzaei Nimish Srivastava Kwonjoon Lee Songting Xu CSE 252C 4/12/17 2/44 Outline: Generative Models vs Discriminative Models (Background) Generative

More information

arxiv: v1 [cs.lg] 28 Dec 2017

arxiv: v1 [cs.lg] 28 Dec 2017 PixelSNAIL: An Improved Autoregressive Generative Model arxiv:1712.09763v1 [cs.lg] 28 Dec 2017 Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, Pieter Abbeel Embodied Intelligence UC Berkeley, Department of

More information

Wasserstein GAN. Juho Lee. Jan 23, 2017

Wasserstein GAN. Juho Lee. Jan 23, 2017 Wasserstein GAN Juho Lee Jan 23, 2017 Wasserstein GAN (WGAN) Arxiv submission Martin Arjovsky, Soumith Chintala, and Léon Bottou A new GAN model minimizing the Earth-Mover s distance (Wasserstein-1 distance)

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Generative adversarial networks

Generative adversarial networks 14-1: Generative adversarial networks Prof. J.C. Kao, UCLA Generative adversarial networks Why GANs? GAN intuition GAN equilibrium GAN implementation Practical considerations Much of these notes are based

More information

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Nishant Gurnani. GAN Reading Group. April 14th, / 107 Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107 Why are these Papers Important? 2 / 107 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN,

More information

First Order Generative Adversarial Networks

First Order Generative Adversarial Networks Calvin Seward 1 2 Thomas Unterthiner 2 Urs Bergmann 1 Nikolay Jetchev 1 Sepp Hochreiter 2 Abstract GANs excel at learning high dimensional distributions, but they can update generator parameters in directions

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,

Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab, Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab, 2016-08-31 Generative Modeling Density estimation Sample generation

More information

Lecture 14: Deep Generative Learning

Lecture 14: Deep Generative Learning Generative Modeling CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 14: Deep Generative Learning Density estimation Reconstructing probability density function using samples Bohyung Han

More information

Non Convex-Concave Saddle Point Optimization

Non Convex-Concave Saddle Point Optimization Non Convex-Concave Saddle Point Optimization Master Thesis Leonard Adolphs April 09, 2018 Advisors: Prof. Dr. Thomas Hofmann, M.Sc. Hadi Daneshmand Department of Computer Science, ETH Zürich Abstract

More information

An Online Learning Approach to Generative Adversarial Networks

An Online Learning Approach to Generative Adversarial Networks An Online Learning Approach to Generative Adversarial Networks arxiv:1706.03269v1 [cs.lg] 10 Jun 2017 Paulina Grnarova EH Zürich paulina.grnarova@inf.ethz.ch homas Hofmann EH Zürich thomas.hofmann@inf.ethz.ch

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

arxiv: v2 [cs.lg] 21 Aug 2018

arxiv: v2 [cs.lg] 21 Aug 2018 CoT: Cooperative Training for Generative Modeling of Discrete Data arxiv:1804.03782v2 [cs.lg] 21 Aug 2018 Sidi Lu Shanghai Jiao Tong University steve_lu@apex.sjtu.edu.cn Weinan Zhang Shanghai Jiao Tong

More information

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global homas Laurent * 1 James H. von Brecht * 2 Abstract We consider deep linear networks with arbitrary convex differentiable loss. We provide a short and elementary proof of the fact that all local minima

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Improving Visual Semantic Embedding By Adversarial Contrastive Estimation

Improving Visual Semantic Embedding By Adversarial Contrastive Estimation Improving Visual Semantic Embedding By Adversarial Contrastive Estimation Huan Ling Department of Computer Science University of Toronto huan.ling@mail.utoronto.ca Avishek Bose Department of Electrical

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Ian Goodfellow, Staff Research Scientist, Google Brain. Seminar at CERN Geneva,

Ian Goodfellow, Staff Research Scientist, Google Brain. Seminar at CERN Geneva, MedGAN ID-CGAN CoGAN LR-GAN CGAN IcGAN b-gan LS-GAN AffGAN LAPGAN DiscoGANMPM-GAN AdaGAN LSGAN InfoGAN CatGAN AMGAN igan IAN Open Challenges for Improving GANs McGAN Ian Goodfellow, Staff Research Scientist,

More information

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Yi Zhou Ohio State University Yingbin Liang Ohio State University Abstract zhou.1172@osu.edu liang.889@osu.edu The past

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

Generating Text via Adversarial Training

Generating Text via Adversarial Training Generating Text via Adversarial Training Yizhe Zhang, Zhe Gan, Lawrence Carin Department of Electronical and Computer Engineering Duke University, Durham, NC 27708 {yizhe.zhang,zhe.gan,lcarin}@duke.edu

More information

Provable Non-Convex Min-Max Optimization

Provable Non-Convex Min-Max Optimization Provable Non-Convex Min-Max Optimization Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa City, IA, 52242 Department of Mathematics, The

More information

Generalization and Equilibrium in Generative Adversarial Nets (GANs)

Generalization and Equilibrium in Generative Adversarial Nets (GANs) Sanjeev Arora 1 Rong Ge Yingyu Liang 1 Tengyu Ma 1 Yi Zhang 1 Abstract It is shown that training of generative adversarial network (GAN) may not have good generalization properties; e.g., training may

More information

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.

Nonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University. Nonlinear Models Numerical Methods for Deep Learning Lars Ruthotto Departments of Mathematics and Computer Science, Emory University Intro 1 Course Overview Intro 2 Course Overview Lecture 1: Linear Models

More information

arxiv: v4 [cs.cv] 5 Sep 2018

arxiv: v4 [cs.cv] 5 Sep 2018 Wasserstein Divergence for GANs Jiqing Wu 1, Zhiwu Huang 1, Janine Thoma 1, Dinesh Acharya 1, and Luc Van Gool 1,2 arxiv:1712.01026v4 [cs.cv] 5 Sep 2018 1 Computer Vision Lab, ETH Zurich, Switzerland {jwu,zhiwu.huang,jthoma,vangool}@vision.ee.ethz.ch,

More information

The Mechanics of n-player Differentiable Games

The Mechanics of n-player Differentiable Games The Mechanics of n-player Differentiable Games David Balduzzi 1 Sébastien Racanière 1 James Martens 1 Jakob Foerster 2 Karl Tuyls 1 Thore Graepel 1 Abstract The cornerstone underpinning deep learning is

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

Summary of A Few Recent Papers about Discrete Generative models

Summary of A Few Recent Papers about Discrete Generative models Summary of A Few Recent Papers about Discrete Generative models Presenter: Ji Gao Department of Computer Science, University of Virginia https://qdata.github.io/deep2read/ Outline SeqGAN BGAN: Boundary

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera

Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera Intelligent Control Module I- Neural Networks Lecture 7 Adaptive Learning Rate Laxmidhar Behera Department of Electrical Engineering Indian Institute of Technology, Kanpur Recurrent Networks p.1/40 Subjects

More information

arxiv: v1 [cs.lg] 5 Oct 2018

arxiv: v1 [cs.lg] 5 Oct 2018 Local Stability and Performance of Simple Gradient Penalty µ-wasserstein GAN arxiv:1810.02528v1 [cs.lg] 5 Oct 2018 Cheolhyeong Kim, Seungtae Park, Hyung Ju Hwang POSTECH Pohang, 37673, Republic of Korea

More information

Lecture 5 : Projections

Lecture 5 : Projections Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization

More information

New Insights and Perspectives on the Natural Gradient Method

New Insights and Perspectives on the Natural Gradient Method 1 / 18 New Insights and Perspectives on the Natural Gradient Method Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology March 13, 2018 Motivation 2 / 18

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

AdaGAN: Boosting Generative Models

AdaGAN: Boosting Generative Models AdaGAN: Boosting Generative Models Ilya Tolstikhin ilya@tuebingen.mpg.de joint work with Gelly 2, Bousquet 2, Simon-Gabriel 1, Schölkopf 1 1 MPI for Intelligent Systems 2 Google Brain Radford et al., 2015)

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

arxiv: v1 [cs.lg] 6 Dec 2018

arxiv: v1 [cs.lg] 6 Dec 2018 Embedding-reparameterization procedure for manifold-valued latent variables in generative models arxiv:1812.02769v1 [cs.lg] 6 Dec 2018 Eugene Golikov Neural Networks and Deep Learning Lab Moscow Institute

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems Weinan E 1 and Bing Yu 2 arxiv:1710.00211v1 [cs.lg] 30 Sep 2017 1 The Beijing Institute of Big Data Research,

More information

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch

More information

From CDF to PDF A Density Estimation Method for High Dimensional Data

From CDF to PDF A Density Estimation Method for High Dimensional Data From CDF to PDF A Density Estimation Method for High Dimensional Data Shengdong Zhang Simon Fraser University sza75@sfu.ca arxiv:1804.05316v1 [stat.ml] 15 Apr 2018 April 17, 2018 1 Introduction Probability

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

Lecture 6: September 19

Lecture 6: September 19 36-755: Advanced Statistical Theory I Fall 2016 Lecture 6: September 19 Lecturer: Alessandro Rinaldo Scribe: YJ Choe Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

CLOSE-TO-CLEAN REGULARIZATION RELATES

CLOSE-TO-CLEAN REGULARIZATION RELATES Worshop trac - ICLR 016 CLOSE-TO-CLEAN REGULARIZATION RELATES VIRTUAL ADVERSARIAL TRAINING, LADDER NETWORKS AND OTHERS Mudassar Abbas, Jyri Kivinen, Tapani Raio Department of Computer Science, School of

More information

6.254 : Game Theory with Engineering Applications Lecture 7: Supermodular Games

6.254 : Game Theory with Engineering Applications Lecture 7: Supermodular Games 6.254 : Game Theory with Engineering Applications Lecture 7: Asu Ozdaglar MIT February 25, 2010 1 Introduction Outline Uniqueness of a Pure Nash Equilibrium for Continuous Games Reading: Rosen J.B., Existence

More information

A VARIATIONAL INEQUALITY PERSPECTIVE ON GENERATIVE ADVERSARIAL NETWORKS

A VARIATIONAL INEQUALITY PERSPECTIVE ON GENERATIVE ADVERSARIAL NETWORKS A VARIAIONAL INEQUALIY PERSPECIVE ON GENERAIVE ADVERSARIAL NEWORKS Gauthier Gidel 1, Hugo Berard 1,3, Gaëtan Vignoud 1 Pascal Vincent 1,,3 Simon Lacoste-Julien 1, 1 Mila & DIRO, Université de Montréal.

More information

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 XVI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 A slightly changed ADMM for convex optimization with three separable operators Bingsheng He Department of

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Lecture 16: FTRL and Online Mirror Descent

Lecture 16: FTRL and Online Mirror Descent Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which

More information

arxiv: v2 [cs.lg] 7 Jun 2018

arxiv: v2 [cs.lg] 7 Jun 2018 Calvin Seward 2 Thomas Unterthiner 2 Urs Bergmann Nikolay Jetchev Sepp Hochreiter 2 arxiv:802.0459v2 cs.lg 7 Jun 208 Abstract GANs excel at learning high dimensional distributions, but they can update

More information

1 Kernel methods & optimization

1 Kernel methods & optimization Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions

More information

Multi-Agent Learning with Policy Prediction

Multi-Agent Learning with Policy Prediction Multi-Agent Learning with Policy Prediction Chongjie Zhang Computer Science Department University of Massachusetts Amherst, MA 3 USA chongjie@cs.umass.edu Victor Lesser Computer Science Department University

More information

arxiv: v3 [cs.lg] 2 Nov 2018

arxiv: v3 [cs.lg] 2 Nov 2018 A VARIAIONAL INEQUALIY PERSPECIVE ON GENERAIVE ADVERSARIAL NEWORKS Gauthier Gidel Mila Hugo Berard Mila, FAIR Gaëtan Vignoud Mila arxiv:180.10551v3 [cs.lg] Nov 018 Pascal Vincent Simon Lacoste-Julien Mila,

More information

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p

More information

A random perturbation approach to some stochastic approximation algorithms in optimization.

A random perturbation approach to some stochastic approximation algorithms in optimization. A random perturbation approach to some stochastic approximation algorithms in optimization. Wenqing Hu. 1 (Presentation based on joint works with Chris Junchi Li 2, Weijie Su 3, Haoyi Xiong 4.) 1. Department

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information