Interaction Matters: A Note on Non-asymptotic Local Convergence of Generative Adversarial Networks
|
|
- Randall McCormick
- 5 years ago
- Views:
Transcription
1 Interaction Matters: A Note on Non-asymptotic Local onvergence of Generative Adversarial Networks engyuan Liang University of hicago arxiv:submit/ statml 16 Feb 18 James Stokes University of Pennsylvania Abstract Motivated by the pursuit of a systematic computational and algorithmic understanding of Generative Adversarial Networks GANs) we present a simple yet unified non-asymptotic local convergence theory for smooth two-player games which subsumes several discrete-time gradientbased saddle point dynamics he analysis reveals the surprising nature of the off-diagonal interaction term as both a blessing and a curse On the one hand this interaction term explains the origin of the slow-down effect in the convergence of Simultaneous Gradient Ascent SGA) to stable Nash equilibria On the other hand for the unstable equilibria exponential convergence can be proved thanks to the interaction term for three modified dynamics which have been proposed to stabilize GAN training: Optimistic Mirror Descent OMD) onsensus Optimization O) and Predictive Method PM) he analysis uncovers the intimate connections among these stabilizing techniques and provides detailed characterization on the choice of learning rate Key words and phrases: Generative adversarial networks Saddle point dynamics Smooth two-players game Local Nash equilibrium Non-asymptotic convergence Stability 1 INRODUION In this paper we consider the non-asymptotic local convergence and stability of discrete-time gradient-based optimization algorithms for solving smooth two-player zero-sum games of the form 11) min max θ R p ω R q Uθ ω) he motivation behind our non-asymptotic analysis follows from the observation that Generative Adversarial Networks GANs) lack principled understanding both in the computational and the algorithmic level 1 GAN optimization is a special case of 11) which has been developed for learning tengyuanliang@chicagoboothedu) stokesj@sasupennedu) 1 We refer the readers to Huszár 17 s concise post on this issue 1 file: papertex date: February 16 18
2 a complex and multi-modal probability distribution based on samples from P real over X ) through learning a generator function g θ ) that transforms the input distribution P input over Z) to match the target P real Ignoring the parameter regularization the value function corresponding to a GAN is of the form 1) Uθ ω) = E h 1 f ω X)) E h f ω g θ Z))) X P real Z P input where θ ω) R p R q parametrizes the generator function g θ : Z X and discriminator function f ω : X R respectively he original GAN Goodfellow et al 14 for example corresponds to choosing h 1 t) = log σt) h t) = log1 σt)) where σ is the sigmoid function; Wasserstein GAN Arjovsky et al 17 considers h 1 t) = h t) = t; f-gan Nowozin et al 16 proposes to use h 1 t) = t h t) = f t) where f denotes the Fenchel dual of f Recently several attempts have been made to understand whether GANs learn the target distribution in the statistical sense Liu et al 17 Arora and Zhang 17 Liang 17 Arora et al 17 Optimization of GANs and value functions of the form 11) at large) is hard both in theory and in practice Singh et al Pfau and Vinyals 16 Salimans et al 16 Global optimization of a general value function with multiple saddle points is impractical and unstable so we instead resort to the more modest problem of searching for a local saddle point θ ω ) such that no player has the incentive to deviate locally Uθ ω ) Uθ ω ) for θ in an open neighborhood of θ Uθ ω ) Uθ ω) for ω in an open neighborhood of ω For smooth value functions the above conditions are equivalent to the following solution concept: Definition 1 Local Nash Equilibrium) 1 θ Uθ ω ) = ω Uθ ω ) = ; θθ Uθ ω ) ωω Uθ ω ) θ ω ) is called a local Nash equilibrium if Here we use θω Uθ ω) R p q to denote the off-diagonal term U/ θ ω and name it the interaction term throughout the paper θ Uθ ω) R p denotes the gradient U/ θ and θθ Uθ ω) R p p for the Hessian U/ θ θ In practice discrete-time dynamical systems are employed to numerically approach the saddle points of Uθ ω) as is the case in GANs Goodfellow et al 14 and in primal-dual methods for non-linear optimization Singh et al he simplest possibility is Simultaneous Gradient Ascent SGA) which corresponds to the following discrete-time dynamical system 13) θ t+1 = θ t η θ Uθ t ω t ) ω t+1 = ω t + η ω Uθ t ω t ) where η is the step size or learning rate In the limit of vanishing step size SGA approximates a continuous-time autonomous dynamical system the asymptotic convergence of which has been established in Singh et al herukuri et al 17 Nagarajan and Kolter 17 In practice however it has been widely reported that the discrete-time SGA dynamics for GAN optimization suffers from instabilities due to the possibility of complex eigenvalues in the operator of the dynamical system Salimans et al 16 Metz et al 16 Nagarajan and Kolter 17 Mescheder et al 17 Heusel et al 17 We believe room for improvement still exists in the current theory which we hope will render it to be more informative in practice:
3 Non-asymptotic convergence speed In practice one is concerned with finite step size η > which is typically subject to extensive hyperparameter tuning Detailed characterizations on the convergence speed and theoretical insights on the choice of learning rate can be helpful Unified simple analysis for modified saddle point dynamics Several attempts to fix GAN optimization have been put forth by independent researchers which modify the dynamics Mescheder et al 17 Daskalakis et al 17 Yadav et al 17 using very different insights A unified analysis that reviews the deeper connections amongst these proposals helps to better understand the saddle point dynamics at large In this paper we address the above points by studying the theory of non-asymptotic convergence of SGA and related discrete-time saddle point dynamics namely Optimistic Mirror Descent OMD) onsensus Optimization O) and Predictive Method PM) More concretely we provide the following theoretical contributions about the crucial effect of the off-diagonal interaction term θω Uθ ω) in two-player games: Stable case: curse of the interaction term Locally SGA converges exponentially fast to a stable Nash equilibrium with a carefully chosen learning rate his can be viewed as a generalization rather than a special case) of the local convergence guarantee for single-player gradient descent for strongly-convex functions In addition we quantitatively isolate the slowdown in the convergence rate of two-player SGA compared to single-player gradient descent due to the presence of the off-diagonal interaction term θω Uθ ω) for the two-player game Unstable case: blessing of the interaction term For unstable Nash equilibria SGA diverges away for any non-zero learning rate We discover a unified non-asymptotic analysis that encompasses three proposed modified dynamics OMD O and PM he analysis shows that all these algorithms at a high level share the same idea of utilizing the curvature introduced by the interaction term θω Uθ ω) Unlike the slow sub-linear rate of convergence experienced by single-player gradient descent for non-strongly convex functions the OMD/O/PM effectively exploit the interaction term to achieve exponential convergence to unstable Nash equilibria he analysis also provides specific advice on the choice of learning rate for each procedure albeit restricted to the simple case of bi-linear games he organization of the paper is as follows In Section we consider the admittedly idealized) situation when locally the value function satisfies strict convexity/concavity We show non-asymptotic exponential convergence to Nash equilibria for SGA and identify an optimized learning rate o reveal and understand the new features of the modified dynamics we study in Section 3 the minimal unstable bilinear game showing that proposed stabilizing techniques all achieve exponential convergence to unstable Nash equilibria Finally in Section 4 we take a step closer to the real world by numerically evaluating each of the proposed dynamical systems using value functions of GAN form 1) under objective evaluation metrics Detailed proofs are deferred to the Section 5 SABLE ASE: NON-ASYMPOI LOAL ONVERGENE In this section we will establish the non-asymptotic convergence of SGA dynamics to saddle points that are stable local Nash equilibria With a properly chosen learning rate the local convergence can In fact Nesterov 13 constructed a convex function that is non-strongly convex such that all first order methods suffer slow sub-linear rate of convergence in optimization literature linear rate refers to exponential convergence speed) 3
4 be intuitively pictured as cycling inwards to these saddle points where the distance to the saddle point of interest is exponentially contracting First let s introduce the notion of stable equilibrium if Definition Stable Local Nash Equilibrium) 1 θ Uθ ω ) = ω Uθ ω ) = ; θθ Uθ ω ) ωω Uθ ω ) θ ω ) is called a stable local Nash equilibrium he above notion of stability is stronger than the Definition 1 in the sense that θθ Uθ ω ) and ωω Uθ ω ) have smallest eigenvalues bounded away from Assumption 1 Local Strong onvexity-oncavity) onsider Uθ ω) : R p R q R that is smooth and twice differentiable and let θ ω ) be a stable local Nash equilibrium as in Definition Assume that for some r > there exists an open neighborhood near θ ω ) such that for all θ ω) B θ ω ) r) the following strong convexity-concavity condition holds θθ Uθ ω) ωω Uθ ω) It will prove convenient to introduce some notation before introducing the main theorem Let us define the following block-wise abbreviation for the matrix of second derivatives θθ Uθ ω) θω Uθ ω) Aθω 1) := θω ωθ Uθ ω) ωω Uθ ω) θω B θω and define α β as ) α := β := min θω) B θ ω )r) λ min max θω) B θ ω )r) λ max A ) θω Bθω A θω + θωθω θω A θω + B θω θω ) A θω θω + θω B θω Bθω + θω θω where λ max M) λ min M) denote the largest and smallest eigenvalue of matrix M heorem 1 Exponential onvergence: SGA) onsider Uθ ω) : R p R q R that satisfies Assumption 1 holds for some radius r > near a stable local Nash equilibrium θ ω ) as in Definition onsider the initialization satisfies θ ω ) B θ ω ) r) hen the SGA dynamics 13) with fixed learning rate η = α/β α β defined in Eqn )) obtains an ɛ-minimizer such that θ ω ) B θ ω ) ɛ) as long as SGA := β α log r ɛ Remark 1 It is interesting to compare the convergence speed of the saddle point dynamics to conventional gradient descent in one variable for a strongly-convex function We remind the reader that to obtain an ɛ-minimizer for a strongly-convex function one needs the following number of iterations of gradient descent { λmax A θω ) GD := max λ min A θω ) log r ɛ λ max B θω ) λ min B θω ) log r } ɛ 4
5 depending on whether we are optimizing with respect to θ or ω respectively It is now evident that due to the presence of θω θω the convergence of two-player SGA to a saddle-point can be significantly slower than convergence of single-player gradient descent Note the following inequality A θω λ + θωθω max θω A θω + B θω θω ) A θω θω + θω B θω A Bθω + θω λ θω max θω which follows from the fact that the LHS is lower-bounded by λ max A θω + θω θω) λmax A θω ) herefore the convergence of SGA is slower than that in the conventional GD SGA GD We would like to emphasize that for the saddle point convergence the slow-down effect of the interaction term θω is explicit in our non-asymptotic analysis he intuition that the discrete-time SGA dynamics cycles inward to a stable Nash equilibrium exponentially fast can be seen in the following way he presence of the off-diagonal anti-symmetric component in Eqn 1) means that the associated linear operator of the discrete-time dynamics has complex eigenvalues which results in periodic cycling behavior However due to the explicit choice of η the distance to stable Nash equilibrium is shrinking exponentially fast he local exponential stability in the infinitesimal/asymptotic case when η has already been studied in a nice paper Nagarajan and Kolter 17 heorem 31 therein) by showing the Jacobian matrix of a particular form of GAN objective is Hurwitz has all strictly negative eigenvalues) here are two distinct differences in our result: 1) we provide non-asymptotic convergence with specific guidance on the choice of learning rate η; ) our analysis goes through analyzing the singular values which is rather different from the modulus of eigenvalue for a general matrix) instead of involving the complex eigenvalues and this simple technique generalizes to three other modified saddle point dynamics which we discuss in the next section 3 UNSABLE ASE: LOAL BI-LINEAR PROBLEM Oscillation and instability for SGA occurs when the problem is non-strongly convex-concave as in the bi-linear game or more precisely at least linear in one player) his observation was first pointed out using a very simple linear game Ux y) = xy in Salimans et al 16 More generally as a result of heorem 1 this phenomenon occurs when the local Nash equilibrium is non-stable ) Aθ ω λ min B θ ω B θω Aθ ω θ ω θ θ ω ω B θ ω θ ω Let s consider an extreme case when A θ ω = B θ ω = In this case we will show that: 1) an improved analysis for Optimistic Mirror Descent OMD) introduced in Daskalakis et al 17 ) a modified version of Predictive Methods PM) motivated from Yadav et al 17 and 3) onsensus Optimization O) introduced in Mescheder et al 17 can fix the oscillation problem and provide exponential convergence to unstable Nash equilibrium using a novel unified non-asymptotic analysis Our analysis shows that these stabilizing techniques at a high level all manipulate the dynamics to utilize the curvature generated by the interaction term θ ω θ ω which we refer to as the blessing of the interaction term to contrast with the slow-down effect of 5 )
6 the interaction term in the strongly convex-concave case Once again as eluded in the introduction this fast linear rate convergence result in the non-strongly convex-concave two-player game) should be compared to the significantly slower sub-linear convergence rate for all first-order-methods in convex but non-strongly convex optimization one player) he latter was proved by a lower bound argument in Nesterov 13 heorem 17 We are ready to state the main result proved in this section informally heorem 31 Informal: Unstable ase) All these three modified dynamics in the bi-linear game enjoy the last iterate exponential convergence guarantee We will start with the minimal unstable bilinear game explaining why oscillation can happen when the game is non-strongly convex-concave and to reveal and understand the new features of the modified dynamics his bilinear game can be motivated by considering the lowest-order truncation of the aylor expansion of a general smooth two-player game around a Nash equilibrium A B ) 31) Uθ ω) = 1 θ Aθ + 1 ω Bω + θ ω + higher order terms θ ω assuming that θ ω ) = ) Now consider the simple bi-linear game Uθ ω) = θ ω With the SGA dynamics defined in 13) one can easily verify that θ t η λ min ) ) θ t ω t η λ min ) ) ω t herefore the continuous limit η is cycling around a sphere while with any practical learning rate η the distance to the Nash equilibrium can be increasing exponentially instead of converging Per heorem 1 and the discussion above instability for SGA only occurs when the local game is approximately bi-linear From now on we will focus on the simplest unstable form of the game the bi-linear game to isolate the main idea behind fixing the unstable problem 31 Improved) Optimistic Mirror Descent Daskalakis et al 17 employed Optimistic Mirror Descent OMD) motivated by online learning to solve the instability problem in GANs Here we provide a stronger result showing that the last iterate of OMD enjoys exponential convergence for bi-linear games We note that although the last-iterate convergence of this OMD procedure was already rigorously proved in Daskalakis et al 17 the exponential convergence however is not known to the best of our knowledge Roughly speaking for initialization with distance r Daskalakis et al 17 heorem 1 asserts that after t η iterations the last iterate of OMD is within r for a small enough learning rate η under λ min ) the some additional assumptions) and the convergence only happens in the limit of vanishing step size η In contrast we will prove after t iterations with the properly chosen step size η the last iterate of OMD is within a distance of 1 λ min t/ ) 4λ max )) r his improved theory also explains the exponential convergence found in simulations Lemma 31 Exponential onvergence: OMD) onsider a bi-linear game Uθ ω) = θ ω Assume p = q and is full rank hen the OMD dynamics 3) θ t+1 = θ t η θ Uθ t ω t ) + η θ Uθ t 1 ω t 1 ) ω t+1 = ω t + η ω Uθ t ω t ) η ω Uθ t 1 ω t 1 ) 6
7 with the learning rate η η = 1 λ max ) obtains an ɛ-minimizer such that θ ω ) B ɛ) as long as OMD := 8 λ max ) ) 4r 1 + λ min ) λ min ) + 4 λ max log ) ɛ under the assumption that θ ω ) θ 1 ω 1 ) r 3 Modified) Predictive Methods From a very different motivation in ODE Yadav et al 17 proposed Predictive Methods PM) to fix the instability problem he intuition is to evaluate the gradient at a predictive future location then perform the update In this section we propose and analyze a modified version of the predictive method for simultaneous gradient updates) inspired by Yadav et al 17 onsider the following modified PM dynamic ) 33) predictive step: θ t+1/ = θ t γ θ Uθ t ω t ) ω t+1/ = ω t + γ ω Uθ t ω t ) ; gradient step: θ t+1 = θ t η θ Uθ t+1/ ω t+1/ ) ω t+1 = ω t + η ω Uθ t+1/ ω t+1/ ) Lemma 3 Exponential onvergence: PM) onsider a bi-linear game Uθ ω) = θ ω Assume p = q and is full rank Fix some γ > hen the PM dynamics in Eqn 33) with learning rate η η = γλ min ) λ max ) + γ λ max ) obtains an ɛ-minimizer such that θ ω ) B ɛ) as long as PM := γ λ max ) + λ max ) γ λ log r min ) ɛ under the assumption that θ ω ) r 33 onsensus Optimization onsensus Optimization O) is one other elegant attempt to fix the aforementioned problem proposed in Mescheder et al 17 he authors motivation is to add an extra potential vector field consensus part) on top of the current curl vector field for the SGA of the bi-linear problem) in order to attract the dynamics to the critical points Mescheder et al 17 Nagarajan and Kolter 17 analyzed the infinitesimal flow version of the consensus optimization and intuitively showed that it pushes the real part of the eigenvalue away from 1 to ensure asymptotic convergence In this section we provide a simple convergence analysis of the discretized dynamics of the same flavor as 7
8 the previous section An upshot of the analysis is that it sheds light on possible choice of learning rate Recall that the regularization term defining consensus optimization is given by 34) Rθ ω) = θ Uθ ω) + ω Uθ ω) ) / We are ready to state the Lemma Surprisingly we find that the consensus optimization coincides with the modified predictive method for the bi-linear game Lemma 33 Exponential onvergence: O) onsider a bi-linear game Uθ ω) = θ ω Assume p = q and is full rank Recall Rθ ω) defined in Eqn 34) and fix some γ > hen the O dynamics with the same learning rate η as in Lemma 3 35) θ t+1 = θ t η θ Uθ t ω t ) + γ θ Rθ t ω t ) ω t+1 = ω t + η ω Uθ t ω t ) γ ω Rθ t ω t ) converges exponentially fast in the same way as the PM dynamics in Lemma 3 4 EXPERIMENS Lucic et al 17 recently conducted a large-scale study of GANs and found that improvements from algorithmic changes mostly disappeared after taking into account hyper-parameter tuning and randomness of initialization hey conclude that future GAN research should be based on more systematic and objective evaluation procedures Inspired by this conclusion we conduct a systematic evaluation of the proposed optimization algorithms on two basic density learning problems and introduce corresponding objective evaluation metrics he goal of this analysis is not to achieve state-of-art performance but rather to compare and contrast the existing proposals in a carefully controlled learning environment We focus on the Wasserstein GAN formulation so that the value function is given by 41) Uθ ω) = E X Preal f ω X) E Z NIk )f ω g θ Z)) where f ω : R d R is a multi-layer neural network with L hidden layers and rectifier non-linearities and the input distribution P input was chosen to be k-dimensional standard Gaussian noise Following Gulrajani et al 17 we impose the Lipschitz-1 constraint on the discriminator network using the two-sided gradient penalty term Λω) introduced in Gulrajani et al 17 Eqn 3) he consensus optimization loss is defined with respect to the value function as in 34) without including gradient penalty he combined loss function of the discriminator and generator are respectively 4) 43) L dis θ ω) = Uθ ω) + γ Rθ ω) + λ Λω) L gen θ ω) = Uθ ω) + γ Rθ ω) he coefficients of the gradient penalty and consensus optimization terms were determined by a coarse parameter search and then locked to λ = γ = 1 throughout In order to make close contact with our theoretical formalism we optimize the above loss functions on an alternating schedule using vanilla gradient descent with fixed learning rate of η = 1 3 o ensure reproducibility all algorithms were independently implemented on top of the FGAN and Keras framework 8
9 41 Learning ovariance of Multivariate Gaussian onsider the problem of learning the covariance matrix Σ S d ++ of a d-dimensional multivariate Gaussian distribution P real = N Σ) with non-degenerate covariance Σ Note that the learning problem is well-specified if we choose the generator function g θ : R k R d to be a simple linear transformation of the k-dimensional latent space k d) Although the GAN approach is clearly overkill for this simple density estimation problem we find this example illuminating because it affords some analytical tractability for the otherwise intractable general GAN value function 1) Specifically if we choose the discriminator function f ω : R d R to be a neural network with L = 1 hidden layer consisting of H hidden units with rectifier nonlinearities and set biases to zero then the explicit functional forms of discriminator and generator are respectively 44) f ω x) = H v i w i x 1 { wi x } g θ z) = V z i=1 where ω {w i R d v i R : i H} and θ {V R d k } are the discriminator and generator parameters respectively If moreover we express the covariance matrix as Σ = AA then the value function can be expressed in closed form as 45) Uθ ω) = const H v i A w i V w i i=1 he above analytical form of the value function sheds some light on the nature of the local Nash equilibrium solution concept In particular if one solves for the condition of being a Nash equilibrium one does not conclude that V V = Σ he result depends on the rank of the matrix w 1 w H he evaluation of different optimization algorithms involved comparing the target density P real = N Σ) and the analytical generator density P fake = N V V ) after t = training iterations Fig 1) For simplicity we chose the evaluation metric to be the Frobenius norm of the difference between the covariance matrices Σ V V F he covariance learning experiments were conducted in the well-specified and over-parametrized regime k = 16 d = ) using H = 18 hidden units for the discriminator network We also performed a head-to-head comparison of simultaneous and alternating update methods on this distribution and found negligible difference Fig ) 4 Mixture of Gaussians In practical applications GANs are typically trained using the empirical distribution of the samples where the samples are drawn from an idealized multi-modal probability distribution o capture the notion of a multi-modal data distribution we focus on a mixture of 8 Gaussians with means located at the vertices of a regular octagon inscribed in the unit circle where each component has a fixed diagonal covariance of width σ = 3 In contrast to previous visual-based evaluations we estimate the Wasserstein-1 distance W 1 P real P fake ) between the target density P real and the distribution P fake of the random variable g θ Z) implied by the trained generator network he estimate is obtained by solving a linear program which computes the exact Wasserstein-1 distance between the sample estimates P real = 1 m m i=1 δ X i and P fake = 1 m m i=1 δ g θ Z i ) respectively and approaches the population version W 1 P real P fake ) as number of samples m he experiments with the mixture of Gaussians used dimensional Gaussian as input k = ) Both the generator and discriminator networks consisted of L = 4 hidden layers with H = 18 units per hidden layer he estimate of the Wasserstein-1 distance was calculated using a sample size of 9
10 m = 51 after training for t = iterations It is clear from Fig 3 that the Wasserstein-1 distance correlates closely with the visual fit to the target distribution he empirical evaluation Fig 1) shows that the exponential separation between consensus optimization and competing algorithms disappears on the mixture distribution suggesting that the qualitative ranking is not robust to the choice of loss landscape hese findings demand deeper understanding of the global structure of the landscape which is not captured by our local stability analysis ovariance Learning 16 Mixture of Gaussians 1 14 VV F Wasserstein Distance SGA Predictive Optimism onsensus SGA Predictive Optimism onsensus Fig 1 Evaluation metrics for covariance learning left) and mixture of Gaussians learning right) using different dynamical systems after t = training iterations and 16 random seeds Note for covariance learning we use the log-scale on y-axis 5 Simultaneous Alternating 4 Frobenius Norm 3 1 SGA Algorithm OMD Fig omparison of alternating/simultaneous gradient updates on the covariance learning evaluation metric after t = training iterations and 16 random seeds 1
11 Fig 3 Density plots of worst/best generator distribution ranked by Wasserstein-1 distance across all baselines amongst 16 random seeds after training for iterations Left: SGA W 1 = 17) Right: OMD W 1 = 66) 5 EHNIAL PROOFS Proof of heorem 1 Define the line interpolation between two points θx) = xθ t + 1 x)θ ωx) = xω t + 1 x)ω hen the SGA dynamics can be written as θt+1 θ θt θ = θ Uθ η t ω t ) ω t+1 ω ω t ω ω Uθ t ω t ) θt θ 1 = θθ Uθx) ωx)) η θω Uθx) ωx)) θt θ dx ω t ω ωθ Uθx) ωx)) ωω Uθx) ωx)) ω t ω 1 ) θθ Uθx) ωx)) = I η θω Uθx) ωx)) θt θ dx ωθ Uθx) ωx)) ωω Uθx) ωx)) ω t ω Assume that one can prove for some r > and θ ω) B θ ω ) r) with a proper choice of η the largest singular value is bounded above by 1 I η θθ Uθ ω) θω Uθ ω) < 1 ωθ Uθ ω) ωω Uθ ω) hen due to convexity of the operator norm the dynamics of SGA is contracting locally because θt+1 θ 1 ) θθ Uθx) ωx)) ω t+1 ω I η θω Uθx) ωx)) dx ωθ Uθx) ωx)) ωω Uθx) ωx)) θt θ op ω t ω 1 θθ Uθx) ωx)) I η θω Uθx) dx θt θ ωθ Uθx) ωx)) ωω Uθx) ωx))) ω op t ω < θt θ ω t ω op Let s analyze the singular values of θθ Uθ ω) I η θω Uθ ω) ωθ Uθ ω) ωω Uθ ω) 11
12 assuming θθ Uθ ω) ωω Uθ ω) Abbreviate θθ Uθ ω) θω Uθ ω) A := ωθ Uθ ω) ωω Uθ ω) B he largest singular value of A I η B is the square root of the largest eigenvalue of the following symmetric matrix I ηa η I ηa η I ηa) η I ηb η = + η η A B) I ηb η A B ) I ηb) + η It is clear that when η = the largest eigenvalue of the above matrix is 1 Observe I ηa) + η η A B) A η A B ) I ηb) + η = I η + η A + A + B B A + B B + ) A 1 ηλ min + η A λ + ) A + B B max A + B B + I If we choose η to be ) Aθω min θω) B θ ω )r) λ min B θω η = A θω max θω) B θ ω )r) λ + θω ) = θω A θω θω + θω B θω max θω A θω + B θω θω Bθω + θω θω α β then we find I ηa) + η η A B) η A B ) I ηb) + η 1 α ) I β In this case θt+1 θ sup ω t+1 ω I η θθ Uθ ω) θω Uθ ω) θω) B θ ω )r) ωθ Uθ ω) ωω Uθ ω) 1 α β θt θ ω t ω herefore to obtain an ɛ-minimizer one requires a number of steps equal to β α log r ɛ op θt θ ω t ω Proof of Lemma 31 Recall that the OMD dynamics iteratively updates ) θt+1 θt θt 1 = I η + η ω t+1 1 ω t ω t 1
13 Define the following matrices 51) 5) R 1 = R = I η I η ) ) + I 4η 1/ ) I 4η It is easy to verify that ) R 1 + R = I η ) ) I η I 4η R 1 R = R R 1 = 4 ) 1/ = η he commutative property R 1 R = R R 1 follows from a singular value decomposition argument: Letting = UDV be the SVD of D diagonal) one finds I 4η ) 1/ = UD I 4η D ) 1/ V = U I 4η D ) 1/ DV = I 4η ) 1/ Using the above equality the commutative property follows I η ) ) I 4η 1/ ) = I 4η 1/ I η = R 1 R = R R 1 ) Now we have the following relations for OMD ) θt+1 θt θt θt 1 R 1 = R R 1 ω t+1 θt+1 ω t+1 ω t R θt ω t ω t = R 1 θt ω t ω t 1 ) θt 1 R ω t 1 Hence 53) R 1 + R ) θt+1 ω t+1 = R t θ1 ω 1 ) θ R 1 + R1 t ω θ1 ω 1 ) θ R ω 13
14 Let s analyze the eigenvalues of R 1 and R We have ) I η + I 4η R 1 = I+I 4η ) 1/ = η Similarly R 1 R 1 = = = η I+I 4η ) 1/ I+I 4η ) 1/ η η I+I 4η ) 1/ I+I 4η ) 1/ ) + η I+I 4η ) 1/ I+I 4η ) 1/ R R = ) 1/ I+I 4η ) 1/ η η I+I 4η ) 1/ I+I 4η ) 1/ I I 4η ) 1/ I I 4η ) 1/ ) + η For η small enough the spectral radius satisfies the strict inequality R 1 op < 1 oncretely for example η = 1 λ max ) = I + I 4η ) 1/ I + I η ) = I η 1 1 λ min ) 4 λ max ) I I 4η ) 1/ 1 I herefore R 1 op 1 1 λ min ) 4 λ max ) R op 1 1 Due to Eqn 53) we know the RHS is upper bounded ) θ1 θ R t/ θ 1 ω 1 ) + θ ω ) ) ) Rt Rt 1 ω 1 θ1 ω 1 ω Moreover the LHS satisfies R 1 + R ) o sum up when one can ensure θ ω ) ɛ ) θ R 1 1 λ min ) t/ ) ω 4 λ max θ 1 ω 1 ) + θ ω ) ) ) θt+1 ω t+1 θt+1 ω t+1 8 λ max ) ) λ min ) + 4 log λ min ) ) λ max ) 4r 1 + λ min ) λ max ) ɛ ) I
15 Proof of Lemma 33 In this case the consensus optimization satisfies the following update ) θt+1 γ θt 54) = I η γ ω t+1 ) γ Again let s analyze the singular values of the operator K := I η γ or equivalently the eigenvalues of KK KK I ηγ η I ηγ η = η I ηγ η I ηγ I ηγ = ) + η I ηγ )η ηi ηγ ) η I ηγ ) I ηγ )η I ηγ ) + η I ηγ = ) + η I ηγ ) + η Now consider the largest eigenvalue of I ηγ ) +η for a fixed γ with a properly chosen η Using the SVD = UDV we obtain I ηγ ) + η = U I ηγd ) + η D U 1 γλ min )η + γ λ max ) + λ max )η I γ λ min = 1 ) γ λ max ) + λ max I ) ω t with η = γλ min ) λ max ) + γ λ max ) Proof of Lemma 3 In the simple bi-linear game case θt+1 θt η θt+1/ = ω t+1 ω t η ω t+1/ θt η I γ θt = ω t η γ I ω t I ηγ η θt = η I ηγ ω t Note this linear system is the same as that in Lemma 33 herefore the convergence analysis follows in the same way as Lemma 33 REFERENES Martin Arjovsky Soumith hintala and Léon Bottou Wasserstein gan arxiv preprint arxiv: Sanjeev Arora and Yi Zhang Do gans actually learn the distribution? an empirical study arxiv preprint arxiv: Sanjeev Arora Andrej Risteski and Yi Zhang heoretical limitations of encoder-decoder gan architectures arxiv preprint arxiv:
16 Ashish herukuri Bahman Gharesifard and Jorge ortes Saddle-point dynamics: conditions for asymptotic stability of saddle points SIAM Journal on ontrol and Optimization 551): onstantinos Daskalakis Andrew Ilyas Vasilis Syrgkanis and Haoyang Zeng raining gans with optimism arxiv preprint arxiv: Ian Goodfellow Jean Pouget-Abadie Mehdi Mirza Bing Xu David Warde-Farley Sherjil Ozair Aaron ourville and Yoshua Bengio Generative adversarial nets In Advances in neural information processing systems pages Ishaan Gulrajani Faruk Ahmed Martin Arjovsky Vincent Dumoulin and Aaron ourville Improved training of wasserstein gans In Advances in Neural Information Processing Systems pages Martin Heusel Hubert Ramsauer homas Unterthiner Bernhard Nessler and Sepp Hochreiter Gans trained by a two time-scale update rule converge to a local nash equilibrium In Advances in Neural Information Processing Systems pages Ferenc Huszár Gans are broken in more than one way: he numerics of gans 17 URL vc/my-notes-on-the-numerics-of-gans/ engyuan Liang How well can generative adversarial networks learn densities: A nonparametric view arxiv preprint arxiv: Shuang Liu Olivier Bousquet and Kamalika haudhuri Approximation and convergence properties of generative adversarial learning arxiv preprint arxiv: Mario Lucic Karol Kurach Marcin Michalski Sylvain Gelly and Olivier Bousquet Are gans created equal? a large-scale study arxiv preprint arxiv: Lars Mescheder Sebastian Nowozin and Andreas Geiger he numerics of gans arxiv preprint arxiv: Luke Metz Ben Poole David Pfau and Jascha Sohl-Dickstein Unrolled generative adversarial networks arxiv preprint arxiv: Vaishnavh Nagarajan and J Zico Kolter Gradient descent gan optimization is locally stable In Advances in Neural Information Processing Systems pages Yurii Nesterov Introductory lectures on convex optimization: A basic course volume 87 Springer Science & Business Media 13 Sebastian Nowozin Botond seke and Ryota omioka f-gan: raining generative neural samplers using variational divergence minimization In Advances in Neural Information Processing Systems pages David Pfau and Oriol Vinyals onnecting generative adversarial networks and actor-critic methods arxiv preprint arxiv: im Salimans Ian Goodfellow Wojciech Zaremba Vicki heung Alec Radford and Xi hen Improved techniques for training gans In Advances in Neural Information Processing Systems pages Satinder Singh Michael Kearns and Yishay Mansour Nash convergence of gradient dynamics in general-sum games In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence pages Morgan Kaufmann Publishers Inc Abhay Yadav Sohil Shah Zheng Xu David Jacobs and om Goldstein Stabilizing adversarial nets with prediction methods Accepted at ILR
Negative Momentum for Improved Game Dynamics
Negative Momentum for Improved Game Dynamics Gauthier Gidel Reyhane Askari Hemmat Mohammad Pezeshki Gabriel Huang Rémi Lepriol Simon Lacoste-Julien Ioannis Mitliagkas Mila & DIRO, Université de Montréal
More informationarxiv: v1 [cs.lg] 20 Apr 2017
Softmax GAN Min Lin Qihoo 360 Technology co. ltd Beijing, China, 0087 mavenlin@gmail.com arxiv:704.069v [cs.lg] 0 Apr 07 Abstract Softmax GAN is a novel variant of Generative Adversarial Network (GAN).
More informationGENERATIVE ADVERSARIAL LEARNING
GENERATIVE ADVERSARIAL LEARNING OF MARKOV CHAINS Jiaming Song, Shengjia Zhao & Stefano Ermon Computer Science Department Stanford University {tsong,zhaosj12,ermon}@cs.stanford.edu ABSTRACT We investigate
More informationTraining Generative Adversarial Networks Via Turing Test
raining enerative Adversarial Networks Via uring est Jianlin Su School of Mathematics Sun Yat-sen University uangdong, China bojone@spaces.ac.cn Abstract In this article, we introduce a new mode for training
More informationarxiv: v3 [stat.ml] 20 Feb 2018
MANY PATHS TO EQUILIBRIUM: GANS DO NOT NEED TO DECREASE A DIVERGENCE AT EVERY STEP William Fedus 1, Mihaela Rosca 2, Balaji Lakshminarayanan 2, Andrew M. Dai 1, Shakir Mohamed 2 and Ian Goodfellow 1 1
More informationarxiv: v3 [cs.lg] 13 Jan 2018
radient descent AN optimization is locally stable arxiv:1706.04156v3 cs.l] 13 Jan 2018 Vaishnavh Nagarajan Computer Science epartment Carnegie-Mellon University Pittsburgh, PA 15213 vaishnavh@cs.cmu.edu
More informationGradient descent GAN optimization is locally stable
Gradient descent GAN optimization is locally stable Advances in Neural Information Processing Systems, 2017 Vaishnavh Nagarajan J. Zico Kolter Carnegie Mellon University 05 January 2018 Presented by: Kevin
More informationA QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS
A QUANTITATIVE MEASURE OF GENERATIVE ADVERSARIAL NETWORK DISTRIBUTIONS Dan Hendrycks University of Chicago dan@ttic.edu Steven Basart University of Chicago xksteven@uchicago.edu ABSTRACT We introduce a
More informationarxiv: v2 [cs.lg] 13 Feb 2018
TRAINING GANS WITH OPTIMISM Constantinos Daskalakis MIT, EECS costis@mit.edu Andrew Ilyas MIT, EECS ailyas@mit.edu Vasilis Syrgkanis Microsoft Research vasy@microsoft.com Haoyang Zeng MIT, EECS haoyangz@mit.edu
More informationarxiv: v3 [cs.lg] 11 Jun 2018
Lars Mescheder 1 Andreas Geiger 1 2 Sebastian Nowozin 3 arxiv:1801.04406v3 [cs.lg] 11 Jun 2018 Abstract Recent work has shown local convergence of GAN training for absolutely continuous data and generator
More informationSome theoretical properties of GANs. Gérard Biau Toulouse, September 2018
Some theoretical properties of GANs Gérard Biau Toulouse, September 2018 Coauthors Benoît Cadre (ENS Rennes) Maxime Sangnier (Sorbonne University) Ugo Tanielian (Sorbonne University & Criteo) 1 video Source:
More informationThe Numerics of GANs
The Numerics of GANs Lars Mescheder Autonomous Vision Group MPI Tübingen lars.mescheder@tuebingen.mpg.de Sebastian Nowozin Machine Intelligence and Perception Group Microsoft Research sebastian.nowozin@microsoft.com
More informationWhich Training Methods for GANs do actually Converge?
Lars Mescheder 1 Andreas Geiger 1 2 Sebastian Nowozin 3 Abstract Recent work has shown local convergence of GAN training for absolutely continuous data and generator distributions. In this paper, we show
More informationarxiv: v1 [cs.lg] 8 Dec 2016
Improved generator objectives for GANs Ben Poole Stanford University poole@cs.stanford.edu Alexander A. Alemi, Jascha Sohl-Dickstein, Anelia Angelova Google Brain {alemi, jaschasd, anelia}@google.com arxiv:1612.02780v1
More informationUNDERSTAND THE DYNAMICS OF GANS VIA PRIMAL-DUAL OPTIMIZATION
UNDERSTAND THE DYNAMICS OF GANS VIA PRIMAL-DUAL OPTIMIZATION Anonymous authors Paper under double-blind review ABSTRACT Generative adversarial network GAN is one of the best known unsupervised learning
More informationImportance Reweighting Using Adversarial-Collaborative Training
Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a
More informationThe Numerics of GANs
The Numerics of GANs Lars Mescheder 1, Sebastian Nowozin 2, Andreas Geiger 1 1 Autonomous Vision Group, MPI Tübingen 2 Machine Intelligence and Perception Group, Microsoft Research Presented by: Dinghan
More informationarxiv: v1 [math.oc] 7 Dec 2018
arxiv:1812.02878v1 [math.oc] 7 Dec 2018 Solving Non-Convex Non-Concave Min-Max Games Under Polyak- Lojasiewicz Condition Maziar Sanjabi, Meisam Razaviyayn, Jason D. Lee University of Southern California
More informationDeep Generative Models. (Unsupervised Learning)
Deep Generative Models (Unsupervised Learning) CEng 783 Deep Learning Fall 2017 Emre Akbaş Reminders Next week: project progress demos in class Describe your problem/goal What you have done so far What
More informationGenerative Adversarial Networks
Generative Adversarial Networks SIBGRAPI 2017 Tutorial Everything you wanted to know about Deep Learning for Computer Vision but were afraid to ask Presentation content inspired by Ian Goodfellow s tutorial
More informationSinging Voice Separation using Generative Adversarial Networks
Singing Voice Separation using Generative Adversarial Networks Hyeong-seok Choi, Kyogu Lee Music and Audio Research Group Graduate School of Convergence Science and Technology Seoul National University
More informationarxiv: v1 [cs.lg] 13 Nov 2018
EVALUATING GANS VIA DUALITY Paulina Grnarova 1, Kfir Y Levy 1, Aurelien Lucchi 1, Nathanael Perraudin 1,, Thomas Hofmann 1, and Andreas Krause 1 1 ETH, Zurich, Switzerland Swiss Data Science Center arxiv:1811.1v1
More informationOPTIMIZATION METHODS IN DEEP LEARNING
Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate
More informationSupplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization
Supplementary Materials for: f-gan: Training Generative Neural Samplers using Variational Divergence Minimization Sebastian Nowozin, Botond Cseke, Ryota Tomioka Machine Intelligence and Perception Group
More informationON THE LIMITATIONS OF FIRST-ORDER APPROXIMATION IN GAN DYNAMICS
ON THE LIMITATIONS OF FIRST-ORDER APPROXIMATION IN GAN DYNAMICS Anonymous authors Paper under double-blind review ABSTRACT Generative Adversarial Networks (GANs) have been proposed as an approach to learning
More informationNotes on Adversarial Examples
Notes on Adversarial Examples David Meyer dmm@{1-4-5.net,uoregon.edu,...} March 14, 2017 1 Introduction The surprising discovery of adversarial examples by Szegedy et al. [6] has led to new ways of thinking
More informationarxiv: v1 [cs.lg] 7 Nov 2017
Theoretical Limitations of Encoder-Decoder GAN architectures Sanjeev Arora, Andrej Risteski, Yi Zhang November 8, 2017 arxiv:1711.02651v1 [cs.lg] 7 Nov 2017 Abstract Encoder-decoder GANs architectures
More informationBregman Divergence and Mirror Descent
Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,
More informationUnderstanding GANs: Back to the basics
Understanding GANs: Back to the basics David Tse Stanford University Princeton University May 15, 2018 Joint work with Soheil Feizi, Farzan Farnia, Tony Ginart, Changho Suh and Fei Xia. GANs at NIPS 2017
More informationAdversarial Sequential Monte Carlo
Adversarial Sequential Monte Carlo Kira Kempinska Department of Security and Crime Science University College London London, WC1E 6BT kira.kowalska.13@ucl.ac.uk John Shawe-Taylor Department of Computer
More informationWhich Training Methods for GANs do actually Converge? Supplementary material
Supplementary material A. Preliminaries In this section we first summarize some results from the theory of discrete dynamical systems. We also prove a discrete version of a basic convergence theorem for
More informationGenerative Adversarial Networks, and Applications
Generative Adversarial Networks, and Applications Ali Mirzaei Nimish Srivastava Kwonjoon Lee Songting Xu CSE 252C 4/12/17 2/44 Outline: Generative Models vs Discriminative Models (Background) Generative
More informationarxiv: v1 [cs.lg] 28 Dec 2017
PixelSNAIL: An Improved Autoregressive Generative Model arxiv:1712.09763v1 [cs.lg] 28 Dec 2017 Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, Pieter Abbeel Embodied Intelligence UC Berkeley, Department of
More informationWasserstein GAN. Juho Lee. Jan 23, 2017
Wasserstein GAN Juho Lee Jan 23, 2017 Wasserstein GAN (WGAN) Arxiv submission Martin Arjovsky, Soumith Chintala, and Léon Bottou A new GAN model minimizing the Earth-Mover s distance (Wasserstein-1 distance)
More informationDiscriminative Direction for Kernel Classifiers
Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering
More informationGenerative adversarial networks
14-1: Generative adversarial networks Prof. J.C. Kao, UCLA Generative adversarial networks Why GANs? GAN intuition GAN equilibrium GAN implementation Practical considerations Much of these notes are based
More informationNishant Gurnani. GAN Reading Group. April 14th, / 107
Nishant Gurnani GAN Reading Group April 14th, 2017 1 / 107 Why are these Papers Important? 2 / 107 Why are these Papers Important? Recently a large number of GAN frameworks have been proposed - BGAN, LSGAN,
More informationFirst Order Generative Adversarial Networks
Calvin Seward 1 2 Thomas Unterthiner 2 Urs Bergmann 1 Nikolay Jetchev 1 Sepp Hochreiter 2 Abstract GANs excel at learning high dimensional distributions, but they can update generator parameters in directions
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationGenerative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab,
Generative Adversarial Networks (GANs) Ian Goodfellow, OpenAI Research Scientist Presentation at Berkeley Artificial Intelligence Lab, 2016-08-31 Generative Modeling Density estimation Sample generation
More informationLecture 14: Deep Generative Learning
Generative Modeling CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 14: Deep Generative Learning Density estimation Reconstructing probability density function using samples Bohyung Han
More informationNon Convex-Concave Saddle Point Optimization
Non Convex-Concave Saddle Point Optimization Master Thesis Leonard Adolphs April 09, 2018 Advisors: Prof. Dr. Thomas Hofmann, M.Sc. Hadi Daneshmand Department of Computer Science, ETH Zürich Abstract
More informationAn Online Learning Approach to Generative Adversarial Networks
An Online Learning Approach to Generative Adversarial Networks arxiv:1706.03269v1 [cs.lg] 10 Jun 2017 Paulina Grnarova EH Zürich paulina.grnarova@inf.ethz.ch homas Hofmann EH Zürich thomas.hofmann@inf.ethz.ch
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationarxiv: v2 [cs.lg] 21 Aug 2018
CoT: Cooperative Training for Generative Modeling of Discrete Data arxiv:1804.03782v2 [cs.lg] 21 Aug 2018 Sidi Lu Shanghai Jiao Tong University steve_lu@apex.sjtu.edu.cn Weinan Zhang Shanghai Jiao Tong
More informationDeep Linear Networks with Arbitrary Loss: All Local Minima Are Global
homas Laurent * 1 James H. von Brecht * 2 Abstract We consider deep linear networks with arbitrary convex differentiable loss. We provide a short and elementary proof of the fact that all local minima
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationImproving Visual Semantic Embedding By Adversarial Contrastive Estimation
Improving Visual Semantic Embedding By Adversarial Contrastive Estimation Huan Ling Department of Computer Science University of Toronto huan.ling@mail.utoronto.ca Avishek Bose Department of Electrical
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationIan Goodfellow, Staff Research Scientist, Google Brain. Seminar at CERN Geneva,
MedGAN ID-CGAN CoGAN LR-GAN CGAN IcGAN b-gan LS-GAN AffGAN LAPGAN DiscoGANMPM-GAN AdaGAN LSGAN InfoGAN CatGAN AMGAN igan IAN Open Challenges for Improving GANs McGAN Ian Goodfellow, Staff Research Scientist,
More informationCharacterization of Gradient Dominance and Regularity Conditions for Neural Networks
Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Yi Zhou Ohio State University Yingbin Liang Ohio State University Abstract zhou.1172@osu.edu liang.889@osu.edu The past
More informationA Greedy Framework for First-Order Optimization
A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More information1 Problem Formulation
Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning
More informationNon-convex optimization. Issam Laradji
Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective
More informationGenerating Text via Adversarial Training
Generating Text via Adversarial Training Yizhe Zhang, Zhe Gan, Lawrence Carin Department of Electronical and Computer Engineering Duke University, Durham, NC 27708 {yizhe.zhang,zhe.gan,lcarin}@duke.edu
More informationProvable Non-Convex Min-Max Optimization
Provable Non-Convex Min-Max Optimization Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa City, IA, 52242 Department of Mathematics, The
More informationGeneralization and Equilibrium in Generative Adversarial Nets (GANs)
Sanjeev Arora 1 Rong Ge Yingyu Liang 1 Tengyu Ma 1 Yi Zhang 1 Abstract It is shown that training of generative adversarial network (GAN) may not have good generalization properties; e.g., training may
More informationNonlinear Models. Numerical Methods for Deep Learning. Lars Ruthotto. Departments of Mathematics and Computer Science, Emory University.
Nonlinear Models Numerical Methods for Deep Learning Lars Ruthotto Departments of Mathematics and Computer Science, Emory University Intro 1 Course Overview Intro 2 Course Overview Lecture 1: Linear Models
More informationarxiv: v4 [cs.cv] 5 Sep 2018
Wasserstein Divergence for GANs Jiqing Wu 1, Zhiwu Huang 1, Janine Thoma 1, Dinesh Acharya 1, and Luc Van Gool 1,2 arxiv:1712.01026v4 [cs.cv] 5 Sep 2018 1 Computer Vision Lab, ETH Zurich, Switzerland {jwu,zhiwu.huang,jthoma,vangool}@vision.ee.ethz.ch,
More informationThe Mechanics of n-player Differentiable Games
The Mechanics of n-player Differentiable Games David Balduzzi 1 Sébastien Racanière 1 James Martens 1 Jakob Foerster 2 Karl Tuyls 1 Thore Graepel 1 Abstract The cornerstone underpinning deep learning is
More informationLecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016
Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,
More informationSummary of A Few Recent Papers about Discrete Generative models
Summary of A Few Recent Papers about Discrete Generative models Presenter: Ji Gao Department of Computer Science, University of Virginia https://qdata.github.io/deep2read/ Outline SeqGAN BGAN: Boundary
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationIntelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera
Intelligent Control Module I- Neural Networks Lecture 7 Adaptive Learning Rate Laxmidhar Behera Department of Electrical Engineering Indian Institute of Technology, Kanpur Recurrent Networks p.1/40 Subjects
More informationarxiv: v1 [cs.lg] 5 Oct 2018
Local Stability and Performance of Simple Gradient Penalty µ-wasserstein GAN arxiv:1810.02528v1 [cs.lg] 5 Oct 2018 Cheolhyeong Kim, Seungtae Park, Hyung Ju Hwang POSTECH Pohang, 37673, Republic of Korea
More informationLecture 5 : Projections
Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization
More informationNew Insights and Perspectives on the Natural Gradient Method
1 / 18 New Insights and Perspectives on the Natural Gradient Method Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology March 13, 2018 Motivation 2 / 18
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationAdaGAN: Boosting Generative Models
AdaGAN: Boosting Generative Models Ilya Tolstikhin ilya@tuebingen.mpg.de joint work with Gelly 2, Bousquet 2, Simon-Gabriel 1, Schölkopf 1 1 MPI for Intelligent Systems 2 Google Brain Radford et al., 2015)
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationarxiv: v1 [cs.lg] 6 Dec 2018
Embedding-reparameterization procedure for manifold-valued latent variables in generative models arxiv:1812.02769v1 [cs.lg] 6 Dec 2018 Eugene Golikov Neural Networks and Deep Learning Lab Moscow Institute
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationThe Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems
The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems Weinan E 1 and Bing Yu 2 arxiv:1710.00211v1 [cs.lg] 30 Sep 2017 1 The Beijing Institute of Big Data Research,
More informationStable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems
Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch
More informationFrom CDF to PDF A Density Estimation Method for High Dimensional Data
From CDF to PDF A Density Estimation Method for High Dimensional Data Shengdong Zhang Simon Fraser University sza75@sfu.ca arxiv:1804.05316v1 [stat.ml] 15 Apr 2018 April 17, 2018 1 Introduction Probability
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationLecture 6: September 19
36-755: Advanced Statistical Theory I Fall 2016 Lecture 6: September 19 Lecturer: Alessandro Rinaldo Scribe: YJ Choe Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationCLOSE-TO-CLEAN REGULARIZATION RELATES
Worshop trac - ICLR 016 CLOSE-TO-CLEAN REGULARIZATION RELATES VIRTUAL ADVERSARIAL TRAINING, LADDER NETWORKS AND OTHERS Mudassar Abbas, Jyri Kivinen, Tapani Raio Department of Computer Science, School of
More information6.254 : Game Theory with Engineering Applications Lecture 7: Supermodular Games
6.254 : Game Theory with Engineering Applications Lecture 7: Asu Ozdaglar MIT February 25, 2010 1 Introduction Outline Uniqueness of a Pure Nash Equilibrium for Continuous Games Reading: Rosen J.B., Existence
More informationA VARIATIONAL INEQUALITY PERSPECTIVE ON GENERATIVE ADVERSARIAL NETWORKS
A VARIAIONAL INEQUALIY PERSPECIVE ON GENERAIVE ADVERSARIAL NEWORKS Gauthier Gidel 1, Hugo Berard 1,3, Gaëtan Vignoud 1 Pascal Vincent 1,,3 Simon Lacoste-Julien 1, 1 Mila & DIRO, Université de Montréal.
More informationContraction Methods for Convex Optimization and Monotone Variational Inequalities No.16
XVI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 A slightly changed ADMM for convex optimization with three separable operators Bingsheng He Department of
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationLecture 16: FTRL and Online Mirror Descent
Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which
More informationarxiv: v2 [cs.lg] 7 Jun 2018
Calvin Seward 2 Thomas Unterthiner 2 Urs Bergmann Nikolay Jetchev Sepp Hochreiter 2 arxiv:802.0459v2 cs.lg 7 Jun 208 Abstract GANs excel at learning high dimensional distributions, but they can update
More information1 Kernel methods & optimization
Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions
More informationMulti-Agent Learning with Policy Prediction
Multi-Agent Learning with Policy Prediction Chongjie Zhang Computer Science Department University of Massachusetts Amherst, MA 3 USA chongjie@cs.umass.edu Victor Lesser Computer Science Department University
More informationarxiv: v3 [cs.lg] 2 Nov 2018
A VARIAIONAL INEQUALIY PERSPECIVE ON GENERAIVE ADVERSARIAL NEWORKS Gauthier Gidel Mila Hugo Berard Mila, FAIR Gaëtan Vignoud Mila arxiv:180.10551v3 [cs.lg] Nov 018 Pascal Vincent Simon Lacoste-Julien Mila,
More informationVariational Autoencoders (VAEs)
September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p
More informationA random perturbation approach to some stochastic approximation algorithms in optimization.
A random perturbation approach to some stochastic approximation algorithms in optimization. Wenqing Hu. 1 (Presentation based on joint works with Chris Junchi Li 2, Weijie Su 3, Haoyi Xiong 4.) 1. Department
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More information