arxiv: v1 [math.oc] 4 Oct 2018

Size: px

Start display at page:

Download "arxiv: v1 [math.oc] 4 Oct 2018"

Janel Hood
5 years ago
Views:

1 Non-Convex Min-Max Optimization: Provable Algorithms and Applications in Machine Learning arxiv: v [math.oc 4 Oct 08 Hassan Rafique Department of Mathmatics University of Iowa Qihang Lin Tippie College of Business University of Iowa Mingrui Liu Department of Computer Science University of Iowa Tianbao Yang Department of Computer Science University of Iowa October 5, 08 Abstract Min-max saddle-point problems have broad applications in many tasks in machine learning, e.g., distributionally robust learning, learning with non-decomposable loss, or learning with uncertain data. Although convex-concave saddle-point problems have been broadly studied with efficient algorithms and solid theories available, it remains a challenge to design provably efficient algorithms for non-convex saddle-point problems, especially when the objective function involves an expectation or a large-scale finite sum. Motivated by recent literature on non-convex non-smooth minimization, this paper studies a family of non-convex min-max problems where the minimization component is non-convex weakly convex and the maximization component is concave. We propose a proximally guided stochastic subgradient method and a proximally guided stochastic variance-reduced method for expected and finite-sum saddle-point problems, respectively. We establish the computation complexities of both methods for finding a nearly stationary point of the corresponding minimization problem. Introduction The main goal of this paper is to design provably efficient algorithms for solving saddle-point aka min-max problems of the following form that exhibits non-convexity in the part of minimization: min x R pmax y Rqfx,y ry+gx. Many machine learning tasks are formulated as the above problem. Examples with more details given later include learning with non-decomposable loss Fan et al., 07; Liu et al., 08; Ying et al., 06, learning with uncertain data Chen et al., 07, and distributionally robust optimization Namkoong & Duchi, 07, 06, etc. Although many previous studies have considered the min-max formulation and proposed efficient algorithms, most of them focus on the convex-concave family, in which both r and g are convex and fx, y is convex in x given y and is concave in y given x. However, convex-concave formulations cannot cover some important new methods/technologies/paradigms arising in machine learning. Hence, it becomes an emergent task to design provably efficient algorithms for solving that exhibits non-convexity structure.

2 Solving non-convex min-max problems is more challenging than solving non-convex minimization problems. Although there is increasing interest on non-convex optimization problems, most of the existing algorithms are designed for the minimization problem without considering the max-structure of the problem when it appears, and therefore are not directly applicable to. For example, stochastic gradient descent SGD for a minimization problem assumes that a stochastic gradient is available at each iteration for the objective of the minimization problem, which might be impossible for if the maximization over y is non-trivial or if f contains expectations. When designing an optimization algorithm for, the important questions are whether the algorithm has a polynomial runtime and what quality it guarantees in the output. In the recent studies for non-convex minimization problems Davis & Drusvyatskiy, 08e,c,a,b; Davis & Grimmer, 07; Drusvyatskiy, 07; Drusvyatskiy & Paquette, 08; Ghadimi & Lan, 03, 06; Lan & Yang, 08; Paquette et al., 08; Reddi et al., 06a,b, polynomial-time algorithms have been developed for finding a nearly stationary point that is close to a point where the subdifferential of objective function almost contains zero. Following this stream of work, we would naturally ask a question whether it is possible to design a polynomial time algorithm for that finds a nearly stationary point of the minimization problem, i.e., min x R p {ψx := max y R q[fx,y ry+gx}. In this paper, we give an affirmative answer for a special family of that is weakly non-convex in x and is concave in y. We refer to this family of problems as weakly-convex-concave min-max WCC-MinMax problems. Our answer is based on two primal-dual stochastic first-order methods for the WCC-MinMax problem under different settings. It is worth mentioning that WCC-MinMax problems have important applications in machine learning, which is actually the main motivation for us to pursue numerical methods for WCC-MinMax problems. We discuss some of these applications below and summarize our main contributions at the end of this section.. Applications in Machine Learning Distributionally Robust Optimization for Variance-based Regularization. In a recent work by Namkoong & Duchi 07, a distributionally robust optimization framework of the following form is studied: min max x X y Y n y i f i x ry, i= wheref i x denotes the loss of the model denoted byxon thei-th data point,x R d is a closed convex set, Y = {y R n n i= y i =, y i 0i =,...,n} is a simplex, andr : Y R is a closed convex function. They showed that when ry is the indicator function of the constraint set {y : n i= y i /n ρ} for some ρ > 0, the above min-max formulation achieves an effect that minimizes not only the bias but also the variance of the prediction, which could yield better generalization in some cases. In practice, one may also consider a regularized variant where ry = λvy,/n for some λ > 0, where V, denotes some distance measure of two vectors e.g., KL divergence, Euclidean distance. While optimization algorithms for solving convex-concave fromulation were developed Namkoong & Duchi, 06, it is still underexplored for problems with non-convex losses. When f i x is a non-convex loss function e.g., the loss function associated with a deep neural network, is non-convex in terms of x but is concave in y. Besides learning deep neural networks, non-convex loss functions also arise in robust learning Loh, 07; Xu et al., 08, when there exist outliers in the data or the data follows heavy-tailed distribution. For example, Xu et al. 08 studied learning with non-convex truncated loss fx;a,b = φ α lx;a,b, where a R d denotes the feature vector of an example and b denotes it corresponding output, lx; a, b denotes a standard convex loss function and φ α denotes a truncation function e.g., φ α s = αlog+s/α,α > 0.

3 Learning with non-decomposable loss. In some practical scenarios, one considers minimizing nondecomposable loss - meaning that the objective is not given as a summation of the losses of all training examples. We present two examples below. The first example is to minimize the sum of top-k losses Fan et al., 07; Dekel & Singer, 006, i.e., min x X k f πi x i= whereπ i n i= is a permutation of{,...,n} such thatf π x f π x f πn x for anyx. Whenk =, it reduces to the minimization of maximum loss, i.e., min x X max i f i x Shalev-Shwartz & Wexler, 06. The advantage in statistics of minimizing the sum of top-k losses has been considered Fan et al., 07. However, provable stochastic optimization algorithms for solving the above problem with nonconvex loss functions is not available. In order to transform the above problem into the min-max form, we note that the sum of top-k elements of a vector can be written as Peetre s K-functional Bennett & Sharpley, 988. In particular, Peetre s K-functional with respect top,p, and B > 0, is defined as v Kp,p,B := min w p w,z,w+z=v +B z p. It was shown that v K,,k = k i= v π i, where v π v π v πn Dekel & Singer, 006. Using Peetre s K-functional, we can write the sum of top-k nonnegative losses as min fx K,,k = min max x X x X y Y n y i f i x, 3 where fx = f x,...,f n x and Y = {y R n : y J,,/B }. Here, y Jq,q,s := max y q,s y q is called the Peetre s J-functional that is the conjugate dual of Peetre s K-functional v Kp,p,/s for /p i +/q i =, i =,. The second example of learning with non-decomposable loss is AUC maximization for binary classification with non-linear models. In particular, let gx; a denote the prediction score of a non-linear model parameterized by x e.g., deep neural networks on an example a. Given a set of examples {a i,b i } n i= where b i {, }, the AUC maximization problem over this set with a squared loss is written as i= min x X n + n n n [ gx;a i gx;a j I bi = b j =, i= j= where n + and n represent the numbers of positive and negative b i s, respectively, and I A = if event A holds and I A = 0 otherwise. It was shown that the above problem can written equivalently as a min-max problem that facilitates stochastic optimization Ying et al., 06: min max x X,z,z R y R n n f i x,z,z,y 4 i= where f i x,z,z,y = gx;a i z I b i= +gx;a i z I b i= Ibi= ++ygx;a i I b i= y, p + p p p + 3

4 p + = n i= I b i =/n and p = n i= I b i = /n. Let x = x,z,z. It is clear that f i x,y is nonconvex in x and strongly concave in y when gx; a is a non-convex function of x. Robust learning from Multiple Distributions. As an last example, we consider robust learning with multiple distributions. Let P,...,P m denote m distributions of data, which might be different perturbed versions with observable data of the underlying true distribution P 0. Robust learning from multiple distributions is to minimize the maximum of expected loss over themdistributions, i.e., min x X max E a P i [Fx;a = min i {,...,m} max x X y Y i= m y i f i x;a, 5 where Y = {y R n n i= y i =, y i 0 i =,...,n} is a simplex, Fx;a a non-convex loss function e.g., the loss defined by a deep neural network of a model parameterized by x on an example a, and f i x = E a Pi [Fx;a denotes the expected loss on distribution P i. It is an instance of that is non-convex in x and concave in y R m. It is notable that such a formulation also has applications in adversarial machine learning Madry et al., 07, in which P i corresponds to one distribution that one can follow to generate adversarial examples. The goal of adversarial machine learning is to learn a model that is robust to different kinds of adversarial examples.. Contributions In spite of our assumption on the concavity of f in y, there still remain great challenges for developing provable efficient algorithms for due to the non-convexity of f in x. One possible approach to solve is to view it as a non-convex minimization problem min x R p {ψx := max y R q[fx,y ry+gx} and solve it with existing first-order methods for non-convex optimization. By doing so, in iteration t of the first-order method, one needs to compute a subgradient of ψx t at iterate x t, which may require applying another algorithm to find an optimal or nearly-optimal solution, denoted by y t, for the concave maximization problem max y R q[fx,y ry. Then, one may compute a stochastic subgradient of fx t,y t with respect to x to update x t. However, this approach may have the following issues: i the maximization subproblem might not be easily solved; for example the maximization problem in 5 involves evaluating the expectation over a P i, which could be expensive; ii solving the maximization approximately introduces error in computing the subgradient of ψx that makes the convergence analysis more difficult. Moreover, when f involves expectation, an unbiased stochastic subgradient of fx, y at x,y = x t,y t is no longer a unbiased stochastic subgradient ofψx at x = x t. The major contribution of this work is the development and analysis of several efficient stochastic algorithms for tackling the WCC-MinMax problem without solving the maximization problem explicitly at each iteration. We establish the convergence of the proposed algorithms in terms of the stationarity of the corresponding minimization objective ψx. Since ψx could be non-smooth, we use the norm of its Moreau envelope s gradient as a convergence measure. The key assumptions we make are that fx,y is weakly convex x and concave in y and that the domains of r and g are both convex and compact. We consider two different settings in : i f is given as the expectation of a stochastic function and ii f can be represented as an average of finitely many smooth functions. The applications we discuss in Section. belows to at least one of these settings. Motivated by the proximally guided technique introduced by Davis & Grimmer 07, we develop a different stochastic first-order method for each two settings and analyze their time complexities. When f is given as the expectation of a stochastic function, we present a stochastic primal-dual subgradient algorithm for that updates the solutions using the stochastic subgradient of f. When 4

5 fx,y is strongly concave in y, this method achieves a time complexity of Õ/ǫ4 for finding a nearly ǫ-stationary solution see Section 3 for its definition. When fx, y is concave but not strongly concave in y, our method has a time complexity of Õ/ǫ6. The algorithm is applicable to distributionally robust optimization, minimizing the sum of top-k losses 3, AUC maximization with non-convex prediction models 4, and robust learning from multiple distribution 5. When f can be represented as an average of n smooth functions, we present a stochastic variancereduced gradient method for that achieves an improved iteration complexity. In particular, when f is strongly concave iny, our method has a time complexity ofõn/ǫ for finding a nearly stationary solution. This algorithm is also applicable to distributionally robust optimization, minimizing the sum of top-k losses 3, and AUC maximization with non-convex prediction models 4. Related Work In this section, we review some closely related work. We focus on stochastic algorithms for solving nonconvex min-max optimization problems and their applications for machine learning problems. Recently, there is growing interest on designing and analyzing stochastic algorithms for solving nonconvex minimization problems. Assuming stochastic gradient of the minimization objective function is available at each iteration, stochastic gradient descent SGD Ghadimi & Lan, 03 and its variants e.g., stochastic heavy-ball method Yang et al., 06; Ghadimi & Lan, 06 have been analyzed for finding a stationary point of a smooth objective function. When the objective function is summation of a finite number of smooth functions, accelerated stochastic algorithms with improved complexity were proposed based on variance reduction techniques Reddi et al., 06b,a; Lan & Yang, 08; Allen-Zhu, 07b; Allen-Zhu & Hazan, 06. These algorithms have been applied for learning deep neural networks by minimizing empirical risk over training data Krizhevsky et al., 0. However, none of them are directly applicable to solving the considered non-convex min-max problems in this paper. Several recent studies have tried to extend the algorithms and theories to non-convex minimization problems with non-smooth objective functions by assuming the objective is weakly convex Davis & Drusvyatskiy, 08e,d; Drusvyatskiy & Paquette, 08; Davis & Grimmer, 07; Chen et al., 08 or relatively weakly convex Zhang & He, 08. Tools based on the Moreau envelope are used in these studies to tackle the non-smoothness of the objective function. Since we also consider non-smooth objective functions, these tools will be reviewed in the next section. It is notable that the proposed algorithms fall in the same framework as that developed in Davis & Grimmer, 07; Chen et al., 08, i.e., the updates are divided into multiple stages and each stage a stochastic algorithm is employed for solving a regularized convex problem. However, they did not explicitly tackle weakly-convex-concave problems in their papers. A family of weakly convex composite optimization was studied in Drusvyatskiy & Paquette, 08, which is formulated as min x Rphcx +gx 6 where g : R p R {+ } is closed and convex, h : R q R is Lipschitz-continuous and convex and c : R p R q is smooth with a Lipschitz-continuous Jacobian matrix. This family of problem is a special case of considered in this work. In fact, we can reformulate 6 into by writing it as min x R pmax y R qy cx h y+gx 7 5

6 where h z = sup y R q{y z hy}, the Fenchel conjugate of h. Drusvyatskiy & Paquette 08 proposed a prox-linear method for solving 6 where the iterate x t is update by approximately solving x t+ argminh cx t + cx t x x t +gx+ x R p η x xt 8 where c is the Jacobian matrix of c and η > 0 is a step length. Drusvyatskiy & Paquette 08 proposed using accelerated gradient method to solve 8 or its dual. However, this approach requires the exact evaluation of c and c which is computationally expensive, for example, when c involves high-dimensional expectations or is given as a sum of a large number of functions see and 5. One approach to avoid this issue so that the prox-linear method by Drusvyatskiy & Paquette 08 can be still applied is to solve 8 as a min-max problem x t+ argminmax x R p y R qy cx t + cx t x x t h y+gx+ η x xt 9 which can be solved by primal-dual first-order methods based on the sample approximation of c and c. However, such an approach requires c to be smooth to ensure the convergence of the main iterate in the prox-linear method, which restricts its applicability. In contrast, the proposed algorithms in this paper do not require c to be smooth though smoothness allows us to design faster algorithms and can be applied to the general problem instead of just 6. Non-convex min-max problems have recently received increasing attention in machine learning. Chen et al. 07 considered robust learning from multiple datasets to tackle uncertainty in the data, which is a special case of 5. They proposed and analyzed an algorithm by assuming that the minimization problem given y can be solved to a certain optimality ratio with respect to the global minimizer. However, such assumption does not hold in reality since the minimization problem is a non-convex problem. Qian et al. 08 considered a similar problem to 5 and analyzed a primal-dual style stochastic algorithm. There are several differences between their results and our results: i they require E a Pi [fx;a to be a smooth function, while we only assume it is a weakly convex function that could be non-smooth; ii they provide convergence for the partial gradient of fx, y with respect to the primal variable x, while we provide convergence for the gradient of the Moreau envelope for the minimization objective ψx := max y R q[fx,y ry+gx; iii when the objective fx, y has a finite-sum structure with smoothness properties, we also provide an accelerated stochastic algorithm based on variance reduction. These differences make our results much stronger and broader. Sinha et al. 07 considered a particular non-convex min-max problem for adversarial learning. For optimization, they restricted the regularization parameter to be large and showed that the minimization objective is smooth under some smoothness assumptions offx,y and then applied the convergence analysis of SGD for smooth functions. Besides the convergence measures based on gradients, there are studies trying to analyze the convergence to min-max saddle points Cherukuri et al., 07. However, such convergence usually require much stronger assumptions about the problem. 3 Preliminaries We present some preliminaries in this section. According to the majority of the applications of, we use the Euclidean norm and a generic norm in R p and R q, respectively. The dual norm of is denoted by. We consider first-order optimization algorithm for solving the following problem min x R p { ψx := max y R q[fx,y ry+gx }. 0 6

7 Given a function h : R d R {+ }, we define the Fréchet subdifferential of h as hx = {ζ R d hx hx+ζ x x+o x x, x x}, where each element in hx is called a Fréchet subgradient of h at x. We define x fx,y as the subgradient of fx,y with respective to x for a fixed y and y [ fx,y as the subgradient of fx,y with respective toy for a fixedx. LetX := domg andy := domr. Letd y : Y R be a-strongly convex function with respect to which is differentiable in inty. The Bregman divergence associated to d y is defined as V y : Y inty R satisfying V y y,y := d y y d y y d y y,y y. Note that we have V y y,y y y. We say a function h : Y R isµ-strongly convex with respect tov y if hy hy +ζ y y +µv y y,y for anyy,y Y inty and ζ hy. We say a function h : X R isρ-weakly convex if hx hx +ζ x x ρ x x for anyx,x X and ζ hx. The assumptions made throughout this paper are the following: Assumption. f is real-valued and the following statements hold: A. fx,y is ρ-weakly convex in x, i.e., fx,y fx,y+ζ x x ρ x x for x,x X, y Y and ζ x fx,y B. fx,y is concave in y, i.e., fx,y fx,y ζ y y for x X, y,y Y and ζ y [ fx,y. C. r is closed and µ-strongly convex with respect tov y with µ 0, and g is closed and convex on X. D. X := domg and Y := domr are non-empty and convex compact sets. Remark. In fact, the boundedness assumption of X = domg can be removed without affecting our main results. Similarly, when µ > 0, the boundedness assumption of Y = domr can be also removed. We choose to keep the boundedness assumptions only for keeping the proofs of these results relatively simple. Under Assumption, ψ is ρ-weakly convex and thus non-convex so that finding the global optimal solution in general is difficult. An alternative goal is to find a stationary point of 0 which is defined as a point x X such that 0 ψx, Because of the iterative nature of optimization algorithms, such a stationary point generally can only be approached in the limit as the number of iterations increases to infinity. With finitely many iterations, a more realistic goal is to find an ǫ-stationary point, i.e., a point ˆx X satisfying Dist0, ψˆx := min ζ ψˆx ζ ǫ. 7

8 However, when the objective function is non-smooth, computing an ǫ-stationary point is still not an easy task even for convex optimization problem. A simple example ismin x R x where the only stationary point is 0 but any x 0 is not an ǫ-stationary point ǫ < no matter how close it is to 0. This situation is likely to occur in problem 0 because of the potential non-smoothness of f and r. Therefore, following Davis & Drusvyatskiy 08a, Davis & Grimmer 07, Davis & Drusvyatskiy 08b and Zhang & He 08, we consider the Moreau envelope of ψ in 0, which is ψ γ x := min z R p { ψz+ γ z x where γ > 0. Since ψ is ρ-weakly convex by Assumption, if γ > ρ, ψ γ is a smooth function whose gradient is ψx = γ x prox γψ x 4 where prox γψ x is the proximal point ofxdefined as } 3 { prox γψ x := argmin ψz+ } z R p γ z x. 5 Note that, when γ > ρ, the minimization in 3 and 4 is strongly convex so that prox γψ x is uniquely defined. According to the argument by Davis & Drusvyatskiy 08a, Davis & Grimmer 07, Davis & Drusvyatskiy 08b and Zhang & He 08, the norm of the gradient ψ γ x can be used as measure of the quality of a solution x. In fact, let x = prox γψ x. The definition of the Moreau envelope directly implies that for any x R d, x x = γ ψ γ x 6 ψx ψ x 7 Dist0, ψx = ψ γ x. 8 Therefore, if we can find a nearlyǫ-stationary point, which is defined as a point x X with ψ γ x ǫ, we can ensure x x γǫ and Dist0, ψx ǫ, namely, x is closed to ǫ-stationary point i.e., x. Because of this, we will focus on developing a first-order method for 0 and analyze its computational complexity for finding a nearly ǫ-stationary point. 4 Proximally Guided Approach The method we proposed is largely inspired by the proximally guided stochastic subgradient method by Davis & Grimmer 07. Davis & Grimmer 07 consider min x R p ψx with ψx being ρ-weakly convex. The problem they considered is more general and does not necessarily have the minimax structure in 0. After choosing a point x X, their method uses the standard stochastic subgradient SSG method to approximately solve the following subproblem min ψx+ x X γ x x, 9 which is strongly convex when γ > ρ. Then, x is updated by the solution returned from the SSG method. They show that their method finds x with ψ γ x ǫ a nearly ǫ-stationary point within Õ total ǫ 4 8

9 iterations of the SSG method. Davis & Grimmer 07 assume that one can compute a deterministic or stochastic subgradient of ψx in order to apply the SSG method to solve 9. However, this requires solving max y R q [fx,y ry exactly which can be time consuming or impossible, especially when f involves expectations. In order to address this challenge, we view the subproblem 9 also as a min-max saddle-point problem according to the structure of 0, and then solve 9 using primal-dual stochastic first-order methods. In particular, we consider min x R pmax y R q{φ γ,λx,y; x,ȳ}, 0 where φ γ,λ x,y; x,ȳ := fx,y+ γ x x λ V yy,ȳ ry+gx and x,ȳ X inty. Under Assumption, it is easy to show that, when γ > ρ, φ γ,λx,y; x,ȳ is µ x -strongly convex inxfor anyy Y and µ y -concave with respect to V y iny for anyx X with µ x := γ ρand µ y := λ +µ for any x,ȳ X inty. Hence, problem 0 is a convex-concave min-max problem which is relatively easy to solve compared to. We then apply primal-dual stochastic first-order methods to find an approximate solution for 0 and use it to update x,ȳ. Compared to subproblem 9 considered by Davis & Grimmer 07, we deduct a proximal term λ V yy,ȳ from the maximization in 0 along with adding a proximal term γ x x for the minimization in 0. This is only to make our subproblem 0 strongly convex-concave so that we can solve it efficiently using first-order methods. We will follow the convention that λ = 0 if λ = +. Therefore, when ry has been strongly convex µ > 0 already, we will not need the term λ V yy,ȳ and we will choose λ = +. In this case, 0 is equivalent to 9. When ry is not strongly convex µ = 0, we will have to choose a different value forλin each main iteration and the value will increase to infinity as the iterations proceed to make sure 0 closely approximates 9. 5 Minimization of Maximum of Stochastic Functions In this section, we will present our first algorithm which is designed for when the following assumptions are satisfied in addition to Assumption. Assumption. The following statements hold. A. fx,y = E[Fx,y,ξ, where ξ is a random variable and Fx,y,ξ is real-valued, ρ-weakly convex inxfor any y Y, and µ-concave with respect to V y in y for any x X almost surely. B. For any x,y X Y and any realization of ξ, we can compute g x, g y x Fx,y,ξ y [ Fx,y,ξ such that Eg x, Eg y x fx,y y [ fx,y. C. E g j x M x and E g j y y, M y for any x,y X Y for some M x > 0 and M y > 0. We then propose a proximally guided stochastic mirror descent PG-SMD method for the saddle-point problem 0. In this method, we approximately solve 0 with x,ȳ = x t,ȳ t and λ = λ t in the main iteration t using the primal-dual stochastic mirror descent SMD method by Nemirovski et al

10 Algorithm PG-SMD: Proximally Guided Stochastic Mirror Descent Method : Choose x 0 X,ȳ 0 Y. Input ρ > 0, γ 0,/ρ. : fort = 0,...,T do 3: Set x 0 = x t and y 0 = ȳ t. 4: Choose j t, λ t > 0, µ t y = λ t +µ, ηx j = µ and xj+ ηj y = µ t y j+ for j = 0,...,j t. 5: for j = 0,...,j t do 6: Sample ξ j from ξ and compute g x j, g y j x Fx j,y j,ξ j y [ Fx j,y j,ξ j 7: Solve x j+ = min x X x g x j + ηx j x x j + γ x xt +gx y j+ = min y Y y g j y + η j yv y y,y j + λ t V y y,ȳ t +ry 8: end for 9: 0: end for x t+ = j t j t j t + j=0 j +x j, ȳ t+ = j t j +y j j t j t + j=0 Then, x t+,ȳ t+ in iteration t+ is chosen to be the approximate solution we obtained from solving 0 with x,ȳ = x t,ȳ t. We present this PG-SMD method in Algorithm with details. Note that Line 5-8 of Algorithm are exactly the steps of the primal-dual SMD method by Nemirovski et al. 009 when applied to 0 with x,ȳ = x t,ȳ t and λ = λ t. To analyze the convergence property of Algorithm, we first present the convergence result of SMD for solving 0. Different from the analysis in Nemirovski et al. 009, we show a better convergence rate for SMD by exploiting the strong convexity and strong concavity in 0. The step size in Algorithm and the analysis we use to ensure this better convergence rate are borrowed from Lacoste-Julien et al. 0 who consider a strongly convex minimization problem. Proposition. Let x t = prox γψ x t, where x t is generated in Algorithm. Algorithm guarantees [ E max φ γ,λ x t+,y; x t,ȳ t min φ γ,λx,ȳ t+ ; x t,ȳ t y Y x X 0Mx + 0M y j t + µ x µ t + 4D x +4µ t y y γ D y +4Q g +4Q r. and µ x xt+ x t 0Mx + 0M y j t + µ x µ t y + 4D x γ +4µt yd y +4Q g +4Q r + D y λ t. where D x := max x x,x X x, D y := max V yy,y, Q g := max x X gx min x X gx and y,y Y Q r := max y Y ry min y Y ry 0

11 Proof. For the simplicity of notation, we writeφ γ,λ x,y; x t, x t as φx,y and define ḡx = γ x xt +gx and ry = λ V yy,ȳ t +ry. By the standard analysis of the SMD method in Nemirovski et al. 009, we can show that φx j,y φx,y j +x j x j x y j y j y ηj x M x + ηx j + ρ x x j ηx j + x x j+ γ ḡxj+ +ḡx j + ηj y M y + ηyv j y y,y j +µ y V y y,y j+ ry j+ + ry j, 3 for anyx X and anyy Y where j x Let x 0,ỹ 0 = x 0,y 0 and let η j y := gj x Egj x and j y := g j y Eg j y. x j+ = min x, j x + x X ỹ j+ = min y Y y, j y ηx j x x j + γ x xt +gx + V y y,ỹ j + λ V yy,ȳ t +ry η j y for j = 0,,...,j t. Using a similar analysis as in the SMD method again, we can obtain the following inequality similar to 3. x j x j x +ỹj y j y + ηx j x x j ηx j 4ηj xm x + 4ηj y M y η j y + γ x x j+ ḡ x j+ +ḡ x j + ηyv j y y,ỹ j +µ y V y y,ỹ j+ rỹ j+ + rỹ j. 4 By the definition of µ x, µ t y, ηj x and ηx, j we can obtain the following inequality by multiplying 3 and 4 by j + and adding them together for j = 0,,...,j t : j + φx j,y φx,y j +j +x j x j j x j +y j ỹ j j y 5j +M x j +j +µx ρj + j +j + + x x j +µx + j + x x j+ µ x j γ + 5j +M y + µ yj +j + V y y,y j µy j +j + +µ y j + V y y,y j+ µ y j + j +ḡx j+ +j +ḡx j j + ry j+ +j + ry j + j +j +µ x j +j x x j +µx + j + x x j+ 4 4 γ + j +j +µ y V y y,ỹ j µy j +j + +µ y j + V y y,ỹ j+ j +ḡ x j+ +j +ḡ x j j + rỹ j+ +j + rỹ j.

12 Summing the inequality above forj = 0,,...,j t and organizing terms yields j t j + φx j,y φx,y j j=0 j t j=0 [ j +x j x j j x +j +y j ỹ j j y + 5j tmx + j tdx + 5j tmy +j t µ y Dy µ x γ µ +j tq g +j t Q r. y Dividing the inequality above by j t jtjt+ j=0 j + = and applying the convexity φ in x and the concavity of φ iny, we obtain φ x t+,y φx,ȳ t+ j t j t + + j t + j t j=0 [ j +x j x j j x +j +y j ỹ j j y 0M x µ x + 4D x γ + 0M y µ y +4µ y D y +4Q g +4Q r. Finally, by maximizing both sides of the inequality above over x X and y Y and taking expectation, we obtain. Here, we use the fact E j x = 0 and E j y = 0 conditioning on all the stochastic events in iterations 0,,...,j. We follow the convention that λ = 0 if λ = +. Hence, we have µ x xt+ x t ψ x t+ + E E γ xt+ x t ψxt γ xt x t [ max φ γ,+ x t+,y; x t,ȳ t min φ γ,+ x,ȳ t+ ; x t,ȳ t y Y x X [ max φ γ,λ x t+,y; x t,ȳ t min φ γ,λx,ȳ t+ ; x t,ȳ t y Y x X which implies. Here, we use the µ x -strong convexity of φ γ,λ inx. + D y λ Theorem. Letx t = prox γψ x t, where x t is generated in Algorithm. Supposeτ is an random index uniformly distributed on {0,,...,T }. Also, suppose γ = ρ, λ t = + and j t = t+if µ > 0 and λ t = t+ and j t = t+ if µ = 0. Algorithm ensures EDist0, ψx τ γ E xτ x τ ǫ, with a total number of gradient calculations at mostõǫ 4 if µ > 0 and Õǫ 6 if µ = 0. Proof. By the definition of x t, we have ψx t + γ xt x t ψ x t 5

13 Borrowing the inequalities derived by Davis & Grimmer 07 in the proof of their Theorem, we have x t x t x t+ x t = x t x t + x t+ x t x t x t x t+ x t Let E t := 0Mx + 0M y j t + µ x µ t y Using 6, Proposition, and the relationship that x t x t + x t x t+ x t x t+ = x t x t x t x t+ + x t x t+ 3 xt x t +4 x t x t D x γ +4µt yd y +4Q g +4Q r ψ x t+ + γ xt+ x t ψxt γ xt x t max y Y φ γ,λ t x t+,y; x t,ȳ t min x X max y Y φ γ,λ t x,y; x t,ȳ t + D y λ t. we have max y Y φ γ,λ t x t+,y; x t,ȳ t min x X φ γ,λ t x,ȳ t+ ; x t,ȳ t + D y λ t, ψ x t+ ψx t + γ xt ψx t + γ 3 xt x t γ xt+ x t +E t + D y λ t ψx t + 6γ xt x t + 4 γµ x x t +4 xt x t+ E t + D y λ t +E t + D y λ t +E t + D y λ t. Combining this inequality with 5 leads to ψ x t+ ψ x t γ xt x t + 4 6γ xt x t + + E t + D y γµ x λ t ψ x t 4 3γ xt x t + + E t + D y γµ x λ t Adding the inequalities for t = 0,...,T yields T ψ x T ψ x 0 4 3γ xt x t + + γµ x Rearranging the inequality above gives T x t 4 x t 3γ ψ x 0 ψ + + γµ x T E t + D y λ t T E t + D y λ t 3

14 where ψ = min x X ψx. Let τ be a uniform random variable supported on {0,,...,T }. Then, we have E x τ x τ 3γ T 4 ψ x 0 ψ + + E t + D y T γµ x λ t Suppose µ > 0. We choose j t = t+ and λ t = + such that λ t = 0 and µ t y = µ. Therefore, T E t + D T y 0Mx = + 0M y + 4D x +4µDy λ t t+3 µ x µ γ +4Q g +4Q r 0Mx lnt + + 0M y + 4D x +4µDy µ x µ γ +4Q g +4Q r. To ensure γ E xτ x τ ǫ, we only need to set T = Õǫ and the the total number of gradient calculations required to compute x τ is bounded by T TT +3 j t = = Õǫ 4 Suppose µ = 0. We choose j t = t+ and λ t = t+ such that λ t = µ t y = t+. Therefore, T To ensure E t + D y λ t T [ = t+ lnt + 0Mx +0t+My + 4D x + 4D y µ x γ 0M x µ x +0M y + 4D x γ +4D y +4Q g +4Q r t+ +4Q g +4Q r EDist0, ψ + X x τ γ E xτ x τ ǫ, + D y t+ we only need to set T = Õǫ and the the total number of gradient calculations required to compute x τ is bounded by T j t = T3 +6T +9T = 3 Õǫ 6 When µ > 0, the complexity of Algorithm for finding a nearly ǫ-stationary point is Õǫ 4 which matches the complexity of the methods by Davis & Drusvyatskiy 08a, Davis & Grimmer 07, Davis & Drusvyatskiy 08b and Zhang & He 08. The complexity of Algorithm significantly increases to Õǫ 6 when µ = 0. However, in either case, we are not aware of any other existing first-order methods that can provably find a nearly ǫ-stationary point of problem under Assumption and. 4

15 6 Minimization of Maximum of Deterministic Functions In this section, we will present our second algorithm which is designed for when the following assumptions are satisfied in addition to Assumption. Assumption 3. The following statements hold. A. fx,y = n n i= f ix,y, where f i is real-valued, ρ-weakly convex in x for any y Y, and µ-concave with respect to V y iny for any x X. Moreover, µ > 0. B. f i is differentiable. Also, x f i isl x -Lipschitz continuous and y f i isl y -Lipschitz continuous. Here, x f i and y f i are the partial gradients of f i with respect toxand y, respectively. Different from the setting of the previous section, we are able to evaluate the value and the gradient of f but the computational challenge still arises when the number of functions n is large. We then propose a proximally guided stochastic variance-reduced gradient PG-SVRG method for the saddle-point problem 0 under Assumption and 3. In contrast to Algorithm, we apply the stochastic variance-reduced gradient SVRG method with the Bregman divergences defined in to approximately solve 0 with x = x t, λ = +, and λ V yy,ȳ = 0 at main iteration t. Then, we choose x t+ as the obtained approximate solution for iteration t +. The SVRG method was originally developed for finite-sum minimization Johnson & Zhang, 03; Defazio et al., 04; Xiao & Zhang, 04; Allen-Zhu & Yuan, 06; Allen-Zhu, 07a and it has been extended by Palaniappan & Bach 06; Shi et al. 07 for finite-sum min-max problems. Generally speaking, the SVRG method consists of multiple stages and, in each stage, a fixed number of stochastic gradient steps are performed. At the beginning of each stage, the SVRG method computes a deterministic gradient at a reference point, which is later used in every stochastic gradient step to construct a variancereduced stochastic gradient of the objective function. The reference point is updated only once in each stage followed by an update of the deterministic gradient. We present the PG-SVRG method in Algorithm. This algorithm consists of three levels of loops. The outermost loop is designed for updating x in 0. In each outermost iteration index byt, 0 is solved by the standard SVRG method with x = x t. Each iteration index by k in the middle-level loop corresponds to a stage of SVRG where the deterministic gradients ĝ k x and ĝ k y are computed at the reference point ˆx k,ŷ k. Each iteration index by j in the innermost loop performs a similar primal-dual gradient step as in Algorithm except that the variance-reduced gradients g k x and g k y are used. Proposition. Let x t = prox γψ x t, where x t is generated in Algorithm. [ E max and 3 4 φ γ,+ x t+,y; x t,ȳ t min y Y kt 4 + Λ φ γ,+ x,ȳ t+ ; x t,ȳ t x X [µx D x +µ yd y. 30 µ x xt+ x t where D x andd y are defined as in Proposition. kt Λ [µx Dx +µ ydy 3 5

16 Algorithm PGSVRG: Proximally Guided Stochastic Variance-Reduced Gradient Method : Choose x 0 X,ȳ 0 Y. Input ρ > 0, γ 0,/ρ. : fort = 0,...,T do 3: Set ˆx 0 = x t and ŷ 0 = ȳ t. 4: Choosek t,µ t y = µ,ηx j = µ xλ t,ηy j = µ t yλ t, andj t = Λ tlog4 forj =,...,j t, where Λ t = 5max{L x,l y} min{µ x,µt y }. 5: for k = 0,...,k t do 6: Compute ĝ k x = x fˆx k,ŷ k and ĝ k y = y fˆx k,ŷ k 7 7: Set x 0,y 0 = ˆx k,ŷ k 8: for j = 0,...,j t do 9: Sample l from {,,...,n} uniformly and compute f l x k,y k and g x j g y j = ĝ x k x f l ˆx k,ŷ k + x f l x j,y j 8 = ĝ y k y f l ˆx k,ŷ k + y f l x j,y j 9 0: Solve x j+ y j+ = min x X x g x j + ηx j x x j + γ x xt +gx = min y Y y g j y + η j yv y y,y j +ry : end for : ˆx k+ = x jt, ȳ k+ = y jt 3: end for 4: x t+ = x kt, ȳ t+ = y kt 5: end for Proof. We first analyze the convergence property of SVRG in solving 0 with x,ȳ = x t,ȳ t and λ = +, which are performed in the outermost iterationt in Algorithm. Let s focus on thekth iteration of the middle loop in Algorithm. LetE j represent the conditional expectation conditioning onx 0,y 0 = ˆx k,ŷ k and all random events that happen before iteration j of the innermost iteration of Algorithm. Again, for the simplicity of notation, we writeφ γ,λ x,y; x t, x t asφx,y and define ḡx = γ x xt +gx and ry = λ V yy,ȳ t +ry. The definition of x j+,y j+ and the optimality conditions of x t,y t imply γ + x x j+ ηx j +x j+ g x j +ḡx j+ + xj x j+ ηx j x g j x +ḡx+ x xt η j x 3 6

17 and µ t y + η j y V y y,y j+ y j+ g j y + ry j+ + V yy j+,y j η j y y g j y + ry+ V yy,y j η j y 33 for anyx X and y Y. We define Px := fx,y t +ḡx fx t,y t ḡx t 34 Dy := fx t,y ry fx t,y t + ry t. 35 Note that min x X Px = Px t = 0 and max y Y Dy = Dy t = 0. By the µ x -strong convexity of Px with respect to Euclidean distance and the µ t y -strong convexity of Dy with respect to Bregman divergence V y, we can show that Px µ x x x t and Dy µ t yv y y,y t 36 We choose x = x t in 3 and y = y t in 33, and then add 3 and 33 together. After organizing terms, we obtain γ + t x x j+ ηx j + xj x j+ ηx j + µ y + ηy j V y y t,y j+ + V yy j+,y j ηy j + Px j+ Dy j+ xt x j η j x x t x j+ g x j yt y j+ g y j V yy t,y j ηy j fx,y t +fx t,y We reorganize the right hand side of the inequality above as follows x t x j+ g x j yt y j+ g y j = x t x j [ g j x x fx j,y j +x j x j+ [ g j x xfx j,y j y t y j [ g y j y fx j,y j y j y j+ [ g j y y fx j,y j +x t x j+ x fx j,y j y t y j+ y fx j,y j 37 Next, we study the three lines on the right hand side of 3, respectively. Since the random index l is independent of x i and y i, we have [ E t x t x j [ g x j xfx j,y j y t y j [ g y j y fx j,y j = 0 38 by the definition of g j x and g j y. By the definition of g j x, Cauchy-Schwarz inequality and Young s in- 7

18 equality, we have [ E t x j x j+ [ g x j x fx j,y j a t E t x j x j+ + a t E t x fˆx k,ŷ k x f l ˆx k,ŷ k + x f l x j,y j x fx j,y j E t x j x j+ a +a tl x xj x t +a tl x ˆxk x t t +4a t L x V yy j,y t +4a t L x V yŷ k,y t 39 Similarly, we can prove that [ E t y j y j+ [ y fx j,y j g j b t V y y j+,y j +b t L y ˆxk x t +b tl y xj x t. y +4b t L yv y y j,y t +4b t L yv y ŷ k,y t. 40 Next, we consider the third line in the right hand side of 37. We first show the following upper bound on x t x j+ x fx j,y j. x t x j+ x fx j,y j 4 x t x j+ x fx j+,y j+ + c t x t x j+ + c t L x xj x j+ + c t L x yj y j+ fx t,y j+ fx j+,y j+ + ρ xt x j+ + c t x t x j+ + c t L x xj x j+ +c tl x V yy j+,y j Similarly, we have the following upper bound on y t y j+ y fx j,y j y t y j+ y fx j,y j 4 fx j+,y t +fx j+,y j+ + d t V y y j+,y t + d t L y x j x j+ +d t L yv y y j+,y j Applying 38, 39, 40, 4, 4 to 37 lead to µ x + Et x t x j+ + c t η j x 4a t L x +4b tl y + η j x t x x j + µ y + ηy j d t + a t L x +b t L t y x ˆx k + 4a t L x +4b t L y + +c t L x a +d tl y x j x j+ t ηx j + 4a t L x +4b tl y + η j y E t V y y t,y j+ + Px j+ Dy j+ Vy y t,ŷ k V y y t,y t b t +c t L x +d tl y η j y V y y j+,y j 43 8

19 min{µ x,µ t y} Choosing a t = b t = 48 max{l x,l y},c t = µ x, and d t = µ y in the inequality above, we obtain µx Et x t x j+ + ηx j µx 6 + t x x j ηx j max{ } L x,l y min { } µ x,µ t y ηx j + µ y µ y 6 + η j y + ηy j x j x j+ E t V y y t,y j+ + Px j+ Dy j+ V y y t,y t + µ x Given that ηx j = µ xλ t and ηy j = µ t yλ t withλ t = 5 max{l x,l y} above becomes µ x E t x t x j /+3Λ t +Λ t xt ˆx k + µ y 6 V yy t,ŷ k + 5 max{ } L x,l y min { µ x,µ y} t ηy j V y y j+,y j 44 min{µ x,µt y } +µ t ye t V y y t,y j+ + [ µx x t x j µ x x t ˆx k +Λ t +µ t yv y y t,y t +µ t yv y y t,ŷ k. after organizing terms, the inequality Px j+ Dy j+ Applying this inequality recursively forj = 0,,...,j t and organizing terms, we have. µ x E t x t x jt 3/+3Λ t + +µ t y E tv y y t,y jt + jt [ µx x t x 0 µ x x t ˆx k +µ t yv y y t,ŷ k +Λ t +µ t y V yy t,y 0. Px j t Dy jt Recall that x 0,y 0 = ˆx k,ŷ k and x jt,y jt = ˆx k+,ŷ k+. If we choose j t = + 3/+3Λ t log4, the inequality implies 3 4 µ x E t x t ˆx k+ +µ t y E tv y y t [ µ x x t ˆx k +µ t yv y y t,ŷ k,ŷ k+ + +Λ t Pˆx k+ Dŷ k+ 45 Note that, because ry is µ t y-strongly convex, using a standard analysis, we can show that the function of xmax y Y fx,y ry is smooth and its gradient isl x +L x /µt y -Lipschitz continuous. Similarly, because ḡx is µ x -strongly convex, we can show that the function of y min x X fx,y +ḡx is smooth and its gradient isl y +L y /µ x-lipschitz continuous. Therefore, we define Px := max y Rqφx,y and Dy := min x Rpφx,y 46 9

20 and we can easily show that Px Dy Px Dy+ L x µ x + L x µ x x x t µ t y + L y µ t y + L y µ µ t yv y y t,y x for anyx X and y Y. Choosing x,y = ˆx k+,ŷ k+ and applying 45 yield Pˆx k+ Dŷ k+ 3 [ 4 +Λ µ x x t ˆx k t +µ y V y y t,ŷ k Applying this inequality and 45 recursively for k = 0,,...,k t, we have kt [ 3 Pˆx kt Dŷ kt 4 +Λ µ x x t ˆx 0 t +µ y V y y t,ŷ 0 According to the facts that ˆx kt,ŷ kt = x t+,ȳ t+ and ˆx 0,ŷ 0 = x t,ȳ t kt [ 3 P x t+ Dȳ t+ 4 +Λ µ x x t x t t +µ y V y y t,ȳ t, which completes the proof of is implied by theµ x -strong convexity ofφinxusing the same proof at the end of Proposition. Theorem. Letx t = prox γψ x t, where x t is generated in Algorithm. Supposeτ is an random index uniformly distributed on {0,,...,T }. Also, suppose, γ = ρ and Algorithm ensures k t = log t+ + γµ x 4 + Λ t γ E xτ x τ ǫ, µ x D x +µ t yd y with a total number of gradient calculations at mostõn+ max{l x,l y } min{ρ,µ } ǫ. Proof. Let F t := 3 kt Λ [µx Dx +µ t ydy. Using the same argument in the beginning of the proof of Theorem but withλ t = +, we have E x τ x τ 3γ T 4 ψ x 0 ψ + + F t + D y T γµ x λ t where ψ = min x X ψx. We choose k t = + 4log λ t = 0 and µ t y = µ. Therefore, T t+ 4 γµ x Λ µx Dx +µt y D y and λ t = + such that F t + D y λ t = T t+ π 6.. 0

21 To ensure γ E xτ x τ ǫ, we only need to set T = Oǫ and the the total number of gradient calculations required to compute x τ is bounded by T k t n+j t = Õn+Λ tǫ Therefore, whenµ > 0, the complexity of Algorithm for finding a nearlyǫ-stationary point isõnǫ which is better than the complexity Õǫ 4 by Algorithm in the dependency on ǫ. 7 Conclusion Min-max saddle-point problems appear frequently in many real-world applications of machine learning and statistics. A large volume of literature studies the numerical algorithms for solving min-max problem under the assumption that the objective function is convex in the minimization part while concave in the maximization part. Recently, as the use of non-convex loss functions increases in machine learning models, especially in deep neural network, many min-max problems arise in practice where the minimization part is non-convex although the maximization part is still concave. For this type of problems, the existing saddlepoint algorithms for convex problems no longer have a good theoretical guarantee, and solving the min-max problem as a pure minimization problem requires computing the subgradient of the objective function, which further requires solving a non-trivial maximization problem. To contribute to the numerical solvers for non-convex min-max problems, we focus on a class of weaklyconvex-concave min-max problems where the minimization part is weakly convex and the maximization part is concave. This class of problems covers many important applications such as robust learning and AUC maximization problem. We consider two different scenarios: i the min-max objective function involves expectation and ii the objective function is smooth and has a finite-sum structure. In both cases, we propose a proximally guided optimization algorithm where we solve a sequence of a convex-concave min-max subproblems, each of which is constructed by adding a strongly convex term to the original function. Stochastic mirror descent method and stochastic variance-reduced gradient method have been used to solve the subproblem in scenario i and ii, respectively. We analyze the computational complexity of our methods for finding a nearly ǫ-stationary for the original problem under both scenarios. References Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM Symposium on Theory of Computing, STOC 7, 07a. Zeyuan Allen-Zhu. Natasha: Faster non-convex stochastic optimization via strongly non-convex parameter. In Proceedings of the 34th International Conference on Machine Learning ICML, pp , 07b. Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In Proceedings of the 33nd International Conference on Machine Learning ICML, pp , 06. URL

22 Zeyuan Allen-Zhu and Yang Yuan. Improved svrg for non-strongly-convex or sum-of-non-convex objectives. In Proceedings of The 33rd International Conference on Machine Learning, pp , 06. C. Bennett and R.C. Sharpley. Interpolation of Operators. Pure and Applied Mathematics. Elsevier Science, 988. ISBN URL Robert S. Chen, Brendan Lucier, Yaron Singer, and Vasilis Syrgkanis. Robust optimization for non-convex objectives. In Advances in Neural Information Processing Systems 30 NIPS, pp Zaiyi Chen, Tianbao Yang, Jinfeng Yi, Bowen Zhou, and Enhong Chen. Universal stagewise learning for non-convex problems with convergence on averaged solutions. CoRR, /abs/ , 08. Ashish Cherukuri, Bahman Gharesifard, and Jorge Cortés. Saddle-point dynamics: Conditions for asymptotic stability of saddle points. SIAM J. Control and Optimization, 55:486 5, 07. Damek Davis and Dmitriy Drusvyatskiy. Stochastic subgradient method converges at the rate ok /4 on weakly convex functions. arxiv preprint arxiv: , 08a. Damek Davis and Dmitriy Drusvyatskiy. Complexity of finding near-stationary points of convex functions stochastically. arxiv preprint arxiv: , 08b. Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. arxiv preprint arxiv: , 08c. Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. CoRR, abs/ , 08d. Damek Davis and Dmitriy Drusvyatskiy. Stochastic subgradient method converges at the rate ok /4 on weakly convex functions. CoRR, /abs/ , 08e. Damek Davis and Benjamin Grimmer. Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. arxiv preprint arxiv: , 07. Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems NIPS, pp , 04. Ofer Dekel and Yoram Singer. Support vector machines on a budget. In NIPS, pp , 006. D. Drusvyatskiy and C. Paquette. Efficiency of minimizing compositions of convex functions and smooth maps. Mathematical Programming, Jul 08. Dmitriy Drusvyatskiy. The proximal point method revisited. arxiv preprint arxiv: , 07. Yanbo Fan, Siwei Lyu, Yiming Ying, and Bao-Gang Hu. Learning with average top-k loss. In Advances in Neural Information Processing Systems NIPS, pp , 07. Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 34:34 368, 03.

23 Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program., 56-:59 99, 06. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp , 03. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems NIPS, pp. 06 4, 0. Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an O/t convergence rate for the projected stochastic subgradient method. arxiv preprint arxiv:.00, 0. Guanghui Lan and Yu Yang. Accelerated stochastic algorithms for nonconvex finite-sum and multi-block optimization. CoRR, abs/ , 08. Mingrui Liu, Xiaoxuan Zhang, Zaiyi Chen, Xiaoyu Wang, and Tianbao Yang. Fast stochastic auc maximization with o/n-convergence rate. In Proceedings of the 35th International Conference on Machine Learning ICML, 08. Po-Ling Loh. Statistical consistency and asymptotic normality for high-dimensional robust m- estimators. The Annals of Statistics, 45: , doi: 0.4/6-AOS47. URL Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. CoRR, abs/ , 07. URL Hongseok Namkoong and John C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems NIPS, pp. 08 6, 06. Hongseok Namkoong and John C. Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems NIPS, pp , 07. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 94: , 009. Balamurugan Palaniappan and Francis Bach. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems, pp , 06. Courtney Paquette, Hongzhou Lin, Dmitriy Drusvyatskiy, Julien Mairal, and Zaid Harchaoui. Catalyst for gradient-based nonconvex optimization. pp. 0, 08. Qi Qian, Shenghuo Zhu, Jiasheng Tang, Rong Jin, Baigui Sun, and Hao Li. Robust optimization over multiple domains. CoRR, abs/ , 08. URL Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczós, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In Proceedings of the 33rd International Conference on International Conference on Machine Learning ICML, pp JMLR.org, 06a. 3

Provable Non-Convex Min-Max Optimization

Provable Non-Convex Min-Max Optimization Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa City, IA, 52242 Department of Mathematics, The