arxiv: v1 [math.oc] 4 Oct 2018

Size: px
Start display at page:

Download "arxiv: v1 [math.oc] 4 Oct 2018"

Transcription

1 Non-Convex Min-Max Optimization: Provable Algorithms and Applications in Machine Learning arxiv: v [math.oc 4 Oct 08 Hassan Rafique Department of Mathmatics University of Iowa Qihang Lin Tippie College of Business University of Iowa Mingrui Liu Department of Computer Science University of Iowa Tianbao Yang Department of Computer Science University of Iowa October 5, 08 Abstract Min-max saddle-point problems have broad applications in many tasks in machine learning, e.g., distributionally robust learning, learning with non-decomposable loss, or learning with uncertain data. Although convex-concave saddle-point problems have been broadly studied with efficient algorithms and solid theories available, it remains a challenge to design provably efficient algorithms for non-convex saddle-point problems, especially when the objective function involves an expectation or a large-scale finite sum. Motivated by recent literature on non-convex non-smooth minimization, this paper studies a family of non-convex min-max problems where the minimization component is non-convex weakly convex and the maximization component is concave. We propose a proximally guided stochastic subgradient method and a proximally guided stochastic variance-reduced method for expected and finite-sum saddle-point problems, respectively. We establish the computation complexities of both methods for finding a nearly stationary point of the corresponding minimization problem. Introduction The main goal of this paper is to design provably efficient algorithms for solving saddle-point aka min-max problems of the following form that exhibits non-convexity in the part of minimization: min x R pmax y Rqfx,y ry+gx. Many machine learning tasks are formulated as the above problem. Examples with more details given later include learning with non-decomposable loss Fan et al., 07; Liu et al., 08; Ying et al., 06, learning with uncertain data Chen et al., 07, and distributionally robust optimization Namkoong & Duchi, 07, 06, etc. Although many previous studies have considered the min-max formulation and proposed efficient algorithms, most of them focus on the convex-concave family, in which both r and g are convex and fx, y is convex in x given y and is concave in y given x. However, convex-concave formulations cannot cover some important new methods/technologies/paradigms arising in machine learning. Hence, it becomes an emergent task to design provably efficient algorithms for solving that exhibits non-convexity structure.

2 Solving non-convex min-max problems is more challenging than solving non-convex minimization problems. Although there is increasing interest on non-convex optimization problems, most of the existing algorithms are designed for the minimization problem without considering the max-structure of the problem when it appears, and therefore are not directly applicable to. For example, stochastic gradient descent SGD for a minimization problem assumes that a stochastic gradient is available at each iteration for the objective of the minimization problem, which might be impossible for if the maximization over y is non-trivial or if f contains expectations. When designing an optimization algorithm for, the important questions are whether the algorithm has a polynomial runtime and what quality it guarantees in the output. In the recent studies for non-convex minimization problems Davis & Drusvyatskiy, 08e,c,a,b; Davis & Grimmer, 07; Drusvyatskiy, 07; Drusvyatskiy & Paquette, 08; Ghadimi & Lan, 03, 06; Lan & Yang, 08; Paquette et al., 08; Reddi et al., 06a,b, polynomial-time algorithms have been developed for finding a nearly stationary point that is close to a point where the subdifferential of objective function almost contains zero. Following this stream of work, we would naturally ask a question whether it is possible to design a polynomial time algorithm for that finds a nearly stationary point of the minimization problem, i.e., min x R p {ψx := max y R q[fx,y ry+gx}. In this paper, we give an affirmative answer for a special family of that is weakly non-convex in x and is concave in y. We refer to this family of problems as weakly-convex-concave min-max WCC-MinMax problems. Our answer is based on two primal-dual stochastic first-order methods for the WCC-MinMax problem under different settings. It is worth mentioning that WCC-MinMax problems have important applications in machine learning, which is actually the main motivation for us to pursue numerical methods for WCC-MinMax problems. We discuss some of these applications below and summarize our main contributions at the end of this section.. Applications in Machine Learning Distributionally Robust Optimization for Variance-based Regularization. In a recent work by Namkoong & Duchi 07, a distributionally robust optimization framework of the following form is studied: min max x X y Y n y i f i x ry, i= wheref i x denotes the loss of the model denoted byxon thei-th data point,x R d is a closed convex set, Y = {y R n n i= y i =, y i 0i =,...,n} is a simplex, andr : Y R is a closed convex function. They showed that when ry is the indicator function of the constraint set {y : n i= y i /n ρ} for some ρ > 0, the above min-max formulation achieves an effect that minimizes not only the bias but also the variance of the prediction, which could yield better generalization in some cases. In practice, one may also consider a regularized variant where ry = λvy,/n for some λ > 0, where V, denotes some distance measure of two vectors e.g., KL divergence, Euclidean distance. While optimization algorithms for solving convex-concave fromulation were developed Namkoong & Duchi, 06, it is still underexplored for problems with non-convex losses. When f i x is a non-convex loss function e.g., the loss function associated with a deep neural network, is non-convex in terms of x but is concave in y. Besides learning deep neural networks, non-convex loss functions also arise in robust learning Loh, 07; Xu et al., 08, when there exist outliers in the data or the data follows heavy-tailed distribution. For example, Xu et al. 08 studied learning with non-convex truncated loss fx;a,b = φ α lx;a,b, where a R d denotes the feature vector of an example and b denotes it corresponding output, lx; a, b denotes a standard convex loss function and φ α denotes a truncation function e.g., φ α s = αlog+s/α,α > 0.

3 Learning with non-decomposable loss. In some practical scenarios, one considers minimizing nondecomposable loss - meaning that the objective is not given as a summation of the losses of all training examples. We present two examples below. The first example is to minimize the sum of top-k losses Fan et al., 07; Dekel & Singer, 006, i.e., min x X k f πi x i= whereπ i n i= is a permutation of{,...,n} such thatf π x f π x f πn x for anyx. Whenk =, it reduces to the minimization of maximum loss, i.e., min x X max i f i x Shalev-Shwartz & Wexler, 06. The advantage in statistics of minimizing the sum of top-k losses has been considered Fan et al., 07. However, provable stochastic optimization algorithms for solving the above problem with nonconvex loss functions is not available. In order to transform the above problem into the min-max form, we note that the sum of top-k elements of a vector can be written as Peetre s K-functional Bennett & Sharpley, 988. In particular, Peetre s K-functional with respect top,p, and B > 0, is defined as v Kp,p,B := min w p w,z,w+z=v +B z p. It was shown that v K,,k = k i= v π i, where v π v π v πn Dekel & Singer, 006. Using Peetre s K-functional, we can write the sum of top-k nonnegative losses as min fx K,,k = min max x X x X y Y n y i f i x, 3 where fx = f x,...,f n x and Y = {y R n : y J,,/B }. Here, y Jq,q,s := max y q,s y q is called the Peetre s J-functional that is the conjugate dual of Peetre s K-functional v Kp,p,/s for /p i +/q i =, i =,. The second example of learning with non-decomposable loss is AUC maximization for binary classification with non-linear models. In particular, let gx; a denote the prediction score of a non-linear model parameterized by x e.g., deep neural networks on an example a. Given a set of examples {a i,b i } n i= where b i {, }, the AUC maximization problem over this set with a squared loss is written as i= min x X n + n n n [ gx;a i gx;a j I bi = b j =, i= j= where n + and n represent the numbers of positive and negative b i s, respectively, and I A = if event A holds and I A = 0 otherwise. It was shown that the above problem can written equivalently as a min-max problem that facilitates stochastic optimization Ying et al., 06: min max x X,z,z R y R n n f i x,z,z,y 4 i= where f i x,z,z,y = gx;a i z I b i= +gx;a i z I b i= Ibi= ++ygx;a i I b i= y, p + p p p + 3

4 p + = n i= I b i =/n and p = n i= I b i = /n. Let x = x,z,z. It is clear that f i x,y is nonconvex in x and strongly concave in y when gx; a is a non-convex function of x. Robust learning from Multiple Distributions. As an last example, we consider robust learning with multiple distributions. Let P,...,P m denote m distributions of data, which might be different perturbed versions with observable data of the underlying true distribution P 0. Robust learning from multiple distributions is to minimize the maximum of expected loss over themdistributions, i.e., min x X max E a P i [Fx;a = min i {,...,m} max x X y Y i= m y i f i x;a, 5 where Y = {y R n n i= y i =, y i 0 i =,...,n} is a simplex, Fx;a a non-convex loss function e.g., the loss defined by a deep neural network of a model parameterized by x on an example a, and f i x = E a Pi [Fx;a denotes the expected loss on distribution P i. It is an instance of that is non-convex in x and concave in y R m. It is notable that such a formulation also has applications in adversarial machine learning Madry et al., 07, in which P i corresponds to one distribution that one can follow to generate adversarial examples. The goal of adversarial machine learning is to learn a model that is robust to different kinds of adversarial examples.. Contributions In spite of our assumption on the concavity of f in y, there still remain great challenges for developing provable efficient algorithms for due to the non-convexity of f in x. One possible approach to solve is to view it as a non-convex minimization problem min x R p {ψx := max y R q[fx,y ry+gx} and solve it with existing first-order methods for non-convex optimization. By doing so, in iteration t of the first-order method, one needs to compute a subgradient of ψx t at iterate x t, which may require applying another algorithm to find an optimal or nearly-optimal solution, denoted by y t, for the concave maximization problem max y R q[fx,y ry. Then, one may compute a stochastic subgradient of fx t,y t with respect to x to update x t. However, this approach may have the following issues: i the maximization subproblem might not be easily solved; for example the maximization problem in 5 involves evaluating the expectation over a P i, which could be expensive; ii solving the maximization approximately introduces error in computing the subgradient of ψx that makes the convergence analysis more difficult. Moreover, when f involves expectation, an unbiased stochastic subgradient of fx, y at x,y = x t,y t is no longer a unbiased stochastic subgradient ofψx at x = x t. The major contribution of this work is the development and analysis of several efficient stochastic algorithms for tackling the WCC-MinMax problem without solving the maximization problem explicitly at each iteration. We establish the convergence of the proposed algorithms in terms of the stationarity of the corresponding minimization objective ψx. Since ψx could be non-smooth, we use the norm of its Moreau envelope s gradient as a convergence measure. The key assumptions we make are that fx,y is weakly convex x and concave in y and that the domains of r and g are both convex and compact. We consider two different settings in : i f is given as the expectation of a stochastic function and ii f can be represented as an average of finitely many smooth functions. The applications we discuss in Section. belows to at least one of these settings. Motivated by the proximally guided technique introduced by Davis & Grimmer 07, we develop a different stochastic first-order method for each two settings and analyze their time complexities. When f is given as the expectation of a stochastic function, we present a stochastic primal-dual subgradient algorithm for that updates the solutions using the stochastic subgradient of f. When 4

5 fx,y is strongly concave in y, this method achieves a time complexity of Õ/ǫ4 for finding a nearly ǫ-stationary solution see Section 3 for its definition. When fx, y is concave but not strongly concave in y, our method has a time complexity of Õ/ǫ6. The algorithm is applicable to distributionally robust optimization, minimizing the sum of top-k losses 3, AUC maximization with non-convex prediction models 4, and robust learning from multiple distribution 5. When f can be represented as an average of n smooth functions, we present a stochastic variancereduced gradient method for that achieves an improved iteration complexity. In particular, when f is strongly concave iny, our method has a time complexity ofõn/ǫ for finding a nearly stationary solution. This algorithm is also applicable to distributionally robust optimization, minimizing the sum of top-k losses 3, and AUC maximization with non-convex prediction models 4. Related Work In this section, we review some closely related work. We focus on stochastic algorithms for solving nonconvex min-max optimization problems and their applications for machine learning problems. Recently, there is growing interest on designing and analyzing stochastic algorithms for solving nonconvex minimization problems. Assuming stochastic gradient of the minimization objective function is available at each iteration, stochastic gradient descent SGD Ghadimi & Lan, 03 and its variants e.g., stochastic heavy-ball method Yang et al., 06; Ghadimi & Lan, 06 have been analyzed for finding a stationary point of a smooth objective function. When the objective function is summation of a finite number of smooth functions, accelerated stochastic algorithms with improved complexity were proposed based on variance reduction techniques Reddi et al., 06b,a; Lan & Yang, 08; Allen-Zhu, 07b; Allen-Zhu & Hazan, 06. These algorithms have been applied for learning deep neural networks by minimizing empirical risk over training data Krizhevsky et al., 0. However, none of them are directly applicable to solving the considered non-convex min-max problems in this paper. Several recent studies have tried to extend the algorithms and theories to non-convex minimization problems with non-smooth objective functions by assuming the objective is weakly convex Davis & Drusvyatskiy, 08e,d; Drusvyatskiy & Paquette, 08; Davis & Grimmer, 07; Chen et al., 08 or relatively weakly convex Zhang & He, 08. Tools based on the Moreau envelope are used in these studies to tackle the non-smoothness of the objective function. Since we also consider non-smooth objective functions, these tools will be reviewed in the next section. It is notable that the proposed algorithms fall in the same framework as that developed in Davis & Grimmer, 07; Chen et al., 08, i.e., the updates are divided into multiple stages and each stage a stochastic algorithm is employed for solving a regularized convex problem. However, they did not explicitly tackle weakly-convex-concave problems in their papers. A family of weakly convex composite optimization was studied in Drusvyatskiy & Paquette, 08, which is formulated as min x Rphcx +gx 6 where g : R p R {+ } is closed and convex, h : R q R is Lipschitz-continuous and convex and c : R p R q is smooth with a Lipschitz-continuous Jacobian matrix. This family of problem is a special case of considered in this work. In fact, we can reformulate 6 into by writing it as min x R pmax y R qy cx h y+gx 7 5

6 where h z = sup y R q{y z hy}, the Fenchel conjugate of h. Drusvyatskiy & Paquette 08 proposed a prox-linear method for solving 6 where the iterate x t is update by approximately solving x t+ argminh cx t + cx t x x t +gx+ x R p η x xt 8 where c is the Jacobian matrix of c and η > 0 is a step length. Drusvyatskiy & Paquette 08 proposed using accelerated gradient method to solve 8 or its dual. However, this approach requires the exact evaluation of c and c which is computationally expensive, for example, when c involves high-dimensional expectations or is given as a sum of a large number of functions see and 5. One approach to avoid this issue so that the prox-linear method by Drusvyatskiy & Paquette 08 can be still applied is to solve 8 as a min-max problem x t+ argminmax x R p y R qy cx t + cx t x x t h y+gx+ η x xt 9 which can be solved by primal-dual first-order methods based on the sample approximation of c and c. However, such an approach requires c to be smooth to ensure the convergence of the main iterate in the prox-linear method, which restricts its applicability. In contrast, the proposed algorithms in this paper do not require c to be smooth though smoothness allows us to design faster algorithms and can be applied to the general problem instead of just 6. Non-convex min-max problems have recently received increasing attention in machine learning. Chen et al. 07 considered robust learning from multiple datasets to tackle uncertainty in the data, which is a special case of 5. They proposed and analyzed an algorithm by assuming that the minimization problem given y can be solved to a certain optimality ratio with respect to the global minimizer. However, such assumption does not hold in reality since the minimization problem is a non-convex problem. Qian et al. 08 considered a similar problem to 5 and analyzed a primal-dual style stochastic algorithm. There are several differences between their results and our results: i they require E a Pi [fx;a to be a smooth function, while we only assume it is a weakly convex function that could be non-smooth; ii they provide convergence for the partial gradient of fx, y with respect to the primal variable x, while we provide convergence for the gradient of the Moreau envelope for the minimization objective ψx := max y R q[fx,y ry+gx; iii when the objective fx, y has a finite-sum structure with smoothness properties, we also provide an accelerated stochastic algorithm based on variance reduction. These differences make our results much stronger and broader. Sinha et al. 07 considered a particular non-convex min-max problem for adversarial learning. For optimization, they restricted the regularization parameter to be large and showed that the minimization objective is smooth under some smoothness assumptions offx,y and then applied the convergence analysis of SGD for smooth functions. Besides the convergence measures based on gradients, there are studies trying to analyze the convergence to min-max saddle points Cherukuri et al., 07. However, such convergence usually require much stronger assumptions about the problem. 3 Preliminaries We present some preliminaries in this section. According to the majority of the applications of, we use the Euclidean norm and a generic norm in R p and R q, respectively. The dual norm of is denoted by. We consider first-order optimization algorithm for solving the following problem min x R p { ψx := max y R q[fx,y ry+gx }. 0 6

7 Given a function h : R d R {+ }, we define the Fréchet subdifferential of h as hx = {ζ R d hx hx+ζ x x+o x x, x x}, where each element in hx is called a Fréchet subgradient of h at x. We define x fx,y as the subgradient of fx,y with respective to x for a fixed y and y [ fx,y as the subgradient of fx,y with respective toy for a fixedx. LetX := domg andy := domr. Letd y : Y R be a-strongly convex function with respect to which is differentiable in inty. The Bregman divergence associated to d y is defined as V y : Y inty R satisfying V y y,y := d y y d y y d y y,y y. Note that we have V y y,y y y. We say a function h : Y R isµ-strongly convex with respect tov y if hy hy +ζ y y +µv y y,y for anyy,y Y inty and ζ hy. We say a function h : X R isρ-weakly convex if hx hx +ζ x x ρ x x for anyx,x X and ζ hx. The assumptions made throughout this paper are the following: Assumption. f is real-valued and the following statements hold: A. fx,y is ρ-weakly convex in x, i.e., fx,y fx,y+ζ x x ρ x x for x,x X, y Y and ζ x fx,y B. fx,y is concave in y, i.e., fx,y fx,y ζ y y for x X, y,y Y and ζ y [ fx,y. C. r is closed and µ-strongly convex with respect tov y with µ 0, and g is closed and convex on X. D. X := domg and Y := domr are non-empty and convex compact sets. Remark. In fact, the boundedness assumption of X = domg can be removed without affecting our main results. Similarly, when µ > 0, the boundedness assumption of Y = domr can be also removed. We choose to keep the boundedness assumptions only for keeping the proofs of these results relatively simple. Under Assumption, ψ is ρ-weakly convex and thus non-convex so that finding the global optimal solution in general is difficult. An alternative goal is to find a stationary point of 0 which is defined as a point x X such that 0 ψx, Because of the iterative nature of optimization algorithms, such a stationary point generally can only be approached in the limit as the number of iterations increases to infinity. With finitely many iterations, a more realistic goal is to find an ǫ-stationary point, i.e., a point ˆx X satisfying Dist0, ψˆx := min ζ ψˆx ζ ǫ. 7

8 However, when the objective function is non-smooth, computing an ǫ-stationary point is still not an easy task even for convex optimization problem. A simple example ismin x R x where the only stationary point is 0 but any x 0 is not an ǫ-stationary point ǫ < no matter how close it is to 0. This situation is likely to occur in problem 0 because of the potential non-smoothness of f and r. Therefore, following Davis & Drusvyatskiy 08a, Davis & Grimmer 07, Davis & Drusvyatskiy 08b and Zhang & He 08, we consider the Moreau envelope of ψ in 0, which is ψ γ x := min z R p { ψz+ γ z x where γ > 0. Since ψ is ρ-weakly convex by Assumption, if γ > ρ, ψ γ is a smooth function whose gradient is ψx = γ x prox γψ x 4 where prox γψ x is the proximal point ofxdefined as } 3 { prox γψ x := argmin ψz+ } z R p γ z x. 5 Note that, when γ > ρ, the minimization in 3 and 4 is strongly convex so that prox γψ x is uniquely defined. According to the argument by Davis & Drusvyatskiy 08a, Davis & Grimmer 07, Davis & Drusvyatskiy 08b and Zhang & He 08, the norm of the gradient ψ γ x can be used as measure of the quality of a solution x. In fact, let x = prox γψ x. The definition of the Moreau envelope directly implies that for any x R d, x x = γ ψ γ x 6 ψx ψ x 7 Dist0, ψx = ψ γ x. 8 Therefore, if we can find a nearlyǫ-stationary point, which is defined as a point x X with ψ γ x ǫ, we can ensure x x γǫ and Dist0, ψx ǫ, namely, x is closed to ǫ-stationary point i.e., x. Because of this, we will focus on developing a first-order method for 0 and analyze its computational complexity for finding a nearly ǫ-stationary point. 4 Proximally Guided Approach The method we proposed is largely inspired by the proximally guided stochastic subgradient method by Davis & Grimmer 07. Davis & Grimmer 07 consider min x R p ψx with ψx being ρ-weakly convex. The problem they considered is more general and does not necessarily have the minimax structure in 0. After choosing a point x X, their method uses the standard stochastic subgradient SSG method to approximately solve the following subproblem min ψx+ x X γ x x, 9 which is strongly convex when γ > ρ. Then, x is updated by the solution returned from the SSG method. They show that their method finds x with ψ γ x ǫ a nearly ǫ-stationary point within Õ total ǫ 4 8

9 iterations of the SSG method. Davis & Grimmer 07 assume that one can compute a deterministic or stochastic subgradient of ψx in order to apply the SSG method to solve 9. However, this requires solving max y R q [fx,y ry exactly which can be time consuming or impossible, especially when f involves expectations. In order to address this challenge, we view the subproblem 9 also as a min-max saddle-point problem according to the structure of 0, and then solve 9 using primal-dual stochastic first-order methods. In particular, we consider min x R pmax y R q{φ γ,λx,y; x,ȳ}, 0 where φ γ,λ x,y; x,ȳ := fx,y+ γ x x λ V yy,ȳ ry+gx and x,ȳ X inty. Under Assumption, it is easy to show that, when γ > ρ, φ γ,λx,y; x,ȳ is µ x -strongly convex inxfor anyy Y and µ y -concave with respect to V y iny for anyx X with µ x := γ ρand µ y := λ +µ for any x,ȳ X inty. Hence, problem 0 is a convex-concave min-max problem which is relatively easy to solve compared to. We then apply primal-dual stochastic first-order methods to find an approximate solution for 0 and use it to update x,ȳ. Compared to subproblem 9 considered by Davis & Grimmer 07, we deduct a proximal term λ V yy,ȳ from the maximization in 0 along with adding a proximal term γ x x for the minimization in 0. This is only to make our subproblem 0 strongly convex-concave so that we can solve it efficiently using first-order methods. We will follow the convention that λ = 0 if λ = +. Therefore, when ry has been strongly convex µ > 0 already, we will not need the term λ V yy,ȳ and we will choose λ = +. In this case, 0 is equivalent to 9. When ry is not strongly convex µ = 0, we will have to choose a different value forλin each main iteration and the value will increase to infinity as the iterations proceed to make sure 0 closely approximates 9. 5 Minimization of Maximum of Stochastic Functions In this section, we will present our first algorithm which is designed for when the following assumptions are satisfied in addition to Assumption. Assumption. The following statements hold. A. fx,y = E[Fx,y,ξ, where ξ is a random variable and Fx,y,ξ is real-valued, ρ-weakly convex inxfor any y Y, and µ-concave with respect to V y in y for any x X almost surely. B. For any x,y X Y and any realization of ξ, we can compute g x, g y x Fx,y,ξ y [ Fx,y,ξ such that Eg x, Eg y x fx,y y [ fx,y. C. E g j x M x and E g j y y, M y for any x,y X Y for some M x > 0 and M y > 0. We then propose a proximally guided stochastic mirror descent PG-SMD method for the saddle-point problem 0. In this method, we approximately solve 0 with x,ȳ = x t,ȳ t and λ = λ t in the main iteration t using the primal-dual stochastic mirror descent SMD method by Nemirovski et al

10 Algorithm PG-SMD: Proximally Guided Stochastic Mirror Descent Method : Choose x 0 X,ȳ 0 Y. Input ρ > 0, γ 0,/ρ. : fort = 0,...,T do 3: Set x 0 = x t and y 0 = ȳ t. 4: Choose j t, λ t > 0, µ t y = λ t +µ, ηx j = µ and xj+ ηj y = µ t y j+ for j = 0,...,j t. 5: for j = 0,...,j t do 6: Sample ξ j from ξ and compute g x j, g y j x Fx j,y j,ξ j y [ Fx j,y j,ξ j 7: Solve x j+ = min x X x g x j + ηx j x x j + γ x xt +gx y j+ = min y Y y g j y + η j yv y y,y j + λ t V y y,ȳ t +ry 8: end for 9: 0: end for x t+ = j t j t j t + j=0 j +x j, ȳ t+ = j t j +y j j t j t + j=0 Then, x t+,ȳ t+ in iteration t+ is chosen to be the approximate solution we obtained from solving 0 with x,ȳ = x t,ȳ t. We present this PG-SMD method in Algorithm with details. Note that Line 5-8 of Algorithm are exactly the steps of the primal-dual SMD method by Nemirovski et al. 009 when applied to 0 with x,ȳ = x t,ȳ t and λ = λ t. To analyze the convergence property of Algorithm, we first present the convergence result of SMD for solving 0. Different from the analysis in Nemirovski et al. 009, we show a better convergence rate for SMD by exploiting the strong convexity and strong concavity in 0. The step size in Algorithm and the analysis we use to ensure this better convergence rate are borrowed from Lacoste-Julien et al. 0 who consider a strongly convex minimization problem. Proposition. Let x t = prox γψ x t, where x t is generated in Algorithm. Algorithm guarantees [ E max φ γ,λ x t+,y; x t,ȳ t min φ γ,λx,ȳ t+ ; x t,ȳ t y Y x X 0Mx + 0M y j t + µ x µ t + 4D x +4µ t y y γ D y +4Q g +4Q r. and µ x xt+ x t 0Mx + 0M y j t + µ x µ t y + 4D x γ +4µt yd y +4Q g +4Q r + D y λ t. where D x := max x x,x X x, D y := max V yy,y, Q g := max x X gx min x X gx and y,y Y Q r := max y Y ry min y Y ry 0

11 Proof. For the simplicity of notation, we writeφ γ,λ x,y; x t, x t as φx,y and define ḡx = γ x xt +gx and ry = λ V yy,ȳ t +ry. By the standard analysis of the SMD method in Nemirovski et al. 009, we can show that φx j,y φx,y j +x j x j x y j y j y ηj x M x + ηx j + ρ x x j ηx j + x x j+ γ ḡxj+ +ḡx j + ηj y M y + ηyv j y y,y j +µ y V y y,y j+ ry j+ + ry j, 3 for anyx X and anyy Y where j x Let x 0,ỹ 0 = x 0,y 0 and let η j y := gj x Egj x and j y := g j y Eg j y. x j+ = min x, j x + x X ỹ j+ = min y Y y, j y ηx j x x j + γ x xt +gx + V y y,ỹ j + λ V yy,ȳ t +ry η j y for j = 0,,...,j t. Using a similar analysis as in the SMD method again, we can obtain the following inequality similar to 3. x j x j x +ỹj y j y + ηx j x x j ηx j 4ηj xm x + 4ηj y M y η j y + γ x x j+ ḡ x j+ +ḡ x j + ηyv j y y,ỹ j +µ y V y y,ỹ j+ rỹ j+ + rỹ j. 4 By the definition of µ x, µ t y, ηj x and ηx, j we can obtain the following inequality by multiplying 3 and 4 by j + and adding them together for j = 0,,...,j t : j + φx j,y φx,y j +j +x j x j j x j +y j ỹ j j y 5j +M x j +j +µx ρj + j +j + + x x j +µx + j + x x j+ µ x j γ + 5j +M y + µ yj +j + V y y,y j µy j +j + +µ y j + V y y,y j+ µ y j + j +ḡx j+ +j +ḡx j j + ry j+ +j + ry j + j +j +µ x j +j x x j +µx + j + x x j+ 4 4 γ + j +j +µ y V y y,ỹ j µy j +j + +µ y j + V y y,ỹ j+ j +ḡ x j+ +j +ḡ x j j + rỹ j+ +j + rỹ j.

12 Summing the inequality above forj = 0,,...,j t and organizing terms yields j t j + φx j,y φx,y j j=0 j t j=0 [ j +x j x j j x +j +y j ỹ j j y + 5j tmx + j tdx + 5j tmy +j t µ y Dy µ x γ µ +j tq g +j t Q r. y Dividing the inequality above by j t jtjt+ j=0 j + = and applying the convexity φ in x and the concavity of φ iny, we obtain φ x t+,y φx,ȳ t+ j t j t + + j t + j t j=0 [ j +x j x j j x +j +y j ỹ j j y 0M x µ x + 4D x γ + 0M y µ y +4µ y D y +4Q g +4Q r. Finally, by maximizing both sides of the inequality above over x X and y Y and taking expectation, we obtain. Here, we use the fact E j x = 0 and E j y = 0 conditioning on all the stochastic events in iterations 0,,...,j. We follow the convention that λ = 0 if λ = +. Hence, we have µ x xt+ x t ψ x t+ + E E γ xt+ x t ψxt γ xt x t [ max φ γ,+ x t+,y; x t,ȳ t min φ γ,+ x,ȳ t+ ; x t,ȳ t y Y x X [ max φ γ,λ x t+,y; x t,ȳ t min φ γ,λx,ȳ t+ ; x t,ȳ t y Y x X which implies. Here, we use the µ x -strong convexity of φ γ,λ inx. + D y λ Theorem. Letx t = prox γψ x t, where x t is generated in Algorithm. Supposeτ is an random index uniformly distributed on {0,,...,T }. Also, suppose γ = ρ, λ t = + and j t = t+if µ > 0 and λ t = t+ and j t = t+ if µ = 0. Algorithm ensures EDist0, ψx τ γ E xτ x τ ǫ, with a total number of gradient calculations at mostõǫ 4 if µ > 0 and Õǫ 6 if µ = 0. Proof. By the definition of x t, we have ψx t + γ xt x t ψ x t 5

13 Borrowing the inequalities derived by Davis & Grimmer 07 in the proof of their Theorem, we have x t x t x t+ x t = x t x t + x t+ x t x t x t x t+ x t Let E t := 0Mx + 0M y j t + µ x µ t y Using 6, Proposition, and the relationship that x t x t + x t x t+ x t x t+ = x t x t x t x t+ + x t x t+ 3 xt x t +4 x t x t D x γ +4µt yd y +4Q g +4Q r ψ x t+ + γ xt+ x t ψxt γ xt x t max y Y φ γ,λ t x t+,y; x t,ȳ t min x X max y Y φ γ,λ t x,y; x t,ȳ t + D y λ t. we have max y Y φ γ,λ t x t+,y; x t,ȳ t min x X φ γ,λ t x,ȳ t+ ; x t,ȳ t + D y λ t, ψ x t+ ψx t + γ xt ψx t + γ 3 xt x t γ xt+ x t +E t + D y λ t ψx t + 6γ xt x t + 4 γµ x x t +4 xt x t+ E t + D y λ t +E t + D y λ t +E t + D y λ t. Combining this inequality with 5 leads to ψ x t+ ψ x t γ xt x t + 4 6γ xt x t + + E t + D y γµ x λ t ψ x t 4 3γ xt x t + + E t + D y γµ x λ t Adding the inequalities for t = 0,...,T yields T ψ x T ψ x 0 4 3γ xt x t + + γµ x Rearranging the inequality above gives T x t 4 x t 3γ ψ x 0 ψ + + γµ x T E t + D y λ t T E t + D y λ t 3

14 where ψ = min x X ψx. Let τ be a uniform random variable supported on {0,,...,T }. Then, we have E x τ x τ 3γ T 4 ψ x 0 ψ + + E t + D y T γµ x λ t Suppose µ > 0. We choose j t = t+ and λ t = + such that λ t = 0 and µ t y = µ. Therefore, T E t + D T y 0Mx = + 0M y + 4D x +4µDy λ t t+3 µ x µ γ +4Q g +4Q r 0Mx lnt + + 0M y + 4D x +4µDy µ x µ γ +4Q g +4Q r. To ensure γ E xτ x τ ǫ, we only need to set T = Õǫ and the the total number of gradient calculations required to compute x τ is bounded by T TT +3 j t = = Õǫ 4 Suppose µ = 0. We choose j t = t+ and λ t = t+ such that λ t = µ t y = t+. Therefore, T To ensure E t + D y λ t T [ = t+ lnt + 0Mx +0t+My + 4D x + 4D y µ x γ 0M x µ x +0M y + 4D x γ +4D y +4Q g +4Q r t+ +4Q g +4Q r EDist0, ψ + X x τ γ E xτ x τ ǫ, + D y t+ we only need to set T = Õǫ and the the total number of gradient calculations required to compute x τ is bounded by T j t = T3 +6T +9T = 3 Õǫ 6 When µ > 0, the complexity of Algorithm for finding a nearly ǫ-stationary point is Õǫ 4 which matches the complexity of the methods by Davis & Drusvyatskiy 08a, Davis & Grimmer 07, Davis & Drusvyatskiy 08b and Zhang & He 08. The complexity of Algorithm significantly increases to Õǫ 6 when µ = 0. However, in either case, we are not aware of any other existing first-order methods that can provably find a nearly ǫ-stationary point of problem under Assumption and. 4

15 6 Minimization of Maximum of Deterministic Functions In this section, we will present our second algorithm which is designed for when the following assumptions are satisfied in addition to Assumption. Assumption 3. The following statements hold. A. fx,y = n n i= f ix,y, where f i is real-valued, ρ-weakly convex in x for any y Y, and µ-concave with respect to V y iny for any x X. Moreover, µ > 0. B. f i is differentiable. Also, x f i isl x -Lipschitz continuous and y f i isl y -Lipschitz continuous. Here, x f i and y f i are the partial gradients of f i with respect toxand y, respectively. Different from the setting of the previous section, we are able to evaluate the value and the gradient of f but the computational challenge still arises when the number of functions n is large. We then propose a proximally guided stochastic variance-reduced gradient PG-SVRG method for the saddle-point problem 0 under Assumption and 3. In contrast to Algorithm, we apply the stochastic variance-reduced gradient SVRG method with the Bregman divergences defined in to approximately solve 0 with x = x t, λ = +, and λ V yy,ȳ = 0 at main iteration t. Then, we choose x t+ as the obtained approximate solution for iteration t +. The SVRG method was originally developed for finite-sum minimization Johnson & Zhang, 03; Defazio et al., 04; Xiao & Zhang, 04; Allen-Zhu & Yuan, 06; Allen-Zhu, 07a and it has been extended by Palaniappan & Bach 06; Shi et al. 07 for finite-sum min-max problems. Generally speaking, the SVRG method consists of multiple stages and, in each stage, a fixed number of stochastic gradient steps are performed. At the beginning of each stage, the SVRG method computes a deterministic gradient at a reference point, which is later used in every stochastic gradient step to construct a variancereduced stochastic gradient of the objective function. The reference point is updated only once in each stage followed by an update of the deterministic gradient. We present the PG-SVRG method in Algorithm. This algorithm consists of three levels of loops. The outermost loop is designed for updating x in 0. In each outermost iteration index byt, 0 is solved by the standard SVRG method with x = x t. Each iteration index by k in the middle-level loop corresponds to a stage of SVRG where the deterministic gradients ĝ k x and ĝ k y are computed at the reference point ˆx k,ŷ k. Each iteration index by j in the innermost loop performs a similar primal-dual gradient step as in Algorithm except that the variance-reduced gradients g k x and g k y are used. Proposition. Let x t = prox γψ x t, where x t is generated in Algorithm. [ E max and 3 4 φ γ,+ x t+,y; x t,ȳ t min y Y kt 4 + Λ φ γ,+ x,ȳ t+ ; x t,ȳ t x X [µx D x +µ yd y. 30 µ x xt+ x t where D x andd y are defined as in Proposition. kt Λ [µx Dx +µ ydy 3 5

16 Algorithm PGSVRG: Proximally Guided Stochastic Variance-Reduced Gradient Method : Choose x 0 X,ȳ 0 Y. Input ρ > 0, γ 0,/ρ. : fort = 0,...,T do 3: Set ˆx 0 = x t and ŷ 0 = ȳ t. 4: Choosek t,µ t y = µ,ηx j = µ xλ t,ηy j = µ t yλ t, andj t = Λ tlog4 forj =,...,j t, where Λ t = 5max{L x,l y} min{µ x,µt y }. 5: for k = 0,...,k t do 6: Compute ĝ k x = x fˆx k,ŷ k and ĝ k y = y fˆx k,ŷ k 7 7: Set x 0,y 0 = ˆx k,ŷ k 8: for j = 0,...,j t do 9: Sample l from {,,...,n} uniformly and compute f l x k,y k and g x j g y j = ĝ x k x f l ˆx k,ŷ k + x f l x j,y j 8 = ĝ y k y f l ˆx k,ŷ k + y f l x j,y j 9 0: Solve x j+ y j+ = min x X x g x j + ηx j x x j + γ x xt +gx = min y Y y g j y + η j yv y y,y j +ry : end for : ˆx k+ = x jt, ȳ k+ = y jt 3: end for 4: x t+ = x kt, ȳ t+ = y kt 5: end for Proof. We first analyze the convergence property of SVRG in solving 0 with x,ȳ = x t,ȳ t and λ = +, which are performed in the outermost iterationt in Algorithm. Let s focus on thekth iteration of the middle loop in Algorithm. LetE j represent the conditional expectation conditioning onx 0,y 0 = ˆx k,ŷ k and all random events that happen before iteration j of the innermost iteration of Algorithm. Again, for the simplicity of notation, we writeφ γ,λ x,y; x t, x t asφx,y and define ḡx = γ x xt +gx and ry = λ V yy,ȳ t +ry. The definition of x j+,y j+ and the optimality conditions of x t,y t imply γ + x x j+ ηx j +x j+ g x j +ḡx j+ + xj x j+ ηx j x g j x +ḡx+ x xt η j x 3 6

17 and µ t y + η j y V y y,y j+ y j+ g j y + ry j+ + V yy j+,y j η j y y g j y + ry+ V yy,y j η j y 33 for anyx X and y Y. We define Px := fx,y t +ḡx fx t,y t ḡx t 34 Dy := fx t,y ry fx t,y t + ry t. 35 Note that min x X Px = Px t = 0 and max y Y Dy = Dy t = 0. By the µ x -strong convexity of Px with respect to Euclidean distance and the µ t y -strong convexity of Dy with respect to Bregman divergence V y, we can show that Px µ x x x t and Dy µ t yv y y,y t 36 We choose x = x t in 3 and y = y t in 33, and then add 3 and 33 together. After organizing terms, we obtain γ + t x x j+ ηx j + xj x j+ ηx j + µ y + ηy j V y y t,y j+ + V yy j+,y j ηy j + Px j+ Dy j+ xt x j η j x x t x j+ g x j yt y j+ g y j V yy t,y j ηy j fx,y t +fx t,y We reorganize the right hand side of the inequality above as follows x t x j+ g x j yt y j+ g y j = x t x j [ g j x x fx j,y j +x j x j+ [ g j x xfx j,y j y t y j [ g y j y fx j,y j y j y j+ [ g j y y fx j,y j +x t x j+ x fx j,y j y t y j+ y fx j,y j 37 Next, we study the three lines on the right hand side of 3, respectively. Since the random index l is independent of x i and y i, we have [ E t x t x j [ g x j xfx j,y j y t y j [ g y j y fx j,y j = 0 38 by the definition of g j x and g j y. By the definition of g j x, Cauchy-Schwarz inequality and Young s in- 7

18 equality, we have [ E t x j x j+ [ g x j x fx j,y j a t E t x j x j+ + a t E t x fˆx k,ŷ k x f l ˆx k,ŷ k + x f l x j,y j x fx j,y j E t x j x j+ a +a tl x xj x t +a tl x ˆxk x t t +4a t L x V yy j,y t +4a t L x V yŷ k,y t 39 Similarly, we can prove that [ E t y j y j+ [ y fx j,y j g j b t V y y j+,y j +b t L y ˆxk x t +b tl y xj x t. y +4b t L yv y y j,y t +4b t L yv y ŷ k,y t. 40 Next, we consider the third line in the right hand side of 37. We first show the following upper bound on x t x j+ x fx j,y j. x t x j+ x fx j,y j 4 x t x j+ x fx j+,y j+ + c t x t x j+ + c t L x xj x j+ + c t L x yj y j+ fx t,y j+ fx j+,y j+ + ρ xt x j+ + c t x t x j+ + c t L x xj x j+ +c tl x V yy j+,y j Similarly, we have the following upper bound on y t y j+ y fx j,y j y t y j+ y fx j,y j 4 fx j+,y t +fx j+,y j+ + d t V y y j+,y t + d t L y x j x j+ +d t L yv y y j+,y j Applying 38, 39, 40, 4, 4 to 37 lead to µ x + Et x t x j+ + c t η j x 4a t L x +4b tl y + η j x t x x j + µ y + ηy j d t + a t L x +b t L t y x ˆx k + 4a t L x +4b t L y + +c t L x a +d tl y x j x j+ t ηx j + 4a t L x +4b tl y + η j y E t V y y t,y j+ + Px j+ Dy j+ Vy y t,ŷ k V y y t,y t b t +c t L x +d tl y η j y V y y j+,y j 43 8

19 min{µ x,µ t y} Choosing a t = b t = 48 max{l x,l y},c t = µ x, and d t = µ y in the inequality above, we obtain µx Et x t x j+ + ηx j µx 6 + t x x j ηx j max{ } L x,l y min { } µ x,µ t y ηx j + µ y µ y 6 + η j y + ηy j x j x j+ E t V y y t,y j+ + Px j+ Dy j+ V y y t,y t + µ x Given that ηx j = µ xλ t and ηy j = µ t yλ t withλ t = 5 max{l x,l y} above becomes µ x E t x t x j /+3Λ t +Λ t xt ˆx k + µ y 6 V yy t,ŷ k + 5 max{ } L x,l y min { µ x,µ y} t ηy j V y y j+,y j 44 min{µ x,µt y } +µ t ye t V y y t,y j+ + [ µx x t x j µ x x t ˆx k +Λ t +µ t yv y y t,y t +µ t yv y y t,ŷ k. after organizing terms, the inequality Px j+ Dy j+ Applying this inequality recursively forj = 0,,...,j t and organizing terms, we have. µ x E t x t x jt 3/+3Λ t + +µ t y E tv y y t,y jt + jt [ µx x t x 0 µ x x t ˆx k +µ t yv y y t,ŷ k +Λ t +µ t y V yy t,y 0. Px j t Dy jt Recall that x 0,y 0 = ˆx k,ŷ k and x jt,y jt = ˆx k+,ŷ k+. If we choose j t = + 3/+3Λ t log4, the inequality implies 3 4 µ x E t x t ˆx k+ +µ t y E tv y y t [ µ x x t ˆx k +µ t yv y y t,ŷ k,ŷ k+ + +Λ t Pˆx k+ Dŷ k+ 45 Note that, because ry is µ t y-strongly convex, using a standard analysis, we can show that the function of xmax y Y fx,y ry is smooth and its gradient isl x +L x /µt y -Lipschitz continuous. Similarly, because ḡx is µ x -strongly convex, we can show that the function of y min x X fx,y +ḡx is smooth and its gradient isl y +L y /µ x-lipschitz continuous. Therefore, we define Px := max y Rqφx,y and Dy := min x Rpφx,y 46 9

20 and we can easily show that Px Dy Px Dy+ L x µ x + L x µ x x x t µ t y + L y µ t y + L y µ µ t yv y y t,y x for anyx X and y Y. Choosing x,y = ˆx k+,ŷ k+ and applying 45 yield Pˆx k+ Dŷ k+ 3 [ 4 +Λ µ x x t ˆx k t +µ y V y y t,ŷ k Applying this inequality and 45 recursively for k = 0,,...,k t, we have kt [ 3 Pˆx kt Dŷ kt 4 +Λ µ x x t ˆx 0 t +µ y V y y t,ŷ 0 According to the facts that ˆx kt,ŷ kt = x t+,ȳ t+ and ˆx 0,ŷ 0 = x t,ȳ t kt [ 3 P x t+ Dȳ t+ 4 +Λ µ x x t x t t +µ y V y y t,ȳ t, which completes the proof of is implied by theµ x -strong convexity ofφinxusing the same proof at the end of Proposition. Theorem. Letx t = prox γψ x t, where x t is generated in Algorithm. Supposeτ is an random index uniformly distributed on {0,,...,T }. Also, suppose, γ = ρ and Algorithm ensures k t = log t+ + γµ x 4 + Λ t γ E xτ x τ ǫ, µ x D x +µ t yd y with a total number of gradient calculations at mostõn+ max{l x,l y } min{ρ,µ } ǫ. Proof. Let F t := 3 kt Λ [µx Dx +µ t ydy. Using the same argument in the beginning of the proof of Theorem but withλ t = +, we have E x τ x τ 3γ T 4 ψ x 0 ψ + + F t + D y T γµ x λ t where ψ = min x X ψx. We choose k t = + 4log λ t = 0 and µ t y = µ. Therefore, T t+ 4 γµ x Λ µx Dx +µt y D y and λ t = + such that F t + D y λ t = T t+ π 6.. 0

21 To ensure γ E xτ x τ ǫ, we only need to set T = Oǫ and the the total number of gradient calculations required to compute x τ is bounded by T k t n+j t = Õn+Λ tǫ Therefore, whenµ > 0, the complexity of Algorithm for finding a nearlyǫ-stationary point isõnǫ which is better than the complexity Õǫ 4 by Algorithm in the dependency on ǫ. 7 Conclusion Min-max saddle-point problems appear frequently in many real-world applications of machine learning and statistics. A large volume of literature studies the numerical algorithms for solving min-max problem under the assumption that the objective function is convex in the minimization part while concave in the maximization part. Recently, as the use of non-convex loss functions increases in machine learning models, especially in deep neural network, many min-max problems arise in practice where the minimization part is non-convex although the maximization part is still concave. For this type of problems, the existing saddlepoint algorithms for convex problems no longer have a good theoretical guarantee, and solving the min-max problem as a pure minimization problem requires computing the subgradient of the objective function, which further requires solving a non-trivial maximization problem. To contribute to the numerical solvers for non-convex min-max problems, we focus on a class of weaklyconvex-concave min-max problems where the minimization part is weakly convex and the maximization part is concave. This class of problems covers many important applications such as robust learning and AUC maximization problem. We consider two different scenarios: i the min-max objective function involves expectation and ii the objective function is smooth and has a finite-sum structure. In both cases, we propose a proximally guided optimization algorithm where we solve a sequence of a convex-concave min-max subproblems, each of which is constructed by adding a strongly convex term to the original function. Stochastic mirror descent method and stochastic variance-reduced gradient method have been used to solve the subproblem in scenario i and ii, respectively. We analyze the computational complexity of our methods for finding a nearly ǫ-stationary for the original problem under both scenarios. References Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th Annual ACM Symposium on Theory of Computing, STOC 7, 07a. Zeyuan Allen-Zhu. Natasha: Faster non-convex stochastic optimization via strongly non-convex parameter. In Proceedings of the 34th International Conference on Machine Learning ICML, pp , 07b. Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In Proceedings of the 33nd International Conference on Machine Learning ICML, pp , 06. URL

22 Zeyuan Allen-Zhu and Yang Yuan. Improved svrg for non-strongly-convex or sum-of-non-convex objectives. In Proceedings of The 33rd International Conference on Machine Learning, pp , 06. C. Bennett and R.C. Sharpley. Interpolation of Operators. Pure and Applied Mathematics. Elsevier Science, 988. ISBN URL Robert S. Chen, Brendan Lucier, Yaron Singer, and Vasilis Syrgkanis. Robust optimization for non-convex objectives. In Advances in Neural Information Processing Systems 30 NIPS, pp Zaiyi Chen, Tianbao Yang, Jinfeng Yi, Bowen Zhou, and Enhong Chen. Universal stagewise learning for non-convex problems with convergence on averaged solutions. CoRR, /abs/ , 08. Ashish Cherukuri, Bahman Gharesifard, and Jorge Cortés. Saddle-point dynamics: Conditions for asymptotic stability of saddle points. SIAM J. Control and Optimization, 55:486 5, 07. Damek Davis and Dmitriy Drusvyatskiy. Stochastic subgradient method converges at the rate ok /4 on weakly convex functions. arxiv preprint arxiv: , 08a. Damek Davis and Dmitriy Drusvyatskiy. Complexity of finding near-stationary points of convex functions stochastically. arxiv preprint arxiv: , 08b. Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. arxiv preprint arxiv: , 08c. Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. CoRR, abs/ , 08d. Damek Davis and Dmitriy Drusvyatskiy. Stochastic subgradient method converges at the rate ok /4 on weakly convex functions. CoRR, /abs/ , 08e. Damek Davis and Benjamin Grimmer. Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. arxiv preprint arxiv: , 07. Aaron Defazio, Francis R. Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems NIPS, pp , 04. Ofer Dekel and Yoram Singer. Support vector machines on a budget. In NIPS, pp , 006. D. Drusvyatskiy and C. Paquette. Efficiency of minimizing compositions of convex functions and smooth maps. Mathematical Programming, Jul 08. Dmitriy Drusvyatskiy. The proximal point method revisited. arxiv preprint arxiv: , 07. Yanbo Fan, Siwei Lyu, Yiming Ying, and Bao-Gang Hu. Learning with average top-k loss. In Advances in Neural Information Processing Systems NIPS, pp , 07. Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 34:34 368, 03.

23 Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program., 56-:59 99, 06. Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pp , 03. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems NIPS, pp. 06 4, 0. Simon Lacoste-Julien, Mark Schmidt, and Francis Bach. A simpler approach to obtaining an O/t convergence rate for the projected stochastic subgradient method. arxiv preprint arxiv:.00, 0. Guanghui Lan and Yu Yang. Accelerated stochastic algorithms for nonconvex finite-sum and multi-block optimization. CoRR, abs/ , 08. Mingrui Liu, Xiaoxuan Zhang, Zaiyi Chen, Xiaoyu Wang, and Tianbao Yang. Fast stochastic auc maximization with o/n-convergence rate. In Proceedings of the 35th International Conference on Machine Learning ICML, 08. Po-Ling Loh. Statistical consistency and asymptotic normality for high-dimensional robust m- estimators. The Annals of Statistics, 45: , doi: 0.4/6-AOS47. URL Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. CoRR, abs/ , 07. URL Hongseok Namkoong and John C. Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems NIPS, pp. 08 6, 06. Hongseok Namkoong and John C. Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems NIPS, pp , 07. A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 94: , 009. Balamurugan Palaniappan and Francis Bach. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems, pp , 06. Courtney Paquette, Hongzhou Lin, Dmitriy Drusvyatskiy, Julien Mairal, and Zaid Harchaoui. Catalyst for gradient-based nonconvex optimization. pp. 0, 08. Qi Qian, Shenghuo Zhu, Jiasheng Tang, Rong Jin, Baigui Sun, and Hao Li. Robust optimization over multiple domains. CoRR, abs/ , 08. URL Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczós, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In Proceedings of the 33rd International Conference on International Conference on Machine Learning ICML, pp JMLR.org, 06a. 3

Provable Non-Convex Min-Max Optimization

Provable Non-Convex Min-Max Optimization Provable Non-Convex Min-Max Optimization Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa City, IA, 52242 Department of Mathematics, The

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

A Unified Analysis of Stochastic Momentum Methods for Deep Learning

A Unified Analysis of Stochastic Momentum Methods for Deep Learning A Unified Analysis of Stochastic Momentum Methods for Deep Learning Yan Yan,2, Tianbao Yang 3, Zhe Li 3, Qihang Lin 4, Yi Yang,2 SUSTech-UTS Joint Centre of CIS, Southern University of Science and Technology

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

arxiv: v1 [math.oc] 18 Mar 2016

arxiv: v1 [math.oc] 18 Mar 2016 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Stochastic model-based minimization under high-order growth

Stochastic model-based minimization under high-order growth Stochastic model-based minimization under high-order growth Damek Davis Dmitriy Drusvyatskiy Kellie J. MacPhee Abstract Given a nonsmooth, nonconvex minimization problem, we consider algorithms that iteratively

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

arxiv: v1 [math.oc] 10 Oct 2018

arxiv: v1 [math.oc] 10 Oct 2018 8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv:80.04765v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

arxiv: v1 [math.oc] 5 Feb 2018

arxiv: v1 [math.oc] 5 Feb 2018 for Convex-Concave Saddle Point Problems without Strong Convexity Simon S. Du * 1 Wei Hu * arxiv:180.01504v1 math.oc] 5 Feb 018 Abstract We consider the convex-concave saddle point problem x max y fx +

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

arxiv: v2 [math.oc] 1 Nov 2017

arxiv: v2 [math.oc] 1 Nov 2017 Stochastic Non-convex Optimization with Strong High Probability Second-order Convergence arxiv:1710.09447v [math.oc] 1 Nov 017 Mingrui Liu, Tianbao Yang Department of Computer Science The University of

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Tomoya Murata Taiji Suzuki More recently, several authors have proposed accelerarxiv:70300439v4

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Composite nonlinear models at scale

Composite nonlinear models at scale Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)

More information

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Yi Zhou Department of ECE The Ohio State University zhou.1172@osu.edu Zhe Wang Department of ECE The Ohio State University

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

SGD CONVERGES TO GLOBAL MINIMUM IN DEEP LEARNING VIA STAR-CONVEX PATH

SGD CONVERGES TO GLOBAL MINIMUM IN DEEP LEARNING VIA STAR-CONVEX PATH Under review as a conference paper at ICLR 9 SGD CONVERGES TO GLOAL MINIMUM IN DEEP LEARNING VIA STAR-CONVEX PATH Anonymous authors Paper under double-blind review ASTRACT Stochastic gradient descent (SGD)

More information

Efficiency of minimizing compositions of convex functions and smooth maps

Efficiency of minimizing compositions of convex functions and smooth maps Efficiency of minimizing compositions of convex functions and smooth maps D. Drusvyatskiy C. Paquette Abstract We consider global efficiency of algorithms for minimizing a sum of a convex function and

More information

BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS

BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS Songtao Lu, Ioannis Tsaknakis, and Mingyi Hong Department of Electrical

More information

arxiv: v1 [stat.ml] 20 May 2017

arxiv: v1 [stat.ml] 20 May 2017 Stochastic Recursive Gradient Algorithm for Nonconvex Optimization arxiv:1705.0761v1 [stat.ml] 0 May 017 Lam M. Nguyen Industrial and Systems Engineering Lehigh University, USA lamnguyen.mltd@gmail.com

More information

arxiv: v1 [math.oc] 9 Oct 2018

arxiv: v1 [math.oc] 9 Oct 2018 Cubic Regularization with Momentum for Nonconvex Optimization Zhe Wang Yi Zhou Yingbin Liang Guanghui Lan Ohio State University Ohio State University zhou.117@osu.edu liang.889@osu.edu Ohio State University

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)

More information

A Universal Catalyst for Gradient-Based Optimization

A Universal Catalyst for Gradient-Based Optimization A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization Linbo Qiao, and Tianyi Lin 3 and Yu-Gang Jiang and Fan Yang 5 and Wei Liu 6 and Xicheng Lu, Abstract We consider

More information

constrained convex optimization, level-set technique, first-order methods, com-

constrained convex optimization, level-set technique, first-order methods, com- 3 4 5 6 7 8 9 0 3 4 5 6 7 8 A LEVEL-SET METHOD FOR CONVEX OPTIMIZATION WITH A FEASIBLE SOLUTION PATH QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI Abstract. Large-scale constrained convex optimization

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints Hao Yu and Michael J. Neely Department of Electrical Engineering

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Fast Stochastic Maximization with O1/n)-Convergence Rate Mingrui Liu 1 Xiaoxuan Zhang 1 Zaiyi Chen Xiaoyu Wang 3 ianbao Yang 1 Abstract In this paper we consider statistical learning with area under ROC

More information

ON PROXIMAL POINT-TYPE ALGORITHMS FOR WEAKLY CONVEX FUNCTIONS AND THEIR CONNECTION TO THE BACKWARD EULER METHOD

ON PROXIMAL POINT-TYPE ALGORITHMS FOR WEAKLY CONVEX FUNCTIONS AND THEIR CONNECTION TO THE BACKWARD EULER METHOD ON PROXIMAL POINT-TYPE ALGORITHMS FOR WEAKLY CONVEX FUNCTIONS AND THEIR CONNECTION TO THE BACKWARD EULER METHOD TIM HOHEISEL, MAXIME LABORDE, AND ADAM OBERMAN Abstract. In this article we study the connection

More information

arxiv: v3 [cs.lg] 15 Sep 2018

arxiv: v3 [cs.lg] 15 Sep 2018 Asynchronous Stochastic Proximal Methods for onconvex onsmooth Optimization Rui Zhu 1, Di iu 1, Zongpeng Li 1 Department of Electrical and Computer Engineering, University of Alberta School of Computer

More information

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

c 2015 Society for Industrial and Applied Mathematics

c 2015 Society for Industrial and Applied Mathematics SIAM J. OPTIM. Vol. 5, No. 1, pp. 115 19 c 015 Society for Industrial and Applied Mathematics DUALITY BETWEEN SUBGRADIENT AND CONDITIONAL GRADIENT METHODS FRANCIS BACH Abstract. Given a convex optimization

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections Jianhui Chen 1, Tianbao Yang 2, Qihang Lin 2, Lijun Zhang 3, and Yi Chang 4 July 18, 2016 Yahoo Research 1, The

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Approximate Second Order Algorithms Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Why Second Order Algorithms? Invariant under affine transformations e.g. stretching a function preserves the convergence

More information

Randomized Smoothing for Stochastic Optimization

Randomized Smoothing for Stochastic Optimization Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Distirbutional robustness, regularizing variance, and adversaries

Distirbutional robustness, regularizing variance, and adversaries Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

Dual Proximal Gradient Method

Dual Proximal Gradient Method Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method

More information

Expanding the reach of optimal methods

Expanding the reach of optimal methods Expanding the reach of optimal methods Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with C. Kempton (UW), M. Fazel (UW), A.S. Lewis (Cornell), and S. Roy (UW) BURKAPALOOZA! WCOM

More information

Variance-Reduced and Projection-Free Stochastic Optimization

Variance-Reduced and Projection-Free Stochastic Optimization Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU Abstract The Frank-Wolfe optimization

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Adaptive restart of accelerated gradient methods under local quadratic growth condition

Adaptive restart of accelerated gradient methods under local quadratic growth condition Adaptive restart of accelerated gradient methods under local quadratic growth condition Olivier Fercoq Zheng Qu September 6, 08 arxiv:709.0300v [math.oc] 7 Sep 07 Abstract By analyzing accelerated proximal

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms Pooria Joulani 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science, University of Alberta,

More information

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Characterization of Gradient Dominance and Regularity Conditions for Neural Networks Yi Zhou Ohio State University Yingbin Liang Ohio State University Abstract zhou.1172@osu.edu liang.889@osu.edu The past

More information