arxiv: v1 [cs.lg] 6 Sep 2018

Size: px
Start display at page:

Download "arxiv: v1 [cs.lg] 6 Sep 2018"

Transcription

1 arxiv: v1 [cs.lg] 6 Sep 018 Aryan Mokhtari Asuman Ozdaglar Ali Jadbabaie Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge, MA 0139 Abstract aryanm@mit.edu asuman@mit.edu jadbabai@mit.edu In this paper, we focus on escaping from saddle points in smooth nonconvex optimization problems subject to a convex set C. We propose a generic framework that yields convergence to a secondorder stationary point of the problem, if the convex set C is simple for a quadratic objective function. To be more precise, our results hold if one can find a ρ-approximate solution of a quadratic program subject to C in polynomial time, where ρ < 1 is a positive constant that depends on the structure of the set C. Under this condition, we show that the sequence of iterates generated by the proposed framework reaches an (ǫ, γ)-second order stationary point (SOSP) in at most O(max{ǫ,ρ 3 γ 3 }) iterations. We further characterize the overall arithmetic operations to reach an SOSP when the convex set C can be written as a set of quadratic constraints. Finally, we extend our results to the stochastic setting and characterize the number of stochastic gradient and Hessian evaluations to reach an (ǫ, γ)-sosp. 1. Introduction There has been a recent revival of interest in non-convex optimization, due to obvious applications in machine learning. While the modern history of the subject goes back six or seven decades, the recent attention to the topic stems from new applications as well as availability of modern analytical and computational tools, providing a new perspective on classical problems. Following this trend, in this paper we focus on the problem of minimizing a smooth nonconvex function over a convex set as follows: minimize f(x), subject to x C, (1) where x R d is the decision variable, C R d is a closed convex set, and f : R d R is a twice continuously differentiable function over C. It is well known that finding the global minimum of Problem (1) is hard. Equally well-known is the fact that for certain nonconvex problems, all local minimizers are global. These include, for example, matrix completion (Ge et al., 016), phase retrieval (Sun et al., 016), and dictionary learning (Sun et al., 017). For such problems, finding the global minimum of (1) reduces to the problem of finding one of its local minima. Given the well-known hardness results in finding stationary points, recent focus has shifted in characterizing approximate stationary points. When the objective function f is convex, finding an ǫ- first-order stationary point is often sufficient since it leads to finding an approximate local(and hence global) minimum. However, in the nonconvex setting, even when the problem is unconstrained, i.e., C = R d, convergence to a first-order stationary point (FOSP) is not enough as the critical point to which convergence is established might be a saddle point. It is therefore natural to look at higher order derivatives and search for a second-order stationary points. Indeed, under the assumption that all the saddle points are strict (formally defined later), in both unconstrained and constrained

2 Mokhtari, Ozdaglar, and Jadbabaie settings, convergence to a second-order stationary point (SOSP) implies convergence to a local minimum. While convergence to an SOSP has been thoroughly investigated in the recent literature for the unconstrained setting, the iteration complexity of the convex-constrained setting has not been studied yet. Contributions. Our main contribution in this paper is to propose a generic framework which generates a sequence of iterates converging to an approximate second-order stationary point for the constrained nonconvex problem in (1), when the convex set C has a specific structure that allows for approximate minimization of a quadratic loss over the feasible set. The proposed framework consists of two main stages: First, it utilizes first-order information to reach a first-order stationary point; next, it incorporates second-order information to escape from a stationary point if it is a local maximizer or a strict saddle point. We show that the proposed approach leads to an (ǫ, γ)-secondorder stationary point (SOSP) for Problem (1) (check Definition 3). The proposed approach utilizes advances in constant-factor optimization of nonconvex quadratic programs (Vavasis, 1991; Ye, 199; Fu et al., 1998; Nemirovski et al., 1999) that find a ρ-approximate solution over C in polynomial time, where ρ is a positive constant smaller than 1 that depends on the structure of C. When such approximate solution exists, the sequence of iterates generated by the proposed framework reaches an (ǫ,γ)-sosp of Problem (1) in at most O(max{ǫ,ρ 3 γ 3 }) iterations. We show that linear and quadratic constraints satisfy the required condition for the convex set C. In particular, for the case that C is defined as a set of m quadratic constraints, we can achieve an (ǫ,γ)-sosp after at most O(max{τǫ,d 3 m 7 γ 3 }) arithmetic operations, where d is the dimension of the problem and τ is the number of required arithmetic operations to solve a linear program over C or to project a point onto C. We further extend our results to the stochastic setting and show that we can reach an (ǫ,γ)-sosp after computing at most O(max{ǫ 4,ǫ ρ 4 γ 4,ρ 7 γ 7 }) stochastic gradients and O(max{ǫ ρ 3 γ 3,ρ 5 γ 5 }) stochastic Hessians. 1.1 Related work Unconstrained case. The rich literature on nonconvex optimization provides a plethora of algorithms for reaching stationary points of a smooth unconstrained minimization problem. Convergence to first-order stationary points (FOSP) has been widely studied for both deterministic (Nesterov, 013; Cartis et al., 010; Agarwal et al., 017; Carmon et al., 016, 017a,b,c) and stochastic settings (Reddi et al., 016b,a; Allen Zhu and Hazan, 016; Lei et al., 017). Stronger results which indicate convergence to an SOSP are also established. Numerical optimization methods such as trust-region methods (Cartis et al., 01a; Curtis et al., 017; Martínez and Raydan, 017) and cubic regularization algortihms (Nesterov and Polyak, 006; Cartis et al., 011a,b) can reach an approximate second-order stationary point in a finite number of iterations; however, typically the computational complexity of each iteration could be relatively large due to the cost of solving trust-region or regularized cubic subproblems. Recently, a new line of research has emerged that focuses on the overall computational cost to achieve an SOSP. These results build on the idea of escaping from strict saddle points with perturbing the iterates by injecting a properly chosen noise (Ge et al., 015; Jin et al., 017a,b), or by incorporating second-order information and updating the iterates using the eigenvector corresponding to the smallest eigenvalue of the Hessian (Carmon et al., 016; Allen-Zhu, 017; Xu and Yang, 017; Allen-Zhu and Li, 017; Royer and Wright, 017; Agarwal et al., 017; Reddi et al., 018; Paternain et al., 017). Constrained case. Asymptotic convergence to first-order and second-order stationary points for the constrained optimization problem in (1) has been studied in the numerical optimization community (Burke et al., 1990; Conn et al., 1993; Facchinei and Lucidi, 1998; Di Pillo et al., 005). Recently, finite-time analysis for convergence to an FOSP of the generic smooth constrained problem in (1) has received a lot of attention. In particular, Lacoste-Julien (016) shows that the sequence of iterates generated by the update of Frank-Wolfe converges to an ǫ-fosp of Problem (1) after O(ǫ ) iterations. Ghadimi et al. (016) consider norm of gradient mapping as a measure of non-

3 stationarity and show that the projected gradient method has the same complexity of O(ǫ ). Further, Ghadimi and Lan (016) prove a similar complexity for the accelerated projected gradient method with a slightly better dependency on the Lipschitz constant of gradients. Adaptive cubic regularization methods in (Cartis et al., 01b, 013, 015) improve these results using second-order information and obtain an ǫ-fosp of Problem (1) after at most O(ǫ 3/ ) iterations. Finite time analysis for convergence to an SOSP has also been studied for linear constraints. To be more precise, Bian et al. (015) study convergence to an SOSP of (1) when the set C is a linear constraint of the form x 0 and propose a trust region interior point method that obtains an (ǫ, ǫ)-sosp in O(ǫ 3/ ) iterations. Haeser et al. (017) extend their results to the case that the objective function is potentially not differentiable or not twice differentiable on the boundary of the feasible region. Cartis et al.(017) focus on the general convex constraint case and introduce a trust region algorithm that requires O(ǫ 3 ) iterations to obtain an SOSP; however, each iteration of their proposed method requires access to the exact solution of a nonconvex quadratic program (finding its global minimum) which, in general, could be computationally prohibitive. To the best of our knowledge, our paper provides the first finite-time overall computational complexity analysis for reaching an SOSP of Problem (1) with nonlinear constraints.. Preliminaries and Definitions In the case of unconstrained minimization of the objective function f, the first-order and secondorder necessary conditions for a point x to be a local minimum of that are defined as f(x ) = 0 d and f(x ) 0 d d, respectively. If a point satisfies these conditions it is called a second-order stationary point (SOSP). If the second condition becomes strict, i.e., f(x) 0, then we recover the sufficient conditions for a local minimum. However, to derive finite time convergence bounds for achieving an SOSP, these conditions should be relaxed. In other words, the goal should be to find an approximate SOSP where the approximation error can be arbitrary small. For the case of unconstrained minimization, a point x is called an (ǫ,γ)-second-order stationary point if it satisfies f(x ) ǫ and f(x ) γi d, () where ǫ and γ are arbitrary positive constants. In the following definition we formally define strict saddle points for the unconstrained version of Problem (1), i.e., when C = R d. Definition 1 Consider Problem (1) when C = R d. Then, x is called a δ-strict saddle point if it is a saddle point, i.e., f(x) = 0 and f(x) is indefinite, and the smallest eigenvalue of the Hessian f(x) evaluated at x is strictly smaller than δ, i.e., λ min ( f(x)) < δ. Using the definitions of a δ-strict saddle and an (ǫ,γ)-sosp we obtain that an (ǫ,γ)-sosp is a local minimum for the unconstrained optimization problem if all the saddle points are δ-strict and the condition γ δ is satisfied. To study the constrained setting, we first state the necessary conditions for a local minimum of problem (1). Proposition (Bertsekas (1999)) If x C is a local minimum of the function f over the convex set C, then f(x ) (x x ) 0, for all x C, (3) (x x ) f(x )(x x ) 0, for all x C s.t. f(x ) (x x ) = 0. (4) The conditions in (3) and(4) are the first-order and second-order necessary optimality conditions, respectively. By making the inequality in (4) strict, i.e., (x x ) f(x )(x x ) > 0, we recover the sufficient conditions for a local minimum when C is a polyhedral (Bertsekas, 1999). Further, if 3

4 Mokhtari, Ozdaglar, and Jadbabaie the inequality in (4) is replaced by (x x ) f(x )(x x ) δ x x for some δ > 0, we obtain the sufficient conditions for a local minimum of Problem (1) for any convex constraint C; see (Bertsekas, 1999). If a point x satisfies the conditions in (3) and (4) it is an SOSP of Problem (1). As in the unconstrained setting, the first-order and second-order optimality conditions may not be satisfied in finite number of iterations, and we focus on finding an approximate SOSP. Definition 3 Recall the twice continuously differentiable function f : R d R and the convex closed set C R d introduced in Problem (1). We call x C an (ǫ,γ)-second-order stationary point of Problem (1) if the following conditions are satisfied. f(x ) (x x ) ǫ, for all x C, (5) (x x ) f(x )(x x ) γ, for all x C s.t. 0 f(x ) (x x ) ǫ. (6) If a point only satisfies the first condition, we call it an ǫ-first-order stationary point. To further clarify the conditions in Definition 3, we first elaborate on the conditions in Proposition for stationary points. The condition in (3) ensures that there is no feasible direction that makes the linear term in the Taylor series of f around x negative. If there exists such a point then by choosing a very small stepsize we can find a feasible direction that decreases the objective function value, and, therefore, x cannot be a local minimum. The condition in (5) relaxes this requirement and checks if the inner product f(x ) (x x ) is not too negative for any x C. In other words, it ensures that the function value is not decreasing more than O(ǫ) for any feasible direction. Note that if f(x ) (x x ) is strictly positive for all x C, then we can ensure that in a small neighborhood of x any feasible direction increases the function value and hence x is a local minimum. Toensurethat x hasthe necessaryconditionsforalocalminimum, amongthe feasible directions x x that are orthogonal to the gradient, i.e., f(x ) (x x ) = 0, we must ensure that the function value is non-decreasing. The condition in (4) guarantees that among the feasible directions that satisfy f(x ) (x x ) = 0, there is no direction that makes the quadratic term in the Taylor series of f around x negative. We would like to emphasize that as we know from the firstorder optimality condition that x satisfies the condition f(x ) (x x ) 0 for any x C, the quadratic condition in (4) should only be satisfied for the directions that f(x ) (x x ) = 0, since for those that we have f(x ) (x x ) > 0 the function value is increasing in a small neighborhood of x. Using the same argument, for the relaxed version of (4) it is not required to check the quadratic condition for points x C satisfying f(x ) (x x ) > 0; however, as we relax the first-order optimality condition, i.e., the inner product f(x ) (x x ) is allowed to be a small negative value, we need to ensure that the quadratic condition in (4) stands for all points 0 f(x ) (x x ) ǫ. The conditions in (5) and (6) together ensure that in a neighborhood of x with radius α the function value does not decrease more than O(αǫ) by the linear term and more than O(α γ) by the quadratic term. We further formally define strict saddle points for the constrained optimization problem in (1). Definition 4 A point x C is a δ-strict saddle point of Problem (1) if (i) for all x C the condition f(x ) (x x ) 0 holds, and (ii) there exists a point y such that (y x ) f(x )(y x ) < δ, y C and f(x ) (y x ) = 0. (7) According to Definitions 3 and 4 if all saddle points are δ-strict and γ δ, any (ǫ,γ)-sosp of Problem (1) is an approximate local minimum. We emphasize that in this paper we do not assume that all saddlesare strict to proveconvergence to an SOSP. We formally defined strict saddles just to clarify that if all the saddles are strict then convergence to an approximate SOSP is equivalent to convergence to an approximate local minimum. Our goal throughout the rest of the paper is to design an algorithm which finds an (ǫ,γ)-sosp of Problem (1). To do so, we first assume the following conditions are satisfied. 4

5 Assumption 1 The gradients f are L-Lipschitz continuous over the set C, i.e., for any x, x C, f(x) f( x) L x x. (8) Assumption The Hessians f are M-Lipschitz continuous over the set C, i.e., for any x, x C f(x) f( x) M x x. (9) Assumption 3 The diameter of the compact convex set C is upper bounded by a constant D, i.e., max{ x ˆx } D. (10) x, x C Conditions in Assumptions 1 and ensure that the objective function gradients and Hessians are Lipschitz continuous, respectively, over the convex set C. Assumption 3 guarantees that the distance between any two points in the convex set C is bounded above by a constant. 3. Main Result In this section, we introduce a generic framework to reach an (ǫ, γ)-sosp of the non-convex function f over the convex set C, when C has a specific structure as we describe below. In particular, we focus on the case when we can solve a quadratic program (QP) of the form minimize x Ax+b x+c subject to x C, (11) up to a constant factor ρ 1 with a finite number of arithmetic operations. Here, A R d is a symmetricmatrix, b R d isavector,andc Risascalar. Toclarifythe notionofsolvingaproblem up to a constant factor ρ, consider x as a global minimizer of (11). Then, we say Problem (11) is solved up to a constant factor ρ (0,1] if we have found a feasible solution x C such that x Ax +b x +c x A x+b x+c ρ(x Ax +b x +c). (1) Notethatherew.l.o.g. wehaveassumedthattheoptimalobjectivefunctionvaluex Ax +b x +c is non-positive. Larger constant ρ implies that the approximate solution is more accurate. If x satisfies the condition in (1), we call it a ρ-approximate solution of Problem (11). Indeed, if ρ = 1 then x is a global minimizer of Problem (11). In Algorithm 1, we introduceagenericframeworkthat finds an (ǫ,γ)-sospofproblem(1) whose running time is polynomial in ǫ 1, γ 1, ρ 1 and d, when we can find a ρ-approximate solution of a quadratic problem of the form (11) in a time that is polynomial in d. The proposed scheme consists of two major stages. In the first phase, as mentioned in Steps -4, we use a first-order update, i.e., a gradient-based update, to find an ǫ-fosp, i.e., we update the decision variable x according to a first-order update until we reach a point x t C that satisfies the condition f(x t ) (x x t ) ǫ, for all x C. (13) In Section 4, we study in detail projected gradient descent and conditional gradient algorithms for the first-order phase of the proposed framework. Interestingly, both of these algorithms require at most O(ǫ ) iterations to reach an ǫ-first-order stationary point. The second stage of the proposed scheme uses second-order information of the objective function f to escape from the stationary point if it is a local maximum or a strict saddle point. To be more precise, if we assume that x t is a feasible point satisfying the condition (13), we then aim to find a descent direction by solving the following quadratic program minimize q(u) := (u x t ) f(x t )(u x t ) subject to u C, ǫ f(x t ) (u x t ) 0, (14) 5

6 Mokhtari, Ozdaglar, and Jadbabaie Algorithm 1 Generic framework for escaping saddles in constrained optimization Require: Stepsize σ > 0. Initialize x 0 C 1: for t = 1,,... do : if x t is not an ǫ-first-order stationary point then 3: Compute x t+1 using first-order information (Frank-Wolfe or projected gradient descent) 4: else 5: Find u t: a ρ-approximate solution of (14) 6: if q(u t) < ργ then 7: Compute the updated variable x t+1 = (1 σ)x t +σu t; 8: else 9: Return x t and stop. 10: end if 11: end if 1: end for up to a constant factor ρ where ρ (0,1]. If we define q(u ) as the optimal objective function value of the program in (14), we focus on the cases that we can obtain a feasible point u t which is a ρ-approximate solution of Problem (14), i.e., u t satisfies the constraints in (14) and q(u ) q(u t ) ρ q(u ). (15) The problem formulation in (14) can be transformed into the quadratic program in(11); see Section 5 for more details. Note that the constant ρ is independent of ǫ, γ, and d and only depends on the structure of the convex set C. For instance, if C is defined in terms of m quadratic constraints one can find a ρ = m approximate solution of (14) after at most Õ(md3 ) arithmetic operations (Section 5). After computing a feasible point u t satisfying the condition in (15), we check the quadratic objective function value at the point u t, and if the inequality q(u t ) < ργ holds, we follow the update x t+1 = (1 σ)x t +σu t, (16) where σ is a positive step size. Otherwise, we stop the process and return x t as an (ǫ,γ)-secondorder stationary point of Problem (1). To check this claim, note that Algorithm 1 stops if we reach a point x t that satisfies the first-order stationary condition f(x t ) (x x t ) ǫ, and the objective function value for the ρ-approximate solution of the quadratic subproblem is larger than ργ, i.e., q(u t ) ργ. The second condition alongside with the fact that q(u t ) satisfies (15) implies that q(u ) γ. Therefore, for any x C and ǫ f(x t ) (x x t ) 0, it holds that (x x t ) f(x t )(x x t ) γ. (17) These twoobservationsshowthat the outcomeofthe proposedframeworkin Algorithm1is an (ǫ,γ)- SOSP of Problem (1). Now it remains to characterize the number of iterations that Algorithm 1 needs to perform before reaching an (ǫ, γ)-sosp which we formally state in the following theorem. Theorem 5 Consider the problem in (1). Suppose the conditions in Assumptions 1-3 are satisfied. If in the first-order stage, i.e., Steps -4, we use the update of Frank-Wolfe or projected gradient descent, the framework proposed in Algorithm 1 finds an (ǫ, γ)-second-order stationary point of Problem (1) after at most O(max{ǫ,ρ 3 γ 3 }) iterations. The result in Theorem 5 shows that if the convex constraint C is such that one can solve the quadratic subproblem in (14) ρ-approximately, then the proposed generic framework finds an (ǫ, γ)- SOSP point of Problem (1) after at most O(ǫ ) first-order and O(ρ 3 γ 3 ) second-order updates. To prove the claim in Theorem 5, we first review first-order conditional gradient and projected gradient algorithms and show that if the current iterate is not a first-order stationary point, by 6

7 following either of these updates the objective function value decreases by a constant of O(ǫ ) (Section 4). We then focus on the second stage of Algorithm 1 which corresponds to the case that the current iterate is an ǫ-fosp and we need to solve the quadratic program in (14) approximately (Section 5). In this case, we show that if the iterate is not an (ǫ,γ)-sosp, by following the update in (16) the objective function value decreases at least by a constant of O(ρ 3 γ 3 ). Finally, by combining these two results it can be shown that Algorithm 1 finds an (ǫ,γ)-sosp after at most O(max{ǫ,ρ 3 γ 3 }) iterations. 4. First-Order Step: Convergence to a First-Order Stationary Point In this section, we study two different first-order methods for the first stage of Algorithm 1. The result in this section can also be independently used for convergence to an FOSP of Problem (1) satisfying f(x ) (x x ) ǫ, for all x C, (18) where ǫ > 0 is a positive constant. Although for Algorithm 1 we assume that C has a specific structure as mentioned in (11), the results in this section hold for any closed and compact convex setc. Tokeepourresultasgeneralaspossible,inthissection, westudybothconditionalgradientand projected-based methods when they are used in the first-stage of the proposed generic framework. 4.1 Conditional gradient update The conditional gradient (Frank-Wolfe) update has two steps. We first solve the linear program v t = argmax{ f(x t ) v}. (19) v C Then, we compute the updated variable x t+1 according to the update x t+1 = (1 η)x t +ηv t, (0) where η is a stepsize. In the following proposition, we show that if the current iterate is not an ǫ-first-order stationary point, then by updating the variable according to (19)-(0) the objective function value decreases. The proof of the following proposition is adopted from (Lacoste-Julien, 016). Proposition 6 Consider the optimization problem in (1). Suppose Assumptions 1 and 3 hold. Set the stepsize in (0) to η = ǫ/d L. Then, if the iterate x t at step t is not an ǫ-first-order stationary point, the objective function value at the updated variable x t+1 satisfies the inequality f(x t+1 ) f(x t ) ǫ D L. (1) The result in Proposition 6 shows that by following the update of the conditional gradient method the objective function value decreases by O(ǫ ), if an ǫ-fosp is not achieved. As a consequence of this result, the FW algorithm reaches an ǫ-first-order stationary point after at most O(ǫ ) iterations, or equivalently, after at most solving O(ǫ ) linear programs. Remark 7 In step 3 of Algorithm 1 we first check if x t is an ǫ-fosp. This can be done by evaluating min { f(x t) (x x t )} = max { f(x t) x}+ f(x t ) x t () x C x C and comparing the optimal value with ǫ. Note that the linear program in () is the same as the one in (19). Therefore, by checking the first-order optimality condition of x t, the variable v t is already computed, and we need to solve only one linear program per iteration. 7

8 Mokhtari, Ozdaglar, and Jadbabaie 4. Projected gradient update The projected gradient descent (PGD) update consists of two steps: (i) descending through the gradient direction and (ii) projecting the updated variable onto the convex constraint set. These two steps can be combined together and the update can be explicitly written as x t+1 = π C {x t η f(x t )}, (3) where π C (.) is the Euclidean projection onto the convex set C and η is a positive stepsize. In the following proposition, we first show that by following the update of PGD the objective function value decreases by a constant until we reach an ǫ- FOSP. Further, we show that the number of required iterations for PGD to reach an ǫ-fosp is of O(ǫ ). Proposition 8 Consider Problem (1). Suppose Assumptions 1 and 3 are satisfied. Further, assume that the gradients f(x) are uniformly bounded by K for all x C. If the stepsize of the projected gradient descent method defined in (3) is set to η = 1/L the objective function value decreases by f(x t+1 ) f(x t ) ǫ L (K +LD), (4) Moreover, iterates reach a first-order stationary point satisfying (18) after at most O(ǫ ) iterations. Proposition 8 shows that by following the update of PGD the function value decreases by O(ǫ ) until we reach an ǫ-fosp. It further shows PGD obtains an ǫ-fosp satisfying (18) after at most O(ǫ ) iterations. To the best of our knowledge, this result is also novel, since the only convergence guarantee for PGD in (Ghadimi et al., 016) is in terms of number of iterations to reach a point with a gradient mapping norm less than ǫ, while our result characterizes number of iterations to satisfy (18). Remark 9 To use the PGD update in the first stage of Algorithm 1 one needs to define a criteria to check if x t is an ǫ-fosp or not. However, in PGD we do not solve the linear program min x C { f(x t ) (x x t )}. This issue can be resolved by checking the condition x t x t+1 ǫ/(k +LD) which is a sufficient condition for the condition in (18). In other words, if this condition holds we stop and x t is an ǫ-fosp; otherwise, the result in (4) holds and the function value decreases. For more details please check the proof of Proposition Second-Order Step: Escaping Saddle Points In this section, we study the second stage of the framework in Algorithm 1 which corresponds to the case that the current iterate is an ǫ-fosp. Note that when we reach a critical point the goal is to find a feasible point u C in the subspace ǫ f(x t ) (u x t ) 0 that makes the inner product (u x t ) f(x t )(u x t ) smaller than γ. To achieve this goal we need to check the minimum value of this inner product over the constraints, i.e., we need to solve the quadratic program in (14) up to a constant factor ρ (0,1]. In the following proposition, we show that the updated variable according to (16) decreases the objective function value if the condition q(u t ) < ργ holds. Proposition 10 Consider the quadratic program in (14). Let u t be a ρ-approximate solution for quadratic subproblem in (14). Suppose that Assumptions and 3 hold. Further, set the stepsize to σ = ργ/md 3. If the quadratic objective function value q evaluated at u t satisfies the condition q(u t ) < ργ, then the updated variable according to (16) satisfies the inequality f(x t+1 ) f(x t ) ρ3 γ 3 3M D6. (5) 8

9 The only unanswered question is how to solve the quadratic subproblem in (14) up to a constant factor ρ (0,1] in polynomial time. For general C, the quadratic subproblem could be NP-hard (Murty and Kabadi, 1987); however, for some special choices of the convex constraint C, e.g., linear and quadratic constraints, this quadratic program(qp) can be solved either exactly or approximately up to a constant factor. In the following section, we focus on the quadratic constraint case. The discussion on the linear constraint case is available in Section 8.1 in the supplementary material. 5.1 Quadratic constraints case Consider the case that the convex set C is the intersection of m ellipsoids, i.e., C := {x R d x Q i x+s i x+r i 0, for all i = 1,...,m}, (6) where r i R, s i R d, and Q i S d +. Under this assumption, the QP in (14) can be written as min u (u x t ) f(x t )(u x t ) s.t. u Q i u+s i u+r i 0, for i = 1,...,m ǫ f(x t ) (u x t ) 0. (7) Note that the linear constraints ǫ f(x t ) (u x t ) 0 can be easily handled by writing them in a form of quadratic program. To do so, first define a new optimization variable z := u x t to obtain min z z f(x t )z s.t. z Q i z+ s i z+ r i 0, for i = 1,...,m ǫ f(x t ) z 0, (8) where s i = s i + x t and r i = r i + s i x t + x t Q i x t. We simply can replace f(x t ) z 0 by the quadratic constraint z Q m+1 z+ s m+1z+ r m+1 0 with parameters Q m+1 = 0, s m+1 = f(x t ), and r m+1 = 0. Similarly, we can write f(x t ) z ǫ as the quadratic constraint z Q m+ z + s m+z+ r m+ 0 with parameters Q m+ = 0, s m+ = f(x t ), and r m+ = ǫ. Therefore, the problem in (8) can be written as min z z f(x t )z s.t. z Q i z+ s i z+ r i 0, for i = 1,...,m+ (9) Note that the matrices Q i S d + are positive semidefinite, while the Hessian f(x t ) S d might be indefinite. Indeed, the optimal objective function value of the program in (9) is equal to the optimal objective function value of (7). The program in (9) is a Quadratic Constraint Quadratic Program (QCQP), and one can find an approximate solution for this program as we state in the following proposition. Proposition 11 (Fu et al. (1998)) Consider Problem (9). There exists a polynomial time method 1 ζ that obtains a m (1+ζ) -approximation by at most O(d 3 (mlog(1/δ) + log(1/ζ) + logd)) arithmetic operations, where δ is the ratio of the radius of the largest inscribing sphere over that of the smallest circumscribing sphere of the feasible set. TheresultinProposition11indicatesthatwecansolvetheQCQPin(9)withtheapproximation factor ρ 1/m for m 1. According to this result the complexity of solving the QCQP in (14) is Õ(md 3 ) when the constraint C is defined as m convex quadratic constraints. As the total number of 9

10 Mokhtari, Ozdaglar, and Jadbabaie Algorithm Require: Stepsize σ t > 0. Initialize x 0 C 1: for t = 1,,... do : Compute v t = argmax{ d t v} v C 3: if d t (v t x t) < ǫ then 4: Compute x t+1 = (1 η)x t +ηv t 5: else 6: Find u t: a ρ-approximate solution of min u (u x t) H t(u x t) s.t. u C, ǫ r d t (u x t) r. 7: if q(u t) < ργ then 8: Compute the updated variable x t+1 = (1 σ)x t +σu t 9: else 10: Return x t and stop 11: end if 1: end if 13: end for calls to the second-order stage is at most O(ρ 3 γ 3 ) = O(m 6 γ 3 ), we obtain that the total number of arithmetic operations for the second-order stage is at most Õ(m7 d 3 γ 3 ). The main idea of the proposed algorithm by Fu et al. (1998) is to approximate the feasible set by an inscribing ellipsoid and maximize the quadratic objective function over this ellipsoid to find a global minimizer x, and then use x as an approximate global minimizer for the original QP. If by increasing the radius of the ellipsoid the enlarged ellipsoid contains the feasible set, then using the result in (Ye, 199) it follows that x is a good approximate solution of the original QCQP. Remark 1 If m = 1, one can also use the S-Procedure (Pólik and Terlaky, 007) to solve (9) accurately. Further, Nemirovski et al. (1999) showed that if the sum of the matrices Q i is positive 1 definite and the constraints are homogeneous, i.e., s i = 0, one can find a ln(mµ) -approximate solution of (9) by solving its SDP relaxation, where µ := min{m,max i rank(q i )}. 6. Stochastic Extension In this section, we focus on stochastic constrained minimization problems. Consider the optimization problem in (1) when the objective function f is defined as an expectation of a set of stochastic functions F : R d R d R with inputs x R d and Θ R d, where Θ is a random variable with probability distribution P. To be more precise, we consider the optimization problem minimize f(x) := E[F(x,Θ)], subject to x C. (30) Our goal is to find a point which satisfies the necessary optimality conditions with high probability. Consider the vector d t = (1/b g ) b g i=1 F(x t,θ i ) and matrix H t = (1/b H ) b H i=1 F(x t,θ i ) as stochastic approximations of the gradient f(x t ) and Hessian f(x t ), respectively. Here b g and b H are the gradient and Hessian batch sizes, respectively, and the vectors θ i are the realizations of the random variable Θ. In Algorithm, we present the stochastic variant of our proposed scheme for finding an (ǫ,γ)-sosp of Problem (30). We only focus on a stochastic variant of the Frank-Wolfe method for simplicity, but similar stochastic extension can also be shown for the projected gradient descent method. Algorithm differs from Algorithm 1 in using stochastic gradients d t and Hessians H t in lieu of exact gradients f(x t ) and f(x t ) Hessians. The second major difference is the inequality constraint in step 6. Here instead of using the constraint ǫ d t (u x t) 0 we need to use 10

11 ǫ r d t (u x t) r, where r > 0 is a properly chosen constant. This modification is needed to ensure that if a point satisfies this constraint it also satisfies ǫ f(x t ) (u x t ) 0 with high probability. To prove our main result we assume that the following conditions also hold. Assumption 4 The variance of the stochastic gradients and Hessians are uniformly bounded by constants ν and ξ, respectively, i.e., for any x C and θ we can write E [ F(x,θ) f(x) ] ν, E [ F(x,θ) f(x) ] ξ. (31) The conditions in Assumption 4 which require access to unbiased estimates of the gradient and Hessian with bounded variances are customary in stochastic optimization. In the following theorem, we characterize the iteration complexity of Algorithm to reach an (ǫ, γ)-sosp of Problem (30) with high probability. Theorem 13 Consider the optimization problem defined in (30). Suppose the conditions in Assumptions 1-4 are satisfied. If the batch sizes are b g = O(max{ρ 4 γ 4,ǫ }) and b H = O(ρ γ ) and we set r = O(ρ γ ), then the outcome of the proposed framework outlined in Algorithm is an (ǫ, γ)-second-order stationary point of Problem (30) with high probability. Further, the total number of iterations to reach such point is at most O(max{ǫ,ρ 3 γ 3 }) with high probability. The result in Theorem 13 indicates that the total number of iterations to reach an (ǫ,γ)-sosp is at most O(max{ǫ,ρ 3 γ 3 }). As each iteration at most requires O(max{ρ 4 γ 4,ǫ }) stochastic gradient and O(ρ γ ) stochastic Hessian evaluations, the total number of stochastic gradient and Hessian computations to reach an (ǫ,γ)-sosp is of O(max{ǫ ρ 4 γ 4,ǫ 4,ρ 7 γ 7 }) and O(max{ǫ ρ 3 γ 3,ρ 5 γ 5 }), respectively. 7. Conclusion In this paper, we studied the problem of finding an (ǫ, γ)-second order stationary point (SOSP) of a generic smooth constrained nonconvex minimization problem. We proposed a procedure that obtains an (ǫ,γ)-sosp after at most O(max{ǫ,ρ 3 γ 3 }) iterations if the constraint set C is such that one can find a ρ-approximate solution of a quadratic optimization problem with the constraint set C. We showed that our results hold for the case that C is a set of linear constraints or is the form of finite number of quadratic convex constraints. We further extended our results to the stochastic setting and characterized the number of stochastic gradient and Hessian evaluations to reach an (ǫ, γ)-sosp. 8. Appendix 8.1 The linear constraints case Consider the quadratic program in (14), for the case that the convex constraint set C is defined as C = {x R d Ax b}, where the matrix A R m d and the vector b R m are given. In this case, the QP in (14) is equivalent to min u (u x t ) f(x t )(u x t ) s.t. Au b, ǫ f(x t ) (u x t ) 0. (3) Define a new variable z := u x t and the vector ˆb = b Ax t to rewrite the problem as min z z f(x t )z s.t. Az ˆb, ǫ f(x t ) z 0. (33) 11

12 Mokhtari, Ozdaglar, and Jadbabaie Then, we can simply write the extra linear inequalities as f(x t ) z ǫ and f(x t ) z 0. Using these modifications we can write the problem in (33) as min y y f(x t )y s.t. Ây ˆb, (34) where  R(m+) d and ˆb R (m+). The QP in (34) in general could be NP-hard; however, in polynomial time one can find an approximate solution for this program as we state in the following proposition. Proposition 14 Consider the optimization problem in (34). There exists a polynomial time algorithm, based on solving a ball-constraint quadratic problem, to compute a (1/m )-approximate solution (Vavasis, 1991) and (Ye, 199). Further, if the constraint is the polytope {y e y e} for some e R d 1, there exits a polynomial time algorithm with complexity O(d 3 log(log(d))) which reaches a (1/m)-approximate solution (Fu et al., 1998, Section 4). 8. Proof of Proposition The claim in (3) follows from Proposition.1. in (Bertsekas, 1999). The proof for the claim in (4) is similar to the proof of Proposition.1. in (Bertsekas, 1999), and we mention it for completeness. We prove the claim in (4) by contradiction. Suppose that (x x ) f(x )(x x ) < 0 for some x C satisfying f(x ) (x x ) = 0. By the mean value theorem, for any ǫ > 0 there exists an α [0,1] such that f(x +ǫ(x x )) = f(x )+ǫ f(x ) (x x )+ǫ (x x ) f(x +αǫ(x x )) (x x ), (35) Use the relation f(x ) (x x ) = 0 to simplify the right hand side to f(x +ǫ(x x )) = f(x )+ǫ (x x ) f(x +αǫ(x x )) (x x ). (36) Note that since (x x ) f(x )(x x ) < 0 and the Hessian is continuous, we have for all sufficiently small ǫ > 0, (x x ) f(x + αǫ(x x )) (x x ) < 0. This observation and the expression in (36) follows that for sufficiently small ǫ we have f(x + ǫ(x x )) < f(x ). Note that the point x + ǫ(x x ) for all ǫ [0,1] belongs to the set C and satisfies the inequality f(x ) ((x +ǫ(x x )) x ) = 0. Therefore, we obtained a contradiction of the local optimality of x. 8.3 Proof of Proposition 6 First consider the definition G(x t ) = max x C { f(x t ) (x x t )} which is also known as Frank- Wolfe gap (Lacoste-Julien, 016). This constant measures how close the point x t is to be a firstorder stationary point. If G(x t ) ǫ, then x t is an ǫ-first-order stationary point. Let s assume that G(x t ) > ǫ. Then, based on the Lipschitz continuity of gradients and the definition of G(x t ) we can write f(x t+1 ) f(x t )+ f(x t ) (x t+1 x t )+ L x t+1 x t = f(x t )+η f(x t ) (v t x t )+ Lη v t x t f(x t ) ηg(x t )+ η D L, (37) 1

13 where the last inequality follows from v t x t D. Replacing the stepsize η by its value ǫ/d L and G(x t ) by its lower bound ǫ lead to f(x t+1 ) f(x t ) ǫ D L. (38) This result implies that if the current point x t is not an ǫ-first-order stationary point, by following the update of Frank-Wolfe algorithm the objective function value decreases by ǫ /D L. Therefore, after at most D L(f(x 0 ) f(x ))/ǫ iterations we either reach the global minimum or one of the iterates x t satisfies G(x t ) ǫ which implies that and the claim in Proposition 6 follows. f(x t ) (x x t ) ǫ, for all x C, (39) 8.4 Proof of Proposition 8 First, note that based on the projection property we know that (x t η f(x t ) x t+1 ) (x x t+1 ) 0, for all x C. (40) Therefore, by setting x = x t we obtain that η f(x t ) (x t+1 x t ) x t x t+1. (41) Hence, we can replace the inner product f(x t ) (x t+1 x t ) by its upper bound x t x t+1 /η f(x t+1 ) f(x t )+ f(x t ) (x t+1 x t )+ L x t+1 x t f(x t ) x t x t+1 η + L x t+1 x t = f(x t ) L x t+1 x t, (4) where the equality follows by setting η = 1/L. Indeed, if x t+1 = x t then we are at a first-order stationary point, however, we need a finite time analysis. To do so, note that for any x C we have Therefore, for any x C it holds which implies that (x t η f(x t ) x t+1 ) (x x t+1 ) 0. (43) f(x t ) (x x t+1 ) L(x t x t+1 ) (x x t+1 ), (44) f(x t ) (x x t ) f(x t ) (x t+1 x t )+L(x t x t+1 ) (x x t+1 ) K x t+1 x t LD x t x t+1 (K +LD) x t x t+1, (45) where K is an upper bound on the norm of gradient over the convex set C. Therefore, we can write min f(x t) (x x t ) (K +LD) x t x t+1, (46) x C 13

14 Mokhtari, Ozdaglar, and Jadbabaie Combining these results, we obtain that we should check the norm x t x t+1 at each iteration and check whether if it is larger than ǫ/(k +LD) or not. If the norm is larger than the threshold then f(x t+1 ) f(x t ) ǫ L (K +LD). (47) If the norm is smaller than the threshold then we stop and the iterate x t satisfies the inequality f(x t ) (x x t ) ǫ, for all x C. (48) Note that this process can not take more than O( f(x0) f(x ) ǫ ) iterations. 8.5 Proof of Proposition 10 The Taylor s expansion of the function f around the point x t and M-Lipschitz continuity of the Hessians imply that f(x t+1 ) f(x t )+ f(x t ) (x t+1 x t )+ 1 (x t+1 x t ) f(x)(x t+1 x t )+ M 6 x t+1 x t 3. (49) Replace x t+1 x t by the expression σ(u t x t ) to obtain f(x t+1 ) f(x t )+σ f(x t ) (u t x t )+ σ (u t x t ) f(x)(u t x t )+ Mσ3 6 u t x t 3. (50) Since, u t is a ρ-approximate solution for the subproblem in (14) with the objective function value q(u t ) ργ, we can substitute the quadratic term (u t x t ) f(x)(u t x t ) by its upper bound ργ. Additionally, the vector u t is chosen such that f(x t ) (u t x t ) 0 and therefore the linear term in (50) can be eliminated. Further, the cubic term u t x t 3 is upper bounded by D 3 since both u t and x t belong to the convex set C. Applying these substitutions into (50) yields f(x t+1 ) f(x t ) σ ργ By setting σ = ργ/md 3 in (51) it follows that f(x t+1 ) f(x t ) ρ3 γ 3 M D 6 + ρ3 γ 3 6M D 6 + σ3 MD 3. (51) 6 = f(x t ) ρ3 γ 3 3M D6. (5) Therefore, in this case, the objective function value decreases at least by a fixed value of O(ρ 3 γ 3 ). 8.6 Proof of Theorem 5 Then at each iteration, either the first oder optimality condition is not satisfied and the function value decreasesby aconstant ofo(ǫ ), or this condition is satisfied and we use a second-orderupdate which leads to an objective function value decrement of O(ρ 3 γ 3 ). This shows that if the algorithm has not reached an (ǫ, γ)-second-order stationary point the objective function value decreases at least by O(min{ǫ,ρ 3 γ 3 }). Therefore, we either reach the global minimum or converge to an (ǫ,γ)- second-order stationary point of Problem (1) after at most O can be written as O((f(x 0 ) f(x ))(max{ǫ,ρ 3 γ 3 })). ( f(x0) f(x ) min{ǫ,ρ 3 γ 3 } ) iterations which also 14

15 8.7 Proof of Theorem 13 In this proof, for notation convenience, we define ǫ = ǫ/ and γ = γ/. First, note that the condition in Assumption 4 and the fact that F(x,θ) and F(x,θ) are the unbiased estimators of the gradient f(x) and Hessian f(x) imply that the variance of the batch gradient d t and the batch Hessian H t approximations are upper bounded by E [ d t f(x t ) ] ν b g, E [ H t f(x t ) ] ξ b H. (53) Here we assume that b g and b H satisfy the following conditions, { 34ν M D 8 b g = max ρ 4 γ 4, 16D ν } ǫ, b H = 81D4 ξ ρ. (54) γ We further set the parameter r as r = ρ γ 18MD3. (55) Now we proceed to analyze the complexity of Algorithm. First, consider the case that the current iterate x t satisfies the inequality d t (v t x t ) < ǫ and therefore we perform the first-order update in step 4. In this case, we can show that f(x t+1 ) f(x t )+ f(x t ) (x t+1 x t )+ L x t+1 x t = f(x t )+η f(x t ) (v t x t )+ η L v t x t f(x t )+ηd t (v t x t )+η( f(x t ) d t ) (v t x t )+ η LD f(x t ) ηǫ +ηd f(x t ) d t + η LD, (56) where in the last inequality we used d t (v t x t ) < ǫ and the fact that both v t and x t belong to the set C and therefore x t v t D. Consider F t as the sigma algebra that measures all sources of randomness up to step t. Then, computing the expected value of both sides of (56) given F t leads to E[f(x t+1 ) F t ] f(x t ) ηǫ + ηdν bg + η LD (57) where we used the inequality E[X] E[X ] when X is a positive random variable. Replace the stepsize η by its value ǫ /(D L) and the batch size b g by its lower bound (16D ν )/(ǫ ) to obtain E[f(x t+1 ) F t ] f(x t ) ǫ 4D L. (58) Hence, in this case, the objective function value decreases in expectation by a constant factor of O(ǫ ). Now we proceed to study the case that the current iterate x t does not satisfy the inequality d t (v t x t ) < ǫ and we need to perform the second-order update in step 8. In this case, we can 15

16 Mokhtari, Ozdaglar, and Jadbabaie show that f(x t+1 ) f(x t )+ f(x t ) (x t+1 x t )+ 1 (x t+1 x t ) f(x)(x t+1 x t )+ M 6 x t+1 x t 3 f(x t )+σ f(x t ) (u t x t )+ σ (u t x t ) f(x)(u t x t )+ σ3 MD 3 6 f(x t )+σd t (u t x t )+σ( f(x t ) d t ) (u t x t )+ σ (u t x t ) H t (u t x t ) + σ (u t x t ) ( f(x) H t )(u t x t )+ σ3 MD 3. (59) 6 Note that u t is a ρ-approximate solution for the subproblem in step 6 of Algorithm, with the objective function value less than ργ. This observation implies that the quadratic term (u t x t ) H t (u t x t ) is bounded above by ργ. Further, the linear term d t (u t x t ) is less than r according to the constraint of the subproblem. Applying these substitutions and using the Cauchy- Schwartz inequality multiple times lead to f(x t+1 ) f(x t )+σr+σd d t f(x t ) σ ργ + σ D H t f(x) + σ3 MD 3. (60) 6 Compute the conditional expected value of both sides of (60) and use the inequalities in (53) to obtain E[f(x t+1 ) F t ] f(x t )+σr+ σdν σ ργ + σ D ξ bg + σ3 MD 3. (61) b H 6 By setting the stepsize σ = ργ /MD 3 in (61) it follows that E[f(x t+1 ) F t ] f(x t ) ρ3 γ 3 3L D 6 + rργ MD 3 + ργ ν MD b g + ρ γ ξ M D 4 b H. (6) Moreover, setting r = ρ γ 18MD and b 3 H = 81D4 ξ ρ γ, and replacing b g by its lower bound 34ν M D 8 ρ 4 γ 4 to lead E[f(x t+1 ) F t ] f(x t ) ρ3 γ 3 6M D 6 (63) Hence, in this case, the expected objective function value decreases by a constant of O(ρ 3 γ 3 ). By combining the results in (58) and (63), we obtain that if the iterate x t is not the final iterate the objective function value at step t + 1 satisfies the following ineqaulity { ǫ ρ 3 γ 3 } E[f(x t+1 ) F t ] f(x t ) min 4LD, 6M D 6. (64) Let us define T as the number of iterations we perform until Algorithm stops. We use an argument similar to Wald s lemma to derive an upper bound on the expected number of iterations T that we 16

17 need to run the algorithm. Note that [ T ] E[f(x 0 ) f(x T )] = E (f(x t 1 ) f(x t )) t=1 [ [ T ] = E E (f(x t 1 ) f(x t ))] T t=1 [ k ] = E (f(x t 1 ) f(x t )) P(T = k) = k=1 k=1 t=1 k=1 t=1 t=1 k E[(f(x t 1 ) f(x t ))]P(T = k) k { ǫ ρ 3 γ 3 } min 4LD, 6M D 6 P(T = k) { ǫ ρ 3 γ 3 } = min 4LD, 6M D 6 k P(T = k) k=1 { ǫ ρ 3 γ 3 = min 4LD, 6M D 6 } E[T]. (65) The first equality holds by simplifying the sum, for the second equality we use the fact that E[X] = E[E[X Y]], in the third equality we use the expression E[E[X Y]] = ye[x Y = y]p(y = y), in the fourth equality we exchange sum and expectation, and the inequality is true based on the result in (64). Note that to derive this result we also have assumed that the sequence of function differences f(x t 1 ) f(x t ) are independent of each other and also independent of the total number of iterations T. { ǫ Based on the result in (65), we can write that E[T] E[f(x 0 ) f(x T )]/min ρ 4LD, 3 γ 3 6M D }. 6 We further know that f(x T ) f(x ) which implies that { 4LD E[T] (f(x 0 ) f(x ))max ǫ, 6M D 6 } ρ 3 γ 3. (66) Using Markov s inequality we can show that Set a = (f(x0) f(x )) δ { } (f(x 0 ) f(x 4LD ))max ǫ, 6M D 6 P(T a) 1 ρ 3 γ 3 a { } 4LD max ǫ, 6M D 6 ρ 3 γ to obtain that 3 { (f(x 0 ) f(x 4LD ))max ǫ, 6M D 6 P T δ ρ 3 γ 3 } (67) 1 δ. (68) Therefore, it follows that with high probability the total number of iterations T that Algorithm runs is at most O(max { ǫ,ρ 3 γ 3} ). As ǫ = ǫ/ and γ = γ/, we obtain that the overall number of iterations is at most O(max { ǫ,ρ 3 γ 3} ) with high probability. Now it remains to show that the outcome of Algorithm is an (ǫ,γ)-sosp of Problem (30) with high probability. Let s assume that x t is the final output of Algorithm. Then, we know that x t satisfies the conditions d t (x x t) ǫ for all x C, (69) 17

18 Mokhtari, Ozdaglar, and Jadbabaie and (x x t ) H t (x x t ) γ for all x C, ǫ r d t (x x t ) r. (70) First, we use the condition in (69) to show that x t satisfies the first-order optimality condition with high probability. Note that for any x C it holds that f(x t ) (x x t ) = d t (x x t )+( f(x t ) d t ) (x x t ) d t (x x t) D f(x t ) d t. (71) Now compute the minimum of both sides of (71) for all x C to obtain min { f(x t) (x x t )} min x C x C {d t (x x t) D f(x t ) d t } = min x C {d t (x x t)} D f(x t ) d t ǫ D f(x t ) d t, (7) where the equality holds since D f(x t ) d t does not depend on x, and the last inequality is implied by (69). Since E [ f(x t ) d t ] ν /b g we obtain from Markov s inequality that P( f(x t ) d t ǫ ) 1 ν b g ǫ. (73) Therefore, by combining the results in (7) and (73) we obtain that ( ) P min { f(x t) (x x t )} (ǫ +Dǫ ) 1 ν x C b g ǫ. (74) Now by setting ǫ = ǫ /D it follows from (74) that with probability at least 1 ν D /b g ǫ the final iterate x t satisfies f(x t ) (x x t ) ǫ for all x C. (75) Replacing ǫ by ǫ/ leads to f(x t ) (x x t ) ǫ for all x C. (76) It remains to show that with high probability the final iterate x t satisfies the second-order optimality condition. First, consider the sets A t = {x C ǫ f(x t ) (x x t ) 0} and B t = {x C ǫ r d t (x x t) r}. (77) We proceed to show that with high probability A t B t. If y satisfies the condition then it can be shown that ǫ f(x t ) (y x t ) 0, (78) d t (y x t) = f(x t ) (y x t )+(d t f(x t )) (y x t ) D d t f(x t ). (79) Since E [ f(x t ) d t ] ν /b g we obtain from Markov s inequality that ( P f(x t ) d t r ) 1 ν D D b g r. (80) 18

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India Chi Jin UC Berkeley Michael I. Jordan UC Berkeley Rong Ge Duke Univ. Sham M. Kakade U Washington Nonconvex optimization

More information

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

arxiv: v1 [math.oc] 9 Oct 2018

arxiv: v1 [math.oc] 9 Oct 2018 Cubic Regularization with Momentum for Nonconvex Optimization Zhe Wang Yi Zhou Yingbin Liang Guanghui Lan Ohio State University Ohio State University zhou.117@osu.edu liang.889@osu.edu Ohio State University

More information

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic

More information

Mini-Course 1: SGD Escapes Saddle Points

Mini-Course 1: SGD Escapes Saddle Points Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Yi Zhou Department of ECE The Ohio State University zhou.1172@osu.edu Zhe Wang Department of ECE The Ohio State University

More information

SVRG Escapes Saddle Points

SVRG Escapes Saddle Points DUKE UNIVERSITY SVRG Escapes Saddle Points by Weiyao Wang A thesis submitted to in partial fulfillment of the requirements for graduating with distinction in the Department of Computer Science degree of

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

On the fast convergence of random perturbations of the gradient flow.

On the fast convergence of random perturbations of the gradient flow. On the fast convergence of random perturbations of the gradient flow. Wenqing Hu. 1 (Joint work with Chris Junchi Li 2.) 1. Department of Mathematics and Statistics, Missouri S&T. 2. Department of Operations

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Lecture 5 : Projections

Lecture 5 : Projections Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization

More information

Second-Order Methods for Stochastic Optimization

Second-Order Methods for Stochastic Optimization Second-Order Methods for Stochastic Optimization Frank E. Curtis, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

The Trust Region Subproblem with Non-Intersecting Linear Constraints

The Trust Region Subproblem with Non-Intersecting Linear Constraints The Trust Region Subproblem with Non-Intersecting Linear Constraints Samuel Burer Boshi Yang February 21, 2013 Abstract This paper studies an extended trust region subproblem (etrs in which the trust region

More information

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Clément Royer - University of Wisconsin-Madison Joint work with Stephen J. Wright MOPTA, Bethlehem,

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

5 Handling Constraints

5 Handling Constraints 5 Handling Constraints Engineering design optimization problems are very rarely unconstrained. Moreover, the constraints that appear in these problems are typically nonlinear. This motivates our interest

More information

Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué

Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué Clément Royer (Université du Wisconsin-Madison, États-Unis) Toulouse, 8 janvier 2019 Nonconvex optimization via Newton-CG

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization E5295/5B5749 Convex optimization with engineering applications Lecture 8 Smooth convex unconstrained and equality-constrained minimization A. Forsgren, KTH 1 Lecture 8 Convex optimization 2006/2007 Unconstrained

More information

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf. Maria Cameron 1. Trust Region Methods At every iteration the trust region methods generate a model m k (p), choose a trust region, and solve the constraint optimization problem of finding the minimum of

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu

More information

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL) Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

4TE3/6TE3. Algorithms for. Continuous Optimization

4TE3/6TE3. Algorithms for. Continuous Optimization 4TE3/6TE3 Algorithms for Continuous Optimization (Algorithms for Constrained Nonlinear Optimization Problems) Tamás TERLAKY Computing and Software McMaster University Hamilton, November 2005 terlaky@mcmaster.ca

More information

Second Order Optimization Algorithms I

Second Order Optimization Algorithms I Second Order Optimization Algorithms I Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 7, 8, 9 and 10 1 The

More information

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization Frank E. Curtis Department of Industrial and Systems Engineering, Lehigh University Daniel P. Robinson Department

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

DO NOT OPEN THIS QUESTION BOOKLET UNTIL YOU ARE TOLD TO DO SO

DO NOT OPEN THIS QUESTION BOOKLET UNTIL YOU ARE TOLD TO DO SO QUESTION BOOKLET EECS 227A Fall 2009 Midterm Tuesday, Ocotober 20, 11:10-12:30pm DO NOT OPEN THIS QUESTION BOOKLET UNTIL YOU ARE TOLD TO DO SO You have 80 minutes to complete the midterm. The midterm consists

More information

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite

More information

A random perturbation approach to some stochastic approximation algorithms in optimization.

A random perturbation approach to some stochastic approximation algorithms in optimization. A random perturbation approach to some stochastic approximation algorithms in optimization. Wenqing Hu. 1 (Presentation based on joint works with Chris Junchi Li 2, Weijie Su 3, Haoyi Xiong 4.) 1. Department

More information

arxiv: v2 [math.oc] 1 Nov 2017

arxiv: v2 [math.oc] 1 Nov 2017 Stochastic Non-convex Optimization with Strong High Probability Second-order Convergence arxiv:1710.09447v [math.oc] 1 Nov 017 Mingrui Liu, Tianbao Yang Department of Computer Science The University of

More information

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems Robert M. Freund February 2016 c 2016 Massachusetts Institute of Technology. All rights reserved. 1 1 Introduction

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems Z. Akbari 1, R. Yousefpour 2, M. R. Peyghami 3 1 Department of Mathematics, K.N. Toosi University of Technology,

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

minimize x subject to (x 2)(x 4) u,

minimize x subject to (x 2)(x 4) u, Math 6366/6367: Optimization and Variational Methods Sample Preliminary Exam Questions 1. Suppose that f : [, L] R is a C 2 -function with f () on (, L) and that you have explicit formulae for

More information

Constrained optimization: direct methods (cont.)

Constrained optimization: direct methods (cont.) Constrained optimization: direct methods (cont.) Jussi Hakanen Post-doctoral researcher jussi.hakanen@jyu.fi Direct methods Also known as methods of feasible directions Idea in a point x h, generate a

More information

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings Structural and Multidisciplinary Optimization P. Duysinx and P. Tossings 2018-2019 CONTACTS Pierre Duysinx Institut de Mécanique et du Génie Civil (B52/3) Phone number: 04/366.91.94 Email: P.Duysinx@uliege.be

More information

1 Lyapunov theory of stability

1 Lyapunov theory of stability M.Kawski, APM 581 Diff Equns Intro to Lyapunov theory. November 15, 29 1 1 Lyapunov theory of stability Introduction. Lyapunov s second (or direct) method provides tools for studying (asymptotic) stability

More information

A Distributed Newton Method for Network Utility Maximization, II: Convergence

A Distributed Newton Method for Network Utility Maximization, II: Convergence A Distributed Newton Method for Network Utility Maximization, II: Convergence Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract The existing distributed algorithms for Network Utility

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

Gradient Descent. Dr. Xiaowei Huang

Gradient Descent. Dr. Xiaowei Huang Gradient Descent Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Three machine learning algorithms: decision tree learning k-nn linear regression only optimization objectives are discussed,

More information

Unconstrained Optimization

Unconstrained Optimization 1 / 36 Unconstrained Optimization ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University February 2, 2015 2 / 36 3 / 36 4 / 36 5 / 36 1. preliminaries 1.1 local approximation

More information

Lecture 3. Optimization Problems and Iterative Algorithms

Lecture 3. Optimization Problems and Iterative Algorithms Lecture 3 Optimization Problems and Iterative Algorithms January 13, 2016 This material was jointly developed with Angelia Nedić at UIUC for IE 598ns Outline Special Functions: Linear, Quadratic, Convex

More information

Near-Potential Games: Geometry and Dynamics

Near-Potential Games: Geometry and Dynamics Near-Potential Games: Geometry and Dynamics Ozan Candogan, Asuman Ozdaglar and Pablo A. Parrilo January 29, 2012 Abstract Potential games are a special class of games for which many adaptive user dynamics

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

The Steepest Descent Algorithm for Unconstrained Optimization

The Steepest Descent Algorithm for Unconstrained Optimization The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi Gradient Descent This lecture introduces Gradient Descent a meta-algorithm for unconstrained minimization. Under convexity of the

More information

Chapter 7. Extremal Problems. 7.1 Extrema and Local Extrema

Chapter 7. Extremal Problems. 7.1 Extrema and Local Extrema Chapter 7 Extremal Problems No matter in theoretical context or in applications many problems can be formulated as problems of finding the maximum or minimum of a function. Whenever this is the case, advanced

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

A strongly polynomial algorithm for linear systems having a binary solution

A strongly polynomial algorithm for linear systems having a binary solution A strongly polynomial algorithm for linear systems having a binary solution Sergei Chubanov Institute of Information Systems at the University of Siegen, Germany e-mail: sergei.chubanov@uni-siegen.de 7th

More information

CE 191: Civil and Environmental Engineering Systems Analysis. LEC 05 : Optimality Conditions

CE 191: Civil and Environmental Engineering Systems Analysis. LEC 05 : Optimality Conditions CE 191: Civil and Environmental Engineering Systems Analysis LEC : Optimality Conditions Professor Scott Moura Civil & Environmental Engineering University of California, Berkeley Fall 214 Prof. Moura

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

HW1 solutions. 1. α Ef(x) β, where Ef(x) is the expected value of f(x), i.e., Ef(x) = n. i=1 p if(a i ). (The function f : R R is given.

HW1 solutions. 1. α Ef(x) β, where Ef(x) is the expected value of f(x), i.e., Ef(x) = n. i=1 p if(a i ). (The function f : R R is given. HW1 solutions Exercise 1 (Some sets of probability distributions.) Let x be a real-valued random variable with Prob(x = a i ) = p i, i = 1,..., n, where a 1 < a 2 < < a n. Of course p R n lies in the standard

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Nonlinear Optimization for Optimal Control

Nonlinear Optimization for Optimal Control Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]

More information

Introduction to Nonlinear Stochastic Programming

Introduction to Nonlinear Stochastic Programming School of Mathematics T H E U N I V E R S I T Y O H F R G E D I N B U Introduction to Nonlinear Stochastic Programming Jacek Gondzio Email: J.Gondzio@ed.ac.uk URL: http://www.maths.ed.ac.uk/~gondzio SPS

More information

Written Examination

Written Examination Division of Scientific Computing Department of Information Technology Uppsala University Optimization Written Examination 202-2-20 Time: 4:00-9:00 Allowed Tools: Pocket Calculator, one A4 paper with notes

More information

Lecture Note 5: Semidefinite Programming for Stability Analysis

Lecture Note 5: Semidefinite Programming for Stability Analysis ECE7850: Hybrid Systems:Theory and Applications Lecture Note 5: Semidefinite Programming for Stability Analysis Wei Zhang Assistant Professor Department of Electrical and Computer Engineering Ohio State

More information

Self-Concordant Barrier Functions for Convex Optimization

Self-Concordant Barrier Functions for Convex Optimization Appendix F Self-Concordant Barrier Functions for Convex Optimization F.1 Introduction In this Appendix we present a framework for developing polynomial-time algorithms for the solution of convex optimization

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2 Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2 Coralia Cartis, University of Oxford INFOMM CDT: Modelling, Analysis and Computation of Continuous Real-World Problems Methods

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

This manuscript is for review purposes only.

This manuscript is for review purposes only. 1 2 3 4 5 6 7 8 9 10 11 12 THE USE OF QUADRATIC REGULARIZATION WITH A CUBIC DESCENT CONDITION FOR UNCONSTRAINED OPTIMIZATION E. G. BIRGIN AND J. M. MARTíNEZ Abstract. Cubic-regularization and trust-region

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

Constrained Optimization Theory

Constrained Optimization Theory Constrained Optimization Theory Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. IMA, August 2016 Stephen Wright (UW-Madison) Constrained Optimization Theory IMA, August

More information

1. Introduction. We analyze a trust region version of Newton s method for the optimization problem

1. Introduction. We analyze a trust region version of Newton s method for the optimization problem SIAM J. OPTIM. Vol. 9, No. 4, pp. 1100 1127 c 1999 Society for Industrial and Applied Mathematics NEWTON S METHOD FOR LARGE BOUND-CONSTRAINED OPTIMIZATION PROBLEMS CHIH-JEN LIN AND JORGE J. MORÉ To John

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

arxiv: v4 [math.oc] 24 Apr 2017

arxiv: v4 [math.oc] 24 Apr 2017 Finding Approximate ocal Minima Faster than Gradient Descent arxiv:6.046v4 [math.oc] 4 Apr 07 Naman Agarwal namana@cs.princeton.edu Princeton University Zeyuan Allen-Zhu zeyuan@csail.mit.edu Institute

More information

c 2000 Society for Industrial and Applied Mathematics

c 2000 Society for Industrial and Applied Mathematics SIAM J. OPIM. Vol. 10, No. 3, pp. 750 778 c 2000 Society for Industrial and Applied Mathematics CONES OF MARICES AND SUCCESSIVE CONVEX RELAXAIONS OF NONCONVEX SES MASAKAZU KOJIMA AND LEVEN UNÇEL Abstract.

More information

Primal-dual relationship between Levenberg-Marquardt and central trajectories for linearly constrained convex optimization

Primal-dual relationship between Levenberg-Marquardt and central trajectories for linearly constrained convex optimization Primal-dual relationship between Levenberg-Marquardt and central trajectories for linearly constrained convex optimization Roger Behling a, Clovis Gonzaga b and Gabriel Haeser c March 21, 2013 a Department

More information

A trust region algorithm with a worst-case iteration complexity of O(ɛ 3/2 ) for nonconvex optimization

A trust region algorithm with a worst-case iteration complexity of O(ɛ 3/2 ) for nonconvex optimization Math. Program., Ser. A DOI 10.1007/s10107-016-1026-2 FULL LENGTH PAPER A trust region algorithm with a worst-case iteration complexity of O(ɛ 3/2 ) for nonconvex optimization Frank E. Curtis 1 Daniel P.

More information

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version Convex Optimization Theory Chapter 5 Exercises and Solutions: Extended Version Dimitri P. Bertsekas Massachusetts Institute of Technology Athena Scientific, Belmont, Massachusetts http://www.athenasc.com

More information

Optimal Newton-type methods for nonconvex smooth optimization problems

Optimal Newton-type methods for nonconvex smooth optimization problems Optimal Newton-type methods for nonconvex smooth optimization problems Coralia Cartis, Nicholas I. M. Gould and Philippe L. Toint June 9, 20 Abstract We consider a general class of second-order iterations

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Lecture 6: Conic Optimization September 8

Lecture 6: Conic Optimization September 8 IE 598: Big Data Optimization Fall 2016 Lecture 6: Conic Optimization September 8 Lecturer: Niao He Scriber: Juan Xu Overview In this lecture, we finish up our previous discussion on optimality conditions

More information

CSCI : Optimization and Control of Networks. Review on Convex Optimization

CSCI : Optimization and Control of Networks. Review on Convex Optimization CSCI7000-016: Optimization and Control of Networks Review on Convex Optimization 1 Convex set S R n is convex if x,y S, λ,µ 0, λ+µ = 1 λx+µy S geometrically: x,y S line segment through x,y S examples (one

More information

Nonlinear Programming

Nonlinear Programming Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week

More information

1 Directional Derivatives and Differentiability

1 Directional Derivatives and Differentiability Wednesday, January 18, 2012 1 Directional Derivatives and Differentiability Let E R N, let f : E R and let x 0 E. Given a direction v R N, let L be the line through x 0 in the direction v, that is, L :=

More information

In English, this means that if we travel on a straight line between any two points in C, then we never leave C.

In English, this means that if we travel on a straight line between any two points in C, then we never leave C. Convex sets In this section, we will be introduced to some of the mathematical fundamentals of convex sets. In order to motivate some of the definitions, we will look at the closest point problem from

More information

More First-Order Optimization Algorithms

More First-Order Optimization Algorithms More First-Order Optimization Algorithms Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 3, 8, 3 The SDM

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Barrier Method. Javier Peña Convex Optimization /36-725

Barrier Method. Javier Peña Convex Optimization /36-725 Barrier Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: Newton s method For root-finding F (x) = 0 x + = x F (x) 1 F (x) For optimization x f(x) x + = x 2 f(x) 1 f(x) Assume f strongly

More information

Lecture 5: September 12

Lecture 5: September 12 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS

More information

arxiv: v3 [math.oc] 20 Jul 2018

arxiv: v3 [math.oc] 20 Jul 2018 A NEWTON-BASED METHOD FOR NONCONVEX OPTIMIZATION WITH FAST EVASION OF SADDLE POINTS SANTIAGO PATERNAIN, ARYAN MOKHTARI, AND ALEJANDRO RIBEIRO arxiv:1707.0808v3 [math.oc] 0 Jul 018 Abstract. Machine learning

More information

Complexity Analysis of Interior Point Algorithms for Non-Lipschitz and Nonconvex Minimization

Complexity Analysis of Interior Point Algorithms for Non-Lipschitz and Nonconvex Minimization Mathematical Programming manuscript No. (will be inserted by the editor) Complexity Analysis of Interior Point Algorithms for Non-Lipschitz and Nonconvex Minimization Wei Bian Xiaojun Chen Yinyu Ye July

More information

Arc Search Algorithms

Arc Search Algorithms Arc Search Algorithms Nick Henderson and Walter Murray Stanford University Institute for Computational and Mathematical Engineering November 10, 2011 Unconstrained Optimization minimize x D F (x) where

More information

Convex Optimization in Classification Problems

Convex Optimization in Classification Problems New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1

More information

arxiv: v1 [math.oc] 5 Dec 2014

arxiv: v1 [math.oc] 5 Dec 2014 FAST BUNDLE-LEVEL TYPE METHODS FOR UNCONSTRAINED AND BALL-CONSTRAINED CONVEX OPTIMIZATION YUNMEI CHEN, GUANGHUI LAN, YUYUAN OUYANG, AND WEI ZHANG arxiv:141.18v1 [math.oc] 5 Dec 014 Abstract. It has been

More information