arxiv: v1 [cs.lg] 6 Sep 2018

Similar documents
arxiv: v1 [math.oc] 1 Jul 2016

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Higher-Order Methods

arxiv: v1 [math.oc] 9 Oct 2018

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Mini-Course 1: SGD Escapes Saddle Points

Optimization for Machine Learning

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

SVRG Escapes Saddle Points

Unconstrained optimization

On the fast convergence of random perturbations of the gradient flow.

Constrained Optimization and Lagrangian Duality

Lecture 5 : Projections

Second-Order Methods for Stochastic Optimization

The Trust Region Subproblem with Non-Intersecting Linear Constraints

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Stochastic Optimization Algorithms Beyond SG

5 Handling Constraints

Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué

Conditional Gradient (Frank-Wolfe) Method

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Optimization Tutorial 1. Basic Gradient Descent

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

4TE3/6TE3. Algorithms for. Continuous Optimization

Second Order Optimization Algorithms I

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization

Non-convex optimization. Issam Laradji

DO NOT OPEN THIS QUESTION BOOKLET UNTIL YOU ARE TOLD TO DO SO

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

A random perturbation approach to some stochastic approximation algorithms in optimization.

arxiv: v2 [math.oc] 1 Nov 2017

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems

On Nesterov s Random Coordinate Descent Algorithms - Continued

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

minimize x subject to (x 2)(x 4) u,

Constrained optimization: direct methods (cont.)

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

1 Lyapunov theory of stability

A Distributed Newton Method for Network Utility Maximization, II: Convergence

Coordinate Descent and Ascent Methods

8 Numerical methods for unconstrained problems

Gradient Descent. Dr. Xiaowei Huang

Unconstrained Optimization

Lecture 3. Optimization Problems and Iterative Algorithms

Near-Potential Games: Geometry and Dynamics

Selected Topics in Optimization. Some slides borrowed from

The Steepest Descent Algorithm for Unconstrained Optimization

Advanced computational methods X Selected Topics: SGD

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent

Chapter 7. Extremal Problems. 7.1 Extrema and Local Extrema

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

A strongly polynomial algorithm for linear systems having a binary solution

CE 191: Civil and Environmental Engineering Systems Analysis. LEC 05 : Optimality Conditions

Stochastic and online algorithms

HW1 solutions. 1. α Ef(x) β, where Ef(x) is the expected value of f(x), i.e., Ef(x) = n. i=1 p if(a i ). (The function f : R R is given.

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Linear Convergence under the Polyak-Łojasiewicz Inequality

Nonlinear Optimization for Optimal Control

Introduction to Nonlinear Stochastic Programming

Written Examination

Lecture Note 5: Semidefinite Programming for Stability Analysis

Self-Concordant Barrier Functions for Convex Optimization

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Optimization methods

This manuscript is for review purposes only.

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

Constrained Optimization Theory

1. Introduction. We analyze a trust region version of Newton s method for the optimization problem

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

arxiv: v4 [math.oc] 24 Apr 2017

c 2000 Society for Industrial and Applied Mathematics

Primal-dual relationship between Levenberg-Marquardt and central trajectories for linearly constrained convex optimization

A trust region algorithm with a worst-case iteration complexity of O(ɛ 3/2 ) for nonconvex optimization

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

Optimal Newton-type methods for nonconvex smooth optimization problems

Comparison of Modern Stochastic Optimization Algorithms

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Lecture 6: Conic Optimization September 8

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Nonlinear Programming

1 Directional Derivatives and Differentiability

In English, this means that if we travel on a straight line between any two points in C, then we never leave C.

More First-Order Optimization Algorithms

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Convex Optimization Lecture 16

Barrier Method. Javier Peña Convex Optimization /36-725

Lecture 5: September 12

arxiv: v3 [math.oc] 20 Jul 2018

Complexity Analysis of Interior Point Algorithms for Non-Lipschitz and Nonconvex Minimization

Arc Search Algorithms

Convex Optimization in Classification Problems

arxiv: v1 [math.oc] 5 Dec 2014

Transcription:

arxiv:1809.016v1 [cs.lg] 6 Sep 018 Aryan Mokhtari Asuman Ozdaglar Ali Jadbabaie Laboratory for Information and Decision Systems Massachusetts Institute of Technology Cambridge, MA 0139 Abstract aryanm@mit.edu asuman@mit.edu jadbabai@mit.edu In this paper, we focus on escaping from saddle points in smooth nonconvex optimization problems subject to a convex set C. We propose a generic framework that yields convergence to a secondorder stationary point of the problem, if the convex set C is simple for a quadratic objective function. To be more precise, our results hold if one can find a ρ-approximate solution of a quadratic program subject to C in polynomial time, where ρ < 1 is a positive constant that depends on the structure of the set C. Under this condition, we show that the sequence of iterates generated by the proposed framework reaches an (ǫ, γ)-second order stationary point (SOSP) in at most O(max{ǫ,ρ 3 γ 3 }) iterations. We further characterize the overall arithmetic operations to reach an SOSP when the convex set C can be written as a set of quadratic constraints. Finally, we extend our results to the stochastic setting and characterize the number of stochastic gradient and Hessian evaluations to reach an (ǫ, γ)-sosp. 1. Introduction There has been a recent revival of interest in non-convex optimization, due to obvious applications in machine learning. While the modern history of the subject goes back six or seven decades, the recent attention to the topic stems from new applications as well as availability of modern analytical and computational tools, providing a new perspective on classical problems. Following this trend, in this paper we focus on the problem of minimizing a smooth nonconvex function over a convex set as follows: minimize f(x), subject to x C, (1) where x R d is the decision variable, C R d is a closed convex set, and f : R d R is a twice continuously differentiable function over C. It is well known that finding the global minimum of Problem (1) is hard. Equally well-known is the fact that for certain nonconvex problems, all local minimizers are global. These include, for example, matrix completion (Ge et al., 016), phase retrieval (Sun et al., 016), and dictionary learning (Sun et al., 017). For such problems, finding the global minimum of (1) reduces to the problem of finding one of its local minima. Given the well-known hardness results in finding stationary points, recent focus has shifted in characterizing approximate stationary points. When the objective function f is convex, finding an ǫ- first-order stationary point is often sufficient since it leads to finding an approximate local(and hence global) minimum. However, in the nonconvex setting, even when the problem is unconstrained, i.e., C = R d, convergence to a first-order stationary point (FOSP) is not enough as the critical point to which convergence is established might be a saddle point. It is therefore natural to look at higher order derivatives and search for a second-order stationary points. Indeed, under the assumption that all the saddle points are strict (formally defined later), in both unconstrained and constrained

Mokhtari, Ozdaglar, and Jadbabaie settings, convergence to a second-order stationary point (SOSP) implies convergence to a local minimum. While convergence to an SOSP has been thoroughly investigated in the recent literature for the unconstrained setting, the iteration complexity of the convex-constrained setting has not been studied yet. Contributions. Our main contribution in this paper is to propose a generic framework which generates a sequence of iterates converging to an approximate second-order stationary point for the constrained nonconvex problem in (1), when the convex set C has a specific structure that allows for approximate minimization of a quadratic loss over the feasible set. The proposed framework consists of two main stages: First, it utilizes first-order information to reach a first-order stationary point; next, it incorporates second-order information to escape from a stationary point if it is a local maximizer or a strict saddle point. We show that the proposed approach leads to an (ǫ, γ)-secondorder stationary point (SOSP) for Problem (1) (check Definition 3). The proposed approach utilizes advances in constant-factor optimization of nonconvex quadratic programs (Vavasis, 1991; Ye, 199; Fu et al., 1998; Nemirovski et al., 1999) that find a ρ-approximate solution over C in polynomial time, where ρ is a positive constant smaller than 1 that depends on the structure of C. When such approximate solution exists, the sequence of iterates generated by the proposed framework reaches an (ǫ,γ)-sosp of Problem (1) in at most O(max{ǫ,ρ 3 γ 3 }) iterations. We show that linear and quadratic constraints satisfy the required condition for the convex set C. In particular, for the case that C is defined as a set of m quadratic constraints, we can achieve an (ǫ,γ)-sosp after at most O(max{τǫ,d 3 m 7 γ 3 }) arithmetic operations, where d is the dimension of the problem and τ is the number of required arithmetic operations to solve a linear program over C or to project a point onto C. We further extend our results to the stochastic setting and show that we can reach an (ǫ,γ)-sosp after computing at most O(max{ǫ 4,ǫ ρ 4 γ 4,ρ 7 γ 7 }) stochastic gradients and O(max{ǫ ρ 3 γ 3,ρ 5 γ 5 }) stochastic Hessians. 1.1 Related work Unconstrained case. The rich literature on nonconvex optimization provides a plethora of algorithms for reaching stationary points of a smooth unconstrained minimization problem. Convergence to first-order stationary points (FOSP) has been widely studied for both deterministic (Nesterov, 013; Cartis et al., 010; Agarwal et al., 017; Carmon et al., 016, 017a,b,c) and stochastic settings (Reddi et al., 016b,a; Allen Zhu and Hazan, 016; Lei et al., 017). Stronger results which indicate convergence to an SOSP are also established. Numerical optimization methods such as trust-region methods (Cartis et al., 01a; Curtis et al., 017; Martínez and Raydan, 017) and cubic regularization algortihms (Nesterov and Polyak, 006; Cartis et al., 011a,b) can reach an approximate second-order stationary point in a finite number of iterations; however, typically the computational complexity of each iteration could be relatively large due to the cost of solving trust-region or regularized cubic subproblems. Recently, a new line of research has emerged that focuses on the overall computational cost to achieve an SOSP. These results build on the idea of escaping from strict saddle points with perturbing the iterates by injecting a properly chosen noise (Ge et al., 015; Jin et al., 017a,b), or by incorporating second-order information and updating the iterates using the eigenvector corresponding to the smallest eigenvalue of the Hessian (Carmon et al., 016; Allen-Zhu, 017; Xu and Yang, 017; Allen-Zhu and Li, 017; Royer and Wright, 017; Agarwal et al., 017; Reddi et al., 018; Paternain et al., 017). Constrained case. Asymptotic convergence to first-order and second-order stationary points for the constrained optimization problem in (1) has been studied in the numerical optimization community (Burke et al., 1990; Conn et al., 1993; Facchinei and Lucidi, 1998; Di Pillo et al., 005). Recently, finite-time analysis for convergence to an FOSP of the generic smooth constrained problem in (1) has received a lot of attention. In particular, Lacoste-Julien (016) shows that the sequence of iterates generated by the update of Frank-Wolfe converges to an ǫ-fosp of Problem (1) after O(ǫ ) iterations. Ghadimi et al. (016) consider norm of gradient mapping as a measure of non-

stationarity and show that the projected gradient method has the same complexity of O(ǫ ). Further, Ghadimi and Lan (016) prove a similar complexity for the accelerated projected gradient method with a slightly better dependency on the Lipschitz constant of gradients. Adaptive cubic regularization methods in (Cartis et al., 01b, 013, 015) improve these results using second-order information and obtain an ǫ-fosp of Problem (1) after at most O(ǫ 3/ ) iterations. Finite time analysis for convergence to an SOSP has also been studied for linear constraints. To be more precise, Bian et al. (015) study convergence to an SOSP of (1) when the set C is a linear constraint of the form x 0 and propose a trust region interior point method that obtains an (ǫ, ǫ)-sosp in O(ǫ 3/ ) iterations. Haeser et al. (017) extend their results to the case that the objective function is potentially not differentiable or not twice differentiable on the boundary of the feasible region. Cartis et al.(017) focus on the general convex constraint case and introduce a trust region algorithm that requires O(ǫ 3 ) iterations to obtain an SOSP; however, each iteration of their proposed method requires access to the exact solution of a nonconvex quadratic program (finding its global minimum) which, in general, could be computationally prohibitive. To the best of our knowledge, our paper provides the first finite-time overall computational complexity analysis for reaching an SOSP of Problem (1) with nonlinear constraints.. Preliminaries and Definitions In the case of unconstrained minimization of the objective function f, the first-order and secondorder necessary conditions for a point x to be a local minimum of that are defined as f(x ) = 0 d and f(x ) 0 d d, respectively. If a point satisfies these conditions it is called a second-order stationary point (SOSP). If the second condition becomes strict, i.e., f(x) 0, then we recover the sufficient conditions for a local minimum. However, to derive finite time convergence bounds for achieving an SOSP, these conditions should be relaxed. In other words, the goal should be to find an approximate SOSP where the approximation error can be arbitrary small. For the case of unconstrained minimization, a point x is called an (ǫ,γ)-second-order stationary point if it satisfies f(x ) ǫ and f(x ) γi d, () where ǫ and γ are arbitrary positive constants. In the following definition we formally define strict saddle points for the unconstrained version of Problem (1), i.e., when C = R d. Definition 1 Consider Problem (1) when C = R d. Then, x is called a δ-strict saddle point if it is a saddle point, i.e., f(x) = 0 and f(x) is indefinite, and the smallest eigenvalue of the Hessian f(x) evaluated at x is strictly smaller than δ, i.e., λ min ( f(x)) < δ. Using the definitions of a δ-strict saddle and an (ǫ,γ)-sosp we obtain that an (ǫ,γ)-sosp is a local minimum for the unconstrained optimization problem if all the saddle points are δ-strict and the condition γ δ is satisfied. To study the constrained setting, we first state the necessary conditions for a local minimum of problem (1). Proposition (Bertsekas (1999)) If x C is a local minimum of the function f over the convex set C, then f(x ) (x x ) 0, for all x C, (3) (x x ) f(x )(x x ) 0, for all x C s.t. f(x ) (x x ) = 0. (4) The conditions in (3) and(4) are the first-order and second-order necessary optimality conditions, respectively. By making the inequality in (4) strict, i.e., (x x ) f(x )(x x ) > 0, we recover the sufficient conditions for a local minimum when C is a polyhedral (Bertsekas, 1999). Further, if 3

Mokhtari, Ozdaglar, and Jadbabaie the inequality in (4) is replaced by (x x ) f(x )(x x ) δ x x for some δ > 0, we obtain the sufficient conditions for a local minimum of Problem (1) for any convex constraint C; see (Bertsekas, 1999). If a point x satisfies the conditions in (3) and (4) it is an SOSP of Problem (1). As in the unconstrained setting, the first-order and second-order optimality conditions may not be satisfied in finite number of iterations, and we focus on finding an approximate SOSP. Definition 3 Recall the twice continuously differentiable function f : R d R and the convex closed set C R d introduced in Problem (1). We call x C an (ǫ,γ)-second-order stationary point of Problem (1) if the following conditions are satisfied. f(x ) (x x ) ǫ, for all x C, (5) (x x ) f(x )(x x ) γ, for all x C s.t. 0 f(x ) (x x ) ǫ. (6) If a point only satisfies the first condition, we call it an ǫ-first-order stationary point. To further clarify the conditions in Definition 3, we first elaborate on the conditions in Proposition for stationary points. The condition in (3) ensures that there is no feasible direction that makes the linear term in the Taylor series of f around x negative. If there exists such a point then by choosing a very small stepsize we can find a feasible direction that decreases the objective function value, and, therefore, x cannot be a local minimum. The condition in (5) relaxes this requirement and checks if the inner product f(x ) (x x ) is not too negative for any x C. In other words, it ensures that the function value is not decreasing more than O(ǫ) for any feasible direction. Note that if f(x ) (x x ) is strictly positive for all x C, then we can ensure that in a small neighborhood of x any feasible direction increases the function value and hence x is a local minimum. Toensurethat x hasthe necessaryconditionsforalocalminimum, amongthe feasible directions x x that are orthogonal to the gradient, i.e., f(x ) (x x ) = 0, we must ensure that the function value is non-decreasing. The condition in (4) guarantees that among the feasible directions that satisfy f(x ) (x x ) = 0, there is no direction that makes the quadratic term in the Taylor series of f around x negative. We would like to emphasize that as we know from the firstorder optimality condition that x satisfies the condition f(x ) (x x ) 0 for any x C, the quadratic condition in (4) should only be satisfied for the directions that f(x ) (x x ) = 0, since for those that we have f(x ) (x x ) > 0 the function value is increasing in a small neighborhood of x. Using the same argument, for the relaxed version of (4) it is not required to check the quadratic condition for points x C satisfying f(x ) (x x ) > 0; however, as we relax the first-order optimality condition, i.e., the inner product f(x ) (x x ) is allowed to be a small negative value, we need to ensure that the quadratic condition in (4) stands for all points 0 f(x ) (x x ) ǫ. The conditions in (5) and (6) together ensure that in a neighborhood of x with radius α the function value does not decrease more than O(αǫ) by the linear term and more than O(α γ) by the quadratic term. We further formally define strict saddle points for the constrained optimization problem in (1). Definition 4 A point x C is a δ-strict saddle point of Problem (1) if (i) for all x C the condition f(x ) (x x ) 0 holds, and (ii) there exists a point y such that (y x ) f(x )(y x ) < δ, y C and f(x ) (y x ) = 0. (7) According to Definitions 3 and 4 if all saddle points are δ-strict and γ δ, any (ǫ,γ)-sosp of Problem (1) is an approximate local minimum. We emphasize that in this paper we do not assume that all saddlesare strict to proveconvergence to an SOSP. We formally defined strict saddles just to clarify that if all the saddles are strict then convergence to an approximate SOSP is equivalent to convergence to an approximate local minimum. Our goal throughout the rest of the paper is to design an algorithm which finds an (ǫ,γ)-sosp of Problem (1). To do so, we first assume the following conditions are satisfied. 4

Assumption 1 The gradients f are L-Lipschitz continuous over the set C, i.e., for any x, x C, f(x) f( x) L x x. (8) Assumption The Hessians f are M-Lipschitz continuous over the set C, i.e., for any x, x C f(x) f( x) M x x. (9) Assumption 3 The diameter of the compact convex set C is upper bounded by a constant D, i.e., max{ x ˆx } D. (10) x, x C Conditions in Assumptions 1 and ensure that the objective function gradients and Hessians are Lipschitz continuous, respectively, over the convex set C. Assumption 3 guarantees that the distance between any two points in the convex set C is bounded above by a constant. 3. Main Result In this section, we introduce a generic framework to reach an (ǫ, γ)-sosp of the non-convex function f over the convex set C, when C has a specific structure as we describe below. In particular, we focus on the case when we can solve a quadratic program (QP) of the form minimize x Ax+b x+c subject to x C, (11) up to a constant factor ρ 1 with a finite number of arithmetic operations. Here, A R d is a symmetricmatrix, b R d isavector,andc Risascalar. Toclarifythe notionofsolvingaproblem up to a constant factor ρ, consider x as a global minimizer of (11). Then, we say Problem (11) is solved up to a constant factor ρ (0,1] if we have found a feasible solution x C such that x Ax +b x +c x A x+b x+c ρ(x Ax +b x +c). (1) Notethatherew.l.o.g. wehaveassumedthattheoptimalobjectivefunctionvaluex Ax +b x +c is non-positive. Larger constant ρ implies that the approximate solution is more accurate. If x satisfies the condition in (1), we call it a ρ-approximate solution of Problem (11). Indeed, if ρ = 1 then x is a global minimizer of Problem (11). In Algorithm 1, we introduceagenericframeworkthat finds an (ǫ,γ)-sospofproblem(1) whose running time is polynomial in ǫ 1, γ 1, ρ 1 and d, when we can find a ρ-approximate solution of a quadratic problem of the form (11) in a time that is polynomial in d. The proposed scheme consists of two major stages. In the first phase, as mentioned in Steps -4, we use a first-order update, i.e., a gradient-based update, to find an ǫ-fosp, i.e., we update the decision variable x according to a first-order update until we reach a point x t C that satisfies the condition f(x t ) (x x t ) ǫ, for all x C. (13) In Section 4, we study in detail projected gradient descent and conditional gradient algorithms for the first-order phase of the proposed framework. Interestingly, both of these algorithms require at most O(ǫ ) iterations to reach an ǫ-first-order stationary point. The second stage of the proposed scheme uses second-order information of the objective function f to escape from the stationary point if it is a local maximum or a strict saddle point. To be more precise, if we assume that x t is a feasible point satisfying the condition (13), we then aim to find a descent direction by solving the following quadratic program minimize q(u) := (u x t ) f(x t )(u x t ) subject to u C, ǫ f(x t ) (u x t ) 0, (14) 5

Mokhtari, Ozdaglar, and Jadbabaie Algorithm 1 Generic framework for escaping saddles in constrained optimization Require: Stepsize σ > 0. Initialize x 0 C 1: for t = 1,,... do : if x t is not an ǫ-first-order stationary point then 3: Compute x t+1 using first-order information (Frank-Wolfe or projected gradient descent) 4: else 5: Find u t: a ρ-approximate solution of (14) 6: if q(u t) < ργ then 7: Compute the updated variable x t+1 = (1 σ)x t +σu t; 8: else 9: Return x t and stop. 10: end if 11: end if 1: end for up to a constant factor ρ where ρ (0,1]. If we define q(u ) as the optimal objective function value of the program in (14), we focus on the cases that we can obtain a feasible point u t which is a ρ-approximate solution of Problem (14), i.e., u t satisfies the constraints in (14) and q(u ) q(u t ) ρ q(u ). (15) The problem formulation in (14) can be transformed into the quadratic program in(11); see Section 5 for more details. Note that the constant ρ is independent of ǫ, γ, and d and only depends on the structure of the convex set C. For instance, if C is defined in terms of m quadratic constraints one can find a ρ = m approximate solution of (14) after at most Õ(md3 ) arithmetic operations (Section 5). After computing a feasible point u t satisfying the condition in (15), we check the quadratic objective function value at the point u t, and if the inequality q(u t ) < ργ holds, we follow the update x t+1 = (1 σ)x t +σu t, (16) where σ is a positive step size. Otherwise, we stop the process and return x t as an (ǫ,γ)-secondorder stationary point of Problem (1). To check this claim, note that Algorithm 1 stops if we reach a point x t that satisfies the first-order stationary condition f(x t ) (x x t ) ǫ, and the objective function value for the ρ-approximate solution of the quadratic subproblem is larger than ργ, i.e., q(u t ) ργ. The second condition alongside with the fact that q(u t ) satisfies (15) implies that q(u ) γ. Therefore, for any x C and ǫ f(x t ) (x x t ) 0, it holds that (x x t ) f(x t )(x x t ) γ. (17) These twoobservationsshowthat the outcomeofthe proposedframeworkin Algorithm1is an (ǫ,γ)- SOSP of Problem (1). Now it remains to characterize the number of iterations that Algorithm 1 needs to perform before reaching an (ǫ, γ)-sosp which we formally state in the following theorem. Theorem 5 Consider the problem in (1). Suppose the conditions in Assumptions 1-3 are satisfied. If in the first-order stage, i.e., Steps -4, we use the update of Frank-Wolfe or projected gradient descent, the framework proposed in Algorithm 1 finds an (ǫ, γ)-second-order stationary point of Problem (1) after at most O(max{ǫ,ρ 3 γ 3 }) iterations. The result in Theorem 5 shows that if the convex constraint C is such that one can solve the quadratic subproblem in (14) ρ-approximately, then the proposed generic framework finds an (ǫ, γ)- SOSP point of Problem (1) after at most O(ǫ ) first-order and O(ρ 3 γ 3 ) second-order updates. To prove the claim in Theorem 5, we first review first-order conditional gradient and projected gradient algorithms and show that if the current iterate is not a first-order stationary point, by 6

following either of these updates the objective function value decreases by a constant of O(ǫ ) (Section 4). We then focus on the second stage of Algorithm 1 which corresponds to the case that the current iterate is an ǫ-fosp and we need to solve the quadratic program in (14) approximately (Section 5). In this case, we show that if the iterate is not an (ǫ,γ)-sosp, by following the update in (16) the objective function value decreases at least by a constant of O(ρ 3 γ 3 ). Finally, by combining these two results it can be shown that Algorithm 1 finds an (ǫ,γ)-sosp after at most O(max{ǫ,ρ 3 γ 3 }) iterations. 4. First-Order Step: Convergence to a First-Order Stationary Point In this section, we study two different first-order methods for the first stage of Algorithm 1. The result in this section can also be independently used for convergence to an FOSP of Problem (1) satisfying f(x ) (x x ) ǫ, for all x C, (18) where ǫ > 0 is a positive constant. Although for Algorithm 1 we assume that C has a specific structure as mentioned in (11), the results in this section hold for any closed and compact convex setc. Tokeepourresultasgeneralaspossible,inthissection, westudybothconditionalgradientand projected-based methods when they are used in the first-stage of the proposed generic framework. 4.1 Conditional gradient update The conditional gradient (Frank-Wolfe) update has two steps. We first solve the linear program v t = argmax{ f(x t ) v}. (19) v C Then, we compute the updated variable x t+1 according to the update x t+1 = (1 η)x t +ηv t, (0) where η is a stepsize. In the following proposition, we show that if the current iterate is not an ǫ-first-order stationary point, then by updating the variable according to (19)-(0) the objective function value decreases. The proof of the following proposition is adopted from (Lacoste-Julien, 016). Proposition 6 Consider the optimization problem in (1). Suppose Assumptions 1 and 3 hold. Set the stepsize in (0) to η = ǫ/d L. Then, if the iterate x t at step t is not an ǫ-first-order stationary point, the objective function value at the updated variable x t+1 satisfies the inequality f(x t+1 ) f(x t ) ǫ D L. (1) The result in Proposition 6 shows that by following the update of the conditional gradient method the objective function value decreases by O(ǫ ), if an ǫ-fosp is not achieved. As a consequence of this result, the FW algorithm reaches an ǫ-first-order stationary point after at most O(ǫ ) iterations, or equivalently, after at most solving O(ǫ ) linear programs. Remark 7 In step 3 of Algorithm 1 we first check if x t is an ǫ-fosp. This can be done by evaluating min { f(x t) (x x t )} = max { f(x t) x}+ f(x t ) x t () x C x C and comparing the optimal value with ǫ. Note that the linear program in () is the same as the one in (19). Therefore, by checking the first-order optimality condition of x t, the variable v t is already computed, and we need to solve only one linear program per iteration. 7

Mokhtari, Ozdaglar, and Jadbabaie 4. Projected gradient update The projected gradient descent (PGD) update consists of two steps: (i) descending through the gradient direction and (ii) projecting the updated variable onto the convex constraint set. These two steps can be combined together and the update can be explicitly written as x t+1 = π C {x t η f(x t )}, (3) where π C (.) is the Euclidean projection onto the convex set C and η is a positive stepsize. In the following proposition, we first show that by following the update of PGD the objective function value decreases by a constant until we reach an ǫ- FOSP. Further, we show that the number of required iterations for PGD to reach an ǫ-fosp is of O(ǫ ). Proposition 8 Consider Problem (1). Suppose Assumptions 1 and 3 are satisfied. Further, assume that the gradients f(x) are uniformly bounded by K for all x C. If the stepsize of the projected gradient descent method defined in (3) is set to η = 1/L the objective function value decreases by f(x t+1 ) f(x t ) ǫ L (K +LD), (4) Moreover, iterates reach a first-order stationary point satisfying (18) after at most O(ǫ ) iterations. Proposition 8 shows that by following the update of PGD the function value decreases by O(ǫ ) until we reach an ǫ-fosp. It further shows PGD obtains an ǫ-fosp satisfying (18) after at most O(ǫ ) iterations. To the best of our knowledge, this result is also novel, since the only convergence guarantee for PGD in (Ghadimi et al., 016) is in terms of number of iterations to reach a point with a gradient mapping norm less than ǫ, while our result characterizes number of iterations to satisfy (18). Remark 9 To use the PGD update in the first stage of Algorithm 1 one needs to define a criteria to check if x t is an ǫ-fosp or not. However, in PGD we do not solve the linear program min x C { f(x t ) (x x t )}. This issue can be resolved by checking the condition x t x t+1 ǫ/(k +LD) which is a sufficient condition for the condition in (18). In other words, if this condition holds we stop and x t is an ǫ-fosp; otherwise, the result in (4) holds and the function value decreases. For more details please check the proof of Proposition 8. 5. Second-Order Step: Escaping Saddle Points In this section, we study the second stage of the framework in Algorithm 1 which corresponds to the case that the current iterate is an ǫ-fosp. Note that when we reach a critical point the goal is to find a feasible point u C in the subspace ǫ f(x t ) (u x t ) 0 that makes the inner product (u x t ) f(x t )(u x t ) smaller than γ. To achieve this goal we need to check the minimum value of this inner product over the constraints, i.e., we need to solve the quadratic program in (14) up to a constant factor ρ (0,1]. In the following proposition, we show that the updated variable according to (16) decreases the objective function value if the condition q(u t ) < ργ holds. Proposition 10 Consider the quadratic program in (14). Let u t be a ρ-approximate solution for quadratic subproblem in (14). Suppose that Assumptions and 3 hold. Further, set the stepsize to σ = ργ/md 3. If the quadratic objective function value q evaluated at u t satisfies the condition q(u t ) < ργ, then the updated variable according to (16) satisfies the inequality f(x t+1 ) f(x t ) ρ3 γ 3 3M D6. (5) 8

The only unanswered question is how to solve the quadratic subproblem in (14) up to a constant factor ρ (0,1] in polynomial time. For general C, the quadratic subproblem could be NP-hard (Murty and Kabadi, 1987); however, for some special choices of the convex constraint C, e.g., linear and quadratic constraints, this quadratic program(qp) can be solved either exactly or approximately up to a constant factor. In the following section, we focus on the quadratic constraint case. The discussion on the linear constraint case is available in Section 8.1 in the supplementary material. 5.1 Quadratic constraints case Consider the case that the convex set C is the intersection of m ellipsoids, i.e., C := {x R d x Q i x+s i x+r i 0, for all i = 1,...,m}, (6) where r i R, s i R d, and Q i S d +. Under this assumption, the QP in (14) can be written as min u (u x t ) f(x t )(u x t ) s.t. u Q i u+s i u+r i 0, for i = 1,...,m ǫ f(x t ) (u x t ) 0. (7) Note that the linear constraints ǫ f(x t ) (u x t ) 0 can be easily handled by writing them in a form of quadratic program. To do so, first define a new optimization variable z := u x t to obtain min z z f(x t )z s.t. z Q i z+ s i z+ r i 0, for i = 1,...,m ǫ f(x t ) z 0, (8) where s i = s i + x t and r i = r i + s i x t + x t Q i x t. We simply can replace f(x t ) z 0 by the quadratic constraint z Q m+1 z+ s m+1z+ r m+1 0 with parameters Q m+1 = 0, s m+1 = f(x t ), and r m+1 = 0. Similarly, we can write f(x t ) z ǫ as the quadratic constraint z Q m+ z + s m+z+ r m+ 0 with parameters Q m+ = 0, s m+ = f(x t ), and r m+ = ǫ. Therefore, the problem in (8) can be written as min z z f(x t )z s.t. z Q i z+ s i z+ r i 0, for i = 1,...,m+ (9) Note that the matrices Q i S d + are positive semidefinite, while the Hessian f(x t ) S d might be indefinite. Indeed, the optimal objective function value of the program in (9) is equal to the optimal objective function value of (7). The program in (9) is a Quadratic Constraint Quadratic Program (QCQP), and one can find an approximate solution for this program as we state in the following proposition. Proposition 11 (Fu et al. (1998)) Consider Problem (9). There exists a polynomial time method 1 ζ that obtains a m (1+ζ) -approximation by at most O(d 3 (mlog(1/δ) + log(1/ζ) + logd)) arithmetic operations, where δ is the ratio of the radius of the largest inscribing sphere over that of the smallest circumscribing sphere of the feasible set. TheresultinProposition11indicatesthatwecansolvetheQCQPin(9)withtheapproximation factor ρ 1/m for m 1. According to this result the complexity of solving the QCQP in (14) is Õ(md 3 ) when the constraint C is defined as m convex quadratic constraints. As the total number of 9

Mokhtari, Ozdaglar, and Jadbabaie Algorithm Require: Stepsize σ t > 0. Initialize x 0 C 1: for t = 1,,... do : Compute v t = argmax{ d t v} v C 3: if d t (v t x t) < ǫ then 4: Compute x t+1 = (1 η)x t +ηv t 5: else 6: Find u t: a ρ-approximate solution of min u (u x t) H t(u x t) s.t. u C, ǫ r d t (u x t) r. 7: if q(u t) < ργ then 8: Compute the updated variable x t+1 = (1 σ)x t +σu t 9: else 10: Return x t and stop 11: end if 1: end if 13: end for calls to the second-order stage is at most O(ρ 3 γ 3 ) = O(m 6 γ 3 ), we obtain that the total number of arithmetic operations for the second-order stage is at most Õ(m7 d 3 γ 3 ). The main idea of the proposed algorithm by Fu et al. (1998) is to approximate the feasible set by an inscribing ellipsoid and maximize the quadratic objective function over this ellipsoid to find a global minimizer x, and then use x as an approximate global minimizer for the original QP. If by increasing the radius of the ellipsoid the enlarged ellipsoid contains the feasible set, then using the result in (Ye, 199) it follows that x is a good approximate solution of the original QCQP. Remark 1 If m = 1, one can also use the S-Procedure (Pólik and Terlaky, 007) to solve (9) accurately. Further, Nemirovski et al. (1999) showed that if the sum of the matrices Q i is positive 1 definite and the constraints are homogeneous, i.e., s i = 0, one can find a ln(mµ) -approximate solution of (9) by solving its SDP relaxation, where µ := min{m,max i rank(q i )}. 6. Stochastic Extension In this section, we focus on stochastic constrained minimization problems. Consider the optimization problem in (1) when the objective function f is defined as an expectation of a set of stochastic functions F : R d R d R with inputs x R d and Θ R d, where Θ is a random variable with probability distribution P. To be more precise, we consider the optimization problem minimize f(x) := E[F(x,Θ)], subject to x C. (30) Our goal is to find a point which satisfies the necessary optimality conditions with high probability. Consider the vector d t = (1/b g ) b g i=1 F(x t,θ i ) and matrix H t = (1/b H ) b H i=1 F(x t,θ i ) as stochastic approximations of the gradient f(x t ) and Hessian f(x t ), respectively. Here b g and b H are the gradient and Hessian batch sizes, respectively, and the vectors θ i are the realizations of the random variable Θ. In Algorithm, we present the stochastic variant of our proposed scheme for finding an (ǫ,γ)-sosp of Problem (30). We only focus on a stochastic variant of the Frank-Wolfe method for simplicity, but similar stochastic extension can also be shown for the projected gradient descent method. Algorithm differs from Algorithm 1 in using stochastic gradients d t and Hessians H t in lieu of exact gradients f(x t ) and f(x t ) Hessians. The second major difference is the inequality constraint in step 6. Here instead of using the constraint ǫ d t (u x t) 0 we need to use 10

ǫ r d t (u x t) r, where r > 0 is a properly chosen constant. This modification is needed to ensure that if a point satisfies this constraint it also satisfies ǫ f(x t ) (u x t ) 0 with high probability. To prove our main result we assume that the following conditions also hold. Assumption 4 The variance of the stochastic gradients and Hessians are uniformly bounded by constants ν and ξ, respectively, i.e., for any x C and θ we can write E [ F(x,θ) f(x) ] ν, E [ F(x,θ) f(x) ] ξ. (31) The conditions in Assumption 4 which require access to unbiased estimates of the gradient and Hessian with bounded variances are customary in stochastic optimization. In the following theorem, we characterize the iteration complexity of Algorithm to reach an (ǫ, γ)-sosp of Problem (30) with high probability. Theorem 13 Consider the optimization problem defined in (30). Suppose the conditions in Assumptions 1-4 are satisfied. If the batch sizes are b g = O(max{ρ 4 γ 4,ǫ }) and b H = O(ρ γ ) and we set r = O(ρ γ ), then the outcome of the proposed framework outlined in Algorithm is an (ǫ, γ)-second-order stationary point of Problem (30) with high probability. Further, the total number of iterations to reach such point is at most O(max{ǫ,ρ 3 γ 3 }) with high probability. The result in Theorem 13 indicates that the total number of iterations to reach an (ǫ,γ)-sosp is at most O(max{ǫ,ρ 3 γ 3 }). As each iteration at most requires O(max{ρ 4 γ 4,ǫ }) stochastic gradient and O(ρ γ ) stochastic Hessian evaluations, the total number of stochastic gradient and Hessian computations to reach an (ǫ,γ)-sosp is of O(max{ǫ ρ 4 γ 4,ǫ 4,ρ 7 γ 7 }) and O(max{ǫ ρ 3 γ 3,ρ 5 γ 5 }), respectively. 7. Conclusion In this paper, we studied the problem of finding an (ǫ, γ)-second order stationary point (SOSP) of a generic smooth constrained nonconvex minimization problem. We proposed a procedure that obtains an (ǫ,γ)-sosp after at most O(max{ǫ,ρ 3 γ 3 }) iterations if the constraint set C is such that one can find a ρ-approximate solution of a quadratic optimization problem with the constraint set C. We showed that our results hold for the case that C is a set of linear constraints or is the form of finite number of quadratic convex constraints. We further extended our results to the stochastic setting and characterized the number of stochastic gradient and Hessian evaluations to reach an (ǫ, γ)-sosp. 8. Appendix 8.1 The linear constraints case Consider the quadratic program in (14), for the case that the convex constraint set C is defined as C = {x R d Ax b}, where the matrix A R m d and the vector b R m are given. In this case, the QP in (14) is equivalent to min u (u x t ) f(x t )(u x t ) s.t. Au b, ǫ f(x t ) (u x t ) 0. (3) Define a new variable z := u x t and the vector ˆb = b Ax t to rewrite the problem as min z z f(x t )z s.t. Az ˆb, ǫ f(x t ) z 0. (33) 11

Mokhtari, Ozdaglar, and Jadbabaie Then, we can simply write the extra linear inequalities as f(x t ) z ǫ and f(x t ) z 0. Using these modifications we can write the problem in (33) as min y y f(x t )y s.t. Ây ˆb, (34) where  R(m+) d and ˆb R (m+). The QP in (34) in general could be NP-hard; however, in polynomial time one can find an approximate solution for this program as we state in the following proposition. Proposition 14 Consider the optimization problem in (34). There exists a polynomial time algorithm, based on solving a ball-constraint quadratic problem, to compute a (1/m )-approximate solution (Vavasis, 1991) and (Ye, 199). Further, if the constraint is the polytope {y e y e} for some e R d 1, there exits a polynomial time algorithm with complexity O(d 3 log(log(d))) which reaches a (1/m)-approximate solution (Fu et al., 1998, Section 4). 8. Proof of Proposition The claim in (3) follows from Proposition.1. in (Bertsekas, 1999). The proof for the claim in (4) is similar to the proof of Proposition.1. in (Bertsekas, 1999), and we mention it for completeness. We prove the claim in (4) by contradiction. Suppose that (x x ) f(x )(x x ) < 0 for some x C satisfying f(x ) (x x ) = 0. By the mean value theorem, for any ǫ > 0 there exists an α [0,1] such that f(x +ǫ(x x )) = f(x )+ǫ f(x ) (x x )+ǫ (x x ) f(x +αǫ(x x )) (x x ), (35) Use the relation f(x ) (x x ) = 0 to simplify the right hand side to f(x +ǫ(x x )) = f(x )+ǫ (x x ) f(x +αǫ(x x )) (x x ). (36) Note that since (x x ) f(x )(x x ) < 0 and the Hessian is continuous, we have for all sufficiently small ǫ > 0, (x x ) f(x + αǫ(x x )) (x x ) < 0. This observation and the expression in (36) follows that for sufficiently small ǫ we have f(x + ǫ(x x )) < f(x ). Note that the point x + ǫ(x x ) for all ǫ [0,1] belongs to the set C and satisfies the inequality f(x ) ((x +ǫ(x x )) x ) = 0. Therefore, we obtained a contradiction of the local optimality of x. 8.3 Proof of Proposition 6 First consider the definition G(x t ) = max x C { f(x t ) (x x t )} which is also known as Frank- Wolfe gap (Lacoste-Julien, 016). This constant measures how close the point x t is to be a firstorder stationary point. If G(x t ) ǫ, then x t is an ǫ-first-order stationary point. Let s assume that G(x t ) > ǫ. Then, based on the Lipschitz continuity of gradients and the definition of G(x t ) we can write f(x t+1 ) f(x t )+ f(x t ) (x t+1 x t )+ L x t+1 x t = f(x t )+η f(x t ) (v t x t )+ Lη v t x t f(x t ) ηg(x t )+ η D L, (37) 1

where the last inequality follows from v t x t D. Replacing the stepsize η by its value ǫ/d L and G(x t ) by its lower bound ǫ lead to f(x t+1 ) f(x t ) ǫ D L. (38) This result implies that if the current point x t is not an ǫ-first-order stationary point, by following the update of Frank-Wolfe algorithm the objective function value decreases by ǫ /D L. Therefore, after at most D L(f(x 0 ) f(x ))/ǫ iterations we either reach the global minimum or one of the iterates x t satisfies G(x t ) ǫ which implies that and the claim in Proposition 6 follows. f(x t ) (x x t ) ǫ, for all x C, (39) 8.4 Proof of Proposition 8 First, note that based on the projection property we know that (x t η f(x t ) x t+1 ) (x x t+1 ) 0, for all x C. (40) Therefore, by setting x = x t we obtain that η f(x t ) (x t+1 x t ) x t x t+1. (41) Hence, we can replace the inner product f(x t ) (x t+1 x t ) by its upper bound x t x t+1 /η f(x t+1 ) f(x t )+ f(x t ) (x t+1 x t )+ L x t+1 x t f(x t ) x t x t+1 η + L x t+1 x t = f(x t ) L x t+1 x t, (4) where the equality follows by setting η = 1/L. Indeed, if x t+1 = x t then we are at a first-order stationary point, however, we need a finite time analysis. To do so, note that for any x C we have Therefore, for any x C it holds which implies that (x t η f(x t ) x t+1 ) (x x t+1 ) 0. (43) f(x t ) (x x t+1 ) L(x t x t+1 ) (x x t+1 ), (44) f(x t ) (x x t ) f(x t ) (x t+1 x t )+L(x t x t+1 ) (x x t+1 ) K x t+1 x t LD x t x t+1 (K +LD) x t x t+1, (45) where K is an upper bound on the norm of gradient over the convex set C. Therefore, we can write min f(x t) (x x t ) (K +LD) x t x t+1, (46) x C 13

Mokhtari, Ozdaglar, and Jadbabaie Combining these results, we obtain that we should check the norm x t x t+1 at each iteration and check whether if it is larger than ǫ/(k +LD) or not. If the norm is larger than the threshold then f(x t+1 ) f(x t ) ǫ L (K +LD). (47) If the norm is smaller than the threshold then we stop and the iterate x t satisfies the inequality f(x t ) (x x t ) ǫ, for all x C. (48) Note that this process can not take more than O( f(x0) f(x ) ǫ ) iterations. 8.5 Proof of Proposition 10 The Taylor s expansion of the function f around the point x t and M-Lipschitz continuity of the Hessians imply that f(x t+1 ) f(x t )+ f(x t ) (x t+1 x t )+ 1 (x t+1 x t ) f(x)(x t+1 x t )+ M 6 x t+1 x t 3. (49) Replace x t+1 x t by the expression σ(u t x t ) to obtain f(x t+1 ) f(x t )+σ f(x t ) (u t x t )+ σ (u t x t ) f(x)(u t x t )+ Mσ3 6 u t x t 3. (50) Since, u t is a ρ-approximate solution for the subproblem in (14) with the objective function value q(u t ) ργ, we can substitute the quadratic term (u t x t ) f(x)(u t x t ) by its upper bound ργ. Additionally, the vector u t is chosen such that f(x t ) (u t x t ) 0 and therefore the linear term in (50) can be eliminated. Further, the cubic term u t x t 3 is upper bounded by D 3 since both u t and x t belong to the convex set C. Applying these substitutions into (50) yields f(x t+1 ) f(x t ) σ ργ By setting σ = ργ/md 3 in (51) it follows that f(x t+1 ) f(x t ) ρ3 γ 3 M D 6 + ρ3 γ 3 6M D 6 + σ3 MD 3. (51) 6 = f(x t ) ρ3 γ 3 3M D6. (5) Therefore, in this case, the objective function value decreases at least by a fixed value of O(ρ 3 γ 3 ). 8.6 Proof of Theorem 5 Then at each iteration, either the first oder optimality condition is not satisfied and the function value decreasesby aconstant ofo(ǫ ), or this condition is satisfied and we use a second-orderupdate which leads to an objective function value decrement of O(ρ 3 γ 3 ). This shows that if the algorithm has not reached an (ǫ, γ)-second-order stationary point the objective function value decreases at least by O(min{ǫ,ρ 3 γ 3 }). Therefore, we either reach the global minimum or converge to an (ǫ,γ)- second-order stationary point of Problem (1) after at most O can be written as O((f(x 0 ) f(x ))(max{ǫ,ρ 3 γ 3 })). ( f(x0) f(x ) min{ǫ,ρ 3 γ 3 } ) iterations which also 14

8.7 Proof of Theorem 13 In this proof, for notation convenience, we define ǫ = ǫ/ and γ = γ/. First, note that the condition in Assumption 4 and the fact that F(x,θ) and F(x,θ) are the unbiased estimators of the gradient f(x) and Hessian f(x) imply that the variance of the batch gradient d t and the batch Hessian H t approximations are upper bounded by E [ d t f(x t ) ] ν b g, E [ H t f(x t ) ] ξ b H. (53) Here we assume that b g and b H satisfy the following conditions, { 34ν M D 8 b g = max ρ 4 γ 4, 16D ν } ǫ, b H = 81D4 ξ ρ. (54) γ We further set the parameter r as r = ρ γ 18MD3. (55) Now we proceed to analyze the complexity of Algorithm. First, consider the case that the current iterate x t satisfies the inequality d t (v t x t ) < ǫ and therefore we perform the first-order update in step 4. In this case, we can show that f(x t+1 ) f(x t )+ f(x t ) (x t+1 x t )+ L x t+1 x t = f(x t )+η f(x t ) (v t x t )+ η L v t x t f(x t )+ηd t (v t x t )+η( f(x t ) d t ) (v t x t )+ η LD f(x t ) ηǫ +ηd f(x t ) d t + η LD, (56) where in the last inequality we used d t (v t x t ) < ǫ and the fact that both v t and x t belong to the set C and therefore x t v t D. Consider F t as the sigma algebra that measures all sources of randomness up to step t. Then, computing the expected value of both sides of (56) given F t leads to E[f(x t+1 ) F t ] f(x t ) ηǫ + ηdν bg + η LD (57) where we used the inequality E[X] E[X ] when X is a positive random variable. Replace the stepsize η by its value ǫ /(D L) and the batch size b g by its lower bound (16D ν )/(ǫ ) to obtain E[f(x t+1 ) F t ] f(x t ) ǫ 4D L. (58) Hence, in this case, the objective function value decreases in expectation by a constant factor of O(ǫ ). Now we proceed to study the case that the current iterate x t does not satisfy the inequality d t (v t x t ) < ǫ and we need to perform the second-order update in step 8. In this case, we can 15

Mokhtari, Ozdaglar, and Jadbabaie show that f(x t+1 ) f(x t )+ f(x t ) (x t+1 x t )+ 1 (x t+1 x t ) f(x)(x t+1 x t )+ M 6 x t+1 x t 3 f(x t )+σ f(x t ) (u t x t )+ σ (u t x t ) f(x)(u t x t )+ σ3 MD 3 6 f(x t )+σd t (u t x t )+σ( f(x t ) d t ) (u t x t )+ σ (u t x t ) H t (u t x t ) + σ (u t x t ) ( f(x) H t )(u t x t )+ σ3 MD 3. (59) 6 Note that u t is a ρ-approximate solution for the subproblem in step 6 of Algorithm, with the objective function value less than ργ. This observation implies that the quadratic term (u t x t ) H t (u t x t ) is bounded above by ργ. Further, the linear term d t (u t x t ) is less than r according to the constraint of the subproblem. Applying these substitutions and using the Cauchy- Schwartz inequality multiple times lead to f(x t+1 ) f(x t )+σr+σd d t f(x t ) σ ργ + σ D H t f(x) + σ3 MD 3. (60) 6 Compute the conditional expected value of both sides of (60) and use the inequalities in (53) to obtain E[f(x t+1 ) F t ] f(x t )+σr+ σdν σ ργ + σ D ξ bg + σ3 MD 3. (61) b H 6 By setting the stepsize σ = ργ /MD 3 in (61) it follows that E[f(x t+1 ) F t ] f(x t ) ρ3 γ 3 3L D 6 + rργ MD 3 + ργ ν MD b g + ρ γ ξ M D 4 b H. (6) Moreover, setting r = ρ γ 18MD and b 3 H = 81D4 ξ ρ γ, and replacing b g by its lower bound 34ν M D 8 ρ 4 γ 4 to lead E[f(x t+1 ) F t ] f(x t ) ρ3 γ 3 6M D 6 (63) Hence, in this case, the expected objective function value decreases by a constant of O(ρ 3 γ 3 ). By combining the results in (58) and (63), we obtain that if the iterate x t is not the final iterate the objective function value at step t + 1 satisfies the following ineqaulity { ǫ ρ 3 γ 3 } E[f(x t+1 ) F t ] f(x t ) min 4LD, 6M D 6. (64) Let us define T as the number of iterations we perform until Algorithm stops. We use an argument similar to Wald s lemma to derive an upper bound on the expected number of iterations T that we 16

need to run the algorithm. Note that [ T ] E[f(x 0 ) f(x T )] = E (f(x t 1 ) f(x t )) t=1 [ [ T ] = E E (f(x t 1 ) f(x t ))] T t=1 [ k ] = E (f(x t 1 ) f(x t )) P(T = k) = k=1 k=1 t=1 k=1 t=1 t=1 k E[(f(x t 1 ) f(x t ))]P(T = k) k { ǫ ρ 3 γ 3 } min 4LD, 6M D 6 P(T = k) { ǫ ρ 3 γ 3 } = min 4LD, 6M D 6 k P(T = k) k=1 { ǫ ρ 3 γ 3 = min 4LD, 6M D 6 } E[T]. (65) The first equality holds by simplifying the sum, for the second equality we use the fact that E[X] = E[E[X Y]], in the third equality we use the expression E[E[X Y]] = ye[x Y = y]p(y = y), in the fourth equality we exchange sum and expectation, and the inequality is true based on the result in (64). Note that to derive this result we also have assumed that the sequence of function differences f(x t 1 ) f(x t ) are independent of each other and also independent of the total number of iterations T. { ǫ Based on the result in (65), we can write that E[T] E[f(x 0 ) f(x T )]/min ρ 4LD, 3 γ 3 6M D }. 6 We further know that f(x T ) f(x ) which implies that { 4LD E[T] (f(x 0 ) f(x ))max ǫ, 6M D 6 } ρ 3 γ 3. (66) Using Markov s inequality we can show that Set a = (f(x0) f(x )) δ { } (f(x 0 ) f(x 4LD ))max ǫ, 6M D 6 P(T a) 1 ρ 3 γ 3 a { } 4LD max ǫ, 6M D 6 ρ 3 γ to obtain that 3 { (f(x 0 ) f(x 4LD ))max ǫ, 6M D 6 P T δ ρ 3 γ 3 } (67) 1 δ. (68) Therefore, it follows that with high probability the total number of iterations T that Algorithm runs is at most O(max { ǫ,ρ 3 γ 3} ). As ǫ = ǫ/ and γ = γ/, we obtain that the overall number of iterations is at most O(max { ǫ,ρ 3 γ 3} ) with high probability. Now it remains to show that the outcome of Algorithm is an (ǫ,γ)-sosp of Problem (30) with high probability. Let s assume that x t is the final output of Algorithm. Then, we know that x t satisfies the conditions d t (x x t) ǫ for all x C, (69) 17

Mokhtari, Ozdaglar, and Jadbabaie and (x x t ) H t (x x t ) γ for all x C, ǫ r d t (x x t ) r. (70) First, we use the condition in (69) to show that x t satisfies the first-order optimality condition with high probability. Note that for any x C it holds that f(x t ) (x x t ) = d t (x x t )+( f(x t ) d t ) (x x t ) d t (x x t) D f(x t ) d t. (71) Now compute the minimum of both sides of (71) for all x C to obtain min { f(x t) (x x t )} min x C x C {d t (x x t) D f(x t ) d t } = min x C {d t (x x t)} D f(x t ) d t ǫ D f(x t ) d t, (7) where the equality holds since D f(x t ) d t does not depend on x, and the last inequality is implied by (69). Since E [ f(x t ) d t ] ν /b g we obtain from Markov s inequality that P( f(x t ) d t ǫ ) 1 ν b g ǫ. (73) Therefore, by combining the results in (7) and (73) we obtain that ( ) P min { f(x t) (x x t )} (ǫ +Dǫ ) 1 ν x C b g ǫ. (74) Now by setting ǫ = ǫ /D it follows from (74) that with probability at least 1 ν D /b g ǫ the final iterate x t satisfies f(x t ) (x x t ) ǫ for all x C. (75) Replacing ǫ by ǫ/ leads to f(x t ) (x x t ) ǫ for all x C. (76) It remains to show that with high probability the final iterate x t satisfies the second-order optimality condition. First, consider the sets A t = {x C ǫ f(x t ) (x x t ) 0} and B t = {x C ǫ r d t (x x t) r}. (77) We proceed to show that with high probability A t B t. If y satisfies the condition then it can be shown that ǫ f(x t ) (y x t ) 0, (78) d t (y x t) = f(x t ) (y x t )+(d t f(x t )) (y x t ) D d t f(x t ). (79) Since E [ f(x t ) d t ] ν /b g we obtain from Markov s inequality that ( P f(x t ) d t r ) 1 ν D D b g r. (80) 18