arxiv: v1 [cs.lg] 17 Nov 2017

Similar documents
arxiv: v4 [math.oc] 11 Jun 2018

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

arxiv: v4 [math.oc] 24 Apr 2017

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

Mini-Course 1: SGD Escapes Saddle Points

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

arxiv: v4 [math.oc] 5 Jan 2016

SVRG Escapes Saddle Points

Neural Network Training

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

SVRG++ with Non-uniform Sampling

Adaptive Negative Curvature Descent with Applications in Non-convex Optimization

Advanced computational methods X Selected Topics: SGD

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

A Subsampling Line-Search Method with Second-Order Results

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

arxiv: v2 [math.oc] 1 Nov 2017

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get

Introduction to gradient descent

Large-scale Stochastic Optimization

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization

Accelerating Stochastic Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Linear Discrimination Functions

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Nonlinear Optimization Methods for Machine Learning

Overparametrization for Landscape Design in Non-convex Optimization

Stochastic and online algorithms

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Stochastic Cubic Regularization for Fast Nonconvex Optimization

Summary and discussion of: Dropout Training as Adaptive Regularization

Unconstrained optimization

arxiv: v1 [math.oc] 1 Jul 2016

Stochastic Gradient Descent with Variance Reduction

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Day 3 Lecture 3. Optimizing deep networks

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Gradient Descent. Dr. Xiaowei Huang

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

arxiv: v2 [math.oc] 5 May 2018

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Stochastic Optimization

Lecture 2: From Classical to Quantum Model of Computation

arxiv: v3 [math.oc] 8 Jan 2019

Optimization Tutorial 1. Basic Gradient Descent

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

On the fast convergence of random perturbations of the gradient flow.

A summary of Deep Learning without Poor Local Minima

Linear Regression (continued)

U.C. Berkeley Better-than-Worst-Case Analysis Handout 3 Luca Trevisan May 24, 2018

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

STAT 200C: High-dimensional Statistics

Non-convex optimization. Issam Laradji

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

A random perturbation approach to some stochastic approximation algorithms in optimization.

Stochastic Quasi-Newton Methods

Generalization theory

Full-information Online Learning

Lecture 1: Supervised Learning

ECS171: Machine Learning

1 What a Neural Network Computes

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Adaptive Online Gradient Descent

1 Lyapunov theory of stability

Optimization and Gradient Descent

Machine Learning. Lecture 9: Learning Theory. Feng Li.

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

arxiv: v2 [stat.ml] 24 Apr 2017

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Convex Optimization Lecture 16

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Models for Regression

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Introduction to Machine Learning (67577) Lecture 7

Optimization methods

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Comparison of Modern Stochastic Optimization Algorithms

Lecture 6 Optimization for Deep Neural Networks

Big Data Analytics: Optimization and Randomization

Overfitting, Bias / Variance Analysis

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Sub-Sampled Newton Methods

Proximal and First-Order Methods for Convex Optimization

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 8. Instructor: Haipeng Luo

Coordinate Descent and Ascent Methods

Optimality Conditions for Constrained Optimization

FORMULATION OF THE LEARNING PROBLEM

A strongly polynomial algorithm for linear systems having a binary solution

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Active Learning: Disagreement Coefficient

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Transcription:

Neon: Finding Local Minima via First-Order Oracles (version ) Zeyuan Allen-Zhu zeyuan@csail.mit.edu Microsoft Research Yuanzhi Li yuanzhil@cs.princeton.edu Princeton University arxiv:7.06673v [cs.lg] 7 Nov 07 November 7, 07 Abstract We propose a reduction for non-convex optimization that can () turn a stationary-point finding algorithm into a local-minimum finding one, and () replace the Hessian-vector product computations with only gradient computations. It works both in the stochastic and the deterministic settings, without hurting the algorithm s performance. As applications, our reduction turns Natasha into a first-order method without hurting its performance. It also converts SGD, GD, SCSG, and SVRG into local-minimum finding algorithms outperforming some best known results. Introduction Nonconvex optimization has become increasing popular due its ability to capture modern machine learning tasks in large scale. Most notably, training deep neural networks corresponds to minimizing a function f(x) = n n i= f i(x) over x R d that is non-convex, where each training sample i corresponds to one loss function f i ( ) in the summation. This average structure allows one to perform stochastic gradient descent (SGD) which uses a random f i (x) corresponding to computing backpropagation once to approximate f(x) and performs descent updates. Motivated by such large-scale machine learning applications, we wish to design faster first-order non-convex optimization methods that outperform the performance of gradient descent, both in the online and offline settings. In this paper, we say an algorithm is online if its complexity is independent of n (so n can be infinite), and offline otherwise. In recently years, researchers across different communities have gathered together to tackle this challenging question. By far, known theoretical approaches mostly fall into one of the following two categories. First-order methods for stationary points. In analyzing first-order methods, we denote by gradient complexity T the number of computations of f i (x). To achieve an ε-approximate stationary point namely, a point x with f(x) ε it is a folklore that gradient descent (GD) is offline and needs T O ( ) n ε, while stochastic gradient decent (SGD) is online and needs T O ( ) ( ε. In recent years, the offline complexity has been improved to T O n /3 ) 4 ε by the The result of this paper was briefly discussed at a Berkeley Simons workshop on Oct 6 and internally presented at Microsoft on Oct 30. We started to prepare this manuscript on Nov, after being informed of the independent and similar work of Xu and Yang [8]. Their result appeared on arxiv on Nov 3. To respect the fact that their work appeared online before us, we have adopted their algorithm name Neon and called our new algorithm Neon. We encourage readers citing this work to also cite [8].

SVRG method [3, 3], and the online complexity has been improved to T O ( ε 0/3 ) by the SCSG method [8]. Both of them rely on the so-called variance-reduction technique, originally discovered for convex problems [, 6, 4, 6]. These algorithms SVRG and SCSG are only capable of finding stationary points, which may not necessarily be approximate local minima and are arguably bad solutions for neural-network training [9, 0, 4]. Therefore, can we turn stationary-point finding algorithms into local-minimum finding ones? Hessian-vector methods for local minima. It is common knowledge that using information about the Hessian, one can find ε-approximate local minima namely, a point x with f(x) ε and also f(x) ε /C I. In 006, Nesterov and Polyak [0] showed that one can find an ε- approximate in O( ) iterations, but each iteration requires an (offline) computation as heavy as ε.5 inverting the matrix f(x). To fix this issue, researchers propose to study the so-called Hessian-free methods that, in addition to gradient computations, also compute Hessian-vector products. That is, instead of using the full matrix f i (x) or f(x), these methods also compute f i (x) v for indices i and vectors v. For Hessian-free methods, we denote by gradient complexity T the number of computations of f i (x) plus that of f i (x) v. The hope of using Hessian-vector products is to improve the complexity T as a function of ε. Such improvement was first shown possible independently by [, 7] for the offline setting, with complexity T ( ) n + n3/4 ε.5 ε so is better than that of gradient descent. In the online setting, the.75 first improvement was by Natasha which gives complexity T ( ) ε []. 3.5 Unfortunately, it is argued by some researchers that Hessian-vector products are not general enough and may not be as simple to implement as evaluating gradients [8]. Therefore, can we turn Hessian-free methods into first-order ones, without hurting their performance?. From Hessian-Vector Products to First-Order Methods Recall by inition of derivative we have f i (x) v = lim q 0 { f i(x+qv) f i (x) q }. Given any Hessian-free method, at least at a high level, can we replace every occurrence of f i (x) v with w = f i(x+qv) f i (x) q for some small q > 0? Note the error introduced in this approximation is f i (x) v w q v. Therefore, as long as the original algorithm is sufficiently stable to adversarial noise, and as long as q is small enough, this can convert Hessian-free algorithms into first-order ones. In this paper, we demonstrate this idea by converting negative-curvature-search (NC-search) subroutines into first-order processes. NC-search is a key subroutine used in state-of-the-art Hessian-free methods that have rigorous proofs (see [,, 7]). It solves the following simple task: negative-curvature search (NC-search) given point x 0, decide if f(x 0 ) δi or find a unit vector v such that v f(x 0 )v δ. Online Setting. In the online setting, NC-search can be solved by Oja s algorithm [] which We say A δi if all the eigenvalues of A are no smaller than δ. In this high-level introduction, we focus only on the case when δ = ε /C for some constant C. Hessian-free methods are useful because when f i( ) is explicitly given, computing its gradient is in the same complexity as computing its Hessian-vector product [, 5], using backpropagation.

costs Õ(/δ ) computations of Hessian-vector products (first proved in [6] and applied to NC-search in []). In this paper, we propose a method Neon online which solves the NC-search problem via only stochastic first-order updates. That is, starting from x = x 0 + ξ where ξ is some random perturbation, we keep updating x t+ = x t η( f i (x t ) f i (x 0 )). In the end, the vector x T x 0 gives us enough information about the negative curvature. Theorem (informal). Our Neon online algorithm solves NC-search using Õ(/δ ) stochastic gradients, without Hessian-vector product computations. This complexity Õ(/δ ) matches that of Oja s algorithm, and is information-theoretically optimal (up to log factors), see the lower bound in [6]. We emphasize that the independent work Neon by Xu and Yang [8] is actually the first recorded theoretical result that proposed this approach. However, Neon needs Õ(/δ3 ) stochastic gradients, because it uses full gradient descent to find NC (on a sub-sampled objective) inspired by [5] and the power method; instead, Neon online uses stochastic gradients and is based on our prior work on Oja s algorithm [6]. By plugging Neon online into Natasha [], we achieve the following corollary (see Figure (c)): Theorem (informal). Neon online turns Natasha into a stochastic first-order method, without hurting its performance. That is, it finds an (ε, δ)-approximate local minimum in T = Õ( + ε 3.5 ε 3 δ + ) δ stochastic gradient computations, without Hessian-vector product computations. 5 (We say x is an approximate local minimum if f(x) ε and f(x) δi.) Offline Setting. There are a number of ways to solve the NC-search problem in the offline setting using Hessian-vector products. Most notably, power method uses Õ(n/δ) computations of Hessianvector products, Lanscoz method [7] uses Õ(n/ δ) computations, and shift-and-invert [] on top of SVRG [6] (that we call SI+SVRG) uses Õ(n + n3/4 / δ) computations. In this paper, we convert Lanscoz s method and SI+SVRG into first-order ones: Theorem 3 (informal). Our Neon det algorithm solves NC-search using Õ(/ δ) full gradients (or equivalently Õ(n/ δ) stochastic gradients), and our Neon svrg solves NC-search using Õ(n + n 3/4 / δ) stochastic gradients. We emphasize that, although analyzed in the online setting only, the work Neon by Xu and Yang [8] also applies to the offline setting, and seems to be the first result to solve NC-search using first-order gradients with a theoretical proof. However, Neon uses Õ(/δ) full gradients instead of Õ(/ δ). Their approach is inspired by [5], but our Neon det is based on Chebyshev approximation theory (see textbook [7]) and its recent stability analysis [5]. By putting Neon det and Neon svrg into the CDHS method of Carmon et al. [7], we have 3 Theorem 4 (informal). Neon det turns CDHS into a first-order method without hurting its performance: it finds an (ε, δ)-approximate local minimum in Õ( + ) ε.75 δ full gradient computations. 3.5 Neon svrg turns CDHS into a first-order method without hurting its performance: it finds an (ε, δ)- approximate local minimum in T = Õ( n + n ) + n3/4 + n3/4 ε.5 δ 3 ε.75 δ stochastic gradient computations. 3.5 3 We note that the original paper of CDHS only proved such complexity results (although requiring Hessian-vector products) for the special case of δ ε /. In such a case, it requires either Õ( ) ε full gradient computations or.75 Õ ( ) n + n3/4 ε.5 ε stochastic gradient computations..75 3

T T=δ -5 ε -5 T=δ -7 Neon+SGD Neon+SGD T T=δ -5 ε -5 T=δ -6 Neon+SCSG Neon+SCSG T T=δ -5 ε -5 T=δ -6 Neon+Natasha Neon+Natasha ε -4 T=δ -3 ε - T=ε -4 ε ε /3 ε 4/7 ε / ε /4 δ ε -4 ε -3.33 T=δ -3 ε - T=ε -3.33 ε ε /3 ε 4/9 ε / ε /4 δ ε -3.75 ε -3.6 ε -3.5 ε ε ε T=δ - ε -3 T=ε-3.5 ε 3/4 ε 3/5 /4 / δ (a) (b) (c) Figure : Neon vs Neon for finding (ε, δ)-approximate local minima. We emphasize that Neon and Neon are based on the same high-level idea, but Neon is arguably the first-recorded result to turn stationary-point finding algorithms (such as SGD, SCSG) into local-minimum finding ones, with theoretical proofs. One should perhaps compare Neon det to the interesting work convex until guilty by Carmon et al. [8]. Their method finds ε-approximate stationary points using Õ(/ε.75 ) full gradients, and is arguably the first first-order method achieving a convergence rate better than /ε of GD. Unfortunately, it is unclear if their method guarantees local minima. In comparison, Neon det on CDHS achieves the same complexity but guarantees its output to be an approximate local minimum. Remark.. All the cited works in this sub-section requires the objective to have () Lipschitzcontinuous Hessian (a.k.a. second-order smoothness) and () Lipschitz-continuous gradient (a.k.a. Lipschitz smoothness). One can argue that () and () are both necessary for finding approximate local minima, but if only finding approximate stationary points, then only () is necessary. We shall formally discuss our assumptions in Section.. From Stationary Points to Local Minima Given any (first-order) algorithm that finds only stationary points (such as GD, SGD, or SCSG [8]), we can hope for using the NC-search routine to identify whether or not its output x satisfies f(x) δi. If so, then automatically x becomes an (ε, δ)-approximate local minima so we can terminate. If not, then we can go in its negative curvature direction to further decrease the objective. In the independent work of Xu and Yang [8], they proposed to apply their Neon method for NC-search, and thus turned SGD and SCSG into first-order methods finding approximate local minima. In this paper, we use Neon instead. We show the following theorem: Theorem 5 (informal). To find an (ε, δ)-approximate local minima, (a) Neon+SGD needs T = Õ( ε 4 + ε δ 3 + δ 5 ) stochastic gradients; (b) Neon+SCSG needs T = Õ( ε 0/3 + ε δ 3 + δ 5 ) stochastic gradients; and (c) Neon+GD needs T = Õ( n ε + n δ 3.5 ) (so Õ ( ε + δ 3.5 ) full gradients). (d) Neon+SVRG needs T = Õ( n /3 ε + n δ 3 + n5/ ε δ / + n3/4 δ 3.5 ) stochastic gradients. We make several comments as follows. (a) We compare Neon+SGD to Ge et al. [3], where the authors showed SGD plus perturbation needs T = Õ(poly(d)/ε4 ) stochastic gradients to find (ε, ε /4 )-approximate local minima. This is the perhaps first time that a theoretical guarantee for finding local minima is given using first-order oracles. 4

To some extent, Theorem 5a is superior because we have () removed the poly(d) factor, 4 () achieved T = Õ(/ε4 ) as long as δ ε /3, and (3) a much simpler analysis. We also remark that, if using Neon instead of Neon, one achieves a slightly worse complexity T = Õ( + ) ε 4 δ, see Figure (a) for a comparison. 5 7 (b) Neon+SCSG turns SCSG into a local-minimum finding algorithm. Again, if using Neon instead of Neon, one gets a slightly worse complexity T = Õ( + + ) ε 0/3 ε δ 3 δ, see Figure (b). 6 (c) We compare Neon+GD to Jin et al. [5], where the authors showed GD plus perturbation needs Õ(/ε ) full gradients to find (ε, ε / )-approximate local minima. This is perhaps the first time that one can convert a stationary-point finding method (namely GD) into a local minimum-finding one, without hurting its performance. To some extent, Theorem 5c is better because we use Õ(/ε ) full gradients as long as δ ε 4/7. (d) Our result for Neon+SVRG does not seem to be recorded anywhere, even if Hessian-vector product computations are allowed. Limitation. We note that there is limitation of using Neon (or Neon) to turn an algorithm finding stationary points to that finding local minima. Namely, given any algorithm A, if the gradient complexity for A to find an ε-approximate stationary point is T, then after this conversion, it finds (ε, δ)-approximate local minima in a gradient complexity that is at least T. This is because the new algorithm, after combining Neon and A, tries to alternatively find stationary points (using A) and escape from saddle points (using Neon). Therefore, it must pay at least complexity T. In contrast, methods such as Natasha swing by saddle points instead of go to saddle points and then escape. This has enabled it to achieve a smaller complexity T = O(ε 3.5 ) for δ ε /4. Preliminaries Throughout this paper, we denote by the Euclidean norm. We use i R [n] to denote that i is generated from [n] = {,,..., n} uniformly at random. We denote by I[event] the indicator function of probabilistic events. We denote by A the spectral norm of matrix A. For symmetric matrices A and B, we write A B to indicate that A B is positive semiinite (PSD). Therefore, A σi if and only if all eigenvalues of A are no less than σ. We denote by λ min (A) and λ max (A) the minimum and maximum eigenvalue of a symmetric matrix A. Recall some initions on smoothness (for other equivalent initions, see textbook [9]) Definition.. For a function f : R d R, f is L-Lipschitz smooth (or L-smooth for short) if x, y R d, f(x) f(y) L x y. f is second-order L -Lipschitz smooth (or L -second-order smooth for short) if x, y R d, f(x) f(y) L x y. The following fact says the variance of a random variable decreases by a factor m if we choose m independent copies and average them. It is trivial to prove, see for instance [8]. 4 We are aware that the original authors of [3] have a different proof to remove its poly(d) factor, but have not found it online at this moment. 5 Their complexity might be improvable to Õ( ) + ε 4 δ with a slight change of the algorithm, but not beyond. 6 5

algorithm stationary SGD (folklore) O ( ε 4 ) local minima perturbed SGD [3] Õ ( poly(d) ε 4 ) local minima Neon+SGD [8] Õ ( ε 4 + δ 7 ) gradient complexity T local minima Neon+SGD Õ ( ε 4 + ε δ 3 + δ 5 ) stationary SCSG [8] O ( ε 0/3 ) local minima Neon+SCSG [8] O ( ) + ε 0/3 ε δ + 3 δ 6 local minima Neon+SCSG O ( ) + ε 0/3 ε δ + 3 δ 5 local minima Natasha [] Õ ( ε + 3.5 ε 3 δ + ) δ 5 local minima Neon+Natasha [8] Õ( ε + 3.5 ε 3 δ + ) δ 6 local minima Neon+Natasha Õ ( ε + 3.5 ε 3 δ + ) δ 5 stationary GD (folklore) O ( ) n ε local minima perturbed GD [5] Õ ( ) n ε local minima Neon+GD Õ ( n ε + n δ 3.5 ) Hessianvector products variance bound Lip. smooth nd - order smooth no needed needed no (only for δ ε /4 ) no needed needed needed no needed needed needed no needed needed needed no needed needed no no needed needed needed no needed needed needed needed needed needed needed no needed needed needed no needed needed needed no no needed no (only for δ ε / ) no no needed needed no no needed needed stationary SVRG [3, 3] O ( n /3 ε + n ) no no needed no local minima Neon+SVRG Õ ( ) n /3 ε + n δ + n5/ + n3/4 3 ε δ / δ no no needed needed 3.5 stationary guilty [8] Õ ( ) n ε no no needed needed.75 Õ ( ) n local minima FastCubic [] ε + n3/4.5 ε.75 needed no needed needed (only for δ ε / ) local minima CDHS [7] Õ ( ) n ε + n.5 δ + n3/4 3 ε + n3/4.75 δ 3.5 local minima Neon+CDHS Õ ( ) n ε + n.5 δ + n3/4 3 ε + n3/4.75 δ 3.5 needed no needed needed no no needed needed Table : Complexity for finding f(x) ε and f(x) δi. Following tradition, in these complexity bounds, we assume variance and smoothness parameters as constants, and only show the dependency on n, d, ε. Remark. Variance bounds is needed for online methods. Remark. Lipschitz smoothness is needed for finding approximate stationary points. Remark 3. Second-order Lipschitz smoothness is needed for finding approximate local minima. Fact.. If v,..., v n R d satisfy n i= v i = 0, and S is a non-empty, uniform random subset of [n]. Then [ E S i S v i ] =. Problem and Assumptions n S (n ) S n i [n] v i I[ S <n] S n i [n] v i. Throughout the paper we study the following minimization problem { min x R d f(x) = } n n i= f i(x) (.) 6

Algorithm Neon online (f, x 0, δ, p) Input: Function f(x) = n n i= f i(x), vector x 0, negative curvature δ > 0, confidence p (0, ]. : for j =,, Θ(log /p) do boost the confidence : v j Neon online weak (f, x 0, δ, p); 3: if v j then 4: m Θ( L log /p ), v Θ( δ δ L )v. 5: Draw i,..., i m R [n]. 6: z j = m m v j= (v ) ( f ij (x 0 + v ) f ij (x 0 ) ) 7: if z j 3δ/4 return v = v j 8: end if 9: end for 0: return v =. Algorithm Neon online weak (f, x 0, δ, p) : η δ C 0 L log(d/p), T C 0 log(d/p) ηδ, for sufficiently large constant C 0 : ξ Gaussian random vector with norm σ. σ = η(d/p) C 0 δ L L 3 3: x x 0 + ξ. 4: for t to T do 5: x t+ x t η ( f i (x t ) f i (x 0 )) where i R [n]. 6: if x t+ x 0 r then return v = x t+ x 0 7: end for 8: return v = ; x t+ x 0 r = (d/p) C 0 σ where both f( ) and each f i ( ) can be nonconvex. We wish to find (ε, δ)-local minima which are points x satisfying We need the following three assumptions Each f i (x) is L-Lipschitz smooth. f(x) ε and f(x) δi. Each f i (x) is second-order L -Lipschitz smooth. (In fact, the gradient complexity of Neon in this paper only depends polynomially on the second-order smoothness of f(x) (rather than f i (x)), and the time complexity depends logarithmically on the second-order smoothness of f i (x). To make notations simple, we decide to simply assume each f i (x) is L -second-order smooth.) Stochastic gradients have bounded variance: x R d : E i R [n] f(x) f i (x) V. (This assumption is needed only for online algorithms.) 3 Neon in the Online Setting We propose Neon online formally in Algorithm. It repeatedly invokes Neon online weak in Algorithm, whose goal is to solve the NC-search problem with confidence /3 only; then Neon online invokes Neon online weak repeatedly for log(/p) times to boost the confidence to p. We prove the following theorem: 7

Theorem (Neon online ). Let f(x) = n n i= f i(x) where each f i is L-smooth and L -secondorder smooth. For every point x 0 R d, every δ > 0, every p (0, ], the output satisfies that, with probability at least p:. If v =, then f(x 0 ) δi. v = Neon online (f, x 0, δ, p). If v, then v = and v f(x 0 )v δ. Moreover, the total number of stochastic gradient evaluations O ( log (d/p)l δ ). The proof of Theorem immediately follows from Lemma 3. and Lemma 3. below. Lemma 3. (Neon online weak ). In the same setting as Theorem, the output v = Neononline weak (f, x 0, δ, p) satisfies If λ min ( f(x 0 )) δ, then with probability at least /3, v and v f(x 0 )v 5 Proof sketch of Lemma 3.. We explain why Neon online weak works as follows. Starting from a randomly perturbed point x = x 0 + ξ, it keeps updating x t+ x t η ( f i (x t ) f i (x 0 )) for some random index i [n], and stops either when T iterations are reached, or when x t+ x 0 > r. Therefore, we have x t x 0 r throughout the iterations, and thus can approximate f i (x 0 )(x t x 0 ) using f i (x t ) f i (x 0 ), up to error O(r ). This is a small term based on our choice of r. Ignoring the error term, our updates look like x t+ x 0 = ( I η f i (x 0 ) ) (x t x 0 ). This is exactly the same as Oja s algorithm [] which is known to approximately compute the minimum eigenvector of f(x 0 ) = n n i= f i(x 0 ). Using the recent optimal convergence analysis of Oja s algorithm [6], one can conclude that after T = Θ ( log r ) σ ηλ iterations, where λ = max{0, λmin ( f(x 0 ))}, then we not only have that x t+ x 0 is blown up, but also it aligns well with the minimum eigenvector of f(x 0 ). In other words, if λ δ, then the algorithm must stop before T. Finally, one has to carefully argue that the error does not blow up in this iterative process. We er the proof details to Appendix A.. Our Lemma 3. below tells us we can verify if the output v of Neon online weak to additive δ 4 ), so we can boost the success probability to p. 00 δ. is indeed correct (up Lemma 3. (verification). In the same setting as Theorem, let vectors x, v R d. If i,..., i m R [n] and ine z = m m j= v ( f ij (x + v) f ij (x)) Then, if v δ 8L and m = Θ( L log /p ), with probability at least p, δ z δ 4. v v f(x)v v 4 Neon in the Deterministic Setting We propose Neon det formally in Algorithm 3 and prove the following theorem: 8

Algorithm 3 Neon det (f, x 0, δ, p) Input: A function f, vector x 0, negative curvature target δ > 0, failure probability p (0, ]. : T C log(d/p) L δ. for sufficiently large constant C. : ξ Gaussian random vector with norm σ; σ = (d/p) C δ T 3 L 3: x x 0 + ξ. y ξ, y 0 0 4: for t to T do 5: y t+ = M(y t ) y t ; M(y) = L ( f(x0 + y) f(x0)) + ( ) 3δ 4L y 6: x t+ = x 0 + y t+ M(y t ). 7: if x t+ x 0 r then return x t+ x 0 8: end for 9: return. x t+ x 0. r = (d/p) C σ Theorem 3 (Neon det ). Let f(x) be a function that is L-smooth and L -second-order smooth. For every point x 0 R d, every δ > 0, every p (0, ], the output v = Neon det (f, x 0, δ, p) satisfies that, with probability at least p:. If v =, then f(x 0 ) δi.. If v, then v = and v f(x 0 )v δ. Moreover, the total number full gradient evaluations is O ( log (d/p) L δ ). Proof sketch of Theorem 3. We explain the high-level intuition of Neon det and the proof of Theorem 3 as follows. Define M = L f(x 0 ) + ( 3δ 4L) I. We immediately notice that all eigenvalues of f(x 0 ) in [ 3δ 4, L] are mapped to the eigenvalues of M in [, ], and any eigenvalue of f(x 0 ) smaller than δ is mapped to eigenvalue of M greater than + δ 4L. Therefore, as long as T Ω ( ) L δ, if we compute xt + = x 0 + M T ξ for some random vector ξ, by the theory of power method, x T + x 0 must be a negative-curvature direction of f(x 0 ) with value δ. There are two issues with this approach. The first issue is that, the degree T of this matrix polynomial M T can be reduced to T = Ω ( ) L δ if the so-called Chebyshev polynomial is used. Claim 4.. Let T t (x) be the t-th Chebyshev polynomial of the first kind, ined as [7]: T 0 (x) =, T (x) = x, T n+ (x) = x T n (x) T n (x) [, ] if x [, ]; then T t (x) satisfies: T t (x) ( x + ) t ( x, x + ) ] t x if x >. [ Since T t (x) stays between [, ] when x [, ], and grows to ( + x ) t for x, we can use T T (M) in replacement of M T. Then, any eigenvalue of M that is above + δ 4L shall grow in a speed like ( + δ/l) T, so it suffices to choose T Ω ( ) L σ. This is quadratically faster than applying the power method, so in Neon det we wish to compute x t+ x 0 + T t (M) ξ. The second issue is that, since we cannot compute Hessian-vector products, we have to use the 9

gradient difference to approximate it; that is, we can only use M(y) to approximate My where M(y) = ( L ( f(x 0 + y) f(x 0 )) + 3δ ) y. 4L How does error propagate if we compute T t (M) ξ by replacing M with M? Note that this is a very non-trivial question, because the coefficients of the polynomial T t (x) is as large as O(t). It turns out, the way that error propagates depends on how the Chebyshev polynomial is calculated. If the so-called backward recurrence formula is used, namely, y 0 = 0, y = ξ, y t = M(y t ) y t and setting x T + = x 0 + y T + M(y T ), then this x T + is sufficiently close to the exact value x 0 + T t (M) ξ. This is known as the stability theory of computing Chebyshev polynomials, and is proved in our prior work [5]. We er all the proof details to Appendix B.. 5 Neon in the SVRG Setting Recall that the shift-and-invert (SI) approach [] on top of the SVRG method [6] solves the minimum eigenvector problem as follows. Given any matrix A = f(x 0 ) and suppose its eigenvalues are λ λ d. Then, if λ > λ, we can ine positive semiinite matrix B = (λi+a), and then apply power method to find an (approximate) maximum eigenvector of B, which necessarily is an (approximate) minimum eigenvector of A. The SI approach specifies a binary search routine to determine the shifting constant λ, and ensures that B = (λi + A) is always well conditioned, meaning that it suffices to apply power method on B for logarithmic number of iterations. In other words, the task of computing the minimum eigenvector of A reduces to computing matrix-vector products By for poly-logarithmic number of times. Moreover, the stability of SI was shown in a number of papers, including [] and [4]. This means, it suffices for us to compute By approximately. However, how to compute By for an arbitrary vector y. It turns out, this is equivalent to minimizing a convex quadratic function that is of a finite sum form g(z) = z (λi + A)z + y z = n n z (λi + f i (x 0 ))z + y z. Therefore, one can apply the a variant of the SVRG method (arguably first discovered by Shalev- Shwartz [6]) to solve this task. In each iteration, SVRG needs to evaluate a stochastic gradient (λi + f i (x 0 ))z + y at some point z for some random i [n]. Instead of evaluating it exactly (which require a Hessian-vector product), we use f i (x 0 +z) f i (x 0 ) to approximate f i (x 0 ) z. Of course, one needs to show also that the SVRG method is stable to noise. Using similar techniques as the previous two sections, one can show that the error term is proportional to O( z ), and thus as long as we bound the norm of z is bounded (just like we did in the previous two sections), this should not affect the performance of the algorithm. We decide to ignore the detailed theoretical proof of this result, because it will complicate this paper. Theorem 3 (Neon svrg ). Let f(x) = n n i= f i(x) where each f i is L-smooth and L -second-order smooth. For every point x 0 R d, every δ > 0, every p (0, ], the output v = Neon svrg (f, x 0, δ, p) satisfies that, with probability at least p:. If v =, then f(x 0 ) δi.. If v, then v = and v f(x 0 )v δ. 0 i=

Moreover, the total number stochastic gradient evaluations is Õ( n + n3/4 L δ ). 6 Applications of Neon We show how Neon online can be applied to existing algorithms such as SGD, GD, SCSG, SVRG, Natasha, CDHS. Unfortunately, we are unaware of a generic statement for applying Neon online to any algorithm. Therefore, we have to prove them individually. 6 Throughout this section, we assume that some starting vector x 0 R d and upper bound f is given to the algorithm, and it satisfies f(x 0 ) min x {f(x)} f. This is only for the purpose of proving theoretical bounds. In practice, because f only appears in specifying the number of iterations, can just run enough number of iterations and then halt the algorithm, without the necessity of knowing f. 6. Auxiliary Claims Claim 6.. For any x, using O(( V + ) log ε p ) stochastic gradients, we can decide with probability p: either f(x) ε or f(x) ε. Proof. Suppose we generate m = O(log p ) random uniform subsets S,..., S m of [n], each of cardinality B = max{ 3ε V, }. Then, denoting by v j = B i S j f i (x), we have according to [ Fact. that E Sj vj f(x) ] V B = ε 3. In other words, with probability at least / over the randomness of S j, we have v j f(x) v j f(x) ε 4. Since m = O(log p ), we have with probability at least p, it satisfies that at least m/ + of the vectors v j satisfy vj f(x) ε 4. Now, if we select v = v j where j [m] is the index that gives the median value of v j, then it satisfies v j f(x) ε 4. Finally, we can check if v j 3ε 4. If so, then we conclude that f(x) ε, and if not, we conclude that f(x) ε. Claim 6.. If v is a unit vector and v f(y)v δ, suppose we choose y = y ± δ L v where the sign is random, then f(y) E[f(y )] δ3. L Proof. Letting η = δ L, then by the second-order smoothness, f(y) E[f(y )] E [ f(y), y y (y y ) f(y)(y y ) L 6 y y 3] 6. Neon on SGD and GD = η v f(y)v L η 3 6 v 3 η δ 4 L η 3 6 = δ3 L. To apply Neon to turn SGD into an algorithm finding approximate local minima, we propose the following process Neon+SGD (see Algorithm 4). In each iteration t, we first apply SGD with minibatch size O( ε ) (see Line 4). Then, if SGD finds a point with small gradient, we apply Neon online to decide if it has a negative curvature, if so, then we move in the direction of the negative curvature (see Line 0). We have the following theorem: 6 This is because stationary-point finding algorithms have somewhat different guarantees. For instance, in minibatch SGD we have f(x t) E[f(x t+)] Ω( f(x t) ) but in SCSG we have f(x t) E[f(x t+)] E[Ω( f(x t+) )].

Algorithm 4 Neon+SGD(f, x 0, p, ε, δ) Input: function f( ), starting vector x 0, confidence p (0, ), ε > 0 and δ > 0. : K O ( L f + L ) f δ 3 ε ; : for t 0 to K do f is any upper bound on f(x 0) min x{f(x)} 3: S a uniform random subset of [n] with cardinality S = B = max{ 8V, }; ε 4: x t+/ x t L S i S f i(x t ); 5: if f(x t ) ε then estimate f(xt) using O(ε V log(k/p)) stochastic gradients 6: x t+ x t+/ ; 7: else necessarily f(x t) ε 8: v Neon online (x t, δ, p K ); 9: if v = then return x t ; necessarily f(x t) δi 0: else x t+ x t ± δ L v; necessarily v f(x t)v δ/ : end if : end for 3: will not reach this line (with probability p). Algorithm 5 Neon+GD(f, x 0, p, ε, δ) Input: function f( ), starting vector x 0, confidence p (0, ), ε > 0 and δ > 0. : K O ( L f + L ) f δ 3 ε ; : for t 0 to K do f is any upper bound on f(x 0) min x{f(x)} 3: x t+/ x t L f(x t); 4: if f(x t ) ε then 5: x t+ x t+/ ; 6: else 7: v Neon det (x t, δ, p K ); 8: if v = then return x t ; necessarily f(x t) δi 9: else x t+ x t ± δ L v; necessarily v f(x t)v δ/ 0: end if : end for : will not reach this line (with probability p). Theorem 5a. With probability at least ( p, Neon+SGD outputs an (ε, δ)-approximate local minimum in gradient complexity T = Õ ( V + ) ( L ε f + L ) f δ 3 ε + L L δ f ). δ 3 Corollary 6.3. Treating f, V, L, L as constants, we have T = Õ( ε 4 + ε δ 3 + δ 5 ). One can similarly (and more easily) give an algorithm Neon+GD, which is the same as Neon+SGD except that the mini-batch SGD is replaced with a full gradient descent, and the use of Neon online is replaced with Neon det. We have the following theorem: Theorem 5c. With probability( at least p, Neon+GD ) outputs an (ε, δ)-approximate local minimum in gradient complexity Õ L f + L/ L ε δ / f full gradient computations. δ 3 We only prove Theorem 5a in Appendix C and the proof of Theorem 5c is only simpler.

6.3 Neon on SCSG and SVRG Background. We first recall the main idea of the SVRG method for non-convex optimization [3, 3]. It is an offline method but is what SCSG is built on. SVRG divides iterations into epochs, each of length n. It maintains a snapshot point x for each epoch, and computes the full gradient f( x) only for snapshots. Then, in each iteration t at point x t, SVRG ines gradient estimator f(x t ) = f i (x t ) f i ( x) + f( x) which satisfies E i [ f(x t )] = f(x t ), and performs update x t+ x t α f(x t ) for learning rate α. The SCSG method of Lei et al. [8] proposed a simple fix to turn SVRG into an online method. They changed the epoch length of SVRG from n to B /ε, and then replaced the computation of f( x) with S i S f i( x) where S is a random subset of [n] with cardinality S = B. To make this approach even more general, they also analyzed SCSG in the mini-batch setting, with mini-batch size b {,,..., B}. 7 Their Theorem 3. [8] says that, Lemma 6.4 ([8]). There exist constant C > such that, if we run SCSG for an epoch of size B (so using O(B) stochastic gradients) 8 with mini-batch b {,,..., B} starting from a point x t and moving to x + t, then E [ f(x + t ) ] C L(b/B) /3( f(x t ) E[f(x + t )]) + 6V B. Our Approach. In principle, one can apply the same idea of Neon+SGD on SCSG to turn it into an algorithm finding approximate local minima. Unfortunately, this is not quite possible because the left hand side of Lemma 6.4 is on E [ f(x + t ) ], as opposed to f(x t ) in SGD (see (C.)). This means, instead of testing whether x t is a good local minimum (as we did in Neon+SGD), this time we need to test whether x + t is a good local minimum. This creates some extra difficulty so we need a different proof. Remark 6.5. As for the parameters of SCSG, we simply use B = max{, 48V }. However, choosing ε mini-batch size b = does not necessarily give the best complexity, so a tradeoff b = Θ( (ε +V)ε 4 L 6 ) } δ 9 L 3 is needed. (A similar tradeoff was also discovered by the authors of Neon [8].) Note that this quantity b may be larger than B, and if this happens, SCSG becomes essentially equivalent to one iteration of SGD with mini-batch size b. Instead of analyzing this boundary case b > B separately, we decide to simply run Neon+SGD whenever b > B happens, to simplify our proof. We show the following theorem (proved in Appendix C) Theorem 5b. With probability at least ( /3, Neon+SCSG outputs an (ε, δ)-approximate local ( minimum in gradient complexity T = Õ L f )( V ) + L L ε δ + f L ). ε δ + L ε 4/3 V /3 f δ 3 (To provide the simplest proof, we have shown Theorem 5b only with probability /3. One can for instance boost the confidence to p by running log p copies of Neon+SCSG.) Corollary 6.6. Treating f, V, L, L as constants, we have T = Õ( ε 0/3 + ε δ 3 + δ 5 ). 7 That is, they reduced the epoch length to B, and replaced fi(xt) fi( x) with b S i S ( fi(x t) f i( x) ) for some S that is a random subset of [n] with cardinality S = b. 8 We remark that Lei et al. [8] only showed that an epoch runs in an expectation of O(B) stochastic gradients. We assume it is exact here to simplify proofs. One can for instance stop SCSG after O(B log ) stochastic gradient p computations, and then Lemma 6.4 will succeed with probability p. 3

Algorithm 6 Neon+SCSG(f, x 0, ε, δ) Input: function f( ), starting vector x 0, ε > 0 and δ > 0. : B max{, 48V }; b max {, Θ( (ε +V)ε 4 L 6 ε ) } ; δ 9 L 3 : if b > B then return Neon+SGD(f, x 0, /3, ε, δ); for cleaner analysis purpose, see Remark 6.5 3: K Θ ( Lb /3 f ) ε 4/3 V ; /3 f is any upper bound on f(x 0) min x{f(x)} 4: for t 0 to K do 5: x t+/ apply SCSG on x t for one epoch of size B = max{θ(v/ε ), }; 6: if f(x t+/ ) ε then estimate f(xt) using O(ε V log K) stochastic gradients 7: x t+ x t+/ ; 8: else necessarily f(x t+/ ) ε 9: v Neon online (f, x t+/, δ, /0K); 0: if v = then return x t+/ ; necessarily f(x t+/ ) δi : else x t+ x t+/ ± δ L v; necessarily v f(x t+/ )v δ/ : end if 3: end for 4: will not reach this line (with probability /3). As for SVRG, it is an offline method and its one-epoch lemma looks like 9 E [ f(x + t ) ] C Ln /3( f(x t ) E[f(x + t )]). If one replaces the use of Lemma 6.4 with this new inequality, and replace the use of Neon online with Neon svrg, then we get the following theorem: Theorem 5d. With probability at least ( /3, Neon+SVRG outputs an (ε, δ)-approximate local ( minimum in gradient complexity T = Õ L f )( n + n 3/4 ) ) L δ. + L ε n /3 f δ 3 For a clean presentation of this paper, we ignore the pseudocode and proof because they are only simpler than Neon+SCSG. 6.4 Neon on Natasha and CDHS The recent results Carmon et al. [7] (that we refer to CDHS) and Natasha [] are both Hessianfree methods where the only Hessian-vector product computations come from the exact NC-search process we study in this paper. Therefore, by replacing their NC-search with Neon, we can directly turn them into first-order methods without the necessity of computing Hessian-vector products. We state the following two theorems where the proofs are exactly the same as the papers [7] and []. We directly state them by assuming f, V, L, L are constants, to simplify our notions. Theorem. One can replace Oja s algorithm with Neon online in Natasha without hurting its performance, turning it into a first-order stochastic method. Treating f, V, L, L as constants, Natasha finds an (ε, δ)-approximate local minimum in T = Õ ( + ε 3.5 ε 3 δ + ) δ stochastic gradient computations. 5 Theorem 4. One can replace Lanczos method with Neon det or Neon svrg in CDHS without hurting it performance, turning it into a first-order method. 9 There are at least three different variants of SVRG [3, 8, 3]. We have adopted the lemma of [8] for simplicity. 4

Treating f, L, L as constants, CDHS finds an (ε, δ)-approximate local minimum in either Õ ( + ) ε.75 δ full gradient computations (if Neon det is used) or in T = Õ( n + n ) + n3/4 + n3/4 3.5 ε.5 δ 3 ε.75 δ 3.5 stochastic gradient computations (if Neon svrg is used). Acknowledgements We would like to thank Tianbao Yang and Yi Xu for helpful feedbacks on this manuscript. A Missing Proofs for Section 3 A. Auxiliary Lemmas Appendix We use the following lemma to approximate hessian-vector products: Lemma A.. If f(x) is L -second-order smooth, then for every point x R d and every vector v R d, we have: f(x + v) f(x) f(x)v L v. Proof of Lemma A.. We can write f(x+v) f(x) = t=0 f(x+tv)vdt. Subtracting f(x)v we have: f(x + v) f(x) f(x)v ( = f(x + tv) f(x) ) vdt t=0 t=0 f(x + tv) f(x) v dt L v. We need the following auxiliary lemma about martingale concentration bound: Lemma A.. Consider random events {F t } t and random variables x,..., x T 0 and a,..., A T [ ρ, ρ] for ρ [0, /] where each x t and a t only depend on F,..., F t. Letting x 0 = 0 and suppose there exist constant b 0 and λ > 0 such that for every t : x t x t ( a t ) + b and E[a t F,..., F t ] λ. ] Then, we have for every p (0, ): Pr [x T T be λt +ρ T log T p p. Proof. We know that x T ( a T )x T + b ( a T ) (( a T )x T + b) + b = ( a T )( a T )x T + ( a T )b + b T T ( a s )b s= t=s For each s [T ], we consider the random process ine as t s : y t+ = ( a t )y t, y s = b 5

Therefore log y t+ = log( a t ) + log y t For log( a t ) [ ρ, ρ] and E[log( a t ) F, F t ] λ. Thus, we can apply Azuma-Hoeffding inequality on log y t to conclude that ] Pr [y T be λt +ρ T log T p p/t. Taking union bound over s we complete the proof. A. Proof of Lemma 3. Proof of Lemma 3.. Let i t [n] be the random index i chosen when computing x t+ from x t in Line 5 of Neon online weak. We will write the update rule of x t in terms of the Hessian before we stop. By Lemma A., we know that for every t, fit (x t ) f it (x 0 ) f it (x 0 )(x t x 0 ) L x t x 0. Therefore, there exists error vector ξ t R d with ξ t L x t x 0 such that For notational simplicity, let us denote by then it satisfies (x t+ x 0 ) = (x t x 0 ) η f it (x 0 )(x t x 0 ) + ηξ t. z t = x t x 0, A t = B t + R t where B t = f it (x 0 ), R t = ξ tzt z t z t+ = z t ηb t z t + ηξ t = (I ηa t )z t. We have R t L z t L r. By the L-smoothness of f it, we know B t L and thus A t B t + R t B t + L r L. Now, ine Φ t = z t+ zt+ = (I ηa t) (I ηa )ξξ (I ηa ) (I ηa t ) and w t = zt z t. Then, before we stop, we have: (Tr(Φ t )) / ) Tr(Φ t ) = Tr(Φ t ) ( ηwt A t w t + η wt A t w t Tr(Φ t ) ( ηwt A t w t + 4η L ) Tr(Φ t ) ( ηwt B t w t + η R t + 4η L ) ( Tr(Φ t ) ηwt B t w t + 8η L ). Above, is because our choice of parameter satisfies r η L L. Therefore, ( log (Tr(Φ t )) log (Tr(Φ t )) + log ηwt B t w t + 8η L ). z t = Letting λ = λ min ( f(x 0 )) = λ min (E Bt [B t ]), since the randomness of B t is independent of w t, we know that wt [ ] B t w t [ L, L] and for every w t, it satisfies E Bt w t B t w t w t λ. Which (by concavity of log) also implies that E[log ( ηwt B t w t + 8η L ) ] ηλ and log( ηwt B t w t + 8η L ) [ (ηl + 8η L ), ηl + 8η L ] [ 6ηL, 3ηL]. Hence, applying Azuma-Hoeffding inequality on log(φ t ) we have [ Pr log(φ t ) log(φ 0 ) ηλt + 6ηL t log ] p. p 6

In other words, with probability at least p, Neon online weak will not terminate until t T 0, where T 0 is given by the equation (recall Φ 0 = z = σ ): ηλt 0 + 6ηL T 0 log ( ) r p = log. Next, we turn to accuracy. Let true vector v t+ = (I ηb t ) (I ηb )ξ and we have t t z t+ v t+ = (I ηa s )ξ (I ηb s )ξ = (I ηb t )(z t v t ) ηr t z t. s= s= Thus, if we call u t = z t v t with u = 0, then, before the algorithm stops, we have: σ u t+ (I ηb t )u t η R t z t ηl r ( ) Using Young s inequality a + b ( + β) a + β + b for every β > 0, we have: u t+ ( + η ) ( (I ηb t )u t + 8L r 4 u t η u ) tb t u t u t + 4η L + 8L r 4. Above, assumes without loss of generality that L (as otherwise we can re-scale the problem). Therefore, applying martingale concentration Lemma A., we know ] Pr [ u t 6L r te ηλt+8ηl t log t p p. Now we can apply the recent of Oja s algorithm [6, Theorem 4]. By our choice of parameter η, we have: with probability at least 99/00 the following holds:. Norm growth: v t e (ηλ η L )t σ/d. ( ). Negative curvature: v t f(x 0 )v t ( η)λ + O log(d) v t ηt. Then let us consider the case: λ δ, let us consider a fixed T ined as T = log dr σ ηλ η L = C 0 (log d/p + log(d)) ηλ η L < T. At this point, by the norm growth property, we know that w.p. 99/00, v T r and by our choice of parameters, we know that ( e ηλt +ηl T log p r ). σ which implies that with probability at least 98/00, u T 6dL r T e ηλt+8ηl v T e (ηλ η L )T σ 6dL r T σ T log T p e 8ηL T log T p +η L T 6dL r T e 6 log T p 6dL r T σ σp δ 00L 00. Here, uses our choice of parameters so η L T log T p ( ) L η λ log r σ log T p log T p. 7

is due to r σ δ p δp 300dηL L log dr 600dT L L. Thus, with probability at least 98/00, z T = σ v T + u T r. This means Neon online weak must terminate within T T iterations. Moreover, recall with probability at last p, Neon online weak will not terminate before iteration T 0. Thus, at the point of t [T 0, T ] of termination, we have with probability at least 98/00, ( ) ( ) log(d) log(d) v t Av t v t ( η)λ + O ηt ( η)λ + O Moreover, we also know that u t + v t = z t r, therefore, u t 6L r T e ηλt+ηl u t + v t r T log T p. ηt 0 5 6 δ. By inition of T, we know that r = e (ηλ η L )T σ/(d), so we can show that ut before. Together, we have: v t 00 as z t Az t z t = v t z t z t Az t v t 3 vt Av t + 4L z t v t v t 4 v t 3 ( v t Av t 4 v t + 4L u ) t 3 ( v t Av t v t 4 v t + ) 4 δ 5 00 δ. Putting everything together we complete the proof. A.3 Proof of Lemma 3. Proof of Lemma 3.. By Lemma A., we know for every i [n], v ( f i (x + v) f i (x) f i (x) ) v L v 3. Letting z j = v ( f j (x + v) f j (x)), we know that z,, z m are i.i.d. random variables with z j L v + L v 3. By Chernoff bound, we know that [ Pr z E[z] ( L v + L v 3 ) m log ] p p Since we also have E[z] v f(x)v L v 3 from Lemma A., we conclude that [ z Pr v v f(x)v v (L + L v ) m log ] p + L v p. Plugging in our assumption on v and our choice of m finishes the proof. B Missing Proofs for Section 4 B. Stable Computation of Chebyshev Polynomials We recall the following result from [5, Section 6.] regarding one way to stably compute Chebyshev polynomials. Suppose we want to compute s N = N T k (M) c k R d where M R d d is symmetric and each c k is in R d. (B.) k=0 8

Definition B. (inexact backward recurrence). Let M be an approximate algorithm that satisfies M(u) Mu ε u for every u R d. Then, ine inexact backward recurrence to be bn+ = 0, bn = c N, and r {N,..., 0}: b r = M ( br+ ) br+ + c r R d, and ine the output as ŝ N = b 0 M( b ). If ε = 0, then ŝ N = s N. The following theorem gives an error analysis [5, Theorem 6.4]. Theorem B. (stable Chebyshev sum). For every N N, suppose the eigenvalues of M are in [a, b] and suppose there are parameters C U, C T, ρ, C c 0 satisfying { k {0,,..., N}: ρ k c k C c x [a, b]: Tk (x) C T ρ k and U k (x) C U ρ k}. Then, if the inexact backward recurrence in Def. B. is applied with ε 4NC U, we have B. Proof of Theorem 3 ŝ N s N ε ( + NC T )NC U C c. Proof of Theorem 3. We can without loss of generality assume δ L. For notation simplicity, let us denote ( A = f(x 0 ), M = ( L f(x 0 ) + 3δ ) ) I and λ = λ min (A). 4L Then, we know that the eigenvalues of M lie in [, + λ 3δ/4 ] L. We wish to iteratively compute x t+ x 0 +T t (M) ξ, where T t is the t-th Chebyshev polynomial of the first kind. However, we cannot multiply M to vectors (because we are not allowed to use Hessian-vector products). We ine M(y) = ( L ( f(x 0 + y) f(x 0 )) + 3δ ) y. 4L and shall use it to approximate My and then apply backward recurrence y 0 = 0, y = ξ, y t = M(y t ) y t. If we set x t+ = x 0 + y t+ M(y t ), following Def. B., it satisfies x t+ x 0 T t (M) ξ. Now, letting x t+ = x 0 +T t (M)ξ be the exact solution, we wish to bound the error x t+ x t+. Throughout the iterations of Neon det, we have y t = M(y t ) y t = (x 0 x t + y t ) y t = y t y t = (x t x 0 ). Since we have x t x 0 r for each t before termination, we know y t tr. Using this upper bound we can approximate Hessian-vector product by gradient difference. Lemma A. gives us M(y t ) My t L L y t L rt L y t. Now, recall from Claim 4. that [, ] if x [, ]; T t (x) ( x + ) t ( x, x + ) ] t x if x >. [ [ ] We can apply Theorem B. with the eigenvalues of M in [a, b] = 0, + λ 3δ/4 L and { ρ = max + λ 3δ/4 } + λ 3δ/4 (λ 3δ/4) + L L L,, C c = ρ t σ, C T = C U =. 9

Theorem B. tells us that, for every t before termination, x t+ x t+ 3L rt 3 ρ t σ. L In order to prove Theorem 3, in the rest of the proof, it suffices for us to show that, if λ min ( f(x 0 )) δ, then with probability at least p, it satisfies v, v =, and v f(x 0 )v δ. In other words, we can assume λ δ. The value λ δ implies ρ + δ >, so we can let T = log 4dr pσ log ρ T. By Claim 4., we know that T T (M) ρt pσ. Thus, with probability at least p, x T + x 0 = T T (M)ξ r. Moreover, at iteration T, we have: x T + x T + 3L rt 3 ρt σ L Here, uses the fact that r 3L rt 3 σ L δp. 800dL T 3 = dr 4dr pσ 8dL T 3 r L p δ 00L r 6 r. This means x T + x 0 r so the algorithm must terminate before iteration T T. On the other hand, since T t (M) ρ t, we know that the algorithm will not terminate until t T 0 for T 0 = log r σ log ρ. At the time t T 0 of termination, by the property of Chebyshev polynomial Claim 4., we know. T t (ρ) ρt ρt 0 r 4σ = (d/p)θ().. x [, ], T t (x) [, ]. Since all the eigenvalues of A that are 3/4δ are mapped to the eigenvalues of M that are in [, ], and the smallest eigenvalue of A is mapped to the eigenvalue ρ of M. So we have, with probability at least p, letting v t = x t+ x 0 it satisfies Therefore, denoting by z t = x t+ x 0, we have z t Az t z t v t Av t v t 5 8 δ. = v t z t z t Az t v t 5 vt Av t + 4L z t v t v t 6 v t 5 6 5 6 ( v t Av t + 4L v t z t v t v t ( v t Av t v t + ) 5 δ 5 00 δ. This finishes the proof because we have shown that, with probability at least p, the output v = zt z satisfies and t v f(x 0 )v δ. ) 0

C Missing Proofs for Section 6 C. Proof of Theorem 5a Proof of Theorem 5a. Since both estimating f(x t ) in Line 5 (see Claim 6.) and invoking Neon online (see Theorem ) succeed with high probability, we can assume that they always succeed. This means whenever we output x t in an iteration, it already satisfies f(x t ) ε and f(x t ) δi. Therefore, it remains to show that the algorithm must output some x t in an iteration, as well as to compute the final complexity. Recall (from classical SGD theory) if we update x t+/ x t α S the learning rate and S = B, then f(x t ) E S [f(x t+/ )] E S [ f(xt ), x t x t+/ L x t x t+/ ] i S f i(x t ) where α > 0 is = α f(x t ) α L [ E i S S S f i(x t ) ] = ( α α L) f(xt ) α L [ E f(xt ) i S S S f i(x t ) ] ( α α L ) f(xt ) α L Above, is due to the smoothness of f( ) and is due to Fact.. Now, if we choose α = L and B = max{ 8V ε, }, then we have f(x t ) E S [f(x t+/ )] α V B. ( f(xt ) ε ). (C.) 8 In other words, as long as Line 6 is reached, we have f(x t ) E[f(x t+ )] Ω(ε /L). On the other hand, whenever Line 0 is reached, then we must have v f(y 0 )v δ. By Claim 6., we must have f(x t ) E[f(x t+ )] Ω(δ 3 /L ). In sum, if we choose K = O ( L f + L ) f δ 3 ε, then the algorithm must terminate and return xt in one of its iterations. This ensures that Line 3 will not be reached. As for the total complexity, we note that each iteration of Neon+SGD is dominated by Õ(B) = Õ( V + ) stochastic gradient ε computations in Line 4 and Line 5, totaling Õ(( V L + )K), as well as Õ( ) stochastic gradient ε δ computations by Neon online, but the latter will not happen for more than O( L f ) times. Therefore, δ 3 the total gradient complexity is Õ (( V L L + )K + ) ( f ε δ δ 3 = Õ ( V ε + )( L f δ 3 + L f ) L L + ) f ε δ δ 3. C. Proof of Theorem 5b Proof of Theorem 5b. We first note in the special case b = Θ( (ε +V)ε 4 L 6 ) B, or equivalently ( δ 9 L 3 ) δ 3 O( L ε L ), Theorem 5a gives us gradient complexity T = Õ V L ε f + L L δ 3 δ f so we are done. δ 3 Therefore, in the rest of the proof we assume Θ( (ε +V)ε 4 L 6 ) < B and thus b B is well ined. δ 9 L 3 Since both estimating f(x t ) in Line 6 (see Claim 6.) and invoking Neon online (see Theorem ) succeed with high probability, we can assume that they always succeed. This means whenever we output x t+/ in an iteration, it already satisfies f(x t+/ ) ε and f(x t+/ ) δi. Therefore, it remains to show that the algorithm must output some x t in an iteration, as well as to compute the final complexity.