arxiv: v4 [math.oc] 24 Apr 2017

Similar documents
Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

arxiv: v4 [math.oc] 11 Jun 2018

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

arxiv: v1 [cs.lg] 17 Nov 2017

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Optimization Tutorial 1. Basic Gradient Descent

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

Mini-Course 1: SGD Escapes Saddle Points

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Unconstrained optimization

Unconstrained minimization of smooth functions

Second-Order Stochastic Optimization for Machine Learning in Linear Time

Least Sparsity of p-norm based Optimization Problems with p > 1

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

arxiv: v4 [math.oc] 5 Jan 2016

This manuscript is for review purposes only.

Advanced computational methods X Selected Topics: SGD

STA141C: Big Data & High Performance Statistical Computing

Neural Network Training

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

CPSC 540: Machine Learning

Linear Regression and Its Applications

Non-convex optimization. Issam Laradji

arxiv: v1 [math.oc] 1 Jul 2016

1 Kernel methods & optimization

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Gradient Descent. Dr. Xiaowei Huang

SVRG++ with Non-uniform Sampling

ECS289: Scalable Machine Learning

Convex Functions and Optimization

Midterm for Introduction to Numerical Analysis I, AMSC/CMSC 466, on 10/29/2015

Day 3 Lecture 3. Optimizing deep networks

Line Search Methods for Unconstrained Optimisation

Adaptive Online Gradient Descent

Nonlinear Optimization Methods for Machine Learning

Second Order Optimization Algorithms I

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

Lecture 18 Oct. 30, 2014

Math 471 (Numerical methods) Chapter 3 (second half). System of equations

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Lecture 5 : Projections

Stochastic Optimization Algorithms Beyond SG

Math 113 Homework 5. Bowei Liu, Chao Li. Fall 2013

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global

Accelerating Stochastic Optimization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Linear Regression (continued)

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Optimization for neural networks

arxiv: v2 [math.oc] 1 Nov 2017

Algorithms for Learning Good Step Sizes

Stochastic Cubic Regularization for Fast Nonconvex Optimization

Notes for CS542G (Iterative Solvers for Linear Systems)

Trust Regions. Charles J. Geyer. March 27, 2013

Lecture 6 Optimization for Deep Neural Networks

8 Numerical methods for unconstrained problems

STAT 200C: High-dimensional Statistics

5 Handling Constraints

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Solution of the 8 th Homework

September Math Course: First Order Derivative

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Large-scale Stochastic Optimization

SVRG Escapes Saddle Points

MATH SOLUTIONS TO PRACTICE MIDTERM LECTURE 1, SUMMER Given vector spaces V and W, V W is the vector space given by

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees

Connectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ).

Bare-bones outline of eigenvalue theory and the Jordan canonical form

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Introduction to gradient descent

Nonlinear Programming

Chapter 8 Gradient Methods

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Stochastic Gradient Descent with Variance Reduction

Constrained Optimization and Lagrangian Duality

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

UNCONSTRAINED OPTIMIZATION PAUL SCHRIMPF OCTOBER 24, 2013

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

Symmetric Matrices and Eigendecomposition

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

Comparison of Modern Stochastic Optimization Algorithms

The Skorokhod reflection problem for functions with discontinuities (contractive case)

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Lecture 8. Instructor: Haipeng Luo

THE INVERSE FUNCTION THEOREM

Unconstrained Optimization

Selected Topics in Optimization. Some slides borrowed from

5 + 9(10) + 3(100) + 0(1000) + 2(10000) =

Jim Lambers MAT 610 Summer Session Lecture 2 Notes

Transcription:

Finding Approximate ocal Minima Faster than Gradient Descent arxiv:6.046v4 [math.oc] 4 Apr 07 Naman Agarwal namana@cs.princeton.edu Princeton University Zeyuan Allen-Zhu zeyuan@csail.mit.edu Institute for Advanced Study Elad Hazan ehazan@cs.princeton.edu Princeton University November 3, 06 Abstract Tengyu Ma tengyu@cs.princeton.edu Princeton University Brian Bullins bbullins@cs.princeton.edu Princeton University We design a non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which scales linearly in the underlying dimension and the number of training examples. The time complexity of our algorithm to find an approximate local minimum is even faster than that of gradient descent to find a critical point. Our algorithm applies to a general class of optimization problems including training a neural network and other non-convex objectives arising in machine learning.

Introduction Finding a global minimizer of a non-convex optimization problem is NP-hard. Thus, the standard goal of efficient non-convex optimization algorithms is instead to find a local minimum. This problem has become increasingly important as the state-of-the-art in machine learning is attained by non-convex models, many of which are variants of deep neural networks. Experiments in [0,, ] suggest that fast convergence to a local minimum is sufficient for training neural nets, while convergence to critical points (points with vanishing gradients) is not. Theoretical works have also affirmed the same phenomenon for other machine learning problems (see [5, 6, 8, 9] and the references therein). In this paper we give a provable linear-time algorithm for finding an approximate local minimum in smooth non-convex optimization. It applies to a general setting of machine learning optimization, and in particular to the optimization problem of training deep neural networks. Furthermore, the running time bound of our algorithm is the fastest known even for the more lenient task of computing a point with vanishing gradient (called a critical point), for a wide range of parameters. Formally, the problem of unconstrained mathematical optimization is stated in general terms as that of finding the minimum value that a function attains over Euclidean space, i.e. min f(x). (.) x Rd If f is convex, the above formulation is convex optimization and is solvable in (randomized) polynomial time even if only a valuation oracle to f is provided. A crucial property of convex functions is that local optimality implies global optimality, allowing for greedy algorithms to reach the global optimum efficiently. Unfortunately, this is no longer the case if f is nonconvex; indeed, even a degree four polynomial can be NP-hard to optimize [3], or even just to check whether a point is not a local minimum [5]. Thus, for non-convex optimization one has to settle for the more modest goal of reaching approximate local optimality efficiently. Note that a particular interest to machine learning is the optimization of functions f : R d R of the finite-sum form f(x) = n f i (x). (.) n Such functions arise when minimizing loss over a training set, where each example i in the set corresponds to one loss function f i in the summation. i= We say that the function f is second-order smooth if it has ipschitz continuous gradient and ipschitz continuous Hessian. We say that a point x is an ε-approximate local minimum if it satisfies (following the tradition of [8]): f(x) ε and f(x) εi, where denotes the Euclidean norm of a vector. We say that a point x is an ε-critical point if it satisfies the gradient condition above, but not necessarily the second-order condition. Critical points include saddle points in addition to local minima. We remark that ε-approximate local minima (even with ε = 0) are not necessarily close to any local minimum, neither in domain nor in function value. However, if we assume in addition the function satisfies the (robust) strict-saddle property [5, 4] (see Section for the precise definition), then an ε-approximate local minimum is guaranteed to be close to a local minimum for sufficiently small ε. Our main theorem below states the time required for the proposed algorithm FastCubic to find an ε-approximate local minimum for second-order smooth functions.

Theorem (informal). Ignoring smoothness parameters, the running time of FastCubic to return an ε-approximate local minimum is ( ) ( n ) ( ) n3/4 Õ + ε3/ ε 7/4 T h, for (.) or Õ ε 7/4 T h for the general (.). Above, T h is the time to compute Hessian-vector product for f(x) and T h, is that for an arbitrary f i (x). The full statement of Theorem can be found in Section. Hessian-vector products can be computed in linear time meaning T h, = O(d) and T h = O(nd) for many machine learning problems such as generalized linear models and training neural networks [, 9]. We explain this more generally in Appendix A. Therefore, Corollary.. Algorithm FastCubic returns an ε-approximate local minimum for the optimization problem of training a neural network in time ( ) nd Õ ε 3/ + n3/4 d ε 7/4. Another important aspect of our algorithm is that even in terms of just reaching an ε-critical point, i.e. a point that satisfies f(x) ε without any second-order guarantee, FastCubic is faster than all previous results (see Table for a comparison). The fastest methods to find critical points for a smooth non-convex function are gradient descent and its extensions, jointly known as first-order methods. These methods are extremely efficient in terms of per-iteration complexity; however, they necessarily suffer from a /ε convergence rate [7], to the best of our knowledge, in previous results only higher-order methods seem capable of breaking this /ε bottleneck [8]. For certain ranges of parameters, our FastCubic finds local minima even faster than first-order methods, even though they only find critical points. This is depicted in Table. Paper Total Time Achieving f(x) ε Second-Order Guarantee Gradient Descent (GD) O ( ) nd ε n/a SVRG [] O ( ) nd + n/3 d ε n/a SGD [0] O ( ) d ε n/a 4 noisy SGD [6] a O ( ) d C ε f(x) ε /C I 4 cubic regularization [8] Õ ( ) nd ω +d ω f(x) ε / I ε 3/ this paper Õ ( ) nd f(x) ε / I ε 7/4 this paper Õ ( ) nd + n3/4 d f(x) ε / I ε 3/ ε 7/4 Table : Comparison of known methods. a Here C, C are two constants that are not explicitly written. We believe C 4.

. Related work Methods that Provably Reach Critical Points. Recall that only a gradient oracle is needed to reach a critical point. The most commonly used algorithm in practice for training non-convex learning machines such as deep neural networks is stochastic gradient descent (SGD), also known as stochastic approximation [30] and its derivatives. Some practical enhancements widely used in practice are based on Nesterov s acceleration [6] and adaptive regularization []. The variance reduction technique, introduced in [3], was extremely successful in convex optimization, but only recently there was a non-convex counterpart with theoretical benefits introduced []. Methods that Provably Reach ocal Minima. The recent work of Ge et al. [7] showed that a noise-injected version of SGD in fact converges to local minima instead of critical points, as long as the underlying non-convex function is strict-saddle. Their theoretical running time is a large polynomial in the dimension and not competitive with our method (see Table ). The work of ee et al. [4] shows that gradient descent, starting from a random point, almost surely converges to a local minimum of a strict-saddle function. The rates of convergence and precise step-sizes that are required are, however, yet unknown. If second-order information (i.e., the Hessian oracle) is provided, the cubic-regularization method of Nesterov and Polyak [8] converges in O( ) iterations. However, each iteration of Nesterovε 3/ Polyak requires solving a cubic function which, in general, takes time super-linear in the input representation. One natural direction is to apply an approximate trust region solver, such as the linear-time solver of [], to approximately solve the cubic regularization subroutine of Nesterov-Polyak. However, the approximation needed by a naive calculation makes this approach even slower than vanilla gradient descent. Our main challenge is to obtain approximate second-order local-minima and simultaneously improve upon gradient descent. Independently of this paper and concurrently, Carmon et al. [7] develop an accelerated gradient descent method that achieves the same running time for finding an approximate local minimum as in our paper. Remarkably, the same running time is obtained via a very different technique.. Our Techniques Our algorithm is based on the cubic regularization method of Nesterov and Polyak [8, 9, 8]. At a high level, cubic regularization states that if we can minimize a cubic function m(h) g h + h Hh + 6 h 3 exactly, where g = f(x), H = f(x), and is the second-order smoothness of the function f, then we can iteratively perform updates x x + h, and this algorithm converges to an ε-approximate local minimum in O(/ε 3/ ) iterations. Unfortunately, solving this cubic minimization problem exactly, to the best of our knowledge, requires a running time of O(d ω ) where ω is the matrix multiplication constant. Getting around this requires five observations. The first observation is that, minimizing m(h) up to a constant multiplicative approximation (plus a few other constraints) is sufficient for showing an iteration complexity of O(/ε 3/ ). The proof techniques to show this observation are based on extending Nesterov and Polyak. The second observation is that the minimizer h of m(h) must be of the form h = (H+λ I) + g+ v, where λ 0 is some constant satisfying H + λ I 0, and v is the smallest eigenvector of H and + denotes the pseudo-inverse of a matrix. This can be viewed as moving in a mixture direction To be precise, their manuscript appeared online approximately 4 hours before ours. More specifically, we need m t(h) C min h{m t(h)} for some constant C. In addition, we need to have good bounds on h and m(h). 3

between choosing h v, and choosing h to follow a shifted Newton s direction h (H + λ I) + g. Intuitively, we wish to reduce both the computation of (H+λ I) + g and v to Hessian-vector products. The first task of computing (H + λ I) + g can be slow, and even if H + λ I is strictly positivedefinite, computing it has a complexity depending on the (possibly huge) condition number of H+λ I [34]. The third observation is that it suffices to pick some λ > λ so both () the condition number of H + λ I is small and () the vectors (H + λ I) g and (H + λ I) g are close. This relies on the structure of m(h). The second task of computing v has a complexity depending on / δ where δ is the target additive error [3, 4]. The fourth observation is that the choice δ = ε suffices for the outer loop of cubic regularization to make sufficient progress. This reduces the complexity to compute v. Finally, finding the correct value λ itself is as hard as minimizing m t (h). The fifth step is to design an iterative scheme that makes only logarithmic number of guesses on λ. This procedure either finds the correct one (via binary search), or finds an approximate one, λ, but satisfying (H + λ I) g and (H + λ I) g being sufficiently close. Putting all the observations together, and balancing all the parameters, we can obtain a cubic minimization subroutine (see FastCubicMin in Algorithm ) that runs in time O(nd + n 3/4 d/ε /4 ). Preliminaries and Main Theorem We use to denote the Euclidean norm of a vector and the spectral norm of a matrix. For a symmetric matrix M we denote by λ max (M) and λ min (M) respectively the maximum and minimum eigenvalues of M. We denote by A B that A B is positive semidefinite (PSD). For a PSD matrix M, we denote by M + its pseudo-inverse if M is not strictly positive definite. We make the following ipschitz continuity assumptions for the gradient and Hessian of the target function f. Namely, there exist, > 0 such that Definition.. x, y R d : f(x) and f(x) f(y) x y. (.) We assume the following complexity parameters on the access to f(x): et T g R be the time complexity to compute f(x) for any x R d. et T h R be the time complexity to compute ( f(x) ) v for any x, v R d. Definition.. We say that f is of finite-sum form if f = n n i= f i(x) and f i (x) for each i [n]. In this case, we define T h, to be the time complexity to compute ( f i (x) ) v for arbitrary x, v R d and i [n]. Next we define the strict-saddle function for which an ε-approximate local minimum is almost equivalent to a local minimum [5, 4]. Definition.3 (strict saddle). Suppose f( ) : R d R is twice differentiable. For α, β, γ 0, we say f is (α, β, γ)-strict saddle if every x R d satisfies at least one of the following three conditions:. f(x) α.. λ min ( f) β. 3. There exists a local minimum x that is γ-close to x in Euclidean distance. We see that if a function is (α, β, γ)-strict saddle, then for ε < min{α, β } an ε-approximate local minimum is γ-close to some local minimum. 4

Algorithm FastCubic(f, x 0, ε,, ) Input: f(x) that satisfies (.) with and ; a starting vector x 0 ; a target accuracy ε. : κ ( ) 900 /. ε : for t = 0 to do 3: m t (h) f(x t ) h + h f(x t)h ( + 6 h 3 ) 4: (λ, v, v min ) FastCubicMin f(x t ), f(x t ),,, κ 5: h either v or λv min 6: Set x t+ x t + h whichever gives smaller value for m t(h); 7: if m t (h ) > ε3/ c then return x t+. c is a constant; we proved c =.4 0 6 works 8: end for. Main Results The finite-sum setting captures much of supervised learning, including Neural Networks and Generalized inear Models. The main theorem which we show in our paper is as follows: Theorem. FastCubic (Algorithm ) starts from a point x 0 and outputs a point x such that in total time (denoting by D f(x 0 ) f(x )) ( ) Õ D T ε 3/ g + D/4 T ε 7/4 h, or f(x) ε and λ min ( f(x)) ε ( Õ D (T ) ε 3/ g + nt h, + Dn 3/4 /4 ) T ε 7/4 h, in the finite-sum setting (see Definition.). Here Õ hides logarithmic factors in,, /ε, d, and in max x { f(x) }. Two Known Subroutines. Our running time of FastCubic relies on the following recent results for approximate matrix inverse and approximate PCA: Theorem.4 (Approximate Matrix Inverse). Suppose matrix M R d d satisfies M and λi + M δi for constants λ, δ, > 0. et κ λ+ δ. Then, we can compute vector x satisfying x (λi + M) b ε b, (.) using Accelerated gradient descent (AGD) in O ( κ / log(κ/ε) ) iterations, each requiring O(d) time plus the time needed to multiply M with a vector. Moreover, suppose M = n n i= M i where each M i is symmetric and satisfies M i. If M i b can be computed in time O(d ) for each i and vector b, then accelerated SVRG [4, 33] computes a vector x that satisfies equation (.) in time O ( max{n, n 3/4 κ / } d log (κ/ε) ). We refer to the running time for this computation as T inverse (κ, ε) and the algorithm as A. Above, the SVRG based running time shall be used only towards our finite-sum case in Definition.. Theorem.5 (AppxPCA [3, 3, 4]). et M R d d be a symmetric matrix with eigenvalues λ λ d 0. With probability at least p, AppxPCA produces a unit vector w satisfying w Mw ( δ )( ε)λ max (M). The total running time is Õ(T inverse(/δ, εδ )). 3 Our Fast Cubic Regularization Algorithm Recall that the cubic regularization method of Nesterov and Polyak [8] studies the following upper bound on the change in objective value as we move from a point x t to x t + h: (it follows simply 5

from the Taylor series truncated to the third order) h R d : f(x t + h) f(x t ) m t (h) f(x t ) h + h f(x t )h + 6 h 3. (3.) Denote by h an arbitrary minimizer of m t (h). We propose in this paper a subroutine FastCubicMin to minimizes m t (h) approximately. Note that FastCubicMin returns two vectors v and v min. We then choose h to be either v or λv min, whichever gives a smaller value for m t(h). Before discussing the details of FastCubicMin, let us first state a main theorem for FastCubicMin: 3 Theorem (Guarantees of FastCubicMin). satisfies (a) It produces a vector h satisfying m t (h ) 0 and The algorithm FastCubicMin finds a vector h that either 3000m t (h ) m t (h ) or m t (h ) ε3/ 800. (b) If m(h ) ε3/ 300, then h h + ε 4 and m t (h ) ε. (c) FastCubicMin Õ( runs in time: (using Õ to hide logarithmic factors in,, /ε, d, f(x t ) ) ) T (ε) /4 h where T h is the time to multiply f(x t ) to a vector; ( Õ max { n, n 3/4 ) } Th, where T (ε) /4 h, is the time to multiply f i (x t ) with a vector. Above, the first guarantee promises that we are either done (because m t (h ) is close to zero), or we obtain a /3000 multiplicative approximation to m t (h ). Our second guarantee in Theorem promises that when we are done (because m t (h ) is close to zero), the output vector h and h are roughly similar in Euclidean norm and have a small gradient m t (h ). Our third guarantee gives the time complexity of FastCubicMin. Now, our final algorithm FastCubic for finding the ε-approximate local minimum of f(x) is included in Algorithm. It simply iteratively calls FastCubicMin to find an approximate minimizer, and it then stops whenever m t (h ) > ε3/ c for some large constant c. Roadmap. In Section 4 we show why Theorem implies Theorem. All the remaining sections are for the purpose of proving Theorem. Because our FastCubicMin is very technical, instead of stating what the algorithm is right away, we decide to take a different path. In Section 5, we first state a lemma characterizing what h looks like. In Section 6, we provide a set of sufficient conditions which look similar to the characterization of h, and show that as long as these conditions are met, Theorem -a and -b follow easily. Finally, in Section 7, we state FastCubicMin and explain why it satisfies these sufficient conditions and why it runs in the aforementioned time. 4 Theorem implies Theorem In this section, we show that Theorem implies Theorem. It relies on the following lemma (proved in Appendix B) regarding the sufficient condition for us to reach an ε-approximate local minimum. 3 To present the simplest result, we have not tried to improve the constant dependency in this paper. 6

emma 4.. If m t (h ) ε3/ 800 and h is an approximate minimizer of m t (h) satisfying h h + ε 4 and m t (h ) ε, then we have that f(x t + h ) ε and λ min ( f(x t + h )) ε. Proof of Theorem from Theorem. When FastCubic terminates, we have m t (h ) > ε3/ c ; therefore, it satisfies m t (h ) ε3/ 800 according to Theorem -a. Combining this with Theorem -b and Corollary 4., we conclude that in the last iteration of FastCubic, our output satisfies f(x t +h ) ε and λ min ( f(x t + h )) ε. This finishes the proof with respect to the accuracy conditions. As for the running time, in every iteration except for the last one, FastCubic satisfies m t (h ) Ω ( ) ( ε 3/. Therefore by (3.), we must have decreased the objective by at least Ω ε 3/ ) in this round, and this cannot happen for more than O ( (f(x 0 ) f ) ) ε iterations. The final running time 3/ of FastCubic follows from this bound together with Theorem -c. Therefore, in the rest of the paper it suffices to study FastCubicMin and prove Theorem. 5 Characterization emma of the Minimizer h For notational simplicity in this and the subsequent sections we focus on the following problem: minimize m(h) g h + h Hh + 6 h 3 where H is a symmetric matrix with H. Recall from the previous section that we have denoted by h an arbitrary minimizer of m(h). We have the following lemma which characterizes h : (a variant of this lemma has appeared in [8], and we prove it in the appendix for the sake of completeness) emma 5.. We have h is a minimizer of m(h) if and only if there exists λ 0 such that H + λ I 0, (H + λ I)h = g, h = λ. The objective value in this case is given by m(h ) = g (H + λ I) + g (λ ) 3 3 0. The following corollary comes from emma 5. and its proof: Corollary 5.. have The value λ in emma 5. is unique, and for every λ satisfying H + λi 0, we (H + λi) g > λ λ > λ and (H + λi) g < λ λ < λ. In the above characterization, we have a crude upper bound on λ : Proposition 5.3. We have λ B max { + g, } with λ defined in emma 5.. Proof. We have (H + BI) g Corollary 5.. g λ min (H+BI) g B < B and therefore λ B due to 7

6 Sufficient Conditions for Theorem -a and -b Without worrying about the design of FastCubicMin at this moment, let us first state a set of sufficient conditions under which the assumptions in Theorem -a can be satisfied. Main emma. Consider an algorithm that outputs a real λ [0, B], a vector v R d, and a unit vector v min R d. Additionally, suppose numbers κ, ε 0 satisfying the following conditions: ε 0000 (max {κ,,, g, (H + λi), B}) 0 (6.) (H + (λ ε)i) 0 (6.) Moreover, suppose that the outputs (λ, v, v min ) satisfy one of the following two cases: Case : (H + λi) g [λ ε, λ + ε] and v + (H + λi) g ε Case : The following conditions are satisfied: (a) λ λ and λ + λ min (H) κ (b) (H + λi) g λ and v + (H + λi) g ε (c) vmin Hv min λ min (H) + 0κ Then, at least one of the two choices h { v, λv } min satisfy either m(h ) 3000m(h ) or m(h ) 3 κ 3. et us compare such sufficient conditions to the characterization emma 5.. In Case, up to a very small error ε, we have essentially found a vector v that satisfies v (H + λi) g and v λ. Therefore, this v should be close to h for obvious reason. (This is the simple case.) In Case, we have only found a vector v that satisfies v (H + λi) g and v λ. In this case, we also compute an approximate lowest eigenvector v min of λ min (H) up to an additive /0κ accuracy (see case -c). We will make sure that, as long as the conditions in -a hold, then either v or λv min will be an approximate minimizer for m t(h). (This is the hard case.) Proof of Main emma. We first consider Case. According to Corollary 5., if ε = 0 then v is a minimizer of m(h). The following claim extends this argument to the setting when ε > 0: Claim 6.. If λ and v satisfy Case and ε satisfies (6.), then m(v) m(h ) + 50κ 3 From the above lemma it follows that either m(h ) 8 satisfies the conditions of the theorem. κ 3 otherwise m(h ).m(v) which We now consider Case, and in this case we make the following two claims: { ( )} Claim 6.. If λ min (H) κ then m(h ) 500 min m(v), m λvmin. 500κ 3 Claim 6.3. If λ min (H) κ then m(h ) m(v) 6 κ 3. emma now follows from the two claims because we can output the vector h which has the lowest value of m(h ) amongst the two choices h { v, λ v } min d. This satisfies either m(h ) 3000m(h ) or m(h ) 3. κ 3 The missing proofs of the three claims are deferred to Appendix D. 8

The next main lemma shows that, under the same sufficient conditions as Main emma, we also have that Theorem -b holds. (Its proof is contained in Appendix E.) Main emma. In the same setting as Main emma, suppose m(h ) ε3/ output vector v satisfies the following conditions: v h + 3 and m(v) ε κ 4 + 5 κ. 7 Main Algorithms for Theorem 300. Then the We are now ready to state our main algorithm FastCubicMin and sketch why it satisfies the sufficient conditions in Main emma. As described in Algorithm, our algorithm starts with a very large choice λ 0 B and decreases it gradually. At each iteration i, it computes an approximate inverse v satisfying v + (H + λ i I) g ε with respect to the current λ i. Then there are three cases, depending on whether v is approximately equal to, larger than, or smaller than λ i. At a high level, if it is equal, then we have met Case in Main emma ; if it is larger, then we can binary search the correct value of λ in the interval [λ i, λ i ]; and if it is smaller, then we need to compute an approximate eigenvector and carefully choose the next point λ i+. We state our main lemma below regarding the correctness and running time of FastCubicMin. Main emma 3. FastCubicMin in Algorithm outputs a real λ [0, B], a vector v R d, and a unit vector v min R d satisfying one of the two sufficient conditions in Main emma. We also have that the procedure can be implemented in a total running time of Õ( κ T h ) if Accelerated Gradient Descent is used in Theorem.4 to invert matrices. Õ( max{n, n 3/4 κ } T h, ) if we use accelerated SVRG as the subprocedure A in Theorem.4. Here Õ hides logarithmic factors in,, κ, d, B. We prove the correctness half of Main emma 3, and defer its running time analysis to Appendix G. 7. Correctness Half of Main emma 3 We will now establish the correctness of our algorithm. We first observe that the BinarySearch subroutine returns (λ, v, ) that satisfies Case of Main emma. Fact 7.. BinarySearch outputs a pair λ and v such that (H + λi) g [λ ε, λ + ε] and v + (H + λi) g ε. Proof. The latter is guaranteed by line 3 in BinarySearch, and the former is implied by the latter because (H + λi) g [ v ε/, v + ε/ ] [ λ ε, λ + ε ]. We also establish the following invariants regarding the values λ i. (Proof in Appendix F.) emma 7.. The following statements hold for all i until FastCubicMin terminates (a) λ i [0, B], λ i + λ max (H) 3B (b) λ i + λ min (H) 3 0κ (c) λ i+ + λ min (H) 3 4 (λ i + λ min (H)) unless λ i+ = 0 Moreover when FastCubicMin terminates at ine 0 we have λ i + λ min (H) κ. We now prove the output (λ, v, v min ) of FastCubicMin satisfies the sufficient conditions of Main emma. 9

Algorithm FastCubicMin(g, H,,, κ) (main algorithm for cubic minimization) Input: g a vector, H a symmetric matrix, parameters κ, and which satisfies I H I. Output: (λ, v, v min ) : B + ( g + κ. : ε / 0000 ( max {, g, 3κ 0, B, }) 0 3: λ 0 B. 4: for i = 0 to do 5: Compute v such that v + (H + λ i I) g ε. 6: if v [λ i ε, λ i + ε] then 7: return (λ i, v, ). 8: else if v > λ i + ε then 9: return BinarySearch(λ = λ i, λ = λ i, ε). 0: else if v < λ i ε then : et Power Method find vector w that is 9/0-appx leading eigenvector of (H + λ i I) : 9 0 λ max((h + λ i I) ) w (H + λ i I) w λ max ((H + λ i I) ). : Compute a vector w such that w (H + λ i I) w ˆε w w ˆε. 3: 4: if > κ then 5: λi+ λ i. 6: if λi+ > 0 then λ i+ λ i+ else λ i+ 0 7: else 8: 9: Use AppxPCA to find any unit vector v min such that vmin Hv min λ min (H) + 0κ. Flip the sign of v min so that g v min 0. 0: return (λ i, v, v min ). : end if : end if 3: end for 60B. Algorithm 3 BinarySearch(λ, λ, ε) (binary search subroutine) Input: λ λ, (H + λ I) g λ, (H + λ I) g λ, λ + λ min (H) > 0 Output: (λ, v, ) : for t = to do : λ mid λ +λ 3: Compute vector v such that v + (H + λ mid I) g ε/ 4: if v [λ mid ε, λ mid + ε] then 5: return (λ mid, v, ) 6: else if v + ε λ mid then 7: λ λ mid 8: else if v ε λ mid then 9: λ λ mid 0: end if : end for 0

Correctness Proof of Main emma 3. We carefully verify these sufficient conditions: emma 7. implies λ i [0, B]. λ i + λ min (H) 3 0κ from emma 7. implies (H + λ ii) 4κ. It is now immediate that the choice of ε on ine satisfies the Condition (6.) in the assumption of Main emma. Since ε 0κ and λ i + λ min (H) 3 0κ it follows that (H + (λ i ε)i) 0 which proves Condition (6.) in Main emma. We now verify Case and in the assumption of Main emma. At the beginning of the algorithm, our choice λ 0 = B ensures (using Proposition 5.3) that (H + λ 0 I) g < λ 0. et us now consider the various places where the algorithm outputs: If FastCubicMin terminates at ine 7, then we have v + (H + λ i I) g ε and additionally (H + λ i I) g [ v ε, v + ε ] [λ i ε, λ i + ε]. Therefore, the output meets Case requirement of Main emma with λ = λ i. If FastCubicMin terminates at ine 9, then (H + λ i I) g > v ε λ i. Obviously, we must have i in this case because (H + λ 0 I) g < λ 0. Therefore, ine 0 must have been reached at the previous iteration, so it implies (H + λ i I) g < λ i. Together, these two imply that we can call BinarySearch with (λ i, λ i ). Owing to Fact 7., the subroutine outputs a pair (λ, v) satisfying the Case requirement of Main emma. If FastCubicMin terminates on ine 0, we verify that Case of Main emma with λ = λ i holds. We first have (H + λ i I) g v + ε λ i. By Corollary 5., we also have that λ i λ. emma 7. tells us λ i satisfies λ i +λ min (H) κ. Vector v satisfies v + (H + λ i I) g ε. Vector v min satisfies v min Hv min λ min (H) + 0κ. In sum, we have verified that all the assumptions of Main emma hold. Final Proof of Theorem. Theorem is a direct corollary of our main lemmas. Main emma 3 ensures that the assumptions of Main emma and Main emma both hold. Now, using the special choice of κ in FastCubic, Theorem -a immediately comes from Main emma ; Theorem -b immediately comes from Main emma ; and Theorem -c immediately comes from Main emma 3. This finishes the proof of Theorem. Acknowledgements We thank Ben Recht for very helpful suggestions and corrections to a previous version. Z. Allen-Zhu is supported by an NSF Grant, no. CCF-4958, and a Microsoft Research Grant, no. 058584. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of NSF or Microsoft. A Appendix Computing Hessian-Vector Product in inear Time In this section we sketch the intuition regarding why Hessian-vector products can be computed in linear time in many interesting (especially machine learning) problems. We start by showing that the gradient can be computed in linear time. The algorithm is often referred to as back-propagation,

which dates back to Werbos s PhD thesis [35], and has been popularized by Rumelharte et al. [3] for training neural networks. Claim A. (back-propagation, informally stated). Suppose a real-valued function f : R d R can be evaluated by a differentiable circuit of size N. Then, the gradient f can be computed in time O(N + d) (using a circuit of size O(N + d)). 4 The claim follows from simple induction and chain-rule, and is left to the readers. In the training of neural networks, often the size of circuits that computes the objective f is proportional to (or equal to) the number of parameters d. Thus the gradient f can be computed in time O(d) using a circuit of size d. Next, we consider computing f(x) v where v R d. et g(x) := f(x), v be a function from R d to R. Then, we see that if suffices to compute the gradient of g, since f(x) v = g(x). We observe that g(x) can be evaluated in linear time using circuit of size O(d) since we ve shown f(x) can. Thus, using Claim A. again on function g, 5 we conclude that g(x) can also be computed in linear time. B Proof of emma B. and Corollary 4. emma B.. For all h R d, it satisfies f(x t +h ) h + m t (h ) and λ min ( f(x t +h )) ( ) 3 max{0, m t(h )} /3 h. Proof of emma B.. et us denote by g = f(x t ) and H = f(x t ) in this proof. We begin by proving the first order condition. Note that we have m t (h) = g + Hh + h h. Recall h is a minimizer of argmin m t (h). The characterization result in emma 5. shows H + h I 0, and thus They imply g h + (h ) Hh + h 3 = m t (h ) h = 0 (h ) Hh + ( ) h 3 = (h ) H + h I h 0. m t (h ) = g h + (h ) Hh where uses (B.) and uses (B.). + h 3 6 4 h 3 3 h 3 = h 3 = (h ) Hh 3 h 3 We compute the norm of the gradient at a point x t + h for any h R d : f(x t + h ) f(x t + h ) m t (h ) + m t (h ) ( = f(x t ) + f(x t + τh )h dτ g + Hh + h h ) + mt (h ) 0 4 Technically, we assume that the gradient of each gate can be computed in O() time 5 We assume here that the original circuits are twice differentiable (B.) (B.) (B.3)

0 ( f(x t + τh ) H ) h dτ + h + m t (h ) 3 h τdτ + h + m t (h ) = h + m t (h ) 0 (B.4) where 3 follows from the ipschitz continuity on the Hessian (.). This proves the first conclusion of the lemma. As for the second-order condition, we first note that for all h R d, by the ipschitz continuity on the Hessian (.), we have f(x t + h ) f(x t ) h. However, this implies λ min ( f(x t + h )) λ min ( f(x t )) h. (B.5) because if two matrices A and B satisfies A B p, then it must satisfy λmin (A) λ min (B) p as well. We consider two cases: if λ min ( f(x t )) 0, then we have λ min ( f(x t + h )) h. (B.6) Otherwise, we consider the case where λ min ( f(x t )) = λ min (H) < 0. et ν d be the normalized eigenvector corresponding to λ min (H), and define h sign(g ν d ) λ min(h) ν d. We calculate m t ( h) as follows: m t ( h) = g h h H h + = (λ min(h)) 3 + 6 h 3 h H h + 6 h 3 = (λ min(h)) νd Hν d + 4 λ min(h) 3 3 + 4 λ min(h) 3 (λ min (H)) 3 = 3 3, (B.7) where uses νd Hν d = λ min (H) < 0, and uses the assumption that λ min (H) < 0. Since by definition m t (h ) m t ( h), we can deduce from inequality (B.7) that ( 3 λ min ( m t (h ) /3 ) f(x t )) = λ min (H). (B.8) Now we put together inequalities (B.5) and (B.8), and obtain ( 3 λ min ( f(x t + h m t (h ) /3 ) )) h. (B.9) Combining (B.6) and (B.9) we finish the proof of emma B.. Corollary 4.. If m t (h ) ε3/ 800 and h is an approximate minimizer of m t (h) satisfying h h + ε 4 and m t (h ) ε, then we have that f(x t + h ) ε and λ min ( f(x t + h )) ε. Proof of Corollary 4.. First of all, our assumption that m t (h ) ε3/ ε 800, along with inequality (B.3), tells us that h 4. This, together with our assumption on h, implies h Since we also assume m t (h ) ε, we have from emma B. that f(x t + h ) h + m t (h ) ε 4 + ε ε. ε. For the second-order condition, we can again apply emma B. to get ( 3 λ min ( f(x t + h max{0, m t (h ) /3 ) /3 )} (3 )) h 3/ ε 3/ ε ε. 600 3

C Proof of emma 5. and Corollary 5. We begin by proving a few lemmas that characterize the system of equations. emma C.. Consider the following system of equations/inequalities in variables λ, h: H + λi 0, (H + λi)h = g, h = λ. (C.) The following statements hold for any solution (λ, h ) of the above system: There is a unique value λ that satisfies the above equations. λ is such that λ λ min (H). If λ > λ min (H), then the corresponding h is also unique and is given by h = (H+λI) g. If λ = λ min (H) then g v = 0 for any vector v belonging to the eigenspace corresponding to λ min (H). Subsequently we also have that the corresponding h is of the form h = (H + λi) + g + γv for some γ and v in the lowest eigenspace of H. Proof of emma C.. Note that H + λi 0 ensures that for any solution λ, we have λ λ min (H). Furthermore, for any λ > λ min (H), the corresponding h is uniquely defined by h = (H + λi) g since H + λ I is invertible. If indeed λ = λ min (H), then we have that the equation (H λ min (H)I)h = g has a solution. This implies that g has no component in the null space of H λ min (H)I, or equivalently that it has no component in the eigenspace corresponding to λ min (H). We also have that every solution of (H λ min (H)I)h = g is necessarily of the form for some γ and v in the lowest eigenspace of H. h = (H λ min I) + g + γv We will now prove the uniqueness of λ by contradiction. Consider two distinct values of λ, λ that satisfy the system (C.). If both λ, λ > λ min (H) we get that (H + λ I) g = λ and (H + λ I) g = λ. Now note that (H + λi) g is a strictly decreasing function over the domain λ ( λ min (H), ) and λ is strictly increasing over the same domain. Therefore the above two equations cannot be satisfied for two distinct λ, λ > λ min (H) which is a contradiction. Suppose now without loss of generality that λ = λ min (H). Then we have that the corresponding solution is of the form h = (H + λi) + g + γv for some γ and v in the lowest eigenspace of H and g has no component in the lowest eigenspace of H. It follows that (H λ min (H)I) + g (H + λi) g for any λ > λ min (H). By a similar argument as in the first case, we can now see that the following conditions, (H + λ I) + g + γv min (H) = λ and (H + λ I) g = λ, cannot both be satisfied for λ > λ = λ min (H), giving us a contradiction. This finishes the proof of emma C.. emma C.. et (λ, h) be a solution of the system (C.). Then we have that m(h) = g (H + λi) + g λ3 3. 4

Proof of emma C.. By the definition of the system (C.), any solution λ, h to the system should be such that there exists some γ such that h = (H + λi) + g + γv 0 where v 0 is in the null space of H + λi if it exists; otherwise γ = 0. This gives us the following: m(h) = g h + h Hh + 6 h 3 = h (H + λi)h λ h + 6 h 3 = g (H + λi) + g λ3 3. Equality follows because (H + λi)h = g. Equality follows because h = (H + λi) + g + γv 0 and h = λ. emma 5.. h is a minimizer of m(h) if and only if there exists λ 0 such that H + λ I 0, (H + λ I)h = g, h = λ. The objective value in this case is given by m(h ) = g (H + λ I) + g (λ ) 3 3 0. Proof of emma 5.. We first compute that m(h) = g + Hh + h h and m(h) = H + h I + ( ) ( ) h h h. h h For the forward direction, suppose h is a minimizer of m(h). et λ = h. Then, the necessary conditions m(h ) = 0 and m(h ) 0 can be written as ( ( ) ( ) ) h g + (H + λ I)h = 0 and w H + λ I + λ h h h w 0, w R n. (C.) From this we see (H+λ I)h = g and h = λ, and the only thing left to verify is H+λ I 0. Note that if h = 0, then the second inquality in (C.) directly implies H + λ I 0. Thus, we only need to focus on h 0. We want to show that w (H + λ I)w 0 for every w R d. Now, if w h = 0 then this trivially follows from (C.), so it suffices to focus on those w that satisfies w h 0. Since w and h are not orthogonal, there exists γ R\{0} such that h + γw = h. (This can be done by squaring both sides and solving the linear system in λ.) Squaring both sides we have (γw) h + γ w = 0. (C.3) Now we bound the difference m(h + γw) m(h ) = g ((h + γw) h ) + (h + γw) H(h + γw) = (h (h + γw)) (H + λ I)h + (h + γw) H(h + γw) h Hh h Hh = λ γ w + (h (h + γw)) Hh + (h + γw) H(h + γw) h Hh = λ γ w + h Hh (h + γw) Hh + (h + γw) H(h + γw) 5

= λ γ w + γ w Hw = γ w (H + λ I)w, (C.4) where and follow from (C.) and (C.3), respectively. Since h is a minimizer of m(h), we immediately have m(h + γw) m(h ) = γ w (H + λ I)w 0, and we conclude that (H + λ I) 0. For the backward direction, we will make use emma C. and emma C.. First we note that the function m(h) is continuous and bounded from below, and there exists at least one minimizer h. Suppose now there exists a λ and a corresponding h such that (λ, h ) is a solution to the system C.. The backward direction requires us to prove that h must be a minimizer of m(h). By emma C. we get the following two cases. We prove the backward direction by showing that the conditions in Equation C. determine the minimizer up to its norm. To this end we will use emma C. and emma C.. First we note that the function m(h) is continuous, bounded from below, and tends to + when h, so there exists at least one minimizer h. Suppose now there exists a λ and a corresponding h such that (λ, h ) is a solution to the system (C.). The backward direction requires us to prove that h must be a minimizer of m(h). By emma C. we get the following two cases. If λ > λ min (H) then (λ, h ) is the only solution to the system (C.). By the proof of the forward direction we see that any minimizer of m(h) must satisfy system (C.) and therefore h must be the minimizer. If above is not the case, then λ = λ min (H). et h be any minimizer of m(h). emma C. and the proof of the forward direction ensures that (λ, h ) also satisfies the system (C.). By emma C. we get m(h ) = m(h ) and therefore h is a minimizer too. Corollary 5.. This value λ is unique, and for every λ satisfying H + λi 0, we have (H + λi) g > λ λ > λ and (H + λi) g < λ λ < λ. Proof of Corollary 5.. The uniqueness of λ follows from emma C.. To prove the second part we first make some observations about the function p(y) y (H + yi) g defined on the domain y ( λ min (H), ). Note that p(y) is continuous and strictly increasing over the domain and p(y) as y. The corollary requires us to show that p(λ) < 0 λ > λ and p(λ) > 0 λ < λ. We begin by showing the first equivalence. To see the backward direction note that if λ > λ > λ min (H), by the characterization of λ in emma C. we have that (H + λ I) g = λ i.e. p(λ ) = 0 which implies that p(λ) < 0 as p(y) is a strictly increasing function. For the forward direction note that since p(y) is continuous and strictly increasing we see that the range of the function contains [p(λ), ). Since p(λ) < 0 there must exist a λ > λ such that p(λ ) = 0 which by the characterization in emma C. finishes the proof. Now we will prove that p(λ) > 0 λ < λ. To see the forward direction note that if λ λ then p(λ ) = 0 and p(λ) > 0 which contradicts the fact that p(y) is strictly increasing. For the 6

backward direction we consider two cases. Firstly if λ > λ min (H) the conclusion follows similarly by the monotonicity of p(y). If λ = λ min then by emma C., we have that g has no component in the lowest eigenspace of H and therefore if we extend p(y) to λ min (H) by defining p( λ min (H)) λ min(h) (H λ min (H)I) + g we get that p(y) is increasing in the domain y [ λ min (H), ). Now from the characterization of the solution in emma C. we can see that p( λ min (H)) 0 and therefore by monotonicity p(λ) > 0. This finishes the proof. D Proof of Main emma D. Proof of Claim 6. Claim 6.. If λ and v satisfy Case and ε satisfies (6.), then m(v) m(h ) + 50κ 3 Proof of Claim 6.. Note that by the conditions of the theorem we have that (H+(λ ε)i) eq and (H + (λ ε)i) g λ ε and (H + (λ + ε)i) g λ ε, according to Corollary 5. we must have This also implies (using our assumption on ε) Next, consider the value m(v) m(v) = g v + v Hv λ [λ ε, λ + ε] v [λ 5 ε, λ + 5 ε]. + 6 v 3 = g v + v (H + λi)v ( λ v v ) 6 We bound the two parts on the right hand side of (D.) separately. The first part g v + v (H + λi)v g (H + λi) g + g ε + (H + λi) g ε + (H + λi) ε g (H + λi) g + 000κ 3 g (H + λ I) g 3 g (H + λ I) g + g (H + λi) (H + (λ + ε)i) ε + + 500κ 3 Above, inequalities and 3 use the assumption on ε in (6.), and inequality uses (H + λi) (H + (λ + ε)i) = (H + λ I) ε(h + λ I) (H + (λ + ε)i) (D.). (D.) (D.3) 000κ 3 Note that (H+λ I) 0 by Equations (D.) and (6.). The second part of (D.) can be bounded as follows ( λ v v ) (λ 5 ε) ( λ ) ε 6 λ + 5 ε 6 (λ ) 3 3 000 ε(λ ) (λ ) 3 3 500κ 3 7

Above, inequality uses λ B (owing to Proposition 5.3) and our assumption on ε from (6.). Putting these together we get that m(v) m(h ) + 50κ 3. D. Proofs for Claims 6. and 6.3 For notational simplicity, let us rotate the space into the basis in the eigenspace of H; let the i-th dimension correspond to the i-th largest eigenvalue λ i of H. We have λ λ... λ d = λ min. et g i denote the i-th coordinate of g in this basis. emma 5. implies where we denote by m(h ) = S = i i:λ i +λ κ gi λ i + λ (λ ) 3 3 =: S + S (λ ) 3 3. (D.4) g i λ i + λ From Corollary 5. we can also obtain i:λ i +λ >0 Now the assumption (H + λi) g λ We begin with a few auxiliary claims. i S = i:0<λ i +λ κ g i λ i + λ gi (λ i + λ ) 4(λ ). (D.5) is equivalent to gi (λ i + λ) 4λ Claim D.. If λ min (H) κ then S 000 m ( λvmin Proof of Claim D.. We compute that gi S = λ i + λ = i:0<λ i +λ κ i:0<λ i +λ κ ) g i (λ i + λ ) (λ i + λ ) κ i:0<λ i +λ κ g i (λ i + λ ) (D.6) 4 κ (λ ) 6 λ min 3. (D.7) Above, uses (D.5), and follows because we have λ min (H) κ in the assumption and have λ λ min (H) + κ in the assumption of Case of Main emma. et us now consider the value of the vector λv min. We have that ( ) λvmin m = λg v min + λ vmin Hv min 8 + λ3 48 λg v min + λ λ min 6 + λ3 48 λg v min + λ λ min 6 λ λ min 4 λg v min + λ λ min 48 Above, is because our assumption λ min (H) κ and assumption v minhv min λ min (H) + 0κ together imply v min Hv min λ min. follows from λ min(h) κ and λ λ min(h) + κ. Now, recall that the sign of v min is chosen so g v min is non-positive, and therefore by our 8

assumptions λ min (H) κ and λ λ min(h) + κ, we get the following inequality: ( ) λvmin m λ min 3 48 Putting inequalities (D.8) and (D.7) together finishes the proof of Claim D.. (D.8) We also show the following lemma, the proof of which can be seen from inequality (D.3), as part of the proof of Claim 6. above. emma D.. If we have λ, v such that (H + λi) g λ and v + (H + λi) g ε with ε satisfying condition (6.) then we have that g v + v (H + λi)v Claim D.3. S 4m(v) 50κ 3 Proof of Claim D.3. We have that m(v) = g v + v (H + λi)v g (H + λi) g λ v + 6 v 3 = g (H + λi) ( g λ v ) 6 v + ( λ 3 ε g (H + λi) g + ) ( λ 6 + ε 3 000κ 3 000κ 3 ) + 000κ 3 3 g (H + λi) g λ3 3 + 500κ 3 g (H + λi) g + 500κ 3 (D.9) Above, is due to emma D.; uses our condition on v which gives v [λ 3 ε, λ + 3 ε]; 3 uses our condition (6.) on ε. We now bound S. For this purpose first we note that if λ i + λ κ and λ λ κ then Therefore, the sum S satisfies S = gi λ i + λ i:λ i +λ κ (λ i + λ ) /κ + λ i + λ λ i + λ. i:0<λ i +λ κ gi (λ i + λ) (g (H + λi) g) 4m(v) 50κ 3 (Note that we have H + λi 0.) This finishes the proof of Claim D.3. { ( )} Claim 6.. If λ min (H) κ then m(h ) 500 min m(v), m λvmin 500κ 3 Proof of Claim 6.. We derive that m(h ) = (S + S ) (λ ) 3 3 m(v) 4 m(v) 3 500κ 3 + 500 m 500κ 3 + 500 m 9 (S + S ) 6 λ min 3 ) 3 6 λ min 3 ( λvmin ( λvmin ) 3

{ 500 min m(v), m ( λvmin )} 500κ 3 Above, uses equation (D.4), inequality follows because we have λ min (H) κ in the assumption and have λ λ min (H) + κ in the assumption of Case of Main emma ; inequality 3 uses Claim D.3 and Claim D.; and inequality 4 uses (D.8). This finishes the proof of Claim 6.. Claim 6.3. If λ min (H) κ then m(h ) m(v) 6 κ 3 Proof of Claim 6.3. This time we lower bound S slightly differently: 4 S κ (λ ) 6 κ 3 where comes from the second to last inequality from (D.7) and comes from λ λ min (H) + κ κ using our assumption in Case of Main emma. Putting these together we get that m(h ) = (S + S ) (λ ) 3 3 m(v) 500κ 3 5 6 κ 3 m(v) κ 3. Above, comes from (D.4), uses Claim D.3, lower bound (D.0) and (λ ) 3 6 3 3κ 3 (D.0) λ E Proof of Main emma Main emma. In the same setting as Main emma, suppose m(h ) ε3/ output vector v satisfies the following conditions: v h + 3 and m(v) ε κ 4 + 5 κ. Proof of Main emma. et s first note that from the value given in emma 5., 300. Then the (λ ) 3 3 m(h ) 3/ ε 3/. (E.) 00 If Case occurs, we have v (H + λi) g + ε λ + ε + ε 3 λ + 5 ε 4 h + 0κ. Above, inequalities and both use the assumptions of Case ; inequality 3 uses the fact that λ [λ ε, λ + ε] which again follows from the assumptions of Case (see (D.)); inequality 4 uses h = λ from emma 5. as well as our assumption (6.) on ε. As for the quantity m(v), we bound it as follows m(v) = g + Hv + v v g + (H + λi)v + λ v + v H + λi ε + λ v + v 3 ( + B) ε + λ(λ + 3 ε) + (λ + 3 ε) = ( + B) ε + 6λ 4 + 5 ελ + 9 ε 6(λ + ε) + ( + 3B) ε + 9 ε 5 6(λ ) + ( + 56B) ε + 5 ε 6 ε 4 + 5 κ. Above, inequality uses triangle inequality; inequality uses v + (H + λi) g ε; inequality 3 uses H + λi + B and v λ + 3 ε which comes from our upper bound on v above; 4 uses the fact that λ [λ ε, λ + ε] which again follows from the assumptions of Case (see 0

(D.)); inequality 5 uses λ B; and inequality 6 uses (E.) together with our assumption (6.) on ε. If Case occurs, we have v (H + λi) g + ε λ + ε 3 (λ + /κ) + ε h + 3 κ. (E.) Above, inequalities and both use the assumptions of Case ; inequality 3 uses λ λ min (H)+ /κ from our assumption of Case as well as λ min (H) λ which comes from emma 5.; inequality 4 uses h = λ from emma 5. as well as our assumption (6.) on ε. The quantity m(v) can be bounded in an analogous manner as Case : m(v) H + λi ε + λ v + v ( + B) ε + 6λ + 0κ 6(λ + κ ) + 0κ (λ ) + 5 κ λ(λ + ε) + (λ + ε) 3 ε 4 + 5 κ. Above, inequality uses our assumption (6.) on ε; inequality uses λ λ + κ which appeared in (E.); inequality 3 uses (E.). F Proof of emma 7. emma 7.. The following statements hold for all i until FastCubicMin terminates (a) λ i [0, B], λ i + λ max (H) 3B (b) λ i + λ min (H) 3 0κ (c) λ i+ + λ min (H) 3 4 (λ i + λ min (H)) unless λ i+ = 0 Moreover when FastCubicMin terminates at ine 0 we have λ i + λ min (H) κ. Proof of emma 7.. The lemma follows via induction. To see (a) and (b) at the base case i = 0, recall that the definitions of B and together ensure λ 0 + λ max (H) 3B and λ 0 + λ min (H) 3 0κ. Also λ 0 [0, B]. Suppose now for some i 0 properties (a) and (b) hold. It is easy to check that λ i λ i and thus we have λ i + λ max (H) B and λ i B. This implies property (a) at iteration i + also hold. We now proceed to show property (c) at iteration i and property (b) at iteration i +. Recall that the algorithm ensures 9 0 λ max((h + λ i I) ) w (H + λ i I) w λ max ((H + λ i I) ), and by the definition of w we have 9 0 λ max((h + λ i I) ) ˆε w w ˆε λ max ((H + λ i I) ). (F.) Now, since 3 0κ λ i + λ min (H) 3B from the inductive assumption, it follows from the choice of ˆε that ˆε 30B 0(λ i + λ min (H)) = λ max((h + λ i I) ). (F.) 0 Plugging Equation (F.) into Equation (F.) we get 8 0 λ i + λ min (H) = 8 0 λ max((h + λ i I) ) w w ˆε λ max ((H + λ i I) ) = λ i + λ min (H).

Inverting this chain of inequalities, we have λ i + λ min (H) 5(λ i + λ min (H)) 8. (F.3) From this we derive the following implications: κ = (λ i + λ min (H)) κ (F.4) > κ = (λ i + λ min (H)) > 4 5κ (F.5) If Condition (F.4) happens, our algorithm FastCubicMin outputs on ine 0; in such a case (F.4) implies our desired inequality λ i + λ min (H) κ. If Condition (F.5) happens, our choice λ i+ λ i and Equation (F.3) together imply that 3 4 (λ i + λ min (H)) λ i+ + λ min (H) 6 (λ i + λ min (H)) Combining this with (F.5) we get that 3 4 (λ i + λ min (H)) λ i+ + λ min (H) ( ) 4 3 6 5κ 0κ. Therefore, we conclude that property (c) at iteration i holds and property (b) at iteration i + hold because λ i+ λ i+. This finishes the proof of emma 7.. G Proof of Main emma 3: Running Time Half Having proven the correctness of the algorithm, we now aim to bound the overall running time of FastCubicMin, completing the proof of Main emma 3. We prove in Appendix H the following lemma: emma G.. If λ +λ min (H) c (0, ) then BinarySearch ends in O ( log( (λ λ )B c ε ) ) iterations. Since in our FastCubicMin algorithm, it satisfies λ i B and λ i +λ min (H) 3 taken together with our choice of ε we have: Claim G.. Claim G.3. Each invocation of BinarySearch ends in O ( log(/ ε) ) iterations. FastCubicMin ends in at most O(log(Bκ)) outer loops. 0κ (see emma 7.), Proof. According to emma 7. we have 3 4 (λ i + λ min (H)) λ i + λ min (H) so the quantity λ i + λ min (H) decreases by a constant factor per iteration (except possibly λ i = 0 the last outer loop in which case we shall terminate in one more iteration). On one hand, we have began with λ 0 + λ min (H) 3B. On the other hand, we always have λ i + λ min (H) 3 0κ according to emma 7.. Therefore, the total number of outer loops is at most O(log(Bκ)). G. Matrix Inverse Since the key component of the running time is the computation of (H+λ i I) b for different vectors b we will first bound the condition number of the matrix (H + λ i I) via the following lemma Claim G.4. Through out the execution of FastCubicMin and BinarySearch whenever we compute (H + λ i I) λ b for some vector b it satisfies i + λ i +λ min (H) 0κ. Proof of Claim G.4. We first focus on ine 5 and ine of FastCubicMin. There are two cases. If λ i, then according to I H I we can bound 3 because the left hand λ i + λ i +λ min (H)