arxiv: v4 [math.oc] 24 Apr 2017

Size: px
Start display at page:

Download "arxiv: v4 [math.oc] 24 Apr 2017"

Transcription

1 Finding Approximate ocal Minima Faster than Gradient Descent arxiv:6.046v4 [math.oc] 4 Apr 07 Naman Agarwal namana@cs.princeton.edu Princeton University Zeyuan Allen-Zhu zeyuan@csail.mit.edu Institute for Advanced Study Elad Hazan ehazan@cs.princeton.edu Princeton University November 3, 06 Abstract Tengyu Ma tengyu@cs.princeton.edu Princeton University Brian Bullins bbullins@cs.princeton.edu Princeton University We design a non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which scales linearly in the underlying dimension and the number of training examples. The time complexity of our algorithm to find an approximate local minimum is even faster than that of gradient descent to find a critical point. Our algorithm applies to a general class of optimization problems including training a neural network and other non-convex objectives arising in machine learning.

2 Introduction Finding a global minimizer of a non-convex optimization problem is NP-hard. Thus, the standard goal of efficient non-convex optimization algorithms is instead to find a local minimum. This problem has become increasingly important as the state-of-the-art in machine learning is attained by non-convex models, many of which are variants of deep neural networks. Experiments in [0,, ] suggest that fast convergence to a local minimum is sufficient for training neural nets, while convergence to critical points (points with vanishing gradients) is not. Theoretical works have also affirmed the same phenomenon for other machine learning problems (see [5, 6, 8, 9] and the references therein). In this paper we give a provable linear-time algorithm for finding an approximate local minimum in smooth non-convex optimization. It applies to a general setting of machine learning optimization, and in particular to the optimization problem of training deep neural networks. Furthermore, the running time bound of our algorithm is the fastest known even for the more lenient task of computing a point with vanishing gradient (called a critical point), for a wide range of parameters. Formally, the problem of unconstrained mathematical optimization is stated in general terms as that of finding the minimum value that a function attains over Euclidean space, i.e. min f(x). (.) x Rd If f is convex, the above formulation is convex optimization and is solvable in (randomized) polynomial time even if only a valuation oracle to f is provided. A crucial property of convex functions is that local optimality implies global optimality, allowing for greedy algorithms to reach the global optimum efficiently. Unfortunately, this is no longer the case if f is nonconvex; indeed, even a degree four polynomial can be NP-hard to optimize [3], or even just to check whether a point is not a local minimum [5]. Thus, for non-convex optimization one has to settle for the more modest goal of reaching approximate local optimality efficiently. Note that a particular interest to machine learning is the optimization of functions f : R d R of the finite-sum form f(x) = n f i (x). (.) n Such functions arise when minimizing loss over a training set, where each example i in the set corresponds to one loss function f i in the summation. i= We say that the function f is second-order smooth if it has ipschitz continuous gradient and ipschitz continuous Hessian. We say that a point x is an ε-approximate local minimum if it satisfies (following the tradition of [8]): f(x) ε and f(x) εi, where denotes the Euclidean norm of a vector. We say that a point x is an ε-critical point if it satisfies the gradient condition above, but not necessarily the second-order condition. Critical points include saddle points in addition to local minima. We remark that ε-approximate local minima (even with ε = 0) are not necessarily close to any local minimum, neither in domain nor in function value. However, if we assume in addition the function satisfies the (robust) strict-saddle property [5, 4] (see Section for the precise definition), then an ε-approximate local minimum is guaranteed to be close to a local minimum for sufficiently small ε. Our main theorem below states the time required for the proposed algorithm FastCubic to find an ε-approximate local minimum for second-order smooth functions.

3 Theorem (informal). Ignoring smoothness parameters, the running time of FastCubic to return an ε-approximate local minimum is ( ) ( n ) ( ) n3/4 Õ + ε3/ ε 7/4 T h, for (.) or Õ ε 7/4 T h for the general (.). Above, T h is the time to compute Hessian-vector product for f(x) and T h, is that for an arbitrary f i (x). The full statement of Theorem can be found in Section. Hessian-vector products can be computed in linear time meaning T h, = O(d) and T h = O(nd) for many machine learning problems such as generalized linear models and training neural networks [, 9]. We explain this more generally in Appendix A. Therefore, Corollary.. Algorithm FastCubic returns an ε-approximate local minimum for the optimization problem of training a neural network in time ( ) nd Õ ε 3/ + n3/4 d ε 7/4. Another important aspect of our algorithm is that even in terms of just reaching an ε-critical point, i.e. a point that satisfies f(x) ε without any second-order guarantee, FastCubic is faster than all previous results (see Table for a comparison). The fastest methods to find critical points for a smooth non-convex function are gradient descent and its extensions, jointly known as first-order methods. These methods are extremely efficient in terms of per-iteration complexity; however, they necessarily suffer from a /ε convergence rate [7], to the best of our knowledge, in previous results only higher-order methods seem capable of breaking this /ε bottleneck [8]. For certain ranges of parameters, our FastCubic finds local minima even faster than first-order methods, even though they only find critical points. This is depicted in Table. Paper Total Time Achieving f(x) ε Second-Order Guarantee Gradient Descent (GD) O ( ) nd ε n/a SVRG [] O ( ) nd + n/3 d ε n/a SGD [0] O ( ) d ε n/a 4 noisy SGD [6] a O ( ) d C ε f(x) ε /C I 4 cubic regularization [8] Õ ( ) nd ω +d ω f(x) ε / I ε 3/ this paper Õ ( ) nd f(x) ε / I ε 7/4 this paper Õ ( ) nd + n3/4 d f(x) ε / I ε 3/ ε 7/4 Table : Comparison of known methods. a Here C, C are two constants that are not explicitly written. We believe C 4.

4 . Related work Methods that Provably Reach Critical Points. Recall that only a gradient oracle is needed to reach a critical point. The most commonly used algorithm in practice for training non-convex learning machines such as deep neural networks is stochastic gradient descent (SGD), also known as stochastic approximation [30] and its derivatives. Some practical enhancements widely used in practice are based on Nesterov s acceleration [6] and adaptive regularization []. The variance reduction technique, introduced in [3], was extremely successful in convex optimization, but only recently there was a non-convex counterpart with theoretical benefits introduced []. Methods that Provably Reach ocal Minima. The recent work of Ge et al. [7] showed that a noise-injected version of SGD in fact converges to local minima instead of critical points, as long as the underlying non-convex function is strict-saddle. Their theoretical running time is a large polynomial in the dimension and not competitive with our method (see Table ). The work of ee et al. [4] shows that gradient descent, starting from a random point, almost surely converges to a local minimum of a strict-saddle function. The rates of convergence and precise step-sizes that are required are, however, yet unknown. If second-order information (i.e., the Hessian oracle) is provided, the cubic-regularization method of Nesterov and Polyak [8] converges in O( ) iterations. However, each iteration of Nesterovε 3/ Polyak requires solving a cubic function which, in general, takes time super-linear in the input representation. One natural direction is to apply an approximate trust region solver, such as the linear-time solver of [], to approximately solve the cubic regularization subroutine of Nesterov-Polyak. However, the approximation needed by a naive calculation makes this approach even slower than vanilla gradient descent. Our main challenge is to obtain approximate second-order local-minima and simultaneously improve upon gradient descent. Independently of this paper and concurrently, Carmon et al. [7] develop an accelerated gradient descent method that achieves the same running time for finding an approximate local minimum as in our paper. Remarkably, the same running time is obtained via a very different technique.. Our Techniques Our algorithm is based on the cubic regularization method of Nesterov and Polyak [8, 9, 8]. At a high level, cubic regularization states that if we can minimize a cubic function m(h) g h + h Hh + 6 h 3 exactly, where g = f(x), H = f(x), and is the second-order smoothness of the function f, then we can iteratively perform updates x x + h, and this algorithm converges to an ε-approximate local minimum in O(/ε 3/ ) iterations. Unfortunately, solving this cubic minimization problem exactly, to the best of our knowledge, requires a running time of O(d ω ) where ω is the matrix multiplication constant. Getting around this requires five observations. The first observation is that, minimizing m(h) up to a constant multiplicative approximation (plus a few other constraints) is sufficient for showing an iteration complexity of O(/ε 3/ ). The proof techniques to show this observation are based on extending Nesterov and Polyak. The second observation is that the minimizer h of m(h) must be of the form h = (H+λ I) + g+ v, where λ 0 is some constant satisfying H + λ I 0, and v is the smallest eigenvector of H and + denotes the pseudo-inverse of a matrix. This can be viewed as moving in a mixture direction To be precise, their manuscript appeared online approximately 4 hours before ours. More specifically, we need m t(h) C min h{m t(h)} for some constant C. In addition, we need to have good bounds on h and m(h). 3

5 between choosing h v, and choosing h to follow a shifted Newton s direction h (H + λ I) + g. Intuitively, we wish to reduce both the computation of (H+λ I) + g and v to Hessian-vector products. The first task of computing (H + λ I) + g can be slow, and even if H + λ I is strictly positivedefinite, computing it has a complexity depending on the (possibly huge) condition number of H+λ I [34]. The third observation is that it suffices to pick some λ > λ so both () the condition number of H + λ I is small and () the vectors (H + λ I) g and (H + λ I) g are close. This relies on the structure of m(h). The second task of computing v has a complexity depending on / δ where δ is the target additive error [3, 4]. The fourth observation is that the choice δ = ε suffices for the outer loop of cubic regularization to make sufficient progress. This reduces the complexity to compute v. Finally, finding the correct value λ itself is as hard as minimizing m t (h). The fifth step is to design an iterative scheme that makes only logarithmic number of guesses on λ. This procedure either finds the correct one (via binary search), or finds an approximate one, λ, but satisfying (H + λ I) g and (H + λ I) g being sufficiently close. Putting all the observations together, and balancing all the parameters, we can obtain a cubic minimization subroutine (see FastCubicMin in Algorithm ) that runs in time O(nd + n 3/4 d/ε /4 ). Preliminaries and Main Theorem We use to denote the Euclidean norm of a vector and the spectral norm of a matrix. For a symmetric matrix M we denote by λ max (M) and λ min (M) respectively the maximum and minimum eigenvalues of M. We denote by A B that A B is positive semidefinite (PSD). For a PSD matrix M, we denote by M + its pseudo-inverse if M is not strictly positive definite. We make the following ipschitz continuity assumptions for the gradient and Hessian of the target function f. Namely, there exist, > 0 such that Definition.. x, y R d : f(x) and f(x) f(y) x y. (.) We assume the following complexity parameters on the access to f(x): et T g R be the time complexity to compute f(x) for any x R d. et T h R be the time complexity to compute ( f(x) ) v for any x, v R d. Definition.. We say that f is of finite-sum form if f = n n i= f i(x) and f i (x) for each i [n]. In this case, we define T h, to be the time complexity to compute ( f i (x) ) v for arbitrary x, v R d and i [n]. Next we define the strict-saddle function for which an ε-approximate local minimum is almost equivalent to a local minimum [5, 4]. Definition.3 (strict saddle). Suppose f( ) : R d R is twice differentiable. For α, β, γ 0, we say f is (α, β, γ)-strict saddle if every x R d satisfies at least one of the following three conditions:. f(x) α.. λ min ( f) β. 3. There exists a local minimum x that is γ-close to x in Euclidean distance. We see that if a function is (α, β, γ)-strict saddle, then for ε < min{α, β } an ε-approximate local minimum is γ-close to some local minimum. 4

6 Algorithm FastCubic(f, x 0, ε,, ) Input: f(x) that satisfies (.) with and ; a starting vector x 0 ; a target accuracy ε. : κ ( ) 900 /. ε : for t = 0 to do 3: m t (h) f(x t ) h + h f(x t)h ( + 6 h 3 ) 4: (λ, v, v min ) FastCubicMin f(x t ), f(x t ),,, κ 5: h either v or λv min 6: Set x t+ x t + h whichever gives smaller value for m t(h); 7: if m t (h ) > ε3/ c then return x t+. c is a constant; we proved c = works 8: end for. Main Results The finite-sum setting captures much of supervised learning, including Neural Networks and Generalized inear Models. The main theorem which we show in our paper is as follows: Theorem. FastCubic (Algorithm ) starts from a point x 0 and outputs a point x such that in total time (denoting by D f(x 0 ) f(x )) ( ) Õ D T ε 3/ g + D/4 T ε 7/4 h, or f(x) ε and λ min ( f(x)) ε ( Õ D (T ) ε 3/ g + nt h, + Dn 3/4 /4 ) T ε 7/4 h, in the finite-sum setting (see Definition.). Here Õ hides logarithmic factors in,, /ε, d, and in max x { f(x) }. Two Known Subroutines. Our running time of FastCubic relies on the following recent results for approximate matrix inverse and approximate PCA: Theorem.4 (Approximate Matrix Inverse). Suppose matrix M R d d satisfies M and λi + M δi for constants λ, δ, > 0. et κ λ+ δ. Then, we can compute vector x satisfying x (λi + M) b ε b, (.) using Accelerated gradient descent (AGD) in O ( κ / log(κ/ε) ) iterations, each requiring O(d) time plus the time needed to multiply M with a vector. Moreover, suppose M = n n i= M i where each M i is symmetric and satisfies M i. If M i b can be computed in time O(d ) for each i and vector b, then accelerated SVRG [4, 33] computes a vector x that satisfies equation (.) in time O ( max{n, n 3/4 κ / } d log (κ/ε) ). We refer to the running time for this computation as T inverse (κ, ε) and the algorithm as A. Above, the SVRG based running time shall be used only towards our finite-sum case in Definition.. Theorem.5 (AppxPCA [3, 3, 4]). et M R d d be a symmetric matrix with eigenvalues λ λ d 0. With probability at least p, AppxPCA produces a unit vector w satisfying w Mw ( δ )( ε)λ max (M). The total running time is Õ(T inverse(/δ, εδ )). 3 Our Fast Cubic Regularization Algorithm Recall that the cubic regularization method of Nesterov and Polyak [8] studies the following upper bound on the change in objective value as we move from a point x t to x t + h: (it follows simply 5

7 from the Taylor series truncated to the third order) h R d : f(x t + h) f(x t ) m t (h) f(x t ) h + h f(x t )h + 6 h 3. (3.) Denote by h an arbitrary minimizer of m t (h). We propose in this paper a subroutine FastCubicMin to minimizes m t (h) approximately. Note that FastCubicMin returns two vectors v and v min. We then choose h to be either v or λv min, whichever gives a smaller value for m t(h). Before discussing the details of FastCubicMin, let us first state a main theorem for FastCubicMin: 3 Theorem (Guarantees of FastCubicMin). satisfies (a) It produces a vector h satisfying m t (h ) 0 and The algorithm FastCubicMin finds a vector h that either 3000m t (h ) m t (h ) or m t (h ) ε3/ 800. (b) If m(h ) ε3/ 300, then h h + ε 4 and m t (h ) ε. (c) FastCubicMin Õ( runs in time: (using Õ to hide logarithmic factors in,, /ε, d, f(x t ) ) ) T (ε) /4 h where T h is the time to multiply f(x t ) to a vector; ( Õ max { n, n 3/4 ) } Th, where T (ε) /4 h, is the time to multiply f i (x t ) with a vector. Above, the first guarantee promises that we are either done (because m t (h ) is close to zero), or we obtain a /3000 multiplicative approximation to m t (h ). Our second guarantee in Theorem promises that when we are done (because m t (h ) is close to zero), the output vector h and h are roughly similar in Euclidean norm and have a small gradient m t (h ). Our third guarantee gives the time complexity of FastCubicMin. Now, our final algorithm FastCubic for finding the ε-approximate local minimum of f(x) is included in Algorithm. It simply iteratively calls FastCubicMin to find an approximate minimizer, and it then stops whenever m t (h ) > ε3/ c for some large constant c. Roadmap. In Section 4 we show why Theorem implies Theorem. All the remaining sections are for the purpose of proving Theorem. Because our FastCubicMin is very technical, instead of stating what the algorithm is right away, we decide to take a different path. In Section 5, we first state a lemma characterizing what h looks like. In Section 6, we provide a set of sufficient conditions which look similar to the characterization of h, and show that as long as these conditions are met, Theorem -a and -b follow easily. Finally, in Section 7, we state FastCubicMin and explain why it satisfies these sufficient conditions and why it runs in the aforementioned time. 4 Theorem implies Theorem In this section, we show that Theorem implies Theorem. It relies on the following lemma (proved in Appendix B) regarding the sufficient condition for us to reach an ε-approximate local minimum. 3 To present the simplest result, we have not tried to improve the constant dependency in this paper. 6

8 emma 4.. If m t (h ) ε3/ 800 and h is an approximate minimizer of m t (h) satisfying h h + ε 4 and m t (h ) ε, then we have that f(x t + h ) ε and λ min ( f(x t + h )) ε. Proof of Theorem from Theorem. When FastCubic terminates, we have m t (h ) > ε3/ c ; therefore, it satisfies m t (h ) ε3/ 800 according to Theorem -a. Combining this with Theorem -b and Corollary 4., we conclude that in the last iteration of FastCubic, our output satisfies f(x t +h ) ε and λ min ( f(x t + h )) ε. This finishes the proof with respect to the accuracy conditions. As for the running time, in every iteration except for the last one, FastCubic satisfies m t (h ) Ω ( ) ( ε 3/. Therefore by (3.), we must have decreased the objective by at least Ω ε 3/ ) in this round, and this cannot happen for more than O ( (f(x 0 ) f ) ) ε iterations. The final running time 3/ of FastCubic follows from this bound together with Theorem -c. Therefore, in the rest of the paper it suffices to study FastCubicMin and prove Theorem. 5 Characterization emma of the Minimizer h For notational simplicity in this and the subsequent sections we focus on the following problem: minimize m(h) g h + h Hh + 6 h 3 where H is a symmetric matrix with H. Recall from the previous section that we have denoted by h an arbitrary minimizer of m(h). We have the following lemma which characterizes h : (a variant of this lemma has appeared in [8], and we prove it in the appendix for the sake of completeness) emma 5.. We have h is a minimizer of m(h) if and only if there exists λ 0 such that H + λ I 0, (H + λ I)h = g, h = λ. The objective value in this case is given by m(h ) = g (H + λ I) + g (λ ) The following corollary comes from emma 5. and its proof: Corollary 5.. have The value λ in emma 5. is unique, and for every λ satisfying H + λi 0, we (H + λi) g > λ λ > λ and (H + λi) g < λ λ < λ. In the above characterization, we have a crude upper bound on λ : Proposition 5.3. We have λ B max { + g, } with λ defined in emma 5.. Proof. We have (H + BI) g Corollary 5.. g λ min (H+BI) g B < B and therefore λ B due to 7

9 6 Sufficient Conditions for Theorem -a and -b Without worrying about the design of FastCubicMin at this moment, let us first state a set of sufficient conditions under which the assumptions in Theorem -a can be satisfied. Main emma. Consider an algorithm that outputs a real λ [0, B], a vector v R d, and a unit vector v min R d. Additionally, suppose numbers κ, ε 0 satisfying the following conditions: ε 0000 (max {κ,,, g, (H + λi), B}) 0 (6.) (H + (λ ε)i) 0 (6.) Moreover, suppose that the outputs (λ, v, v min ) satisfy one of the following two cases: Case : (H + λi) g [λ ε, λ + ε] and v + (H + λi) g ε Case : The following conditions are satisfied: (a) λ λ and λ + λ min (H) κ (b) (H + λi) g λ and v + (H + λi) g ε (c) vmin Hv min λ min (H) + 0κ Then, at least one of the two choices h { v, λv } min satisfy either m(h ) 3000m(h ) or m(h ) 3 κ 3. et us compare such sufficient conditions to the characterization emma 5.. In Case, up to a very small error ε, we have essentially found a vector v that satisfies v (H + λi) g and v λ. Therefore, this v should be close to h for obvious reason. (This is the simple case.) In Case, we have only found a vector v that satisfies v (H + λi) g and v λ. In this case, we also compute an approximate lowest eigenvector v min of λ min (H) up to an additive /0κ accuracy (see case -c). We will make sure that, as long as the conditions in -a hold, then either v or λv min will be an approximate minimizer for m t(h). (This is the hard case.) Proof of Main emma. We first consider Case. According to Corollary 5., if ε = 0 then v is a minimizer of m(h). The following claim extends this argument to the setting when ε > 0: Claim 6.. If λ and v satisfy Case and ε satisfies (6.), then m(v) m(h ) + 50κ 3 From the above lemma it follows that either m(h ) 8 satisfies the conditions of the theorem. κ 3 otherwise m(h ).m(v) which We now consider Case, and in this case we make the following two claims: { ( )} Claim 6.. If λ min (H) κ then m(h ) 500 min m(v), m λvmin. 500κ 3 Claim 6.3. If λ min (H) κ then m(h ) m(v) 6 κ 3. emma now follows from the two claims because we can output the vector h which has the lowest value of m(h ) amongst the two choices h { v, λ v } min d. This satisfies either m(h ) 3000m(h ) or m(h ) 3. κ 3 The missing proofs of the three claims are deferred to Appendix D. 8

10 The next main lemma shows that, under the same sufficient conditions as Main emma, we also have that Theorem -b holds. (Its proof is contained in Appendix E.) Main emma. In the same setting as Main emma, suppose m(h ) ε3/ output vector v satisfies the following conditions: v h + 3 and m(v) ε κ κ. 7 Main Algorithms for Theorem 300. Then the We are now ready to state our main algorithm FastCubicMin and sketch why it satisfies the sufficient conditions in Main emma. As described in Algorithm, our algorithm starts with a very large choice λ 0 B and decreases it gradually. At each iteration i, it computes an approximate inverse v satisfying v + (H + λ i I) g ε with respect to the current λ i. Then there are three cases, depending on whether v is approximately equal to, larger than, or smaller than λ i. At a high level, if it is equal, then we have met Case in Main emma ; if it is larger, then we can binary search the correct value of λ in the interval [λ i, λ i ]; and if it is smaller, then we need to compute an approximate eigenvector and carefully choose the next point λ i+. We state our main lemma below regarding the correctness and running time of FastCubicMin. Main emma 3. FastCubicMin in Algorithm outputs a real λ [0, B], a vector v R d, and a unit vector v min R d satisfying one of the two sufficient conditions in Main emma. We also have that the procedure can be implemented in a total running time of Õ( κ T h ) if Accelerated Gradient Descent is used in Theorem.4 to invert matrices. Õ( max{n, n 3/4 κ } T h, ) if we use accelerated SVRG as the subprocedure A in Theorem.4. Here Õ hides logarithmic factors in,, κ, d, B. We prove the correctness half of Main emma 3, and defer its running time analysis to Appendix G. 7. Correctness Half of Main emma 3 We will now establish the correctness of our algorithm. We first observe that the BinarySearch subroutine returns (λ, v, ) that satisfies Case of Main emma. Fact 7.. BinarySearch outputs a pair λ and v such that (H + λi) g [λ ε, λ + ε] and v + (H + λi) g ε. Proof. The latter is guaranteed by line 3 in BinarySearch, and the former is implied by the latter because (H + λi) g [ v ε/, v + ε/ ] [ λ ε, λ + ε ]. We also establish the following invariants regarding the values λ i. (Proof in Appendix F.) emma 7.. The following statements hold for all i until FastCubicMin terminates (a) λ i [0, B], λ i + λ max (H) 3B (b) λ i + λ min (H) 3 0κ (c) λ i+ + λ min (H) 3 4 (λ i + λ min (H)) unless λ i+ = 0 Moreover when FastCubicMin terminates at ine 0 we have λ i + λ min (H) κ. We now prove the output (λ, v, v min ) of FastCubicMin satisfies the sufficient conditions of Main emma. 9

11 Algorithm FastCubicMin(g, H,,, κ) (main algorithm for cubic minimization) Input: g a vector, H a symmetric matrix, parameters κ, and which satisfies I H I. Output: (λ, v, v min ) : B + ( g + κ. : ε / 0000 ( max {, g, 3κ 0, B, }) 0 3: λ 0 B. 4: for i = 0 to do 5: Compute v such that v + (H + λ i I) g ε. 6: if v [λ i ε, λ i + ε] then 7: return (λ i, v, ). 8: else if v > λ i + ε then 9: return BinarySearch(λ = λ i, λ = λ i, ε). 0: else if v < λ i ε then : et Power Method find vector w that is 9/0-appx leading eigenvector of (H + λ i I) : 9 0 λ max((h + λ i I) ) w (H + λ i I) w λ max ((H + λ i I) ). : Compute a vector w such that w (H + λ i I) w ˆε w w ˆε. 3: 4: if > κ then 5: λi+ λ i. 6: if λi+ > 0 then λ i+ λ i+ else λ i+ 0 7: else 8: 9: Use AppxPCA to find any unit vector v min such that vmin Hv min λ min (H) + 0κ. Flip the sign of v min so that g v min 0. 0: return (λ i, v, v min ). : end if : end if 3: end for 60B. Algorithm 3 BinarySearch(λ, λ, ε) (binary search subroutine) Input: λ λ, (H + λ I) g λ, (H + λ I) g λ, λ + λ min (H) > 0 Output: (λ, v, ) : for t = to do : λ mid λ +λ 3: Compute vector v such that v + (H + λ mid I) g ε/ 4: if v [λ mid ε, λ mid + ε] then 5: return (λ mid, v, ) 6: else if v + ε λ mid then 7: λ λ mid 8: else if v ε λ mid then 9: λ λ mid 0: end if : end for 0

12 Correctness Proof of Main emma 3. We carefully verify these sufficient conditions: emma 7. implies λ i [0, B]. λ i + λ min (H) 3 0κ from emma 7. implies (H + λ ii) 4κ. It is now immediate that the choice of ε on ine satisfies the Condition (6.) in the assumption of Main emma. Since ε 0κ and λ i + λ min (H) 3 0κ it follows that (H + (λ i ε)i) 0 which proves Condition (6.) in Main emma. We now verify Case and in the assumption of Main emma. At the beginning of the algorithm, our choice λ 0 = B ensures (using Proposition 5.3) that (H + λ 0 I) g < λ 0. et us now consider the various places where the algorithm outputs: If FastCubicMin terminates at ine 7, then we have v + (H + λ i I) g ε and additionally (H + λ i I) g [ v ε, v + ε ] [λ i ε, λ i + ε]. Therefore, the output meets Case requirement of Main emma with λ = λ i. If FastCubicMin terminates at ine 9, then (H + λ i I) g > v ε λ i. Obviously, we must have i in this case because (H + λ 0 I) g < λ 0. Therefore, ine 0 must have been reached at the previous iteration, so it implies (H + λ i I) g < λ i. Together, these two imply that we can call BinarySearch with (λ i, λ i ). Owing to Fact 7., the subroutine outputs a pair (λ, v) satisfying the Case requirement of Main emma. If FastCubicMin terminates on ine 0, we verify that Case of Main emma with λ = λ i holds. We first have (H + λ i I) g v + ε λ i. By Corollary 5., we also have that λ i λ. emma 7. tells us λ i satisfies λ i +λ min (H) κ. Vector v satisfies v + (H + λ i I) g ε. Vector v min satisfies v min Hv min λ min (H) + 0κ. In sum, we have verified that all the assumptions of Main emma hold. Final Proof of Theorem. Theorem is a direct corollary of our main lemmas. Main emma 3 ensures that the assumptions of Main emma and Main emma both hold. Now, using the special choice of κ in FastCubic, Theorem -a immediately comes from Main emma ; Theorem -b immediately comes from Main emma ; and Theorem -c immediately comes from Main emma 3. This finishes the proof of Theorem. Acknowledgements We thank Ben Recht for very helpful suggestions and corrections to a previous version. Z. Allen-Zhu is supported by an NSF Grant, no. CCF-4958, and a Microsoft Research Grant, no Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of NSF or Microsoft. A Appendix Computing Hessian-Vector Product in inear Time In this section we sketch the intuition regarding why Hessian-vector products can be computed in linear time in many interesting (especially machine learning) problems. We start by showing that the gradient can be computed in linear time. The algorithm is often referred to as back-propagation,

13 which dates back to Werbos s PhD thesis [35], and has been popularized by Rumelharte et al. [3] for training neural networks. Claim A. (back-propagation, informally stated). Suppose a real-valued function f : R d R can be evaluated by a differentiable circuit of size N. Then, the gradient f can be computed in time O(N + d) (using a circuit of size O(N + d)). 4 The claim follows from simple induction and chain-rule, and is left to the readers. In the training of neural networks, often the size of circuits that computes the objective f is proportional to (or equal to) the number of parameters d. Thus the gradient f can be computed in time O(d) using a circuit of size d. Next, we consider computing f(x) v where v R d. et g(x) := f(x), v be a function from R d to R. Then, we see that if suffices to compute the gradient of g, since f(x) v = g(x). We observe that g(x) can be evaluated in linear time using circuit of size O(d) since we ve shown f(x) can. Thus, using Claim A. again on function g, 5 we conclude that g(x) can also be computed in linear time. B Proof of emma B. and Corollary 4. emma B.. For all h R d, it satisfies f(x t +h ) h + m t (h ) and λ min ( f(x t +h )) ( ) 3 max{0, m t(h )} /3 h. Proof of emma B.. et us denote by g = f(x t ) and H = f(x t ) in this proof. We begin by proving the first order condition. Note that we have m t (h) = g + Hh + h h. Recall h is a minimizer of argmin m t (h). The characterization result in emma 5. shows H + h I 0, and thus They imply g h + (h ) Hh + h 3 = m t (h ) h = 0 (h ) Hh + ( ) h 3 = (h ) H + h I h 0. m t (h ) = g h + (h ) Hh where uses (B.) and uses (B.). + h h 3 3 h 3 = h 3 = (h ) Hh 3 h 3 We compute the norm of the gradient at a point x t + h for any h R d : f(x t + h ) f(x t + h ) m t (h ) + m t (h ) ( = f(x t ) + f(x t + τh )h dτ g + Hh + h h ) + mt (h ) 0 4 Technically, we assume that the gradient of each gate can be computed in O() time 5 We assume here that the original circuits are twice differentiable (B.) (B.) (B.3)

14 0 ( f(x t + τh ) H ) h dτ + h + m t (h ) 3 h τdτ + h + m t (h ) = h + m t (h ) 0 (B.4) where 3 follows from the ipschitz continuity on the Hessian (.). This proves the first conclusion of the lemma. As for the second-order condition, we first note that for all h R d, by the ipschitz continuity on the Hessian (.), we have f(x t + h ) f(x t ) h. However, this implies λ min ( f(x t + h )) λ min ( f(x t )) h. (B.5) because if two matrices A and B satisfies A B p, then it must satisfy λmin (A) λ min (B) p as well. We consider two cases: if λ min ( f(x t )) 0, then we have λ min ( f(x t + h )) h. (B.6) Otherwise, we consider the case where λ min ( f(x t )) = λ min (H) < 0. et ν d be the normalized eigenvector corresponding to λ min (H), and define h sign(g ν d ) λ min(h) ν d. We calculate m t ( h) as follows: m t ( h) = g h h H h + = (λ min(h)) h 3 h H h + 6 h 3 = (λ min(h)) νd Hν d + 4 λ min(h) λ min(h) 3 (λ min (H)) 3 = 3 3, (B.7) where uses νd Hν d = λ min (H) < 0, and uses the assumption that λ min (H) < 0. Since by definition m t (h ) m t ( h), we can deduce from inequality (B.7) that ( 3 λ min ( m t (h ) /3 ) f(x t )) = λ min (H). (B.8) Now we put together inequalities (B.5) and (B.8), and obtain ( 3 λ min ( f(x t + h m t (h ) /3 ) )) h. (B.9) Combining (B.6) and (B.9) we finish the proof of emma B.. Corollary 4.. If m t (h ) ε3/ 800 and h is an approximate minimizer of m t (h) satisfying h h + ε 4 and m t (h ) ε, then we have that f(x t + h ) ε and λ min ( f(x t + h )) ε. Proof of Corollary 4.. First of all, our assumption that m t (h ) ε3/ ε 800, along with inequality (B.3), tells us that h 4. This, together with our assumption on h, implies h Since we also assume m t (h ) ε, we have from emma B. that f(x t + h ) h + m t (h ) ε 4 + ε ε. ε. For the second-order condition, we can again apply emma B. to get ( 3 λ min ( f(x t + h max{0, m t (h ) /3 ) /3 )} (3 )) h 3/ ε 3/ ε ε

15 C Proof of emma 5. and Corollary 5. We begin by proving a few lemmas that characterize the system of equations. emma C.. Consider the following system of equations/inequalities in variables λ, h: H + λi 0, (H + λi)h = g, h = λ. (C.) The following statements hold for any solution (λ, h ) of the above system: There is a unique value λ that satisfies the above equations. λ is such that λ λ min (H). If λ > λ min (H), then the corresponding h is also unique and is given by h = (H+λI) g. If λ = λ min (H) then g v = 0 for any vector v belonging to the eigenspace corresponding to λ min (H). Subsequently we also have that the corresponding h is of the form h = (H + λi) + g + γv for some γ and v in the lowest eigenspace of H. Proof of emma C.. Note that H + λi 0 ensures that for any solution λ, we have λ λ min (H). Furthermore, for any λ > λ min (H), the corresponding h is uniquely defined by h = (H + λi) g since H + λ I is invertible. If indeed λ = λ min (H), then we have that the equation (H λ min (H)I)h = g has a solution. This implies that g has no component in the null space of H λ min (H)I, or equivalently that it has no component in the eigenspace corresponding to λ min (H). We also have that every solution of (H λ min (H)I)h = g is necessarily of the form for some γ and v in the lowest eigenspace of H. h = (H λ min I) + g + γv We will now prove the uniqueness of λ by contradiction. Consider two distinct values of λ, λ that satisfy the system (C.). If both λ, λ > λ min (H) we get that (H + λ I) g = λ and (H + λ I) g = λ. Now note that (H + λi) g is a strictly decreasing function over the domain λ ( λ min (H), ) and λ is strictly increasing over the same domain. Therefore the above two equations cannot be satisfied for two distinct λ, λ > λ min (H) which is a contradiction. Suppose now without loss of generality that λ = λ min (H). Then we have that the corresponding solution is of the form h = (H + λi) + g + γv for some γ and v in the lowest eigenspace of H and g has no component in the lowest eigenspace of H. It follows that (H λ min (H)I) + g (H + λi) g for any λ > λ min (H). By a similar argument as in the first case, we can now see that the following conditions, (H + λ I) + g + γv min (H) = λ and (H + λ I) g = λ, cannot both be satisfied for λ > λ = λ min (H), giving us a contradiction. This finishes the proof of emma C.. emma C.. et (λ, h) be a solution of the system (C.). Then we have that m(h) = g (H + λi) + g λ3 3. 4

16 Proof of emma C.. By the definition of the system (C.), any solution λ, h to the system should be such that there exists some γ such that h = (H + λi) + g + γv 0 where v 0 is in the null space of H + λi if it exists; otherwise γ = 0. This gives us the following: m(h) = g h + h Hh + 6 h 3 = h (H + λi)h λ h + 6 h 3 = g (H + λi) + g λ3 3. Equality follows because (H + λi)h = g. Equality follows because h = (H + λi) + g + γv 0 and h = λ. emma 5.. h is a minimizer of m(h) if and only if there exists λ 0 such that H + λ I 0, (H + λ I)h = g, h = λ. The objective value in this case is given by m(h ) = g (H + λ I) + g (λ ) Proof of emma 5.. We first compute that m(h) = g + Hh + h h and m(h) = H + h I + ( ) ( ) h h h. h h For the forward direction, suppose h is a minimizer of m(h). et λ = h. Then, the necessary conditions m(h ) = 0 and m(h ) 0 can be written as ( ( ) ( ) ) h g + (H + λ I)h = 0 and w H + λ I + λ h h h w 0, w R n. (C.) From this we see (H+λ I)h = g and h = λ, and the only thing left to verify is H+λ I 0. Note that if h = 0, then the second inquality in (C.) directly implies H + λ I 0. Thus, we only need to focus on h 0. We want to show that w (H + λ I)w 0 for every w R d. Now, if w h = 0 then this trivially follows from (C.), so it suffices to focus on those w that satisfies w h 0. Since w and h are not orthogonal, there exists γ R\{0} such that h + γw = h. (This can be done by squaring both sides and solving the linear system in λ.) Squaring both sides we have (γw) h + γ w = 0. (C.3) Now we bound the difference m(h + γw) m(h ) = g ((h + γw) h ) + (h + γw) H(h + γw) = (h (h + γw)) (H + λ I)h + (h + γw) H(h + γw) h Hh h Hh = λ γ w + (h (h + γw)) Hh + (h + γw) H(h + γw) h Hh = λ γ w + h Hh (h + γw) Hh + (h + γw) H(h + γw) 5

17 = λ γ w + γ w Hw = γ w (H + λ I)w, (C.4) where and follow from (C.) and (C.3), respectively. Since h is a minimizer of m(h), we immediately have m(h + γw) m(h ) = γ w (H + λ I)w 0, and we conclude that (H + λ I) 0. For the backward direction, we will make use emma C. and emma C.. First we note that the function m(h) is continuous and bounded from below, and there exists at least one minimizer h. Suppose now there exists a λ and a corresponding h such that (λ, h ) is a solution to the system C.. The backward direction requires us to prove that h must be a minimizer of m(h). By emma C. we get the following two cases. We prove the backward direction by showing that the conditions in Equation C. determine the minimizer up to its norm. To this end we will use emma C. and emma C.. First we note that the function m(h) is continuous, bounded from below, and tends to + when h, so there exists at least one minimizer h. Suppose now there exists a λ and a corresponding h such that (λ, h ) is a solution to the system (C.). The backward direction requires us to prove that h must be a minimizer of m(h). By emma C. we get the following two cases. If λ > λ min (H) then (λ, h ) is the only solution to the system (C.). By the proof of the forward direction we see that any minimizer of m(h) must satisfy system (C.) and therefore h must be the minimizer. If above is not the case, then λ = λ min (H). et h be any minimizer of m(h). emma C. and the proof of the forward direction ensures that (λ, h ) also satisfies the system (C.). By emma C. we get m(h ) = m(h ) and therefore h is a minimizer too. Corollary 5.. This value λ is unique, and for every λ satisfying H + λi 0, we have (H + λi) g > λ λ > λ and (H + λi) g < λ λ < λ. Proof of Corollary 5.. The uniqueness of λ follows from emma C.. To prove the second part we first make some observations about the function p(y) y (H + yi) g defined on the domain y ( λ min (H), ). Note that p(y) is continuous and strictly increasing over the domain and p(y) as y. The corollary requires us to show that p(λ) < 0 λ > λ and p(λ) > 0 λ < λ. We begin by showing the first equivalence. To see the backward direction note that if λ > λ > λ min (H), by the characterization of λ in emma C. we have that (H + λ I) g = λ i.e. p(λ ) = 0 which implies that p(λ) < 0 as p(y) is a strictly increasing function. For the forward direction note that since p(y) is continuous and strictly increasing we see that the range of the function contains [p(λ), ). Since p(λ) < 0 there must exist a λ > λ such that p(λ ) = 0 which by the characterization in emma C. finishes the proof. Now we will prove that p(λ) > 0 λ < λ. To see the forward direction note that if λ λ then p(λ ) = 0 and p(λ) > 0 which contradicts the fact that p(y) is strictly increasing. For the 6

18 backward direction we consider two cases. Firstly if λ > λ min (H) the conclusion follows similarly by the monotonicity of p(y). If λ = λ min then by emma C., we have that g has no component in the lowest eigenspace of H and therefore if we extend p(y) to λ min (H) by defining p( λ min (H)) λ min(h) (H λ min (H)I) + g we get that p(y) is increasing in the domain y [ λ min (H), ). Now from the characterization of the solution in emma C. we can see that p( λ min (H)) 0 and therefore by monotonicity p(λ) > 0. This finishes the proof. D Proof of Main emma D. Proof of Claim 6. Claim 6.. If λ and v satisfy Case and ε satisfies (6.), then m(v) m(h ) + 50κ 3 Proof of Claim 6.. Note that by the conditions of the theorem we have that (H+(λ ε)i) eq and (H + (λ ε)i) g λ ε and (H + (λ + ε)i) g λ ε, according to Corollary 5. we must have This also implies (using our assumption on ε) Next, consider the value m(v) m(v) = g v + v Hv λ [λ ε, λ + ε] v [λ 5 ε, λ + 5 ε]. + 6 v 3 = g v + v (H + λi)v ( λ v v ) 6 We bound the two parts on the right hand side of (D.) separately. The first part g v + v (H + λi)v g (H + λi) g + g ε + (H + λi) g ε + (H + λi) ε g (H + λi) g + 000κ 3 g (H + λ I) g 3 g (H + λ I) g + g (H + λi) (H + (λ + ε)i) ε κ 3 Above, inequalities and 3 use the assumption on ε in (6.), and inequality uses (H + λi) (H + (λ + ε)i) = (H + λ I) ε(h + λ I) (H + (λ + ε)i) (D.). (D.) (D.3) 000κ 3 Note that (H+λ I) 0 by Equations (D.) and (6.). The second part of (D.) can be bounded as follows ( λ v v ) (λ 5 ε) ( λ ) ε 6 λ + 5 ε 6 (λ ) ε(λ ) (λ ) κ 3 7

19 Above, inequality uses λ B (owing to Proposition 5.3) and our assumption on ε from (6.). Putting these together we get that m(v) m(h ) + 50κ 3. D. Proofs for Claims 6. and 6.3 For notational simplicity, let us rotate the space into the basis in the eigenspace of H; let the i-th dimension correspond to the i-th largest eigenvalue λ i of H. We have λ λ... λ d = λ min. et g i denote the i-th coordinate of g in this basis. emma 5. implies where we denote by m(h ) = S = i i:λ i +λ κ gi λ i + λ (λ ) 3 3 =: S + S (λ ) 3 3. (D.4) g i λ i + λ From Corollary 5. we can also obtain i:λ i +λ >0 Now the assumption (H + λi) g λ We begin with a few auxiliary claims. i S = i:0<λ i +λ κ g i λ i + λ gi (λ i + λ ) 4(λ ). (D.5) is equivalent to gi (λ i + λ) 4λ Claim D.. If λ min (H) κ then S 000 m ( λvmin Proof of Claim D.. We compute that gi S = λ i + λ = i:0<λ i +λ κ i:0<λ i +λ κ ) g i (λ i + λ ) (λ i + λ ) κ i:0<λ i +λ κ g i (λ i + λ ) (D.6) 4 κ (λ ) 6 λ min 3. (D.7) Above, uses (D.5), and follows because we have λ min (H) κ in the assumption and have λ λ min (H) + κ in the assumption of Case of Main emma. et us now consider the value of the vector λv min. We have that ( ) λvmin m = λg v min + λ vmin Hv min 8 + λ3 48 λg v min + λ λ min 6 + λ3 48 λg v min + λ λ min 6 λ λ min 4 λg v min + λ λ min 48 Above, is because our assumption λ min (H) κ and assumption v minhv min λ min (H) + 0κ together imply v min Hv min λ min. follows from λ min(h) κ and λ λ min(h) + κ. Now, recall that the sign of v min is chosen so g v min is non-positive, and therefore by our 8

20 assumptions λ min (H) κ and λ λ min(h) + κ, we get the following inequality: ( ) λvmin m λ min 3 48 Putting inequalities (D.8) and (D.7) together finishes the proof of Claim D.. (D.8) We also show the following lemma, the proof of which can be seen from inequality (D.3), as part of the proof of Claim 6. above. emma D.. If we have λ, v such that (H + λi) g λ and v + (H + λi) g ε with ε satisfying condition (6.) then we have that g v + v (H + λi)v Claim D.3. S 4m(v) 50κ 3 Proof of Claim D.3. We have that m(v) = g v + v (H + λi)v g (H + λi) g λ v + 6 v 3 = g (H + λi) ( g λ v ) 6 v + ( λ 3 ε g (H + λi) g + ) ( λ 6 + ε 3 000κ 3 000κ 3 ) + 000κ 3 3 g (H + λi) g λ κ 3 g (H + λi) g + 500κ 3 (D.9) Above, is due to emma D.; uses our condition on v which gives v [λ 3 ε, λ + 3 ε]; 3 uses our condition (6.) on ε. We now bound S. For this purpose first we note that if λ i + λ κ and λ λ κ then Therefore, the sum S satisfies S = gi λ i + λ i:λ i +λ κ (λ i + λ ) /κ + λ i + λ λ i + λ. i:0<λ i +λ κ gi (λ i + λ) (g (H + λi) g) 4m(v) 50κ 3 (Note that we have H + λi 0.) This finishes the proof of Claim D.3. { ( )} Claim 6.. If λ min (H) κ then m(h ) 500 min m(v), m λvmin 500κ 3 Proof of Claim 6.. We derive that m(h ) = (S + S ) (λ ) 3 3 m(v) 4 m(v) 3 500κ m 500κ m 9 (S + S ) 6 λ min 3 ) 3 6 λ min 3 ( λvmin ( λvmin ) 3

21 { 500 min m(v), m ( λvmin )} 500κ 3 Above, uses equation (D.4), inequality follows because we have λ min (H) κ in the assumption and have λ λ min (H) + κ in the assumption of Case of Main emma ; inequality 3 uses Claim D.3 and Claim D.; and inequality 4 uses (D.8). This finishes the proof of Claim 6.. Claim 6.3. If λ min (H) κ then m(h ) m(v) 6 κ 3 Proof of Claim 6.3. This time we lower bound S slightly differently: 4 S κ (λ ) 6 κ 3 where comes from the second to last inequality from (D.7) and comes from λ λ min (H) + κ κ using our assumption in Case of Main emma. Putting these together we get that m(h ) = (S + S ) (λ ) 3 3 m(v) 500κ κ 3 m(v) κ 3. Above, comes from (D.4), uses Claim D.3, lower bound (D.0) and (λ ) κ 3 (D.0) λ E Proof of Main emma Main emma. In the same setting as Main emma, suppose m(h ) ε3/ output vector v satisfies the following conditions: v h + 3 and m(v) ε κ κ. Proof of Main emma. et s first note that from the value given in emma 5., 300. Then the (λ ) 3 3 m(h ) 3/ ε 3/. (E.) 00 If Case occurs, we have v (H + λi) g + ε λ + ε + ε 3 λ + 5 ε 4 h + 0κ. Above, inequalities and both use the assumptions of Case ; inequality 3 uses the fact that λ [λ ε, λ + ε] which again follows from the assumptions of Case (see (D.)); inequality 4 uses h = λ from emma 5. as well as our assumption (6.) on ε. As for the quantity m(v), we bound it as follows m(v) = g + Hv + v v g + (H + λi)v + λ v + v H + λi ε + λ v + v 3 ( + B) ε + λ(λ + 3 ε) + (λ + 3 ε) = ( + B) ε + 6λ ελ + 9 ε 6(λ + ε) + ( + 3B) ε + 9 ε 5 6(λ ) + ( + 56B) ε + 5 ε 6 ε κ. Above, inequality uses triangle inequality; inequality uses v + (H + λi) g ε; inequality 3 uses H + λi + B and v λ + 3 ε which comes from our upper bound on v above; 4 uses the fact that λ [λ ε, λ + ε] which again follows from the assumptions of Case (see 0

22 (D.)); inequality 5 uses λ B; and inequality 6 uses (E.) together with our assumption (6.) on ε. If Case occurs, we have v (H + λi) g + ε λ + ε 3 (λ + /κ) + ε h + 3 κ. (E.) Above, inequalities and both use the assumptions of Case ; inequality 3 uses λ λ min (H)+ /κ from our assumption of Case as well as λ min (H) λ which comes from emma 5.; inequality 4 uses h = λ from emma 5. as well as our assumption (6.) on ε. The quantity m(v) can be bounded in an analogous manner as Case : m(v) H + λi ε + λ v + v ( + B) ε + 6λ + 0κ 6(λ + κ ) + 0κ (λ ) + 5 κ λ(λ + ε) + (λ + ε) 3 ε κ. Above, inequality uses our assumption (6.) on ε; inequality uses λ λ + κ which appeared in (E.); inequality 3 uses (E.). F Proof of emma 7. emma 7.. The following statements hold for all i until FastCubicMin terminates (a) λ i [0, B], λ i + λ max (H) 3B (b) λ i + λ min (H) 3 0κ (c) λ i+ + λ min (H) 3 4 (λ i + λ min (H)) unless λ i+ = 0 Moreover when FastCubicMin terminates at ine 0 we have λ i + λ min (H) κ. Proof of emma 7.. The lemma follows via induction. To see (a) and (b) at the base case i = 0, recall that the definitions of B and together ensure λ 0 + λ max (H) 3B and λ 0 + λ min (H) 3 0κ. Also λ 0 [0, B]. Suppose now for some i 0 properties (a) and (b) hold. It is easy to check that λ i λ i and thus we have λ i + λ max (H) B and λ i B. This implies property (a) at iteration i + also hold. We now proceed to show property (c) at iteration i and property (b) at iteration i +. Recall that the algorithm ensures 9 0 λ max((h + λ i I) ) w (H + λ i I) w λ max ((H + λ i I) ), and by the definition of w we have 9 0 λ max((h + λ i I) ) ˆε w w ˆε λ max ((H + λ i I) ). (F.) Now, since 3 0κ λ i + λ min (H) 3B from the inductive assumption, it follows from the choice of ˆε that ˆε 30B 0(λ i + λ min (H)) = λ max((h + λ i I) ). (F.) 0 Plugging Equation (F.) into Equation (F.) we get 8 0 λ i + λ min (H) = 8 0 λ max((h + λ i I) ) w w ˆε λ max ((H + λ i I) ) = λ i + λ min (H).

23 Inverting this chain of inequalities, we have λ i + λ min (H) 5(λ i + λ min (H)) 8. (F.3) From this we derive the following implications: κ = (λ i + λ min (H)) κ (F.4) > κ = (λ i + λ min (H)) > 4 5κ (F.5) If Condition (F.4) happens, our algorithm FastCubicMin outputs on ine 0; in such a case (F.4) implies our desired inequality λ i + λ min (H) κ. If Condition (F.5) happens, our choice λ i+ λ i and Equation (F.3) together imply that 3 4 (λ i + λ min (H)) λ i+ + λ min (H) 6 (λ i + λ min (H)) Combining this with (F.5) we get that 3 4 (λ i + λ min (H)) λ i+ + λ min (H) ( ) κ 0κ. Therefore, we conclude that property (c) at iteration i holds and property (b) at iteration i + hold because λ i+ λ i+. This finishes the proof of emma 7.. G Proof of Main emma 3: Running Time Half Having proven the correctness of the algorithm, we now aim to bound the overall running time of FastCubicMin, completing the proof of Main emma 3. We prove in Appendix H the following lemma: emma G.. If λ +λ min (H) c (0, ) then BinarySearch ends in O ( log( (λ λ )B c ε ) ) iterations. Since in our FastCubicMin algorithm, it satisfies λ i B and λ i +λ min (H) 3 taken together with our choice of ε we have: Claim G.. Claim G.3. Each invocation of BinarySearch ends in O ( log(/ ε) ) iterations. FastCubicMin ends in at most O(log(Bκ)) outer loops. 0κ (see emma 7.), Proof. According to emma 7. we have 3 4 (λ i + λ min (H)) λ i + λ min (H) so the quantity λ i + λ min (H) decreases by a constant factor per iteration (except possibly λ i = 0 the last outer loop in which case we shall terminate in one more iteration). On one hand, we have began with λ 0 + λ min (H) 3B. On the other hand, we always have λ i + λ min (H) 3 0κ according to emma 7.. Therefore, the total number of outer loops is at most O(log(Bκ)). G. Matrix Inverse Since the key component of the running time is the computation of (H+λ i I) b for different vectors b we will first bound the condition number of the matrix (H + λ i I) via the following lemma Claim G.4. Through out the execution of FastCubicMin and BinarySearch whenever we compute (H + λ i I) λ b for some vector b it satisfies i + λ i +λ min (H) 0κ. Proof of Claim G.4. We first focus on ine 5 and ine of FastCubicMin. There are two cases. If λ i, then according to I H I we can bound 3 because the left hand λ i + λ i +λ min (H)

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic

More information

arxiv: v4 [math.oc] 11 Jun 2018

arxiv: v4 [math.oc] 11 Jun 2018 Natasha : Faster Non-Convex Optimization han SGD How to Swing By Saddle Points (version 4) arxiv:708.08694v4 [math.oc] Jun 08 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Microsoft Research, Redmond August 8,

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC Non-Convex Optimization in Machine Learning Jan Mrkos AIC The Plan 1. Introduction 2. Non convexity 3. (Some) optimization approaches 4. Speed and stuff? Neural net universal approximation Theorem (1989):

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

arxiv: v1 [cs.lg] 17 Nov 2017

arxiv: v1 [cs.lg] 17 Nov 2017 Neon: Finding Local Minima via First-Order Oracles (version ) Zeyuan Allen-Zhu zeyuan@csail.mit.edu Microsoft Research Yuanzhi Li yuanzhil@cs.princeton.edu Princeton University arxiv:7.06673v [cs.lg] 7

More information

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Approximate Second Order Algorithms Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Why Second Order Algorithms? Invariant under affine transformations e.g. stretching a function preserves the convergence

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India Chi Jin UC Berkeley Michael I. Jordan UC Berkeley Rong Ge Duke Univ. Sham M. Kakade U Washington Nonconvex optimization

More information

Mini-Course 1: SGD Escapes Saddle Points

Mini-Course 1: SGD Escapes Saddle Points Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

More information

Second-Order Stochastic Optimization for Machine Learning in Linear Time

Second-Order Stochastic Optimization for Machine Learning in Linear Time Journal of Machine Learning Research 8 (207) -40 Submitted 9/6; Revised 8/7; Published /7 Second-Order Stochastic Optimization for Machine Learning in Linear Time Naman Agarwal Computer Science Department

More information

Least Sparsity of p-norm based Optimization Problems with p > 1

Least Sparsity of p-norm based Optimization Problems with p > 1 Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

This manuscript is for review purposes only.

This manuscript is for review purposes only. 1 2 3 4 5 6 7 8 9 10 11 12 THE USE OF QUADRATIC REGULARIZATION WITH A CUBIC DESCENT CONDITION FOR UNCONSTRAINED OPTIMIZATION E. G. BIRGIN AND J. M. MARTíNEZ Abstract. Cubic-regularization and trust-region

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

1 Kernel methods & optimization

1 Kernel methods & optimization Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Gradient Descent. Dr. Xiaowei Huang

Gradient Descent. Dr. Xiaowei Huang Gradient Descent Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Three machine learning algorithms: decision tree learning k-nn linear regression only optimization objectives are discussed,

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Convex Functions and Optimization

Convex Functions and Optimization Chapter 5 Convex Functions and Optimization 5.1 Convex Functions Our next topic is that of convex functions. Again, we will concentrate on the context of a map f : R n R although the situation can be generalized

More information

Midterm for Introduction to Numerical Analysis I, AMSC/CMSC 466, on 10/29/2015

Midterm for Introduction to Numerical Analysis I, AMSC/CMSC 466, on 10/29/2015 Midterm for Introduction to Numerical Analysis I, AMSC/CMSC 466, on 10/29/2015 The test lasts 1 hour and 15 minutes. No documents are allowed. The use of a calculator, cell phone or other equivalent electronic

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Line Search Methods for Unconstrained Optimisation

Line Search Methods for Unconstrained Optimisation Line Search Methods for Unconstrained Optimisation Lecture 8, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Generic

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Second Order Optimization Algorithms I

Second Order Optimization Algorithms I Second Order Optimization Algorithms I Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 7, 8, 9 and 10 1 The

More information

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Yi Zhou Department of ECE The Ohio State University zhou.1172@osu.edu Zhe Wang Department of ECE The Ohio State University

More information

Lecture 18 Oct. 30, 2014

Lecture 18 Oct. 30, 2014 CS 224: Advanced Algorithms Fall 214 Lecture 18 Oct. 3, 214 Prof. Jelani Nelson Scribe: Xiaoyu He 1 Overview In this lecture we will describe a path-following implementation of the Interior Point Method

More information

Math 471 (Numerical methods) Chapter 3 (second half). System of equations

Math 471 (Numerical methods) Chapter 3 (second half). System of equations Math 47 (Numerical methods) Chapter 3 (second half). System of equations Overlap 3.5 3.8 of Bradie 3.5 LU factorization w/o pivoting. Motivation: ( ) A I Gaussian Elimination (U L ) where U is upper triangular

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

Lecture 5 : Projections

Lecture 5 : Projections Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Math 113 Homework 5. Bowei Liu, Chao Li. Fall 2013

Math 113 Homework 5. Bowei Liu, Chao Li. Fall 2013 Math 113 Homework 5 Bowei Liu, Chao Li Fall 2013 This homework is due Thursday November 7th at the start of class. Remember to write clearly, and justify your solutions. Please make sure to put your name

More information

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017 Stochastic Gradient Descent: The Workhorse of Machine Learning CS6787 Lecture 1 Fall 2017 Fundamentals of Machine Learning? Machine Learning in Practice this course What s missing in the basic stuff? Efficiency!

More information

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global homas Laurent * 1 James H. von Brecht * 2 Abstract We consider deep linear networks with arbitrary convex differentiable loss. We provide a short and elementary proof of the fact that all local minima

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via nonconvex optimization Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Optimization for neural networks

Optimization for neural networks 0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make

More information

arxiv: v2 [math.oc] 1 Nov 2017

arxiv: v2 [math.oc] 1 Nov 2017 Stochastic Non-convex Optimization with Strong High Probability Second-order Convergence arxiv:1710.09447v [math.oc] 1 Nov 017 Mingrui Liu, Tianbao Yang Department of Computer Science The University of

More information

Algorithms for Learning Good Step Sizes

Algorithms for Learning Good Step Sizes 1 Algorithms for Learning Good Step Sizes Brian Zhang (bhz) and Manikant Tiwari (manikant) with the guidance of Prof. Tim Roughgarden I. MOTIVATION AND PREVIOUS WORK Many common algorithms in machine learning,

More information

Stochastic Cubic Regularization for Fast Nonconvex Optimization

Stochastic Cubic Regularization for Fast Nonconvex Optimization Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni Mitchell Stern Chi Jin Jeffrey Regier Michael I. Jordan {nilesh tripuraneni,mitchell,chijin,regier}@berkeley.edu jordan@cs.berkeley.edu

More information

Notes for CS542G (Iterative Solvers for Linear Systems)

Notes for CS542G (Iterative Solvers for Linear Systems) Notes for CS542G (Iterative Solvers for Linear Systems) Robert Bridson November 20, 2007 1 The Basics We re now looking at efficient ways to solve the linear system of equations Ax = b where in this course,

More information

Trust Regions. Charles J. Geyer. March 27, 2013

Trust Regions. Charles J. Geyer. March 27, 2013 Trust Regions Charles J. Geyer March 27, 2013 1 Trust Region Theory We follow Nocedal and Wright (1999, Chapter 4), using their notation. Fletcher (1987, Section 5.1) discusses the same algorithm, but

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

5 Handling Constraints

5 Handling Constraints 5 Handling Constraints Engineering design optimization problems are very rarely unconstrained. Moreover, the constraints that appear in these problems are typically nonlinear. This motivates our interest

More information

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Clément Royer - University of Wisconsin-Madison Joint work with Stephen J. Wright MOPTA, Bethlehem,

More information

Solution of the 8 th Homework

Solution of the 8 th Homework Solution of the 8 th Homework Sangchul Lee December 8, 2014 1 Preinary 1.1 A simple remark on continuity The following is a very simple and trivial observation. But still this saves a lot of words in actual

More information

September Math Course: First Order Derivative

September Math Course: First Order Derivative September Math Course: First Order Derivative Arina Nikandrova Functions Function y = f (x), where x is either be a scalar or a vector of several variables (x,..., x n ), can be thought of as a rule which

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

SVRG Escapes Saddle Points

SVRG Escapes Saddle Points DUKE UNIVERSITY SVRG Escapes Saddle Points by Weiyao Wang A thesis submitted to in partial fulfillment of the requirements for graduating with distinction in the Department of Computer Science degree of

More information

MATH SOLUTIONS TO PRACTICE MIDTERM LECTURE 1, SUMMER Given vector spaces V and W, V W is the vector space given by

MATH SOLUTIONS TO PRACTICE MIDTERM LECTURE 1, SUMMER Given vector spaces V and W, V W is the vector space given by MATH 110 - SOLUTIONS TO PRACTICE MIDTERM LECTURE 1, SUMMER 2009 GSI: SANTIAGO CAÑEZ 1. Given vector spaces V and W, V W is the vector space given by V W = {(v, w) v V and w W }, with addition and scalar

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees

An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees Francesc Rosselló 1, Gabriel Valiente 2 1 Department of Mathematics and Computer Science, Research Institute

More information

Connectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ).

Connectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ). Connectedness 1 Motivation Connectedness is the sort of topological property that students love. Its definition is intuitive and easy to understand, and it is a powerful tool in proofs of well-known results.

More information

Bare-bones outline of eigenvalue theory and the Jordan canonical form

Bare-bones outline of eigenvalue theory and the Jordan canonical form Bare-bones outline of eigenvalue theory and the Jordan canonical form April 3, 2007 N.B.: You should also consult the text/class notes for worked examples. Let F be a field, let V be a finite-dimensional

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

Introduction to gradient descent

Introduction to gradient descent 6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our

More information

Nonlinear Programming

Nonlinear Programming Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week

More information

Chapter 8 Gradient Methods

Chapter 8 Gradient Methods Chapter 8 Gradient Methods An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Introduction Recall that a level set of a function is the set of points satisfying for some constant. Thus, a point

More information

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL) Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

UNCONSTRAINED OPTIMIZATION PAUL SCHRIMPF OCTOBER 24, 2013

UNCONSTRAINED OPTIMIZATION PAUL SCHRIMPF OCTOBER 24, 2013 PAUL SCHRIMPF OCTOBER 24, 213 UNIVERSITY OF BRITISH COLUMBIA ECONOMICS 26 Today s lecture is about unconstrained optimization. If you re following along in the syllabus, you ll notice that we ve skipped

More information

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science EAD 115 Numerical Solution of Engineering and Scientific Problems David M. Rocke Department of Applied Science Multidimensional Unconstrained Optimization Suppose we have a function f() of more than one

More information

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms (February 24, 2017) 08a. Operators on Hilbert spaces Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/real/notes 2016-17/08a-ops

More information

Symmetric Matrices and Eigendecomposition

Symmetric Matrices and Eigendecomposition Symmetric Matrices and Eigendecomposition Robert M. Freund January, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 2 1 Symmetric Matrices and Convexity of Quadratic Functions

More information

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems Weinan E 1 and Bing Yu 2 arxiv:1710.00211v1 [cs.lg] 30 Sep 2017 1 The Beijing Institute of Big Data Research,

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

The Skorokhod reflection problem for functions with discontinuities (contractive case)

The Skorokhod reflection problem for functions with discontinuities (contractive case) The Skorokhod reflection problem for functions with discontinuities (contractive case) TAKIS KONSTANTOPOULOS Univ. of Texas at Austin Revised March 1999 Abstract Basic properties of the Skorokhod reflection

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

THE INVERSE FUNCTION THEOREM

THE INVERSE FUNCTION THEOREM THE INVERSE FUNCTION THEOREM W. PATRICK HOOPER The implicit function theorem is the following result: Theorem 1. Let f be a C 1 function from a neighborhood of a point a R n into R n. Suppose A = Df(a)

More information

Unconstrained Optimization

Unconstrained Optimization 1 / 36 Unconstrained Optimization ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University February 2, 2015 2 / 36 3 / 36 4 / 36 5 / 36 1. preliminaries 1.1 local approximation

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

5 + 9(10) + 3(100) + 0(1000) + 2(10000) =

5 + 9(10) + 3(100) + 0(1000) + 2(10000) = Chapter 5 Analyzing Algorithms So far we have been proving statements about databases, mathematics and arithmetic, or sequences of numbers. Though these types of statements are common in computer science,

More information

Jim Lambers MAT 610 Summer Session Lecture 2 Notes

Jim Lambers MAT 610 Summer Session Lecture 2 Notes Jim Lambers MAT 610 Summer Session 2009-10 Lecture 2 Notes These notes correspond to Sections 2.2-2.4 in the text. Vector Norms Given vectors x and y of length one, which are simply scalars x and y, the

More information