arxiv: v4 [math.oc] 24 Apr 2017
|
|
- Daniela Morgan
- 5 years ago
- Views:
Transcription
1 Finding Approximate ocal Minima Faster than Gradient Descent arxiv:6.046v4 [math.oc] 4 Apr 07 Naman Agarwal namana@cs.princeton.edu Princeton University Zeyuan Allen-Zhu zeyuan@csail.mit.edu Institute for Advanced Study Elad Hazan ehazan@cs.princeton.edu Princeton University November 3, 06 Abstract Tengyu Ma tengyu@cs.princeton.edu Princeton University Brian Bullins bbullins@cs.princeton.edu Princeton University We design a non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which scales linearly in the underlying dimension and the number of training examples. The time complexity of our algorithm to find an approximate local minimum is even faster than that of gradient descent to find a critical point. Our algorithm applies to a general class of optimization problems including training a neural network and other non-convex objectives arising in machine learning.
2 Introduction Finding a global minimizer of a non-convex optimization problem is NP-hard. Thus, the standard goal of efficient non-convex optimization algorithms is instead to find a local minimum. This problem has become increasingly important as the state-of-the-art in machine learning is attained by non-convex models, many of which are variants of deep neural networks. Experiments in [0,, ] suggest that fast convergence to a local minimum is sufficient for training neural nets, while convergence to critical points (points with vanishing gradients) is not. Theoretical works have also affirmed the same phenomenon for other machine learning problems (see [5, 6, 8, 9] and the references therein). In this paper we give a provable linear-time algorithm for finding an approximate local minimum in smooth non-convex optimization. It applies to a general setting of machine learning optimization, and in particular to the optimization problem of training deep neural networks. Furthermore, the running time bound of our algorithm is the fastest known even for the more lenient task of computing a point with vanishing gradient (called a critical point), for a wide range of parameters. Formally, the problem of unconstrained mathematical optimization is stated in general terms as that of finding the minimum value that a function attains over Euclidean space, i.e. min f(x). (.) x Rd If f is convex, the above formulation is convex optimization and is solvable in (randomized) polynomial time even if only a valuation oracle to f is provided. A crucial property of convex functions is that local optimality implies global optimality, allowing for greedy algorithms to reach the global optimum efficiently. Unfortunately, this is no longer the case if f is nonconvex; indeed, even a degree four polynomial can be NP-hard to optimize [3], or even just to check whether a point is not a local minimum [5]. Thus, for non-convex optimization one has to settle for the more modest goal of reaching approximate local optimality efficiently. Note that a particular interest to machine learning is the optimization of functions f : R d R of the finite-sum form f(x) = n f i (x). (.) n Such functions arise when minimizing loss over a training set, where each example i in the set corresponds to one loss function f i in the summation. i= We say that the function f is second-order smooth if it has ipschitz continuous gradient and ipschitz continuous Hessian. We say that a point x is an ε-approximate local minimum if it satisfies (following the tradition of [8]): f(x) ε and f(x) εi, where denotes the Euclidean norm of a vector. We say that a point x is an ε-critical point if it satisfies the gradient condition above, but not necessarily the second-order condition. Critical points include saddle points in addition to local minima. We remark that ε-approximate local minima (even with ε = 0) are not necessarily close to any local minimum, neither in domain nor in function value. However, if we assume in addition the function satisfies the (robust) strict-saddle property [5, 4] (see Section for the precise definition), then an ε-approximate local minimum is guaranteed to be close to a local minimum for sufficiently small ε. Our main theorem below states the time required for the proposed algorithm FastCubic to find an ε-approximate local minimum for second-order smooth functions.
3 Theorem (informal). Ignoring smoothness parameters, the running time of FastCubic to return an ε-approximate local minimum is ( ) ( n ) ( ) n3/4 Õ + ε3/ ε 7/4 T h, for (.) or Õ ε 7/4 T h for the general (.). Above, T h is the time to compute Hessian-vector product for f(x) and T h, is that for an arbitrary f i (x). The full statement of Theorem can be found in Section. Hessian-vector products can be computed in linear time meaning T h, = O(d) and T h = O(nd) for many machine learning problems such as generalized linear models and training neural networks [, 9]. We explain this more generally in Appendix A. Therefore, Corollary.. Algorithm FastCubic returns an ε-approximate local minimum for the optimization problem of training a neural network in time ( ) nd Õ ε 3/ + n3/4 d ε 7/4. Another important aspect of our algorithm is that even in terms of just reaching an ε-critical point, i.e. a point that satisfies f(x) ε without any second-order guarantee, FastCubic is faster than all previous results (see Table for a comparison). The fastest methods to find critical points for a smooth non-convex function are gradient descent and its extensions, jointly known as first-order methods. These methods are extremely efficient in terms of per-iteration complexity; however, they necessarily suffer from a /ε convergence rate [7], to the best of our knowledge, in previous results only higher-order methods seem capable of breaking this /ε bottleneck [8]. For certain ranges of parameters, our FastCubic finds local minima even faster than first-order methods, even though they only find critical points. This is depicted in Table. Paper Total Time Achieving f(x) ε Second-Order Guarantee Gradient Descent (GD) O ( ) nd ε n/a SVRG [] O ( ) nd + n/3 d ε n/a SGD [0] O ( ) d ε n/a 4 noisy SGD [6] a O ( ) d C ε f(x) ε /C I 4 cubic regularization [8] Õ ( ) nd ω +d ω f(x) ε / I ε 3/ this paper Õ ( ) nd f(x) ε / I ε 7/4 this paper Õ ( ) nd + n3/4 d f(x) ε / I ε 3/ ε 7/4 Table : Comparison of known methods. a Here C, C are two constants that are not explicitly written. We believe C 4.
4 . Related work Methods that Provably Reach Critical Points. Recall that only a gradient oracle is needed to reach a critical point. The most commonly used algorithm in practice for training non-convex learning machines such as deep neural networks is stochastic gradient descent (SGD), also known as stochastic approximation [30] and its derivatives. Some practical enhancements widely used in practice are based on Nesterov s acceleration [6] and adaptive regularization []. The variance reduction technique, introduced in [3], was extremely successful in convex optimization, but only recently there was a non-convex counterpart with theoretical benefits introduced []. Methods that Provably Reach ocal Minima. The recent work of Ge et al. [7] showed that a noise-injected version of SGD in fact converges to local minima instead of critical points, as long as the underlying non-convex function is strict-saddle. Their theoretical running time is a large polynomial in the dimension and not competitive with our method (see Table ). The work of ee et al. [4] shows that gradient descent, starting from a random point, almost surely converges to a local minimum of a strict-saddle function. The rates of convergence and precise step-sizes that are required are, however, yet unknown. If second-order information (i.e., the Hessian oracle) is provided, the cubic-regularization method of Nesterov and Polyak [8] converges in O( ) iterations. However, each iteration of Nesterovε 3/ Polyak requires solving a cubic function which, in general, takes time super-linear in the input representation. One natural direction is to apply an approximate trust region solver, such as the linear-time solver of [], to approximately solve the cubic regularization subroutine of Nesterov-Polyak. However, the approximation needed by a naive calculation makes this approach even slower than vanilla gradient descent. Our main challenge is to obtain approximate second-order local-minima and simultaneously improve upon gradient descent. Independently of this paper and concurrently, Carmon et al. [7] develop an accelerated gradient descent method that achieves the same running time for finding an approximate local minimum as in our paper. Remarkably, the same running time is obtained via a very different technique.. Our Techniques Our algorithm is based on the cubic regularization method of Nesterov and Polyak [8, 9, 8]. At a high level, cubic regularization states that if we can minimize a cubic function m(h) g h + h Hh + 6 h 3 exactly, where g = f(x), H = f(x), and is the second-order smoothness of the function f, then we can iteratively perform updates x x + h, and this algorithm converges to an ε-approximate local minimum in O(/ε 3/ ) iterations. Unfortunately, solving this cubic minimization problem exactly, to the best of our knowledge, requires a running time of O(d ω ) where ω is the matrix multiplication constant. Getting around this requires five observations. The first observation is that, minimizing m(h) up to a constant multiplicative approximation (plus a few other constraints) is sufficient for showing an iteration complexity of O(/ε 3/ ). The proof techniques to show this observation are based on extending Nesterov and Polyak. The second observation is that the minimizer h of m(h) must be of the form h = (H+λ I) + g+ v, where λ 0 is some constant satisfying H + λ I 0, and v is the smallest eigenvector of H and + denotes the pseudo-inverse of a matrix. This can be viewed as moving in a mixture direction To be precise, their manuscript appeared online approximately 4 hours before ours. More specifically, we need m t(h) C min h{m t(h)} for some constant C. In addition, we need to have good bounds on h and m(h). 3
5 between choosing h v, and choosing h to follow a shifted Newton s direction h (H + λ I) + g. Intuitively, we wish to reduce both the computation of (H+λ I) + g and v to Hessian-vector products. The first task of computing (H + λ I) + g can be slow, and even if H + λ I is strictly positivedefinite, computing it has a complexity depending on the (possibly huge) condition number of H+λ I [34]. The third observation is that it suffices to pick some λ > λ so both () the condition number of H + λ I is small and () the vectors (H + λ I) g and (H + λ I) g are close. This relies on the structure of m(h). The second task of computing v has a complexity depending on / δ where δ is the target additive error [3, 4]. The fourth observation is that the choice δ = ε suffices for the outer loop of cubic regularization to make sufficient progress. This reduces the complexity to compute v. Finally, finding the correct value λ itself is as hard as minimizing m t (h). The fifth step is to design an iterative scheme that makes only logarithmic number of guesses on λ. This procedure either finds the correct one (via binary search), or finds an approximate one, λ, but satisfying (H + λ I) g and (H + λ I) g being sufficiently close. Putting all the observations together, and balancing all the parameters, we can obtain a cubic minimization subroutine (see FastCubicMin in Algorithm ) that runs in time O(nd + n 3/4 d/ε /4 ). Preliminaries and Main Theorem We use to denote the Euclidean norm of a vector and the spectral norm of a matrix. For a symmetric matrix M we denote by λ max (M) and λ min (M) respectively the maximum and minimum eigenvalues of M. We denote by A B that A B is positive semidefinite (PSD). For a PSD matrix M, we denote by M + its pseudo-inverse if M is not strictly positive definite. We make the following ipschitz continuity assumptions for the gradient and Hessian of the target function f. Namely, there exist, > 0 such that Definition.. x, y R d : f(x) and f(x) f(y) x y. (.) We assume the following complexity parameters on the access to f(x): et T g R be the time complexity to compute f(x) for any x R d. et T h R be the time complexity to compute ( f(x) ) v for any x, v R d. Definition.. We say that f is of finite-sum form if f = n n i= f i(x) and f i (x) for each i [n]. In this case, we define T h, to be the time complexity to compute ( f i (x) ) v for arbitrary x, v R d and i [n]. Next we define the strict-saddle function for which an ε-approximate local minimum is almost equivalent to a local minimum [5, 4]. Definition.3 (strict saddle). Suppose f( ) : R d R is twice differentiable. For α, β, γ 0, we say f is (α, β, γ)-strict saddle if every x R d satisfies at least one of the following three conditions:. f(x) α.. λ min ( f) β. 3. There exists a local minimum x that is γ-close to x in Euclidean distance. We see that if a function is (α, β, γ)-strict saddle, then for ε < min{α, β } an ε-approximate local minimum is γ-close to some local minimum. 4
6 Algorithm FastCubic(f, x 0, ε,, ) Input: f(x) that satisfies (.) with and ; a starting vector x 0 ; a target accuracy ε. : κ ( ) 900 /. ε : for t = 0 to do 3: m t (h) f(x t ) h + h f(x t)h ( + 6 h 3 ) 4: (λ, v, v min ) FastCubicMin f(x t ), f(x t ),,, κ 5: h either v or λv min 6: Set x t+ x t + h whichever gives smaller value for m t(h); 7: if m t (h ) > ε3/ c then return x t+. c is a constant; we proved c = works 8: end for. Main Results The finite-sum setting captures much of supervised learning, including Neural Networks and Generalized inear Models. The main theorem which we show in our paper is as follows: Theorem. FastCubic (Algorithm ) starts from a point x 0 and outputs a point x such that in total time (denoting by D f(x 0 ) f(x )) ( ) Õ D T ε 3/ g + D/4 T ε 7/4 h, or f(x) ε and λ min ( f(x)) ε ( Õ D (T ) ε 3/ g + nt h, + Dn 3/4 /4 ) T ε 7/4 h, in the finite-sum setting (see Definition.). Here Õ hides logarithmic factors in,, /ε, d, and in max x { f(x) }. Two Known Subroutines. Our running time of FastCubic relies on the following recent results for approximate matrix inverse and approximate PCA: Theorem.4 (Approximate Matrix Inverse). Suppose matrix M R d d satisfies M and λi + M δi for constants λ, δ, > 0. et κ λ+ δ. Then, we can compute vector x satisfying x (λi + M) b ε b, (.) using Accelerated gradient descent (AGD) in O ( κ / log(κ/ε) ) iterations, each requiring O(d) time plus the time needed to multiply M with a vector. Moreover, suppose M = n n i= M i where each M i is symmetric and satisfies M i. If M i b can be computed in time O(d ) for each i and vector b, then accelerated SVRG [4, 33] computes a vector x that satisfies equation (.) in time O ( max{n, n 3/4 κ / } d log (κ/ε) ). We refer to the running time for this computation as T inverse (κ, ε) and the algorithm as A. Above, the SVRG based running time shall be used only towards our finite-sum case in Definition.. Theorem.5 (AppxPCA [3, 3, 4]). et M R d d be a symmetric matrix with eigenvalues λ λ d 0. With probability at least p, AppxPCA produces a unit vector w satisfying w Mw ( δ )( ε)λ max (M). The total running time is Õ(T inverse(/δ, εδ )). 3 Our Fast Cubic Regularization Algorithm Recall that the cubic regularization method of Nesterov and Polyak [8] studies the following upper bound on the change in objective value as we move from a point x t to x t + h: (it follows simply 5
7 from the Taylor series truncated to the third order) h R d : f(x t + h) f(x t ) m t (h) f(x t ) h + h f(x t )h + 6 h 3. (3.) Denote by h an arbitrary minimizer of m t (h). We propose in this paper a subroutine FastCubicMin to minimizes m t (h) approximately. Note that FastCubicMin returns two vectors v and v min. We then choose h to be either v or λv min, whichever gives a smaller value for m t(h). Before discussing the details of FastCubicMin, let us first state a main theorem for FastCubicMin: 3 Theorem (Guarantees of FastCubicMin). satisfies (a) It produces a vector h satisfying m t (h ) 0 and The algorithm FastCubicMin finds a vector h that either 3000m t (h ) m t (h ) or m t (h ) ε3/ 800. (b) If m(h ) ε3/ 300, then h h + ε 4 and m t (h ) ε. (c) FastCubicMin Õ( runs in time: (using Õ to hide logarithmic factors in,, /ε, d, f(x t ) ) ) T (ε) /4 h where T h is the time to multiply f(x t ) to a vector; ( Õ max { n, n 3/4 ) } Th, where T (ε) /4 h, is the time to multiply f i (x t ) with a vector. Above, the first guarantee promises that we are either done (because m t (h ) is close to zero), or we obtain a /3000 multiplicative approximation to m t (h ). Our second guarantee in Theorem promises that when we are done (because m t (h ) is close to zero), the output vector h and h are roughly similar in Euclidean norm and have a small gradient m t (h ). Our third guarantee gives the time complexity of FastCubicMin. Now, our final algorithm FastCubic for finding the ε-approximate local minimum of f(x) is included in Algorithm. It simply iteratively calls FastCubicMin to find an approximate minimizer, and it then stops whenever m t (h ) > ε3/ c for some large constant c. Roadmap. In Section 4 we show why Theorem implies Theorem. All the remaining sections are for the purpose of proving Theorem. Because our FastCubicMin is very technical, instead of stating what the algorithm is right away, we decide to take a different path. In Section 5, we first state a lemma characterizing what h looks like. In Section 6, we provide a set of sufficient conditions which look similar to the characterization of h, and show that as long as these conditions are met, Theorem -a and -b follow easily. Finally, in Section 7, we state FastCubicMin and explain why it satisfies these sufficient conditions and why it runs in the aforementioned time. 4 Theorem implies Theorem In this section, we show that Theorem implies Theorem. It relies on the following lemma (proved in Appendix B) regarding the sufficient condition for us to reach an ε-approximate local minimum. 3 To present the simplest result, we have not tried to improve the constant dependency in this paper. 6
8 emma 4.. If m t (h ) ε3/ 800 and h is an approximate minimizer of m t (h) satisfying h h + ε 4 and m t (h ) ε, then we have that f(x t + h ) ε and λ min ( f(x t + h )) ε. Proof of Theorem from Theorem. When FastCubic terminates, we have m t (h ) > ε3/ c ; therefore, it satisfies m t (h ) ε3/ 800 according to Theorem -a. Combining this with Theorem -b and Corollary 4., we conclude that in the last iteration of FastCubic, our output satisfies f(x t +h ) ε and λ min ( f(x t + h )) ε. This finishes the proof with respect to the accuracy conditions. As for the running time, in every iteration except for the last one, FastCubic satisfies m t (h ) Ω ( ) ( ε 3/. Therefore by (3.), we must have decreased the objective by at least Ω ε 3/ ) in this round, and this cannot happen for more than O ( (f(x 0 ) f ) ) ε iterations. The final running time 3/ of FastCubic follows from this bound together with Theorem -c. Therefore, in the rest of the paper it suffices to study FastCubicMin and prove Theorem. 5 Characterization emma of the Minimizer h For notational simplicity in this and the subsequent sections we focus on the following problem: minimize m(h) g h + h Hh + 6 h 3 where H is a symmetric matrix with H. Recall from the previous section that we have denoted by h an arbitrary minimizer of m(h). We have the following lemma which characterizes h : (a variant of this lemma has appeared in [8], and we prove it in the appendix for the sake of completeness) emma 5.. We have h is a minimizer of m(h) if and only if there exists λ 0 such that H + λ I 0, (H + λ I)h = g, h = λ. The objective value in this case is given by m(h ) = g (H + λ I) + g (λ ) The following corollary comes from emma 5. and its proof: Corollary 5.. have The value λ in emma 5. is unique, and for every λ satisfying H + λi 0, we (H + λi) g > λ λ > λ and (H + λi) g < λ λ < λ. In the above characterization, we have a crude upper bound on λ : Proposition 5.3. We have λ B max { + g, } with λ defined in emma 5.. Proof. We have (H + BI) g Corollary 5.. g λ min (H+BI) g B < B and therefore λ B due to 7
9 6 Sufficient Conditions for Theorem -a and -b Without worrying about the design of FastCubicMin at this moment, let us first state a set of sufficient conditions under which the assumptions in Theorem -a can be satisfied. Main emma. Consider an algorithm that outputs a real λ [0, B], a vector v R d, and a unit vector v min R d. Additionally, suppose numbers κ, ε 0 satisfying the following conditions: ε 0000 (max {κ,,, g, (H + λi), B}) 0 (6.) (H + (λ ε)i) 0 (6.) Moreover, suppose that the outputs (λ, v, v min ) satisfy one of the following two cases: Case : (H + λi) g [λ ε, λ + ε] and v + (H + λi) g ε Case : The following conditions are satisfied: (a) λ λ and λ + λ min (H) κ (b) (H + λi) g λ and v + (H + λi) g ε (c) vmin Hv min λ min (H) + 0κ Then, at least one of the two choices h { v, λv } min satisfy either m(h ) 3000m(h ) or m(h ) 3 κ 3. et us compare such sufficient conditions to the characterization emma 5.. In Case, up to a very small error ε, we have essentially found a vector v that satisfies v (H + λi) g and v λ. Therefore, this v should be close to h for obvious reason. (This is the simple case.) In Case, we have only found a vector v that satisfies v (H + λi) g and v λ. In this case, we also compute an approximate lowest eigenvector v min of λ min (H) up to an additive /0κ accuracy (see case -c). We will make sure that, as long as the conditions in -a hold, then either v or λv min will be an approximate minimizer for m t(h). (This is the hard case.) Proof of Main emma. We first consider Case. According to Corollary 5., if ε = 0 then v is a minimizer of m(h). The following claim extends this argument to the setting when ε > 0: Claim 6.. If λ and v satisfy Case and ε satisfies (6.), then m(v) m(h ) + 50κ 3 From the above lemma it follows that either m(h ) 8 satisfies the conditions of the theorem. κ 3 otherwise m(h ).m(v) which We now consider Case, and in this case we make the following two claims: { ( )} Claim 6.. If λ min (H) κ then m(h ) 500 min m(v), m λvmin. 500κ 3 Claim 6.3. If λ min (H) κ then m(h ) m(v) 6 κ 3. emma now follows from the two claims because we can output the vector h which has the lowest value of m(h ) amongst the two choices h { v, λ v } min d. This satisfies either m(h ) 3000m(h ) or m(h ) 3. κ 3 The missing proofs of the three claims are deferred to Appendix D. 8
10 The next main lemma shows that, under the same sufficient conditions as Main emma, we also have that Theorem -b holds. (Its proof is contained in Appendix E.) Main emma. In the same setting as Main emma, suppose m(h ) ε3/ output vector v satisfies the following conditions: v h + 3 and m(v) ε κ κ. 7 Main Algorithms for Theorem 300. Then the We are now ready to state our main algorithm FastCubicMin and sketch why it satisfies the sufficient conditions in Main emma. As described in Algorithm, our algorithm starts with a very large choice λ 0 B and decreases it gradually. At each iteration i, it computes an approximate inverse v satisfying v + (H + λ i I) g ε with respect to the current λ i. Then there are three cases, depending on whether v is approximately equal to, larger than, or smaller than λ i. At a high level, if it is equal, then we have met Case in Main emma ; if it is larger, then we can binary search the correct value of λ in the interval [λ i, λ i ]; and if it is smaller, then we need to compute an approximate eigenvector and carefully choose the next point λ i+. We state our main lemma below regarding the correctness and running time of FastCubicMin. Main emma 3. FastCubicMin in Algorithm outputs a real λ [0, B], a vector v R d, and a unit vector v min R d satisfying one of the two sufficient conditions in Main emma. We also have that the procedure can be implemented in a total running time of Õ( κ T h ) if Accelerated Gradient Descent is used in Theorem.4 to invert matrices. Õ( max{n, n 3/4 κ } T h, ) if we use accelerated SVRG as the subprocedure A in Theorem.4. Here Õ hides logarithmic factors in,, κ, d, B. We prove the correctness half of Main emma 3, and defer its running time analysis to Appendix G. 7. Correctness Half of Main emma 3 We will now establish the correctness of our algorithm. We first observe that the BinarySearch subroutine returns (λ, v, ) that satisfies Case of Main emma. Fact 7.. BinarySearch outputs a pair λ and v such that (H + λi) g [λ ε, λ + ε] and v + (H + λi) g ε. Proof. The latter is guaranteed by line 3 in BinarySearch, and the former is implied by the latter because (H + λi) g [ v ε/, v + ε/ ] [ λ ε, λ + ε ]. We also establish the following invariants regarding the values λ i. (Proof in Appendix F.) emma 7.. The following statements hold for all i until FastCubicMin terminates (a) λ i [0, B], λ i + λ max (H) 3B (b) λ i + λ min (H) 3 0κ (c) λ i+ + λ min (H) 3 4 (λ i + λ min (H)) unless λ i+ = 0 Moreover when FastCubicMin terminates at ine 0 we have λ i + λ min (H) κ. We now prove the output (λ, v, v min ) of FastCubicMin satisfies the sufficient conditions of Main emma. 9
11 Algorithm FastCubicMin(g, H,,, κ) (main algorithm for cubic minimization) Input: g a vector, H a symmetric matrix, parameters κ, and which satisfies I H I. Output: (λ, v, v min ) : B + ( g + κ. : ε / 0000 ( max {, g, 3κ 0, B, }) 0 3: λ 0 B. 4: for i = 0 to do 5: Compute v such that v + (H + λ i I) g ε. 6: if v [λ i ε, λ i + ε] then 7: return (λ i, v, ). 8: else if v > λ i + ε then 9: return BinarySearch(λ = λ i, λ = λ i, ε). 0: else if v < λ i ε then : et Power Method find vector w that is 9/0-appx leading eigenvector of (H + λ i I) : 9 0 λ max((h + λ i I) ) w (H + λ i I) w λ max ((H + λ i I) ). : Compute a vector w such that w (H + λ i I) w ˆε w w ˆε. 3: 4: if > κ then 5: λi+ λ i. 6: if λi+ > 0 then λ i+ λ i+ else λ i+ 0 7: else 8: 9: Use AppxPCA to find any unit vector v min such that vmin Hv min λ min (H) + 0κ. Flip the sign of v min so that g v min 0. 0: return (λ i, v, v min ). : end if : end if 3: end for 60B. Algorithm 3 BinarySearch(λ, λ, ε) (binary search subroutine) Input: λ λ, (H + λ I) g λ, (H + λ I) g λ, λ + λ min (H) > 0 Output: (λ, v, ) : for t = to do : λ mid λ +λ 3: Compute vector v such that v + (H + λ mid I) g ε/ 4: if v [λ mid ε, λ mid + ε] then 5: return (λ mid, v, ) 6: else if v + ε λ mid then 7: λ λ mid 8: else if v ε λ mid then 9: λ λ mid 0: end if : end for 0
12 Correctness Proof of Main emma 3. We carefully verify these sufficient conditions: emma 7. implies λ i [0, B]. λ i + λ min (H) 3 0κ from emma 7. implies (H + λ ii) 4κ. It is now immediate that the choice of ε on ine satisfies the Condition (6.) in the assumption of Main emma. Since ε 0κ and λ i + λ min (H) 3 0κ it follows that (H + (λ i ε)i) 0 which proves Condition (6.) in Main emma. We now verify Case and in the assumption of Main emma. At the beginning of the algorithm, our choice λ 0 = B ensures (using Proposition 5.3) that (H + λ 0 I) g < λ 0. et us now consider the various places where the algorithm outputs: If FastCubicMin terminates at ine 7, then we have v + (H + λ i I) g ε and additionally (H + λ i I) g [ v ε, v + ε ] [λ i ε, λ i + ε]. Therefore, the output meets Case requirement of Main emma with λ = λ i. If FastCubicMin terminates at ine 9, then (H + λ i I) g > v ε λ i. Obviously, we must have i in this case because (H + λ 0 I) g < λ 0. Therefore, ine 0 must have been reached at the previous iteration, so it implies (H + λ i I) g < λ i. Together, these two imply that we can call BinarySearch with (λ i, λ i ). Owing to Fact 7., the subroutine outputs a pair (λ, v) satisfying the Case requirement of Main emma. If FastCubicMin terminates on ine 0, we verify that Case of Main emma with λ = λ i holds. We first have (H + λ i I) g v + ε λ i. By Corollary 5., we also have that λ i λ. emma 7. tells us λ i satisfies λ i +λ min (H) κ. Vector v satisfies v + (H + λ i I) g ε. Vector v min satisfies v min Hv min λ min (H) + 0κ. In sum, we have verified that all the assumptions of Main emma hold. Final Proof of Theorem. Theorem is a direct corollary of our main lemmas. Main emma 3 ensures that the assumptions of Main emma and Main emma both hold. Now, using the special choice of κ in FastCubic, Theorem -a immediately comes from Main emma ; Theorem -b immediately comes from Main emma ; and Theorem -c immediately comes from Main emma 3. This finishes the proof of Theorem. Acknowledgements We thank Ben Recht for very helpful suggestions and corrections to a previous version. Z. Allen-Zhu is supported by an NSF Grant, no. CCF-4958, and a Microsoft Research Grant, no Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of NSF or Microsoft. A Appendix Computing Hessian-Vector Product in inear Time In this section we sketch the intuition regarding why Hessian-vector products can be computed in linear time in many interesting (especially machine learning) problems. We start by showing that the gradient can be computed in linear time. The algorithm is often referred to as back-propagation,
13 which dates back to Werbos s PhD thesis [35], and has been popularized by Rumelharte et al. [3] for training neural networks. Claim A. (back-propagation, informally stated). Suppose a real-valued function f : R d R can be evaluated by a differentiable circuit of size N. Then, the gradient f can be computed in time O(N + d) (using a circuit of size O(N + d)). 4 The claim follows from simple induction and chain-rule, and is left to the readers. In the training of neural networks, often the size of circuits that computes the objective f is proportional to (or equal to) the number of parameters d. Thus the gradient f can be computed in time O(d) using a circuit of size d. Next, we consider computing f(x) v where v R d. et g(x) := f(x), v be a function from R d to R. Then, we see that if suffices to compute the gradient of g, since f(x) v = g(x). We observe that g(x) can be evaluated in linear time using circuit of size O(d) since we ve shown f(x) can. Thus, using Claim A. again on function g, 5 we conclude that g(x) can also be computed in linear time. B Proof of emma B. and Corollary 4. emma B.. For all h R d, it satisfies f(x t +h ) h + m t (h ) and λ min ( f(x t +h )) ( ) 3 max{0, m t(h )} /3 h. Proof of emma B.. et us denote by g = f(x t ) and H = f(x t ) in this proof. We begin by proving the first order condition. Note that we have m t (h) = g + Hh + h h. Recall h is a minimizer of argmin m t (h). The characterization result in emma 5. shows H + h I 0, and thus They imply g h + (h ) Hh + h 3 = m t (h ) h = 0 (h ) Hh + ( ) h 3 = (h ) H + h I h 0. m t (h ) = g h + (h ) Hh where uses (B.) and uses (B.). + h h 3 3 h 3 = h 3 = (h ) Hh 3 h 3 We compute the norm of the gradient at a point x t + h for any h R d : f(x t + h ) f(x t + h ) m t (h ) + m t (h ) ( = f(x t ) + f(x t + τh )h dτ g + Hh + h h ) + mt (h ) 0 4 Technically, we assume that the gradient of each gate can be computed in O() time 5 We assume here that the original circuits are twice differentiable (B.) (B.) (B.3)
14 0 ( f(x t + τh ) H ) h dτ + h + m t (h ) 3 h τdτ + h + m t (h ) = h + m t (h ) 0 (B.4) where 3 follows from the ipschitz continuity on the Hessian (.). This proves the first conclusion of the lemma. As for the second-order condition, we first note that for all h R d, by the ipschitz continuity on the Hessian (.), we have f(x t + h ) f(x t ) h. However, this implies λ min ( f(x t + h )) λ min ( f(x t )) h. (B.5) because if two matrices A and B satisfies A B p, then it must satisfy λmin (A) λ min (B) p as well. We consider two cases: if λ min ( f(x t )) 0, then we have λ min ( f(x t + h )) h. (B.6) Otherwise, we consider the case where λ min ( f(x t )) = λ min (H) < 0. et ν d be the normalized eigenvector corresponding to λ min (H), and define h sign(g ν d ) λ min(h) ν d. We calculate m t ( h) as follows: m t ( h) = g h h H h + = (λ min(h)) h 3 h H h + 6 h 3 = (λ min(h)) νd Hν d + 4 λ min(h) λ min(h) 3 (λ min (H)) 3 = 3 3, (B.7) where uses νd Hν d = λ min (H) < 0, and uses the assumption that λ min (H) < 0. Since by definition m t (h ) m t ( h), we can deduce from inequality (B.7) that ( 3 λ min ( m t (h ) /3 ) f(x t )) = λ min (H). (B.8) Now we put together inequalities (B.5) and (B.8), and obtain ( 3 λ min ( f(x t + h m t (h ) /3 ) )) h. (B.9) Combining (B.6) and (B.9) we finish the proof of emma B.. Corollary 4.. If m t (h ) ε3/ 800 and h is an approximate minimizer of m t (h) satisfying h h + ε 4 and m t (h ) ε, then we have that f(x t + h ) ε and λ min ( f(x t + h )) ε. Proof of Corollary 4.. First of all, our assumption that m t (h ) ε3/ ε 800, along with inequality (B.3), tells us that h 4. This, together with our assumption on h, implies h Since we also assume m t (h ) ε, we have from emma B. that f(x t + h ) h + m t (h ) ε 4 + ε ε. ε. For the second-order condition, we can again apply emma B. to get ( 3 λ min ( f(x t + h max{0, m t (h ) /3 ) /3 )} (3 )) h 3/ ε 3/ ε ε
15 C Proof of emma 5. and Corollary 5. We begin by proving a few lemmas that characterize the system of equations. emma C.. Consider the following system of equations/inequalities in variables λ, h: H + λi 0, (H + λi)h = g, h = λ. (C.) The following statements hold for any solution (λ, h ) of the above system: There is a unique value λ that satisfies the above equations. λ is such that λ λ min (H). If λ > λ min (H), then the corresponding h is also unique and is given by h = (H+λI) g. If λ = λ min (H) then g v = 0 for any vector v belonging to the eigenspace corresponding to λ min (H). Subsequently we also have that the corresponding h is of the form h = (H + λi) + g + γv for some γ and v in the lowest eigenspace of H. Proof of emma C.. Note that H + λi 0 ensures that for any solution λ, we have λ λ min (H). Furthermore, for any λ > λ min (H), the corresponding h is uniquely defined by h = (H + λi) g since H + λ I is invertible. If indeed λ = λ min (H), then we have that the equation (H λ min (H)I)h = g has a solution. This implies that g has no component in the null space of H λ min (H)I, or equivalently that it has no component in the eigenspace corresponding to λ min (H). We also have that every solution of (H λ min (H)I)h = g is necessarily of the form for some γ and v in the lowest eigenspace of H. h = (H λ min I) + g + γv We will now prove the uniqueness of λ by contradiction. Consider two distinct values of λ, λ that satisfy the system (C.). If both λ, λ > λ min (H) we get that (H + λ I) g = λ and (H + λ I) g = λ. Now note that (H + λi) g is a strictly decreasing function over the domain λ ( λ min (H), ) and λ is strictly increasing over the same domain. Therefore the above two equations cannot be satisfied for two distinct λ, λ > λ min (H) which is a contradiction. Suppose now without loss of generality that λ = λ min (H). Then we have that the corresponding solution is of the form h = (H + λi) + g + γv for some γ and v in the lowest eigenspace of H and g has no component in the lowest eigenspace of H. It follows that (H λ min (H)I) + g (H + λi) g for any λ > λ min (H). By a similar argument as in the first case, we can now see that the following conditions, (H + λ I) + g + γv min (H) = λ and (H + λ I) g = λ, cannot both be satisfied for λ > λ = λ min (H), giving us a contradiction. This finishes the proof of emma C.. emma C.. et (λ, h) be a solution of the system (C.). Then we have that m(h) = g (H + λi) + g λ3 3. 4
16 Proof of emma C.. By the definition of the system (C.), any solution λ, h to the system should be such that there exists some γ such that h = (H + λi) + g + γv 0 where v 0 is in the null space of H + λi if it exists; otherwise γ = 0. This gives us the following: m(h) = g h + h Hh + 6 h 3 = h (H + λi)h λ h + 6 h 3 = g (H + λi) + g λ3 3. Equality follows because (H + λi)h = g. Equality follows because h = (H + λi) + g + γv 0 and h = λ. emma 5.. h is a minimizer of m(h) if and only if there exists λ 0 such that H + λ I 0, (H + λ I)h = g, h = λ. The objective value in this case is given by m(h ) = g (H + λ I) + g (λ ) Proof of emma 5.. We first compute that m(h) = g + Hh + h h and m(h) = H + h I + ( ) ( ) h h h. h h For the forward direction, suppose h is a minimizer of m(h). et λ = h. Then, the necessary conditions m(h ) = 0 and m(h ) 0 can be written as ( ( ) ( ) ) h g + (H + λ I)h = 0 and w H + λ I + λ h h h w 0, w R n. (C.) From this we see (H+λ I)h = g and h = λ, and the only thing left to verify is H+λ I 0. Note that if h = 0, then the second inquality in (C.) directly implies H + λ I 0. Thus, we only need to focus on h 0. We want to show that w (H + λ I)w 0 for every w R d. Now, if w h = 0 then this trivially follows from (C.), so it suffices to focus on those w that satisfies w h 0. Since w and h are not orthogonal, there exists γ R\{0} such that h + γw = h. (This can be done by squaring both sides and solving the linear system in λ.) Squaring both sides we have (γw) h + γ w = 0. (C.3) Now we bound the difference m(h + γw) m(h ) = g ((h + γw) h ) + (h + γw) H(h + γw) = (h (h + γw)) (H + λ I)h + (h + γw) H(h + γw) h Hh h Hh = λ γ w + (h (h + γw)) Hh + (h + γw) H(h + γw) h Hh = λ γ w + h Hh (h + γw) Hh + (h + γw) H(h + γw) 5
17 = λ γ w + γ w Hw = γ w (H + λ I)w, (C.4) where and follow from (C.) and (C.3), respectively. Since h is a minimizer of m(h), we immediately have m(h + γw) m(h ) = γ w (H + λ I)w 0, and we conclude that (H + λ I) 0. For the backward direction, we will make use emma C. and emma C.. First we note that the function m(h) is continuous and bounded from below, and there exists at least one minimizer h. Suppose now there exists a λ and a corresponding h such that (λ, h ) is a solution to the system C.. The backward direction requires us to prove that h must be a minimizer of m(h). By emma C. we get the following two cases. We prove the backward direction by showing that the conditions in Equation C. determine the minimizer up to its norm. To this end we will use emma C. and emma C.. First we note that the function m(h) is continuous, bounded from below, and tends to + when h, so there exists at least one minimizer h. Suppose now there exists a λ and a corresponding h such that (λ, h ) is a solution to the system (C.). The backward direction requires us to prove that h must be a minimizer of m(h). By emma C. we get the following two cases. If λ > λ min (H) then (λ, h ) is the only solution to the system (C.). By the proof of the forward direction we see that any minimizer of m(h) must satisfy system (C.) and therefore h must be the minimizer. If above is not the case, then λ = λ min (H). et h be any minimizer of m(h). emma C. and the proof of the forward direction ensures that (λ, h ) also satisfies the system (C.). By emma C. we get m(h ) = m(h ) and therefore h is a minimizer too. Corollary 5.. This value λ is unique, and for every λ satisfying H + λi 0, we have (H + λi) g > λ λ > λ and (H + λi) g < λ λ < λ. Proof of Corollary 5.. The uniqueness of λ follows from emma C.. To prove the second part we first make some observations about the function p(y) y (H + yi) g defined on the domain y ( λ min (H), ). Note that p(y) is continuous and strictly increasing over the domain and p(y) as y. The corollary requires us to show that p(λ) < 0 λ > λ and p(λ) > 0 λ < λ. We begin by showing the first equivalence. To see the backward direction note that if λ > λ > λ min (H), by the characterization of λ in emma C. we have that (H + λ I) g = λ i.e. p(λ ) = 0 which implies that p(λ) < 0 as p(y) is a strictly increasing function. For the forward direction note that since p(y) is continuous and strictly increasing we see that the range of the function contains [p(λ), ). Since p(λ) < 0 there must exist a λ > λ such that p(λ ) = 0 which by the characterization in emma C. finishes the proof. Now we will prove that p(λ) > 0 λ < λ. To see the forward direction note that if λ λ then p(λ ) = 0 and p(λ) > 0 which contradicts the fact that p(y) is strictly increasing. For the 6
18 backward direction we consider two cases. Firstly if λ > λ min (H) the conclusion follows similarly by the monotonicity of p(y). If λ = λ min then by emma C., we have that g has no component in the lowest eigenspace of H and therefore if we extend p(y) to λ min (H) by defining p( λ min (H)) λ min(h) (H λ min (H)I) + g we get that p(y) is increasing in the domain y [ λ min (H), ). Now from the characterization of the solution in emma C. we can see that p( λ min (H)) 0 and therefore by monotonicity p(λ) > 0. This finishes the proof. D Proof of Main emma D. Proof of Claim 6. Claim 6.. If λ and v satisfy Case and ε satisfies (6.), then m(v) m(h ) + 50κ 3 Proof of Claim 6.. Note that by the conditions of the theorem we have that (H+(λ ε)i) eq and (H + (λ ε)i) g λ ε and (H + (λ + ε)i) g λ ε, according to Corollary 5. we must have This also implies (using our assumption on ε) Next, consider the value m(v) m(v) = g v + v Hv λ [λ ε, λ + ε] v [λ 5 ε, λ + 5 ε]. + 6 v 3 = g v + v (H + λi)v ( λ v v ) 6 We bound the two parts on the right hand side of (D.) separately. The first part g v + v (H + λi)v g (H + λi) g + g ε + (H + λi) g ε + (H + λi) ε g (H + λi) g + 000κ 3 g (H + λ I) g 3 g (H + λ I) g + g (H + λi) (H + (λ + ε)i) ε κ 3 Above, inequalities and 3 use the assumption on ε in (6.), and inequality uses (H + λi) (H + (λ + ε)i) = (H + λ I) ε(h + λ I) (H + (λ + ε)i) (D.). (D.) (D.3) 000κ 3 Note that (H+λ I) 0 by Equations (D.) and (6.). The second part of (D.) can be bounded as follows ( λ v v ) (λ 5 ε) ( λ ) ε 6 λ + 5 ε 6 (λ ) ε(λ ) (λ ) κ 3 7
19 Above, inequality uses λ B (owing to Proposition 5.3) and our assumption on ε from (6.). Putting these together we get that m(v) m(h ) + 50κ 3. D. Proofs for Claims 6. and 6.3 For notational simplicity, let us rotate the space into the basis in the eigenspace of H; let the i-th dimension correspond to the i-th largest eigenvalue λ i of H. We have λ λ... λ d = λ min. et g i denote the i-th coordinate of g in this basis. emma 5. implies where we denote by m(h ) = S = i i:λ i +λ κ gi λ i + λ (λ ) 3 3 =: S + S (λ ) 3 3. (D.4) g i λ i + λ From Corollary 5. we can also obtain i:λ i +λ >0 Now the assumption (H + λi) g λ We begin with a few auxiliary claims. i S = i:0<λ i +λ κ g i λ i + λ gi (λ i + λ ) 4(λ ). (D.5) is equivalent to gi (λ i + λ) 4λ Claim D.. If λ min (H) κ then S 000 m ( λvmin Proof of Claim D.. We compute that gi S = λ i + λ = i:0<λ i +λ κ i:0<λ i +λ κ ) g i (λ i + λ ) (λ i + λ ) κ i:0<λ i +λ κ g i (λ i + λ ) (D.6) 4 κ (λ ) 6 λ min 3. (D.7) Above, uses (D.5), and follows because we have λ min (H) κ in the assumption and have λ λ min (H) + κ in the assumption of Case of Main emma. et us now consider the value of the vector λv min. We have that ( ) λvmin m = λg v min + λ vmin Hv min 8 + λ3 48 λg v min + λ λ min 6 + λ3 48 λg v min + λ λ min 6 λ λ min 4 λg v min + λ λ min 48 Above, is because our assumption λ min (H) κ and assumption v minhv min λ min (H) + 0κ together imply v min Hv min λ min. follows from λ min(h) κ and λ λ min(h) + κ. Now, recall that the sign of v min is chosen so g v min is non-positive, and therefore by our 8
20 assumptions λ min (H) κ and λ λ min(h) + κ, we get the following inequality: ( ) λvmin m λ min 3 48 Putting inequalities (D.8) and (D.7) together finishes the proof of Claim D.. (D.8) We also show the following lemma, the proof of which can be seen from inequality (D.3), as part of the proof of Claim 6. above. emma D.. If we have λ, v such that (H + λi) g λ and v + (H + λi) g ε with ε satisfying condition (6.) then we have that g v + v (H + λi)v Claim D.3. S 4m(v) 50κ 3 Proof of Claim D.3. We have that m(v) = g v + v (H + λi)v g (H + λi) g λ v + 6 v 3 = g (H + λi) ( g λ v ) 6 v + ( λ 3 ε g (H + λi) g + ) ( λ 6 + ε 3 000κ 3 000κ 3 ) + 000κ 3 3 g (H + λi) g λ κ 3 g (H + λi) g + 500κ 3 (D.9) Above, is due to emma D.; uses our condition on v which gives v [λ 3 ε, λ + 3 ε]; 3 uses our condition (6.) on ε. We now bound S. For this purpose first we note that if λ i + λ κ and λ λ κ then Therefore, the sum S satisfies S = gi λ i + λ i:λ i +λ κ (λ i + λ ) /κ + λ i + λ λ i + λ. i:0<λ i +λ κ gi (λ i + λ) (g (H + λi) g) 4m(v) 50κ 3 (Note that we have H + λi 0.) This finishes the proof of Claim D.3. { ( )} Claim 6.. If λ min (H) κ then m(h ) 500 min m(v), m λvmin 500κ 3 Proof of Claim 6.. We derive that m(h ) = (S + S ) (λ ) 3 3 m(v) 4 m(v) 3 500κ m 500κ m 9 (S + S ) 6 λ min 3 ) 3 6 λ min 3 ( λvmin ( λvmin ) 3
21 { 500 min m(v), m ( λvmin )} 500κ 3 Above, uses equation (D.4), inequality follows because we have λ min (H) κ in the assumption and have λ λ min (H) + κ in the assumption of Case of Main emma ; inequality 3 uses Claim D.3 and Claim D.; and inequality 4 uses (D.8). This finishes the proof of Claim 6.. Claim 6.3. If λ min (H) κ then m(h ) m(v) 6 κ 3 Proof of Claim 6.3. This time we lower bound S slightly differently: 4 S κ (λ ) 6 κ 3 where comes from the second to last inequality from (D.7) and comes from λ λ min (H) + κ κ using our assumption in Case of Main emma. Putting these together we get that m(h ) = (S + S ) (λ ) 3 3 m(v) 500κ κ 3 m(v) κ 3. Above, comes from (D.4), uses Claim D.3, lower bound (D.0) and (λ ) κ 3 (D.0) λ E Proof of Main emma Main emma. In the same setting as Main emma, suppose m(h ) ε3/ output vector v satisfies the following conditions: v h + 3 and m(v) ε κ κ. Proof of Main emma. et s first note that from the value given in emma 5., 300. Then the (λ ) 3 3 m(h ) 3/ ε 3/. (E.) 00 If Case occurs, we have v (H + λi) g + ε λ + ε + ε 3 λ + 5 ε 4 h + 0κ. Above, inequalities and both use the assumptions of Case ; inequality 3 uses the fact that λ [λ ε, λ + ε] which again follows from the assumptions of Case (see (D.)); inequality 4 uses h = λ from emma 5. as well as our assumption (6.) on ε. As for the quantity m(v), we bound it as follows m(v) = g + Hv + v v g + (H + λi)v + λ v + v H + λi ε + λ v + v 3 ( + B) ε + λ(λ + 3 ε) + (λ + 3 ε) = ( + B) ε + 6λ ελ + 9 ε 6(λ + ε) + ( + 3B) ε + 9 ε 5 6(λ ) + ( + 56B) ε + 5 ε 6 ε κ. Above, inequality uses triangle inequality; inequality uses v + (H + λi) g ε; inequality 3 uses H + λi + B and v λ + 3 ε which comes from our upper bound on v above; 4 uses the fact that λ [λ ε, λ + ε] which again follows from the assumptions of Case (see 0
22 (D.)); inequality 5 uses λ B; and inequality 6 uses (E.) together with our assumption (6.) on ε. If Case occurs, we have v (H + λi) g + ε λ + ε 3 (λ + /κ) + ε h + 3 κ. (E.) Above, inequalities and both use the assumptions of Case ; inequality 3 uses λ λ min (H)+ /κ from our assumption of Case as well as λ min (H) λ which comes from emma 5.; inequality 4 uses h = λ from emma 5. as well as our assumption (6.) on ε. The quantity m(v) can be bounded in an analogous manner as Case : m(v) H + λi ε + λ v + v ( + B) ε + 6λ + 0κ 6(λ + κ ) + 0κ (λ ) + 5 κ λ(λ + ε) + (λ + ε) 3 ε κ. Above, inequality uses our assumption (6.) on ε; inequality uses λ λ + κ which appeared in (E.); inequality 3 uses (E.). F Proof of emma 7. emma 7.. The following statements hold for all i until FastCubicMin terminates (a) λ i [0, B], λ i + λ max (H) 3B (b) λ i + λ min (H) 3 0κ (c) λ i+ + λ min (H) 3 4 (λ i + λ min (H)) unless λ i+ = 0 Moreover when FastCubicMin terminates at ine 0 we have λ i + λ min (H) κ. Proof of emma 7.. The lemma follows via induction. To see (a) and (b) at the base case i = 0, recall that the definitions of B and together ensure λ 0 + λ max (H) 3B and λ 0 + λ min (H) 3 0κ. Also λ 0 [0, B]. Suppose now for some i 0 properties (a) and (b) hold. It is easy to check that λ i λ i and thus we have λ i + λ max (H) B and λ i B. This implies property (a) at iteration i + also hold. We now proceed to show property (c) at iteration i and property (b) at iteration i +. Recall that the algorithm ensures 9 0 λ max((h + λ i I) ) w (H + λ i I) w λ max ((H + λ i I) ), and by the definition of w we have 9 0 λ max((h + λ i I) ) ˆε w w ˆε λ max ((H + λ i I) ). (F.) Now, since 3 0κ λ i + λ min (H) 3B from the inductive assumption, it follows from the choice of ˆε that ˆε 30B 0(λ i + λ min (H)) = λ max((h + λ i I) ). (F.) 0 Plugging Equation (F.) into Equation (F.) we get 8 0 λ i + λ min (H) = 8 0 λ max((h + λ i I) ) w w ˆε λ max ((H + λ i I) ) = λ i + λ min (H).
23 Inverting this chain of inequalities, we have λ i + λ min (H) 5(λ i + λ min (H)) 8. (F.3) From this we derive the following implications: κ = (λ i + λ min (H)) κ (F.4) > κ = (λ i + λ min (H)) > 4 5κ (F.5) If Condition (F.4) happens, our algorithm FastCubicMin outputs on ine 0; in such a case (F.4) implies our desired inequality λ i + λ min (H) κ. If Condition (F.5) happens, our choice λ i+ λ i and Equation (F.3) together imply that 3 4 (λ i + λ min (H)) λ i+ + λ min (H) 6 (λ i + λ min (H)) Combining this with (F.5) we get that 3 4 (λ i + λ min (H)) λ i+ + λ min (H) ( ) κ 0κ. Therefore, we conclude that property (c) at iteration i holds and property (b) at iteration i + hold because λ i+ λ i+. This finishes the proof of emma 7.. G Proof of Main emma 3: Running Time Half Having proven the correctness of the algorithm, we now aim to bound the overall running time of FastCubicMin, completing the proof of Main emma 3. We prove in Appendix H the following lemma: emma G.. If λ +λ min (H) c (0, ) then BinarySearch ends in O ( log( (λ λ )B c ε ) ) iterations. Since in our FastCubicMin algorithm, it satisfies λ i B and λ i +λ min (H) 3 taken together with our choice of ε we have: Claim G.. Claim G.3. Each invocation of BinarySearch ends in O ( log(/ ε) ) iterations. FastCubicMin ends in at most O(log(Bκ)) outer loops. 0κ (see emma 7.), Proof. According to emma 7. we have 3 4 (λ i + λ min (H)) λ i + λ min (H) so the quantity λ i + λ min (H) decreases by a constant factor per iteration (except possibly λ i = 0 the last outer loop in which case we shall terminate in one more iteration). On one hand, we have began with λ 0 + λ min (H) 3B. On the other hand, we always have λ i + λ min (H) 3 0κ according to emma 7.. Therefore, the total number of outer loops is at most O(log(Bκ)). G. Matrix Inverse Since the key component of the running time is the computation of (H+λ i I) b for different vectors b we will first bound the condition number of the matrix (H + λ i I) via the following lemma Claim G.4. Through out the execution of FastCubicMin and BinarySearch whenever we compute (H + λ i I) λ b for some vector b it satisfies i + λ i +λ min (H) 0κ. Proof of Claim G.4. We first focus on ine 5 and ine of FastCubicMin. There are two cases. If λ i, then according to I H I we can bound 3 because the left hand λ i + λ i +λ min (H)
Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationThird-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima
Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic
More informationarxiv: v4 [math.oc] 11 Jun 2018
Natasha : Faster Non-Convex Optimization han SGD How to Swing By Saddle Points (version 4) arxiv:708.08694v4 [math.oc] Jun 08 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Microsoft Research, Redmond August 8,
More informationTutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer
Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,
More informationNon-Convex Optimization in Machine Learning. Jan Mrkos AIC
Non-Convex Optimization in Machine Learning Jan Mrkos AIC The Plan 1. Introduction 2. Non convexity 3. (Some) optimization approaches 4. Speed and stuff? Neural net universal approximation Theorem (1989):
More informationDesign and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016
Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More informationarxiv: v1 [cs.lg] 17 Nov 2017
Neon: Finding Local Minima via First-Order Oracles (version ) Zeyuan Allen-Zhu zeyuan@csail.mit.edu Microsoft Research Yuanzhi Li yuanzhil@cs.princeton.edu Princeton University arxiv:7.06673v [cs.lg] 7
More informationApproximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo
Approximate Second Order Algorithms Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Why Second Order Algorithms? Invariant under affine transformations e.g. stretching a function preserves the convergence
More informationOptimization Tutorial 1. Basic Gradient Descent
E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.
More informationHow to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India
How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India Chi Jin UC Berkeley Michael I. Jordan UC Berkeley Rong Ge Duke Univ. Sham M. Kakade U Washington Nonconvex optimization
More informationMini-Course 1: SGD Escapes Saddle Points
Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More informationUnconstrained optimization
Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout
More informationUnconstrained minimization of smooth functions
Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and
More informationSecond-Order Stochastic Optimization for Machine Learning in Linear Time
Journal of Machine Learning Research 8 (207) -40 Submitted 9/6; Revised 8/7; Published /7 Second-Order Stochastic Optimization for Machine Learning in Linear Time Naman Agarwal Computer Science Department
More informationLeast Sparsity of p-norm based Optimization Problems with p > 1
Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationOracle Complexity of Second-Order Methods for Smooth Convex Optimization
racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il
More informationarxiv: v4 [math.oc] 5 Jan 2016
Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The
More informationThis manuscript is for review purposes only.
1 2 3 4 5 6 7 8 9 10 11 12 THE USE OF QUADRATIC REGULARIZATION WITH A CUBIC DESCENT CONDITION FOR UNCONSTRAINED OPTIMIZATION E. G. BIRGIN AND J. M. MARTíNEZ Abstract. Cubic-regularization and trust-region
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu
More informationNOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained
NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationNon-convex optimization. Issam Laradji
Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More information1 Kernel methods & optimization
Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions
More informationStochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos
1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and
More informationGradient Descent. Dr. Xiaowei Huang
Gradient Descent Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Three machine learning algorithms: decision tree learning k-nn linear regression only optimization objectives are discussed,
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationConvex Functions and Optimization
Chapter 5 Convex Functions and Optimization 5.1 Convex Functions Our next topic is that of convex functions. Again, we will concentrate on the context of a map f : R n R although the situation can be generalized
More informationMidterm for Introduction to Numerical Analysis I, AMSC/CMSC 466, on 10/29/2015
Midterm for Introduction to Numerical Analysis I, AMSC/CMSC 466, on 10/29/2015 The test lasts 1 hour and 15 minutes. No documents are allowed. The use of a calculator, cell phone or other equivalent electronic
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationLine Search Methods for Unconstrained Optimisation
Line Search Methods for Unconstrained Optimisation Lecture 8, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Generic
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationNonlinear Optimization Methods for Machine Learning
Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks
More informationSecond Order Optimization Algorithms I
Second Order Optimization Algorithms I Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 7, 8, 9 and 10 1 The
More informationConvergence of Cubic Regularization for Nonconvex Optimization under KŁ Property
Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Yi Zhou Department of ECE The Ohio State University zhou.1172@osu.edu Zhe Wang Department of ECE The Ohio State University
More informationLecture 18 Oct. 30, 2014
CS 224: Advanced Algorithms Fall 214 Lecture 18 Oct. 3, 214 Prof. Jelani Nelson Scribe: Xiaoyu He 1 Overview In this lecture we will describe a path-following implementation of the Interior Point Method
More informationMath 471 (Numerical methods) Chapter 3 (second half). System of equations
Math 47 (Numerical methods) Chapter 3 (second half). System of equations Overlap 3.5 3.8 of Bradie 3.5 LU factorization w/o pivoting. Motivation: ( ) A I Gaussian Elimination (U L ) where U is upper triangular
More informationMath 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.
Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,
More informationLecture 5 : Projections
Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationMath 113 Homework 5. Bowei Liu, Chao Li. Fall 2013
Math 113 Homework 5 Bowei Liu, Chao Li Fall 2013 This homework is due Thursday November 7th at the start of class. Remember to write clearly, and justify your solutions. Please make sure to put your name
More informationWorst-Case Complexity Guarantees and Nonconvex Smooth Optimization
Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationStochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017
Stochastic Gradient Descent: The Workhorse of Machine Learning CS6787 Lecture 1 Fall 2017 Fundamentals of Machine Learning? Machine Learning in Practice this course What s missing in the basic stuff? Efficiency!
More informationDeep Linear Networks with Arbitrary Loss: All Local Minima Are Global
homas Laurent * 1 James H. von Brecht * 2 Abstract We consider deep linear networks with arbitrary convex differentiable loss. We provide a short and elementary proof of the fact that all local minima
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via nonconvex optimization Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationOptimization for neural networks
0 - : Optimization for neural networks Prof. J.C. Kao, UCLA Optimization for neural networks We previously introduced the principle of gradient descent. Now we will discuss specific modifications we make
More informationarxiv: v2 [math.oc] 1 Nov 2017
Stochastic Non-convex Optimization with Strong High Probability Second-order Convergence arxiv:1710.09447v [math.oc] 1 Nov 017 Mingrui Liu, Tianbao Yang Department of Computer Science The University of
More informationAlgorithms for Learning Good Step Sizes
1 Algorithms for Learning Good Step Sizes Brian Zhang (bhz) and Manikant Tiwari (manikant) with the guidance of Prof. Tim Roughgarden I. MOTIVATION AND PREVIOUS WORK Many common algorithms in machine learning,
More informationStochastic Cubic Regularization for Fast Nonconvex Optimization
Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni Mitchell Stern Chi Jin Jeffrey Regier Michael I. Jordan {nilesh tripuraneni,mitchell,chijin,regier}@berkeley.edu jordan@cs.berkeley.edu
More informationNotes for CS542G (Iterative Solvers for Linear Systems)
Notes for CS542G (Iterative Solvers for Linear Systems) Robert Bridson November 20, 2007 1 The Basics We re now looking at efficient ways to solve the linear system of equations Ax = b where in this course,
More informationTrust Regions. Charles J. Geyer. March 27, 2013
Trust Regions Charles J. Geyer March 27, 2013 1 Trust Region Theory We follow Nocedal and Wright (1999, Chapter 4), using their notation. Fletcher (1987, Section 5.1) discusses the same algorithm, but
More informationLecture 6 Optimization for Deep Neural Networks
Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things
More information8 Numerical methods for unconstrained problems
8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57
More information5 Handling Constraints
5 Handling Constraints Engineering design optimization problems are very rarely unconstrained. Moreover, the constraints that appear in these problems are typically nonlinear. This motivates our interest
More informationComplexity analysis of second-order algorithms based on line search for smooth nonconvex optimization
Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Clément Royer - University of Wisconsin-Madison Joint work with Stephen J. Wright MOPTA, Bethlehem,
More informationSolution of the 8 th Homework
Solution of the 8 th Homework Sangchul Lee December 8, 2014 1 Preinary 1.1 A simple remark on continuity The following is a very simple and trivial observation. But still this saves a lot of words in actual
More informationSeptember Math Course: First Order Derivative
September Math Course: First Order Derivative Arina Nikandrova Functions Function y = f (x), where x is either be a scalar or a vector of several variables (x,..., x n ), can be thought of as a rule which
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationSVRG Escapes Saddle Points
DUKE UNIVERSITY SVRG Escapes Saddle Points by Weiyao Wang A thesis submitted to in partial fulfillment of the requirements for graduating with distinction in the Department of Computer Science degree of
More informationMATH SOLUTIONS TO PRACTICE MIDTERM LECTURE 1, SUMMER Given vector spaces V and W, V W is the vector space given by
MATH 110 - SOLUTIONS TO PRACTICE MIDTERM LECTURE 1, SUMMER 2009 GSI: SANTIAGO CAÑEZ 1. Given vector spaces V and W, V W is the vector space given by V W = {(v, w) v V and w W }, with addition and scalar
More informationIPAM Summer School Optimization methods for machine learning. Jorge Nocedal
IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep
More informationAn Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees
An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees Francesc Rosselló 1, Gabriel Valiente 2 1 Department of Mathematics and Computer Science, Research Institute
More informationConnectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ).
Connectedness 1 Motivation Connectedness is the sort of topological property that students love. Its definition is intuitive and easy to understand, and it is a powerful tool in proofs of well-known results.
More informationBare-bones outline of eigenvalue theory and the Jordan canonical form
Bare-bones outline of eigenvalue theory and the Jordan canonical form April 3, 2007 N.B.: You should also consult the text/class notes for worked examples. Let F be a field, let V be a finite-dimensional
More informationConvex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization
Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f
More informationIntroduction to gradient descent
6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our
More informationNonlinear Programming
Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week
More informationChapter 8 Gradient Methods
Chapter 8 Gradient Methods An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Introduction Recall that a level set of a function is the set of points satisfying for some constant. Thus, a point
More informationPart 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)
Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationConstrained Optimization and Lagrangian Duality
CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may
More informationDS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.
DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1
More informationUNCONSTRAINED OPTIMIZATION PAUL SCHRIMPF OCTOBER 24, 2013
PAUL SCHRIMPF OCTOBER 24, 213 UNIVERSITY OF BRITISH COLUMBIA ECONOMICS 26 Today s lecture is about unconstrained optimization. If you re following along in the syllabus, you ll notice that we ve skipped
More informationEAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science
EAD 115 Numerical Solution of Engineering and Scientific Problems David M. Rocke Department of Applied Science Multidimensional Unconstrained Optimization Suppose we have a function f() of more than one
More information08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms
(February 24, 2017) 08a. Operators on Hilbert spaces Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/real/notes 2016-17/08a-ops
More informationSymmetric Matrices and Eigendecomposition
Symmetric Matrices and Eigendecomposition Robert M. Freund January, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 2 1 Symmetric Matrices and Convexity of Quadratic Functions
More informationThe Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems
The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems Weinan E 1 and Bing Yu 2 arxiv:1710.00211v1 [cs.lg] 30 Sep 2017 1 The Beijing Institute of Big Data Research,
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationThe Skorokhod reflection problem for functions with discontinuities (contractive case)
The Skorokhod reflection problem for functions with discontinuities (contractive case) TAKIS KONSTANTOPOULOS Univ. of Texas at Austin Revised March 1999 Abstract Basic properties of the Skorokhod reflection
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationTHE INVERSE FUNCTION THEOREM
THE INVERSE FUNCTION THEOREM W. PATRICK HOOPER The implicit function theorem is the following result: Theorem 1. Let f be a C 1 function from a neighborhood of a point a R n into R n. Suppose A = Df(a)
More informationUnconstrained Optimization
1 / 36 Unconstrained Optimization ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University February 2, 2015 2 / 36 3 / 36 4 / 36 5 / 36 1. preliminaries 1.1 local approximation
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More information5 + 9(10) + 3(100) + 0(1000) + 2(10000) =
Chapter 5 Analyzing Algorithms So far we have been proving statements about databases, mathematics and arithmetic, or sequences of numbers. Though these types of statements are common in computer science,
More informationJim Lambers MAT 610 Summer Session Lecture 2 Notes
Jim Lambers MAT 610 Summer Session 2009-10 Lecture 2 Notes These notes correspond to Sections 2.2-2.4 in the text. Vector Norms Given vectors x and y of length one, which are simply scalars x and y, the
More information