arxiv: v1 [cs.lg] 17 Nov 2017

Size: px
Start display at page:

Download "arxiv: v1 [cs.lg] 17 Nov 2017"

Transcription

1 Neon: Finding Local Minima via First-Order Oracles (version ) Zeyuan Allen-Zhu zeyuan@csail.mit.edu Microsoft Research Yuanzhi Li yuanzhil@cs.princeton.edu Princeton University arxiv: v [cs.lg] 7 Nov 07 November 7, 07 Abstract We propose a reduction for non-convex optimization that can () turn a stationary-point finding algorithm into a local-minimum finding one, and () replace the Hessian-vector product computations with only gradient computations. It works both in the stochastic and the deterministic settings, without hurting the algorithm s performance. As applications, our reduction turns Natasha into a first-order method without hurting its performance. It also converts SGD, GD, SCSG, and SVRG into local-minimum finding algorithms outperforming some best known results. Introduction Nonconvex optimization has become increasing popular due its ability to capture modern machine learning tasks in large scale. Most notably, training deep neural networks corresponds to minimizing a function f(x) = n n i= f i(x) over x R d that is non-convex, where each training sample i corresponds to one loss function f i ( ) in the summation. This average structure allows one to perform stochastic gradient descent (SGD) which uses a random f i (x) corresponding to computing backpropagation once to approximate f(x) and performs descent updates. Motivated by such large-scale machine learning applications, we wish to design faster first-order non-convex optimization methods that outperform the performance of gradient descent, both in the online and offline settings. In this paper, we say an algorithm is online if its complexity is independent of n (so n can be infinite), and offline otherwise. In recently years, researchers across different communities have gathered together to tackle this challenging question. By far, known theoretical approaches mostly fall into one of the following two categories. First-order methods for stationary points. In analyzing first-order methods, we denote by gradient complexity T the number of computations of f i (x). To achieve an ε-approximate stationary point namely, a point x with f(x) ε it is a folklore that gradient descent (GD) is offline and needs T O ( ) n ε, while stochastic gradient decent (SGD) is online and needs T O ( ) ( ε. In recent years, the offline complexity has been improved to T O n /3 ) 4 ε by the The result of this paper was briefly discussed at a Berkeley Simons workshop on Oct 6 and internally presented at Microsoft on Oct 30. We started to prepare this manuscript on Nov, after being informed of the independent and similar work of Xu and Yang [8]. Their result appeared on arxiv on Nov 3. To respect the fact that their work appeared online before us, we have adopted their algorithm name Neon and called our new algorithm Neon. We encourage readers citing this work to also cite [8].

2 SVRG method [3, 3], and the online complexity has been improved to T O ( ε 0/3 ) by the SCSG method [8]. Both of them rely on the so-called variance-reduction technique, originally discovered for convex problems [, 6, 4, 6]. These algorithms SVRG and SCSG are only capable of finding stationary points, which may not necessarily be approximate local minima and are arguably bad solutions for neural-network training [9, 0, 4]. Therefore, can we turn stationary-point finding algorithms into local-minimum finding ones? Hessian-vector methods for local minima. It is common knowledge that using information about the Hessian, one can find ε-approximate local minima namely, a point x with f(x) ε and also f(x) ε /C I. In 006, Nesterov and Polyak [0] showed that one can find an ε- approximate in O( ) iterations, but each iteration requires an (offline) computation as heavy as ε.5 inverting the matrix f(x). To fix this issue, researchers propose to study the so-called Hessian-free methods that, in addition to gradient computations, also compute Hessian-vector products. That is, instead of using the full matrix f i (x) or f(x), these methods also compute f i (x) v for indices i and vectors v. For Hessian-free methods, we denote by gradient complexity T the number of computations of f i (x) plus that of f i (x) v. The hope of using Hessian-vector products is to improve the complexity T as a function of ε. Such improvement was first shown possible independently by [, 7] for the offline setting, with complexity T ( ) n + n3/4 ε.5 ε so is better than that of gradient descent. In the online setting, the.75 first improvement was by Natasha which gives complexity T ( ) ε []. 3.5 Unfortunately, it is argued by some researchers that Hessian-vector products are not general enough and may not be as simple to implement as evaluating gradients [8]. Therefore, can we turn Hessian-free methods into first-order ones, without hurting their performance?. From Hessian-Vector Products to First-Order Methods Recall by inition of derivative we have f i (x) v = lim q 0 { f i(x+qv) f i (x) q }. Given any Hessian-free method, at least at a high level, can we replace every occurrence of f i (x) v with w = f i(x+qv) f i (x) q for some small q > 0? Note the error introduced in this approximation is f i (x) v w q v. Therefore, as long as the original algorithm is sufficiently stable to adversarial noise, and as long as q is small enough, this can convert Hessian-free algorithms into first-order ones. In this paper, we demonstrate this idea by converting negative-curvature-search (NC-search) subroutines into first-order processes. NC-search is a key subroutine used in state-of-the-art Hessian-free methods that have rigorous proofs (see [,, 7]). It solves the following simple task: negative-curvature search (NC-search) given point x 0, decide if f(x 0 ) δi or find a unit vector v such that v f(x 0 )v δ. Online Setting. In the online setting, NC-search can be solved by Oja s algorithm [] which We say A δi if all the eigenvalues of A are no smaller than δ. In this high-level introduction, we focus only on the case when δ = ε /C for some constant C. Hessian-free methods are useful because when f i( ) is explicitly given, computing its gradient is in the same complexity as computing its Hessian-vector product [, 5], using backpropagation.

3 costs Õ(/δ ) computations of Hessian-vector products (first proved in [6] and applied to NC-search in []). In this paper, we propose a method Neon online which solves the NC-search problem via only stochastic first-order updates. That is, starting from x = x 0 + ξ where ξ is some random perturbation, we keep updating x t+ = x t η( f i (x t ) f i (x 0 )). In the end, the vector x T x 0 gives us enough information about the negative curvature. Theorem (informal). Our Neon online algorithm solves NC-search using Õ(/δ ) stochastic gradients, without Hessian-vector product computations. This complexity Õ(/δ ) matches that of Oja s algorithm, and is information-theoretically optimal (up to log factors), see the lower bound in [6]. We emphasize that the independent work Neon by Xu and Yang [8] is actually the first recorded theoretical result that proposed this approach. However, Neon needs Õ(/δ3 ) stochastic gradients, because it uses full gradient descent to find NC (on a sub-sampled objective) inspired by [5] and the power method; instead, Neon online uses stochastic gradients and is based on our prior work on Oja s algorithm [6]. By plugging Neon online into Natasha [], we achieve the following corollary (see Figure (c)): Theorem (informal). Neon online turns Natasha into a stochastic first-order method, without hurting its performance. That is, it finds an (ε, δ)-approximate local minimum in T = Õ( + ε 3.5 ε 3 δ + ) δ stochastic gradient computations, without Hessian-vector product computations. 5 (We say x is an approximate local minimum if f(x) ε and f(x) δi.) Offline Setting. There are a number of ways to solve the NC-search problem in the offline setting using Hessian-vector products. Most notably, power method uses Õ(n/δ) computations of Hessianvector products, Lanscoz method [7] uses Õ(n/ δ) computations, and shift-and-invert [] on top of SVRG [6] (that we call SI+SVRG) uses Õ(n + n3/4 / δ) computations. In this paper, we convert Lanscoz s method and SI+SVRG into first-order ones: Theorem 3 (informal). Our Neon det algorithm solves NC-search using Õ(/ δ) full gradients (or equivalently Õ(n/ δ) stochastic gradients), and our Neon svrg solves NC-search using Õ(n + n 3/4 / δ) stochastic gradients. We emphasize that, although analyzed in the online setting only, the work Neon by Xu and Yang [8] also applies to the offline setting, and seems to be the first result to solve NC-search using first-order gradients with a theoretical proof. However, Neon uses Õ(/δ) full gradients instead of Õ(/ δ). Their approach is inspired by [5], but our Neon det is based on Chebyshev approximation theory (see textbook [7]) and its recent stability analysis [5]. By putting Neon det and Neon svrg into the CDHS method of Carmon et al. [7], we have 3 Theorem 4 (informal). Neon det turns CDHS into a first-order method without hurting its performance: it finds an (ε, δ)-approximate local minimum in Õ( + ) ε.75 δ full gradient computations. 3.5 Neon svrg turns CDHS into a first-order method without hurting its performance: it finds an (ε, δ)- approximate local minimum in T = Õ( n + n ) + n3/4 + n3/4 ε.5 δ 3 ε.75 δ stochastic gradient computations We note that the original paper of CDHS only proved such complexity results (although requiring Hessian-vector products) for the special case of δ ε /. In such a case, it requires either Õ( ) ε full gradient computations or.75 Õ ( ) n + n3/4 ε.5 ε stochastic gradient computations..75 3

4 T T=δ -5 ε -5 T=δ -7 Neon+SGD Neon+SGD T T=δ -5 ε -5 T=δ -6 Neon+SCSG Neon+SCSG T T=δ -5 ε -5 T=δ -6 Neon+Natasha Neon+Natasha ε -4 T=δ -3 ε - T=ε -4 ε ε /3 ε 4/7 ε / ε /4 δ ε -4 ε T=δ -3 ε - T=ε ε ε /3 ε 4/9 ε / ε /4 δ ε ε -3.6 ε -3.5 ε ε ε T=δ - ε -3 T=ε-3.5 ε 3/4 ε 3/5 /4 / δ (a) (b) (c) Figure : Neon vs Neon for finding (ε, δ)-approximate local minima. We emphasize that Neon and Neon are based on the same high-level idea, but Neon is arguably the first-recorded result to turn stationary-point finding algorithms (such as SGD, SCSG) into local-minimum finding ones, with theoretical proofs. One should perhaps compare Neon det to the interesting work convex until guilty by Carmon et al. [8]. Their method finds ε-approximate stationary points using Õ(/ε.75 ) full gradients, and is arguably the first first-order method achieving a convergence rate better than /ε of GD. Unfortunately, it is unclear if their method guarantees local minima. In comparison, Neon det on CDHS achieves the same complexity but guarantees its output to be an approximate local minimum. Remark.. All the cited works in this sub-section requires the objective to have () Lipschitzcontinuous Hessian (a.k.a. second-order smoothness) and () Lipschitz-continuous gradient (a.k.a. Lipschitz smoothness). One can argue that () and () are both necessary for finding approximate local minima, but if only finding approximate stationary points, then only () is necessary. We shall formally discuss our assumptions in Section.. From Stationary Points to Local Minima Given any (first-order) algorithm that finds only stationary points (such as GD, SGD, or SCSG [8]), we can hope for using the NC-search routine to identify whether or not its output x satisfies f(x) δi. If so, then automatically x becomes an (ε, δ)-approximate local minima so we can terminate. If not, then we can go in its negative curvature direction to further decrease the objective. In the independent work of Xu and Yang [8], they proposed to apply their Neon method for NC-search, and thus turned SGD and SCSG into first-order methods finding approximate local minima. In this paper, we use Neon instead. We show the following theorem: Theorem 5 (informal). To find an (ε, δ)-approximate local minima, (a) Neon+SGD needs T = Õ( ε 4 + ε δ 3 + δ 5 ) stochastic gradients; (b) Neon+SCSG needs T = Õ( ε 0/3 + ε δ 3 + δ 5 ) stochastic gradients; and (c) Neon+GD needs T = Õ( n ε + n δ 3.5 ) (so Õ ( ε + δ 3.5 ) full gradients). (d) Neon+SVRG needs T = Õ( n /3 ε + n δ 3 + n5/ ε δ / + n3/4 δ 3.5 ) stochastic gradients. We make several comments as follows. (a) We compare Neon+SGD to Ge et al. [3], where the authors showed SGD plus perturbation needs T = Õ(poly(d)/ε4 ) stochastic gradients to find (ε, ε /4 )-approximate local minima. This is the perhaps first time that a theoretical guarantee for finding local minima is given using first-order oracles. 4

5 To some extent, Theorem 5a is superior because we have () removed the poly(d) factor, 4 () achieved T = Õ(/ε4 ) as long as δ ε /3, and (3) a much simpler analysis. We also remark that, if using Neon instead of Neon, one achieves a slightly worse complexity T = Õ( + ) ε 4 δ, see Figure (a) for a comparison. 5 7 (b) Neon+SCSG turns SCSG into a local-minimum finding algorithm. Again, if using Neon instead of Neon, one gets a slightly worse complexity T = Õ( + + ) ε 0/3 ε δ 3 δ, see Figure (b). 6 (c) We compare Neon+GD to Jin et al. [5], where the authors showed GD plus perturbation needs Õ(/ε ) full gradients to find (ε, ε / )-approximate local minima. This is perhaps the first time that one can convert a stationary-point finding method (namely GD) into a local minimum-finding one, without hurting its performance. To some extent, Theorem 5c is better because we use Õ(/ε ) full gradients as long as δ ε 4/7. (d) Our result for Neon+SVRG does not seem to be recorded anywhere, even if Hessian-vector product computations are allowed. Limitation. We note that there is limitation of using Neon (or Neon) to turn an algorithm finding stationary points to that finding local minima. Namely, given any algorithm A, if the gradient complexity for A to find an ε-approximate stationary point is T, then after this conversion, it finds (ε, δ)-approximate local minima in a gradient complexity that is at least T. This is because the new algorithm, after combining Neon and A, tries to alternatively find stationary points (using A) and escape from saddle points (using Neon). Therefore, it must pay at least complexity T. In contrast, methods such as Natasha swing by saddle points instead of go to saddle points and then escape. This has enabled it to achieve a smaller complexity T = O(ε 3.5 ) for δ ε /4. Preliminaries Throughout this paper, we denote by the Euclidean norm. We use i R [n] to denote that i is generated from [n] = {,,..., n} uniformly at random. We denote by I[event] the indicator function of probabilistic events. We denote by A the spectral norm of matrix A. For symmetric matrices A and B, we write A B to indicate that A B is positive semiinite (PSD). Therefore, A σi if and only if all eigenvalues of A are no less than σ. We denote by λ min (A) and λ max (A) the minimum and maximum eigenvalue of a symmetric matrix A. Recall some initions on smoothness (for other equivalent initions, see textbook [9]) Definition.. For a function f : R d R, f is L-Lipschitz smooth (or L-smooth for short) if x, y R d, f(x) f(y) L x y. f is second-order L -Lipschitz smooth (or L -second-order smooth for short) if x, y R d, f(x) f(y) L x y. The following fact says the variance of a random variable decreases by a factor m if we choose m independent copies and average them. It is trivial to prove, see for instance [8]. 4 We are aware that the original authors of [3] have a different proof to remove its poly(d) factor, but have not found it online at this moment. 5 Their complexity might be improvable to Õ( ) + ε 4 δ with a slight change of the algorithm, but not beyond. 6 5

6 algorithm stationary SGD (folklore) O ( ε 4 ) local minima perturbed SGD [3] Õ ( poly(d) ε 4 ) local minima Neon+SGD [8] Õ ( ε 4 + δ 7 ) gradient complexity T local minima Neon+SGD Õ ( ε 4 + ε δ 3 + δ 5 ) stationary SCSG [8] O ( ε 0/3 ) local minima Neon+SCSG [8] O ( ) + ε 0/3 ε δ + 3 δ 6 local minima Neon+SCSG O ( ) + ε 0/3 ε δ + 3 δ 5 local minima Natasha [] Õ ( ε ε 3 δ + ) δ 5 local minima Neon+Natasha [8] Õ( ε ε 3 δ + ) δ 6 local minima Neon+Natasha Õ ( ε ε 3 δ + ) δ 5 stationary GD (folklore) O ( ) n ε local minima perturbed GD [5] Õ ( ) n ε local minima Neon+GD Õ ( n ε + n δ 3.5 ) Hessianvector products variance bound Lip. smooth nd - order smooth no needed needed no (only for δ ε /4 ) no needed needed needed no needed needed needed no needed needed needed no needed needed no no needed needed needed no needed needed needed needed needed needed needed no needed needed needed no needed needed needed no no needed no (only for δ ε / ) no no needed needed no no needed needed stationary SVRG [3, 3] O ( n /3 ε + n ) no no needed no local minima Neon+SVRG Õ ( ) n /3 ε + n δ + n5/ + n3/4 3 ε δ / δ no no needed needed 3.5 stationary guilty [8] Õ ( ) n ε no no needed needed.75 Õ ( ) n local minima FastCubic [] ε + n3/4.5 ε.75 needed no needed needed (only for δ ε / ) local minima CDHS [7] Õ ( ) n ε + n.5 δ + n3/4 3 ε + n3/4.75 δ 3.5 local minima Neon+CDHS Õ ( ) n ε + n.5 δ + n3/4 3 ε + n3/4.75 δ 3.5 needed no needed needed no no needed needed Table : Complexity for finding f(x) ε and f(x) δi. Following tradition, in these complexity bounds, we assume variance and smoothness parameters as constants, and only show the dependency on n, d, ε. Remark. Variance bounds is needed for online methods. Remark. Lipschitz smoothness is needed for finding approximate stationary points. Remark 3. Second-order Lipschitz smoothness is needed for finding approximate local minima. Fact.. If v,..., v n R d satisfy n i= v i = 0, and S is a non-empty, uniform random subset of [n]. Then [ E S i S v i ] =. Problem and Assumptions n S (n ) S n i [n] v i I[ S <n] S n i [n] v i. Throughout the paper we study the following minimization problem { min x R d f(x) = } n n i= f i(x) (.) 6

7 Algorithm Neon online (f, x 0, δ, p) Input: Function f(x) = n n i= f i(x), vector x 0, negative curvature δ > 0, confidence p (0, ]. : for j =,, Θ(log /p) do boost the confidence : v j Neon online weak (f, x 0, δ, p); 3: if v j then 4: m Θ( L log /p ), v Θ( δ δ L )v. 5: Draw i,..., i m R [n]. 6: z j = m m v j= (v ) ( f ij (x 0 + v ) f ij (x 0 ) ) 7: if z j 3δ/4 return v = v j 8: end if 9: end for 0: return v =. Algorithm Neon online weak (f, x 0, δ, p) : η δ C 0 L log(d/p), T C 0 log(d/p) ηδ, for sufficiently large constant C 0 : ξ Gaussian random vector with norm σ. σ = η(d/p) C 0 δ L L 3 3: x x 0 + ξ. 4: for t to T do 5: x t+ x t η ( f i (x t ) f i (x 0 )) where i R [n]. 6: if x t+ x 0 r then return v = x t+ x 0 7: end for 8: return v = ; x t+ x 0 r = (d/p) C 0 σ where both f( ) and each f i ( ) can be nonconvex. We wish to find (ε, δ)-local minima which are points x satisfying We need the following three assumptions Each f i (x) is L-Lipschitz smooth. f(x) ε and f(x) δi. Each f i (x) is second-order L -Lipschitz smooth. (In fact, the gradient complexity of Neon in this paper only depends polynomially on the second-order smoothness of f(x) (rather than f i (x)), and the time complexity depends logarithmically on the second-order smoothness of f i (x). To make notations simple, we decide to simply assume each f i (x) is L -second-order smooth.) Stochastic gradients have bounded variance: x R d : E i R [n] f(x) f i (x) V. (This assumption is needed only for online algorithms.) 3 Neon in the Online Setting We propose Neon online formally in Algorithm. It repeatedly invokes Neon online weak in Algorithm, whose goal is to solve the NC-search problem with confidence /3 only; then Neon online invokes Neon online weak repeatedly for log(/p) times to boost the confidence to p. We prove the following theorem: 7

8 Theorem (Neon online ). Let f(x) = n n i= f i(x) where each f i is L-smooth and L -secondorder smooth. For every point x 0 R d, every δ > 0, every p (0, ], the output satisfies that, with probability at least p:. If v =, then f(x 0 ) δi. v = Neon online (f, x 0, δ, p). If v, then v = and v f(x 0 )v δ. Moreover, the total number of stochastic gradient evaluations O ( log (d/p)l δ ). The proof of Theorem immediately follows from Lemma 3. and Lemma 3. below. Lemma 3. (Neon online weak ). In the same setting as Theorem, the output v = Neononline weak (f, x 0, δ, p) satisfies If λ min ( f(x 0 )) δ, then with probability at least /3, v and v f(x 0 )v 5 Proof sketch of Lemma 3.. We explain why Neon online weak works as follows. Starting from a randomly perturbed point x = x 0 + ξ, it keeps updating x t+ x t η ( f i (x t ) f i (x 0 )) for some random index i [n], and stops either when T iterations are reached, or when x t+ x 0 > r. Therefore, we have x t x 0 r throughout the iterations, and thus can approximate f i (x 0 )(x t x 0 ) using f i (x t ) f i (x 0 ), up to error O(r ). This is a small term based on our choice of r. Ignoring the error term, our updates look like x t+ x 0 = ( I η f i (x 0 ) ) (x t x 0 ). This is exactly the same as Oja s algorithm [] which is known to approximately compute the minimum eigenvector of f(x 0 ) = n n i= f i(x 0 ). Using the recent optimal convergence analysis of Oja s algorithm [6], one can conclude that after T = Θ ( log r ) σ ηλ iterations, where λ = max{0, λmin ( f(x 0 ))}, then we not only have that x t+ x 0 is blown up, but also it aligns well with the minimum eigenvector of f(x 0 ). In other words, if λ δ, then the algorithm must stop before T. Finally, one has to carefully argue that the error does not blow up in this iterative process. We er the proof details to Appendix A.. Our Lemma 3. below tells us we can verify if the output v of Neon online weak to additive δ 4 ), so we can boost the success probability to p. 00 δ. is indeed correct (up Lemma 3. (verification). In the same setting as Theorem, let vectors x, v R d. If i,..., i m R [n] and ine z = m m j= v ( f ij (x + v) f ij (x)) Then, if v δ 8L and m = Θ( L log /p ), with probability at least p, δ z δ 4. v v f(x)v v 4 Neon in the Deterministic Setting We propose Neon det formally in Algorithm 3 and prove the following theorem: 8

9 Algorithm 3 Neon det (f, x 0, δ, p) Input: A function f, vector x 0, negative curvature target δ > 0, failure probability p (0, ]. : T C log(d/p) L δ. for sufficiently large constant C. : ξ Gaussian random vector with norm σ; σ = (d/p) C δ T 3 L 3: x x 0 + ξ. y ξ, y 0 0 4: for t to T do 5: y t+ = M(y t ) y t ; M(y) = L ( f(x0 + y) f(x0)) + ( ) 3δ 4L y 6: x t+ = x 0 + y t+ M(y t ). 7: if x t+ x 0 r then return x t+ x 0 8: end for 9: return. x t+ x 0. r = (d/p) C σ Theorem 3 (Neon det ). Let f(x) be a function that is L-smooth and L -second-order smooth. For every point x 0 R d, every δ > 0, every p (0, ], the output v = Neon det (f, x 0, δ, p) satisfies that, with probability at least p:. If v =, then f(x 0 ) δi.. If v, then v = and v f(x 0 )v δ. Moreover, the total number full gradient evaluations is O ( log (d/p) L δ ). Proof sketch of Theorem 3. We explain the high-level intuition of Neon det and the proof of Theorem 3 as follows. Define M = L f(x 0 ) + ( 3δ 4L) I. We immediately notice that all eigenvalues of f(x 0 ) in [ 3δ 4, L] are mapped to the eigenvalues of M in [, ], and any eigenvalue of f(x 0 ) smaller than δ is mapped to eigenvalue of M greater than + δ 4L. Therefore, as long as T Ω ( ) L δ, if we compute xt + = x 0 + M T ξ for some random vector ξ, by the theory of power method, x T + x 0 must be a negative-curvature direction of f(x 0 ) with value δ. There are two issues with this approach. The first issue is that, the degree T of this matrix polynomial M T can be reduced to T = Ω ( ) L δ if the so-called Chebyshev polynomial is used. Claim 4.. Let T t (x) be the t-th Chebyshev polynomial of the first kind, ined as [7]: T 0 (x) =, T (x) = x, T n+ (x) = x T n (x) T n (x) [, ] if x [, ]; then T t (x) satisfies: T t (x) ( x + ) t ( x, x + ) ] t x if x >. [ Since T t (x) stays between [, ] when x [, ], and grows to ( + x ) t for x, we can use T T (M) in replacement of M T. Then, any eigenvalue of M that is above + δ 4L shall grow in a speed like ( + δ/l) T, so it suffices to choose T Ω ( ) L σ. This is quadratically faster than applying the power method, so in Neon det we wish to compute x t+ x 0 + T t (M) ξ. The second issue is that, since we cannot compute Hessian-vector products, we have to use the 9

10 gradient difference to approximate it; that is, we can only use M(y) to approximate My where M(y) = ( L ( f(x 0 + y) f(x 0 )) + 3δ ) y. 4L How does error propagate if we compute T t (M) ξ by replacing M with M? Note that this is a very non-trivial question, because the coefficients of the polynomial T t (x) is as large as O(t). It turns out, the way that error propagates depends on how the Chebyshev polynomial is calculated. If the so-called backward recurrence formula is used, namely, y 0 = 0, y = ξ, y t = M(y t ) y t and setting x T + = x 0 + y T + M(y T ), then this x T + is sufficiently close to the exact value x 0 + T t (M) ξ. This is known as the stability theory of computing Chebyshev polynomials, and is proved in our prior work [5]. We er all the proof details to Appendix B.. 5 Neon in the SVRG Setting Recall that the shift-and-invert (SI) approach [] on top of the SVRG method [6] solves the minimum eigenvector problem as follows. Given any matrix A = f(x 0 ) and suppose its eigenvalues are λ λ d. Then, if λ > λ, we can ine positive semiinite matrix B = (λi+a), and then apply power method to find an (approximate) maximum eigenvector of B, which necessarily is an (approximate) minimum eigenvector of A. The SI approach specifies a binary search routine to determine the shifting constant λ, and ensures that B = (λi + A) is always well conditioned, meaning that it suffices to apply power method on B for logarithmic number of iterations. In other words, the task of computing the minimum eigenvector of A reduces to computing matrix-vector products By for poly-logarithmic number of times. Moreover, the stability of SI was shown in a number of papers, including [] and [4]. This means, it suffices for us to compute By approximately. However, how to compute By for an arbitrary vector y. It turns out, this is equivalent to minimizing a convex quadratic function that is of a finite sum form g(z) = z (λi + A)z + y z = n n z (λi + f i (x 0 ))z + y z. Therefore, one can apply the a variant of the SVRG method (arguably first discovered by Shalev- Shwartz [6]) to solve this task. In each iteration, SVRG needs to evaluate a stochastic gradient (λi + f i (x 0 ))z + y at some point z for some random i [n]. Instead of evaluating it exactly (which require a Hessian-vector product), we use f i (x 0 +z) f i (x 0 ) to approximate f i (x 0 ) z. Of course, one needs to show also that the SVRG method is stable to noise. Using similar techniques as the previous two sections, one can show that the error term is proportional to O( z ), and thus as long as we bound the norm of z is bounded (just like we did in the previous two sections), this should not affect the performance of the algorithm. We decide to ignore the detailed theoretical proof of this result, because it will complicate this paper. Theorem 3 (Neon svrg ). Let f(x) = n n i= f i(x) where each f i is L-smooth and L -second-order smooth. For every point x 0 R d, every δ > 0, every p (0, ], the output v = Neon svrg (f, x 0, δ, p) satisfies that, with probability at least p:. If v =, then f(x 0 ) δi.. If v, then v = and v f(x 0 )v δ. 0 i=

11 Moreover, the total number stochastic gradient evaluations is Õ( n + n3/4 L δ ). 6 Applications of Neon We show how Neon online can be applied to existing algorithms such as SGD, GD, SCSG, SVRG, Natasha, CDHS. Unfortunately, we are unaware of a generic statement for applying Neon online to any algorithm. Therefore, we have to prove them individually. 6 Throughout this section, we assume that some starting vector x 0 R d and upper bound f is given to the algorithm, and it satisfies f(x 0 ) min x {f(x)} f. This is only for the purpose of proving theoretical bounds. In practice, because f only appears in specifying the number of iterations, can just run enough number of iterations and then halt the algorithm, without the necessity of knowing f. 6. Auxiliary Claims Claim 6.. For any x, using O(( V + ) log ε p ) stochastic gradients, we can decide with probability p: either f(x) ε or f(x) ε. Proof. Suppose we generate m = O(log p ) random uniform subsets S,..., S m of [n], each of cardinality B = max{ 3ε V, }. Then, denoting by v j = B i S j f i (x), we have according to [ Fact. that E Sj vj f(x) ] V B = ε 3. In other words, with probability at least / over the randomness of S j, we have v j f(x) v j f(x) ε 4. Since m = O(log p ), we have with probability at least p, it satisfies that at least m/ + of the vectors v j satisfy vj f(x) ε 4. Now, if we select v = v j where j [m] is the index that gives the median value of v j, then it satisfies v j f(x) ε 4. Finally, we can check if v j 3ε 4. If so, then we conclude that f(x) ε, and if not, we conclude that f(x) ε. Claim 6.. If v is a unit vector and v f(y)v δ, suppose we choose y = y ± δ L v where the sign is random, then f(y) E[f(y )] δ3. L Proof. Letting η = δ L, then by the second-order smoothness, f(y) E[f(y )] E [ f(y), y y (y y ) f(y)(y y ) L 6 y y 3] 6. Neon on SGD and GD = η v f(y)v L η 3 6 v 3 η δ 4 L η 3 6 = δ3 L. To apply Neon to turn SGD into an algorithm finding approximate local minima, we propose the following process Neon+SGD (see Algorithm 4). In each iteration t, we first apply SGD with minibatch size O( ε ) (see Line 4). Then, if SGD finds a point with small gradient, we apply Neon online to decide if it has a negative curvature, if so, then we move in the direction of the negative curvature (see Line 0). We have the following theorem: 6 This is because stationary-point finding algorithms have somewhat different guarantees. For instance, in minibatch SGD we have f(x t) E[f(x t+)] Ω( f(x t) ) but in SCSG we have f(x t) E[f(x t+)] E[Ω( f(x t+) )].

12 Algorithm 4 Neon+SGD(f, x 0, p, ε, δ) Input: function f( ), starting vector x 0, confidence p (0, ), ε > 0 and δ > 0. : K O ( L f + L ) f δ 3 ε ; : for t 0 to K do f is any upper bound on f(x 0) min x{f(x)} 3: S a uniform random subset of [n] with cardinality S = B = max{ 8V, }; ε 4: x t+/ x t L S i S f i(x t ); 5: if f(x t ) ε then estimate f(xt) using O(ε V log(k/p)) stochastic gradients 6: x t+ x t+/ ; 7: else necessarily f(x t) ε 8: v Neon online (x t, δ, p K ); 9: if v = then return x t ; necessarily f(x t) δi 0: else x t+ x t ± δ L v; necessarily v f(x t)v δ/ : end if : end for 3: will not reach this line (with probability p). Algorithm 5 Neon+GD(f, x 0, p, ε, δ) Input: function f( ), starting vector x 0, confidence p (0, ), ε > 0 and δ > 0. : K O ( L f + L ) f δ 3 ε ; : for t 0 to K do f is any upper bound on f(x 0) min x{f(x)} 3: x t+/ x t L f(x t); 4: if f(x t ) ε then 5: x t+ x t+/ ; 6: else 7: v Neon det (x t, δ, p K ); 8: if v = then return x t ; necessarily f(x t) δi 9: else x t+ x t ± δ L v; necessarily v f(x t)v δ/ 0: end if : end for : will not reach this line (with probability p). Theorem 5a. With probability at least ( p, Neon+SGD outputs an (ε, δ)-approximate local minimum in gradient complexity T = Õ ( V + ) ( L ε f + L ) f δ 3 ε + L L δ f ). δ 3 Corollary 6.3. Treating f, V, L, L as constants, we have T = Õ( ε 4 + ε δ 3 + δ 5 ). One can similarly (and more easily) give an algorithm Neon+GD, which is the same as Neon+SGD except that the mini-batch SGD is replaced with a full gradient descent, and the use of Neon online is replaced with Neon det. We have the following theorem: Theorem 5c. With probability( at least p, Neon+GD ) outputs an (ε, δ)-approximate local minimum in gradient complexity Õ L f + L/ L ε δ / f full gradient computations. δ 3 We only prove Theorem 5a in Appendix C and the proof of Theorem 5c is only simpler.

13 6.3 Neon on SCSG and SVRG Background. We first recall the main idea of the SVRG method for non-convex optimization [3, 3]. It is an offline method but is what SCSG is built on. SVRG divides iterations into epochs, each of length n. It maintains a snapshot point x for each epoch, and computes the full gradient f( x) only for snapshots. Then, in each iteration t at point x t, SVRG ines gradient estimator f(x t ) = f i (x t ) f i ( x) + f( x) which satisfies E i [ f(x t )] = f(x t ), and performs update x t+ x t α f(x t ) for learning rate α. The SCSG method of Lei et al. [8] proposed a simple fix to turn SVRG into an online method. They changed the epoch length of SVRG from n to B /ε, and then replaced the computation of f( x) with S i S f i( x) where S is a random subset of [n] with cardinality S = B. To make this approach even more general, they also analyzed SCSG in the mini-batch setting, with mini-batch size b {,,..., B}. 7 Their Theorem 3. [8] says that, Lemma 6.4 ([8]). There exist constant C > such that, if we run SCSG for an epoch of size B (so using O(B) stochastic gradients) 8 with mini-batch b {,,..., B} starting from a point x t and moving to x + t, then E [ f(x + t ) ] C L(b/B) /3( f(x t ) E[f(x + t )]) + 6V B. Our Approach. In principle, one can apply the same idea of Neon+SGD on SCSG to turn it into an algorithm finding approximate local minima. Unfortunately, this is not quite possible because the left hand side of Lemma 6.4 is on E [ f(x + t ) ], as opposed to f(x t ) in SGD (see (C.)). This means, instead of testing whether x t is a good local minimum (as we did in Neon+SGD), this time we need to test whether x + t is a good local minimum. This creates some extra difficulty so we need a different proof. Remark 6.5. As for the parameters of SCSG, we simply use B = max{, 48V }. However, choosing ε mini-batch size b = does not necessarily give the best complexity, so a tradeoff b = Θ( (ε +V)ε 4 L 6 ) } δ 9 L 3 is needed. (A similar tradeoff was also discovered by the authors of Neon [8].) Note that this quantity b may be larger than B, and if this happens, SCSG becomes essentially equivalent to one iteration of SGD with mini-batch size b. Instead of analyzing this boundary case b > B separately, we decide to simply run Neon+SGD whenever b > B happens, to simplify our proof. We show the following theorem (proved in Appendix C) Theorem 5b. With probability at least ( /3, Neon+SCSG outputs an (ε, δ)-approximate local ( minimum in gradient complexity T = Õ L f )( V ) + L L ε δ + f L ). ε δ + L ε 4/3 V /3 f δ 3 (To provide the simplest proof, we have shown Theorem 5b only with probability /3. One can for instance boost the confidence to p by running log p copies of Neon+SCSG.) Corollary 6.6. Treating f, V, L, L as constants, we have T = Õ( ε 0/3 + ε δ 3 + δ 5 ). 7 That is, they reduced the epoch length to B, and replaced fi(xt) fi( x) with b S i S ( fi(x t) f i( x) ) for some S that is a random subset of [n] with cardinality S = b. 8 We remark that Lei et al. [8] only showed that an epoch runs in an expectation of O(B) stochastic gradients. We assume it is exact here to simplify proofs. One can for instance stop SCSG after O(B log ) stochastic gradient p computations, and then Lemma 6.4 will succeed with probability p. 3

14 Algorithm 6 Neon+SCSG(f, x 0, ε, δ) Input: function f( ), starting vector x 0, ε > 0 and δ > 0. : B max{, 48V }; b max {, Θ( (ε +V)ε 4 L 6 ε ) } ; δ 9 L 3 : if b > B then return Neon+SGD(f, x 0, /3, ε, δ); for cleaner analysis purpose, see Remark 6.5 3: K Θ ( Lb /3 f ) ε 4/3 V ; /3 f is any upper bound on f(x 0) min x{f(x)} 4: for t 0 to K do 5: x t+/ apply SCSG on x t for one epoch of size B = max{θ(v/ε ), }; 6: if f(x t+/ ) ε then estimate f(xt) using O(ε V log K) stochastic gradients 7: x t+ x t+/ ; 8: else necessarily f(x t+/ ) ε 9: v Neon online (f, x t+/, δ, /0K); 0: if v = then return x t+/ ; necessarily f(x t+/ ) δi : else x t+ x t+/ ± δ L v; necessarily v f(x t+/ )v δ/ : end if 3: end for 4: will not reach this line (with probability /3). As for SVRG, it is an offline method and its one-epoch lemma looks like 9 E [ f(x + t ) ] C Ln /3( f(x t ) E[f(x + t )]). If one replaces the use of Lemma 6.4 with this new inequality, and replace the use of Neon online with Neon svrg, then we get the following theorem: Theorem 5d. With probability at least ( /3, Neon+SVRG outputs an (ε, δ)-approximate local ( minimum in gradient complexity T = Õ L f )( n + n 3/4 ) ) L δ. + L ε n /3 f δ 3 For a clean presentation of this paper, we ignore the pseudocode and proof because they are only simpler than Neon+SCSG. 6.4 Neon on Natasha and CDHS The recent results Carmon et al. [7] (that we refer to CDHS) and Natasha [] are both Hessianfree methods where the only Hessian-vector product computations come from the exact NC-search process we study in this paper. Therefore, by replacing their NC-search with Neon, we can directly turn them into first-order methods without the necessity of computing Hessian-vector products. We state the following two theorems where the proofs are exactly the same as the papers [7] and []. We directly state them by assuming f, V, L, L are constants, to simplify our notions. Theorem. One can replace Oja s algorithm with Neon online in Natasha without hurting its performance, turning it into a first-order stochastic method. Treating f, V, L, L as constants, Natasha finds an (ε, δ)-approximate local minimum in T = Õ ( + ε 3.5 ε 3 δ + ) δ stochastic gradient computations. 5 Theorem 4. One can replace Lanczos method with Neon det or Neon svrg in CDHS without hurting it performance, turning it into a first-order method. 9 There are at least three different variants of SVRG [3, 8, 3]. We have adopted the lemma of [8] for simplicity. 4

15 Treating f, L, L as constants, CDHS finds an (ε, δ)-approximate local minimum in either Õ ( + ) ε.75 δ full gradient computations (if Neon det is used) or in T = Õ( n + n ) + n3/4 + n3/4 3.5 ε.5 δ 3 ε.75 δ 3.5 stochastic gradient computations (if Neon svrg is used). Acknowledgements We would like to thank Tianbao Yang and Yi Xu for helpful feedbacks on this manuscript. A Missing Proofs for Section 3 A. Auxiliary Lemmas Appendix We use the following lemma to approximate hessian-vector products: Lemma A.. If f(x) is L -second-order smooth, then for every point x R d and every vector v R d, we have: f(x + v) f(x) f(x)v L v. Proof of Lemma A.. We can write f(x+v) f(x) = t=0 f(x+tv)vdt. Subtracting f(x)v we have: f(x + v) f(x) f(x)v ( = f(x + tv) f(x) ) vdt t=0 t=0 f(x + tv) f(x) v dt L v. We need the following auxiliary lemma about martingale concentration bound: Lemma A.. Consider random events {F t } t and random variables x,..., x T 0 and a,..., A T [ ρ, ρ] for ρ [0, /] where each x t and a t only depend on F,..., F t. Letting x 0 = 0 and suppose there exist constant b 0 and λ > 0 such that for every t : x t x t ( a t ) + b and E[a t F,..., F t ] λ. ] Then, we have for every p (0, ): Pr [x T T be λt +ρ T log T p p. Proof. We know that x T ( a T )x T + b ( a T ) (( a T )x T + b) + b = ( a T )( a T )x T + ( a T )b + b T T ( a s )b s= t=s For each s [T ], we consider the random process ine as t s : y t+ = ( a t )y t, y s = b 5

16 Therefore log y t+ = log( a t ) + log y t For log( a t ) [ ρ, ρ] and E[log( a t ) F, F t ] λ. Thus, we can apply Azuma-Hoeffding inequality on log y t to conclude that ] Pr [y T be λt +ρ T log T p p/t. Taking union bound over s we complete the proof. A. Proof of Lemma 3. Proof of Lemma 3.. Let i t [n] be the random index i chosen when computing x t+ from x t in Line 5 of Neon online weak. We will write the update rule of x t in terms of the Hessian before we stop. By Lemma A., we know that for every t, fit (x t ) f it (x 0 ) f it (x 0 )(x t x 0 ) L x t x 0. Therefore, there exists error vector ξ t R d with ξ t L x t x 0 such that For notational simplicity, let us denote by then it satisfies (x t+ x 0 ) = (x t x 0 ) η f it (x 0 )(x t x 0 ) + ηξ t. z t = x t x 0, A t = B t + R t where B t = f it (x 0 ), R t = ξ tzt z t z t+ = z t ηb t z t + ηξ t = (I ηa t )z t. We have R t L z t L r. By the L-smoothness of f it, we know B t L and thus A t B t + R t B t + L r L. Now, ine Φ t = z t+ zt+ = (I ηa t) (I ηa )ξξ (I ηa ) (I ηa t ) and w t = zt z t. Then, before we stop, we have: (Tr(Φ t )) / ) Tr(Φ t ) = Tr(Φ t ) ( ηwt A t w t + η wt A t w t Tr(Φ t ) ( ηwt A t w t + 4η L ) Tr(Φ t ) ( ηwt B t w t + η R t + 4η L ) ( Tr(Φ t ) ηwt B t w t + 8η L ). Above, is because our choice of parameter satisfies r η L L. Therefore, ( log (Tr(Φ t )) log (Tr(Φ t )) + log ηwt B t w t + 8η L ). z t = Letting λ = λ min ( f(x 0 )) = λ min (E Bt [B t ]), since the randomness of B t is independent of w t, we know that wt [ ] B t w t [ L, L] and for every w t, it satisfies E Bt w t B t w t w t λ. Which (by concavity of log) also implies that E[log ( ηwt B t w t + 8η L ) ] ηλ and log( ηwt B t w t + 8η L ) [ (ηl + 8η L ), ηl + 8η L ] [ 6ηL, 3ηL]. Hence, applying Azuma-Hoeffding inequality on log(φ t ) we have [ Pr log(φ t ) log(φ 0 ) ηλt + 6ηL t log ] p. p 6

17 In other words, with probability at least p, Neon online weak will not terminate until t T 0, where T 0 is given by the equation (recall Φ 0 = z = σ ): ηλt 0 + 6ηL T 0 log ( ) r p = log. Next, we turn to accuracy. Let true vector v t+ = (I ηb t ) (I ηb )ξ and we have t t z t+ v t+ = (I ηa s )ξ (I ηb s )ξ = (I ηb t )(z t v t ) ηr t z t. s= s= Thus, if we call u t = z t v t with u = 0, then, before the algorithm stops, we have: σ u t+ (I ηb t )u t η R t z t ηl r ( ) Using Young s inequality a + b ( + β) a + β + b for every β > 0, we have: u t+ ( + η ) ( (I ηb t )u t + 8L r 4 u t η u ) tb t u t u t + 4η L + 8L r 4. Above, assumes without loss of generality that L (as otherwise we can re-scale the problem). Therefore, applying martingale concentration Lemma A., we know ] Pr [ u t 6L r te ηλt+8ηl t log t p p. Now we can apply the recent of Oja s algorithm [6, Theorem 4]. By our choice of parameter η, we have: with probability at least 99/00 the following holds:. Norm growth: v t e (ηλ η L )t σ/d. ( ). Negative curvature: v t f(x 0 )v t ( η)λ + O log(d) v t ηt. Then let us consider the case: λ δ, let us consider a fixed T ined as T = log dr σ ηλ η L = C 0 (log d/p + log(d)) ηλ η L < T. At this point, by the norm growth property, we know that w.p. 99/00, v T r and by our choice of parameters, we know that ( e ηλt +ηl T log p r ). σ which implies that with probability at least 98/00, u T 6dL r T e ηλt+8ηl v T e (ηλ η L )T σ 6dL r T σ T log T p e 8ηL T log T p +η L T 6dL r T e 6 log T p 6dL r T σ σp δ 00L 00. Here, uses our choice of parameters so η L T log T p ( ) L η λ log r σ log T p log T p. 7

18 is due to r σ δ p δp 300dηL L log dr 600dT L L. Thus, with probability at least 98/00, z T = σ v T + u T r. This means Neon online weak must terminate within T T iterations. Moreover, recall with probability at last p, Neon online weak will not terminate before iteration T 0. Thus, at the point of t [T 0, T ] of termination, we have with probability at least 98/00, ( ) ( ) log(d) log(d) v t Av t v t ( η)λ + O ηt ( η)λ + O Moreover, we also know that u t + v t = z t r, therefore, u t 6L r T e ηλt+ηl u t + v t r T log T p. ηt δ. By inition of T, we know that r = e (ηλ η L )T σ/(d), so we can show that ut before. Together, we have: v t 00 as z t Az t z t = v t z t z t Az t v t 3 vt Av t + 4L z t v t v t 4 v t 3 ( v t Av t 4 v t + 4L u ) t 3 ( v t Av t v t 4 v t + ) 4 δ 5 00 δ. Putting everything together we complete the proof. A.3 Proof of Lemma 3. Proof of Lemma 3.. By Lemma A., we know for every i [n], v ( f i (x + v) f i (x) f i (x) ) v L v 3. Letting z j = v ( f j (x + v) f j (x)), we know that z,, z m are i.i.d. random variables with z j L v + L v 3. By Chernoff bound, we know that [ Pr z E[z] ( L v + L v 3 ) m log ] p p Since we also have E[z] v f(x)v L v 3 from Lemma A., we conclude that [ z Pr v v f(x)v v (L + L v ) m log ] p + L v p. Plugging in our assumption on v and our choice of m finishes the proof. B Missing Proofs for Section 4 B. Stable Computation of Chebyshev Polynomials We recall the following result from [5, Section 6.] regarding one way to stably compute Chebyshev polynomials. Suppose we want to compute s N = N T k (M) c k R d where M R d d is symmetric and each c k is in R d. (B.) k=0 8

19 Definition B. (inexact backward recurrence). Let M be an approximate algorithm that satisfies M(u) Mu ε u for every u R d. Then, ine inexact backward recurrence to be bn+ = 0, bn = c N, and r {N,..., 0}: b r = M ( br+ ) br+ + c r R d, and ine the output as ŝ N = b 0 M( b ). If ε = 0, then ŝ N = s N. The following theorem gives an error analysis [5, Theorem 6.4]. Theorem B. (stable Chebyshev sum). For every N N, suppose the eigenvalues of M are in [a, b] and suppose there are parameters C U, C T, ρ, C c 0 satisfying { k {0,,..., N}: ρ k c k C c x [a, b]: Tk (x) C T ρ k and U k (x) C U ρ k}. Then, if the inexact backward recurrence in Def. B. is applied with ε 4NC U, we have B. Proof of Theorem 3 ŝ N s N ε ( + NC T )NC U C c. Proof of Theorem 3. We can without loss of generality assume δ L. For notation simplicity, let us denote ( A = f(x 0 ), M = ( L f(x 0 ) + 3δ ) ) I and λ = λ min (A). 4L Then, we know that the eigenvalues of M lie in [, + λ 3δ/4 ] L. We wish to iteratively compute x t+ x 0 +T t (M) ξ, where T t is the t-th Chebyshev polynomial of the first kind. However, we cannot multiply M to vectors (because we are not allowed to use Hessian-vector products). We ine M(y) = ( L ( f(x 0 + y) f(x 0 )) + 3δ ) y. 4L and shall use it to approximate My and then apply backward recurrence y 0 = 0, y = ξ, y t = M(y t ) y t. If we set x t+ = x 0 + y t+ M(y t ), following Def. B., it satisfies x t+ x 0 T t (M) ξ. Now, letting x t+ = x 0 +T t (M)ξ be the exact solution, we wish to bound the error x t+ x t+. Throughout the iterations of Neon det, we have y t = M(y t ) y t = (x 0 x t + y t ) y t = y t y t = (x t x 0 ). Since we have x t x 0 r for each t before termination, we know y t tr. Using this upper bound we can approximate Hessian-vector product by gradient difference. Lemma A. gives us M(y t ) My t L L y t L rt L y t. Now, recall from Claim 4. that [, ] if x [, ]; T t (x) ( x + ) t ( x, x + ) ] t x if x >. [ [ ] We can apply Theorem B. with the eigenvalues of M in [a, b] = 0, + λ 3δ/4 L and { ρ = max + λ 3δ/4 } + λ 3δ/4 (λ 3δ/4) + L L L,, C c = ρ t σ, C T = C U =. 9

20 Theorem B. tells us that, for every t before termination, x t+ x t+ 3L rt 3 ρ t σ. L In order to prove Theorem 3, in the rest of the proof, it suffices for us to show that, if λ min ( f(x 0 )) δ, then with probability at least p, it satisfies v, v =, and v f(x 0 )v δ. In other words, we can assume λ δ. The value λ δ implies ρ + δ >, so we can let T = log 4dr pσ log ρ T. By Claim 4., we know that T T (M) ρt pσ. Thus, with probability at least p, x T + x 0 = T T (M)ξ r. Moreover, at iteration T, we have: x T + x T + 3L rt 3 ρt σ L Here, uses the fact that r 3L rt 3 σ L δp. 800dL T 3 = dr 4dr pσ 8dL T 3 r L p δ 00L r 6 r. This means x T + x 0 r so the algorithm must terminate before iteration T T. On the other hand, since T t (M) ρ t, we know that the algorithm will not terminate until t T 0 for T 0 = log r σ log ρ. At the time t T 0 of termination, by the property of Chebyshev polynomial Claim 4., we know. T t (ρ) ρt ρt 0 r 4σ = (d/p)θ().. x [, ], T t (x) [, ]. Since all the eigenvalues of A that are 3/4δ are mapped to the eigenvalues of M that are in [, ], and the smallest eigenvalue of A is mapped to the eigenvalue ρ of M. So we have, with probability at least p, letting v t = x t+ x 0 it satisfies Therefore, denoting by z t = x t+ x 0, we have z t Az t z t v t Av t v t 5 8 δ. = v t z t z t Az t v t 5 vt Av t + 4L z t v t v t 6 v t ( v t Av t + 4L v t z t v t v t ( v t Av t v t + ) 5 δ 5 00 δ. This finishes the proof because we have shown that, with probability at least p, the output v = zt z satisfies and t v f(x 0 )v δ. ) 0

21 C Missing Proofs for Section 6 C. Proof of Theorem 5a Proof of Theorem 5a. Since both estimating f(x t ) in Line 5 (see Claim 6.) and invoking Neon online (see Theorem ) succeed with high probability, we can assume that they always succeed. This means whenever we output x t in an iteration, it already satisfies f(x t ) ε and f(x t ) δi. Therefore, it remains to show that the algorithm must output some x t in an iteration, as well as to compute the final complexity. Recall (from classical SGD theory) if we update x t+/ x t α S the learning rate and S = B, then f(x t ) E S [f(x t+/ )] E S [ f(xt ), x t x t+/ L x t x t+/ ] i S f i(x t ) where α > 0 is = α f(x t ) α L [ E i S S S f i(x t ) ] = ( α α L) f(xt ) α L [ E f(xt ) i S S S f i(x t ) ] ( α α L ) f(xt ) α L Above, is due to the smoothness of f( ) and is due to Fact.. Now, if we choose α = L and B = max{ 8V ε, }, then we have f(x t ) E S [f(x t+/ )] α V B. ( f(xt ) ε ). (C.) 8 In other words, as long as Line 6 is reached, we have f(x t ) E[f(x t+ )] Ω(ε /L). On the other hand, whenever Line 0 is reached, then we must have v f(y 0 )v δ. By Claim 6., we must have f(x t ) E[f(x t+ )] Ω(δ 3 /L ). In sum, if we choose K = O ( L f + L ) f δ 3 ε, then the algorithm must terminate and return xt in one of its iterations. This ensures that Line 3 will not be reached. As for the total complexity, we note that each iteration of Neon+SGD is dominated by Õ(B) = Õ( V + ) stochastic gradient ε computations in Line 4 and Line 5, totaling Õ(( V L + )K), as well as Õ( ) stochastic gradient ε δ computations by Neon online, but the latter will not happen for more than O( L f ) times. Therefore, δ 3 the total gradient complexity is Õ (( V L L + )K + ) ( f ε δ δ 3 = Õ ( V ε + )( L f δ 3 + L f ) L L + ) f ε δ δ 3. C. Proof of Theorem 5b Proof of Theorem 5b. We first note in the special case b = Θ( (ε +V)ε 4 L 6 ) B, or equivalently ( δ 9 L 3 ) δ 3 O( L ε L ), Theorem 5a gives us gradient complexity T = Õ V L ε f + L L δ 3 δ f so we are done. δ 3 Therefore, in the rest of the proof we assume Θ( (ε +V)ε 4 L 6 ) < B and thus b B is well ined. δ 9 L 3 Since both estimating f(x t ) in Line 6 (see Claim 6.) and invoking Neon online (see Theorem ) succeed with high probability, we can assume that they always succeed. This means whenever we output x t+/ in an iteration, it already satisfies f(x t+/ ) ε and f(x t+/ ) δi. Therefore, it remains to show that the algorithm must output some x t in an iteration, as well as to compute the final complexity.

arxiv: v4 [math.oc] 11 Jun 2018

arxiv: v4 [math.oc] 11 Jun 2018 Natasha : Faster Non-Convex Optimization han SGD How to Swing By Saddle Points (version 4) arxiv:708.08694v4 [math.oc] Jun 08 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Microsoft Research, Redmond August 8,

More information

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic

More information

arxiv: v4 [math.oc] 24 Apr 2017

arxiv: v4 [math.oc] 24 Apr 2017 Finding Approximate ocal Minima Faster than Gradient Descent arxiv:6.046v4 [math.oc] 4 Apr 07 Naman Agarwal namana@cs.princeton.edu Princeton University Zeyuan Allen-Zhu zeyuan@csail.mit.edu Institute

More information

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India Chi Jin UC Berkeley Michael I. Jordan UC Berkeley Rong Ge Duke Univ. Sham M. Kakade U Washington Nonconvex optimization

More information

Mini-Course 1: SGD Escapes Saddle Points

Mini-Course 1: SGD Escapes Saddle Points Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

SVRG Escapes Saddle Points

SVRG Escapes Saddle Points DUKE UNIVERSITY SVRG Escapes Saddle Points by Weiyao Wang A thesis submitted to in partial fulfillment of the requirements for graduating with distinction in the Department of Computer Science degree of

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Adaptive Negative Curvature Descent with Applications in Non-convex Optimization

Adaptive Negative Curvature Descent with Applications in Non-convex Optimization Adaptive Negative Curvature Descent with Applications in Non-convex Optimization Mingrui Liu, Zhe Li, Xiaoyu Wang, Jinfeng Yi, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

A Subsampling Line-Search Method with Second-Order Results

A Subsampling Line-Search Method with Second-Order Results A Subsampling Line-Search Method with Second-Order Results E. Bergou Y. Diouane V. Kungurtsev C. W. Royer November 21, 2018 Abstract In many contemporary optimization problems, such as hyperparameter tuning

More information

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate 58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

arxiv: v2 [math.oc] 1 Nov 2017

arxiv: v2 [math.oc] 1 Nov 2017 Stochastic Non-convex Optimization with Strong High Probability Second-order Convergence arxiv:1710.09447v [math.oc] 1 Nov 017 Mingrui Liu, Tianbao Yang Department of Computer Science The University of

More information

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto

More information

Introduction to gradient descent

Introduction to gradient descent 6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization Frank E. Curtis Department of Industrial and Systems Engineering, Lehigh University Daniel P. Robinson Department

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Overparametrization for Landscape Design in Non-convex Optimization

Overparametrization for Landscape Design in Non-convex Optimization Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Sub-Sampled Newton Methods I: Globally Convergent Algorithms Sub-Sampled Newton Methods I: Globally Convergent Algorithms arxiv:1601.04737v3 [math.oc] 26 Feb 2016 Farbod Roosta-Khorasani February 29, 2016 Abstract Michael W. Mahoney Large scale optimization problems

More information

Stochastic Cubic Regularization for Fast Nonconvex Optimization

Stochastic Cubic Regularization for Fast Nonconvex Optimization Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni Mitchell Stern Chi Jin Jeffrey Regier Michael I. Jordan {nilesh tripuraneni,mitchell,chijin,regier}@berkeley.edu jordan@cs.berkeley.edu

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Gradient Descent. Dr. Xiaowei Huang

Gradient Descent. Dr. Xiaowei Huang Gradient Descent Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Three machine learning algorithms: decision tree learning k-nn linear regression only optimization objectives are discussed,

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

arxiv: v2 [math.oc] 5 May 2018

arxiv: v2 [math.oc] 5 May 2018 The Impact of Local Geometry and Batch Size on Stochastic Gradient Descent for Nonconvex Problems Viva Patel a a Department of Statistics, University of Chicago, Illinois, USA arxiv:1709.04718v2 [math.oc]

More information

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman Lecture, Sept 2016 1 Collaborators Raghu Bollapragada Northwestern University Richard Byrd University of Colorado

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Lecture 2: From Classical to Quantum Model of Computation

Lecture 2: From Classical to Quantum Model of Computation CS 880: Quantum Information Processing 9/7/10 Lecture : From Classical to Quantum Model of Computation Instructor: Dieter van Melkebeek Scribe: Tyson Williams Last class we introduced two models for deterministic

More information

arxiv: v3 [math.oc] 8 Jan 2019

arxiv: v3 [math.oc] 8 Jan 2019 Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

On the fast convergence of random perturbations of the gradient flow.

On the fast convergence of random perturbations of the gradient flow. On the fast convergence of random perturbations of the gradient flow. Wenqing Hu. 1 (Joint work with Chris Junchi Li 2.) 1. Department of Mathematics and Statistics, Missouri S&T. 2. Department of Operations

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

U.C. Berkeley Better-than-Worst-Case Analysis Handout 3 Luca Trevisan May 24, 2018

U.C. Berkeley Better-than-Worst-Case Analysis Handout 3 Luca Trevisan May 24, 2018 U.C. Berkeley Better-than-Worst-Case Analysis Handout 3 Luca Trevisan May 24, 2018 Lecture 3 In which we show how to find a planted clique in a random graph. 1 Finding a Planted Clique We will analyze

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via nonconvex optimization Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on Overview of gradient descent optimization algorithms HYUNG IL KOO Based on http://sebastianruder.com/optimizing-gradient-descent/ Problem Statement Machine Learning Optimization Problem Training samples:

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

A random perturbation approach to some stochastic approximation algorithms in optimization.

A random perturbation approach to some stochastic approximation algorithms in optimization. A random perturbation approach to some stochastic approximation algorithms in optimization. Wenqing Hu. 1 (Presentation based on joint works with Chris Junchi Li 2, Weijie Su 3, Haoyi Xiong 4.) 1. Department

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

1 Lyapunov theory of stability

1 Lyapunov theory of stability M.Kawski, APM 581 Diff Equns Intro to Lyapunov theory. November 15, 29 1 1 Lyapunov theory of stability Introduction. Lyapunov s second (or direct) method provides tools for studying (asymptotic) stability

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

arxiv: v2 [stat.ml] 24 Apr 2017

arxiv: v2 [stat.ml] 24 Apr 2017 Faster Principal Component Regression and Stable Matrix Chebyshev Approximation arxiv:68.4773v [stat.ml] 4 Apr 7 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University / IAS August 6, 6 Abstract Yuanzhi

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Clément Royer - University of Wisconsin-Madison Joint work with Stephen J. Wright MOPTA, Bethlehem,

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Lecture 6 Optimization for Deep Neural Networks

Lecture 6 Optimization for Deep Neural Networks Lecture 6 Optimization for Deep Neural Networks CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 12, 2017 Things we will look at today Stochastic Gradient Descent Things

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Optimality Conditions for Constrained Optimization

Optimality Conditions for Constrained Optimization 72 CHAPTER 7 Optimality Conditions for Constrained Optimization 1. First Order Conditions In this section we consider first order optimality conditions for the constrained problem P : minimize f 0 (x)

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

A strongly polynomial algorithm for linear systems having a binary solution

A strongly polynomial algorithm for linear systems having a binary solution A strongly polynomial algorithm for linear systems having a binary solution Sergei Chubanov Institute of Information Systems at the University of Siegen, Germany e-mail: sergei.chubanov@uni-siegen.de 7th

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC

Non-Convex Optimization in Machine Learning. Jan Mrkos AIC Non-Convex Optimization in Machine Learning Jan Mrkos AIC The Plan 1. Introduction 2. Non convexity 3. (Some) optimization approaches 4. Speed and stuff? Neural net universal approximation Theorem (1989):

More information