Approximate Second Order Algorithms Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo
Why Second Order Algorithms? Invariant under affine transformations e.g. stretching a function preserves the convergence rate of Newton s Method Ex: Consider ff xx = xx 2 and gg xx = ff xx 2 = xx2 4. GD for the second function takes smaller steps, whereas Newton s method solves this in a single step. Thus, potentially requires less hyperparameter tuning Hopefully improve training speed First order methods that achieve the theoretical lower bound are already achieved. Can we further improve? Number of iterations to converge may balance the per-iteration cost. Disadvantages: If HH(xx) is not invertible? Use Pseudo-inverse (Moore-Penrose) Computing HH 1 xx f(x) is expensive Approximate
Stochastic Newton Step? Suppose we are minimizing ff xx = mm kk=1 ff kk (xx) with ff kk xx μμ strongly convex and LL-smooth. For analysis of second order algorithms, we also need another constraint: HH(xx) is MM-Lipschitz HH xx HH yy xx yy Naïve Generalization: HH kk 1 (xx) ff kk (xx) Estimation of curvature hurts performance
Hessian-Free (HF) Optimization To avoid computing H, we instead compute HHHH, where v is any vector, which can be computed as for a small εε ff xx + εεεε + ff(xx) HHHH = εε To avoid inverting HH to obtain HHHH = ff xx, we solve min yy using Conjugate Gradient ff xx + yy TT ff xx + 1 2 yytt HHHH
Hessian Free Optimization Off the shelf HF algorithms are not feasible for large scale problems Damping, makes a more conservative curvature estimate Adding the constant λλ dd 2 to the curvature estimate depending on ρρ where ρρ = ff xx+pp ff(xx) qq xx pp qq xx (0) Computing Matrix Vector products Use GG instead of HH for Hv, where GG is the Gauss-Newton approximation for Hessian which is positive semidefinite Terminating conditions for CG CG finds solution to AAAA = bb not by optimizing AAAA bb 2 but by optimizing the quadratic φφ xx = 1 2 xxtt AAAA bb TT xx φφ xx decreases with every step whereas AAAA bb 2 fluctuates a lot before tending towards 0 Terminates when relative improvement of φφ xx over the last kk steps drops below a constant kkεε Many methods to better precondition for CG
Lower Bounds First order methods require Ω mm + mmmm log 1 oracle calls to the εε gradient to achieve an εε-approximate solution Linear dependence on the condition number is obtained by SVRG, SAGA, This minimal bound is obtained by Katyusha and AccSDCA Second Order methods: An algorithm can use at most mm Hessians for update 2 Indices kk [mm] sampled uniformly at random Input dimension: d = OO(1 + κκκκ) Oracle calls = Ω mm + mmmm log 1 εε Bound is better by logarithmic factor by randomized construction
Lower Bounds Discussion This lower bound suggests that second algorithms cannot improve rates of optimization by much. Because even with the oracles, second order methods requires computing the Hessian OO(mmdd 2 ) and inverting it OO(dd 3 ), simple second order algorithms are not attractive. Because of the assumption that the algorithm cannot use the Hessian of all samples HH kk xx kk [mm], we lose the quadratic convergence rate. Suggestion: If an algorithm that does not satisfy the assumption, it may achieve faster convergence than the bound presented. LiSSA-sample uses leverage scoring to sample the Hessians non-uniformly and achieves convergence rate in the high accuracy regime faster than any first order algorithm.
Overview min ff xx = 1 mm kk=1 mm 2 ff kk (xx) + λλ xx 2 ff kk (xx) is μμ-strongly convex, LL-smooth, Hessian MM-Lipschitz LiSSA and LiSSA-sample focuses on Generalized Linear Models (GLM): ff kk xx = ll(vv kk xx, yy kk ) with ll μμ strongly-convex and LL-smooth e.g. linear regression with Mean Squared Error loss, data (vv kk, yy kk ) This results in HH kk xx = αα kk vv kk vv kk TT Condition Number κκ = max xx λλ mmmmmm 2 ff xx min λλ mmmmmm 2 ff xx xx μμ strongly-convex and LL smooth implies LII 2 ff xx μμμμ
LiSSA Description (LiSSA LiSSA-Sample) Key Idea 1 (Estimator): Avoid direct inversion of the Hessian by using a recursive formula of Taylor Approximation for matrices AA 1 = ii=0 II AA ii Key Idea 2 (Concentration): With sufficient number of random matrices, Matrix Bernstein inequality gives a tail bound: AA ii ~ iiiiii PP AA with EE AA ii = 0 and EE AA ii MM where AA ii RR dddddd PP AA ii tt dd exp tt2 4RR 2
LiSSA Pseudocode Run any (fast) first order algorithm to obtain xx 0 such that xx 0 xx 1 (in practice, use some estimate) 4κκ ll MM For each iteration t = 0,, T-1: Compute the full gradient X i = ff xx tt = 1 mm kk ff kk xx tt ii [SS 1 ] where SS 1 is a parameter Iterate Inner loop 2 SS 1 κκ ll ln(2 κκ ll ) times: Compute the Hessian of a single random sample HH XX ii = f x t + I HH XX ii xx tt+1 = xx tt 1 SS 1 ii SS 1 XX ii
LiSSA Analysis Time Complexity (convergence rate + per-iteration cost): ff xx TT ff xx εε requires time OO( mm + κκ ll 3 dd log 1 εε for small εε with high probability.
LiSSA Sample Description Key Idea 1: Hessian Sketch BBHH 1 ff xx = arg min ff xx TT yy + 1 yy 2 yytt HH xx BB 1 yy Leverage Score: Measurement of deviation of a sample from other observations Sample OO(dd log dd) Hessians uniformly at random, without replacement Use these to compute (generalized) leverage scores for all samples BB = mm 1 kk=1 HH pp kk BBBBBBBBBBBBBBBBBB(pp kk ) (where pp kk kk 1 llllllllllllllll ssssssssss kk )
LiSSA-Sample Pseudocode Repeat for log log 1 εε : 1. Sample HH kk ~ pp kk where pp kk to compute BB = llllllllllllllll ssssssssss OO(dd log dd) kk=1 HHkk (xx) such that 1 BB AA 2BB 2 2. Minimize the quadratic objective (approximately): yy BBHH 1 ff xx = arg min ff xx TT yy + 1 yy 2 yytt HH xx BB 1 yy 3. Approximately solve for u to obtain HH 1 ff xx uu BB 1 yy 4. Update: xx xx uu 1
Computing Leverage scores efficiently Computing Leverage scores requires computing: TT γγ ii = vv ii HH 1 vv ii = AA 1 vv ii 2 2 dd log dd where H = kk=1 HHkk = AAAA TT Instead, randomly sample GG RR OO log mmmm dd where each entry is a normal random variable and compute γγ ii = GGAA 1 2 vv ii 2 By Johnson-Lindenstrauss Lemma, with high probability: γγ ii 1 2 AA 1 vv ii, 2 AA 1 vv ii 2 2 All of this takes OO dd 2 log mmmm + mmmm + dd Note: dd 2 mmmm κκκκ
LiSSA-Sample Analysis In the high accuracy regime (i.e. εε small), LiSSA-Sample enjoys a convergence rate (with high probability): OO mmmm log 1 εε + dd + κκκκ )dd log2 1 εε log log 1 εε OO mmmm + dd κκκκ log 2 1 εε when κκ > mm dd ( dd 2 mmmm) This is faster than accelerated first order methods: OO mmdd + dd κκκκ log 1 εε
Extensions to Nonconvex Optimization 1. Same authors used a similar observation to extend the algorithm for non-convex optimization, proving a convergence rate (to a local min) that is faster than gradient descent. 2. Use similar techniques such as non-uniform sampling or sketching to use the Saddle-Free Newton Method proposed by Dauphin: HH 1 ff(xx)
LiSSA Experiments
Empirical study: sketch size and convergence speed
Sketch Hessian vs computing exact Hessian
Red curve sketch results in deviation in find optimal point, giving independent trials to verify. As sketch sizes increases, it converges to the center path.
References Zeyuan Allen-Zhu, Katyusha: the first direct acceleration of stochastic gradient methods, Proceedings of the 49 th Annual ACM SIGACT Symposium on Theory of Computing, June 19-23, 2017, Montreal, Canada Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148-4187, 2017. Alekh Agarwal, Leon Bottou. A Lower Bound for the Optimization of Finite Sums. Jounral of Machine Learning Research. 2015. Yossi Arjevani and Ohad Shamir. Oracle Complexity of Second-Order Methods for Finite-Sum Problems. In: arxiv preprint. 2016. Naman Agarwal et al. Finding Approximate Local Minima Faster than Gradient Descent. In arxiv preprint. 2016. Cohen et al. Uniform Sampling for Matrix Approximation. Proceedings of the 6 th Conference on Innovations in Theoretical Computer Science (ITCS). 2015. James Martens. Deep learning via Hessian Free Optimization. In Proceedings of the 26 th international Conference on Machine Learning, 2010 Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS. 2014.