Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Similar documents
Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Second-Order Stochastic Optimization for Machine Learning in Linear Time

General Strong Polarization

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Review for Exam Hyunse Yoon, Ph.D. Assistant Research Scientist IIHR-Hydroscience & Engineering University of Iowa

arxiv: v4 [math.oc] 24 Apr 2017

2.4 Error Analysis for Iterative Methods

Last Name _Piatoles_ Given Name Americo ID Number

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

General Strong Polarization

Stochastic Optimization Algorithms Beyond SG

7.3 The Jacobi and Gauss-Seidel Iterative Methods

Statistical Learning with the Lasso, spring The Lasso

A Posteriori Error Estimates For Discontinuous Galerkin Methods Using Non-polynomial Basis Functions

Variations. ECE 6540, Lecture 02 Multivariate Random Variables & Linear Algebra

An Efficient Algorithm For Weak Hierarchical Lasso. Yashu Liu, Jie Wang, Jieping Ye Arizona State University

Worksheets for GCSE Mathematics. Quadratics. mr-mathematics.com Maths Resources for Teachers. Algebra

Trade-Offs in Distributed Learning and Optimization

Nonlinear Optimization Methods for Machine Learning

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Review for Exam Hyunse Yoon, Ph.D. Adjunct Assistant Professor Department of Mechanical Engineering, University of Iowa

Stochastic Quasi-Newton Methods

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Lecture 11. Kernel Methods

Work, Energy, and Power. Chapter 6 of Essential University Physics, Richard Wolfson, 3 rd Edition

Lecture 6. Notes on Linear Algebra. Perceptron

Exact and Inexact Subsampled Newton Methods for Optimization

The Randomized Newton Method for Convex Optimization

(1) Correspondence of the density matrix to traditional method

CSC 578 Neural Networks and Deep Learning

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Radial Basis Function (RBF) Networks

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Secondary 3H Unit = 1 = 7. Lesson 3.3 Worksheet. Simplify: Lesson 3.6 Worksheet

1. The graph of a function f is given above. Answer the question: a. Find the value(s) of x where f is not differentiable. Ans: x = 4, x = 3, x = 2,

Convex Optimization Algorithms for Machine Learning in 10 Slides

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Classical RSA algorithm

PHY103A: Lecture # 4

Introduction to gradient descent

Review for Exam Hyunse Yoon, Ph.D. Adjunct Assistant Professor Department of Mechanical Engineering, University of Iowa

Property Testing and Affine Invariance Part I Madhu Sudan Harvard University

Stochastic Gradient Descent with Variance Reduction

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Mathematics Ext 2. HSC 2014 Solutions. Suite 403, 410 Elizabeth St, Surry Hills NSW 2010 keystoneeducation.com.

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Expectation Propagation performs smooth gradient descent GUILLAUME DEHAENE

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

STA141C: Big Data & High Performance Statistical Computing

SECTION 5: POWER FLOW. ESE 470 Energy Distribution Systems

ECS289: Scalable Machine Learning

Selected Topics in Optimization. Some slides borrowed from

Sub-Sampled Newton Methods

Dan Roth 461C, 3401 Walnut

Math, Stats, and Mathstats Review ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Large Scale Data Analysis Using Deep Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

arxiv: v2 [math.oc] 1 Nov 2017

Control of Mobile Robots

An Investigation of Newton-Sketch and Subsampled Newton Methods

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

Lecture 1: Supervised Learning

CHAPTER 2 Special Theory of Relativity

Lecture 3 Optimization methods for econometrics models

Uncertain Compression & Graph Coloring. Madhu Sudan Harvard

On one Application of Newton s Method to Stability Problem

Course Business. Homework 3 Due Now. Homework 4 Released. Professor Blocki is travelling, but will be back next week

SECTION 5: CAPACITANCE & INDUCTANCE. ENGR 201 Electrical Fundamentals I

Local Decoding and Testing Polynomials over Grids

ECS171: Machine Learning

Worksheets for GCSE Mathematics. Algebraic Expressions. Mr Black 's Maths Resources for Teachers GCSE 1-9. Algebra

Quantum Mechanics. An essential theory to understand properties of matter and light. Chemical Electronic Magnetic Thermal Optical Etc.

Math 171 Spring 2017 Final Exam. Problem Worth

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

Line Search Methods for Unconstrained Optimisation

Independent Component Analysis and FastICA. Copyright Changwei Xiong June last update: July 7, 2016

Elastic light scattering

Gradient expansion formalism for generic spin torques

Rotational Motion. Chapter 10 of Essential University Physics, Richard Wolfson, 3 rd Edition

Approximate Newton Methods and Their Local Convergence

Large-scale Stochastic Optimization

SVRG++ with Non-uniform Sampling

Angular Momentum, Electromagnetic Waves

Accelerating SVRG via second-order information

Lecture 7 MOS Capacitor

arxiv: v1 [math.oc] 9 Oct 2018

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Linear Regression. S. Sumitra

Day 3 Lecture 3. Optimizing deep networks

CPSC 540: Machine Learning

Lecture 2: Plasma particles with E and B fields

On The Cauchy Problem For Some Parabolic Fractional Partial Differential Equations With Time Delays

Module 7 (Lecture 25) RETAINING WALLS

Comparison of Modern Stochastic Optimization Algorithms

Grover s algorithm. We want to find aa. Search in an unordered database. QC oracle (as usual) Usual trick

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

Transcription:

Approximate Second Order Algorithms Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Why Second Order Algorithms? Invariant under affine transformations e.g. stretching a function preserves the convergence rate of Newton s Method Ex: Consider ff xx = xx 2 and gg xx = ff xx 2 = xx2 4. GD for the second function takes smaller steps, whereas Newton s method solves this in a single step. Thus, potentially requires less hyperparameter tuning Hopefully improve training speed First order methods that achieve the theoretical lower bound are already achieved. Can we further improve? Number of iterations to converge may balance the per-iteration cost. Disadvantages: If HH(xx) is not invertible? Use Pseudo-inverse (Moore-Penrose) Computing HH 1 xx f(x) is expensive Approximate

Stochastic Newton Step? Suppose we are minimizing ff xx = mm kk=1 ff kk (xx) with ff kk xx μμ strongly convex and LL-smooth. For analysis of second order algorithms, we also need another constraint: HH(xx) is MM-Lipschitz HH xx HH yy xx yy Naïve Generalization: HH kk 1 (xx) ff kk (xx) Estimation of curvature hurts performance

Hessian-Free (HF) Optimization To avoid computing H, we instead compute HHHH, where v is any vector, which can be computed as for a small εε ff xx + εεεε + ff(xx) HHHH = εε To avoid inverting HH to obtain HHHH = ff xx, we solve min yy using Conjugate Gradient ff xx + yy TT ff xx + 1 2 yytt HHHH

Hessian Free Optimization Off the shelf HF algorithms are not feasible for large scale problems Damping, makes a more conservative curvature estimate Adding the constant λλ dd 2 to the curvature estimate depending on ρρ where ρρ = ff xx+pp ff(xx) qq xx pp qq xx (0) Computing Matrix Vector products Use GG instead of HH for Hv, where GG is the Gauss-Newton approximation for Hessian which is positive semidefinite Terminating conditions for CG CG finds solution to AAAA = bb not by optimizing AAAA bb 2 but by optimizing the quadratic φφ xx = 1 2 xxtt AAAA bb TT xx φφ xx decreases with every step whereas AAAA bb 2 fluctuates a lot before tending towards 0 Terminates when relative improvement of φφ xx over the last kk steps drops below a constant kkεε Many methods to better precondition for CG

Lower Bounds First order methods require Ω mm + mmmm log 1 oracle calls to the εε gradient to achieve an εε-approximate solution Linear dependence on the condition number is obtained by SVRG, SAGA, This minimal bound is obtained by Katyusha and AccSDCA Second Order methods: An algorithm can use at most mm Hessians for update 2 Indices kk [mm] sampled uniformly at random Input dimension: d = OO(1 + κκκκ) Oracle calls = Ω mm + mmmm log 1 εε Bound is better by logarithmic factor by randomized construction

Lower Bounds Discussion This lower bound suggests that second algorithms cannot improve rates of optimization by much. Because even with the oracles, second order methods requires computing the Hessian OO(mmdd 2 ) and inverting it OO(dd 3 ), simple second order algorithms are not attractive. Because of the assumption that the algorithm cannot use the Hessian of all samples HH kk xx kk [mm], we lose the quadratic convergence rate. Suggestion: If an algorithm that does not satisfy the assumption, it may achieve faster convergence than the bound presented. LiSSA-sample uses leverage scoring to sample the Hessians non-uniformly and achieves convergence rate in the high accuracy regime faster than any first order algorithm.

Overview min ff xx = 1 mm kk=1 mm 2 ff kk (xx) + λλ xx 2 ff kk (xx) is μμ-strongly convex, LL-smooth, Hessian MM-Lipschitz LiSSA and LiSSA-sample focuses on Generalized Linear Models (GLM): ff kk xx = ll(vv kk xx, yy kk ) with ll μμ strongly-convex and LL-smooth e.g. linear regression with Mean Squared Error loss, data (vv kk, yy kk ) This results in HH kk xx = αα kk vv kk vv kk TT Condition Number κκ = max xx λλ mmmmmm 2 ff xx min λλ mmmmmm 2 ff xx xx μμ strongly-convex and LL smooth implies LII 2 ff xx μμμμ

LiSSA Description (LiSSA LiSSA-Sample) Key Idea 1 (Estimator): Avoid direct inversion of the Hessian by using a recursive formula of Taylor Approximation for matrices AA 1 = ii=0 II AA ii Key Idea 2 (Concentration): With sufficient number of random matrices, Matrix Bernstein inequality gives a tail bound: AA ii ~ iiiiii PP AA with EE AA ii = 0 and EE AA ii MM where AA ii RR dddddd PP AA ii tt dd exp tt2 4RR 2

LiSSA Pseudocode Run any (fast) first order algorithm to obtain xx 0 such that xx 0 xx 1 (in practice, use some estimate) 4κκ ll MM For each iteration t = 0,, T-1: Compute the full gradient X i = ff xx tt = 1 mm kk ff kk xx tt ii [SS 1 ] where SS 1 is a parameter Iterate Inner loop 2 SS 1 κκ ll ln(2 κκ ll ) times: Compute the Hessian of a single random sample HH XX ii = f x t + I HH XX ii xx tt+1 = xx tt 1 SS 1 ii SS 1 XX ii

LiSSA Analysis Time Complexity (convergence rate + per-iteration cost): ff xx TT ff xx εε requires time OO( mm + κκ ll 3 dd log 1 εε for small εε with high probability.

LiSSA Sample Description Key Idea 1: Hessian Sketch BBHH 1 ff xx = arg min ff xx TT yy + 1 yy 2 yytt HH xx BB 1 yy Leverage Score: Measurement of deviation of a sample from other observations Sample OO(dd log dd) Hessians uniformly at random, without replacement Use these to compute (generalized) leverage scores for all samples BB = mm 1 kk=1 HH pp kk BBBBBBBBBBBBBBBBBB(pp kk ) (where pp kk kk 1 llllllllllllllll ssssssssss kk )

LiSSA-Sample Pseudocode Repeat for log log 1 εε : 1. Sample HH kk ~ pp kk where pp kk to compute BB = llllllllllllllll ssssssssss OO(dd log dd) kk=1 HHkk (xx) such that 1 BB AA 2BB 2 2. Minimize the quadratic objective (approximately): yy BBHH 1 ff xx = arg min ff xx TT yy + 1 yy 2 yytt HH xx BB 1 yy 3. Approximately solve for u to obtain HH 1 ff xx uu BB 1 yy 4. Update: xx xx uu 1

Computing Leverage scores efficiently Computing Leverage scores requires computing: TT γγ ii = vv ii HH 1 vv ii = AA 1 vv ii 2 2 dd log dd where H = kk=1 HHkk = AAAA TT Instead, randomly sample GG RR OO log mmmm dd where each entry is a normal random variable and compute γγ ii = GGAA 1 2 vv ii 2 By Johnson-Lindenstrauss Lemma, with high probability: γγ ii 1 2 AA 1 vv ii, 2 AA 1 vv ii 2 2 All of this takes OO dd 2 log mmmm + mmmm + dd Note: dd 2 mmmm κκκκ

LiSSA-Sample Analysis In the high accuracy regime (i.e. εε small), LiSSA-Sample enjoys a convergence rate (with high probability): OO mmmm log 1 εε + dd + κκκκ )dd log2 1 εε log log 1 εε OO mmmm + dd κκκκ log 2 1 εε when κκ > mm dd ( dd 2 mmmm) This is faster than accelerated first order methods: OO mmdd + dd κκκκ log 1 εε

Extensions to Nonconvex Optimization 1. Same authors used a similar observation to extend the algorithm for non-convex optimization, proving a convergence rate (to a local min) that is faster than gradient descent. 2. Use similar techniques such as non-uniform sampling or sketching to use the Saddle-Free Newton Method proposed by Dauphin: HH 1 ff(xx)

LiSSA Experiments

Empirical study: sketch size and convergence speed

Sketch Hessian vs computing exact Hessian

Red curve sketch results in deviation in find optimal point, giving independent trials to verify. As sketch sizes increases, it converges to the center path.

References Zeyuan Allen-Zhu, Katyusha: the first direct acceleration of stochastic gradient methods, Proceedings of the 49 th Annual ACM SIGACT Symposium on Theory of Computing, June 19-23, 2017, Montreal, Canada Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148-4187, 2017. Alekh Agarwal, Leon Bottou. A Lower Bound for the Optimization of Finite Sums. Jounral of Machine Learning Research. 2015. Yossi Arjevani and Ohad Shamir. Oracle Complexity of Second-Order Methods for Finite-Sum Problems. In: arxiv preprint. 2016. Naman Agarwal et al. Finding Approximate Local Minima Faster than Gradient Descent. In arxiv preprint. 2016. Cohen et al. Uniform Sampling for Matrix Approximation. Proceedings of the 6 th Conference on Innovations in Theoretical Computer Science (ITCS). 2015. James Martens. Deep learning via Hessian Free Optimization. In Proceedings of the 26 th international Conference on Machine Learning, 2010 Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS. 2014.