Stochastic Variance-Reduced Cubic Regularized Newton Method

Similar documents
A Primal-Dual Type Algorithm with the O(1/t) Convergence Rate for Large Scale Constrained Convex Programs

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018

Lecture 9: September 25

A Forward-Backward Splitting Method with Component-wise Lazy Evaluation for Online Structured Convex Optimization

Notes on online convex optimization

Vehicle Arrival Models : Headway

Technical Report Doc ID: TR March-2013 (Last revision: 23-February-2016) On formulating quadratic functions in optimization models.

Online Convex Optimization Example And Follow-The-Leader

Lecture 20: Riccati Equations and Least Squares Feedback Control

Lecture 2 October ε-approximation of 2-player zero-sum games

The Asymptotic Behavior of Nonoscillatory Solutions of Some Nonlinear Dynamic Equations on Time Scales

Robust estimation based on the first- and third-moment restrictions of the power transformation model

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Chapter 3 Boundary Value Problem

Revisiting Projection-Free Optimization for Strongly Convex Constraint Sets

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

1 Review of Zero-Sum Games

Chapter 2. First Order Scalar Equations

arxiv: v2 [math.oc] 4 Apr 2016

Optimality Conditions for Unconstrained Problems

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Random Walk with Anti-Correlated Steps

Multi-scale 2D acoustic full waveform inversion with high frequency impulsive source

An introduction to the theory of SDDP algorithm

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality

Ensamble methods: Bagging and Boosting

Linear Response Theory: The connection between QFT and experiments

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

ELE 538B: Large-Scale Optimization for Data Science. Quasi-Newton methods. Yuxin Chen Princeton University, Spring 2018

Lecture 33: November 29

Single and Double Pendulum Models

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes

A New Perturbative Approach in Nonlinear Singularity Analysis

Notes for Lecture 17-18

Two Coupled Oscillators / Normal Modes

Ensamble methods: Boosting

12: AUTOREGRESSIVE AND MOVING AVERAGE PROCESSES IN DISCRETE TIME. Σ j =

GRADIENT ESTIMATES FOR A SIMPLE PARABOLIC LICHNEROWICZ EQUATION. Osaka Journal of Mathematics. 51(1) P.245-P.256

STATE-SPACE MODELLING. A mass balance across the tank gives:

Final Spring 2007

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing

EXERCISES FOR SECTION 1.5

di Bernardo, M. (1995). A purely adaptive controller to synchronize and control chaotic systems.

Variational Iteration Method for Solving System of Fractional Order Ordinary Differential Equations

Christos Papadimitriou & Luca Trevisan November 22, 2016

A Decentralized Second-Order Method with Exact Linear Convergence Rate for Consensus Optimization

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds

Problem Set 5. Graduate Macro II, Spring 2017 The University of Notre Dame Professor Sims

Nature Neuroscience: doi: /nn Supplementary Figure 1. Spike-count autocorrelations in time.

Primal-Dual Splitting: Recent Improvements and Variants

Appendix to Online l 1 -Dictionary Learning with Application to Novel Document Detection

6.2 Transforms of Derivatives and Integrals.

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature

Lecture Notes 2. The Hilbert Space Approach to Time Series

KINEMATICS IN ONE DIMENSION

Class Meeting # 10: Introduction to the Wave Equation

Lecture 4 Notes (Little s Theorem)

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Mean-square Stability Control for Networked Systems with Stochastic Time Delay

An Introduction to Malliavin calculus and its applications

Hamilton- J acobi Equation: Weak S olution We continue the study of the Hamilton-Jacobi equation:

ON THE BEAT PHENOMENON IN COUPLED SYSTEMS

Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

GMM - Generalized Method of Moments

Particle Swarm Optimization Combining Diversification and Intensification for Nonlinear Integer Programming Problems

Recursive Least-Squares Fixed-Interval Smoother Using Covariance Information based on Innovation Approach in Linear Continuous Stochastic Systems

Online Appendix to Solution Methods for Models with Rare Disasters

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Network Newton Distributed Optimization Methods

On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems

CHAPTER 2 Signals And Spectra

arxiv: v3 [cs.lg] 9 Apr 2017

10. State Space Methods

11!Hí MATHEMATICS : ERDŐS AND ULAM PROC. N. A. S. of decomposiion, properly speaking) conradics he possibiliy of defining a counably addiive real-valu

Chapter 6. Systems of First Order Linear Differential Equations

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis

23.5. Half-Range Series. Introduction. Prerequisites. Learning Outcomes

arxiv: v4 [cs.lg] 20 Dec 2016

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

SGD and Hogwild! Convergence Without the Bounded Gradients Assumption

IMPLICIT AND INVERSE FUNCTION THEOREMS PAUL SCHRIMPF 1 OCTOBER 25, 2013

Anti-Disturbance Control for Multiple Disturbances

Rapid Termination Evaluation for Recursive Subdivision of Bezier Curves

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October ISSN

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter

Vanishing Viscosity Method. There are another instructive and perhaps more natural discontinuous solutions of the conservation law

Ordinary Differential Equations

Most Probable Phase Portraits of Stochastic Differential Equations and Its Numerical Simulation

Regularization Paths with Guarantees for Convex Semidefinite Optimization

5.1 - Logarithms and Their Properties

Aryan Mokhtari, Wei Shi, Qing Ling, and Alejandro Ribeiro. cost function n

Model Reduction for Dynamical Systems Lecture 6

A DELAY-DEPENDENT STABILITY CRITERIA FOR T-S FUZZY SYSTEM WITH TIME-DELAYS

Transcription:

Sochasic Variance-Reduced Cubic Regularized Newon Mehod Dongruo Zhou and Pan Xu and Quanquan Gu arxiv:1802.04796v1 [cs.lg] 13 Feb 2018 February 9, 2018 Absrac We propose a sochasic variance-reduced cubic regularized Newon mehod for non-convex opimizaion. A he core of our algorihm is a novel semi-sochasic gradien along wih a semi-sochasic Hessian, which are specifically designed for cubic regularizaion mehod. We show ha our algorihm is guaraneed o converge o an (ɛ, ɛ)-approximaely local minimum wihin Õ(n4/5 /ɛ 3/2 ) second-order oracle calls, which ouperforms he sae-of-he-ar cubic regularizaion algorihms including subsampled cubic regularizaion. Our work also sheds ligh on he applicaion of variance reducion echnique o high-order non-convex opimizaion mehods. Thorough experimens on various non-convex opimizaion problems suppor our heory. 1 Inroducion We sudy he following finie-sum opimizaion problem: min x R d F (x) = 1 n n f i (x), (1.1) where F (x) and each f i (x) can be non-convex. Such problems are common in machine learning, where each f i (x) is a loss funcion on a raining example (LeCun e al., 2015). Since F (x) is non-convex, finding is global minimum is generally NP-Hard (Hillar and Lim, 2013). As a resul, one possible goal is o find an approximae firs-order saionary poin (ɛ saionary poin): i=1 F (x) ɛ, for some given ɛ > 0. A lo of sudies have been devoed o his problem including gradien descen (GD), sochasic gradien descen (SGD) (Robbins and Monro, 1951), and heir exensions Deparmen of Sysems and Informaion Engineering, Universiy of Virginia, Charloesville, VA 22904, USA; e-mail: dz4bd@virginia.edu Deparmen of Compuer Science, Universiy of Virginia, Charloesville, VA 22904, USA; e-mail: px3ds@virginia.edu Deparmen of Compuer Science, Universiy of Virginia, Charloesville, VA 22904, USA; e-mail: qg5w@virginia.edu 1

(Ghadimi and Lan, 2013; Reddi e al., 2016a; Allen-Zhu and Hazan, 2016; Ghadimi and Lan, 2016). Neverheless, firs-order saionary poins can be non-degenerae saddle poins or even local maximum in non-convex opimizaion, which are undesirable. Therefore, a more reasonable objecive is o find an approximae second-order saionary poin (Neserov and Polyak, 2006), which is also known as an (ɛ g, ɛ h )-approximae local minimum of F (x): F (x) 2 < ɛ g, λ min ( 2 F (x)) ɛ h, (1.2) for some given consan ɛ g, ɛ h > 0. In fac, in some machine learning problems like marix compleion (Ge e al., 2016), one finds ha every local minimum is a global minimum, suggesing ha finding an approximae local minimum is a beer choice han a saionary poin, and is good enough in many applicaions. One of he mos popular mehod o achieve his goal is perhaps cubic regularized Newon mehod, which was inroduced by Neserov and Polyak (2006), and solves he following kind of subproblems in each ieraion: h(x) = argmin h R d m(h, x) = F (x), h + 1 2 2 F (x)h, h + θ 6 h 3 2, (1.3) where θ > 0 is a regularizaion parameer. Neserov and Polyak (2006) proved ha fixing a saring poin x 0, and performing he updaing rule x = x 1 +h(x 1 ), he algorihm can oupu a sequence x i ha converges o a local minimum provided ha he funcion is Hessian Lipschiz. However, i can be seen ha o solve he subproblem (1.3), one needs o calculae he full gradien F (x) and Hessian 2 F (x), which is a big overhead in large scale machine learning problem because n is ofen very large. Some recen sudies presened various algorihms o avoid he calculaion of full gradien and Hessian in cubic regularizaion. Kohler and Lucchi (2017) used subsampling echnique o ge approximae gradien and Hessian insead of exac ones, and Xu e al. (2017b) also used subsampled Hessian. Boh of hem can reduce he compuaional complexiy in some circumsance. However, jus like oher sampling-based algorihm such as subsampled Newon mehod Erdogdu and Monanari (2015); Xu e al. (2016); Roosakhorasani and Mahoney (2016a,b); Ye e al. (2017), heir convergence raes are worse han ha of he Newon mehod, especially when one needs a high-accuracy soluion (i.e., he opimizaion error ɛ is small). This is because he subsampling size one needs o achieve cerain accuracy may be even larger han he full sample size n. Therefore, a naural quesion arises as follows: When we need a high-accuracy local minimum, is here an algorihm ha can oupu an approximae local minimum wih beer second-order oracle complexiy han cubic regularized Newon mehod? In his paper, we give an affirmaive answer o he above quesion. We propose a novel cubic regularizaion algorihm named Sochasic Variance-Reduced Cubic regularizaion (), which incorporaes he variance reducion echniques (Johnson and Zhang, 2013; Xiao and Zhang, 2014; Allen-Zhu and Hazan, 2016; Reddi e al., 2016a) ino he cubic-regularized Newon mehod. The key componen in our algorihm is a novel semi-sochasic gradien, ogeher wih a semisochasic Hessian, ha are specifically designed for cubic regularizaion. Furhermore, we prove 2

ha, for Hessian Lipschiz funcions, o aain an approximae (ɛ, ρɛ)-local minimum, our proposed algorihm requires O(n+n 4/5 /ɛ 3/2 ) Second-order Oracle (SO) calls and O(1/ɛ 3/2 ) Cubic Subproblem Oracle (CSO) calls. Here an ISO oracle represens an evaluaion of riple (f i (x), f i (x), 2 f i (x)), and a CSO oracle denoes an evaluaion of he exac soluion (or inexac soluion) of he cubic subproblem (1.3). Compared wih he original cubic regularizaion algorihm (Neserov and Polyak, 2006), which requires O(n/ɛ 3/2 ) ISO calls and O(1/ɛ 3/2 ) CSO calls, our proposed algorihm reduces he SO calls by a facor of Ω(n 1/5 ). We also carry ou experimens on real daa o demonsrae he superior performance of our algorihm. Our major conribuions are summarized as follows: We presen a novel cubic regularizaion mehod wih improved oracle complexiy. To he bes of our knowledge, his is he firs algorihm ha ouperforms cubic regularizaion wihou any loss in convergence rae. This is in sharp conras o subsampled cubic regularizaion mehods (Kohler and Lucchi, 2017; Xu e al., 2017a), which suffer from worse convergence raes han cubic regularizaion. We also exend our algorihm o he case wih inexac soluion o he cubic regularizaion subproblem. Similar o previous work (Caris e al., 2011; Xu e al., 2017a), we also layou a se of sufficien condiions, under which he oupu of he inexac algorihm is sill guaraneed o have he same convergence rae and oracle complexiy as he exac algorihm. This furher sheds ligh on he pracical implemenaion of our algorihm. As far as we know, our work is he firs work, which rigorously demonsraes he advanage of variance reducion for second-order opimizaion algorihms. Alhough here exis a few sudies (Lucchi e al., 2015; Moriz e al., 2016; Rodomanov and Kropoov, 2016) using variance reducion o accelerae Newon mehod, none of hem can deliver faser raes of convergence han sandard Newon mehod. Noaion We use [n] o denoe he index se {1, 2,..., n}. We use v 2 o denoe vecor Euclidean norm. For symmeric marix H R d d, we denoe is eigenvalues by λ 1 (H)... λ d (H), is specral norm by H 2 = max{ λ 1 (H), λ d (H) }, and he Schaen r-norm by H Sr = ( d i=1 λ i(h) r ) 1/r for r 1. We denoe A B if λ 1 (A B) 0 for symmeric marices A, B R d d. We also noe ha A B 2 C A 2 B 2 C I, C > 0. We call ξ a Rademacher random variable if P(ξ = 1) = P(ξ = 1) = 1/2. We use f n = O(g n ) o denoe ha f n Cg n for some consan C > 0 and use f n = Õ(g n) o hide he logarihmic erms of g n. 2 Relaed Work In his secion, we briefly review he relevan work in he lieraure. The mos relaed work o ours is he cubic regularized Newon mehod, which was originally proposed in Neserov and Polyak (2006). Caris e al. (2011) presened an adapive framework of cubic regularizaion, which uses an adapive esimaion of he local Lipschiz consan and approximae soluion o he cubic subproblem. To overcome he compuaional burden of gradien and Hessian marix evaluaions, Kohler and Lucchi (2017); Xu e al. (2017b,a) proposed o use subsampled gradien and Hessian in cubic regularizaion. On he oher hand, in order o solve he 3

cubic subproblem (1.3) more efficienly, Carmon and Duchi (2016) proposed o use gradien descen, while Agarwal e al. (2017) proposed a sophisicaed algorihm based on approximae marix inverse and approximae PCA. Tripuraneni e al. (2017) proposed a refined sochasic cubic regularizaion algorihm based on above subproblem solver. However, none of he aforemenioned varians of cubic regularizaion ouperforms he original cubic regularizaion mehod in erms of he oracle complexiy. Anoher imporan line of relaed research is he variance reducion mehod, which has been exensively sudied for large-scale finie-sum opimizaion problems. Variance reducion was firs proposed in convex finie-sum opimizaion (Roux e al., 2012; Johnson and Zhang, 2013; Xiao and Zhang, 2014; Defazio e al., 2014), which uses semi-sochasic gradien o reduce he variance of he sochasic gradien and improves he gradien complexiy of boh sochasic gradien descen (SGD) and gradien descen (GD). Represenaive algorihms include Sochasic Average Gradien (SAG) (Roux e al., 2012), Sochasic Variance Reduced Gradien (SVRG) (Johnson and Zhang, 2013) and SAGA (Defazio e al., 2014), o menion a few. For non-convex finie-sum opimizaion, Garber and Hazan (2015); Shalev-Shwarz (2016) proposed algorihms for he seing where each individual funcion may be non-convex, bu heir sum is sill convex. Laer on, Reddi e al. (2016a) and Allen-Zhu and Hazan (2016) exended he SVRG algorihm o he general non-convex finie-sum opimizaion, which ouperforms SGD and GD in erms of gradien complexiy as well. However, o he bes of our knowledge, i is sill an open problem wheher variance reducion can also improve he oracle complexiy of second-order opimizaion algorihms. Las bu no he leas is he line of research which aims o escape from nondegeneraed saddle poins by finding he negaive curvaure direcion. There is a vas lieraure which focuses on algorihms escaping from saddle poin by using informaion of gradien and negaive curvaure insead of considering he subproblem (1.3). Ge e al. (2015), Jin e al. (2017a) showed ha simple (sochasic) gradien descen wih perurbaion can escape from saddle poins. Carmon e al. (2016); Royer and Wrigh (2017); Allen-Zhu (2017) showed ha by calculaing he negaive curvaure using Hessian informaion, one can find (ɛ, ɛ)-local minimum faser han he firs-order mehods. Recen work (Allen-Zhu and Li, 2017; Jin e al., 2017b; Xu e al., 2017c) proposed firs-order algorihms ha can escape from saddle poins wihou using Hessian informaion. For beer comparison of our algorihm wih he mos relaed algorihms in erms of SO and CSO oracle complexiies, we summarize he resuls in Table 1. I can be seen from Table 1 ha our algorihm () achieves he lowes (SO and CSO) oracle complexiy compared wih he original cubic regularizaion mehod (Neserov and Polyak, 2006) which employs full gradien and Hessian evaluaions and he subsampled cubic mehod (Kohler and Lucchi, 2017; Xu e al., 2017b). In paricular, our algorihm reduces he SO oracle complexiy of cubic regularizaion by a facor of n 1/5 for finding an (ɛ, ɛ)-local minimum. We will provide more deailed discussion in he main heory secion. 1 I is he refined rae proved by Xu e al. (2017b) for he subsampled cubic regularizaion algorihm proposed in Kohler and Lucchi (2017) 4

Table 1: Comparisons beween differen mehods o find (ɛ, ɛ)-local minimum on he second-order oracle (SO) complexiy and and he cubic subproblem oracle (CSO) complexiy. Algorihm SO calls CSO calls Gradien Lipschiz Cubic regularizaion (Neserov and Polyak, 2006) Subsampled cubic regularizaion (Kohler and Lucchi, 2017; Xu e al., 2017b) (his paper) O(n/ɛ 3/2 ) O(1/ɛ 3/2 ) no yes Õ(n/ɛ 3/2 + 1/ɛ 5/2 ) 1 O(1/ɛ 3/2 ) yes yes Õ(n + n 4/5 /ɛ 3/2 ) O(1/ɛ 3/2 ) no yes Hessian Lipschiz 3 The Proposed Algorihm In his secion, we presen a novel algorihm, which uilizes sochasic variance reducion echniques o improve cubic regularizaion mehod. To reduce he compuaion burden of gradien and Hessian marix evaluaions in he cubic regularizaion updaes in (1.3), subsampled gradien and Hessian marix have been used in subsampled cubic regularizaion (Kohler and Lucchi, 2017; Xu e al., 2017b) and sochasic cubic regularizaion (Tripuraneni e al., 2017). Neverheless he sochasic gradien and Hessian marix have large variances, which undermine he convergence performance. Inspired by SVRG (Johnson and Zhang, 2013), we propose o use a semi-sochasic version of gradien and Hessian marix, which can conrol he variances auomaically. Specifically, our algorihm has wo loops. A he beginning of he s-h ieraion of he ouer loop, we denoe x s = x s+1 0. We firs calculae he full gradien g s = F ( x s ) and Hessian marix H s = 2 F ( x s ), which are sored for furher references in he inner loop. A he -h ieraion of he inner loop, we calculae he following semi-sochasic gradien and Hessian marix: v s+1 = 1 b g i I g U s+1 = 1 b h j I h ( fi (x s+1 ) f i ( x s ) + g s) 1 ( 2 f i ( x s ) H s) (x s+1 x s), (3.1) b g i I g ( 2 f j (x s+1 ) 2 f j ( x s ) ) + H s, (3.2) where I g and I h are bach index ses, and he bach sizes will be decided laer. In each inner ieraion, we solve he following cubic regularizaion subproblem: h s+1 = argmin m s+1 (h) = v s+1 h + 1 2 Us+1 h, h + M s+1, h 3 6 2. (3.3) Then we perform he updae x s+1 +1 = xs+1 + h s+1 in he -h ieraion of he inner loop. The proposed algorihm is displayed in Algorihm 1. 5

Algorihm 1 Sochasic Variance Reducion Cubic Regularizaion () 1: Inpu: bach size b g, b h, penaly parameer M s,, s = 1... S, = 0... T, saring poin x 1. 2: Iniializaion 3: for s = 1,..., S do 4: x s+1 0 = x s 5: g s = F ( x s ) = 1 n n i=1 f i( x s ), H s = 1 n n i=1 2 f i ( x s ) 6: for = 0,..., T 1 do 7: Sample index se I g, I h, I g = b g, I h = b h ; 8: v s+1 = 1 b g i I g f i (x s+1 ) f i ( x s ) + g s ( 1 b g i I g 2 f i ( x s ) H s) (x s+1 x s ) 9: U s+1 = 1 b h ( j I h 2 f j (x s+1 ) 2 f j ( x s )) + H s 10: h s+1 = argmin h v s+1, h + 1 2 Us+1 h, h + M s+1, 6 h 3 2, 11: x s+1 +1 = xs+1 + h s+1 12: end for 13: x s+1 = x s+1 T 14: end for 15: Oupu: random choose one x s, for = 0,..., T and s = 1,..., S. There are wo noable feaures of our esimaor of he full gradien and Hessian in each inner loop, compared wih ha used in SVRG (Johnson and Zhang, 2013). The firs is ha our gradien and Hessian esimaors consis of mini-baches of sochasic gradien and Hessian. The second one is ha we use second order informaion when we consruc he gradien esimaor v s+1, while classical SVRG only uses firs order informaion o build i. Inuiively speaking, boh feaures are used o make a more accurae esimaion of he rue gradien and Hessian wih affordable oracle calls. Noe ha similar approximaions of he gradien and Hessian marix have been saged in recen work by Gower e al. (2017) and Wai e al. (2017), where hey used his new kind of esimaor for radiional SVRG in he convex seing, which radically differs from our seing. 4 Main Theory We firs lay down he following Hessian Lipschiz assumpion, which are necessary for our analysis and are widely used in he lieraure (Neserov and Polyak, 2006; Xu e al., 2016; Kohler and Lucchi, 2017). Assumpion 4.1 (Hessian Lipschiz). There exiss a consan ρ > 0, such ha for all x, y and i [n] 2 f i (x) 2 f i (y) 2 ρ x y 2. In fac, his is he only assumpion we need o prove our heoreical resuls. The Hessian Lipschiz assumpion plays a cenral role in conrolling he changing speed of second order informaion. I is obvious ha Assumpion 4.1 implies he Hessian Lipschiz assumpion of F, which, according o Neserov and Polyak (2006), is also equivalen o he following lemma. 6

Lemma 4.2. Le funcion F : x R d saisfy ρ-hessian Lipschiz assumpion, hen for any h R d, i holds ha 2 F (x) 2 F (y) 2 ρ x y 2, F (x + h) F (x) + F (x), h + 1 2 2 F (x)h, h + ρ 6 h 3 2, F (x + h) F (x) 2 F (x)h 2 ρ 2 h 2 2. We hen define he following opimal funcion gap beween iniial poin x 0 and he global minimum of F. Definiion 4.3 (Opimal Gap). For funcion F ( ) and he iniial poin x 0, le F be where F = inf x R d F (x). F = inf{ R : F (x 0 ) F }, Wihou loss of generaliy, we assume F < + hroughou his paper. Before we presen nonasympoic convergence resuls of Algorihm 1, we define { µ(x s+1 ) = max F (x s+1 ) 3/2 2, λ3 min ( 2 F (x s+1 } )) [M s+1, ] 3/2. (4.1) By definiion in (4.1), µ(x s+1 ) < ɛ 3/2 holds if and only if F (x s+1 ) 2 ɛ, λ min ( 2 F (x s+1 ) ) > M s+1, ɛ. (4.2) Therefore, in order o find an (ɛ, ρɛ)-local minimum of he non-convex funcion F, i suffices o find a poin x s+1 which saisfies µ(x s+1 ) < ɛ 3/2, and M s+1, = O(ρ) for all s,. Nex we define our oracles formally: Definiion 4.4 (Second-order Oracle). Given an index i and a poin x, one second-order oracle (SO) call reurns such a riple: [f i (x), f i (x), 2 f i (x)]. (4.3) Definiion 4.5 (Cubic Subproblem Oracle). Given a vecor g R d, a Hessian marix H and a posiive consan θ, one Cubic Subproblem Oracle (CSO) call reurns h sol, where h sol can be solved exacly as follows h sol = argmin h R d g, h + 1 2 h, Hh + θ 6 h 3 2. Remark 4.6. The second-order oracle is a special form of Informaion Oracle which is inroduced by Neserov, which reurns gradien, Hessian and all high order derivaives of objecive funcion F (x). Here, our second-order oracle will only reurns firs and second order informaion a some poin of single objecive f i insead of F. We argue ha i is a reasonable adapion because in his paper we focus on finie-sum objecive funcion. The Cubic Subproblem Oracle will reurn an exac or inexac soluion of (3.3), which plays an imporan role in boh heory and pracice. 7

Now we are ready o give a general convergence resul of Algorihm 1: Theorem 4.7. Le A, B, α and β be arbirary posiive consans, choose M s, = M for each s. Define parameer sequences {Θ } T =0 and {c } T =0 as follows ( Θ c = M 3/2 + 1 ) ( ρ 3/2 Θ A 1/2 b 3/4 + M 3 + 1 ) g B 2 Cρ3 (log d) 3/2 ( b 3/2 + c +1 1 + 1 α 2 + 2 ) h β 1/2, Θ = 3M 2ρ 4A 4B (4.4) c +1 (1 + 2α + β ), 12 c T = 0, where ρ is he Hessian Lipschiz consan, M is he regularizaion parameer of Algorihm 1, and C is an absolue consan. If seing bach size b h > 25 log d, M = O(ρ), and Θ > 0 for all, hen he oupu of Algorihm 1 saisfies where γ n = min Θ /(15M 3/2 ). E[µ(x ou )] E[ F ( x 0 ) F ], (4.5) γ n ST Remark 4.8. To ensure ha x ou is an (ɛ, ρɛ)-local minimum, we can se he righ hand side of (4.5) o be less hen ɛ 3/2. This immediaely implies ha he oal ieraion complexiy of Algorihm 1 is ST = O(E [ F ( x 0 ) F ] /ɛ 3/2 ), which maches he ieraion complexiy of cubic regularizaion Neserov and Polyak (2006). Remark 4.9. Noe ha here is a log d erm in he expression of parameer c, and i is only relaed o Hessian bach size b h. The log d erm comes from marix concenraion inequaliies, which is believed o be unavoidable (Tropp e al., 2015). In oher words, he bach size of Hessian marix b h has a ineviable relaion o dimension d, unlike he bach size of gradien b g. The ieraion complexiy resul in Theorem 4.7 depends on a series of parameer defined as in (4.4). In he following corollary, we will show how o choose hese parameers in pracice o achieve a beer oracle complexiy. Corollary 4.10. Le bach sizes b g and b h saisfy b g = b h / log d = 1400n 2/5. Se he parameers in Theorem 4.7 as follows A = B = 125ρ, α = 2n 1/10, β = 4n 2/5. Θ and c are defined as in (4.4). Le he cubic regularizaion parameer be M = 2000ρ, and he epoch lengh be T = n 1/5. Then Algorihm 1 converges o a (ɛ, ρɛ)-local minimum wih ( O n + F ρn 4/5 ) ( ) F ρ SO calls and O CSO calls. (4.6) ɛ 3/2 Remark 4.11. Corollary 4.10 saes ha we can reduce he SO calls by seing he bach size b g, b h relaed o n. In conras, in order o achieve a (ɛ, ρɛ) local minimum, original cubic regularizaion mehod in Neserov and Polyak (2006) needs O(n/ɛ 3/2 ) second-order oracle calls, which is by a facor of n 1/5 worse han ours. And subsampled cubic regularizaion Kohler and Lucchi (2017); Xu e al. (2017b) requires Õ(n/ɛ3/2 + 1/ɛ 5/2 ) SO calls, which is even worse. ɛ 3/2 8

5 Pracical Algorihm wih Inexac Oracle In pracice, he exac soluion o he cubic subproblem (3.3) canno be obained. Insead, one can only ge an approximae soluion by some inexac solver. Thus we replace he CSO oracle in (4.5) wih he following inexac CSO oracle h sol argmin h R d g, h + 1 2 h, Hh + θ 6 h 3 2. To analyze he performance of Algorihm 1 wih inexac cubic subproblem solver, we replace he exac solver in Line 10 of Algorihm 1 wih h s+1 argmin m s+1 (h). (5.1) In order o characerize he above inexac soluion, we presen he following sufficien condiion, under which inexac soluion can ensure he same oracle complexiy as he exac soluion: Condiion 5.1 (Inexac condiion). For each s, and given δ > 0, if saisfies h s+1 m s+1 ( h s+1 m s+1 ( h s+1 δ 3/2, h s+1 ) M s+1, h s+1 3 2 + δ, 12 h s+1 3 2 h s+1 3 2 + δ. saisfies δ- inexac condiion Remark 5.2. Similar inexac condiions have been sudied in he lieraure of cubic regularizaion. For insance, Neserov and Polyak (2006) presened a pracical way o solve he cubic subproblem wihou erminaion condiion. Caris e al. (2011); Kohler and Lucchi (2017) presened erminaion crieria for approximae soluion o cubic subproblem, which is slighly differen from our condiion. In general, he erminaion crieria in Caris e al. (2011); Kohler and Lucchi (2017) conains a non-linear equaion, which is hard o verify and less pracical. In conras, our inexac condiion only conains inequaliy, which is easy o be verified in pracice. Nex we give he convergence resul wih inexac CSO oracle: h s+1 Theorem 5.3. Le o be he oupu in each inner loop of Algorihm 1 which saisfies Condiion 5.1. Le A, B, α, β > 0 be arbirary consans. Le M s, = M for each s, and Θ and c are defined in (4.4), where 1 T. If choosing bach size b h > 25 log d and M = O(ρ), and Θ > 0 for all, hen he oupu of Algorihm 1 wih inexac subproblem solver saisfies: E[µ(x ou )] E[F ( x0 ) F ] γ n ST + δ, where ( δ = δ Θ + Θ ) + 1, γ 2M 3/2 n = min Θ /(15M 3/2 ). 9

Remark 5.4. By he definiion of µ(x) in (4.1) and (4.2), in order o aain an (ɛ, ɛ) local minimum, we require E[µ(x ou )] ɛ 3/2 and hus δ < ɛ 3/2, which implies ha δ in Condiion 5.1 should saisfy δ < ɛ 3/2 /(Θ + Θ /(2M 3/2 ) + 1). Thus he oal ieraion complexiy of Algorihm 1 is O( F /(γ n ɛ 3/2 )). By he same choice of parameers, Algorihm 1 wih inexac oracle can achieve a reducion in SO calls. Corollary 5.5. Under Condiion 5.1, and under he same condiions as in Corollary 4.10, he oupu of Algorihm 1 wih he inexac subproblem solver saisfies Eµ(x ou ) ɛ 3/2 + δ f wihin where δ f = O(ρδ). ( O n + F ρn 4/5 ) ( ) F ρ SO calls and O CSO calls, ɛ 3/2 ɛ 3/2 0 100 200 300 400 500 600 (a) a9a 0 50 100 150 200 250 300 (b) covype 0 20 40 60 80 100 (c) ijcnn1 0 0 30 40 50 ime in seconds (d) a9a 0 25 50 75 100 125 150 175 200 ime in seconds (e) covype 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 ime in seconds (f) ijcnn1 Figure 1: Logarihmic funcion value gap for nonconvex regularized logisic regression on differen daases. (a), (b) and (c) presen he oracle complexiy comparison; (d), (e) and (f) presen he runime comparison. 6 Experimens In his secion, we presen numerical experimens on differen non-convex Empirical Risk Minimizaion (ERM) problems and on differen daases o validae he advanage of our proposed algorihm in finding approximae local minima. 10

0 0 30 40 50 60 70 80 (a) a9a 0 0 30 40 50 (b) covype 0 5 5 20 25 30 (c) ijcnn1 0 1 2 3 4 5 6 7 ime in seconds (d) a9a 0 5 5 20 25 30 35 40 ime in seconds (e) covype 0.00 0.05 0..15 0.20 0.25 0.30 ime in seconds (f) ijcnn1 Figure 2: Logarihmic funcion value gap for nonlinear leas square on differen daases. (a), (b) and (c) presen he oracle complexiy comparison; (d), (e) and (f) presen he runime comparison. Baselines: We compare our algorihm wih Regularizaion () (Caris e al., 2011), Regularizaion () (Kohler and Lucchi, 2017), Regularizaion () (Tripuraneni e al., 2017) and Gradien Cubic Regularizaion () (Carmon and Duchi, 2016). All of above algorihms are carefully uned for a fair comparison. The Calculaion for SO calls: Here we lis he SO call each algorihm needs for one loop. For, each loop coss (B g +B h ) SO calls, where B g and B h o denoe he subsampling size of gradien and Hessian. For, each loop coss (n g + n h ) SO calls, where we use n g and n h o denoe he subsampling size of gradien and Hessian-vecor operaor. Gradien Cubic and cos n SO calls in each loop. Finally, we define he amoun of is he amoun of SO call divided by n. Parameer uning and subproblem solver: For each algorihm and each daase, we choose differen b g, b h, T for he bes performance. Meanwhile, we also use wo differen sraegies for choosing M s, : he firs one is o fix M s, = M in each ieraion, which is proved o enjoy good convergence performance; he oher one is o choose M s, = α/(1 + β) (s+/t ), α, β > 0 for each ieraion. This choice of parameer is similar o he choice of penaly parameer in Subsampled Cubic and, which someimes makes some algorihms behave beer in our experimen. As o he solver for subproblem (3.3) in each loop, we choose o use he Lanczos-ype mehod inroduced in Caris e al. (2011). Daases: The daases we use are a9a, covype, ijcnn1, which are common daases used in ERM problems. The deailed informaion abou hese daases are in Table 2. Non-convex regularized logisic regression: The firs nonconvex problem we choose is 11

0 5 5 20 25 30 (a) a9a 0 0 30 40 50 (b) covype 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 (c) ijcnn1 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 ime in seconds (d) a9a 0 5 5 20 25 30 ime in seconds (e) covype 0.00 0.05 0..15 0.20 0.25 ime in seconds (f) ijcnn1 Figure 3: Logarihmic funcion value gap for robus linear regression on differen daases. (a), (b) and (c) presen he oracle complexiy comparison; (d), (e) and (f) presen he runime comparison. Table 2: Overview of he daases used in our experimens Daase sample size n dimension d a9a 32,561 123 covype 581,012 54 ijcnn1 35,000 22 a binary logisic regression problem wih a non-convex regularizer d i=1 λw2 (i) /(1 + w2 (i) ) (Reddi e al., 2016b). More specifically, suppose we are given raining daa {x i, y i } n i=1, where x i R d and y i {0, 1} are feaure vecors and labels corresponding o he i-h daa poins. The minimizaion problem is as follows 1 min w R d n n y i log φ(x T i w) + (1 y i ) log[1 φ(x T i w)] + i=1 d λw(i) 2 /(1 + w2 (i) ), where φ(x) = 1/(1 + exp( x)) is he sigmoid funcion. We fix λ = 10 in our experimens. The experimen resuls are shown in Figure 1. Nonlinear linear squares: The second problem is a non-linear leas squares problem which focuses on he ask of binary linear classificaion (Xu e al., 2017a). Given raining daa {x i, y i } n i=1, where x i R d and y i {0, 1} are feaure vecors and labels corresponding o he i-h daa poins. i=1 12

The minimizaion problem is 1 min w R d n n [y i φ(x T i w)] 2 i=1 where φ(x) = 1/(1 + exp( x)) is he sigmoid funcion. The experimen resuls are shown in Figure 2. Robus linear regression: The hird problem is a robus linear regression problem where we use a non-convex robus loss funcion log(x 2 /2 + 1) (Barron, 2017) insead of square loss in leas square regression. Given a raining sample {x i, y i } n i=1, where x i R d and y i {0, 1} are feaure vecors and labels corresponding o he i-h daa poin. The minimizaion problem is 1 min w R d n n η(y i x T i w), i=1 where η(x) = log(x 2 /2 + 1). The experimenal resuls are shown in Figure 3. From Figures 1, 2 and 3, we can see ha our algorihm ouperforms all he oher baseline algorihms on all he daases. The only excepion happens in he non-linear leas square problem and he robus linear regression problem on he covype daase, where our algorihm behaves a lile worse han a he high accuracy regime in erms of epoch couns. However, under his seing, our algorihm sill ouperforms he oher baselines in erms of he cpu ime. 7 Conclusions In his paper, we propose a novel second-order algorihm for non-convex opimizaion called. Our algorihm is he firs algorihm which improves he oracle complexiy of cubic regularizaion and is subsampled varians under cerain regime using variance reducion echniques. We also show ha similar oracle complexiy also holds wih inexac oracle. Under boh seings our algorihm ouperforms he sae-of-he-ar. References Agarwal, N., Allenzhu, Z., Bullins, B., Hazan, E. and Ma, T. (2017). Finding approximae local minima for nonconvex opimizaion in linear ime. Allen-Zhu, Z. (2017). Naasha 2: Faser non-convex opimizaion han sgd. arxiv preprin arxiv:1708.08694. Allen-Zhu, Z. and Hazan, E. (2016). Variance reducion for faser non-convex opimizaion. In Inernaional Conference on Machine Learning. Allen-Zhu, Z. and Li, Y. (2017). Neon2: Finding Local Minima via Firs-Order Oracles. ArXiv e-prins abs/1711.06673. Full version available a hp://arxiv.org/abs/1711.06673. Barron, J. T. (2017). A more general robus loss funcion. arxiv preprin arxiv:1701.03077. 13

Carmon, Y. and Duchi, J. C. (2016). Gradien descen efficienly finds he cubic-regularized non-convex newon sep. Carmon, Y., Duchi, J. C., Hinder, O. and Sidford, A. (2016). Acceleraed mehods for non-convex opimizaion. Caris, C., Gould, N. I. and Toin, P. L. (2011). Adapive cubic regularisaion mehods for unconsrained opimizaion. par i: moivaion, convergence and numerical resuls. Mahemaical Programming 127 245 295. Defazio, A., Bach, F. and Lacose-Julien, S. (2014). Saga: A fas incremenal gradien mehod wih suppor for non-srongly convex composie objecives. In Advances in Neural Informaion Processing Sysems. Erdogdu, M. A. and Monanari, A. (2015). Convergence raes of sub-sampled newon mehods. In Proceedings of he 28h Inernaional Conference on Neural Informaion Processing Sysems- Volume 2. MIT Press. Garber, D. and Hazan, E. (2015). Fas and simple pca via convex opimizaion. arxiv preprin arxiv:1509.05647. Ge, R., Huang, F., Jin, C. and Yuan, Y. (2015). Escaping from saddle poinsonline sochasic gradien for ensor decomposiion. In Conference on Learning Theory. Ge, R., Lee, J. D. and Ma, T. (2016). Marix compleion has no spurious local minimum. In Advances in Neural Informaion Processing Sysems. Ghadimi, S. and Lan, G. (2013). Sochasic firs-and zeroh-order mehods for nonconvex sochasic programming. SIAM Journal on Opimizaion 23 2341 2368. Ghadimi, S. and Lan, G. (2016). Acceleraed gradien mehods for nonconvex nonlinear and sochasic programming. Mahemaical Programming 156 59 99. Gower, R. M., Roux, N. L. and Bach, F. (2017). Tracking he gradiens using he hessian: A new look a variance reducing sochasic mehods. arxiv preprin arxiv:1710.07462. Hillar, C. J. and Lim, L.-H. (2013). Mos ensor problems are np-hard. Journal of he ACM (JACM) 60 45. Jin, C., Ge, R., Nerapalli, P., Kakade, S. M. and Jordan, M. I. (2017a). How o escape saddle poins efficienly. Jin, C., Nerapalli, P. and Jordan, M. I. (2017b). Acceleraed gradien descen escapes saddle poins faser han gradien descen. arxiv preprin arxiv:1711.10456. Johnson, R. and Zhang, T. (2013). Acceleraing sochasic gradien descen using predicive variance reducion. In Advances in neural informaion processing sysems. Kohler, J. M. and Lucchi, A. (2017). Sub-sampled cubic regularizaion for non-convex opimizaion. arxiv preprin arxiv:1705.05933. 14

LeCun, Y., Bengio, Y. and Hinon, G. (2015). Deep learning. Naure 521 436 444. Lucchi, A., McWilliams, B. and Hofmann, T. (2015). A variance reduced sochasic newon mehod. arxiv preprin arxiv:1503.08316. Moriz, P., Nishihara, R. and Jordan, M. (2016). A linearly-convergen sochasic l-bfgs algorihm. In Arificial Inelligence and Saisics. Neserov, Y. and Polyak, B. T. (2006). Cubic regularizaion of newon mehod and is global performance. Mahemaical Programming 108 177 205. Reddi, S. J., Hefny, A., Sra, S., Poczos, B. and Smola, A. (2016a). Sochasic variance reducion for nonconvex opimizaion 314 323. Reddi, S. J., Sra, S., Póczos, B. and Smola, A. (2016b). Fas incremenal mehod for smooh nonconvex opimizaion. In Decision and Conrol (CDC), 2016 IEEE 55h Conference on. IEEE. Robbins, H. and Monro, S. (1951). A sochasic approximaion mehod. The annals of mahemaical saisics 400 407. Rodomanov, A. and Kropoov, D. (2016). A superlinearly-convergen proximal newon-ype mehod for he opimizaion of finie sums. In Inernaional Conference on Machine Learning. Roosakhorasani, F. and Mahoney, M. W. (2016a). Sub-sampled newon mehods i: Globally convergen algorihms. Roosakhorasani, F. and Mahoney, M. W. (2016b). Sub-sampled newon mehods ii: Local convergence raes. Roux, N. L., Schmid, M. and Bach, F. R. (2012). A sochasic gradien mehod wih an exponenial convergence rae for finie raining ses. In Advances in Neural Informaion Processing Sysems. Royer, C. W. and Wrigh, S. J. (2017). Complexiy analysis of second-order line-search algorihms for smooh nonconvex opimizaion. arxiv preprin arxiv:1706.03131. Shalev-Shwarz, S. (2016). Sdca wihou dualiy, regularizaion, and individual convexiy. In Inernaional Conference on Machine Learning. Tripuraneni, N., Sern, M., Jin, C., Regier, J. and Jordan, M. I. (2017). Sochasic cubic regularizaion for fas nonconvex opimizaion. arxiv preprin arxiv:1711.02838. Tropp, J. A. e al. (2015). An inroducion o marix concenraion inequaliies. Foundaions and Trends R in Machine Learning 8 1 230. Wai, H.-T., Shi, W., Nedic, A. and Scaglione, A. (2017). aggregaed gradien mehod. arxiv preprin arxiv:1710.08936. Curvaure-aided incremenal Xiao, L. and Zhang, T. (2014). A proximal sochasic gradien mehod wih progressive variance reducion. SIAM Journal on Opimizaion 24 2057 2075. 15

Xu, P., Roosa-Khorasan, F. and Mahoney, M. W. (2017a). Second-order opimizaion for non-convex machine learning: An empirical sudy. arxiv preprin arxiv:1708.07827. Xu, P., Roosa-Khorasani, F. and Mahoney, M. W. (2017b). Newon-ype mehods for non-convex opimizaion under inexac hessian informaion. arxiv preprin arxiv:1708.07164. Xu, P., Yang, J., Roosa-Khorasani, F., R, C. and Mahoney, M. W. (2016). Sub-sampled newon mehods wih non-uniform sampling. Xu, Y., Jin, R. and Yang, T. (2017c). Neon+: Acceleraed gradien mehods for exracing negaive curvaure for non-convex opimizaion. arxiv preprin arxiv:1712.01033. Ye, H., Luo, L. and Zhang, Z. (2017). Approximae newon mehods and heir local convergence. In Inernaional Conference on Machine Learning. 16