Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Similar documents
Trade-Offs in Distributed Learning and Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Importance Sampling for Minibatches

Stochastic Optimization

Sub-Sampled Newton Methods

Analysis of Greedy Algorithms

arxiv: v1 [cs.it] 21 Feb 2013

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

SVRG++ with Non-uniform Sampling

Stochastic Second-Order Method for Large-Scale Nonconvex Sparse Learning Models

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

ECS289: Scalable Machine Learning

Solving Corrupted Quadratic Equations, Provably

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Stochastic Quasi-Newton Methods

Restricted Strong Convexity Implies Weak Submodularity

Parallel Coordinate Optimization

arxiv: v1 [math.oc] 1 Jul 2016

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Comparison of Modern Stochastic Optimization Algorithms

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Stability and Robustness of Weak Orthogonal Matching Pursuits

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

STA141C: Big Data & High Performance Statistical Computing

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Learning with stochastic proximal gradient

SPARSE signal representations have gained popularity in recent

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Greedy Sparsity-Constrained Optimization

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Parallel Numerical Algorithms

Convex Optimization Algorithms for Machine Learning in 10 Slides

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Block stochastic gradient update method

Convex relaxation for Combinatorial Penalties

Lecture 3: Minimizing Large Sums. Peter Richtárik

On Iterative Hard Thresholding Methods for High-dimensional M-Estimation

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis

Sketching for Large-Scale Learning of Mixture Models

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Stochastic Optimization Algorithms Beyond SG

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Discriminative Models

Incremental Reshaped Wirtinger Flow and Its Connection to Kaczmarz Method

STAT 200C: High-dimensional Statistics

Recent Developments in Compressed Sensing

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Optimization methods

Non-convex Robust PCA: Provable Bounds

ECS289: Scalable Machine Learning

Sparse Gaussian conditional random fields

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

Big Data Analytics: Optimization and Randomization

Incremental Quasi-Newton methods with local superlinear convergence rate

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

Accelerating SVRG via second-order information

ECS289: Scalable Machine Learning

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Full-information Online Learning

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Empirical Risk Minimization

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Dual Iterative Hard Thresholding: From Non-convex Sparse Minimization to Non-smooth Concave Maximization

Stochastic Gradient Descent with Variance Reduction

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

Accelerate Subgradient Methods

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

Towards stability and optimality in stochastic gradient descent

1 Maximizing a Submodular Function

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

ADMM and Fast Gradient Methods for Distributed Optimization

Sparse Support Vector Machines by Kernel Discriminant Analysis

Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Discriminative Models

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Regression with Numerical Optimization. Logistic

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification

Stochastic and online algorithms

High Order Methods for Empirical Risk Minimization

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Transcription:

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology Qingshan Liu BDAT Lab, Nanjing University of Information Science and Technology Dimitris N. Metaxas Department of Computer Science, Rutgers University Abstract lb507@cs.rutgers.edu xtyuan1980@gmail.com qsliu@nuist.edu.cn dnm@cs.rutgers.edu In this paper, we present a distributed greedy pursuit method for non-convex sparse learning under cardinality constraint. Given the training samples randomly partitioned across multiple machines, the proposed method alternates between local inexact optimization of a Newtontype approximation and centralized global results aggregation. Theoretical analysis shows that for a general class of objective functions with Lipschitze continues Hessian, the method converges linearly with contraction factor scales inversely with data size. Numerical results confirm the high communication efficiency of our method when applied to large-scale sparse learning tasks. 1 Introduction Setup. We are interested in distributed computing methods for solving the following cardinality-constrained empirical risk minimization (ERM) problem: min w R p F (w) = 1 m m j=1 F j (w) = 1 mn m j=1 i=1 n f(w; x ji, y ji ), subject to w 0 k, (1) where f is a convex loss function, w 0 represents the number of non-zero entries in w, and we assume the training data D = {D 1, D,..., D m } with N = nm samples is evenly and randomly distributed over m machines; each machine j locally stores and accesses n training sample D j = {x ji, y ji } n i=1. We refer to the above model as l 0 -ERM in this paper. Due to the presence of cardinality constraint, the problem is non-convex and NP-hard in general. Related work. Iterative hard thresholding (IHT) methods have demonstrated superior scalability in solving (1) [1, ]. It is known from ( [] ) that if F (w) is L-smooth and µ s -strongly convex over s-sparse with L some sparsity level k = O µ k, IHT-style methods reach the estimation error level w (t) w = s ( k F ) ( ( )) O ( L w) /µ s after O µ s log µs w (0) w k F ( w) rounds of iteration. A distributed implementation of IHT was considered in [4] for compressive sensing. However, the linear dependence of the iteration complexity on the restricted condition number L/µ s makes it inefficient in ill-conditioned settings. Significant interest has recently been dedicated to designing distributed algorithms that have flexibility to adapt to the communication-computation tradeoffs [3, 7, 6]. There is a recent surge of developing Newton-type algorithm for distributed model learning [6, 5]. It was proved in [6] that if F (w) is quadratic ( with condition number L L/µ, the communication complexity (in high probability) to reach ɛ-precision is O µ n log(mp) log ( ) ) 1 ɛ, which has an improved dependence on the condition number L/µ which could scale as large as O( mn) in regularized learning problems. Open problem. Despite the attractiveness of distributed approximate/inexact Newton-type methods in classical regularized ERM learning, it still remains unclear whether this type of methods generalize equally well, both in theory and practice, to the non-convex l 0 -ERM model as defined in (1). OPTML 017: 10th NIPS Workshop on Optimization for Machine Learning (NIPS 017).

Our contribution. In this paper, we give an affirmative answer to the above open question by developing a novel DANE-type distributed greedy pursuit ( method. We show for our method that the parameter estimation error bound w (t) w (0) = O ( w) /µ s can be guaranteed in high probability after k F ) ( ( ) ) O log µs w (0) w k F ( w) rounds of communication. Comparing to the bound of distributed 1 1 L log(mp) µs n IHT [4], this bound has much improved dependence on restricted condition number when data size is sufficiently ( large. In sharp ) contrast, the required sample complexity in [7] for l 1 -regularized distributed learning is n = O s L log p which is clearly inferior to ours. µ s Notation and definitions. We denote H k (x) as a truncation operator which preserves the top k (in magnitude) entries of vector x and forces the remaining to be zero. The notation supp(x) represents the index set of nonzero entries of x. We conventionally define x = max i [x] i and define x min = min i supp(x) [x] i. For an index set S, we define [x] S and [A] SS as the restriction of x to S and the restriction of rows and columns of A to S, respectively. For an integer n, we abbreviate the set {1,..., n} to [n]. For any integer s > 0, we say f(w) is restricted µ s -strongly-convex and L s -smooth if w, w with w w 0 s, µs w w f(w) f(w ) f(w ), w w Ls w w. Suppose that f(w) is twice continuously differentiable. We say f(w) has Restricted Lipschitz Hessian with constant β s 0 (or β s -RLH) if w, w with w w 0 s, SS f(w) SS f(w ) β s w w, where we have used the abbreviation SS f := [ f] SS. Distributed Inexact Newton-type Pursuit.1 Algorithm The Distributed Inexact Newton-type PurSuit (DINPS) algorithm is outlined in Algorithm 1. Starting from an initial k-sparse approximation w (0), the procedure generates a sequence of intermediate k-sparse iterate {w (t) } t 1 via distributed local sparse estimation and global synchronization among machines. More precisely, each iteration loop of DINPS can be decomposed into the following three consequent main steps: Map-reduce gradient computation. In this first step, we evaluate the global gradient F (w (t 1) ) = 1 m j=1 F j(w (t 1) ) at the current iterate via simple map-reduce averaging and distribute it to all machines for local computation. Local inexact sparse approximation. In this step, each machine j constructs at the current iterate a local objective function as in (), and then inexactly estimate a local k-sparse solution satisfying (3). This inexact sparse optimization step can be implemented using IHT-style algorithms which have been witnessed to offer fast and accurate solutions for centralized l 0 -estimation [8]. Centralized results aggregation. The master machine simply assigns to w (t) the first received k-sparse local output w (t) j t at the current round of iteration. Such an aggregation scheme is by nature asynchronous and hence robust to computation-power imbalance and communication delay. The simplest way of initialization is to set w (0) = 0, i.e., starting the iteration from scratch. Since the data samples are assumed to be evenly and randomly distributed on machines, another reasonable option of initialization is to minimize one of the local l 0 -ERM problems, say w (0) arg min w 0 k F 1 (w), which is expected to be reasonably close to the global solution. A similar local initialization strategy was also considered for EDSL [7].. Main results Deterministic result. The following is a deterministic result on the parameter estimation error bound of DINPS when the objective functions is twice differentiable with restricted Lipschitz Hessian. Theorem 1. Let w be a k-sparse target vector with k k. Assume that each component F j (w) is - strongly-convex and has β 3k -RLH. Let H j = F j ( w) and H = 1 m j=1 H j. Assume that max j H j η H θ /4 for some θ (0, 1) and ɛ kη F ( w). Assume that w (0) θ w θµ F ( w) 3k 8.94η k. Set γ = 0, then w (t) w 5.47η k F ( w) (1 θ) after and

Algorithm 1: Distributed Inexact Newton-type PurSuit (DINPS) Input :Loss functions {F j (w)} m j=1 distributed over m different machines, sparsity level k, parameter γ 0 and η > 0. Typically set γ = 0 and η = 1. Initialization Set w (0) = 0 or estimate w (0) arg min w 0 k F 1 (w). for t = 1,,... do /* Map-reduce gradient evaluation */ (S1) Compute F (w (t 1) ) = 1 m j=1 F j(w (t 1) ) and broadcast it to all workers; /* Local sparse optimization: */ (S) for all the workers j = 1,..., m in parallel do (i) Construct a local objective function: P j (w; w (t 1) η, γ) := η F (w (t 1) ) F j (w (t 1) ), w + γ w w(t 1) + F j (w); () (ii) Estimate a k-sparse w (t) j via approximately solving min w 0 k P j (w; w (t 1), η, γ) up to sparsity level k k and ɛ-precision, i.e., for any k-sparse vector w: P j (w (t) j ; w (t 1) η, γ) P j ( w; w (t 1) η, γ) + ɛ; (3) end /* Centralized results aggregation: */ (S3) Set w (t) = w (t) j t as the first received local solution; end Output :w (t). rounds of iteration. ( ) t 1 1 θ log µ3k w (0) w η k F ( w) Proof sketch. The proof is rooted on the following key inequality: w (t) w max j H j η H w (t 1) w + (1 + η)β 3k w (t 1) w + 4.47η k F ( w). Given this inequality, we can prove by induction that w (t) θ w holds for all t 0, which then implies the recursive form w (t) w θ w (t 1) w + 4.47η k F ( w). Remark. The main message conveyed by Theorem 1 is: Provided that w (0) is properly initialized and the estimation error level k F ( w) is sufficiently small, DINPS for RLH objectives exhibits linear convergence behavior with contraction factor θ = O(max j H j η H / ). Stochastic result. We now turn to a stochastic setting where the samples are uniformly randomly distributed over m machines. Corollary 1. Let w be a k-sparse target vector with k k. Assume that the samples are uniformly randomly distributed on m machines and each component F j (w) is -strongly-convex and has β 3k -RLH. Assume f(w x ji, y ji ) L holds for all j [m] and i [n]. Let Hj = F j ( w) and H = 1 m j=1 H j. Set γ = 0 and η = 1. Assume that ɛ k F ( w). Assume that w (0) w θ θ 17.88ηβ 3k k. For any δ (0, 1), if n > 51L log(mp/δ) µ 3k output solution w (t) satisfying w (t) w 5.47 k F ( w) (1 θ) rounds of iteration with θ = L 51 log(mp/δ) n. β 3k and F ( w), then with probability at least 1 δ, Algorithm 1 will after ( ) t 1 1 θ log µ3k w (0) w k F ( w) 3

1 1 0.5 0.1 Dist-IHT DINPS(m=) DINPS(m=4) DINPS(m=8) 0.1 Dist-IHT(N=x10 5 ) DINPS(N=5x10 4 ) DINPS(N=10 5 ) DINPS(N=x10 5 ) 0.05 0 4 6 8 10 Communication Round (a) F lr with different m values. N = 10 4. 0.01 0 4 6 8 10 Communication Round (b) F lr with different N values. m = 10. Figure 1: Linear regression (denoted by F lr ) model training convergence, given varied number of machines (m) and training samples (N). The convergence of F lr model training is evaluated by w w / w. Proof sketch. Based on [6, Lemma ] we can show that max j H j H θ /4 holds with probability at least 1 δ. The condition on n guarantees θ (0, 1). Remark. Corollary 1 indicates that in the considered ( statistical ) setting, the contraction factor θ can be arbitrarily small given that the sample size n = O L log(mp) is sufficiently large. This sample size µ 3k( ) complexity is clearly superior to the corresponding n = O k L log p complexity established in [7] for µ 3k distributed l 1 -regularized sparse learning. 3 Experiment We evaluate the empirical performance of DINPS on a set of simulated sparse linear regression tasks. A synthetic N p { design matrix is generated with each data sample x i drawn from Gaussian distribution N (0, Σ) 1 if j = k with Σ j,k = 1.5 j k 10 otherwise. A sparse model parameter w Rp is generated with the top k entries uniformly randomly valued in interval (0, 0.1) and all the other entries set to be zero. The response variables {y i } N i=1 are generated by y i = w x i + ɛ i with ɛ i N (0, 0.01). A distributed implementation of IHT (which we refer to as Dist-IHT) [4] is considered as a baseline algorithm for communication efficiency comparison. We set p = 1000. k value and parameter update step-sizes are chosen by grid search for optimal performance. We first fix the training sample number to be N = 10 4 and vary the number of machines to be m =, 4, 8. The convergence curves of the considered algorithms in terms of round of communication are shown in Figure 1(a). Since Dist-IHT collects the gradient update from all workers before conducting gradient descent and hard-thresholding on master machine, its convergence behavior is irrelevant to the number of machines. In our DINPS algorithm, we adopt a more greedy parameter update strategy by solving a l 0 -constrained minimization problem on worker machines. As a result, it requires much fewer number of communication rounds between worker and master than Dist-IHT. We have also tested with the case when the number of machines is fixed as m = 10, and the number of training sample is varying to be N = 5 10 4, 10 5 and 10 5. Figure 1(b) shows the convergence curves of the considered algorithms. Again we observe the superior communication efficiency of DINPS over Dist-IHT. 4 Conclusion As a novel inexact Newton-type greedy pursuit method for distributed l 0 -ERM, DINPS has improved dependence on the restricted condition number than the prior distributed IHT methods. 4

References [1] A. Beck and Y. C. Eldar. Sparsity constrained nonlinear optimization: Optimality conditions and algorithms. SIAM Journal on Optimization, 3(3):1480 1509, 013. [] P. Jain, A. Tewari, and P. Kar. On iterative hard thresholding methods for high-dimensional m-estimation. In Advances in Neural Information Processing Systems, 014. [3] M. I. Jordan, J. D. Lee, and Y. Yang. Communication-efficient distributed statistical inference. arxiv preprint arxiv:1605.07689, 016. [4] S. Patterson, Y. C. Eldar, and I. Keidar. Distributed compressed sensing for static and time-varying networks. IEEE Transactions on Signal Processing, 6(19):4931 4946, 014. [5] S. J. Reddi, J. Konečnỳ, P. Richtárik, B. Póczós, and A. Smola. Aide: Fast and communication efficient distributed optimization. arxiv preprint arxiv:1608.06879, 016. [6] O. Shamir, N. Srebro, and T. Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In International Conference on Machine Learning, 014. [7] J. Wang, M. Kolar, N. Srebro, and T. Zhang. Efficient distributed learning with sparsity. International Conference on Machine Learning, 017. [8] X.-T. Yuan, P. Li, and T. Zhang. Gradient hard thresholding pursuit for sparsity-constrained optimization. In International Conference on Machine Learning, 014. 5

Abstract This supplementary document contains the technical proofs of algorithm analysis in 10th NIPS Workshop on Optimization for Machine Learning submission entitled Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning. A A key lemma The following lemma is key to our analysis. Lemma 1. Let w be a k-sparse target vector with k k. Assume that each component F j (w) is -stronglyconvex and ηf (w) F j (w) γ w has α 3k -RLG. w (t) w α 3k w (t 1) w + 3.47η k ɛ F ( w) +. γ + γ + γ + Moreover, assume that each F j (w) has β 3k -RLH. Let H j = F j ( w) and H = 1 m j=1 H j. Then w (t) w (γ + max j H j η H ) w (t 1) w + (1 + η)β 3k w (t 1) w + 3.47η k F ( w) γ + γ + ɛ +. γ + Proof. Recall that w (t) = w (t) j t. By assumption we know that F jt (w) is -strongly-convex. Obviously, P jt (w; w (t 1) η, γ) is (γ + )-strongly-convex. Let S (t) = supp(w (t) ) and S = supp( w). Consider S = S (t) S (t 1) S. Then P jt (w (t) ; w (t 1) η, γ) P jt ( w; w (t 1) η, γ) + P ( w; w (t 1) η, γ), w (t) w + γ + w (t) w =P jt ( w; w (t 1) η, γ) + S P ( w; w (t 1) η, γ), w (t) w + γ + w (t) w, ξ 1 P jt (w (t) ; w (t 1) η, γ) ɛ S P ( w; w (t 1) η, γ) w (t) w + γ + w (t) w where ξ 1 follows from the definition of w (t) as an approximate k-sparse minimizer of P jt (w; w (t 1) η, γ) up to precision ɛ. By rearranging both sides of the above inequality with proper elementary calculation we get w (t) w ɛ S P jt ( w; w (t 1) η, γ) + γ + γ + = ɛ η S F (w (t 1) ) S F jt (w (t 1) ) + γ( w w (t 1) ) + S F jt ( w) + γ + γ + = η S F (w (t 1) ) η S F ( w) ( S F jt (w (t 1) ) S F jt ( w)) + γ( w w (t 1) ) + η S F ( w) γ + ɛ + γ + ( η S F (w (t 1) ) S F jt (w (t 1) ) γw (t 1)) (η S F ( w) S F jt ( w) γ w) γ + + η ɛ S F ( w) + γ + γ + ζ 1 α 3k w (t 1) w + η 3k F ( w) + γ + γ + 6 ɛ γ +,

where ζ 1 is according to the assumption that ηf (w) F jt (w) has α 3k -RLG. This shows the validity of the first part. Next we prove the second part. Similar to the above argument, we have w (t) w ɛ S P jt ( w; w (t 1) η, γ) + γ + γ + γ(w (t 1) w) + η S F (w (t 1) ) η S F ( w) ( S F jt (w (t 1) ) S F jt ( w)) γ + + η ɛ S F ( w) + γ + γ + γ(w (t 1) w) + η γ + µ SSF ( w)(w (t 1) w) SSF jt ( w)(w (t 1) w) 3k + η S F ( w) + η S F (w (t 1) ) S F ( w) γ + µ SSF ( w)(w (t 1) w) 3k + S F jt (w (t 1) ) S F jt ( w) SSF jt ( w)(w (t 1) ɛ w) + γ + ( γ + η γ + µ SS F ( w) SSF jt ( w) ) w (t 1) w + η S F ( w) 3k γ + + (1 + η)β 3k ɛ w (t 1) w + γ + (γ + max j H j η H ) w (t 1) w + (1 + η)β 3k w (t 1) w γ + + η 3k ɛ F ( w) +. γ + γ + This proves the second part. B Proof of Theorem 1 Proof. We first claim that w (t) w θ Obviously the claim holds for t = 0. Now suppose that w (t 1) w γ = 0, according to Lemma 1 we have holds for all t 0. This can be shown by induction. θ for some t 1. Since w (t) w max j H j η H w (t 1) w + (1 + η)β 3k w (t 1) w + 3.47η k F ( w) γ + ɛ + γ + ζ 1 θ w(t 1) w + (1 + η)β 3k w (t 1) w + 4.47η k F ( w) θ w (t 1) w + 4.47η k F ( w) ζ θ θ θ + =, (1 + η)β 3k (1 + η)β 3k (1 + η)β 3k where ζ 1 follows from γ = 0 and the assumption on ɛ, and ζ follows from θ 0.5 and the condition θµ of F ( w) 3k 8.94η k. Thus by induction w (t) θ w holds for all t 1. Then it follows from the inequality ζ 1 of the above we can see that for all t 0, w (t) w θ w (t 1) w + 4.47η k F ( w). By recursively applying the above inequality we get w (t) w θ t w (0) w + 7 4.47η k (1 θ)( ) F ( w)

Based on the inequality 1 x exp( x) we need t 1 ( (1 1 θ log θ)µ3k w (0) ) w η k F ( w) steps of iteration to achieve the precision of w (t) w 5.47η k F ( w) (1 θ). This proves the desired complexity bound. C Proof of Corollary 1 Lemma which is based on a matrix concentration bound, indicates that the Hessian H j is close to H when sample size is sufficiently large. The same result can be found in [6, Lemma ]. Lemma. Assume that f(w x ji, y ji ) L holds for all j [m] and i [n]. Then for each j, with probability at least 1 δ over the samples, 3L log(mp/δ) max H j H, j n where H j = 1 n n i=1 f(w x ji, y ji ) and H = 1 m j=1 H j. From Lemma we get that max j H j H θ /4 holds with probability at least 1 δ. Since n > 51L log(mp/δ), we have θ (0, 1). By invoking Theorem 1 we get the desired result. µ 3k 8