Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Similar documents
SVRG++ with Non-uniform Sampling

Accelerating SVRG via second-order information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Large-scale Stochastic Optimization

Sub-Sampled Newton Methods for Machine Learning. Jorge Nocedal

Fast Stochastic Optimization Algorithms for ML

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Trade-Offs in Distributed Learning and Optimization

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

Comparison of Modern Stochastic Optimization Algorithms

Big Data Analytics: Optimization and Randomization

Stochastic Quasi-Newton Methods

Nonlinear Optimization Methods for Machine Learning

Optimization for Machine Learning

A Universal Catalyst for Gradient-Based Optimization

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Lecture 3: Minimizing Large Sums. Peter Richtárik

Nesterov s Acceleration

Stochastic Optimization Methods for Machine Learning. Jorge Nocedal

arxiv: v1 [math.oc] 18 Mar 2016

Stochastic Gradient Descent with Variance Reduction

Barzilai-Borwein Step Size for Stochastic Gradient Descent

arxiv: v2 [math.oc] 29 Jul 2016

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Sub-Sampled Newton Methods

Second-Order Stochastic Optimization for Machine Learning in Linear Time

ONLINE VARIANCE-REDUCING OPTIMIZATION

Accelerating Stochastic Optimization

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Stochastic and online algorithms

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Stochastic Optimization Algorithms Beyond SG

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

STA141C: Big Data & High Performance Statistical Computing

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Stochastic gradient descent and robustness to ill-conditioning

Coordinate Descent and Ascent Methods

ECS289: Scalable Machine Learning

High Order Methods for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization

Coordinate Descent Faceoff: Primal or Dual?

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Approximate Newton Methods and Their Local Convergence

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Stochastic Optimization

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms

Simple Optimization, Bigger Models, and Faster Learning. Niao He

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

Stochastic gradient methods for machine learning

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

arxiv: v1 [cs.lg] 15 Jun 2016

Optimization Methods for Machine Learning

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning

Learning with stochastic proximal gradient

Stochastic gradient methods for machine learning

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

arxiv: v2 [cs.na] 3 Apr 2018

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

ECS171: Machine Learning

Incremental Quasi-Newton methods with local superlinear convergence rate

CS260: Machine Learning Algorithms

A Parallel SGD method with Strong Convergence

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Second order machine learning

Second order machine learning

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Linearly-Convergent Stochastic-Gradient Methods

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression

Convex Optimization Lecture 16

Importance Sampling for Minibatches

arxiv: v1 [math.oc] 9 Oct 2018

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x),

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Second-Order Methods for Stochastic Optimization

Selected Topics in Optimization. Some slides borrowed from

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

SEGA: Variance Reduction via Gradient Sketching

Efficient Distributed Hessian Free Algorithm for Large-scale Empirical Risk Minimization via Accumulating Sample Strategy

Optimization for Machine Learning (Lecture 3-B - Nonconvex)

Lecture 1: Supervised Learning

Transcription:

Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org We present a novel miniatch stochastic optimization method for empirical risk minimization prolems. The method efficiently leverages variance reduced first-order and su-sampled higher-order information to accelerate the convergence. We prove improved iteration complexity over state-of-the-art methods under suitale conditions. Introduction We consider the following optimization prolem of finite-sums: min f(w) := w n n f i (w), () which arises in many machine learning prolems. In the context of regularized loss minimization of linear predictors, we have f i (w) = l(w x i ; i ) + 2 w 2. Let x, x 2,..., x n R d e feature vectors of n data samples, and,..., n R or, } e the corresponding target variales of interest. We note that () covers many popular models used in machine learning: for example when l(w x i ; i ) = log( + exp( i w, x i )) we otain l 2 regularized logistic regression. The uiquitousness of such finite-sum optimization prolems and the massive scale of modern datasets motivate significant research effort on efficient optimization algorithms to solve (). Throughout the paper, we assume each component function is L-smooth and -strongly convex. In recent years there have een significant advances in developing fast optimization algorithms for (), and we refer the readers to ottou206optimization for a comprehensive survey of these developments. For large-scale prolems in the form of (), randomized methods are particularly efficient ecause of their low per iteration computation. Below we riefly review two lines of research: i) randomized variance reduced first-order methods; ii) randomized methods leveraging second-order information. Variance Reduced First-order Methods The key technique for developing fast stochastic first-order methods is variance reduction, which makes the variance of the randomized update direction approching zero when the iterate gets closer to the optimum. Representative methods of this category include SAG roux202stochastic, SVRG johnson203accelerating, SDCA shalev203stochastic and SAGA defazio204saga, etc. In order to find a solution that reaches ɛ-suoptimality, these methods requires O (( ) ( n + L log )) ɛ calls to the first-order oracle to compute the gradient of an individual function. In L the large condition numer regime (e.g. > n), y using an acceleration technique (such as Catalyst shalev206accelerated,frostig205regularizing,lin205universal, APCG lin205accelerated, SPDC zhang205stochastic, (( ) Katyusha allen206katyusha, etc), one can further improve the iteration complexity to O n + n L log ( ) ) ɛ. i= Leveraging Second-order Information Second-order information is often useful in improving the convergence of optimization algorithms. However for large-scale prolems, otaining and inverting the exact Hessian matrix is often computational expensive, making vanilla Newton methods not well suited in solving (). Therefore, there have een emerging studies in designing randomized algorithms that effectively utilize the approximated Hessian information. One line of such research is the su-sampled Newton method which approximates the Hessian matrix ased on a su-sampled miniatch of data. Several work, such as yrd20use,am205su,roosta206su,roosta206su2,ollapragada206exact, estalished local linear OPTML 207: 0th NIPS Workshop on Optimization for Machine Learning (NIPS 207).

Algorithm MB-SVRP: Miniatch Stochastic Variance Reduced Proximal Iterations. Parameters η,,, ν, ε. Initialize w 0 = 0. Sampling Sampling items from [n] to form a miniatch B. for s =, 2,... do Calculate ṽ = n n i= f i( w s ). Initialize y 0 = w 0 = w s. for t =, 2,..., m do Sampling items from [n] to form a miniatch B t. Find w t that approximately solve (2) such that f t (w t ) min w ft (w) ε. f t (w) := 2 (w w t ) 2 f i (w t ) (w w t ) + η i B f i (y t ) + ηṽ η f i ( w s ), w i B t i B t + 2 w y t 2. (2) Update: end for Update w s = w m. end for Return w s y t = w t + ν(w t w t ). convergence rates for different variants of su-sampled Newton methods, when the size of the su-sampled data is large enough. All of the aforementioned methods employ the full first-order gradient, and require calling the second-order oracle to compute the Hessian and solving a resulting linear system at every iteration. Therefore the computational complexity (as compared in Tale 2 of xu206su) are often worse than stochastic algorithms such as SVRG. Another line of research is to consider randomness in oth first and second-order information to design lower per-iteration cost algorithms. In particular, yrd206stochastic considered comining miniatch SGD with Limited-memory BFGS (L-BFGS) liu989limited type update. Inspired y this, moritz206linearly,gower206stochastic,wang207stochastic proposed to comine variance reduced stochastic gradient with L-BFGS, and proved linear convergence for strongly convex and smooth ojectives. However, theoretically it is hard to guarantee the quality of approximated Hessian using L-BFGS update, and thus the iteration complexity otained y moritz206linearly,gower206stochastic,wang207stochastic can e pessimistic, and can e much worse than vanilla SVRG. In this paper, we make effort in this direction and make the following contriutions: i) we propose a novel approach that comines the advantages of variance-reduced first-order methods and su-sampled Newton methods, in a efficient way that does not require expensive Hessian matrix computation and inversion. The method can e naturally extended to solve composite optimization prolems with non-smooth regularization; ii) we theoretically show under certain conditions the proposed approach can improve state-of-the-art iteration complexity, and empirically demonstrate it can sustantially improve the convergence of existing methods. 2 Miniatch Stochastic Variance Reduced Proximal Iterations In this section we present the proposed approach for minimizing (). The high-level description is given in Algorithm, which consists of two main innovations: i) simultaneous incorporation of first-order and higher-order information via su-sampling; ii) allowing larger miniatch sizes through acceleration with inexact minimization. We explain these features elow. 2

Sampling oth First-order and Higher-order Information We first discuss the main uilding lock of the algorithm. Suppose at iteration t, given the previous iterate w t, we consider the following update rule: w t arg min w 2 (w w t ) 2 f i (w t ) (w w t ) + 2 w w t 2 i B η + f i (w t ) + η f( w) η f i ( w), w (3) i B t i B t ( ) =w t η 2 f i (w t ) + I f i (w t ) + f( w) f i ( w), (4) i B t i B t i B where B, B t are some randomly sampled miniatch from,..., n, oth with miniatch size ; η and are stepsize parameters, and w is a reference predictor used for reducing variance. The main feature of updating rule (3) is it considered oth noisy first-order and higher-order information via miniatch sampling. The term i B f i(w) i B f i(w t ), w in (3) incorporates noisy higher-order information of the ojective function. It which can e treated as a variant of su-sampled Newton method, comined with the miniatch stochastic gradient with variance reduction. And the update rule (3) can e treated as a preconditioned I) miniatch SVRG update rule, with i B 2 f i (w t ) + as a precondition matrix. Large Miniatch Size via Acceleration with Inexact Minimization ( Using (3) as the uilding lock, we propose the MB-SVRP (miniatch stochastic variance reduced proximal iterations) method, which is detailed in Algorithm. At the eginning of the algorithm, we form a miniatch B y sampling from,..., n and fix it for the whole optimization process. Then following the SVRG method johnson203accelerating, the algorithm is divided to multiple stages, indexed y s. At each stage, we iteratively solve a minimization prolem of the form (2) ased on the randomly sampled miniatch B t. Compared with the simple update rule in (3), the major difference in Algorithm is that we consider a momentum scheme y maintaining two sequences w t, y t }, which is inspired y Nesterov s acceleration technique nesterov2003introductory and its recent SVRG variant nitanda204stochastic. The main theoretical advantage over miniatch SVRG without momentum konevcny206mini is that such an acceleration allows us to use a much larger miniatch size (up to a size of O( n)) without slowing down the convergence. Since it is often expensive to find the exact minimizer of (2), we consider an approximate minimizer with ojective suoptimality ε. When we choose the appropriate for f t (w) to otain enough strong convexity, the suprolems can e efficiently solved to high accuracy efficiently. For example, using SVRG-type algorithms johnson203accelerating to solve (2) allows us to find an approximate solution with a small suoptimality ε with a constant numer of passes over the data in the miniatch B B t, if is set appropriately. 3 Theoretical Results Besides the strong convexity and smoothness, we also need the following Hessian Lipschitz condition, and the notion of effective dimension and statistical leverage. Condition. For each component function f i (w), i [n], its Hessian matrix is M-Lipschitz, i.e. 2 f i (w) 2 f i (w ) M w w, w, w R d, i [n]. Definition. (Effective dimension) Let the,..., d e the top-d eigenvalues of H 0 (w ) = (/n) n i= l (w x i ; i )x i x i, define the effective dimension d (for some 0) of H 0 (w ) eing d = d j= j j+. 3

Tale : Comparison of iteration complexity of various finite-sum quadratic optimization algorithms when effective dimension and statistical leverage are ounded, where we compare the different relative scale of condition numer κ = L/ and sample size n, ignoring logarithmic factors. κ n κ = n 4/3 κ = n 3/2 κ = n 2 κ = n 3 SVRG n n 4/3 n 3/2 n 2 n 3 MB-SVRP n ρ 2 d n ρ 2 d n ρ 2 d n 3/2 ρ 2 d n 5/2 Acc-SVRG n n 7/6 n 5/4 n 3/2 n 2 ( ) /2 ( ) /2 ( 2 d ) /2 ( 2 d ) /2 Acc-MB-SVRP n ρ 2 d n ρ 2 d n ρ n 5/4 ρ n 7/4 Definition 2. (Statistical leverage at ) Let H (w n ) = n i= l (w x i ; i )x i x i + I, we say the statistical leverage of data matrix X is ounded y ρ at if H /2 (w )l (w x i; i) /2 x i max i [n] (/n) ρ. n j= H /2 (w )l (w x j; j) /2 2 x j Theorem 3. Consider Algorithm on prolems with L-smooth and -strongly convex functions that satisfied Condition with Hessian Lipschitz parameter M. Suppose we sample miniatch B uniformly from,..., n, and set the tuning parameters as = max, L }, min n, ρ 2 d ( ) } /3 L, η min 3 (ρ 2 d ) 2 L, ρ 2 d and( suppose ) we start from the initialization point that is sufficiently close to the optimum: w 0 w R = O 4 L 2 M. Given ɛ > 0, and let ε 0 ( 7 L) ɛ. Then for the MP-SVRP algorithm to find the approximate 5 solution w s of () that reaches the expected ɛ-ojective suoptimality Ef( w s ) min w f(w) ɛ, the total numer of oracle calls to solve (2) is no more than ( (L ) /3 O max, ρ2 d L } ( ) ) n 2 log. ɛ }, Note that in Algorithm, we can use SVRG to solve each suprolem in (2). This leads to the following result. Corollary 4. Assume that the conditions of Theorem 3 hold. Moreover, if we use SVRG to solve each suprolem in (2) up to the suoptimality ε, then the total numer of gradient evaluations used in the whole MP-SVRP algorithm can e upper ounded y O ( max ρ 2 d ( ) 2/3 L, ρ2 d L n } log ( ) ( ) ( ) ) L log 2 + n log. ɛ ɛ Acceleration Corollary 4 stated that if the condition numer is not too large: L n5/4, then MB-SVRP only requires logarithmic passes over data to find a solution with high accuracy. When the condition numer is large, MP-SVRP can still e slow. By using the accelerated proximal point framework proposed in shalev206accelerated, frostig205regularizing,lin205universal, it is possile to otain an accelerated convergence rate which has milder dependence on condition numer, Tale summarized the main results. 4

References [Allen-Zhu(206)] Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. arxiv preprint arxiv:603.05953, 206. [Bollapragada et al.(206)bollapragada, Byrd, and Nocedal] R. Bollapragada, R. Byrd, and J. Nocedal. Exact and inexact susampled newton methods for optimization. arxiv preprint arxiv:609.08502, 206. [Bottou et al.(206)bottou, Curtis, and Nocedal] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. arxiv preprint arxiv:606.04838, 206. [Byrd et al.(20)byrd, Chin, Neveitt, and Nocedal] R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal. On the use of stochastic hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 2(3):977 995, 20. [Byrd et al.(206)byrd, Hansen, Nocedal, and Singer] R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer. A stochastic quasi-newton method for large-scale optimization. SIAM Journal on Optimization, 26(2): 008 03, 206. [Cotter et al.(20)cotter, Shamir, Srero, and Sridharan] A. Cotter, O. Shamir, N. Srero, and K. Sridharan. Better mini-atch algorithms via accelerated gradient methods. In Advances in neural information processing systems, pages 647 655, 20. [Defazio et al.(204)defazio, Bach, and Lacoste-Julien] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite ojectives. In Advances in Neural Information Processing Systems, pages 646 654, 204. [Dekel et al.(202)dekel, Gilad-Bachrach, Shamir, and Xiao] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distriuted online prediction using mini-atches. Journal of Machine Learning Research, 3(Jan):65 202, 202. [Devolder et al.(204)devolder, Glineur, and Nesterov] O. Devolder, F. Glineur, and Y. Nesterov. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 46(-2): 37 75, 204. [Erdogdu and Montanari(205)] M. A. Erdogdu and A. Montanari. Convergence rates of su-sampled newton methods. In Advances in Neural Information Processing Systems, pages 3052 3060, 205. [Frostig et al.(205)frostig, Ge, Kakade, and Sidford] R. Frostig, R. Ge, S. Kakade, and A. Sidford. Unregularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In Proceedings of The 32nd International Conference on Machine Learning, pages 2540 2548, 205. [Gower et al.(206)gower, Goldfar, and Richtárik] R. Gower, D. Goldfar, and P. Richtárik. Stochastic lock fgs: squeezing more curvature out of data. In International Conference on Machine Learning, pages 869 878, 206. [Johnson and Zhang(203)] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 35 323, 203. [Konečnỳ et al.(206)konečnỳ, Liu, Richtárik, and Takáč] J. Konečnỳ, J. Liu, P. Richtárik, and M. Takáč. Mini-atch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 0(2):242 255, 206. [Lan(202)] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 33():365 397, 202. [Lin et al.(205a)lin, Mairal, and Harchaoui] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pages 3384 3392, 205a. [Lin et al.(205)lin, Lu, and Xiao] Q. Lin, Z. Lu, and L. Xiao. An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM Journal on Optimization, 25(4):2244 2273, 205. [Liu and Nocedal(989)] D. C. Liu and J. Nocedal. On the limited memory fgs method for large scale optimization. Mathematical programming, 45():503 528, 989. [Moritz et al.(206)moritz, Nishihara, and Jordan] P. Moritz, R. Nishihara, and M. Jordan. A linearlyconvergent stochastic l-fgs algorithm. In Artificial Intelligence and Statistics, pages 249 258, 206. 5

[Nesterov(2004)] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2004. [Nitanda(204)] A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pages 574 582, 204. [Roosta-Khorasani and Mahoney(206a)] F. Roosta-Khorasani and M. W. Mahoney. Su-sampled newton methods i: gloally convergent algorithms. arxiv preprint arxiv:60.04737, 206a. [Roosta-Khorasani and Mahoney(206)] F. Roosta-Khorasani and M. W. Mahoney. Su-sampled newton methods ii: Local convergence rates. arxiv preprint arxiv:60.04738, 206. [Roux et al.(202)roux, Schmidt, and Bach] N. L. Roux, M. Schmidt, and F. R. Bach. A stochastic gradient method with an exponential convergence _rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2663 267, 202. [Schmidt et al.(20)schmidt, Roux, and Bach] M. Schmidt, N. L. Roux, and F. R. Bach. Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in neural information processing systems, pages 458 466, 20. [Shalev-Shwartz and Zhang(203)] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 4(Fe):567 599, 203. [Shalev-Shwartz and Zhang(206)] S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 55(-2):05 45, 206. [Wang et al.(207a)wang, Wang, and Srero] J. Wang, W. Wang, and N. Srero. Memory and communication efficient distriuted stochastic optimization with miniatch prox. In Proceedings of The 30th Conference on Learning Theory, 207a. [Wang et al.(207)wang, Ma, Goldfar, and Liu] X. Wang, S. Ma, D. Goldfar, and W. Liu. Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM Journal on Optimization, 27(2): 927 956, 207. [Xu et al.(206)xu, Yang, Roosta-Khorasani, Ré, and Mahoney] P. Xu, J. Yang, F. Roosta-Khorasani, C. Ré, and M. W. Mahoney. Su-sampled newton methods with non-uniform sampling. In Advances in Neural Information Processing Systems, pages 3000 3008, 206. [Zhang and Xiao(205)] Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. In Proceedings of The 32nd International Conference on Machine Learning, pages 353 36, 205. 6