Randomized Dual Coordinate Ascent with Arbitrary Sampling

Similar documents
Dimensionality Reduction and Learning

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Rademacher Complexity. Examples

TESTS BASED ON MAXIMUM LIKELIHOOD

arxiv: v1 [cs.lg] 22 Feb 2015

CHAPTER VI Statistical Analysis of Experimental Data

Econometric Methods. Review of Estimation

Lecture 3 Probability review (cont d)

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

Solving Constrained Flow-Shop Scheduling. Problems with Three Machines

Lecture 9: Tolerant Testing

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem

Chapter 9 Jordan Block Matrices

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Functions of Random Variables

Chapter 8. Inferences about More Than Two Population Central Values

18.413: Error Correcting Codes Lab March 2, Lecture 8

Bayes (Naïve or not) Classifiers: Generative Approach

CHAPTER 4 RADICAL EXPRESSIONS

A tighter lower bound on the circuit size of the hardest Boolean functions

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Summary of the lecture in Biostatistics

Homework 1: Solutions Sid Banerjee Problem 1: (Practice with Asymptotic Notation) ORIE 4520: Stochastics at Scale Fall 2015

The Mathematical Appendix

Strong Convergence of Weighted Averaged Approximants of Asymptotically Nonexpansive Mappings in Banach Spaces without Uniform Convexity

CIS 800/002 The Algorithmic Foundations of Data Privacy October 13, Lecture 9. Database Update Algorithms: Multiplicative Weights

ENGI 3423 Simple Linear Regression Page 12-01

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

Chapter 5 Properties of a Random Sample

Lecture 07: Poles and Zeros

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

Class 13,14 June 17, 19, 2015

Introduction to local (nonparametric) density estimation. methods

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Third handout: On the Gini Index

Research Article A New Iterative Method for Common Fixed Points of a Finite Family of Nonexpansive Mappings

Lecture 2 - What are component and system reliability and how it can be improved?

1 Mixed Quantum State. 2 Density Matrix. CS Density Matrices, von Neumann Entropy 3/7/07 Spring 2007 Lecture 13. ψ = α x x. ρ = p i ψ i ψ i.

ESS Line Fitting

Chapter 14 Logistic Regression Models

Simple Linear Regression

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Generalized Linear Regression with Regularization

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

Lecture Note to Rice Chapter 8

Point Estimation: definition of estimators

The number of observed cases The number of parameters. ith case of the dichotomous dependent variable. the ith case of the jth parameter

MATH 247/Winter Notes on the adjoint and on normal operators.

To use adaptive cluster sampling we must first make some definitions of the sampling universe:

Parameter, Statistic and Random Samples

Lecture 8: Linear Regression

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek

Non-uniform Turán-type problems

Simulation Output Analysis

Chapter 4 Multiple Random Variables

Unsupervised Learning and Other Neural Networks

Lecture 3. Sampling, sampling distributions, and parameter estimation

18.657: Mathematics of Machine Learning

9 U-STATISTICS. Eh =(m!) 1 Eh(X (1),..., X (m ) ) i.i.d

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

å 1 13 Practice Final Examination Solutions - = CS109 Dec 5, 2018

Algorithms Design & Analysis. Hash Tables

Chapter 2 - Free Vibration of Multi-Degree-of-Freedom Systems - II

Generative classification models

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

Entropy ISSN by MDPI

Assignment 5/MATH 247/Winter Due: Friday, February 19 in class (!) (answers will be posted right after class)

22 Nonparametric Methods.

Distributed Accelerated Proximal Coordinate Gradient Methods

1 Lyapunov Stability Theory

8.1 Hashing Algorithms

LECTURE - 4 SIMPLE RANDOM SAMPLING DR. SHALABH DEPARTMENT OF MATHEMATICS AND STATISTICS INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Investigating Cellular Automata

Chapter 13, Part A Analysis of Variance and Experimental Design. Introduction to Analysis of Variance. Introduction to Analysis of Variance

The Occupancy and Coupon Collector problems

C-1: Aerodynamics of Airfoils 1 C-2: Aerodynamics of Airfoils 2 C-3: Panel Methods C-4: Thin Airfoil Theory

Lecture 02: Bounding tail distributions of a random variable

hp calculators HP 30S Statistics Averages and Standard Deviations Average and Standard Deviation Practice Finding Averages and Standard Deviations

6.867 Machine Learning

1 Review and Overview

Cubic Nonpolynomial Spline Approach to the Solution of a Second Order Two-Point Boundary Value Problem

1 Onto functions and bijections Applications to Counting

ρ < 1 be five real numbers. The

1 Convergence of the Arnoldi method for eigenvalue problems

X ε ) = 0, or equivalently, lim

n -dimensional vectors follow naturally from the one

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

Derivation of 3-Point Block Method Formula for Solving First Order Stiff Ordinary Differential Equations

PTAS for Bin-Packing

Special Instructions / Useful Data

5 Short Proofs of Simplified Stirling s Approximation

Complete Convergence and Some Maximal Inequalities for Weighted Sums of Random Variables

Convergence of the Desroziers scheme and its relation to the lag innovation diagnostic

Statistics Descriptive and Inferential Statistics. Instructor: Daisuke Nagakura

Maps on Triangular Matrix Algebras

Transcription:

Radomzed Dual Coordate Ascet wth Arbtrary Samplg Zheg Qu Peter Rchtárk Tog Zhag November 21, 2014 Abstract We study the problem of mmzg the average of a large umber of smooth covex fuctos pealzed wth a strogly covex regularzer. We propose ad aalyze a ovel prmal-dual method Quartz whch at every terato samples ad updates a radom subset of the dual varables, chose accordg to a arbtrary dstrbuto. I cotrast to typcal aalyss, we drectly boud the decrease of the prmal-dual error expectato, wthout the eed to frst aalyze the dual error. Depedg o the choce of the samplg, we obta effcet seral, parallel ad dstrbuted varats of the method. I the seral case, our bouds match the best kow bouds for SDCA both wth uform ad mportace samplg. Wth stadard m-batchg, our bouds predct tal data-depedet speedup as well as addtoal data-drve speedup whch depeds o spectral ad sparsty propertes of the data. We calculate theoretcal speedup factors ad fd that they are excellet predctors of actual speedup practce. Moreover, we llustrate that t s possble to desg a effcet m-batch mportace samplg. The dstrbuted varat of Quartz s the frst dstrbuted SDCA-lke method wth a aalyss for o-separable data. 1 Itroducto I ths paper we cosder a prmal-dual par of structured covex optmzato problems whch has several varats of varyg degrees of geeralty attracted a lot of atteto the past few years the mache learg ad optmzato commutes [8, 9, 29, 27, 30, 28, 37]. 1.1 The problem Let A 1,..., A be a collecto of d-by-m real matrces ad φ 1,..., φ be 1/γ-smooth covex fuctos from R m to R, where γ > 0. Further, let g : R d R be a 1-strogly covex fucto ad λ > 0 a regularzato parameter. We are terested solvg the followg prmal problem: [ ] m w=w 1,...,w d R d P w def = 1 φ A w + λgw. 1 I the mache learg cotext, matrces {A } are terpreted as examples/samples, w s a lear predctor, fucto φ s the loss curred by the predctor o example A, g s a regularzer, School of Mathematcs, The Uversty of Edburgh, Uted Kgdom. School of Mathematcs, The Uversty of Edburgh, Uted Kgdom. Departmet of Statstcs, Rutgers Uversty, New Jersey, USA ad Bg Data Lab, Badu Ic, Cha. Ackowledgmets: The frst two authors would lke to ackowledge support from the EPSRC Grat EP/K02325X/1, Accelerated Coordate Descet Methods for Bg Data Optmzato. 1

λ s a regularzato parameter ad 1 s the regularzed emprcal rsk mmzato problem. However, above problem has may other applcatos outsde mache learg. I ths paper we are especally terested problems where s very bg mllos, bllos, ad much larger tha d. Ths s ofte the case bg data applcatos. Let g : R d R be the covex cojugate 1 of g ad for each, let φ : Rm R be the covex cojugate of φ. Assocated wth the prmal problem 1 s the Fechel dual problem: [ ] max Dα def = fα ψα, 2 α=α 1,...,α R N =R m where α = α 1,..., α R N = R m s obtaed by stackg dual varables blocks α R m, = 1,...,, o top of each other ad fuctos f ad ψ are defed by fα def = λg 1 A α, 3 λ ψα def = 1 φ α. 4 Note that f s covex ad smooth ad ψ s strogly covex ad block separable. 1.2 Cotrbutos We ow brefly lst the ma cotrbutos of ths work. Quartz. We propose a ew algorthm, whch we call Quartz 2, for smultaeously solvg the prmal 1 ad dual 2 problems. O the dual sde, at each terato our method selects ad updates a radom subset samplg Ŝ {1,..., } of the dual varables/blocks. We assume that these sets are..d. throughout the teratos. However, we do ot mpose ay addtoal assumptos o the dstrbuto apart from the ecessary requremet that each block [] eeds def to be chose wth a postve probablty: p = P Ŝ > 0. Quartz s the frst SDCA-lke method aalyzed for a arbtrary samplg. The dual updates are the used to perform a update to the prmal varable w ad the process s repeated. Our prmal updates are dfferet less aggressve from those used SDCA [29] ad Prox-SDCA [27]. Ma result. We prove that startg from a tal par w 0, α 0, Quartz fds a par w, α for whch P w Dα ɛ expectato at most 1 max + v P w 0 Dα 0 log 5 p p λγ ɛ 1 I ths paper, the covex Fechel cojugate of a fucto ξ : R k R s the fucto ξ : R k R defed by ξ u = sup s =1 {s u ξs}, where s the L2 orm. 2 Strage as t may seem, ths algorthm ame appeared to oe of the authors of ths paper a dream. Accordg to Wkpeda: Quartz s the secod most abudat meral the Earth s cotetal crust. There are may dfferet varetes of quartz, several of whch are sem-precous gemstoes. Our method also comes may varats. It later came as a surprse to the authors that the ame could be terpreted as QU Ad Rchtárk ad Tog Zhag. Whether the subcoscous md of the sleepg coauthor who dreamed up the ame kew about ths coecto or ot s ot kow. 2

teratos. The parameters v 1,..., v are assumed to satsfy the followg ESO expected separable overapproxmato equalty: [ EŜ Ŝ A ] h 2 p v h 2. 6 Moreover, the parameters are eeded to ru the method they determe stepszes, ad hece t s crtcal that they ca be cheaply computed before the method starts. As we wll show, for may samplgs of terest ths ca be doe tme requred to read the data {A }. We wsh to pot out that 6 always holds for some parameters {v }. Ideed, the left had sde s a quadratc fucto of h ad hece the equalty holds for large-eough v. Havg sad that, the sze of these parameters drectly flueces the complexty, ad hece oe would wat to obta as tght bouds as possble. Arbtrary samplg. As descrbed above, Quartz uses a arbtrary samplg for pckg the dual varables to be updated each terato. To the best of our kowledge, oly a sgle paper exsts the lterature where a stochastc method usg a arbtrary samplg was aalyzed: the NSyc method of Rchtárk ad Takáč [22] for ucostraed mmzato of a strogly covex fucto. Assumpto 6 was for the frst tme troduced there a more geeral form; we are usg t here the specal case of a quadratc fucto. However, NSyc s ot a prmal-dual method. Besdes NSyc, the closest works to ours terms of the geeralty of the samplg are the PCDM algorthm of Rchtárk ad Takáč [23], SPCDM method of Fercoq ad Rchtárk [7] ad the APPROX method of Fercoq ad Rchtárk [6]. All these are radomzed coordate descet methods, ad all were aalyzed for arbtrary uform samplgs.e., samplgs satsfyg P Ŝ = P Ŝ for all, []. Aga, oe of these methods were aalyzed a prmal-dual framework. Drect prmal-dual aalyss. Vrtually all methods for solvg 1 by performg stochastc steps the dual 2, such as SDCA [29], SDCA for SVM dual [30], ProxSDCA [27], ASDCA [28] ad APCG [15], are aalyzed by frst establshg dual covergece ad the provg that the dualty gap s bouded by the dual resdual. The SPDC method of Zhag ad Xao [36], whch s a stochastc coordate update varat of the Chambolle-Pock method [3], s a excepto. Our aalyss s ovel, ad drectly prmal-dual ature. As a result, our proof s more drect, ad the logarthmc term our boud has a smpler form. Flexblty: may mportat varats. Our method s very flexble: by specalzg t to specfc samplgs, we obta umerous varats, some smlar but ot detcal to exstg methods the lterature, ad some very ew ad of sgfcace to bg data optmzato. Seral uform samplg. If Ŝ always pcks a sgle block, uformly at radom p = 1/, the the dual updates of Quartz are smlar to those of SDCA [29] ad Prox-SDCA [27]. The leadg term the complexty boud 5 becomes +max λ max A A /λγ, whch matches the bouds obtaed these papers. However, our logarthmc term s smpler. Seral optmal samplg mportace samplg. If Ŝ always pcks a sgle block, wth p chose so as to mmze the complexty boud 5, we obta the same mportace samplg as that recetly used the IProx-SDCA method [37]. Our boud becomes + 1 λ maxa A /λγ, whch matches the boud [37]. Aga, our logarthmc term s better. 3

τ-ce samplg. If we ow let Ŝ be a radom subset of [] of sze τ chose uformly at radom ths samplg s called τ-ce [23], we obta a m-batch parallel varat of Quartz. There are oly a hadful of prmal-dual stochastc methods whch use m-batchg. The frst such method was a m-batch verso of SDCA specalzed to trag L2-regularzed lear SVMs wth hge loss [30]. Besdes ths, two accelerated m-batch methods have bee recetly proposed: ASDCA of Shalev-Shwartz ad Zhag [28] ad SPDC of Zhag ad Xao [36]. The complexty boud of Quartz specalzed to the τ-ce samplg s dfferet, ad despte Quartz ot beg a accelerated method, ad ca be better certa regmes we wll do a detaled comparso Secto 4. Dstrbuted samplg. To the best of our kowledge, o other samplgs tha those descrbed above were used stochastc prmal-dual methods. However, there are may addtoal terestg samplgs proposed for radomzed coordate descet, but ever appled to the prmal-dual framework. For stace, we ca use the dstrbuted samplg whch led to the developmet of the Hydra algorthm [21] dstrbuted coordate descet ad ts accelerated varat Hydra 2 Hydra squared [5]. Usg ths samplg, Quartz ca be effcetly mplemeted a dstrbuted evromet partto the examples across the odes of a cluster, ad let each ode each terato update a radom subset of varables correspodg to the examples t ows. Product samplg. We descrbe a ovel samplg, whch we call product samplg, that ca be both o-seral ad o-uform. Ths s the frst tme such a samplg has bee descrbed ad ad a SDCA-lke method usg t aalyzed. For sutable data f the examples ca be parttoed to several groups o two of whch share a feature, ths samplg ca lead to lear or early lear speedup whe compared to the seral uform samplg. Other samplgs. Whle we develop the aalyss of Quartz for a arbtrary samplg, we do ot compute the ESO parameters {v } for ay other samplgs ths paper. However, there are several other terestg choces. We refer the reader to [23] ad [22] for further examples of uform ad o-uform samplgs, respectvely. All that must be doe for ay ew Ŝ s to fd parameters v for whch 6 holds, ad the complexty of the ew varat of Quartz s gve by 5. Further data-drve speedup. Exstg m-batch stochastc prmal-dual methods acheve lear speedup up to a certa m-batch sze whch depeds o, λ ad γ. Quartz obtas ths data-depedet speedup, but also obtas further data-drve speedup. Ths s caused by the fact that Quartz uses more aggressve dual stepszes, formed by the data through the ESO parameters {v }. The smaller these costats, the better speedup. For stace, we wll show that hgher data sparsty leads to smaller {v } ad hece to better speedup. To llustrate ths, cosder the τ-ce samplg hece, p = τ/ for all ad the extreme case of perfectly sparse data each feature j [d] appearg a sgle example A. The 6 holds wth v = λ max A A for all, ad hece the leadg term 5 becomes /τ + max λ max A A /γλτ, predctg perfect speedup the m-batch sze τ. We derve theoretcal speedup factors ad show that these are excellet predctors of actual behavor of the method a mplemetato. Ths was prevously observed for the PCDM method [23] whch s ot prmal-dual. 4

Quartz vs purely prmal ad purely dual methods. I the specal case whe Ŝ s the seral uform samplg, the complexty of Quartz s smlar to the bouds recetly obtaed by several purely prmal stochastc ad sem-stochastc gradet methods all havg reduced varace of the gradet estmate such as SAG [25], SVRG [11], S2GD [14], SAGA [4], ms2gd [12] ad MISO [16]. I the case of seral optmal samplg, relevat purely prmal methods wth smlar guaratees are ProxSVRG [33] ad S2CD [13]. A m-batch prmal method, ms2gd, was aalyzed [12], achevg a smlar boud to Quartz specalzed to the τ-ce samplg. Purely dual stochastc coordate descet methods wth smlar bouds to Quartz for both the seral uform ad seral optmal samplg, for problems of varyg smlarty ad geeralty whe compared to 2, clude SCD [26], RCDM [19], UCDC/RCDC [24], ICD [32] ad RCD [18]. These methods were the geeralzed to the τ-ce samplg SHOTGUN [2], further geeralzed to arbtrary uform samplgs PCDM [23], SPCDM [7], APPROX [6] whch s a accelerated method ad to arbtrary eve ouform samplgs NSyc [22]. Aother accelerated method, BOOM, was proposed [17]. Dstrbuted radomzed coordate descet methods wth purely dual aalyss clude Hydra [21] ad Hydra 2 [5] accelerated varat of Hydra. Quartz specalzed to the dstrbuted samplg acheves the same rate as Hydra, but for both the prmal ad dual problems smultaeously. Geeral problem. We cosder the problem 1 ad cosequetly, the assocated dual a rather geeral form; most exstg prmal-dual methods focus o the case whe g s a quadratc e.g., [29, 28] or m = 1 e.g., [36]. Lower bouds for a varat of problem 1 were recetly establshed by Agarwal ad Bottou [1]. 1.3 Outle I Secto 2 we descrbe the algorthm ad show that t admts a atural terpretato terms of Fechel dualty. We also outle the smlartes ad dffereces of the prmal ad dual update steps wth SDCA-lke methods. I Secto 3 we show how parameters {v } satsfyg the ESO equalty 6 ca be computed for several selected samplgs. We the proceed to Secto 4 where we state the ma result, specalze t to some of the samplgs dscussed Secto 3. Sectos 5 ad 6 deal wth Quartz specalzed to the τ-ce ad dstrbuted samplg, respectvely. We also gve detaled comparso of our results wth exstg results for related prmal-dual stochastc methods exstg the lterature, ad aalyze theoretcal speedup factors. We the provde the proof of the ma complexty result Secto 7. I Secto 8 we perform umercal expermets o the problem of trag L 2 -regularzed lear support vector mache wth square ad smoothed hge loss wth real datasets. Fally, Secto 9 we coclude. 2 The Quartz Algorthm I ths secto we descrbe our method Algorthm 1. 2.1 Prelmares The most mportat parameter of Quartz s a radom samplg Ŝ of the dual varables [] = {1, 2,..., }. That s, Ŝ s a radom subset of [], or more precsely, a radom set-valued mappg wth values beg the subsets of []. I order to guaratee that each block dual varable has a chace to get updated by the method, we ecessarly eed to make the followg assumpto. 5

Assumpto 1 Proper samplg Ŝ s a proper samplg. That s, def p = P Ŝ > 0, []. 7 However, we shall ot make ay other assumpto o Ŝ. Pror to rug the algorthm, we compute postve costats v 1,..., v satsfyg 6 such costats always exst as these are used to defe the stepsze parameter θ used throughout: θ = m p λγ v + λγ. 8 We shall show how ths parameter ca be computed for varous samplgs Secto 3. Let us ow formalze the otos of 1/γ-smoothess ad strog covexty. Assumpto 2 Loss For each [], the loss fucto φ : R m R s covex, dfferetable ad has 1/γ-Lpschtz cotuous gradet wth respect to the L2 orm, where γ s a postve costat: φ x φ y 1 γ x y, x, y Rm. For brevty, the last property s ofte called 1/γ-smoothess. It follows that φ s γ-strogly covex. Assumpto 3 Regularzer The regularzer g : R d R s 1-strogly covex. That s, gw gw + gw, w w + 1 2 w w 2, w, w R d, where gw s a subgradet of g at w. It follows that g s 1-smooth. 2.2 Descrpto of the method Quartz starts wth a tal par of prmal ad dual vectors w 0, α 0. Gve w t 1 ad α t 1, the method matas the vector ᾱ t 1 = 1 A α t 1. 9 λ Itally ths s computed from scratch, ad subsequetly t s mataed a effcet maer at the ed of each terato. Let us ow descrbe how the vectors w t ad α t are computed. Quartz frst updates the prmal vector w t by settg t to a covex combato of the prevous value w t 1 ad g ᾱ t 1 : w t = 1 θw t 1 + θ g ᾱ t 1. 10 We the proceed to select, ad subsequetly update, a radom subset S t [] of the dual varables, depedetly from the sets draw prevous teratos, ad followg the dstrbuto of Ŝ. Clearly, there are may ways whch the dstrbuto of Ŝ ca be chose, leadg the umerous varats of Quartz. We shall descrbe some of them Secto 3. We allow two optos for the actual computato of the dual updates. Oce the dual varables are updated, the vector ᾱ t s updated a effcet maer so that 9 holds. The etre process s repeated. 6

Algorthm 1 Quartz Parameters: proper radom samplg Ŝ ad a postve vector v R Italzato: Choose α 0 R N ad w 0 R d p Set p = P Ŝ, θ = m λγ v +λγ ad ᾱ0 = 1 λ A α 0 for t 1 do w t = 1 θw t 1 + θ g ᾱ t 1 α t = α t 1 Geerate a radom set S t [], followg the dstrbuto of Ŝ for S t do Calculate α t usg oe of the followg optos: Opto I : [ ] α t = arg max R m φ αt 1 + g ᾱ t 1 A v 2 2λ Opto II : α t = θp 1 α t = αt 1 + α t ed for ᾱ t = ᾱ t 1 + λ 1 S t A α t ed for Output: w t, α t α t 1 θp 1 φ A wt Fechel dualty terpretato. Quartz has a atural terpretato terms of Fechel dualty. Fx a prmal-dual par of vectors w, α R d R N ad defe ᾱ = 1 λ A α. The dualty gap for the par w, α ca be decomposed as follows: 1+2 P w Dα = λ gw + g ᾱ + 1 φ A w + φ α = λgw + g ᾱ w, ᾱ + 1 }{{} GAP gw,α φ A w + φ α + A w, α. }{{} GAP φ w,α By Fechel-Youg equalty, GAP g w, α 0 ad GAP φ w, α 0 for all, whch proves weak dualty for the problems 1 ad 2,.e., P w Dα. The par w, α s optmal whe both GAP g ad GAP φ for all are zero. It s kow that ths happes precsely whe the followg optmalty codtos hold: w = g ᾱ, 11 α = φ A w, []. 12 We wll ow terpret the prmal ad dual steps of Quartz terms of the above dscusso. At terato t we frst set the prmal varable w t to a covex combato of ts curret value w t 1 ad a value that would set GAP g to zero: see 10. Hece, our prmal update s ot as aggressve as that of Prox-SDCA. Ths s followed by adjustg the dual varables correspodg to a radomly chose set of examples S t. Uder Opto II, for each example S t, the -th dual varable α t s set to a covex combato of ts curret value α t 1 ad the value that would set GAP φ to zero: α t = 1 θp α t 1 + θ φ A w t. p 7

Quartz vs Prox-SDCA. I the specal case whe Ŝ s the seral uform samplg.e., p = 1/ for all [], Quartz ca be compared to Proxmal Stochastc Dual Coordate Ascet Prox- SDCA [28, 29]. Ideed, f Opto I s always used Quartz, the the dual update of α t Quartz s exactly the same as the dual update of Prox-SDCA usg Opto I. I ths case, the dfferece betwee our method ad Prox-SDCA les the update of the prmal varable w t : whle Quartz performs the update 10, Prox-SDCA see also [34, 10] performs the more aggressve update w t = g ᾱ t 1. 3 Expected Separable Overapproxmato For the sake of brevty, t wll be coveet to establsh some otato. Let A = [A 1,..., A ] R d N = R d m be the block matrx wth blocks A R d m. Further, let A j be the j-th row of A. Lkewse, for h R N we wll wrte h = h 1,..., h, where h R m, so that Ah = A h. For a vector of postve weghts w R, we defe a weghted Eucldea orm R N by h 2 w def = w h 2, 13 where s the stadard Eucldea orm o R m. For S [] def = {1,..., } ad h R N we use the otato h [S] to deote the vector R N cocdg wth h for blocks S ad zero elsewhere: h [S] = { h, f S, 0, otherwse. Wth ths otato, we have Ah [S] = S A h. 14 As metoed before, our aalyss we requre that the radom samplg Ŝ ad the postve vector v R used Quartz satsfy equalty 6. We shall ow formalze ths as a assumpto, usg the compact otato establshed above. Assumpto 4 ESO The followg equalty holds for all h R N : E[ Ah [ Ŝ] 2 ] h 2 p v, 15 where p = p 1,..., p s defed 7, v = v 1,..., v > 0 ad p v = p 1 v 1,..., p v R. Note that for ay proper samplg Ŝ, there must exst vector v > 0 satsfyg Assumpto 4. Hece, ths s a assumpto that such a vector v s readly avalable. Ideed, the term o the left s a fte average of covex quadratc fuctos of h, ad hece s a covex quadratc. Moreover, we ca wrte E[ Ah [ Ŝ] 2 ] = E[h [Ŝ]A Ah [ Ŝ] ] = h P A A h, where deotes the Hadamard compoet-wse product of matrces ad P R N N s a -by- block matrx wth block, j equal to P Ŝ, j Ŝ1 m, wth 1 m beg the m-by-m matrx of all oes. Hece 15 merely meas to upper boud the matrx P A A by a -by- block dagoal 8

matrx D = D p,v, the -th block of whch s equal to p v I m wth I m beg the m-by-m detty matrx. There s a fte umber of ways how ths ca be doe theory. Ideed, for ay proper samplg Ŝ ad ay postve w R, 15 holds wth v = tw, where t = λ max Dp,w 1/2 P A T ADp,w 1/2, sce the P A A td p,w = D p,v. I practce, ad especally the bg data settg whe s very large, computg v by solvg a egevalue problem wth a N N matrx recall that N = m wll be ether effcet or mpossble. It s therefore mportat that a good.e., small, albet perhaps suboptmal v ca be detfed cheaply. I all the cases we cosder ths paper, the detfcato of v ca be doe durg the tme the data s beg read; or tme roughly equal to a sgle pass through the data matrx A. I the specal case of uform 3 samplgs but for arbtrary smooth fuctos ad ot just quadratcs; whch s all we eed here, equalty 15 was troduced ad studed by Rchtárk ad Takáč [23], the cotext of complexty aalyss of o prmal-dual parallel block coordate descet methods. A varat of ESO for arbtrary possbly ouform samplgs was troduced [22]; ad to the best of our kowledge that s the oly work aalyzg a stochastc coordate descet method whch uses a arbtrary samplg. However, NSyc s ot a prmal-dual method ad apples to a dfferet problem ucostraed mmzato of a smooth strogly covex fucto. Besdes [23, 22], ESO equaltes were further studed [30, 31, 7, 21, 6, 5, 12]. 3.1 Seral samplgs The most studed samplg lterature o stochastc optmzato s the seral samplg, whch correspods to the selecto of a sgle block []. That s, Ŝ = 1 wth probablty 1. The ame seral s potg to the fact that a method usg such a samplg wll typcally be a seral as opposed to beg parallel method; updatg a sgle block dual varable at a tme. A seral samplg s uquely characterzed by the vector of probabltes p = p 1,..., p, where p s defed by 7. It turs out that we ca fd a vector v > 0 for whch 15 holds for ay seral samplg, depedetly of ts dstrbuto gve by p. Lemma 5 If Ŝ s a seral samplg.e., f satsfed for Ŝ = 1 wth probablty 1, the Assumpto 4 s v = λ max A A, []. 16 Proof Note that for ay h R N, E[ Ah [ Ŝ] 2 ] = p Ah [{}] 2 14 = p h A A h p λ max A A h 2 13 = h 2 p v. 3 A samplg Ŝ s uform f p = pj for all, j. It s easy to see that the, ecessarly, p = E[ Ŝ ]/ for all. E[ Ŝ ] The ESO equalty studed [23] s of the form: E[ξx + h [ Ŝ] ] ξx + ξx, h + 1 2 v h 2. I the case of uform samplg, x = 0 ad ξh = 1 2 Ah 2, we recover 15. 9

Note that v s the largest egevalue of a m-by-m matrx. If m s relatvely small ad may mache learg applcatos oe has m = 1; as examples are usually vectors ad ot matrces, the the cost of computg v s small. If m = 1, the v s smply the squared Eucldea orm of the vector A, ad hece oe ca compute all of these parameters oe pass through the data e.g., durg loadg to memory. 3.2 Parallel τ-ce samplg We ow cosder Ŝ whch selects subsets of [] of cardalty τ, uformly at radom. I the termology establshed [23], such Ŝ s called τ-ce. Ths samplg satsfes p = p j for all, j []; ad hece t s uform. Ths samplg s well suted for parallel computg. Ideed, Quartz could be mplemeted as follows. If we have τ processors avalable, the at the begg of terato t we ca assg each block dual varable S t to a dedcated processor. The processor assged to would the compute α t ad apply the update. If all processors have fast access to the memory where all the data s stored, as s the case a shared-memory multcore workstato, the ths way of assgg workload to the dvdual processors does ot cause ay major problems. Depedg o the partcular computer archtecture ad the sze m of the blocks whch wll fluece processg tme, t may be more effcet to chose τ to be a multple of the umber of processors avalable, whch case each terato every processor updates more tha oe block. The followg lemma gves a closed-form formula for parameters {v } for whch the ESO equalty holds. Lemma 6 compare wth [6] If Ŝ s a τ-ce samplg, the Assumpto 4 s satsfed for d v = λ max 1 + ω j 1τ 1 A 1 ja j, [], 17 j=1 where for each j [d], ω j s the umber of ozero blocks the j-th row of matrx A,.e., ω j def = { [] : A j 0}, j [d]. 18 Proof I the m = 1 case the result follows from Theorem 1 [6]. Exteso to the m > 1 case s straghtforward. Note that v s the largest egevalue of a m-by-m matrx whch s formed as the sum of d rak-oe matrces. The formato of all of these matrces takes tme proportoal to the umber of ozeros A f the data s stored a sparse format. Costats {ω j } ca be computed by scag the data oce e.g., durg loadg-to-memory phase. Fally, oe must compute egevalue problems for matrces of sze m m. I most applcatos, m = 1, so there s o more work to be doe. If m > 1, the cost of computg these egevalues would be small. Whle for τ = 1 t was easy to fd parameters {v } for ay samplg ad hece, as we wll see, t wll be easy to fd a optmal samplg, ths s ot the case the τ > 1 case. The task s geeral a dffcult optmzato problem. For some work ths drecto we refer the reader to [22]. 10

3.3 Product samplg I ths secto we gve a example of a samplg Ŝ whch ca be both o-uform ad o-seral.e., for whch P Ŝ = 1 1. We make the followg group separablty assumpto: there s a partto X 1,..., X τ of [] accordg to whch the examples {A } ca be parttoed to τ groups such that o feature s shared by ay two examples belogg to dfferet groups. Cosder the followg example wth m = 1, = 5 ad d = 4: 0 0 6 4 9 A = [A 1, A 2, A 3, A 4, A 5 ] = 0 3 0 0 0 0 0 3 0 1 1 8 0 0 0 If we choose τ = 2 ad X 1 = {1, 2}, X 2 = {3, 4, 5}, the o row of A has a ozero both a colum belogg to X 1 ad a colum belogg to X 2. Wth each [] we ow assocate l [τ] such that X l ad defe: S def = X 1 X τ. The product samplg Ŝ s obtaed by choosg S S, uformly at radom; that s, va: PŜ = S = 1 S = 1 τ l=1 X, S S. 19 l The Ŝ s proper ad def p = P Ŝ = l l X l 19 = 1, []. 20 S X l Hece the samplg s ouform as log as ot all of the sets X l have the same cardalty. We ext show that the product samplg Ŝ defed as above allows the same stepsze parameter v as the seral uform samplg. Lemma 7 Uder the group separablty assumpto, Assumpto 4 s satsfed for the product samplg Ŝ ad v = λ max A A, []. Proof For each j [d], deote by A j: the j-th row of the matrx A ad Ω j the colum dex set def of ozero blocks A j: : Ω j = { [] : A j 0}. For each l [τ], defe: J l def = {j [d] : Ω j X l }. 21 I words, J l s the set of features assocated wth the examples X l. By the group separablty assumpto, J 1,..., J τ forms a partto of [d], amely, τ J l = [d]; J k J l =, k l [τ]. 22 l=1 11

Thus, A A = d j=1 A 22 j:a j: = τ l=1 j J l A j:a j:. 23 Now fx l [τ] ad j J l. For ay h R N we have: E[h [ Ŝ] A j:a j: h [ Ŝ] ] = h A ja j h P Ŝ, Ŝ =, [], Ω j h A ja j h P Ŝ, Ŝ. Sce X 1,..., X τ forms a partto of [], the ay two dexes belogg to the same subset X l wll ever be selected smultaeously Ŝ,.e., Therefore, P Ŝ, Ŝ = { p f = 0 f,, X l. E[h [ Ŝ] A j:a j: h [ Ŝ] ] = Ω j h A ja j h p = h A ja j h p. 24 It follows from 23 ad 24 that: E[ Ah [ Ŝ] 2 ] = E[h [ Ŝ] A Ah [ Ŝ] ] = τ l=1 j J l E[h [ Ŝ] A j:a j: h [ Ŝ] ] = τ h A ja j h p. 25 l=1 j J l Hece, E[ Ah [ Ŝ] 2 ] 22 = d j=1 h A j A jh p λ maxa A h h p = h 2 p v. 3.4 Dstrbuted samplg We ow descrbe a samplg whch s partcularly sutable for a dstrbuted mplemetato of Quartz. Ths samplg was frst proposed [21] ad later used [5], where the dstrbuted coordate descet algorthm Hydra ad ts accelerated varat Hydra 2 were proposed ad aalyzed, respectvely. Both methods were show to be able to scale up to huge problem szes tests were performed o problem szes of several TB; ad up 50 bllo dual varables sze. Cosder a dstrbuted computg evromet wth c odes/computers. For smplcty, assume that s a teger multple of c ad let the blocks {1, 2,..., } be parttoed to c sets of equal sze: P 1, P 2,..., P c. We assg partto P l to ode l. The data A 1,..., A ad the dual varables blocks α 1,..., α are parttoed accordgly ad stored o the respectve odes. At each terato, all odes l {1,..., c} parallel pck a subset Ŝl of τ dual varables from those they ow,.e., from P l, uformly at radom. That s, each ode locally performs a τ-ce samplg, depedetly from the other odes. Node l computes the updates to the dual varables α correspodg to S l, ad locally stores them. Hece, a sgle dstrbuted terato, Quartz updates the dual varables belogg to the set Ŝ def = c l=1ŝl. Ths defes a samplg, whch we wll call c, τ-dstrbuted samplg. 12

Of course, there are other mportat cosderatos pertag to the dstrbuted mplemetato of Quartz, but we do ot dscuss them here as the focus of ths secto s o the samplg. However, t s possble to desg a dstrbuted commucato protocol for the update of the prmal varable. The followg result gves a formula for admssble parameters {v }. Lemma 8 compare wth [5] If Ŝ s a c, τ-dstrbuted samplg, the Assumpto 4 s satsfed for d v = λ max 1 + τ 1ω j 1 τc max { + c 1, 1} τ 1 ω j 1 max{ c 1, 1} ω j ω j A ja j, [], j=1 where ω j s the umber of ozero blocks the j-th row of the matrx A, as defed prevously 18, ad ω j s the umber of parttos actve at row j of A, more precsely, ω j 26 def = {l [c] : { P l : A j 0} }, j [d]. 27 Proof Whe m = 1, the result s equvalet to Theorem 4.1 [5]. The exteso to blocks m > 1 s straghtforward. Lemma 6 s a specal case of Lemma 8 whe oly a sgle ode c = 1 s used, whch case ω j = 1 for all j [d]. Lemma 8 also mproves the costats {v } derved [21], where stead of ω j ad ω j 26 oe has max j ω j ad max j ω j. Lemma 8 s expressed terms of certa sparsty parameters assocated wth the data {ω j } ad the parttog {ω j }. However, t s possble to derve alteratve ESO results for the c, τ-dstrbuted samplg. For stace, oe ca stead express the parameters {v j } wthout ay sparsty assumptos, usg oly spectral propertes of the data oly. We have ot cluded these results here, but the m = 1 case such results have bee derved [5]. It s possble to adopt them to the m = 1 case as we have doe t wth Lemma 8. 4 Ma Result The complexty of our method s gve by the followg theorem. Theorem 9 Ma Result Let Assumpto 2 φ are 1/γ-smooth ad Assumpto 3 g s 1- strogly covex be satsfed. Let Ŝ be a proper samplg Assumpto 1 ad v 1,..., v be postve scalars satsfyg Assumpto 4. The the sequece of prmal ad dual varables {w t, α t } t 0 of Quartz Algorthm 1 satsfes: E[P w t Dα t ] 1 θ t P w 0 Dα 0, 28 where θ = m p λγ v + λγ. 29 13

I partcular, f we fx ɛ P w 0 Dα 0, the for 1 T max + v log p p λγ we are guarateed that E[P w T Dα T ] ɛ. P w 0 Dα 0 ɛ, 30 A result of a smlar flavour but for a dfferet problem ad ot a prmal-dual settg has bee establshed [22], where the authors aalyze a parallel coordate descet method, NSyc, also wth a arbtrary samplg, for mmzg a strogly covex fucto uder a ESO assumpto. I the rest of ths secto we wll specalze the above result to a few selected samplgs. We the devote two separate sectos to Quartz specalzed to the τ-ce samplg Secto 5 ad Quartz specalzed to the c, τ-dstrbuted samplg Secto 6 as we do a more detaled aalyss of the results these two cases. 4.1 Quartz wth uform seral samplg We frst look at the specal case whe Ŝ s the uform seral samplg,.e., whe p = 1/ for all []. Corollary 10 Assume that at each terato of Quartz we update oly oe dual varable uformly at radom ad use v = λ max A A for all []. If we let ɛ P w 0 Dα 0 ad T + max λ max A A P w 0 Dα 0 log, 31 λγ ɛ the E[P w T Dα T ] ɛ. Proof The result follows by combg Lemma 5 ad Theorem 9. Corollary 10 should be compared wth Theorem 5 [29] coverg the L2-regularzed case ad Theorem 1 [28] coverg the case of geeral g. They obta the rate + max λ max A A log + max λ max A A Dα Dα 0, λγ λγ ɛ where α s the dual optmal soluto. Notce that the domat terms the two rates exactly match, although our logarthmc term s better ad smpler. 4.2 Quartz wth optmal seral samplg mportace samplg Accordg to Lemma 5, the parameter v for a seral samplg Ŝ s determed by 16 ad s depedet of the dstrbuto of Ŝ. We ca the seek to maxmze the quatty θ 29 to obta the best boud. A smple calculato reveals that the optmal probablty s gve by: PŜ = {} = p def = λ max A A + λγ λmax A A. 32 + λγ Usg ths samplg, we obta the followg terato complexty boud, whch s a mprovemet o the boud for uform probabltes 31. 14

Corollary 11 Assume that at each terato of Quartz we update oly oe dual varable at radom accordg to the probablty p defed 32 ad use v = λ max A A for all []. If we let ɛ P w 0 Dα 0 ad T the E[P w T Dα T ] ɛ. + 1 λ maxa A λγ P w 0 Dα 0 log ɛ, 33 Note that cotrast wth the seral uform samplg, we ow have depedece o the average of the egevalues. The above result should be compared wth the complexty result of Iprox- SDCA [37]: 1 + λ maxa A 1 log + λ maxa A Dα Dα 0, λγ λγ ɛ where α s the dual optmal soluto. Aga, the domat terms the two rates exactly match, although our logarthmc term s better ad smpler. 4.3 Quartz wth product samplg I ths secto we apply Theorem 9 to the case whe Ŝ s the product samplg see the descrpto Secto 3.3. All the otato we use here was establshed there. Corollary 12 Uder the group separablty assumpto, let Ŝ be the product samplg ad let v = λ max A A for all []. If we fx ɛ P w 0 Dα 0 ad the E[P w T Dα T ] ɛ. T max X l + λ maxa A X l log λγ P w 0 Dα 0 Proof The proof follows drectly from Theorem 9, Lemma 7 ad 20. Recall from Secto 3.3 that the product samplg Ŝ has cardalty τ 1 ad s o-uform as log as all the sets {X 1,..., X τ } do ot have the same cardalty. To the best of our kowledge, Corollary 12 s the frst explct complexty boud of stochastc algorthm usg o-seral ad ouform samplg for composte covex optmzato problem the paper [22] oly deals wth smooth fuctos ad the method s ot prmal-dual, albet uder the group separablty assumpto. Let us compare the complexty boud wth the seral uform case Corollary 10: + max λ maxa A λγ max X l + λmaxa A X l λγ m ɛ X l. Hece the terato boud of Quartz specalzed to product samplg s at most a max X l / fracto of that of Quartz specalzed to seral uform samplg. The factor max X l / vares from 1/τ to 1, depedg o the degree to whch the partto X 1,..., X τ s balaced. A perfect, 15

lear speedup max X l / = 1/τ oly occurs whe the partto X 1,..., X τ s perfectly balaced.e., the set X l have the same cardalty, whch case the product samplg s uform recall the defto of uformty we use ths paper: P Ŝ = P Ŝ for all, []. Note that f the partto s ot perfectly but suffcetly so, the the factor max X l / wll be close to the perfect lear speedup factor 1/τ. 5 Quartz wth τ-ce Samplg stadard m-batchg We ow specalze Theorem 9 to the case of the τ-ce samplg. Corollary 13 Assume Ŝ s the τ-ce samplg ad v s chose as 17. If we let ɛ P w0 Dα 0 ad d T max λ max τ + j=1 1 + ω j 1τ 1 1 A j A j P w 0 Dα 0 log, 34 λγτ ɛ the E[P w T Dα T ] ɛ. Proof The result follows by combg Lemma 6 ad Theorem 9. Let us ow have a detaled look at the above result; especally terms of how t compares wth the seral uform case Corollary 10. We do ths comparso Table 1. For fully sparse data, we get perfect lear speedup: the boud the secod le of Table 1 s a 1/τ fracto of the boud the frst le. For fully dese data, the codto umber κ def = max λ max A A /γλ s uaffected by m-batchg/parallelzato. Hece, lear speedup s obtaed f κ = O/τ. For geeral data, the behavour of Quartz wth τ-ce samplg terpolates these two extreme cases. That s, κ gets multpled by a quatty betwee 1/τ fully sparse case ad 1 fully dese case. It s coveet to wrte ths factor the form 1 + 1 τ ω 1τ 1 1 where ω [1, ] s a measure of average sparsty of the data, usg whch we ca wrte:, T τ def = 1 + ω 1τ 1 τ + 1 max λ max A A P w 0 Dα 0 log. 35 λγτ ɛ 5.1 Theoretcal speedup factor For smplcty of exposto, let us ow assume that λ max A A = 1. theoretcal speedup factor, defed as: T 1 35 = T τ τ1 + λγ 1 + λγ + τ 1 ω 1 1 = τ 1 + τ 1 ω 1 11+λγ We wll ow study the. 36 16

Samplg Ŝ Data Complexty of Quartz 34 Theorem Seral uform Ay data + max λ max A A λγ Corollary 10 τ-ce Fully sparse data ω j = 1 for all j τ + max λ max A A λγτ Corollary 13 τ-ce Fully dese data ω j = for all j τ + max λ max A A λγ Corollary 13 τ-ce Ay data τ + 1 + ω 1τ 1 1 max λ max A A λγτ Corollary 13 Table 1: Comparso of the complexty of Quartz wth seral uform samplg ad τ-ce samplg. That s, the speedup factor measures how much better Quartz s wth τ-ce samplg tha the seral uform case wth 1-ce samplg. Note that the speedup factor s a cocave ad creasg fucto wth respect to the umber of threads τ. The value depeds o two factors: the relatve sparsty level of the data matrx A, expressed through the quatty ω 1 1 ad the codto umber of the problem, expressed through the quatty λγ. We provde below two lower bouds for the speedup factor: T 1, 1 T 1, τ τ 1 + ω 1 1 τ 2 f 1 τ 2 + λγ. 37 Note that the last term does ot volve ω. I other words, lear speedup modulus a factor of 2 s acheved at least utl τ = 2 + λγ of course, we also requre that τ, regardless of the data matrx A. For stace, f λγ = 1/, whch s a frequetly used settg for the regularzer, the we get data depedet lear speedup up to m-batch sze τ = 2 +. Moreover, from the frst equalty 37 we see that there s further data-depedet speedup, depedg o the average sparsty measure ω. We gve a llustrato of ths pheomeo Fgure 1, where we plot the theoretcal speedup factor 36 as a fucto of the umber of threads τ, for = 10 6, γ = 1 ad three values of ω ad λ. Lookgat the plots from rght to left, we see that for fxed λ, the speedup factor creases as ω decreases, as descrbed by 36. Moreover, as the regularzato parameter λ gets smaller a reaches the value 1/, the speedup factor s healthy for sparse data oly. However, for λ = 1/ = 10 3, we observe lear speedup up to τ = = 1000, regardless 17

of ω the sparsty of the data, as predcted. There s addtoal data-drve speedup beyod ths pot, whch s better for sparser data. 2000 λ=1e-3 λ=1e-4 λ=1e-6 2000 λ=1e-3 λ=1e-4 λ=1e-6 700 600 λ=1e-3 λ=1e-4 λ=1e-6 speed up factor 1500 1000 500 speed up factor 1500 1000 500 speed up factor 500 400 300 200 100 0 0 500 1000 1500 2000 τ a ω = 10 2, = 10 6, γ = 1 0 0 500 1000 1500 2000 τ b ω = 10 4, = 10 6, γ = 1 0 0 500 1000 1500 2000 τ c ω = 10 6, = 10 6, γ = 1 Fgure 1: The speedup factor 36 as a fucto of τ for = 10 6, γ = 1, three regularzato parameters ad data of varous sparsty levels. 5.2 Quartz vs exstg prmal-dual m-batch methods We ow compare the above result wth exstg m-batch stochastc dual coordate ascet methods. A m-batch varat of SDCA, to whch Quartz wth τ-ce samplg ca be aturally compared, has bee proposed ad aalyzed prevously [30], [28] ad [36]. I [30], the authors proposed to use a so-called safe m-batchg, whch s precsely equvalet to fdg the stepsze parameter v satsfyg Assumpto 4 the specal case of τ-ce samplg. However, they oly aalyzed the case where the fuctos φ are o-smooth. I [28], the authors studed accelerated m-batch SDCA ASDCA, specalzed to the case whe the regularzer g s the squared L2 orm. They showed that the complexty of ASDCA terpolates betwee that of SDCA ad accelerated gradet descet AGD [20] through varyg the m-batch sze τ. I [36], the authors proposed a m-batch exteso of ther stochastc prmal-dual coordate algorthm SPDC. Both ASDCA ad SPDC reach the same complexty as AGD whe the m-batch sze equals to, thus should be cosdered as accelerated algorthms. The complexty bouds for all these algorthms are summarzed Table 2. To facltate the comparso, we assume that max λ max A A = 1 sce the aalyss of ASDCA assumes ths. I Table 3 we compare the complextes of SDCA, ASDCA, SPDC ad Quartz several regmes. We have used Lemma 14 to smplfy the bouds for Quartz. Lemma 14 For ay ω [1, ] ad τ [1, ] we have ω 1τ 1 1 ωτ 1 + ω 1τ 1 1 1 + ωτ. Proof The secod equalty follows by showg that the fucto φ 1 x = x + ω xτ x x s creasg, the frst ad thrd follow by showg that φ 2 x = ω xτ x x s decreasg o [0, 1]. The mootocty clams follow from the fact that φ 1 x = 2 + ωτ ω+τ x 2 φ 2 x = φ ω τ x2 1 x 1 = 0 for all x [0, 1]. x 2 18 = ω τ x 2 0 ad

Algorthm Iterato complexty g SDCA [29] + { 1 λγ } ASDCA [28] 4 max τ, λγτ, 1 λγτ, 3 1 SPDC [36] Quartz wth τ-ce samplg τ + τ + λγτ 1 + ω 1τ 1 1 λγτ 2 3 1 λγτ 1 2 2 1 2 2 geeral geeral Table 2: Comparso of the terato complexty of several prmal-dual algorthms performg stochastc coordate ascet steps the dual usg a m-batch of examples of sze τ wth the excepto of SDCA, whch s a seral method usg τ = 1. We assume that λ max A A = 1 for all to facltate comparso sce ths assumpto has bee mplctly made [28]. Algorthm γλ = Θ 1 γλ = Θ 1 τ γλ = Θ1 γλ = Θτ γλ = Θ κ = 3/2 κ = τ κ = κ = /τ κ = SDCA [29] ASDCA [28] SPDC [36] Quartz τ-ce 3/2 τ 3/2 /τ + 5/4 / τ + / τ /τ /τ + 3/4 / τ 4/3 /τ 2/3 5/4 / τ / τ /τ /τ + 3/4 / τ 3/2 /τ + ω + ωτ /τ + ω /τ /τ + ω/ Table 3: Comparso of leadg factors the complexty bouds of several methods 5 regmes; where κ = 1/γλ s the codto umber. We gore costat terms ad hece oe ca replace each plus by a max. Lookg at Table 3, we see that the γλ = Θτ regme.e., f the codto umber s κ = Θ/τ, Quartz matches the lear speedup whe compared to SDCA of ASDCA ad SPDC. Whe the codto umber s roughly equal to the sample sze κ = Θ, the Quartz does better tha both ASDCA ad SPDC as log as /τ + ω / τ. I partcular, ths s the case whe the data s sparse: ω / τ. If the data s eve more sparse ad may bg data applcatos oe has ω = O1 ad we have ω /τ, the Quartz sgfcatly outperforms both ASDCA ad SPDC. Note that Quartz ca be better tha both ASDCA ad SPDC eve the doma of accelerated methods, that s, whe the codto umber s larger tha the umber of examples: κ = 1. 38 γλ Ideed, we have the followg result, whch ca be terpreted as follows: f κ τ/4 that s, λγτ 4, the there are sparse-eough problems for whch Quartz s better tha both ASDCA ad SPDC. 19

Proposto 15 Assume that 38 holds ad that max λ max A A = 1. The f the data s suffcetly sparse so that ω 1τ 1 2 λγτ 2 +, 39 1 the terato complexty Õ order of Quartz s better tha that of ASDCA ad SPDC. Proof As log as λγτ 1, whch holds uder our assumpto, the terato complexty of ASDCA s: { } 1 Õ max τ, λγτ, 1 λγτ, 3 = λγτ Õ. 2 3 λγτ whch s already less tha that of SPDC. Moreover, λγτ 39 2 + τ 1 ω 1 1 λγτ 38 τ + 1 + τ 1 ω 1 1. λγτ 6 Quartz wth Dstrbuted Samplg I ths secto we apply Theorem 9 to the case whe Ŝ s the c, τ-dstrbuted samplg; see the descrpto of ths samplg Secto 3.4. Corollary 16 Assume that Ŝ s a c, τ-dstrbuted samplg ad v s chose as 26. If we let ɛ P w 0 Dα 0 ad P w 0 Dα 0 T T c, τ log, 40 ɛ where d T c, τ def = λ max cτ + max j=1 the E[P w T Dα T ] ɛ. 1 + τ 1ω j 1 max{/c 1,1} + τc Proof If Ŝ s a c, τ-dstrbuted samplg, the λγcτ τ 1 max{/c 1,1} ω j 1 ω j ω j A j A j, 41 p = cτ, []. It ow oly remas to combe Theorem 9 ad Lemma 8. The expresso 41 volves ω j, whch depeds o the parttog {P 1, P 2,..., P c } of the dual varable ad the data. The followg lemma says that the effect of the partto s eglgble, ad fact vashes as τ creases. It was proved [5, Lemma 5.2]. 20

Lemma 17 [5] If /c 2 ad τ 2, the for all j [d], we have τc τ 1 ω j 1 /c 1 ω j ω j 1 1 + τ 1ω j 1. τ 1 /c 1 Accordg to ths result, whe each ode ows at least two dual examples /c 2 ad pcks ad updates at least two examples each terato τ 2, the T c, τ cτ + 1 + 1 d max λ max j=1 1 + τ 1ω j 1 /c 1 A j A j τ 1 λγcτ = cτ + 1 + 1 τ 1ˆω 1 max λ max A 1 + A, 42 τ 1 /c 1 λγcτ where ˆω [1, ] s a average sparsty measure smlar to that oe we troduced the study of τ-ce samplg. Ths boud s smlar to that we obtaed for the τ-ce samplg; ad ca be terpreted a aalogous way. Note that as the frst term receves perfect m-batch scalg t s dvded by cτ, whle the codto umber max λ max A A /λγ s dvded by cτ but also. However, ths term s bouded by 2ˆω, ad hece f ˆω s multpled by 1 + 1 τ 1 1 + τ 1ˆω 1 /c 1 small, the codto umber also receves a early perfect m-batch scalg. 6.1 Quartz vs DSDCA A dstrbuted varat of SDCA, amed DsDCA, has bee proposed [34] ad aalyzed [35]. The authors of [34] proposed a basc DsDCA varat whch was aalyzed ad a practcal DsDCA varat whch was ot aalyzed. The complexty of basc DsDCA was show to be: cτ + max λ max A A log λγ cτ + max λ max A A λγ Dα Dα 0, 43 ɛ where α s a optmal dual soluto. Note that ths rate s much worse tha our rate. Igorg the logarthmc terms, whle the frst expresso /cτ s the same both results, f we replace all ω j by the upper boud ad all ω j by the upper boud c 41, the T c, τ max cτ + λ max A A 1 + τ 1 1 max/c 1 + τc λγcτ cτ + max λ max A A. λγ τ 1 max/c 1,1 c 1 c Therefore, the domat term 40 s a strct lower boud of that 43. Moreover, t s clear that the gap betwee 40 ad 43 s large whe the data s sparse. For stace, the perfectly sparse case wth ˆω = 1, the boud 42 for Quartz becomes whch s much better tha 43. cτ + 1 + 1 max λ max A A, τ 1 λγcτ 21

6.2 Theoretcal speedup factor I aalogy wth the dscusso Secto 5.1, we shall ow aalyze the theoretcal speedup factor T 1, 1/T c, τ measurg the multplcatve amout by whch Quartz specalzed to the c, τ- dstrbuted samplg s better tha Quartz specalzed to the seral uform samplg. I Secto 5, we have see how the speedup factor creases wth τ whe a m-batch of examples s used at each terato followg the τ-ce samplg. As we have dscussed before, ths samplg s ot partcularly sutable for a dstrbuted mplemetato uless τ = ; whch the bg data settg where s very large may be askg for may more cores/threads that are avalable. Ths s because the mplemetato of updates usg ths samplg would ether result frequetly dle odes, or creased data trasfer. Ofte the data matrx A s too large to be stored o a sgle ode, or lmted umber of threads/cores are avalable per ode. We the wat to mplemet Quartz a dstrbuted way c > 1. It s therefore ecessary to uderstad how the speedup factor compares to the hypothetcal stuato whch we would have a large mache where all data could be stored we gore commucato costs here ad hece a cτ-ce samplg could be mplemeted. That s, we are terested comparg T c, τ dstrbuted mplemetato ad T 1, cτ hypothetcal computer. If for smplcty of exposto we assume that λ max A A = 1, t s possble to argue that f cτ, the T 1, 1 T 1, 1 T c, τ T 1, cτ. 44 I Fgure 2 we plot the cotour les of the theoretcal speedup factor a log-log plot wth axes correspodg to τ ad c. The cotours are early perfect straght les, whch meas that the speedup factor s approxmately costat for those pars c, τ for whch cτ s the same. I partcular, ths meas that 44 holds. Note that better speedup s obtaed for sparse data the for dese data. However, all plots we have chose γ = 1 ad λ = 1/ ; ad hece we expect data depedet lear speedup up to cτ = Θ a specal le s depcted all three plots whch defes ths cotour. 4 cτ = 4 cτ = 4 900 cτ = 3.8 3.6 6000 3.8 3.6 6000 3.8 log 10c 3.4 3.2 3 6000 2000 6000 2.8 10000 10000 10000 10000 10000 3.4 3.2 3 6000 2000 6000 6000 3.6 800 900 log 10c 2000 6000 0 0.5 1 1.5 log 10τ a ω = 10 2, = 10 6 2.8 10000 10000 10000 10000 10000 log 10c 3.4 3.2 3 700 600 500 800 700 900 800 900 2000 0 0.5 1 1.5 log 10τ b ω = 10 4, = 10 6 2.8 400 600 700 800 0 0.5 1 log 10τ 1.5 c ω = 10 6, = 10 6 900 Fgure 2: Cotour le plots of the speedup factor T 1, 1/T c, τ for = 10 6, γ = 1, λ = 10 3, ω = 10 2 Fgure 2a, ω = 10 4 Fgure 2b, ω = 10 6 Fgure 2c. Here, ω [1, ] s a degree of average sparsty of the data. 22

7 Proof of the Ma Result I ths secto we prove our ma result Theorem 9. I order to make the aalyss more trasparet, we wll frst establsh three auxlary results. 7.1 Three lemmas Lemma 18 Fucto f : R N R defed 3 satsfes the followg equalty: fα + h fα + fα, h + 1 2λ 2 h A Ah, α, h R N. 45 Proof Sce g s 1-strogly covex, g s 1-smooth. Pck α, h R N. Sce, fα = λg 1 λ Aα, we have fα + h = λg 1 λ Aα + 1 λ Ah λ g 1 λ Aα + g 1 λ Aα, 1 λ Ah + 1 2 = fα + fα, h + 1 2λ 2 h T A Ah. 1 λ Ah 2 For s = s 1,..., s R N, h = h 1,..., h R N, where s, h R m for all, we wll for coveece wrte s, h p = p s, h, where p = p 1,..., p ad p = P Ŝ for []. I the ext lemma we gve a expected separable overapproxmato of the covex fucto D. Lemma 19 If Ŝ ad v R satsfy Assumpto 4, the for all α, h R N, the followg holds: E[ Dα + h [ Ŝ] ] fα + fα, h p + 1 2λ 2 h 2 p v + 1 [1 p φ α + p φ α h ]. 46 Proof By defto of D, we have Dα + h [ Ŝ] 2 = fα + h [ Ŝ] + ψα + h [Ŝ], where f ad ψ are defed 3 ad 4. Now we apply Lemma 18 ad 15 to boud the frst term: 45 E[fα + h [ Ŝ] ] E[fα + fα, h [ Ŝ] + 1 2λ 2 h [Ŝ]A Ah [ Ŝ] ] 15 fα + E[ fα, h [ Ŝ] ] + 1 2λ 2 h 2 p v = fα + fα, h p + 1 2λ 2 h 2 p v. 23

Moreover, sce ψ s block separable, we ca wrte 4 E[ψα + h [ Ŝ] ] = 1 = 1 [ ] P / Ŝφ α + P Ŝφ α h [1 p φ α + p φ α h ]. Our last auxlary result s a techcal lemma for further boudg the rght had sde Lemma 19. Lemma 20 Suppose that Ŝ ad v R satsfy Assumpto 4. Fxg α R N ad w R d, let h R N be defed by: h = θp 1 α + φ A w, [], where θ be as 29. The fα + fα, h p + 1 2λ 2 h 2 p v + 1 1 θdα θλg g ᾱ 1 [1 p φ α + p φ α h ] θ g ᾱ, A φ A w + θ φ φ A w, 47 where ᾱ = 1 λ Aα. Proof Recall from 3 that fα = λg ᾱ ad hece fα = 1 A g ᾱ. Thus, fα + fα, h p + 1 2λ 2 h 2 p v = λg ᾱ p 1 A g ᾱ, θp 1 α + φ A w + 1 2λ 2 h 2 p v = 1 θλg ᾱ + θλg ᾱ g ᾱ, ᾱ 1 θ g ᾱ, A φ A w + 1 2λ 2 h 2 p v. 48 Sce the fuctos φ are 1/γ-smooth, the cojugate fuctos φ Therefore, must be γ-strogly covex. φ α h = φ 1 θp 1 1 θp 1 = 1 θp 1 α + θp 1 φ A w φ α + θp 1 φ φ A w γθp 1 1 θp 1 α + φ A w 2 2 φ α + θp 1 φ φ A w γp 1 θp 1 h 2, 49 2θ 24

ad we ca wrte 1 [1 p φ α + p φ α h ] 49 1 θψα + θ φ φ A w 1 2λ 2 The by combg 48 ad 50 we get: fα + fα, h p + 1 2λ 2 h 2 p v + 1 1 θdα θλg g ᾱ 1 + 1 2λ 2 p v λγp2 1 θp 1 θ λγp 2 1 θp 1 h 2. θ [1 p φ α + p φ α h ] θ g ᾱ, A φ A w + θ h 2. It remas to otce that for θ defed 29, we have: p v λγp2 1 θp 1, []. θ φ φ A w 50 7.2 Proof of Theorem 9 Let t 1. Defe h t = h t 1,..., ht R N by: ad κ t = κ t 1,, κt by: κ t = arg max R m h t = θp 1 α t 1 + φ A w t, [] [ φ α t 1 + g ᾱ t 1 A v 2 ], []. 2λ If we use Opto I Algorthm 1, the α t = α t 1 + κt [Ŝ]. If we use Opto II Algorthm 1, the we have α t = α t 1 + ht [Ŝ]. I both cases, by Lemma 19: E t [ Dα t ] fα t 1 + fα t 1, h t p + 1 2λ 2 ht 2 p v + 1 We ow apply Lemma 20 to further boud the last term ad obta: E t [ Dα t ] 1 θdα t 1 θλg g ᾱ t 1 1 θ g ᾱ t 1, A φ A w t + θ 25 [ 1 p φ α t 1 + p φ α t 1 h t ]. φ φ A w t. 51

By covexty of g, P w t = 1 1 φ A w t + λg1 θw t 1 + θ g ᾱ t 1 φ A w t + 1 θλgw t 1 + θλg g ᾱ t 1. 52 By combg 51 ad 52 we get: E t [P w t Dα t ] 1 φ A w t + 1 θλgw t 1 1 θdα t 1 1 θ g ᾱ t 1, A φ A w t + θ = 1 θp w t 1 Dα t 1 + 1 1 φ φ A w t φ A w t 1 θφ A w t 1 θ g ᾱ t 1, A φ A w t + θ φ φ A w t. 53 Note that θ g ᾱ t 1 = w t 1 θw t 1 ad φ φ A wt = φ A wt, A wt φ A wt. Fally, we plug these two equaltes to 53 ad obta: E t [P w t Dα t ] 1 θp w t 1 Dα t 1 + 1 φ A w t 1 θφ A w t 1 1 A w t 1 θa w t 1, φ A w t + θ φ A w t, A w t φ A w t =1 θp w t 1 Dα t 1 + 1 θ φ A w t φ A w t 1 1 θ A w t A w t 1, φ A w t =1 θp w t 1 Dα t 1 + 1 θ [ φ A w t φ A w t 1 + A w t 1 A w t, φ A w t ] 1 θp w t 1 Dα t 1, where the last equalty follows from the covexty of φ. 26

8 Expermetal Results I [29] ad [28], the reader ca fd a extesve lst of popular mache learg problems to whch Prox-SDCA ca be appled. Sharg the same prmal-dual formulato, our algorthm ca also be specfed ad appled to those applcatos, cludg Rdge regresso, SVM, Lasso, logstc regresso ad multclass predcto. We focus our umercal expermets o the L2-regularzed lear SVM problem wth smoothed hge loss or squared hge loss. These problems are descrbed detal Secto 8.1. The three ma messages that we draw from the umercal expermets are: Importace samplg does mprove the covergece for certa datasets; Quartz specalzed to seral samplgs s comparable to Prox-SDCA practce; The theoretcal speedup factor s a almost exact predctor of the actual speedup terms of terato complexty. We performed the expermets o several real world large datasets, of varous dmesos, d ad sparsty. The detals of the dataset characterstcs are provded Table 4. I all our expermets we used Opto I, whch we foud to be better practce. Dataset # Trag sze # features d Sparsty # z/d astro-ph 29,882 99,757 0.08% CCAT 781,265 47,236 0.16% cov1 522,911 54 22.22% w8a 49,749 300 3.91% jc1 49,990 22 59.09% webspam 350,000 254 33.52% Table 4: Datasets used our expermets. 8.1 Applcatos Smooth hge loss wth L 2 regularzer. We specfy Quartz to the lear Support Vector Mache SVM problem wth smoothed hge loss ad L 2 regularzer: where ad m w R d P w def = 1 φ y A w + λgw, 0 a 1 1 a γ/2 a 1 γ φ a = 1 a 2 otherwse. 2γ, a R 54 gw = 1 2 w 2, w R d. 55 27