Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization

Similar documents
arxiv: v1 [cs.lg] 22 Feb 2015

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

Dimensionality Reduction and Learning

Rademacher Complexity. Examples

An Accelerated Proximal Coordinate Gradient Method

Bayes (Naïve or not) Classifiers: Generative Approach

Chapter 5 Properties of a Random Sample

CHAPTER 4 RADICAL EXPRESSIONS

Lecture 9: Tolerant Testing

TESTS BASED ON MAXIMUM LIKELIHOOD

Econometric Methods. Review of Estimation

Solving Constrained Flow-Shop Scheduling. Problems with Three Machines

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Multivariate Transformation of Variables and Maximum Likelihood Estimation

Lecture 3 Probability review (cont d)

Simple Linear Regression

Functions of Random Variables

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

Summary of the lecture in Biostatistics

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Analysis of Lagrange Interpolation Formula

Point Estimation: definition of estimators

ENGI 3423 Simple Linear Regression Page 12-01

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek

Strong Convergence of Weighted Averaged Approximants of Asymptotically Nonexpansive Mappings in Banach Spaces without Uniform Convexity

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

Communication-Efficient Distributed Primal-Dual Algorithm for Saddle Point Problems

X ε ) = 0, or equivalently, lim

Supervised learning: Linear regression Logistic regression

2006 Jamie Trahan, Autar Kaw, Kevin Martin University of South Florida United States of America

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

Objectives of Multiple Regression

Class 13,14 June 17, 19, 2015

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

The Mathematical Appendix

Research Article A New Iterative Method for Common Fixed Points of a Finite Family of Nonexpansive Mappings

Maximum Likelihood Estimation

ρ < 1 be five real numbers. The

CIS 800/002 The Algorithmic Foundations of Data Privacy October 13, Lecture 9. Database Update Algorithms: Multiplicative Weights

Support vector machines

Distributed Accelerated Proximal Coordinate Gradient Methods

Introduction to local (nonparametric) density estimation. methods

Lecture 3. Sampling, sampling distributions, and parameter estimation

ESS Line Fitting

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

Median as a Weighted Arithmetic Mean of All Sample Observations

For combinatorial problems we might need to generate all permutations, combinations, or subsets of a set.

Simulation Output Analysis

UNIT 2 SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS

Unimodality Tests for Global Optimization of Single Variable Functions Using Statistical Methods

Non-uniform Turán-type problems

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

Unsupervised Learning and Other Neural Networks

Third handout: On the Gini Index

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

L5 Polynomial / Spline Curves

The internal structure of natural numbers, one method for the definition of large prime numbers, and a factorization test

CHAPTER VI Statistical Analysis of Experimental Data

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem

8.1 Hashing Algorithms

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Kernel-based Methods and Support Vector Machines

Some Notes on the Probability Space of Statistical Surveys

MATH 247/Winter Notes on the adjoint and on normal operators.

Mu Sequences/Series Solutions National Convention 2014

Naïve Bayes MIT Course Notes Cynthia Rudin

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

1 Onto functions and bijections Applications to Counting

C-1: Aerodynamics of Airfoils 1 C-2: Aerodynamics of Airfoils 2 C-3: Panel Methods C-4: Thin Airfoil Theory

Chapter 9 Jordan Block Matrices

Lecture Note to Rice Chapter 8

Chapter 4 Multiple Random Variables

Chapter 8. Inferences about More Than Two Population Central Values

Cubic Nonpolynomial Spline Approach to the Solution of a Second Order Two-Point Boundary Value Problem

A Remark on the Uniform Convergence of Some Sequences of Functions

An Introduction to. Support Vector Machine

The Selection Problem - Variable Size Decrease/Conquer (Practice with algorithm analysis)

Chapter 14 Logistic Regression Models

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

LECTURE 24 LECTURE OUTLINE

A New Family of Transformations for Lifetime Data

Generative classification models

Extreme Value Theory: An Introduction

Investigating Cellular Automata

PTAS for Bin-Packing

Taylor s Series and Interpolation. Interpolation & Curve-fitting. CIS Interpolation. Basic Scenario. Taylor Series interpolates at a specific

Overcoming Limitations of Sampling for Aggregation Queries

Beam Warming Second-Order Upwind Method

KLT Tracker. Alignment. 1. Detect Harris corners in the first frame. 2. For each Harris corner compute motion between consecutive frames

AN UPPER BOUND FOR THE PERMANENT VERSUS DETERMINANT PROBLEM BRUNO GRENET

A tighter lower bound on the circuit size of the hardest Boolean functions

CS 1675 Introduction to Machine Learning Lecture 12 Support vector machines

18.413: Error Correcting Codes Lab March 2, Lecture 8

Overview. Basic concepts of Bayesian learning. Most probable model given data Coin tosses Linear regression Logistic regression

Linear Regression Linear Regression with Shrinkage. Some slides are due to Tommi Jaakkola, MIT AI Lab

Lecture Notes Types of economic variables

Transcription:

Stochastc Prmal-Dual Coordate Method for Regularzed Emprcal Rsk Mmzato Yuche Zhag L Xao September 24 Abstract We cosder a geerc covex optmzato problem assocated wth regularzed emprcal rsk mmzato of lear predctors The problem structure allows us to reformulate t as a covexcocave saddle pot problem We propose a stochastc prmal-dual coordate ( method, whch alterates betwee maxmzg over a radomly chose dual varable ad mmzg over the prmal varable A extrapolato step o the prmal varable s performed to obta accelerated covergece rate We also develop a m-batch verso of the method whch facltates parallel computg, ad a exteso wth weghted samplg probabltes o the dual varables, whch has a better complexty tha uform samplg o uormalzed data Both theoretcally ad emprcally, we show that the method has comparable or better performace tha several state-of-the-art optmzato methods Itroducto We cosder a geerc covex optmzato problem that arses ofte mache learg: regularzed emprcal rsk mmzato (ERM of lear predctors More specfcally, let a,,a R d be the feature vectors of data samples, φ : R R be a covex loss fucto assocated wth the lear predcto a T x, for =,,, ad g : Rd R be a covex regularzato fucto for the predctor x R d Our goal s to solve the followg optmzato problem: { } mmze x R d P(x def = φ (a T x+g(x = ( Examples of the above formulato clude may well-kow classfcato ad regresso problems For bary classfcato, each feature vector a s assocated wth a label b {±} We obta the lear SVM (support vector mache by settg φ (z = max{, b z} (the hge loss ad g(x = (λ/2 x 2 2, where λ > s a regularzato parameter Regularzed logstc regresso s obtaed by settg φ (z = log( + exp( b z For lear regresso problems, each feature vector a s assocated wth a depedet varable b R, ad φ (z = (/2(z b 2 The we get rdge regresso wth g(x = (λ/2 x 2 2, ad the Lasso wth g(x = λ x Further backgrouds o regularzed ERM mache learg ad statstcs ca be foud, eg, the book [3] Departmet of Electrcal Egeerg ad Computer Scece, Uversty of Calfora, Berkekey, CA 9472, USA Emal: yuczhag@eecsberkeleyedu (Ths work was perfomed durg a tershp at Mcrosoft Research Mache Learg Groups, Mcrosoft Research, Redmod, WA 9853, USA Emal: lxao@mcrosoftcom

We are especally terested developg effcet algorthms for solvg problem ( whe the umber of samples s very large I ths case, evaluatg the full gradet or subgradet of the fucto P(x s very expesve, thus cremetal methods that operate o a sgle compoet fucto φ at each terato ca be very attractve There have bee extesve research o cremetal (subgradet methods (eg [4, 4, 2, 2, 3] as well as varats of the stochastc gradet method (eg, [46, 5,, 7, 43] Whle the computatoal cost per terato of these methods s oly a small fracto, say /, of that of the batch gradet methods, ther terato complextes are much hgher (t takes may more teratos for them to reach the same precso I order to better quatfy the complextes of varous algorthms ad posto our cotrbutos, we eed to make some cocrete assumptos ad troduce the oto of codto umber ad batch complexty Codto umber ad batch complexty Let γ ad λ be two postve real parameters We make the followg assumpto: Assumpto A Each φ s covex ad dfferetable, ad ts dervatve s (/γ-lpschtz cotuous (same as φ beg (/γ-smooth, e, φ (α φ (β (/γ α β, α,β R, =,, I addto, the regularzato fucto g s λ-strogly covex, e, g(y g(x+g (y T (x y+ λ 2 x y 2 2, g (y g(y, x,y R For example, the logstc loss φ (z = log( + exp( b z s (/4-smooth, the squared error φ (z = (/2(z b 2 s -smooth, ad the squared l 2 -orm g(x = (λ/2 x 2 2 s λ-strogly covex The hge loss φ (z = max{, b z} ad the l -regularzato g(x = λ x do ot satsfy Assumpto A Nevertheless, we ca treat them usg smoothg ad strogly covex perturbatos, respectvely, so that our algorthm ad theoretcal framework stll apply (see Secto 3 Uder Assumpto A, the gradet of each compoet fucto, φ (a T x, s also Lpschtz cotuous, wth Lpschtz costat L = a 2 2 /γ R2 /γ, where R = max a 2 I other words, each φ (a T x s (R2 /γ-smooth We defe a codto umber κ = R 2 /(λγ, ad focus o ll-codtoed problems where κ I the statstcal learg cotext, the regularzato parameter λ s usually o the order of / or / (eg, [6], thus κ s o the order of or It ca be eve larger f the strog covexty g s added purely for umercal regularzato purposes (see Secto 3 We ote that the actual codtog of problem ( may be better tha κ, f the emprcal loss fucto (/ = φ (a T x by tself s strogly covex I those cases, our complexty estmates terms of κ ca be loose (upper bouds, but they are stll useful comparg dfferet algorthms for solvg the same gve problem Let P be the optmal value of problem (, e, P = m x R d P(x I order to fd a approxmate soluto ˆx satsfyg P(ˆx P ǫ, the classcal full gradet method ad ts proxmal varats requre O(( + κ log(/ǫ teratos (eg, [24, 26] Accelerated full gradet ( methods [24, 4,, 26] eoy the mproved terato complexty O((+ κlog(/ǫ However, each terato of these batch methods requres a full pass over the dataset, computg the gradet For the aalyss of full gradet methods, we should use (R 2 /γ + λ/λ = + κ as the codto umber of problem (; see [26, Secto 5] Here we used the upper boud +κ < + κ for easy comparso Whe κ, the addtve costat ca be dropped 2

of each compoet fucto ad formg ther average, whch cost O(d operatos (assumg the features vectors a R d are dese I cotrast, the stochastc gradet method ad ts proxmal varats operate o oe sgle compoet φ (a T x (chose radomly at each terato, whch oly costs O(d But ther terato complextes are far worse Uder Assumpto A, t takes them O(κ/ǫ teratos to fd a ˆx such that E[P(ˆx P ] ǫ, where the expectato s wth respect to the radom choces made at all the teratos (see, eg, [3, 23,, 7, 43] To make far comparsos wth batch methods, we measure the complexty of stochastc or cremetal gradet methods terms of the umber of equvalet passes over the dataset requred to reach a expected precso ǫ We call ths measure the batch complexty, whch are usually obtaed by dvdg ther terato complextes by For example, the batch complexty of the stochastc gradet method s O(κ/(ǫ The batch complextes of full gradet methods are the same as ther terato complextes By carefully explotg the fte average structure ( ad other smlar problems, several recet work [32, 36, 6, 44] proposed ew varats of the stochastc gradet or dual coordate ascet methods ad obtaed the terato complexty O(( + κ log(/ǫ Sce ther computatoal cost per terato s O(d, the equvalet batch complexty s O(( + κ/ log(/ǫ Ths complexty has much weaker depedece o tha the full gradet methods, ad also much weaker depedece o ǫ tha the stochastc gradet methods I ths paper, we preset a ew algorthm that has the batch complexty O ( (+ κ/log(/ǫ, (2 whch s more effcet whe κ > 2 Outle of the paper Our approach s based o reformulatg problem ( as a covex-cocave saddle pot problem, ad the devsg a prmal-dual algorthm to approxmate the saddle pot More specfcally, we replace each compoet fucto φ (a T x through covex cougato, e, φ (a T x = sup {y a,x φ (y }, y R where φ (y = sup α R {αy φ (α}, ad a,x deotes the er product of a ad x (whch s the same as a T x, but s more coveet for later presetato Ths leads to a covex-cocave saddle pot problem { m max f(x,y def = ( y a,x φ x R d y R (y } +g(x (3 = Uder Assumpto A, each φ s γ-strogly covex (sce φ s (/γ-smooth; see, eg, [4, Theorem 422] ad g s λ-strogly covex As a cosequece, the saddle pot problem (3 has a uque soluto, whch we deote by (x,y I Secto 2, we propose a stochastc prmal-dual coordate ( method, whch alterates betwee maxmzg f over a radomly chose dual coordate y ad mmzg f over the prmal varable x We also apply a extrapolato step to the prmal varable x to accelerate the covergece The method has terato complexty O(( + κlog(/ǫ Sce each terato of oly operates o a sgle dual coordate y, ts batch complexty s gve by (2 We also preset a m-batch algorthm whch s well suted for dstrbuted computg 3

Algorthm : The method Iput: parameters τ,σ,θ R +, umber of teratos T, ad tal pots x ( ad y ( Italze: x ( = x (, u ( = (/ = y( a for t =,,2,,T do Pck a dex k {,2,,} uformly at radom, ad execute the followg updates: { { } y (t+ argmaxβ R β a,x (t φ = (β 2σ (β y(t 2 f = k, (4 y (t f k, { } x (t+ = arg m g(x+ u (t +(y (t+ x R d k k a k, x + x x(t 2 2, (5 2τ u (t+ = u (t + (y(t+ k k a k, (6 x (t+ = x (t+ +θ(x (t+ x (t (7 ed Output: x (T ad y (T I Secto 3, we preset two extesos of the method We frst expla how to solve problem ( whe Assumpto A does ot hold The dea s to apply small regularzatos to the saddle pot fucto so that ca stll be appled, whch results accelerated sublear rates The secod exteso s a method wth o-uform samplg The batch complexty of ths algorthm has the same form as (2, but κ s defed as κ = R/(λγ, where R = = a, whch ca be much smaller tha R = max a f there s cosderable varato the orms a I Secto 4, we dscuss related work I partcular, the method ca be vewed as a coordate-update exteso of the batch prmal-dual algorthm developed by Chambolle ad Pock[8] Wealsodscusstwoveryrecetwork [34,8]whchachevethesamebatch complexty (2 I Secto 5, we dscuss effcet mplemetato of the method whe the feature vectors a are sparse We focus o two popular cases: whe g s a squared l 2 -orm pealty ad whe g s a l +l 2 pealty We show that the computatoal cost per terato of oly depeds o the umber of o-zero elemets the feature vectors I Secto 6, we preset expermet results comparg wth several state-of-the-art optmzato methods, cludg two effcet batch methods ( [24] ad L-BFGS [27, Secto 72], the stochastc average gradet ( method [32, 33], ad the stochastc dual coordate ascet ( method [36] O all scearos we tested, has comparable or better performace 2 The method I ths secto, we descrbe ad aalyze the Stochastc Prmal-Dual Coordate ( method The basc dea of s qute smple: to approach the saddle pot of f(x,y defed (3, we alteratvely maxmze f wth respect to y, ad mmze f wth respect to x Sce the dual vector y has coordates ad each coordate s assocated wth a feature vector a R d, maxmzg f wth respect to y takes O(d computato, whch ca be very expesve f s large We reduce the computatoal cost by radomly pckg a sgle coordate of y at a tme, ad 4

Algorthm 2: The M-Batch method Iput: m-batch sze m, parameters τ,σ,θ R +, umber of teratos T, ad x ( ad y ( Italze: x ( = x (, u ( = (/ = y( a for t =,,2,,T do Radomly pck a subset of dces K {,2,,} of sze m, such that the probablty of each dex beg pcked s equal to m/ Execute the followg updates: { { } y (t+ argmaxβ R β a,x (t φ = (β 2σ (β y(t 2 f K, (8 y (t x (t+ = arg m x R d u (t+ = u (t + f / K, { g(x+ u (t + (y (t+ m k k a k, x k K (y (t+ k k a k, x (t+ = x (t+ +θ(x (t+ x (t ed Output: x (T ad y (T k K + x x(t 2 2 2τ }, (9 maxmzg f oly wth respect to ths coordate Cosequetly, the computatoal cost of each terato s O(d WegvethedetalsofthemethodAlgorthm Thedualcoordateupdateadprmal vector update are gve equatos (4 ad (5 respectvely Istead of maxmzg f over y k ad mmzg f over x drectly, we add two quadratc regularzato terms to pealze y (t+ k ad x (t+ from devatg from y (t k ad x (t The parameters σ ad τ cotrol ther regularzato stregth, whch we wll specfy the covergece aalyss (Theorem Moreover, we troduce two auxlary varables u (t ad x (t From the talzato u ( = (/ = y( a ad the update rules (4 ad (6, we have u (t = = y (t a, t =,,T Equato (7 obtas x (t+ based o extrapolato from x (t ad x (t+ Ths step s smlar to Nesterov s accelerato techque [24, Secto 22], ad yelds faster covergece rate Before presetg the theoretcal results, we troduce a M-Batch method Algorthm 2, whch s a atural exteso of Algorthm The dfferece betwee these two algorthms s that, the M-Batch method may smultaeously select more tha oe dual coordates to update Let m be the m-batch sze Durg each terato, the M-Batch method radomly pcks a subset of dces K {,,} of sze m, such that the probablty of each dex beg pcked s equal to m/ The followg s a smple procedure to acheve ths Frst, partto the set of dces to m dsot subsets, so that the cardalty of each subset s equal to /m (assumg m dvdes The, durg each terato, radomly select a sgle dex from each subset ad add t to K Other approaches for m-batch selecto are also possble Wth a sgle processor, each terato of Algorthm 2 takes O(md tme to accomplsh Sce 5

the updates of each coordate y k are depedet of each other, we ca use parallel computg to accelerate the M-Batch method Cocretely, we ca use m processors to update the m coordates the subset K parallel, the aggregate them to update x (t+ Such a procedure ca be acheved by a sgle roud of commucato, for example, usg the Allreduce operato MPI [2] or MapReduce [] If we gore the commucato delay, the each terato takes O(d tme, whch s the same as rug oe terato of the basc algorthm Not surprsgly, we wll show that the M-Batch algorthm coverges faster tha terms of the terato complexty, because t processes multple dual coordates a sgle terato 2 Covergece aalyss Sce the basc algorthm s a specal case of M-Batch wth m =, we oly preset a covergece theorem for the m-batch verso Theorem Assume that each φ s (/γ-smooth ad g s λ-strogly covex (Assumpto A Let R = max{ a 2 : =,,} If the parameters τ,σ ad θ Algorthm 2 are chose such that τ = mγ 2R λ, σ = λ 2R mγ, θ = (/m+r (/m/(λγ, ( the for each t, the M-Batch algorthm acheves ( 2τ +λ E [ x (t x 2 ] ( [ E y (t 2 + 4σ +γ y 2 2] m ( ( ( y θ t 2τ +λ x ( x 2 ( 2 + 2σ +γ y 2 2 m The proof of Theorem s gve Appedx A The followg corollary establshes the expected terato complexty of M-Batch for obtag a ǫ-accurate soluto Corollary Suppose Assumpto A holds ad the parameters τ, σ ad θ are set as ( I order for Algorthm 2 to obta t suffces to have the umber of teratos T satsfy ( T m +R mλγ where C = E[ x (T x 2 2] ǫ, E[ y (T y 2 2] ǫ, ( log ( C, ǫ ( /(2τ+λ x ( x 2 2 +( /(2σ+γ y ( y 2 2 /m m { /(2τ+λ, (/(4σ+γ/m } Proof By Theorem, we have E[ x (T x 2 2 ] θt C ad E[ y (T y 2 2 ] θt C To obta (, t suffces to esure that θ T C ǫ, whch s equvalet to T log(c/ǫ log(θ = log(c/ǫ ( ( log (/m+r (/m/(λγ Applyg the equalty log( x x to the deomator above completes the proof 6

Recallthedeftoofthecodtoumberκ = R 2 /(λγsecto Corollaryestablshes that the terato complexty of the M-Batch method for achevg ( s ( ((/m+ O κ(/m log(/ǫ So a larger batch sze m leads to less umber of teratos I the extreme case of = m, we obta a full batch algorthm, whch has terato or batch complexty O((+ κlog(/ǫ Ths complexty s also shared by the methods [24, 26] (see Secto, as well as the batch prmal-dual algorthm of Chambolle ad Pock [8] (see dscussos o related work Secto 4 Sce a equvalet pass over the dataset correspods to /m teratos, the batch complexty (the umber of equvalet passes over the data of M-Batch s ( (+ O κ(m/ log(/ǫ The above expresso mples that a smaller batch sze m leads to less umber of passes through the data I ths sese, the basc method wth m = s the most effcet oe However, f we prefer the least amout of wall-clock tme, the the best choce s to choose a m-batch sze m that matches the umber of parallel processors avalable 22 Covergece of prmal obectve I the prevous subsecto, we establshed terato complexty of the M-Batch method terms of approxmatg the saddle pot of the mmax problem (3, more specfcally, to meet the requremet ( Next we show that t has the same order of complexty reducg the prmal obectve gap P(x (T P(x But we eed a extra assumpto Assumpto B There exst costats G ad H such that for ay x R d, g(x g(x G x x 2 + H 2 x x 2 2 We ote that Assumpto B s weaker tha ether G-Lpschtz cotuty or H-smoothess It s satsfed by the l orm, the squared l 2 -orm, ad mxed l +l 2 regularzatos Corollary 2 Suppose both Assumptos A ad B hold, ad the parameters τ, σ ad θ are set as ( To guaratee E[P(x (T P(x ] ǫ, t suffces to ru Algorthm 2 for T teratos, wth ( ( C(4G 2 T m +R +H +/γ log mλγ ǫ 2, where C = x ( x 2 2 + ( /(2σ+γ y ( y 2 2 /(2τ+λ m Proof Usg the (/γ-smoothess of P g ad Assumpto B, we have P(x (T P(x (P g (x,x (T x + 2γ x(t x 2 2 +g(x (T g(x ( (P g (x 2 +G x (T x 2 + H +/γ x (T x 2 2 2 7

Sce x mmzes P, we have (P g (x g(x Hece, Assumpto B mples that (P g (x 2 G Substtutg ths relato to the above equalty, ad usg Hölder s equalty, we have E[P(x (T P(x ] 2G ( E[ x (T x 2 2] /2 H +/γ + E[ x (T x 2 2 2] To make E[P(x (T P(x ] ǫ, t suffces to let the rght-had sde of the above equalty bouded by ǫ Sce ǫ, ths s guarateed by E[ x (T x 2 2] ǫ 2 4G 2 +H +/γ (2 By Theorem, we have E[ x (T x 2 2 ] θt C To secure equalty (2, t s suffcet to make θ T ǫ 2, whch s equvalet to C(4G 2 +H+/γ T log(c(4g2 +H +/γ/ǫ 2 log(θ = log(c(4g 2 +H +/γ/ǫ 2 ( ( log (/m+r (/m/(λγ Applyg log( x x to the deomator above completes the proof 3 Extesos of I ths secto, we derve two extesos of the method The frst oe hadles problems for whch Assumpto A does ot hold The secod oe employs a o-uform samplg scheme to mprove the terato complexty whe the feature vectors a are uormalzed 3 No-smooth or o-strogly covex fuctos The complexty bouds establshed Secto 2 requre each φ to be γ-strogly covex, whch correspodstothecodtothatthefrstdervatveofφ s(/γ-lpschtzcotuous Iaddto, thefuctog eedstobeλ-stroglycovex Forgeerallossfuctoswhereetherorbothofthese codtos fal (eg, the hge loss ad l -regularzato, we ca slghtly perturb the saddle-pot fucto f(x,y so that the method ca stll be appled For smplcty, here we cosder the case where ether φ s smooth or g s strogly covex Formally, we assume that each φ ad g are covex ad Lpschtz cotuous, ad f(x,y has a saddle pot (x,y We choose a scalar δ > ad cosder the modfed saddle-pot fucto: f δ (x,y def = = ( ( y a,x φ (y + δy2 +g(x+ δ 2 2 x 2 2 (3 Deote by (x δ,y δ the saddle-pot of f δ We employ the M-Batch method (Algorthm 2 to approxmate (x δ,y δ, treatg φ + δ 2 ( 2 as φ ad g + δ 2 2 2 as g, whch ow are all δ-strogly covex We ote that addg strogly covex perturbato o φ s equvalet to smoothg φ, whch becomes (/δ-smooth Lettg γ = λ = δ, the parameters τ, σ ad θ ( become τ = 2R m, σ = 2R m, ad θ = ( m + R δ m 8

Although (x δ,y δ s ot exactly the saddle pot of f, the followg corollary shows that applyg the method to the perturbed fucto f δ effectvely mmzes the orgal loss fucto P Corollary 3 Assume that each φ s covex ad G φ -Lpschtz cotuous, ad g s covex ad G g -Lpschtz cotuous Defe two costats: ( /(2σ+δ y C = ( x 2 2 +G 2 φ, C 2 = (G φ R+G g ( x 2 ( x ( y δ 2 δ 2 + 2 2 /(2τ+δ m If we choose δ ǫ/c, ad ru the M-Batch algorthm for T teratos where ( T m + R ( 4C2 log δ m ǫ 2, the E[P(x (T P(x ] ǫ Proof Let ỹ = argmax y f(x δ,y be a shorthad otato We have P(x δ ( = f(x ( δ,ỹ f δ (x δ,ỹ+ δ ỹ 2 2 2 (v f(x,yδ + δ x 2 2 + δ ỹ 2 2 2 2 ( f δ (x δ,y δ + δ ỹ 2 2 2 (v f(x,y + δ x 2 2 + δ ỹ 2 2 2 2 (v f δ (x,y δ + δ ỹ 2 2 2 (v = P(x + δ x 2 2 + δ ỹ 2 2 2 2 Here, equatos ( ad (v use the defto of the fucto f, equaltes ( ad (v use the defto of the fucto f δ, equaltes ( ad (v use the fact that (x δ,y δ s the saddle pot of f δ, ad equalty (v s due to the fact that (x,y s the saddle pot of f Sce φ s G φ -Lpschtz cotuous, the doma of φ s the terval [ G φ,g φ ], whch mples ỹ 2 2 G2 φ (see, eg, [34, Lemma ] Thus, we have P(x δ P(x δ 2 ( x 2 2 +G 2 φ = δ 2 C (4 O the other had, sce P s (G φ R+G g -Lpschtz cotuous, Theorem mples E[P(x (T P(x δ ] (G φr+g g E[ x (T x δ 2] ( C 2 ( m + R T/2 (5 δ m Combg equalty (4 ad equalty (5, to guaratee E[P(x (T P(x ] ǫ, t suffces to have C δ ǫ ad ( ( C2 m + R T/2 ǫ δ m 2 (6 The corollary s establshed by fdg the smallest T that satsfes equalty (6 There are two other cases that ca be cosdered: whe φ s ot smooth but g s strogly covex, ad whe φ s smooth but g s ot strogly covex They ca be hadled wth the same techque descrbed above, ad we omt the detals here (Alteratvely, t s possble to use the techques descrbed [8, Secto 5] to obta accelerated sublear covergece rates wthout usg strogly covex perturbatos I Table, we lst the complextes of the M-Batch method for fdg a ǫ-optmal soluto of problem ( uder varous assumptos Smlar results are also obtaed [34] 9

φ g terato complexty Õ( (/γ-smooth λ-strogly covex /m+ (/m/(λγ (/γ-smooth o-strogly covex /m+ (/m/(ǫγ o-smooth λ-strogly covex /m+ (/m/(ǫλ o-smooth o-strogly covex /m+ /m/ǫ Table: Iteratocomplextesofthemethoduderdfferetassumptosothefuctosφ ad g For the last three cases, we solve the perturbed saddle-pot problem wth δ = ǫ/c 32 wth o-uform samplg Oe potetal drawback of the algorthm s that, ts covergece rate depeds o a problemspecfc costat R, whch s the largest l 2 -orm of the feature vectors a As a cosequece, the algorthm may perform badly o uormalzed data, especally f the l 2 -orms of some feature vectors are substatally larger tha others I ths secto, we propose a exteso of the method to mtgate ths problem, whch s gve Algorthm 3 The basc dea s to use o-uform samplg pckg the dual coordate to update at each terato I Algorthm 3, we pck coordate k wth the probablty p k = 2 + a k 2 2 = a, k =,, (7 2 Therefore, staces wth large feature orms are sampled more frequetly Smultaeously, we adopt a adaptve regularzato step (8, mposg stroger regularzato o such staces I addto, we adust the weght of a k (9 for updatg the prmal varable As a cosequece, the covergece rate of Algorthm 3 depeds o the average orm of feature vectors Ths s summarzed by the followg theorem Theorem 2 Suppose Assumpto A holds Let R = = a 2 If the parameters τ,σ,θ Algorthm 3 are chose such that τ = γ 4 R λ, σ = λ 4 R γ, θ = 2+2 R /(λγ, the for each t, we have ( 2τ +λ E [ x (t x 2 ] ( 7 2 + θ t ( ( 2τ +λ x ( x 2 2 + 6σ + 2γ E [ y (t y 2 ] 2 ( 2σ +2γ y ( y 2 2 Comparg the costat θ Theorem 2 to that of Theorem, we ca fd two dffereces Frst, there s a addtoal factor of 2 multpled to the deomator 2 + 2 R /(λγ, makg the value of θ larger Secod, the costat R here s determed by the average orm of features, stead of the largest oe, whch makes the value of θ smaller The secod dfferece makes the algorthm more robust to uormalzed feature vectors For example, f the a s are sampled d

Algorthm 3: method wth weghted samplg Iput: parameters τ,σ,θ R +, umber of teratos T, ad tal pots x ( ad y ( Italze: x ( = x (, u ( = (/ = y( a for t =,,2,,T do Radomly pck k {,2,,}, wth probablty p k = 2 + a k 2 2 = a 2 Execute the followg updates: { { } y (t+ argmaxβ R β a,x (t φ = (β p 2σ (β y(t 2 = k, (8 y (t k, { x (t+ = arg m g(x+ u (t + } x R d p k (y(t+ k k a k, x + x x(t 2 2, (9 2τ u (t+ = u (t + (y(t+ k k a k, x (t+ = x (t+ +θ(x (t+ x (t ed Output: x (T ad y (T from a multvarate ormal dstrbuto, the max { a 2 } almost surely goes to fty as, but the average orm = a 2 coverges to E[ a 2 ] For smplcty of presetato, we descrbed Algorthm 3 a weghted samplg method wth sgle dual coordate update, e, the case of m = It s ot hard to see that the ouform samplg scheme ca also be exteded to M-Batch wth m > Moreover, the o-uform samplg scheme ca also be appled to solve problems wth o-smooth φ or ostrogly covex g, leadg to smlar coclusos as Corollary 3 Here, we omt the techcal detals 4 Related Work Chambolle ad Pock [8] cosdered a class of covex optmzato problems wth the followg saddle-pot structure: m max { Kx,y +G(x F (y }, (2 x R d y R where K R m d, G ad F are proper closed covex fuctos, wth F tself beg the cougate of a covex fucto F They developed the followg frst-order prmal-dual algorthm: { y (t+ = argmax Kx (t,y F (y } y R 2σ y y(t 2 2, (2 { x (t+ = arg m K T y (t+,x +G(x+ } x R d 2τ x x(t 2 2, (22 x (t+ = x (t+ +θ(x (t+ x (t (23 Whe both F ad G are strogly covex ad the parameters τ, σ ad θ are chose approprately, ths algorthm obtas accelerated lear covergece rate [8, Theorem 3]

algorthm τ σ θ batch complexty ( Chambolle-Pock [8] γ λ A 2 λ A 2 γ + A 2 /(2 + A 2 λγ 2 log(/ǫ λγ ( wth m = γ λ 2R λ 2R γ +R/ + R λγ λγ log(/ǫ ( wth m = γ λ 2R λ 2R γ + R +R /λγ λγ log(/ǫ Table 2: Comparg wth Chambolle ad Pock [8, Algorthm 3, Theorem 3] We ca map the saddle-pot problem (3 to the form of (2 by lettg A = [a,,a ] T ad K = A, G(x = g(x, F (y = φ (y (24 The method developed ths paper ca be vewed as a exteso of the batch method (2-(23, where the dual update step (2 s replaced by a sgle coordate update (4 or a mbatch update (8 However, order to obta accelerated covergece rate, more subtle chages are ecessary the prmal update step More specfcally, we troduced the auxlary varable = y(t a = K T y (t, ad replaced the prmal update step (22 by (5 ad (9 The prmal u (t = extrapolato step (23 stays the same To compare the batch complexty of wth that of (2-(23, we use the followg facts mpled by Assumpto A ad the relatos (24: K 2 = A 2, G(x s λ-strogly covex, ad F (y s (γ/-strogly covex Based o these codtos, we lst Table 2 the equvalet parameters used [8, Algorthm 3] ad the batch complexty obtaed [8, Theorem 3], ad compare them wth The batch complexty of the Chambolle-Pock algorthm s Õ( + A 2/(2 λγ, where the Õ( otato hdes the log(/ǫ factor We ca boud the spectral orm A 2 by the Frobeus orm A F ad obta A 2 A F max{ a 2 } = R (Note that the secod equalty above would be a equalty f the colums of A are ormalzed So the worst case, the batch complexty of the Chambolle-Pock algorthm becomes ( Õ +R/ λγ = Õ( + κ, where κ = R 2 /(λγ, whch matches the worst-case complexty of the methods [24, 26] (see Secto ad also the dscussos [8, Secto 5] Ths s also of the same order as the complexty of wth m = (see Secto 2 Whe the codto umber κ, they ca be worse tha the batch complexty of wth m =, whch s Õ(+ κ/ If ether G(x or F (y (2 s ot strogly covex, Chambolle ad Pock proposed varats of the prmal-dual batch algorthm to acheve accelerated sublear covergece rates [8, Secto 5] It s also possble to exted them to coordate update methods for solvg problem ( whe ether φ or g s ot strogly covex Ther complextes would be smlar to those Table 2 =

4 Dual coordate ascet methods We ca also solve the prmal problem ( va ts dual: { maxmze y R D(y def = = φ (y g ( } y a, (25 where g (u = sup x R d{x T u g(x} s the cougate fucto of g Here aga, coordate ascet methods (eg, [29, 9, 5, 36] ca be more effcet tha full gradet methods I the stochastc dual coordate ascet ( method [36], a dual coordate y s pcked at radom durg each terato ad updated to crease the dual obectve value Shalev-Shwartz ad Zhag [36] showed that the terato complexty of s O(( + κ log(/ǫ, whch correspods to the batch complexty Õ(+κ/ Therefore, the method, whch has batch complexty Õ(+ κ/, ca be much better whe κ >, e, for ll-codtoed problems For more geeral covex optmzato problems, there s a vast lterature o coordate descet methods I partcular, Nesterov s work o radomzed coordate descet [25] sparked a lot of recet actvtes o ths topc Rchtárk ad Takáč [3] exteded the algorthm ad aalyss to composte covex optmzato Whe appled to the dual problem (25, t becomes oe varat of studed [36] M-batch ad dstrbuted versos of have bee proposed ad aalyzed [39] ad [45] respectvely No-uform samplg schemes smlar to the oe used Algorthm 3 have bee studed for both stochastc gradet ad methods (eg, [22, 44, 48] Shalev-Shwartz ad Zhag [35] proposed a accelerated m-batch method whch corporates addtoal prmal updates tha, ad bears some smlarty to our M-Batch method They showed that ts complexty terpolates betwee that of ad by varyg the m-batch sze m I partcular, for m =, t matches that of the methods (as does But for m =, the complexty of ther method s the same as, whch s worse tha for ll-codtoed problems I addto, Shalev-Shwartz ad Zhag [34] developed a accelerated proxmal method whch acheves the same batch complexty Õ( + κ/ as Ther method s a er-outer terato procedure, where the outer loop s a full-dmesoal accelerated gradet method the prmal space x R d At each terato of the outer loop, the method [36] s called to solve the dual problem (25 wth customzed regularzato parameter ad precso I cotrast, s a straghtforward sgle-loop coordate optmzato methods More recetly, L et al [8] developed a accelerated proxmal coordate gradet (APCG method for solvg a more geeral class of composte covex optmzato problems Whe appled to the dual problem (25, APCG eoys the same batch complexty Õ( + κ/ as of However, t eeds a extra prmal proxmal-gradet step to have theoretcal guaratees o the covergece of prmal-dual gap [8, Secto 5] The computatoal cost of ths addtoal step s equvalet to oe pass of the dataset, thus t does ot affect the overall complexty 42 Other related work Aother way to approach problem ( s to reformulate t as a costraed optmzato problem mmze φ (z +g(x (26 = subect to a T x = z,, =,,, = 3

ad solve t by ADMM type of operator-splttg methods (eg, [9] I fact, as show [8], the batch prmal-dual algorthm (2-(23 s equvalet to a pre-codtoed ADMM (or exact Uzawa method; see, eg, [47] Several authors [42, 28, 37, 49] have cosdered a more geeral formulato tha (26, where each φ s a fucto of the whole vector z R They proposed ole or stochastc versos of ADMM whch operate o oly oe φ each terato, ad obtaed sublear covergece rates However, ther cost per terato s O(d stead of O(d Suzuk[38] cosdered a problem smlar to(, but wth more complex regularzato fucto g, meag that g does ot have a smple proxmal mappg Thus prmal updates such as step (5 or(9adsmlarstepscaotbecomputedeffcetly Heproposedaalgorthm that combes [36] ad ADMM (eg, [7], ad showed that t has lear rate of covergece uder smlar codtos as Assumpto A It would be terestg to see f the method ca be exteded to ther settg to obta accelerated lear covergece rate 5 Effcet Implemetato wth Sparse Data Durg each terato of the methods, the updates of prmal varables (e, computg x (t+ requre full d-dmesoal vector operatos; see the step (5 of Algorthm, the step (9 of Algorthm 2 ad the step (9 of Algorthm 3 So the computatoal cost per terato s O(d, ad ths ca be too expesve f the dmeso d s very hgh I ths secto, we show how to explot problem structure to avod hgh-dmesoal vector operatos whe the feature vectors a are sparse We llustrate the effcet mplemetato for two popular cases: whe g s a squared-l 2 pealty ad whe g s a l +l 2 pealty For both cases, we show that the computato cost per terato oly depeds o the umber of o-zero compoets of the feature vector 5 Squared l 2 -orm pealty Suppose that g(x = λ 2 x 2 2 For ths case, the updates for each coordate of x are depedet of each other More specfcally, x (t+ ca be computed coordate-wse closed form: where u deotes (y (t+ k (y (t+ k x (t+ = +λτ (x(t τu (t τ u, =,,, (27 k a k Algorthm, or m k K (y(t+ k k a k Algorthm 2, or k a k/(p k Algorthm 3, ad u represets the -th coordate of u Although the dmeso d ca be very large, we assume that each feature vector a k s sparse We deote by J (t the set of o-zero coordates at terato t, that s, f for some dex k K pcked at terato t we have a k, the J (t If / J (t, the the algorthm (ad ts varats updates y (t+ wthout usg the value of x (t or x (t Ths ca be see from the updates (4, (8 ad (8, where the value of the er product a k,x (t does ot deped o the value of x (t As a cosequece, we ca delay the updates o x ad x wheever / J (t wthout affectg the updates o y (t, ad process all the mssg updates at the ext tme whe J (t Such a delayed update ca be carred out very effcetly We assume that t s the last tme whe J (t, ad t s the curret terato where we wat to update x ad x Sce / J (t mples u =, we have x t+ = +λτ (x(t τu (t, t = t +,t +2,,t (28 4

Notce that u (t s updated oly at teratos where J (t The value of u (t does t chage durg teratos [t +,t ], so we have u (t u (t + for t [t +,t ] Substtutg ths equato to the recursve formula (28, we obta x (t = (+λτ t t ( x (t + + u(t+ λ u(t + (29 λ The update (29 takes O( tme to compute Usg the same formula, we ca compute x (t ad subsequetly compute x (t = x (t +θ(x (t x (t Thus, the computatoal complexty of a sgle terato s proportoal to J (t, depedet of the dmeso d 52 (l +l 2 -orm pealty Suppose that g(x = λ x + λ 2 2 x 2 2 Sce both the l -orm ad the squared l 2 -orm are decomposable, the updates for each coordate of x (t+ are depedet More specfcally, { } x (t+ = argm λ α + λ 2α 2 +(u (t + u α+ (α x(t 2, (3 α R 2 2τ where u follows the defto Secto 5 If / J (t, the u = ad equato (3 ca be smplfed as x (t+ = +λ 2 τ (x(t τu (t τλ f x (t τu (t > τλ, +λ 2 τ (x(t τu (t +τλ f x (t τu (t < τλ, otherwse Smlar to the approach of Secto 5, we delay the update of x utl J (t We assume t to be the last terato whe J (t, ad let t be the curret terato whe we wat to update x Durg teratos [t +,t ], the value of u (t does t chage, so we have u (t u (t + for t [t +,t ] Usg equato (3 ad the varace of u (t for t [t +,t ], we have a O( tme algorthm to calculate x (t, whch we detal Appedx C The vector x (t ca be updated by the same algorthm sce t s a lear combato of x (t ad x (t As a cosequece, the computatoal complexty of each terato s proportoal to J (t, depedet of the dmeso d 6 Expermets I ths secto, we compare the basc method (Algorthm wth several state-of-the-art optmzato algorthms for solvg problem ( They clude two batch-update algorthms: the accelerated full gradet (FAG method [24, Secto 22], ad the lmted-memory quas-newto method L-BFGS [27, Secto 72] For the method, we adopt a adaptve le search scheme (eg, [26] to mprove ts effcecy For the L-BFGS method, we use the memory sze 3 as suggested by [27] We also compare wth two stochastc algorthms: the stochastc average gradet ( method [32, 33], ad the stochastc dual coordate descet ( method [36] We coduct expermets o a sythetc dataset ad three real datasets (3 5

2 4 6 8 (a λ = 3 5 5 (b λ = 5 5 2 25 3 (c λ = 3 5 5 2 25 3 (d λ = Fgure : Comparg wth other methods o sythetc data, wth the regularzato coeffcet λ { 3,,, } The horzotal axs s the umber of passes through the etre dataset, ad the vertcal axs s the logarthmc gap log(p(x (T P(x 6 Rdge regresso wth sythetc data We frst compare wth other algorthms o a smple quadratc problem usg sythetc data We geerate = 5 d trag examples {a,b } = accordg to the model b = a,x +ε, a N(,Σ, ε N(,, where a R d ad d = 5, ad x s the all-oes vector To make the problem ll-codtoed, the covarace matrx Σ s set to be dagoal wth Σ =, for =,,d Gve the set of examples {a,b } =, we the solved a stadard rdge regresso problem { mmze x R d P(x def = = } 2 (at x b 2 + λ 2 x 2 2 I the form of problem (, we have φ (z = z 2 /2 ad g(x = (/2 x 2 2 As a cosequece, the dervatve of φ s -Lpschtz cotuous ad g s λ-strogly covex 6

Dataset ame umber of samples umber of features d sparsty Covtype 58,2 54 22% RCV 2,242 47,236 6% News2 9,996,355,9 4% Table 3: Characterstcs of three real datasets obtaed from LIBSVM data [2] We evaluate the algorthms by the logarthmc optmalty gap log(p(x (t P(x, where x (t s the output of the algorthms after t passes over the etre dataset, ad x s the global mmum Whe the regularzato coeffcet s relatvely large, eg, λ = or, the problem s wellcodtoed ad we observe fast covergece of the stochastc algorthms, ad, whch are substatally faster tha the two batch methods ad L-BFGS Fgure shows the covergece of the fve dfferet algorthms whe we vared λ from 3 to As the plot shows, whe the codto umber s greater tha, the algorthm also coverges substatally faster tha the other two stochastc methods ad It s also otably faster tha L-BFGS These results support our theory that eoys a faster covergece rate o ll-codtoed problems I terms of ther batch complextes, s up to tmes faster tha, ad (λ /2 tmes faster tha ad 62 Bary classfcato wth real data Fally we show the results of solvg the bary classfcato problem o three real datasets The datasets are obtaed from LIBSVM data [2] ad summarzed Table 3 The three datasets are selected to reflect dfferet relatos betwee the sample sze ad the feature dmesoalty d, whch cover d (Covtype, d (RCV ad d (News2 For all tasks, the data pots take the form of (a,b, where a R d s the feature vector, ad b {,} s the bary class label Our goal s to mmze the regularzed emprcal rsk: P(x = φ (a T x+ λ 2 x 2 2 where φ (z = = f b z 2 b z f b z 2 ( b z 2 otherwse Here, φ s the smoothed hge loss (see, eg, [36] It s easy to verfy that the cougate fucto of φ s φ (β = b β + 2 β2 for b β [,] ad otherwse The performace of the fve algorthms are plotted Fgure 2 ad Fgure 3 I Fgure 2, we compare wth the two batch methods: ad L-BFGS The results show that s substatally faster tha ad L-BFGS for relatvely large λ, llustratg the advatage of stochastc methods over batch methods o well-codtoed problems As λ decreases to, the batch methods (especally L-BFGS become comparable to I Fgure 3, we compare wth the two stochastc methods: ad Here, the observatos are ust the opposte to that of Fgure 2 The three stochastc algorthms have comparable performace o relatvely large λ, but becomes substatally faster whe λ gets closer to zero Summarzg Fgure 2 ad Fgure 3, the performace of are always comparable or better tha the other methods comparso 7

λ RCV Covtype News2 5 5 2 25 5 5 2 25 5 5 2 25 2 3 4 5 6 5 5 2 25 3 2 3 4 5 6 2 4 6 8 2 4 6 8 2 4 6 8 7 5 5 2 25 3 5 5 2 25 3 5 5 2 25 3 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 Fgure 2: Comparg wth ad L-BFGS o three real datasets wth smoothed hge loss The horzotal axs s the umber of passes through the etre dataset, ad the vertcal axs s the logarthmc optmalty gap log(p(x (t P(x The algorthm s faster tha the two batch methods whe λ s relatvely large 8

λ RCV Covtype News2 5 5 2 25 5 5 2 25 5 5 2 25 2 3 4 5 6 5 5 2 25 3 2 3 4 5 6 2 4 6 8 2 4 6 8 2 4 6 8 7 5 5 2 25 3 5 5 2 25 3 5 5 2 25 3 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 Fgure 3: Comparg wth ad o three real datasets wth smoothed hge loss The horzotal axs s the umber of passes through the etre dataset, ad the vertcal axs s the logarthmc optmalty gap log(p(x (T P(x The algorthm s faster tha the other two stochastc methods whe λ s small 9

A Proof of Theorem We focus o characterzg the values of x ad y after the t-th update Algorthm 2 For ay {,,}, let ỹ be the value of y (t+ f K, e, ỹ = argmax {y a,x (t φ (β (y y(t y R 2σ Sce φ s (/γ-smooth by assumpto, ts cougate φ s γ-strogly covex (eg, [4, Theorem 422] Thus the fucto beg maxmzed above s (/σ + γ-strogly cocave Therefore, y a,x (t +φ (y+ (y y(t 2 2σ 2 } ỹ a,x (t +φ (ỹ + (ỹ 2 ( + σ +γ (ỹ y 2 2 O the other had, sce y mmzes φ k (y y a,x (by property of the saddle-pot, we have φ (ỹ ỹ a,x φ (y y a,x + γ 2 (ỹ y 2 Summg up the above two equaltes, we obta (y (t y 2 2σ ( 2σ +γ (ỹ y 2 + (ỹ 2 2σ 2σ +(ỹ y a,x x (t (32 Accordg to Algorthm 2, the set K of dces to be updated are chose radomly For every specfc dex, the evet K happes wth probablty m/ If K, the y (t+ s updated to the value ỹ, whch satsfes equalty (32 Otherwse, y (t+ s assged by ts old value y (t Let F t be the sgma feld geerated by all radom varables defed before roud t, ad takg expectato codtoed o F t, we have E[(y (t+ y 2 F t ] = m(ỹ y 2 2 F t ] = m(ỹ E[(y (t+ 2 + ( m(y(t E[y (t+ F t ] = mỹ + ( my(t, y 2 As a result, we ca represet (ỹ y 2, (ỹ 2 ad ỹ terms of the codtoal expectatos o (y (t+ y 2, (y (t+ 2 ad y (t+ Pluggg these represetatos to equalty (32, we have ( 2mσ + ( mγ ( (y (t y m 2 2mσ + γ m ( + a,x x (t E[(y (t+ y 2 F t ]+ E[(y(t+ y (t y + m E[y(t+, F t ] 2 F t ] 2mσ (33 2

The summg over all dces =,2,, ad dvdg both sdes of the equalty by, we have ( 2mσ + ( mγ ( y (t y 2 2 m 2mσ + γ E[ y (t+ y 2 m 2 F t ]+ E[ y(t+ 2 2 F t] 2mσ [ +E u (t u + m k K (y (t+ k = y(t k a k, x x (t Ft ], where u = = y a s a shorthad otato, ad u (t = a s defed Algorthm 2 We used the fact that = (y(t+ y (t a = k K (y(t+ k y (t k a k, sce oly the coordates K are updated We stll eed a equalty characterzg the relato betwee x (t+ ad x (t Followg the same steps for dervg equalty (32, ad usg the λ-strog covexty of fucto g, t s ot dffcult to show that x (t x 2 2 2τ ( 2τ +λ x (t+ x 2 2 + x(t+ x (t 2 2 2τ + u (t u + (y (t+ m k k K (34 k a k,x (t+ x (35 Takg expectato over both sde of equalty (35, the addg t to equalty (34, we have x (t x 2 ( 2 + 2τ 2σ + ( mγ y (t y 2 ( 2 m 2τ +λ E[ x (t+ x 2 2 F t ] ( E[ y (t+ + 2σ +γ y 2 2 F t] + E[ x(t+ x (t 2 2 F t] + E[ y(t+ 2 2 F t] m 2τ 2mσ ( T +E y (t y + y(t+ A(x (t+ x (t θ(x (t x (t F t (36 m For the last term of equalty (36, we have plugged the deftos of u (t+, u ad x (t, ad used the relato that (y (t+ T A = k K (y(t+ k k at k The matrx A s a -by-d matrx, whose -th row s equal to the vector a T For the rest of the proof, we lower boud the last term o the rght-had-sde of equalty (36 I partcular, we have ( y (t y T + y(t+ A(x (t+ x (t θ(x (t x (t = (y(t+ y T A(x (t+ x (t m θ(y(t y T A(x (t x (t + m m (y(t+ T A(x (t+ x (t θ m (y(t+ T A(x (t x (t (37 2

Recall that a k 2 R ad /τ = 4σR 2 accordg to ( We have (y (t+ T A(x (t+ x (t x(t+ x (t 2 2 /m = x(t+ x (t 2 2 /m + (y(t+ T A 2 2 m/τ + ( k K y(t+ k k a k 2 2 4mσR 2 Smlarly, we have m x(t+ x (t 2 2 + y(t+ 2 2, 4σ (y (t+ T A(x (t x (t m x(t x (t 2 2 The above upper bouds o the absolute values mply + y(t+ 2 2 4σ (y (t+ T A(x (t+ x (t m x(t+ x (t 2 2 (y (t+ T A(x (t x (t m x(t x (t 2 2 y(t+ 2 2, 4σ y(t+ 2 2 4σ Combg the above two equaltes wth lower bouds (36 ad (37, we obta x (t x 2 2 2τ + ( + ( 2σ +γ 2σ + ( mγ y (t y 2 2 m E[ y (t+ y 2 2 F t] m ( 2τ +λ E[ x (t+ x 2 2 F t ] + E[ x(t+ x (t 2 2 F t] θ x (t x (t 2 2 + E[(y(t+ y T A(x (t+ x (t F t ] θ(y (t y T A(x (t x (t (38 Recall that the parameters τ, σ, ad θ are chose to be τ = mγ 2R λ, σ = λ 2R mγ, ad θ = (/m+r (/m/(λγ Pluggg these assgmets, we fd that /(2τ /(2τ+λ = +/(2τλ θ ad /(2σ+( mγ/ = /(2σ+γ /m+/(2mσγ = θ Therefore, f we defe a sequece (t such that (t = ( ( E[ y 2τ +λ E[ x (t x 2 (t 2]+ 2σ +γ y 2 2 ] m + E[ x(t x (t 2 2 ] + E[(y(t y T A(x (t x (t ], 22

the equalty (38 mples the recursve relato (t+ θ (t, whch mples where ( ( E[ y 2τ +λ E[ x (t x 2 (t 2]+ 2σ +γ y 2 2 ] m + E[ x(t x (t 2 2 ] ( = + E[(y(t y T A(x (t x (t ] ( ( y 2τ +λ x ( x 2 ( 2 + 2σ +γ y 2 2 m θ t (, (39 To elmate the last two terms o the left-had sde of equalty (39, we otce that (y (t y T A(x (t x (t x(t x (t 2 2 + A 2 2 y(t y 2 2 2 /τ x(t x (t 2 2 + R2 y (t y 2 2 2 /τ = x(t x (t 2 2 x(t x (t 2 2 + y(t y 2 2 4σ + y(t y 2 2, 4mσ where the secod equalty we used A 2 2 A 2 F R2, the equalty we used τσ = /(4R 2, ad the last equalty we used m The above upper boud o absolute value mples (y (t y T A(x (t x (t x(t x (t 2 2 y(t y 2 2 4mσ The theorem s establshed by combg the above equalty wth equalty (39 B Proof of Theorem 2 The proof of Theorem 2 mmcs the steps for provg Theorem We start by establshg relato betwee(y (t,y (t+ adbetwee(x (t,x (t+ Supposethatthequattyỹ mmzesthefucto φ (y y a,x (t + p 2σ (y y(t 2 The, followgthesameargumetforestablshgequalty(32, we obta p ( 2σ (y(t y 2 p 2σ +γ (ỹ y 2 + p(ỹ 2 + a,x x (t (ỹ y 2σ (4 Note that = k wth probablty p Therefore, we have (ỹ y 2 = p E[(y (t+ (ỹ 2 = E[(y (t+ 2 F t ], p y 2 F t ] p p (y (t y 2, ỹ = E[y (t+ F t ] p y (t, p p 23

where F t represets the sgma feld geerated by all radom varables defed before terato t Substtutg the above equatos to equalty (4, ad averagg over =,2,,, we have = ( 2σ + ( p γ (y (t y p 2 = ( 2σ + γ E[(y (t+ p +E[ (u (t u +(y (t+ k y 2 F t ]+ E[(y(t+ k k 2 F t ] 2σ k a k/(p k,x x (t F t ], where u = = y a ad u (t = = y(t a have the same defto as the proof of Theorem For the relato betwee x (t ad x (t+, we follow the steps the proof of Theorem to obta x (t x 2 ( 2 2τ 2τ +λ x (t+ x 2 2 + x(t+ x (t 2 2 2τ = + (u (t u +(y (t+ k (4 k a k/(p k,x (t+ x (42 Takg expectato over both sdes of equalty (42 ad addg t to equalty (4 yelds x (t x 2 ( 2 + 2τ 2σ + ( p ( γ (y (t y p = 2 2τ +λ E[ x (t+ x 2 2 F t ] ( + 2σ + γ E[(y (t+ y p 2 F t ]+ x(t+ x (t 2 2 + E[(y(t+ k k 2 F t ] 2τ 2σ [( (y (t y T A +E + (y(t+ k k at k p k ((x (t+ x (t θ(x (t x (t F t ], (43 } {{ } v where the matrx A s a -by-d matrx, whose -th row s equal to the vector a T Next, we lower boud the last term o the rght-had sde of equalty (43 Ideed, t ca be expaded as v = (y(t+ y T A(x (t+ x (t θ(y(t y T A(x (t x (t + p k p k (y(t+ k k at k (x(t+ x (t θ p k (y(t+ k k at k (x(t x (t (44 Note that the probablty p k gve (7 satsfes p k a k 2 2 = a = a k 2 2 2 R, k =,, Sce the parameters τ ad σ satsfes στ R 2 = /6, we have p 2 k 2 /τ 4σ a k 2 2 ad cosequetly (y (t+ k k at k (x(t+ x (t x(t+ x (t 2 2 p k x(t+ x (t 2 2 + (y(t+ k + (y(t+ k k a k 2 2 p 2 k 2 /τ k 2 4σ 24

Smlarly, we have (y (t+ k k at k (x(t x (t x(t x (t 2 2 p k + (y(t+ k k 2 Combg the above two equaltes wth lower bouds (43 ad (44, we obta x (t x 2 2 + 2τ + = = ( 2σ + γ p ( 2σ + ( p γ p 4σ ( (y (t y 2 2τ +λ E[ x (t+ x 2 2 F t ] E[(y (t+ y 2 F t ]+ E[ x(t+ x (t 2 2 F t] θ x (t x (t 2 2 + E[(y(t+ y T A(x (t+ x (t F t ] θ(y (t y A(x (t x (t (45 Recall that the parameters τ, σ, ad θ are chose to be τ = γ 4 R λ, σ = λ 4 R γ, ad θ = 2+2 R /(λγ Pluggg these assgmets ad usg the fact that p /(2, we fd that /(2τ /(2τ+λ θ ad /(2σ+( p γ/(p θ for =,2,, /(2σ+γ/(p Therefore, f we defe a sequece (t such that (t = ( 2τ +λ E[ x (t x 2 2]+ + E[ x(t x (t 2 2 ] = ( 2σ + γ E[(y (t y p 2 ] + E[(y(t y T A(x (t x (t ], the equalty (45 mples the recursve relato (t+ θ (t, whch mples ( ( 2τ +λ E[ x (t x 2 2]+ 2σ + 2γ E[ y (t y 2 2] where + E[ x(t x (t 2 2 ] ( ( = 2τ +λ x ( x 2 2 + ( 2τ +λ x ( x 2 2 + + E[(y(t y T A(x (t x (t ] = ( 2σ + γ p ( 2σ +2γ θ T (, (46 (y ( y 2 y ( y 2 2 25

To elmate the last two terms o the left-had sde of equalty (46, we otce that (y (t y T A(x (t x (t x(t x (t 2 2 x(t x (t 2 2 = x(t x (t 2 2 + y(t y 2 2 A 2 2 2 /τ + y(t y 2 2 A 2 F 2 /τ + y(t y 2 2 = a 2 2 6σ( = a 2 2 x(t x (t 2 2 + y(t y 2 2, 6σ where the equalty we used 2 /τ = 2 6σ R 2 = 6σ( = a 2 2 Ths mples (y (t y T A(x (t x (t x(t x (t 2 2 Substtutg the above equalty to equalty (46 completes the proof C Effcet update for (l +l 2 -orm pealty y(t y 2 2 6σ From Secto 52, we have the followg recursve formula for t [t +,t ], τu (t+ τλ f x (t τu (t + > τλ, x (t+ = +λ 2 τ (x(t +λ 2 τ (x(t τu (t + +τλ f x (t τu (t + < τλ, otherwse (47 Gve x (t + at terato t, we preset a effcet algorthm for calculatg x (t We beg by examg the sg of x (t + Case I (x (t + = : If u (t + Cosequetly, we have a closed-form formula for x (t : x (t = > λ, the equato (47 mples x (t > for all t > t + ( (+λ 2 τ t t x (t + + u(t+ +λ λ 2 u(t + +λ λ 2 (48 If u (t + < λ, the equato (47 mples x (t < for all t > t + Therefore, we have the closed-form formula: x (t = ( (+λ 2 τ t t x (t + + u(t+ λ λ 2 Fally, f u (t + [ λ,λ ], the equato (47 mples x (t = u(t + λ λ 2 (49 26

Case II (x (t + > : If u (t + λ, the t s easy to verfy that x (t s obtaed by equato (48 Otherwse, We use the recursve formula (47 to derve the latest tme t + [t +,t ] such that x t+ > s true Ideed, sce x (t > for all t [t +,t + ], we have a closed-form formula for x t+ : x t+ = ( x (t + (+λ 2 τ t+ t + u(t+ +λ λ 2 u(t + +λ λ 2 (5 We look for the largest t + such that the rght-had sde of equato (5 s postve, whch s equvalet of t + t < log (+ λ 2x (t + u (t /log(+λ 2 τ (5 + +λ Thus, t + s the largest teger [t +,t ] such that equalty (5 holds If t + = t, the x (t s obtaed by (5 Otherwse, we ca calculate x t+ + by formula (47, the resort to Case I or Case III, treatg t + as t Case III (x (t + < : If u (t + λ, the x (t s obtaed by equato (49 Otherwse, we calculate the largest teger t [t +,t ] such that x t < s true Usg the same argumet as for Case II, we have the closed-form expresso x t = ( x (t + (+λ 2 τ t t + u(t+ λ λ 2 u(t + λ λ 2 (52 where t s the largest teger [t +,t ] such that the followg equalty holds: t t < log (+ λ 2x (t + u (t /log(+λ 2 τ (53 + λ If t = t, the x (t s obtaed by (52 Otherwse, we ca calculate x t + by formula (47, the resort to Case I or Case II, treatg t as t Fally, we ote that formula (47 mples the mootocty of x (t (t = t +,t +2, As a cosequece, the procedure of ether Case I, Case II or Case III s executed for at most oce Hece, the algorthm for calculatg x (t has O( tme complexty Refereces [] A Beck ad M Teboulle A fast teratve shrkage-threshold algorthm for lear verse problems SIAM Joural o Imagg Sceces, 2(:83 22, 29 [2] D P Bertsekas Icremetal proxmal methods for large scale covex optmzato Mathematcal Programmg, Ser B, 29:63 95, 2 27

[3] D P Bertsekas Icremetal gradet, subgradet, ad proxmal methods for covex optmzato: a survey I S Sra, S Nowoz, ad S J Wrght, edtors, Optmzato for Mache Learg, chapter 4 The MIT Press, 22 [4] D Blatt, A O Hero, ad H Gauchma A coverget cremetal gradet method wth a costat step sze SIAM Joural o Optmzato, 8(:29 5, 27 [5] L Bottou Large-scale mache learg wth stochastc gradet descet I Y Lechevaller ad G Saporta, edtors, Proceedgs of the 9th Iteratoal Coferece o Computatoal Statstcs (COMPSTAT 2, pages 77 87, Pars, Frace, August 2 Sprger [6] O Bousquet ad A Elsseeff Stablty ad geeralzato Joural of Mache Learg Research, 2:499 526, 22 [7] S Boyd, N Parkh, E Chu, B Peleato, ad J Eckste Dstrbuted optmzato ad statstcal learg va the alteratg drecto method of multplers Foudatos ad Treds Mache Learg, 3(: 22, 2 [8] A Chambolle ad T Pock A frst-order prmal-dual algorthm for covex problems wth applcatos to magg Joural of Mathematcal Imagg ad Vso, 4(:2 45, 2 [9] K-W Chag, C-J Hseh, ad C-J L Coordate descet method for large-scale l 2 -loss lear support vector maches Joural of Mache Learg Research, 9:369 398, 28 [] J Dea ad S Ghemawat MapReduce: Smplfed data processg o large clusters Commucatos of the ACM, 5(:7 3, 28 [] J Duch ad Y Sger Effcet ole ad batch learg usg forward backward splttg Joural of Mache Learg Research, :2873 2898, 29 [2] R-E Fa ad C-J L LIBSVM data: Classfcato, regresso ad mult-label URL: http://wwwcsetuedutw/ cl/lbsvmtools/datasets, 2 [3] T Haste, R Tbshra, ad J Fredma The Elemets of Statstcal Learg: Data Mg, Iferece, ad Predcto Sprger, New York, 2d edto, 29 [4] J-B Hrart-Urruty ad C Lemaréchal Fudametals of Covex Aalyss Sprger, 2 [5] C-J Hseh, K-W Chag, C-J L, S Keerth, ad S Sudararaa A dual coordate descet method for large-scale lear svm I Proceedgs of the 25th Iteratoal Coferece o Mache Learg (ICML, pages 48 45, 28 [6] R Johso ad T Zhag Acceleratg stochastc gradet descet usg predctve varace reducto I Advaces Neural Iformato Processg Systems 26, pages 35 323 23 [7] J Lagford, L L, ad T Zhag Sparse ole learg va trucated gradet Joural of Mache Learg Research, :777 8, 29 [8] Q L, Z Lu, ad L Xao A accelerated proxmal coordate gradet method ad ts applcato to regularzed emprcal rsk mmzato Techcal Report MSR-TR-24-94, Mcrosoft Research, 24 arxv:47296 28