arxiv: v1 [math.oc] 7 Mar 2017

Similar documents
arxiv: v1 [cs.lg] 22 Feb 2015

Dimensionality Reduction and Learning

An Accelerated Proximal Coordinate Gradient Method

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Econometric Methods. Review of Estimation

Bayes (Naïve or not) Classifiers: Generative Approach

CIS 800/002 The Algorithmic Foundations of Data Privacy October 13, Lecture 9. Database Update Algorithms: Multiplicative Weights

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

Functions of Random Variables

Rademacher Complexity. Examples

Lecture 3 Probability review (cont d)

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Chapter 5 Properties of a Random Sample

The Mathematical Appendix

Simple Linear Regression

ESS Line Fitting

Introduction to local (nonparametric) density estimation. methods

Analysis of Lagrange Interpolation Formula

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

UNIT 2 SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS

Solving Constrained Flow-Shop Scheduling. Problems with Three Machines

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization

A tighter lower bound on the circuit size of the hardest Boolean functions

Cubic Nonpolynomial Spline Approach to the Solution of a Second Order Two-Point Boundary Value Problem

Communication-Efficient Distributed Primal-Dual Algorithm for Saddle Point Problems

Point Estimation: definition of estimators

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

Lecture 02: Bounding tail distributions of a random variable

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem

Unimodality Tests for Global Optimization of Single Variable Functions Using Statistical Methods

LECTURE 24 LECTURE OUTLINE

Distributed Accelerated Proximal Coordinate Gradient Methods

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

CHAPTER 4 RADICAL EXPRESSIONS

Median as a Weighted Arithmetic Mean of All Sample Observations

Summary of the lecture in Biostatistics

Research Article A New Iterative Method for Common Fixed Points of a Finite Family of Nonexpansive Mappings

PROJECTION PROBLEM FOR REGULAR POLYGONS

Objectives of Multiple Regression

Kernel-based Methods and Support Vector Machines

Assignment 5/MATH 247/Winter Due: Friday, February 19 in class (!) (answers will be posted right after class)

6.867 Machine Learning

Mu Sequences/Series Solutions National Convention 2014

Lecture 3. Sampling, sampling distributions, and parameter estimation

Department of Agricultural Economics. PhD Qualifier Examination. August 2011

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek

Unsupervised Learning and Other Neural Networks

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

MATH 247/Winter Notes on the adjoint and on normal operators.

18.657: Mathematics of Machine Learning

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

Lecture 07: Poles and Zeros

Lecture 9: Tolerant Testing

Chapter 8. Inferences about More Than Two Population Central Values

9 U-STATISTICS. Eh =(m!) 1 Eh(X (1),..., X (m ) ) i.i.d

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

Comparison of Dual to Ratio-Cum-Product Estimators of Population Mean

Class 13,14 June 17, 19, 2015

Simulation Output Analysis

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Parallel Multi-splitting Proximal Method for Star Networks

An Introduction to. Support Vector Machine

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

The number of observed cases The number of parameters. ith case of the dichotomous dependent variable. the ith case of the jth parameter

X ε ) = 0, or equivalently, lim

Linear Regression Linear Regression with Shrinkage. Some slides are due to Tommi Jaakkola, MIT AI Lab

TESTS BASED ON MAXIMUM LIKELIHOOD

MULTIDIMENSIONAL HETEROGENEOUS VARIABLE PREDICTION BASED ON EXPERTS STATEMENTS. Gennadiy Lbov, Maxim Gerasimov

Analysis of Variance with Weibull Data

Arithmetic Mean and Geometric Mean

Non-uniform Turán-type problems

Homework 1: Solutions Sid Banerjee Problem 1: (Practice with Asymptotic Notation) ORIE 4520: Stochastics at Scale Fall 2015

ENGI 3423 Simple Linear Regression Page 12-01

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

A NEW LOG-NORMAL DISTRIBUTION

Chapter 4 Multiple Random Variables

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

Derivation of 3-Point Block Method Formula for Solving First Order Stiff Ordinary Differential Equations

Strong Convergence of Weighted Averaged Approximants of Asymptotically Nonexpansive Mappings in Banach Spaces without Uniform Convexity

Chapter 9 Jordan Block Matrices

A New Family of Transformations for Lifetime Data

CS 1675 Introduction to Machine Learning Lecture 12 Support vector machines

Module 7: Probability and Statistics

Maximum Likelihood Estimation

STK4011 and STK9011 Autumn 2016

Exploiting Strong Convexity from Data with Primal-Dual First-Order Algorithms

A Remark on the Uniform Convergence of Some Sequences of Functions

Support vector machines

5 Short Proofs of Simplified Stirling s Approximation

C-1: Aerodynamics of Airfoils 1 C-2: Aerodynamics of Airfoils 2 C-3: Panel Methods C-4: Thin Airfoil Theory

Lecture Note to Rice Chapter 8

Multivariate Transformation of Variables and Maximum Likelihood Estimation

PTAS for Bin-Packing

Complete Convergence and Some Maximal Inequalities for Weighted Sums of Random Variables

Overview. Basic concepts of Bayesian learning. Most probable model given data Coin tosses Linear regression Logistic regression

CHAPTER 3 POSTERIOR DISTRIBUTIONS

Transcription:

Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms Jale Wag L Xao arxv:703.064v [math.oc] 7 Mar 07 Abstract We cosder emprcal rsk mmzato of lear predctors wth covex loss fuctos. Such problems ca be reformulated as covex-cocave saddle pot problems ad thus are well sutable for prmal-dual frst-order algorthms. However prmal-dual algorthms ofte requre explct strogly covex regularzato order to obta fast lear covergece ad the requred dual proxmal mappg may ot admt closedform or effcet soluto. I ths paper we develop both batch ad radomzed prmal-dual algorthms that ca explot strog covexty from data adaptvely ad are capable of achevg lear covergece eve wthout regularzato. We also preset dual-free varats of the adaptve prmal-dual algorthms that do ot requre computg the dual proxmal mappg whch are especally sutable for logstc regresso.. Itroducto We cosder the problem of regularzed emprcal rsk mmzato ERM of lear predctors. Leta...a R d be the feature vectors of data samples φ : R R be a covex loss fucto assocated wth the lear predcto a T x for =... ad g : Rd R be a covex regularzato fucto for the predctorx R d. ERM amouts to solvg the followg covex optmzato problem: { m Px def = } x R d = φ a T xgx. Examples of the above formulato clude may wellkow classfcato ad regresso problems. For bary classfcato each feature vectora s assocated wth a label b {±}. I partcular logstc regresso s obtaed by settg φ z = logexp b z. For lear regresso problems each feature vector a s assocated wth a Departmet of Computer Scece The Uversty of Chcago Chcago Illos 60637 USA. Mcrosoft Research Redmod Washgto 9805 USA. Correspodece to: Jale Wag <jale@uchcago.edu> L Xao<l.xao@mcrosoft.com>. depedet varable b R ad φ z = /z b. The we get rdge regresso wth gx = λ/ x ad elastc et wthgx = λ x λ / x. LetA = [a...a ] T be the data matrx. Throughout ths paper we make the followg assumptos: Assumpto. The fuctosφ g ad matrxasatsfy: Each φ s δ-strogly covex ad /-smooth where > 0 adδ 0 adδ ; g s λ-strogly covex where λ 0; λδµ > 0 where µ = λ m A T A. The strog covexty ad smoothess metoed above are wth respect to the stadard Eucldea orm deoted as x = x T x. See e.g. Nesterov 004 Sectos.. ad..3 for the exact deftos. Let R = max { a } ad assumg λ > 0 the R /λ s a popular defto of codto umber for aalyzg complextes of dfferet algorthms. The last codto above meas that the prmal objectve fucto Px s strogly covex eve f λ = 0. There have bee extesve research actvtes recet years o developg effcetly algorthms for solvg problem. A broad class of radomzed algorthms that explot the fte sum structure the ERM problem have emerged as very compettve both terms of theoretcal complexty ad practcal performace. They ca be put to three categores: prmal dual ad prmal-dual. Prmal radomzed algorthms work wth the ERM problem drectly. They are moder versos of radomzed cremetal gradet methods e.g. Bertsekas 0; Nedc & Bertsekas 00 equpped wth varace reducto techques. Each terato of such algorthms oly process oe data pot a wth complexty Od. They cludes SAG Roux et al. 0 SAGA Defazo et al. 04 ad SVRG Johso & Zhag 03; Xao & Zhag 04 whch all acheve the terato complexty O R /λlog/ǫ to fd a ǫ- optmal soluto. I fact they are capable of explotg the strog covexty from data meag that the codto umberr /λ the complexty ca be replaced by the more favorable oer /λδµ /. Ths mprovemet ca be acheved wthout explct kowledge of µ from data.

Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms Dual algorthms solve Fechel dual of by maxmzg Dy def = = φ y g = y a usg radomzed coordate ascet algorthms. Here φ ad g deotes the cojugate fuctos of φ ad g. They clude SDCA Shalev-Shwartz & Zhag 03 Nesterov 0 ad Rchtárk & Takáč 04. They have the same complexty O R /λlog/ǫ but are hard to explot strog covexty from data. Prmal-dual algorthms solve the covex-cocave saddle pot problemm x max y Lxy where Lxy def = = y a x φ y gx. 3 I partcular SPDC Zhag & Xao 05 acheves a accelerated lear covergece rate wth terato complexty O R/ λlog/ǫ whch s better tha the aforemetoed o-accelerated complexty whe R /λ >. La & Zhou 05 developed dual-free varats of accelerated prmal-dual algorthms but wthout cosderg the lear predctor structure ERM. Balamuruga & Bach 06 exteded SVRG ad SAGA to solvg saddle pot problems. Accelerated prmal ad dual radomzed algorthms have also bee developed. Nesterov 0 Fercoq & Rchtárk 05 ad L et al. 05b developed accelerated coordate gradet algorthms whch ca be appled to solve the dual problem. Alle-Zhu 06 developed a accelerated varat of SVRG. Accelerato ca also be obtaed usg the Catalyst framework L et al. 05a. They all acheve the same O R/ λlog/ǫ complexty. A commo feature of accelerated algorthms s that they requre good estmate of the strog covexty parameter. Ths makes hard for them to explot strog covexty from data because the mmum sgular valueµ of the data matrxas very hard to estmate geeral. I ths paper we show that prmal-dual algorthms are capable of explotg strog covexty from data f the algorthm parameters such as step szes are set approprately. Whle these optmal settg depeds o the kowledge of the covexty parameter µ from the data we develop adaptve varats of prmal-dual algorthms that ca tue the parameter automatcally. Such adaptve schemes rely crtcally o the capablty of evaluatg the prmal-dual optmalty gaps by prmal-dual algorthms. A major dsadvatage of prmal-dual algorthms s that the requred dual proxmal mappg may ot admt closedform or effcet soluto. We follow the approach of La & Zhou 05 to derve dual-free varats of the prmal-dual algorthms customzed for ERM problems wth the lear predctor structure ad show that they ca also explot strog covexty from data wth correct choces of parameters or usg a adaptato scheme. Algorthm Batch Prmal-Dual BPD Algorthm put: parametersτ θ tal pot x 0 = x 0 y 0 fort = 0... do y t = prox f y t A x t x t = prox τg x t τa T y t x t = x t θx t x t ed for. Batch prmal-dual algorthms Before dvg to radomzed prmal-dual algorthms we frst cosder batch prmal-dual algorthms whch exhbt smlar propertes as ther radomzed varats. To ths ed we cosder a batch verso of the ERM problem m x R d { Px def = faxgx }. 4 wherea R d ad make the followg assumpto: Assumpto. The fuctos f g ad matrx A satsfy: f s δ-strogly covex ad /-smooth where > 0 adδ 0 adδ ; g s λ-strogly covex where λ 0; λδµ > 0 where µ = λ m A T A. For exact correspodece wth problem we have fz = = φ z wth z = a T x. Uder Assumpto the fucto fz s δ/-strogly covex ad /-smooth ad fax s δµ /-strogly covex ad R /-smooth. However such correspodeces aloe are ot suffcet to explot the structure of.e. substtutg them to the batch algorthms of ths secto wll ot produce the effcet algorthms for solvg problem that we wll preset Sectos 3 ad 4.. So we do ot make such correspodeces explct ths secto. Rather treat them as depedet assumptos wth the same otato. Usg cojugate fuctos we ca derve the dual of 4 as max y R { Dy def = f y g A T y } 5 ad the covex-cocave saddle pot formulato s { def m max Lxy = gxy T Ax f y }. 6 x R d y R We cosder the prmal-dual frst-order algorthm proposed by Chambolle & Pock 0; 06 for solvg the saddle pot problem 6 whch s gve as Algorthm. Here we call t the batch prmal-dual BPD algorthm. Assumg that f s smooth ad g s strogly covex Chambolle & Pock 0; 06 showed that Algorthm acheves accelerated lear covergece rate f λ > 0. However they dd ot cosder the case where addtoal or the sole source of strog covexty comes from fax.

Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms I the followg theorem we show how to set the parameters τ ad θ to explot both sources of strog covexty to acheve fast lear covergece. Theorem. Suppose Assumpto holds ad x y s the uque saddle pot ofldefed 6. LetL = A = λmax A T A. If we set the parameters Algorthm as = L λδµ τ = L λδµ 7 adθ = max{θ x θ y } where θ x = δ µ δ L τλ θ y = / 8 the we have τ λ x t x 4 yt y θ t C Lx t y Lx y t θ t C wherec = τ λ x 0 x 4 y 0 y. The proof of Theorem s gve Appedces B ad C. Here we gve a detaled aalyss of the covergece rate. Substtutg ad τ 7 to the expressos for θ y ad θ x 8 ad assumgλδµ L we have θ x δµ L λδµ L δ λ θ y = λδµ /L λδµ L. L λδµ Sce the overall codto umber of the problem s L λδµ t s clear that θ y s a accelerated covergece rate. Next we exameθ x two specal cases. The case of δµ = 0but λ > 0. I ths case we have τ = L λ ad = λ L ad thus θ x = λ/l λ L θ y= λ/l λ L. Therefore we have θ = max{θ x θ y } λ L. Ths deed s a accelerated covergece rate recoverg the result of Chambolle & Pock 0; 06. The case of λ = 0 butδµ > 0. τ = Lµ δ ad = µ δ L ad I ths case we have θ x = δµ L δµ/lδ θ y δµ L. L Notce that δ µ s the codto umber of fax. Next we assumeµ L ad exame howθ x vares wthδ. Ifδ µ L meagf s badly codtoed the θ x δµ L 3 δµ/l = δµ 3L. Because the overall codto umber s L δ µ ths s a accelerated lear rate ad so sθ = max{θ x θ y }. Algorthm Adaptve Batch Prmal-Dual Ada-BPD put: problem costats λ δ L ad ˆµ > 0 tal potx 0 y 0 ad adaptato perodt. Compute τ adθ as 7 ad 8 usgµ = ˆµ fort = 0... do y t = prox f y t A x t x t = prox τg x t τa T y t x t = x t θx t x t f modtt == 0 the τθ = BPD-Adapt {P s D s } t s=t T ed f ed for Ifδ µ L meagf s mldly codtoed the θ x µ3 µ L 3 µ/l 3/ µ/l L. Ths represets a half-accelerated rate because the overall codto umber s L δ µ L3 µ. 3 Ifδ =.e.f s a smple quadratc fucto the θ x µ µ L µ/l L. Ths rate does ot have accelerato because the overall codto umber s L δ µ L µ. I summary the extet of accelerato the domatg factorθ x whch determesθ depeds o the relatve sze of δ ad µ /L.e. the relatve codtog betwee the fucto f ad the matrx A. I geeral we have full accelerato f δ µ /L. The theory predcts that the accelerato degrades as the fucto f gets better codtoed. However our umercal expermets we ofte observe accelerato eve f δ gets closer to. As explaed Chambolle & Pock 0 Algorthm s equvalet to a precodtoed ADMM. Deg & Y 06 characterzed codtos for ADMM to obta lear covergece wthout assumg both parts of the objectve fucto beg strogly covex but they dd ot derve covergece rate for ths case... Adaptve batch prmal-dual algorthms I practce t s ofte very hard to obta good estmate of the problem-depedet costats especally µ = λm A T A order to apply the algorthmc parameters specfed Theorem. Here we explore heurstcs that ca eable adaptve tug of such parameters whch ofte lead to much mproved performace practce. A key observato s that the covergece rate of the BPD algorthm chages mootocally wth the overall strog covexty parameter λ δµ regardless of the extet of 3

Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms Algorthm 3 BPD-Adapt smple heurstc put: prevous estmate ˆµ adapto perod T prmal ad dual objectve values{p s D s } t s=t T f P t D t < θ T P t T D t T the ˆµ := ˆµ else ˆµ := ˆµ/ ed f Compute τ adθ as 7 ad 8 usgµ = ˆµ output: ew parameters τ θ accelerato. I other words the larger λ δµ s the faster the covergece. Therefore f we ca motor the progress of the covergece ad compare t wth the predcted covergece rate Theorem the we ca adjust the algorthmc parameters to explot the fastest possble covergece. More specfcally f the observed covergece s slower tha the predcted covergece rate the we should reduce the estmate of µ; f the observed covergece s better tha the predcted rate the we ca try to crease µ for eve faster covergece. We formalze the above reasog a Adaptve BPD Ada-BPD algorthm descrbed Algorthm. Ths algorthm matas a estmate ˆµ of the true costatµ ad adjust t every T teratos. We use P t ad D t to represet the prmal ad dual objectve values at Px t ad Dy t respectvely. We gve two mplemetatos of the tug procedure BPD-Adapt: Algorthm 3 s a smple heurstc for tug the estmate ˆµ where the creasg ad decreasg factor ca be chaged to other values larger tha ; Algorthm 4 s a more robust heurstc. It does ot rely o the specfc covergece rate θ establshed Theorem. Istead t smply compares the curret estmate of objectve reducto rate ˆρ wth the prevous estmate ρ θ T. It also specfes a o-tug rage of chages ρ specfed by the terval[cc]. Oe ca also devse more sophstcated schemes; e.g. f we estmate that δµ < λ the o more tug s ecessary. The capablty of accessg both the prmal ad dual objectve values allows prmal-dual algorthms to have good estmate of the covergece rate whch eables effectve tug heurstcs. Automatc tug of prmal-dual algorthms have also bee studed by e.g. Maltsky & Pock 06 ad Goldste et al. 03 but wth dfferet goals. Fally we ote that Theorem oly establshes covergece rate for the dstace to the optmal pot ad the quatty Lx t y Lx y t whch s ot qute the dualty gappx t Dy t. Nevertheless same covergece rate ca also be establshed for the dualty gap see Algorthm 4 BPD-Adapt robust heurstc put: prevous rate estmate ρ > 0 = δˆµ perodt costatsc < adc > ad{p s D s } t s=t T Compute ew rate estmate ˆρ = Pt D t P t T D t T f ˆρ cρ the := ρ := ˆρ else f ˆρ cρ the := / else := ed f λ ρ := ˆρ λ = L τ = L Computeθ usg 8 or set θ = output: ew parameters τ θ Zhag & Xao 05 Secto. whch ca be used to better justfy the adapto procedure. 3. Radomzed prmal-dual algorthm I ths secto we come back to the ERM problem whch have a fte sum structure that allows the developmet of radomzed prmal-dual algorthms. I partcular we exted the stochastc prmal-dual coordate SPDC algorthm Zhag & Xao 05 to explot the strog covexty from data order to acheve faster covergece rate. Frst we show that by settg algorthmc parameters approprately the orgal SPDC algorthm may drectly beeft from strog covexty from the loss fucto. We ote that the SPDC algorthm s a specal case of the Adaptve SPDC Ada-SPDC algorthm preseted Algorthm 5 by settg the adapto perod T = ot performg ay adapto. The followg theorem s proved Appedx E. Theorem. Suppose Assumpto holds. Let x y be the saddle pot of the fucto L defed 3 ad R = max{ a... a }. If we set T = Algorthm 5 o adapto ad let τ = 4R λδµ = 4R adθ = max{θ x θ y } where θ x = τδµ 4δ λδµ 9 τλ θ y = // / 0 the we have τ [ λ E x t x ] 4 E[ y t y ] θ t C E [ Lx t y Lx y t ] θ t C wherec = τ λ x 0 x 4 y 0 y. The expectato E[ ] s take wth respect to the hstory of radom dces draw at each terato. 4

Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms Algorthm 5 Adaptve SPDC Ada-SPDC put: parameters τ θ > 0 tal potx 0 y 0 ad adaptato perod T. Set x 0 = x 0 fort = 0... do pckk {...} uformly at radom for {...} do f == k the y t k = prox φ k y t k at k xt else y t = y t ed f ed for x t = prox τg x t τ u t y t u t = u t yt k y t k a k x t = x t θx t x t k y t k a k f modtt = 0 the τθ = SPDC-Adapt {P t s D t s } T s=0 ed f ed for Below we gve a detaled dscusso o the expected covergece rate establshed Theorem. The cases of µ = 0 but λ > 0. τ = 4R λ ad = λ 4R ad θ x = τλ = 4R /λ I ths case we have θ y = // / = 8R /λ. Hece θ = θ y. These recover the parameters ad covergece rate of the stadard SPDC Zhag & Xao 05. The cases of µ > 0 but λ = 0. τ = 4Rµ δ ad = µ δ 4R ad θ x = τδµ δµ 4δ = θ y = 8R/µ δµ δ 8R I ths case we have 3R δµ/4r4δ. δµ. 8R Sce the objectve s R /-smooth ad δµ /-strogly covex θ y s a accelerated rate f δµ 8R otherwse θ y. Forθ x we cosder dfferet stuatos: If µ R the we have θ x δµ R whch s a accelerated rate. So sθ = max{θ x θ y }. If µ < R ad δ µ R the θ x δµ R whch represets accelerated rate. The terato complexty of SPDC s whch s better tha that of Õ R µ δ SVRG ths case whch sõ R δµ. Ifµ < R adδ µ R the we getθ x µ R. Ths s a half-accelerated rate because ths case SVRG would requreõr3 µ teratos whle terato complexty here sõr µ 3. If µ < R ad δ meag the φ s are well codtoed the we get θ x δµ R µ R whch s a o-accelerated rate. The correspodg terato complexty s the same as SVRG. 3.. Parameter adaptato for SPDC The SPDC-Adapt procedure called Algorthm 5 follows the same logcs as the batch adapto schemes Algorthms 3 ad 4 ad we omt the detals here. Oe thg we emphasze here s that the adaptato perod T s terms of epochs or umber of passes over the data. I addto we oly compute the prmal ad dual objectve values after each pass or every few passes because computg them exactly usually eed to take a full pass of the data. Aother mportat ssue s that ulke the batch case where the dualty gap usually decreases mootocally the dualty gap for radomzed algorthms ca fluctuate wldly. So stead of usg oly the two ed valuesp t T D t T ad P t D t we ca use more pots to estmate the covergece rate through a lear regresso. Suppose the prmal-dual values at the ed of each past T passes are {P0D0}{PD}...{PTDT} ad we eed to estmate ρ rate per pass such that Pt Dt ρ t P0 D0 t =...T. We ca tur t to a lear regresso problem after takg logarthm ad obta the estmate ˆρ through T Pt Dt logˆρ = T t=tlog P0 D0. The rest of the adapto procedure ca follow the robust scheme Algorthm 4. I practce we ca compute the prmal-dual values more sporadcally say every few passes ad modfy the regresso accordgly. 4. Dual-free Prmal-dual algorthms Compared wth prmal algorthms oe major dsadvatage of prmal-dual algorthms s the requremet of computg the proxmal mappg of the dual fuctof orφ whch may ot admt closed-formed soluto or effcet computato. Ths s especally the case for logstc regresso oe of the most popular loss fuctos used classfcato. La & Zhou 05 developed dual-free varats of prmal-dual algorthms that avod computg the dual proxmal mappg. Ther ma techque s to replace the Eucldea dstace the dual proxmal mappg wth a Bregma dvergece defed over the dual loss fucto tself. 5

Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms Algorthm 6 Dual-Free BPD Algorthm put: parameters τ θ > 0 tal potx 0 y 0 Set x 0 = x 0 adv 0 = f y 0 fort = 0... do v t = vt A x t y t = f v t x t = prox τg x t τa T y t x t = x t θx t x t ed for We show how to apply ths approach to solve the structured ERM problems cosdered ths paper. They ca also explot strog covexty from data f the algorthmc parameters are set approprately or adapted automatcally. 4.. Dual-free BPD algorthm Frst we cosder the batch settg. We replace the dual proxmal mappg computgy t Algorthm wth y t { =argm f y y T A x t Dyyt } y where D s the Bregma dvergece of a strctly covex kerel fucto h defed as D h yy t = hy hy t hy t y y t. Algorthm s obtaed the Eucldea settg wth hy = y ad Dyy t = y yt. Whle our covergece results would apply for arbtrary Bregma dvergece we oly focus o the case of usg f tself as the kerel because ths allows us to computey t very effcetly. The followg lemma explas the detals Cf. La & Zhou 05 Lemma. Lemma. Let the kerel h f the Bregma dvergeced. If we costruct a sequece of vectors{v t } such thatv 0 = f y 0 ad for allt 0 v t = vt A x t the the soluto to problem s y t = f v t. Proof. Supposev t = f y t true fort = 0 the Dyy t = f y f y t v tt y y t. The soluto to ca be wrtte as { y t = argm f y y T A x t f y v tt y } y { = argm f y } A x t vt T y y = argmax y = argmax y { T v t A x t y f y} } { v tt y f y = f v t where the last equalty we used the property of cojugate fucto whe f s strogly covex ad smooth. Moreover v t = f y t = f y t whch completes the proof. Accordg to Lemma we oly eed to provde tal pots such thatv 0 = f y 0 s easy to compute. We do ot eed to compute f y t drectly for ay t > 0 because t s ca be updated as v t. Cosequetly we ca updatey t the BPD algorthm usg the gradet f v t wthout the eed of dual proxmal mappg. The resultg dual-free algorthm s gve Algorthm 6. La & Zhou 05 cosdered a geeral settg whch does ot possess the lear predctor structure we focus o ths paper ad assumed that oly the regularzato g s strogly covex. Our followg result shows that dualfree prmal-dual algorthms ca also explot strog covexty from data wth approprate algorthmc parameters. Theorem 3. Suppose Assumpto holds ad let x y be the uque saddle pot ofldefed 6. If we set the parameters Algorthm 6 as τ = L λδµ = L λδµ 3 adθ = max{θ x θ y } where θ x = τδµ 4 τλ θ y = / 4 the we have τ λ x t x Dy y t θ t C Lx t y Lx y t θ t C where C = τ λ x 0 x Dy y 0. Theorem 3 s proved Appedces B ad D. Assumg λδµ L we have θ x δµ 6L λ λδµ L λδµ θ y 4L. Aga we ga sghts by cosder the specal cases: If δµ = 0 ad λ > 0 the θ y λ 4L ad θ x λ L. So θ = max{θ xθ y } s a accelerated rate. If δµ > 0 ad λ = 0 the θ y δµ 4L ad θ x δµ 6L. Thus θ = max{θ x θ y } δµ 6L s ot accelerated. Notce that ths cocluso does ot depeds o the relatve sze ofδ adµ /L ad ths s the major dfferece from the Eucldea case dscussed Secto. If both δµ > 0 ad λ > 0 the the extet of accelerato depeds o ther relatve sze. If λ s o the same order as δµ or larger the accelerated rate s obtaed. Ifλs much smaller thaδµ the the theory predcts o accelerato. 6

Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms Algorthm 7 Adaptve Dual-Free SPDC ADF-SPDC put: parameters τ θ > 0 tal potx 0 y 0 ad adaptato perod T. Set x 0 = x 0 adv 0 = φ y 0 for =... fort = 0... do pckk {...} uformly at radom for {...} do f == k the else v t k v t ed f ed for x t = prox τg = vt k at k xt y t k = φ k vt k = v t y t = y t x t τ u t y t u t = u t yt k y t k a k x t = x t θx t x t k y t k a k f modtt = 0 the τθ = SPDC-Adapt {P t s D t s } T s=0 ed f ed for 4.. Dual-free SPDC algorthm The same approach ca be appled to derve a Dualfree SPDC algorthm whch s descrbed Algorthm 7. It also cludes a parameter adapto procedure so we call t the adaptve dual-free SPDC ADF-SPDC algorthm. O related work Shalev-Shwartz & Zhag 06 ad Shalev-Shwartz 06 troduced dual-free SDCA. The followg theorem characterzes the choce of algorthmc parameters that ca explot strog covexty from data to acheve lear covergece proof gve Appedx F. Theorem 4. Suppose Assumpto holds. Let x y be the saddle pot of L defed 3 ad R = max{ a... a }. If we set T = Algorthm 7 o adapto ad let = 4R λδµ τ = 4R adθ = max{θ x θ y } where θ x = τδµ 4 λδµ 5 τλ θ y = // / 6 the we have τ λ E [ x t x ] 4 E[ Dy y t ] θ t C E [ Lx t y Lx y t ] θ t C where C = τ λ x 0 x Dy y 0. Below we dscuss the expected covergece rate establshed Theorem two specal cases. The cases of µ = 0 but λ > 0. τ = 4R λ ad = 4R λ ad θ x = τλ = 4R /λ I ths case we have θ y = // / = 8R /λ. These recover the covergece rate of the stadard SPDC algorthm Zhag & Xao 05. The cases ofµ > 0 but λ = 0. I ths case we have τ = 4Rµ δ = 4R µ δ ad θ x = τδµ δµ 4 = 3R δµ/4r4 θ y = // / = 8R/µ δ. We ote that the prmal fucto ow s R /-smooth ad δµ /-strogly covex. We dscuss the followg cases: If δµ > R the we have θ x δµ 8R ad θ y. Thereforeθ = max{θ xθ y }. Otherwse we have θ x δµ 64R ad θ y s of the same order. Ths s ot a accelerated rate ad we have the same terato complexty as SVRG. Fally we gve cocrete examples of how to compute the tal potsy 0 adv 0 such thatv 0 = φ y 0. For squared loss φ α = α b ad φ β = β b β. So v 0 = φ y 0 = y 0 b. For logstc regresso we have b { } ad φ α = log e bα. The cojugate fucto s φ β = b βlog b βb βlogb β f b β [ 0] ad otherwse. We ca choose y 0 = b adv 0 =0 such thatv 0 =φ y 0. For logstc regresso we have δ = 0 over the full doma of φ. However each φ s locally strogly covex bouded doma Bach 04: f z [ B B] the we kow δ = m z φ z exp B/4. Therefore t s well sutable for a adaptato scheme smlar to Algorthm 4 that do ot requre kowledge of etherδ orµ. 5. Prelmary expermets We preset prelmary expermets to demostrate the effectveess of our proposed algorthms. Frst we cosder batch prmal-dual algorthms for rdge regresso over a sythetc dataset. The data matrx A has szes = 5000 ad d = 3000 ad ts etres are sampled from multvarate ormal dstrbuto wth mea zero ad covarace matrx Σ j = j /. We ormalze all datasets 7

Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms Prmal optmalty gap 0-5 0-0 Prmal AG BPD Opt-BPD Ada-BPD 0-5 0-0 0-5 0-0 0 0 40 60 80 0 00 00 300 400 500 0 00 400 600 800 000 sythetcλ = / sythetcλ = 0 / sythetcλ = 0 4 / Fgure. Comparso of batch prmal-dual algorthms for a rdge regresso problem wth = 5000 ad d = 3000. such that a = a /max j a j to esure the maxmum orm of the data pots s. We use l -regularzato gx = λ/ x wth three choces of parameterλ: / 0 / ad 0 4 / whch represet the strog medum ad weak levels of regularzato respectvely. Fgure shows the performace of four dfferet algorthms: the accelerated gradet algorthm for solvg the prmal mmzato problem Prmal AG Nesterov 004 usg λ as strog covexty parameter the BPD algorthm Algorthm that usesλas the strog covexty parameter settg µ = 0 the optmal BPD algorthm Opt- BPD that uses µ = λ m A T A explctly computed from data ad the Ada-BPD algorthm Algorthm wth the robust adaptato heurstc Algorthm 4 wth T = 0 c = 0.95 ad c =.5. As expected the performace of Prmal-AG s very smlar to BPD wth the same strog covexty parameter. The Opt-BPD fully explots strog covexty from data thus has the fastest covergece. The Ada-BPD algorthm ca partally explot strog covexty from data wthout kowledge ofµ. Next we compare the DF-SPDC Algorthm 5 wthout adapto ad ADF-SPDC Algorthm 7 wth adapto agast several state-of-the-art radomzed algorthms for ERM: SVRG Johso & Zhag 03 SAGA Defazo et al. 04 Katyusha Alle-Zhu 06 ad the stadard SPDC method Zhag & Xao 05. For SVRG ad Katyusha a accelerated varat of SVRG we choose the varace reducto perod asm =. The step szes of all algorthms are set as ther orgal paper suggested. For Ada-SPDC ad ADF-SPDC we use the robust adaptato scheme wtht = 0c = 0.95 adc =.5. We frst compare these radomzed algorthms for rdge regresso over the same sythetc data descrbed above ad thecpuact data from the LbSVM webste. The results are show Fgure. Wth relatvely strog regularzato λ = / all methods perform smlarly as predcted by theory. For the sythetc dataset Wth λ = 0 / the regularzato s weaker but stll stroger tha the hdde strog covexty from data so the accelerated algorthms all varats of SPDC ad Katyusha perform better tha SVRG ad SAGA. Wth λ = 0 4 / t looks that the strog covexty from data domates the regularzato. Sce the o-accelerated algorthms SVRG ad SAGA may automatcally explot strog covexty from data they become faster tha the o-adaptve accelerated methods Katyusha SPDC ad DF-SPDC. The adaptve accelerated method ADF-SPDC has the fastest covergece. Ths shows that our theoretcal results whch predct o accelerato ths case ca be further mproved. Fally we compare these radomzed algorthm for logstc regresso o the rcv dataset from LbSVM webste ad aother sythetc dataset wth = 5000 ad d = 500 geerated smlarly as before but wth covarace matrx Σ j = j /00. For the stadard SPDC we solve the dual proxmal mappg usg a few steps of Newto s method to hgh precso. The dual-free SPDC algorthms oly use gradets of the logstc fucto. The results are preseted Fgure 3. for both datasets the strog covexty from data s very weak or oe so the accelerated algorthms performs better. 6. Coclusos We have show that prmal-dual frst-order algorthms are capable of explotg strog covexty from data f the algorthmc parameters are chose approprately. Whle they may depeds o problem depedet costats that are ukow we developed heurstcs for adaptg the parameters o the fly ad obtaed mproved performace expermets. It looks that our theoretcal characterzato of the covergece rates ca be further mproved as our expermets ofte demostrate sgfcat accelerato cases where our theory does ot predct accelerato. https://www.cse.tu.edu.tw/ cjl/lbsvm/ 8

Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms Prmal optmalty gap 0-5 0-0 SVRG SAGA Katyusha SPDC DF-SPDC ADF-SPDC 0-5 0-0 0-5 0-0 0 0 40 60 80 00 0 00 400 600 800 000 0 00 400 600 800 000 sythetcλ = / sythetcλ = 0 / sythetcλ = 0 4 / Prmal optmalty gap 0 0 0-5 0-0 SVRG SAGA Katyusha SPDC DF-SPDC ADF-SPDC 0 0 0-5 0-0 0 0 0-5 0-0 0 0 40 60 80 0 00 00 300 400 0 00 400 600 800 000 cpuactλ = / cpuactλ = 0 / cpuactλ = 0 4 / Fgure. Comparso of radomzed algorthms for rdge regresso problems. Prmal optmalty gap 0-5 0-0 SVRG SAGA Katyusha SPDC DF-SPDC ADF-SPDC 0-5 0-0 0 0 0-5 0-0 0 0 40 60 80 0 00 400 600 800 000 0 00 400 600 800 000 sythetcλ = / sythetcλ = 0 / sythetcλ = 0 4 / Prmal optmalty gap 0-5 0-0 SVRG SAGA Katyusha SPDC DF-SPDC ADF-SPDC 0-5 0-0 0 0 0-5 0 0 40 60 80 0 00 00 300 400 500 0 00 00 300 400 500 rcvλ = / rcvλ = 0 / rcvλ = 0 4 / Fgure 3. Comparso of radomzed algorthms for logstc regresso problems. 9

Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms Refereces Alle-Zhu Zeyua. Katyusha: Accelerated varace reducto for faster sgd. ArXv e-prt 603.05953 06. Bach Fracs. Adaptvty of averaged stochastc gradet descet to local strog covexty for logstc regresso. Joural of Mache Learg Research 5:595 67 04. Balamuruga Palaappa ad Bach Fracs. Stochastc varace reducto methods for saddle-pot problems. I Advaces Neural Iformato Processg Systems NIPS 9 pp. 46 44 06. Bertsekas Dmtr P. Icremetal gradet subgradet ad proxmal methods for covex optmzato: A survey. I Sra Suvrt Nowoz Sebasta ad Wrght Stephe J. eds. Optmzato for Mache Learg chapter 4 pp. 85 0. MIT Press 0. Chambolle Ato ad Pock Thomas. A frst-order prmal-dual algorthm for covex problems wth applcatos to magg. Joural of Mathematcal Imagg ad Vso 40:0 45 0. Chambolle Ato ad Pock Thomas. O the ergodc covergece rates of a frst-order prmal dual algorthm. Mathematcal Programmg Seres A 59:53 87 06. Defazo Aaro Bach Fracs ad Lacoste-Jule Smo. Saga: A fast cremetal gradet method wth support for o-strogly covex composte objectves. I Advaces Neural Iformato Processg Systems pp. 646 654 04. Deg We ad Y Wotao. O the global ad lear covergece of the geeralzed alteratg drecto method of multplers. Joural of Scetfc Computg 663: 889 96 06. Fercoq Olver ad Rchtárk Peter. Accelerated parallel ad proxmal coordate descet. SIAM Joural o Optmzato 54:997 03 05. Goldste Tom L M Yua Xaomg Esser Ere ad Barauk Rchard. Adaptve prmal-dual hybrd gradet methods for saddle-pot problems. arxv preprt arxv:305.0546 03. Johso Re ad Zhag Tog. Acceleratg stochastc gradet descet usg predctve varace reducto. I Advaces Neural Iformato Processg Systems pp. 35 33 03. La Guaghu ad Zhou Y. A optmal radomzed cremetal gradet method. arxv preprt arxv:507.0000 05. L Hogzhou Maral Jule ad Harchaou Zad. A uversal catalyst for frst-order optmzato. I Advaces Neural Iformato Processg Systems pp. 3384 339 05a. L Qhag Lu Zhaosog ad Xao L. A accelerated radomzed proxmal coordate gradet method ad ts applcato to regularzed emprcal rsk mmzato. SIAM Joural o Optmzato 54:44 73 05b. Maltsky Yura ad Pock Thomas. A frst-order prmal-dual algorthm wth lesearch. arxv preprt arxv:608.08883 06. Nedc Agela ad Bertsekas Dmtr P. Icremetal subgradet methods for odfferetable optmzato. SIAM Joural o Optmzato :09 38 00. Nesterov Y. Itroductory Lectures o Covex Optmzato: A Basc Course. Kluwer Bosto 004. Nesterov Yu. Effcecy of coordate descet methods o huge-scale optmzato problems. SIAM Joural o Optmzato :34 36 0. Rchtárk Peter ad Takáč Mart. Iterato complexty of radomzed block-coordate descet methods for mmzg a composte fucto. Mathematcal Programmg 44-: 38 04. Roux Ncolas L Schmdt Mark ad Bach Fracs. A stochastc gradet method wth a expoetal covergece rate for fte trag sets. I Advaces Neural Iformato Processg Systems pp. 663 67 0. Shalev-Shwartz Sha. Sdca wthout dualty regularzato ad dvdual covexty. I Proceedgs of The 33rd Iteratoal Coferece o Mache Learg pp. 747 754 06. Shalev-Shwartz Sha ad Zhag Tog. Stochastc dual coordate ascet methods for regularzed loss mmzato. Joural of Mache Learg Research 4Feb: 567 599 03. Shalev-Shwartz Sha ad Zhag Tog. Accelerated proxmal stochastc dual coordate ascet for regularzed loss mmzato. Mathematcal Programmg 55-: 05 45 06. Xao L ad Zhag Tog. A proxmal stochastc gradet method wth progressve varace reducto. SIAM Joural o Optmzato 44:057 075 04. Zhag Yuche ad Xao L. Stochastc prmal-dual coordate method for regularzed emprcal rsk mmzato. I Proceedgs of The 3d Iteratoal Coferece o Mache Learg pp. 353 36 05. 0

Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms I the followg appedces we provde detaled proofs of theorems stated the ma paper. I Secto A we frst prove a basc equalty whch s useful throughout the rest of the covergece aalyss. Secto B cotas geeral aalyss of the batch prmal-dual algorthm that are commo for provg both Theorem ad Theorem 3. Sectos C D E ad F gve proofs for Theorem Theorem 3 Theorem ad Theorem 4 respectvely. A. A basc lemma Lemma. Let h be a strctly covex fucto ad D h be ts Bregma dvergece. Suppose ψ s ν-strogly covex wth respect to D h ad/δ-smooth wth respect to the Eucldea orm ad ŷ = argm y C { ψyηdh yȳ } where C s a compact covex set that les wth the relatve teror of the domas of h ad ψ.e. both h ad ψ are dfferetable over C. The for ay y C ad ρ [0 ] we have ψyηd h y x ψŷηd h ŷȳ η ρν D h yŷ ρδ ψy ψŷ. Proof. The mmzer ŷ satsfes the followg frst-order optmalty codto: ψŷη D h ŷȳ y ŷ 0 y C. Here D deotes partal gradet of the Bregma dvergece wth respect to ts frst argumet.e. Dŷ ȳ = hŷ hȳ. So the above optmalty codto s the same as ψŷη hŷ hȳ y ŷ 0 y C. 7 Sceψ sν-strogly covex wth respect tod h ad/δ-smooth we have ψy ψŷ ψŷy ˆx νd h yŷ ψy ψŷ ψŷy ŷ δ ψy ψŷ. For the secod equalty see e.g. Theorem..5 Nesterov 004. Multplyg the two equaltes above by ρ adρrespectvely ad addg them together we have ψy ψŷ ψŷy ŷ ρνd h yŷ ρδ ψy ψŷ. The Bregma dvergeced h satsfes the followg equalty: D h yȳ = D h yŷd h ŷȳ hŷ hȳ y ŷ. We multply ths equalty byη ad add t to the last equalty to obta ψyηd h yȳ ψŷηd h yŷ η ρν D h ŷȳ ρδ ψy ψŷ ψŷη hŷ hȳ y ŷ. Usg the optmalty codto 7 the last term of er product s oegatve ad thus ca be dropped whch gves the desred equalty. B. Commo Aalyss of Batch Prmal-Dual Algorthms We cosder the geeral prmal-dual update rule as:

Iterato: ˆxŷ = PD τ xȳ xỹ Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms ˆx = arg m x R d ŷ = arg m y R { gxỹ T Ax τ Each terato of Algorthm s equvalet to the followg specfcato ofpd τ : } x x 8 {f y y T A x Dyȳ }. 9 ˆx = x t x = x t x = x t θx t x t ŷ = y t ȳ = y t ỹ = y t. 0 Besdes Assumpto we also assume that f sν-strogly covex wth respect to a kerel fuctoh.e. whered h s the Bregma dvergece defed as f y f y f yy y νd h y y D h y y = hy hy hyy y. We assume thaths -strogly covex ad/δ -smooth. Depedg o the kerel fuctoh ths assumpto of may mpose addtoal restrctos o f. I ths paper we are mostly terested two specal cases: hy = / y ad hy = f y for the latter we always have ν =. From ow o we wll omt the subscrpt h ad use D deote the Bregma dvergece. Uder the above assumptos ay solutox y to the saddle-pot problem 6 satsfes the optmalty codto: The optmalty codtos for the updates descrbed equatos 8 ad 9 are A T y gx Ax = f y. A T ỹ x ˆx gˆx 3 τ A x hŷ hȳ = f ŷ. 4 Applyg Lemma to the dual mmzato step 9 wth ψy = f y y T A x η = / y = y ad ρ = / we obta f y y T A x Dy ȳ f ŷ ŷ T A x Dŷȳ ν Dy ŷ δ f y f ŷ. 5 4 Smlarly for the prmal mmzato step 8 we have settgρ = 0 gx ỹ T Ax τ x x gˆxỹ T Aˆx τ ˆx x τ λ x ˆx. 6 Combg the two equaltes above wth the deftolxy = gxy T Ax f y we get Lˆxy Lx ŷ = gˆxy T Aˆx f y gx ŷ T Ax f ŷ τ x x Dy ȳ τ λ x ˆx ν Dy ŷ τ ˆx x Dŷȳ δ f y f ŷ 4 y T Aˆx ŷ T Ax ỹ T Ax ỹ T Aˆx y T A xŷ T A x.

Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms We ca smplfy the er product terms as y T Aˆx ŷ T Ax ỹ T Ax ỹ T Aˆx y T A xŷ T A x = ŷ ỹ T Aˆx x ŷ y T Aˆx x. Rearragg terms o the two sdes of the equalty we have τ x x Dy ȳ Lˆxy Lx ŷ τ λ x ˆx ν Dy ŷ τ ˆx x Dŷȳ δ f y f ŷ 4 ŷ y T Aˆx x ŷ ỹ T Aˆx x. Applyg the substtutos 0 yelds τ x x t Dy y t Lx t y Lx y t τ λ x x t ν Dy y t τ xt x t Dyt y t δ f y f y t 4 y t y T A x t x t θx t x t. 7 We ca rearrage the er product term 7 as y t y T A x t x t θx t x t = y t y T Ax t x t θy t y T Ax t x t θy t y t T Ax t x t. Usg the optmalty codtos ad 4 we ca also boud f y f y t : = f y f y t Ax A x t θx t x t hy t hy t α Ax x t α θax t x t hy t hy t whereα >. Wth the deftoµ = λ m A T A we also have Ax x t µ x x t. Combg them wth the equalty 7 leads to τ x x t Dy y t θy t y T Ax t x t Lx t y Lx y t τ λ x x t ν Dy y t y t y T Ax t x t τ xt x t Dyt y t θy t y t T Ax t x t δµ α 4 x x t α δ θax t x t hy 4 t hy t. 8 3

C. Proof of Theorem Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms Let the kerel fucto behy = / y. I ths case we havedy y = / y y ad hy = y. Moreover = δ = adν =. Therefore the equalty 8 becomes τ δµ x x t α y y t θy t y T Ax t x t Lx t y Lx y t τ λ x x t y y t y t y T Ax t x t τ xt x t yt y t θy t y t T Ax t x t α δ θax t x t 4 yt y t. 9 Next we derve aother form of the uderled tems above: yt y t θy t y t T Ax t x t = yt y t θ yt y t T Ax t x t = θax t x t yt y t θ Ax t x t θax t x t yt y t θ L x t x t where the last equalty we used A L ad hece Ax t x t L x t x t. Combg wth equalty 9 we have τ δµ x t x α yt y θy t y T Ax t x t θ L x t x t Lx t y Lx y t τ λ x t x y t y y t y T Ax t x t τ xt x t θax α δ t x t 4 yt y t. 30 We ca remove the last term the above equalty as log as ts coeffcet s oegatve.e. α δ 4 0. I order to maxmze /α we take the equalty ad solve for the largest value ofαallowed whch results α = δ α = δ. Applyg these values 30 gves τ δµ x t x δ yt y θy t y T Ax t x t θ L x t x t Lx t y Lx y t τ λ x t x y t y y t y T Ax t x t τ xt x t. 3 4

Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms We use t to deote the last row 3. Equvaletly we defe t = τ λ x x t y y t y t y T Ax t x t τ xt x t = τ λ x x t 4 y y t [ ] x t x t T [ y y t τ I ][ ] AT x t x t A y y t. The quadratc form the last term s oegatve f the matrx M = [ τ I AT A ] s postve semdefte for whch a suffcet codto sτ /L. Uder ths codto t τ λ x x t 4 y y t 0. 3 If we ca to chooseτ ad so that τ δµ δ θ τ λ θ θ L θ τ 33 the accordg to 3 we have t Lx t y Lx y t θ t. Because t 0 adlx t y Lx y t 0 for ayt 0 we have t θ t whch mples ad t θ t 0 Lx t y Lx y t θ t 0. Letθ x adθ y be two cotracto factors determed by the frst two equaltes 33.e. / θ x = τ δµ δ τ λ = θ y = / = /. τδµ δ τλ The we ca let θ = max{θ x θ y }. We ote that ayθ < would satsfy the last codto 33 provded that τ = L whch also makes the matrxm postve semdefte ad thus esures the equalty 3. Amog all possble parsτ that satsfy τ = /L we choose whch gve the desred results of Theorem. τ = L λδµ = λδµ 34 L 5

D. Proof of Theorem 3 If we chooseh = f the Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms h s-strogly covex ad/δ-smooth.e. = adδ = δ; f s-strogly covex wth respect toh.e.ν =. For coveece we repeat equalty 8 here: τ x x t Dy y t θy t y T Ax t x t Lx t y Lx y t τ λ x x t ν Dy y t y t y T Ax t x t τ xt x t Dyt y t θy t y t T Ax t x t δµ α 4 x x t α δ θax t x t hy 4 t hy t. 35 We frst boud the Bregma dvergece Dy t y t usg the assumpto that the kerel h s -strogly covex ad /δ-smooth. Usg smlar argumets as the proof of Lemma we have for ayρ [0] Dy t y t = hy t hy t hy t y t y t ρ yt y t ρ δ hy t hy t. 36 For ayβ > 0 we ca lower boud the er product term I addto we have θy t y t T Ax t x t β yt y t θ L β xt x t. θax t x t hy t hy t θ L x t x t hy t hy t. Combg these bouds wth 35 ad 36 wth ρ = / we arrve at τ δµ α θ L L β α δθ x x t Dy y t θy t y T Ax t x t x t x t Lx t y Lx y t τ λ x x t Dy y t y t y T Ax t x t 4 β δ y t y t 4 α δ hy t hy t τ xt x t. 37 We chooseαadβ 37 to zero out the coeffcets of y t y t ad hy t hy t : α = β =. 6

The the equalty 37 becomes τ δµ 4 θ L Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms x x t Dy y t θy t y T Ax t x t δθ L 4 x t x t Lx t y Lx y t τ λ x x t Dy y t y t y T Ax t x t τ xt x t. The coeffcet of x t x t ca be bouded as θ L δθ L 4 = 4 δ θ L = 4δ 4 θ L < θ L where the equalty we used δ. Therefore we have x τ δµ x t 4 Dy y t θy t y T Ax t x t θ L x t x t Lx t y Lx y t τ λ x x t Dy y t y t y T Ax t x t τ xt x t. We use t to deote the last row of the above equalty. Equvaletly we defe t = τ λ x x t Dy y t y t y T Ax t x t τ xt x t. Scehs-strogly covex we havedy y t y y t ad thus t = τ λ x x t Dy y t τ λ x x t Dy y t The quadratc form the last term s oegatve fτ /L. Uder ths codto t yt y y t y T Ax t x t τ xt x t [ ] x t x t T [ y y t τ I ][ ] AT x t x t A y y t. τ λ x x t Dy y t 0. 38 If we ca to chooseτ ad so that τ δµ 4 θ τ λ θ θ L θ τ 39 the we have t Lx t y Lx y t θ t. Because t 0 adlx t y Lx y t 0 for ayt 0 we have t θ t whch mples t θ t 0 7

Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms ad Lx t y Lx y t θ t 0. To satsfy the last codto 39 ad also esure the equalty 38 t suffces to have τ 4L. We choose τ = L λδµ = λδµ. L Wth the above choce ad assumgλδµ L we have θ y = For the cotracto factor over the prmal varables we have = / = λδµ /4L λδµ. 4L θ x = τ δµ 4 τδµ 4 δµ 44L τ λ = τλ = τλ δµ 6L λ L λδµ. Ths fshes the proof of Theorem 3. E. Proof of Theorem We cosder the SPDC algorthm the Eucldea case wthhx = / x. The correspodg batch case aalyss s gve Secto C. For each=... let ỹ be ỹ = argm y Based o the frst-order optmalty codto we have Also sce y mmzesφ y y a x we have By Lemma wth ρ = / we have y a x t φ y ad re-arragg terms we get { φ y } y yt y a x t. a x t ỹ y t φ ỹ. yt y y t y ỹ y a x φ y. ỹ y φ ỹ ỹ a x t ỹ y t δ 4 φ ỹ φ y ỹ y t ỹ y a x t φ ỹ φ y δ 4 φ ỹ φ y. 40 8

Notce that Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms E[y t ] = ỹ y t E[y t y ] = ỹ y y t ] = ỹ y t E[y t yt y E[φ y t ] = φ ỹ φ y t. Plug the above relatos to 40 ad dvde both sdes by we have y t y 4 4 ad summg over =... we get 4 where u t = y t y = E[y t y ] E[y t y t ] yt y E[φ y t δ 4 y t a u t = E[yt ] φ y t φ y t a x t x ỹ y t a x t y t ] φ y E[ y t y ] E[ yt y t ] 4 φ k yt k φ k yt k = u t u t u t u x t δ 4 Ax x t ỹ yt = y t a ad u = O the other had scex t mmzes the τ λ-strogly covex objectve gx u t u t u t x x xt τ we ca apply Lemma wth ρ = 0 to obta gx u t u t u t x xt x gx t u t u t u t x t xt x t ad re-arragg terms we get x t x τ τ λ τ τ φ yt φ y y a. = τ λ x t x E[ x t x ] E[ xt x t ] E[gx t gx ] τ E[ u t u t u t x t x ]. 9

Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms Also otce that Lx t y Lx y Lx y Lx y t Lx y Lx y t = φ yt φ y φ k yt k φ k yt k gxt gx = u x t u t x u t u t x. Combg everythg together we have x t x τ 4 τ λ E[ x t x ] y t y Lx y Lx y t E[ y t y ] E[ xt x t ] E[ yt y t ] 4 τ E[Lx t y Lx y Lx y Lx y t ] E[ u t u u t u t x t x t ] δ 4 Ax x t ỹ yt. Next we otce that δ 4 Ax x t E[yt ] y t for someα > ad Ax x t µ x x t ad θaxt x t ỹ yt = δ 4 Ax x t θax t x t ỹ yt δ Ax x t α 4 α δ 4 θaxt x t ỹ yt θ Ax t x t ỹ yt θ L x t x t E[ yt y t ]. We follow the same reasog as the stadard SPDC aalyss u t u u t u t x t x t = yt y T Ax t x t y t y t T Ax t x t θy t y t T Ax t x t ad usg Cauchy-Schwartz equalty we have ad y t y t T Ax t x t yt y t T A /τ yt y t /τr y t y t T Ax t x t yt y t T A /τ yt y t /τr. θyt y T Ax t x t xt x t 8τ xt x t 8τ 0

Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms Thus we get u t u u t u t x t x t yt y T Ax t x t yt y t /4τR xt x t 8τ Puttg everythg together we have τ /αδµ x t x 4 Lx y Lx y t θ τ λ E[ x t x ] 4 θ xt x t. 8τ 4 8τ α θδl E[Lx t y Lx y Lx y Lx y t ] τ E[ x t x t ] 8τ 4R τ α δ E[ y t y t ]. θyt y T Ax t x t y t y θlx t y Lx y x t x t θyt y T Ax t x t E[ y t y ] E[yt y T Ax t x t ] If we choose the parameters as α = τ = 4δ 6R the we kow 4R τ α δ = 4 8 > 0 ad α θδl L 8 R 8 56τ thus 8τ α θδl 3 8τ. I addto we have α = 4δ. Fally we obta τ δµ x t x 4 4δ y t y θlx t y Lx y 4 Lx y Lx y t 3 θ 8τ xt x t θyt y T Ax t x t τ λ E[ x t x ] E[ y t y ] E[yt y T Ax t x t ] 4 E[Lx t y Lx y Lx y Lx y t ] 3 8τ E[ xt x t ]. Now we ca defe θ x ad θ y as the ratos betwee the coeffcets the x-dstace ad y-dstace terms ad let θ = max{θ x θ y } as before. Choosg the step-sze parameters as λδµ gves the desred result. τ = 4R λδµ = 4R

F. Proof of Theorem 4 Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms I ths settg for-th coordate of the dual varablesy we chooseh = φ let ad defe For =... let ỹ be D y y = φ y φ y φ y y y ỹ = argm y Dyy = Based o the frst-order optmalty codto we have Also scey mmzesφ y y a x we have D y y. = { } φ y D yy t y a x t. a x t φ ỹ φ y t φ ỹ. a x φ y. Usg Lemma wthρ = / we obta y a x t φ y D y yt D y ỹ φ ỹ ỹ a x t ad rearragg terms we get D y yt D ỹ y t δ 4 φ ỹ φ y D y ỹ D ỹ y t ỹ y a x t φ ỹ φ y δ 4 φ ỹ φ y. 4 Wth..d. radom samplg at each terato we have the followg relatos: E[y t ] = ỹ y t E[D y t y ] =D ỹ y Dy t y E[D y t y t ] = D ỹ y t E[φ yt ] = φ ỹ φ yt. Pluggg the above relatos to 4 ad dvdg both sdes by we have D y t y D y t y E[D y t E[y t y t ] yt y y t ] a x t E[φ y t ] φ y t φ y t φ y δ a x t x φ ỹ φ y t 4

ad summg over =... we get Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms Dy t y E[Dy t y ] E[Dyt y t ] φ ky t k φ ky t k = u t u t u t u x t δ 4 Ax x t φ whereφ y t s a -dmesoal vector such that the-th coordate s φ y t φ y ỹ φ y t ad u t = = y t a u t = [φ y t ] = φ y t = y t a ad u = y a. = O the other had scex t mmzes a τ λ-strogly covex objectve gx u t u t u t x x xt τ we ca apply Lemma wth ρ = 0 to obta gx u t u t u t x xt x gx t u t u t u t x t xt x t ad rearragg terms we get Notce that x t x τ τ λ τ τ τ λ x t x E[ x t x ] E[ xt x t ] E[gx t gx ] τ E[ u t u t u t x t x ]. Lx t y Lx y Lx y Lx y t Lx y Lx y t = φ yt φ y φ k yt k φ k yt k gxt gx = u x t u t x u t u t x so x t x τ τ λ E[ x t x ] Dy t y Lx y Lx y t E[Dy t y ] E[ xt x t ] E[Dyt y t ] τ E[Lx t y Lx y Lx y Lx y t ] E[ u t u u t u t x t x t ] δ 4 Ax x t φ ỹ φ y t. 3

Next we have δ 4 Ax x t φ for ayα > ad ad θaxt x t φ Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms ỹ φ y t ỹ φ y t = δ 4 Ax x t θax t x t φ δ Ax x t α 4 α δ 4 θaxt x t φ Ax x t µ x x t Followg the same reasog as the stadard SPDC aalyss we have ỹ φ y t ỹ φ y t θ Ax t x t φ ỹ φ y t ] u t u u t u t x t x t = yt y T Ax t x t θ L x t x t E[ φ y t φ y t ]. y t y t T Ax t x t θy t y t T Ax t x t ad usg Cauchy-Schwartz equalty we have ad Thus we get y t y t T Ax t x t yt y t T A /τ yt y t /τr y t y t T Ax t x t yt y t T A /τ yt y t /τr. u t u u t u t x t x t yt y T Ax t x t yt y t /4τR xt x t 8τ θ xt x t. 8τ Also we ca lower boud the termdy t y t usg Lemma wthρ = /: Dy t y t = = = φ yt φ yt φ y t θyt y T Ax t x t xt x t 8τ xt x t 8τ y t θyt y T Ax t x t y t yt y t δ φ y t φ y t = yt y t δ φ y t φ y t. 4

Explotg Strog Covexty from Data wth Prmal-Dual Frst-Order Algorthms Combg everythg above together we have τ /αδµ x t x 4 Lx y Lx y t θ 8τ α θδl τ λ E[ x t x ] Dy t y θlx t y Lx y x t x t θyt y T Ax t x t E[Dy t y ] E[yt y T Ax t x t ] E[Lx t y Lx y Lx y Lx y t ] τ E[ x t x t ] 8τ 4R τ E[ y t y t ] δ α δ E[ φ y t φ y t ]. If we choose the parameters as the we kow ad ad thus I addto we have α θδl α = 4 τ = 6R 4R τ = 4 > 0 δ α δ = δ δ 8 > 0 δl 8 δr δ 8 56τ 56τ 8τ α θδl 3 8τ. α = 4. Fally we obta τ δµ x t x 44 Dy t y θlx t y Lx y Lx y Lx y t 3 θ 8τ xt x t θyt y T Ax t x t τ λ E[ x t x ] E[ y t y ] E[yt y T Ax t x t ] E[Lx t y Lx y Lx y Lx y t ] 3 8τ E[ xt x t ]. As before we ca defe θ x ad θ y as the ratos betwee the coeffcets the x-dstace ad y-dstace terms ad let θ = max{θ x θ y }. The choosg the step-sze parameters as gves the desred result. τ = 4R λδµ = λδµ 4R 5