Tight Complexity Bounds for Optimizing Composite Objectives

Similar documents
Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Stochastic Subgradient Methods

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

An incremental mirror descent subgradient algorithm with random sweeping and proximal step

A Simple Regression Problem

An incremental mirror descent subgradient algorithm with random sweeping and proximal step

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

A Theoretical Analysis of a Warm Start Technique

Learnability and Stability in the General Learning Setting

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Convex Programming for Scheduling Unrelated Parallel Machines

Computational and Statistical Learning Theory

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations

Sharp Time Data Tradeoffs for Linear Inverse Problems

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Ch 12: Variations on Backpropagation

arxiv: v1 [cs.ds] 3 Feb 2014

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

Probability Distributions

1 Bounding the Margin

Computational and Statistical Learning Theory

Understanding Machine Learning Solution Manual

arxiv: v1 [cs.ds] 17 Mar 2016

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

The Methods of Solution for Constrained Nonlinear Programming

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

When Short Runs Beat Long Runs

Machine Learning Basics: Estimators, Bias and Variance

Constrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008

Distributed Subgradient Methods for Multi-agent Optimization

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

Interactive Markov Models of Evolutionary Algorithms

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

Lecture 21. Interior Point Methods Setup and Algorithm

Exact tensor completion with sum-of-squares

List Scheduling and LPT Oliver Braun (09/05/2017)

Introduction to Machine Learning. Recitation 11

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

Fairness via priority scheduling

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

OPTIMIZATION in multi-agent networks has attracted

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Lower Bounds for Quantized Matrix Completion

Robustness and Regularization of Support Vector Machines

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

SVRG++ with Non-uniform Sampling

Hybrid System Identification: An SDP Approach

Asynchronous Gossip Algorithms for Stochastic Optimization

Pattern Recognition and Machine Learning. Artificial Neural networks

Necessity of low effective dimension

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 60, NO. 2, FEBRUARY ETSP stands for the Euclidean traveling salesman problem.

Boosting with log-loss

Non-Parametric Non-Line-of-Sight Identification 1

On Constant Power Water-filling

Inexact Proximal Gradient Methods for Non-Convex and Non-Smooth Optimization

1 Proof of learning bounds

Kernel Methods and Support Vector Machines

lecture 36: Linear Multistep Mehods: Zero Stability

C na (1) a=l. c = CO + Clm + CZ TWO-STAGE SAMPLE DESIGN WITH SMALL CLUSTERS. 1. Introduction

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

. The univariate situation. It is well-known for a long tie that denoinators of Pade approxiants can be considered as orthogonal polynoials with respe

Bootstrapping Dependent Data

Randomized Recovery for Boolean Compressed Sensing

Prediction by random-walk perturbation

A Probabilistic and RIPless Theory of Compressed Sensing

1 Identical Parallel Machines

On Conditions for Linearity of Optimal Estimation

Variational Adaptive-Newton Method

Feature Extraction Techniques

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Lecture 20 November 7, 2013

Support Vector Machines. Goals for the lecture

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Trade-Offs in Distributed Learning and Optimization

STOPPING SIMULATED PATHS EARLY

Synthetic Generation of Local Minima and Saddle Points for Neural Networks

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Least Squares Fitting of Data

A Simple Homotopy Algorithm for Compressive Sensing

time time δ jobs jobs

COS 424: Interacting with Data. Written Exercises

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

A PROBABILISTIC AND RIPLESS THEORY OF COMPRESSED SENSING. Emmanuel J. Candès Yaniv Plan. Technical Report No November 2010

Fixed-to-Variable Length Distribution Matching

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

In this chapter, we consider several graph-theoretic and probabilistic models

Bipartite subgraphs and the smallest eigenvalue

A Theoretical Framework for Deep Transfer Learning

Mixed Robust/Average Submodular Partitioning

Data Dependent Convergence For Consensus Stochastic Optimization

Transcription:

Tight Coplexity Bounds for Optiizing Coposite Objectives Blake Woodworth Toyota Technological Institute at Chicago Chicago, IL, 60637 blake@ttic.edu Nathan Srebro Toyota Technological Institute at Chicago Chicago, IL, 60637 nati@ttic.edu Abstract We provide tight upper and lower bounds on the coplexity of iniizing the average of convex functions using gradient and prox oracles of the coponent functions. We show a significant gap between the coplexity of deterinistic vs randoized optiization. For sooth functions, we show that accelerated gradient descent (AGD) and an accelerated variant of SVRG are optial in the deterinistic and randoized settings respectively, and that a gradient oracle is sufficient for the optial rate. For non-sooth functions, having access to prox oracles reduces the coplexity and we present optial ethods based on soothing that iprove over ethods using just gradient accesses. 1 Introduction We consider iniizing the average of convex functions: ( ) in F (x) := 1 X f i (x) xx where X R d is a closed, convex set, and where the algorith is given access to the following gradient (or subgradient in the case of non-sooth functions) and prox oracle for the coponents: where i=1 h F (x, i, )= f i (x), rf i (x), prox fi (x, ) () prox fi (x, ) = arg in f i (u)+ kx uk (3) ux A natural question is how to leverage the prox oracle, and how uch benefit it provides over gradient access alone. The prox oracle is potentially uch ore powerful, as it provides global, rather then local, inforation about the function. For exaple, for a single function ( =1), one prox oracle call (with =0) is sufficient for exact optiization. Several ethods have recently been suggested for optiizing a su or average of several functions using prox accesses to each coponent, both in the distributed setting where each coponents ight be handled on a different achine (e.g. ADMM [7], DANE [18], DISCO [0]) or for functions that can be decoposed into several easy parts (e.g. PRISMA [13]). But as far as we are aware, no eaningful lower bound was previously known on the nuber of prox oracle accesses required even for the average of two functions ( =). The optiization of coposite objectives of the for (1) has also been extensively studied in the context of iniizing epirical risk over saples. Recently, stochastic ethods such as SDCA [16], SAG [14], SVRG [8], and other variants, have been presented which leverage the finite nature of the proble to reduce the variance in stochastic gradient estiates and obtain guarantees that doinate both batch and stochastic gradient descent. As ethods with iproved coplexity, such (1) 30th Conference on Neural Inforation Processing Systes (NIPS 016), Barcelona, Spain.

Deterinistic Lower Upper Randoized Lower Upper L-Lipschitz -Sooth Convex, -Strongly Convex, -Strongly kxk appleb Convex kxk appleb Convex q LB B p log 0 pl (Section 3) (Section 3) (AGD) (AGD) LB pl B p log 0 (Section 4) (Section 4) (Section 4) (Section 4) L B ^ log 1 + p LB L ^ log 1 + p q L p log 0 B + + p log 0 (SGD, A-SVRG) (SGD, A-SVRG) (A-SVRG) (A-SVRG) L B ^ + p LB L ^ + p q L p B + + p log 0 (Section 5) (Section 5) (Section 5) (Section 5) Table 1: Upper and lower bounds on the nuber of grad-and-prox oracle accesses needed to find -suboptial solutions for each function class. These are exact up to constant factors except for the lower bounds for sooth and strongly convex functions, which hide extra log / and log p / factors for deterinistic and randoized algoriths. Here, 0 is the suboptiality of the point 0. q as accelerated SDCA [17], accelerated SVRG, and KATYUSHA [3], have been presented, researchers have also tried to obtain lower bounds on the best possible coplexity in this settings but as we survey below, these have not been satisfactory so far. In this paper, after briefly surveying ethods for sooth, coposite optiization, we present ethods for optiizing non-sooth coposite objectives, which show that prox oracle access can indeed be leveraged to iprove over ethods using erely subgradient access (see Section 3). We then turn to studying lower bounds. We consider algoriths that access the objective F only through the oracle h F and provide lower bounds on the nuber of such oracle accesses (and thus the runtie) required to find -suboptial solutions. We consider optiizing both Lipschitz (non-sooth) functions and sooth functions, and guarantees that do and do not depend on strong convexity, distinguishing between deterinistic optiization algoriths and randoized algoriths. Our upper and lower bounds are suarized in Table 1. As shown in the table, we provide atching upper and lower bounds (up to a log factor) for all function and algorith classes. In particular, our bounds establish the optiality (up to log factors) of accelerated SDCA, SVRG, and SAG for randoized finite-su optiization, and also the optiality of our deterinistic soothing algoriths for non-sooth coposite optiization. On the power of gradient vs prox oracles For non-sooth functions, we show that having access to prox oracles for the coponents can reduce the polynoial dependence on fro 1/ to 1/, or fro 1/( ) to 1/ p for -strongly convex functions. However, all of the optial coplexities for sooth functions can be attained with only coponent gradient access using accelerated gradient descent (AGD) or accelerated SVRG. Thus the worst-case coplexity cannot be iproved (at least not significantly) by using the ore powerful prox oracle. On the power of randoization We establish a significant gap between deterinistic and randoized algoriths for finite-su probles. Naely, the dependence on the nuber of coponents ust be linear in for any deterinistic algorith, but can be reduced to p (in the typically significant ter) using randoization. We ephasize that the randoization here is only in the algorith not in the oracle. We always assue the oracle returns an exact answer (for the requested coponent) and is not a stochastic oracle. The distinction is that the algorith is allowed to flip coins in deciding what operations and queries to perfor but the oracle ust return an exact answer to that query (of course, the algorith could siulate a stochastic oracle). Prior Lower Bounds Several authors recently presented lower bounds for optiizing (1) in the sooth and strongly convex setting using coponent gradients. Agarwal and Bottou [1] presented a lower bound of + p log 1. However, their bound is valid only for deterinistic algoriths (thus not including SDCA, SVRG, SAG, etc.) we not only consider randoized algoriths, but also show a uch higher lower bound for deterinistic algoriths (i.e. the bound of Agarwal

and Bottou is loose). Iproving upon this, Lan [9] shows a siilar lower bound for a restricted class of randoized algoriths: the algorith ust select which coponent to query for a gradient by drawing an index fro a fixed distribution, but the algorith ust otherwise be deterinistic in how it uses the gradients, and its iterates ust lie in the span of the gradients it has received. This restricted class includes SAG, but not SVRG nor perhaps other realistic attepts at iproving over these. Furtherore, both bounds allow only gradient accesses, not prox coputations. Thus SDCA, which requires prox accesses, and potential variants are not covered by such lower bounds. We prove as siilar lower bound to Lan s, but our analysis is uch ore general and applies to any randoized algorith, aking any sequence of queries to a gradient and prox oracle, and without assuing that iterates lie in the span of previous responses. In addition to sooth functions, we also provide lower bounds for non-sooth probles which were not considered by these previous attepts. Another recent observation [15] was that with access only to rando coponent subgradients without knowing the coponent s identity, an algorith ust ake ( ) queries to optiize well. This shows how relatively subtle changes in the oracle can have a draatic effect on the coplexity of the proble. Since the oracle we consider is quite powerful, our lower bounds cover a very broad faily of algoriths, including SAG, SVRG, and SDCA. Our deterinistic lower bounds are inspired by a lower bound on the nuber of rounds of counication required for optiization when each f i is held by a different achine and when iterates lie in the span of certain peritted calculations [5]. Our construction for =is siilar to theirs (though in a different setting), but their analysis considers neither scaling with (which has a different role in their setting) nor randoization. Notation and Definitions We use k k to denote the standard Euclidean nor on R d. We say that a function f is L-Lipschitz continuous on X if 8x, y X f(x) f(x) applel kx yk; -sooth on X if it is differentiable and its gradient is -Lipschitz on X ; and -strongly convex on X if 8x, y X f i (y) f i (x) +hrf i (x), y xi + kx yk. We consider optiizing (1) under four cobinations of assuptions: each coponent f i is either L-Lipschitz or -sooth, and either F (x) is -strongly convex or its doain is bounded, X {x : kxk appleb}. Optiizing Sooth Sus We briefly review the best known ethods for optiizing (1) when the coponents are -sooth, yielding the upper bounds on the right half of Table 1. These upper bounds can be obtained using only coponent gradient access, without need for the prox oracle. We can obtain exact gradients of F (x) by coputing all coponent gradients rf i (x). Running accelerated gradient descent (AGD) [1] on F (x) using these exact gradients achieves the upper coplexity bounds for deterinistic algoriths and sooth probles (see Table 1). SAG [14], SVRG [8] and related ethods use randoization to saple coponents, but also leverage the finite nature of the objective to control the variance of the gradient estiator used. Accelerating these ethods using the Catalyst fraework [10] ensures that for -strongly convex objectives we have E F (x (k) ) F (x ) <after k = O + p log 0 iterations, where F (0) F (x )= 0.KATYUSHA [3] is a ore direct approach to accelerating SVRG which avoids extraneous log-factors, yielding the coplexity k = O + p log 0 indicated in Table 1. When F is not strongly convex, adding a regularizer to the objective and instead optiizing F (x) = q F (x) + kxk with = /B B results in an oracle coplexity of O + log 0. The log-factor in the second ter can be reoved using the ore delicate reduction of Allen-Zhu and Hazan [4], which involves optiizing F (x) for progressively saller values of, yielding the upper bound in the table. KATYUSHA and Catalyst-accelerated SAG or SVRG use only gradients of the coponents. Accelerated SDCA [17] achieves a siilar coplexity using gradient and prox oracle access. 3 Leveraging Prox Oracles for Lipschitz Sus In this section, we present algoriths for leveraging the prox oracle to iniize (1) when each coponent is L-Lipschitz. This will be done by using the prox oracle to sooth each coponent, 3

and optiizing the new, sooth su which approxiates the original proble. This idea was used in order to apply KATYUSHA [3] and accelerated SDCA [17] to non-sooth objectives. We are not aware of a previous explicit presentation of the AGD-based deterinistic algorith, which achieves the deterinistic upper coplexity indicated in Table 1. -Moreau envelope of a non-sooth func- The key is using a prox oracle to obtain gradients of the tion, f, defined as: f ( ) (x) = inf ux f(u)+ kx uk (4) Lea 1 ([13, Lea.], [6, Proposition 1.9], following [11]). Let f be convex and L- Lipschitz continuous. For any >0, 1. f ( ) is -sooth. r(f ( ) )(x) = (x prox f (x, )) 3. f ( ) (x) apple f(x) apple f ( ) (x)+ L Consequently, we can consider the soothed proble ( ) in F ( ) (x) := 1 X f ( ) i (x). (5) xx While F ( ) is not, in general, the -Moreau envelope of F, it is -sooth, we can calculate the gradient of its coponents using the oracle h F, and F ( ) (x) apple F (x) apple F ( ) (x) + L. Thus, to obtain an -suboptial solution to (1) using h F, we set = L / and apply any algorith which can optiize (5) using gradients of the L /-sooth coponents, to within / accuracy. With the rates presented in Section, using AGD on (5) yields a coplexity of O LB in the deterinistic setting. When the functions are -strongly convex, soothing with a fixed results in a spurious log-factor. To avoid this, we again apply the reduction of Allen-Zhu and Hazan [4], this tie optiizing F ( ) for increasingly large values of when used with AGD (see Appendix A for details). i=1. This leads to the upper bound of O L p Siilarly, we can apply an accelerated randoized algorith (such as KATYUSHA) to the sooth proble F ( ) to obtain coplexities of O log 0 + p LB and O log 0 + p p L this atches the presentation of Allen-Zhu [3] and is siilar to that of Shalev-Shwartz and Zhang [17]. Finally, if > L B / or > L /( ), stochastic gradient descent is a better randoized alternative, yielding coplexities of O(L B / ) or O(L /( )). 4 Lower Bounds for Deterinistic Algoriths We now turn to establishing lower bounds on the oracle coplexity of optiizing (1). We first consider only deterinistic optiization algoriths. What we would like to show is that for any deterinistic optiization algorith we can construct a hard function for which the algorith cannot find an -suboptial solution until it has ade any oracle accesses. Since the algorith is deterinistic, we can construct such a function by siulating the (deterinistic) behavior of the algorith. This can be viewed as a gae, where an adversary controls the oracle being used by the algorith. At each iteration the algorith queries the oracle with soe triplet (x, i, ) and the adversary responds with an answer. This answer ust be consistent with all previous answers, but the adversary ensures it is also consistent with a coposite function F that the algorith is far fro optiizing. The hard function is then gradually defined in ters of the behavior of the optiization algorith. To help us forulate our constructions, we define a round of queries as a series of queries in which d e distinct functions f i are queried. The first round begins with the first query and continues until exactly d e unique functions have been queried. The second round begins with the next query, and continues until exactly d e ore distinct coponents have been queried in the second round, and so on until the algorith terinates. This definition is useful for analysis but requires no assuptions about the algorith s querying strategy. 4

4.1 Non-Sooth Coponents We begin by presenting a lower bound for deterinistic optiization of (1) when each coponent f i is convex and L-Lipschitz continuous, but is not necessarily strongly convex, on the doain X = {x : kxk appleb}. Without loss of generality, we can consider L = B =1. We will construct functions of the following for: f i (x) = 1 p b hx, v 0 i + 1 p k r=1 i,r hx, v r 1 i hx, v r i. (6) where k = b 1 1 c, b = 1 p k+1, and {v r } is an orthonoral set of vectors in R d chosen according to the behavior of the algorith such that v r is orthogonal to all points at which the algorith queries h F before round r, and where i,r are indicators chosen so that i,r = 1 if the algorith does not query coponent i in round r (and zero otherwise). To see how this is possible, consider the following truncations of (6): fi t (x) = p 1 b hx, v 0 i + 1 p Xt 1 k r=1 i,r hx, v r 1 i hx, v r i (7) During each round t, the adversary answers queries according to fi t, which depends only on v r, i,r for r<t, i.e. fro previous rounds. When the round is copleted, i,t is deterined and v t is chosen to be orthogonal to the vectors {v 0,...,v t 1 } as well as every point queried by the algorith so far, thus defining f t+1 i for the next round. In Appendix B.1 we prove that these responses based on fi t are consistent with f i. The algorith can only learn v r after it copletes round r until then every iterate is orthogonal to it by construction. The average of these functions reaches its iniu of F (x )=0at x = b P k r=0 v r, so we can view optiizing these functions as the task of discovering the vectors v r even if only v k is issing, a suboptiality better than b/(6 p k) >cannot be achieved. Therefore, the deterinistic algorith ust coplete at least k rounds of optiization, each coprising at least queries to hf in order to optiize F. The key to this construction is that even though each ter hx, v r 1 i hx, v r i appears in / coponents, and hence has a strong effect on the average F (x), we can force a deterinistic algorith to ake () queries during each round before it finds the next relevant ter. We obtain (for coplete proof see Appendix B.1): Theore 1. For any L, B > 0, any 0 << LB 1, any, and any deterinistic algorith A with access to h F, there exists a diension d = O LB, and functions f i defined over X = x R d : kxk appleb, which are convex and L-Lipschitz continuous, such that in order to find a point ˆx for which F (ˆx) F (x ) <, A ust ake LB Furtherore, we can always reduce optiizing a function over kxk appleb to optiizing a strongly convex function by adding the regularizer kxk /(B ) to each coponent, iplying (see coplete proof in Appendix B.): Theore. For any L, > 0, any 0 << L 88, any, and any deterinistic algorith L A with access to h F, there exists a diension d = O p, and functions f i defined over X R d, which are L-Lipschitz continuous and ˆx for which F (ˆx) 4. Sooth Coponents F (x ) <, A ust ake -strongly convex, such that in order to find a point pl When the coponents f i are required to be sooth, the lower bound construction is siilar to (6), except it is based on squared differences instead of absolute differences. We consider the functions:! f i (x) = 1 i,1 hx, v 0 i a hx, v 0 i + i,k hx, v k i + i,r (hx, v r 1 i hx, v r i) (8) 8 where i,r and v r are as before. Again, we can answer queries at round t based only on i,r,v r for r<t. This construction yields the following lower bounds (full details in Appendix B.3): r=1 5

Theore 3. For any,b, > 0, any, and any deterinistic algorith A with access to h F, there exists a sufficiently large diension d = O p B /, and functions f i defined over X = x R d : kxk appleb, which are convex and -sooth, such that in order to find a point ˆx R d for which F (ˆx) F (x ) <, A ust ake p B / In the strongly convex case, we use a very siilar construction, adding the ter gives the following bound (see Appendix B.4): kxk /, which Theore 4. For any, > 0 such that > 73, any >0, any 0 > 3, any, and any deterinistic algorith A with access to h F, there exists a sufficiently large diension d = O p log 0, and functions f i defined over X R d, which are -sooth and - strongly convex and where F (0) F (x ) = 0, such that in order to find a point ˆx for which F (ˆx) F (x ) <, A ust ake p log 0 5 Lower Bounds for Randoized Algoriths We now turn to randoized algoriths for (1). In the deterinistic constructions, we relied on being able to set v r and i,r based on the predictable behavior of the algorith. This is ipossible for randoized algoriths, we ust choose the hard function before we know the rando choices the algorith will ake so the function ust be hard ore generally than before. Previously, we chose vectors v r orthogonal to all previous queries ade by the algorith. For randoized algoriths this cannot be ensured. However, if we choose orthonoral vectors v r randoly in a high diensional space, they will be nearly orthogonal to queries with high probability. Slightly odifying the absolute or squared difference fro before akes near orthogonality sufficient. This issue increases the required diension but does not otherwise affect the lower bounds. More probleatic is our inability to anticipate the order in which the algorith will query the coponents, precluding the use of i,r. In the deterinistic setting, if a ter revealing a new v r appeared in half of the coponents, we could ensure that the algorith ust ake / queries to find it. However, a randoized algorith could find it in two queries in expectation, which would eliinate the linear dependence on in the lower bound! Alternatively, if only one coponent included the ter, a randoized algorith would indeed need () queries to find it, but that ter s effect on suboptiality of F would be scaled down by, again eliinating the dependence on. To establish a ( p ) lower bound for randoized algoriths we ust take a new approach. We define pairs of functions which operate on orthogonal subspaces of R d. Each pair of functions resebles the constructions fro the previous section, but since there are any of the, the algorith ust solve () separate optiization probles in order to optiize F. 5.1 Lipschitz Continuous Coponents First consider the non-sooth, non-strongly-convex setting and assue for siplicity is even (otherwise we siply let the last function be zero). We define the helper function c, which replaces the absolute value operation and akes our construction resistant to sall inner products between iterates and not-yet-discovered coponents: Next, we define / pairs of functions, indexed by i =1../: f i,1 (x) = 1 p b hx, v i,0 i + 1 p k f i, (x) = 1 p k c(z) = ax (0, z c) (9) r odd r even c (hx, v i,r 1 i hx, v i,r i) (10) c (hx, v i,r 1 i hx, v i,r i) where {v i,r } r=0..k,i=1../ are rando orthonoral vectors and k = ( 1 p ). With c sufficiently sall and the diensionality sufficiently high, with high probability the algorith only learns the 6

identity of new vectors v i,r by alternately querying f i,1 and f i, ; so revealing all k +1vectors requires at least k +1total queries. Until v i,k is revealed, an iterate is ()-suboptial on (f i,1 + f i, )/. Fro here, we show that an -suboptial solution to F (x) can be found only after at least k +1queries are ade to at least /4 pairs, for a total of (k) queries. This tie, since the optiu x will need to have inner product b with (k) vectors v i,r, we need to have b = 1 ( p k )= ( p / p ), and the total nuber of queries is (k) = ( p ). The () ter of the lower bound follows trivially since we require = O(1/ p ), (proofs in Appendix C.1): Theore 5. For any L, B > 0, any 0 << LB 10 p with access to h F, there exists a diension d = O over X =, any, and any randoized algorith A L 4 B 6 log LB 4, and functions f i defined x R d : kxk appleb, which are convex and L-Lipschitz continuous, such that to find a point ˆx for which E [F (ˆx) F (x )] <, A ust ake + p LB An added regularizer gives the result for strongly convex functions (see Appendix C.): Theore 6. For any L, > 0, any 0 << L 00, any, and any randoized algorith A with access to h F, there exists a diension d = O L4 3 log p L, and functions f i defined over X R d, which are L-Lipschitz continuous and -strongly convex, such that in order to find a point ˆx for which E [F (ˆx) F (x )] <, A ust ake + p p L The large diension required by these lower bounds is the cost of oitting the assuption that the algorith s queries lie in the span of previous oracle responses. If we do assue that the queries lie in that span, the necessary diension is only on the order of the nuber of oracle queries needed. When = (LB/ p ) in the non-strongly convex case or = L /( ) in the strongly convex case, the lower bounds for randoized algoriths presented above do not apply. Instead, we can obtain a lower bound based on an inforation theoretic arguent. We first uniforly randoly choose a paraeter p, which is either (1/ ) or (1/+). Then for i =1,...,, in the nonstrongly convex case we ake f i (x) =x with probability p and f i (x) = x with probability 1 p. Optiizing F (x) to within accuracy then iplies recovering the bias of the Bernoulli rando variable, which requires (1/ ) queries based on a standard inforation theoretic result [, 19]. Setting f i (x) =±x + kxk gives a (1/( )) lower bound in the -strongly convex setting. This is foralized in Appendix C.5. 5. Sooth Coponents When the functions f i are sooth and not strongly convex, we define another helper function c : 8 < 0 z applec c(z) = ( z c) c< z applec (11) : z c z > c and the following pairs of functions for i =1,..., /: f i,1 (x) = 1 hx, v i,0 i a hx, v i,0 i + c (hx, v i,r 1 i hx, v i,r i) 16 r even f i, (x) = 1 c (hx, v i,k i)+ c (hx, v i,r 1 i hx, v i,r i) 16 r odd with v i,r as before. The sae arguents apply, after replacing the absolute difference with squared difference. A separate arguent is required in this case for the () ter in the bound, which we show using a construction involving siple linear functions (see Appendix C.3). Theore 7. For any,b, > 0, any, and any randoized algorith A with access to h F, B there exists a sufficiently large diension d = O 6 log B + B log and functions f i defined over X = x R d : kxk appleb, which are convex and -sooth, such that to find a point q ˆx R d for which E [F (ˆx) F (x )] <, A ust ake + B (1) 7

In the strongly convex case, we add the ter kxk / to f i,1 and f i, (see Appendix C.4) to obtain: Theore 8. For any, any, > 0 such that > 161, any >0, any 0 > 60 p, and.5 any randoized algorith A, there exists a diension d = O 0.5 log3 0 + 0 log, doain X R d, x 0 X, and functions f i defined on X which are -sooth and -strongly convex, and such that F (x 0 ) F (x )= 0 and such that in order to find a point ˆx Xsuch that E [F (ˆx) F (x )] <, A ust ake + p q log 0 Reark: We consider (1) as a constrained optiization proble, thus the iniizer of F could be achieved on the boundary of X, eaning that the gradient need not vanish. If we ake the additional assuption that the iniizer of F lies on the interior of X (and is thus the unconstrained global iniu), Theores 1-8 all still apply, with a slight odification to Theores 3 and 7. Since the gradient now needs to vanish on X, 0 is always O( B )-suboptial, and only values of in the range 0 << B 9 B 18 and 0 << 18 result in a non-trivial lower bound (see Rearks at the end of Appendices B.3 and C.3). 6 Conclusion We provide a tight (up to a log factor) understanding of optiizing finite su probles of the for (1) using a coponent prox oracle. Randoized optiization of (1) has been the subject of uch research in the past several years, starting with the presentation of SDCA and SAG, and continuing with accelerated variants. Obtaining lower bounds can be very useful for better understanding the proble, for knowing where it ight or ight not be possible to iprove or where different assuptions would be needed to iprove, and for establishing optiality of optiization ethods. Indeed, several attepts have been ade at lower bounds for the finite su setting [1, 9]. But as we explain in the introduction, these were unsatisfactory and covered only liited classes of ethods. Here we show that in a fairly general sense, accelerated SDCA, SVRG, SAG, and KATYUSHA are optial up to a log factor. Iproving on their runtie would require additional assuptions, or perhaps a stronger oracle. However, even if given full access to the coponent functions, all algoriths that we can think of utilize this inforation to calculate a prox vector. Thus, it is unclear what realistic oracle would be ore powerful than the prox oracle we consider. Our results highlight the power of randoization, showing that no deterinistic algorith can beat the linear dependence on and reduce it to the p dependence of the randoized algoriths. The deterinistic algorith for non-sooth probles that we present in Section 3 is also of interest in its own right. It avoids randoization, which is not usually probleatic, but akes it fully parallelizable unlike the optial stochastic ethods. Consider, for exaple, a supervised learning proble where f i (x) =`(h i,xi,y i ) is the (non-sooth) loss on a single training exaple ( i,y i ), and the data is distributed across achines. Calculating a prox oracle involves applying the Fenchel conjugate of the loss function `, but even if a closed for is not available, this is often easy to copute nuerically, and is used in algoriths such as SDCA. But unlike SDCA, which is inherently sequential, we can calculate all prox operations in parallel on the different achines, average the resulting gradients of the soothed function, and take an accelerated gradient step to ipleent our optial deterinistic algorith. This ethod attains a recent lower bound for distributed optiization, resolving a question raised by Arjevani and Shair [5], and when the nuber of achines is very large iproves over all other known distributed optiization ethods for the proble. In studying finite su probles, we were forced to explicitly study lower bounds for randoized optiization as opposed to stochastic optiization (where the source of randoness is the oracle, not the algorith). Even for the classic proble of iniizing a sooth function using a first order oracle, we could not locate a published proof that applies to randoized algoriths. We provide a siple construction using -insensitive differences that allows us to easily obtain such lower bounds without reverting to assuing the iterates are spanned by previous responses (as was done, e.g., in [9]), and could potentially be useful for establishing randoized lower bounds also in other settings. Acknowledgeents: We thank Ohad Shair for his helpful discussions and for pointing out [4]. 8

References [1] Alekh Agarwal and Leon Bottou. A lower bound for the optiization of finite sus. arxiv preprint arxiv:1410.073, 014. [] Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikuar. Inforation-theoretic lower bounds on the oracle coplexity of convex optiization. In Advances in Neural Inforation Processing Systes, pages 1 9, 009. [3] Zeyuan Allen-Zhu. Katyusha: The first truly accelerated stochastic gradient descent. arxiv preprint arxiv:1603.05953, 016. [4] Zeyuan Allen-Zhu and Elad Hazan. Optial black-box reductions between optiization objectives. arxiv preprint arxiv:1603.0564, 016. [5] Yossi Arjevani and Ohad Shair. Counication coplexity of distributed convex learning and optiization. In Advances in Neural Inforation Processing Systes, pages 1747 1755, 015. [6] Heinz H Bauschke and Patrick L Cobettes. Convex analysis and onotone operator theory in Hilbert spaces. Springer Science & Business Media, 011. [7] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optiization and statistical learning via the alternating direction ethod of ultipliers. Foundations and Trends R in Machine Learning, 3(1):1 1, 011. [8] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Inforation Processing Systes, pages 315 33, 013. [9] Guanghui Lan. An optial randoized increental gradient ethod. arxiv preprint arxiv:1507.0000, 015. [10] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optiization. In Advances in Neural Inforation Processing Systes, pages 3366 3374, 015. [11] Yu Nesterov. Sooth iniization of non-sooth functions. Matheatical prograing, 103(1):17 15, 005. [1] Yurii Nesterov. A ethod of solving a convex prograing proble with convergence rate o (1/k). Soviet Matheatics Doklady, 7():37 376, 1983. [13] Francesco Orabona, Andreas Argyriou, and Nathan Srebro. Prisa: Proxial iterative soothing algorith. arxiv preprint arxiv:106.37, 01. [14] Mark Schidt, Nicolas Le Roux, and Francis Bach. Miniizing finite sus with the stochastic average gradient. arxiv preprint arxiv:1309.388, 013. [15] Shai Shalev-Shwartz. Stochastic optiization for achine learning. Slides of presentation at Optiization Without Borders 016, http://www.di.ens.fr/~aspreon/houches/talks/shai.pdf, 016. [16] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent ethods for regularized loss. The Journal of Machine Learning Research, 14(1):567 599, 013. [17] Shai Shalev-Shwartz and Tong Zhang. Accelerated proxial stochastic dual coordinate ascent for regularized loss iniization. Matheatical Prograing, 155(1-):105 145, 016. [18] Ohad Shair, Nathan Srebro, and Tong Zhang. Counication efficient distributed optiization using an approxiate newton-type ethod. arxiv preprint arxiv:131.7853, 013. [19] Bin Yu. Assouad, fano, and le ca. In Festschrift for Lucien Le Ca, pages 43 435. Springer, 1997. [0] Yuchen Zhang and Lin Xiao. Counication-efficient distributed optiization of self-concordant epirical loss. arxiv preprint arxiv:1501.0063, 015. 9