Tight Complexity Bounds for Optimizing Composite Objectives

Size: px
Start display at page:

Download "Tight Complexity Bounds for Optimizing Composite Objectives"

Transcription

1 Tight Coplexity Bounds for Optiizing Coposite Objectives Blake Woodworth Toyota Technological Institute at Chicago Chicago, IL, Nathan Srebro Toyota Technological Institute at Chicago Chicago, IL, Abstract We provide tight upper and lower bounds on the coplexity of iniizing the average of convex functions using gradient and prox oracles of the coponent functions. We show a significant gap between the coplexity of deterinistic vs randoized optiization. For sooth functions, we show that accelerated gradient descent (AGD) and an accelerated variant of SVRG are optial in the deterinistic and randoized settings respectively, and that a gradient oracle is sufficient for the optial rate. For non-sooth functions, having access to prox oracles reduces the coplexity and we present optial ethods based on soothing that iprove over ethods using just gradient accesses. 1 Introduction We consider iniizing the average of convex functions: ( ) in F (x) := 1 X f i (x) xx where X R d is a closed, convex set, and where the algorith is given access to the following gradient (or subgradient in the case of non-sooth functions) and prox oracle for the coponents: where i=1 h F (x, i, )= f i (x), rf i (x), prox fi (x, ) () prox fi (x, ) = arg in f i (u)+ kx uk (3) ux A natural question is how to leverage the prox oracle, and how uch benefit it provides over gradient access alone. The prox oracle is potentially uch ore powerful, as it provides global, rather then local, inforation about the function. For exaple, for a single function ( =1), one prox oracle call (with =0) is sufficient for exact optiization. Several ethods have recently been suggested for optiizing a su or average of several functions using prox accesses to each coponent, both in the distributed setting where each coponents ight be handled on a different achine (e.g. ADMM [7], DANE [18], DISCO [0]) or for functions that can be decoposed into several easy parts (e.g. PRISMA [13]). But as far as we are aware, no eaningful lower bound was previously known on the nuber of prox oracle accesses required even for the average of two functions ( =). The optiization of coposite objectives of the for (1) has also been extensively studied in the context of iniizing epirical risk over saples. Recently, stochastic ethods such as SDCA [16], SAG [14], SVRG [8], and other variants, have been presented which leverage the finite nature of the proble to reduce the variance in stochastic gradient estiates and obtain guarantees that doinate both batch and stochastic gradient descent. As ethods with iproved coplexity, such (1) 30th Conference on Neural Inforation Processing Systes (NIPS 016), Barcelona, Spain.

2 Deterinistic Lower Upper Randoized Lower Upper L-Lipschitz -Sooth Convex, -Strongly Convex, -Strongly kxk appleb Convex kxk appleb Convex q LB B p log 0 pl (Section 3) (Section 3) (AGD) (AGD) LB pl B p log 0 (Section 4) (Section 4) (Section 4) (Section 4) L B ^ log 1 + p LB L ^ log 1 + p q L p log 0 B + + p log 0 (SGD, A-SVRG) (SGD, A-SVRG) (A-SVRG) (A-SVRG) L B ^ + p LB L ^ + p q L p B + + p log 0 (Section 5) (Section 5) (Section 5) (Section 5) Table 1: Upper and lower bounds on the nuber of grad-and-prox oracle accesses needed to find -suboptial solutions for each function class. These are exact up to constant factors except for the lower bounds for sooth and strongly convex functions, which hide extra log / and log p / factors for deterinistic and randoized algoriths. Here, 0 is the suboptiality of the point 0. q as accelerated SDCA [17], accelerated SVRG, and KATYUSHA [3], have been presented, researchers have also tried to obtain lower bounds on the best possible coplexity in this settings but as we survey below, these have not been satisfactory so far. In this paper, after briefly surveying ethods for sooth, coposite optiization, we present ethods for optiizing non-sooth coposite objectives, which show that prox oracle access can indeed be leveraged to iprove over ethods using erely subgradient access (see Section 3). We then turn to studying lower bounds. We consider algoriths that access the objective F only through the oracle h F and provide lower bounds on the nuber of such oracle accesses (and thus the runtie) required to find -suboptial solutions. We consider optiizing both Lipschitz (non-sooth) functions and sooth functions, and guarantees that do and do not depend on strong convexity, distinguishing between deterinistic optiization algoriths and randoized algoriths. Our upper and lower bounds are suarized in Table 1. As shown in the table, we provide atching upper and lower bounds (up to a log factor) for all function and algorith classes. In particular, our bounds establish the optiality (up to log factors) of accelerated SDCA, SVRG, and SAG for randoized finite-su optiization, and also the optiality of our deterinistic soothing algoriths for non-sooth coposite optiization. On the power of gradient vs prox oracles For non-sooth functions, we show that having access to prox oracles for the coponents can reduce the polynoial dependence on fro 1/ to 1/, or fro 1/( ) to 1/ p for -strongly convex functions. However, all of the optial coplexities for sooth functions can be attained with only coponent gradient access using accelerated gradient descent (AGD) or accelerated SVRG. Thus the worst-case coplexity cannot be iproved (at least not significantly) by using the ore powerful prox oracle. On the power of randoization We establish a significant gap between deterinistic and randoized algoriths for finite-su probles. Naely, the dependence on the nuber of coponents ust be linear in for any deterinistic algorith, but can be reduced to p (in the typically significant ter) using randoization. We ephasize that the randoization here is only in the algorith not in the oracle. We always assue the oracle returns an exact answer (for the requested coponent) and is not a stochastic oracle. The distinction is that the algorith is allowed to flip coins in deciding what operations and queries to perfor but the oracle ust return an exact answer to that query (of course, the algorith could siulate a stochastic oracle). Prior Lower Bounds Several authors recently presented lower bounds for optiizing (1) in the sooth and strongly convex setting using coponent gradients. Agarwal and Bottou [1] presented a lower bound of + p log 1. However, their bound is valid only for deterinistic algoriths (thus not including SDCA, SVRG, SAG, etc.) we not only consider randoized algoriths, but also show a uch higher lower bound for deterinistic algoriths (i.e. the bound of Agarwal

3 and Bottou is loose). Iproving upon this, Lan [9] shows a siilar lower bound for a restricted class of randoized algoriths: the algorith ust select which coponent to query for a gradient by drawing an index fro a fixed distribution, but the algorith ust otherwise be deterinistic in how it uses the gradients, and its iterates ust lie in the span of the gradients it has received. This restricted class includes SAG, but not SVRG nor perhaps other realistic attepts at iproving over these. Furtherore, both bounds allow only gradient accesses, not prox coputations. Thus SDCA, which requires prox accesses, and potential variants are not covered by such lower bounds. We prove as siilar lower bound to Lan s, but our analysis is uch ore general and applies to any randoized algorith, aking any sequence of queries to a gradient and prox oracle, and without assuing that iterates lie in the span of previous responses. In addition to sooth functions, we also provide lower bounds for non-sooth probles which were not considered by these previous attepts. Another recent observation [15] was that with access only to rando coponent subgradients without knowing the coponent s identity, an algorith ust ake ( ) queries to optiize well. This shows how relatively subtle changes in the oracle can have a draatic effect on the coplexity of the proble. Since the oracle we consider is quite powerful, our lower bounds cover a very broad faily of algoriths, including SAG, SVRG, and SDCA. Our deterinistic lower bounds are inspired by a lower bound on the nuber of rounds of counication required for optiization when each f i is held by a different achine and when iterates lie in the span of certain peritted calculations [5]. Our construction for =is siilar to theirs (though in a different setting), but their analysis considers neither scaling with (which has a different role in their setting) nor randoization. Notation and Definitions We use k k to denote the standard Euclidean nor on R d. We say that a function f is L-Lipschitz continuous on X if 8x, y X f(x) f(x) applel kx yk; -sooth on X if it is differentiable and its gradient is -Lipschitz on X ; and -strongly convex on X if 8x, y X f i (y) f i (x) +hrf i (x), y xi + kx yk. We consider optiizing (1) under four cobinations of assuptions: each coponent f i is either L-Lipschitz or -sooth, and either F (x) is -strongly convex or its doain is bounded, X {x : kxk appleb}. Optiizing Sooth Sus We briefly review the best known ethods for optiizing (1) when the coponents are -sooth, yielding the upper bounds on the right half of Table 1. These upper bounds can be obtained using only coponent gradient access, without need for the prox oracle. We can obtain exact gradients of F (x) by coputing all coponent gradients rf i (x). Running accelerated gradient descent (AGD) [1] on F (x) using these exact gradients achieves the upper coplexity bounds for deterinistic algoriths and sooth probles (see Table 1). SAG [14], SVRG [8] and related ethods use randoization to saple coponents, but also leverage the finite nature of the objective to control the variance of the gradient estiator used. Accelerating these ethods using the Catalyst fraework [10] ensures that for -strongly convex objectives we have E F (x (k) ) F (x ) <after k = O + p log 0 iterations, where F (0) F (x )= 0.KATYUSHA [3] is a ore direct approach to accelerating SVRG which avoids extraneous log-factors, yielding the coplexity k = O + p log 0 indicated in Table 1. When F is not strongly convex, adding a regularizer to the objective and instead optiizing F (x) = q F (x) + kxk with = /B B results in an oracle coplexity of O + log 0. The log-factor in the second ter can be reoved using the ore delicate reduction of Allen-Zhu and Hazan [4], which involves optiizing F (x) for progressively saller values of, yielding the upper bound in the table. KATYUSHA and Catalyst-accelerated SAG or SVRG use only gradients of the coponents. Accelerated SDCA [17] achieves a siilar coplexity using gradient and prox oracle access. 3 Leveraging Prox Oracles for Lipschitz Sus In this section, we present algoriths for leveraging the prox oracle to iniize (1) when each coponent is L-Lipschitz. This will be done by using the prox oracle to sooth each coponent, 3

4 and optiizing the new, sooth su which approxiates the original proble. This idea was used in order to apply KATYUSHA [3] and accelerated SDCA [17] to non-sooth objectives. We are not aware of a previous explicit presentation of the AGD-based deterinistic algorith, which achieves the deterinistic upper coplexity indicated in Table 1. -Moreau envelope of a non-sooth func- The key is using a prox oracle to obtain gradients of the tion, f, defined as: f ( ) (x) = inf ux f(u)+ kx uk (4) Lea 1 ([13, Lea.], [6, Proposition 1.9], following [11]). Let f be convex and L- Lipschitz continuous. For any >0, 1. f ( ) is -sooth. r(f ( ) )(x) = (x prox f (x, )) 3. f ( ) (x) apple f(x) apple f ( ) (x)+ L Consequently, we can consider the soothed proble ( ) in F ( ) (x) := 1 X f ( ) i (x). (5) xx While F ( ) is not, in general, the -Moreau envelope of F, it is -sooth, we can calculate the gradient of its coponents using the oracle h F, and F ( ) (x) apple F (x) apple F ( ) (x) + L. Thus, to obtain an -suboptial solution to (1) using h F, we set = L / and apply any algorith which can optiize (5) using gradients of the L /-sooth coponents, to within / accuracy. With the rates presented in Section, using AGD on (5) yields a coplexity of O LB in the deterinistic setting. When the functions are -strongly convex, soothing with a fixed results in a spurious log-factor. To avoid this, we again apply the reduction of Allen-Zhu and Hazan [4], this tie optiizing F ( ) for increasingly large values of when used with AGD (see Appendix A for details). i=1. This leads to the upper bound of O L p Siilarly, we can apply an accelerated randoized algorith (such as KATYUSHA) to the sooth proble F ( ) to obtain coplexities of O log 0 + p LB and O log 0 + p p L this atches the presentation of Allen-Zhu [3] and is siilar to that of Shalev-Shwartz and Zhang [17]. Finally, if > L B / or > L /( ), stochastic gradient descent is a better randoized alternative, yielding coplexities of O(L B / ) or O(L /( )). 4 Lower Bounds for Deterinistic Algoriths We now turn to establishing lower bounds on the oracle coplexity of optiizing (1). We first consider only deterinistic optiization algoriths. What we would like to show is that for any deterinistic optiization algorith we can construct a hard function for which the algorith cannot find an -suboptial solution until it has ade any oracle accesses. Since the algorith is deterinistic, we can construct such a function by siulating the (deterinistic) behavior of the algorith. This can be viewed as a gae, where an adversary controls the oracle being used by the algorith. At each iteration the algorith queries the oracle with soe triplet (x, i, ) and the adversary responds with an answer. This answer ust be consistent with all previous answers, but the adversary ensures it is also consistent with a coposite function F that the algorith is far fro optiizing. The hard function is then gradually defined in ters of the behavior of the optiization algorith. To help us forulate our constructions, we define a round of queries as a series of queries in which d e distinct functions f i are queried. The first round begins with the first query and continues until exactly d e unique functions have been queried. The second round begins with the next query, and continues until exactly d e ore distinct coponents have been queried in the second round, and so on until the algorith terinates. This definition is useful for analysis but requires no assuptions about the algorith s querying strategy. 4

5 4.1 Non-Sooth Coponents We begin by presenting a lower bound for deterinistic optiization of (1) when each coponent f i is convex and L-Lipschitz continuous, but is not necessarily strongly convex, on the doain X = {x : kxk appleb}. Without loss of generality, we can consider L = B =1. We will construct functions of the following for: f i (x) = 1 p b hx, v 0 i + 1 p k r=1 i,r hx, v r 1 i hx, v r i. (6) where k = b 1 1 c, b = 1 p k+1, and {v r } is an orthonoral set of vectors in R d chosen according to the behavior of the algorith such that v r is orthogonal to all points at which the algorith queries h F before round r, and where i,r are indicators chosen so that i,r = 1 if the algorith does not query coponent i in round r (and zero otherwise). To see how this is possible, consider the following truncations of (6): fi t (x) = p 1 b hx, v 0 i + 1 p Xt 1 k r=1 i,r hx, v r 1 i hx, v r i (7) During each round t, the adversary answers queries according to fi t, which depends only on v r, i,r for r<t, i.e. fro previous rounds. When the round is copleted, i,t is deterined and v t is chosen to be orthogonal to the vectors {v 0,...,v t 1 } as well as every point queried by the algorith so far, thus defining f t+1 i for the next round. In Appendix B.1 we prove that these responses based on fi t are consistent with f i. The algorith can only learn v r after it copletes round r until then every iterate is orthogonal to it by construction. The average of these functions reaches its iniu of F (x )=0at x = b P k r=0 v r, so we can view optiizing these functions as the task of discovering the vectors v r even if only v k is issing, a suboptiality better than b/(6 p k) >cannot be achieved. Therefore, the deterinistic algorith ust coplete at least k rounds of optiization, each coprising at least queries to hf in order to optiize F. The key to this construction is that even though each ter hx, v r 1 i hx, v r i appears in / coponents, and hence has a strong effect on the average F (x), we can force a deterinistic algorith to ake () queries during each round before it finds the next relevant ter. We obtain (for coplete proof see Appendix B.1): Theore 1. For any L, B > 0, any 0 << LB 1, any, and any deterinistic algorith A with access to h F, there exists a diension d = O LB, and functions f i defined over X = x R d : kxk appleb, which are convex and L-Lipschitz continuous, such that in order to find a point ˆx for which F (ˆx) F (x ) <, A ust ake LB Furtherore, we can always reduce optiizing a function over kxk appleb to optiizing a strongly convex function by adding the regularizer kxk /(B ) to each coponent, iplying (see coplete proof in Appendix B.): Theore. For any L, > 0, any 0 << L 88, any, and any deterinistic algorith L A with access to h F, there exists a diension d = O p, and functions f i defined over X R d, which are L-Lipschitz continuous and ˆx for which F (ˆx) 4. Sooth Coponents F (x ) <, A ust ake -strongly convex, such that in order to find a point pl When the coponents f i are required to be sooth, the lower bound construction is siilar to (6), except it is based on squared differences instead of absolute differences. We consider the functions:! f i (x) = 1 i,1 hx, v 0 i a hx, v 0 i + i,k hx, v k i + i,r (hx, v r 1 i hx, v r i) (8) 8 where i,r and v r are as before. Again, we can answer queries at round t based only on i,r,v r for r<t. This construction yields the following lower bounds (full details in Appendix B.3): r=1 5

6 Theore 3. For any,b, > 0, any, and any deterinistic algorith A with access to h F, there exists a sufficiently large diension d = O p B /, and functions f i defined over X = x R d : kxk appleb, which are convex and -sooth, such that in order to find a point ˆx R d for which F (ˆx) F (x ) <, A ust ake p B / In the strongly convex case, we use a very siilar construction, adding the ter gives the following bound (see Appendix B.4): kxk /, which Theore 4. For any, > 0 such that > 73, any >0, any 0 > 3, any, and any deterinistic algorith A with access to h F, there exists a sufficiently large diension d = O p log 0, and functions f i defined over X R d, which are -sooth and - strongly convex and where F (0) F (x ) = 0, such that in order to find a point ˆx for which F (ˆx) F (x ) <, A ust ake p log 0 5 Lower Bounds for Randoized Algoriths We now turn to randoized algoriths for (1). In the deterinistic constructions, we relied on being able to set v r and i,r based on the predictable behavior of the algorith. This is ipossible for randoized algoriths, we ust choose the hard function before we know the rando choices the algorith will ake so the function ust be hard ore generally than before. Previously, we chose vectors v r orthogonal to all previous queries ade by the algorith. For randoized algoriths this cannot be ensured. However, if we choose orthonoral vectors v r randoly in a high diensional space, they will be nearly orthogonal to queries with high probability. Slightly odifying the absolute or squared difference fro before akes near orthogonality sufficient. This issue increases the required diension but does not otherwise affect the lower bounds. More probleatic is our inability to anticipate the order in which the algorith will query the coponents, precluding the use of i,r. In the deterinistic setting, if a ter revealing a new v r appeared in half of the coponents, we could ensure that the algorith ust ake / queries to find it. However, a randoized algorith could find it in two queries in expectation, which would eliinate the linear dependence on in the lower bound! Alternatively, if only one coponent included the ter, a randoized algorith would indeed need () queries to find it, but that ter s effect on suboptiality of F would be scaled down by, again eliinating the dependence on. To establish a ( p ) lower bound for randoized algoriths we ust take a new approach. We define pairs of functions which operate on orthogonal subspaces of R d. Each pair of functions resebles the constructions fro the previous section, but since there are any of the, the algorith ust solve () separate optiization probles in order to optiize F. 5.1 Lipschitz Continuous Coponents First consider the non-sooth, non-strongly-convex setting and assue for siplicity is even (otherwise we siply let the last function be zero). We define the helper function c, which replaces the absolute value operation and akes our construction resistant to sall inner products between iterates and not-yet-discovered coponents: Next, we define / pairs of functions, indexed by i =1../: f i,1 (x) = 1 p b hx, v i,0 i + 1 p k f i, (x) = 1 p k c(z) = ax (0, z c) (9) r odd r even c (hx, v i,r 1 i hx, v i,r i) (10) c (hx, v i,r 1 i hx, v i,r i) where {v i,r } r=0..k,i=1../ are rando orthonoral vectors and k = ( 1 p ). With c sufficiently sall and the diensionality sufficiently high, with high probability the algorith only learns the 6

7 identity of new vectors v i,r by alternately querying f i,1 and f i, ; so revealing all k +1vectors requires at least k +1total queries. Until v i,k is revealed, an iterate is ()-suboptial on (f i,1 + f i, )/. Fro here, we show that an -suboptial solution to F (x) can be found only after at least k +1queries are ade to at least /4 pairs, for a total of (k) queries. This tie, since the optiu x will need to have inner product b with (k) vectors v i,r, we need to have b = 1 ( p k )= ( p / p ), and the total nuber of queries is (k) = ( p ). The () ter of the lower bound follows trivially since we require = O(1/ p ), (proofs in Appendix C.1): Theore 5. For any L, B > 0, any 0 << LB 10 p with access to h F, there exists a diension d = O over X =, any, and any randoized algorith A L 4 B 6 log LB 4, and functions f i defined x R d : kxk appleb, which are convex and L-Lipschitz continuous, such that to find a point ˆx for which E [F (ˆx) F (x )] <, A ust ake + p LB An added regularizer gives the result for strongly convex functions (see Appendix C.): Theore 6. For any L, > 0, any 0 << L 00, any, and any randoized algorith A with access to h F, there exists a diension d = O L4 3 log p L, and functions f i defined over X R d, which are L-Lipschitz continuous and -strongly convex, such that in order to find a point ˆx for which E [F (ˆx) F (x )] <, A ust ake + p p L The large diension required by these lower bounds is the cost of oitting the assuption that the algorith s queries lie in the span of previous oracle responses. If we do assue that the queries lie in that span, the necessary diension is only on the order of the nuber of oracle queries needed. When = (LB/ p ) in the non-strongly convex case or = L /( ) in the strongly convex case, the lower bounds for randoized algoriths presented above do not apply. Instead, we can obtain a lower bound based on an inforation theoretic arguent. We first uniforly randoly choose a paraeter p, which is either (1/ ) or (1/+). Then for i =1,...,, in the nonstrongly convex case we ake f i (x) =x with probability p and f i (x) = x with probability 1 p. Optiizing F (x) to within accuracy then iplies recovering the bias of the Bernoulli rando variable, which requires (1/ ) queries based on a standard inforation theoretic result [, 19]. Setting f i (x) =±x + kxk gives a (1/( )) lower bound in the -strongly convex setting. This is foralized in Appendix C Sooth Coponents When the functions f i are sooth and not strongly convex, we define another helper function c : 8 < 0 z applec c(z) = ( z c) c< z applec (11) : z c z > c and the following pairs of functions for i =1,..., /: f i,1 (x) = 1 hx, v i,0 i a hx, v i,0 i + c (hx, v i,r 1 i hx, v i,r i) 16 r even f i, (x) = 1 c (hx, v i,k i)+ c (hx, v i,r 1 i hx, v i,r i) 16 r odd with v i,r as before. The sae arguents apply, after replacing the absolute difference with squared difference. A separate arguent is required in this case for the () ter in the bound, which we show using a construction involving siple linear functions (see Appendix C.3). Theore 7. For any,b, > 0, any, and any randoized algorith A with access to h F, B there exists a sufficiently large diension d = O 6 log B + B log and functions f i defined over X = x R d : kxk appleb, which are convex and -sooth, such that to find a point q ˆx R d for which E [F (ˆx) F (x )] <, A ust ake + B (1) 7

8 In the strongly convex case, we add the ter kxk / to f i,1 and f i, (see Appendix C.4) to obtain: Theore 8. For any, any, > 0 such that > 161, any >0, any 0 > 60 p, and.5 any randoized algorith A, there exists a diension d = O 0.5 log log, doain X R d, x 0 X, and functions f i defined on X which are -sooth and -strongly convex, and such that F (x 0 ) F (x )= 0 and such that in order to find a point ˆx Xsuch that E [F (ˆx) F (x )] <, A ust ake + p q log 0 Reark: We consider (1) as a constrained optiization proble, thus the iniizer of F could be achieved on the boundary of X, eaning that the gradient need not vanish. If we ake the additional assuption that the iniizer of F lies on the interior of X (and is thus the unconstrained global iniu), Theores 1-8 all still apply, with a slight odification to Theores 3 and 7. Since the gradient now needs to vanish on X, 0 is always O( B )-suboptial, and only values of in the range 0 << B 9 B 18 and 0 << 18 result in a non-trivial lower bound (see Rearks at the end of Appendices B.3 and C.3). 6 Conclusion We provide a tight (up to a log factor) understanding of optiizing finite su probles of the for (1) using a coponent prox oracle. Randoized optiization of (1) has been the subject of uch research in the past several years, starting with the presentation of SDCA and SAG, and continuing with accelerated variants. Obtaining lower bounds can be very useful for better understanding the proble, for knowing where it ight or ight not be possible to iprove or where different assuptions would be needed to iprove, and for establishing optiality of optiization ethods. Indeed, several attepts have been ade at lower bounds for the finite su setting [1, 9]. But as we explain in the introduction, these were unsatisfactory and covered only liited classes of ethods. Here we show that in a fairly general sense, accelerated SDCA, SVRG, SAG, and KATYUSHA are optial up to a log factor. Iproving on their runtie would require additional assuptions, or perhaps a stronger oracle. However, even if given full access to the coponent functions, all algoriths that we can think of utilize this inforation to calculate a prox vector. Thus, it is unclear what realistic oracle would be ore powerful than the prox oracle we consider. Our results highlight the power of randoization, showing that no deterinistic algorith can beat the linear dependence on and reduce it to the p dependence of the randoized algoriths. The deterinistic algorith for non-sooth probles that we present in Section 3 is also of interest in its own right. It avoids randoization, which is not usually probleatic, but akes it fully parallelizable unlike the optial stochastic ethods. Consider, for exaple, a supervised learning proble where f i (x) =`(h i,xi,y i ) is the (non-sooth) loss on a single training exaple ( i,y i ), and the data is distributed across achines. Calculating a prox oracle involves applying the Fenchel conjugate of the loss function `, but even if a closed for is not available, this is often easy to copute nuerically, and is used in algoriths such as SDCA. But unlike SDCA, which is inherently sequential, we can calculate all prox operations in parallel on the different achines, average the resulting gradients of the soothed function, and take an accelerated gradient step to ipleent our optial deterinistic algorith. This ethod attains a recent lower bound for distributed optiization, resolving a question raised by Arjevani and Shair [5], and when the nuber of achines is very large iproves over all other known distributed optiization ethods for the proble. In studying finite su probles, we were forced to explicitly study lower bounds for randoized optiization as opposed to stochastic optiization (where the source of randoness is the oracle, not the algorith). Even for the classic proble of iniizing a sooth function using a first order oracle, we could not locate a published proof that applies to randoized algoriths. We provide a siple construction using -insensitive differences that allows us to easily obtain such lower bounds without reverting to assuing the iterates are spanned by previous responses (as was done, e.g., in [9]), and could potentially be useful for establishing randoized lower bounds also in other settings. Acknowledgeents: We thank Ohad Shair for his helpful discussions and for pointing out [4]. 8

9 References [1] Alekh Agarwal and Leon Bottou. A lower bound for the optiization of finite sus. arxiv preprint arxiv: , 014. [] Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikuar. Inforation-theoretic lower bounds on the oracle coplexity of convex optiization. In Advances in Neural Inforation Processing Systes, pages 1 9, 009. [3] Zeyuan Allen-Zhu. Katyusha: The first truly accelerated stochastic gradient descent. arxiv preprint arxiv: , 016. [4] Zeyuan Allen-Zhu and Elad Hazan. Optial black-box reductions between optiization objectives. arxiv preprint arxiv: , 016. [5] Yossi Arjevani and Ohad Shair. Counication coplexity of distributed convex learning and optiization. In Advances in Neural Inforation Processing Systes, pages , 015. [6] Heinz H Bauschke and Patrick L Cobettes. Convex analysis and onotone operator theory in Hilbert spaces. Springer Science & Business Media, 011. [7] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optiization and statistical learning via the alternating direction ethod of ultipliers. Foundations and Trends R in Machine Learning, 3(1):1 1, 011. [8] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Inforation Processing Systes, pages , 013. [9] Guanghui Lan. An optial randoized increental gradient ethod. arxiv preprint arxiv: , 015. [10] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optiization. In Advances in Neural Inforation Processing Systes, pages , 015. [11] Yu Nesterov. Sooth iniization of non-sooth functions. Matheatical prograing, 103(1):17 15, 005. [1] Yurii Nesterov. A ethod of solving a convex prograing proble with convergence rate o (1/k). Soviet Matheatics Doklady, 7():37 376, [13] Francesco Orabona, Andreas Argyriou, and Nathan Srebro. Prisa: Proxial iterative soothing algorith. arxiv preprint arxiv:106.37, 01. [14] Mark Schidt, Nicolas Le Roux, and Francis Bach. Miniizing finite sus with the stochastic average gradient. arxiv preprint arxiv: , 013. [15] Shai Shalev-Shwartz. Stochastic optiization for achine learning. Slides of presentation at Optiization Without Borders 016, [16] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent ethods for regularized loss. The Journal of Machine Learning Research, 14(1): , 013. [17] Shai Shalev-Shwartz and Tong Zhang. Accelerated proxial stochastic dual coordinate ascent for regularized loss iniization. Matheatical Prograing, 155(1-): , 016. [18] Ohad Shair, Nathan Srebro, and Tong Zhang. Counication efficient distributed optiization using an approxiate newton-type ethod. arxiv preprint arxiv: , 013. [19] Bin Yu. Assouad, fano, and le ca. In Festschrift for Lucien Le Ca, pages Springer, [0] Yuchen Zhang and Lin Xiao. Counication-efficient distributed optiization of self-concordant epirical loss. arxiv preprint arxiv: ,

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION GUANGHUI LAN AND YI ZHOU Abstract. In this paper, we consider a class of finite-su convex optiization probles defined over a distributed

More information

An incremental mirror descent subgradient algorithm with random sweeping and proximal step

An incremental mirror descent subgradient algorithm with random sweeping and proximal step An increental irror descent subgradient algorith with rando sweeping and proxial step Radu Ioan Boţ Axel Böh April 6, 08 Abstract We investigate the convergence properties of increental irror descent type

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

An incremental mirror descent subgradient algorithm with random sweeping and proximal step

An incremental mirror descent subgradient algorithm with random sweeping and proximal step An increental irror descent subgradient algorith with rando sweeping and proxial step Radu Ioan Boţ Axel Böh Septeber 7, 07 Abstract We investigate the convergence properties of increental irror descent

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Convex Programming for Scheduling Unrelated Parallel Machines

Convex Programming for Scheduling Unrelated Parallel Machines Convex Prograing for Scheduling Unrelated Parallel Machines Yossi Azar Air Epstein Abstract We consider the classical proble of scheduling parallel unrelated achines. Each job is to be processed by exactly

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations Randoized Accuracy-Aware Progra Transforations For Efficient Approxiate Coputations Zeyuan Allen Zhu Sasa Misailovic Jonathan A. Kelner Martin Rinard MIT CSAIL zeyuan@csail.it.edu isailo@it.edu kelner@it.edu

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science A Better Algorith For an Ancient Scheduling Proble David R. Karger Steven J. Phillips Eric Torng Departent of Coputer Science Stanford University Stanford, CA 9435-4 Abstract One of the oldest and siplest

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

arxiv: v1 [cs.ds] 17 Mar 2016

arxiv: v1 [cs.ds] 17 Mar 2016 Tight Bounds for Single-Pass Streaing Coplexity of the Set Cover Proble Sepehr Assadi Sanjeev Khanna Yang Li Abstract arxiv:1603.05715v1 [cs.ds] 17 Mar 2016 We resolve the space coplexity of single-pass

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,

More information

The Methods of Solution for Constrained Nonlinear Programming

The Methods of Solution for Constrained Nonlinear Programming Research Inventy: International Journal Of Engineering And Science Vol.4, Issue 3(March 2014), PP 01-06 Issn (e): 2278-4721, Issn (p):2319-6483, www.researchinventy.co The Methods of Solution for Constrained

More information

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions Tight Inforation-Theoretic Lower Bounds for Welfare Maxiization in Cobinatorial Auctions Vahab Mirrokni Jan Vondrák Theory Group, Microsoft Dept of Matheatics Research Princeton University Redond, WA 9805

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

When Short Runs Beat Long Runs

When Short Runs Beat Long Runs When Short Runs Beat Long Runs Sean Luke George Mason University http://www.cs.gu.edu/ sean/ Abstract What will yield the best results: doing one run n generations long or doing runs n/ generations long

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Constrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008

Constrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008 LIDS Report 2779 1 Constrained Consensus and Optiization in Multi-Agent Networks arxiv:0802.3922v2 [ath.oc] 17 Dec 2008 Angelia Nedić, Asuan Ozdaglar, and Pablo A. Parrilo February 15, 2013 Abstract We

More information

Distributed Subgradient Methods for Multi-agent Optimization

Distributed Subgradient Methods for Multi-agent Optimization 1 Distributed Subgradient Methods for Multi-agent Optiization Angelia Nedić and Asuan Ozdaglar October 29, 2007 Abstract We study a distributed coputation odel for optiizing a su of convex objective functions

More information

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation journal of coplexity 6, 459473 (2000) doi:0.006jco.2000.0544, available online at http:www.idealibrary.co on On the Counication Coplexity of Lipschitzian Optiization for the Coordinated Model of Coputation

More information

Interactive Markov Models of Evolutionary Algorithms

Interactive Markov Models of Evolutionary Algorithms Cleveland State University EngagedScholarship@CSU Electrical Engineering & Coputer Science Faculty Publications Electrical Engineering & Coputer Science Departent 2015 Interactive Markov Models of Evolutionary

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

Lecture 21. Interior Point Methods Setup and Algorithm

Lecture 21. Interior Point Methods Setup and Algorithm Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and

More information

Exact tensor completion with sum-of-squares

Exact tensor completion with sum-of-squares Proceedings of Machine Learning Research vol 65:1 54, 2017 30th Annual Conference on Learning Theory Exact tensor copletion with su-of-squares Aaron Potechin Institute for Advanced Study, Princeton David

More information

List Scheduling and LPT Oliver Braun (09/05/2017)

List Scheduling and LPT Oliver Braun (09/05/2017) List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)

More information

Introduction to Machine Learning. Recitation 11

Introduction to Machine Learning. Recitation 11 Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,...,

More information

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x) 7Applying Nelder Mead s Optiization Algorith APPLYING NELDER MEAD S OPTIMIZATION ALGORITHM FOR MULTIPLE GLOBAL MINIMA Abstract Ştefan ŞTEFĂNESCU * The iterative deterinistic optiization ethod could not

More information

Fairness via priority scheduling

Fairness via priority scheduling Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation

More information

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and

More information

OPTIMIZATION in multi-agent networks has attracted

OPTIMIZATION in multi-agent networks has attracted Distributed constrained optiization and consensus in uncertain networks via proxial iniization Kostas Margellos, Alessandro Falsone, Sione Garatti and Maria Prandini arxiv:603.039v3 [ath.oc] 3 May 07 Abstract

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space Journal of Machine Learning Research 3 (2003) 1333-1356 Subitted 5/02; Published 3/03 Grafting: Fast, Increental Feature Selection by Gradient Descent in Function Space Sion Perkins Space and Reote Sensing

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Hybrid System Identification: An SDP Approach

Hybrid System Identification: An SDP Approach 49th IEEE Conference on Decision and Control Deceber 15-17, 2010 Hilton Atlanta Hotel, Atlanta, GA, USA Hybrid Syste Identification: An SDP Approach C Feng, C M Lagoa, N Ozay and M Sznaier Abstract The

More information

Asynchronous Gossip Algorithms for Stochastic Optimization

Asynchronous Gossip Algorithms for Stochastic Optimization Asynchronous Gossip Algoriths for Stochastic Optiization S. Sundhar Ra ECE Dept. University of Illinois Urbana, IL 680 ssrini@illinois.edu A. Nedić IESE Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Necessity of low effective dimension

Necessity of low effective dimension Necessity of low effective diension Art B. Owen Stanford University October 2002, Orig: July 2002 Abstract Practitioners have long noticed that quasi-monte Carlo ethods work very well on functions that

More information

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs On the Inapproxiability of Vertex Cover on k-partite k-unifor Hypergraphs Venkatesan Guruswai and Rishi Saket Coputer Science Departent Carnegie Mellon University Pittsburgh, PA 1513. Abstract. Coputing

More information

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 60, NO. 2, FEBRUARY ETSP stands for the Euclidean traveling salesman problem.

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 60, NO. 2, FEBRUARY ETSP stands for the Euclidean traveling salesman problem. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 60, NO., FEBRUARY 015 37 Target Assignent in Robotic Networks: Distance Optiality Guarantees and Hierarchical Strategies Jingjin Yu, Meber, IEEE, Soon-Jo Chung,

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

On Constant Power Water-filling

On Constant Power Water-filling On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives

More information

Inexact Proximal Gradient Methods for Non-Convex and Non-Smooth Optimization

Inexact Proximal Gradient Methods for Non-Convex and Non-Smooth Optimization The Thirty-Second AAAI Conference on Artificial Intelligence AAAI-8) Inexact Proxial Gradient Methods for Non-Convex and Non-Sooth Optiization Bin Gu, De Wang, Zhouyuan Huo, Heng Huang * Departent of Electrical

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

lecture 36: Linear Multistep Mehods: Zero Stability

lecture 36: Linear Multistep Mehods: Zero Stability 95 lecture 36: Linear Multistep Mehods: Zero Stability 5.6 Linear ultistep ethods: zero stability Does consistency iply convergence for linear ultistep ethods? This is always the case for one-step ethods,

More information

C na (1) a=l. c = CO + Clm + CZ TWO-STAGE SAMPLE DESIGN WITH SMALL CLUSTERS. 1. Introduction

C na (1) a=l. c = CO + Clm + CZ TWO-STAGE SAMPLE DESIGN WITH SMALL CLUSTERS. 1. Introduction TWO-STGE SMPLE DESIGN WITH SMLL CLUSTERS Robert G. Clark and David G. Steel School of Matheatics and pplied Statistics, University of Wollongong, NSW 5 ustralia. (robert.clark@abs.gov.au) Key Words: saple

More information

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Yossi Arjevani Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot 7610001,

More information

. The univariate situation. It is well-known for a long tie that denoinators of Pade approxiants can be considered as orthogonal polynoials with respe

. The univariate situation. It is well-known for a long tie that denoinators of Pade approxiants can be considered as orthogonal polynoials with respe PROPERTIES OF MULTIVARIATE HOMOGENEOUS ORTHOGONAL POLYNOMIALS Brahi Benouahane y Annie Cuyt? Keywords Abstract It is well-known that the denoinators of Pade approxiants can be considered as orthogonal

More information

Bootstrapping Dependent Data

Bootstrapping Dependent Data Bootstrapping Dependent Data One of the key issues confronting bootstrap resapling approxiations is how to deal with dependent data. Consider a sequence fx t g n t= of dependent rando variables. Clearly

More information

Randomized Recovery for Boolean Compressed Sensing

Randomized Recovery for Boolean Compressed Sensing Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch

More information

Prediction by random-walk perturbation

Prediction by random-walk perturbation Prediction by rando-walk perturbation Luc Devroye School of Coputer Science McGill University Gábor Lugosi ICREA and Departent of Econoics Universitat Popeu Fabra lucdevroye@gail.co gabor.lugosi@gail.co

More information

A Probabilistic and RIPless Theory of Compressed Sensing

A Probabilistic and RIPless Theory of Compressed Sensing A Probabilistic and RIPless Theory of Copressed Sensing Eanuel J Candès and Yaniv Plan 2 Departents of Matheatics and of Statistics, Stanford University, Stanford, CA 94305 2 Applied and Coputational Matheatics,

More information

1 Identical Parallel Machines

1 Identical Parallel Machines FB3: Matheatik/Inforatik Dr. Syaantak Das Winter 2017/18 Optiizing under Uncertainty Lecture Notes 3: Scheduling to Miniize Makespan In any standard scheduling proble, we are given a set of jobs J = {j

More information

On Conditions for Linearity of Optimal Estimation

On Conditions for Linearity of Optimal Estimation On Conditions for Linearity of Optial Estiation Erah Akyol, Kuar Viswanatha and Kenneth Rose {eakyol, kuar, rose}@ece.ucsb.edu Departent of Electrical and Coputer Engineering University of California at

More information

Variational Adaptive-Newton Method

Variational Adaptive-Newton Method Variational Adaptive-Newton Method Mohaad Etiyaz Khan Wu Lin Voot Tangkaratt Zuozhu Liu Didrik Nielsen AIP, RIKEN, Tokyo Abstract We present a black-box learning ethod called the Variational Adaptive-Newton

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Lecture 20 November 7, 2013

Lecture 20 November 7, 2013 CS 229r: Algoriths for Big Data Fall 2013 Prof. Jelani Nelson Lecture 20 Noveber 7, 2013 Scribe: Yun Willia Yu 1 Introduction Today we re going to go through the analysis of atrix copletion. First though,

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup) Recovering Data fro Underdeterined Quadratic Measureents (CS 229a Project: Final Writeup) Mahdi Soltanolkotabi Deceber 16, 2011 1 Introduction Data that arises fro engineering applications often contains

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

STOPPING SIMULATED PATHS EARLY

STOPPING SIMULATED PATHS EARLY Proceedings of the 2 Winter Siulation Conference B.A.Peters,J.S.Sith,D.J.Medeiros,andM.W.Rohrer,eds. STOPPING SIMULATED PATHS EARLY Paul Glasseran Graduate School of Business Colubia University New Yor,

More information

Synthetic Generation of Local Minima and Saddle Points for Neural Networks

Synthetic Generation of Local Minima and Saddle Points for Neural Networks uigi Malagò 1 Diitri Marinelli 1 Abstract In this work-in-progress paper, we study the landscape of the epirical loss function of a feed-forward neural network, fro the perspective of the existence and

More information

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies Approxiation in Stochastic Scheduling: The Power of -Based Priority Policies Rolf Möhring, Andreas Schulz, Marc Uetz Setting (A P p stoch, r E( w and (B P p stoch E( w We will assue that the processing

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

Least Squares Fitting of Data

Least Squares Fitting of Data Least Squares Fitting of Data David Eberly, Geoetric Tools, Redond WA 98052 https://www.geoetrictools.co/ This work is licensed under the Creative Coons Attribution 4.0 International License. To view a

More information

A Simple Homotopy Algorithm for Compressive Sensing

A Simple Homotopy Algorithm for Compressive Sensing A Siple Hootopy Algorith for Copressive Sensing Lijun Zhang Tianbao Yang Rong Jin Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China Departent of Coputer

More information

time time δ jobs jobs

time time δ jobs jobs Approxiating Total Flow Tie on Parallel Machines Stefano Leonardi Danny Raz y Abstract We consider the proble of optiizing the total ow tie of a strea of jobs that are released over tie in a ultiprocessor

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Sha M Kakade Microsoft Research and Wharton, U Penn skakade@icrosoftco Varun Kanade SEAS, Harvard University vkanade@fasharvardedu

More information

A PROBABILISTIC AND RIPLESS THEORY OF COMPRESSED SENSING. Emmanuel J. Candès Yaniv Plan. Technical Report No November 2010

A PROBABILISTIC AND RIPLESS THEORY OF COMPRESSED SENSING. Emmanuel J. Candès Yaniv Plan. Technical Report No November 2010 A PROBABILISTIC AND RIPLESS THEORY OF COMPRESSED SENSING By Eanuel J Candès Yaniv Plan Technical Report No 200-0 Noveber 200 Departent of Statistics STANFORD UNIVERSITY Stanford, California 94305-4065

More information

Fixed-to-Variable Length Distribution Matching

Fixed-to-Variable Length Distribution Matching Fixed-to-Variable Length Distribution Matching Rana Ali Ajad and Georg Böcherer Institute for Counications Engineering Technische Universität München, Gerany Eail: raa2463@gail.co,georg.boecherer@tu.de

More information

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes Graphical Models in Local, Asyetric Multi-Agent Markov Decision Processes Ditri Dolgov and Edund Durfee Departent of Electrical Engineering and Coputer Science University of Michigan Ann Arbor, MI 48109

More information

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,

More information

In this chapter, we consider several graph-theoretic and probabilistic models

In this chapter, we consider several graph-theoretic and probabilistic models THREE ONE GRAPH-THEORETIC AND STATISTICAL MODELS 3.1 INTRODUCTION In this chapter, we consider several graph-theoretic and probabilistic odels for a social network, which we do under different assuptions

More information

Bipartite subgraphs and the smallest eigenvalue

Bipartite subgraphs and the smallest eigenvalue Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.

More information

A Theoretical Framework for Deep Transfer Learning

A Theoretical Framework for Deep Transfer Learning A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University toer22g@gail.co Lior Wolf The School of Coputer Science Tel Aviv University wolf@cs.tau.ac.il

More information

Mixed Robust/Average Submodular Partitioning

Mixed Robust/Average Submodular Partitioning Mixed Robust/Average Subodular Partitioning Kai Wei 1 Rishabh Iyer 1 Shengjie Wang 2 Wenruo Bai 1 Jeff Biles 1 1 Departent of Electrical Engineering, University of Washington 2 Departent of Coputer Science,

More information

Data Dependent Convergence For Consensus Stochastic Optimization

Data Dependent Convergence For Consensus Stochastic Optimization Data Dependent Convergence For Consensus Stochastic Optiization Avleen S. Bijral, Anand D. Sarwate, Nathan Srebro Septeber 8, 08 arxiv:603.04379v ath.oc] Sep 06 Abstract We study a distributed consensus-based

More information