Sublinear Optimization for Machine Learning

Size: px

Start display at page:

Download "Sublinear Optimization for Machine Learning"

Maurice Webb
5 years ago
Views:

1 Sublinear Optimization for Machine Learning Kenneth L. Clarkson, IBM Almaen Research Center Ela Hazan, Technion - Israel Institute of technology Davi P. Wooruff, IBM Almaen Research Center In this paper we escribe an analyze sublinear-time approximation algorithms for some optimization problems arising in machine learning, such as training linear classifiers an fining minimum enclosing balls. Our algorithms can be extene to some kernelize versions of these problems, such as SVDD, har margin SVM, an L -SVM, for which sublinear-time algorithms were not known before. These new algorithms use a combination of a novel sampling techniques an a new multiplicative upate algorithm. We give lower bouns which show the running times of many of our algorithms to be nearly best possible in the unit-cost RAM moel. 1. INTRODUCTION Linear classification is a funamental problem of machine learning, in which positive an negative examples of a concept are represente in Eucliean space by their feature vectors, an we seek to fin a hyperplane separating the two classes of vectors. The Perceptron Algorithm for linear classification is one of the olest algorithms stuie in machine learning [Novikoff 1963; Minsky an Papert 1988]. It can be use to efficiently give a goo approximate solution, if one exists, an has nice noise-stability properties which allow it to be use as a subroutine in many applications such as learning with noise [Bylaner 1994; Blum et al. 1998], boosting [Serveio 1999] an more general optimization [Dunagan an Vempala 004]. In aition, it is extremely simple to implement: the algorithm starts with an arbitrary hyperplane, an iteratively fins a vector on which it errs, an moves in the irection of this vector by aing a multiple of it to the normal vector to the current hyperplane. The stanar implementation of the Perceptron Algorithm must iteratively fin a ba vector which is classifie incorrectly, that is, for which the inner prouct with the current normal vector has an incorrect sign. Our new algorithm is similar to the Perceptron Algorithm, in that it maintains a hyperplane an moifies it iteratively, accoring to the examples seen. However, instea of explicitly fining a ba vector, we run another ual learning algorithm to learn the most aversarial istribution over the vectors, an use that istribution to generate an expecte ba vector. Moreover, we o not compute the inner proucts with the current normal vector exactly, but instea estimate them using a fast sampling-base scheme. Thus our upate to the hyperplane uses a vector whose baness is etermine quickly, but very cruely. We show that espite this, an approximate solution is still obtaine in about the same number of iterations as the stanar perceptron. So our algorithm is faster; notably, it can be execute in time sublinear in the size of the Part of this work was one while E. Hazan was at IBM Almaen Research Center. He is currently supporte by Israel Science Founation grant 810/11. Permission to make igital or har copies of part or all of this work for personal or classroom use is grante without fee provie that copies are not mae or istribute for profit or commercial avantage an that copies show this notice on the first page or initial screen of a isplay along with the full citation. Copyrights for components of this work owne by others than ACM must be honore. Abstracting with creit is permitte. To copy otherwise, to republish, to post on servers, to reistribute to lists, or to use any component of this work in other works requires prior specific permission an/or a fee. Permissions may be requeste from Publications Dept., ACM, Inc., Penn Plaza, Suite 701, New York, NY USA, fax +1 (1) , or permissions@acm.org. c 01 ACM /01/01-ART0 $10.00 DOI /

2 0: Clarkson, Hazan an Wooruff Problem Previous time Time Here Lower Boun linear Õ(ε M) classification [Novikoff 1963] Õ(ε (n + )) Ω(ε (n + )) 6.1 min. enc. Õ(ε 1/ M) ball (MEB) [Saha an Vishwanathan 009] Õ(ε n + ε 1 ) 3.1 Ω(ε n + ε 1 ) 6. QP in the O(ε 1 M) simplex [Frank an Wolfe 1956] Õ(ε n + ε 1 ) 3.3 Las Vegas aitive O(M) versions Cor.11 Ω(M) 6.4 kernelize factors O(s 4 ) MEB an QP or O(q) 5 Fig. 1. Our results, parameters efine in relevant sections input ata, an still have goo output, with high probability. (Here we must make some reasonable assumptions about the way in which the ata is store, as iscusse below.) This technique applies more generally than to the perceptron: we also obtain sublinear time approximation algorithms for the relate problems of fining an approximate Minimum Enclosing Ball (MEB) of a set of points, an training a Support Vector Machine (SVM), in the har margin or L -SVM formulations. We give lower bouns that imply that our algorithms for classification are best possible, up to polylogarithmic factors, in the unit-cost RAM moel, while our bouns for MEB are best possible up to an Õ(ε 1 ) factor. For most of these bouns, we give a family of inputs such that a single coorinate, ranomly plante over a large collection of input vector coorinates, etermines the output to such a egree that all coorinates in the collection must be examine for even a /3 probability of success. Our approach can be extene to give algorithms for the kernelize versions of these problems, for some popular kernels incluing the Gaussian an polynomial, an also easily gives Las Vegas results, where the output guarantees always hol, an only the running time is probabilistic. 1 Our main results are given in Figure 1, using the following notation: all the problems we consier have an n matrix A as input, with M nonzero entries, an with each row of A with Eucliean length no more than one. The parameter ɛ > 0 is the aitive error; for MEB, this can be a relative error, after a simple O(M) preprocessing step. We n use the asymptotic notation Õ(f) = O(f polylog ε ). The parameter σ is the margin of the problem instance, explaine below. The parameters s an q etermine the stanar eviation of a Gaussian kernel, an egree of a polynomial kernel, respectively. The time bouns given for our algorithms, except the Las Vegas ones, are uner the assumption of constant error probability; for output guarantees that hol with probability 1 δ, our bouns shoul be multiplie by log(n/δ). The time bouns also require the assumption that the input ata is store in such a way that a given entry A i,j can be recovere in constant time. This can be one by, for example, keeping each row A i of A as a hash table. (Simply keeping the entries of the row in sorte orer by column number is also sufficient, incurring an O(log ) overhea in running time for binary search.) 1 For MEB an the kernelize versions, we assume that the Eucliean norms of the relevant input vectors are known. Even with the aition of this linear-time step, all our algorithms improve on prior bouns, with the exception of MEB when M = o(ε 3/ (n + )).

3 Sublinear Optimization for Machine Learning 0:3 Formal Description: Classification. In the linear classification problem, the learner is given a set of n labele examples in the form of -imensional vectors, comprising the input matrix A. The labels comprise a vector y {+1, 1} n. The goal is to fin a separating hyperplane, that is, a normal vector x in the unit Eucliean ball B such that for all i, y(i) A i x 0; here y(i) enotes the i th coorinate of y. As mentione, we will assume throughout that A i B for all i [n], where generally [m] enotes the set of integers {1,,..., m}. As is stanar, we may assume that the labels y(i) are all 1, by taking A i A i for any i with y(i) = 1. The approximation version of linear classification is to fin a vector x ε B that is an ε-approximate solution, that is, i A i x ε max min x B i A i x ε. (1) The optimum for this formulation is obtaine when x = 1, except when no separating hyperplane exists, an then the optimum x is the zero vector. Note that min i A i x = min p p Ax, where R n is the unit simplex {p R n p i 0, i p i = 1}. Thus we can regar the optimum as the outcome of a game to etermine p Ax, between a minimizer choosing p, an a maximizer choosing x B, yieling σ max x B min p p Ax, where this optimum σ is calle the margin. From stanar uality results, σ is also the optimum of the ual problem min max p x B p Ax, an the optimum vectors p an x are the same for both problems. The classical Perceptron algorithm returns an ε-approximate solution to this problem in 1 ε iterations, an total time O(ε M). For given δ (0, 1), our new algorithm takes O(ε (n + )(log n) log(n/δ)) time to return an ε-approximate solution with probability at least 1 δ. Further, we show this is optimal in the unit-cost RAM moel, up to poly-logarithmic factors. Formal Description: Minimum Enclosing Ball (MEB). The MEB problem is to fin the smallest Eucliean ball in R containing the rows of A. It is a special case of quaratic programming (QP) in the unit simplex, namely, to fin min p p b+p AA p, where b is an n-vector. This relationship, an the generalization of our MEB algorithm to QP in the simplex, is iscusse in 3.3; for more general backgroun on QP in the simplex, an relate problems, see for example [Clarkson 008] Relate work Perhaps the most closely relate work is that of [Grigoriais an Khachiyan 1995], who showe how to approximately solve a zero-sum game up to aitive precision ε in time Õ(ε (n + )), where the game matrix is n. This problem is analogous to ours, an our algorithm is similar in structure to theirs, but where we minimize over p an maximize over x B, their optimization has not only p but also x in a unit simplex. Their algorithm (an ours) relies on sampling base on x an p, to estimate inner proucts x v or p w for vectors v an w that are rows or columns of A. For a vector p, this estimation is easily one by returning w i with probability p i. strictly speaking, this is true only for ε equals the margin, enote σ, an efine below. Yet a slight moification of the perceptron gives this running time for any small enough ε > 0

4 0:4 Clarkson, Hazan an Wooruff For vectors x B, however, the natural estimation technique is to pick i with probability x i, an return v i/x i. The estimator from this l sample is less well-behave, since it is unboune, an can have a high variance. While l sampling has been use in streaming applications [Monemizaeh an Wooruff 010], it has not previously foun applications in optimization ue to this high variance problem. Inee, it might seem surprising that sublinearity is at all possible, given that the correct classifier might be etermine by very few examples, as shown in figure. It thus seems necessary to go over all examples at least once, instea of looking at noisy estimates base on sampling. Fig.. The optimum x is etermine by the vectors near the horizontal axis. However, as we show, in our setting there is a version of the funamental Multiplicative Weights (MW) technique that can cope with unboune upates, an for which the variance of l -sampling is manageable. In our version of MW, the multiplier associate with a value z is quaratic in z, in contrast to the more stanar multiplier that is exponential in z; while the latter is a funamental builing block in approximate optimization algorithms, as iscusse at [Plotkin et al. 1991], in our setting such exponential upates can lea to a very expensive Ω(1) iterations. We analyze MW from the perspective of on-line optimization, an show that our version of MW has low expecte expecte regret given only that the ranom upates have the variance bouns provable for l sampling. We also use another technique from on-line optimization, a graient escent variant which is better suite for the ball. For the special case of zero-sum games in which the entries are all non-negative (this is equivalent to packing an covering linear programs), [Koufogiannakis an Young 007] give a sublinear-time algorithm which returns a relative approximation in time Õ(ε (n + )). Our lower bouns show that a similar relative approximation boun for sublinear algorithms is impossible for general classification, an hence general linear programming.. LINEAR CLASSIFICATION AND THE PERCEPTRON Before our algorithm, some reminers an further notation: R n is the unit simplex {p R n p i 0, i p i = 1}, B R is the Eucliean unit ball, an the unsubscripte x enotes the Eucliean norm x. The n-vector, all of whose entries are one, is enote by 1 n. The i th row of the input matrix A is enote A i, although a vector is a column vector unless otherwise inicate. The i th coorinate of vector v is enote v(i). For a vector v, we let v enote the vector whose coorinates have v (i) v(i) for all i.

5 Sublinear Optimization for Machine Learning 0:5.1. The Sublinear Perceptron Our sublinear perceptron algorithm is given in Figure 1. The algorithm maintains a vector w t R n, with nonnegative coorinates, an also p t, which is w t scale to have unit l 1 norm. A vector y t R is maintaine also, an x t which is y t scale to have Eucliean norm no larger than one. These normalizations are one on line 4. In lines 5 an 6, the algorithm is upating y t by aing a row of A ranomly chosen using p t. This is a ranomize version of Online Graient Descent (OGD); ue to the ranom choice of i t, A it is an unbiase estimator of p t A, which is the graient of p t Ay with respect to y. In lines 7 through 1, the algorithm is upating w t using a column j t of A ranomly chosen base on x t, an also using the value x t (j t ). This is a version of the Multiplicative Weights (MW) technique for online optimization in the unit simplex, where v t is an unbiase estimator of Ax t, the graient of p Ax t with respect to p. Actually, v t is not unbiase, after the clip operation: for z, V R, clip(z, V ) min{v, max{ V, z}}, an our analysis is helpe by clipping the entries of v t ; we show that the resulting slight bias is not harmful. As iscusse in 1.1, the sampling use to choose j t (an upate p t ) is l -sampling, an that for i t, l 1 -sampling. These techniques, which can be regare as special cases of an l p -sampling technique, for p [1, ), yiel unbiase estimators of vector ot proucts. It is important for us also that l -sampling has a variance boun here; in particular, for each relevant i an t, E[v t (i) ] A i x t 1. () Algorithm 1 Sublinear Perceptron 1: Input: ε > 0, A R n with A i B for i [n]. : Let T 00 ε log n, y 1 0, w 1 1 n, log n η T. 3: for t = 1 to T o 4: p t wt y t max{1, y t }. w t 1, x t 5: Choose i t [n] by i t i with prob. p t (i). 6: y t+1 y t + 1 T A it 7: Choose j t [] by j t j with probability x t (j) / x t. 8: for i [n] o 9: ṽ t (i) A i (j t ) x t /x t (j t ) 10: v t (i) clip(ṽ t (i), 1/η) 11: w t+1 (i) w t (i)(1 ηv t (i) + η v t (i) ) 1: en for 13: en for 14: return x = 1 T t x t First we note the running time. THEOREM.1. The sublinear perceptron takes O(ε log n) iterations, with a total running time of O(ε (n + ) log n). PROOF. The algorithm iterates T = O( log n ε ) times. Each iteration requires: (1) One l sample per iterate, which takes O() time using known ata structures.

6 0:6 Clarkson, Hazan an Wooruff () Sampling i t R p t which takes O(n) time. (3) The upate of x t an p t, which takes O(n + ) time. The total running time is O(ε (n + ) log n). Next we analyze the output quality. The proof uses new tools from regret minimization an sampling that are the builing blocks of most of our upper boun results. Let us first state the MW algorithm use in all our algorithms. Definition. (MW algorithm). Consier a sequence of vectors q 1,..., q T R n. The Multiplicative Weights (MW) algorithm is as follows. Let w 1 1 n, an for t 1, an for 0 < η R, forall i [n] p t w t / w t 1, (3) w t+1 (i) w t (i)(1 ηq t (i) + η q t (i) ), (4) The following is a key lemma, which proves a novel boun on the regret of the MW algorithm above, suitable for the case where the losses are ranom variables with boune variance. As oppose to previous multiplicative-upates algorithms, this is the only MW algorithm we are familiar with that oes not require an upper boun on the losses/payoffs. The proof is iffere to after the main theorem an its proof. LEMMA.3 (VARIANCE MW LEMMA). The MW algorithm satisfies (recall v enotes the vector with v (i) v(i) ). p t q t min i [n] max{q t(i), 1 η } + log n η + η p t qt. The following three lemmas give concentration bouns on our ranom variables from their expectations. The first two are base on stanar martingale analysis, an the last is a simple Markov application. LEMMA.4. For η log n T, with probability at least 1 O(1/n), max i [v t (i) A i x t ] 4ηT. log n LEMMA.5. For η T, with probability at least 1 O(1/n), it hols that A i t x t t p t v t 10ηT. LEMMA.6. With probability at least 1 1 4, it hols that t p t v t 8T. THEOREM.7 (MAIN THEOREM). With probability 1/, the sublinear perceptron returns a solution x that is an ε-approximation. PROOF. First we use the regret bouns for lazy graient escent to lower boun A i t x t, next we get an upper boun for that quantity using Lemma.3, an then we combine the two. By efinition, A i x σ for all i [n], an so, using the boun of Lemma A., T σ max A it x A it x t + T, (5) x B

7 Sublinear Optimization for Machine Learning 0:7 or rearranging, A it x t T σ T. (6) Now we turn to the MW part of our algorithm. By the Weak Regret Lemma.3, an using the clipping of v t (i), p t v t min v t (i) + (log n)/η + η p t vt. i [n] By Lemma.4 above, with high probability, for any i [n], A i x t v t (i) 4ηT, so that with high probability p t v t min i [n] A ix t + (log n)/η + η p t vt + 4T η. Combining (6) an (7) we get min A i x t (log n)/η η p t vt 4T η i [n] + T σ T p t v t A it x t By Lemmas.5,.6 we have w.p at least 3 4 O( 1 n ) 1 min A i x t (log n)/η 8ηT 4T η + T σ T 10ηT i [n] T σ log n η ηt. Diviing through by T, an using our choice of η, we have min i A i x σ ε/ w.p. at least 1/ as claime. PROOF PROOF OF LEMMA.3. We first show an upper boun on log w T +1 1, then a lower boun, an then relate the two. From (4) an (3) we have w t+1 1 = w t+1 (i) i [n] = p t (i) w t 1 (1 ηq t (i) + η q t (i) ) i [n] = w t 1 (1 ηp t q t + η p t q t ). This implies by inuction on t, an using 1 + z exp(z) for z R, that log w T +1 1 = log n + log(1 ηp t q t + η p t qt ) log n ηp t q t + η p t qt. (7) Now for the lower boun. From (4) we have by inuction on t that w T +1 (i) = (1 ηq t (i) + η q t (i) ),

8 0:8 Clarkson, Hazan an Wooruff an so log w T +1 1 = log (1 ηq t (i) + η q t (i) ) i [n] log max (1 ηq t (i) + η q t (i) ) i [n] = max log(1 ηq t (i) + η q t (i) ) i [n] max [min{ ηq t (i), 1}], i [n] where the last inequality uses the fact that 1 + z + z exp(min{z, 1}) for all z R. Putting this together with the upper boun (7), we have max [min{ ηq t (i), 1}] log n ηp t q t + η p t qt, i [n] Changing sies ηp t q t max [min{ ηq t (i), 1}] + log n + η p t qt, i [n] = min [max{ηq t (i), 1}] + log n + η p t qt, i [n] an the lemma follows, iviing through by η. COROLLARY.8 (DUAL SOLUTION). The vector p t e i t /T is, with probability 1/, an O(ε)-approximate ual solution. PROOF. Observing in (5) that the mile expression max x B A i t x is equal to T max x B p Ax, we have T max x B p Ax A i t x t + T, or changing sies, A it x t T max x B p Ax T Recall from (7) that with high probability, p t v t min A i x t + (log n)/η + η p t vt + 4T η. (8) i [n] Following the proof of the main Theorem, we combine both inequalities an use Lemmas.5,.6, such that with probability at least 1 : T max x B p Ax min A i x t + (log n)/η + η p t vt + 90T η + T + p t v t A it x t i [n] T σ + O( T log n) Diviing through by T we have with probability at least 1 that max x B p Ax σ+o(ɛ) for our choice of T an η.

9 Sublinear Optimization for Machine Learning 0:9.. High Success Probability an Las Vegas Given two vectors u, v B, we have seen that a single l -sample is an unbiase estimator of their inner prouct with variance at most one. Averaging 1 ε such samples reuces the variance to ε, which reuces the stanar eviation to ε. Repeating O(log 1 δ ) such estimates, an taking the meian, gives an estimator enote X ε,δ, which satis- fies, via a Chernoff boun: Pr[ X ε,δ v u > ε] δ As an immeiate corollary of this fact we obtain: COROLLARY.9. There exists a ranomize algorithm that with probability 1 δ, successfully etermines whether a given hyperplane with normal vector x B, together with an instance of linear classification an parameter σ > 0, is an ε-approximate solution. The algorithm runs in time O( + n ε log n δ ). PROOF. Let δ = δ/n. Generate the ranom variable X ε,δ for each inner prouct pair x, A i, an return true if an only if X ε,δ σ ε for each pair. By the observation above an taking union boun over all n inner proucts, with probability 1 δ the estimate X ε,δ was ε-accurate for all inner-prouct pairs, an hence the algorithm returne a correct answer. The running time inclues preprocessing of x in O() time, an n inner-prouct estimates, for a total of O( + n ε log n δ ). Hence, we can amplify the success probability of Algorithm 1 to 1 δ for any δ > 0 albeit incurring aitional poly-log factors in running time: COROLLARY.10 (HIGH PROBABILITY). There exists a ranomize algorithm that with probability 1 δ returns an ε-approximate solution to the linear classification problem, an runs in expecte time O( n+ ε log n δ ). PROOF. 1 Run Algorithm 1 for log δ times to generate that many caniate solutions. By Theorem.7, at least one caniate solution is an ε-approximate solution with probability at least 1 log 1 δ = 1 δ. For each caniate solution apply the verification proceure above with success probability 1 δ 1 δ, an all verifications will be correct again with probability at least 1 δ. Hence, both events hol with probability at least 1 δ. The result log 1 δ follows after ajusting constants. The worst-case running time comes to O( n+ ε log n δ log 1 δ ). However, we can generate the caniate solutions an verify them one at a time, rather than all at once. The expecte number of caniates we nee to generate is constant. It is also possible to obtain an algorithm that never errs: COROLLARY.11 (LAS VEGAS VERSION). After O(ε log n) iterations, the sublinear perceptron returns a solution that with probability 1/ can be verifie in O(M) time to be ε-approximate. Thus with expecte O(1) repetitions, an a total of expecte O(M + ε (n + ) log n) work, a verifie ε-approximate solution can be foun. PROOF. We have min A i i x σ p A,

10 0:10 Clarkson, Hazan an Wooruff an so if min i A i x p A ɛ, (9) then x is an ε-approximate solution, an x will pass this test if it an p are (ε/)- approximate solutions, an the same for p. Thus, running the algorithm for a constant factor more iterations, so that with probability 1/, x an p are both (ɛ/)-approximate solutions, it can be verifie that both are ε-approximate solutions..3. Further Optimizations The regret of OGD as given in Lemma A. is smaller than the ual strategy of ranom MW. We can take avantage of this an improve the running time slightly, by replacing line [6] of the sublinear algorithm with the line shown below. [6 ] With probability 1 log T, let y t+1 y t + 1 T A i t (else o nothing). This has the effect of increasing the regret of the primal online algorithm by a log n factor, which oes not hurt the number of iterations require to converge, since the overall regret is ominate by that of the MW algorithm. Since the primal solution x t is not upate in every iteration, we improve the running time slightly to O(ε log n(n + /(log 1/ε + log log n))). We use this technique to greater effect for the MEB problem, where it is iscusse in more etail..4. Implications in the PAC moel Consier the separable case of hyperplane learning, in which there exists a hyperplane classifying all ata points correctly. It is well known that the concept class of hyperplanes in imensions with margin σ has effective imension at most min{, 1 σ }+1. 1 Consier the case in which the margin is significant, i.e. σ <. PAC learning theory implies that the number of examples neee to attain generalization error of δ is O( 1 σ δ ). Using the metho of online to batch conversion (see [Cesa-Bianchi et al. 004]), an applying the online graient escent algorithm, it is possible to obtain δ generalization error in time O( σ δ ) time, by going over the ata once an performing a graient step on each example. Our algorithm improves upon this running time boun as follows: we use the sublinear perceptron to compute a σ/-approximation to the best hyperplane over the test ata, where the number of examples is taken to be n = O( 1 σ δ ) (in orer to obtain δ generalization error). As shown previously, the total running time amounts to Õ( 1 σ + δ σ ) = O( 1 σ 4 δ + σ ). This improves upon stanar methos by a factor of Õ(σ ), which is always an improvement by our initial assumption on σ an.

11 Sublinear Optimization for Machine Learning 0:11 3. STRONGLY CONVEX PROBLEMS: MEB AND SVM 3.1. Minimum Enclosing Ball In the Minimum Enclosing Ball problem the input consists of a matrix A R n. The rows are interprete as vectors an the problem is to fin a vector x R such that x argmin x R max i [n] x A i We further assume for this problem that all vectors A i have Eucliean norm at most one. Denote by σ = max i [n] x A i the raius of the optimal ball, an we say that a solution is ε-approximate if the ball it generates has raius at most σ + ε. As in the case of linear classification, to obtain tight running time bouns we use a primal-ual approach; the algorithm is given below. (This is a conceptual version of the algorithm: in the analysis of the running time, we use the fact that we can batch together the upates for w t over the iterations for which x t oes not change.) Algorithm Sublinear Primal-Dual MEB 1: Input: ε > 0, A R n with A i B for i [n] an A i known. : Let T Θ(ε log n), y 1 0, w 1 1, η (log n)/t, α log T T log n. 3: for t = 1 to T o 4: p t wt w t 1 5: Choose i t [n] by i t i with probability p t (i). 6: With probability α, upate y t+1 y t + A it, x t+1 yt+1 t. (else o nothing) 7: Choose j t [] by j t j with probability x t (j) / x t. 8: for i [n] o 9: ṽ t (i) A i (j t ) x t /x t (j t ) + A i + x t. 10: v t (i) clip(ṽ t (i), 1 η ). 11: w t+1 (i) w t (i)(1 + ηv t (i) + η v t (i) ). 1: en for 13: en for 14: return x = 1 T t x t THEOREM 3.1. Algorithm runs in O( log n ε ) iterations, with a total expecte running time of ( n Õ ε + ), ε an with probability 1/, returns an ε-approximate solution. PROOF. Except for the running time analysis, the proof of this theorem is very similar to that of Theorem.7, where we take avantage of a tighter regret boun for strictly convex loss functions in the case of MEB, for which the OGD algorithm with a learning rate of 1 t is known to obtain a tighter regret boun of O(log T ) instea of O( T ). For presentation, we use asymptotic notation rather than computing the exact constants (as one for the linear classification problem). Let f t (x) = x A it t. Notice that arg min x B τ=1 f t τ=1 τ (x) = Aiτ t. By Lemma A.5 such that f t (x) = x A it, with G an H =, an x being the solution to the

12 0:1 Clarkson, Hazan an Wooruff instance, we have E {ct}[ t x t A it ] E {ct}[ t x A it ] + 4 α log T T σ + 4 log T, (10) α where σ is the square MEB raius. Here the expectation is taken only over the ranom coin tosses for upating x t, enote c t, an hols for any outcome of the inices i t sample from p t an the coorinates j t use for the l sampling. Now we turn to the MW part of our algorithm. By the Weak Regret Lemma.3, using the clipping of v t (i), an reversing inequalities to account for the change of sign, we have p t v t max v t (i) O( log n + η p t vt ). i [n] η Using Lemmas B.4,B.5 with high probability i [n]. v t (i) A i x t O(ηT ), x t A it t p t v t = O(ηT ). Plugging these two facts in the previous inequality we have w.h.p x t A it max A i x t O( log n + η p t vt + T η). i [n] η This hols w.h.p over the ranom choices of {i t, j t }, an irrespective of the coin tosses {c t }. Hence, we can take expectations w.r.t {c t }, an obtain E {ct}[ x t A it ] E {ct}[max i [n] A i x t ] O( log n η + η p t vt + T η). (11) Combining with equation (10), we obtain that w.h.p. over the ranom variables {i t, j t } T σ + 4 α log T E {c t}[max x t A i ] O( log n + η p t vt + T η) i [n] η Rearranging an using Lemma B.8, we have w.p. at least 1 E {ct}[max x t A i ] O(T σ + log T i [n] α + log n + T η) η Diviing through by T an applying Jensen s inequality, we have E[max x A j ] 1 j T E[max x t A i ] O(σ + log T i [n] T α + log n T η + η). Optimizing over the values of α, η, an T, this implies that the expecte error is O(ε), an so using Markov s inequality, x is a O(ε)-approximate solution with probability at least 1/.

13 Sublinear Optimization for Machine Learning 0:13 Running time. The algorithm above consists of T = O( log n ε ) iterations. Naively, this woul result in the same running time as for linear classification. Yet notice that x t changes only an expecte αt times, an only then o we perform an O() operation. The expecte number of iterations in which x t changes is αt 16ε 1 log T, an so the running time is O(ε 1 (log T ) + log n ε n)) = Õ(ε n + ε 1 ). The following Corollary is a irect analogue of Corollary.8. COROLLARY 3. (DUAL SOLUTION). 1/, an O(ε)-approximate ual solution. The vector p t e i t /T is, with probability 3.. High Success Probability an Las Vegas As for linear classification, we can amplify the success probability of Algorithm to 1 δ for any δ > 0 albeit incurring aitional poly-log factors in running time. COROLLARY 3.3 (MEB HIGH PROBABILITY). There exists a ranomize algorithm that with probability 1 δ returns an ε-approximate solution to the MEB problem, an runs in expecte time Õ( n ε log n εδ + ε log 1 ε ). There is also a ranomize algorithm that returns an ε-approximate solution in Õ(M + n ε + ε ) time. PROOF. We can estimate the istance between two points in B in O(ε log(1/δ)) time, with error at most ε an failure probability at most δ, using the ot prouct estimator escribe in.. Therefore we can estimate the maximum istance of a given point to every input point in O(nε log(n/δ)) time, with error at most ε an failure probability at most δ. This istance is σ ε, where σ is the optimal raius attainable, w.p. 1 δ. Because Algorithm yiels an ε-ual solution with probability 1/, we can use this solution to verify that the raius of any possible solution to the farthest point is at least σ ε. So, to obtain a solution as escribe in the lemma statement, run Algorithm, an verify that it yiels an ε-approximation, using this approximate ual solution; with probability 1/, this gives a verifie ε-approximation. Keep trying until this succees, in an expecte trials. For a Las Vegas algorithm, we simply apply the same scheme, but verify the istances exactly Convex Quaratic Programming in the Simplex We can exten our approach to problems of the form min p p b + p AA p, (1) where b R n, A R n, an is, as usual, the unit simplex in R n. As is well known, an as we partially review below, this problem inclues the MEB problem, margin estimation as for har margin support vector machines, the L -SVM variant of support vector machines, the problem of fining the shortest vector in a polytope, an others. Applying v x = v v + x x v x 0 with v A p, we have max p Ax x = p AA p, (13) x R

14 0:14 Clarkson, Hazan an Wooruff with equality at x = A p. Thus (1) can be written as min max p (b + Ax 1 n x ). (14) p x R The Wolfe ual of this problem exchanges the max an min: Since max min x R p p (b + Ax 1 n x ). (15) min p p (b + Ax 1 n x ) = min b(i) + A i x + x, (16) i with equality when pî = 0 if î is not a minimizer, the ual can also be expresse as max min b(i) + A i x x (17) x R i By the two relations (13) an (16) use to erive the ual problem from the primal, we have immeiately the weak uality conition that the objective function of the ual (17) is always no more than the objective function value of the primal (1). The strong uality conition, that the two problems take the same optimal value, also hols here; inee, the optimum x also solves (13), an the optimal p also solves (16). To generalize Algorithm, we make v t an unbiase estimator of b + Ax t 1 n x t, an set x t+1 to be the minimizer of b(i t ) + A it x t x t, t [t] namely, as with MEB, y t+1 t [t] A i t, an x t+1 y t+1 /t. (We also make some sign changes to account for the max-min formulation here, versus the min-max formulation use for MEB above.) This allows the use of Lemma A.4 for essentially the same analysis as for MEB; the graient boun G an Hessian boun H are both at most, again assuming that all A i B. MEB. When the b(i) A i, we have max min x R i b(i) + A i x x = min max x R i the objective function for the MEB problem. A i A i x + x = min max x R i x A i, Margin Estimation. When b 0 in the primal problem (1), that problem is one of fining the shortest vector in the polytope {A p p }. Consiering this case of the ual problem (17), for any given x R with min i A i x 0, the value of β R such that βx maximizes min i A i βx βx is β = 0. On the other han if x is such that min i A i x > 0, the maximizing value β is β = A i x/ x, so that the solution of (17) also maximizes min i (A i x) / x. The latter is the square of the margin σ, which as before is the minimum istance of the points A i to the hyperplane that is normal to x an passes through the origin. Aapting Algorithm for margin estimation, an with the slight changes neee for its analysis, we have that there is an algorithm taking Õ(n/ɛ + /ɛ) time that fins x R such that, for all i [n], A i x x σ ɛ. When σ ɛ, we on t appear to gain any useful information. However, when σ > ɛ, we have min i [n] A i x > 0, an so, by appropriate scaling of x, we have ˆx such that ˆσ = min i [n] (A iˆx) / ˆx = min i [n] A iˆx ˆx σ ɛ,

15 Sublinear Optimization for Machine Learning 0:15 an so ˆσ σ ɛ/σ. That is, letting ɛ ɛ σ, if ɛ σ, there is an algorithm taking Õ(n/(ɛσ) + /ɛ σ) time that fins a solution ˆx with ˆσ σ ɛ. 4. A GENERIC SUBLINEAR PRIMAL-DUAL ALGORITHM We note that our technique above can be applie more broaly to any constraine optimization problem for which low-regret algorithms exist an low-variance sampling can be applie efficiently; that is, consier the general problem with optimum σ: max min x K i c i (x) = σ. (18) Suppose that for the set K an cost functions c i (x), there exists an iterative low regret algorithm, enote LRA, with regret R(T ) = o(t ). Let T ε (LRA) be the smallest T such that R(T ) T ε. We enote by x t+1 LRA(x t, c) an invocation of this algorithm, when at state x t K an the cost function c is observe. Let Sample(x, c) be a proceure that returns an unbiase estimate of c(x) with variance at most one, that runs in constant time. Further assume c i (x) 1 for all x K, i [n]. Algorithm 3 Generic Sublinear Primal-Dual Algorithm 1: Let T max{t ε (LRA), log n ε }, x 1 LRA(initial), w 1 1 n, η log n T. : for t = 1 to T o 3: for i [n] o 4: Let v t (i) Sample(x t, c i ) 5: v t (i) clip(ṽ t (i), 1/η) 6: w t+1 (i) w t (i)(1 ηv t (i) + η v t (i) ) 7: en for 8: p t wt w t 1, 9: Choose i t [n] by i t i with probability p t (i). 10: x t LRA(x t 1, c it ) 11: en for 1: return x = 1 T t x t Applying the techniques of section we can obtain the following generic lemma. LEMMA 4.1. The generic sublinear primal-ual algorithm returns a solution x that with probability at least 1 is an ε-approximate solution in max{t ε(lra), log n ε } iterations. PROOF. First we use the regret bouns for LRA to lower boun c i t (x t ), next we get an upper boun for that quantity using the Weak Regret Lemma, an then we combine the two in expectation. By efinition, c i (x ) σ for all i [n], an so, using the LRA regret guarantee, T σ max x B c it (x) c it (x t ) + R(T ), (19)

16 0:16 Clarkson, Hazan an Wooruff or rearranging, c it (x t ) T σ R(T ). (0) Now we turn to the MW part of our algorithm. By the Weak Regret Lemma.3, an using the clipping of v t (i), p t v t min v t (i) + (log n)/η + η p t vt. i [n] Using Lemma B.4 an Lemma B.5, since the proceure Sample is unbiase an has variance at most one, with high probability: i [n], v t (i) c i (x t ) + O(ηT ), c it (x t ) t p t v t = O(ηT ). Plugging these two facts in the previous inequality we have w.h.p, c it (x t ) min c i (x t ) + O( log n + η p t vt + ηt ) (1) i [n] η Combining (0) an (1) we get w.h.p min c i (x t ) O( log n + ηt + η p t vt ) R(T ) i [n] η An via Lemma B.8 we have w.p. at least 1 that min i [n] c i (x t ) O( log n η + ηt ) R(T ) Diviing through by T, an using our choice of η, we have min i c i x σ ε/ w.p. at least least 1/ as claime. High-probability results can be obtaine using the same technique as for linear classification More applications The generic algorithm above can be use to erive the result of [Grigoriais an Khachiyan 1995] on sublinear approximation of zero sum games with payoffs/losses boune by one (up to poly-logarithmic factors in running time). A zero sum game can be cast as the following min-max optimization problem: min max A i x x i n That is, the constraints are inner proucts with the rows of the game matrix. This is exactly the same as the linear classification problem, but the vectors x are taken from the convex set K which is the simplex - or the set of all mixe strategies of the column player.

17 Sublinear Optimization for Machine Learning 0:17 A low regret algorithm for the simplex is the multiplicative weights algorithm, which attains regret R(T ) T log n. The proceure Sample(x, A i ) to estimate the inner prouct A i x is much simpler than the one use for linear classification: we sample from the istribution x an return A i (j) w.p. x(j). This has correct expectation an variance boune by one (in fact, the ranom variable is always boune by one). Lemma 4.1 then implies: COROLLARY 4.. The sublinear primal-ual algorithm applie to zero sum games returns a solution x that with probability at least 1 is an ε-approximate solution in O( log n n+ ε ) iterations an total time Õ( ε ). Essentially any constraine optimization problem which has convex or linear constraints, an is over a simple convex boy such as the ball or simplex, can be approximate in sublinear time using our metho. 5. KERNELIZING THE SUBLINEAR ALGORITHMS An important generalization of linear classifiers is that of kernel-base linear preictors (see e.g. [Schölkopf an Smola 003]). Let Ψ : R H be a mapping of feature vectors into a reproucing kernel Hilbert space. In this setting, we seek a non-linear classifier given by h H so as to maximize the margin: σ max h H min i [n] h, Ψ(A i). The kernels of interest are those for which we can compute inner proucts of the form k(x, y) = Ψ(x), Ψ(y) efficiently. One popular kernel is the polynomial kernel, for which the corresponing Hilbert space is the set of polynomials over R of egree q. The mapping Ψ for this kernel is given by S [], S q. Ψ(x) S = i S x i. That is, all monomials of egree at most q. The kernel function in this case is given by k(x, y) = (x y) q. Another useful kernel is the Gaussian kernel k(x, y) = exp( x y s ), where s is a parameter. The mapping here is efine by the kernel function (see [Schölkopf an Smola 003] for more etails). The kernel version of Algorithm 1 is shown in Figure 4. Note that x t an y t are members of H, an not maintaine explicitly, but rather are implicitly represente by the values i t. (An thus y t is the norm of H, not R.) Also, Ψ(A i ) is not compute. The neee kernel prouct x t, Ψ(A i ) is estimate by the proceure Kernel-L-Sampling, using the implicit representations an specific properties of the kernel being use. In the regular sublinear algorithm, this inner prouct coul be sufficiently well approximate in O(1) time via l -sampling. As we show below, for many interesting kernels the time for Kernel-L-Sampling is not much longer. For the analog of Theorem.7 to apply, we nee the expectation of the estimates v t (i) to be correct, with variance O(1). By Lemma C.1, it is enough if the estimates v t (i) have an aitive bias of O(ɛ). Hence, we efine the proceure Kernel-L-Sampling to obtain such an not-too-biase estimator with variance at most one; first we show how to implement Kernel-L-Sampling, assuming that there is an estimator k() of the kernel k() such that E[ k(x, y)] = k(x, y) an Var( k(x, y)) 1, an then we show how to implement such kernel estimators.

18 0:18 Clarkson, Hazan an Wooruff Algorithm 4 Sublinear Kernel Perceptron 1: Input: ε > 0, A R n with A i B for i [n]. : Let T 00 ε log n, y 1 0, w 1 1 n, η : for t = 1 to T o 4: p t wt y t max{1, y t }. log n T. w t 1, x t 5: Choose i t [n] by i t i with probability p t (i). 6: y t+1 τ [t] Ψ(A i τ )/ T. 7: for i [n] o 8: ṽ t (i) Kernel-L-Sampling(x t, Ψ(A i )). (estimating x t, Ψ(A i ) ) 9: v t (i) clip(ṽ t (i), 1/η). 10: w t+1 (i) w t (i)(1 ηv t (i) + η v t (i) ). 11: en for 1: en for 13: return x = 1 T t x t 5.1. Implementing Kernel-L-Sampling Estimating y t. A key step in Kernel-L-Sampling is the estimation of y t, which reaily reuces to estimating Y t T y t /t = 1 t τ,τ [t] k(a iτ, A iτ ), that is, the mean of the summans. Since we use max{1, y t ), we nee not be concerne with small y t, an it is enough that the aitive bias in our estimate of Y be at most ɛ/t ɛ(t/t ) for t [T ], implying a bias for y t no more than ɛ. Since we nee 1/ y t in the algorithm, it is not enough for estimates of Y just to be goo in mean an variance; we will fin an estimator whose error bouns hol with high probability. Our estimate Ỹt of Y t can first be consiere assuming we only nee to make an estimate for a single value of t. Let N Y t (8/3) log(1/δ)t /ɛ t. To estimate Y t, we compute, for each τ, τ [t], n t N Y /t inepenent estimates X τ,τ,m clip( k(a iτ, A iτ ), T/ɛ), for m [n t ], an our estimate is Ỹ t τ,τ [t] m [n t] X τ,τ,m/n Y. LEMMA 5.1. With probability at least 1 δ, Y Ỹt ɛ/t. PROOF. We apply Bernstein s inequality (as in 31) to the N Y ranom variables X τ,τ,m E[X τ,τ,m]. which have mean zero, variance at most one, an are at most T/ɛ in magnitue. Bernstein s inequality implies, using Var[X τ,τ,m] 1, log Prob{ τ,τ [t] m [n t] (X τ,τ,m E[X τ,τ,m]) > α} α /(N Y + (T/ɛ)α/3),

19 Sublinear Optimization for Machine Learning 0:19 an putting α N Y ɛ/t gives log Prob{Ỹ E[Ỹ ] > ɛ/t } N Y (ɛ/t ) /(N Y + (T/ɛ)N Y (ɛ/t )/3) (8/3) log(1/δ)(3/4) log(1/δ). Similar reasoning for X τ,τ,m, an the union boun, implies the lemma. To compute Y for t = 1... T, we can save some work by reusing estimates from one t to the next. Now let N Y (8/3) log(1/δ)t /ɛ. Compute Ỹ1 as above for t = 1, an let Ŷ1 Ỹ1. For t > 1, let n t N Y /t, an let Ŷ t (X t,τ,m + X τ,t,m )/n t, an return Ỹt τ [t] Ŷτ /t. m [n t] X t,t,m /n t + τ [t] m [n t] Since for each τ an τ, the expecte total contribution of all X τ,τ,m terms to Ỹt is k(a iτ, A iτ ), we have E[Ỹt] = Y t. Moreover, the number of instances of X τ,τ,m average to compute Ỹt is always at least as large as the number use for the above batch version; it follows that the total variance of Ỹt is non-increasing in t, an therefore Lemma 5.1 hols also for the Ỹt compute stepwise. Since the number of calls to k(, ) is (1 + n t) = O(N Y ), we have the following lemma. LEMMA 5.. The values Ỹt(t /T ) y t, t [T ], can be estimate with O((log(1/ɛδ)T /ɛ ) calls to k(, ), so that with probability at least 1 δ, Ỹt(t /T ) y t ɛ. The values y t, t [T ], can be compute exactly with T calls to the exact kernel k(, ). PROOF. This follows from the iscussion above, applying the union boun over t [T ], an ajusting constants. The claim for exact computation is straightforwar. Given this proceure for estimating y t, we can escribe Kernel-L-Sampling. Since x t+1 = y t+1 / max{1, y t+1 }, we have 1 x t+1, A i = max{1, y t+1 } Ψ(A iτ ), Ψ(A i ) T τ [t] 1 = max{1, y t+1 } k(a iτ, A i ), () T so that the main remaining step is to estimate τ [t] k(a i τ, A i ), for i [n]. Here we simply call k(a iτ, A i ) for each τ. We save time, at the cost of O(n) space, by saving the value of the sum for each i [n], an upating it for the next t with n calls k(a it, A i ). LEMMA 5.3. Let L k enote the expecte time neee for one call to k(, ), an T k enote the time neee for one call to k(, ). Except for estimating y t, Kernel-L-Sampling can be compute in nl k expecte time per iteration t. The resulting estimate has expectation within aitive ɛ of x t, A i, an variance at most one. Thus Algorithm 4 runs in time Õ( (L kn+) ε + min{ L k ε, T 6 k ε }), an prouces a solution with properties as in Algorithm 1. 4 PROOF. For Kernel-L-Sampling it remains only to show that its variance is at most one, given that each k(, ) has variance at most one. We observe from ( that t τ [t]

20 0:0 Clarkson, Hazan an Wooruff inepenent estimates k(, ) are ae together, an scale by a value that is at most 1/ T. Since the variance of the sum is at most t, an the variance is scale by a value no more than 1/T, the variance of Kernel-L-Sampling is at most one. The only bias in the estimate is ue to estimation of y t, which gives relative error of ɛ. For our kernels, Ψ(v) 1 if v B, so the aitive error of Kernel-L-Sampling is O(ɛ). The analysis of Algorithm 4 then follows as for the un-kernelize perceptron; we neglect the time neee for preprocessing for the calls to k(, ), as it is ominate by other terms for the kernels we consier, an this is likely in general. 5.. Implementing the Kernel Estimators Using the lemma above we can erive corollaries for the Gaussian an polynomial kernels. More general kernels can be hanle via the technique of [Cesa-Bianchi et al. 010]. Polynomial kernels. For the polynomial kernel of egree q, estimating a single kernel prouct, i.e. k(x, y) = k(a i, A j ), where the norm of x, y is at most one, takes O(q) as follows: Recall that for the polynomial kernel, k(x, y) = (x y) q. To estimate this kernel we take the prouct of q inepenent l -samples, yieling k(x, y). Notice that the expectation of this estimator is exactly equal to the prouct of expectations, E[ k(x, y)] = (x y) q. The variance of this estimator is equal to the prouct of variances, which is Var( k(x, y)) ( x y ) q 1. Of course, calculating the inner prouct exactly takes O( log q) time. We obtain: COROLLARY 5.4. For the polynomial egree-q kernel, Algorithm 4 runs in time q(n + ) Õ( ε + min{ log q ε 4, q ε 6 }). Gaussian kernels. To estimate the Gaussian kernel function, we assume that x an y are known an no more than s/; thus to estimate k(x, y) = exp( x y ) = exp(( x + y )/s ) exp(x y/s ), we nee to estimate exp(x y/s ). For exp(γx) = i 0 γi X i /i! with ranom X an parameter γ > 0, we pick inex i with probability exp( γ)γ i /i! (that is, i has a Poisson istribution) an return exp(γ) times the prouct of i inepenent estimates of X. In our case we take X to be the average of c l -samples of x y, an hence E[X] = x y, E[X ] 1 c E[(x y) ] 1 c. The expectation of our kernel estimator is thus: E[ k(x, y)] = E[ e γ γ i i! e γ X i ] = i γ i i! E[X] = exp(γx y). i 0 i 0 j=1 The secon moment of this estimator is boune by: E[ k(x, y) ] = E[ i 0 e γ γ i i! e γ (X i ) ] = e γ i 0 γ i i! i E[X ] exp( γ c ). Hence, we take γ = c = 1 s. This gives a correct estimator in terms of expectation an constant variance. The variance can be further mae smaller than one by taking the average of a constant estimators of the above type. As for evaluation time, the expecte size of the inex i is γ = 1 s. Thus, we require on the expectation γ c = 1 s of l 4 -samples. We obtain: j=1

21 Sublinear Optimization for Machine Learning 0:1 COROLLARY 5.5. For the Gaussian kernel with parameter s, Algorithm 4 runs in time (n + ) Õ( s 4 ε + min{ ε 4, 1 s 4 ε 6 }) Kernelizing the MEB an strictly convex problems Analogously to Algorithm 4, we can efine the kernel version of strongly convex problems, incluing MEB. The kernelize version of MEB is particularly efficient, since as in Algorithm, the norm y t is never require. This means that the proceure Kernel-L-Sampling can be compute in time O(nL k ) per iteration, for a total running time of O(L k (ε n + ε 1 )). 6. LOWER BOUNDS All of our lower bouns are information-theoretic, meaning that any successful algorithm must rea at least some number of entries of the input matrix A. Clearly this also lower bouns the time complexity of the algorithm in the unit-cost RAM moel. Some of our arguments use the following meta-theorem. Consier a p q matrix A, where p is an even integer. Consier the following ranom process. Let W q. Let a = 1 1/W, an let e j enote the j-th stanar q-imensional unit vector. For each i [p/], choose a ranom j [q] uniformly, an set A i+p/ A i ae j +b(1 q e j ), where b is chosen so that A i = 1. We say that such an A is a YES instance. With probability 1/, transform A into a NO instance as follows: choose a ranom i [p/] uniformly, an if A i = ae j + b(1 q e j ) for a particular j [q], set A i +p/ ae j + b(1 q e j ). Suppose there is a ranomize algorithm reaing at most s positions of A which istinguishes YES an NO instances with probability /3, where the probability is over the algorithm s coin tosses an this istribution µ on YES an NO instances. By averaging this implies a eterministic algorithm Alg reaing at most s positions of A an istinguishing YES an NO instances with probability /3, where the probability is taken only over µ. We show the following meta-theorem with a stanar argument. THEOREM 6.1. (Meta-theorem) For any such algorithm Alg, s = Ω(pq). This Meta-Theorem follows from the following folklore fact: FACT 6.. Consier the following ranom process. Initialize a length-r array A to an array of r zeros. With probability 1/, choose a ranom position i [r] an set A[i] = 1. With the remaining probability 1/, leave A as the all zero array. Then any algorithm which etermines if A is the all zero array with probability /3 must rea Ω(r) entries of A. Let us prove Theorem 6.1 using this fact: PROOF. Consier the matrix B R (p/) q which is efine by subtracting the bottom half of the matrix from the top half, that is, B i,j = A i,j A i+p/,j. Then B is the all zeros matrix, except that with probability 1/, there is one entry whose value is roughly two, an whose location is ranom an istribute uniformly. An algorithm istinguishing between YES an NO instances of A in particular istinguishes between the two cases for B, which cannot be one without reaing a linear number of entries. In the proofs of Theorem 6.3, Corollary 6.4, an Theorem 6.6, it will be more convenient to use M as an upper boun on the number of non-zero entries of A rather than the exact number of non-zero entries. However, it shoul be unerstoo that these the-

Sublinear Optimization for Machine Learning

Sublinear Optimization for Machine Learning Kenneth L. Clarkson IBM Almaden Research Center San Jose, CA Elad Hazan Department of Industrial Engineering Technion - Israel Institute of Technology Haifa