6.883: Online Methods in Machine Learning Alexander Rakhlin

Size: px

Start display at page:

Download "6.883: Online Methods in Machine Learning Alexander Rakhlin"

Margery Moody
5 years ago
Views:

1 6.883: Olie Methods i Machie Learig Alexader Rakhli LECURE 4 his lecture is partly based o chapters 4-5 i [SSBD4]. Let us o give a variat of SGD for strogly covex fuctios. Algorithm SGD for strogly covex fuctios Iput: σ > 0 (strog covexity parameter) Iit: = 0 for,..., do t+ = t σt t, here E t [ t ] f( t ) ed for Lemma. If f is strogly covex ith parameter σ ad E i 2 G 2 for all i [ ], the the average of the trajectory ŵ = t satisfies E[f(ŵ)] f( ) G2 2σ ( + log ). he proof is a small modificatio of the gradiet descet lemma from the previous lecture. We prove the result for o-stochastic gradiet, ad leave the stochastic versio as a exercise. Proof. Folloig the proof of the gradiet descet lemma, but ith time-varyig η t, e get f( t ), t = [ t 2 t+ 2 ] + η t 2 f( t) 2. () Hoever, there is o additioal egative term cog from strog covexity. his term ill give us the faster / covergece rate: he hich is upper bouded by his upper boud is f( t ) f( ) f( t ), t σ 2 t 2 f(ŵ) f( ) f( t ) f( ) [ t 2 t+ 2 σ 2 t 2 ] + G 2 2σt. η t 2 f( t) 2. (2)

2 A fe remarks: he logarithmic factor ca be removed by averagig oly the secod half of the trajectory, or puttig some other ouiform eights o the trajectory. I practice, the last iterate is quite good, ad possibly better tha the average. Aalysis for this last iterate as doe i [SZ2]. Oce agai, e may icorporate a Euclidea projectio step oto a covex set after each update. I this case, the guaratee is ith respect to i that set. 0. Full gradiet for empirical objectives May offlie (or, batch ) problems i machie learig ca be ritte as a empirical objective or a regularized versio of it l(, (x t, y t )) (3) l(, (x t, y t )) + λr() (4) for some pealty R, tradeoff parameter λ, ad a fuctio l that measures ho ell explais the relatioship betee x ad y. For istace, for fidig a lo-error liear separator i the o-separable case, e may try to perform gradiet descet o f() = max{0, y t, x t } (5) his ould be a o-stochastic gradiet descet, but each iteratio requires oe to compute a elemet of f( t ). We may take t here t is a subgradiet of the t-th loss. For the hige loss case, a subgradiet (ith respect to example i) ca be ritte as y i x i {y i, x i < } (6) he procedure amouts to ruig through the hole dataset to calculate the full gradiet, ad the make oe step. ime complexity, i terms of gradiet evaluatios, to obtai ɛ-accurate solutio is the R2 2 ɛ 2 here R = max x i. Check that R comes from the boud o the gradiet of the loss. Ufortuately, the boud scales ith the size of our dataset, ad oe opts for the SGD procedure. echically speakig, there are better aalyses of SGD that may eve get the log(/ɛ) depedece o target accuracy uder additioal assumptios o the fuctios. Hoever, it has bee argued i the last decade (both empirically ad theoretically) that the ability to process larger is more importat tha attaiig high accuracy for a limited. hat is, if the costrait is computatio time, rather tha amout of data, oe should opt for stochastic gradiet descet [BB08, SSSSC]. 2

3 0.2 SGD for empirical objectives I applyig SGD to batch learig problems, oe vies the objective f() = l(, (x t, y t )) (7) (or the regularized versio) as a empirical distributio. If idex I is sampled uiformly at radom from [], the ay I l(, (x I, y I )) has the property that E I Uif [ I ] f(). (8) he time complexity of SGD for attaiig a ɛ-imizer of the objective is idepedet of (!!), ad the depedece o ɛ, of course, varies accordig to the properties of l. We remark that e proved covergece of SGD for geeral radom ubiased subgradiets, but e are applyig it to a distributio of a very specific form. his has bee exploited to improve the aalysis ad the depedece o ɛ. Istead of samplig from [], it is commo to permute the data ad ru over it i order, possibly several times. here have bee several orks tryig to uderstad ho differet the radom samplig is from cyclig through the data. 0.3 SGD for Support Vector Machies Recall that the hige loss pealizes data poits close to the boudary, ad thus pushes the hyperplae to have a large margi. his is ot a etirely precise statemet, sice the very otio of margi of size as tied to the fact that is a imal-orm vector. Hece, the objective (5) is ot hat e at to imize. Istead, e eed a bi-criterio form max{0, y t, x t }, 2. (9) hat is, e at to imize loss ad the orm at the same time. here are several ays to combie the bi-criteria ito a sigle oe. Here is oe: max{0, y t, x t } + λ 2 2. (0) his is ko as the Support Vector Machie (a facy ame is a must i machie learig!) for the case of liear kerels (more o this later). Oe more caveat is that SVM does ot pealize the scalar shift of the ohomogeeous hyperplae; i the formulatio (0), hoever, this shift is absorbed i ad the orm pealizes it. Before the large-scale problems came about, the SVMs ere solved as a costraied quadratic programg problem. Pegasos, the SGD solutio to this objective, proposed by [SSSSC] (see also [Zha04]) has bee very ifluetial i practical applicatios ith large datasets. For the radomly chose example i, the subgradiet of the SVM objective is t = y i x i {y i t, x i < } + λ t. () o apply SGD it remais to decide o the step size. Sice the objective is λ-strogly covex due to the regularizatio term, e choose the step-size η t = λt (2) 3

4 ad apply SGD for strogly covex objectives. Suppose e substitue a potetially suboptimal choice = 0 i (0). he value of the objective is the at most. Hece, the optimal solutio should give a objective value o greater tha that. hat implies λ 2 2, (3) or 2/λ (ith a bit more ork, 2 ca be replaced by ). he SGD algorithm may add the projectio step oto a uit ball of this radius to help guide the search ith the extra iformatio about the locatio of. [SSSSC] reports that the projectio step makes little differece i the experimets they performed. We summarize the Pegasos algorithm belo. Algorithm 2 SGD for SVM objective (Pegasos) Iput: λ > 0 (regularizatio parameter) Iit: = 0 for,..., do Set η t = λt Sample i Uif[] if y i t, x i < the t+ = ( η t λ) t + η t y i x i else t+ = ( η t λ) t ed if Optioally, rescale t+ to have orm at most 2/λ ed for o apply Lemma o covergece of SGD for strogly covex fuctios, e eed to calculate bouds o the gradiets ad. Observe that the hige loss is R-Lipschitz, here R = max i x i. Furthermore, the update of SGD ca be ritte succictly as t+ = λt t s= y is x is {y is s, x is < } (4) (prove this by uidig the recursio), here i s is the idex chose at step s. I particular, this implies that t+ R/λ for all iterates. he Lipschitz costat of the overall fuctio is the upper bouded by 2R, ad the covergece guaratee of Pegasos for the average ŵ of the trajectory is here f is the SVM objective i (0). 0.4 Mii-batchig Ef(ŵ) f( ) 4R2 ( + log ) λ A commo practice (icludig i SGD for deep eural ets) is to take a small batch of data, evaluate the average gradiet ith respect to these data, ad the update the parameter. Mii-batchig presets a atural iterpolatio betee the full gradiet (all gradiets at oce) ad the sigle-gradiet SGD as stated above. his has the effect of reducig variace of the gradiets hile still beig computatioally cheap (ad idepedet of ). (5) 4

5 0.5 Sparse updates Exaig (4), e oly eed to keep track of the sum of y is x is that led to a correctio of the hyperplae. Hece, if x s are s-sparse, the update ca be implemeted i time O(s) rather tha O(d). his becomes hady, for istace, i documet classificatio ith the bag-of-ords (or related) sparse represetatio. 0.6 Equivalet form of SVM objective A form that oe may ecouter i the literature is,b,ξ m ξ i + λ 2 2 (6) subj to y t (, x t + b) ξ t (7) ξ t 0 (8) Refereces [BB08] Olivier Bousquet ad Léo Bottou. he tradeoffs of large scale learig. I Advaces i eural iformatio processig systems, pages 6 68, [SSBD4] Shai Shalev-Shartz ad Shai Be-David. Uderstadig machie learig: From theory to algorithms. Cambridge Uiversity Press, 204. [SSSSC] Shai Shalev-Shartz, Yoram Siger, Natha Srebro, ad Adre Cotter. Pegasos: Primal estimated sub-gradiet solver for svm. Mathematical programg, 27():3 30, 20. [SZ2] Ohad Shamir ad og Zhag. Stochastic gradiet descet for o-smooth optimizatio: Covergece results ad optimal averagig schemes. arxiv preprit arxiv:22.824, 202. [Zha04] og Zhag. Solvig large scale liear predictio problems usig stochastic gradiet descet algorithms. I Proceedigs of the tety-first iteratioal coferece o Machie learig, page 6. ACM,

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but