A Generalized Online Mirror Descent with Applications to Classification and Regression

Size: px

Start display at page:

Download "A Generalized Online Mirror Descent with Applications to Classification and Regression"

Sheryl Montgomery
5 years ago
Views:

1 Journal of Machne Learnng Research Submed 4/00; Publshed 10/00 A Generalzed Onlne Mrror Descen wh Applcaons o Classfcaon and Regresson Francesco Orabona Toyoa Technologcal Insue a Chcago Chcago, IL, USA Koby Crammer Deparmen of Elecrcal Engnerng The Technon Hafa, 3000 Israel Ncolò Cesa-Banch Deparmen of Compuer Scence Unversà degl Sud d Mlano Mlano, 0135 Ialy francesco@orabonacom koby@eeechnonacl ncolocesa-banch@unm Edor: Absrac Onlne learnng algorhms are fas, memory-effcen, easy o mplemen, and applcable o many predcon problems, ncludng classfcaon, regresson, and rankng Several onlne algorhms were proposed n he pas few decades, some based on addve updaes, lke he Percepron, and some oher on mulplcave updaes, lke Wnnow Onlne convex opmzaon s a general framework o unfy boh he desgn and he analyss of onlne algorhms usng a sngle predcon sraegy: onlne mrror descen Dfferen frs-order onlne algorhms are obaned by choosng he regularzaon funcon n onlne mrror descen We generalze onlne mrror descen o sequences of me-varyng regularzers Our approach allows us o recover as specal cases many recenly proposed second-order algorhms, such as he Vovk-Azoury-Warmuh, he second-order Percepron, and he AROW algorhm Moreover, we derve a new second order adapve p-norm algorhm, and mprove bounds for some frs-order algorhms, such as Passve-Aggressve PA-I Keywords: Onlne learnng, Convex opmzaon, Second-order algorhms 1 Inroducon Onlne learnng provdes a scalable and flexble approach for he soluon of a wde range of predcon problems, ncludng classfcaon, regresson, rankng, and porfolo managemen Popular onlne algorhms for classfcaon nclude he sandard Percepron and s many varans, such as kernel Percepron Freund and Schapre, 1999, p-norm Percepron Genle, 003, and Passve-Aggressve Crammer e al, 006 These algorhms have well known counerpars for regresson problems, such he Wdrow-Hoff algorhm and s p-norm generalzaon Oher onlne algorhms, wh properes dfferen from hose of he sandard Percepron, are based on exponenal raher han addve updaes, such as Wnnow Llesone, 1988 for classfcaon and Exponenaed Graden Kvnen and c 000 Francesco Orabona and Koby Crammer and Ncol Cesa-Banch

2 Orabona and Crammer and Cesa-Banch Warmuh, 1997 for regresson Whereas hese onlne algorhms are all essenally varans of sochasc graden descen Tsypkn, 1971, n he las decade many algorhms usng second-order nformaon from he npu feaures have been proposed These nclude he Vovk-Azoury-Warmuh algorhm for regresson Vovk, 001; Azoury and Warmuh, 001, he second-order Percepron Cesa-Banch e al, 005, he CW/AROW algorhms Dredze e al, 008; Crammer e al, 009,?, and he algorhms proposed by Duch e al 011, all for bnary classfcaon Recenly, onlne convex opmzaon has been proposed as a common unfyng framework for desgnng and analyzng onlne algorhms In parcular, onlne mrror descen OMD s a general onlne convex opmzaon algorhm whch s paramerzed by a regularzer, e, a srongly convex funcon By approprae choces of he regularzer, mos frs-order onlne learnng algorhms are recovered as specal cases of OMD Moreover, performance guaranees can be also derved smply by nsanang he general OMD bounds o he specfc regularzer beng used The heorecal sudy of OMD reles on convex analyss Warmuh and Jagoa 1997 and Kvnen and Warmuh 001 poneered he use of Bregman dvergences n he analyss of onlne algorhms, as explaned n he monography of Cesa-Banch and Lugos 006 Shalev-Shwarz and Snger 007, Shalev-Shwarz 007 n hs dsseraon, and Shalev-Shwarz and Kakade 009 showed a dfferen analyss based on a prmal-dual mehod Sarng from he work of Kakade e al 009, s now clear ha many nsances of OMD can be analyzed usng only a few basc convex dualy properes See he he recen survey by Shalev-Shwarz 01 for a lucd descrpon of hese developmens In hs paper we exend and generalze he heorecal framework of Kakade e al 009 In parcular, we allow OMD o use a sequence of me-varyng regularzers Ths s known o be he key o obanng second-order algorhms, and ndeed we recover he Vovk-Azoury- Warmuh, he second-order Percepron, and he AROW algorhm as specal cases, wh a slghly mproved analyss of AROW Our generalzed analyss also capures he effcen varans of hese algorhms ha only use he dagonal elemens of he second order nformaon marx, a resul whch was no whn reach of he prevous echnques Besdes beng able o express second-order algorhms, me-varyng regularzers can be used o perform oher ypes of adapaon o he sequence of observed daa We gve a concree example by nroducng a new adapve regularzer correspondng o a weghed verson of he p-norm regularzer In he case of sparse arges, he correspondng nsance of OMD acheves a performance bound beer han ha of OMD wh 1-norm regularzaon, whch s he sandard regularzer for he sparse arge assumpon Even n case of frs-order algorhms our framework gves mprovemens on prevous resuls For example, alhough aggressve algorhms for bnary classfcaon ofen exhb a beer emprcal performance han her conservave counerpars, a heorecal explanaon of hs behavor remaned so far elusve Usng our refned analyss, we are able o prove he frs bound for Passve-Aggressve PA-I ha s never worse and somemes beer han he Percepron bound

3 The generalzed graden-based lnear forecaser Onlne convex programmng Le X be some Eucldean space a fne-dmensonal lnear space over he reals equpped wh an nner produc In he onlne convex opmzaon proocol an algorhm sequenally chooses elemens from S X, each me ncurrng a ceran loss A each sep = 1,, he algorhm chooses w S and hen observes a convex loss funcon l : S R The value l w s he loss of he learner a sep, and he goal s o conrol he regre, R T u = l w l u for all u S and for any sequence of convex loss funcons l An mporan applcaon doman for hs proocol s sequenal lnear regresson/classfcaon In hs case, here s a fxed and gven loss funcon l : R R R and a fxed bu unknown sequence x 1, y 1, x, y, of examples x, y X R A each sep = 1,, he learner observes x and pcks w S X The loss suffered a sep s hen defned as l w = l w, x, y For example, n regresson l w, x, y = w, x y In classfcaon, where y { 1, +1}, a ypcal loss funcon s he hnge loss [ 1 y w, x ] +, where [a] + = max{0, a} Ths s a convex upper bound on he rue quany of neres Namely, he msake ndcaor funcon I {y w,x 0} 1 Furher noaon and defnons We now nroduce some basc noons of convex analyss ha are used n he paper We refer o Rockafellar 1970 for defnons and ermnology We consder funcons f : X R ha are closed and convex Ths s equvalen o say ha her epgraph {x, y : fx y} s a convex and closed subse of X R The effecve doman of f, ha s he se {x X : fx < }, s a convex se whenever f s convex We can always choose any S X as doman of f by leng fx = for x S Gven a closed and convex funcon f wh doman S X, s Fenchel conjugae f : X R s defned as f u = sup v S v, u fv Noe ha he doman of f s always X Moreover, one can prove ha f = f A generc norm of a vecor u X s denoed by u Is dual s he norm defned as v = sup u { u, v : u 1} The Fenchel-Young nequaly saes ha fu + f v u, v for all v, u A vecor x s a subgraden of a convex funcon f a v f fu fv u v, x for any u n he doman of f The dfferenal se of f a v, denoed by fv, s he se of all he subgradens of f a v If f s also dfferenable a v, hen fv conans a sngle vecor, denoed by fv, whch s he graden of f a v A consequence of he Fenchel- Young nequaly s he followng: for all x fv we have ha fv + f x = v, x A funcon f s β-srongly convex wh respec o a norm f for any u, v n s doman, and any x fu, fv fu + x, v u + β u v 3

4 Orabona and Crammer and Cesa-Banch The Fenchel conjugae f of a β-srongly convex funcon f s everywhere dfferenable and -srongly smooh Ths means ha for all u, v X, 1 β f v f u + fu, v u + 1 β u v See also he paper of Kakade e al 009 and references heren A furher propery of srongly convex funcons f : S R s he followng: for all u X, v, f u = argsup u fv 1 v S Ths mples he useful deny f f u + f u = f u, u Srong convexy and srong smoohness are key properes n he desgn of onlne learnng algorhms In he followng, we ofen wre f o denoe he norm accordng o whch f s srongly convex 3 Onlne Mrror Descen We now nroduce our man algorhmc ool: a generalzaon of he sandard OMD algorhm for onlne convex programmng n whch he regularzers may change over me Algorhm 1 Onlne Mrror Descen 1: Parameers: A sequence of srongly convex funcons f 1, f, defned on a common doman S X : Inalze: θ 1 = 0 X 3: for = 1,, do 4: Choose w = f θ 5: Observe z X 6: Updae θ +1 = θ + z 7: end for Sandard OMD see, eg, Kakade e al, 009 uses f = f for all Noe he followng remarkable propery of Algorhm 1: whle θ moves freely n X as deermned by he npu sequence z, because of 1 he propery w S holds for all The followng lemma s a generalzaon of Corollary 4 of Kakade e al 009 and of Corollary 3 of Duch e al 011 Lemma 1 Assume OMD s run wh funcons f 1, f, defned on a common doman S X and such ha each f s β -srongly convex wh respec o he norm f Then, for any u S, z f z, u w f T u + + f θ f β 1θ where we se f 0 0 = 0 Moreover, f θ f 1 θ f 1 w f w for all 1 4

5 The generalzed graden-based lnear forecaser Proof Le = f θ +1 f 1 θ Then = ft θ T +1 f0 θ 1 = ft θ T +1 Snce he funcons f are 1 β -srongly smooh wh respec o f, and recallng ha θ +1 = θ + z, = f θ +1 f θ + f θ f 1θ f θ f 1θ f θ, z + 1 β z f = f θ f 1θ w, z + 1 β z f where we used he defnon of w n he las sep On he oher hand, he Fenchel-Young nequaly mples = ft θ T +1 u, θ T +1 f T u = u, z f T u Combnng he upper and lower bound on and summng over we ge u, z f T u f θ f 1θ + w, z + 1 z β f We now prove he second saemen Recallng agan he defnon of w we have ha mples f θ = w, θ f w On he oher hand, he Fenchel-Young nequaly mples ha f 1 θ f 1 w w, θ Combnng he wo we ge f θ f 1 θ f 1 w f w, as desred Nex, we show a general regre bound for Algorhm 1 Corollary 1 Le R : S R be a convex funcon and le g 1, g, be a sequence of nondecreasng convex funcons g : S R Fx η > 0 and assume f = g + ηr are β -srongly convex wh respec o If OMD s run on he npu sequence z = ηl for some l l w, hen l w + Rw l u + Ru g T u η for all u S Moreover, f f = g + ηr where g : S R s β-srongly convex, hen l w + Rw for all u S l u + Ru T gu η + η l f β 3 + η β max l f T 4 5

6 Orabona and Crammer and Cesa-Banch Fnally, f f = R, where R s β-srongly convex wh respec o a norm, hen l w + Rw for all u S l u + Ru max T l f 1 + ln T β Proof By convexy, l w l u 1 η z, u w Usng Lemma 1 we have, T l z, u w g T u + ηt Ru + η f + η 1Rw Rw β where we used he fac ha he erms g 1 w g w are nonposve under he hypohess ha he funcons g are nondecreasng Reorderng erms we oban 3 In order o oban 4 s suffcen o noe ha, by defnon of srong convexy, g s β -srongly convex because g s β-srongly convex, hence f s β -srongly convex oo The elemenary nequaly T 1 T concludes he proof of 4 Fnally, bound 5 s proven by observng ha f = R s β-srongly convex because R s β-srongly convex The elemenary nequaly T ln T concludes he proof A specal case of OMD s he Regularzed Dual Averagng framework of Xao 010, where he predcon a each sep s defned by w = argmn w s=1 5 w l s + β 1 gw + Rw 6 1 for some l s l s w s, s = 1,, 1 Usng 1, s easy o see ha hs updae s equvalen 1 o 1 w = f where f w = β 1 gw + 1 Rw The framework of Xao 010 has been exended by Duch e al 010 o allow he srongly convex par of he regularzer o ncrease over me However, her framework s no flexble enough o nclude algorhms ha updae whou usng he graden of he loss funcon wh respec o whch he regre s calculaed Examples of such algorhms are he Vovk-Azoury-Warmuh algorhm of he nex secon and he onlne bnary classfcaon algorhms of Secon 6 A bound smlar o 3 has been recenly presened by Duch e al 011 and exended o he varable poenal funcons by Duch e al 010 There, a more mmedae radeoff beween he curren graden and he Bregman dvergence from he new soluon o he prevous one s used o updae a each me sep Noe ha he only hypohess on R s convexy Hence, R can be a nondfferenable funcon as 1 Thus we recover he resuls abou mnmzaon of srongly convex and compose loss funcons, and adapve learnng raes, n a smple unque framework In he nex secons we show more algorhms ha can be vewed as specal cases of hs framework 1 Alhough Xao 010 explcly menons ha hs resuls canno be recovered wh he prmal-dual proofs, here we prove he conrary s=1 l s 6

7 The generalzed graden-based lnear forecaser 4 Square Loss In hs secon we recover known regre bounds for onlne regresson wh he square loss va Lemma 1 Throughou hs secon, X = R d and he nner produc u, x s he sandard do produc u x We se l u = 1 y u x where x1, y 1, x, y, s some arbrary sequence of examples x, y R d R Frs, noe ha s possble o specalze OMD o he Vovk-Azoury-Warmuh algorhm for onlne regresson by seng z = y x and f u = 1 u A u, where A 1 = ai d and A = A 1 + x x for > 1 The regre bound of hs algorhm see, eg, Theorem 118 of Cesa-Banch and Lugos 006 s recovered from Lemma 1 by nong ha f s 1- srongly convex wh respec o he norm u f = u A u Hence, R T = y u x y w x f T u + a u + 1 w x y x f f T u + + f θ f 1θ f T u + a u + 1 a u + Y x A 1 x w x snce f θ f 1 θ f 1 w f w = 1 w x, and by seng Y max y We can also generalze he p-norm LMS algorhm of Kvnen e al 006 for conrollng he adapve flerng regre R af T = w x u x The reader neresed n he movaons behnd he sudy of hs regre s addressed o ha paper Ths s acheved by seng z = y w x x and f u = X β fu n OMD, where f s an arbrary β-srongly convex funcon wh respec o some norm, and X = max s x s We can hen wre R T + 1 Raf T = y w x u x y w x w x f T u + 1 y w x where n he las sep we used Lemma 1, he X -srong convexy of f, and he fac he f f 1 Smplfyng he expresson we oban he followng adapve flerng bound R af T X T T β fu + y u x Compared o he bounds of Kvnen e al 006, our algorhm nhers he ably o adap o he maxmum norm of x whou any pror knowledge Moreover, nsead of usng a decreasng learnng rae here we use an ncreasng regularzer 7

8 Orabona and Crammer and Cesa-Banch 5 A new algorhm for onlne regresson In hs secon we show he full power of our framework by nroducng a new me-varyng regularzer f generalzng he squared q-norm Then, we derve he correspondng regre bound As n he prevous secon, le X = R d and le he nner produc u, x be he sandard do produc u x Gven b 1,, b d R + and q 1, ] le he weghed q-norm of w R d be d 1/q w q b Defne he correspondng regularzaon funcon by fw = d /q 1 w q b q 1 Ths funcon has he followng properes proof n appendx Lemma The Fenchel conjugae of f s f θ = d /p 1 θ p b 1 p for p = q p 1 q 1 7 Moreover, he funcon fw s 1-srcly convex wh respec o he norm d 1/q x q b whose dual norm s defned by d θ p b 1 p 1/p We can now prove he followng regre bound for lnear regresson wh absolue loss Corollary 3 Le f u = where b, = max s=1,, x s,, and le d /q e u q b, q 1 1 q = 1 ln max x s s=1,, 0 8 1

9 The generalzed graden-based lnear forecaser If OMD s run usng regularzers f on he npu sequence z = ηl, where l l w for l w = w x y and η > 0, hen d w x y u x y et ln max x u B T, η,,t η for any u R d, where B T, = max,,t x, Ths bound has he neresng propery o be nvaran wh respec o arbrary scalng of ndvdual coordnaes of he daa pons x Ths s unlke runnng sandard OMD wh non-adapve regularzers, whch gves bounds of he form u max x T In parcular, by an approprae unng of η he regre n Corollary 3 s bounded by a quany of he order of d T u max x, ln d When he good u are sparse, ha s u 1 are small, hs s always beer han runnng sandard OMD wh a non-weghed q-norm regularzer, whch for q 1 he bes choce for he sparse u case gves bounds of he form u 1 max x T ln d Indeed, we have d u max x, d u max max x,j j = u 1 max x Smlar regularzaon funcons are suded by Grave e al 011 alhough n a dfferen conex 6 Bnary classfcaon: aggressve and dagonal updaes In hs secon we show ha several known algorhms for onlne bnary classfcaon are specal cases of OMD These algorhms nclude p-norm Percepron Genle, 003, Passve- Aggressve Crammer e al, 006, second-order Percepron Cesa-Banch e al, 005, and AROW Crammer e al, 009 Besdes recoverng all prevously known msake bounds, we also show new bounds for Passve-Aggressve and for AROW wh dagonal updaes Fx any Eucldean space wh nner produc, Gven a fxed bu unknown sequence x 1, y 1, x, y, of examples x, y X { 1, +1}, le l w = l w, x, y be he hnge loss [ 1 y w, x ] I s easy o verfy ha he hnge loss sasfes he followng + condon: f l w > 0 hen l u 1 + u, l for all u, w R d wh l l w 8 Noe ha when l w > 0 he subgraden noaon s redundan, as l w s he sngleon { l w } We apply he OMD algorhm o onlne bnary classfcaon by seng z = η l f l w > 0, and z = 0 oherwse 9

10 Orabona and Crammer and Cesa-Banch In he followng, when T s undersood from he conex, we denoe by M he se of seps on whch he algorhm made a msake, ŷ y Smlarly, we denoe by U he se of margn error seps; ha s, seps where ŷ = y bu l w > 0 Followng a sandard ermnology, we call conservave or passve an algorhm ha updaes s classfer only on msake seps, and aggressve an algorhm ha updaes s classfer boh on msake and margn error seps 61 Frs-order algorhms If we run OMD n conservave mode, and le f = f = 1 p for 1 < p, hen we recover he p-norm Percepron of Genle 003 We now show how o use our framework o generalze and mprove prevous analyses for bnary classfcaon algorhms ha use aggressve updaes Corollary 4 Assume OMD s run wh f = f where f, wh doman X, s β-srongly convex wh respec o he norm and sasfes fλu λ fu for all λ R and all u X Furher assume he npu sequence s z = η y x, for some 0 < η 1 such ha y w, x 0 mples η = 1 Then, for all T 1 and for all u X, M Lu + D + β fux T + X T β fulu where M = M, X T = max T x, Lu = [ 1 y u, x ] and D = η x η + βy w, x + X For he conservave p-norm Percepron, we have U =, = q where q = p 1, and β = p 1 because 1 p s p 1-srongly convex wh respec o p for 1 < p, see Lemma 17 of Shalev-Shwarz 007 We herefore recover he msake bound of Genle 003 The erm D n he bound of Corollary 4 can be negave We can mnmze, subjec o 0 η 1, by seng { { } } X η = max mn βy w, x x, 1, 0 Ths unng of η s que smlar o ha of he Passve-Aggressve algorhm ype I of Crammer e al 006 In fac for f = f = 1 we would have { { } } X η = max mn y w, x x, 1, 0 whle he updae rule for PA-I s η = max U { { } } 1 y w, x mn x, 1, 0 10 p

11 The generalzed graden-based lnear forecaser The msake bound of Corollary 4 s however beer han he aggressve bounds for PA-I of Crammer e al 006 and Shalev-Shwarz 007 Indeed, whle he PA-I bounds are generally worse han he Percepron msake bound M Lu + u X T + u XT Lu, 9 as dscussed by Crammer e al 006, our bound s beer as soon as D < 0 Hence, can be vewed as he frs heorecal evdence n suppor of aggressve updaes Proof of Corollary 4 Usng 15 n Lemma 5 wh he assumpon η = 1 when M, we ge M Lu + fu M x β + U η β x + η y w, x η U Lu + X T β fu M + U η x + βη y w, x X U η where we have used he fac ha X X T for all T Solvng for M we ge M Lu β fux T + X T β fu β X T fu + Lu + D η 10 U wh 1 β X T fu + Lu + D 0, and D = U η x + βη y w, x η X We furher upper bound he rgh-hand sde of 10 usng he elemenary nequaly a + b a + for all a > 0 and b a Ths gves b a M Lu X T D β fux T + X T β fu β X T fu + Lu + β fu 1 β X T fu + Lu Lu β fux T + X T β fu β X T fu + Lu + D η U Applyng he nequaly a + b a + b and rearrangng gves he desred bound U η 6 Second-order algorhms We now apply our framework o second-order algorhms for bnary classfcaon Here, we le X = R d and he nner produc u, x be he sandard do produc u x 11

12 Orabona and Crammer and Cesa-Banch Second-order algorhms for bnary classfcaon are onlne varans of Rdge regresson Recall ha he Rdge regresson lnear predcor s defned by w +1 = argmn w x s y s + w w R d s=1 The closed-form expresson for w +1, whch nvolves he desgn marx S = [ x 1,, x ] and he label vecor y = y 1,, y, s gven by w = I + S 1S S y The second-order Percepron see below uses hs wegh w +1, bu S and y only conan he examples x s, y s on whch a msake occurred In hs sense, we call an onlne varan of Rdge regresson In pracce, second-order algorhms perform ypcally beer han her frs-order counerpars, such as he algorhms n he Percepron famly There are wo basc second-order algorhms: he second-order Percepron of Cesa-Banch e al 005 and he AROW algorhm of Crammer e al 009 We show ha boh of hem are nsances of OMD and recover her msake bounds as specal cases of our analyss Le f x = 1 x A x, where A 0 = I and A = A r x x wh r > 0 Each funcon f s 1-srongly convex wh respec o he norm x f = x A x wh dual norm x f = x A 1 x The dual funcon of f x s f x = 1 x A 1 x Now, he conservave verson of OMD run wh f chosen as above s he second-order Percepron The aggressve verson corresponds nsead o AROW wh a mnor dfference Indeed, n hs case he predcon of OMD s he sgn of y w r x = m r+χ, where we use he noaon χ = x A 1 1 x and m = y x A 1 1 θ On he oher hand, AROW smply predcs usng r he sgn of m The sgn of he predcons s he same, bu OMD updaes when m r+χ 1 whle AROW updaes when m 1 Typcally, for large he value of χ s small, and hus he wo updae rules concde n pracce To derve a msake bound for OMD run wh f x = 1 x A x, frs observe ha usng he Woodbury deny we have f θ f 1θ = x A 1 1 θ r + x A 1 1 x = m r + χ Hence, usng 15 n Lemma 5, and seng η = 1, we oban M + U Lu + u A T u x A 1 x + y w x m r + χ Lu + for all u X, where u + 1 r = Lu + r u + Lu = u x r ln A T + u x ln A T + [ 1 y u, x ] + 1 y w x m r m rr + χ m r + χ

13 The generalzed graden-based lnear forecaser Ths bound mproves slghly over he known bound for AROW n he las sum n he square roo In fac n AROW we have he erm U, whle here we have m r m rr + χ U m r m rr + χ U r rr + χ U 11 In he conservave case, when U, he bound specalzes o he sandard second-order Percepron bound 63 Dagonal updaes AROW and he second-order Percepron can be run more effcenly usng dagonal marces In hs case, each updae akes me lnear n d We now use Corollary 5 o prove a msake bound for he dagonal verson of he second-order Percepron Denoe D = dag{a } be he dagonal marx ha agrees wh A on he dagonal, where A s defned as before and f x = 1 x D x Seng η = 1, usng he second bound of Lemma 5, and Lemma 6, we have M + U Lu + u T D T u r = Lu + u + 1 r d u d 1 ln r x, x, U r d 1 ln r x, U 1 Ths allows us o heorecally analyze he cases where hs algorhm could be advanageous In parcular, feaures of NLP daa are ypcally bnary, and s ofen he case ha mos of he feaures are zero mos of he me On he oher hand, hese rare feaures are usually he mos nformave ones see, eg, he dscusson of Dredze e al 008 Fgure 1 shows he number of mes each feaure word appears n wo senmen daases vs he word rank Clearly, here are a few very frequen words and many rare words These exac properes orgnally movaed he CW and AROW algorhms, and now our analyss provdes a heorecal jusfcaon Concreely, he above consderaons suppor he assumpon ha he opmal hyperplane u sasfes d u x, I u x, s I u s u where I s he se of nformave and rare feaures, and s s he maxmum number of mes hese feaures appear n he sequence Runnng he dagonal verson of he second order Perepron so ha U =, and assumng ha, d u We dd no opmze he consan mulplyng U n he bound x, s u 13 13

14 Orabona and Crammer and Cesa-Banch Fgure 1: Evdence of heavy als for NLP Daa The plos show he number of words vs he word rank on wo senmen daa ses he las erm n he msake bound 1 can be re-wren as u + 1 d d u x 1, r ln x, r r + 1 u MX r + s d ln T dr M M + 1 where we calculaed he maxmum of he sum, gven he consran d x, XT M M We can now use Corollary 3 n he appendx o oban M Lu + u 8 u r + sx 4 r + sd ln T edr + Lu X T dr + Hence, when he hypohess 13 s verfed, he number of msakes of he dagonal verson of AROW depends on ln Lu raher han on Lu 7 Conclusons We proposed a framework for onlne convex opmzaon combnng onlne mrror descen wh me-varyng regularzers Ths allowed us o vew second-order algorhms such as he Vovk-Azoury-Warmuh forecaser, he second-order Percepron, and he AROW algorhm as specal cases of mrror descen Our analyss also capures second-order varans ha only employ he dagonal elemens of he second order nformaon marx, a resul whch was no whn reach of he prevous echnques Whn our framework, we also derved and analyzed a new regularzer based on an adapve weghed verson of he p-norm Percepron In he case of sparse arges, he 14

15 The generalzed graden-based lnear forecaser correspondng nsance of OMD acheves a performance bound beer han ha of OMD wh 1-norm regularzaon We also mproved prevous bounds for exsng frs-order algorhms For example, we were able o formally explan he phenomenon accordng o whch aggressve algorhms ypcally exhb beer emprcal performance han her conservave counerpars Specfcally, our refned analyss provdes a bound for Passve-Aggressve PA-I ha s never worse and somemes beer han he Percepron bound One neresng drecon o pursue s he dervaon and analyss of algorhms based on me-varyng versons of he enropc regularzers used by he EG and Wnnow algorhms More n general, would be useful o devse a more sysemac approach o he desgn of adapve regularzers enojoyng a gven se of desred properes Ths would help obanng more examples of adapaon mechansms ha are no based on second-order nformaon Acknowledgmens The hrd auhor graefully acknowledges paral suppor by he PASCAL Nework of Excellence under EC gran no Ths publcaon only reflecs he auhors vews The second auhor graefully acknowledges paral suppor by an Israel Scence Foundaon gran ISF-1567/10 Techncal lemmas Proof of Lemma The Fenchel conjugae of f s f θ = sup v v θ fv Se w 1 d /p equal o he graden of p 1 θ p b 1 p wh respec o θ Easy calculaons show ha d /p w 1 θ fw = θ p b 1 p p 1 We now show ha hs quany s ndeed sup v v θ fv Pck any v R d Applyng Hölder nequaly o he vecors v 1 b 1/q 1,, v d b 1/q d and θ 1b 1/q 1,, θ d b 1/q d we ge, d 1/q d 1/p d 1/q d 1/p v θ v q b θ p b p/q = v q b θ p b 1 p Hence d 1/q d 1/p d /q v θ fv v q b θ p b 1 p 1 v q b q 1 d 1/q The rgh-hand sde s a quadrac funcon n v q b If we maxmze, we oban d /p d /p v θ fv q 1 θ p b 1 p 1 = θ p b 1 p p 1 15

16 Orabona and Crammer and Cesa-Banch whch concludes he proof for f In order o show he second par, we follow Lemma 17 of Shalev-Shwarz 007 and d /q prove ha x q b x fwx Defne Ψa = a/q q 1 and φa = a q, hence fw = Ψ d b φx Clearly Ψ a = a/q 1 qq 1 and Ψ a = /q 1 qq 1 a/q Moreover, φ a = q sgna a q 1 and φ a = qq 1 a q The, j elemen of fw for j s d Ψ b k φw k b b j φ w φ w j k=1 The dagonal elemens of fw are d Ψ b k φw k k=1 Thus we have b φ w d + Ψ b k φw k b φ w d d d d x fwx = Ψ b k φw k b x φ w + Ψ b k φw k b x φ w k=1 k=1 k=1 The frs erm s non-negave snce q 1, Wrng he second erm explcly we have, d q/q 1 d x fwx b k w k b x w q k=1 We now lower bound hs quany usng Hölder nequaly Le y = b γ w qq/ for γ = q/ We have d /q d /q x q b x q d = y b y y / q q/ d d q/ / q d = b γ w qq/ x b/q y /q x b/q q/ b γ w qq/ /q /q q/ /q 16

17 The generalzed graden-based lnear forecaser = We jus showed ha d q/q d b γ/ q w q d q/q d = b w q x b/q b γ/q w q x w q b /q q/q d q/q d = b w q x w q b /q q/q d q/q d = b w q x w q b x fwx d k=1 b k w k q/q 1 d d /q b x w q x q b Ths concludes he proof of he 1-src convexy of f d 1/q We now prove ha he dual norm of x d 1/p q b s θ p b 1 p By defnon of dual norm, d 1/q d 1/q sup x u x : x q b 1 = sup x u x : x b 1/q q 1 d 1/q = sup u y b 1/q y : y q 1 = u 1 b 1/q 1,, u d b 1/q p d where 1/q + 1/p = 1 Wrng he las norm explcly and observng ha p = q/q 1, 1/p 1/p u p b p/q = u p b 1 q whch concludes he proof Lemma 5 Assume OMD s run wh funcons f 1, f, defned on X and such ha each f s β -srongly convex wh respec o he norm f and f λu λ f u for all λ R and all u S Assume furher he npu sequence s z = η l for some η > 0, where l l w, l w = 0 mples l = 0, and l = l, x, y sasfes 8 Then, for all T 1, η L η + λf T u + 1 B + η l λ β f η w, l 14 17

18 Orabona and Crammer and Cesa-Banch for any u S and any λ > 0, where L η = η l u and B = f θ f 1θ In parcular, choosng he opmal λ, we oban η L η + [ f T u B + η l ] β f η w, l + 15 Proof We apply Lemma 1 wh z = η l and usng λu for any λ > 0, η l, w λu λ f T u + η l β f + f θ f 1θ Snce l w = 0 mples l = 0, and usng 8, η l, w + η η l u η l, w λu Dvdng by λ and rearrangng gves he frs bound The second bound s obaned by choosng he λ ha makes equal he las wo erms n he rgh-hand sde of 14 Lemma 6 For all x 1, x T R d le D = dag{a } where A 0 = I and A = A r x x for some r > 0 Then x D 1 x r d 1 ln r x, + 1 Proof Consder he sequence a 0 and defne v = a 0 + a wh a 0 > 0 The concavy of he logarhm mples ln b ln a + b a a for all a, b > 0 Hence we have a v = v v 1 v ln v = ln v T = ln a 0 + T a v 1 v 0 a 0 Usng he above and he defnon of D, we oban x D 1 x = d x, r j=1 x j, = r d x, r + j=1 x j, r d ln r + T x, r We conclude he appendx by provng he resuls requred o solve he mplc logarhmc equaons of Secon 63 We use he followng fac of Orabona e al 01 18

19 The generalzed graden-based lnear forecaser Lemma 7 Le a, x > 0 be such ha x a ln x Then for all n > 1 x n na a ln n 1 e Corollary For all a, b, c, d, x > 0 such ha x a lnbx + c + d, we have x n a ln nab n 1 e + d + c 1 b n 1 Corollary 3 For all a, b, c, d, x > 0 such ha x a lnbx c + d 16 we have x 8ab a ln e + b c + db + + c + d Proof Assumpon 16 mples x a lnbx c + d 17 a lnbx c + d = a lnbx c + d a lnb x + + c + d From Corollary we have ha f f, g, h,, y > 0 sasfy y f lngx + h +, hen y n f ln nfg + + h 1 n 1 e g n 1 n nf g n 1 e + + h 1 g n 1 where we have used he elemenary nequaly ln y y e for all y 0 Applyng he above o 17 we oban x n na b n 1 e + c + d b n 1 whch mples n nab x + c + d n 1 e b n 1 Noe ha we have repeaedly used he elemenary nequaly x + y x + y Choosng n = and applyng 18 o 16 we ge x a lnbx c + d 8ab a ln + b c + db + + c + d e concludng he proof 19

20 Orabona and Crammer and Cesa-Banch References KS Azoury and MK Warmuh Relave loss bounds for on-lne densy esmaon wh he exponenal famly of dsrbuons Machne Learnng, 433:11 46, 001 N Cesa-Banch and G Lugos Predcon, Learnng, and Games Cambrdge Unversy Press, 006 N Cesa-Banch, A Concon, and C Genle A second-order Percepron algorhm SIAM Journal on Compung, 343: , 005 K Crammer, O Dekel, J Keshe, S Shalev-Shwarz, and Y Snger Onlne passveaggressve algorhms Journal of Machne Learnng Research, 7: , 006 K Crammer, M Dredze, and F Perera Exac convex confdence-weghed learnng Advances n Neural Informaon Processng Sysems, 1:345 35, 009 K Crammer, A Kulesza, and M Dredze Adapve regularzaon of wegh vecors Advances n Neural Informaon Processng Sysems, :414 4, 009 M Dredze, K Crammer, and F Perera Onlne confdence-weghed learnng Proceedngs of he 5h Inernaonal Conference on Machne Learnng, 008 J Duch, E Hazan, and Y Snger Adapve subgraden mehods for onlne learnng and sochasc opmzaon Journal of Machne Learnng Research, 1:11 159, 011 J Duch, S Shalev-Shwarz, Y Snger, and A Tewar Compose objecve mrror descen In Proceedngs of he 3rd Annual Conference on Learnng Theory, pages 14 6, 010 Y Freund and R E Schapre Large margn classfcaon usng he Percepron algorhm Machne Learnng, pages 77 96, 1999 C Genle The robusness of he p-norm algorhms Machne Learnng, 533:65 99, 003 E Grave, G Oboznsk, and FR Bach Trace Lasso: a race norm regularzaon for correlaed desgns Advances n Neural Informaon Processng Sysems, 4: , 011 SM Kakade, S Shalev-Shwarz, and A Tewar wh marces CoRR, abs/ , 009 Regularzaon echnques for learnng J Kvnen and MK Warmuh Exponenaed graden versus graden descen for lnear predcors Informaon and Compuaon, 131:1 63, 1997 J Kvnen and MK Warmuh Relave loss bounds for muldmensonal regresson problems Machne Learnng, 453:301 39, 001 J Kvnen, M K Warmuh, and B Hassb The p-norm generalzaon of he LMS algorhm for adapve flerng IEEE Transacons on Sgnal Processng, 545: , 006 0

21 The generalzed graden-based lnear forecaser N Llesone Learnng quckly when rrelevan arbues abound: a new lnear-hreshold algorhm Machne Learnng, 4:85 318, 1988 F Orabona, N Cesa-Banch, and C Genle Beyond logarhmc bounds n onlne learnng In Proceedngs of he 15h Inernaonal Conference on Arfcal Inellgence and Sascs, pages JMLR W&CP, 01 R T Rockafellar Convex Analyss Prnceon Unversy Press, 1970 S Shalev-Shwarz Onlne Learnng: Theory, Algorhms, and Applcaons PhD hess, The Hebrew Unversy, 007 S Shalev-Shwarz Onlne learnng and onlne convex opmzaon Foundaons and Trends n Machne Learnng, 4, 01 S Shalev-Shwarz and SM Kakade Mnd he dualy gap: Logarhmc regre algorhms for onlne opmzaon Advances n Neural Informaon Processng Sysems, 1: , 009 S Shalev-Shwarz and Y Snger A prmal-dual perspecve of onlne learnng algorhms Machne Learnng Journal, 007 Y Tsypkn Adapaon and Learnng n Auomac Sysems Academc Press, 1971 V Vovk Compeve on-lne sascs Inernaonal Sascal Revew, 69:13 48, 001 MK Warmuh and AK Jagoa Connuous and dscree-me nonlnear graden descen: Relave loss bounds and convergence In Elecronc proceedngs of he 5h Inernaonal Symposum on Arfcal Inellgence and Mahemacs, 1997 L Xao Dual averagng mehods for regularzed sochasc learnng and onlne opmzaon Journal of Machne Learnng Research, 11: , 010 1

Variants of Pegasos. December 11, 2009

Variants of Pegasos. December 11, 2009 Inroducon Varans of Pegasos SooWoong Ryu bshboy@sanford.edu December, 009 Youngsoo Cho yc344@sanford.edu Developng a new SVM algorhm s ongong research opc. Among many exng SVM algorhms, we wll focus on