Second-Order Non-Stationary Online Learning for Regression

Size: px

Start display at page:

Download "Second-Order Non-Stationary Online Learning for Regression"

Randell Boyd
5 years ago
Views:

1 Second-Order Non-Saonary Onlne Learnng for Regresson Nna Vas, Edward Moroshko, and Koby Crammer, Fellow, IEEE arxv:303040v cslg] Mar 03 Absrac he goal of a learner, n sandard onlne learnng, s o have he cumulave loss no much larger compared wh he bes-performng funcon from some fxed class Numerous algorhms were shown o have hs gap arbrarly close o zero, compared wh he bes funcon ha s chosen off-lne Neverheless, many real-world applcaons, such as adapve flerng, are non-saonary n naure, and he bes predcon funcon may drf over me We nroduce wo novel algorhms for onlne regresson, desgned o work well n non-saonary envronmen Our frs algorhm performs adapve reses o forge he hsory, whle he second s las-sep mn-max opmal n conex of a drf We analyze boh algorhms n he worscase regre framework and show ha hey manan an average loss close o ha of he bes slowly changng sequence of lnear funcons, as long as he cumulave drf s sublnear In addon, n he saonary case, when no drf occurs, our algorhms suffer logarhmc regre, as for prevous algorhms Our bounds mprove over he exsng ones, and smulaons demonsrae he usefulness of hese algorhms compared wh oher sae-of-hear approaches I INROUCION We consder he classcal problem of onlne learnng for regresson On each eraon, an algorhm receves a new nsance for example, npu from an array of anennas and oupus a predcon of a real value for example dsance o he source he correc value s hen revealed, and he algorhm suffers a loss based on boh s predcon and he correc oupu value In he pas half a cenury many algorhms were proposed see eg a comprehensve book 9] for hs problem, some of whch are able o acheve an average loss arbrarly close o ha of he bes funcon n rerospec Furhermore, such guaranees hold even f he npu and oupu pars are chosen n a fully adversaral manner wh no dsrbuonal assumpons Many of hese algorhms explo frs-order nformaon eg gradens Recenly here s an ncreased amoun of neres n algorhms ha explo second order nformaon For example he second order percepron algorhm 8], confdenceweghed learnng ], 3], adapve regularzaon of weghs AROW ], all desgned for classfcaon; and AdaGrad 4] and FPRL 8] for general loss funcons espe he exensve and mpressve guaranees ha can be made for algorhms n such sengs, compeng wh he bes fxed funcon s no always good enough In many real-world All auhors are wh he eparmen of Elecrcal Engneerng, he echnon - Israel Insue of echnology, Hafa 3000, Israel Vas and Moroshko have equal conrbuon o he paper Manuscrp receved Monh dae, 03; revsed monh dae, 03 gran applcaons, he rue arge funcon s no fxed, bu s slowly changng over me Consder a fler desgned o cancel echoes n a hall Over me, people ener and leave he hall, furnure are beng moved, mcrophones are replaced and so on When hs drf occurs, he predcor self mus also change n order o reman relevan Wh such properes n mnd, we develop new learnng algorhms, based on second-order quanes, desgned o work wh arge drf he goal of an algorhm s o manan an average loss close o ha of he bes slowly changng sequence of funcons, raher han compee well wh a sngle funcon We focus on problems for whch hs sequence consss only of lnear funcons Mos prevous algorhms eg ], 3], 6], 7] desgned for hs problem are based on frs-order nformaon, such as graden descen, wh addonal conrol on he norm of he wegh-vecor used for predcon 6] or he number of npus used o defne 6] In Sec II we revew hree second-order learnng algorhms: he recursve leas squares RLS ] algorhm, he Aggregang Algorhm for regresson AAR ], 36], whch can be shown o be derved based on a las-sep mn-max approach 6], and he AROWR algorhm 35] whch s a modfcaon of he AROW algorhm ] for regresson All hree algorhms oban logarhmc regre n he saonary seng, alhough derved usng dfferen approaches, and hey are no equvalen n general In Sec III we formally presen he non-saonary seng boh n erms of algorhms and n erms of heorecal analyss For he RLS algorhm, a varan called CR-RLS 0], 0], 3] for he non-saonary seng was descrbed, ye no analyzed, before In Sec IV we presen wo algorhms for he non-saonary seng, ha buld on he oher wo algorhms Specfcally, n Sec IV-A we exend he AROWR algorhm for he non-saonary seng, yeldng an algorhm called ARCOR for adapve regularzaon wh covarance rese Smlar o CR-RLS, ARCOR performs a sep called covarance-rese, whch reses he second-order nformaon from me-o-me, ye s done based on he properes of hs covarance-lke marx, and no based on he number of examples observed, as n CR-RLS In Sec IV-B we derve dfferen algorhm based on he lassep mn-max approach proposed by Forser 6] and laer used 34] for onlne densy esmaon On each eraon he algorhm makes he opmal mn-max predcon wh respec o he regre, assumng s he las eraon Ye, unlke prevous work 6], s opmal when a drf s allowed As opposed o he dervaon of he las-sep mnmax predcor for a fxed vecor, he resulng opmzaon

2 problem s no sraghforward o solve We develop a dynamc program a recurson o solve hs problem, whch allows o compue he opmal las-sep mn-max predcor We call hs algorhm LASER for las sep adapve regressor algorhm We conclude he algorhmc par n Sec IV-C n whch we compare all non-saonary algorhms head-o-head hghlghng her smlares and dfferences Addonally, afer descrbng he deals of our algorhms, we provde n Sec V a comprehensve revew of prevous work, ha pus our conrbuon n perspecve Boh algorhms reduce o her saonary counerpars when no drf occurs We hen move o Sec VI whch summarzes our nex conrbuon sang and provng regre bounds for boh algorhms We analyse boh algorhms n he wors-case regreseng and show ha as long as he amoun of average-drf s sublnear, he average-loss of boh algorhms wll converge o he average-loss of he bes sequence of funcons Specfcally, we show n Sec VI-A ha he cumulave loss of ARCOR afer observng examples, denoed by L ARCOR, s upper bounded by he cumulave loss of any sequence of weghvecors u }, denoed by L u }, plus an addonal erm O / V u } / log where V u } measures he dfferences or varance beween consecuve wegh-vecors of he sequence u } Laer, we show n Sec VI-B a smlar bound for he loss of LASER, denoed by L LASER, for whch he second erm s O /3 V u } /3 We emphasze ha n boh bounds he measure V u } of dfferences beween consecuve wegh-vecors s no defned n he same way, and hus, he bounds are no comparable n general In Sec VII we repor resuls of smulaons desgned o hghlgh he properes of boh algorhms, as well as he commonales and dfferences beween hem We conclude n Sec VIII and mos of he echncal proofs appear n he appendx he ARCOR algorhm was presened n a shorer publcaon 35], as well wh s analyss and some of s deals he LASER algorhm and s analyss was also presened n a shorer verson 30] he conrbuon of hs submsson s hree-fold Frs, we provde head-o-head comparson of hree second-order algorhms for he saonary case Second, we fll he gap of second-order algorhms for he non-saonary case Specfcally, we add o he CR-RLS whch exends RLS and desgn second-order algorhms for he non-saonary case and analyze hem, buldng boh on AROWR and AAR Our algorhms are derved from dfferen prncples from each oher, whch s refleced n our analyss Fnally, we provde emprcal evdence showng ha under varous condons dfferen algorhm performs he bes Some noaon we use hroughou he paper: For a symmerc marx Σ we denoe s jh egenvalue by λ j Σ Smlarly we denoe s smalles egenvalue by λ mn Σ = mn j λ j Σ, and s larges egenvalue by λ max Σ = max j λ j Σ For a vecor u R d, we denoe by u he l -norm of he vecor Fnally, for y > 0 we defne clpx, y = sgnx mn x, y} II SAIONERY ONLINE LEARNING We focus on he regresson ask evaluaed wh he squared loss Our algorhms are desgned for he onlne seng and work n eraons or rounds On each round an onlne algorhm receves an npu-vecor x R d and predcs a real value ŷ R hen he algorhm receves a arge label y R assocaed wh x, and uses o updae s predcon rule, and proceeds o he nex round A each eraon, he performance of he algorhm s evaluaed usng he squared loss, l alg = l y, ŷ = ŷ y he cumulave loss suffered by he algorhm over eraons s, L alg = = l alg he goal of he algorhm s o have low cumulave loss compared o predcors from some class A large body of work, whch we adop as well, s focused on lnear predcon funcons of he form fx = x u where u R d s some wegh-vecor We denoe by l u = x u y he nsananeous loss of a wegh-vecor u he cumulave loss suffered by a fxed wegh-vecor u s, L u = l u he goal of he learnng algorhm s o suffer low loss compared wh he bes lnear funcon Formally we defne he regre of an algorhm o be R = L alg nf u L u he goal of an algorhm s o have R = o, such ha he average loss wll converge o he average loss of he bes lnear funcon u Numerous algorhms were developed for hs problem, see a comprehensve revew n he book of Cesa-Banch and Lugos 9] Among hese, a few second-order onlne algorhms for regresson were proposed n recen years, whch we summarze n able I One approach for onlne learnng s o reduce he problem no consecuve bach problems, and specfcally use all prevous examples o generae a classfer, whch s used o predc curren example Recursve leas squares RLS ] approach, for example, ses a wegh-vecor o be he soluon of he followng opmzaon problem, w = arg mn w = r y w x Snce he las problem grows wh me, he well known recursve leas squares RLS ] algorhm was developed o generae a soluon recursvely he RLS algorhm manans boh a vecor w and a posve sem-defne PS marx Σ On each eraon, afer makng a predcon ŷ = x w, he algorhm receves he rue label y and updaes, w = w + y x w Σ x r + x Σ x Σ = rσ + x x 3 he updae of he predcon vecor w s addve, wh vecor Σ x scaled by he error y x w over he norm of he npu measured usng he norm defned by he marx x Σ x he algorhm s summarzed n he rgh column of able I he Aggregang Algorhm for regresson AAR ], 36], summarzed n he mddle column of able I, was nroduced

3 3 by Vovk and s smlar o he RLS algorhm, excep shrnks s predcons he AAR algorhm was shown o be las-sep mn-max opmal by Forser 6] Gven a new npu x he algorhm predcs ŷ whch s he mnmzer of he followng problem, arg mn max ŷ y = y ŷ nf u b u + L u ] 4 Forser proposed also a smpler analyss wh he same regre bound Fnally, he AROWR algorhm 35] s a modfcaon of he AROW algorhm ] for regresson In a nushell, he AROW algorhm manans a Gaussan dsrbuon parameerzed by a mean w R d and a full covarance marx Σ R d d Inuvely, he mean w represens a curren lnear funcon, whle he covarance marx Σ capures he uncerany n he lnear funcon w Gven a new example x, y he algorhm uses s curren mean o make a predcon ŷ = x w AROWR hen ses he new dsrbuon o be he soluon of he followng opmzaon problem, arg mn KL N w, Σ N w, Σ w,σ + y w x + x r r Σx hs opmzaon problem s smlar o he one of AROW ] for classfcaon, excep we use he square loss raher han squared-hnge loss used n AROW Inuvely, he opmzaon problem rades off beween hree requremens he frs erm forces he parameers no o change much per example, as he enre learnng hsory s encapsulaed whn hem he second erm requres ha he new vecor w should perform well on he curren nsance, and fnally, he hrd erm reflecs he fac ha he uncerany abou he parameers reduces as we observe he curren example x he wegh vecor solvng hs opmzaon problem deals gven by Vas and Crammer 35] s gven by, 5 y w x w = w + r + x Σ x, 6 Σ x and he opmal covarance marx s, Σ = Σ + r x x 7 he algorhm s summarzed n he lef column of able I Comparng AROW o RLS we observe ha whle he updae of he weghs of 6 s equvalen o he updae of RLS n, he updae of he marx 3 for RLS s no equvalen o 7, as n he former case he marx goes va a mulplcave updae as well as addve, whle n 7 he updae s only addve he wo updaes are equvalen only by seng r = Movng o AAR, we noe ha he updae rules for w and Σ n AROWR = Σ AROW R /r, bu AROWR does no shrnk s predcons as AAR hus all hree algorhms are no equvalen, alhough very smlar and AAR are he same f we defne Σ AAR III NON-SAIONARY ONLINE LEARNING All prevous algorhms assume boh by desgn and analyss ha he daa s saonary he analyss of all algorhms compares her performance o ha of a sngle fxed wegh vecor u, and all suffer regre ha s logarhmc s We use an exended noon of evaluaon, comparng our algorhms o a sequence of funcons We defne he loss suffered by such a sequence o be, L u,, u = L u } = and he regre s hen defned o be, l u, R = L alg nf L u } 8 u,,u We focus on algorhms ha are able o compee agans sequences of wegh-vecors, u,, u R d R d, where u s used o make a predcon for he h example x, y Clearly, wh no resrcon over he se u } he rgh erm of he regre can easly be zero by seng, u = x y / x, whch mples l u = 0 for all hus, n he analyss below we wll make use of he oal drf of he wegh-vecors defned o be, V P = V P u } = u u + P, where P, } For all hree algorhms, as was also observed prevously n he conex of CW 3], AROW ], AdaGrad 4] and FPRL 8], he marx Σ can be nerpreed as adapve learnng rae As hese algorhms process more examples, ha s larger values of, he egenvalues of he marx Σ ncrease, and he egenvalues of he marx Σ decrease, and we ge ha he rae of updaes s geng smaller, snce he addve erm Σ x s geng smaller As a consequence he algorhms wll gradually sop updang usng curren nsances whch le n he subspace of examples ha were prevously observed numerous mes hs propery leads o a very fas convergence n he saonary case However, when we allow hese algorhms o be compared wh a sequence of wegh-vecors, each appled o a dfferen npu example, or equvalenly, here s a drf or shf of a good predcon vecor, hese algorhms wll perform poorly, as hey wll converge and no be able o adap o he non-saonary naure of he daa hs phenomena movaed he proposal of he CR-RLS algorhm 0], 0], 3], whch re-ses he covarance marx every fxed number of npu examples, causng he algorhm no o converge or ge suck he pseudo-code of CR-RLS algorhm s gven n he rgh column of able II he only dfference of CR-RLS from RLS s ha afer updang he marx Σ, he algorhm checks wheher 0 a predefned naural number examples were observed snce he las resar, and f hs s he case, ses he marx o be he deny marx Clearly, f 0 = he CR-RLS algorhm s reduced o he RLS algorhm =

4 4 ABLE I ALGORIHMS FOR SAIONARY SEING AROWR AAR RLS Parameers 0 < r 0 < b 0 < r Inalze w 0 = 0, Σ 0 = I w 0 = 0, Σ 0 = b I w 0 = 0, Σ 0 = I Receve an nsance x For Oupu ŷ = x w ŷ = x ŷ w = x w + x Σ x = predcon Updae Σ : Σ = Σ + r xx Receve a correc label y Σ = Σ + x x Σ = rσ + x x Updae w : w = w + y x w Σ x r+x Σ x w = w + y x w Σ x + y x w Σ x r+x Σ x +x Σ x w = w Oupu w, Σ w, Σ w, Σ Exenson o non-saonary seng Analyss ARCOR Sec IV-A below yes, Sec VI-A below LASER Sec IV-B below yes, Sec VI-B below CR-RLS 0], 0], 3] No IV ALGORIHMS FOR NON-SAIONARY REGRESSION In hs work we fll he gap and propose exenson o nonsaonary seng for he wo oher algorhms n able I Smlar o CR-RLS, boh algorhms modfy he marx Σ o preven s egen-values o shrnk o zero he frs algorhm, descrbed n Sec IV-A, exends AROWR o he non-saonary seng and s smlar n spr o CR-RLS, ye he resar operaons performs depend on he specral properes of he covarance marx, raher han he me ndex Addonally, hs algorhm performs a projecon of he wegh vecor no a predefned ball Smlar echnque was used n frs order algorhms by Herbser and Warmuh 3], and Kvnen and Warmuh 5] Boh seps are movaed boh from he desgn and analyss of AROWR Is desgn s composed of solvng small opmzaon problems defned n 5, one per npu example he non-saonary verson performs explc correcons o s updae, n order o preven from he covarance marx o shrnk o zero, and he wegh-vecor o grow oo fas he second algorhm descrbed n Sec IV-B s based on a las-sep mn-max predcon prncple and objecve, where we replace L u n 4 wh L u } and some addonal modfcaons prevenng he soluon beng degenerae Here he algorhmc modfcaons from he orgnal AAR algorhm are mplc and are due o he modfcaons of he objecve he resulng algorhm smoohly nerpolaes he covarance marx wh a un marx A ARCOR: Adapve regularzaon of weghs for Regresson wh COvarance Rese Our frs algorhm s based on he AROWR We propose wo modfcaons o 6 and 7, whch n combnaon overcome he problem ha he algorhm s learnng rae gradually goes o zero he modfed algorhm operaes on segmens of he npu sequence In each segmen ndexed by, he algorhm checks wheher he lowes egenvalue of Σ s greaer han a gven lower bound Λ Once he lowes egenvalue of Σ s smaller han Λ he algorhm reses Σ = I and updaes he value of he lower bound Λ + Formally, he algorhm uses he updae 7 o compue an nermedae canddae for Σ, denoed by Σ = Σ + r x x 9 If ndeed Σ Λ I hen ses Σ = Σ, oherwse ses Σ = I and he segmen ndex s ncreased by Addonally, before our modfcaon, he norm of he wegh vecor w dd no ncrease much as he effecve learnng rae he marx Σ wen o zero Afer our updae, as he learnng rae s effecvely bounded from below, he norm of w may ncrease oo fas, whch n urn wll cause a low updae-rae n non-saonary npus We hus employ addonal modfcaon whch s exploed by he analyss Afer updang he mean w as n 6 w = w + y x w Σ x r + x Σ x, 0 we projec no a ball B around he orgn of radus R B usng a Mahalanobs dsance Formally, we defne he

5 5 funcon proj w, Σ, R B o be he soluon of he followng opmzaon problem, arg mn w R B w w Σ w w We wre he Lagrangan, L = w w Σ w w + α w R B Seng he graden wh respec o w o zero we ge, Σ w w + αw = 0 Solvng for w we ge w = αi + Σ Σ w = I + ασ w From KK condons we ge ha f w R B hen α = 0 and w = w Oherwse, α s he unque posve scalar ha sasfes I + ασ w = R B he value of α can be found usng bnary search and egen-decomposon of he marx Σ We wre explcly Σ = V ΛV for a dagonal marx Λ By denong u = V w we rewre he las equaon, I + αλ u = R B We hus wsh o fnd α such ha d j u j +αλ j,j = RB I can be done usng a bnary search for α 0, a] where a = u /R B /λ mn Λ o summarze, he projecon sep can be performed n me cubc n d and logarhmc n R B and Λ We call he algorhm ARCOR for adapve regularzaon wh covarance rese A pseudo-code of he algorhm s summarzed n he lef column of able II We defer a comparson of ARCOR and CR-RLS afer he presenaon of our second algorhm now B Las-Sep Mn-Max Algorhm for Non-saonary Seng Our second algorhm s based on a las-sep mn-max predcor proposed by Forser 6] and laer modfed by Moroshko and Crammer 9] o oban sub-logarhmc regre n he saonary case On each round, he algorhm predcs as s he las round, and assumes a wors case choce of y gven he algorhm s predcon We exend hs rule for he non-saonary seng gven n 4, and re-defne he las-sep mnmax predcor ŷ o be, ] arg mn max ŷ y where, = y ŷ mn u,,u Q u,, u Q u,, u =b u + c u s+ u s + s=, ys u s x s, s= for some posve consans b, c he frs erm of s he loss suffered by he algorhm whle Q u,, u defned n s a sum of he loss suffered by some sequence of lnear funcons u,, u and a penaly for consecuve pars ha y and ŷ serve boh as quanfers over he mn and max operaors, respecvely, and as he opmal argumens of hs opmzaon problem are far from each oher, and for he norm of he frs o be far from zero We develop he algorhm by solvng he hree opmzaon problems n, frs, mnmzng he nner erm, mn u,,u Q u,, u, maxmzng over y, and fnally, mnmzng over ŷ We sar wh he nner erm for whch we defne an auxlary funcon, P u = whch clearly sasfes, mn Q u,, u, u,,u mn Q u,, u = mn P u u,,u u he followng lemma saes a recursve form of he funconsequence P u Lemma : For =, 3, P u = Q u P u = mn P u +c u u + y u x u he proof appears n Sec A Usng Lemma we wre explcly he funcon P u Lemma : he followng equaly holds where, P u = u u u e + f, 3 = bi + x x, = + c I + x x 4 e = y x, e = I + c e + y x 5 f = y, f = f e ci + e + y 6 Noe ha R d d s a posve defne marx, e R d and f R he proof appears n Sec B From Lemma we conclude ha, mn Q u,, u = mn P u u,,u u = mn u u u e + f = e e + f u 7 Subsung 7 back n we ge ha he las-sep mnmax predcor s gven by, ŷ = arg mn max ŷ y = y ŷ + e e f ] 8 Snce e depends on y we subsue 5 n he second erm of 8, e e = I + c e + y x I + c e + y x 9

6 6 ABLE II ARCOR, LASER AN CR-RLS ALGORIHMS ARCOR LASER CR-RLS Parameers 0 < r, R B, a sequence 0 < b < c 0 < r, 0 N > Λ Λ Inalze w 0 = 0, Σ 0 = I, = w 0 = 0, Σ 0 = c b bc w0 = 0, Σ0 = I Receve an nsance x For Oupu ŷ = x w x ŷ w = x w ŷ = + x Σ + c I x = predcon Updae Σ : Σ = Σ + r xx Receve a correc label y Σ = Σ + c I +xx Σ = rσ + x x Updae w : If Σ Λ I se Σ = Σ else se Σ = I, = + w = w + y x w Σ x r+x Σ x w = w + y x w Σ +c Ix +x Σ +c Ix If mod, 0 > 0 se Σ = Σ else se Σ = I w = w + y x w Σ x r+x Σ x w = proj w, Σ, R B Oupu w, Σ w, Σ w, Σ Subsung 9 and 6 n 8 and omng erms no dependng explcly on y and ŷ we ge, ŷ = arg mn max y ŷ + y x ŷ y x ] + y x = arg mn max ŷ y + y I + c e y x x y 0 x I + c ] e ŷ + ŷ he las equaon s srcly convex n y and hus he opmal soluon s no bounded o solve, we follow an approach used by Forser n a dfferen conex 6] In order o make he opmal value bounded, we assume ha he adversary can only choose labels from a bounded se y Y, Y ] hus, he opmal soluon of 0 over y s gven by he followng equaon, snce he opmal value s y +Y, Y }, x ŷ = arg mn x Y ŷ + Y x I + c ] e ŷ + ŷ hs problem s of a smlar form o he one dscussed by Forser 6], from whch we ge he opmal soluon, ŷ = clp I + c e, Y x he opmal soluon depends explcly on he bound Y, and as s value s no known, we hus gnore, and defne he oupu of he algorhm o be, ŷ = x I + c e = x e, where we defne = I + c We call he algorhm LASER for las sep adapve regressor algorhm Clearly, for c = he LASER algorhm reduces o he AAR algorhm Smlar o CR-RLS and ARCOR, hs algorhm can be also expressed n erms of wegh-vecor w and a PS marx Σ, by denong w = e and Σ = he algorhm s summarzed n he mddle column of able II C scusson able II enables us o compare he hree algorhms heado-head All algorhms perform lnear predcons, and hen updae he predcon vecor w and he marx Σ CR-RLS and ARCOR are more smlar o each oher, boh sem from a saonary algorhm, and perform reses from me-o-me For CR-RLS s performed every fxed me seps, whle for ARCOR s performed when he egenvalues of he marx or effecve learnng rae are oo small ARCOR also performs a projecon sep, whch s movaed o ensure ha he weghvecor wll no grow o much, and s used explcly n he analyss below Noe ha CR-RLS as well as RLS also uses a forgeng facor f r < Our second algorhm, LASER, conrols he covarance marx n a smooher way On each eraon nerpolaes

7 7 wh he deny marx before addng x x Noe ha f λ s an egenvalue of Σ hen λ c/λ + c < λ s an egenvalue of Σ + c I hus he algorhm mplcly reduce he egenvalues of he nverse covarance and ncrease he egenvalues of he covarance Fnally, all hree algorhms can be combned wh Mercer kernels as hey employ only sums of nner- and ouer-producs of s npus hs allows hem o perform non-lnear predcons, smlar o SVM V RELAE WORK here s a large body of research n onlne learnng for regresson problems Almos half a cenury ago, Wdrow and Hoff 37] developed a varan of he leas mean squares LMS algorhm for adapve flerng and nose reducon he algorhm was furher developed and analyzed exensvely for example by Feuer 5] he normalzed leas mean squares fler NLMS 3], 4] bulds on LMS and performs beer o scalng of he npu he recursve leas squares RLS ] s he closes o our algorhms n he sgnal processng leraure and also manans a wegh-vecor and a covarance-lke marx, whch s posve sem-defne PS, ha s used o re-wegh npus In he machne learnng leraure he problem of onlne regresson was suded exensvely, and clearly we canno cover all he relevan work Cesa-Banch e al 7] suded graden descen based algorhms for regresson wh he squared loss Kvnen and Warmuh 5] proposed varous generalzaons for general regularzaon funcons We refer he reader o a comprehensve book n he subjec 9] Foser 7] suded an onlne verson of he rdge regresson algorhm n he wors-case seng Vovk ] proposed a relaed algorhm called he Aggregang Algorhm AA, whch was laer appled o he problem of lnear regresson wh square loss 36] Forser 6] smplfed he regre analyss for hs problem Boh algorhms employ second order nformaon ARCOR for he separable case s very smlar o hese algorhms, alhough has alernave dervaon Recenly, few algorhms were proposed eher for classfcaon 8], ] 3] or for general loss funcons 4], 8] n he onlne convex programmng framework AROWR 35] shares he same desgn prncples of AROW ] ye s amed for regresson he ARCOR algorhm akes AROWR one sep furher and has wo mporan modfcaons whch makes work n he drfng or shfng sengs hese modfcaons make he analyss more complex han of AROW wo of he approaches used n prevous algorhms for non-saonary seng are o bound he wegh vecor and covarance rese Boundng he wegh vecor was performed eher by projecng no a bounded se 3], shrnkng by mulplcaon 6], or subracon of prevously seen examples 6] hese hree mehods or a leas mos of her varans can be combned wh kernel operaors, and n fac, he las wo approaches were desgned and movaed by kernels he Covarance Rese RLS algorhm CR-RLS 0], 0], 3] was desgned for adapve flerng CR-RLS makes covarance rese every fxed amoun of daa pons, whle ARCOR performs resars based on he acual properes of he daa - he egenspecrum of he covarance marx Furhermore, as far as we know, here s no analyss n he msake bound model for hs algorhm Boh ARCOR and CR-RLS are movaed from he propery ha he covarance marx goes o zero and becomes rank defcen In boh algorhms he nformaon encapsulaed n he covarance marx s los afer resars In a rapdly varyng envronmens, lke a wreless channel, hs loss of memory can be benefcal, as prevous conrbuons o he covarance marx may have lle correlaon wh he curren srucure Recen versons of CR- RLS 9], 33] employ covarance rese o have numercally sable compuaons ARCOR algorhm combnes boh echnques wh onlne learnng ha employs second order algorhm for regresson In hs aspec we have he bes of all worlds, fas convergence rae due o he usage of second order nformaon, and he ably o adap n non-saonary envronmens due o projecon and reses LASER s smpler han all hese algorhms as conrols he ncrease of he egenvalues of he covarance marx mplcly raher han explcly by averagng wh a fxed dagonal marx see 4, and do no nvolve projecon seps he Kalman fler 4] and he H algorhm eg he work of Smon 3] desgned for flerng ake a smlar approach, ye he exac algebrac form s dfferen he dervaon of he LASER algorhm n hs work shares smlares wh he work of Forser 6] and he work of Moroshko and Crammer 9] hese algorhms are movaed from he las-sep mn-max predcor Ye, he algorhms of Forser and Moroshko and Crammer are desgned for he saonary seng, whle LASER s prmarly desgned for he non-saonary seng Moroshko and Crammer 9] also dscussed a weak varan of he non-saonary seng, where he complexy s measured by he oal dsance from a reference vecor ū, raher han he oal dsance of consecuve vecors as n hs paper, whch s more relevan o nonsaonary problems VI REGRE BOUNS We now analyze our algorhms n he non-saonary case, upper boundng he regre usng more han a sngle comparson vecor Specfcally, our goal s o prove bounds ha would hold unformly for all npus, and are of he form, L alg L u } + α V P γ + β, for eher P = or P =, a consan γ and some funcons α, β ha may depend mplcly on oher quanes of he problem Specfcally, n he nex secon we show ha under a parcular choce of Λ = Λ V for he ARCOR algorhm, s regre s bounded by, L ARCOR L u } + O V log

8 8 Addonally, n Sec VI-B, we show ha under proper choce of he consan c = c V, he regre of LASER s bounded by, L LASER L u } + O 3 V 3 he wo bounds are no comparable n general For example, assume a consan nsananeous drf u + u = ν for some consan value ν In hs case he varance and squared varance are, V = ν and V = ν he bound of ARCOR becomes ν log, whle he bound of LASER becomes ν 3 he bound of ARCOR s larger f log 6 > ν, and he bound of LASER s larger n he oppose case Anoher example s polynomal decay of he drf, u + u κ for some κ > 0 In hs case, for κ we ge V = κ κ d+ = κ κ κ For κ = we ge V log + For LASER we have, for κ 05, V = κ κ d + = κ κ κ For κ = 05 we ge V log + Asympocally, ARCOR ouperforms LASER abou when κ 07 Herbser and Warmuh 3] developed shfng bounds for general graden descen algorhms wh projecon of he wegh-vecor usng he Bregman dvergence In her bounds, here s a facor greaer han mulplyng he erm L u }, leadng o a small regre only when he daa s close o be realzable wh lnear models Busul and Kalnshkan 5] developed a varan of he Aggregang Algorhm ] for he non-saonary seng However, o have sublnear regre hey requre a srong assumpon on he drf V = o, whle we requre only V = o for LASER or V = o for ARCOR A Analyss of ARCOR algorhm Le us defne addonal noaon ha we wll use n our bounds We denoe by he example ndex for whch a resar was performed for he h me, ha s Σ = I for all We defne by n he oal number of resars, or nervals We denoe by = he number of examples beween wo consecuve resars Clearly = n = Fnally, we denoe by Σ = Σ jus before he h resar, and we noe ha depends on exacly examples snce he las resar In wha follows we compare he performance of he AR- COR algorhm o he performance of a sequence of wegh vecors u R d all of whch are of bounded norm R B In oher words, all he vecors u belong o B We break he proof no four seps In he frs sep heorem 3 we bound he regre when he algorhm s execued wh some value of parameers Λ } and he resulng covarance marces In he second sep, summarzed n Corollary 4, we remove he dependences n he covarance marces, by akng a wors case bound In he hrd sep, summarzed n Lemma 5, we upper bound he oal number of swches n gven he parameers Λ } Fnally, n Corollary 6 we provde he regre bound for a specfc choce of he parameers We now move o sae he frs heorem heorem 3: Assume ha he ARCOR algorhm s run wh an npu sequence x, y,, x, y Assume ha all he npus are upper bounded by un norm x and ha he oupus are bounded by Y = max y Le u be any sequence of bounded wegh vecors u R B hen, he cumulave loss s bounded by, L ARCOR L u } + R B r + ru Σ u + RB + Y n Λ u u Σ log de where n s he number of covarance resars and Σ s he value of he covarance marx jus before he h resar he proof appears n Sec C Noe ha he number of resars n s no fxed bu depends boh on he oal number of examples and he scheme used o se he values of he lower bound of he egenvalues Λ In general, he lower he values of Λ are, he smaller number of covarance-resars occur, ye he larger he value of he las erm of he bound s, whch scales nversely proporonal o Λ A more precse saemen s gven n he nex corollary Corollary 4: Assume ha he ARCOR algorhm made n resars Under he condons of heorem 3 we have, L ARCOR L u } + R B rλ n + RB + Y dn log + nrd Proof: By defnon we have Σ + = I + x x r = u u + ru Σ u enoe he egenvalues of + = x x by λ,, λ d Snce + x her sum s r x x We use he = Σ concavy of he log funcon o bound log de = d j log + λj r d log + rd We use concavy agan o bound he sum n Σ log de n d log + dn log + nrd rd, where we used he fac ha n = Subsung he las nequaly n heorem 3, as well as usng he monooncy of he coeffcens, Λ Λ n for all n, yelds he desred bound Implcly, he second and hrd erms of he bound have oppose dependence on n he second erm s decreasng wh n If n s small means ha he lower bound Λ n s very low oherwse we would make many resars and hus Λ n s large he hrd erm s ncreasng wh n We now make hs mplc dependence explc Our goal s o bound he number of resars n as a funcon of he number of examples hs depends on he exac

9 9 sequence of values Λ used he followng lemma provdes a bound on n gven a specfc sequence of Λ Lemma 5: Assume ha he ARCOR algorhm s run wh some sequence of Λ hen, he number of resars s upper bounded by, n max N N : r N Λ } Proof: Snce n = =, hen he number of resars s maxmzed when he number of examples beween resars s mnmzed We prove now a lower bound on for all = n A resar occurs for he h me when he smalles egenvalue of Σ s smaller for he frs me han Λ + As before, by defnon, Σ = I + r = x x By a resul n marx analyss 8, heorem 88] we have ha here exss a marx A R d wh each column belongs o a bounded convex body ha sasfy a k,l 0 and k a k,l for l =,,, such ha he kh egenvalue λ k of Σ equals o λ k = + r l= a k,l he value of s defned as when larges egenvalue of Σ hs Λ Formally, we ge he followng lower bound on, arg mn a k,l } s s max k a k,l 0 + r s a k,l l= Λ for k =,, d, l =,, s a k,l for l,, s For a fxed value of s, a maxmal value max k + s r l= a k,l s obaned f all he mass s concenraed n one value k and for hs k each a k,l s equal o s maxmal value ha s, we have a k,l = for k = k 0 and a k,l = 0 oherwse In hs case max k + s r l= a k,l = + r s and he lower bound s obaned when + r s = Λ Solvng for s we ge ha he shores possble lengh of he h nerval s bounded by, r Λ Summng over he las equaon we ge, = n r n Λ hus, he number of resars s upper bounded by he maxmal value n ha sasfy he las nequaly We now prove a bound for a specfc choce of he parameers Λ }, namely polynomal decay, Λ = q + hs schema o se Λ } balances beween he amoun of nose need for many resars and he propery ha usng he covarance marx for updaes acheves fas-convergence We noe ha an exponenal schema Λ = wll lead o very few resars, and very small egenvalues of he covarance marx Inuvely, hs s because he las segmen wll be abou half he lengh of he enre sequence Combnng Lemma 5 wh Corollary 4 we ge, Corollary 6: Assume ha he ARCOR algorhm s run wh a polynomal schema, ha s Λ = q + for some k q 0 Under he condons of heorem 3 we have, L ARCOR L u } + ru Σ u + RB + Y d q + q log + 3 nrd + R B r q + q q + u u 4 Proof: Subsung Λ = q + n Lemma 5 we ge, n r Λ n n = r q r x q dx = r q nq = hs yelds an upper bound on n, n q + q Λ n q + q q + Comparng he las wo erms of he bound of Corollary 6 we observe a naural radeoff n he value of q he hrd erm of 3 s decreasng wh large values of q, whle he fourh erm of 4 s ncreasng wh q Assumng a bound on he devaon u u = V O /p, or n oher words p = log / log V We se a drf dependen parameer q = p / p + = log / log + log V and ge ha he sum of 3 and 4 s of order O p+ p log = O V log Few commens are n order Frs, as long as p > he sum of 3 and 4 s o and hus vanshng Second, when he nose s very low, ha s p + ɛ, he algorhm ses q + /ɛ, and hus wll no make any resars, and he bound of Olog for he saonary case s rereved In oher words, for hs choce of q he algorhm wll have only one nerval, and here wll be no resars o conclude, we showed ha f he algorhm s gven an upper bound on he amoun of drf, whch s sub-lnear n, can acheve sub-lnear regre Furhermore, f s known ha here s no non-saonary n he reference vecors, hen runnng he algorhm wh large enough q wll have a regre logarhmc n B Analyss of LASER algorhm We now analyze he performance of he LASER algorhm n he wors-case seng n sx seps Frs, sae a echncal lemma ha s used n he second sep heorem 8, n whch we bound he regre wh a quany proporonal o = x x hrd, n Lemma 9 we bound each of he summands wh wo erms, one logarhmc and one lnear n he egenvalues of he marces In he fourh Lemma 0 and ffh Lemma seps we bound he egenvalues of frs for scalars and hen exend he resuls o marces Fnally, n Corollary we pu all hese resuls ogeher and ge he desred bounds Lemma 7: For all he followng saemen holds, x x + + c I 0

10 0 where as defned n we have = I + c he proof appears n Sec We nex bound he cumulave loss of he algorhm, heorem 8: Assume ha he labels are bounded sup y Y for some Y R hen he followng bound holds, L LASER mn L u } + cv u u } + b u ],,u + Y = x x 5 Proof: Fx A long algebrac manpulaon, gven n Sec E yelds, y ŷ + mn u,,u Q u,, u mn Q u,, u u,,u = y ŷ + y x e +e + +c I ] e + y x x y 6 Subsung he specfc value of he predcor ŷ = x e from, we ge ha 6 equals o, ŷ + y x + e x + + c I ] e = e x x e + y x x + e + + c I ] e = e e + y x x, 7 where = x x + + c I Usng Lemma 7 we upper bound 0 and hus 7 s bounded, Y = x x In he nex lemma we furher bound he rgh erm of 5 hs ype of bound s based on he usage of he covarance-lke marx Lemma 9: x x ln b + c r 8 = = Proof: Le B = x x = + c I 0 x x = r x x = r x x = r B = r / B / = r I / B / d ] = λ j / B / j= We connue usng x ln x and ge x I follows ha, x d j= d = ln = ln ] ln λ j / B / j= / λ j / B / B / = ln B = ln x x x x ln + c I = ln I + c = ln + ln I + c and because ln b 0 0 we ge x x ln b + ln I + c = = ln b + c r A frs sgh seems ha he rgh erm of 8 may grow y x x Y x x super-lnearly wh, as each of he marces grows wh he nex wo lemmas show ha hs s no he case, and Fnally, summng over,, } gves he desred bound, n fac, he rgh erm of 8 s no growng oo fas, whch ] wll allow us o oban a sub-lnear regre bound Lemma 0 L LASER mn b u + cv u u } + L u },,u = analyzes he properes of he recurson of defned n 4 for scalars, ha s d = In Lemma we exend hs analyss o marces Lemma 0: efne fλ = λβ/ λ + β + x for β, λ 0 and some x γ hen: fλ β + γ fλ λ + γ 3 fλ max λ, 3γ + γ 4 +4γ β Proof: For he frs propery we have fλ = λβ/ λ + β + x β + x he second propery follows from he symmery beween β and λ o prove he hrd propery we decompose he funcon as, fλ = λ λ λ+β +x }

11 herefore, he funcon s bounded by s argumen fλ λ f, and only f, λ λ+β + x 0 Snce we assume x γ, he las nequaly holds f, λ +γ λ+γ β 0, whch holds for λ γ + γ 4 +4γ β o conclude If λ γ + γ 4 +4γ β, hen fλ λ Oherwse, by he second propery, we have, fλ λ+γ γ + γ 4 +4γ β +γ = 3γ + γ 4 +4γ β, as requred We buld on Lemma 0 o bound he maxmal egenvalue of he marces Lemma : Assume x X for some X hen, he egenvalues of for, denoed by λ, are upper bounded by max λ max 3X + } X 4 + 4X c, b + X Proof: By nducon From 4 we have ha λ b + X for =,, d We proceed wh a proof for some For smplcy, denoe by λ = λ he h egenvalue of wh a correspondng egenvecor v From 4 we have, = + c I + x x + c I + I x = = d v v d v v λ + c + x λ c λ + c + x 9 Pluggng Lemma 0 n 9 we ge, d v v 3X + } X max 4 + 4X c, b + X 3X + } X = max 4 + 4X c, b + X I Fnally, equpped wh he above lemmas we are able o prove he man resul of hs secon Corollary : Assume x X, y Y hen, L LASER b u + L u } + Y ln b +c Y r 0 + cv +c Y 3X + } X d max 4 + 4X c, b + X 30 Furhermore, se b = εc for some 0 < ε < enoe by µ = max 9/8X, b+x and M = max 3X, b + X } If 8X } V Y dx low drf hen by seng µ 3/ Y dx c = V /3 3 we have, L LASER /3 b u + 3 Y dx /3 V /3 + ε ε Y d + L u } + Y ln b 3 he proof appears n Sec F Noe ha f V Y dm µ hen by seng c = Y dm /V we have, L LASER b u + Y d MV + ε ε Y d + L u } + Y ln b 33 See Sec G for deals he las bound s lnear n and can be obaned also by a nave algorhm ha oupus ŷ = 0 for all A few remarks are n order When he varance V = 0 goes o zero, we se c = and hus we have = bi + s= x sx s used n recen algorhms 8], 6], ], 36] In hs case he algorhm reduces o he algorhm by Forser 6] whch s also he AAR algorhm of Vovk 36], wh he same logarhmc regre bound noe ha he las erm n he bounds s logarhmc n, see he proof of Forser 6] See also he work of Azoury and Warmuh ] VII SIMULAIONS We evaluae our algorhms on hree daases, one synhec and wo real world he synhec daase conans, 000 pons n R 0, where he frs en coordnaes were grouped no fve groups of sze wo Each such par was drawn from a 45 roaed Gaussan dsrbuon wh sandard devaons 0 and he remanng 0 coordnaes were drawn from ndependen Gaussan dsrbuons N 0, he daase was generaed usng a sequence of vecors u R 0 for whch he only non-zero coordnaes are he frs wo, where her values are he coordnaes of a un vecor ha s roang wh a consan rae Specfcally, we have u = and he nsananeous drf u u s consan he oher wo daases are generaed from echoed speech sgnal he frs speech echoed sgnal was generaed usng FIR fler wh k delays and varyng aenuaed amplude hs effec maes acousc echo reflecons from large, dsan and dynamc obsacles he dfference equaon yn = xn + k = Anxn + vn was used, where s a delay n samples, he coeffcen An descrbes he changng aenuaon relaed from objec reflecon and vn N 0, 0 3 s a whe nose he second speech echoed sgnal was generaed usng a flange IIR fler, where he delay s no consa, bu changng wh me hs effec maes me srechng of audo sgnal caused by movng and changng objecs n he room he dfference equaon yn = xn + Ay n n + vn was used Fve algorhms are evaluaed: NLMS normalzed leas mean square 3], 4] whch s a sae-of-he-ar frs-order algorhm, AROWR AROW for Regresson wh no resars nor projecon, ARCOR, LASER and CR-RLS For he synhec daases he algorhms parameers were uned usng

12 Cumulve Square Error CR RLS ARCOR NLMS AROW LASER synhec daa Cumulve Square Error CR RLS ARCOR NLMS AROW LASER speech sgnal wh echo Cumulve Square Error CR RLS ARCOR NLMS AROW LASER speech sgnal wh addve echo Ieraon Ieraon x Ieraon x 0 4 Fg Cumulave squared loss for AROWR, ARCOR, LASER, NLMS and CR-RLS vs eraon Lef panel shows resul for synhec daases wh drf, and wo rgh panels show resuls for a problem of acousc echo cancelaon on speech sgnal bes shown n color a sngle random sequence We repea each expermen 00 mes reporng he mean cumulave square-loss We noe ha AAR ], 36] s a specal case of LASER and RLS s a specal case of CR-RLS, for a specfc choce of her respecve parameers Addonally, he performance of AROWR, AAR and RLS s smlar, and hus only he performance of AROWR s shown For he speech sgnal he algorhms parameers were uned on 0% of he sgnal, hen he bes parameer choces for each algorhm were used o evaluae he performance on he remanng sgnal he resuls are summarzed n Fg AROWR performs wors on all daases as converges very fas and hus no able o rack he changes n he daa Focusng on he lef panel, showng he resuls for he synhec sgnal, we observe ha ARCOR performs relavely bad as suggesed by our analyss for consan, ye no oo large, drf Boh CR-RLS and NLMS perform beer, where CR-RLS s slghly beer as s a second order algorhm, and allows o converge faser beween swches On he oher hand, NLMS s no convergng and s able o adap o he drf Fnally, LASER performs he bes, as hned by s analyss, for whch he bound s lower where here s a consan drf Movng o he cener panel, showng he resuls for frs echoed speech sgnal wh varyng amplude, we observe ha LASER s he wors among all algorhms excep AROWR Indeed, does prevenng convergence by keepng he learnng raes far from zero, ye s a mn-max algorhm desgned for he wors-case, whch s no he case for real-world speech daa However, speech daa s hghly regular and he nsananeous drf vary NLMS performs beer as s no convergng, ye boh CR-RLS and ARCOR perform even beer, as hey boh no-convergng due o covarance reses on he one hand, and second order updaes on he oher hand ARCOR ouperforms CR-RLS as he former adaps he reses o acual daa, and s no usng pre-defned schedulng as he laer Fnally, he rgh panel summarzes he resuls for evaluaons on he second echoed speech sgnal Noe ha he amoun of drf grows snce he daa s generaed usng flange fler Boh LASER and ARCOR are ou-performed as boh assume drf ha s sublnear or a mos lnear, whch s no he case CR-RLS ouperforms NLMS he laer s frs order, so s able o adap o changes, ye has slower converge rae he former s able o cope wh drf due o reses Ineresngly, n all expermens, NLMS was no performng he bes nor he wors here s no clear wnner among he hree algorhms ha are boh second order, and desgned o adap o drfs Inuvely, f he drf sus he assumpons of an algorhm, ha algorhm would perform he bes, and oherwse, s performance may even be worse han of NLMS We have seen above ha ARCOR performs a projecon sep, whch parally was movaed from he analyss We now evaluae s need and affec n pracce on wo speech problems We es wo modfcaons of ARCOR, resulng n four varans alogeher Frs, we replace he he polynomal hresholds scheme o he consan hresholds scheme, ha s, all hresholds are equal Second, we om he projecon sep he resuls are summarzed n Fg he lne correspondng o he orgnal algorhm, s called proj, poly as performs a projecon sep and uses polynomal schema for he lowerbound on egenvalues he verson ha oms projecon and uses consan schema, called no proj, cons, s mos smlar o CR-RLS Boh reses he covarance marx, CR-RLS afer fxed amoun of eraons, whle ARCOR-no proj, cons when he egenvalues mees a specfed fxed lower bound he dfference beween he wo plos s he amoun of drf used: he op panel shows resuls for sublnear drf, and he boom panel shows resuls wh ncreasng per-nsance drf he orgnal verson, as hned by he analyss, s desgned o work wh sub-lnear drf, and performs he bes n hs case However, when hs assumpon over he amoun of drf breaks, hs verson s no opmal anymore, and consa schema performs beer, as allows he algorhm o adap o non-vanshng drf Fnally, n boh daases, he algorhm ha perform bes perform a projecon sep afer each eraon Provdng some emprcal evdence for s need VIII SUMMARY AN CONCLUSIONS We proposed and analyzed wo novel algorhms for nonsaonary onlne regresson desgned and analyzed wh he squared loss n he wors-case regre framework he ARCOR algorhm was bul on AROWR, ha employs second order

13 3 Fg Cumulve Square Error Cumulve Square Error speech sgnal wh echo ARCOR no proj, poly ARCOR proj, poly ARCOR no proj, cons ARCOR proj, cons 05 5 Ieraon x 0 4 speech sgnal wh addve echo ARCOR no proj, poly ARCOR proj, poly ARCOR no proj, cons ARCOR proj, cons Ieraon x 0 4 Cumulave squared loss of four varans of ARCOR vs eraon nformaon, ye performs daa-dependen covarance reses, whch provdes he ably o rack drfs he LASER algorhm was bul on he las-sep mnmax predcor wh he proper modfcaons for non-saonary problems Regre bounds shown and proven are opmzed usng knowledge on he amoun of drf, and n general he wo algorhms are no comparable Few open drecons are possble Frs, o exend hese algorhms o oher loss funcons raher han he squared loss Second, currenly, drec mplemenaon of boh algorhms requres eher marx nverson or egenvecor decomposon A possble drecon s o desgn a more effcen verson of hese algorhms hrd, an neresng drecon s o desgn algorhms ha auomacally deec he level of drf, or do no need hs nformaon before run-me P u = mn u,,u + b u + c u s+ u s ys u s x s s= = mn u,,u s= b u + c u s+ u s s= + ys u s x s + c u u s= + y u x = mn mn u u,,u b u + c u s+ u s s= + ys u s x s + c u u s= + y u x = mn u mn u,,u + ys u s x s s= b u + c u s+ u s s= + c u u + y u x ] = mn u P u + c u u + y u x B Proof of Lemma Proof: By defnon, APPENIX P u = Q u = b u + y u x = u bi + x x u y u x + y, A Proof of Lemma Proof: We calculae and ndeed = bi + x x, e = y x, and f = y We proceed by nducon, assume ha, P u = u u u e + f

14 4 Applyng Lemma we ge, e ci + e + f + y = u + c I + x x u I u + c ] e + y x e ci + e + f + y, and ndeed = + c I + x x, e = I + c e + y x and, f = f e ci + e + y, as desred C Proof of heorem 3 We prove he heorem n four seps Frs, we sae he followng echncal lemma, for whch we defne he followng noaon, d z, v = z v Σ z v, d z, v = z v Σ z v, χ = x Σ x Second, we defne a elescopc sum and n Lemma 4 prove a lower bound for each elemen hrd, n Lemma 5 we upper bound one erm of he elescopc sum, and fnally, n he fourh sep we combne all hese pars o conclude he proof Le us sar wh he echncal lemma Lemma 3: Le w and Σ be defned n 9 and 0, hen, d w, u d w, u = r l r g l χ r r + χ where l = y w x and g = y u x Proof: We sar by wrng he dsances explcly d w, u d w, u P u = mn u u u u e + f = u w Σ u w + c u u + y u + u w Σ x u w Subsung w as appears n 0 he las equaon becomes, = mn u ci + u u cu + e u w Σ u w u + f + c u + + u w Σ y x w Σ x y u r + x Σ x x y x w = cu + e ci + cu + e + f r + x x Σ Σ x Σ Σ x + c u + y u + u x w Σ u w = u ci + x x c ci + u Pluggng Σ as appears n 9 we ge, ] u c ci + d w, u d e + y x w, u e ci + e + f + y = u w Σ + r x x u w Usng he Woodbury deny we connue o develop he las + u w Σ equaon, + r x x y x w Σ x r+x Σ x = u ci + x x c c I c + c I ] u y x w I u + c ] r + x x Σ Σ Σ x + r x x Σ x e + y x + u w Σ u w Fnally, we subsue l = y x w, g = y x u and, χ = x Σ x Rearrangng he erms, d w, u d w, u = y x w y x u r y x u y x w y x w + χ r + χ r l χ r + χ + χ r = r l + y x w y x u r r g + l + χ l χ r + χ r r r + χ y x w y x u + χ r + χ r = r l r g l χ r r + χ, whch complees he proof We now defne one elemen of he elescopc sum, and lower bound, Lemma 4: enoe by hen = d w, u d w, u χ r l g l rr + χ + u Σ u u Σ u R B Λ u u

15 5 where s he number of resars occurrng before example Proof: We wre as a elescopc sum of four erms as follows,, = d w, u d w, u, = d w, u d w, u,3 = d w, u d w, u,4 = d w, u d w, u We lower bound each of he four erms Snce he value of, was compued n Lemma 3, we sar wh he second erm If no rese occurs hen Σ = Σ and, = 0 Oherwse, we use he facs ha 0 Σ I and Σ = I, and ge,, = w u Σ w u w u Σ w u r w u w u I I = 0 o summarze,, 0 We can lower bound,3 0 by usng he fac ha w s a projecon of w ono a closed se a ball of radus R B around he orgn, whch by our assumpon conans u Employng Corollary 3 of Herbser and Warmuh 3] we ge, d w, u d w, u and hus,3 0 Fnally, we lower bound for he fourh erm,4,,4 = w u Σ w u w u Σ w u 34 = u Σ u u Σ u w Σ u u We use Hölder nequaly and hen Cauchy-Schwarz nequaly o ge he followng lower bound, w Σ u u = r Σ u u w λ max Σ w u u λ max Σ w u u Usng he facs ha w R B and ha λ max Σ = /λ mn Σ Λ, where s he curren segmen ndex, we ge, w Σ u u Λ R B u u 35 Subsung 35 n 34 and usng Σ Σ a lower bound s obaned,,4 u Σ u u Σ u R B Λ u u u Σ u u Σ u R B Λ u u 36 Combnng 36 wh Lemma 3 concludes he proof Nex we sae an upper bound ha wll appear n one of he summands of he elescopc sum, Lemma 5: urng he runme of he ARCOR algorhm we have, + = χ χ + r log de Σ + Σ =log de We remnd he reader ha s he frs example ndex afer he h resar, and s he number of examples observed before he nex resar We also remnd he reader he noaon Σ = Σ + s he covarance marx jus before he nex resar he proof of he lemma s smlar o he proof of Lemma 4 by Crammer e al ] and hus omed We now pu all he peces ogeher and prove heorem 3 Proof: We bound he sum from above and below, and sar wh an upper bound usng he propery of elescopc sum, = d w, u d w, u = d 0 w 0, u 0 d w, u d 0 w 0, u 0 37 We compue a lower bound by applyng Lemma 4, r l χ g l rr + χ + u Σ u u Σ u R B Λ u u where s he number of resars occurred before observng he h example Connung o develop he las equaon we oban, l g χ l r r rr + χ + u Σ u u Σ u R B Λ u u = l g χ l r r rr + χ + u 0 Σ 0 u 0 u Σ u R B Λ u u 38 Combnng 37 wh 38 and usng d 0 w 0, u 0 = u 0 Σ 0 u 0 as w 0 = 0, r l g r R B l χ rr + χ u Σ Λ u u 0 Rearrangng he erms of he las nequaly, l g + χ l + ru Σ r + χ + R B r u u Λ Snce w R B and we assume ha x = and sup y = Y, we ge ha sup l RB +Y Subsung he las nequaly n Lemma 5, we bound he second erm u u,

Variants of Pegasos. December 11, 2009

Variants of Pegasos. December 11, 2009 Inroducon Varans of Pegasos SooWoong Ryu bshboy@sanford.edu December, 009 Youngsoo Cho yc344@sanford.edu Developng a new SVM algorhm s ongong research opc. Among many exng SVM algorhms, we wll focus on