A Novel Family of Boosted Online Regression Algorithms with Strong Theoretical Bounds

Size: px

Start display at page:

Download "A Novel Family of Boosted Online Regression Algorithms with Strong Theoretical Bounds"

Morgan Ball
5 years ago
Views:

1 Noname manuscrip No. will be insered by he edior) A Novel Family of Boosed Online Regression Algorihms wih Srong Theoreical Bounds Dariush Kari Farhan Khan Selami Cifci Suleyman S. Koza he dae of receip and accepance should be insered laer arxiv: v2 [mah.st] 6 Dec 2016 Absrac We invesigae boosed online regression and propose a novel family of regression algorihms wih srong heoreical bounds. In addiion, we implemen several varians of he proposed generic algorihm. We specifically provide heoreical bounds for he performance of our proposed algorihms ha hold in a srong mahemaical sense. We achieve guaraneed performance improvemen over he convenional online regression mehods wihou any saisical assumpions on he desired daa or feaure vecors. We demonsrae an inrinsic relaionship, in erms of boosing, beween he adapive mixure-of-expers and daa reuse algorihms. Furhermore, we inroduce a boosing algorihm based on random updaes ha is significanly faser han he convenional boosing mehods and oher varians of our proposed algorihms while achieving an enhanced performance gain. Hence, he random updaes mehod is specifically applicable o he fas and high dimensional sreaming daa. Specifically, we invesigae Newon Mehod-based and Sochasic Gradien Descen-based linear regression algorihms in a mixure-of-expers seing, and provide several varians of hese well known adapaion mehods. However, he proposed algorihms can be exended o oher base learners, e.g., nonlinear, ree-based piecewise linear. Furhermore, we provide heoreical bounds for he compuaional complexiy of our proposed algorihms. We demonsrae subsanial performance gains in erms of mean square error over Dariush Kari Suleyman S. Koza Deparmen of Elecrical and Elecronics Engineering, Bilken Universiy Ankara 06800, Turkey kari@ee.bilken.edu.r, koza@ee.bilken.edu.r Selami Cifci Turk Telekom Communicaions Services Inc., Isanbul, Turkey selami.cifci1@urkelekom.com.r Farhan Khan Deparmen of Elecrical and Elecronics Engineering, Bilken Universiy Ankara 06800, Turkey khan@ee.bilken.edu.r and also Elecrical Engineering Deparmen, COMSATS Insiue of Informaion Technology, Pakisan engrfarhan@cii.ne.pk

2 2 Dariush Kari e al. he base learners hrough an exensive se of benchmark real daa ses and simulaed examples. Keywords Online boosing, online regression, boosed regression, ensemble learning, smooh boos, mixure mehods 1 Inroducion Boosing is considered as one of he mos imporan ensemble learning mehods in he machine learning lieraure and i is exensively used in several differen real life applicaions from classificaion o regression Bauer and Kohavi 1999); Dieerich 2000); Schapire and Singer 1999); Schapire and Freund 2012); Freund and E.Schapire 1997); Shresha and Solomaine 2006); Shalev-Shwarz and Singer 2010); Saigo e al. 2009); Demiriz e al. 2002)). As an ensemble learning mehod Fern and Givan 2003); Solanmohammadi e al. 2016); Duda e al. 2001)), boosing combines several parallel running weakly performing algorihms o build a final srongly performing algorihm Solanmohammadi e al. 2016); Freund 2001); Schapire and Freund 2012); Mannor and Meir 2002)). This is accomplished by finding a linear combinaion of weak learning algorihms in order o minimize he oal loss over a se of raining daa commonly using a funcional gradien descen Duffy and Helmbold 2002); Freund and E.Schapire 1997)). Boosing is successfully applied o several differen problems in he machine learning lieraure including classificaion Jin and Zhang 2007); Chapelle e al. 2011); Freund and E.Schapire 1997)), regression Duffy and Helmbold 2002); Shresha and Solomaine 2006)), and predicion Taieb and Hyndman 2014, 2013)). However, significanly less aenion is given o he idea of boosing in online regression framework. To his end, our goal is a) o inroduce a new boosing approach for online regression, b) derive several differen online regression algorihms based on he boosing approach, c) provide mahemaical guaranees for he performance improvemens of our algorihms, and d) demonsrae he inrinsic connecions of boosing wih he adapive mixure-of-expers algorihms Arenas-Garcia e al. 2016); Koza e al. 2010)) and daa reuse algorihms Shaffer and Williams 1983)). Alhough boosing is iniially inroduced in he bach seing Freund and E.Schapire 1997)), where algorihms boos hemselves over a fixed se of raining daa, i is laer exended o he online seing Oza and Russell 2001)). In he online seing, however, we neiher need nor have access o a fixed se of raining daa, since he daa samples arrive one by one as a sream Ben-David e al. 1997); Fern and Givan 2003); Lu e al. 2016)). Each newly arriving daa sample is processed and hen discarded wihou any soring. The online seing is naurally moivaed by many real life applicaions especially for he ones involving big daa, where here may no be enough sorage space available or he consrains of he problem require insan processing Boou and Bousque 2008)). Therefore, we concenrae on he online boosing framework and propose several algorihms for online regression asks. In addiion, since our algorihms are online, hey can be direcly used in adapive filering applicaions o improve he performance of convenional mixure-of-expers mehods Arenas-Garcia e al. 2016)). For adapive filering purposes, he online seing is especially imporan, where he sequenially

3 Boosed Online Regression Algorihms wih Srong Theoreical Bounds 3 arriving daa is used o adjus he inernal parameers of he filer, eiher o dynamically learn he underlying model or o rack he nonsaionary daa saisics Arenas-Garcia e al. 2016); Sayed 2003)). Specifically, we have m parallel running weak learners WL) Schapire and Freund 2012)) ha receive he inpu vecors sequenially. Each WL uses an updae mehod, such as he second order Newon s Mehod NM) or Sochasic Gradien Descen SGD), depending on he arge of he applicaions or problem consrains Sayed 2003)). Afer receiving he inpu vecor, each algorihm produces is oupu and hen calculaes is insananeous error afer he observaion is revealed. In he mos generic seing, his esimaion/predicion error and he corresponding inpu vecor are hen used o updae he inernal parameers of he algorihm o minimize a priori defined loss funcion, e.g., insananeous error for he SGD algorihm. These updaes are performed for all of he m WLs in he mixure. However, in he online boosing approaches, hese adapaions a each ime proceed in rounds from op o boom, saring from he firs WL o he las one o achieve he boosing effec Chen e al. 2012)). Furhermore, unlike he usual mixure approaches Arenas-Garcia e al. 2016); Koza e al. 2010)), he updae of each WL depends on he previous WLs in he mixure. In paricular, a each ime, afer he k h WL calculaes is error over x, d ) pair, i passes a cerain weigh o he nex WL, he k + 1) h WL, quanifying how much error he consiuen WLs from 1 s o k h made on he curren x, d ) pair. Based on he performance of he WLs from 1 o k on he curren x, d ) pair, he k + 1) h WL may give a differen emphasis imporance weigh) o x, d ) pair in is adapaion in order o recify he misake of he previous WLs. The proposed idea for online boosing is clearly relaed o he adapive mixureof-expers algorihms widely used in he machine learning lieraure, where several parallel running adapive algorihms are combined o improve he performance. In he mixure mehods, he performance improvemen is achieved due o he diversiy provided by using several differen adapive algorihms each having a differen view or advanage Koza e al. 2010)). This diversiy is exploied o yield a final combined algorihm, which achieves a performance beer han any of he algorihms in he mixure. Alhough he online boosing approach is similar o mixure approaches Koza e al. 2010)), here are significan differences. In he online boosing noion, he parallel running algorihms are no independen, i.e., one deliberaely inroduces he diversiy by updaing he WLs one by one from he firs WL o he m h WL for each new sample based on he performance of all he previous WLs on his sample. In his sense, each adapive algorihm, say he k + 1) h WL, receives feedback from he previous WLs, i.e., 1 s o k h, and updaes is inner parameers accordingly. As an example, if he curren x, d ) is well modeled by he previous WLs, hen he k + 1) h WL performs minor updae using x, d ) and may give more emphasis imporance weigh) o he laer arriving samples ha may be worse modeled by he previous WLs. Thus, by boosing, each adapive algorihm in he mixure can concenrae on differen pars of he inpu and oupu pairs achieving diversiy and significanly improving he gain. The linear online learning algorihms, such as SGD or NM, are among he simples as well as he mos widely used regression algorihms in he real-life applicaions Sayed 2003)). Therefore, we use such algorihms as base WLs in our boosing algorihms. To his end, we firs apply he boosing noion o several parallel

4 4 Dariush Kari e al. running linear NM-based WLs and inroduce hree differen approaches o use he imporance weighs Chen e al. 2012)), namely weighed updaes, daa reuse, and random updaes. In he firs approach, we use he imporance weighs direcly o produce cerain weighed NM algorihms. In he second approach, we use he imporance weighs o consruc daa reuse adapive algorihms Oza and Russell 2001)). However, daa reuse in boosing, such as Oza and Russell 2001)), is significanly differen from he usual daa reusing approaches in adapive filering Shaffer and Williams 1983)). As an example, in boosing, he imporance weigh coming from he k h WL deermines he daa reuse amoun in he k + 1) h WL, i.e., i is no used for he k h filer, hence, achieving he diversiy. The hird approach uses he imporance weighs o decide wheher o updae he consiuen WLs or no, based on a random number generaed from a Bernoulli disribuion wih he parameer equal o he weigh. The laer mehod can be effecively used for big daa processing Malik 2013)) due o he reduced complexiy. The oupu of he consiuen WLs is also combined using a linear mixure algorihm o consruc he final oupu. We hen updae he final combinaion algorihm using he SGD algorihm Koza e al. 2010)). Furhermore, we exend he boosing idea o parallel running linear SGD-based algorihm similar o he NM case. We sar our discussions by invesigaing he relaed works in Secion 2. We hen inroduce he problem seup and background in Secion 3, where we provide individual sequence as well as MSE convergence resuls for he NM and SGD algorihms. We inroduce our generic boosed online regression algorihm in Secion 4 and provide he mahemaical jusificaions for is performance. Then, in Secions 5 and 6, hree differen varians of he proposed boosing algorihm are derived, using he NM and SGD, respecively. Then, in Secion 7 we provide he mahemaical analysis for he compuaional complexiy of he proposed algorihms. The paper concludes wih exensive ses of experimens over he well known benchmark daa ses and simulaion models widely used in he machine learning lieraure o demonsrae he significan gains achieved by he boosing noion. 2 Relaed Works AdaBoos is one of he earlies and mos popular boosing mehods, which has been used for binary and muliclass classificaions as well as regression Freund and E.Schapire 1997)). This algorihm has been well sudied and has clear heoreical guaranees, and is excellen performance is explained rigorously Breiman 1997)). However, AdaBoos canno perform well on he noisy daa ses Servedio 2003)), herefore, oher boosing mehods have been suggesed ha are more robus agains noise. In order o reduce he effec of noise, SmoohBoos was inroduced in Servedio 2003)) in a bach seing. Moreover, in Servedio 2003)) he auhor proves he erminaion ime of he SmoohBoos algorihm by simulaneously obaining upper and lower bounds on he weighed advanage of all samples over all of he weak learners. We noe ha he SmoohBoos algorihm avoids overemphasizing he noisy samples, hence, provides robusness agains noise. In Oza and Russell 2001)), he auhors exend bagging and boosing mehods o an online seing, where hey use a Poisson sampling process o approximae he reweighing algorihm. However, he online boosing mehod in Oza and Russell 2001))

5 Boosed Online Regression Algorihms wih Srong Theoreical Bounds 5 corresponds o AdaBoos, which is suscepible o noise. In Babenko e al. 2009)), he auhors use a greedy opimizaion approach o develop he boosing noion o he online seing and inroduce sochasic boosing. Neverheless, while mos of he online boosing algorihms in he lieraure seek o approximae AdaBoos, Chen e al. 2012)) invesigaes he inheren difference beween bach and online learning, exend he SmoohBoos algorihm o an online seing, and provide he mahemaical guaranees for heir algorihm. Chen e al. 2012)) poins ou ha he online weak learners do no need o perform well on all possible disribuions of daa, insead, hey have o perform well only wih respec o smooher disribuions. Recenly, in Beygelzimer e al. 2015b)) he auhors have developed wo online boosing algorihms for classificaion, an opimal algorihm in erms of he number of weak learners, and also an adapive algorihm using he poenial funcions and boos-by-majoriy Freund 1995)). In addiion o he classificaion ask, he boosing approach has also been developed for he regression Duffy and Helmbold 2002)). In Beroni e al. 1997)), a boosing algorihm for regression is proposed, which is an exension of Adaboos.R Beroni e al. 1997)). Moreover, in Duffy and Helmbold 2002)), several gradien descen algorihms are presened, and some bounds on heir performances are provided. In Babenko e al. 2009)) he auhors presen a family of boosing algorihms for online regression hrough greedy minimizaion of a loss funcion. Also, in Beygelzimer e al. 2015a)) he auhors propose an online gradien boosing algorihm for regression. In his paper we propose a novel family of boosed online algorihms for he regression ask using he online boosing noion inroduced in Chen e al. 2012)), and invesigae hree differen varians of he inroduced algorihm. Furhermore, we show ha our algorihm can achieve a desired mean squared error MSE), given a sufficien amoun of daa and a sufficien number of weak learners. In addiion, we use similar echniques o Servedio 2003)) o prove he correcness of our algorihm. We emphasize ha our algorihm has a guaraneed performance in an individual sequence manner, i.e., wihou any saisical assumpions on he daa. In esablishing our algorihm and is jusificaions, we refrain from changing he regression problem o he classificaion problem, unlike he AdaBoos.R Freund and E.Schapire 1997)). Furhermore, unlike he online SmoohBoos Chen e al. 2012)), our algorihm can learn he guaraneed MSE of he weak learners, which in urn improves is adapiviy. 3 Problem Descripion and Background All vecors are column vecors and represened by bold lower case leers. Marices are represened by bold upper case leers. For a vecor a or a marix A), a T or A T ) is he ranspose and TrA) is he race of he marix A. Here, I m and 0 m represen he ideniy marix of dimension m m and he all zeros vecor of lengh m, respecively. Excep I m and 0 m, he ime index is given in he subscrip, i.e., x is he sample a ime. We work wih real daa for noaional simpliciy. We denoe he mean of a random variable x as E[x]. Also, we show he cardinaliy of a se S by S. We sequenially receive r-dimensional inpu regressor) vecors {x } 1, x R r, and desired daa {d } 1, and esimae d by ˆd = f x ), where f.) is an

6 6 Dariush Kari e al. online regression algorihm. A each ime he esimaion error is given by e = d ˆd and is used o updae he parameers of he WL. For presenaion purposes, we assume ha d [ 1, 1], however, our derivaions hold for any bounded bu arbirary desired daa sequences. In our framework, we do no use any saisical assumpions on he inpu feaure vecors or on he desired daa such ha our resuls are guaraneed o hold in an individual sequence manner Koza and Singer Jan. 2008)). The linear mehods are considered as he simples online modeling or learning algorihms, which esimae he desired daa d by a linear model as ˆd = w T x, where w is he linear algorihm s coefficiens a ime. Noe ha he previous expression also covers he affine model if one includes a consan erm in x, hence we use he purely linear form for noaional simpliciy. When he rue d is revealed, he algorihm updaes is coefficiens w based on he error e. As an example, in he basic implemenaion of he NM algorihm, he coefficiens are seleced o minimize he accumulaed squared regression error up o ime 1 as 1 w = arg min d w l x T l w) 2, = l=1 1 x l x T l l=1 ) 1 1 ) x l d l, 1) where w is a fixed vecor of coefficiens. The NM algorihm is shown o enjoy several opimaliy properies under differen saisical seings Sayed 2003)). Apar from hese resuls and more relaed o he framework of his paper, he NM algorihm is also shown o be rae opimal in an individual sequence manner Merhav and Feder 1993)). As shown in Merhav and Feder 1993)) Secion V), when applied o any sequence {x } 1 and {d } 1, he accumulaed squared error of he NM algorihm is as small as he accumulaed squared error of he bes bach leas squares LS) mehod ha is direcly opimized for hese realizaions of he sequences, i.e., for all T, {x } 1 and {d } 1, he NM achieves l=1 l=1 T T d l x T l w l ) 2 min d w l x T l w) 2 Oln T ). 2) l=1 The NM algorihm is a member of he Follow-he-Leader ype algorihms Cesa- Bianchi and Lugosi 2006)) Secion 3), where one uses he bes performing linear model up o ime 1 o predic d. Hence, 2) follows by direc applicaion of he online convex opimizaion resuls Shalev-Shwarz 2012)) afer regularizaion. The convergence rae or he rae of he regre) of he NM algorihm is also shown o be opimal so ha Oln T ) in he upper bound canno be improved Singer e al. 2002)). I is also shown in Singer e al. 2002)) ha one can reach he opimal upper bound wih exac scaling erms) by using a slighly modified version of 1) ) 1 1 ) w = x l x T l x l d l. 3) l=1 Noe ha he exension 3) of 1) is a forward algorihm Secion 5 of Azoury and Warmuh 2001)) and one can show ha, in he scalar case, he predicions of 3) are always bounded which is no he case for 1)) Singer e al. 2002)). l=1

7 Boosed Online Regression Algorihms wih Srong Theoreical Bounds 7 We emphasize ha in he basic applicaion of he NM algorihm, all daa pairs d l, x l ), l = 1,...,, receive he same imporance or weigh in 1). Alhough here exiss exponenially weighed or windowed versions of he basic NM algorihm Sayed 2003)), hese mehods weigh or concenrae on) he mos recen samples for beer modeling of he nonsaionariy Sayed 2003)). However, in he boosing framework Freund and E.Schapire 1997)), each sample pair receives a differen weigh based on no only hose weighing schemes, bu also he performance of he boosed algorihms on his pair. As an example, if a WL performs worse on a sample, he nex WL concenraes more on his example o beer recify his misake. In he following secions, we use his noion o derive differen boosed online regression algorihms. Alhough in his paper we use linear WLs for he sake of noaional simpliciy, one can readily exend our approach o nonlinear and piecewise linear regression mehods. For example, one can use ree based online regression mehods Khan e al. 2016); Vanli and Koza 2014); Koza e al. 2007)) as he weak learners, and boos hem wih he proposed approach. 4 New Boosed Online Regression Algorihm In his secion we presen he generic form of our proposed algorihms and provide he guaraneed performance bounds for ha. Regarding he noion of online boosing inroduced in Chen e al. 2012)), he online weak learners need o perform well only over smooh disribuions of daa poins. We firs presen he generic algorihm in Algorihm 1) and provide is heoreical jusificaions, hen discuss abou is srucure and he inuiion behind i. Algorihm 1 Boosed online regression algorihm 1: Inpu: x, d ) daa sream), m number of weak learners running in parallel), σm 2 he modified desired MSE), and σ 2 he guaraneed achievable weighed MSE). 2: Iniialize he regression coefficiens w k) 1 for each WL; and he combinaion coefficiens as z 1 = 1 m [1, 1,..., 1]T ; 3: for = 1 o T do 4: Receive he regressor daa insance x ; 5: k) Compue he WLs oupus ˆd ; 6: Produce he final esimae ˆd = z T y = zt 7: Receive he rue oupu d desired daa); 8: λ 1) = 1; l 1) = 0; 9: for k = 1 o m{ do 10: λ k) = min 1, } σ 2) l k) /2 ; 1) m) [ ˆd,..., ˆd ] T ; 11: Updae he WL k), such ha i has a weighed MSE σ 2 ; 12: e k) k) = d ˆd ; ) ] 13: l k+1) = l k) + [σ 2m e k) 2 ; 14: end for 15: Updae z based on e = d z T y ; 16: end for

8 8 Dariush Kari e al. In Algorihm 1, we have m copies of an online WL, each of which is guaraneed o have a weighed MSE of a mos σ 2. We prove ha he Algorihm 1 can reach a desired MSE, σd 2, hrough Lemma 1, Lemma 2, and Theorem 1. Noe ha since we assume d [ 1, 1], he rivial soluion ˆd = 0 incurs an MSE of a mos 1. Therefore, we define a weak learner as an algorihm which has an MSE less han 1. Lemma 1. every k M, and also T In Algorihm 1, if here is an ineger M such ha T =1 λk) κt for < κt, where 0 < κ < σd 2 is arbirarily chosen, i =1 λm+1) can reach a desired MSE, σd 2. Proof. The proof of Lemma 1 is given in Appendix A. Lemma 2. If he weak learners are guaraneed o have a weighed MSE less han σ 2, i.e., T =1 λk) e k) ) 2 k : 4 σ 2 1 T =1 λk) 4, here is an ineger M ha saisfies he condiions in Lemma 1. Proof. The proof of Lemma 2 is given in Appendix B. Theorem 1. If he weak learners in line 11 of Algorihm 1 achieve a weighed MSE of a mos σ 2 < 1 4, here exiss an upper bound for m such ha he algorihm reaches he desired MSE. Proof. This heorem is a direc consequence of combining Lemma 1 and Lemma 2. Noe ha alhough we are using copies of a base learner as he weak learners and seek o improve is performance, he consiuen WLs can be differen. However, by using he boosing approach, we can improve he MSE performance of he overall sysem as long as he WLs can provide a weighed MSE of a mos σ 2. For example, we can improve he performance of mixure-of-expers algorihms Arenas-Garcia e al. 2016)) by leveraging he boosing approach inroduced in his paper. As shown in Fig. 1, a each ieraion, we have m parallel running WLs wih esimaing funcions f k), producing esimaes ˆdk) = f k) x ) of d, k = 1,..., m. As an example, if we use m linear algorihms, ˆdk) = x T w k) is he esimae generaed by he k h WL. The oupus of hese m WLs are hen combined using he linear weighs z o produce he final esimae as ˆd = z T y Koza e al. 2010)), where y [ ˆd1),..., ˆdm) ] T is he vecor of oupus. Afer he desired oupu d is revealed, he m parallel running WLs will be updaed for he nex ieraion. Moreover, he linear combinaion coefficiens z are also updaed using he normalized SGD Sayed 2003)), as deailed laer in Secion 4.1. Afer d is revealed, he consiuen WLs, f k), k = 1,..., m, are consecuively updaed, as shown in Fig. 1, from op o boom, i.e., firs k = 1 is updaed, hen, k = 2 and finally k = m is updaed. However, o enhance he performance, we use a boosed updaing approach Freund and E.Schapire 1997)), such ha he k + 1) h WL receives a oal loss parameer, l k+1), from he k h WL, as [ ) ] 2 l k+1) = l k) + σm 2 d f k) x ), 4) o compue a weigh λ k). The oal loss parameer l k), indicaes he sum of he differences beween he modified desired MSE σm) 2 and he squared error of

9 Boosed Online Regression Algorihms wih Srong Theoreical Bounds 9 Desired Oupu d + - e dˆ Combining he resuls of all consiuen WLs Final Esimae m) z 2) z Combinaion Weighs 1) z 1 ˆ1) d 1) f + - 1) e x Inpu Vecor m) ˆm) d - m m) f WL m) 1 + m) e 2) 2) 2 ˆ2) d 2) f WL Parameers Updae 3) l - 2) 1 2) l + 2) e 1) 1) WL Parameers Updae 2) l 1) 1 1) l m) Parameers Updae m) l Fig. 1: The block diagram of a boosed online regression sysem ha uses he inpu vecor x o produce he final esimae ˆd. There are m consiuen WLs f 1),..., f m), each of which is k) an online linear algorihm ha generaes is own esimae ˆd. The final esimae ˆd is a linear combinaion of he esimaes generaed by all hese consiuen WLs, wih he combinaion weighs z k) k) s corresponding o ˆd s. The combinaion weighs are sored in a vecor which is updaed afer each ieraion. A ime he k h WL is updaed based on he values of λ k) and e k), and provides he k + 1) h filer wih l k+1) ha is used o compue λ k+1). The parameer δ k) used in compuing λ k). indicaes he weighed MSE of he k h WL over he firs esimaions, and is he firs k 1 WLs a ime. Then, we add he difference σm 2 e k) ) 2 o l k), o [ generae l k+1), and pass l k+1) o he nex WL, as shown in Fig. 1. Here, ) ] 2 σm 2 d f k) x ) measures how much he k h WL is off wih respec o he final MSE performance goal. For example, in a saionary environmen, if d = fx ) + ν, where f ) is a deerminisic funcion and ν is he observaion noise, one can selec he desired MSE σd 2 as an upper bound on he variance of he noise process ν, and define a modified desired MSE as σm 2 σ2 d κ 1 κ. In his sense, l k) measures how he WLs j = 1,..., k are cumulaively performing on d, x ) pair wih respec o he final performance goal. We hen use he weigh λ k) o updae he k h WL wih he weighed updaes, daa reuse, or random updaes mehod, which we explain laer in Secions 5

10 10 Dariush Kari e al. and 6. Our aim is o make λ k) large if he firs k 1 WLs made large errors on d, so ha he k h WL gives more imporance o x, d ) in order o recify he performance of he overall sysem. We now explain how o consruc hese weighs, such ha 0 < λ k) 1. To his end, we se λ 1) = 1, for all, and inroduce a weighing similar o Servedio 2003); Chen e al. 2012)). We define he weighs as { } λ k) = min 1, σ 2) l k) /2, 5) where σ 2 is he guaraneed upper bound on he weighed MSE of he weak learners. However, since here is no prior informaion abou he exac MSE performance of he weak learners, we use he following weighing scheme λ k) = min { 1, δ k) 1 ) c l k) }, 6) where δ k) 1 indicaes an esimae of he kh weak learner s MSE, and c 0 is a design parameer, which deermines he dependence of each WL updae on he performance of he previous WLs, i.e., c = 0 corresponds o independen updaes, like he ordinary combinaion of he WLs in adapive filering Koza e al. 2010); Arenas-Garcia e al. 2016)), while a greaer c indicaes he greaer effec of he previous WLs performance on he weigh λ k) of he curren WL. Noe ha including he parameer c does no change he validiy of our proofs, since one can ake δ k) 1 ) 2c as he new guaraneed weighed MSE. Here, δ k) 1 is an esimae of he Weighed Mean Squared Error WMSE) of he k h WL over {x } 1 and {d } 1. In he basic implemenaion ) of he online boosing Servedio 2003); Chen e al. 2012)), 1 δ k) 1 is se o he classificaion advanage of he weak learners Servedio 2003)), where his advanage is assumed o be he same for all weak learners. In his paper, o avoid using any a priori knowledge and o be compleely adapive, we choose δ k) 1 as he weighed and hresholded MSE of he k h WL up o ime 1 as where Λ k) δ k) = = τ=1 λk) τ τ=1 [ ] ) λ k) + 2 τ 4 d τ f τ k) x τ ) τ=1 λk) τ Λ k) 1 δk) 1 + λk) 4 d [ ] ) + 2 f k) x ) Λ k) 1 +, 7) λk) [ ] +, and f τ k) k) x τ ) hresholds f τ x τ ) ino he range [ 1, 1]. This hresholding is necessary o assure ha 0 < δ k) 1, which guaranees 0 < λ k) 1 for all k = 1,..., m and. We poin ou ha 7) can be recursively calculaed. Regarding he definiion of λ k), if he firs k WLs are good, we will pass less weigh o he nex WLs, such ha hose WLs can concenrae more on he oher

11 Boosed Online Regression Algorihms wih Srong Theoreical Bounds 11 samples. Hence, he WLs can increase he diversiy by concenraing on differen pars of he daa Koza e al. 2010). Furhermore, following his idea, in 6), he weigh λ k) is larger, i.e., close o 1, if mos of he WLs, 1,..., k 1, have errors larger han σm 2 on x, d ), and smaller, i.e., close o 0, if he pair x, d ) is easily modeled by he previous WLs such ha he WLs k,..., m do no need o concenrae more on his pair. 4.1 The Combinaion Algorihm Alhough in he proof of our algorihm, we assume a consan combinaion vecor z over ime, we use a ime varying combinaion vecor in pracice, since here is no knowledge abou he exac number of he required week learners for each problem. Hence, afer d is revealed, we also updae he final combinaion weighs z based on he final oupu ˆd = z T y, where ˆd = z T y, y = [ ˆd1),..., ˆdm) ] T. To updae he final combinaion weighs, we use he normalized SGD algorihm Sayed 2003) yielding y z +1 = z + µ ze y 2. 8) 4.2 Choice of Parameer Values The choice of σm 2 is a crucial ask, i.e., we canno reach any desired MSE for any daa sequence uncondiionally. As an example, suppose ha he daa are generaed randomly according o a known disribuion, while hey are conaminaed wih a whie noise process. I is clear ha we canno obain an MSE level below he noise power. However, if he WLs are guaraneed o saisfy he condiions of Theorem 1, his would no happen. Inuiively, here is a guaraneed upper bound i.e., σ 2 ) on he wors case performance, since in he weighed MSE, he samples wih a higher error have a more imporan effec. On he oher hand, if one chooses a σm 2 smaller han he noise power, l k) will be negaive for almos every k, urning mos of he weighs ino 1, and as a resul he weak learners fail o reach a weighed MSE smaller han σ 2. Neverheless, in pracice we have o choose he parameer σm 2 reasonably and precisely such ha he condiions of Theorem 1 are saisfied. For insance, we se σm 2 o be an upper bound on he noise power. In addiion, he number of weak learners, m, is chosen regarding o he compuaional complexiy consrains. However, in our experimens we choose a moderae number of weak learners, m = 20, which successfully improves he performance. Moreover, according o he resuls in Secion 8.3, he opimum value for c is around 1, hence, we se he parameer c = 1 in our simulaions. 5 Boosed NM Algorihms A each ime, all of he WLs shown in Fig. 1) esimae he desired daa d in parallel, and he final esimae is a linear combinaion of he resuls generaed by he WLs. When he k h WL receives he weigh λ k), i updaes he linear coefficiens w k) using one of he following mehods.

12 12 Dariush Kari e al. 5.1 Direcly Using λ s as Sample Weighs Here, we consider λ k) as he weigh for he observaion pair x, d ) and apply a weighed NM updae o w k). For his paricular weighed NM algorihm, we define he Hessian marix and he gradien vecor as R k) +1 βrk) + λ k) x x T, 9) p k) +1 βpk) + λ k) x d, 10) ) 1 where β is he forgeing facor Sayed 2003) and w k) +1 = R k) k) +1 p +1 can be calculaed in a recursive manner as e k) g k) = = d x T w k), λ k) w k) +1 = wk) P k) x β + λ k) x T P k), x + e k) g k), P k) +1 = β 1 P k) g k) ) x T P k). 11) where P k) R k) ) 1, k) and P 0 = v 1 I, and 0 < v 1. The complee algorihm is given in Algorihm 2 wih he weighed NM implemenaion in 11). 5.2 Daa Reuse Approaches Based on The Weighs Anoher approach follows Ozaboos Oza and Russell 2001)). In his approach, from λ k), we generae an ineger, say n k) = ceilkλ k) ), where K is a design parameer ha akes on posiive ineger values. We hen apply he NM updae on he x, d ) pair repeaedly n k) imes, i.e., run he NM updae on he same x, d ) pair n k) imes consecuively. Noe ha K should be deermined according o he compuaional complexiy consrains. However, increasing K does no necessarily resul in a beer performance, herefore, we use moderae values for K, e.g., we use K = 5 in our simulaions. The final w k) +1 is calculaed afer nk) NM updaes. As a major advanage, clearly, his reusing approach can be readily generalized o oher adapive algorihms in a sraighforward manner. We poin ou ha Ozaboos Oza and Russell 2001)) uses a differen daa reuse sraegy. In his approach, λ k) is used as he parameer of a Poisson disribuion and an ineger n k) is randomly generaed from his Poisson disribuion. One hen applies he NM updae n k) imes. 5.3 Random Updaes Approach Based on The Weighs In his approach, we simply use he weigh λ k) as a probabiliy of updaing he k h WL a ime. To his end, we generae a Bernoulli random variable, which

13 Boosed Online Regression Algorihms wih Srong Theoreical Bounds 13 Algorihm 2 Boosed NM-based algorihm 1: Inpu: x, d ) daa sream), m number of WLs) and σm. 2 2: Iniialize he regression coefficiens w k) 1 for each WL; and he combinaion coefficiens as z 1 = 1 m [1, 1,..., 1]T ; and for all k se δ k) 0 = 0. 3: for = 1 o T do 4: Receive he regressor daa insance x ; k) 5: Compue he WLs oupus ˆd = x T wk) ; 6: Produce he final esimae ˆd = z T 1) [ ˆd,..., 7: Receive he rue oupu d desired daa); 8: λ 1) = 1; l 1) = 0; 9: for k = 1 o m{ do } ) = min 1, δ k) k) c l 1 ; 10: λ k) ˆd m) ] T ; 11: Updae he regression coefficiens w k) by using he NM and he weigh λ k) based on one of he inroduced algorihms in Secion 5; 12: e k) k) = d ˆd ; 13: δ k) Λ k) 1 δk) 1 + λk) [ 4 d f k) ] ) + 2 x ) = ; 14: Λ k) 15: l k+1) Λ k) 1 +λk) = Λ k) 1 + λk) = l k) + [σ 2m 16: end for 17: e = d z T y ; 18: z +1 = z + µ ze y y 2 ; 19: end for e k) ) ] 2 ; is 1 wih probabiliy λ k) and is 0 wih probabiliy 1 λ k). Then, we updae he k h WL, only if he Bernoulli random variable equals 1. Wih his mehod, we significanly reduce he compuaional complexiy of he algorihm. Moreover, due o he dependence of his Bernoulli random variable on he performance of he previous consiuen WLs, his mehod does no degrade he MSE performance, while offering a considerably lower complexiy, i.e., when he MSE is low, here is no need for furher updaes, hence, he probabiliy of an updae is low, while his probabiliy is larger when he MSE is high. 6 Boosed SGD Algorihms In his case, as shown in Fig. 1, we have m parallel running WLs, each of which is updaed using he SGD algorihm. Based on he weighs given in 6) and he oal loss and MSE parameers in 4) and 7), we nex inroduce hree SGD based boosing algorihms, similar o hose inroduced in Secion Direcly Using λ s o Scale The Learning Raes We noe ha by consrucion mehod in 6), 0 < λ k) 1, hus, hese weighs can be direcly used o scale he learning raes for he SGD updaes. When he k h

14 14 Dariush Kari e al. WL receives he weigh λ k), i updaes is coefficiens w k), as w k) +1 = I µ k) λ k) x x T ) w k) + µ k) λ k) x d, 12) where 0 < µ k) λ k) µ k). Noe ha we can choose µ k) = µ for all k, since he online algorihms work consecuively from op o boom, and he k h WL will have a differen learning rae µ k) λ k). 6.2 A Daa Reuse Approach Based on The Weighs In his scenario, for updaing w k), we use he SGD updae n k) = ceilkλ k) ) imes o obain he w k) +1 as q 0) = w k), ) q a) = I µ k) x x T q a 1) + µ k) x d, a = 1,..., n k), ) w k) +1 = q n k). 13) where K is a consan design parameer. Similar o he NM case, if we follow he Ozaboos Oza and Russell 2001)), we use he weighs o generae a random number n k) from a Poisson disribuion wih parameer λ k), and perform he SGD updae n k) imes on w k) as explained above. 6.3 Random Updaes Based on The Weighs Again, in his scenario, similar o he NM case, we use he weigh λ k) o generae a random number from a Bernoulli disribuion, which equals 1 wih probabiliy λ k), and equals 0 wih probabiliy 1 λ k). Then we updae w using SGD only if he generaed number is 1. 7 Analysis Of The Proposed Algorihms In his secion we provide he complexiy analysis for he proposed algorihms. We prove an upper bound for he weighs λ k), which is significanly less han 1. This bound shows ha he complexiy of he random updaes algorihm is significanly less han he oher proposed algorihms, and slighly greaer han ha of a single WL. Hence, i shows he considerable advanage of boosing wih random updaes in processing of high dimensional daa.

15 Boosed Online Regression Algorihms wih Srong Theoreical Bounds Complexiy Analysis Here we compare he complexiy of he proposed algorihms and find an upper bound for he compuaional complexiy of random updaes scenario inroduced in Secion 5.3 for NM, and in Secion 6.3 for SGD updaes), which shows is significanly lower compuaional burden wih respec o wo oher approaches. For x R r, each WL performs Or) compuaions o generaes is esimae, and if updaed using he NM algorihm, requires Or 2 ) compuaions due o updaing he marix R k), while i needs Or) compuaions when updaed using he SGD mehod in heir mos basic implemenaion). We firs derive he compuaional complexiy of using he NM updaes in differen boosing scenarios. Since here are a oal of m WLs, all of which are updaed in he weighed updaes mehod, his mehod has a compuaional cos of order Omr 2 ) per each ieraion. However, in he random updaes, a ieraion, he k h WL may or may no be updaed wih probabiliies λ k) 1 λ k) respecively, yielding C k) = { Or 2 ) Or) wih probabiliy λ k) wih probabiliy 1 λ k), where C k) indicaes he complexiy of running he k h WL a ieraion. Therefore, he oal compuaional complexiy C a ieraion will be C = m k=1 Ck), which yields [ m ] m E [C ] = E C k) = E[λ k) ]Or 2 ) 15) k=1 Hence, if E [ λ k) ] is upper bounded by λk) < 1, he average compuaional complexiy of he random updaes mehod, will be E [C ] < k=1 and 14) m λ k) Or 2 ). 16) k=1 In Theorem 2, we provide sufficien consrains o have such an upper bound. Furhermore, we can use such a bound for he daa reuse mode as well. In his case, for each WL f k), we perform he NM updae λ k) K imes, resuling a m compuaional complexiy of order E [C ] < K λk) Or 2 )). For he SGD updaes, we similarly obain he compuaional complexiies Omr), m k=1 O λk) r ), and m k=1 O K λk) r ), for he weighed updaes, random updaes, and daa reuse scenarios respecively. The following heorem deermines he upper bound λk) for E [ λ k) ]. Theorem 2. If he WLs converge and achieve a sufficienly small MSE according o he proof following his Theorem), he following upper bound is obained for λ k), given ha σm 2 is chosen properly, [ ] E λ k) λk) = k=1 ) 1 k γ 2σ2 m1 + 2ζ 2 2 ln γ), 17)

16 16 Dariush Kari e al. [ ] where γ E δ k) 1 and ζ 2 E [ ) ] 2 e k). I can be sraighforwardly shown ha, his bound is less han 1 for appropriae choices of σm, 2 and reasonable values for he MSE according o he proof. This heorem saes ha if we adjus σm 2 such ha i is achievable, i.e., he WLs can provide a slighly lower MSE han σm, 2 he probabiliy of updaing he WLs in he random updaes scenario will decrease. This is of course our desired resuls, since if he WLs are performing sufficienly well, here is no need for addiional updaes. Moreover, if σm 2 is oped such ha he WLs canno achieve a MSE equal o σm, 2 he WLs have o be updaed a each ieraion, which increases he complexiy. Proof: For simpliciy, in his proof, we have assumed ha c = 1, however, he resuls are readily exended o he general values of c. We consruc our proof based on he following assumpion: Assumpion: assume ha e k) s are independen and idenically disribued i.i.d) zero-mean Gaussian random variables wih variance ζ 2. We have [ { [ ] E λ k) = E min 1, min { 1, E δ k) 1 }] ) k) l [ ]} ) k) l δ k) 1 18) Now, we show ha under cerain condiions, E [ δ k) ) k) l ] 1 will be less han 1, hence, we obain an upper bound for E [ λ k) ] k). We define s lnδ 1 ), yielding [ ] ) k) l [ [ E δ k) 1 = E E exp s l k) ) ]] [ ] s = E s) s, 19) where M k) l he Algorihm 2, l k) e j) ζ M l k).) is he momen generaing funcion of he random variable l k). From = k 1)σm 2 k 1 j)) 2. j=1 e According o he Assumpion, is a sandard normal random variable. Therefore, k 1 j=1 j)) 2 e has a Gamma disribuion as Γ k 1 2, 2ζ2) Papoulis and Pillai 2002)), which resuls in he following momen generaing funcion for l k) ) ) 1 k M k) l s) = exp sk 1)σm ζ 2 2 s ) k 1)σ = δ k) 2 )) m ζ 2 ln δ k) 1 k ) In he above equaliy δ k) 1 is a random variable, he mean of which is denoed by γ. We poin ou [ ha ] γ will[ approach )] o ζ 2 in convergence. We define a funcion ϕ.) such ha E λ k) = E ϕ δ k) 1, and seek o find a condiion for ϕ.) o be a concave funcion. Then, by using he Jenssen s inequaliy for concave funcions, we have [ ] E λ k) ϕγ). 21)

17 Boosed Online Regression Algorihms wih Srong Theoreical Bounds 17 Inspired by 20), we define A A δ k) 1 )) 1 k 2 δ k) 1. By hese definiions we obain ) ϕ δ k) 1 = 1 k )) A δ k) k ) δ k) 2σ )) ) 2 m ζ 2 ln δ k) 1 and ϕ δ k) 1 [ k 1 2 ) )) A δ k) 2 1 )) + A δ k) 2 1 A δ 1) ] k). 22) Considering ha k > 1, in order for ϕ.) o be concave, i suffices o have )) 2 ) ) A δ k) 1 A δ k) k > A δ k) 2 1)), 23) which reduces o he following necessary and sufficien condiions: and where and ) 2σ δ k) 2 m ζ 2 ln )) 2 < δ k) σ 2 m ) 2 4k + 1), 24) 1 ξ 1 )σm 2 ) < ζ 2 1 ξ 2 )σm < 2 ), 25) 1 2σm 2 ln δ k) 1 1 2σm 2 ln δ k) 1 ξ 1 = α σm) 2 + α 1 + 2σm) 2 2 α 2 4k + 1)δ k) 2k + 1)δ k) 1 )2σ2 m ξ 2 = α σm) 2 α 1 + 2σm) 2 2 α 2 4k + 1)δ k) 2k + 1)δ k) 1 )2σ2 m ) α 1 + 2ζ 2 ln δ k) 1. 1 )2σ2 m, 1 )2σ2 m, Under hese condiions, ϕ.) is concave, herefore, by subsiuing ϕ.) in 21) we achieve 17). This concludes he proof of he Theorem 2. 8 Experimens In his secion, we demonsrae he efficacy of he proposed boosing algorihms for NM and SGD linear WLs under differen scenarios. To his end, we firs consider he online regression of daa generaed wih a saionary linear model. Then, we illusrae he performance of our algorihms under nonsaionary condiions, o horoughly es he adapaion capabiliies of he proposed boosing framework. Furhermore, since he mos imporan parameers in he proposed mehods are σ 2 m, c, and m, we invesigae heir effecs on he final MSE performance. Finally, we provide he resuls of he experimens over several real and synheic benchmark daases.

18 18 Dariush Kari e al. Throughou his secion, SGD represens he linear SGD-based WL, NM represens he linear NM-based WL, and a prefix B indicaes he boosing algorihms. In addiion, we use he suffixes -WU, -RU, or -DR o denoe he weighed updaes, random updaes, or daa reuse modes, respecively, e.g., he BSGD-RU represens he Boosed SGD-based algorihm using Random Updaes. In order o observe he boosing effec, in all experimens, we se he sep size of SGD and he forgeing facor of he NM o heir opimal values, and use hose parameers for he WLs, oo. In addiion, he iniial values of all of he weak learners in all of he experimens are se o zero. However, in all experimens, since we use K = 5 in BSGD-DR algorihm, we se he sep size of he WLs in BSGD- DR mehod o µ/k = µ/5, where, µ is he sep size of he SGD. To compare he MSE resuls, we have provided he Accumulaed Square Error ASE) resuls. 8.1 Saionary Daa In his experimen, we consider he case where he desired daa is generaed by a saionary linear model. The inpu vecors x = [x 1 x 2 1] are 3-dimensional, where [x 1 x 2 ] is drawn from a joinly Gaussian random process and hen scaled such ha x = [x 1 x 2 ] T [0 1] 2. We include 1 as he hird enry of x o consider affine learners. Specifically he desired daa is generaed by d = [1 1 1] T x + ν, where ν is a random Gaussian noise wih a variance of In our simulaions, we use m = 20 WLs and µ = 0.1 for all SGD learners. In addiion, for NM-based boosing algorihms, we se he forgeing facor β = for all algorihms. Moreover, we choose σm 2 = 0.02 for SGD-based algorihms and σm 2 = for NM-based algorihms, K = 5 for daa reuse approaches, and c = 1 for all boosing algorihms. To achieve robusness, we average he resuls over 100 rials. As depiced in Fig. 2, our proposed mehods boos he performance of a single linear SGD-based WL. Neverheless, we canno furher improve he performance of a linear NM-based WL in such a saionary experimen since he NM achieves he lowes MSE. We poin ou ha he random updaes mehod achieves he performance of he weighed updaes mehod and he daa reuse mehod wih a much lower complexiy. In addiion, we observe ha by increasing he daa lengh, he performance improvemen increases Noe ha he disance beween he ASE curves is slighly increasing). 8.2 Chaoic Daa Here, in order o show he racking capabiliy of our algorihms in nonsaionary environmens, we consider he case where he desired daa is generaed by he Duffing map Wiggins 2003)) as a chaoic model. Specifically, he daa is generaed by he following equaion x +1 = 2.75x x 3 0.2x 1, where we se x 1 = and x 0 = We consider d = x +1 as he desired daa and [x 1 x 1] as he inpu vecor. In his experimen, each boosing algorihm uses 20 WLs. The sep sizes for he SGD-based algorihms are se o 0.1, he forgeing facor β for he NM-based algorihms are se o 0.999, and he modified desired

19 Boosed Online Regression Algorihms wih Srong Theoreical Bounds Accumulaed Squared Error Performance - Saionary Experimen Accumulaed Squared Error SGD NM BNM-WU BNM-DR BNM-RU SGD BSGD-WU BSGD-DR BSGD-RU 4.8 NM and BNM algorihms BSGD algorihms Daa Lengh T) 10 4 Fig. 2: The ASE performnce of he proposed algorihms in he saionary daa experimen. MSE parameer σ 2 m is se o 0.25 for BSGD mehods, and 0.17 for he BNM mehods. Noe ha alhough he value of σ 2 m is higher han he achieved MSE, i can improve he performance significanly. This is because of he boosing effec, i.e., emphasizing on he harder daa paerns. The figures show he superior performance of our algorihms over a single WL whose sep size is chosen o be he bes), in his highly nonsaionary environmen. Moreover, as shown in Fig. 3, in he SGD-based boosed algorihms, he daa reuse mehod shows a beer performance relaive o he oher boosing mehods. However, he random updaes mehod has a significanly lower ime consumpion, which makes i desirable for larger daa lenghs. From he Fig. 3, one can see ha our mehod is ruly boosing he performance of he convenional linear WLs in his chaoic environmen. From he Fig. 4, we observe he approximae changes of he weighs, in he BSGD-RU algorihm running over he Duffing daa. As shown in his figure, he weighs do no change monoonically, and his shows he capabiliy of our algorihm in effecive racking of he nonsaionary daa. Furhermore, since we updae he WLs in an ordered manner, i.e., we updae he k + 1) h WL afer he k h WL is updaed, he weighs assigned o he las WLs are generally smaller han he weighs assigned o he previous WLs. As an example, in Fig. 4 we see ha he weighs assigned o he 5 h WL are larger han hose of he 10 h and 20 h WLs. Furhermore, noe ha in his experimen, he dependency parameer c is se o 1. We should menion ha increasing he value of his parameer, in general, causes he lower weighs, hence, i can considerably reduce he complexiy of he random updaes and daa reuse mehods. 8.3 The Effec of Parameers In his secion, we invesigae he effecs of he dependence parameer c and he modified desired MSE σ 2 m as well as he number of WLs,m, on he boosing performance of our mehods in he Duffing daa experimen, explained in Secion 8.2. From he resuls in Fig. 5c, we observe ha, increasing he number of WLs up o 30 can improve he performance significanly, while furher increasing of m only increases he compuaional complexiy wihou improving he performance.

20 20 Dariush Kari e al. Accumulaed Squared Error Performance - Duffing Experimen NM 0.18 SGD Accumulaed Squared Error BNM-RU BNM-DR BNM-WU BSGD-WU and BSGD-RU 0.15 BSGD-DR Daa Lengh T) 10 4 Fig. 3: ASE performance of he proposed mehods on a Duffing daa se Changes in λ values in Duffing experimen, BSGD-RU 5h WL 10h WL 20h WL λ Daa Lengh T) 10 4 Fig. 4: The changing of he weighs in BSGD-RU algorihm in he Duffing daa experimen. In addiion, as shown in Fig. 5b, in his experimen, he dependency parameer c has an opimum value around 1. We noe ha choosing small values for c reduces he boosing effec, and causes he weighs o be larger, which in urn increases he compuaional complexiy in random updaes and daa reuse approaches. On he oher hand, choosing very large values for c increases he dependency, i.e., in his case he generaed weighs are very close o 1 or 0, hence, he boosing effec is decreased. Overall, one should choose values around 1 for c o avoid hose exreme cases. Furhermore, as depiced in Fig. 5a, here is an opimum value around 0.5 for σ 2 m in his experimen. Noe ha, choosing small values for σ 2 m resuls in large weighs, hus, increases he complexiy and reduces he diversiy. However, choosing higher values for σ 2 m resuls in smaller weighs, and in urn reduces he complexiy. Neverheless, we noe ha increasing he value of σ 2 m does no necessarily enhance he performance. Through he experimens, we find ou ha σ 2 m mus be in he order of he MSE amoun o obain he bes performance.

21 Boosed Online Regression Algorihms wih Srong Theoreical Bounds 21 The effec ofσ m 2 on he MSE performance-duffing BNM-RU BSGD-RU 0.18 Mean Squared Error σ m a) The effec of he parameer σ 2 m The effec of c on he MSE performance-duffing BNM-RU BSGD-RU 0.18 Mean Squared Error b) The effec of he parameer c c The effec of m on he MSE performance-duffing 0.18 BNM-RU BSGD-RU Mean Squared Error he number of WLs m) c) The effec of he parameer m Fig. 5: The effec of he parameers σm 2, c, and m, on he MSE performance of he BNM-RU and BSGD-RU algorihms in he Duffing daa experimen. 8.4 Benchmark Real and Synheic Daa Ses In his secion, we demonsrae he efficiency of he inroduced mehods over some widely used real life machine learning regression daa ses. We have normalized each dimension of he daa o he inerval [ 1, 1] in all algorihms. We presen he MSE performance of he algorihms in Table 1. These experimens show ha our algorihms can successfully improve he performance of single linear WLs. We

22 22 Dariush Kari e al. Algorihms Daa Ses SGD BSGD-WU BSGD-DR BSGD-RU NM BNM-WU BNM-DR BNM-RU MV Puma8NH Kinemaics Compaciv Proein Teriary ONP California Housing YPMSD Table 1: The MSE of he proposed algorihms on real daa ses. now describe he experimens and provide he resuls: Here, we briefly explain he deails of he daa ses: 1. MV: This is an arificial daase wih dependencies beween he aribue values. One can refer o Torgo) for furher deails. There are 10 aribues and one arge value. In his daase, we can slighly improve he performance of a single linear WL by using any of he proposed mehods. 2. Puma Dynamics Puma8NH): This daase is a realisic simulaion of he dynamics of a Puma 560 robo arm Torgo). The ask is o predic he angular acceleraion of one of he robo arm s links. The inpus include angular posiions, velociies and orques of he robo arm. According o he ASE resuls in Fig. 6a, he BNM-WU has he bes boosing performance in his experimen. Noneheless, he SGD-based mehods also improve he performance. 3. Kinemaics: This daase is concerned wih he forward kinemaics of an 8 link robo arm Torgo). We use he varian 8nm, which is highly non-linear and noisy. As shown in Fig. 6b, our proposed algorihms slighly improve he performance in his experimen. 4. Compuer Aciviy Compaciv): This real daase is a collecion of compuer sysems aciviy measures Torgo). The ask is o predic USR, he porion of ime ha CPUs run in user mode from all aribues Torgo). The NMbased boosing algorihms deliver a significan performance improvemen in his experimen, as shown by he resuls in Table Proein Teriary Lichman 2013)): This daase is colleced from Criical Assessmen of proein Srucure Predicion CASP) experimens 5 9. The aim is o predic he size of he residue using 9 aribues over daa insances. 6. Online News Populariy ONP) Lichman 2013); Pereira e al. 2015)): This daase summarizes a heerogeneous se of feaures abou aricles published by Mashable in a period of wo years. The goal is o predic he number of shares in social neworks populariy). 7. California Housing: This daase has been obained from SaLib reposiory. They have colleced informaion on he variables using all he block groups in California from he 1990 Census. Here, we seek o find he house median values, based on he given aribues. For furher descripion one can refer o Torgo). 8. Year Predicion Million Song Daase YPMSD) Berin-Mahieux e al. 2011)): The aim is predicing he release year of a song from is audio feaures. Songs are mosly wesern, commercial racks ranging from 1922 o 2011, wih a peak in he year 2000s. We use a subse of he Million Song Daase Berin-Mahieux

Ensamble methods: Bagging and Boosting

Ensamble methods: Bagging and Boosting Lecure 21 Ensamble mehods: Bagging and Boosing Milos Hauskrech milos@cs.pi.edu 5329 Senno Square Ensemble mehods Mixure of expers Muliple base models (classifiers, regressors), each covers a differen par