arxiv: v2 [cs.lg] 20 May 2010

Size: px

Start display at page:

Download "arxiv: v2 [cs.lg] 20 May 2010"

Baldwin Hudson
5 years ago
Views:

1 Olie Learig of Noisy Data with Kerels Nicolò Cesa-Biachi Uiversità degli Studi di Milao Shai Shalev Shwartz he Hebrew Uiversity Ohad Shamir he Hebrew Uiversity arxiv: v2 cslg] 20 May 200 Abstract We study olie learig whe idividual istaces are corruted by adversarially chose radom oise We assume the oise distributio is ukow, ad may chage over time with o restrictio other tha havig zero mea ad bouded variace Our techique relies o a family of ubiased estimators for o-liear fuctios, which may be of ideedet iterest We show that a variat of olie gradiet descet ca lear fuctios i ay dotroduct eg, olyomial or Gaussia kerel sace with ay aalytic covex loss fuctio Our variat uses radomized estimates that eed to query a radom umber of oisy coies of each istace, where with high robability this umber is uer bouded by a costat Allowig such multile queries caot be avoided: Ideed, we show that olie learig is i geeral imossible whe oly oe oisy coy of each istace ca be accessed Itroductio I may machie learig alicatios traiig data are tyically collected by measurig certai hysical quatities Examles iclude bioiformatics, medical tests, robotics, ad remote sesig hese measuremets have errors that may be due to several reasos: sesor costs, commuicatio costraits, or itrisic hysical limitatios I all such cases, the learer trais o a distorted versio of the actual target data, which is where the learer s redictive ability is evetually evaluated I this work we ivestigate the extet to which a learig algorithm ca achieve a good redictive erformace whe traiig data are corruted by oise with ukow distributio We rove uer ad lower bouds o the learer s cumulative loss i the framework of olie learig, where examles are geerated by a arbitrary ad ossibily adversarial source We model the measuremet error via a radom erturbatio which affects each istace observed by the learer We do ot assume ay secific roerty of the oise distributio other tha zero-mea ad bouded variace Moreover, we allow the oise distributio to chage at every ste i a adversarial way ad fully hidde from the learer Our ositive results are quite geeral: by usig a radomized ubiased estimate for the loss gradiet ad a radomized feature maig to estimate kerel values, we show that a variat of olie gradiet descet ca lear fuctios i ay dot-roduct eg, olyomial or Gaussia RKHS uder ay give aalytic covex loss fuctio Our techiques are readily extedable to other kerel tyes as well I order to obtai ubiased estimates of loss gradiets ad kerel values, we allow the learer to query a radom umber of ideedetly erturbed coies of the curret usee istace We show how low-variace estimates ca be comuted usig a umber of queries that is costat with high robability his is i shar cotrast with stadard averagig techiques which attemts to directly estimate the oisy istace, as these require a samle whose size deeds o the scale of the roblem Fially, we formally show that learig is imossible, eve without kerels, whe oly oe erturbed coy of each istace ca be accessed his is true for essetially ay reasoable loss fuctio Our aer is orgaized as follows I the ext subsectio we discuss related work I Sec 2 we itroduce our settig ad justify some of our choices I Sec 4 we reset our mai results but before that, i Sec 3, we discuss the techiques used to obtai them I the same sectio, we also exlai why existig techiques are isufficiet to deal with our roblem he detailed roofs ad subroutie imlemetatios aear i Sec 5, with some of the more techical lemmas ad roofs

2 relegated to the aedix We wra u with a discussio o ossible aveues for future work i Sec 6 Related Work I the machie learig literature, the roblem of learig from oisy examles, ad, i articular, from oisy traiig istaces, has traditioally received a lot of attetio see, for examle, the recet survey ] O the other had, there are comarably few theoretically-riciled studies o this toic wo of them focus o models quite differet from the oe studied here: radom attribute oise i PAC boolea learig 3, 8], ad malicious oise 9, 5] I the first case, learig is restricted to classes of boolea fuctios ad the oise must be ideedet across each boolea coordiate I the secod case, a adversary is allowed to erturb a small fractio of the traiig examles i a arbitrary way, makig learig imossible i a strog iformatioal sese uless this erturbed fractio is very small of the order of the desired accuracy for the redictor he revious work erhas closest to the oe reseted here is 0], where biary classificatio mistake bouds are rove for the olie Wiow algorithm i the resece of attribute errors Similarly to our settig, the sequece of istaces observed by the learer is chose by a adversary However, i 0] the oise is geerated by a adversary, who may chage the value of each attribute i a arbitrary way he fial mistake boud, which oly alies whe the oiseless data sequece is liearly searable without kerels, deeds o the sum of all adversarial erturbatios 2 Settig We cosider a settig where the goal is to redict values y R based o istaces x R d I this aer we focus o kerel-based liear redictors of the form x w, Ψx, where Ψ is a feature maig ito some reroducig kerel Hilbert sace RKHS We assume there exists a kerel fuctio that efficietly imlemets dot roducts i that sace, ie, kx, x Ψx, Ψx Note that a secial case of this settig is liear kerels, where Ψ is the idetity maig ad kx, x x, x he stadard olie learig rotocol for liear redictio with kerels is defied as follows: at each roud t, the learer icks a liear hyothesis w t from the RKHS he adversary the icks a examle x t, y t ad reveals it to the learer he loss suffered by the learer is l w t, Ψx t, y t, where l is a kow ad fixed loss fuctio he goal of the learer is to miimize regret with resect to a fixed covex set of hyotheses W, amely l w t, Ψx t, y t mi l w, Ψx t, y t yically, we wish to fid a strategy for the learer, such that o matter what is the adversary s strategy of choosig the sequece of examles, the exressio above is sub-liear i We ow make the followig twist, which limits the iformatio available to the learer: istead of receivig x t, y t, the learer observes y t ad is give access to a oracle A t O each call, A t returs a ideedet coy of x t + Z t, where Z t is a zero-mea radom vector with some kow fiite boud o its variace i the sese that E Z t 2] a for some uiform costat a I geeral, the distributio of Z t is ukow to the learer It might be chose by the adversary, ad chage from roud to roud or eve betwee cosecutive calls to A t Note that here we assume that y t remais uerturbed, but we emhasize that this is just for simlicity - our techiques ca be readily exteded to deal with oisy values as well he learer may call A t more tha oce I fact, as we discuss later o, beig able to call A t more tha oce is ecessary for the learer to have ay hoe to succeed O the other had, if the learer calls A t a ulimited umber of times, it ca recostruct x t arbitrarily well by averagig, ad we are back to the stadard learig settig I this aer we focus o learig algorithms that call A t oly a small, essetially costat umber of times, which deeds oly o our choice of loss fuctio ad kerel rather tha, the orm of x t, or the variace of Z t, which will hae with aïve averagig techiques Moreover, sice the umber of queries is bouded with very high robability, oe ca eve roduce a algorithm with a absolute boud o the umber of queries, which will fail or itroduce some bias with a arbitrarily small robability For simlicity, we igore these issues i this aer I this settig, we wish to miimize the regret i hidsight with resect to the uerturbed data ad averaged over the oise itroduced by the oracle, amely ] E l w t, Ψx t, y t mi l w, Ψx t, y t

3 where the radom quatities are the redictors w, w 2, geerated by the learer, which deed o the observed oisy istaces i the aedix, we briefly discuss alterative regret measures, ad why they are usatisfactory his kid of regret is relevat where we actually wish to lear from data, without the oise causig a hidrace I articular, cosider the batch settig, where the examles {x t, y t } are actually samled iid from some ukow distributio, ad we wish to fid a redictor which miimizes the exected loss El w, x, y] with resect to ew examles x, y Usig stadard olie-to-batch coversio techiques, if we ca fid a olie algorithm with a subliear boud o Eq, the it is ossible to costruct learig algorithms for the batch settig which are robust to oise hat is, algorithms geeratig a redictor w with close to miimal exected loss El w, x, y] amog all w W While our techiques are quite geeral, the exact algorithmic ad theoretical results deed a lot o which loss fuctio ad kerel is used Discussig the loss fuctio first, we will assume that l w, Ψx, y is a covex fuctio of w for each examle x, y Somewhat abusig otatio, we assume the loss ca be writte either as l w, Ψx, y fy w, Ψx or as l w, Ψx, y f w, Ψx y for some fuctio f We refer to the first tye as classificatio losses, as it ecomasses most reasoable losses for classificatio, where y {, +} ad the goal is to redict the label We refer to the secod tye as regressio losses, as it ecomasses most reasoable regressio losses, where y takes arbitrary real values For simlicity, we reset some of our results i terms of classificatio losses, but they all hold for regressio losses as well with slight modificatios We reset our results uder the assumtio that the loss fuctio is smooth, i the sese that l a ca be writte as γ a, for ay a i its domai his assumtio holds for istace for the squared loss la a 2, the exoetial loss la exa, ad smoothed versios of loss fuctios such as the hige loss ad the absolute loss we discuss examles i more details i Subsectio 42 his assumtio ca be relaxed uder certai coditios, ad this is further discussed i Subsectio 32 urig to the issue of kerels, we ote that the geeral resetatio of our aroach is somewhat hamered by the fact that it eeds to be tailored to the kerel we use I this aer, we focus o two families of kerels: Dot Product Kerels: the kerel kx, x ca be writte as a fuctio of x, x Examles of such kerels kx, x are liear kerels x, x ; homogeeous olyomial kerels x, x, ihomogeeous olyomial kerels + x, x ; exoetial kerels e x,x ; biomial kerels + x, x α, ad more see for istace 4, 6] Gaussia Kerels: kx, x e x x 2 /σ 2 for some σ 2 > 0 Agai, we emhasize that our techiques are extedable to other kerel tyes as well 3 echiques Our results are based o two key ideas: the use of olie gradiet descet algorithms, ad costructio of ubiased gradiet estimators i the kerel settig he latter is based o a geeral method to build ubiased estimators for o-liear fuctios, which may be of ideedet iterest 3 Olie Gradiet Descet here exist well develoed theory ad algorithms for dealig with the stadard olie learig settig, where the examle x t, y t is revealed after each roud, ad for geeral covex loss fuctios Oe of the simlest ad most well kow oes is the olie gradiet descet algorithm due to Zikevich 7] Sice this algorithm forms a basis for our algorithm i the ew settig, we briefly review it below as adated to our settig he algorithm iitializes the classifier w 0 At roud t, the algorithm redicts accordig to w t, ad udates the learig rule accordig to w t+ P w t η t t, where ηt is a suitably chose costat which might deed o t; t l y t w t, Ψx t y t Ψx t is the gradiet of l y t w, Ψx t with resect to w t ; ad P is a rojectio oerator o the covex set W, o whose elemets we wish to achieve low regret I articular, if we wish to comete with hyotheses of bouded squared orm B w, P simly ivolves rescalig the orm of the redictor so as to have squared orm at most B w With this algorithm, oe ca rove regret bouds with resect to ay w W A folklore result about this algorithm is that i fact, we do ot eed to udate the redictor by the gradiet at each ste Istead, it is eough to udate by some radom vector of bouded variace, which merely equals the gradiet i exectatio his is a useful roerty i settigs where x t, y t is ot revealed to the learer, ad has bee used before, such as i the olie badit settig see for istace 6, 7, ] Here, we will use this roerty i a ew way, i order to devise

4 algorithms which are robust to oise Whe the kerel ad loss fuctio are liear eg, Ψx x ad la ca + b for some costats b, c, this roerty already esures that the algorithm is robust to oise without ay further chages his is because the oise ijected to each x t merely causes the exact gradiet estimate to chage to a radom vector which is correct i exectatio: If we assume l is a classificatio loss, the E l y t w t, Ψ x t Ψ x t ] E c x t ] x t O the other had, whe we use oliear kerels ad oliear loss fuctios, usig stadard olie gradiet descet leads to systematic ad ukow biases sice the oise distributio is ukow, which revets the method from workig roerly o deal with this roblem, we ow tur to describe a techique for estimatig exressios such as l y t w t, Ψx t i a ubiased maer I Subsectio 33, we discuss how Ψx t ca be estimated i a ubiased maer 32 Ubiased Estimators for No-Liear Fuctios Suose that we are give access to ideedet coies of a real radom variable X, with exectatio EX], ad some real fuctio f, ad we wish to costruct a ubiased estimate of fex] If f is a liear fuctio, the this is easy: just samle x from X, ad retur fx By liearity, EfX] fex] ad we are doe he roblem becomes less trivial whe f is a geeral, oliear fuctio, sice usually EfX] fex] I fact, whe X takes fiitely may values ad f is ot a olyomial fuctio, oe ca rove that o ubiased estimator ca exist see 3], Proositio 8 ad its roof Nevertheless, we show how i may cases oe ca costruct a ubiased estimator of fex], icludig cases covered by the imossibility result here is o cotradictio, because we do ot costruct a stadard estimator Usually, a estimator is a fuctio from a give samle to the rage of the arameter we wish to estimate A imlicit assumtio is that the size of the samle give to it is fixed, ad this is also a crucial igrediet i the imossibility result We circumvet this by costructig a estimator based o a radom umber of samles Here is the key idea: suose f : R R is ay fuctio cotiuous o a bouded iterval It is well kow that oe ca costruct a sequece of olyomials Q, where Q is a olyomial of degree, which coverges uiformly to f o the iterval If Q x i0 γ,ix i, let Q x,, x i0 γ i,i j x j Now, cosider the estimator which draws a ositive iteger N accordig to some distributio PN, samles X for N times to get x, x 2,, x N, ad returs N Q N x,, x N Q N x,, x N, where we assume Q 0 0 he exected value of this estimator is equal to: E N,x,,x N Q N x,, x N Q N x,, x N ] N E x,,x Q x,, x Q x,, x ] Q EX] Q EX] fex] hus, we have a ubiased estimator of fex] his techique aeared i a rather obscure early 960 s aer 5] from sequetial estimatio theory, ad aears to be little kow, articularly outside the sequetial estimatio commuity However, we believe this techique is iterestig, ad exect it to have useful alicatios for other roblems as well While this may seem at first like a very geeral result, the variace of this estimator must be bouded for it to be useful Ufortuately, this is ot true for geeral cotiuous fuctios More recisely, let N be distributed accordig to, ad let θ be the value retured by the estimator I 2], it is show that if X is a Beroulli radom variable, ad if EθN k ] < for some iteger k, the f must be k times cotiuously differetiable Sice EθN k ] Eθ 2 ] + EN 2k ]/2, this meas that fuctios f which yield a estimator with fiite variace, while usig a umber of queries with bouded variace, must be cotiuously differetiable Moreover, i case we desire the umber of queries to be essetially costat ie choose a distributio for N with exoetially decayig tails, we must have EN k ] < for all k, which meas that f should be ifiitely differetiable i fact, i 2] it is cojectured that f must be aalytic i such cases hus, we focus i this aer o fuctios f which are aalytic, ie, they ca be writte as fx i0 γ ix i for aroriate costats γ 0, γ, I that case, Q ca simly be the trucated

5 aylor exasio of f to order, ie, Q i0 γ ix i Moreover, we ca ick / for ay > So the estimator becomes the followig: we samle a oegative iteger N accordig to PN / +, samle X ideedetly N times to get x, x 2,, x N, ad retur θ γ N+ N x x 2 x N where we set θ γ 0 if N 0 We have the followig: Lemma For the above estimator, it holds that Eθ] fex] he exected umber of samles used by the estimator is /, ad the robability of it beig at least z is z Moreover, if we assume that f + x γ x exists for ay x i the domai of iterest, the Eθ 2 ] f + 2 EX2 ] Proof he fact that Eθ] fex] follows from the discussio above he results about the umber of samles follow directly from roerties of the geometric distributio As for the secod momet, Eθ 2 ] equals E N,x,,x N γ N 2 2N+ ] 2 x2 x 2 2 x 2 2+ N 2 + γ2 E x,,x x 2 x 2 2 x 2 ] γ 2 EX 2 ] 2 γ EX2 ] 2 γ EX2 ] f + 2 EX2 ] he arameter rovides a tradeoff betwee the variace of the estimator ad the umber of samles eeded: the larger is, the less samles do we eed, but the estimator has more variace I ay case, the samle size distributio decays exoetially fast, so the samle size is essetially bouded It should be emhasized that the estimator associated with Lemma is tailored for geerality, ad is subotimal i some cases For examle, if f is a olyomial fuctio, the γ 0 for sufficietly large, ad there is o reaso to samle N from a distributio suorted o all oegative itegers - it just icreases the variace Nevertheless, i order to kee the resetatio uified ad geeral, we will always use this tye of estimator If eeded, the estimator ca always be otimized for secific cases We also ote that this techique ca be imroved i various directios, if more is kow about the distributio of X For istace, if we have some estimate of the exectatio ad variace of X, the we ca erform a aylor exasio aroud the estimated EX] rather tha 0, ad tue the robability distributio of N to be differet tha the oe we used above hese modificatios ca allow us to make the variace of the estimator arbitrarily small, if the variace of X is small eough Moreover, oe ca take olyomial aroximatios to f which are erhas better tha trucated aylor exasios I this aer, for simlicity, we will igore these otetial imrovemets Fially, we ote that a related result i 2] imlies that it is imossible to estimate fex] i a ubiased maer whe f is discotiuous, eve if we allow a umber of queries ad estimator values which are ifiite i exectatio herefore, sice the derivative of the hige loss is ot cotiuous, estimatig i a ubiased maer the gradiet of the hige loss with arbitrary oise aears to be imossible hus, if olie learig with oise ad hige loss is at all feasible, a rather differet aroach tha ours will eed to be take 33 Ubiasig Noise i the RKHS he third comoet of our aroach ivolves the ubiased estimatio of Ψx t, whe we oly have ubiased oisy coies of x t Here agai, we have a o-trivial roblem, because the feature maig Ψ is usually highly o-liear, so EΨ x t ] ΨE x t ] i geeral Moreover, Ψ is ot a scalar fuctio, so the techique of Subsectio 32 will ot work as-is Admittedly, the evet N 0 should receive zero robability, as it amouts to skiig the samlig altogether However, settig PN 0 0 aears to imrove the boud i this aer oly i the smaller order terms, while makig the aalysis i the aer more comlicated

6 o tackle this roblem, we costruct a exlicit feature maig, which eeds to be tailored to the kerel we wat to use o give a very simle examle, suose we use the homogeeous 2ddegree olyomial kerel, kr, s r, s 2 It is ot hard to verify that the fuctio Ψ : R d R d2, defied via Ψx x x, x x 2,, x d x d, is a exlicit feature maig for this kerel Now, if we query two ideedet oisy coies x, x of x, we have that the exectatio of the radom vector x x, x x 2,, x d x d is othig more tha Ψx hus, we ca costruct ubiased estimates of Ψx i the RKHS Of course, this examle ertais to a very simle RKHS with a fiite dimesioal reresetatio By a radomizatio trick somewhat similar to the oe i Subsectio 32, we ca adat this aroach to ifiite dimesioal RKHS as well I a utshell, we rereset Ψx as a ifiite-dimesioal vector, ad its oisy ubiased estimate is a vector which is o-zero o oly fiitely may etries, usig fiitely may oisy queries Moreover, ier roducts betwee these estimates ca be doe efficietly, allowig us to imlemet the learig algorithms, ad use the resultig redictor o test istaces 4 Mai Results 4 Algorithm We reset our algorithmic aroach i a modular form We start by itroducig the mai algorithm, which cotais several subrouties he we rove our two mai results, which boud the regret of the algorithm, the umber of queries to the oracle, ad the ruig time for two tyes of kerels: dot roduct ad Gaussia our results ca be exteded to other kerel tyes as well I itself, the algorithm is othig more tha a stadard olie gradiet descet algorithm with a stadard O regret boud hus, most of the roofs are devoted to a detailed discussio of how the subrouties are imlemeted icludig exlicit seudo-code I this sectio, we just describe oe subroutie, based o the techiques discussed i Sec 3 he other subrouties require a more detailed ad techical discussio, ad thus their imlemetatio is described as art of the roofs i Sec 5 I ay case, the ituitio behid the imlemetatios ad the techiques used are described i Sec 3 For simlicity, we will focus o a fiite-horizo settig, where the umber of olie rouds is fixed ad kow to the learer he algorithm ca easily be modified to deal with the ifiite horizo settig, where the learer eeds to achieve sub-liear regret for all simultaeously Also, for the remaider of this subsectio, we assume for simlicity that l is a classificatio loss, amely ca be writte as a fuctio of ly w, Ψx It is ot hard to adat the results below to the case where l is a regressio loss where l is a fuctio of w, Ψx y We ote that at each roud, the algorithm below costructs a object which we deote as Ψx t his object has two iterretatios here: formally, it is a elemet of a reroducig kerel Hilbert sace RKHS corresodig to the kerel we use, ad is equal i exectatio to Ψx t However, i terms of imlemetatio, it is simly a data structure cosistig of a fiite set of vectors from R d hus, it ca be efficietly stored i memory ad hadled eve for ifiite-dimesioal RKHS Algorithm Kerel Learig Algorithm with Noisy Iut Parameters: Learig rate η > 0, umber of rouds, samle arameter > Iitialize: α i 0 for all i,, Ψx i for all i,, // Ψx i is a data structure which ca store a variable umber of vectors i R d For t Defie w t t i α Ψx i i Receive A t, y t // he oracle A t rovides oisy estimates of x t Let Ψx t : Ma EstimateA t, // Get ubiased estimate of Ψx t i the RKHS Let g t : Grad Legth EstimateA t, y t, // Get ubiased estimate of l y t w t, Ψx t Let α t : g t η/ // Perform gradiet ste Let ñ t : t t i j α t,iα t,j Prod Ψx i, Ψx j // Comute squared orm, where Prod Ψx i, Ψx j returs Ψx i, Ψx j If ñ t > B w // If orm squared is larger tha B w, the roject Let α i : α Bw i ñ t for all i,, t Like Ψx t, w t+ has also two iterretatios: formally, it is a elemet i the RKHS, as defied

7 i the seudocode I terms of imlemetatio, it is defied via the data structures Ψx,, Ψx t ad the values of α,, α t at roud t o aly this hyothesis o a give istace x, we comute t i α t,iprod Ψx i, x, where Prod Ψx i, x is a subroutie which returs Ψx i, Ψx a seudocode is rovided as art of the roofs later o We ow tur to the mai results ertaiig to the algorithm he first result shows what regret boud is achievable by the algorithm for ay dot-roduct kerel, as well as characterize the umber of oracle queries er istace, ad the overall ruig time of the algorithm heorem Assume that the loss fuctio l has a aalytic derivative l a γ a for all a i its domai, ad let l +a γ a assumig it exists Assume also that the kerel kx, x ca be writte as Q x, x for all x, x R d Fially, assume that E x t 2 ] B x for ay x t retured by the oracle at roud t, for all t,, he, for all B w > 0 ad >, it is ossible to imlemet the subrouties of Algorithm such that: he exected umber of queries to each oracle A t is 2 he exected ruig time of the algorithm is O 3 + d 2 / 2 If we ru Algorithm with η B w ul + u, where u Bw QB x, the ] E ly t w t, Ψx t mi ly t w, Ψx t l + u u w : w 2 B w he exectatios are with resect to the radomess of the oracles ad the algorithm throughout its ru We ote that the distributio of the umber of oracle queries ca be secified exlicitly, ad it decays very raidly - see the roof for details Also, for simlicity, we oly boud the exected regret i the theorem above If the oise is bouded almost surely or with sub-gaussia tails rather tha just bouded variace, the it is ossible to obtai similar guaratees with high robability, by relyig o Azuma s iequality or variats thereof see for examle 4] We ow tur to the case of Gaussia kerels heorem 2 Assume that the loss fuctio l has a aalytic derivative l a γ a for all a i its domai, ad let l +a γ a assumig it exists Assume that the kerel kx, x is defied as ex x x 2 /σ 2 Fially, assume that E x t 2 ] B x for ay x t retured by the oracle at roud t, for all t,, he for all B w > 0 ad > it is ossible to imlemet the subrouties of Algorithm such that he exected umber of queries to each oracle A t is he exected ruig time of the algorithm is O 3 + d / If we ru Algorithm with η B w ul + u, where 3 B x + 2 B x u B w ex σ 2 the E ly t w t, Ψx t mi w : w 2 B w ] ly t w, Ψx t l + u u he exectatios are with resect to the radomess of the oracles ad the algorithm throughout its ru As i hm, ote that the umber of oracle queries has a fast decayig distributio Also, ote that with Gaussia kerels, σ 2 is usually chose to be o the order of the examle s squared orms hus, if the oise added to the examles is roortioal to their origial orm, we ca assume that B x /σ 2 O, ad thus u which aears i the boud is also bouded by a costat As reviously metioed, most of the subrouties are described i the roofs sectio, as art of the roof of hm Here, we oly show how to imlemet Grad Legth Estimate subroutie,

8 which returs the gradiet legth estimate g t he idea is based o the techique described i Subsectio 32 We rove that g t is a ubiased estimate of l y t w t, Ψx t, ad boud E g t 2 ] As discussed earlier, we assume that l is aalytic ad ca be writte as l a γ a Subroutie Grad Legth EstimateA t, y t, Samle oegative iteger accordig to P / + For j,, Let Ψx t j : Ma EstimateA t // Get ubiased estimate of Ψx t i the RKHS Retur g t : y t γ + t j i α t,iprod Ψx i, Ψx t j Lemma 2 Assume that E Ψx t ] Ψx t, ad that Prod Ψx, Ψx returs Ψx, Ψx for all x, x he for ay give w t α t, Ψx + + α t,t Ψxt it holds that E t g t ] y t l y t w t, Ψx t ad E t g t 2 ] 2 l + B w B Ψx where the exectatio is with resect to the radomess of Subroutie, ad l +a γ a Proof he result follows from Lemma, where g t corresods to the estimator θ, the fuctio f corresods to l, ad the radom variable X corresods to w t, Ψx t where Ψx t is radom ad w t is held fixed he term EX 2 ] i Lemma ca be uer bouded as E t wt, Ψx t 2 ] w t 2 E t Ψx t 2] B w B Ψx 42 Loss Fuctio Examles heorems ad 2 both deal with geeric loss fuctios l whose derivative ca be writte as γ a, ad the regret bouds ivolve the fuctios l +a γ a Below, we reset a few examles of loss fuctios ad their corresodig l + As metioed earlier, while the theorems i the revious subsectio are i terms of classificatio losses ie, l is a fuctio of y w, Ψx, virtually idetical results ca be rove for regressio losses ie, l is a fuctio of w, Ψx y, so we will give examles from both families Workig out the first two examles is straightforward he roofs of the other two aear i Sec 5 he loss fuctios are illustrated grahically i Fig Examle For the squared loss fuctio, l w, x, y w, x y 2, we have l + u 2 u Examle 2 For the exoetial loss fuctio, l w, x, y e y w,x, we have l + u e u Examle 3 Cosider a smoothed absolute loss fuctio l σ w, Ψx y, defied as a atiderivative of Erfsa for some s > 0 see roof for exact aalytic form he we have that l + u 2 + e s2 u s π u Examle 4 Cosider a smoothed hige loss ly w, Ψx, defied as a atiderivative of Erfsa /2 for some s > 0 see roof for exact aalytic form he we have that l + u 2 e s2 u s π u For ay s, the loss fuctio i the last two examles are covex, ad resectively aroximate the absolute loss w, Ψx y ad the hige loss max { 0, y w, Ψx } arbitrarily well for large eough s Fig shows these loss fuctios grahically for s Note that s eed ot be large i order to get a good aroximatio Also, we ote that both the loss itself ad its gradiet are comutatioally easy to evaluate Fially, we remid the reader that as discussed i Subsectio 32, erformig a ubiased estimate of the gradiet for o-differetiable losses directly such as the hige loss or absolute loss aears to be imossible i geeral O the fli side, if oe is willig to use a radom umber of queries with olyomial rather tha exoetial tails, the oe ca achieve much better samle comlexity results, by focusig o loss fuctios or aroximatios thereof which are oly differetiable to a bouded order, rather tha fully aalytic his agai demostrates the tradeoff betwee the samle size ad the amout of iformatio that eeds to be gathered o each traiig examle

9 Absolute Loss Smoothed Absolute Loss s 2 Hige Loss Smoothed Hige Loss s Figure : Absolute loss, hige loss, ad smooth aroximatios 43 Oe Noisy Coy is Not Eough he revious results might lead oe to woder whether it is really ecessary to query the same istace more tha oce I some alicatios this is icoveiet, ad oe would refer a method which works whe just a sigle oisy coy of each istace is made available I this subsectio we show that, ufortuately, such a method caot be foud Secifically, we rove that uder very mild assumtios, o method ca achieve sub-liear regret whe it has access to just a sigle oisy coy of each istace O the other had, for the case of squared loss ad liear kerels, our techiques ca be adated to work with exactly two oisy coies of each istace, 2 so without further assumtios, the lower boud that we rove here is ideed tight For simlicity, we rove the result for liear kerels ie, where kx, x x, x It is a iterestig oe roblem to show imroved lower bouds whe oliear kerels are used We also ote that the result crucially relies o the learer ot kowig the oise distributio, ad we leave to future work the ivestigatio of what haes whe this assumtio is relaxed heorem 3 Let W be a comact covex subset of R d, ad let l, : R R satisfies the followig: it is bouded from below; 2 it is differetiable at 0 with l 0, < 0 For ay learig algorithm which selects hyotheses from W ad is allowed access to a sigle oisy coy of the istace at each roud t, there exists a strategy for the adversary such that the sequece w, w 2, of redictors outut by the algorithm satisfies lim su max l w t, x t, y t l w, x t, y t > 0 with robability with resect to the radomess of the oracles Note that coditio is satisfied by virtually ay loss fuctio other tha the liear loss, while coditio 2 is satisfied by most regressio losses, ad by all classificatio calibrated losses, which iclude all reasoable losses for classificatio see 2] he most obvious examle where the coditios are ot satisfied is whe l, is a liear fuctio his is ot surrisig, because whe l, is liear, the learer is always robust to oise see the discussio at Sec 3 he ituitio of the roof is very simle: the adversary chooses beforehad whether the examles are draw iid from a distributio D, ad the erturbed by oise, or draw iid from some other distributio D without addig oise he distributios D, D ad the oise are desiged so that the examles observed by the learer are distributed i the same way irresective to which of the two samlig strategies the adversary chooses herefore, it is imossible for the learer accessig a sigle coy of each istace to be statistically cosistet with resect to both distributios simultaeously As a result, the adversary ca always choose a distributio o which the algorithm will be icosistet, leadig to costat regret he full roof is reseted i Sectio 53 2 I a utshell, for squared loss ad liear kerels, we just eed to estimate 2 w t, x t y tx t i a ubiased maer at each roud t his ca be doe by comutig 2 w t, x t y t x t, where x t, x t are two oisy coies of x t

10 5 Proofs Due to the lack of sace, some of the roofs are give i the the aedix 5 Prelimiary Result o rove hm ad hm 2, we eed a theorem which basically states that if all subrouties i algorithm behave as they should, the oe ca achieve a O regret boud his is rovided i the followig theorem, which is a adatatio of a stadard result of olie covex otimizatio see, eg, 7] he roof is give i Aedix D heorem 4 Assume the followig coditios hold with resect to Algorithm : For all t, Ψxt ad g t are ideedet of each other as radom variables iduced by the radomess of Algorithm as well as ideedet of ay Ψx i ad g i for i < t 2 For all t, E Ψx t ] Ψx t, ad there exists a costat B Ψ > 0 such that E Ψx t 2 ] B Ψ 3 For all t, E g t ] y t l y t w t, Ψx t, ad there exists a costat B g > 0 such that E g 2 t ] B g 4 For ay air of istaces x, x, Prod Ψx, Ψx Ψx, Ψx he if Algorithm is ru with η Bw B gb, the followig iequality holds Ψ E l y t w t, Ψx t mi l y t w, Ψx t ] B w B g B Ψ w : w 2 B w where the exectatio is with resect to the radomess of the oracles ad the algorithm throughout its ru 52 Proof of hm I this subsectio, we reset the roof of hm We first show how to imlemet the subrouties of Algorithm, ad rove the relevat results o their behavior he, we rove the theorem itself It is kow that for k, Q x, x to be a valid kerel, it is ecessary that Q x, x ca be writte as a aylor exasio β x, x, where β 0 see theorem 49 i 4] his makes these tyes of kerels ameable to our techiques We start by costructig a exlicit feature maig Ψ corresodig to the RKHS iduced by our kerel For ay x, x, we have that d kx, x β x, x β x i x i i d d β x k x k2 x k x k x k 2 x k k k d d k k β x k x k2 x k β x k x k2 x k his suggests the followig feature reresetatio: for ay x, Ψx returs a ifiite-dimesioal vector, idexed by ad k,, k {,, d}, with the etry corresodig to, k,, k beig β x k x k he dot roduct betwee Ψx ad Ψx is similar to a stadard dot roduct betwee two vectors, ad by the derivatio above equals kx, x as required We ow use a slightly more elaborate variat of our ubiased estimate techique, to derive a ubiased estimate of Ψx First, we samle N accordig to PN / + he, we query the oracle for x for N times to get x,, x N, ad formally defie Ψx as Ψx + d d β x k x k e,k,,k 2 k where e,k,,k reresets the uit vector i the directio idexed by, k,, k as exlaied above Sice the oracle queries are iid, the exectatio of this exressio is + d d β + E x k x ] d d k e,k,,k β x k x k e,k,,k k k k k which is exactly Ψx We formalize the eeded roerties of Ψx i the followig lemma k

11 Lemma 3 Assumig Ψx is costructed as i the discussio above, it holds that E Ψx] Ψx for ay x Moreover, if the oisy samles x t retured by the oracle A t satisfy E x t 2 ] B x, the E Ψx t 2] QB x where we recall that Q defies the kerel by kx, x Q x, x Proof he first art of the lemma follows from the discussio above As to the secod art, ote that by 2, E Ψx t 2] E β 2+2 d 2 2 x t,k x N t,k E β 2+2 x j 2 2 t k,k + β 2+2 ] E 2 ] x 2 t β E 2 x t where the secod-to-last ste used the fact that β 0 for all j β B x QB x Of course, exlicitly storig Ψx as defied above is ifeasible, sice the umber of etries is huge Fortuately, this is ot eeded: we just eed to store x t,, x N t he reresetatio above is used imlicitly whe we calculate dot roducts betwee Ψx ad other elemets i the RKHS, via the subroutie Prod We ote that while N is a radom quatity which might be ubouded, its distributio decays exoetially fast, so the umber of vectors to store is essetially bouded After the discussio above, the seudocode for Ma Estimate below should be self-exlaatory Subroutie 2 Ma EstimateA t, Samle oegative iteger N accordig to PN / + Query A t for N times to get x t,, x N t Retur x t,, x N t as Ψx t We ow tur to the subroutie Prod, which give two elemets i the RKHS, returs their dot roduct his subroutie comes i two flavors: either as a rocedure defied over Ψx, Ψx ad returig Ψx, Ψx Subroutie 3; or as a rocedure defied over Ψx, x Subroutie 4, where the secod elemet is a exlicitely give vector ad returig Ψx, Ψx his secod variat of Prod is eeded whe we wish to aly the leared redictor o a ew give istace x Subroutie 3 Prod Ψx, Ψx Let x,, x be the idex ad vectors comrisig Ψx Let x,, x be the idex ad vectors comrisig Ψx If retur 0, else retur β j xj, x j Lemma 4 Prod Ψx, Ψx returs Ψx Ψx Proof Usig the formal reresetatio of Ψx, Ψx i 2, we have that Ψx, Ψx is 0 wheever because the these two elemets are comosed of differet uit vectors with resect to a orthogoal basis Otherwise, we have that Ψx Ψx β β d k,,k d k x k x x k x k k x k d k N x k x k N x k N which is exactly what the algorithm returs, hece the lemma follows β N j x j, x j

12 he seudocode for calculatig the dot roduct Ψx, Ψx where x is kow is very similar, ad the roof is essetially the same Subroutie 4 Prod Ψx, x Let, x,, x be the idex ad vectors comrisig Ψx Retur β + j xj, x We are ow ready to rove hm First, regardig the exected umber of queries, otice that to ru Algorithm, we ivoke Ma Estimate ad Grad Legth Estimate oce at roud t Ma Estimate uses a radom umber B of queries distributed as PB / +, ad Grad Legth Estimate ivokes Ma Estimate a radom umber C of times, distributed as PC / + he total umber of queries is therefore C+ j B j, where B j for all j are iid coies of B he exected value of this exressio, usig a stadard result o the exected value of a sum of a radom umber of ideedet radom variables, is equal to + EC]EB j ], or + 2 d, I terms of ruig time, we ote that the exected ruig time of Prod is O + this because it erforms N multilicatios of ier roducts, each oe with ruig time Od, ad EN] he exected ruig time of Ma Estimate is O + he exected ruig time of Grad Legth Estimate is O d, which ca be writte as O + + d Sice Algorithm at each of rouds calls Ma Estimate oce, 2 Grad Legth Estimate oce, Prod for O 2 times, ad erforms O other oeratios, we get that the overall rutime is O d d Sice, we ca uer boud this by 2 O d 2 O 3 d + 2 he regret boud i the theorem follows from hm 4, with the exressios for costats followig from Lemma 2, Lemma 3, ad Lemma 4 53 Proof Sketch of hm 3 o rove the theorem, we use a more geeral result which leads to o-vaishig regret, ad the show that uder the assumtios of hm 3, the result holds he roof of the result is give i Aedix F heorem 5 Let W be a comact covex subset of R d ad ick ay learig algorithm which selects hyotheses from W ad is allowed access to a sigle oisy coy of the istace at each roud t If there exists a distributio over a comact subset of R d such that argmi E l w, x, ] ad argmi l w, Ex], 3 are disjoit, the there exists a strategy for the adversary such that the sequece w, w 2, W of redictors outut by the algorithm satisfies lim su max l w t, x t, y t l w, x t, y t > 0 with robability with resect to the radomess of the oracles Aother way to hrase this theorem is that the regret caot vaish, if give examles samled iid from a distributio, the learig roblem is more comlicated tha just fidig the mea of the data Ideed, the adversary s strategy we choose later o is simly drawig ad resetig examles from such a distributio Below, we sketch how we use hm 5 i order to rove hm 3 A full roof is rovided i Aedix E

13 We costruct a very simle oe-dimesioal distributio, which satisfies the coditios of hm 5: it is simly the uiform distributio o {3x, x}, where x is the vector, 0,, 0 hus, it is eough to show that argmi l3w, + l w, ad argmi lw, 4 w : w 2 B w w : w 2 B w are disjoit, for some aroriately chose B w Assumig the cotrary, the uder the assumtios o l, we show that the first set i Eq 4 is iside a bouded ball aroud the origi, i a way which is ideedet of B w, o matter how large it is hus, if we ick B w to be large eough, ad assume that the two sets i Eq 4 are ot disjoit, the there must be some w such that both l3w, + l w, ad lw, have a subgradiet of zero at w However, this ca be show to cotradict the assumtios o l, leadig to the desired result 6 Future Work here are several iterestig research directios worth ursuig i the oisy learig framework itroduced here For istace, doig away with ubiasedess, which could lead to the desig of estimators that are alicable to more tyes of loss fuctios, for which ubiased estimators may ot eve exist Also, it would be iterestig to show how additioal iformatio oe has about the oise distributio ca be used to desig imroved estimates, ossibly i associatio with secific losses or kerels Aother oe questio is whether our lower boud hm 3 ca be imroved whe oliear kerels are used Refereces ] J Aberethy, E Haza, ad A Rakhli Cometig i the dark: A efficiet algorithm for badit liear otimizatio I COL, ages , ] S Bhadari ad A Bose Existece of ubiased estimators i sequetial biomial exerimets Sakhyā: he Idia Joural of Statistics, 52:27 30, 990 3] N Bshouty, J Jackso, ad C amo Uiform-distributio attribute oise learability Iformatio ad Comutatio, 872: , ] N Cesa-Biachi, A Cocoi, ad C Getile O the geeralizatio ability of o-lie learig algorithms IEEE rasactios o Iformatio heory, 509: , Setember ] N Cesa-Biachi, E Dichterma, P Fischer, E Shamir, ad H Simo Samle-efficiet strategies for learig i the resece of oise Joural of the ACM, 465:684 79, 999 6] N Cesa-Biachi ad G Lugosi Predictio, learig, ad games Cambridge Uiversity Press, ] A Flaxma, A auma Kalai, ad H McMaha Olie covex otimizatio i the badit settig: gradiet descet without a gradiet I Proceedigs of SODA, ages , ] S Goldma ad R Sloa Ca ac learig algorithms tolerate radom attribute oise? Algorithmica, 4:70 84, 995 9] M Kears ad M Li Learig i the resece of malicious errors SIAM Joural o Comutig, 224: , 993 0] N Littlestoe Redudat oisy attributes, attribute errors, ad liear threshold learig usig Wiow I Proceedigs of COL, ages 47 56, 99 ] D Nettleto, A Orriols-Puig, ad A Forells A study of the effect of differet tyes of oise o the recisio of suervised learig techiques Artificial Itelligece Review, 200 2] M Jorda P Bartlett ad J McAuliffe Covexity, classificatio ad risk bouds Joural of the America Statistical Associatio, 0473:38 56, March ] L Paiski Estimatio of etroy ad mutual iformatio Neural Comutatio, 56:9 253, ] B Schölkof ad A Smola Learig with Kerels MI Press, ] R Sigh Existece of ubiased estimates Sakhyā: he Idia Joural of Statistics, 26:93 96, 964 6] I Steiwart ad A Christma Suort Vector Machies Sriger, ] M Zikevich Olie covex rogrammig ad geeralized ifiitesimal gradiet ascet I Proceedigs of ICML, ages , 2003

14 A Alterative Notios of Regret I the olie settig, oe may cosider otios of regret other tha Oe choice is l w t, Ψ x t, y t mi l w, Ψ x t, y t but this is too easy, as it reduces to stadard olie learig with resect to examles which hae to be oisy Aother kid of regret we may wat to miimize is l w t, Ψ x t, y t mi l w t, Ψx t, y t 5 his is the kid of regret which is relevat for actually redictig the values y t well based o the oisy istaces Ufortuately, i geeral this is too much to hoe for o see why, assume we deal with a liear kerel so that Ψx x, ad assume lw, x, y w, x y 2 Now, suose that the adversary icks some w 0 i W, which might be eve kow to the learer, ad at each roud t rovides the examle w / w, It is easy to verify that Eq 5 i this case equals w t, x t 2 0 Recall that the learer chooses w t before x t is revealed herefore, if the oise which leads to x t has ositive variace, it will geerally be imossible for the learer to choose w t such that w t, x t is arbitrarily close to herefore, the equatio above caot grow sub-liearly with B Proof of hm 2 he aalysis i this subsectio is similar to the oe of Subsectio 52, focusig o Gaussia kerels Namely, we assume here that the kerel kx, x is equal to e x x 2 /σ 2 for some σ 2 > 0 We start by costructig a exlicit feature maig Ψ corresodig to the RKHS iduced by our kerel For ay x, x, we have that kx, x e x x 2 /σ 2 e x 2 /σ 2 e x 2 /σ 2 e 2 x,x /σ 2 e x 2 /σ 2 e x 2 /σ 2 2 x, x σ 2! e x 2 /σ 2 e x 2 /σ 2 d k d k 2/σ 2 x k x k x k! x k his suggests the followig feature reresetatio: for ay x, Ψx returs a ifiite-dimesioal vector, idexed by ad k,, k {,, d}, with the etry corresodig to, k,, k beig e x 2 /σ 2 2/σ 2! x k x k he dot roduct betwee Ψx ad Ψx is similar to a stadard dot roduct betwee two vectors, ad by the derivatio above equals kx, x as required he idea of derivig a ubiased estimate of Ψx is the followig: first, we samle N, N 2 ideedetly accordig to PN PN 2 2 / + he, we query the oracle for x for 2N + N 2 times to get x,, x 2N+N2, ad formally defie Ψx as Ψx N N+N2+2 2 N2 N!N 2!σ 2N+2N2 2 N x 2j, x 2j j d x 2N+ k k,,k N2 x 2N+N2 k N2 e N2,k,,k N2 6 where e N2,k,,k N2 reresets the uit vector i the directio idexed by N 2, k,, k N2 as exlaied above Sice the oracle calls are iid, it is ot hard to verify that the exectatio of the exressio

15 above is + + x,!σ 2 x 2+ 2 d 2 2+ x 0 2!σ 22 k x k2 e 2,k,,k 2 20 k,,k 2 x 2 /σ 2 2/σ 2 d 2 x k x k2 e 2,k! 0 2!,,k 2 20 k,,k 2 d e x 2 /σ 2 2/σ 2 2 x k x k2 e 2,k 2!,,k 2 20 k,,k 2 which is exactly Ψx as defied above o actually store Ψx i memory, we simly kee ad x,, x 2N+N2 he reresetatio above is used imlicitly whe we calculate dot roducts betwee Ψx ad other elemets i the RKHS, via the subroutie Prod We formalize the eeded roerties of Ψx i the followig lemma Lemma 5 Assumig the costructio of Ψx as i the discussio above, it holds that E t Ψx] Ψx for all x Moreover, if the oisy samle x t retured by the oracle A t satisfies E x t 2 ] B x, the E Ψx t 2] 2 e B x+2 B x/σ 2 Proof he first art of the lemma follows from the discussio above As to the secod art, ote that by 6, we have that Ψx t 2 2N+2N N2 N N!N 2!σ 2N+2N2 2 2 x 2j, x 2j 2 d x 2N+ k x 2N+N2 2N+2N N2 N!N 2!σ 2N+2N2 2 2 j k,,k N2 N x 2j, x 2j 2 N 2 x N+j 2 j 2N+2N N2 N!N 2!σ 2N+2N2 2 2 B B2N N2 x x j k N2 2 he exectatio of this exressio over N, N 2 is equal to !σ 2 2 B2 x !σ 22 2 B2 x 20 2 B 2 x 4 2 B x 2!σ !σ B x /σ B x /σ 2 2 2! 0 2! 20 2 B x /σ B x /σ e B x+2 B x/σ 2! 0 2! 20 After the discussio above, the seudocode for Ma Estimate below should be self-exlaatory Subroutie 5 Ma EstimateA t, Samle N accordig to PN / + Samle N 2 accordig to PN 2 2 / 2+ Query A t for 2N + N 2 times to get x t Retur x t,, x 2N+N2 t as Ψx t,, x 2N+N2 t

16 We ow tur to the subroutie Prod, which give two elemets i the RKHS, returs their dot roduct his subroutie comes i two flavors: either as a rocedure defied over Ψx, Ψx ad returig Ψx, Ψx Subroutie 6; or as a rocedure defied over Ψx, x Subroutie 7, where the secod elemet is a exlicitly give vector ad returig Ψx, Ψx his secod variat of Prod is eeded whe we wish to aly the hyothesis o a ew kow istace x Subroutie 6 Prod Ψx, Ψx Let x,, x 2+2 be the vectors comrisig Ψx Let x,, x be the vectors comrisig Ψx If retur 0, else retur!! 2! 2 σ j x2j, x 2j j x 2j, x 2j 2 j x2+j, x 2 +j he roof of the followig lemma is a straightforward algebraic exercise, similar to the roof of Lemma 4 Lemma 6 Prod Ψx, Ψx returs Ψx, Ψx he seudocode for calculatig the dot roduct Ψx, Ψx where x is kow is very similar, ad the roof is essetially the same Subroutie 7 Prod Ψx, x Let x,, x 2+2 be the vectors comrisig Ψx Retur ! 2! 2 σ e x 2 /σ 2 2 x 2j, x 2j x 2+j, x j j We are ow ready to rove hm 2 First, regardig the exected umber of queries, otice that to ru Algorithm, we ivoke Ma Estimate ad Grad Legth Estimate oce at roud t Ma Estimate uses a radom umber 2B + B 2 of queries, where B, B 2 are ideedet ad distributed as PB PB 2 / + Grad Legth Estimate ivokes Ma Estimate a radom umber C of times, where PC / + he total umber of queries is therefore C+ j 2B j, + B j,2, where B j,, B j,2 are iid coies of B, B 2 resectively he exected value of this exressio, usig a stadard result o the exected value of a sum of a radom umber of radom variables, is equal to + EC]2EB j, ] + EB j,2 ], or I terms of ruig time, the aalysis is comletely idetical to the oe erformed i the roof of hm, ad the exected ruig time is the same u to costats he regret boud i the theorem follows from hm 4, with the exressios for costats followig from Lemma 2, Lemma 5, ad Lemma 6 C Proof of Examles 3 ad 4 Examles 3 ad 4 use the error fuctio Erfa i order to costruct smooth aroximatios of the hige loss ad the absolute loss see Fig he error fuctio is useful for our uroses, sice it is aalytic i all of R, ad smoothly iterolates betwee for a 0 ad for a 0 hus, it ca be used to aroximate derivative of losses which are iecewise liear, such as the hige loss la max{0, a} ad the absolute loss la a o aroximate the absolute loss, we use the atiderivative of Erfsa his fuctio reresets a smooth uer boud o the absolute loss, which becomes tighter as s icreases It ca be verified that the atiderivative with the costat free arameter fixed so the fuctio has the desired behavior is la a Erfsa + e s2 a 2 σ π

17 While this loss fuctio may seem to have slightly comlex form, we ote that our algorithm oly eeds to calculate the derivative of this loss fuctio at various oits amely Erfsa for various values of a, which ca be easily doe By a aylor exasio of the error fuctio, we have that herefore, l +a i this case is at most l a 2 π 2 sa 2+ π!2 + 2 as π sa 2+!2 + sa 2+ +! 2 as e σ2 a 2 π We ow tur to deal with Examle 4 his time, we use the atiderivative of Erfsa /2 his fuctio smoothly iterolates betwee for a ad 0 for a 0 herefore, its atiderivative with resect to x reresets a smooth uer boud o the hige loss, which becomes tighter as s icreases It ca be verified that the atiderivative with the costat free arameter fixed so the fuctio has the desired behavior is la a Erfsa 2 By a aylor exasio of the error fuctio, we have that l a 2 + π hus, l +a i this case ca be uer bouded by 2 + π D Proof of hm 4 sa 2+! as π a 2 + e s2 2 πs sa 2+!2 + sa 2+ +! 2 + as e s2 a 2 π Our algorithm corresods to Zikevich s algorithm 7] i a fiite horizo settig, where we assume the sequece of examles is g Ψx,, g Ψx, the cost fuctio is liear, ad the learig rate at roud t is η/ By a straightforward adatatio of the stadard regret boud for that algorithm see 7], we have that for ay w such that w 2 B w, w t, g t Ψxt w, g t Ψxt B w 2 η + η g t Ψxt 2 We ow take exectatio of both sides i the iequality above he exectatio of the right-had side is simly ] B w E 2 η + η ] 2 E t g t Et Ψx t 2] Bw 2 η + ηb gb Ψ As to the left-had side, ote that ] E w t, g t Ψxt E E t w t, g t Ψxt ] ] E w t, y t l y t w t, Ψx t ] Ψx t Also, ] E w, g t Ψxt w, l y t w t, Ψx t Ψx t Pluggig i these exectatios ad choosig η Bw B gb, we get that for ay w such that w 2 Ψ B w, E w t, y t l y t w t, Ψx t Ψx t w, l y t w t, Ψx t ] Ψx t B w B g B Ψ

Confidence Intervals

Confidence Intervals Cofidece Itervals Berli Che Deartmet of Comuter Sciece & Iformatio Egieerig Natioal Taiwa Normal Uiversity Referece: 1. W. Navidi. Statistics for Egieerig ad Scietists. Chater 5 & Teachig Material Itroductio