Tighter Bounds for Multi-Armed Bandits with Expert Advice

Size: px

Start display at page:

Download "Tighter Bounds for Multi-Armed Bandits with Expert Advice"

Duane Arnold
6 years ago
Views:

1 Tgher Bounds for Mul-Armed Bnds wh Exper Advce H. Brendn McMhn nd Mhew Sreeer Google, Inc. Psburgh, PA 523, USA Absrc Bnd problems re clssc wy of formulng exploron versus exploon rdeoffs. Auer e l. [ACBFS02] nroduced he EXP4 lgorhm, whch explcly decouples he se of A cons whch cn be ken n he world from he se of M expers (generl sreges for selecng cons) wh whch we wsh o be compeve. Auer e l. show h EXP4 hs expeced cumulve regre bounded by O( T A log M), where T s he ol number of rounds. Ths bound s rcve when he number of cons s smll compred o he number of expers, bu poor when he suon s reversed. In hs pper we nroduce new lgorhm, smlr n spr o EXP4, whch hs bound of O( T S log M). The S prmeer mesures he exen o whch exper recommendons gree; we lwys hve S mn {A, M}. We dscuss prccl pplcons h rse n he conexul bnds seng, ncludng sponsored serch keyword dversng. In hese problems, common conex mens mny cons re rrelevn on ny gven round, nd so S << mn {A, M}, mplyng our bounds offer sgnfcn mprovemen. The key o our new lgorhm s lner-progrmng-bsed exploron sregy h s opml n cern sense. In ddon o provng gher bounds, we run expermens on rel-world d from n onlne dversng problem, nd demonsre h our refned exploron sregy leds o sgnfcn mprovemens over known pproches. Inroducon The vrous formulons of he k-rmed bnd problem provde clen frmeworks for nlyzng rdeoffs beween exploron nd exploon, nd hence hve seen exensve enon from reserchers n vrey of felds. A bnd problem kes plce over seres of rounds. On ech round, he lgorhm selecs some con á A o be execued n he world. Afer á s chosen, rewrd r (á ) s obned nd observed by he lgorhm. The gol s o mxmze he sum of rewrds r (á ). In hs pper we dop he nonsochsc vewpon: we mke no ssumpons bou he source of rewrds, nd so seek bounds h hold for rbrry sequences of rewrd vecors (we ssume ech rewrd s n [0, ]). I s no possble o mke ny gurnees bou cumulve rewrd (for exmple, we mgh fce sequence of vecors where every con on every round ges rewrd 0). Insed, lgorhms for hs problem bound performnce n erms of regre, he dfference beween he lgorhm s cumulve rewrd nd he rewrd cheved by he bes fxed con. Such nonsochsc ssumpons re jusfed n chngng worlds, where he ps performnce of cons my no be ndcve of her fuure rewrds. In mny rel-world problems, however, s no ppropre o compre ourselves o he performnce of he bes fxed con: for exmple, suppose he cons re dversemens h could be shown n response o queres on serch engne. Any sngle d wll hve errble performnce f shown for ll queres, nd so reng hs s sngle mul-rmed bnd problem would provde exremely wek gurnees. Insed, our pproch decouples he cons A h cn be ken n he world from he se of sreges (expers) M wh whch we wsh o be compeve. Ths pproch s no new o hs pper: he EXP4 lgorhm of [ACBFS02] ddresses hs problem. However, he bounds for h lgorhm re only useful when he se of sreges M s lrger hn A. We propose nd nlyze new lgorhm for hs problem whch ddresses hs ssue. For n lgorhm o perform s well s he bes exper from M, mus mplcly esme he cumulve rewrd obned by ech exper. If expers ofen gree on he cons hey recommend, nuvely hs esmon problem should become eser; however, curren bounds for he problem do no reflec hs. We propose new lgorhm, NEXP (he N s for Nonunform exploron), whch solves lner progrm o selec dsrbuon on cons h offers loclly opml (wh respec o our nlyss) blnce of exploron nd exploon. Ths lgorhm hs bound of O( GS log M) on ol regre, where M = M, A = A, G s bound on he bes exper s cumulve rewrd (for exmple, G = T ), nd S s prmeer h mesures he The sochsc verson, where he rewrd of ech con s drwn..d. from some dsrbuon (unknown o he lgorhm) on ech round, hs lso been exensvely suded. L nd Robbns [LR85] s he foundonl pper.

2 Alg Bound Exmple Reference EXP GM log M/T [ACBFS02] EXP GA log M/T 2.78 [ACBFS02] NEXP 2.63 GS log M/T hs pper Tble : Bounds for he bnd problem wh exper dvce wh A cons nd M expers. G s bound on he rewrd of he bes exper. The prmeer S, nroduced n hs pper, ssfes S mn {A, M}, nd s ofen much less. To mke hese bounds concree, he Exmple column shows he bound on expeced regre per round for A = 0000, M = 000, S = 20, nd T = G = 00, 000. Noe h he bound for EXP4 s vcuous. EXP3 drecly selecs expers, whou usng he srucure nduced by he cons. exen o whch he exper s recommendons gree. Impornly, S mn {A, M}, nd for some problems (such s he sponsored serch dversng problem menoned bove) cn be mny orders of mgnude smller. The pper s orgnzed s follows: Secon 2 complees he forml semen of he problem, defnes noon, nd compres he bounds for our lgorhm o prevously publshed resuls. Secon 3 nroduces severl rel-world nsnces of hs seng where he gher bounds proved cn hve sgnfcn prccl mpc. Secon 4 summrzes reled work. In Secon 5 we nroduce our lgorhm nd presen nd prove bounds. Secon 6 presens expermens. 2 Prelmnres An nsnce I = (A, r, e) of he bnd problem wh exper dvce s defned by sequence of rewrd vecors h s fxed n dvnce, wh rewrds bounded n [0, ], h s r : A [0, ]. The bnd lgorhm hs ccess o he recommendons of M expers from se M. Ech exper suggess probbly dsrbuon e, over cons on ech round. These recommendons mus be fxed (hough no necessrly known o he lgorhm) n dvnce, o he exen h hey do no depend on he cons seleced on erler rounds. We dscuss he rmfcons of hs ssumpon n he nex secon. Our gol s o consruc rndomzed lgorhm h on ech round proposes dsrbuon p over he cons n such wy h our cumulve regre s smll. In order o formlze he noon of regre, le G be rndom vrble (wh respec o he dsrbuons e, ) gvng he performnce of he -h exper on fxed problem nsnce I. Then, kng expecon wh respec o he drws from e,, E[G ] = T e, ()r () = A T e, r s he expeced performnce of he -h exper on I. We hen defne G OPT = mx {E[G ]} nd G ALG = r (á ), where á s he con chosen by he lgorhm on round. Then, Regre = G OPT G ALG. The cumulve rewrd G ALG of he lgorhm s rndom vrble (wh dsrbuon dependen on he rndomzon used by he lgorhm n choosng he á ), nd so Regre s lso rndom vrble. Snce only r (á ) s observed, even pos-hoc we wll no be ble o excly clcule our regre s n he expers seng where r s fully observble; 2 nsed we cn bound E[Regre], he lgorhm s expeced regre. Unless oherwse sed, expecons re wh respec o ny nernl rndomness of he bnd lgorhm. In he cse where expers mke probblsc recommendons e, hs expecon my mplcly nclude drws from hese dsrbuons, dependng on wheher he gven lgorhm nernlly smples from hese dsrbuons. The noon used n hs pper s summrzed n Tble 2. We use subscrps for me, bu somemes om hem when referrng o me, so w s w, mplcly. Our Mn Resul We cn now se he mn heorecl resul of hs pper. Defne s = mx {e, ()}. Observe h s A, nd lso s e () = M. If ll he expers recommend he sme dsrbuon, hen s =. If ll expers re deermnsc, hen s s he number of dsnc cons recommended by he expers. Our mn heorem (sed fully s Theorem 3) shows h he NEXP lgorhm we nroduce, when run wh ppropre prmeers, ssfes E[Regre] 2.63 SG ln M () where S = mx {s } mn {A, M}. The dependence on he mxmum of s over ll rounds s perhps unssfyng: suppose on sngle round ll he exper re confused nd ech pus probbly.0 on sepre con: suddenly S = M, nd our bound s no beer hn EXP3, even f on every oher round he expers gree enrely clerly gher bound should be possble n hs cse. Under mld ssumpons on he rewrds of he problem, we srenghen our mn heorem o hndle hs suon, replcng S n Equon () wh S, he hrmonc men of he n s lrges s ; he precse semen s gven s Theorem 6. Comprson o Prevous Bounds The frs lgorhm proposed for he bnd problem wh exper dvce ws EXP4 [ACBFS02], whch bounds he expeced regre by O( GA log M), where G s n upper bound on G OPT. Snce rewrds re n [0, ], one cn lwys use G = T. Dvdng he bound by T shows h he per-round regre goes o zero s T, nd so hs s no-regre lgorhm. Ths bound s good when he number of cons s smll nd he number of expers s lrge. Wh f he number of cons s much lrger hn he number of expers? If we hve deermnsc expers h recommend sngle con (e.g., e ( ) = ), hen we cn consruc rewrd vecor r drecly on expers, where r () = e, r = r (, ). Now we pply sndrd nonsochsc mul-rmed bnd lgorhm (sy, EXP3, lso from [ACBFS02]) where he rms of he bnd problem re 2 [FS95] descrbes he fully-observble expers seng n del; for comprson of he fully nd prlly observble cses, see [DHK07].

3 Expers m m 2 m 3 m 4 m 5 Acons/Ads Buy Pe Lzrds -800-Roses Dgl Cmer Supply Bes Locl Florss Chep MP3 Plyers Dscoun Clmbng Ger Fgure : The d selecon bnd problem wh exper dvce. Ech exper corresponds o deermnsc funcon m mppng queres o recommend d o show for he query. Whle he se A of ll ds cn be very lrge, gven knowledge of he query mny of he schemes m re lkely o sugges he sme con/d. Our lgorhm leverges hs fc o cheve shrper regre bounds. In hs exmple round (consderng only he expers nd cons shown), we hve M = 5, A = 6, nd s = 2. he deermnsc expers from he orgnl bnd problem wh exper dvce. Ths gves bound of O( GM log M) on expeced regre, whch s beer hn he bound of EXP4 when A > M. If he recommendons e re sochsc, e, r s no fully observble, bu mehod bsed on smplng deermnsc expers from he sochsc ones cn be ppled o obn bounds on expeced regre. Inuvely, he exr nformon provded by observng he recommendons of ech exper should only mke he problem eser, bu n he cse of lrge numbers of cons EXP4 cully hs worse bounds hn EXP3 (nd sgnfcnly worse performnce n our expermens s well). Our new lgorhm resolves hs defcency. In fc, our bounds remn unchnged even f n enrely dfferen se of cons re recommended on ech round or f cons re rbrrly re-ndexed on ech round. Ths mkes cler h he only vlue n knowng he expers recommendons (rher hn, sy, jus beng ble o blndly follow he expers dvce) comes from her use n correlng he performnce of he expers, mkng esmng ech exper s performnce eser. In he full-nformon cse (.e., he full rewrd vecor r s observed on ech round), he cons cn be effecvely gnored, s he exc expeced performnce of ech exper cn be compued drecly, nd so sndrd full-nformon lgorhm lke Hedge cn be ppled drecly o he expers nd performs essenlly opmlly. These bounds re summrzed n Tble, long wh some concree verge per-round regre numbers; he prmeers for he exmple were chosen o be resonble boh n erms of compuon nd d, bu show h boh EXP3 nd EXP4 mgh perform poorly. As he gp beween S nd mn {A, M} grows lrge, here re problems where boh EXP3 nd EXP4 wll provde vcuous gurnees, bu NEXP s bounds wll be que gh. A M á T r e,() s P p p q w W Problem Semen se of cons, A = A se of expers, M = M generc con con plyed by he lgorhm on round ol number of rounds ndex of round rewrd vecor on cons exper s recommended probbly on on round mx {e,()} Algorhm dsrbuon on cons execued n world del (non-exploron) dsrbuon on cons dsrbuon on expers weghs on expers sum of weghs Bounds nd Anlyss Vrbles G OPT cumulve rewrd of bes exper n hndsgh G cumulve rewrd of he -h exper G ALG cumulve rewrd of bnd lgorhm S bound on s, S mx {s } z bound on p ()/p () for ll Z bound on z, Z mx {z } G bound on G OPT 3 Applcons Tble 2: Summry of noon. Whle he mprovemen n bounds for bnd problems wh exper dvce s neresng from purely heorecl pon of vew, we lso beleve h mny (perhps even mos) relworld bnd problems re beer frmed s bnd problems wh exper dvce. To suppor hs clm, we consder severl movng problem domns where exper dvce s prculrly useful nd he new lgorhms nroduced n hs pper re prculrly dvngeous. Serch Engne Keyword Adversng The problem of selecng nd prcng ds o be shown longsde Inerne serch engne queres hs receved gre del of enon lely, for exmple [RCKU08, Vr07, GP07, WVLL07, PO07]. We cn effecvely pply our lgorhm o hs problem s follows. Consder se of M dfferen schemes for deermnng whch ds o show on prculr query. Le A be he ol se of dversemens vlble o be shown cross ll possble queres, nd le X be he se of possble queres. Ech (deermnsc) exper s ssoced wh funcon m : X A (for smplcy, we ssume we show one d per query). Clerly, he se A my be exremely lrge, nd he EXP4 regre bound wll lkely be vcuous. Furher, he se M my lso be very lrge suppose, for exmple, we hve fmly of schemes for showng ds h re prmeerzed by some vecor θ Θ; we mgh consruc he se M by dscrezng Θ. If Θ hs even moderely hgh dmenson, he squre-roo dependence on M from EXP3 wll be prohbve. However, here we see h he srucure nduced by he conex cn be used o our dvnge: for mny queres, only smll number of ds wll be relevn (see Fgure ).

4 If ll of he schemes m re bsclly n greemen for mos queres, hen our job of selecng he bes one should become much eser. Ths s excly he nuon h our lgorhm cpures. We pply our lgorhm o problem from hs seng n Secon 6, nd demonsre subsnl emprcl mprovemens. Onlne Choce of Acve Lernng Algorhms Brm e l. proposed usng EXP4 o dynmclly choose mong severl cve lernng lgorhms n he pool-bsed cve lernng seng [BEYL04]. They emprclly evlue her pproch usng EXP4 wh M = 3 cve lernng lgorhms; on ech round ech lgorhm suggess n exmple from pool A of unlbeled exmples whch would lke o hve lbeled (he sze of hs pool rnged from 25 o 8300 n her expermens). Idelly, he rewrd ssoced wh lbellng prculr exmple should be he dfferenl mprovemen n generlzon error gned by hvng ccess o he lbel. Ths s no generlly vlble, so [BEYL04] nroduced he Clssfcon Enropy Mxmzon (CEM) heursc, nd used o ssgn rewrd for he exmple lbelled. They show emprclly h hs s que effecve, nd furher h her pproch does remrkbly good job of rckng he bes exper (ndvdul cve lernng lgorhm) wh lmos no regre. In fc, on some problems her combned pproch ouperforms ny of he ndvdul expers (h s, cheves negve regre). 3 The Conexul Bnds Problem The bove pplcons cn be vewed s exmples of conexul mul-rmed bnd problems [LZ07] (lso known s bnds wh sde nformon). In hs seng, on ech round he lgorhm hs ccess o vecor x X h provdes conex (sde nformon) for he curren round, whch my be used n deermnng whch con o ke. Formlly, n nsnce of he nonsochsc conexul bnds problem I = (A, r, x) s gven by sequence (x, r ), where he sde nformon x X s observed by he lgorhm before s chosen nd rewrd r ( ) s obned nd observed. Rher hn ry o leverge he conex nformon x n generl-purpose wy h s pplcble o rbrry nsnces I = (A, r, x), we consder nroducng domn-specfc expers h know how o mke use of he sde nformon, bu hen gnore he sde nformon n he mser lgorhm: hs rnsforms he conexul bnd problem o bnd problem wh exper dvce. A domn-specfc scheme for ncorporng he conex cn be vewed s n exper emple e : X (A), where (A) s he se of probbly dsrbuons over cons. Usng hese emples, gven n nsnce I = (A, r, x) of he conexul bnds problem we 3 Ineresngly, he smll M nd lrge A mply h EXP3 gves beer bounds for hs problem hn EXP4. However, becuse he ndvdul lgorhms use ll of he lbels observed so fr, he unform-rndom exploron done by EXP4 my n fc be benef: provdes lbelled rnng d o ech cve lernng lgorhm h mgh no hve been vlble o ny of he cve lernng lgorhms f hey hd been run ndvdully. Ths my expln boh why EXP4 s so effecve despe s poor bounds, nd lso why he uhors observe h her combned pproch cn cully ouperform he ndvdul lgorhms. (We wll hve more o sy on he exc nure of hese bounds n few prgrphs). consruc n nsnce I = (A, r, e) of he bnd problem wh exper dvce by seng e, = e (x ) for ll nd. Ths s he pproch mplcly ken n he pplcons jus dscussed. When hs rnsformon s ppled nd he conex s cully hghly relevn (s s he query n deermnng whch ds o show), s lkely h he expers ccess o he common x wll cuse S o be much smller hn A, nd so he pproch ken n hs pper wll be prculrly benefcl. Hsory-Dependen Expers There s suble ssue wh he pplcon of EXP4 (or our lgorhms) o mny prccl problems, ncludng he onlne choce of cve lernng lgorhms. In hs domn, he recommendons e nurlly depend on he observons nd conex from he prevous rounds. In prculr, he recommendons of he cve lernng lgorhms depend on whch exmples hve prevously been lbelled. In conexul bnds problem, some expers mgh be onlne lernng lgorhms h connue o rn on he newly lbelled exmples (x, r ( )) (where, r ( ) cn be ssoced wh lbel). In pure conexul bnds problem, s jus defned, he x nd e, re jonly fxed n dvnce, nd so dependence of e on x,..., x (or even r ) s mplcly llowed by he bounds. However, f e, s cully funcon of,...,, we cnno bound regre wh respec o he expeced gn of followng exper on every me sep. Hence, even hough [BEYL04] show h her combned pproch does s well or beer hn ech ndvdul lernng lgorhm, he bound from EXP4 provdes no such gurnee. Regre bounds sll hold, bu hey bound regre wh respec o he pos-hoc sequence of recommendons h ech exper cully mde. I s srghforwrd o consruc phologcl exmples where hs quny my be very dfferen hn he expeced performnce of exper hd been plyed on every round (for exmple, consder n exper h mkes perfec recommendons, bu only f s dvce s followed excly on he frs k rounds). If he expers re hsory-dependen, our lgorhm s relly ryng o solve renforcemen lernng problem where he se s he hsory of ps cons. If s resonble o mke ssumpons bou he rnson probbles nd rewrds, or f we re llowed o rese o prevously vsed ses, hen sndrd renforcemen lernng echnques cn be ppled, for exmple [KMN00]. If one beleves h only lmed wndow of hsory mers o he performnce of expers, pproches lke hose of [dfm06] cn be dped. In hs work we re unwllng o mke such srong ssumpons, bu mens our bounds cn only hold wh respec o he pos-hoc recommendons of he expers. However, for mny prccl pplcons (ncludng he onlne choce of cve lernng lgorhms seng) s resonble o beleve h phologcl cses lke he bove wll no occur, or even h he expers wll do beer bsed on he obned shred hsory hn f hey hd been run ndependenly. In hese cses, lgorhms for he bnd problem wh exper dvce re n ppropre (nd esy o mplemen) opon.

5 4 Reled Work We hve lredy menoned severl mporn peces of reled work. For n excellen summry of bounds for sndrd bnd problems, comprsons o he full nformon seng, nd generlzons, see [DHK07]. Lngford nd Zhng [LZ07] formlze generl conexul mul-rmed bnd problem under sochsc ssumpons. In prculr, hey ssume h ech (x, r ) s drwn..d. from fxed (.e., ndependen of ) dsrbuon. Ther focus s on he cse where M s n nfne hypohess spce (bu wh fne VC dmenson), nd so her work cn be seen s exendng supervsed lernng echnques o he conexul bnds problem. When her lgorhm s ppled o fne hypohess spce M of sze M, hey ge bounds of he form O(G 2/3 A /3 (ln M) /3 ). Thus, compred o EXP4, hey ge beer dependence on M nd A, bu worse dependence on G. However, hs s relly n pples-o-ornges comprson, s her work mkes srong probblsc ssumpon on (x, r ) n order o be ble o hndle nfne hypohess spces. In conrs, we mke no dsrbuonl ssumpons nd so ge bounds h hold for rbrry sequences (x, r ). We cn lso obn much gher bounds n erms of he number of cons A (n some cses removng he dependence enrely), nd we cn combne enrely rbrry nd unreled wys of ncorporng he sde nformon. Severl uhors hve consdered pplyng bnd-syle lgorhms o sponsored serch ucons. Explore-explo rdeoffs my rse wo dfferen levels n hs domn. Mos pror work (ncludng [PO07] nd [WVLL07]) hs ddressed he rdeoff beween showng ds h re known o hve good clck-hrough-re (CTR) versus he need o show ds wh unknown CTRs n order o esme her relevnce. These lgorhms drecly propose se of ds o show on ech query. In prculr, Pndey nd Olson [PO07] consder bnd-bsed lgorhm h drecly res o lern clck-hrough res s well s correcly lloce ds o queres n he budge-lmed cse. Gonen nd Pvlov [GP07] sudy smlr problem, bu lso consder dverser ncenves. Our pproch s orhogonl o hs work, s we ddress he exploron/exploon rdeoff he me-level: gven selecon m,..., m M of possble lgorhms (possbly ncludng hose from he bove references), how do we rde off evlung hese dfferen lgorhms versus usng he lgorhm currenly esmed o be bes? 5 Algorhms nd Anlyss The lgorhms we nlyze hve he generl form gven n Fgure 2; he key dsncon beween he lgorhms n hs fmly s he choce of he exploron polcy F mx. Our recommended pproch, LP-Mx, s gven by he lner progrm n he fgure. We refer o hs lgorhm s NEXP(LP-Mx) or jus NEXP. The dsrbuon p cn be vewed s he del dsrbuon o follow f ll of our esmes were perfec; corresponds o he exponenl weghng scheme used by lgorhms lke Hedge [FS95]. The key lgorhmc choce s how o modfy p o ensure suffcen exploron. I wll become cler from Algorhm NEXP Choose prmeer α nd subroune F mx Add he exper e 0 () = s mx {e, ()} o M ( M) w, for =, 2,..., T do Observe exper dsrbuons e,..., e M W M = w, q w, /W p() M = q e () p F mx ( p, q, e) // For exmple, LP-Mx α Drw á rndomly ccordng o p. Tke con á, observe rewrd r(á) ( ) ŷ e(á) p(á) r(á) ( ) w,(+) w, exp(αŷ ) end for Subroune LP-Mx α ( p, q, e) solves for p Solve he lner progrm below, nd reurn p mx p,c c // Use for F mx subjec o p() α mx {e ()} p() c p() p() =. Fgure 2: Algorhm NEXP. Vrbles used only n sngle eron of he for loop hve subscrps omed. The funcon F mx kes n del exploon dsrbuon p nd modfes o ensure suffcen exploron. The soluon p o LP-Mx s he recommended choce for F mx ; oher choces re dscussed n he ex. Algorhm LP-Mx-Solve (Fgure 3) cn be used o solve LP-Mx effcenly. Lemm 2 h we wll wn p h ssfes he followng properes for n ppropre choce of α nd s smll z s possble: e, () (α) :,, p () α, (Z) :, p () p () z. (2) The bound (α) ensures suffcen exploron, n prculr h our mpornce-weghed esmes ŷ of he rue rewrd of ech exper remn bounded; (Z) bounds he componenwse ro of he exploon dsrbuon p we would lke o ply o he exploron-modfed dsrbuon p we cully ply. The need for hs componenwse-ro defnon of dsnce wll become cler n he proof of Lemm 2. For our nlyss, we ssume our se of expers conns n exper h recommends he dsrbuon e 0, () = s mx {e, ()} on ech round. If hs s no he cse, M becomes M + n he bounds, nd he e 0, exper cn be

6 dded n he lgorhm mplemenon on per-round bss. Exploron Sreges We consder severl exploron sreges (subrounes F mx ), ll of whch we wll be ble o nlyze usng Lemm 2. UA-Mx γ : unform dsrbuon on cons. For ll A, use p() = ( γ) p() + γ A. (3) Ths produces n lgorhm h s lmos dencl o he orgnl EXP4, nd for whch we prove dencl bounds. The only dfference s h NEXP(UA-Mx) dds n ddonl exper e 0, whle EXP4 dds n ddonl exper h plys he unform dsrbuon over ll cons. UE-Mx γ : unform dsrbuon on expers. Le p u () = M e (), nd for ll A, use p() = ( γ) p() + γp u (). (4) Ths produces n lgorhm smlr o runnng EXP3 on he expers, bu works mmedely for expers h recommend generl probbly dsrbuons, nd kes dvnge of mpornce weghng o upde he esmes for ll expers h recommended he con á cully plyed. LP-Mx α : opml exploron. Gven consn α, use he p derved by solvng he lner progrm gven n Fgure 2. Theorem 7 gves n effcen, esy-o-mplemen lgorhm for solvng hs LP. We now begn he nlyss of hese hree lgorhms n unfed frmework. The nex lemm shows h even hough mny per-round vrbles re no ndependenly dsrbued due o dependence on he pror cons chosen, n n mporn cse we cn sll re her expecons ndependenly: Lemm Le X be rndom vrble ssoced wh he -h round of NEXP whose vlue depends on he oucome of prevous rndomness (.e., he hsory,..., ). Then, f for ll possble hsores,..., E[X,..., ] = x for fxed vlue x (ndependen of he prevous s, nd hence ndependen of he dsrbuon p), we hve [ T ] T E X = x. The proof follows from lnery of expecon. The bove lemm wll ypclly be ppled where he rndom vrble X depends on he dsrbuon p ; observe h p s fxed dsrbuon gven,...,. The nex lemm gves generl purpose bound n erms of he bounds α nd z of Equon (2). The resuls for specfc lgorhms wll follow by pluggng n suble consns bsed on he dfferen exploron sreges. Lemm 2 If condons (α) nd (Z) re ssfed by he p dsrbuons seleced by NEXP(F mx ), hen ( ) E[G ALG ] (e 2)αS G OPT ln M (5) Z αz where Z mx {z }. Proof: Unless oherwse sed, vrbles re defned s n Tble 2, hough n some cses subscrp s hve been dded. All expecons re wh respec o he drws á p. The bsc proof echnque follows he lnes of hose for EXP3 nd EXP4 (see [ACBFS02] nd [CBL06]). The key s relng W, he sum of our weghs on he ls round, o boh our performnce nd he performnce of he bes exper. To do hs, we wll use he nequly exp(x) + x + (e 2)x 2 for x [0, ]. For compcness, we wre κ = e W + W = = + α w, W exp(αŷ, ) q, [ + αŷ, + κ(αŷ, ) 2 ] q, ŷ, + κα 2 q, ŷ 2,, nong h becuse ŷ e (á)/p(á) α, we hve αŷ [0, ]. Now, kng logs nd summng from o T, we hve for he lef-hnd sde T ln W + W = T (ln W + ln W ) = ln W T + ln M, nd usng ln( + x) x for he rgh-hnd sde, we hve ln W T + ln M }{{} (I). G OPT T [ α q, ŷ, + κα 2 } {{ } (II). G ALG q, ŷ 2, } {{ } (III). Regre (6) where he underbrces ndce he qunes o whch we rele ech erm. Frs we rele erm (I) o he gn of he bes exper. Noe h ŷ, s n unbsed esme of he rewrd we would hve receved on he -h round f we hd chosen exper, nd ( ) T T w,t + = exp (αŷ, ) = exp α ŷ,. Thus, w,t + s he exponened scled esmed ol rewrd of exper. Usng he fc h ln exp(x ) s good pproxmon for mx {x }, we cn show ln W T + mus be close o he ol rewrd of he bes exper. In prculr, for ny exper k, we hve ln W T + ln w k,t + = α ], T ŷ k,. (7) We cn rele erm (II) n Equon (6) o our lgorhm s cul gn on ech round (droppng subscrps): q ŷ = q e (á) p(á) p(á) r(á) = r(á) Zr(á). (8) p(á) Combnng he mn nequly (6) wh he bounds of (7) nd (8), we hve T T T α ŷ k, ln M α Zr (á ) + κα 2 q, ŷ, 2. (9)

7 Then, dvdng by αz nd rerrngng, we hve T r (á ) T ŷ k, κα T ln M q, ŷ Z αz Z,. 2 (0) We now bound erm (III), whch conrbues o our regre. I s here h our nlyss dverges from he nlyss of EXP4. Consder some prculr nd defne ē() = mx {e ()}. Agn omng subscrps, we hve q ŷ 2 = q e (á) 2 r(á) 2 p(á) 2 q e (á)ē(á)r(á) 2 p(á) 2 = p(á)ē(á) p(á) 2 r(á) 2 Zē(á) r(á) p(á), () recllng r() [0, ] nd so r() 2 r(). Defne { r(á)/p(á) f = á ˆr() = 0 oherwse. Summng he bound of Equon () over, nd usng S s, we hve q, ŷ, 2 Z ZS = ZS T ē (á ) r (á ) p(á ) T ē () ˆr () s T e 0, ()ˆr (). Noe h for ny where p() > 0, Eá p [ˆr()] = r(). We hve q, > 0 for ll, nd so condon (Z) mples p() > 0 whenever e 0, () > 0. Thus, pplyng Lemm o ˆr(), [ T ] E e 0, ()ˆr () = G 0 G OPT, (2) where G 0 s he rewrd for lwys followng he dvce of he exper e 0. The dsrbuon p s fxed gven,..., so for ny k, Eá [ŷ k,,... ] = nd so gn usng Lemm, [ T ] E ŷ k, = G k. p () e k,() p () r () = e k, r, By defnon, E[ T r(á )] = E[G ALG ], nd so combnng hese expecons wh Equon (0) nd kng he mx over k, E[G ALG ] Z G OPT αz ln M καsg OPT whch proves he heorem. We now consder bounds for specfc versons of he lgorhm, prmeerzed by dfferen choces of he F mx funcon. We begn wh our mn heorem for NEXP(LP-Mx). Theorem 3{ Algorhm NEXP (LP-Mx), run wh prmeer α = mn S, ln M/ } (e )SG, hs expeced regre bounded by E[Regre] 2 (e )SG ln M. Proof: For he cse when α = /S, solvng S ln M/ (e )SG for G shows h he gn of he bes exper mus be less hn SG ln M nd so he resul follows mmedely. In he oher cse, we frs show h, for hs choce of α, he opmum z of he lner progrm s mos γ, where γ = Sα. To see hs, le p() = ( γ) p() + γe 0, (). Becuse e () ē() nd p() γe 0, () αē(), we hve e () p() ē() αē() = α. Thus, p s fesble soluon o he lner progrm. Furhermore, p() p() = p() γe 0,() ( γ)p() ( γ) whch mples z γ. Applyng Theorem 2 nd subsung no Equon (5) wh Z = γ, we hve ( γ) E[G ALG ] (( γ) καs) G OPT ln M α where κ e 2. Droppng he ( γ) on he ln M erm, pluggng n γ = Sα, re-rrngng, nd subsung G for G OPT gves E[Regre] = G OPT E[G ALG ] (e )SαG + ln M. α Pluggng n our choce of α proves he heorem. Noe h he opml choce of α depends on S nd G; f good esmes of hese re no vlble n dvnce, hen one cn mke conservve guesses nlly. If he curren esme s ever exceeded, hen one smply resrs he lgorhm fer re-seng γ bsed on doublng he exceeded esme. Ths only nfles he bounds by consn fcor. Such pproches re sndrd, for dels see [CBL06]. For compleeness, we lso derve regre bounds for he EXP3 nd EXP4 lke lgorhms NEXP(UA-Mx) nd NEXP(UE-Mx): Theorem{ 4 Algorhm NEXP (UA-Mx), run wh prmeer γ = mn, M ln M/ } (e )G hs E[Regre] 2 (e )GM ln M, nd NEXP(UE-Mx), usng { γ = mn, A ln M/ } (e )G, hs E[Regre] 2 (e )GA ln M.

8 Proof (skech): For NEXP(UE-Mx), he resul follows long he lnes of he prevous heorem fer showng α = γ/m ssfes condon (α), Z = /( γ) ssfes (Z), nd S M. For NEXP(UA-Mx), he proof uses α = γ/a, Z = /( γ), nd S A. Bounds In Terms of he Averge s. We now prove bound h depends on he per-round s, rher hn he mx over ll rounds. We wll need he followng lemm bou weghed sums: Lemm 5 Fx consns Ā ā 0. Le w,..., w T R + be sequence of non-negve rel numbers, le,..., T [0, ā], wh he ddonl consrn h Ā. Le n = Ā/ā. Then, w MB(w, n), where MB(w, n) s he men of he n smlles w s. Proof: Assume whou loss of generly h w s sored n non-decresng order. Le A = T nd m = A/ā. The sum w s mnmzed by seng = ā for m nd m+ = A mā. Thus, T m w ā w + m+ w m+ = mā MB(w, m) + m+ w m+ (mā + m+ ) MB(w, m) (3) T MB(w, n) (4) where lne (3) follows becuse MB(w, m) w m+ nd lne (4) uses A = mā+ m+ nd MB(w, m) MB(w, n) becuse m n. Now we cn prove he followng Theorem, srenghenng Theorem 2. The key ddonl ssumpon s h we cn bound G 0 wy from zero, s hs les us show h few bd s cn hur us oo much. Theorem 6 Suppose G 0 S, nd le S = MB(/s, n s ) be he hrmonc men of he n s lrges s, where n s = G 0 /S. Then lgorhm NEXP (LP-Mx), run wh prmeer α = ln M/ (e ) SG, hs regre bounded by E[Regre] 2 (e ) SG ln M. Proof (skech): Buldng on he proof of Lemm 2 nd Theorem 3, suffces o show h [ ] E q, ŷ, 2 Z SG 0. Algorhm LP-Mx-Solve Defne p mn () = α mx {e ()} Inlze c repe Le A 0 = { : p mn () c p()}, nd se c A 0 p mn () A\A 0 p() unl he upde (5) produces no chnge n c Reurn he dsrbuon p() = mx {p mn (), c p()} Fgure 3: Algorhm LP-Mx-Solve. Usng Equon (), we hve q, ŷ, 2 Z r (á )ē(á ) p(á ). (5) Recll h ē () = mx {e, ()}. Le A p = { p() > 0}, nd kng expecons, we ge [ ] E q, ŷ, 2 Z r ()ē () = Z g A p where g = A p r ()ē (). I remns o show h g SG 0. To see hs, noe h g S nd g G 0 S. Thus, by Lemm 5, G 0 = s g MB(/s, n s ) Rerrngng hs nequly gves g SG 0. g. A Fs Algorhm for LP-Mx In hs secon, we presen n effcen nd esy-o-mplemen lgorhm for solvng LP-Mx. The lgorhm, gven n Fgure 3, ervely refnes n upper bound c on he opml objecve funcon vlue unl reches fesble (nd opml) soluon. Is performnce s summrzed n Theorem 7. Theorem 7 Assumng he lner progrm s fesble, lgorhm LP-Mx-Solve runs for mos + { p() > 0} erons before reurnng n opml p for he lner progrm mx p,c c subjec o p() α mx {e ()} p() c p() p() =. Proof: Consder n rbrry fesble soluon (p, c). We frs show h our lgorhm mnns he nvrn c c. Frs, noe h c = c p() p() =.

9 so he nvrn s rue nlly. For ny se A 0 A, we hve p mn () + c p() p() =. A 0 A\A 0 A Rerrngng hs nequly shows h (5) mnns he nvrn. Le c be he opml vlue of he objecve funcon. We nex show h f c > c, hen pplyng (5) wll reduce c. To see hs, consder he pon ( p, c), where p() = mx {p mn (), c p()}. Becuse c > c, he pon ( p, c) cnno be fesble, whch mples p() (he oher wo consrns re ssfed by consrucon). Assume p() >. Then, for he A 0 defned from c, < p() = p mn () + c p(). A 0 A\A 0 Rerrngng hs nequly mples h (5) wll decrese c. On he oher hnd, f p() < hen we could ncrese he componens of p rbrrly o obn fesble soluon ( p, c), conrdcng c > c. Thus, c keeps decresng unl c = c, whch pon c no longer chnges. Ths shows h our lgorhm s correc ssumng ermnes. We now consder he me he lgorhm requres. Becuse c s non-ncresng, he se A 0 cn only grow cross erons. Furhermore, by nspecon of (5) we see h f c decreses, A 0 mus hve ncresed. Thus here cn be mos A erons before c does no chnge ( whch pon he lgorhm ermnes). To ghen hs bound, noe h every con wh p() = 0 s dded o A 0 on he frs eron, so n fc he number of erons cn be mos + { p() > 0}. 6 Expermens We compre our new lgorhm o EXP3 nd EXP4 on lrge, rel-world problem: predcng d clcks on serch engne. EXP3 s run drecly on he expers, s dscussed n Secon 2. In he rel d-selecon bnd problem, he serch engne chooses few (sy 0) ds o show from presumbly much lrger se of ds rgeed prculr query; from hs, we consruc smller bnd problem where we preend only he 0 ds cully shown re relevn, nd from hs se selec sngle d o show. Ths smplfcon s necessry becuse we only observe rewrds (clck vs. noclck) for he ds h were shown o users. Ths sdeseps ypcl chllenge n evlung bnd lgorhms on rel dses: f he bnd problem ws rel hen only sngle rewrd s observed ech round, bu for low-vrnce evluon of dfferen bnd lgorhms, he expermener needs ccess o he full rewrd vecor. Our dses re bsed on nonymzed query nformon from google.com. 4 From 2 monh perod, we colleced queres for prculr phrse (e.g., cnon 40d ) where 4 No user-specfc d ws used n hese expermens. Avg. Regre Acul Bound Exp Exp NEXP Tble 3: Averge expermenl per-round regre nd heorecl bounds, for problem wh T = 9, 73, A = 3567, M = 90, S =, nd G OPT = Regres re verged over 00 runs, nd 95% confdence nervls re ll gher hn ± Prmeers were se nd bounds compued usng he rue S nd G; noe he bound on EXP4 s vcuous. les wo ds were shown nd les one d ws clcked by user. We hen rnsformed hs o predcon problem wh feure vecor x for ech (query, d) pr h ws shown, usng feures bsed on he ex of he d nd he query; he rge lbel s f he d ws clcked, nd 0 oherwse. Usng he frs 9 monhs of d, we rned fmly of logsc regresson models m(λ, [, b]), where λ gves he moun of L -regulrzon nd [, b] ndces whch monhs of d hs prculr model rned on; for exmple, [, b] = [, 9] rns on ll he d, whle [9, 9] rns on only he mos recen monh. These models were fxed, nd used o produce expers for hypohecl bnd problem plyed on he d from monhs 0 2. Ech mesep n he bnd problem mps o rel query h occured on google.com. On ech round, he bnd lgorhm fces he problem of choosng sngle d o show. The full se A of cons corresponds o he se of ll ds shown on he ncluded queres over he 3 monhs. On gven round, n d/con hs rewrd f ws shown by he serch engne nd ws clcked by he user, nd 0 oherwse. For ech model m, he bnd lgorhm hs ccess o deermnsc exper E(m). The exper E(m) receves sde-nformon, nmely, he query phrse nd he se of ds google showed when he query orgnlly occured (only hese ds cn hve posve rewrd). The exper/model hen predcs he probbly of clck on ech d n hs se, nd recommends deermnsclly he con whch receved he hghes predcon. For exmple, f ds (, 2, 3 ) were shown on he query for round, nd model m predcs (0.05, 0.02, 0.03) respecvely, hen he exper E(m) recommends he dsrbuon (, 0, 0). Our gol s no o fully cpure he complexy of decdng whch ds o show longsde serch resuls. In prculr, we gnore he effec of he poson n whch ds re shown, he ucon ypclly used o rnk nd prce he ds, he fc h mulple ds re usully shown, nd he fc h he se of vlble cons would ypclly be lrger se (e.g., ll ds rgeed he query from dversers wh remnng budges). However, we beleve our seup cpures enough of he essence of he problem o be useful for evlung how well dfferen bnd lgorhms mgh pply o such rel-world problems. We repor resuls for represenve dse, bsed on queres for cnon 40d. Abou 200,000 rnng exmples were seleced from he 9-monh rnng perod, on whch we rned 90 models bsed on dfferen combnons of

10 n mporn lgorhmc mprovemen h cn produce sgnfcnly lower regre n rel-world pplcons. Regre EXP3 NEXP(LP) Prmeer Mulpler Fgure 4: Effec on regre of vryng γ for EXP3 nd α for NEXP. The Y -xs s scled so h he op of he plo corresponds o he performnce of he wors exper; NEXP ouperforms EXP3 for ll prmeers. The prmeer vlues re gven s mulples of he prmeers used n Tble 3. he regulrzon nd de-rnge prmeers. From he followng 3-monhs of d, we formed bnd problem wh 9,73 meseps nd 90 expers. Tble 3 shows he expermenl verge per round regre long wh he correspondng bound. The prmeers were chosen bsed on Theorem 3 for NEXP nd he correspondng resuls from [ACBFS02] for EXP3 nd EXP4, usng he rue vlues for S nd G n prcce good esmes of hese re lkely o be vlble n dvnce; for exmple S follows mmedely from he sde nformon presen n our exmple domn, snce hs s he mxmum number of ds google.com shows on sngle query. Fgure 4 shows he effec of usng dfferen prmeers hn hose recommended by he regre bounds. We explored he prmeer spce by mulplyng he prmeer sengs used for Tble 3 by mulplers m [0, 0], dscrdng vlues h produced nfesble prmeer sengs (γ >, α > /S). The ol number of cons vlble n hs problem s so lrge h EXP4 performs hopelessly bdly, s ndced n Tble 3, nd so s omed from Fgure 4. In fc, he heory suggess EXP4 should ge γ = for hs problem (lwys ply unformly rndom cons); we expermened wh dfferen prmeer sengs, nd he bes resuls were effecvely for γ = 0, whch essenl plys rndom exper. 5 These expermens demonsre h he opmzed exploron sregy used by NEXP s no only heorecl mprovemen useful n dervng gher regre bounds, bu lso 5 I s possble o run EXP4 wh unform exploron on he cons h some exper recommends wh posve probbly. The sndrd nlyss of EXP4 does no pply o hs modfed lgorhm, however. And, whle hs lgorhm cn esly be nlyzed long he lnes of Theorem 3, fls mmedely f one nroduces sngle unsure exper whch pus some smll probbly on ech con. 7 Conclusons We hve nroduced NEXP, new lgorhm for he bnd problem wh exper dvce. NEXP provdes bound of GS log M on expeced cumulve regre, where S mn {A, M} (n prcce, S cn be much smller). A refned bound shows h cern verge S cn be used n plce of S. Expermens demonsred h on relsc problem of sgnfcn rel-world mpornce, our mproved lgorhms drmclly ouperform prevously publshed pproches. Acknowledgemens The uhors would lke o hnk Gry Hol, Arkdy Epsheyn, Bren Bryn, Mke Meyer, nd Andrew Moore for neresng dscussons nd feedbck. References [ACBFS02] Peer Auer, Ncol Ces-Bnch, Yov Freund, nd Rober E. Schpre. The nonsochsc mulrmed bnd problem. SIAM Journl on Compung, 32():48 77, [BEYL04] Yorm Brm, Rn El-Ynv, nd Kob Luz. Onlne choce of cve lernng lgorhms. J. Mch. Lern. Res., 5:255 29, [CBL06] Ncolo Ces-Bnch nd Gbor Lugos. Predcon, Lernng, nd Gmes. Cmbrdge Unversy Press, New York, NY, USA, [dfm06] Dnel Pucc de Frs nd Nmrod Megddo. Combnng exper dvce n recve envronmens. J. ACM, 53(5): , [DHK07] Vrsh Dn, Thoms Hyes, nd Shm M. Kkde. The prce of bnd nformon for onlne opmzon. In NIPS, [FS95] Yov Freund nd Rober E. Schpre. A decsonheorec generlzon of on-lne lernng nd n pplcon o boosng. In Europen Conference on Compuonl Lernng Theory, pges 23 37, 995. [GP07] Rc Gonen nd Eln Pvlov. An ncenvecompble mul-rmed bnd mechnsm. In PODC, pges , [KMN00] Mchel Kerns, Yshy Mnsour, nd Andrew Y. Ng. Approxme plnnng n lrge POMDPs v reusble rjecores. In NIPS, [LR85] T. L. L nd H. Robbns. Asympoclly effcen dpve llocon rules. Adv. n Appl. Mh., 6:4 22, [LZ07] 985. John Lngford nd Tong Zhng. The epoch-greedy lgorhm for mul-rmed bnds wh sde nformon. NIPS, [PO07] Sndeep Pndey nd Chrsopher Olson. Hndlng dversemens of unknown quly n serch dversng. In NIPS, pges , [RCKU08] [Vr07] [WVLL07] F. Rdlnsk, D. Chkrbr, R. Kumr, nd E. Upfl. Morl mul-rmed bnds. In NIPS, Hl R. Vrn. Poson ucons. Inernonl Journl of Indusrl Orgnzon, 25(6):63 78, December Jennfer Wormn, Yevgeny Vorobeychk, Lhong L, nd John Lngford. Mnnng equlbr durng exploron n sponsored serch ucons. In WINE, pges 9 30, 2007.

RL for Large State Spaces: Policy Gradient. Alan Fern

RL for Large State Spaces: Policy Gradient. Alan Fern RL for Lrge Se Spce: Polcy Grden Aln Fern RL v Polcy Grden Serch So fr ll of our RL echnque hve red o lern n ec or pprome uly funcon or Q-funcon Lern opml vlue of beng n e or kng n con from e. Vlue funcon