Unimodal Thompson Sampling for Graph Structured Arms

Size: px

Start display at page:

Download "Unimodal Thompson Sampling for Graph Structured Arms"

Clarissa Lang
5 years ago
Views:

1 Proceedings of he Thiry-Firs AAAI Conference on Arificial Inelligence AAAI-7 Unimodal Thompson Sampling for Graph Srucured Arms Sefano Paladino, Francesco Trovò, Marcello Reselli, Nicola Gai Diparimeno di Eleronica, Informazione e Bioingegneria Poliecnico di Milano, Milano, Ialy {sefano.paladino, francesco.rovo, marcello.reselli, nicola.gai}@polimi.i Absrac We sudy, o he bes of our knowledge, he firs Bayesian algorihm for unimodal Muli Armed Bandi MAB problems wih graph srucure. In his seing, each arm corresponds o a node of a graph and each edge provides a relaionship, unknown o he learner, beween wo nodes in erms of expeced reward. Furhermore, for any node of he graph here is a pah leading o he unique node providing he maximum expeced reward, along which he expeced reward is monoonically increasing. Previous resuls on his seing describe he behavior of frequenis MAB algorihms. In our paper, we design a Thompson Sampling based algorihm whose asympoic pseudo regre maches he lower bound for he considered seing. We show ha as i happens in a wide number of scenarios Bayesian MAB algorihms dramaically ouperform frequenis ones. In paricular, we provide a horough experimenal evaluaion of he performance of our and sae of he ar algorihms as he properies of he graph vary. Inroducion Muli Armed Bandi MAB algorihms Auer, Cesa- Bianchi, and Fischer have been proven o provide effecive soluions for a wide range of applicaions fiing he sequenial decisions making scenario. In his framework, a each round over a finie horizon T, he learner selecs an acion usually called arm from a finie se and observes only he reward corresponding o he choice she made. The goal of a MAB algorihm is o converge o he opimal arm, i.e., he one wih he highes expeced reward, while minimizing he loss incurred in he learning process and, herefore, is performance is measured hrough is expeced pseudo regre, defined as he difference beween he expeced reward achieved by an oracle algorihm always selecing he opimal arm and he one achieved by he considered algorihm. We focus on he so called Unimodal MAB UMAB, inroduced in Combes and Prouiere 4a, in which each arm corresponds o a node of a graph and each edge is associaed wih a relaionship specifying which node of he edge gives he larges expeced reward providing hus a parial ordering over he arm space. Furhermore, from any node here is a pah leading o he unique node wih he maximum expeced reward along which he expeced reward is Copyrigh c 7, Associaion for he Advancemen of Arificial Inelligence All righs reserved. monoonically increasing. While he graph srucure may be no necessarily known a priori by he UMAB algorihm, he relaionship defined over he edges is discovered during he learning. In he presen paper, we propose a novel algorihm relying on he Bayesian learning approach for a generic UMAB seing. Models presening a graph srucure have become more and more ineresing in las years due o he spread of social neworks. Indeed, he relaionships among he eniies of a social nework have a naural graph srucure. A pracical problem in his scenario is he argeed adverisemen problem, whose goal is o discover he par of he nework ha is ineresed in a given produc. This ask is heavily influenced by he graph srucure, since in social neworks people end o have similar characerisics o hose of heir friends i.e., neighbor nodes in he graph, herefore ineress of people in a social nework change smoohly and neighboring nodes in he graph look similar o each oher McPherson, Smih-Lovin, and Cook ; Crandall e al. 8. More specifically, an adveriser aims a finding hose users ha maximize he ad expeced revenue i.e., he produc beween click probabiliy and value per click, while a he same ime reducing he amoun of imes he adverisemen is presened o people no ineresed in is conen. Under he assumpion of unimodal expeced reward, he learner can move from low expeced rewards o high ones jus by climbing hem in he graph, prevening from he need for an exploraion over all he graph nodes. This assumpion reduces he complexiy in he search for he opimal arm, since he learning algorihm can avoid o pull he arms corresponding o some subse of non opimal nodes, reducing hus he regre. Oher applicaions migh benefi from his srucure, e.g., recommender sysems which aims a coupling iems wih hose users are likely o enjoy hem. Similarly, he use of he unimodal graph srucure migh provide more meaningful recommendaions wihou esing all he users in he social nework. Finally, noice ha unimodal problems wih a single variable, e.g., in sequenial pricing Jia and Mannor, bidding in online sponsored search aucions Edelman and Osrovsky 7 and single peak preferences economics and voing seings Mas-Collel, Whinson, and Green 995, are graph srucured problems in which he graph is a line. 457

2 Frequenis approaches for UMAB wih graph srucure are proposed in Jia and Mannor and Combes and Prouiere 4a. Jia and Mannor inroduce he GLSE algorihm wih a regre of order O T logt. However, GLSE performs beer han classical bandi algorihms only when he number of arms is ΘT. Combes and Prouiere 4a presen he algorihm based on LUCB achieving asympoic regre of OlogT and ouperforming GLSE in seings wih a few arms. To he bes of our knowledge, no Bayesian approach has been proposed for unimodal bandi seings, included he UMAB seing we sudy. However, i is well known ha Bayesian MAB algorihms he mos popular is Thompson Sampling usually suffer of same order of regre as he bes frequenis one e.g., in unsrucured seings aufmann, orda, and Munos, bu hey ouperform he frequenis mehods in a wide range of problems e.g., in bandi problems wihou srucure Chapelle and Li and in bandi problems wih budge Xia e al. 5. Furhermore, in problems wih srucure, he classical Thompson Sampling no exploiing he problem srucure may ouperform frequenis algorihms exploiing he problem srucure. For his reason, in his paper we explore Bayesian approaches for he UMAB seing. More precisely, we provide he following original conribuions: we design a novel Bayesian MAB algorihm, called and based on he algorihm; we derive a igh upper bound over he pseudo regre for, which asympoically maches he lower bound for he UMAB seing; we describe a wide experimenal campaign showing beer performance of in applicaive scenarios han hose of sae of he ar algorihms, evaluaing also how he performance of he algorihms ours and of he sae of he ar varies as he graph srucure properies vary. Relaed work Here, we menion he main works relaed o ours. Some works deal wih unimodal reward funcions in coninuous armed bandi seing Jia and Mannor ; Combes and Prouiere 4b; leinberg, Slivkins, and Upfal 8. In Jia and Mannor a successive eliminaion algorihm, called LSE, is proposed achieving regre of O T log T. In his case, assumpions over he minimum local decrease and increase of he expeced reward is required. Combes and Prouiere 4b consider sochasic bandi problems wih a coninuous se of arms and where he expeced reward is a coninuous and unimodal funcion of he arm. They propose he SP algorihm, based on he sochasic penachoomy procedure o narrow he search space. Unimodal MABs on meric spaces are sudied in leinberg, Slivkins, and Upfal 8. An applicaion dependen soluion o he recommendaion sysems which explois he similariy of he graph in social nework in argeed adverisemen has been proposed in Valko e al. 4. Similar informaion has been considered in Caron and Bhaga where he problem of cold sar users i.e., new users is sudied. Anoher ype of srucure considered in sequenial games is he one of monooniciy of he conversion rae in he price Trovòeal. 5. Ineresingly, he assumpions of monooniciy and unimodaliy are orhogonal, none of hem being a special case of he oher, herefore he resuls for monoonic seing canno be used in unimodal bandis. In Alon e al. ; Mannor and Shamir, a graph srucure of he arm feedback in an adversarial seing is sudied. More precisely, hey assume o have correlaion over rewards and no over he expeced values of arms. Problem Formulaion A learner receives in inpu a finie undireced graph MAB seing G =A, E, whose verices A = {a,...,a } wih N correspond o he arms and an edge a i,a j E exiss only if here is a direc parial order relaionship beween he expeced rewards of arms a i and a j. The leaner knows a priori he nodes and he edges i.e., she knows he graph, bu, for each edge, she does no know a priori which is he node of he edge wih he larges expeced reward i.e., she does no know he ordering relaionship. A each round over a ime horizon of T N he learner selecs an arm a i and gains he corresponding reward x i,. This reward is drawn from an i.i.d. random variable X i, i.e., we consider a sochasic MAB seing characerized by an unknown disribuion D i wih finie known suppor Ω R as cusomary in MAB seings, from now on we consider Ω [, ] and by unknown expeced value μ i := E[X i, ]. We assume ha here is a single opimal arm, i.e., here exiss a unique arm a i s.. is expeced value μ i =max i μ i and, for sake of noaion, we denoe μ i wih μ. Here we analyze a graph bandi seing wih unimodaliy propery, defined as: Definiion. A graph unimodal MAB UMAB seing G = A, E is a graph bandi seing G s.. for each sub opimal arm a i,i i i exiss a finie pah p =i = i,...,i m = i s.. μ ik < μ ik+ and a ik,a ik+ E for each k {,...,m }. This definiion assures ha if one is able o idenify a non decreasing pah in G of expeced rewards, she be able o reach he opimum arm, wihou geing suck in local opima. Noe ha he unimodaliy propery implies ha he graph G is conneced and herefore we consider only conneced graphs from here on. A policy U over a UMAB seing is a procedure able o selec a each round an arm a i by basing on he hisory h, i.e., he sequence of pas seleced arms and pas rewards gained. The pseudo regre R T U of a generic policy U over a UMAB seing is defined as: [ T ] R T U :=Tμ E, X i, where he expeced value E[ ] is aken w.r.. he sochasiciy of he gained rewards X i, and of he policy U. Le us define he neighborhood of arm a i as Ni := {j a i,a j E}, i.e., he se of each index j of he arm a j 458

3 conneced o he arm a i by an edge a i,a j E. I has been shown in Combes and Prouiere 4a ha he problem of learning in a UMAB seing presens a lower bound over he regre R T U of he following form: Theorem. Le U be a uniformly good policy, i.e., a policy s.. R T U =ot c for each c>. Given a UMAB seing G =A, E we have: lim inf T R T U logt = where Lp, q =p log p q i Ni μ μ i Lμ i,μ, + plog p q, i.e., he ullaback Leibler divergence of wo Bernoulli disribuions wih means p and q, respecively. This resul is similar o he one provided in Lai and Robbins 985, wih he only difference ha he summaion is resriced o he arms laying in he neighborhood of he opimal arm Ni and reduces o i when he opimal arm is conneced o all he ohers i.e., Ni {,...,} or he graph is compleely conneced i.e., Ni {,...,}, i. We would like o poin ou ha by relying on he assumpion of having a single maximum of he expeced rewards, we also assure ha he opimal arm neighborhood Ni is uniquely defined and, hus, he lower bound inequaliy in Equaion is well defined. The algorihm We describe he algorihm and we show ha is regre is asympoically opimal, i.e., i asympoically maches he lower bound of Theorem. The algorihm is an exension of he Thompson Sampling Thompson 9 ha explois he graph srucure and he unimodal propery of he UMAB seing. Basically, he raionale of he algorihm is o apply a simple variaion of he algorihm o only he arms associaed wih he nodes ha compose he neighborhood of he arm wih he highes empirical mean reward, called leader. The pseudo code The pseudo code of he algorihm is presened in Algorihm. The algorihm receives in inpu he graph srucure G, he ime horizon T, and a Bayesian prior π i for each expeced reward μ i. A each round, he algorihm compues he empirical expeced reward for each arm Line : ˆμ i, := S i T i, if T i, > oherwise where S i, = h= X i,h{uh =a i } is he cumulaive reward of arm a i up o round and T i, = h= {Uh = a i } is he number of imes he arm a i has been pulled up o round. Afer ha, selecs he arm denoed as he leader a l for round, i.e., he one having he maximum empirical expeced reward: a l =argmax ˆμ i,. a i A We here denoe wih { } he indicaor funcion., Algorihm : Inpu: UMAB seing G =V,E, Horizon T, Priors {π i } i= : for {,...,T} do : Compue ˆμ i, for each i {,...,} 4: Find he leader a l 5: if L l, mod N + l =hen 6: Collec reward x l, 7: else 8: Draw θ i, from π i, for each i N + l 9: Collec reward x i, where i =argmax i θ i, Once he leader has been chosen, we resric he selecion procedure o i and is neighborhood, considering only arms wih indexes in N + l := Nl {l}. Denoe wih L i, := h= {lh =i} he number of imes he arm a i has been seleced as leader before round. IfL l, is a muliple of N + l, hen he leader is pulled and reward x l, is gained Line 6. Oherwise, he algorihm is performed over arms a i s.. i N + l Lines 8 9. Basically, under he assumpion of having a prior π i,we can compue he poserior disribuion π i, for μ i afer rounds, using he informaion gahered from he rounds in which a i has been pulled. We denoe wih θ i, a sample drawn from π i,, called Thompson sample. For insance, for Bernoulli rewards and by assuming uniform priors we have ha π i, = Bea+S i,, +T i, S i,, where Beaα, β is he bea disribuion wih parameers α and β. Finally, he algorihm pulls he arm wih he larges Thompson sample θ i,n and collecs he corresponding reward x i,. See aufmann, orda, and Munos for furher deails. Remark. Assuming ha he algorihm receives in inpu he whole graph G is unnecessary. The algorihm jus requires an oracle ha, a each round, is able o reurn he neighborhood Nl of he arm which is currenly he leader a l. This is crucial in all he applicaions in which he graph is discovered by means of a series of queries and he queries have a non negligible cos e.g., in social neworks a query migh be compuaionally cosly. Finally, we remark ha he frequenis counerpar of our algorihm i.e., he algorihm requires he compuaion of he maximum node degree γ := max i Ni, hus requiring a leas an iniial analysis of he enire graph G. Finie ime analysis of Theorem. Given a UMAB seing G =A, E, he expeced pseudo regre of he algorihm saisfies, for every ε>: μ μ i R T + ε [logt +loglogt] + C, i Ni Lμ i,μ where C > is a consan depending on ε, he number of arms and he expeced rewards {μ,...,μ }. We here denoe wih he cardinaliy operaor. 459

4 Skech of proof. The complee version of he proof is repored in Paladino e al. 6. A firs, we remark ha a sraighforward applicaion of he proof provided for is no possible in he case of. Indeed, he use of frequenis upper bounds over he expeced reward in implies ha in finie ime and wih high probabiliy he bounds are ordered as he expeced values. Since we are using a Bayesian algorihm, we would require he same assurance over he Thompson samples θ i,, bu we do no have a direc bound over Pθ i, >θ i, where a i is he opimal arm in he neighborhood N + i. This fac requires o follow a compleely differen sraegy when we analyze he case in which he leader is no he opimal arm. The regre of he algorihm R T can be divided in wo pars: he one obained during hose rounds in which he opimal arm a is he leader, called R, and he summaion of he regres in he rounds in which he leader is he arm a i a, called R i. R is obained when i is he leader, hus, he algorihms behaves like Thompson Sampling resriced o he opimal arm and is neighborhood N + i, and he regre upper bound is he one derived in aufmann, orda, and Munos for he algorihm. R i is upper bounded by he expeced number of rounds he arm a i has been seleced as leader E[L i,t ] over he horizon T. Le us consider ˆL i,t defined as he number as leader when resricing he of rounds spen wih a i problem o is neighborhood N + i. E[ˆL i,t ] is an upper bound over E[L i,t ], since here is nonzero probabiliy ha he algorihm moves in anoher neighborhood. Since i i and he seing is unimodal, here exiss an opimal arm a i,i i among hose in he neighborhood Ni s.. μ i =max i ai Ni μ i and ˆμ i, ˆμ i. Thus: [ ] R i E[ ˆL i,t ]= E {ˆμ i, = max ˆμ j, } a j N + i = P ˆμ i, max ˆμ j, a j N + i = P ˆμ i, μ i Δ i ˆμ i, + μ i Δ i P ˆμ i, μ i Δ i + } {{ } R i P ˆμ i, ˆμ i, P ˆμ i, μ i + Δ i, } {{ } R i where Δ i =max i a i Ni μ i μ i is he expeced loss incurred in choosing a i insead of is bes adjacen one a i. R i can be upper bounded by a consan by relying on condiional probabiliy definiion and he Hoeffding inequaliy Hoeffding 96. Specifically, we rely on he fac ha he leader is chosen a leas imes. Upper Ll, N + l bounding R i by a consan erm requires he use of Proposiion in aufmann, orda, and Munos, which limis he expeced number of imes he opimal arm is pulled less han b imes by, where b, is a consan, and Table : Resuls concerning R % U, in he seing wih =7and = 9 and a line graph. 7 9 LUCB.8 ± ±.7.4 ±.7.68 ±.5.5 ±.7.76 ±.5 he use of a echnique already used on R i. Summing up he regre over i i and considering he hree obained bounds concludes he proof. Experimenal Evaluaion In his secion, we compare he empirical performance of he proposed algorihm wih he performance of a number of algorihms. We sudy he performance of he sae of he ar algorihm Combes and Prouiere 4a o evaluae he improvemen due o he employmen of Bayesian approaches w.r.. frequenis approaches. Furhermore, we sudy he performance of Thompson 9 o evaluae he improvemen in Bayesian approaches due o he exploiaion of he problem srucure. For compleeness, we sudy also he performance of LUCB Garivier and Cappé, being a frequenis algorihm ha is opimal for Bernoulli disribuions. Figures of meri Given a policy U, we evaluae he average and 95% confidence inervals of he following figures of meri: he pseudo regre R T U as defined in Equaion ; he lower R T U he beer he performance; he regre raio R % U, U = R T U R T U showing he raio beween he oal regre of policy U afer T rounds and he one obained wih U ; he lower R % U, U he larger he relaive improvemen of U w.r.. U. Line graphs We iniially consider he same experimenal seings, composed of line graphs, ha are sudied in Combes and Prouiere 4a. They consider graphs wih {7, 9} arms, where he arms are ordered on a line from he arm wih smalles index o he arm wih he larges index and wih Bernoulli rewards whose averages have a riangular shape wih he maximum on he arm in he middle of he line. More precisely, he minimum average is., associaed wih arms a and a 7 when =7and wih arms a and a 9 wih = 9, while he maximum average reward is μ =.9, associaed wih arm a 9 when =7and wih arm a 65 wih = 9. The averages decrease linearly from he maximum one o he minimum one. For boh he experimens, we average he regre over independen rials of lengh T = 5. We repor R U for each policy U as varies in Fig. a, for =7, and in Fig. b, for = 9. The algorihm ouperforms all he oher algorihms along he whole ime horizon, providing a significan improvemen in erms of regre w.r.. he 46

5 RU LUCB a RU LUCB Figure : Resuls for he pseudo regre R U in line graphs seings wih = 7 a and = 9 b as defined in Combes and Prouiere 4a. sae of he ar algorihms. In order o have a more precise evaluaion of he reducion of he regre w.r.. algorihm, we repor R % U, in Tab.. As also confirmed below by a more exhausive series of experimens, in line graphs he relaive improvemen of performance due o w.r.. reduces as he number of arms increases, while he relaive improvemen of performance due o w.r.. increases as he number of arms increases. Erdős-Rényi graphs To provide a horough experimenal evaluaion of he considered algorihms in seings in which he space of arms has a graph srucure, we generae graphs using he model proposed by Erdős and Rényi 959, which allows us o simulae graph srucures more complex han a simple line. An Erdős-Rényi graph is generaed by connecing nodes randomly: each edge is included in he graph wih probabiliy p, independenly from exising edges. We consider conneced graphs wih {5,,, 5,, } and wih probabiliy p {,, log,l}, where p = corresponds o have a fully conneced graph and herefore he graph srucure is useless, p = corresponds o have a number of edges ha increases linearly in he number of nodes, p = log corresponds o have a few edges w.r.. he nodes, and we use p = l o denoe line graphs hese line graphs differ from hose used for he experimenal evaluaion discussed above for he reward funcion, as discussed in wha follows. We use differen values of p in order o see how he performance of changes w.r.. he number of edges in he graph; we remark ha such an analysis is unexplored in he lieraure so far. The opimal arm is chosen randomly among he exising arms and is reward is given by a Bernoulli disribuion wih expeced value.9. The rewards of he subopimal arms are given by Bernoulli disribuions wih expeced value depending on heir disance from he opimal one. More precisely, le d i be he shores pah from he i h arm o he opimal arm and le: b d max = max i {,...,} d i be he maximum shores pah of he graph. The expeced reward of he i h arm is: μ i =.9 d.9. i d, max Table : Resuls concerning R T U T = 5 in he seing wih Erdős-Rényi graphs. 5 5 p / log/ l LUCB 4 ±.4 5 ±.5 5 ±.7 56 ±. 8 ±. ±.6 4 ±. 5 ±.7 4 ±. ± 7. 5 ± 5.8 ± 4. 7 ±. 5 ±.4 6 ±. 4 ±. LUCB 77 ±.5 7 ± ±. 59 ± 7. 4 ±. 5 ±. 56 ±.8 67 ±.5 77 ±. 76 ± ± ± 8. 9 ±. 5 ±. 7 ±. 4 ±.4 LUCB 6 ±.7 7 ± 6. 6 ± ±. 84 ±.5 ±. 7 ± ± ±.8 48 ± ± ±.7 8 ±. 7 ± ± ± 8.8 LUCB 4 ±.7 56 ± ±.5 ± ±.5 6 ± 4.4 ±. 454 ± ±. 8 ± ±.9 4 ± ±.7 8 ± ± ±. LUCB 846 ±. 4 ± 7.8 ± ± ±. 58 ± ± ± ± ± 9. 6 ± ±.7 47 ±.5 7 ± 5. 4 ± 9. 9 ± 4. LUCB 855 ±. 47 ± 6. 4 ± ± ±.4 56 ± ± ± ± ± ± ± ± ± 6.9 ± ± 4.8 i.e., he arm wih d max has a value equal o. and he expeced rewards of he arms along he pah from i o he opimal arm are evenly spaced beween. and.9. We generae differen graphs for each combinaion of and p and we run independen rials of lengh T = 5 for each graph. We average he regre over he resuls of he graphs. In Tab., we repor R T U for each combinaion of policy U,, and p. I can be observed ha he algorihm ouperforms all he oher algorihms, providing in every case he smalles regre excep for = and p = l. Below we discuss how he relaive performance of he algorihms vary as he values of he parameers and p vary. Consider he case wih p =. The performance of and are approximaely equal and he same holds for he performance of and LUCB. This is due o he fac ha he neighborhood of each node is composed by all he arms, he graphs being fully conneced, and herefore and canno ake any advanage from he srucure of he problem. We noice, however, ha and have no he same behavior and ha always performs slighly beer han. I can be observed in Fig. wih = 5 and p = ha he relaive improvemen is mainly a he beginning of he ime horizon and ha i goes o zero as increases he same holds for w.r.. LUCB. The reason behind his behavior is ha reduces he exploraion performed by in he firs rounds, forcing he algorihm o pull he leader chosen as he arm maximizing he empirical mean for a larger number of rounds. 46

6 RU LUCB RU 6 4 LUCB RU LUCB Figure : Resuls for he pseudo regre R U in he seing wih =5and p =. Consider he case wih p =. In he considered experimenal seing, he relaive performance of he algorihms does no depend on. The ordering, from he bes o he wors, over he performance of he algorihms is:,,, and finally LUCB. Surprisingly, even he dependency of he following raios on is negligible: R %, =.68 ±., R %, =.47 ±., and R %, LUCB =.68 ±.. This shows ha he relaive improvemen due o is consan w.r.. and as varies. These resuls raise he quesion wheher he relaive performance of and would be he same, excep for he numerical values, for every p consan w.r... To answer o his quesion, we consider he case in which p =., corresponding o he case in which he number of edges is linear in, bu i is smaller han he case wih p =. The resuls in erms of R T U, repored in Table show ha ouperforms for, suggesing ha, when p is consan in, may or may no ouperform depending on he specific pair p,. Consider he case wih p = log. The ordering over he performance of he algorihms changes as varies. More precisely, while keeps o be he bes algorihm for every and LUCB he wors algorihm for every, he ordering beween and changes. When performs beer han, insead when ouperforms, see Fig.. This is due o he fac ha, wih a small number of arms, exploiing he graph srucure is no sufficien for a frequenis algorihm o ouperform he performance of, while wih many arms exploiing he graph srucure even wih a frequenis algorihm is much beer han employing a general-purpose Bayesian algorihm. The raio R %, monoonically decreases as increases, from.66 when =5o.9 when =, suggesing ha exploiing he graph srucure provides ad- Table : Resuls concerning R T U T = 5 in he seing wih Erdős-Rényi graphs and p = a Figure : Resuls for he pseudo regre R U in he seing wih =5a and =b and p = log. vanages as increases. Insead, he raio R %, monoonically increases as increases, from.45 when =5o.94 when =, suggesing ha he improvemen provided by employing Bayesian approaches reduces as increases as observed above in line graphs. Consider he case wih p = l. As in he case discussed above, is ouperformed by for a small number of arms, while i ouperforms for many arms. The reason is he same above. Similarly, he raio R %, monoonically decreases as increases, from.58 when =5o.8 when =, and he raio R %, monoonically increases as increases, from.45 when =5o. when =. This confirms ha he performance of and he one of asympoically mach as increases when p = l as well as p = log. In order o invesigae he reasons behind such a behavior, we produce an addiional experimen wih he line graphs of Combes and Prouiere 4a excep ha he maximum expeced reward is se o.8 when = 7 and.65 when = 9 hus, given any edge wih erminals i and i +,wehave μ i μ i+ =.. Wha we observe deails of hese experimens and hose described below are in Paladino e al. 6 is ha, on average, ouperforms a T = 5 suggesing ha, when i is necessary o repeaedly disinguish beween hree arms ha have very similar expeced rewards, frequenis mehods may ouperform he Bayesian ones. This is no longer rue when T is much larger, e.g., T = 7, where ouperforms ineresingly, differenly from wha happens in he oher opologies, in line graphs wih very small μ i μ i+, he average R T and R T cross a number of imes during he ime horizon. Fuhermore, we evaluae how he relaive performance of w.r.. varies for μ i μ i+ {.,.,.5}, observing i improves as μ i μ i+ decreases. Finally, we evaluae wheher his behavior emerges also in Erdős-Rényi graphs in which p = c where c is a consan we use p = 5, and we observe ha ouperforms, suggesing ha line graphs wih very small μ i μ i+ are pahological insances for. b 46

7 Conclusions and Fuure Work In his paper, we focus on he Unimodal Muli Armed Bandi problem wih graph srucure in which each arm corresponds o a node of a graph and each edge is associaed wih a relaionship in erms of expeced reward beween is arms. We propose, o he bes of our knowledge, he firs Bayesian algorihm for he UMAB seing, called, which is based on he well known Thompson Sampling algorihm. We derive a igh upper bound for ha asympoically maches he lower bound for he UMAB seing, providing a non-rivial derivaion of he bound. Furhermore, we presen a horough experimenal analysis showing ha our algorihm ouperforms he sae of he ar mehods. In fuure, we will evaluae he performance of he algorihms considered in his paper wih oher classes of graphs, e.g., Barabási Alber and laices. Fuure developmen of his work may consider an analysis of he proposed algorihm in he case of ime varying environmens, i.e., he expeced reward of each arm varies over ime, assuming ha he unimodal srucure is preserved. Anoher ineresing sudy may consider he case of a coninuous decision space. References Alon, N.; Cesa-Bianchi, N.; Genile, C.; and Mansour, Y.. From bandis o expers: a ale of dominaion and independence. In Advances in Neural Informaion Processing Sysems NIPS, Auer, P.; Cesa-Bianchi, N.; and Fischer, P.. Finie ime analysis of he muliarmed bandi problem. Machine Learning 47-:5 56. Caron, S., and Bhaga, S.. Mixing bandis: A recipe for improved cold sar recommendaions in a social nework. In Proceedings of he Workshop on Social Nework Mining and Analysis SNA-DD. Chapelle, O., and Li, L.. An empirical evaluaion of hompson sampling. In Advances in Neural Informaion Processing Sysems NIPS, Combes, R., and Prouiere, A. 4a. Unimodal bandis: Regre lower bounds and opimal algorihms. In Proceedings of he Inernaional Conference on Machine Learning ICML, Combes, R., and Prouiere, A. 4b. Unimodal bandis wihou smoohness. arxiv preprin arxiv: Crandall, D.; Cosley, D.; Huenlocher, D.; leinberg, J.; and Suri, S. 8. Feedback effecs beween similariy and social influence in online communiies. In Proceedings of he Special Ineres Group on nowledge Discovery and Daa Mining SIGDD, Edelman, B., and Osrovsky, M. 7. Sraegic bidder behavior in sponsored search aucions. Decision Suppor Sysems 4:9 98. Erdős, P., and Rényi, A On random graphs I. Publicaiones Mahemaicae Debrecen 6:9 97. Garivier, A., and Cappé, O.. The kl-ucb algorihm for bounded sochasic bandis and beyond. In Proceedings of he Conference on Learning Theory COLT, Hoeffding, W. 96. Probabiliy inequaliies for sums of bounded random variables. Journal of he American Saisical Associaion 58:. Jia, Y. Y., and Mannor, S.. Unimodal bandis. In Proceedings of he Inernaional Conference on Machine Learning ICML, aufmann, E.; orda, N.; and Munos, R.. Thompson sampling: An asympoically opimal finie ime analysis. In Proceedings of he Algorihmic Learing Theory ALT, 99. leinberg, R.; Slivkins, A.; and Upfal, E. 8. Muli armed bandis in meric spaces. In Proceedings of he Symposium on Theory Of Compuing STOC, Lai, T. L., and Robbins, H Asympoically efficien adapive allocaion rules. Advances in Applied Mahemaics 6:4. Mannor, S., and Shamir, O.. From bandis o expers: On he value of side observaions. In Advances in Neural Informaion Processing Sysems NIPS, Mas-Collel, A.; Whinson, M. D.; and Green, J. R Micreconomic Theory. Oxford Universiy Press. McPherson, M.; Smih-Lovin, L.; and Cook, J. M.. Birds of a feaher: Homophily in social neworks. Annual review of sociology Paladino, S.; Trovò, F.; Reselli, M.; and Gai, N. 6. Unimodal hompson sampling for graph-srucured arms. arxiv preprin arxiv: Thompson, W. R. 9. On he likelihood ha one unknown probabiliy exceeds anoher in view of he evidence of wo samples. Biomerika 5: Trovò, F.; Paladino, S.; Reselli, M.; and Gai, N. 5. Muli armed bandi for pricing. In Proceedings of he European Workshop on Reinforcemen Learning EWRL. Valko, M.; Munos, R.; veon, B.; and ocak, T. 4. Specral bandis for smooh graph funcions. In Proceedings of he Inernaional Conference on Machine Learning ICML, Xia, Y.; Li, H.; Qin, T.; Yu, N.; and Liu, T.-Y. 5. Thompson sampling for budgeed muli armed bandis. In Proceedings of he Inernaional Join Conference on Arificial Inelligence IJCAI. 46

1 Review of Zero-Sum Games

1 Review of Zero-Sum Games COS 5: heoreical Machine Learning Lecurer: Rob Schapire Lecure #23 Scribe: Eugene Brevdo April 30, 2008 Review of Zero-Sum Games Las ime we inroduced a mahemaical model for wo player zero-sum games. Any