arxiv: v2 [cs.lg] 22 Nov 2016

Similar documents
V.Abramov - FURTHER ANALYSIS OF CONFIDENCE INTERVALS FOR LARGE CLIENT/SERVER COMPUTER NETWORKS

Solution in semi infinite diffusion couples (error function analysis)

Variants of Pegasos. December 11, 2009

Dynamic Team Decision Theory. EECS 558 Project Shrutivandana Sharma and David Shuman December 10, 2005

( ) () we define the interaction representation by the unitary transformation () = ()

In the complete model, these slopes are ANALYSIS OF VARIANCE FOR THE COMPLETE TWO-WAY MODEL. (! i+1 -! i ) + [(!") i+1,q - [(!

Unimodal Thompson Sampling for Graph Structured Arms

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 4

CS286.2 Lecture 14: Quantum de Finetti Theorems II

Existence and Uniqueness Results for Random Impulsive Integro-Differential Equation

John Geweke a and Gianni Amisano b a Departments of Economics and Statistics, University of Iowa, USA b European Central Bank, Frankfurt, Germany

Robustness Experiments with Two Variance Components

On One Analytic Method of. Constructing Program Controls

e-journal Reliability: Theory& Applications No 2 (Vol.2) Vyacheslav Abramov

Epistemic Game Theory: Online Appendix

Cubic Bezier Homotopy Function for Solving Exponential Equations

FTCS Solution to the Heat Equation

Linear Response Theory: The connection between QFT and experiments

Algorithmic models of human decision making in Gaussian multi-armed bandit problems

GENERATING CERTAIN QUINTIC IRREDUCIBLE POLYNOMIALS OVER FINITE FIELDS. Youngwoo Ahn and Kitae Kim

Outline. Probabilistic Model Learning. Probabilistic Model Learning. Probabilistic Model for Time-series Data: Hidden Markov Model

Ordinary Differential Equations in Neuroscience with Matlab examples. Aim 1- Gain understanding of how to set up and solve ODE s

Online Appendix for. Strategic safety stocks in supply chains with evolving forecasts

Mechanics Physics 151

Mechanics Physics 151

Econ107 Applied Econometrics Topic 5: Specification: Choosing Independent Variables (Studenmund, Chapter 6)

TSS = SST + SSE An orthogonal partition of the total SS

Performance Analysis for a Network having Standby Redundant Unit with Waiting in Repair

Graduate Macroeconomics 2 Problem set 5. - Solutions

Tight results for Next Fit and Worst Fit with resource augmentation

Approximate Analytic Solution of (2+1) - Dimensional Zakharov-Kuznetsov(Zk) Equations Using Homotopy

THE PREDICTION OF COMPETITIVE ENVIRONMENT IN BUSINESS

UNIVERSITAT AUTÒNOMA DE BARCELONA MARCH 2017 EXAMINATION

Online Supplement for Dynamic Multi-Technology. Production-Inventory Problem with Emissions Trading

[ ] 2. [ ]3 + (Δx i + Δx i 1 ) / 2. Δx i-1 Δx i Δx i+1. TPG4160 Reservoir Simulation 2018 Lecture note 3. page 1 of 5

Department of Economics University of Toronto

SOME NOISELESS CODING THEOREMS OF INACCURACY MEASURE OF ORDER α AND TYPE β

Robust and Accurate Cancer Classification with Gene Expression Profiling

( ) [ ] MAP Decision Rule

J i-1 i. J i i+1. Numerical integration of the diffusion equation (I) Finite difference method. Spatial Discretization. Internal nodes.

Testing a new idea to solve the P = NP problem with mathematical induction

Computing Relevance, Similarity: The Vector Space Model

Clustering (Bishop ch 9)

HEAT CONDUCTION PROBLEM IN A TWO-LAYERED HOLLOW CYLINDER BY USING THE GREEN S FUNCTION METHOD

Appendix H: Rarefaction and extrapolation of Hill numbers for incidence data

ON THE WEAK LIMITS OF SMOOTH MAPS FOR THE DIRICHLET ENERGY BETWEEN MANIFOLDS

Notes on the stability of dynamic systems and the use of Eigen Values.

CHAPTER 10: LINEAR DISCRIMINATION

Lecture 6: Learning for Control (Generalised Linear Regression)

Political Economy of Institutions and Development: Problem Set 2 Due Date: Thursday, March 15, 2019.

Optimal environmental charges under imperfect compliance

Advanced Macroeconomics II: Exchange economy

arxiv: v1 [cs.sy] 2 Sep 2014

Lecture VI Regression

Comparison of Differences between Power Means 1

Relative controllability of nonlinear systems with delays in control

Survival Analysis and Reliability. A Note on the Mean Residual Life Function of a Parallel System

Hidden Markov Models Following a lecture by Andrew W. Moore Carnegie Mellon University

How about the more general "linear" scalar functions of scalars (i.e., a 1st degree polynomial of the following form with a constant term )?

Mechanics Physics 151

Dual Approximate Dynamic Programming for Large Scale Hydro Valleys

Comb Filters. Comb Filters

5th International Conference on Advanced Design and Manufacturing Engineering (ICADME 2015)

Advanced Machine Learning & Perception

Chapter 6: AC Circuits

( t) Outline of program: BGC1: Survival and event history analysis Oslo, March-May Recapitulation. The additive regression model

Bayes rule for a classification problem INF Discriminant functions for the normal density. Euclidean distance. Mahalanobis distance

Chapter Lagrangian Interpolation

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Volatility Interpolation

Lecture 11 SVM cont

An introduction to Support Vector Machine

Lecture 18: The Laplace Transform (See Sections and 14.7 in Boas)

Machine Learning 2nd Edition

. The geometric multiplicity is dim[ker( λi. number of linearly independent eigenvectors associated with this eigenvalue.

RELATIONSHIP BETWEEN VOLATILITY AND TRADING VOLUME: THE CASE OF HSI STOCK RETURNS DATA

2. SPATIALLY LAGGED DEPENDENT VARIABLES

Should Exact Index Numbers have Standard Errors? Theory and Application to Asian Growth

Including the ordinary differential of distance with time as velocity makes a system of ordinary differential equations.

Bayesian Inference of the GARCH model with Rational Errors

12d Model. Civil and Surveying Software. Drainage Analysis Module Detention/Retention Basins. Owen Thornton BE (Mech), 12d Model Programmer

CHAPTER 2: Supervised Learning

Reactive Methods to Solve the Berth AllocationProblem with Stochastic Arrival and Handling Times

Lecture Notes 4. Univariate Forecasting and the Time Series Properties of Dynamic Economic Models

Introduction to Boosting

CS 268: Packet Scheduling

New M-Estimator Objective Function. in Simultaneous Equations Model. (A Comparative Study)

Time-interval analysis of β decay. V. Horvat and J. C. Hardy

Sequential Sensor Selection and Access Decision for Spectrum Sharing

. The geometric multiplicity is dim[ker( λi. A )], i.e. the number of linearly independent eigenvectors associated with this eigenvalue.

Appendix to Online Clustering with Experts

Math 128b Project. Jude Yuen

Attribute Reduction Algorithm Based on Discernibility Matrix with Algebraic Method GAO Jing1,a, Ma Hui1, Han Zhidong2,b

Sampling Coordination of Business Surveys Conducted by Insee

2/20/2013. EE 101 Midterm 2 Review

Anisotropic Behaviors and Its Application on Sheet Metal Stamping Processes

[Link to MIT-Lab 6P.1 goes here.] After completing the lab, fill in the following blanks: Numerical. Simulation s Calculations

CH.3. COMPATIBILITY EQUATIONS. Continuum Mechanics Course (MMC) - ETSECCPB - UPC

On Optimal Foraging and Multi-armed Bandits

From Bandits to Experts: A Tale of Domination and Independence

Transcription:

Unmodal Thompson Samplng for Graph Srucured Arms Sefano Paladno and Francesco Trovò and Marcello Resell and Ncola Ga Dparmeno d Eleronca, Informazone e Bongegnera Polecnco d Mlano, Mlano, Ialy {sefano.paladno, francesco.rovo, marcello.resell, ncola.ga}@polm. arxv:6.574v [cs.lg Nov 6 Absrac We sudy, o he bes of our knowledge, he frs Bayesan algorhm for unmodal Mul Armed Band MAB problems wh graph srucure. In hs seng, each arm corresponds o a node of a graph and each edge provdes a relaonshp, unknown o he learner, beween wo nodes n erms of expeced reward. Furhermore, for any node of he graph here s a pah leadng o he unque node provdng he maxmum expeced reward, along whch he expeced reward s monooncally ncreasng. Prevous resuls on hs seng descrbe he behavor of frequens MAB algorhms. In our paper, we desgn a Thompson Samplng based algorhm whose asympoc pseudo regre maches he lower bound for he consdered seng. We show ha as happens n a wde number of scenaros Bayesan MAB algorhms dramacally ouperform frequens ones. In parcular, we provde a horough expermenal evaluaon of he performance of our and sae of he ar algorhms as he properes of he graph vary. Inroducon Mul Armed Band MAB algorhms Auer, Cesa- Banch, and Fscher have been proven o provde effecve soluons for a wde range of applcaons fng he sequenal decsons makng scenaro. In hs framework, a each round over a fne horzon T, he learner selecs an acon usually called arm from a fne se and observes only he reward correspondng o he choce she made. The goal of a MAB algorhm s o converge o he opmal arm,.e., he one wh he hghes expeced reward, whle mnmzng he loss ncurred n he learnng process and, herefore, s performance s measured hrough s expeced regre, defned as he dfference beween he expeced reward acheved by an oracle algorhm always selecng he opmal arm and he one acheved by he consdered algorhm. We focus on he so called Unmodal MAB UMAB, nroduced n Combes and Prouere 4a, n whch each arm corresponds o a node of a graph and each edge s assocaed wh a relaonshp specfyng whch node of he edge gves he larges expeced reward provdng hus a paral orderng over he arm space. Furhermore, from any node here s a pah leadng o he unque node wh he maxmum expeced reward along whch he expeced reward s monooncally Copyrgh c 7, Assocaon for he Advancemen of Arfcal Inellgence www.aaa.org. All rghs reserved. ncreasng. Whle he graph srucure may be no necessarly known a pror by he UMAB algorhm, he relaonshp defned over he edges s dscovered durng he learnng. In he presen paper, we propose a novel algorhm relyng on he Bayesan learnng approach for a generc UMAB seng. Models presenng a graph srucure have become more and more neresng n las years due o he spread of socal neworks. Indeed, he relaonshps among he enes of a socal nework have a naural graph srucure. A praccal problem n hs scenaro s he argeed adversemen problem, whose goal s o dscover he par of he nework ha s neresed n a gven produc. Ths ask s heavly nfluenced by he graph srucure, snce n socal neworks people end o have smlar characerscs o hose of her frends.e., neghbor nodes n he graph, herefore neress of people n a socal nework change smoohly and neghborng nodes n he graph look smlar o each oher McPherson, Smh-Lovn, and Cook ; Crandall e al. 8. More specfcally, an adverser ams a fndng hose users ha maxmze he ad expeced revenue.e., he produc beween clck probably and value per clck, whle a he same me reducng he amoun of mes he adversemen s presened o people no neresed n s conen. Under he assumpon of unmodal expeced reward, he learner can move from low expeced rewards o hgh ones jus by clmbng hem n he graph, prevenng from he need of a unform exploraon over all he graph nodes. Ths assumpon reduces he complexy n he search for he opmal arm, snce he learnng algorhm can avod o pull he arms correspondng o some subse of non opmal nodes, reducng hus he regre. Oher applcaons mgh benef from hs srucure, e.g., recommender sysems whch ams a couplng ems wh hose users are lkely o enjoy hem. Smlarly, he use of he unmodal graph srucure mgh provde more meanngful recommendaons whou esng all he users n he socal nework. Fnally, noce ha unmodal problems wh a sngle varable, e.g., n sequenal prcng Ja and Mannor, bddng n onlne sponsored search aucons Edelman and Osrovsky 7 and sngle peak preferences economcs and vong sengs Mas-Collel, Whnson, and Green 995, are graph srucured problems n whch he graph s a lne. Frequens approaches for UMAB wh graph srucure

are proposed n Ja and Mannor and Combes and Prouere 4a. Ja and Mannor nroduce he GLSE algorhm wh a regre of order O T logt. However, GLSE performs beer han classcal band algorhms only when he number of arms s ΘT. Combes and Prouere 4a presen he algorhm based on LUCB achevng asympoc regre of OlogT and ouperformng GLSE n sengs wh a few arms. To he bes of our knowledge, no Bayesan approach has been proposed for unmodal band sengs, ncluded he UMAB seng we sudy. However, s well known ha Bayesan MAB algorhms he mos popular s Thompson Samplng TS usually suffer of same order of regre as he bes frequens one e.g., n unsrucured sengs aufmann, orda, and Munos, bu hey ouperform he frequens mehods n a wde range of problems e.g., n band problems whou srucure Chapelle and L and n band problems wh budge Xa e al. 5. Furhermore, n problems wh srucure, he classcal Thompson Samplng no explong he problem srucure may ouperform frequens algorhms explong he problem srucure. For hs reason, n hs paper we explore Bayesan approaches for he UMAB seng. More precsely, we provde he followng orgnal conrbuons: we desgn a novel Bayesan MAB algorhm, called and based on he TS algorhm; we derve a gh upper bound over he pseudo regre for, whch asympocally maches he lower bound for he UMAB seng; we descrbe a wde expermenal campagn showng beer performance of n applcave scenaros han hose of sae of he ar algorhms, evaluang also how he performance of he algorhms ours and of he sae of he ar vares as he graph srucure properes vary. Relaed work Here, we menon he man works relaed o ours. Some works deal wh unmodal reward funcons n connuous armed band seng Ja and Mannor ; Combes and Prouere 4b; lenberg, Slvkns, and Upfal 8. In Ja and Mannor a successve elmnaon algorhm, called LSE, s proposed achevng regre of O T log T. In hs case, assumpons over he mnmum local decrease and ncrease of he expeced reward s requred. Combes and Prouere 4b consder sochasc band problems wh a connuous se of arms and where he expeced reward s a connuous and unmodal funcon of he arm. They propose he SP algorhm, based on he sochasc penachoomy procedure o narrow he search space. Unmodal MABs on merc spaces are suded n lenberg, Slvkns, and Upfal 8. An applcaon dependen soluon o he recommendaon sysems whch explos he smlary of he graph n socal nework n argeed adversemen has been proposed n Valko e al. 4. Smlar nformaon has been consdered n Caron and Bhaga 3 where he problem of cold sar users.e., new users s suded. Anoher ype of srucure consdered n sequenal games s he one of monooncy of he converson rae n he prce Trovò e al. 5. Ineresngly, he assumpons of monooncy and unmodaly are orhogonal, none of hem beng a specal case of he oher, herefore he resuls for monoonc seng canno be used n unmodal bands. In Alon e al. 3; Mannor and Shamr, a graph srucure of he arm feedback n an adversaral seng s suded. More precsely, hey assume o have correlaon over rewards and no over he expeced values of arms. Problem Formulaon A learner receves n npu a fne undreced graph MAB seng G = A, E, whose verces A = {a,..., a } wh N correspond o he arms and an edge a a j E exss only f here s a drec paral order relaonshp beween he expeced rewards of arms a and a j. The leaner knows a pror he nodes and he edges.e., she knows he graph, bu, for each edge, she does no know a pror whch s he node of he edge wh he larges expeced reward.e., she does no know he orderng relaonshp. A each round over a me horzon of T N he learner selecs an arm a and gans he correspondng reward x,. Ths reward s drawn from an..d. random varable X,.e., we consder a sochasc MAB seng characerzed by an unknown dsrbuon D wh fne known suppor Ω R as cusomary n MAB sengs, from now on we consder Ω [, and by unknown expeced value µ := E[X,. We assume ha here s a sngle opmal arm,.e., here exss a unque arm a s.. s expeced value µ = max µ and, for sake of noaon, we denoe µ wh µ. Here we analyze a graph band seng wh unmodaly propery, defned as: Defnon. A graph unmodal MAB UMAB seng G = A, E s a graph band seng G s.. for each sub opmal arm a, exss a fne pah p = =,..., m = s.. µ k < µ k+ and a k, a k+ E for each k {,..., m }. Ths defnon assures ha f one s able o denfy a non decreasng pah n G of expeced rewards, she be able o reach he opmum arm, whou geng suck n local opma. Noe ha he unmodaly propery mples ha he graph G s conneced and herefore we consder only conneced graphs from here on. A polcy U over a UMAB seng s a procedure able o selec a each round an arm a by basng on he hsory h,.e., he sequence of pas seleced arms and pas rewards ganed. The pseudo regre R T U of a generc polcy U over a UMAB seng s defned as: [ T R T U := T µ E, X, where he expeced value E[ s aken w.r.. he sochascy of he ganed rewards X, and of he polcy U. Le us defne he neghborhood of arm a as N := {j a a j E},.e., he se of each ndex j of he arm a j conneced o he arm a by an edge a a j E. I has been

shown n Combes and Prouere 4a ha he problem of learnng n a UMAB seng presens a lower bound over he regre R T U of he followng form: Theorem. Le U be a unformly good polcy,.e., a polcy s.. R T U = ot c for each c >. Gven a UMAB seng G = A, E we have: lm nf T R T U logt = where Lp, q = p log p q N µ µ Lµ, µ + p log p q,.e., he ullaback Lebler dvergence of wo Bernoull dsrbuons wh means p and q, respecvely. Ths resul s smlar o he one provded n La and Robbns 985, wh he only dfference ha he summaon s resrced o he arms layng n he neghborhood of he opmal arm N and reduces o when he opmal arm s conneced o all he ohers.e., N {,..., } or he graph s compleely conneced.e., N {,..., },. We would lke o pon ou ha by relyng on he assumpon of havng a sngle maxmum of he expeced rewards, we also assure ha he opmal arm neghborhood N s unquely defned and, hus, he lower bound nequaly n Equaon s well defned. The algorhm We descrbe he algorhm and we show ha s regre s asympocally opmal,.e., asympocally maches he lower bound of Theorem. The algorhm s an exenson of he Thompson Samplng Thompson 933 ha explos he graph srucure and he unmodal propery of he UMAB seng. Bascally, he raonale of he algorhm s o apply a smple varaon of he TS algorhm o only he arms assocaed wh he nodes ha compose he neghborhood of he arm wh he hghes emprcal mean reward, called leader. The pseudo code Algorhm : Inpu: UMAB seng G = V, E, Horzon T, Prors {π } = : for {,..., T } do 3: Compue ˆµ,T, for each {,..., } 4: Fnd he leader a l 5: f L l, mod N + l = hen 6: Collec reward x l, 7: else 8: Draw θ, from π, for each N + l 9: Collec reward x, where = arg max θ, The pseudo code of he algorhm s presened n Algorhm. The algorhm receves n npu he graph srucure G, he me horzon T, and a Bayesan pror π for each expeced reward µ. A each round, he algorhm compues he emprcal expeced reward for each arm Lne 3: S f T ˆµ, :=, > T,, oherwse where S, = h= X,h{Uh = a } s he cumulave reward of arm a up o round and T, = h= {Uh = a } s he number of mes he arm a has been pulled up o round. Afer ha, selecs he arm denoed as he leader a l for round,.e., he one havng he maxmum emprcal expeced reward: a l = arg max a A ˆµ,. 3 Once he leader has been chosen, we resrc he selecon procedure o and s neghborhood, consderng only arms wh ndexes n N + l := Nl {l}. Denoe wh L, := h= {lh = } he number of mes he arm a has been seleced as leader before round. If L l, s a mulple of N + l, hen he leader s pulled and reward x l, s ganed Lne 6. Oherwse, he TS algorhm s performed over arms a s.. N + l Lnes 8 9. Bascally, under he assumpon of havng a pror π, we can compue he poseror dsrbuon π, for µ afer rounds, usng he nformaon gahered from he rounds n whch a has been pulled. We denoe wh θ, a sample drawn from π,, called Thompson sample. For nsance, for Bernoull rewards and by assumng unform prors we have ha π, = Bea+S,, +T, S,, where Beaα, β s he bea dsrbuon wh parameers α and β. Fnally, he algorhm pulls he arm wh he larges Thompson sample θ,n and collecs he correspondng reward x,. See aufmann, orda, and Munos for furher deals. Remark. Assumng ha he algorhm receves n npu he whole graph G s unnecessary. The algorhm jus requres an oracle ha, a each round, s able o reurn he neghborhood Nl of he arm whch s currenly he leader a l. Ths s crucal n all he applcaons n whch he graph s dscovered by means of a seres of queres and he queres have a non neglgble cos e.g., n socal neworks a query mgh be compuaonally cosly. Fnally, we remark ha he frequens counerpar of our algorhm.e., he algorhm requres he compuaon of he maxmum node degree γ := max N, hus requrng a leas an nal analyss of he enre graph G. Fne me analyss of Theorem. Gven a UMAB seng G = A, E, he expeced pseudo regre of he algorhm sasfes, for every ε > : R T + ε N µ µ [logt + log logt + C, Lµ, µ where C > s a consan dependng on ε, he number of arms and he expeced rewards {µ,..., µ }. We here denoe wh { } he ndcaor funcon. We here denoe wh he cardnaly operaor.

Skech of proof. The complee verson of he proof s repored n he appendces. A frs, we remark ha a sraghforward applcaon of he proof provded for s no possble n he case of. Indeed, he use of frequens upper bounds over he expeced reward n mples ha n fne me and wh hgh probably he bounds are ordered as he expeced values. Snce we are usng a Bayesan algorhm, we would requre he same assurance over he Thompson samples θ,, bu we do no have a drec bound over Pθ, > θ, where a s he opmal arm n he neghborhood N +. Ths fac requres o follow a compleely dfferen sraegy when we analyze he case n whch he leader s no he opmal arm. The regre of he algorhm R T can be dvded n wo pars: he one obaned durng hose rounds n whch he opmal arm a s he leader, called R, and he summaon of he regres n he rounds n whch he leader s he arm a a, called R. R s obaned when s he leader, hus, he algorhms behaves lke Thompson Samplng resrced o he opmal arm and s neghborhood N +, and he regre upper bound s he one derved n aufmann, orda, and Munos for he TS algorhm. R s upper bounded by he expeced number of rounds he arm a has been seleced as leader E[L,T over he horzon T. Le us consder ˆL,T defned as he number of rounds spen wh a as leader when resrcng he problem o s neghborhood N +. E[ˆL,T s an upper bound over E[L,T, snce here s nonzero probably ha he algorhm moves n anoher neghborhood. Snce and he seng s unmodal, here exss an opmal arm a, among hose n he neghborhood N s.. µ = max a N µ and ˆµ, ˆµ. Thus: R E[ ˆL,T = = = [ E {ˆµ, = P ˆµ, max ˆµ j, a j N + max a j N + ˆµ j, } P ˆµ, ˆµ, P ˆµ, µ ˆµ, + µ P ˆµ, µ }{{} R + P ˆµ, µ +, }{{} R where = max a N µ µ s he expeced loss ncurred n choosng a nsead of s bes adjacen one a. R can be upper bounded by a consan by relyng on condonal probably defnon and he Hoeffdng nequaly Hoeffdng 963. Specfcally, we rely on he fac ha he leader s chosen a leas mes. Upper Ll, N + l boundng R by a consan erm requres he use of Proposon n aufmann, orda, and Munos, whch lms he expeced number of mes he opmal arm s pulled less han b mes by TS, where b, s a consan, and he use of a echnque already used on R. Summng up he regre over and consderng he hree obaned bounds concludes he proof. Table : Resuls concernng R % U, n he seng wh = 7 and = 9 and a lne graph. 7 9 LUCB 3.8 ±.5 6.5 ±.7 TS.34 ±.7.68 ±.5.5 ±.7.76 ±.5 Expermenal Evaluaon In hs secon, we compare he emprcal performance of he proposed algorhm wh he performance of a number of algorhms. We sudy he performance of he sae of he ar algorhm Combes and Prouere 4a o evaluae he mprovemen due o he employmen of Bayesan approaches w.r.. frequens approaches. Furhermore, we sudy he performance of TS Thompson 933 o evaluae he mprovemen n Bayesan approaches due o he exploaon of he problem srucure. For compleeness, we sudy also he performance of LUCB Garver and Cappé, beng a frequens algorhm ha s opmal for Bernoull dsrbuons. Fgures of mer Gven a polcy U, we evaluae he average and 95% confdence nervals of he followng fgures of mer: he pseudo regre R T U as defned n Equaon ; he lower R T U he beer he performance; he regre rao R % U, U = R T U R T U showng he rao beween he oal regre of polcy U afer T rounds and he one obaned wh U ; he lower R % U, U he larger he relave mprovemen of U w.r.. U. Lne graphs We nally consder he same expermenal sengs, composed of lne graphs, ha are suded n Combes and Prouere 4a. They consder graphs wh {7, 9} arms, where he arms are ordered on a lne from he arm wh smalles ndex o he arm wh he larges ndex and wh Bernoull rewards whose averages have a rangular shape wh he maxmum on he arm n he mddle of he lne. More precsely, he mnmum average s., assocaed wh arms a and a 7 when = 7 and wh arms a and a 9 wh = 9, whle he maxmum average reward s µ =.9, assocaed wh arm a 9 when = 7 and wh arm a 65 wh = 9. The averages decrease lnearly from he maxmum one o he mnmum one. For boh he expermens, we average he regre over ndependen rals of lengh T = 5. We repor R U for each polcy U as vares n Fg. a, for = 7, and n Fg. b, for = 9. The algorhm ouperforms all he oher algorhms along he whole me horzon, provdng a sgnfcan mprovemen n erms of regre w.r.. he

sae of he ar algorhms. In order o have a more precse evaluaon of he reducon of he regre w.r.. algorhm, we repor R % U, n Tab.. As also confrmed below by a more exhausve seres of expermens, n lne graphs he relave mprovemen of performance due o w.r.. reduces as he number of arms ncreases, whle he relave mprovemen of performance due o w.r.. TS ncreases as he number of arms ncreases. RU 3 4 6 8 3 a LUCB TS RU 3 3 4 6 8 3 b LUCB TS Fgure : Resuls for he pseudo regre R U n lne graphs sengs wh = 7 a and = 9 b as defned n Combes and Prouere 4a. Erdős-Rény graphs To provde a horough expermenal evaluaon of he consdered algorhms n sengs n whch he space of arms has a graph srucure, we generae graphs usng he model proposed by Erdős and Rény 959, whch allows us o smulae graph srucures more complex han a smple lne. An Erdős-Rény graph s generaed by connecng nodes randomly: each edge s ncluded n he graph wh probably p, ndependenly from exsng edges. We consder conneced graphs wh {5,,, 5,, } and wh probably p {,, log, l}, where p = corresponds o have a fully conneced graph and herefore he graph srucure s useless, p = corresponds o have a number of edges ha ncreases lnearly n he number of nodes, p = log corresponds o have a few edges w.r.. he nodes, and we use p = l o denoe lne graphs hese lne graphs dffer from hose used for he expermenal evaluaon dscussed above for he reward funcon, as dscussed n wha follows. We use dfferen values of p n order o see how he performance of changes w.r.. he number of edges n he graph; we remark ha such an analyss s unexplored n he leraure so far. The opmal arm s chosen randomly among he exsng arms and s reward s gven by a Bernoull dsrbuon wh expeced value.9. The rewards of he subopmal arms are gven by Bernoull dsrbuons wh expeced value dependng on her dsance from he opmal one. More precsely, le d be he shores pah from he h arm o he opmal arm and le: d max = max {,...,} d be he maxmum shores pah of he graph. The expeced reward of he h arm s: µ =.9 d.9. d, max Table : Resuls concernng R T U T = 5 n he seng wh Erdős-Rény graphs. 5 5 p / log/ l LUCB 34 ±.4 5 ±.5 5 ± 3.7 56 ±. TS 8 ±. 3 ±.6 4 ±.3 5 ±.7 34 ±.3 3 ± 7. 35 ± 5.8 3 ± 4. 7 ±. 5 ±.4 6 ±. 4 ±.3 LUCB 77 ±.5 7 ± 5.5 7 ±. 59 ± 7. TS 4 ±. 5 ±. 56 ± 3.8 67 ±.5 77 ±.3 76 ± 8. 57 ± 5.6 7 ± 8. 39 ±. 35 ± 3. 7 ±. 34 ±.4 LUCB 63 ±.7 7 ± 6. 6 ± 6. 386 ±.3 TS 84 ±.5 ±.3 7 ± 5.7 57 ± 6.9 63 ±.8 48 ± 4.9 86 ± 4.6 4 ±.7 83 ±.3 7 ± 6. 44 ± 4.8 65 ± 8.8 LUCB 4 ±.7 56 ± 5. 686 ± 3.5 3 ± 49. TS 7 ±.5 6 ± 4.4 33 ±. 454 ± 9.9 4 ±. 38 ± 35.6 6 ± 3.9 4 ± 5.8 6 ±.7 8 ± 4. 89 ± 5.5 56 ± 3. LUCB 846 ±. 34 ± 7.8 33 ± 59.7 37 ± 63.5 TS 436 ±. 58 ± 4.9 586 ± 8.4 973 ± 3.8 846 ±.7 786 ± 39. 6 ± 7. 369 ±.7 437 ±.5 37 ± 5. 4 ± 9. 9 ± 4.3 LUCB 855 ±. 47 ± 6. 4 ± 464.7 64 ± 9.5 TS 439 ± 3.4 56 ± 3. 5478 ± 5.3 6554 ± 5. 8493 ± 3.6 776 ± 53.4 5 ± 45. 65 ±.7 4388 ± 5. 378 ± 6.9 ± 4. 65 ± 4.8.e., he arm wh d max has a value equal o. and he expeced rewards of he arms along he pah from o he opmal arm are evenly spaced beween. and.9. We generae dfferen graphs for each combnaon of and p and we run ndependen rals of lengh T = 5 for each graph. We average he regre over he resuls of he graphs. In Tab., we repor R T U for each combnaon of polcy U,, and p. I can be observed ha he algorhm ouperforms all he oher algorhms, provdng n every case he smalles regre excep for = and p = l. Below we dscuss how he relave performance of he algorhms vary as he values of he parameers and p vary. Consder he case wh p =. The performance of and TS are approxmaely equal and he same holds for he performance of and LUCB. Ths s due o he fac ha he neghborhood of each node s composed by all he arms, he graphs beng fully conneced, and herefore and canno ake any advanage from he srucure of he problem. We noce, however, ha and TS have no he same behavor and ha always performs slghly beer han TS. I can be observed n Fg. wh = 5 and p = ha he relave mprovemen s manly a he begnnng of he me horzon and ha goes o zero as ncreases he same holds for w.r.. LUCB. The reason behnd hs behavor s ha reduces he exploraon performed by TS n he frs rounds, forcng he algorhm o pull he leader chosen as he arm maxmzng he emprcal mean for a larger number of rounds.

RU 3 LUCB TS RU 6 4 LUCB TS RU 3 LUCB TS 4 6 8 3 Fgure : Resuls for he pseudo regre R U n he seng wh = 5 and p =. Consder he case wh p =. In he consdered expermenal seng, he relave performance of he algorhms does no depend on. The orderng, from he bes o he wors, over he performance of he algorhms s:, TS,, and fnally LUCB. Surprsngly, even he dependency of he followng raos on s neglgble: R %, TS =.68 ±.3, R %, =.47 ±., and R %, LUCB =.68 ±.3. Ths shows ha he relave mprovemen due o s consan w.r.. TS and as vares. These resuls rase he queson wheher he relave performance of and TS would be he same, excep for he numercal values, for every p consan w.r... To answer o hs queson, we consder he case n whch p =., correspondng o he case n whch he number of edges s lnear n, bu s smaller han he case wh p =. The resuls n erms of R T U, repored n Table 3 show ha ouperforms TS for, suggesng ha, when p s consan n, may or may no ouperform TS dependng on he specfc par p,. Consder he case wh p = log. The orderng over he performance of he algorhms changes as vares. More precsely, whle keeps o be he bes algorhm for every and LUCB he wors algorhm for every, he orderng beween TS and changes. When TS performs beer han, nsead when ouperforms TS, see Fg. 3. Ths s due o he fac ha, wh a small number of arms, explong he graph srucure s no suffcen for a frequens algorhm o ouperform he performance of TS, whle wh many arms explong he graph srucure even wh a frequens algorhm s much beer han employng a general-purpose Bayesan algorhm. The rao R %, TS monooncally decreases as ncreases, from.66 when = 5 o.9 when =, suggesng ha explong he graph srucure provdes ad- Table 3: Resuls concernng R T U T = 5 n he seng wh Erdős-Rény graphs and p =.. 5 5 TS 5 66 6 78 59 4564 9 64 6 44 66 358 4 6 8 3 a 4 6 8 3 Fgure 3: Resuls for he pseudo regre R U n he seng wh = 5 a and = b and p = log. vanages as ncreases. Insead, he rao R %, monooncally ncreases as ncreases, from.45 when = 5 o.94 when =, suggesng ha he mprovemen provded by employng Bayesan approaches reduces as ncreases as observed above n lne graphs. Consder he case wh p = l. As n he case dscussed above, s ouperformed by TS for a small number of arms, whle ouperforms TS for many arms. The reason s he same above. Smlarly, he rao R %, TS monooncally decreases as ncreases, from.58 when = 5 o.8 when =, and he rao R %, monooncally ncreases as ncreases, from.45 when = 5 o. when =. Ths confrms ha he performance of and he one of asympocally mach as ncreases when p = l as well as p = log. In order o nvesgae he reasons behnd such a behavor, we produce an addonal expermen wh he lne graphs of Combes and Prouere 4a excep ha he maxmum expeced reward s se o.8 when = 7 and.65 when = 9 hus, gven any edge wh ermnals and +, we have µ µ + =.. Wha we observe deals of hese expermens and hose descrbed below are n he appendces s ha, on average, ouperforms a T = 5 suggesng ha, when s necessary o repeaedly dsngush beween hree arms ha have very smlar expeced rewards, frequens mehods may ouperform he Bayesan ones. Ths s no longer rue when T s much larger, e.g., T = 7, where ouperforms neresngly, dfferenly from wha happens n he oher opologes, n lne graphs wh very small µ µ +, he average R T and R T cross a number of mes durng he me horzon. Fuhermore, we evaluae how he relave performance of w.r.. vares for µ µ + {.,.,.5}, observng mproves as µ µ + decreases. Fnally, we evaluae wheher hs behavor emerges also n Erdős-Rény graphs n whch p = c where c s a consan we use p = 5, and we observe ha ouperforms, suggesng ha lne graphs wh very small µ µ + are pahologcal nsances for. b

Conclusons and Fuure Work In hs paper, we focus on he Unmodal Mul Armed Band problem wh graph srucure n whch each arm corresponds o a node of a graph and each edge s assocaed wh a relaonshp n erms of expeced reward beween s arms. We propose, o he bes of our knowledge, he frs Bayesan algorhm for he UMAB seng, called, whch s based on he well known Thompson Samplng algorhm. We derve a gh upper bound for ha asympocally maches he lower bound for he UMAB seng, provdng a non-rval dervaon of he bound. Furhermore, we presen a horough expermenal analyss showng ha our algorhm ouperforms he sae of he ar mehods. In fuure, we wll evaluae he performance of he algorhms consdered n hs paper wh oher classes of graphs, e.g., Barabás Alber and laces. Fuure developmen of hs work may consder an analyss of he proposed algorhm n he case of me varyng envronmens,.e., he expeced reward of each arm vares over me, assumng ha he unmodal srucure s preserved. Anoher neresng sudy may consder he case of a connuous decson space. References Alon, N.; Cesa-Banch, N.; Genle, C.; and Mansour, Y. 3. From bands o expers: A ale of domnaon and ndependence. In Advances n Neural Informaon Processng Sysems, 6 68. Auer, P.; Cesa-Banch, N.; and Fscher, P.. Fneme analyss of he mularmed band problem. Machne learnng 47-3:35 56. Caron, S., and Bhaga, S. 3. Mxng bands: A recpe for mproved cold-sar recommendaons n a socal nework. In Proceedngs of he 7h Workshop on Socal Nework Mnng and Analyss,. ACM. Chapelle, O., and L, L.. An emprcal evaluaon of hompson samplng. In Advances n neural nformaon processng sysems, 49 57. Combes, R., and Prouere, A. 4a. Unmodal bands: Regre lower bounds and opmal algorhms. In ICML, 5 59. Combes, R., and Prouere, A. 4b. Unmodal bands whou smoohness. arxv preprn arxv:46.7447. Crandall, D.; Cosley, D.; Huenlocher, D.; lenberg, J.; and Sur, S. 8. Feedback effecs beween smlary and socal nfluence n onlne communes. In Proceedngs of he 4h ACM SIGDD nernaonal conference on nowledge dscovery and daa mnng, 6 68. ACM. Edelman, B., and Osrovsky, M. 7. Sraegc bdder behavor n sponsored search aucons. Decson suppor sysems 43:9 98. Erdős, P., and Rény, A. 959. On random graphs. Publ. Mah. Debrecen 6:9 97. Garver, A., and Cappé, O.. The kl-ucb algorhm for bounded sochasc bands and beyond. In COLT, 359 376. Hoeffdng, W. 963. Probably nequales for sums of bounded random varables. Journal of he Amercan sascal assocaon 583:3 3. Ja, Y. Y., and Mannor, S.. Unmodal bands. In Proceedngs of he 8h Inernaonal Conference on Machne Learnng ICML-, 4 48. aufmann, E.; orda, N.; and Munos, R.. Thompson samplng: An asympocally opmal fne-me analyss. In ALT, volume 7568 of Lecure Noes n Compuer Scence, 99 3. Sprnger. lenberg, R.; Slvkns, A.; and Upfal, E. 8. Mul-armed bands n merc spaces. In Proceedngs of he foreh annual ACM symposum on Theory of compung, 68 69. ACM. La, T. L., and Robbns, H. 985. Asympocally effcen adapve allocaon rules. Advances n appled mahemacs 6:4. Mannor, S., and Shamr, O.. From bands o expers: On he value of sde-observaons. In NIPS. 684 69. Mas-Collel, A.; Whnson, M. D.; and Green, J. R. 995. Mcreconomc heory. McPherson, M.; Smh-Lovn, L.; and Cook, J. M.. Brds of a feaher: Homophly n socal neworks. Annual revew of socology 45 444. Thompson, W. R. 933. On he lkelhood ha one unknown probably exceeds anoher n vew of he evdence of wo samples. Bomerka 53/4:85 94. Trovò, F.; Paladno, S.; Resell, M.; and Ga, N. 5. Mul armed band for prcng. In h European Workshop on Renforcemen Learnng EWRL. hps://ewrl. wordpress.com/pas-ewrl/ewrl-5/. Valko, M.; Munos, R.; veon, B.; and ocak, T. 4. Specral bands for smooh graph funcons. In Proceedngs of The 3s Inernaonal Conference on Machne Learnng, ICML, 46 54. Xa, Y.; L, H.; Qn, T.; Yu, N.; and Lu, T.-Y. 5. Thompson samplng for budgeed mul-armed bands. In Tweny- Fourh Inernaonal Jon Conference on Arfcal Inellgence.

Appendx A: Proof of Theorem Theorem. Gven a UMAB seng G = A, E, he expeced pseudo regre of he algorhm sasfes, for every ε > : R T + ε N µ µ [logt + log logt + C, Lµ, µ where C > s a consan dependng on ε, he number of arms and he expeced rewards {µ,..., µ }. Proof. A frs, he regre of he algorhm R T can be rewren by dvdng he T rounds n wo ses: hose rounds n whch he bes arm a s he leader,.e., l =, and hose n whch he leader s anoher arm,.e., l : R T = µ µ E[T,T = [ T µ µ E { = } = [ T µ µ E {l = = } + }{{} R + [ T µ µ E {l = } } {{ } R Le us focus on R. When s he leader, he proposed algorhm behaves lke Thompson Samplng resrced o he opmal arm and s neghborhood N +, and he regre upper bound s he one presened n Theorem n aufmann, orda, and Munos for TS algorhm,.e., for every ε > : R + ε N µ µ Lµ, µ [logt + log logt + C, 4 where C s an approprae consan dependng on ε and on he expeced rewards µ of arms n N +. Now le us consder R, we have: R = [ T µ µ E {l = } }{{} E [L,T. Here we wan o upper bound he number of mes a has been he leader L,T wh ˆL,T defned as he number of rounds spen wh a as leader n he case only s neghborhood s consdered durng he whole me horzon T. Ths s clearly an upper bound over L,T, snce here s nonzero probably ha he algorhms moves n anoher neghborhood. From now on n he proof he analyss s carred on an algorhm workng only on a unque neghborhood N. R E [L,T E [ˆL,T = T E [{l = } = T [ E {ˆµ, = max ˆµ j,}, a j N where, wh abuse of noaon, l s he leader a round n hs new problem where only N s consdered.

When s he leader, a s no he opmal arm. Thus, snce we are n a unmodal seng, exss an opmal arm a N, s.. µ = max a N µ. Noneheless, snce a s he leader, s emprcal mean s he maxmum n s neghborhood and, n parcular, ˆµ, ˆµ. Thus, we have: R T [ E {ˆµ, = max ˆµ j,} a j N T E [{ˆµ, ˆµ,} = T P ˆµ, ˆµ, = T P ˆµ, µ ˆµ, + µ P ˆµ, µ } {{ } R + P ˆµ, µ +, } {{ } R where = max a N µ µ denoes he expeced loss ncurred n choosng arm a nsead of s bes adjacen one a. Le us focus on R : R = P ˆµ, µ + = P T, = h ˆµ, µ + h= = P T, = h ˆµ, µ + P ˆµ, µ + h= h= P T, = h ˆµ, µ + e h Where he las nequaly s due o he Hoeffdng nequaly Hoeffdng 963. By relyng on he fac ha k e kx and by consderng x = N + we have: N + R P T, = h ˆµ, µ + e h + e N + h= }{{} = = where P T, = h ˆµ, µ + e N + C h=x+ e kh = for h N + s due o he fac ha he leader s chosen a leas N + over rounds and C s a consan. Le us focus on R and he followng proposon provded n aufmann, orda, and Munos : Proposon. If we use a TS polcy over a se of fne arms {a } where a s he opmal one, here exs consans b, and C b s..: E [ {T, b } C b. 5

Smlarly o wha has been derved for R we have: R = P ˆµ,s µ = P T, = h ˆµ,s µ = h= b h= P T, = h ˆµ,s µ + E [ {T, b } + C b + C b + h= b + e e b C 3 h= b + h= b + P ˆµ,s µ P T, = h ˆµ,s µ } {{ } P ˆµ,s µ snce we are usng TS n among arms n N and he las nequaly holds for all b,. By consderng he hree paral resuls on R, R, R we have: R T R + R + R = + ε N consderng C = C + C + C 3 concludes he proof. µ logt + log logt µ Lµ, µ + C + C + C 3

Appendx B: Addonal Resuls on p = l In order o nvesgae he reasons why he performance of and he one of asympocally mach as ncreases when p = l, we produce addonal expermens wh he lne graphs descrbed n Combes and Prouere 4a. We generaed lne graphs where he mnmum expeced reward s se o. and he maxmum expeced reward vares: gven any edge wh ermnals wo consecuve nodes and +, we generaed graphs where = µ µ + {.,.,.5}. More precsely, when = 7, he expeced reward of he cenral arm a 8 s se o, respecvely,.8,.6 and.4. When = 9, he expeced reward of he cenral arm a 65 s se o, respecvely,.65,.3 and.45. The resuls for T = 7 are repored n Fgure 4 for = 7 and n Fgure 5 for = 9. 3 3 3 8 3 6 RU 4 RU RU 4 6 8 a 5 4 6 8 b 5 4 6 8 c 5 Fgure 4: Resuls for he pseudo regre R U n he seng wh = 7, p = l and =. a, =. b and =.5 c. 3 4.5 4 6 3 4 RU RU RU.5 4 6 8 a 5 4 6 8 b 5 4 6 8 c 5 Fgure 5: Resuls for he pseudo regre R U n he seng wh = 9, p = l and =. a, =. b and =.5 c. We observe ha, a T = 7 wh = 7 and =., on average ouperforms whle, wh {.,.5}, a he end of he expermens ouperforms. Ths behavor suggess ha, even n he case wh =., wll perform beer han for T > 7. In he case wh = 9, =.5 and T = 7, ouperforms a he end of he expermens whle wh {.,.} performs beer. Followng he same lne of reasonng, for T > 7 hs could no longer be rue. All hese resuls sugges ha when s necessary o repeaedly dsngush beween hree arms ha have very smlar expeced rewards and very low expeced rewards, frequens mehods may ouperform he Bayesan ones a he begnnng of he learnng process, whle Bayesan mehods asympocally ouperform frequens ones. In parcular, we observe ha he relave performance of w.r.. vares for {.,.,.5}, observng mproves as decreases. Fnally, we evaluae wheher hs behavor emerges also n Erdős-Rény graphs n whch p = c where c s a consan. We use p { 5, } and T = 6. We observe ha ouperforms, suggesng ha lne graphs wh very small represen pahologcal nsances for. The resuls are repored n Fgure 6.

3 3 RTU RTU.5.5 4 6 8 4 6 8 4 4 a b Fgure 6: Resuls for he pseudo regre R U n he seng wh =, p = 5 a and p = b.