Optimality of Myopic Policy for a Class of Monotone Affine Restless Multi-Armed Bandit

Size: px

Start display at page:

Download "Optimality of Myopic Policy for a Class of Monotone Affine Restless Multi-Armed Bandit"

Owen Chandler
5 years ago
Views:

1 Univeriy of Souhern Cliforni Opimliy of Myopic Policy for Cl of Monoone Affine Rele Muli-Armed Bndi Pri Mnourifrd USC Tr Jvidi UCSD Bhkr Krihnmchri USC Dec 0, 202

2 Univeriy of Souhern Cliforni Inroducion Muli-Armed Bndi: Sochic deciion problem Selecing from everl lernive rm ech ime Plying n rm yield n immedie rewrd How o ply rm o mximize he expeced dicouned or verge rewrd over horizon Trde-off beween explorion nd exploiion Two cegorie: Reed nd Rele 2

3 Univeriy of Souhern Cliforni Reed MAB: Inroducion The e of he plyed rm chnge ccording o known Mrkovin rule Byein The remining rm y frozen Opiml policy: Specificlly n index cn be igned o he e of ech rm Plying n rm wih he lrge index ech ime Referred o he Giin index 3

Univeriy of Souhern Cliforni Rele MAB: Inroducion The e of ll rm even hoe h re no eleced evolve in Mrkovin fhion ech ime n index-policy i no in generl opiml, While index policy i opiml under ome

4 Univeriy of Souhern Cliforni Rele MAB: Inroducion The e of ll rm even hoe h re no eleced evolve in Mrkovin fhion ech ime n index-policy i no in generl opiml, While index policy i opiml under ome conrin on he verge number of rm h cn be plyed ech ime PSPACE-hrd problem In lierure: pecil cle of RMAB for which priculr heuriic re opiml Our conribuion: generl cl of RMAB for which imple index policy Myopic policy i opiml 4

Univeriy of Souhern Cliforni Myopic policy: Inroducion elec n rm

curren cion on he fuure rewrd Recenly everl reerche: opimliy of

i.d. wo-e dicree-ime Mrkov chin Our conribuion: Generlizing

5 Univeriy of Souhern Cliforni Myopic policy: Inroducion elec n rm wih he highe immedie rewrd, ech ime, ignoring he impc of he curren cion on he fuure rewrd Recenly everl reerche: opimliy of Myopic policy under cerin condiion for muliple rm evolving wih i.i.d. wo-e dicree-ime Mrkov chin Our conribuion: Generlizing beyond he pecific eing of wo-e Mrkov chin rel-vlued e p 0 p 0 00 p bd good p 0 5

6 Univeriy of Souhern Cliforni cl of RMAB Problem Formulion n independen nd ochiclly idenicl rm. Finie horizon T, ime ep,...,t Only one rm cn be plyed ech ime Ech rm i in rel-vlued e: [ 0, mx ] Plying n rm wih e yield n immedie rewrd wih expecion R 6

7 Univeriy of Souhern Cliforni he e of rm j ime : Problem Formulion The e of eleced rm will ree ochiclly. j The e of no-plyed rm evolve ccording o deerminiic funcion Se rniion of rm j : mx Prior work: Specific eing of our formulion, p p mx R, 0 0 p j j j 0 p j p j 2 7

8 Univeriy of Souhern Cliforni Problem Formulion Policy vecor: [,..., T] The policy mp he curren e vecor o he cion of elecing n rm ime {,..., n} Curren e vecor i ufficien iic due o he Mrkovin dynmic Gol: Mximizing ol dicouned expeced rewrd: mx E [ T R ] 8

9 Univeriy of Souhern Cliforni Problem Formulion lue funcion: mximum expeced remining rewrd ring from ime : Recurive Equion DP 9 T n,...,, mx,,...,,...,,,,,, 0 mx, T p p R n n,, T R

10 Univeriy of Souhern Cliforni Problem Formulion Opiml policy: Myopic policy: mx T [ ' E R ' ' opiml rg mx,,..., n ' ] Myopic rg mx,..., n R rg Mximizing curren expeced rewrd R R i umed monooniclly increing in mx,..., n 0

11 Univeriy of Souhern Cliforni Condiion: Min Reul monooniclly increing nd ffine funcion of e, i conrcion mpping Theorem: Under bove condiion, nd he myopic policy i opiml R, p, b. 2 b 2 if, b i p mx p 0 i,,,..., T, i 2,..., n

12 Univeriy of Souhern Cliforni Concluion We proved he opimliy of Myopic policy for generl cl of rele Muli-rmed Bndi Generlizing o non-idenicl rm, non-ffine evoluion Generlizing o muli-dimenionl e Idenifying condiion for he problem h Myopic i no opiml bu oher efficien, poibly index-bed, policy i opiml. 2

13 Univeriy of Souhern Cliforni 3

Chapter 2: Evaluative Feedback

Chapter 2: Evaluative Feedback Chper 2: Evluive Feedbck Evluing cions vs. insrucing by giving correc cions Pure evluive feedbck depends olly on he cion ken. Pure insrucive feedbck depends no ll on he cion ken. Supervised lerning is