Optimality of Myopic Policy for a Class of Monotone Affine Restless Multi-Armed Bandit

Similar documents
Chapter 2: Evaluative Feedback

3. Renewal Limit Theorems

Reinforcement Learning

Machine Learning Reinforcement Learning

Bipartite Matching. Matching. Bipartite Matching. Maxflow Formulation

Stochastic Optimal Control with Linearized Dynamics

Contraction Mapping Principle Approach to Differential Equations

Introduction to SLE Lecture Notes

Transformations. Ordered set of numbers: (1,2,3,4) Example: (x,y,z) coordinates of pt in space. Vectors

The solution is often represented as a vector: 2xI + 4X2 + 2X3 + 4X4 + 2X5 = 4 2xI + 4X2 + 3X3 + 3X4 + 3X5 = 4. 3xI + 6X2 + 6X3 + 3X4 + 6X5 = 6.

Reinforcement learning II

Reinforcement learning

Reinforcement Learning

Reinforcement Learning and Policy Reuse

Positive and negative solutions of a boundary value problem for a

Probability, Estimators, and Stationarity

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Artificial Intelligence Markov Decision Problems

A 1.3 m 2.5 m 2.8 m. x = m m = 8400 m. y = 4900 m 3200 m = 1700 m

Modeling the Evolution of Demand Forecasts with Application to Safety Stock Analysis in Production/Distribution Systems

PHYSICS 1210 Exam 1 University of Wyoming 14 February points

Bellman Optimality Equation for V*

SOME USEFUL MATHEMATICS

S Radio transmission and network access Exercise 1-2

4.8 Improper Integrals

Feature Extraction for Inverse Reinforcement Learning

5. Network flow. Network flow. Maximum flow problem. Ford-Fulkerson algorithm. Min-cost flow. Network flow 5-1

2D Motion WS. A horizontally launched projectile s initial vertical velocity is zero. Solve the following problems with this information.

1 jordan.mcd Eigenvalue-eigenvector approach to solving first order ODEs. -- Jordan normal (canonical) form. Instructor: Nam Sun Wang

Making Complex Decisions Markov Decision Processes. Making Complex Decisions: Markov Decision Problem

A new model for limit order book dynamics

5. Stochastic processes (1)

f t f a f x dx By Lin McMullin f x dx= f b f a. 2

Price Discrimination

Citation Abstract and Applied Analysis, 2013, v. 2013, article no

Reinforcement Learning. Markov Decision Processes

Decision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information

Motion. Part 2: Constant Acceleration. Acceleration. October Lab Physics. Ms. Levine 1. Acceleration. Acceleration. Units for Acceleration.

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

Research Article The General Solution of Differential Equations with Caputo-Hadamard Fractional Derivatives and Noninstantaneous Impulses

3 Motion with constant acceleration: Linear and projectile motion

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

Magnetostatics Bar Magnet. Magnetostatics Oersted s Experiment

ON NEW INEQUALITIES OF SIMPSON S TYPE FOR FUNCTIONS WHOSE SECOND DERIVATIVES ABSOLUTE VALUES ARE CONVEX.

A LIMIT-POINT CRITERION FOR A SECOND-ORDER LINEAR DIFFERENTIAL OPERATOR IAN KNOWLES

Vectorautoregressive Model and Cointegration Analysis. Time Series Analysis Dr. Sevtap Kestel 1

Announcements: Warm-up Exercise:

Reinforcement learning

NMR Spectroscopy: Principles and Applications. Nagarajan Murali 1D - Methods Lecture 5

EXISTENCE AND UNIQUENESS OF SOLUTIONS FOR A SECOND-ORDER ITERATIVE BOUNDARY-VALUE PROBLEM

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

Randomized Perfect Bipartite Matching

A Kalman filtering simulation

September 20 Homework Solutions

P441 Analytical Mechanics - I. Coupled Oscillators. c Alex R. Dzierba

ARTIFICIAL INTELLIGENCE. Markov decision processes

LAPLACE TRANSFORMS. 1. Basic transforms

EXERCISE - 01 CHECK YOUR GRASP

INTEGRALS. Exercise 1. Let f : [a, b] R be bounded, and let P and Q be partitions of [a, b]. Prove that if P Q then U(P ) U(Q) and L(P ) L(Q).

Factorized Decision Forecasting via Combining Value-based and Reward-based Estimation

The Finite Element Method for the Analysis of Non-Linear and Dynamic Systems

Reserves measures have an economic component eg. what could be extracted at current prices?

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

Competitive and Cooperative Inventory Policies in a Two-Stage Supply-Chain

Neural assembly binding in linguistic representation

INVESTIGATION OF REINFORCEMENT LEARNING FOR BUILDING THERMAL MASS CONTROL

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees

GEOMETRIC EFFECTS CONTRIBUTING TO ANTICIPATION OF THE BEVEL EDGE IN SPREADING RESISTANCE PROFILING

Chapter Direct Method of Interpolation

0 for t < 0 1 for t > 0

EXERCISES FOR SECTION 1.5

BU Macro BU Macro Fall 2008, Lecture 4

An introduction to the theory of SDDP algorithm

Non-Myopic Multi-Aspect Sensing with Partially Observable Markov Decision Processes

MATH 124 AND 125 FINAL EXAM REVIEW PACKET (Revised spring 2008)

Hybrid Control and Switched Systems. Lecture #2 How to describe a hybrid system? Formal models for hybrid system

SLOW INCREASING FUNCTIONS AND THEIR APPLICATIONS TO SOME PROBLEMS IN NUMBER THEORY

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter

Syntactic Complexity of Suffix-Free Languages. Marek Szykuła

Chapter 2. Motion along a straight line. 9/9/2015 Physics 218

Solutions to Problems from Chapter 2

ANSWERS TO EVEN NUMBERED EXERCISES IN CHAPTER 2

THE MYSTERY OF STOCHASTIC MECHANICS. Edward Nelson Department of Mathematics Princeton University

Location is relative. Coordinate Systems. Which of the following can be described with vectors??

Minimum Squared Error

Product Operators. Fundamentals of MR Alec Ricciuti 3 March 2011

Physic 231 Lecture 4. Mi it ftd l t. Main points of today s lecture: Example: addition of velocities Trajectories of objects in 2 = =

Minimum Squared Error

Multi-scale 2D acoustic full waveform inversion with high frequency impulsive source

2/5/2012 9:01 AM. Chapter 11. Kinematics of Particles. Dr. Mohammad Abuhaiba, P.E.

Motion in a Straight Line

Family structure and long-run equilibrium distribution of wealth. Yue Xin 2018/03/12

( ) = Q 0. ( ) R = R dq. ( t) = I t

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1

NECESSARY AND SUFFICIENT CONDITIONS FOR LATENT SEPARABILITY

Phil Wertheimer UMD Math Qualifying Exam Solutions Analysis - January, 2015

5.1-The Initial-Value Problems For Ordinary Differential Equations

22.615, MHD Theory of Fusion Systems Prof. Freidberg Lecture 9: The High Beta Tokamak

Properties of Logarithms. Solving Exponential and Logarithmic Equations. Properties of Logarithms. Properties of Logarithms. ( x)

Transcription:

Univeriy of Souhern Cliforni Opimliy of Myopic Policy for Cl of Monoone Affine Rele Muli-Armed Bndi Pri Mnourifrd USC Tr Jvidi UCSD Bhkr Krihnmchri USC Dec 0, 202

Univeriy of Souhern Cliforni Inroducion Muli-Armed Bndi: Sochic deciion problem Selecing from everl lernive rm ech ime Plying n rm yield n immedie rewrd How o ply rm o mximize he expeced dicouned or verge rewrd over horizon Trde-off beween explorion nd exploiion Two cegorie: Reed nd Rele 2

Univeriy of Souhern Cliforni Reed MAB: Inroducion The e of he plyed rm chnge ccording o known Mrkovin rule Byein The remining rm y frozen Opiml policy: Specificlly n index cn be igned o he e of ech rm Plying n rm wih he lrge index ech ime Referred o he Giin index 3

Univeriy of Souhern Cliforni Rele MAB: Inroducion The e of ll rm even hoe h re no eleced evolve in Mrkovin fhion ech ime n index-policy i no in generl opiml, While index policy i opiml under ome conrin on he verge number of rm h cn be plyed ech ime PSPACE-hrd problem In lierure: pecil cle of RMAB for which priculr heuriic re opiml Our conribuion: generl cl of RMAB for which imple index policy Myopic policy i opiml 4

Univeriy of Souhern Cliforni Myopic policy: Inroducion elec n rm wih he highe immedie rewrd, ech ime, ignoring he impc of he curren cion on he fuure rewrd Recenly everl reerche: opimliy of Myopic policy under cerin condiion for muliple rm evolving wih i.i.d. wo-e dicree-ime Mrkov chin Our conribuion: Generlizing beyond he pecific eing of wo-e Mrkov chin rel-vlued e p 0 p 0 00 p bd good p 0 5

Univeriy of Souhern Cliforni cl of RMAB Problem Formulion n independen nd ochiclly idenicl rm. Finie horizon T, ime ep,...,t Only one rm cn be plyed ech ime Ech rm i in rel-vlued e: [ 0, mx ] Plying n rm wih e yield n immedie rewrd wih expecion R 6

Univeriy of Souhern Cliforni he e of rm j ime : Problem Formulion The e of eleced rm will ree ochiclly. j The e of no-plyed rm evolve ccording o deerminiic funcion Se rniion of rm j : mx Prior work: Specific eing of our formulion, p p mx R, 0 0 p j j j 0 p j p j 2 7

Univeriy of Souhern Cliforni Problem Formulion Policy vecor: [,..., T] The policy mp he curren e vecor o he cion of elecing n rm ime {,..., n} Curren e vecor i ufficien iic due o he Mrkovin dynmic Gol: Mximizing ol dicouned expeced rewrd: mx E [ T R ] 8

Univeriy of Souhern Cliforni Problem Formulion lue funcion: mximum expeced remining rewrd ring from ime : Recurive Equion DP 9 T n,...,, mx,,...,,...,,,,,, 0 mx, T p p R n n,, T R

Univeriy of Souhern Cliforni Problem Formulion Opiml policy: Myopic policy: mx T [ ' E R ' ' opiml rg mx,,..., n ' ] Myopic rg mx,..., n R rg Mximizing curren expeced rewrd R R i umed monooniclly increing in mx,..., n 0

Univeriy of Souhern Cliforni Condiion: Min Reul monooniclly increing nd ffine funcion of e, i conrcion mpping Theorem: Under bove condiion, nd he myopic policy i opiml R, p, b. 2 b 2 if, b i p mx p 0 i,,,..., T, i 2,..., n

Univeriy of Souhern Cliforni Concluion We proved he opimliy of Myopic policy for generl cl of rele Muli-rmed Bndi Generlizing o non-idenicl rm, non-ffine evoluion Generlizing o muli-dimenionl e Idenifying condiion for he problem h Myopic i no opiml bu oher efficien, poibly index-bed, policy i opiml. 2

Univeriy of Souhern Cliforni 3