Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems

Similar documents
An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits

Wireless Channel Selection with Restless Bandits

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Power Allocation over Two Identical Gilbert-Elliott Channels

Index Policies and Performance Bounds for Dynamic Selection Problems

Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays

Near-optimal policies for broadcasting files with unequal sizes

Elements of Reinforcement Learning

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and.

Sequential Selection of Projects

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Exploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

Distributed Optimization. Song Chong EE, KAIST

Bandit models: a tutorial

Bayesian Contextual Multi-armed Bandits

CSE250A Fall 12: Discussion Week 9

On the Optimality of Myopic Sensing. in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels

Multi-armed bandit models: a tutorial

Resource Capacity Allocation to Stochastic Dynamic Competitors: Knapsack Problem for Perishable Items and Index-Knapsack Heuristic

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MULTI-ARMED BANDIT PROBLEMS

Mathematical Optimization Models and Applications

Reductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones. Jefferson Huang

Reinforcement Learning

Multi-armed Bandits and the Gittins Index

Evaluation of multi armed bandit algorithms and empirical algorithm

Multi-armed Bandits and the Gittins Index

Information Relaxation Bounds for Infinite Horizon Markov Decision Processes

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

Markov Decision Processes

Abstract We present a polyhedral framework for establishing general structural properties on optimal solutions of stochastic scheduling problems, wher

Bayesian Congestion Control over a Markovian Network Bandwidth Process

Prophet Inequalities and Stochastic Optimization

Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel

The Irrevocable Multi-Armed Bandit Problem

Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement

Dynamic spectrum access with learning for cognitive radio

Infinite-Horizon Discounted Markov Decision Processes

Infinite-Horizon Average Reward Markov Decision Processes

Approximate dynamic programming for stochastic reachability

State estimation of linear dynamic system with unknown input and uncertain observation using dynamic programming

Temporal-Difference Q-learning in Active Fault Diagnosis

Multi-armed Bandits with Limited Exploration

SOME MEMORYLESS BANDIT POLICIES

INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS

Finding the Value of Information About a State Variable in a Markov Decision Process 1

MDP Preliminaries. Nan Jiang. February 10, 2019

Artificial Intelligence

Reinforcement learning

Some notes on Markov Decision Theory

The knowledge gradient method for multi-armed bandit problems

Optimal and Suboptimal Policies for Opportunistic Spectrum Access: A Resource Allocation Approach

1 MDP Value Iteration Algorithm

WIRELESS systems often operate in dynamic environments

Reinforcement Learning

Basics of reinforcement learning

Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks A Whittle s Indexability Analysis

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Partially Observable Markov Decision Processes (POMDPs)

CS 7180: Behavioral Modeling and Decisionmaking

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Coupled Bisection for Root Ordering

Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards

Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks

arxiv: v1 [math.oc] 9 Oct 2018

Dynamic Discrete Choice Structural Models in Empirical IO

Performance of Round Robin Policies for Dynamic Multichannel Access

Introduction to Sequential Teams

Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret

Introduction p. 1 Fundamental Problems p. 2 Core of Fundamental Theory and General Mathematical Ideas p. 3 Classical Statistical Decision p.

Players as Serial or Parallel Random Access Machines. Timothy Van Zandt. INSEAD (France)

Multiple Bits Distributed Moving Horizon State Estimation for Wireless Sensor Networks. Ji an Luo

Can Decentralized Status Update Achieve Universally Near-Optimal Age-of-Information in Wireless Multiaccess Channels?

EE482: Digital Signal Processing Applications

Reinforcement Learning. Introduction

Approximate policy iteration for dynamic resource-constrained project scheduling

Introducing strategic measure actions in multi-armed bandits

6196 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 9, SEPTEMBER 2011

Lecture 4: Lower Bounds (ending); Thompson Sampling

Reinforcement Learning. Yishay Mansour Tel-Aviv University

CS 598 Statistical Reinforcement Learning. Nan Jiang

A LAGRANGIAN APPROACH TO DYNAMIC RESOURCE ALLOCATION

Decentralized Control of Stochastic Systems

A Stochastic Control Approach for Scheduling Multimedia Transmissions over a Polled Multiaccess Fading Channel

Dynamic Control of a Tandem Queueing System with Abandonments

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Lecture 1: March 7, 2018

Markov decision processes with threshold-based piecewise-linear optimal policies

Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks

A Generic Mean Field Model for Optimization in Large-scale Stochastic Systems and Applications in Scheduling

Chapter 3: The Reinforcement Learning Problem

Marginal Productivity Index Policies for Admission Control and Routing to Parallel Multi-server Loss Queues with Reneging

A Study of Covariances within Basic and Extended Kalman Filters

Variable Objective Search

Joint-optimal Probing and Scheduling in Wireless Systems

CS 4100 // artificial intelligence. Recap/midterm review!

Internet Monetization

Practical Dynamic Programming: An Introduction. Associated programs dpexample.m: deterministic dpexample2.m: stochastic

Transcription:

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Sofía S. Villar Postdoctoral Fellow Basque Center for Applied Mathemathics (BCAM) Lancaster University November 5th, 2012

The objective of the talk Multi-armed Restless Bandit Problems (OR) Optimal Sequential Estimation Problems (Statistics) Novel Solution to existing Applied Problems in Modern Sensor Systems

Porblem 1: Multi-target tracking (I) A set of moving targets to be tracked during a period of time:

Problem 1: Multi-target tracking (II) A set of flexible (and error-prone) sensors to track them

Problem 1: Multi-target tracking (III) P1: How to best estimate the position of all targets over time by selecting the beams direction to produce sensors measurements?

Problem 2: Finding elusive targets An error prone sensor must find an object which hides/exposes itself when sensed/not sensed. P2: How should we select when to activate the sensor to maximize the chances of finding the object?

Outline Optimal Sequential Estimation Problems Multi-armed Restless Bandit Problems Some Results Further Work

Optimal Sequential Estimation and Multitarget Tracking An illustrative example Let the position of some target n at time t, be a random variable x n,t evolving over time: x n,t = x n,t 1 + ω n,t, ω n,t N(0, q n ) t 1 (1) Let the measurement of target s n position at time t: y n,t, be such that y n,t = x n,t + ν n,t, ν n,t N(0, r n ) (2) E(x n,0 ) = µ and V(x n,0 ) = 0. Q: From the observed sequence of measurements produce estimates of unobservale variables, i.e., y n,t ˆx n,t? Can these estimates be more precise than those based on a single measurement alone?

The Kalman Filter Algorithm Let v n,t be the variance of the sequential estimator ˆx n,t A priori estimate of x n,t from (1) (Predictive Step) (ˆx n,t ) P = ˆx n,t 1 (3) (v n,t ) P = v n,t 1 + q n. (4) A posteriori estimate of x n,t (Updating Step) Where k (ˆx n,t ) U = (1 k )ˆx n,t 1 + k y n,t (5) (v n,t ) U = v n,t 1 + q n v n,t 1 /r n + q n /r n + 1. (6) (v n,t 1) p (v n,t 1 ) p +r is the optimal Kalman Gain, and it is optimal in the sense that it minimizes (v n,t ) U

Applications, Properties & More The Kalman Filter algorithm is defined for models more general than equations (1) and (2). Widely used for time series analysis, in fields such as signal processing and econometrics. Theoretical properties: minimum Mean Squared Error (MSE) estimators. Key Issue: If there are N targets and N sensors the Kalman filter yields the minimum MSE targets locations estimators at every t. If sensors fall below N, at each time for some targets there will be a prediction without update, therefore resulting in a higher total predictive error. Room for optimizing while estimating!

Outline Optimal Sequential Estimation Problems Multi-armed Restless Bandit Problems Some Results Further Work

Markov Decision Problems Dynamic and Stochastic Optimization Models in OR Markov Decision Problems (MDPs) are a natural way of formulating optimal sequential estimation problems. State space: X t X the sufficient statistic for the N targets state upon which to make decisions at time t. Action Set: a t A t (X t ) the feasible control set applicable to the system at time t. If binary a t {0, 1} Bandit Transition Dynamics: X t+1 P(. X t, a t ) A probability measure following from the corresponding Filter algorithm. One-period net rewards/costs: R(X t, a t ) / C(X t, a t ) representing the effect in terms of the objective of the system of the selected actions. Multi-armed Bandit Problems are a special class of constrained MDPs.

Multitarget tracking and Multi-armed Bandit Problems Elements Arms: N (Moving targets) Action Set: a n,t {0, 1} with a n,t = 1 if target n is observed in t and 0 otherwise. State space: v n,t V n [0, ) One-period Dynamics: ( φ n vn,t 1, 0 ) v n,t 1 + q n, if a n,t = 0 v n,t = ( φ n vn,t 1, 1 ) v n,t 1 + q n v n,t 1 /r n + q n /r n + 1, if a n,t = 1, Time periods: t = 0, 1,... (Infinite horizon) One-period costs: C n (v n,t, a n,t ) d n φ n (v n,t, a n,t ) + h n a n,t h n 0 : measurement cost of measuring target n d n > 0: relative importance of tracking precision of target n.

Optimal Sequential Estimation A Constrained MDP s optimal policy The estimation rule yielding the total minimum (discounted) variance solves MDP (7) [ N V (V 0 ) min π Π(M) Eπ V 0 β t C (n)( ) ] v t, a t, (7) n=1 t=0 E π V 0 [.] denotes expectation under π conditioned on V 0. Π(M) contains all the possible solutions that observe at most M targets at each t. β is a discount factor in [0, 1). The solution to (7) π is stationary deterministic policy solving the traditional dynamic programming (DP) equations V (V 0 ) min {C(V 0, a) + βe V1 V (V 1 )}, V 0, V 1 [0, ) N a A(V 0 )

Heuristic based Solutions Methods An alternative to Dynamic Programming (1) The cardinality of the state space V determines the computation and storage requirements of the DP approach. V R N infinite size. Existence of closed-form solutions. If V n = k its size is k N. Worse for infinite horizon. = Practical implementation of π for realistic scenarios is hindered by the curse of dimensionality. (2) Mathematically based Heuristic solution Methodology Design tractable and near-optimal priority-index policies.

Bandit Indexation A Mathematically based Methodology Gittins 74 First showed that the optimal solution to the Classic version (Bandits state remain frozen if not activated) admitted an expression that breaks the curse of dimensionality. For each arm, an index function (depending on its current state) exists and recovers its optimal policy (Indexability). The index function assigns a priority value (index) to each arm to being activated, depending on its state. Heuristic Index Policy Allocate available resources at each t to activate M arms with highest priority values. hittle 88 Proposed a relaxation and a Lagrangian approach for the Restless case (Bandits state evolve even if not activated) Index functions were not ensured to always exist. The resulting heuristic was not guaranteed to be optimal.

Solving Sequential Estimation Problems as MDPs Research Challenges Multi-armed Bandit reformulations of the resulting MDPs are restless Predicting and Updating Predicting Q: Index existence? Index computation? Tractable? Niño-Mora 01: Discrete-State Sufficient Indexability Conditions (SIC). If SIC hold Index exists and it is computable through a tractable algorithm. Continuous state space Extended SIC for infinite state? Optimal sequential estimation models complex nonlinear dynamics for the state variable (Möbius Transformations) Q: How to address these complications to establish extended SIC? If Whittle heuristic exists, it is not neccesarily optimal. Q: Assesing Suboptimality Gap?

Outline Optimal Sequential Estimation Problems Multi-armed Restless Bandit Problems Some Results Further Work

Whittle Index for Multitarget Tracking The marginal profit of pulling an arm Definition: Single-arm problem is indexable if an index function λ (v) such that λ R and state v V 0,1, it is optimal to observe the target in state v iff λ (v) λ. One-arm problem: V (v) min π Π Eπ V [ ] β t ( ) d n φ n vn,t, a n,t + hn a n,t, (8) t=0 SIC for the single-target problem hold Niño-Mora and Villar (2009) the single-arm tracking problem is indexable under the β-discounted criterion. The Whittle index can be computed as follows: λ (v) = d n ( φn ( vn, 0 ) φ n ( vn, 1 )) h n +β (V (φ(v, 0) V (φ(v, 1))

Index Existence & Evaluation Λ*HvL 300 250 200 150 100 50 0 H1L Φ Β = 0.99 15 30 45 The Whittle index for different discount factors β in a target instance of r = 1, and q = 5 v

Whittle Index Policy: Benchmarking results Base instance: β = 0.99, T = 10 4, M = 1, N = 4, where n N, q n 10, r n 1, d n 1, h n 0, and v n,0 = 0. Total Prediction Error for the prediction of the position of all targets over time Performance Objective 60 Modify it by letting: q 1 {0.5, 1,..., 10}. 50 Myopic policy (β = 0) TEV policy, index v t Whittle policy, index λ (v) Lower Bound on V D (v) 40 2 1 2 4 6 8 10

Outline Optimal Sequential Estimation Problems Multi-armed Restless Bandit Problems Some Results Further Work

Further Work Current and Future Work Other Applied Problems. Some of them: Sensor Networks Search of a reactive (elusive) target. Wireless Cellular Systems Allocation of jobs under partial observability of channel conditions. Clinical Trials Design of Multi-arm multi-stage trials. Methodological: Optimality Conditions for the Whittle Heuristic. Computational: Use properties of Möbious transformations to improve on index computation method (by truncation).

Discussion Questions & Comments Thanks for the attention!

Discussion Questions & Comments Thanks for the attention!

References I Gittins, J. C. and Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. In Gani, J., Sarkadi, K., and Vincze, I., editors, Progress in Statistics (European Meeting of Statisticians, Budapest, 1972), pages 241 266. North-Holland, Amsterdam, The Netherlands. Whittle, P. (1988). Restless bandits: activity allocation in a changing world. Journal of Applied Probability, 25:287 298. Niño-Mora, J. (2001). Restless bandits, partial conservation laws and indexability. Advances in Applied Probability, 33(1):76 98. Niño-Mora, J. and Villar, S. S. (2009). Multitarget tracking via restless bandit marginal productivity indices and Kalman filter in discrete time. In Proceedings of the 48th IEEE Conference on Decision and Control, 2009 held jointly with the 2009 28th Chinese Control Conference. CDC/CCC 2009, pages 2905 2910. IEEE. Niño-Mora, J. and Villar, S. S. (2011). Sensor scheduling for hunting elusive hiding targets via whittle s restless bandit index policy. In NetGCOOP 2011 : International conference on NETwork Games, COntrol and OPtimization. IEEE. P. Jacko, S. S. Villar (2012). Opportunistic Schedulers for Optimal Scheduling of Flows in Wireless Systems with ARQ Feedback Proceedings of The 24th International Teletraffic Congress (ITC), pp. 1-8