Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems

Size: px

Start display at page:

Download "Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems"

Aubrey Atkinson
5 years ago
Views:

1 Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Sofía S. Villar Postdoctoral Fellow Basque Center for Applied Mathemathics (BCAM) Lancaster University November 5th, 2012

2 The objective of the talk Multi-armed Restless Bandit Problems (OR) Optimal Sequential Estimation Problems (Statistics) Novel Solution to existing Applied Problems in Modern Sensor Systems

3 Porblem 1: Multi-target tracking (I) A set of moving targets to be tracked during a period of time:

4 Problem 1: Multi-target tracking (II) A set of flexible (and error-prone) sensors to track them

5 Problem 1: Multi-target tracking (III) P1: How to best estimate the position of all targets over time by selecting the beams direction to produce sensors measurements?

6 Problem 2: Finding elusive targets An error prone sensor must find an object which hides/exposes itself when sensed/not sensed. P2: How should we select when to activate the sensor to maximize the chances of finding the object?

7 Outline Optimal Sequential Estimation Problems Multi-armed Restless Bandit Problems Some Results Further Work

8 Optimal Sequential Estimation and Multitarget Tracking An illustrative example Let the position of some target n at time t, be a random variable x n,t evolving over time: x n,t = x n,t 1 + ω n,t, ω n,t N(0, q n ) t 1 (1) Let the measurement of target s n position at time t: y n,t, be such that y n,t = x n,t + ν n,t, ν n,t N(0, r n ) (2) E(x n,0 ) = µ and V(x n,0 ) = 0. Q: From the observed sequence of measurements produce estimates of unobservale variables, i.e., y n,t ˆx n,t? Can these estimates be more precise than those based on a single measurement alone?

9 The Kalman Filter Algorithm Let v n,t be the variance of the sequential estimator ˆx n,t A priori estimate of x n,t from (1) (Predictive Step) (ˆx n,t ) P = ˆx n,t 1 (3) (v n,t ) P = v n,t 1 + q n. (4) A posteriori estimate of x n,t (Updating Step) Where k (ˆx n,t ) U = (1 k )ˆx n,t 1 + k y n,t (5) (v n,t ) U = v n,t 1 + q n v n,t 1 /r n + q n /r n + 1. (6) (v n,t 1) p (v n,t 1 ) p +r is the optimal Kalman Gain, and it is optimal in the sense that it minimizes (v n,t ) U

10 Applications, Properties & More The Kalman Filter algorithm is defined for models more general than equations (1) and (2). Widely used for time series analysis, in fields such as signal processing and econometrics. Theoretical properties: minimum Mean Squared Error (MSE) estimators. Key Issue: If there are N targets and N sensors the Kalman filter yields the minimum MSE targets locations estimators at every t. If sensors fall below N, at each time for some targets there will be a prediction without update, therefore resulting in a higher total predictive error. Room for optimizing while estimating!

11 Outline Optimal Sequential Estimation Problems Multi-armed Restless Bandit Problems Some Results Further Work

12 Markov Decision Problems Dynamic and Stochastic Optimization Models in OR Markov Decision Problems (MDPs) are a natural way of formulating optimal sequential estimation problems. State space: X t X the sufficient statistic for the N targets state upon which to make decisions at time t. Action Set: a t A t (X t ) the feasible control set applicable to the system at time t. If binary a t {0, 1} Bandit Transition Dynamics: X t+1 P(. X t, a t ) A probability measure following from the corresponding Filter algorithm. One-period net rewards/costs: R(X t, a t ) / C(X t, a t ) representing the effect in terms of the objective of the system of the selected actions. Multi-armed Bandit Problems are a special class of constrained MDPs.

13 Multitarget tracking and Multi-armed Bandit Problems Elements Arms: N (Moving targets) Action Set: a n,t {0, 1} with a n,t = 1 if target n is observed in t and 0 otherwise. State space: v n,t V n [0, ) One-period Dynamics: ( φ n vn,t 1, 0 ) v n,t 1 + q n, if a n,t = 0 v n,t = ( φ n vn,t 1, 1 ) v n,t 1 + q n v n,t 1 /r n + q n /r n + 1, if a n,t = 1, Time periods: t = 0, 1,... (Infinite horizon) One-period costs: C n (v n,t, a n,t ) d n φ n (v n,t, a n,t ) + h n a n,t h n 0 : measurement cost of measuring target n d n > 0: relative importance of tracking precision of target n.

14 Optimal Sequential Estimation A Constrained MDP s optimal policy The estimation rule yielding the total minimum (discounted) variance solves MDP (7) [ N V (V 0 ) min π Π(M) Eπ V 0 β t C (n)( ) ] v t, a t, (7) n=1 t=0 E π V 0 [.] denotes expectation under π conditioned on V 0. Π(M) contains all the possible solutions that observe at most M targets at each t. β is a discount factor in [0, 1). The solution to (7) π is stationary deterministic policy solving the traditional dynamic programming (DP) equations V (V 0 ) min {C(V 0, a) + βe V1 V (V 1 )}, V 0, V 1 [0, ) N a A(V 0 )

15 Heuristic based Solutions Methods An alternative to Dynamic Programming (1) The cardinality of the state space V determines the computation and storage requirements of the DP approach. V R N infinite size. Existence of closed-form solutions. If V n = k its size is k N. Worse for infinite horizon. = Practical implementation of π for realistic scenarios is hindered by the curse of dimensionality. (2) Mathematically based Heuristic solution Methodology Design tractable and near-optimal priority-index policies.

16 Bandit Indexation A Mathematically based Methodology Gittins 74 First showed that the optimal solution to the Classic version (Bandits state remain frozen if not activated) admitted an expression that breaks the curse of dimensionality. For each arm, an index function (depending on its current state) exists and recovers its optimal policy (Indexability). The index function assigns a priority value (index) to each arm to being activated, depending on its state. Heuristic Index Policy Allocate available resources at each t to activate M arms with highest priority values. hittle 88 Proposed a relaxation and a Lagrangian approach for the Restless case (Bandits state evolve even if not activated) Index functions were not ensured to always exist. The resulting heuristic was not guaranteed to be optimal.

17 Solving Sequential Estimation Problems as MDPs Research Challenges Multi-armed Bandit reformulations of the resulting MDPs are restless Predicting and Updating Predicting Q: Index existence? Index computation? Tractable? Niño-Mora 01: Discrete-State Sufficient Indexability Conditions (SIC). If SIC hold Index exists and it is computable through a tractable algorithm. Continuous state space Extended SIC for infinite state? Optimal sequential estimation models complex nonlinear dynamics for the state variable (Möbius Transformations) Q: How to address these complications to establish extended SIC? If Whittle heuristic exists, it is not neccesarily optimal. Q: Assesing Suboptimality Gap?

18 Outline Optimal Sequential Estimation Problems Multi-armed Restless Bandit Problems Some Results Further Work

19 Whittle Index for Multitarget Tracking The marginal profit of pulling an arm Definition: Single-arm problem is indexable if an index function λ (v) such that λ R and state v V 0,1, it is optimal to observe the target in state v iff λ (v) λ. One-arm problem: V (v) min π Π Eπ V [ ] β t ( ) d n φ n vn,t, a n,t + hn a n,t, (8) t=0 SIC for the single-target problem hold Niño-Mora and Villar (2009) the single-arm tracking problem is indexable under the β-discounted criterion. The Whittle index can be computed as follows: λ (v) = d n ( φn ( vn, 0 ) φ n ( vn, 1 )) h n +β (V (φ(v, 0) V (φ(v, 1))

20 Index Existence & Evaluation Λ*HvL H1L Φ Β = The Whittle index for different discount factors β in a target instance of r = 1, and q = 5 v

21 Whittle Index Policy: Benchmarking results Base instance: β = 0.99, T = 10 4, M = 1, N = 4, where n N, q n 10, r n 1, d n 1, h n 0, and v n,0 = 0. Total Prediction Error for the prediction of the position of all targets over time Performance Objective 60 Modify it by letting: q 1 {0.5, 1,..., 10}. 50 Myopic policy (β = 0) TEV policy, index v t Whittle policy, index λ (v) Lower Bound on V D (v)

22 Outline Optimal Sequential Estimation Problems Multi-armed Restless Bandit Problems Some Results Further Work

23 Further Work Current and Future Work Other Applied Problems. Some of them: Sensor Networks Search of a reactive (elusive) target. Wireless Cellular Systems Allocation of jobs under partial observability of channel conditions. Clinical Trials Design of Multi-arm multi-stage trials. Methodological: Optimality Conditions for the Whittle Heuristic. Computational: Use properties of Möbious transformations to improve on index computation method (by truncation).

24 Discussion Questions & Comments Thanks for the attention!

25 Discussion Questions & Comments Thanks for the attention!

26 References I Gittins, J. C. and Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. In Gani, J., Sarkadi, K., and Vincze, I., editors, Progress in Statistics (European Meeting of Statisticians, Budapest, 1972), pages North-Holland, Amsterdam, The Netherlands. Whittle, P. (1988). Restless bandits: activity allocation in a changing world. Journal of Applied Probability, 25: Niño-Mora, J. (2001). Restless bandits, partial conservation laws and indexability. Advances in Applied Probability, 33(1): Niño-Mora, J. and Villar, S. S. (2009). Multitarget tracking via restless bandit marginal productivity indices and Kalman filter in discrete time. In Proceedings of the 48th IEEE Conference on Decision and Control, 2009 held jointly with the th Chinese Control Conference. CDC/CCC 2009, pages IEEE. Niño-Mora, J. and Villar, S. S. (2011). Sensor scheduling for hunting elusive hiding targets via whittle s restless bandit index policy. In NetGCOOP 2011 : International conference on NETwork Games, COntrol and OPtimization. IEEE. P. Jacko, S. S. Villar (2012). Opportunistic Schedulers for Optimal Scheduling of Flows in Wireless Systems with ARQ Feedback Proceedings of The 24th International Teletraffic Congress (ITC), pp. 1-8

An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits

An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits Peter Jacko YEQT III November 20, 2009 Basque Center for Applied Mathematics (BCAM), Bilbao, Spain Example: Congestion