Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems
|
|
- Aubrey Atkinson
- 5 years ago
- Views:
Transcription
1 Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Sofía S. Villar Postdoctoral Fellow Basque Center for Applied Mathemathics (BCAM) Lancaster University November 5th, 2012
2 The objective of the talk Multi-armed Restless Bandit Problems (OR) Optimal Sequential Estimation Problems (Statistics) Novel Solution to existing Applied Problems in Modern Sensor Systems
3 Porblem 1: Multi-target tracking (I) A set of moving targets to be tracked during a period of time:
4 Problem 1: Multi-target tracking (II) A set of flexible (and error-prone) sensors to track them
5 Problem 1: Multi-target tracking (III) P1: How to best estimate the position of all targets over time by selecting the beams direction to produce sensors measurements?
6 Problem 2: Finding elusive targets An error prone sensor must find an object which hides/exposes itself when sensed/not sensed. P2: How should we select when to activate the sensor to maximize the chances of finding the object?
7 Outline Optimal Sequential Estimation Problems Multi-armed Restless Bandit Problems Some Results Further Work
8 Optimal Sequential Estimation and Multitarget Tracking An illustrative example Let the position of some target n at time t, be a random variable x n,t evolving over time: x n,t = x n,t 1 + ω n,t, ω n,t N(0, q n ) t 1 (1) Let the measurement of target s n position at time t: y n,t, be such that y n,t = x n,t + ν n,t, ν n,t N(0, r n ) (2) E(x n,0 ) = µ and V(x n,0 ) = 0. Q: From the observed sequence of measurements produce estimates of unobservale variables, i.e., y n,t ˆx n,t? Can these estimates be more precise than those based on a single measurement alone?
9 The Kalman Filter Algorithm Let v n,t be the variance of the sequential estimator ˆx n,t A priori estimate of x n,t from (1) (Predictive Step) (ˆx n,t ) P = ˆx n,t 1 (3) (v n,t ) P = v n,t 1 + q n. (4) A posteriori estimate of x n,t (Updating Step) Where k (ˆx n,t ) U = (1 k )ˆx n,t 1 + k y n,t (5) (v n,t ) U = v n,t 1 + q n v n,t 1 /r n + q n /r n + 1. (6) (v n,t 1) p (v n,t 1 ) p +r is the optimal Kalman Gain, and it is optimal in the sense that it minimizes (v n,t ) U
10 Applications, Properties & More The Kalman Filter algorithm is defined for models more general than equations (1) and (2). Widely used for time series analysis, in fields such as signal processing and econometrics. Theoretical properties: minimum Mean Squared Error (MSE) estimators. Key Issue: If there are N targets and N sensors the Kalman filter yields the minimum MSE targets locations estimators at every t. If sensors fall below N, at each time for some targets there will be a prediction without update, therefore resulting in a higher total predictive error. Room for optimizing while estimating!
11 Outline Optimal Sequential Estimation Problems Multi-armed Restless Bandit Problems Some Results Further Work
12 Markov Decision Problems Dynamic and Stochastic Optimization Models in OR Markov Decision Problems (MDPs) are a natural way of formulating optimal sequential estimation problems. State space: X t X the sufficient statistic for the N targets state upon which to make decisions at time t. Action Set: a t A t (X t ) the feasible control set applicable to the system at time t. If binary a t {0, 1} Bandit Transition Dynamics: X t+1 P(. X t, a t ) A probability measure following from the corresponding Filter algorithm. One-period net rewards/costs: R(X t, a t ) / C(X t, a t ) representing the effect in terms of the objective of the system of the selected actions. Multi-armed Bandit Problems are a special class of constrained MDPs.
13 Multitarget tracking and Multi-armed Bandit Problems Elements Arms: N (Moving targets) Action Set: a n,t {0, 1} with a n,t = 1 if target n is observed in t and 0 otherwise. State space: v n,t V n [0, ) One-period Dynamics: ( φ n vn,t 1, 0 ) v n,t 1 + q n, if a n,t = 0 v n,t = ( φ n vn,t 1, 1 ) v n,t 1 + q n v n,t 1 /r n + q n /r n + 1, if a n,t = 1, Time periods: t = 0, 1,... (Infinite horizon) One-period costs: C n (v n,t, a n,t ) d n φ n (v n,t, a n,t ) + h n a n,t h n 0 : measurement cost of measuring target n d n > 0: relative importance of tracking precision of target n.
14 Optimal Sequential Estimation A Constrained MDP s optimal policy The estimation rule yielding the total minimum (discounted) variance solves MDP (7) [ N V (V 0 ) min π Π(M) Eπ V 0 β t C (n)( ) ] v t, a t, (7) n=1 t=0 E π V 0 [.] denotes expectation under π conditioned on V 0. Π(M) contains all the possible solutions that observe at most M targets at each t. β is a discount factor in [0, 1). The solution to (7) π is stationary deterministic policy solving the traditional dynamic programming (DP) equations V (V 0 ) min {C(V 0, a) + βe V1 V (V 1 )}, V 0, V 1 [0, ) N a A(V 0 )
15 Heuristic based Solutions Methods An alternative to Dynamic Programming (1) The cardinality of the state space V determines the computation and storage requirements of the DP approach. V R N infinite size. Existence of closed-form solutions. If V n = k its size is k N. Worse for infinite horizon. = Practical implementation of π for realistic scenarios is hindered by the curse of dimensionality. (2) Mathematically based Heuristic solution Methodology Design tractable and near-optimal priority-index policies.
16 Bandit Indexation A Mathematically based Methodology Gittins 74 First showed that the optimal solution to the Classic version (Bandits state remain frozen if not activated) admitted an expression that breaks the curse of dimensionality. For each arm, an index function (depending on its current state) exists and recovers its optimal policy (Indexability). The index function assigns a priority value (index) to each arm to being activated, depending on its state. Heuristic Index Policy Allocate available resources at each t to activate M arms with highest priority values. hittle 88 Proposed a relaxation and a Lagrangian approach for the Restless case (Bandits state evolve even if not activated) Index functions were not ensured to always exist. The resulting heuristic was not guaranteed to be optimal.
17 Solving Sequential Estimation Problems as MDPs Research Challenges Multi-armed Bandit reformulations of the resulting MDPs are restless Predicting and Updating Predicting Q: Index existence? Index computation? Tractable? Niño-Mora 01: Discrete-State Sufficient Indexability Conditions (SIC). If SIC hold Index exists and it is computable through a tractable algorithm. Continuous state space Extended SIC for infinite state? Optimal sequential estimation models complex nonlinear dynamics for the state variable (Möbius Transformations) Q: How to address these complications to establish extended SIC? If Whittle heuristic exists, it is not neccesarily optimal. Q: Assesing Suboptimality Gap?
18 Outline Optimal Sequential Estimation Problems Multi-armed Restless Bandit Problems Some Results Further Work
19 Whittle Index for Multitarget Tracking The marginal profit of pulling an arm Definition: Single-arm problem is indexable if an index function λ (v) such that λ R and state v V 0,1, it is optimal to observe the target in state v iff λ (v) λ. One-arm problem: V (v) min π Π Eπ V [ ] β t ( ) d n φ n vn,t, a n,t + hn a n,t, (8) t=0 SIC for the single-target problem hold Niño-Mora and Villar (2009) the single-arm tracking problem is indexable under the β-discounted criterion. The Whittle index can be computed as follows: λ (v) = d n ( φn ( vn, 0 ) φ n ( vn, 1 )) h n +β (V (φ(v, 0) V (φ(v, 1))
20 Index Existence & Evaluation Λ*HvL H1L Φ Β = The Whittle index for different discount factors β in a target instance of r = 1, and q = 5 v
21 Whittle Index Policy: Benchmarking results Base instance: β = 0.99, T = 10 4, M = 1, N = 4, where n N, q n 10, r n 1, d n 1, h n 0, and v n,0 = 0. Total Prediction Error for the prediction of the position of all targets over time Performance Objective 60 Modify it by letting: q 1 {0.5, 1,..., 10}. 50 Myopic policy (β = 0) TEV policy, index v t Whittle policy, index λ (v) Lower Bound on V D (v)
22 Outline Optimal Sequential Estimation Problems Multi-armed Restless Bandit Problems Some Results Further Work
23 Further Work Current and Future Work Other Applied Problems. Some of them: Sensor Networks Search of a reactive (elusive) target. Wireless Cellular Systems Allocation of jobs under partial observability of channel conditions. Clinical Trials Design of Multi-arm multi-stage trials. Methodological: Optimality Conditions for the Whittle Heuristic. Computational: Use properties of Möbious transformations to improve on index computation method (by truncation).
24 Discussion Questions & Comments Thanks for the attention!
25 Discussion Questions & Comments Thanks for the attention!
26 References I Gittins, J. C. and Jones, D. M. (1974). A dynamic allocation index for the sequential design of experiments. In Gani, J., Sarkadi, K., and Vincze, I., editors, Progress in Statistics (European Meeting of Statisticians, Budapest, 1972), pages North-Holland, Amsterdam, The Netherlands. Whittle, P. (1988). Restless bandits: activity allocation in a changing world. Journal of Applied Probability, 25: Niño-Mora, J. (2001). Restless bandits, partial conservation laws and indexability. Advances in Applied Probability, 33(1): Niño-Mora, J. and Villar, S. S. (2009). Multitarget tracking via restless bandit marginal productivity indices and Kalman filter in discrete time. In Proceedings of the 48th IEEE Conference on Decision and Control, 2009 held jointly with the th Chinese Control Conference. CDC/CCC 2009, pages IEEE. Niño-Mora, J. and Villar, S. S. (2011). Sensor scheduling for hunting elusive hiding targets via whittle s restless bandit index policy. In NetGCOOP 2011 : International conference on NETwork Games, COntrol and OPtimization. IEEE. P. Jacko, S. S. Villar (2012). Opportunistic Schedulers for Optimal Scheduling of Flows in Wireless Systems with ARQ Feedback Proceedings of The 24th International Teletraffic Congress (ITC), pp. 1-8
An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits
An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits Peter Jacko YEQT III November 20, 2009 Basque Center for Applied Mathematics (BCAM), Bilbao, Spain Example: Congestion
More informationWireless Channel Selection with Restless Bandits
Wireless Channel Selection with Restless Bandits Julia Kuhn and Yoni Nazarathy Abstract Wireless devices are often able to communicate on several alternative channels; for example, cellular phones may
More informationMulti-Armed Bandit: Learning in Dynamic Systems with Unknown Models
c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University
More informationPower Allocation over Two Identical Gilbert-Elliott Channels
Power Allocation over Two Identical Gilbert-Elliott Channels Junhua Tang School of Electronic Information and Electrical Engineering Shanghai Jiao Tong University, China Email: junhuatang@sjtu.edu.cn Parisa
More informationIndex Policies and Performance Bounds for Dynamic Selection Problems
Index Policies and Performance Bounds for Dynamic Selection Problems David B. Brown Fuqua School of Business Duke University dbbrown@duke.edu James E. Smith Tuck School of Business Dartmouth College jim.smith@dartmouth.edu
More informationMulti-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays
Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Sahand Haji Ali Ahmad, Mingyan Liu Abstract This paper considers the following stochastic control problem that arises
More informationNear-optimal policies for broadcasting files with unequal sizes
Near-optimal policies for broadcasting files with unequal sizes Majid Raissi-Dehkordi and John S. Baras Institute for Systems Research University of Maryland College Park, MD 20742 majid@isr.umd.edu, baras@isr.umd.edu
More informationElements of Reinforcement Learning
Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,
More informationOptimality of Myopic Sensing in Multi-Channel Opportunistic Access
Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari, Qing Zhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu
More informationCOMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and.
COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis State University of New York at Stony Brook and Cyrus Derman Columbia University The problem of assigning one of several
More informationSequential Selection of Projects
Sequential Selection of Projects Kemal Gürsoy Rutgers University, Department of MSIS, New Jersey, USA Fusion Fest October 11, 2014 Outline 1 Introduction Model 2 Necessary Knowledge Sequential Statistics
More informationComplexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning
Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
More informationExploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms
Exploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms Julia Kuhn The University of Queensland, University of Amsterdam j.kuhn@uq.edu.au Michel Mandjes University of Amsterdam
More informationOptimality of Myopic Sensing in Multi-Channel Opportunistic Access
Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari,QingZhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationBandit models: a tutorial
Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses
More informationBayesian Contextual Multi-armed Bandits
Bayesian Contextual Multi-armed Bandits Xiaoting Zhao Joint Work with Peter I. Frazier School of Operations Research and Information Engineering Cornell University October 22, 2012 1 / 33 Outline 1 Motivating
More informationCSE250A Fall 12: Discussion Week 9
CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.
More informationOn the Optimality of Myopic Sensing. in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels
On the Optimality of Myopic Sensing 1 in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels Kehao Wang, Lin Chen arxiv:1103.1784v1 [cs.it] 9 Mar 2011 Abstract Recent works ([1],
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationResource Capacity Allocation to Stochastic Dynamic Competitors: Knapsack Problem for Perishable Items and Index-Knapsack Heuristic
Annals of Operations Research manuscript No. (will be inserted by the editor) Resource Capacity Allocation to Stochastic Dynamic Competitors: Knapsack Problem for Perishable Items and Index-Knapsack Heuristic
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationMULTI-ARMED BANDIT PROBLEMS
Chapter 6 MULTI-ARMED BANDIT PROBLEMS Aditya Mahajan University of Michigan, Ann Arbor, MI, USA Demosthenis Teneketzis University of Michigan, Ann Arbor, MI, USA 1. Introduction Multi-armed bandit MAB)
More informationMathematical Optimization Models and Applications
Mathematical Optimization Models and Applications Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 1, 2.1-2,
More informationReductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones. Jefferson Huang
Reductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones Jefferson Huang School of Operations Research and Information Engineering Cornell University November 16, 2016
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationMulti-armed Bandits and the Gittins Index
Multi-armed Bandits and the Gittins Index Richard Weber Statistical Laboratory, University of Cambridge A talk to accompany Lecture 6 Two-armed Bandit 3, 10, 4, 9, 12, 1,... 5, 6, 2, 15, 2, 7,... Two-armed
More informationEvaluation of multi armed bandit algorithms and empirical algorithm
Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.
More informationMulti-armed Bandits and the Gittins Index
Multi-armed Bandits and the Gittins Index Richard Weber Statistical Laboratory, University of Cambridge Seminar at Microsoft, Cambridge, June 20, 2011 The Multi-armed Bandit Problem The Multi-armed Bandit
More informationInformation Relaxation Bounds for Infinite Horizon Markov Decision Processes
Information Relaxation Bounds for Infinite Horizon Markov Decision Processes David B. Brown Fuqua School of Business Duke University dbbrown@duke.edu Martin B. Haugh Department of IE&OR Columbia University
More informationThe Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount
The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational
More informationMarkov Decision Processes
Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour
More informationAbstract We present a polyhedral framework for establishing general structural properties on optimal solutions of stochastic scheduling problems, wher
On certain greedoid polyhedra, partially indexable scheduling problems, and extended restless bandit allocation indices Submitted to Mathematical Programming José Ni~no-Mora Λ April 5, 2 Λ Dept. of Economics
More informationBayesian Congestion Control over a Markovian Network Bandwidth Process
Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard 1/30 Bayesian Congestion Control over a Markovian Network Bandwidth Process Parisa Mansourifard (USC) Joint work
More informationProphet Inequalities and Stochastic Optimization
Prophet Inequalities and Stochastic Optimization Kamesh Munagala Duke University Joint work with Sudipto Guha, UPenn Bayesian Decision System Decision Policy Stochastic Model State Update Model Update
More informationOnline Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel
Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel Yanting Wu Dept. of Electrical Engineering University of Southern California Email: yantingw@usc.edu Bhaskar Krishnamachari
More informationThe Irrevocable Multi-Armed Bandit Problem
The Irrevocable Multi-Armed Bandit Problem Vivek F. Farias Ritesh Madan 22 June 2008 Revised: 16 June 2009 Abstract This paper considers the multi-armed bandit problem with multiple simultaneous arm pulls
More informationNear-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement
Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement Jefferson Huang March 21, 2018 Reinforcement Learning for Processing Networks Seminar Cornell University Performance
More informationDynamic spectrum access with learning for cognitive radio
1 Dynamic spectrum access with learning for cognitive radio Jayakrishnan Unnikrishnan and Venugopal V. Veeravalli Department of Electrical and Computer Engineering, and Coordinated Science Laboratory University
More informationInfinite-Horizon Discounted Markov Decision Processes
Infinite-Horizon Discounted Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1 Outline The expected
More informationInfinite-Horizon Average Reward Markov Decision Processes
Infinite-Horizon Average Reward Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 1 Outline The average
More informationApproximate dynamic programming for stochastic reachability
Approximate dynamic programming for stochastic reachability Nikolaos Kariotoglou, Sean Summers, Tyler Summers, Maryam Kamgarpour and John Lygeros Abstract In this work we illustrate how approximate dynamic
More informationState estimation of linear dynamic system with unknown input and uncertain observation using dynamic programming
Control and Cybernetics vol. 35 (2006) No. 4 State estimation of linear dynamic system with unknown input and uncertain observation using dynamic programming by Dariusz Janczak and Yuri Grishin Department
More informationTemporal-Difference Q-learning in Active Fault Diagnosis
Temporal-Difference Q-learning in Active Fault Diagnosis Jan Škach 1 Ivo Punčochář 1 Frank L. Lewis 2 1 Identification and Decision Making Research Group (IDM) European Centre of Excellence - NTIS University
More informationMulti-armed Bandits with Limited Exploration
Multi-armed Bandits with Limited Exploration Sudipto Guha Kamesh Munagala Abstract A central problem to decision making under uncertainty is the trade-off between exploration and exploitation: between
More informationSOME MEMORYLESS BANDIT POLICIES
J. Appl. Prob. 40, 250 256 (2003) Printed in Israel Applied Probability Trust 2003 SOME MEMORYLESS BANDIT POLICIES EROL A. PEKÖZ, Boston University Abstract We consider a multiarmed bit problem, where
More informationINDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS
Applied Probability Trust (4 February 2008) INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS SAVAS DAYANIK, Princeton University WARREN POWELL, Princeton University KAZUTOSHI
More informationFinding the Value of Information About a State Variable in a Markov Decision Process 1
05/25/04 1 Finding the Value of Information About a State Variable in a Markov Decision Process 1 Gilvan C. Souza The Robert H. Smith School of usiness, The University of Maryland, College Park, MD, 20742
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationSome notes on Markov Decision Theory
Some notes on Markov Decision Theory Nikolaos Laoutaris laoutaris@di.uoa.gr January, 2004 1 Markov Decision Theory[1, 2, 3, 4] provides a methodology for the analysis of probabilistic sequential decision
More informationThe knowledge gradient method for multi-armed bandit problems
The knowledge gradient method for multi-armed bandit problems Moving beyond inde policies Ilya O. Ryzhov Warren Powell Peter Frazier Department of Operations Research and Financial Engineering Princeton
More informationOptimal and Suboptimal Policies for Opportunistic Spectrum Access: A Resource Allocation Approach
Optimal and Suboptimal Policies for Opportunistic Spectrum Access: A Resource Allocation Approach by Sahand Haji Ali Ahmad A dissertation submitted in partial fulfillment of the requirements for the degree
More information1 MDP Value Iteration Algorithm
CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using
More informationWIRELESS systems often operate in dynamic environments
IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 61, NO. 9, NOVEMBER 2012 3931 Structure-Aware Stochastic Control for Transmission Scheduling Fangwen Fu and Mihaela van der Schaar, Fellow, IEEE Abstract
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationExploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks A Whittle s Indexability Analysis
1 Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks A Whittle s Indexability Analysis Wenzhuo Ouyang, Sugumar Murugesan, Atilla Eryilmaz, Ness B Shroff Abstract We address
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationPartially Observable Markov Decision Processes (POMDPs)
Partially Observable Markov Decision Processes (POMDPs) Sachin Patil Guest Lecture: CS287 Advanced Robotics Slides adapted from Pieter Abbeel, Alex Lee Outline Introduction to POMDPs Locally Optimal Solutions
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More informationSection Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018
Section Notes 9 Midterm 2 Review Applied Math / Engineering Sciences 121 Week of December 3, 2018 The following list of topics is an overview of the material that was covered in the lectures and sections
More informationCoupled Bisection for Root Ordering
Coupled Bisection for Root Ordering Stephen N. Pallone, Peter I. Frazier, Shane G. Henderson School of Operations Research and Information Engineering Cornell University, Ithaca, NY 14853 Abstract We consider
More informationStochastic Multi-Armed-Bandit Problem with Non-stationary Rewards
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationExploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks
Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks Wenzhuo Ouyang, Sugumar Murugesan, Atilla Eryilmaz, Ness B. Shroff Department of Electrical and Computer Engineering The
More informationarxiv: v1 [math.oc] 9 Oct 2018
A Convex Optimization Approach to Dynamic Programming in Continuous State and Action Spaces Insoon Yang arxiv:1810.03847v1 [math.oc] 9 Oct 2018 Abstract A convex optimization-based method is proposed to
More informationDynamic Discrete Choice Structural Models in Empirical IO
Dynamic Discrete Choice Structural Models in Empirical IO Lecture 4: Euler Equations and Finite Dependence in Dynamic Discrete Choice Models Victor Aguirregabiria (University of Toronto) Carlos III, Madrid
More informationPerformance of Round Robin Policies for Dynamic Multichannel Access
Performance of Round Robin Policies for Dynamic Multichannel Access Changmian Wang, Bhaskar Krishnamachari, Qing Zhao and Geir E. Øien Norwegian University of Science and Technology, Norway, {changmia,
More informationIntroduction to Sequential Teams
Introduction to Sequential Teams Aditya Mahajan McGill University Joint work with: Ashutosh Nayyar and Demos Teneketzis, UMichigan MITACS Workshop on Fusion and Inference in Networks, 2011 Decentralized
More informationLearning of Uncontrolled Restless Bandits with Logarithmic Strong Regret
Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret Cem Tekin, Member, IEEE, Mingyan Liu, Senior Member, IEEE 1 Abstract In this paper we consider the problem of learning the optimal
More informationIntroduction p. 1 Fundamental Problems p. 2 Core of Fundamental Theory and General Mathematical Ideas p. 3 Classical Statistical Decision p.
Preface p. xiii Acknowledgment p. xix Introduction p. 1 Fundamental Problems p. 2 Core of Fundamental Theory and General Mathematical Ideas p. 3 Classical Statistical Decision p. 4 Bayes Decision p. 5
More informationPlayers as Serial or Parallel Random Access Machines. Timothy Van Zandt. INSEAD (France)
Timothy Van Zandt Players as Serial or Parallel Random Access Machines DIMACS 31 January 2005 1 Players as Serial or Parallel Random Access Machines (EXPLORATORY REMARKS) Timothy Van Zandt tvz@insead.edu
More informationMultiple Bits Distributed Moving Horizon State Estimation for Wireless Sensor Networks. Ji an Luo
Multiple Bits Distributed Moving Horizon State Estimation for Wireless Sensor Networks Ji an Luo 2008.6.6 Outline Background Problem Statement Main Results Simulation Study Conclusion Background Wireless
More informationCan Decentralized Status Update Achieve Universally Near-Optimal Age-of-Information in Wireless Multiaccess Channels?
Can Decentralized Status Update Achieve Universally Near-Optimal Age-of-Information in Wireless Multiaccess Channels? Zhiyuan Jiang, Bhaskar Krishnamachari, Sheng Zhou, Zhisheng Niu, Fellow, IEEE {zhiyuan,
More informationEE482: Digital Signal Processing Applications
Professor Brendan Morris, SEB 3216, brendan.morris@unlv.edu EE482: Digital Signal Processing Applications Spring 2014 TTh 14:30-15:45 CBC C222 Lecture 11 Adaptive Filtering 14/03/04 http://www.ee.unlv.edu/~b1morris/ee482/
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationApproximate policy iteration for dynamic resource-constrained project scheduling
Approximate policy iteration for dynamic resource-constrained project scheduling Mahshid Salemi Parizi, Yasin Gocgun, and Archis Ghate June 14, 2017 Abstract We study non-preemptive scheduling problems
More informationIntroducing strategic measure actions in multi-armed bandits
213 IEEE 24th International Symposium on Personal, Indoor and Mobile Radio Communications: Workshop on Cognitive Radio Medium Access Control and Network Solutions Introducing strategic measure actions
More information6196 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 9, SEPTEMBER 2011
6196 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 9, SEPTEMBER 2011 On the Structure of Real-Time Encoding and Decoding Functions in a Multiterminal Communication System Ashutosh Nayyar, Student
More informationLecture 4: Lower Bounds (ending); Thompson Sampling
CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationCS 598 Statistical Reinforcement Learning. Nan Jiang
CS 598 Statistical Reinforcement Learning Nan Jiang Overview What s this course about? A grad-level seminar course on theory of RL 3 What s this course about? A grad-level seminar course on theory of RL
More informationA LAGRANGIAN APPROACH TO DYNAMIC RESOURCE ALLOCATION
Proceedings of the 2010 Winter Simulation Conference B. ohansson, S. ain,. Montoya-Torres,. Hugan, and E. Yücesan, eds. A LAGRANGAN APPROACH TO DYNAMC RESOURCE ALLOCATON Yasin Gocgun Archis Ghate ndustrial
More informationDecentralized Control of Stochastic Systems
Decentralized Control of Stochastic Systems Sanjay Lall Stanford University CDC-ECC Workshop, December 11, 2005 2 S. Lall, Stanford 2005.12.11.02 Decentralized Control G 1 G 2 G 3 G 4 G 5 y 1 u 1 y 2 u
More informationA Stochastic Control Approach for Scheduling Multimedia Transmissions over a Polled Multiaccess Fading Channel
A Stochastic Control Approach for Scheduling Multimedia Transmissions over a Polled Multiaccess Fading Channel 1 Munish Goyal, Anurag Kumar and Vinod Sharma Dept. of Electrical Communication Engineering,
More informationDynamic Control of a Tandem Queueing System with Abandonments
Dynamic Control of a Tandem Queueing System with Abandonments Gabriel Zayas-Cabán 1 Jungui Xie 2 Linda V. Green 3 Mark E. Lewis 1 1 Cornell University Ithaca, NY 2 University of Science and Technology
More informationLearning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning
JMLR: Workshop and Conference Proceedings vol:1 8, 2012 10th European Workshop on Reinforcement Learning Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning Michael
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationMarkov decision processes with threshold-based piecewise-linear optimal policies
1/31 Markov decision processes with threshold-based piecewise-linear optimal policies T. Erseghe, A. Zanella, C. Codemo Dept. of Information Engineering, University of Padova, Italy Padova, June 2, 213
More informationExploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks
1 Exploiting Channel Memory for Joint Estimation and Scheduling in Downlink Networks Wenzhuo Ouyang, Sugumar Murugesan, Atilla Eryilmaz, Ness B. Shroff arxiv:1009.3959v6 [cs.ni] 7 Dec 2011 Abstract We
More informationA Generic Mean Field Model for Optimization in Large-scale Stochastic Systems and Applications in Scheduling
A Generic Mean Field Model for Optimization in Large-scale Stochastic Systems and Applications in Scheduling Nicolas Gast Bruno Gaujal Grenoble University Knoxville, May 13th-15th 2009 N. Gast (LIG) Mean
More informationChapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationMarginal Productivity Index Policies for Admission Control and Routing to Parallel Multi-server Loss Queues with Reneging
Marginal Productivity Index Policies for Admission Control and Routing to Parallel Multi-server Loss Queues with Reneging José Niño-Mora Universidad Carlos III de Madrid Department of Statistics C/ Madrid
More informationA Study of Covariances within Basic and Extended Kalman Filters
A Study of Covariances within Basic and Extended Kalman Filters David Wheeler Kyle Ingersoll December 2, 2013 Abstract This paper explores the role of covariance in the context of Kalman filters. The underlying
More informationVariable Objective Search
Variable Objective Search Sergiy Butenko, Oleksandra Yezerska, and Balabhaskar Balasundaram Abstract This paper introduces the variable objective search framework for combinatorial optimization. The method
More informationJoint-optimal Probing and Scheduling in Wireless Systems
Joint-optimal Probing and Scheduling in Wireless Systems Prasanna Chaporkar and Alexandre Proutiere Abstract Consider a wireless system where a sender can transmit data to various users with independent
More informationCS 4100 // artificial intelligence. Recap/midterm review!
CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationPractical Dynamic Programming: An Introduction. Associated programs dpexample.m: deterministic dpexample2.m: stochastic
Practical Dynamic Programming: An Introduction Associated programs dpexample.m: deterministic dpexample2.m: stochastic Outline 1. Specific problem: stochastic model of accumulation from a DP perspective
More information