Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement

Similar documents
DYNAMIC ROUTING POLICIES FOR MULTISKILL CALL CENTERS

On the static assignment to parallel servers

Markov Decision Processes and Dynamic Programming

Information Relaxation Bounds for Infinite Horizon Markov Decision Processes

Queueing Theory II. Summary. ! M/M/1 Output process. ! Networks of Queue! Method of Stages. ! General Distributions

Operations Research Letters. Instability of FIFO in a simple queueing system with arbitrarily low loads

Tutorial: Optimal Control of Queueing Networks

Markov Decision Processes and Solving Finite Problems. February 8, 2017

MDP Preliminaries. Nan Jiang. February 10, 2019

Approximate Dynamic Programming

OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS

Markov Decision Processes and Dynamic Programming

Dynamic Call Center Routing Policies Using Call Waiting and Agent Idle Times Online Supplement

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Markov Decision Processes and their Applications to Supply Chain Management

Dynamic Control of a Tandem Queueing System with Abandonments

A cµ Rule for Two-Tiered Parallel Servers

Dynamic Control of Parallel-Server Systems

ON ESSENTIAL INFORMATION IN SEQUENTIAL DECISION PROCESSES

Dynamic Call Center Routing Policies Using Call Waiting and Agent Idle Times Online Supplement

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes

ADMISSION CONTROL IN THE PRESENCE OF PRIORITIES: A SAMPLE PATH APPROACH

Electronic Companion Fluid Models for Overloaded Multi-Class Many-Server Queueing Systems with FCFS Routing

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

On the reduction of total cost and average cost MDPs to discounted MDPs

Decision Theory: Markov Decision Processes

A monotonic property of the optimal admission control to an M/M/1 queue under periodic observations with average cost criterion

Approximate Dynamic Programming

Prioritized Sweeping Converges to the Optimal Value Function

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

NATCOR: Stochastic Modelling

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

Lagrangian relaxation and constraint generation for allocation and advanced scheduling

Bayesian Congestion Control over a Markovian Network Bandwidth Process

OPTIMAL CONTROL OF A FLEXIBLE SERVER

Contents Preface The Exponential Distribution and the Poisson Process Introduction to Renewal Theory

Introduction to Markov Chains, Queuing Theory, and Network Performance

Task Assignment in a Heterogeneous Server Farm with Switching Delays and General Energy-Aware Cost Structure

Stochastic Networks and Parameter Uncertainty

Mathematical Optimization Models and Applications

A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies

Stability and Asymptotic Optimality of h-maxweight Policies

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

1 Markov decision processes

Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization

Planning in Markov Decision Processes

Intro Refresher Reversibility Open networks Closed networks Multiclass networks Other networks. Queuing Networks. Florence Perronnin

Decentralized Control of Stochastic Systems

Approximate policy iteration for dynamic resource-constrained project scheduling

Stability and Rare Events in Stochastic Models Sergey Foss Heriot-Watt University, Edinburgh and Institute of Mathematics, Novosibirsk

Static Routing in Stochastic Scheduling: Performance Guarantees and Asymptotic Optimality

Performance Evaluation of Queuing Systems

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Stability of the two queue system

Zero-Inventory Conditions For a Two-Part-Type Make-to-Stock Production System

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

Markov decision processes

Statistics 150: Spring 2007

Elements of Reinforcement Learning

WIRELESS systems often operate in dynamic environments

Decision Theory: Q-Learning

Optimal Stopping Problems

Strategic Dynamic Jockeying Between Two Parallel Queues

Session-Based Queueing Systems

Markov Chains. X(t) is a Markov Process if, for arbitrary times t 1 < t 2 <... < t k < t k+1. If X(t) is discrete-valued. If X(t) is continuous-valued

STOCK RATIONING IN A MULTI-CLASS MAKE-TO-STOCK QUEUE WITH INFORMATION ON THE PRODUCTION STATUS. August 2006

A New Dynamic Programming Decomposition Method for the Network Revenue Management Problem with Customer Choice Behavior

Control of Fork-Join Networks in Heavy-Traffic

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Maximizing throughput in zero-buffer tandem lines with dedicated and flexible servers

QUALIFYING EXAM IN SYSTEMS ENGINEERING

Value Function Based Reinforcement Learning in Changing Markovian Environments

Index Policies and Performance Bounds for Dynamic Selection Problems

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 43, NO. 3, MARCH

Maximum pressure policies for stochastic processing networks

Basics of reinforcement learning

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes

Minimum average value-at-risk for finite horizon semi-markov decision processes

, and rewards and transition matrices as shown below:

On the Pathwise Optimal Bernoulli Routing Policy for Homogeneous Parallel Servers

CSE250A Fall 12: Discussion Week 9

State-dependent and Energy-aware Control of Server Farm

EFFECTS OF SYSTEM PARAMETERS ON THE OPTIMAL POLICY STRUCTURE IN A CLASS OF QUEUEING CONTROL PROBLEMS

Approximate Dynamic Programming

Dynamically scheduling and maintaining a flexible server

Dynamic control of a tandem system with abandonments

NICTA Short Course. Network Analysis. Vijay Sivaraman. Day 1 Queueing Systems and Markov Chains. Network Analysis, 2008s2 1-1

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

UNCORRECTED PROOFS. P{X(t + s) = j X(t) = i, X(u) = x(u), 0 u < t} = P{X(t + s) = j X(t) = i}.

Automatica. Policy iteration for customer-average performance optimization of closed queueing systems

Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem

Reinforcement Learning

INTRODUCTION TO MARKOV DECISION PROCESSES

Robustness of policies in Constrained Markov Decision Processes

Queueing Theory I Summary! Little s Law! Queueing System Notation! Stationary Analysis of Elementary Queueing Systems " M/M/1 " M/M/m " M/M/1/K "

State Space Collapse in Many-Server Diffusion Limits of Parallel Server Systems. J. G. Dai. Tolga Tezcan

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Computation and Dynamic Programming

Trust Region Policy Optimization

Transcription:

Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement Jefferson Huang March 21, 2018 Reinforcement Learning for Processing Networks Seminar Cornell University

Performance Evaluation and Optimization various (approximate) methods for evaluating a fixed policy for an MDP. evaluate = compute value function methods include LSTD, TD(λ),... Policy Improvement: If v π is the exact value function for the policy π, then a policy π + that is provably at least as good is given by: π + (x) arg min {c(x, a) + δe[v π (X ) X p( x, a)]}, x X (1) a A(x) Discounted Costs: δ [0, 1), v π gives expected discounted total cost from each state under π Average Costs 1 : δ = 1, v π is the relative value function under π Policy Iteration is based on this idea. 1 when the MDP is unichain (e.g., ergodic under every policy) 1/27

Approximate Policy Improvement If v π is replaced with an approximation ˆv π, then the improved policy π + where π + (x) arg min {c(x, a) + δe[ˆv π (X ) X p( x, a)]}, x X (2) a A(x) is not necessarily better than π. Questions: 1. When is (2) computationally tractable? 2. When is π + close to being optimal? Our focus is on MDPs modeling queueing systems. 2/27

Outline Part 1: Using Analytically Tractable Policies 2 Average Costs Part 2: Using Simulation and Interpolation 3 Average Costs Part 3: Using Lagrangian Relaxations 4 Discounted Costs 2 Bhulai, S. (2017). Value Function Approximation in Complex Queueing Systems. In Markov Decision Processes in Practice (pp. 33-62). Springer. 3 James, T., Glazebrook, K., & Lin, K. (2016). Developing effective service policies for multiclass queues with abandonment: asymptotic optimality and approximate policy improvement. INFORMS Journal on Computing, 28(2), 251-264. 4 Brown, D. B., & Haugh, M. B. (2017). Information relaxation bounds for infinite horizon Markov decision processes. Operations Research, 65(5), 1355-1379. 3/27

Part 1 Using Analytically Tractable Policies Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 4/27

Analytically Tractable Queueing Systems Idea: 1. Start with systems whose Poisson equation is analytically solvable. 2. Use them to suggest analytically tractable policies for more complex systems. Examples: (Bhulai 2017) M/Cox(r)/1 queue M/M/s queue M/M/s/s blocking system priority queue Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 5/27

The Poisson Equation Let π be a policy (e.g., a fixed admission rule, a fixed priority rule). In general, the Poisson equation looks like this: g + h(x) = c(x, π(x)) + y X p(y x, π(x))h(y), x X. We want to solve for the average cost g and the relative value function h(x) of π. The Poisson equation is also called the evaluation equation. e.g., Puterman, M. L. (2005). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons. Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 6/27

The Poisson Equation for Queueing Systems For queueing systems, the Poisson equation is often a linear difference equation. See e.g., Mickens, R. (1991). Difference Equations: Theory and Applications. CRC Press. Example: Poisson equation for a uniformized M/M/1 queue with λ + µ < 1 and linear holding cost rate c: g + h(x) = cx + λh(x + 1) + µh(x 1) + (1 λ µ)h(x), x {1, 2,... }, g + h(0) = λh(1) + (1 λ)h(0). This is a second-order difference equation. Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 7/27

Linear Difference Equations Theorem (Bhulai 2017, Theorem 2.1) Suppose f : {0, 1,... } R satisfies f (x + 1) + α(x)f (x) + β(x)f (x 1) = q(x), x 1 where β(x) 0 for all x 1. If f h : {0, 1,... } R is a homogeneous solution, i.e., f h (x + 1) + α(x)f h (x) + β(x)f h (x 1) = 0, x 1, then, letting the empty product be equal to one, f (x) f h (x) = f (0) ( f (1) f h (0) + f h (1) f (0) ) x f h (0) + x i i=1 j=1 i i=1 j=1 β(j)f h (j 1) f h (j + 1) β(j)f h (j 1) f h (j + 1) i 1 j=1 q(j) f h (j + 1) j+1 β(k)f h (k 1) k=1 f h (k+1) Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 8/27

Application: M/M/1 Queue Rewrite the Poisson equation in the form of the Theorem: ( h(x + 1) + λ + µ ) ( µ ) h(x) + λ λ }{{}}{{} α(x) β(x) h(x 1) = g cx λ Note that f h 1 works as the homogeneous solution. }{{} q(x), x 1. We also know that, for an M/M/1 queue with linear holding cost rate c, g = cλ µ λ. So, according to the Theorem, h(x) = g λ x ( µ ) i + λ i=1 x ( µ ) i i 1 ( λ λ µ i=1 j=1 ) j+1 ( g cj λ ) = cx(x + 1) 2(µ λ). Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 9/27

Other Analytically Tractable Systems Relative value functions for the following systems are presented in (Bhulai 2017): 1. M/Cox(r)/1 Queue special cases: hyperexponential, hypoexponential, Erlang, and exponential service times 2. M/M/s Queue with infinite buffer, with no buffer (blocking system) 3. 2-class M/M/1 Priority Queue Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 10/27

Application to Analytically Intractable Systems Idea: 1. Pick an initial policy whose relative value function can be written in terms of the relative value functions of simpler systems. 2. Do one-step policy improvement using that policy. In (Bhulai 2017), this is applied to the following problems: 1. Routing Poisson arrivals to two different M/Cox(r)/1 queues. Initial Policy: Bernoulli routing Uses relative value function of M/Cox(r)/1 queue 2. Routing in a Multi-Skill Call Center Initial Policy: Static randomized policy that tries to route calls to agents with the fewest skills first Uses relative value function of M/M/s queue 3. Controlled Polling System with Switching Costs Initial Policy: cµ-rule Uses relative value function of priority queue Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 11/27

Controlled Polling System with Switching Costs Two queues with independent Poisson arrivals at rates λ 1, λ 2, exponential service times with rates µ 1, µ 2 and holding cost rates c 1, c 2, respectively. If queue i is currently being served, switching to queue j i costs s i, i = 1, 2. Problem: Dynamically assign the server to one of the two queues, so that the average cost incurred is minimized. Do one-step policy improvement on the cµ-rule. Results for λ 1 = λ 2 = 1, µ 1 = 6, µ 2 = 3, c 1 = 2, c 2 = 1, s 1 = s 2 = 2: Policy Average Cost cµ-rule 3.62894 One-Step Improvement 3.09895 Optimal Policy 3.09261 Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 12/27

Part 2 Using Simulation and Interpolation Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 13/27

Scheduling a Multiclass Queue with Abandonments k queues with independent Poisson arrivals at rates λ 1,..., λ k, exponential service times with rates µ 1,..., µ k, and holding cost rates c 1,..., c k, respectively. Each customer in queue i = 1,..., k remains available for service for an exponentially distributed amount of time, with rate θ i. Each service completion from queue i = 1,..., k earns R i ; each abandonment from queue i costs D i. Problem: Dynamically assign the server to one of the k queues, so that the average cost incurred is minimized. Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 14/27

Relative Value Function Let π be a policy, and select any reference state x r X = {(i 1,..., i k ) {0, 1,... } k }. g π = average cost incurred under π r π (x) = expected total cost to reach x r under π, starting from state x t π (x) = expected time to reach x r under π, starting from state x Then the relative value function is h π (x) = r π (x) g π t π (x), x X. Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 15/27

Approximate Policy Improvement Exact DP is infeasible for k > 3 classes. (James, Glazebrook, Lin 2016) propose an approximate policy improvement algorithm. Idea: Given a policy π, approximate its relative value function h π as follows: 1. Simulate π to estimate its average cost g π and the long-run frequency with which each state is visited. 2. Based on Step 1, select a set of initial states from which the relative value under π is estimated via simulation. 3. Estimate the relative value function by interpolating between the values estimated in Step 2. 4. Do policy improvement using the estimated relative value function. Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 16/27

Selecting States (Step 2) S = set of initial states selected in Step 2, from which the relative value is estimated via simulation S = S anchor + S support, where 1. S anchor = set of most frequently visited states (based on Step 1) 2. S support = set of regularly spaced states Parameters: How many states to include in S anchor and S support. Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 17/27

Interpolation (Step 3) Use an (augmented) radial basis function h π (x) where n d α i φ( x x i ) + β j p j (x) i=1 n = number of selected states in Step 2 x i = i th selected state in Step 2 φ(r) = r 2 log(r) (thin plate spline) = Euclidean norm d = k + 1 p 1 (x) = 1, p j (x) = number of customers in queue j 1 j=1 Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 18/27

Computing the Interpolation Parameters x i = i th selected state in Step 2 f i = estimated relative value starting from x i A ij = φ( x i x j ) for i, j = 1,... n P ij = p j (x i ) for i = 1,..., n and j = 1,..., k + 1 Solve the following linear system of equations: Aα + Pβ = f P T α = 0 Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 19/27

Example: Interpolation 2 classes, initial policy is the Rµθ-rule m = number of replications for each selected state Figure 1 in (James, Glazebrook, Lin 2016) Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 20/27

Example: Approximate Policy Improvement Same problem API = policy from one-step policy improvement Figure 2 in (James, Glazebrook, Lin 2016) Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 21/27

Suboptimality of Heuristics API(π) = one-step approximate policy improvement applied to π k = 3 classes, ρ = k i=1 (λ i/µ i ) Figure 3 in (James, Glazebrook, Lin 2016); see the paper details on this and other numerical studies. Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 22/27

Part 3 Using Lagrangian Relaxations Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 23/27

Multiclass Queue with Convex Holding Costs k queues with independent Poisson arrivals at rates λ 1,..., λ k and exponential service times with rates µ 1,..., µ k, respectively. If there are x i customers in queue i, the holding cost rate is c i (x i ) where c i : {0, 1,... } R is nonnegative, increasing, and convex. Each queue i has a buffer of size B i Problem: Dynamically assign the server to one of the k queues, so that the discounted cost incurred is minimized. Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 24/27

Relaxed Problem The following relaxation is considered in (Brown, Haugh 2017): The queues are grouped into G groups. The server can serve at most one queue per group; can serve multiple groups simulataneously. Penalty l for serving multiple groups simultaneously. Under this relaxation, the value function decouples across groups (α = discount factor): v l (x) = (G 1)l 1 δ + g v l g (x g ) v l can be used in both one-step policy improvement, and to construct a lower bound on the optimal cost via an information relaxation. Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 25/27

Suboptimality of Heuristics Myopic: use one-step improvement with the value function v m (x) = i c i(x i ) From (Brown, Haugh 2017); see the paper for details. Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 26/27

Research Questions 1. Applications to other systems? 2. Performance guarantees for one-step improvement? 3. Other functions to use in one-step improvement? 4. Conditions under which one-step improvement is practical? Analytically Tractable Policies Simulation & Interpolation Lagrangian Relaxation Research Questions 27/27