An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits

Similar documents
MASTER THESIS. Development and Testing of Index Policies in Internet Routers

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems

Bayesian Contextual Multi-armed Bandits

CCN Interest Forwarding Strategy as Multi-Armed Bandit Model with Delays

Sequential Selection of Projects

A Stochastic Control Approach for Scheduling Multimedia Transmissions over a Polled Multiaccess Fading Channel

Wireless Channel Selection with Restless Bandits

Congestion Control. Topics

Near-optimal policies for broadcasting files with unequal sizes

Power Aware Wireless File Downloading: A Lyapunov Indexing Approach to A Constrained Restless Bandit Problem

Power Allocation over Two Identical Gilbert-Elliott Channels

Resource Capacity Allocation to Stochastic Dynamic Competitors: Knapsack Problem for Perishable Items and Index-Knapsack Heuristic

COMP9334: Capacity Planning of Computer Systems and Networks

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Markovian Model of Internetworking Flow Control

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Index Policies and Performance Bounds for Dynamic Selection Problems

Rate Control in Communication Networks

Modeling Impact of Delay Spikes on TCP Performance on a Low Bandwidth Link

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and.

Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays

Multi-armed bandit models: a tutorial

Learning Algorithms for Minimizing Queue Length Regret

Bandit models: a tutorial

Stochastic Optimization for Undergraduate Computer Science Students

Bayesian Congestion Control over a Markovian Network Bandwidth Process

A Mean Field Approach for Optimization in Discrete Time

MDP Preliminaries. Nan Jiang. February 10, 2019

OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS

Mixed Stochastic and Event Flows

Advanced Machine Learning

Reinforcement Learning and Control

, and rewards and transition matrices as shown below:

Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel

A New Technique for Link Utilization Estimation

Chapter 3: The Reinforcement Learning Problem

Capacity management for packet-switched networks with heterogeneous sources. Linda de Jonge. Master Thesis July 29, 2009.

Multi-armed Bandits and the Gittins Index

Utility, Fairness and Rate Allocation

THE Internet is increasingly being used in the conduct of

Solving Dual Problems

Reinforcement Learning

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

Optimization and Stability of TCP/IP with Delay-Sensitive Utility Functions

Online Learning Schemes for Power Allocation in Energy Harvesting Communications

Stochastic Contextual Bandits with Known. Reward Functions

Performance Analysis of Priority Queueing Schemes in Internet Routers

Information in Aloha Networks

NICTA Short Course. Network Analysis. Vijay Sivaraman. Day 1 Queueing Systems and Markov Chains. Network Analysis, 2008s2 1-1

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Bayesian Congestion Control over a Markovian Network Bandwidth Process: A multiperiod Newsvendor Problem

Dynamic resource sharing

Discrete Random Variables

Analytic Performance Evaluation of the RED Algorithm

Decentralized Control of Stochastic Systems

Solving convex optimization with side constraints in a multi-class queue by adaptive c rule

Marginal Productivity Index Policies for Admission Control and Routing to Parallel Multi-server Loss Queues with Reneging

Tutorial: Optimal Control of Queueing Networks

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

A Restless Bandit With No Observable States for Recommendation Systems and Communication Link Scheduling

Reinforcement Learning

Theoretical Analysis of Performances of TCP/IP Congestion Control Algorithm with Different Distances

Channel Allocation Using Pricing in Satellite Networks

Index Policies for Dynamic and Stochastic Problems

1 MDP Value Iteration Algorithm

Network Optimization and Control

Markov Decision Processes

Optimization under Uncertainty: An Introduction through Approximation. Kamesh Munagala

Quality of Real-Time Streaming in Wireless Cellular Networks : Stochastic Modeling and Analysis

Analysis of Scalable TCP in the presence of Markovian Losses

Markov Chain Model for ALOHA protocol

Evaluation of multi armed bandit algorithms and empirical algorithm

Alternative Decompositions for Distributed Maximization of Network Utility: Framework and Applications

A Queueing System with Queue Length Dependent Service Times, with Applications to Cell Discarding in ATM Networks

Tuning the TCP Timeout Mechanism in Wireless Networks to Maximize Throughput via Stochastic Stopping Time Methods

CSE 123: Computer Networks

Lecture 4: Lower Bounds (ending); Thompson Sampling

Elements of Reinforcement Learning

Crowdsourcing in Cyber-Physical Systems: Stochastic Optimization with Strong Stability

On the Optimality of Myopic Sensing. in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels

Chapter 3: The Reinforcement Learning Problem

Lecture 7: Simulation of Markov Processes. Pasi Lassila Department of Communications and Networking

Reliable Data Transport: Sliding Windows

communication networks

Probability via Expectation

Entropy in Communication and Chemical Systems

Scheduling Multicast Traffic with Deadlines in Wireless Networks

Efficient Network-wide Available Bandwidth Estimation through Active Learning and Belief Propagation

Modelling an Isolated Compound TCP Connection

Linear Model Predictive Control for Queueing Networks in Manufacturing and Road Traffic

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

ON THE OPTIMALITY OF AN INDEX RULE IN MULTICHANNEL ALLOCATION FOR SINGLE-HOP MOBILE NETWORKS WITH MULTIPLE SERVICE CLASSES*

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

Preference Elicitation for Sequential Decision Problems

Strategic Dynamic Jockeying Between Two Parallel Queues

TCP is Competitive Against a Limited AdversaryThis research was supported in part by grants from NSERC and CITO.

Fairness and Optimal Stochastic Control for Heterogeneous Networks

1 Markov decision processes

These are special traffic patterns that create more stress on a switch

Transcription:

An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits Peter Jacko YEQT III November 20, 2009 Basque Center for Applied Mathematics (BCAM), Bilbao, Spain

Example: Congestion Control in Router 1

Multi-Armed Bandit Problem 2

Multi-Armed Bandit Problem 3 A classic problem of efficient learning Originally in sequential design of experiments Thompson (1933): which of two drugs is superior? Robbins (1952), Bradt et. al (1956), Bellman (1956) Job sequencing problem Cox & Smith (1961): cµ-rule Celebrated general solution Gittins and colleagues (1970s): Gittins index rule crucial condition: non-played are frozen

Outline 4 Re-initializing bandits Transmission Control Protocol (TCP) MDP formulation Relaxations and decomposition into subproblems Optimal solution to the subproblems obtaining an index Optimal solution to relaxations Optimal index policy to original problem

Transmission Control Protocol (TCP) 5 Implemented at the two ends of a connection A way of end-to-end congestion control Provides reliable, ordered delivery of a stream of packets Fully-sensitive to packet losses Examples: web browsers, e-mail, file transfer (FTP) Must be distinguished from congestion control in routers An alternative (UDP) is used for VoIP, streaming, etc.

TCP End-to-End Connection 6 Sender sends an initial packet and waits Receiver sends acknowledgment of each received packet Sender sends more packet(s) after receiving acknowledgement(s) or restarts after time-out

TCP Dynamics as Markov Chain 7 0 NO OK NO 1 NO OK 2 NO OK NO OK N 2 OK OK N 1 States: n {0, 1,..., N 1} = sending rate level n = 0: sending rate of 1 packet/rtt n = N 1: maximum rate, W max Transitions: OK (acknowledgment), NO (time-out)

Congestion Control of TCP Flows in Router 8 Time epochs t = 0, 1, 2,... Two possible control actions a(t): transmit the flow packets block the flow by dropping packets If transmitted W X(t) packets in state X(t), then goodput (reward) R X(t) is earned the sender sets X(t + 1) given by TCP dynamics Objective: Maximize the long-run goodput (reward) while choosing exactly one flow every time epoch

Congestion Control in Router 9 Maximizing time-average expected goodput [ T 1 ] max lim 1 π Π T T Eπ n R a m(t) m,x m (t) t=0 m M Subject to sample path condition a m (t) = 1, for all t m M conditional on state history under π

Relaxations 10 1: Whittle s Relaxation: choose one on average lim T 1 T Eπ n [ T 1 t=0 m M a m (t) ] = 1 2: Multiply by W max and use W a m(t) m,x m (t) W max a m (t), lim T 1 T Eπ n [ T 1 t=0 m M W a m(t) m,x m (t) ] W max 3: Dualize this constraint using Lagrangian multiplier max π Π lim T 1 T Eπ n [ T 1 t=0 m M ( ) ] R a m(t) m,x m (t) νw a m(t) m,x m (t) + νw max

Decomposition 11 Decompose the Lagrangian relaxation due to flow independence into single-flow parametric subproblems [ 1 T 1 max lim π m Π m T T Eπ m nm t=0 This is known as a restless bandit ( ) ] R a m(t) m,x m (t) νw a m(t) m,x m (t) Under certain natural conditions, there exist break-even values ν m,n of ν, called transmission indices (prices), s.t. it is optimal to transmit if ν m,n ν it is optimal to block if ν m,n ν

Optimal Solutions to Relaxations 12 For 3 (multi-flow Lagrangian relaxation): For each flow, it is optimal to transmit if ν m,n ν it is optimal to block if ν m,n ν Suppose that re-initializing state 0 has highest value, i.e., ν m,0 ν m,n for all states n of any flow m Denote by ν the second-highest ν m,0 over m For 1: an optimal policy is: at every t, transmit each flow satisfying ν m,x(t) > ν if no such flow exists, then transmit one flow satisfying ν m,x(t) = ν

Optimal Solution to Original Problem 13 An optimal policy is: at every t, transmit each flow satisfying ν m,n > ν if no such flow exists, then transmit one flow satisfying ν m,x(t) = ν This policy chooses exactly one flow every time epoch It is optimal here because it is feasible here and optimal for a relaxation (See animation)

Multi-Armed Bandit Problem 14 We can apply the same reasoning to the classic problem Set the threshold to the second-highest Gittins index Play the bandits with Gittins index higher than the threshold, breaking ties choosing one arbitrarily Once no bandits are above, restart the procedure Optimal policy = sequence of optimal solutions to Lagrangian relaxations with decreasing values of Lagrangian multiplier

Routing: A More Realistic Setting 15 Bandwidth W, i.e., deterministic server capacity Target time-average router throughput W < W, i.e., virtual capacity Buffer size B W Backlog process B(t) at epochs t number of packets buffered for more than one period To be allocated to randomly appearing and disappearing flows

Summary 16 Apart from multi-armed bandit problem and its special cases, proving optimality of an index policy is rare The approach leads to a new proof for the classic problem For more complex (restless) bandit problems gives some intuition for when an index policy is optimal presents a well-grounded method for design of (suboptimal) greedy rules useful for problems with on-average constraint (if capacity can be marketed between periods)

Thank you for your attention! 17

Congestion Control in Router 18 X 1 (t) W sent 1,X 1 (t) a 1 (t) R a 1(t) 1,X 1 (t) X 2 (t) W sent 2,X 2 (t) a 2 (t) W a 1(t) 1,X 1 (t) R a 2(t) 2,X 2 (t) X 3 (t) W sent 3,X 3 (t) a 3 (t) B(t) R a 3(t) 3,X 3 (t) X M (t) W sent M,X M (t) a M (t) W a M(t) M,X M (t) R a M(t) M,X M (t)