Online regret in reinforcement learning

Similar documents
Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

An Analysis of Model-Based Interval Estimation for Markov Decision Processes

Reinforcement Learning

Optimism in the Face of Uncertainty Should be Refutable

Journal of Computer and System Sciences. An analysis of model-based Interval Estimation for Markov Decision Processes

MDP Preliminaries. Nan Jiang. February 10, 2019

Regret Bounds for Restless Markov Bandits

Online Regret Bounds for Markov Decision Processes with Deterministic Transitions

A Theoretical Analysis of Model-Based Interval Estimation

PAC Model-Free Reinforcement Learning

Experts in a Markov Decision Process

Exploration. 2015/10/12 John Schulman

Grundlagen der Künstlichen Intelligenz

Sample Complexity of Multi-task Reinforcement Learning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

PROBABLY APPROXIMATELY CORRECT (PAC) EXPLORATION IN REINFORCEMENT LEARNING

(More) Efficient Reinforcement Learning via Posterior Sampling

Selecting Near-Optimal Approximate State Representations in Reinforcement Learning

Optimal Regret Bounds for Selecting the State Representation in Reinforcement Learning

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

Reinforcement Learning

Selecting the State-Representation in Reinforcement Learning

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Reinforcement learning an introduction

The Multi-Armed Bandit Problem

Improving PAC Exploration Using the Median Of Means

Reinforcement Learning as Classification Leveraging Modern Classifiers

Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

Lecture 9: Policy Gradient II 1

1 MDP Value Iteration Algorithm

Reinforcement Learning

Exploring compact reinforcement-learning representations with linear regression

Lecture 9: Policy Gradient II (Post lecture) 2

Trust Region Policy Optimization

Directed Exploration in Reinforcement Learning with Transferred Knowledge

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Bounded Optimal Exploration in MDP

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning

Efficient Average Reward Reinforcement Learning Using Constant Shifting Values

Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24

Practicable Robust Markov Decision Processes

Learning to Coordinate Efficiently: A Model-based Approach

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Prioritized Sweeping Converges to the Optimal Value Function

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Lecture 3: Markov Decision Processes

Decision Theory: Q-Learning

Chapter 3: The Reinforcement Learning Problem

Basics of reinforcement learning

Real Time Value Iteration and the State-Action Value Function

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Bounded Optimal Exploration in MDP

arxiv: v2 [cs.lg] 18 Mar 2013

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Online Learning Schemes for Power Allocation in Energy Harvesting Communications

DESIGNING controllers and decision-making policies for

Notes on Reinforcement Learning

Decision Theory: Markov Decision Processes

Reinforcement Learning and NLP

REINFORCEMENT LEARNING

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Machine Learning I Reinforcement Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement Learning and Control

Reinforcement Learning

Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret

Reinforcement Learning Active Learning

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Bellmanian Bandit Network

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Internet Monetization

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

arxiv: v1 [cs.lg] 12 Sep 2012

Policy Gradient Methods. February 13, 2017

Multi-armed bandit models: a tutorial

Understanding (Exact) Dynamic Programming through Bellman Operators

Optimism in Reinforcement Learning and Kullback-Leibler Divergence

Preference Elicitation for Sequential Decision Problems

(Deep) Reinforcement Learning

CSC321 Lecture 22: Q-Learning

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Optimism in Reinforcement Learning Based on Kullback-Leibler Divergence

State Space Abstractions for Reinforcement Learning

Learning Transition Dynamics in MDPs with Online Regression and Greedy Feature Selection

Speedy Q-Learning. Abstract

1 Problem Formulation

The Multi-Arm Bandit Framework

Model-Based Reinforcement Learning (Day 1: Introduction)

Reinforcement Learning

Biasing Approximate Dynamic Programming with a Lower Discount Factor

Basis Construction from Power Series Expansions of Value Functions

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang

CS599 Lecture 1 Introduction To RL

Transcription:

University of Leoben, Austria Tübingen, 31 July 2007

Undiscounted online regret I am interested in the difference (in rewards during learning) between an optimal policy and a reinforcement learner: T T R T = r(st, at ) r(s t, a t ), t=0 where s0, s 1,... is the sequence of states visited by an optimal policy choosing actions at, and s 0, s 1,... is the sequence of states visited by the learner choosing actions a t. t=0

Regret for discounted RL Naive: γ t r(st, at ) γ t r(s t, a t ) = O(1), t=0 t=0 Counting non-optimal actions (Kakade, 2003): #{s t : a t a t } Using the value function of the optimal policy (Strehl, Littman, 2005): T V (s t ) t=0 T t=0 τ=t T γ τ t r τ = T V (s t ) t=0 T τ=0 1 γ τ+1 1 γ r τ

Undiscounted regret bounds: log T vs. T bounds Consider the bandit problem first: S = 1. Logarithmic regret bounds for average rewards r a, a A: E [R T ] = O log T r a a a r a Logarithmic regret depends on a gap between the best action and the other actions. A bound independent of the size of the gap: ( ) E [R T ] = O A T log A This bound holds even for varying r a when regret is calculated in respect to the single best action. Both bounds are essentially tight.

Theoretical bounds for RL PAC-like bounds by Fiechter (1994) Assumes a reset action and learns an ɛ-optimal policy of fixed length T in poly(1/ɛ, T ) time. E 3 by Kearns and Singh (1998) Learns an ɛ-optimal policy in poly(1/ɛ, S, A, T ɛ mix ) steps. Analysis of Rmax by Kakade (2003) Bounds the number of actions which are not ɛ-optimal: #{t : a t a ɛ t } = Õ ( S 2 A (T ɛ mix/ɛ) 3) Tmix ɛ is the number of steps such that for any policy π its actual average reward is ɛ-close to the expected average reward.

log T regret for irreducible MDPs Burnetas, Katehakis, 1997: ( ) S A (T E [R T ] = O hit ) 2 log T Φ [ ] Thit = max s,s E Ts,s π T π s,s = min{t 0 : s t = s s 0 = s, π} Φ measures the distance (in expected future rewards) between the best and second best action at a state. But holds only for T A S. Related result by Strehl and Littman: The MBIE algorithm, ICML 2006.

log T regret for irreducible MDPs Auer, Ortner, 2006: for any T. E [R T ] = O ( S 5 A (T max hit ) 3 2 log T ) Thit max = max π max s,s Ts,s π = ρ max{ρ(π) : ρ(π) < ρ } ρ(π) = lim T 1 T E [ T 1 t=0 r t π ]

T 2/3 regret for irreducible MDPs Analogously the regret in respect to an ɛ-optimal policy can be bounded by ( ) log T E [RT ɛ ] = O ɛ 2. For ɛ = 3 (log T )/T this gives ( ) log T ( E [R T ] = O ɛ 2 + ɛt = O T 2/3 (log T ) 1/3).

The algorithm: Optimistic reinforcement learning Let M t be the set of plausible MDPs in respect to experience E t. Choose an optimal policy for the most optimistic MDP in M t, π t := arg max max ρ(π M ) π M M t For simplicity we assume that the rewards are known. Thus M M t if for all s, a, s, p(s s, a) ˆp t (s 3 log t s, a) 1 N t (s, a). Hence M M t with probability at least 1 t 3.

We cannot change policy too often Both states give optimal reward 0.5 under action red. π t would switch states often to balance the number of visits (as the confidence bounds for the reward is larger for the less frequently visited state). Moving from one state to the other is costly, so that always following π t gives large (linear) regret.

We cannot change policy too often We change policy only if the numbers of uses of a state/action pair N t (s, a) has doubled. Let t k be these times, when a new policy is calculated. Thus the number of policy changes in T steps is bounded by T S A log 2 S A.

Accurate Estimates Imply Optimality of π Let π k be the optimal policy chosen at time t = t k for corresponding MDP M M t. If for all s, s, p(s s) p (s s) < 2T max hit S 2 then ρ( π M ) > ρ( π M) ρ(π M ). Thus π is also optimal (since the distance between best and second best policy is ).

Analysis By executing a suboptimal policy for τ steps, we may lose reward at most τ. Making τ steps with a fixed policy, each state is visited (on average) τ/thit max times. For π being non-optimal, there is (s, a, s ), π(s) = a, with p(s s, a) p (s s, a) 2T max hit S 2 and thus 12(T max hit ) 2 S 4 log t N t (s, a) 2. Hence the number of non-optimal steps is at most 24(Thit max ) 3 S 5 A log T 2 + T max hit S A log 2 T S A.

Alternative analysis We want a bound in terms of T min hit = max min T π s,s π s,s. We use the bias equation: Let P be the transition matrix of policy π and r be the vector of rewards. Then λ = ρe r + Pλ for some bias vector λ iff ρ is the average reward of π. Intuition about λ s λ s : It is the difference in total cumulative reward between starting in state s and state s.

Bounding the bias For any MDP there is an optimal policy with and for any states s, s. λ = ρe r + Pλ λ s λ s T min hit Intuition: If λ s λ s > Thit min then we could modify the optimal policy to quickly move from s to s. The number of steps for this is bounded by Thit min. Thus s would be at most Thit min worse than s. We choose min λ s = 0 such that 0 λ s T min hit.

A general bound on the regret Let λ = ρe r + Pλ, 0 λ s T min hit be the bias equation for an optimal policy π. Then the expected regret can be bounded by E [R T ] s,a E [N T (s, a)] [r(s, π (s)) r(s, a) (p( s, π (a)) p( s, a))λ] + O(1) Assuming that r(s, a) = r(s, a ) for all a, a, we get E [R T ] s,a E [N T (s, a)] 2T min hit (p( s, π (s)) p( s, a)) 1

Bounding the regret of our algorithm (1) Let π = π k be the optimal policy chosen by our algorithm at time t k for a plausible MDP M. Let ρ = ρ( π M), ρ = ρ( π M ), and ρ = ρ(π M ) be the average rewards of π in the MDP M, of π in the true MDP M, and the optimal average reward in the true MDP. Then t k+1 1 (t k+1 t k )ρ E r(s t, a t ) M, π t=t k = (t k+1 t k )ρ E +E t k+1 1 t k+1 1 t=t k r(s t, a t ) M, π r(s t, a t ) M, π E t k+1 1 r(s t, a t ) M, π. t=t k t=t k

Bounding the regret of our algorithm (2) It can be shown that t K k+1 1 T ρ E r(s t, a t ) M, ( ) π O (K + 1)Thit min k=0 t=t k with t 0 = 0 and t K +1 = T. ( = O S A T min hit ) log T

Bounding the regret of our algorithm (3) Now we construct an MDP M with actions a and ã, a A, such that a represents an action in the original MDP M and ã represents the corresponding action in M. Thus r(s, ã M ) = r(s, a M ) = r(s, a M ) = r(s, a M ) p( M, s, a) = p( M, s, a) p( M, s, ã) = p( M, s, a). Then π with π (s) = ã for π(s) = a is an optimal policy for M and π is a suboptimal policy in M.

Bounding the regret of our algorithm (4) Thus we can use the general regret bound and get t k+1 1 t E r(s t, a t ) M, k+1 1 π E r(s t, a t ) M, π t=t k s,a t=t k E [N k+1 (s, a) N k (s, a)] 2T min hit (p( M, s, π(s)) p( M, s, π(s)) 1 [ 2Thit min S 3 log T s E N k+1 (s, π(s)) N k (s, π(s)) Nk (s, π(s) ].

Bounding the regret of our algorithm (5) Since N k+1 (s, a) 2N k (s, a) and s,a N K +1(s, a) = T, summing over k gives 2Thit min S 3 log T k = O = O Thit min S log T k,s,a ( ( = O s N k+1 (s, π k (s)) N k (s, π k (s)) Nk (s, π k (s) N k+1 (s, a) N k (s, a) Nk (s, a) T min hit S log T s,a NK +1 (s, a) T min hit S S A T log T ). )