Reinforcement Learning

Similar documents
Reinforcement Learning

Online regret in reinforcement learning

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Two optimization problems in a stochastic bandit model

Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Reinforcement Learning

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Multi-armed bandit models: a tutorial

Bandit models: a tutorial

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

(More) Efficient Reinforcement Learning via Posterior Sampling

Reinforcement Learning

1 MDP Value Iteration Algorithm

Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Selecting Near-Optimal Approximate State Representations in Reinforcement Learning

Optimism in Reinforcement Learning and Kullback-Leibler Divergence

Artificial Intelligence

MDP Preliminaries. Nan Jiang. February 10, 2019

arxiv: v1 [cs.lg] 12 Feb 2018

Decision Theory: Q-Learning

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

The Multi-Armed Bandit Problem

Online Learning and Sequential Decision Making

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Decision Theory: Markov Decision Processes

Lecture 3: Markov Decision Processes

Grundlagen der Künstlichen Intelligenz

Real Time Value Iteration and the State-Action Value Function

Bayesian and Frequentist Methods in Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

arxiv: v2 [stat.ml] 17 Jul 2013

Optimism in the Face of Uncertainty Should be Refutable

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Basics of reinforcement learning

The information complexity of sequential resource allocation

Optimism in Reinforcement Learning Based on Kullback-Leibler Divergence

Reinforcement Learning. George Konidaris

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang

ARTIFICIAL INTELLIGENCE. Reinforcement learning

6 Basic Convergence Results for RL Algorithms

Planning in Markov Decision Processes

Online Regret Bounds for Markov Decision Processes with Deterministic Transitions

On Bayesian bandit algorithms

Reinforcement learning an introduction

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Evaluation of multi armed bandit algorithms and empirical algorithm

Temporal difference learning

Preference Elicitation for Sequential Decision Problems

Reinforcement Learning Part 2

Regret Bounds for Restless Markov Bandits

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

Sparse Linear Contextual Bandits via Relevance Vector Machines

Efficient Average Reward Reinforcement Learning Using Constant Shifting Values

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Exploration. 2015/10/12 John Schulman

arxiv: v1 [cs.lg] 1 Jan 2019

Reinforcement Learning

Selecting the State-Representation in Reinforcement Learning

Infinite-Horizon Average Reward Markov Decision Processes

Lecture 8: Policy Gradient

Internet Monetization

Reinforcement Learning

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning. Introduction

An Analysis of Model-Based Interval Estimation for Markov Decision Processes

Optimal Regret Bounds for Selecting the State Representation in Reinforcement Learning

Reinforcement learning

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

CSC321 Lecture 22: Q-Learning

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Practicable Robust Markov Decision Processes

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

, and rewards and transition matrices as shown below:

Advanced Machine Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Reinforcement Learning: An Introduction

Lecture 1: March 7, 2018

Lecture 9: Policy Gradient II 1

Reinforcement Learning

CS 7180: Behavioral Modeling and Decisionmaking

15-780: ReinforcementLearning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Reinforcement Learning

Bits of Machine Learning Part 2: Unsupervised Learning

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Transcription:

Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology

Objectives of this lecture Present and analyse two online algorithms based on the optimism in front of uncertainty principle, and compare their regret to algorithms with random exploration UCB-VI for episodic RL problems UCRL2 for ergodic RL problems 2

Lecture 6: Outline 1. Minimal exploration in RL 2. UCB-VI 3. UCRL2 3

Lecture 6: Outline 1. Minimal exploration in RL 2. UCB-VI 3. UCRL2 4

Towards minimal exploration The MDP model is unknown and has to be learnt. Solutions for on-policy algorithms: 1. Estimate the model then optimise: poor regret and premature exploitation 2. ɛ greedy exploration: undirected exploration (explores too much (state, action) pairs with low values) 3. Bandit-like optimal exploration-exploitation trade-off But how much should a (state,action) pair be explored? 5

Regret lower bounds In the case of ergodic RL problems: Problem-specific lower bound (Burnetas - Katehakis 1997) E[N (s,a) (T )] 1 lim inf T log(t ) K M (s, a) Leading to an asymptotic regret lower bound scaling as SA log(t ) Minimax lower bound Θ( SAT ) We don t know when the asymptotic problem-specific regret lower bound is representative, often for very large T! Read for bandit optimisation: Explore First, Exploit Next: The True Shape of Regret in Bandit Problems, Garivier et al., https://arxiv.org/abs/1602.07182 6

Which regret lower bound should we target? Example: SA = 1000, comparison of SAT and SA log(t ) 7

Which regret lower bound should we target? Boundary: SA = T log(t ) 2 8

Optimism in front of uncertainty Estimate the unknown system parameters (here p(, ) and r(, )) and build an optimistic reward estimate to trigger exploration. Estimate: find confidence balls containing the true model w.h.p. Optimistic reward estimate: find the model within the confidence balls leading to the highest value. 9

Optimism in front of uncertainty: generic algorithm Algorithm. (for Infinite horizon RL problems) Initialise ˆp, ˆr, and N(s, a) For t = 1, 2,... 1. Build an optimistic reward model ( Q(s, a)) s,a from ˆp, ˆr, and N(s, a) 2. Select action a(t) maximising Q(s(t), a) over A s(t) 3. Observe the transition to s(t + 1) and collect reward r(s(t), a(t)) 4. Update ˆp, ˆr, and N(s, a) 10

Examples UCB-VI: directly build a confidence ball for the Q function based on the empirical estimates of the model. UCRL2: first build confidence balls for the reward and transition probabilities, and then identify Q. 11

Lecture 6: Outline 1. Minimal exploration in RL 2. UCB-VI 3. UCRL2 12

Finite-horizon MDP to episodic RL problems Initial state s 0 (could be a r.v.) Transition probabilities at time t: p(s s, a) Reward at time t: r(s, a) and at time H: r H (s) Unknown transition probabilities and reward function Objective: quickly learn a policy π maximising over π 0 MD [ H 1 ] V π0 H := E u=0 r(s π0 u, a π0 u ) + r H (s π0 H ). 13

Finite-horizon MDP to episodic RL problems Data: K episodes of length H (actions, states, rewards) Learning algorithm π : data π K MD Performance of π: how close π K is from the optimal policy π 14

UCB-VI UCBVI is an extension of Value Iteration, guaranteeing that the resulting value function is a (high-probability) upper confidence bound (UCB) on the optimal value function. At the beginning of episode k, it computes state-action values using empirical transition kernel and reward function. In step h of backward induction (to update Q k,h (s, a) for any (s, a)), it adds a bonus b k,h (s, a) to the value, and ensures that Q k,h never exceeds Q k,h 1. Two variants of UCBVI, depending on the choice of bonus b k,h : UCBVI-CH UCBVI-FB 15

UCB-VI algorithm Variables to be maintained by the algorithm: for known reward function - ˆp = (ˆp(s s, a), s, s S, a A s ): estimated transition probabilities - Q = (Q h (s, a), h H, s S, a A s ): estimated Q-function - b = (b h (s, a), h H, s S, a A s ): Q-value bonus - N = (N(s, a), s S, a A s ): number of visits to (s, a) so far - N = (N h (s, a), h H, s S, a A s ): number of visits in the h-step of episodes to (s, a) so far 16

UCB-VI algorithm Algorithm. UCB-VI Input: Initial state distribution ν 0, precision δ Initialise the variables ˆp, N, and N For episode k = 1, 2,... 1. Optimistic reward: a. Compute the bonus: b bonus(n, N, ˆp, Q, δ) b. Estimate the Q-function: Q bellmanopt(q, b, ˆp) 2. Initialise the state s(0) ν 0 3. for h = 1,..., H, select action a arg max a A s(h 1) Q h (s(h 1), a ) 4. Observe the transition and update ˆp, N, and N 17

UCB-VI algorithm: bonus UCBVI-CH: b h (s, a) = 7H N(s, a) log(5sat/δ) UCBVI-BF: 8L b h (s, a) = N(s, a) Var p( s,a)(v h+1 (Y )) + + 8 p(y s, a) min N(s, a) y 14HL 3N(s, a) { 10 4 H 3 S 2 AL 2 N h+1 (y), H 2 } where L = log(5sat/δ). 18

UCB-VI algorithm: Optimistic Bellman operator bellmanopt(q, b, ˆp) applies Dynamic Programming with a bonus. Initialisation: Q H (s, a) = r H (s) for all (s, a) For step h = H( 1,..., 1: for all (s, a) visited at least once so far: Q h (s, a) min Q h (s, a), H, r(s, a) + ) y ˆp(y s, a)v h+1(s) + b h (s, a) 19

UCB-VI: Regret guarantees Regret up to time T = KH: R UCBV I (T ) = K k=1 (V (x k,1 ) V π k (x k,1 )). Theorem For any δ > 0, the regret of UCB-VI-CH(δ) is bounded w.p. at least 1 δ by: R UCBV I CH (T ) 20HL SAT + 250H 2 S 2 AL 2, with L = log(5hsat/δ). For T HS 3 A and SA H, the regret upper bound scales as Õ(H SAT ) (!?) 20

Sketch of proof Notations: - π k is the policy applied by UCBVI in the k-th episode - V k,h is the optimistic value function computed by UCBVI in the h-step of the k-th episode - V π h is the value function from step h under π - P π = (p(s s, π(s))) s,s - ˆP π k = (ˆp k (s s, π(s))) s,s where ˆp k is the estimated transitions in episode k Claim 1: by construction with high probability, V k,h V h. Then: R UCBV I (T ) R(T ) = K (V k,1 (x k,1 ) V π k (x k,1 )) k=1 21

Sketch of proof Let k,h = V k,h V π k h, so that R(T ) = K k=1 k,1 (x k,1 ). Backward induction on h to bound k,1 : introduce δ k,h = k,h (x k,h ), then δ k,h ( ˆP π k k P π k ) k,h+1 (x k,h ) + δ k,h+1 + ɛ k,h + b k,h + e k,h where { ɛ k,h = P π k k,h+1 (x k,h ) k,h+1 (x k,h+1 ) e k,h = ( ˆP π k k P π k )Vh+1 (x k,h) Concentration + Martingale (Azuma) + bounding bonus 22

Numerical experiments The river-swim example... 23

Regret Regret 4 states, H = 2, δ = 0.05 (for UCBVI), ɛ-greedy: ɛ t = min(1, 1000/t) 10 6 UCBVI-CH DP -greedy 10 4 10 2 10 0 10-2 0 1 2 3 4 5 6 Episode 10 5 24

Regret Regret 4 states, H = 3, δ = 0.05 (for UCBVI), ɛ-greedy: ɛ t = min(1, 1000/t) 10 6 10 4 10 2 UCBVI-CH DP -greedy 10 0 10-2 0 2 4 6 8 10 12 14 Episode 10 5 25

Q * (s,a) - Q k,1 (s,a) Optimistic Q-values 4 states, H = 3, δ = 0.05 (for UCBVI) -0.5-1 -1.5-2 -2.5 s = 1, a = 1 s = 1, a = 2 s = 2, a = 1 s = 2, a = 2 s = 3, a = 1 s = 3, a = 2 s = 4, a = 1 s = 4, a = 2-3 0 1 2 3 4 5 6 7 8 9 10 Episode 10 6 26

V * (s) - V k(s) Value function convergence under UCBVI 4 states, H = 3, δ = 0.05 (for UCBVI) 2.5 2 1.5 s = 1 s = 2 s = 3 s = 4 1 0.5 0 0 1 2 3 4 5 6 7 8 9 10 Episode 10 6 27

Lecture 6: Outline 1. Minimal exploration in RL 2. UCB-VI 3. UCRL2 28

Expected average reward MDP to ergodic RL problems Stationary transition probabilities p(s s, a) and rewards r(s, a), uniformly bounded: a, s, r(s, a) 1 Objective: learn from data a policy π MD maximising (over all possible policies) [ T 1 ] g π = V π 1 (s 0 ) := lim inf T T E s 0 r(s π u, a π u, ) u=0 29

Ergodic RL problems: Preliminaries Optimal policy Recall Bellman s equation ( ) g + h (s) = max a A r(s, a) + h p( s, a), s where g is the maximal gain, and h is the bias function (h is uniquely determined up to an additive constant). Note: g does not depend on the initial state for communicating MDPs. Let a (s) denote any optimal action for state s (i.e., a maximizer in the above). Define the gap for sub-optimal action a at state s: φ(s, a) := ( r(s, a (s)) r(s, a) ) + h ( p( s, a (s)) p( s, a) ) 30

Ergodic RL problems: Preliminaries Diameter D: defined as D := max s s min π E[T π s,s ] where Ts,s π denotes the first time step in which s is reached under π staring from initial state s. Remark: all communicating MDPs have a finite diameter. Important parameters impacting performance Diameter D Gap Φ := min s,a a (s) φ(s, a) Gap := min π (g g π ) 31

Ergodic RL problems: Regret lower bounds Problem-specific regret lower bound: (Burnetas-Katehakis) For any algorithm π, R π (T ) lim inf T log(t ) c bk := s,a φ(s, a) inf{kl(p( s, a), q) : q Θ s,a } where Θ s,a is the set of distributions q s.t. replacing (only) p( s, a) by q makes a the unique optimal action in state s. - asymptotic (valid as T ) - valid for any ergodic MDP - scales as Ω( DSA Φ log(t )) for specific MDPs Minimax regret lower bound: Ω( DSAT ) - non-asymptotic (valid for all T DSA) - derived for a specific family of hard-to-learn communicating MDPs 32

Ergodic RL problems: State-of-the-art Two types of algorithms targeting different regret guarantees: Problem-specific guarantees - MDP-specific regret bound scaling as O(log(T )) - Algorithms: B-K (Burnetas & Katehakis, 1997), OLP (Tewari & Bartlett, 2007), UCRL2 (Jaksch et al. 2009), KL-UCRL (Filippi et al. 2010) Minimax guarantees - Valid for a class of MDPs with S states and A actions, and (typically) diameter D - Scaling as Ω( T ) - Algorithms: UCRL2 (Jaksch et al. 2009), KL-UCRL (Filippi et al. 2010), REGAL (Bartlett & Tewari, 2009), A-J (Agrawal & Jia, 2010) 33

Ergodic RL problems: State-of-the-art Algorithm Setup Regret B-K ergodic MDPs, known rewards O (c bk log(t )) asympt. ( ) OLP ergodic MDPs, known rewards O D 2 SA Φ log(t ) asympt. ( ) UCRL unichain MDPs O S 5 A 2 log(t ) ( ) UCRL2, KL-UCRL communicating MDPs O D 2 S 2 A ( log(t ) ) Lower Bound ergodic MDPs, known rewards Ω (c bk log(t )), Ω DSA Φ log(t ) Algorithm Setup Regret ( UCRL2 communicating MDPs Õ DS ) AT ( KL-UCRL communicating MDPs Õ DS ) AT ( REGAL weakly comm. MDPs, known rewards Õ BS ) AT ( A-J communicating MDPs, known rewards Õ D ) SAT, T S 5 A ( DSAT ) Lower Bound known rewards Ω, T DSA *B denotes the span of bias function of true MDP, and B D 34

UCRL2 UCRL2 is an optimistic algorithm that works in episodes of increasing lengths. At the beginning of each episode k, it maintains a set of plausible MDPs M k (which contains the true MDP w.h.p.) It then computes an optimal policy π k, which has the largest gain over all MDPs in M k (π k argmax M M k,π g π (M )). - For computational efficiency, UCRL2 computes an 1 tk -optimal policy, where t k is the starting step of episode k - To find a near-optimal policy, UCRL2 uses Extended Value Iteration It then follows π k within episode k until the number of visits for some pair (s, a) is doubled (and so, a new episode starts). 35

UCRL2 Notations: - k N: index of an episode - N k (s, a): total no. visits of pairs (s, a) before episode k - ˆp k ( s, a): empirical transition probability of (s, a) made by observations up to episode k - ˆr k (s, a): empirical reward distribution of (s, a) made by observations up to episode k - π k : policy followed in episode k - M k : set of models for episode k (defined next) - ν k (s, a): no. of visits of pairs (s, a) seen so far in episode k 36

UCRL2: Main ingredients The set of plausible MDPs M k : for confidence parameter δ, define { M k = M = (S, A, r, p) : (s, a), r(s, a) ˆr k (s, a) 3.5 log(2sat/δ) N k (s, a) + } p( s, a) ˆp k ( s, a) 1 14S log(2at/δ) N k (s, a) + Optimistic gain: find in M k the MDP that leads to the highest gain. We need to solve for episode k: maximise over (M, π) g π (M) subject to M M k 37

UCRL2 pseudo-code Algorithm. UCRL2 Input: Initial state s 0, precision δ, t = 1 For each episode k 1: 1. Initialisation. t k = t (start time of the episode) Update N k (s, a), ˆr k (s, a), and ˆp k (s, a) for all (s, a) 2. Compute the set of possible MDPs M k (using δ) 3. Compute the policy π k ExtendedValueIteration(M k, 1/ t k ) 4. Execute π k and end the episode: While [ν k (s t, π k (s t )) < max(1, N k (s t, π k (s t ))] - Play π k (s t), observe the reward and the next state - Update ν k (s t, π k (s t)) ν k (s t, π k (s t)) + 1 and t t + 1 38

Extended value iteration Set of plausible MDPs M k : { M k = M = (S, A, r, p) : (s, a), r(s, a) ˆr k (s, a) d(s, a) } p( s, a) ˆp k ( s, a) 1 d (s, a) We wish to find M M k and a policy π k maximising g π (M ) over all possible M M k and policy π. Ideas: a. we can fix the reward to its maximum: r(s, a) = ˆr(s, a) + d(s, a) b. solve a large MDP whose set of actions is A s where (a, q) A s if and only if q P k (s, a) with: P k (s, a) = {q : q( ) ˆp k ( s, a) 1 d (s, a)} 39

Extended value iteration Solution: apply one of the known algorithms to find an optimal policy in MDPs, i.e., value iteration algorithm. Extended Value Iteration: For all s S, starting from u 0 (s) = 0: { } u i+1 (s) = max r(s, a) + max a A q P k (s,a) u i q - P k (s, a) is a polytope, and the inner maximisation can be done in O(S) operations. - To obtain an ε-optimal policy, the update is stopped when max s (u i+1 (s) u i (s)) min s (u i+1 (s) u i (s)) ε 40

UCRL2: Regret guarantees Let π =UCRL2 Regret up to time T : R π (T ) = T g T t=1 r(sπ t, a π t ), a random variable capturing the learning cost and the mixing time problems. Theorem W.p. at least 1 δ, the regret of UCRL2 satisfies, for any initial state, for any T > 1, R π (T ) 34DS AT log( T δ ). For any initial state, and any T 1, we have w.p. at least 1 3δ, R π (T ) 34 2 D2 S 2 A log( T δ ) ɛ + ɛt. 41

Regret Regret 6 states, δ = 0.05 (for UCRL2), ɛ-greedy: ɛ t = min(1, 1000/t) 12 10 4 10 UCRL2 KL-UCRL -Greedy 8 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10 Time 10 5 42

Regret 12 states, δ = 0.05 (for UCRL2) 10 x 104 UCRL2 KL UCRL 8 Regret 6 4 2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Time x 10 5 43

References Episodic RL UCBVI algorithm: M. Gheshlaghi Azar, I. Osband, and R. Munos, Minimax regret bounds for reinforcement learning, Proc. ICML, 2017. Ergodic RL UCRL algorithm: P. Auer & R. Ortner, Logarithmic online regret bounds for undiscounted reinforcement learning, Proc. NIPS, 2006. UCRL2 algorithm and minimax LB: P. Auer, T. Jaksch, and R. Ortner, Near-optimal regret bounds for reinforcement learning, J. Machine Learning Research, 2010. 44