Reinforcement Learning

Similar documents
Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

An Analytic Solution to Discrete Bayesian Reinforcement Learning

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

Online regret in reinforcement learning

An Analysis of Model-Based Interval Estimation for Markov Decision Processes

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning and NLP

A Theoretical Analysis of Model-Based Interval Estimation

Journal of Computer and System Sciences. An analysis of model-based Interval Estimation for Markov Decision Processes

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning Active Learning

1 MDP Value Iteration Algorithm

15-780: ReinforcementLearning

Artificial Intelligence

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Exploration. 2015/10/12 John Schulman

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

REINFORCEMENT LEARNING

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Reinforcement Learning

PAC Model-Free Reinforcement Learning

Lecture 1: March 7, 2018

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

PROBABLY APPROXIMATELY CORRECT (PAC) EXPLORATION IN REINFORCEMENT LEARNING

Lecture 23: Reinforcement Learning

, and rewards and transition matrices as shown below:

Sample Complexity of Multi-task Reinforcement Learning

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Comparison of Information Theory Based and Standard Methods for Exploration in Reinforcement Learning

State Space Abstractions for Reinforcement Learning

Decision Theory: Q-Learning

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning. George Konidaris

6 Reinforcement Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

Trust Region Policy Optimization

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

CS599 Lecture 1 Introduction To RL

Bayes-Adaptive POMDPs 1

University of Alberta

Internet Monetization

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Q-Learning in Continuous State Action Spaces

Optimism in the Face of Uncertainty Should be Refutable

Reinforcement Learning and Control

Reinforcement Learning. Machine Learning, Fall 2010

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Reinforcement Learning II

Efficient Learning in Linearly Solvable MDP Models

Decision Theory: Markov Decision Processes

Autonomous Helicopter Flight via Reinforcement Learning

Multi-armed bandit models: a tutorial

Reinforcement Learning

Reinforcement learning an introduction

Lecture 10 - Planning under Uncertainty (III)

Reinforcement Learning

Notes on Tabular Methods

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Learning to Coordinate Efficiently: A Model-based Approach

Planning in Markov Decision Processes

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

Reinforcement Learning

Reinforcement learning

Bounded Optimal Exploration in MDP

Temporal Difference Learning & Policy Iteration

Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24

CSE250A Fall 12: Discussion Week 9

Reinforcement Learning

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Lecture 9: Policy Gradient II 1

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

Edward L. Thorndike #1874$1949% Lecture 2: Learning from Evaluative Feedback. or!bandit Problems" Learning by Trial$and$Error.

Practicable Robust Markov Decision Processes

Active Learning of MDP models

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Introduction to Reinforcement Learning

Reinforcement Learning: An Introduction

RL 14: Simplifications of POMDPs

Markov Decision Processes (and a small amount of reinforcement learning)

CSC242: Intro to AI. Lecture 23

Lecture 9: Policy Gradient II (Post lecture) 2

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Real Time Value Iteration and the State-Action Value Function

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Transcription:

Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University of Stuttgart

Outline Model-Based RL Exploration/Exploitation PAC-MDP E 3, RMAX Bayes optimal: Bayesian model-based RL 2/??

RL Approaches 3/??

Model-based RL Model learning: Given data D = {(s t, a t, r t, s t+1 )} H t=1 estimate P (s a, s) and P (r a, s) discrete state-action: ˆP (s a, s) = #(s,a,s) #(a,s) continuous state-action: ˆP (s a, s) = N (s φ(s, a) β, Σ) estimate parameters β (and perhaps Σ) as for regression (including non-linear features, regularization, cross-validation!) Planning: discrete state-action: Dynamic Programming with the estimated model continuous state-action: Differential Dynamic Programming, Planning-by-Inference, etc. 4/??

Exploration vs. Exploitation 5/??

Exploration: Motivation Start: many potential paths 6/??

Exploration: Motivation Continue using this path? Try out alternatives? 7/??

Exploration: Motivation 8/??

Exploration: Motivation 9/??

Reinforcement learning In reinforcement learning (RL), the agent starts to act without a model of the environment. The agent has to learn from its experience what to do to in order to fulfill tasks and achieve high rewards. RL algorithms we have seen thus far: Q-learning, TD-learning, policy gradient,... 10/??

Reinforcement learning In reinforcement learning (RL), the agent starts to act without a model of the environment. The agent has to learn from its experience what to do to in order to fulfill tasks and achieve high rewards. RL algorithms we have seen thus far: Q-learning, TD-learning, policy gradient,... Note the difference to the problem of adapting the behavior based on a given model (also called planning / solving an MDP / calculating optimal state and action values). This is a computational subproblem in model-based RL. Planning algorithms we have seen thus far: value iteration, policy iteration. 10/??

Exploration / Exploitation In contrast to supervised learning, in RL the data used for learning depend on the agent. Two different types of behavior: exploration: behave with the goal to learn as much as possible exploitation: behave with the goal of getting as much reward as possible Challenge in exploration: which actions will lead to the most important learning progress with respect to the goal? exploration as fundamental intelligent behavior 11/??

Recall Markov Decision Processes a 0 a 1 a 2 s 0 s 1 s 2 r 0 r 1 r 2 P (s 0:T, a 0:T, r 0:T ; π) = P (s 0 )P (a 0 s 0 ; π)p (r 0 a 0, s 0 ) T t=1 P (st a t-1, s t-1 )P (a t s t; π)p (r t a t, s t) world s initial state distribution P (s 0 ) world s transition probabilities P (s t+1 a t, s t ) world s reward probabilities P (r t a t, s t ) agent s policy π(a t s t ) (or deterministic a t = π(s t )) discount parameter γ for future rewards two different sources of uncertainty: the world itself (not controlled by the agent) vs. the policy (controlled by the agent) 12/??

Exploration-exploitation tradeoff Goal of reinforcement learning agent: maximize future rewards E[ t=0 γt r t s 0 ; π] However, the agent does not know the transition parameters P (s t+1 a t, s t ) and reward parameters P (r t a t, s t ) of the MDP. Rather, the agent needs to learn from its experience s 0, a 0, r 0, s 1, a 1, r 1,... which actions will lead to high rewards. 13/??

Exploration-exploitation tradeoff Goal of reinforcement learning agent: maximize future rewards E[ t=0 γt r t s 0 ; π] However, the agent does not know the transition parameters P (s t+1 a t, s t ) and reward parameters P (r t a t, s t ) of the MDP. Rather, the agent needs to learn from its experience s 0, a 0, r 0, s 1, a 1, r 1,... which actions will lead to high rewards. Exploration-exploitation tradeoff: Which policy π(a t s t ) for action selection shall the agent follow so that it does not miss the high-reward states, but does not spend too much time in low-reward states, either? Exploitation: Prefer actions a t which have led to reward before? Exploration: Or rather take actions to learn more about the unknown MDP parameters and potentially find states with higher reward? 13/??

Notion of Optimality 14/??

Sample Complexity Let M be an MDP with N states, K actions, discount factor γ [0, 1) and a maximal reward R max > 0. Let A be an algorithm (that is, a reinforcement learning agent) that acts in the environment, resulting in s 0, a 0, r 0, s 1, a 1, r 1,.... Let V A t,m = E[ i=0 γi r t+i s 0, a 0, r 0... s t 1, a t 1, r t 1, s t ]. V is the value function of the optimal policy. Define an accuracy threshold ɛ: ˆV V ɛ 15/??

Definition: Let ɛ > 0 be a prescribed accuracy and δ > 0 be an allowed probability of failure. The expression η(ɛ, δ, N, K, γ, R max ) is a sample complexity bound for algorithm A if independently of the choice of s 0, with probability at least 1 δ, the number of timesteps such that V A t,m < V (s t ) ɛ is at most η(ɛ, δ, N, K, γ, R max ). (Kakade, 2003) 16/??

Efficient exploration An algorithm with sample complexity that is polynomial in 1/ɛ, log(1/δ), N, K, 1/(1 γ), R max is called PAC-MDP (probably approximately correct in MDPs). 17/??

Exploration strategies 18/??

Exploration strategies The exploration strategy is reflected in the policy π(s). In the following, assume we have estimates ˆQ(s, a). greedy (only exploit): π(s) = argmax a ˆQ(s, a) problem: learned model not the same as the true environment Without exploration, agent is likely to miss high rewards. random: choose action a with probability 1/{ actions} problem: ignores value estimates and thus rewards 19/??

Exploration strategies (continued) argmax a ˆQ(s, a) with probability 1 ɛ ɛ-greedy: π(s) = random action with probability ɛ most popular method Converges to the optimal value function with probability 1 (all paths will be visited sooner or later), if the exploration rate diminishes according to an appropriate schedule. problem: sample complexity exponential in the number of states exp( Boltzmann: choose action a with probability ˆQ(s,a)/T ) a exp( ˆQ(s,a)/T ) temperature T controls amount of exploration problem again: sample complexity exponential in the number of states 20/??

Exploration strategies (continued) Other heuristics for exploration: minimize variance of action value estimates optimistic initial values ( optimism in the face of uncertainty ) state bonuses: frequency, recency, error etc. problem again: sample complexity exponential in the number of states Bayesian RL: optimal exploration strategy distribution over MDP models (i.e., the parameters of the MDP) posterior distribution updated after each new observation exploration strategy minimizes uncertainty of parameters Bayes-optimal solution to the exploration-exploitation tradeoff (i.e., no other policy is better in expectation w.r.t. prior distribution over MDPs) only tractable for very simple problems E 3 and R max : principled approach to the exploration-exploitation tradeoff with polynomial sample complexity 21/??

PAC-MDP Algorithms Explicit-Exploit-or-Explore (E3) RMAX 22/??

PAC-MDP Algorithms Explicit-Exploit-or-Explore (E3) RMAX Common Instuition: optimism in the face of uncertainty (i.e. if faced the option of certain and uncertain reward regions, try the UNCERTAIN reward region) 22/??

Explicit-Exploit-or-Explore (E3) algorithm 23/??

Explicit-Exploit-or-Explore (E3) algorithm Kearns and Singh (2002) Model-based approach with polynomial sample complexity (PAC-MDP) uses optimism in the face of uncertainty assumes knowledge of maximum reward Maintains counts for states and actions to quantify confidence in model estimates A state s is known if all actions in s have been sufficiently often executed. 24/??

Explicit-Exploit-or-Explore (E3) algorithm Kearns and Singh (2002) Model-based approach with polynomial sample complexity (PAC-MDP) uses optimism in the face of uncertainty assumes knowledge of maximum reward Maintains counts for states and actions to quantify confidence in model estimates A state s is known if all actions in s have been sufficiently often executed. From observed data, E3 constructs two MDPs: MDP known : includes known states with (approximately exact) estimates of P (s t+1 a t, s t ) and P (r t a t, s t ) model which captures what you know MDP unknown : MDP known + special state s where the agent receives maximum reward model which drives exploration 24/??

E3 sketch Input: State s Output: Action a if s is known then Plan in MDP known Sufficiently accurate model estimates if resulting plan has value above some threshold then return first action of plan Exploitation else Plan in MDP unknown return first action of plan Planned exploration else return action with the least observations in s Direct exploration 25/??

E3 example S. Singh (Tutorial 2005) 26/??

E3 example S. Singh (Tutorial 2005) 27/??

E3 example M : true known state MDP S. Singh (Tutorial 2005) M : estimated known state MDP 28/??

E 3 T is the time horizon. Implementation Setting G T max is the maximum T-step return. (discounted case G T max T R max ). ( ) A state is known if it was visited O (NT G T max/ɛ) 4 V ar max log(1/δ) times. (V ar max is the maximum variance of the random payoffs over all states). For the exploration/exploitation choice at known states: It s assumed to be given the optimal value function V. If ˆV obtained from the MDP known > (V ɛ) then do exploitation. 29/??

RMAX Algorithm 30/??

RMAX R-MAX solves only one unique model (don t separate MDP known and MDP unknown ) and therefore implicitly explores or exploits. The R-MAX and E3 algorithms were able to achieve roughly the same level of performance (Strehl s thesis). 31/??

RMAX R-MAX solves only one unique model (don t separate MDP known and MDP unknown ) and therefore implicitly explores or exploits. The R-MAX and E3 algorithms were able to achieve roughly the same level of performance (Strehl s thesis). RMAX builds an approximate MDP based on reward function ˆR(s, a) R(s, a) = R max if (s,a) known otherwise 31/??

RMAX s sketch Initialize all couter n(s, a) = 0, n(s, a, s ) = 0. Initialize ˆT (s s, a) = I s=s, ˆR(s, a) = R max while (1) do Compute policy π t using MDP model of ( ˆT, ˆR). Choose a = π t(s), observe s, r. n(s, a) = n(s, a) + 1 r(s, a) = r(s, a) + r n(s, a, s ) = n(s, a, s ) + 1 if n(s, a) = m then Update ˆT ( s, a) = n(s, a, )/n(s, a), and ˆR(s, a) = r(s, a)/n(s, a). 32/??

RMAX Analysis The general PAC-MDP theorem does not easily adapt to the analysis of E3 because of E3 s use of two internal models (Original analysis depends on horizon and mixing time). ( (Upper bound) There exists m = O NT 2 ɛ 2 of at least 1 δ, V (s t ) V (s t ) ɛ is true for all but ( N 2 KT 3 O ɛ 3 where N is the number of states. For discounted case: T = log ɛ 1 γ ) ln 2 NK δ, then with probability ) ln 2 NK δ 33/??

Limitations of E3 and RMAX E3/RMAX is called efficient because its sample complexity scales only polynomially in the number of states. In natural environments, however, this number of states is enormous: it is exponential in the number of objects in the environment. Hence E3/RMAX scales exponentially in the number of objects. Generalization over states and actions is crucial for exploration. 34/??

KWIK KWIK (Known What It Knows): a supervised-learning model 35/??

KWIK KWIK (Known What It Knows): a supervised-learning model Input set: X Output set: Y Observation set: Z Hypothesis class: H : X Y Target function: h H Special symbol:? (I don t know) 36/??

KWIK 37/??

KWIK Example: Coin Learning Predict P r(head) [0, 1] for a coin from many observations: head or tail Algorithm Predict? the first O(1/ɛ 2 log(1/δ)) times Use empirical estimate afterwards The bound follows from Hoeffdings bound O(1/ɛ 2 log(1/δ)) Li et. al. ICML 2008. 38/??

KWIK-RMAX T (s s, a) and R(s, a) are TWIK-learned. 39/??

Conclusions RL agents need to solve the exploration-exploitation tradeoff. Sample complexity measures the required number of explorative actions of an algorithm. Ideas for driving exploration: random actions, optimism in the face of uncertainty, maximizing learning progress and information gain 40/??

References Kakade (2003): On the sample complexity of reinforcement learning. PhD thesis. Poupart, Vlassis, Hoey, Kevin Regan: An analytic solution to discrete Bayesian reinforcement learning. ICML 2006. Li, Littman, Walsh: Knows what it knows: a framework for self-aware learning. ICML 2008. 41/??