Reinforcement Learning: An Introduction

Similar documents
Machine Learning I Reinforcement Learning

Decision Theory: Q-Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning. George Konidaris

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning

arxiv: v1 [cs.ai] 5 Nov 2017

Lecture 23: Reinforcement Learning

15-780: ReinforcementLearning

Machine Learning I Continuous Reinforcement Learning

Basics of reinforcement learning

Lecture 8: Policy Gradient

CS599 Lecture 1 Introduction To RL

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

CS 570: Machine Learning Seminar. Fall 2016

Reinforcement Learning and Deep Reinforcement Learning

Temporal Difference Learning & Policy Iteration

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

REINFORCEMENT LEARNING

Temporal difference learning

Decision Theory: Markov Decision Processes

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Lecture 7: Value Function Approximation

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning

Q-learning. Tambet Matiisen

Internet Monetization

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement Learning

Reinforcement learning an introduction

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning and NLP

Notes on Reinforcement Learning

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement learning

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Reinforcement Learning and Control

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

6 Reinforcement Learning

Reinforcement Learning II

16.410/413 Principles of Autonomy and Decision Making

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

The convergence limit of the temporal difference learning

Q-Learning in Continuous State Action Spaces

Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

RL 3: Reinforcement Learning

Reinforcement Learning (1)

Chapter 3: The Reinforcement Learning Problem

Reinforcement learning

CS 7180: Behavioral Modeling and Decisionmaking

Connecting the Demons

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Reinforcement Learning

An online kernel-based clustering approach for value function approximation

Reinforcement Learning

Planning in Markov Decision Processes

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement Learning Active Learning

Approximation Methods in Reinforcement Learning

Lecture 3: The Reinforcement Learning Problem

Lecture 3: Markov Decision Processes

Lower PAC bound on Upper Confidence Bound-based Q-learning with examples

Markov Decision Processes (and a small amount of reinforcement learning)

, and rewards and transition matrices as shown below:

CS 188: Artificial Intelligence

Introduction to Reinforcement Learning

CSE 573. Markov Decision Processes: Heuristic Search & Real-Time Dynamic Programming. Slides adapted from Andrey Kolobov and Mausam

Markov Decision Processes and Solving Finite Problems. February 8, 2017

An Introduction to Reinforcement Learning

Lecture 1: March 7, 2018

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Reinforcement Learning

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Lecture 10 - Planning under Uncertainty (III)

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Reinforcement Learning

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

Transcription:

Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004

Introduction What is Learning? Learning is often divided into: Supervised learning Trial-and-Error learning In a computational sense these two are referred to as supervised learning and reinforcement learning. Just as we often combine both ways of learning in the real world we also do so in computational learning. Possible techniques for supervised learning include machine learning, artificial neural networks and more; we will ignore all of these completely and take a closer look at reinforcement learning (RL).

Outline Markov Decision Processes The Gridworld 1 The Problem Markov Decision Processes - The smaller Problem The Gridworld - An Example 2 The Bellmann Equation - Calculating the Value-Function Value Iteration - An Incremental Approach 3 Prediction - Policy Evaluation in a TD World Q-Learning - Learning the Optimal Policy 4

Markov Decision Processes The Gridworld The Problem What is? Computational learning of an agent by interaction with it s environment. Advantages: fast implementation polynomial complexity wide variety of applications: robotics controlling and planning,...

Markov Decision Processes (MDPs) Markov Decision Processes Markov Decision Processes The Gridworld We represent our environment as a (finite) MDP: S A s P(s t, a t, s t+1 ) r t+1 set of states set of actions for every state transition Probability function scalar reward Table: The Elements of a Markov Decision Process (MDP) Markov Property Every transition only depends on the current state and action (and the transition probability function)

Markov Decision Processes The Gridworld The Gridworld: Representation as an MDP Parts of the Grid Every square represents a state In every state the possible actions are the MDP is deterministic A reward of -1 is given on every transition Two terminal states (light grey in the graphics) Figure: Empty 4x4 Gridworld, squares are equal to states.

The Gridworld: A Random Walk Markov Decision Processes The Gridworld 0 1 0 Figure: A Random Walk over the Gridworld. Figure: Steps from each state to reach the terminal state given the random walk.

The Gridworld: A Random Walk Markov Decision Processes The Gridworld 0 1 2 0 Figure: A Random Walk over the Gridworld. Figure: Steps from each state to reach the terminal state given the random walk.

The Gridworld: A Random Walk Markov Decision Processes The Gridworld 0 1 3 2 0 Figure: A Random Walk over the Gridworld. Figure: Steps from each state to reach the terminal state given the random walk.

The Gridworld: A Random Walk Markov Decision Processes The Gridworld 0 1 2 3 4 0 Figure: A Random Walk over the Gridworld. Figure: Steps from each state to reach the terminal state given the random walk.

The Gridworld: A Random Walk Markov Decision Processes The Gridworld 0 1 2 3 5 4 0 Figure: A Random Walk over the Gridworld. Figure: Steps from each state to reach the terminal state given the random walk.

The Gridworld: A Random Walk Markov Decision Processes The Gridworld 0 1 2 3 5 4 6 0 Figure: A Random Walk over the Gridworld. Figure: Steps from each state to reach the terminal state given the random walk.

The Gridworld: A Random Walk Markov Decision Processes The Gridworld 0 1 2 3 7 4 6 0 Figure: A Random Walk over the Gridworld. Figure: Steps from each state to reach the terminal state given the random walk.

Markov Decision Processes The Gridworld The Gridworld: A General Approach to Random Walks General Random Walk On the last sheet we have seen one random walk and the resulting number of steps from each state to reach a terminal state. Now imagine how the grid would be labeled generally when we always take a random action in every state until we reach a terminal state. We are looking for the number of steps in the mean.

Markov Decision Processes The Gridworld The Gridworld: A General Approach to Random Walks General Random Walk On the last sheet we have seen one random walk and the resulting number of steps from each state to reach a terminal state. Now imagine how the grid would be labeled generally when we always take a random action in every state until we reach a terminal state. We are looking for the number of steps in the mean. 0 14 20 22 14 18 22 20 20 22 18 14 22 20 14 Figure: Gridworld with the mean number of steps for random actions. 0

Solution to The Gridworld Markov Decision Processes The Gridworld π V (s) policy π selects an action for every state state-value function V, the sum of future rewards for each state 0 1 2 3 1 2 3 2 2 3 2 1 3 2 1 0 Figure: The Gridworld with state-value function for the optimal policy and the possible actions for the optimal policy.

Outline The Bellmann Equation Value Iteration 1 The Problem Markov Decision Processes - The smaller Problem The Gridworld - An Example 2 The Bellmann Equation - Calculating the Value-Function Value Iteration - An Incremental Approach 3 Prediction - Policy Evaluation in a TD World Q-Learning - Learning the Optimal Policy 4

The Bellmann Equation The Bellmann Equation Value Iteration Theorem Bellmann Equation V (s t ) = r t+1 + γv (s t+1 ) Parts of the Bellmann Equation every state s t at time t the return r t+1 given for the action a t under the current policy π the successor state s t+1 and 0 < γ 1.

The Bellmann Equation The Bellmann Equation Value Iteration Proof n V (s t ) = γ k r t+k+1 k=0 = r t+1 + = r t+1 + = r t+1 + γ n γ k r t+k+1 k=1 n γ k+1 r t+k+2 k=0 n γ k r t+k+2 k=0 = r t+1 + γv (s t+1 )

The Bellmann Equation The Bellmann Equation Value Iteration The Bellmann Optimality Equation for V (s t ) = max a (r t+1 + γv (s t+1 )) s t, t, r t+1 and γ as stated before, max a selecting the action that gives the maximal reward plus the current estimate of the return for the successor state s t+1 computes the optimal state-value-function.

Policy Evaluation The Bellmann Equation Value Iteration Policy Evaluation or Computing V π (s) Policy Evaluation describes the process of computing the state-value function V π for a given policy π. Algorithm Policy Evaluation repeat: = 0 foreach s S: v = V (s) choose a from s using π, observe r t+1 and s t+1 V (s) = r t+1 + γv (s t+1 ) = max(, v V (s) ) until < Θ // small positive number

Value Iteration The Bellmann Equation Value Iteration Value Iteration or Finding the Optimal Policy Value Iteration computes the optimal state-value function V. Algorithm Value Iteration repeat: = 0 foreach s S: v = V ( s) V (s) = max a (r t+1 + γv (s t+1 )) = max(, v V (s) ) until < Θ // small positive number

Outline Prediction Q-Learning 1 The Problem Markov Decision Processes - The smaller Problem The Gridworld - An Example 2 The Bellmann Equation - Calculating the Value-Function Value Iteration - An Incremental Approach 3 Prediction - Policy Evaluation in a TD World Q-Learning - Learning the Optimal Policy 4

Prediction Q-Learning Model (S, A s, P, r) needs to be known Computation of V (s) by iteration over entire state set Temporal Difference (TD) Learning Model unknown, based on experience Sample state, action and reward triples Approximates V (s) while interacting with the environment

Prediction Prediction Q-Learning Prediction in Policy evaluation is also known as the prediction problem V π (s) is computed by turning the bellmann equation into an update rule and iteration until no more changes occur TD Prediction: Update Rule for TD(0) V (s t ) = V (s t ) + α[r t+1 + γv (s t+1 ) V (s t )] Convergence criteria 1 sufficiently small α, convergence in the mean 2 α decreases over time, convergence with Probability = 1

Prediction Q-Learning Q-Learning and The State-Action-Value Function Learning the State-Action-Value Function Value iteration: take max over all actions, this is only possible with a complete model. Q-Learning: we compute Q(s,a) instead; taking the max over Q(s,a) for all actions doesn t require a model. Figure: Digraph of a simple MDP.

Off-Policy Learning: Q-Learning Off-Policy Learning Prediction Q-Learning To assure we constantly explore our state space, we follow one policy while actually evaluating another one. Algorithm Q-Learning repeat (for each episode): initialize s choose a t from s t using ɛ-greedy policy from Q: repeat (for each step of episode): take action a t, observe r t+1, s t+1 choose a t+1 from s t+1 using ɛ-greedy policy from Q: Q(s, a) = Q(s, a) + α[r t+1 + γmax a Q(s t+1, a t+1 ) Q(s, a)] s t = s t+1, a t = a t+1 until s is terminal

Outline 1 The Problem Markov Decision Processes - The smaller Problem The Gridworld - An Example 2 The Bellmann Equation - Calculating the Value-Function Value Iteration - An Incremental Approach 3 Prediction - Policy Evaluation in a TD World Q-Learning - Learning the Optimal Policy 4

The Acrobot Parts of the Acrobot Two links Torque [ 1, 0, 1] is exerted only at the second joint Continuous state variables Θ 1, Θ 2, Θ 1, Θ 2 Goal: hit horizontal line at a distance of max(l 1, L 2 ) in min time r t+1 = 1 γ = 1 Torque applied here T1 L1 M1 L2 T2 M2 Figure: The Acrobot

The Acrobot Parts of the Acrobot II The angular velocities are limited State-space is a rectangular confined region in a four dimensional spatial State-space is bounded but continuous = M2 L2 T1 Torque applied here L1 M1 T2 Tilings; define discrete intervals for each dimension Figure: The Acrobot

The Acrobot - An Algorithm that Solves the Problem Solution A TD algorithm similar to Q-learning called Sarsa(λ) was implemented to solve the problem, where the λ indicates that updates concern more than one state. The constants were set to α = 0.2/48, λ = 0.9 and with ɛ = 0 the algorithm optimizes the policy greedy with respect to the state-action-value function which is important since one exploratory move could goof a whole sequence of good moves. Exploration was ensured by optimistically initializing the values for Q(s, a) with 0.

Generalizations Conclusion Exploration vs. Exploitation Optimistic Initial Values Outline 5 Generalizations Exploration vs. Exploitation Optimistic Initial Values 6 Conclusion: Back to the Full Problem

Generalizations Conclusion Exploration vs. Exploitation Optimistic Initial Values Exploration vs. Exploitation Exploration Extend knowledge of the model by trying out Exploitation Exploit knowledge to achieve maximum return Solutions ɛ-greedy policy selection Following one policy while evaluating another Optimistic initial values

Generalizations Conclusion Exploration vs. Exploitation Optimistic Initial Values Optimistic Initial Values Optimistic Initial Values Initial state or state-action values are set higher than they could ever get Thus exploration in an early phase is guaranteed Can not be used in the non-stationary case

Generalizations Conclusion Outline 5 Generalizations Exploration vs. Exploitation Optimistic Initial Values 6 Conclusion: Back to the Full Problem

Generalizations Conclusion Conclusion Disadvantages Lookout One goal only Markov Property = Abstraction from the environment Actions from a discrete space, always take one time-step In the real world good initial behavior is often more important than asymptotic optimal behavior Some of the previously mentioned disadvantages are solved by a current interest in research: hierarchical multi-agent learning.

Generalizations Conclusion F I N