Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Similar documents
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Grundlagen der Künstlichen Intelligenz

Markov Decision Processes and Solving Finite Problems. February 8, 2017

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Temporal Difference Learning & Policy Iteration

Reinforcement Learning

6 Reinforcement Learning

Reinforcement Learning. George Konidaris

Reinforcement Learning. Introduction

Q-Learning for Markov Decision Processes*

Reinforcement Learning

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Reinforcement Learning

, and rewards and transition matrices as shown below:

REINFORCEMENT LEARNING

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Machine Learning I Reinforcement Learning

Q-Learning in Continuous State Action Spaces

Open Theoretical Questions in Reinforcement Learning

Reinforcement Learning

Reinforcement Learning: An Introduction

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Decision Theory: Q-Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning and NLP

Reinforcement Learning

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement Learning and Deep Reinforcement Learning

Chapter 3: The Reinforcement Learning Problem

Internet Monetization

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Temporal difference learning

Markov Decision Processes

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning: the basics

Lecture 3: The Reinforcement Learning Problem

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Markov Decision Processes (and a small amount of reinforcement learning)

Elements of Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Introduction to Reinforcement Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Planning in Markov Decision Processes

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Lecture 3: Markov Decision Processes

The convergence limit of the temporal difference learning

Lecture 23: Reinforcement Learning

Decision Theory: Markov Decision Processes

RL 3: Reinforcement Learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Sequential Decision Problems

Reinforcement Learning

Reinforcement Learning and Control

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement learning an introduction

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Basics of reinforcement learning

An Adaptive Clustering Method for Model-free Reinforcement Learning

CSC321 Lecture 22: Q-Learning

Markov decision processes

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Real Time Value Iteration and the State-Action Value Function

The Reinforcement Learning Problem

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement Learning

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Stochastic Primal-Dual Methods for Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Discrete planning (an introduction)

Lecture 1: March 7, 2018

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

PAC Model-Free Reinforcement Learning

CS 570: Machine Learning Seminar. Fall 2016

Probabilistic Planning. George Konidaris

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Reinforcement Learning (1)

Reinforcement learning

Artificial Intelligence

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

CS599 Lecture 1 Introduction To RL

Reinforcement Learning Part 2

Bias-Variance Error Bounds for Temporal Difference Updates

An Introduction to Reinforcement Learning

Markov Models and Reinforcement Learning. Stephen G. Ware CSCI 4525 / 5525

16.4 Multiattribute Utility Functions

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Chapter 3: The Reinforcement Learning Problem

Transcription:

Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 1 / 21

Outline 1 What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 2 / 21

Outline 1 What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 3 / 21

What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21

What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal It is not supervised learning, and it is not unsupervised learning There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21

What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal It is not supervised learning, and it is not unsupervised learning There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations There is a tradeoff between exploration and exploitation Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21

What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 5 / 21

Toy Problem What is reinforcement learning? s f s s Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 6 / 21

Markov Decision Process (MDP) Modeling the problem S S - the state of the system Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21

Markov Decision Process (MDP) Modeling the problem S S - the state of the system A A - the agent s action Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21

Markov Decision Process (MDP) Modeling the problem S S - the state of the system A A - the agent s action p (s s, a)- the dynamics of the system Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21

Markov Decision Process (MDP) Modeling the problem S S - the state of the system A A - the agent s action p (s s, a)- the dynamics of the system r : S A R - the reward function (possibly stochastic) Definition MDP is the tuple S, A, p, r Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21

Markov Decision Process (MDP) Modeling the problem S S - the state of the system A A - the agent s action p (s s, a)- the dynamics of the system r : S A R - the reward function (possibly stochastic) Definition MDP is the tuple S, A, p, r π (a s) - the agent s policy Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21

Markov Decision Process (MDP) Modeling the problem S S - the state of the system A A - the agent s action p (s s, a)- the dynamics of the system r : S A R - the reward function (possibly stochastic) Definition MDP is the tuple S, A, p, r π (a s) - the agent s policy S t 1 p S t p S t+1 p S t+2 R t 1 R t π π π R t+1 A t 1 A t A t+1 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21

The Value Function Modeling the problem The un-discounted value function for any given policy π ] V π (s) = E r (s t, a t ) s 0 = s [ T t=0 T is the horizon - can be finite, episodic, or infinite Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21

The Value Function Modeling the problem The un-discounted value function for any given policy π ] V π (s) = E r (s t, a t ) s 0 = s [ T t=0 T is the horizon - can be finite, episodic, or infinite The discounted value function for any given policy π ] V π (s) = E γ t r (s t, a t ) s 0 = s [ t=0 0 < γ < 1 is the discounting factor Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21

The Value Function Modeling the problem The un-discounted value function for any given policy π ] V π (s) = E r (s t, a t ) s 0 = s [ T t=0 T is the horizon - can be finite, episodic, or infinite The discounted value function for any given policy π ] V π (s) = E γ t r (s t, a t ) s 0 = s [ t=0 0 < γ < 1 is the discounting factor The goal: find a policy π that maximizes the value s S V (s) = V π (s) = max π Vπ (s) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21

The Bellman Equation Bellman optimality The value function can be written recursively [ ] V π (s) = E γ t [ r (s t, a t ) s 0 = s = E r (s, a) + γv π ( s )] t=0 a π( s) s p( s,a) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 9 / 21

The Bellman Equation Bellman optimality The value function can be written recursively [ ] V π (s) = E γ t [ r (s t, a t ) s 0 = s = E r (s, a) + γv π ( s )] t=0 a π( s) s p( s,a) The optimal value satisfies the Bellman equation V [ (s) = max r (s, a) + γv ( s )] π E a π( s) s p( s,a) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 9 / 21

The Bellman Equation Bellman optimality The value function can be written recursively [ ] V π (s) = E γ t [ r (s t, a t ) s 0 = s = E r (s, a) + γv π ( s )] t=0 a π( s) s p( s,a) The optimal value satisfies the Bellman equation V [ (s) = max r (s, a) + γv ( s )] π E a π( s) s p( s,a) π is not necessarily unique, but V is unique Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 9 / 21

The Q-Function Bellman optimality The value of taking action a at state s: Q (s, a) = r (s, a) + γ E s p( s,a) [ V ( s )] Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21

The Q-Function Bellman optimality The value of taking action a at state s: Q (s, a) = r (s, a) + γ E s p( s,a) [ V ( s )] If we know V then an optimal policy is to decide deterministically a (s) = arg max Q (s, a) a Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21

The Q-Function Bellman optimality The value of taking action a at state s: Q (s, a) = r (s, a) + γ E s p( s,a) [ V ( s )] If we know V then an optimal policy is to decide deterministically a (s) = arg max Q (s, a) a This means that V (s) = max Q (s, a) a Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21

The Q-Function Bellman optimality The value of taking action a at state s: Q (s, a) = r (s, a) + γ E s p( s,a) [ V ( s )] If we know V then an optimal policy is to decide deterministically a (s) = arg max Q (s, a) a This means that V (s) = max Q (s, a) a Conclusion Learning Q makes it easy to obtain an optimal policy Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21

Outline Q-Learning 1 What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 11 / 21

Learning Q Q-Learning Algorithm If the agent knows the dynamics p and the reward function r, it can find Q by dynamic programing (e.g. value iteration) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 12 / 21

Learning Q Q-Learning Algorithm If the agent knows the dynamics p and the reward function r, it can find Q by dynamic programing (e.g. value iteration) Otherwise, it needs to estimate Q from its experience The experience of an agent is a sequence s 0, a 0, r 0, s 1, a 1, r 1, s 2,... The n-th episode is (s n, a n, r n, s n+1 ) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 12 / 21

Learning Q Q-Learning Algorithm Q-Learning Initialize Q 0 (s, a) for all a A and s S For each episode n observe the current state s n select and execute an action a n observe the next state s n+1 and reward r n ) Q n (s n, a n ) (1 α n ) Q n 1 (s n, a n ) + α n (r n + γmax Q n 1 (s n+1, a) a α n (0, 1) - the learning rate Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 13 / 21

Q-Learning Exploration vs. Exploitation Algorithm How should the agent gain experience in order to learn? Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 14 / 21

Q-Learning Exploration vs. Exploitation Algorithm How should the agent gain experience in order to learn? If it explores too much it might suffer from a low value Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 14 / 21

Q-Learning Exploration vs. Exploitation Algorithm How should the agent gain experience in order to learn? If it explores too much it might suffer from a low value If it exploits the learned Q function too early, it might get stuck at bad states without even knowing about better possibilities (overfitting) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 14 / 21

Q-Learning Exploration vs. Exploitation Algorithm How should the agent gain experience in order to learn? If it explores too much it might suffer from a low value If it exploits the learned Q function too early, it might get stuck at bad states without even knowing about better possibilities (overfitting) One common approach is to use an ɛ-greedy policy Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 14 / 21

Q-Learning Convergence of the Q function Convergence analysis Theorem Given bounded rewards r n R, and learning rates 0 α n 1 s.t. then with probability 1 α n = n=1 and α 2 n < n=1 Q n (s, a) n Q (s, a) The exploration policy should be such that each state-action pair will be encountered infinitely many times Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 15 / 21

Q-Learning Convergence of the Q function Convergence analysis Poof - main idea Define the operator B [Q] s,a = r (s, a) + γ p ( s s, a ) max Q ( s, a ) s a B is a contraction under the norm B [Q 1 ] B [Q 2 ] γ Q 1 Q 2 It is easy to see that B [Q ] = Q We re not done because the updates come from a stochastic process Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 16 / 21

Outline Assumptions and Limitations 1 What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 17 / 21

Assumptions and Limitations What assumptions did we make? Q-learning is model free The state and reward are assumed to be fully observable POMDP is a much harder problem The Q function can be represented by a look-up table otherwise we need to do something else such as function approximation, and this is where deep learning comes in! Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 18 / 21

Outline Summary 1 What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 19 / 21

Summary Summary The basic reinforcement learning problem can be modeled as an MDP If the model is known we can solve precisely by dynamic programing Q-learning allows learning an optimal policy when the model is not known, and without trying to learn the model It converges asymptotically to the optimal Q if the learning rate is not too fast and not too slow, and if the exploration policy is satisfactory Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 20 / 21

Further Reading Summary Richard S. Sutton and Andrew G. Barto Reinforcement Learning: An. MIT Press, 1998 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 21 / 21