Reinforcement Learning

Similar documents
Reinforcement Learning

Temporal difference learning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Lecture 23: Reinforcement Learning

CS599 Lecture 1 Introduction To RL

Reinforcement Learning. George Konidaris

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement Learning

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Machine Learning I Reinforcement Learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Lecture 8: Policy Gradient

Reinforcement learning

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning: the basics

CS 570: Machine Learning Seminar. Fall 2016

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning Part 2

15-780: ReinforcementLearning

Lecture 7: Value Function Approximation

Open Theoretical Questions in Reinforcement Learning

(Deep) Reinforcement Learning

Lecture 1: March 7, 2018

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Notes on Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

Decision Theory: Q-Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Grundlagen der Künstlichen Intelligenz

REINFORCEMENT LEARNING

Reinforcement Learning II

Artificial Intelligence

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning II. George Konidaris

Q-learning. Tambet Matiisen

Reinforcement Learning II. George Konidaris

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

, and rewards and transition matrices as shown below:


Reinforcement learning an introduction

Reinforcement Learning: An Introduction

arxiv: v1 [cs.ai] 5 Nov 2017

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Off-Policy Actor-Critic

Lecture 9: Policy Gradient II 1

Reinforcement Learning In Continuous Time and Space

Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

Temporal Difference Learning & Policy Iteration

Deep Reinforcement Learning

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Introduction to Reinforcement Learning

Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15

Trust Region Policy Optimization

Lecture 9: Policy Gradient II (Post lecture) 2

Reinforcement Learning and NLP

Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

Basics of reinforcement learning

The convergence limit of the temporal difference learning

Olivier Sigaud. September 21, 2012

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Reinforcement Learning

RL 3: Reinforcement Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Reinforcement Learning (1)

Reinforcement Learning in a Nutshell

Reinforcement Learning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Multiagent (Deep) Reinforcement Learning

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

An Introduction to Reinforcement Learning

CS599 Lecture 2 Function Approximation in RL

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Chapter 3: The Reinforcement Learning Problem

Reinforcement Learning

Lecture 3: The Reinforcement Learning Problem

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Animal learning theory

Machine Learning I Continuous Reinforcement Learning

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Internet Monetization

An Adaptive Clustering Method for Model-free Reinforcement Learning

Transcription:

Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of Stuttgart

Outline Temporal Difference Learning Q-learning Eligibility Traces 2/??

Learning in MDPs While interacting with the world, the agent collects data of the form D = {(s t, a t, r t, s t+1 )} H t=1 (state, action, immediate reward, next state) 3/??

Learning in MDPs While interacting with the world, the agent collects data of the form D = {(s t, a t, r t, s t+1 )} H t=1 (state, action, immediate reward, next state) What could we learn from that? learn to predict next state: estimate P (s s, a) learn to predict immediate reward: estimate P (r s, a) learn to predict value: estimate s, a Q(s, a) 3/??

Model-based versus Model-free Edward Tolman (1886-1959) Clark Hull (1884-1952) Principles of Behavior (1943) Wolfgang Köhler (1887 1967) learn stimulus-response mappings based on reinforcement learn facts about the world that they could subsequently use in a flexible manner, rather than simply learning automatic responses 4/??

5/??

6/??

Let s introduce basic model-free methods first. 7/??

Monte-Carlo method (Introduction to RL, Sutton & Barto 1998) 8/??

Temporal difference (TD) learning recall V π (s) = R(π(s), s) + γ s P (s π(s), s) V π (s ) 9/??

Temporal difference (TD) learning recall V π (s) = R(π(s), s) + γ s P (s π(s), s) V π (s ) TD learning: Given a new experience (s, a, r, s ) V new (s) = (1 α) V old (s) + α [r + γv old (s )] = V old (s) + α [r V old (s) + γv old (s )]. 9/??

Temporal difference (TD) learning recall V π (s) = R(π(s), s) + γ s P (s π(s), s) V π (s ) TD learning: Given a new experience (s, a, r, s ) V new (s) = (1 α) V old (s) + α [r + γv old (s )] = V old (s) + α [r V old (s) + γv old (s )]. Reinforcement: more reward than expected (r > V old (s) γv old (s )) increase V (s) less reward than expected (r < V old (s) γv old (s )) decrease V (s) 9/??

Temporal difference Monte-Carlo method TD(0) (Introduction to RL, Sutton & Barto 1998) 10/??

Q-learning recall Q (s, a) = R(s, a) + γ s P (s a, s) max a Q (s, a ) 11/??

Q-learning recall Q (s, a) = R(s, a) + γ s P (s a, s) max a Q (s, a ) Q-learning (Watkins, 1988) Given a new experience (s, a, r, s ) Q new (s, a) = (1 α) Q old (s, a) + α [r + γmax Q old (s, a )] a = Q old (s, a) + α [r t Q old (s, a) + γ max Q old (s, a)] a Reinforcement: more reward than expected (r > Q old (s, a) γ max a Q old (s, a)) increase Q(s, a) less reward than expected (r < Q old (s, a) γ max a Q old (s, a)) decrease Q(s, a) 11/??

Q-learning (Introduction to RL, Sutton & Barto 1998) 12/??

Q-learning convergence with prob 1 Q-learning is a stochastic approximation of Q-Iteration: Q-learning: Q new(s, a) = (1 α)q old (s, a) + α[r + γ max a Q old (s, a )] Q-Iteration: s,a : Q k+1 (s, a) = R(s, a) + γ s P (s a, s) max a Q k (s, a ) We ve shown convergence of Q-VI to Q Convergence of Q-learning: Q-Iteration is a deterministic update: Q k+1 = T (Q k ) Q-learning is a stochastic version: Q k+1 = (1 α)q k + α[t (Q k ) + η k ] η k is zero mean! 13/??

Q-learning convergence with prob 1 The Q-learning algorithm converges w.p.1 as long as the learning rates satisfy, α t (s, a) = t αt 2 (s, a) < t (Watkins and Dayan, Q-learning. Machine Learning 1992) 14/??

Q-learning impact Q-Learning was the first provably convergent direct adaptive optimal control algorithm Great impact on the field of Reinforcement Learning smaller representation than models automatically focuses attention to where it is needed, i.e., no sweeps through state space though does not solve the exploration versus exploitation issue epsilon-greedy, optimistic initialization, etc,... 15/??

Unified View 16/??

Eligibility traces Temporal Difference: based on single experience (s 0, r 0, s 1 ) V new (s 0 ) = V old (s 0 ) + α[r 0 + γv old (s 1 ) V old (s 0 )] longer experience sequence, e.g.: (s 0, r 0, r 1, r 2, s 3 ) 17/??

Eligibility traces Temporal Difference: based on single experience (s 0, r 0, s 1 ) V new (s 0 ) = V old (s 0 ) + α[r 0 + γv old (s 1 ) V old (s 0 )] longer experience sequence, e.g.: (s 0, r 0, r 1, r 2, s 3 ) temporal credit assignment, think further backwards: receiving r 3 also tells us something about V (s 0 ) V new (s 0 ) = V old (s 0 ) + α[r 0 + γr 1 + γ 2 r 2 + γ 3 V old (s 3 ) V old (s 0 )] 17/??

Eligibility trace The error of n-step TD update V t (s) = V t (s) + α[rt n V t (s)] The offline value update up to time T Error reduction T 1 V (s) = V (s) + α V t (s) t=0 V n t V π γ n V t V π 18/??

TD(λ): Forward View TD(λ) is the averaging of n-backups with different n Rt λ = (1 λ) λ n 1 Rt n n=1 19/??

TD(λ): Backward View TD(λ): remember where you ve been recently ( eligibility trace ) and update those values as well: e(s t) e(s t) + 1 s : V new(s) = V old (s) + α e(s) [r t + γv old (s t+1 ) V old (s t)] s : e(s) γλe(s) 20/??

TD(λ): Backward View TD(λ): remember where you ve been recently ( eligibility trace ) and update those values as well: e(s t) e(s t) + 1 s : V new(s) = V old (s) + α e(s) [r t + γv old (s t+1 ) V old (s t)] s : e(s) γλe(s) 20/??

TD(λ): Backward vs. Forward Two view provides equivalent offline update (see the proof in Section 7.4, Introduction to RL book, Sutton & Barto). 21/??

TD-Gammon, by Gerald Tesauro (See section 11.1 in Sutton & Barto s book.) MLP to represent the value function V (s) Only reward given at end of game for win. Self-play: use the current policy to sample moves on both sides! random policies games take up to thousands of steps. Skilled players 50 60 steps. TD(λ) learning (gradient-based update of NN weights) 22/??

TD-Gammon input features first only raw position inputs (number of pieces at each place) as good as previous computer programs using previous computer program s expert features world-class player 23/??