Temporal Difference Learning & Policy Iteration

Similar documents
Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Machine Learning I Reinforcement Learning

Decision Theory: Q-Learning

REINFORCEMENT LEARNING

Lecture 23: Reinforcement Learning

Reinforcement Learning: the basics

Grundlagen der Künstlichen Intelligenz

An Adaptive Clustering Method for Model-free Reinforcement Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Reinforcement Learning. Machine Learning, Fall 2010

An Introduction to Reinforcement Learning

Reinforcement Learning: An Introduction

Reinforcement Learning. George Konidaris

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

An Introduction to Reinforcement Learning

The convergence limit of the temporal difference learning

6 Reinforcement Learning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement learning

Reinforcement Learning

CS599 Lecture 1 Introduction To RL

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

CS 570: Machine Learning Seminar. Fall 2016

Decision Theory: Markov Decision Processes

Reinforcement Learning

16.410/413 Principles of Autonomy and Decision Making

Trust Region Policy Optimization

Lecture 1: March 7, 2018

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Reinforcement Learning (1)

Temporal difference learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reinforcement Learning and NLP

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning

Sequential Decision Problems

Markov Decision Processes

Reinforcement Learning

QUICR-learning for Multi-Agent Coordination

arxiv: v1 [cs.ai] 1 Jul 2015

Reinforcement Learning

(Deep) Reinforcement Learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

ARTIFICIAL INTELLIGENCE. Reinforcement learning

arxiv: v1 [cs.ai] 5 Nov 2017

Reinforcement Learning

Q-Learning for Markov Decision Processes*

Reinforcement Learning. Introduction

Bias-Variance Error Bounds for Temporal Difference Updates

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Markov Models and Reinforcement Learning. Stephen G. Ware CSCI 4525 / 5525

Reinforcement Learning

Animal learning theory

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

MDP Preliminaries. Nan Jiang. February 10, 2019

Internet Monetization

An online kernel-based clustering approach for value function approximation

Markov Decision Processes (and a small amount of reinforcement learning)

Q-Learning in Continuous State Action Spaces

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Artificial Intelligence & Sequential Decision Problems

Partially Observable Markov Decision Processes (POMDPs)

Off-Policy Actor-Critic

On and Off-Policy Relational Reinforcement Learning

Reinforcement Learning II

ilstd: Eligibility Traces and Convergence Analysis

Policies Generalization in Reinforcement Learning using Galois Partitions Lattices

Reinforcement Learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

, and rewards and transition matrices as shown below:

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

The Reinforcement Learning Problem

RL 3: Reinforcement Learning

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning Active Learning

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Lower PAC bound on Upper Confidence Bound-based Q-learning with examples

Connecting the Demons

Autonomous Helicopter Flight via Reinforcement Learning

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Policy Gradient Methods. February 13, 2017

Methods for Interpretable Treatment Regimes

Basics of reinforcement learning

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Reinforcement Learning

Introduction to Reinforcement Learning

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Transcription:

Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 1

Overview Introduction Reinforcement Learning Model Learning procedure Markov Property Value Function Policy Iteration Temporal Difference Learning Idea Update Rule Application 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 2

Introduction (Reinforcement Learning) Reinforcement learning machine learning Learn by reinforcements (good / bad moves) Want: a policy how to behave in different situations Mostly sequential problems state action state action state action state... ( action final state) Delayed Rewards ±0 ±0 +1 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 3

Reinforcement-Learning Model An agent is connected to its environment via perception and action Formal model consists of: A discrete set of states S A discrete set of actions A A set of scalar reinforcement signals (Rewards) - Leslie Pack Kaelbling Main Goal: Find a policy π S A, mapping states to actions, that maximizes some long-run measure of reinforcement. (π s is the chosen action in state s) 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 4

Main procedure Agent is in state s 1 and can choose from actions {a 1,, a n }. Agent chooses action a i. Agent gets a reinforcement (reward) r 1 and is given its new state s 2 and a new set of possible actions. This loops until a given final state is reached or the procedure gets canceled. ±0 ±0 +1 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 5

Markov Decision Process In general: Environment non-deterministic (Same action in same state can lead to different reinforcements and states) But: The probability for a outcome is fixed (Static environment). The markov property: The outcome of an action in a given state does not depend on anything but the state/action pair. (Does not depend on earlier actions or states) 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 6

Markov Decision Process State transition function T: S A Π(S) members of Π(S) are probability distribution over the set S for each state there is a probability that a pair s, a leads into this state We write T(s, a, s ) for the probability of making a transition from state s to state s using action a. 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 7

Measurement of performing good the learner needs information about the quality of a state want to find optimal policy π A value function V represents the expected future reward for a current state, if the agent follows policy π: V π s = R s, π(s) + γ T s, π s, s V π (s ) s S < 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 8

Optimality V π s = R s, π(s) + γ T s, π s, s V π (s ) s S The optimal policy is the policy with the highest value for all states s: π = arg max π Vπ (s) 1 16 1 8 1 4 1 2 Optimal value function V s optimal policy: 1 8 1 4 1 2 1 π s = arg max [R s, a + γ T s, π s, a s V (s )] s S 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 9

Policy Improvement The task: improve a policy π, such that it performs better. That means, it has a bigger cummulated expected reward afterwards (for each state s). If it is V π s = R s, π(s) + γ T s, π s, s V π (s ) s S < R s, a + γ T s, a, s V π (s ) s S than it is better to choose action a instead of π(s) in the current state s t. So one can improve the policy by choosing acion a instead of a t in state s t. < 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 10

Policy Iteration If this improvement is done multiple times with all possible states, this is called policy iteration. One iteration consists of: 1. Calculating the value function of the current policy π for each state s V π s = R s, π(s) + γ T s, π s, s V π (s ) 2. Improve the policy at each state s S π 0 E V π 0 I π 1 E V π 1 I π 1 E V π 2 I π 2 E I V 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 11

Policy Iteration Why does it work (why does it take multiple iterations)? After 1 Iteration, at least this is optimal: 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 12

Policy Iteration Why does it work (why does it take multiple iterations)? After 1 Iteration, at least this is optimal: This works, but Needs many policy evaluations (What is V π (s)?) Need a model of the environment (need to know R and T to get V π (s)) May have large trajectories (expensive computation, memory usage) 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 13

Can we learn V π (s) without knowing R and T?? 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 14

Can we learn V π (s) without knowing R and T? 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 15

Can we learn V π (s) without knowing R and T? 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 16

Can we learn V π (s) without knowing R and T? 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 17

Can we learn V π (s) without knowing R and T? 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 18

Can we learn V π (s) without knowing R and T? Yes! 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 19

How to learn V π (s) without knowing R and T? Introduction Reinforcement Learning Model Learning procedure Markov Property Value Function Policy Iteration Temporal Difference Learning Idea Update Rule 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 20

How to learn V π (s) without knowing R and T? Introduction Reinforcement Learning Model Learning procedure Markov Property Value Function Policy Iteration Temporal Difference Learning Idea Update Rule 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 21

Temporal difference learning Following a policy π gives experience Use this to update estimate V of V π 0 0 0 0 Begin with incorrect estimate V the true values V π are unknown (otherwise there is nothing you need to learn) Learn correct value function 0 0 0 0 (using a good policy) 1 16 1 8 1 4 1 2 1 8 1 4 1 2 1 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 22

Temporal difference learning Recall the value function: V π s = R s, π(s) + γ T s, π s, s V π (s ) s S For deterministic environments, the value function is easier! Lets take a look at this Since T s, a, s is 0 for a π(s) and 1 otherwise, in deterministic environment it is: V π s = R s, π(s) + γ V π (δ(s, π s )) where δ: S A S is the transition function 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 23

Temporal difference learning Example (non-deterministic environment): The agent is in state s at time step t (say s t ) V s t is the expected future reward Agent performs action a = π(s t ) Leads to state s t+1 with reinforcement R s, a Now we have a new expected future reward: V s t+1 Is V s t = R s, a + γv s t+1? For V π it s true: V π s = R s, a + γ V π (s t+1 ) 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 24

Temporal difference learning V s t = R s, a + γv s t+1 in common doesn t hold, because the expectations are not correct. 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 25

Temporal difference learning V s t = R s, a + γv s t+1 in common doesn t hold, because the expectations are not correct. Is R s, a + γv s t+1 a better value for V s t? Should the old value be replaced by the new one? 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 26

Temporal difference learning V s t = R s, a + γv s t+1 in common doesn t hold, because the expectations are not correct. Is R s, a + γv s t+1 a better value for V s t? Should the old value be replaced by the new one? No! Because the term V s t+1 is just as wrong as V s t was Because the learner would forget everything he learned before about V s t 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 27

Temporal difference learning V s t = R s, a + γv s t+1 in common doesn t hold, because the expectations are not correct. Is R s, a + γv s t+1 a better value for V s t? Should the old value be replaced by the new one? No! Because the term V s t+1 is just as wrong as V s t was Because the learner would forget everything he learned before about V s t But R s, a + γv s t+1 has a bit of truth in it: R s, a is the correct reinforcement value V s t+1 hopefully converges to V π s t+1 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 28

TD - Update Rule So both V s t and R s, a + γv s t+1 have some value: V s t contains old knowledge R s, a + γv s t+1 contains new experience 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 29

TD - Update Rule So both V s t and R s, a + γv s t+1 have some value: V s t contains old knowledge R s, a + γv s t+1 contains new experience A proper update for a new value for V s t is a mixture of both! 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 30

TD - Update Rule So both V s t and R s, a + γv s t+1 have some value: V s t contains old knowledge R s, a + γv s t+1 contains new experience A proper update for a new value for V s t is a mixture of both! A often used update-rule, known as TD 0, is: V s t V s t + α[r s, a + γv s t+1 V s t ] 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 31

TD - Update Rule So both V s t and R s, a + γv s t+1 have some value: V s t contains old knowledge R s, a + γv s t+1 contains new experience A proper update for a new value for V s t is a mixture of both! A often used update-rule, known as TD 0, is: V s t V s t + α[r s, a + γv s t+1 V s t ] Using TD(0) the real V π will be learned eventually! (using good values for γ, α and running long enough) 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 32

Application This only learnes V π, not the best policy π! Main Goal: Find a policy π S A, mapping states to actions, that maximizes some long-run measure of reinforcement. To achieve this, one has to: Learn the value of S A, not only S (V vs. Q-Function) Don t follow a single policy, but get actions using an explorator At the beginning: explore the state-space To the end: exploit the gathered knowledge 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 33

The end Thanks for your attention! Any questions? 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 34

References Sutton, Richard S., and Andrew G. Barto. Introduction to reinforcement learning. Vol. 135. Cambridge: MIT Press, 1998. DIETTERICH, ANDREW G. BARTO THOMAS G. "2 Reinforcement Learning and Its Relationship to Supervised Learning." Handbook of learning and approximate dynamic programming 2 (2004): 47. Kunz, Florian. "An Introduction to Temporal Difference Learning. Sutton, Richard S. "Learning to predict by the methods of temporal differences." Machine learning 3.1 (1988): 9-44. Kaelbling, Leslie Pack, Michael L. Littman, and Andrew W. Moore. "Reinforcement learning: A survey." Journal of artificial intelligence research(1996): 237-285. 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof. J. Fürnkranz 35