Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Similar documents
Lecture 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Reinforcement Learning. Up until now we have been

CS599 Lecture 1 Introduction To RL

Reinforcement Learning. Machine Learning, Fall 2010

The Reinforcement Learning Problem

The Markov Decision Process (MDP) model

Reinforcement Learning (1)

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning

Planning in Markov Decision Processes

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Decision Theory: Q-Learning

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement learning an introduction

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

, and rewards and transition matrices as shown below:

The convergence limit of the temporal difference learning

REINFORCEMENT LEARNING

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Machine Learning I Reinforcement Learning

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Reinforcement Learning: An Introduction

Probabilistic Planning. George Konidaris

Grundlagen der Künstlichen Intelligenz

Lecture 3: Markov Decision Processes

Reinforcement Learning

Computational Reinforcement Learning: An Introduction

Reinforcement Learning (1)

Internet Monetization

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Final Exam December 12, 2017

Reinforcement Learning. George Konidaris

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

(Deep) Reinforcement Learning

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Temporal Difference Learning & Policy Iteration

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

Reinforcement Learning and Deep Reinforcement Learning

Autonomous Helicopter Flight via Reinforcement Learning

Introduction to Reinforcement Learning

Decision Theory: Markov Decision Processes

Final Exam December 12, 2017

Reinforcement Learning and Control

Reinforcement Learning II

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning

Markov Decision Processes Chapter 17. Mausam

CS 570: Machine Learning Seminar. Fall 2016

Temporal difference learning

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Reinforcement Learning (1)

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Introduction

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Markov Decision Processes Chapter 17. Mausam

Preference Elicitation for Sequential Decision Problems

Chapter 6: Temporal Difference Learning

Some AI Planning Problems

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

Reinforcement Learning II. George Konidaris

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

Laplacian Agent Learning: Representation Policy Iteration

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Reinforcement Learning and NLP

Q-Learning for Markov Decision Processes*

Reinforcement Learning II. George Konidaris

Markov Decision Processes

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

QUICR-learning for Multi-Agent Coordination

Chapter 6: Temporal Difference Learning

Rewards and Value Systems: Conditioning and Reinforcement Learning

Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Planning by Probabilistic Inference

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Transcription:

Reading Response: Due Wednesday R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Another Example Get to the top of the hill as quickly as possible. reward = 1 for each step where not at top of hill return = number of steps before reaching top of hill Return is maximized by minimizing number of steps to reach the top of the hill. www.cs.lafayette.edu/~taylorm/traj.gif R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

The Markov Property By the state at step t, the book means whatever information is available to the agent at step t about its environment. The state can include immediate sensations, highly processed sensations, and structures built up over time from sequences of sensations. Ideally, a state should summarize past sensations so as to retain all essential information, i.e., it should have the Markov Property: Pr{ s t +1 = s,r t +1 = r s t,a t,r t, s t 1,a t 1,,r 1,s 0,a 0 }= Pr{ s t +1 = s,r t +1 = r s t,a } t for all s, r, and histories s t,a t,r t, s t 1,a t 1,,r 1, s 0,a 0. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3

Markov Decision Processes If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP). If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: state and action sets one-step dynamics defined by transition probabilities: P s s { } for all s, a = Pr s t +1 = s s t = s,a t = a s S, a A(s). reward probabilities: a R s s = E{ r t +1 s t = s,a t = a,s t +1 = s } for all s, s S, a A(s). R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

An Example Finite MDP Recycling Robot At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

Recycling Robot MDP S = { high, low} A(high) = { search, wait} A(low) = { search, wait, recharge } R search = expected no. of cans while searching R wait = expected no. of cans while waiting R search > R wait R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

Value Functions The value of a state is the expected return starting from that state; depends on the agent s policy: State- value function for policy π : V π (s) = E { π R t s t = s}= E π γ k r t +k +1 s t = s The value of taking an action in a state under policy π is the expected return starting from that state, taking that action, and thereafter following π : R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7 k =0 Action - value function for policy π : Q π (s, a) = E { π R t s t = s, a t = a}= E π γ k r t + k +1 s t = s,a t = a k = 0

Bellman Equation for a Policy π The basic idea: R t = r t +1 + γ r t +2 +γ 2 r t + 3 +γ 3 r t + 4 = r t +1 + γ( r t +2 + γ r t +3 + γ 2 r t + 4 ) = r t +1 + γ R t +1 So: V π (s) = E π R t s t = s { } { } = E π r t +1 + γ V( s t +1 ) s t = s Or, without the expectation operator: V π a (s) = π(s,a) P s a s s [ R a + γv π ( s s s )] R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

More on the Bellman Equation V π a (s) = π(s,a) P s a s s [ R a + γv π ( s s s )] This is a set of equations (in fact, linear), one for each state. The value function for π is its unique solution. Backup diagrams: for V π for Q π R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9

Gridworld Actions: north, south, east, west; deterministic. If would take agent off the grid: no move but reward = 1 Other actions produce reward = 0, except actions that move agent out of special states A and B as shown. State-value function for equiprobable random policy; γ = 0.9 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

Golf State is ball location Reward of 1 for each stroke until the ball is in the hole Value of a state? Actions: putt (use putter) driver (use driver) putt succeeds anywhere on the green R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

Optimal Value Functions For finite MDPs, policies can be partially ordered: π π if and only if V π (s) V π (s) for all s S There are always one or more policies that are better than or equal to all the others. These are the optimal policies. We denote them all π *. Optimal policies share the same optimal state-value function: V (s) = max V π (s) for all s S π Optimal policies also share the same optimal action-value function: Q (s,a) = max Q π (s, a) for all s S and a A(s) π This is the expected return for taking action a in state s and thereafter following an optimal policy. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

Optimal Value Function for Golf We can hit the ball farther with driver than with putter, but with less accuracy Q*(s,driver) gives the value or using driver first, then using whichever actions are best R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

Bellman Optimality Equation for V* The value of a state under an optimal policy must equal the expected return for the best action from that state: V (s) = max Q π (s,a) a A(s) { } = max E r + γv t +1 (s t +1 ) s t = s,a t = a a A(s) = max P a [ s s R a + γv ( s s s )] a A(s) The relevant backup diagram: s V is the unique solution of this system of nonlinear equations. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14

Bellman Optimality Equation for V* V (s) = max a A(s) Qπ (s,a) What is V* for the recycling robot? { } = max E r + γv t +1 (s t +1 ) s t = s,a t = a a A(s) = max P a [ s s R a + γv ( s s s )] a A(s) s R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

Bellman Optimality Equation for Q* { } Q (s,a) = E r t +1 + γ maxq (s t +1, a ) s t = s,a t = a a s a = P s s [ R a s s + γ max s, a )] a Q ( The relevant backup diagram: Q * is the unique solution of this system of nonlinear equations. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17