Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Similar documents
Chapter 4: Dynamic Programming

Planning in Markov Decision Processes

Lecture 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 3: The Reinforcement Learning Problem

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Chapter 6: Temporal Difference Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Chapter 6: Temporal Difference Learning

Reinforcement Learning

Lecture 3: Markov Decision Processes

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

CS599 Lecture 1 Introduction To RL

{ } = E! & $ " k r t +k +1

Chapter 4: Dynamic Programming

Bellman Optimality Equation for V*

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Internet Monetization

Temporal difference learning

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Reinforcement Learning

Reinforcement Learning

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Reinforcement Learning

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

, and rewards and transition matrices as shown below:

Machine Learning I Reinforcement Learning

Lecture 23: Reinforcement Learning

CS599 Lecture 2 Function Approximation in RL

Elements of Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Probabilistic Planning. George Konidaris

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Chapter 8: Generalization and Function Approximation

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning Part 2

Reinforcement Learning II

Decision Theory: Q-Learning

Exercises, II part Exercises, II part

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning. George Konidaris

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Reinforcement Learning

Reinforcement Learning. Value Function Updates

Reinforcement Learning. Up until now we have been

Q-Learning in Continuous State Action Spaces

Open Theoretical Questions in Reinforcement Learning

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

The Reinforcement Learning Problem

Reinforcement Learning. Machine Learning, Fall 2010

Administrivia CSE 190: Reinforcement Learning: An Introduction

Reinforcement Learning and Deep Reinforcement Learning

Deep Reinforcement Learning

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement learning

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CS 570: Machine Learning Seminar. Fall 2016

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Notes on Reinforcement Learning

University of Alberta

Reinforcement Learning: An Introduction

Reinforcement learning an introduction

Q-Learning for Markov Decision Processes*

Time Indexed Hierarchical Relative Entropy Policy Search

MDP Preliminaries. Nan Jiang. February 10, 2019

Reinforcement Learning

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Generalization and Function Approximation

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

An Introduction to Reinforcement Learning

Q-learning Tutorial. CSC411 Geoffrey Roeder. Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava

Markov Decision Processes Chapter 17. Mausam

Reinforcement Learning

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Dynamic Programming

Reinforcement Learning (1)

Using first-order logic, formalize the following knowledge:

Laplacian Agent Learning: Representation Policy Iteration

Reinforcement Learning and NLP

Reinforcement Learning: the basics

Be able to define the following terms and answer basic questions about them:

Basics of reinforcement learning

CS 7180: Behavioral Modeling and Decisionmaking

Reinforcement Learning and Control

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Transcription:

Introduction to Reinforcement Learning Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Bellman Equations Recursive relationships among values that can be used to compute values

The tree of transition dynamics a path, or trajectory state action possible path

The web of transition dynamics a path, or trajectory state action possible path

The web of transition dynamics state action possible path backup diagram

4 Bellman-equation backup diagrams representing recursive relationships among values state values action values prediction state action possible path control max max max

Bellman Equation for a Policy π The basic idea: G t = R t+1 + γ R t+2 + γ 2 R t+3 + γ 3 R t+4 L ( +... ) = R t+1 + γ R t+2 + γ R t+3 + γ 2 R t+4 L = R t+1 + γ G t+1 +... So: v π (s) = E { π G t S t = s} = E { π R t+1 + γ v ( π S ) t+1 S t = s} Or, without the expectation operator: v (s) = X a (a s) X s 0,r h i p(s 0,r s, a) r + v (s 0 ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

More on the Bellman Equation v (s) = X a (a s) X s 0,r h i p(s 0,r s, a) r + v (s 0 ) This is a set of equations (in fact, linear), one for each state. The value function for π is its unique solution. Backup diagrams: for v π for q π R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

Gridworld Actions: north, south, east, west; deterministic. If would take agent off the grid: no move but reward = 1 Other actions produce reward = 0, except actions that move agent out of special states A and B as shown. State-value function for equiprobable random policy; γ = 0.9 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

X Bellman Optimality Equation for q * h i q (s, a) = E R t+1 + max q (S t+1,a 0 ) S t = s, A t = a a 0 = X h i p(s 0,r s, a) r + max q (s 0,a 0 ). a 0 s 0,r The relevant backup diagram: (b) s,a r s' max a' q * is the unique solution of this system of nonlinear equations. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

Dynamic Programming Using Bellman equations to compute values and optimal policies (thus a form of planning)

Iterative Methods v 0! v 1!! v k! v k+1!! v a sweep X X A sweep consists of applying a backup operation to each state. X X A full policy-evaluation backup: v k+1 (s) = X (a s) X a s 0,r 0 h h i p(s 0,r s, a) r + v k (s 0 ) i 8s 2 S " # R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3

A Small Gridworld R γ = 1 An undiscounted episodic task Nonterminal states: 1, 2,..., 14; One terminal state (shown twice as shaded squares) Actions that would take agent off the grid leave state unchanged Reward is 1 until the terminal state is reached R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

Iterative Policy Eval for the Small Gridworld π = equiprobable random action choices R γ = 1 An undiscounted episodic task Nonterminal states: 1, 2,..., 14; One terminal state (shown twice as shaded squares) Actions that would take agent off the grid leave state unchanged Reward is 1 until the terminal state is reached R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

Iterative Policy Evaluation One array version Input, the policy to be evaluated Initialize an array V (s) =0,foralls 2 S + Repeat 0 For each s 2 S: v V (s) V (s) Pa (a s) P s 0,r p(s0,r s, a) r + V (s 0 ) max(, v V (s) ) until < (a small positive number) Output V v R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

Iterative Policy Eval for the Small Gridworld π = equiprobable random action choices R γ = 1 An undiscounted episodic task Nonterminal states: 1, 2,..., 14; One terminal state (shown twice as shaded squares) Actions that would take agent off the grid leave state unchanged Reward is 1 until the terminal state is reached R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

Gambler s Problem Gambler can repeatedly bet $ on a coin flip Heads he wins his stake, tails he loses it Initial capital {$1, $2, $99} Gambler wins if his capital becomes $100 loses if it becomes $0 Coin is unfair! Heads (gambler wins) with probability p =.4 States, Actions, Rewards? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

Gambler s Problem Solution R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

Gambler s Problem Solution R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19

Generalized Policy Iteration Generalized Policy Iteration (GPI): any interaction of policy evaluation and policy improvement, independent of their granularity. π evaluation v v π π greedy(v) v A geometric metaphor for convergence of GPI: improvement v = v π v 0 π 0 v * π * π * v * π = greedy(v) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22

Jack s Car Rental $10 for each car rented (must be available when request rec d) Two locations, maximum of 20 cars at each Cars returned and requested randomly λ! Poisson distribution, n returns/requests with prob n! 1st location: average requests = 3, average returns = 3! 2nd location: average requests = 4, average returns = 2 Can move up to 5 cars between locations overnight n! e λ States, Actions, Rewards? Transition probabilities? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

Jack s Car Rental " 0 " 1 " 2 0 5 4 3 2 4 3 2 1 5 4 3 2 1 0 1 0!1!2-4!3!1!2!3!4 0 20 #Cars at first location 0 5 4 3 2 1 " 3 0!1!2!3!4 #Cars at second location 20 5 4 3 2 1 " 4 0!1!2!3!4 612 4200 V 4 v 4 #Cars at second location 20 0 20 #Cars at first location R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

Solving MDPs with Dynamic Programming Finding an optimal policy by solving the Bellman Optimality Equation requires the following:! accurate knowledge of environment dynamics;! we have enough space and time to do the computation;! the Markov Property. How much space and time do we need?! polynomial in number of states (via dynamic programming methods; Chapter 4),! BUT, number of states is often huge (e.g., backgammon has about 10 20 states). We usually have to settle for approximations. Many RL methods can be understood as approximately solving the Bellman Optimality Equation. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31

Efficiency of DP To find an optimal policy is polynomial in the number of states BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables We need to use approximation, but unfortunately classical DP is not sound with approximation (later) In practice, classical DP can be applied to problems with a few millions of states. It is surprisingly easy to come up with MDPs for which DP methods are not practical. Biggest limitation of DP is that it requires a probability model (as opposed to a generative or simulation model) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23

Unified View Temporaldifference learning width of backup Dyna Dynamic programming height (depth) of backup Eligibilty traces Monte Carlo MCTS Exhaustive search... R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 33