Bellman Optimality Equation for V*

Similar documents
Chapter 4: Dynamic Programming

{ } = E! & $ " k r t +k +1

Administrivia CSE 190: Reinforcement Learning: An Introduction

2D1431 Machine Learning Lab 3: Reinforcement Learning

Reinforcement learning II

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Reinforcement Learning

19 Optimal behavior: Game theory

Reinforcement learning

Reinforcement Learning and Policy Reuse

Cf. Linn Sennott, Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley Series in Probability & Statistics, 1999.

Chapter 4: Dynamic Programming

CS 188: Artificial Intelligence Spring 2007

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

CS 188: Artificial Intelligence

Efficient Planning. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

CS 188: Artificial Intelligence Fall 2010

Decision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information

Bayesian Networks: Approximate Inference

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees

Review of Calculus, cont d

Operations with Polynomials

Continuous Random Variables

The Regulated and Riemann Integrals

Artificial Intelligence Markov Decision Problems

MATH 115 FINAL EXAM. April 25, 2005

LECTURE NOTE #12 PROF. ALAN YUILLE

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

We will see what is meant by standard form very shortly

1B40 Practical Skills

Lecture 3 Gaussian Probability Distribution

A Fast and Reliable Policy Improvement Algorithm

Math 1B, lecture 4: Error bounds for numerical methods

Best Approximation. Chapter The General Case

MORE FUNCTION GRAPHING; OPTIMIZATION. (Last edited October 28, 2013 at 11:09pm.)

Nondeterminism and Nodeterministic Automata

Matrix Solution to Linear Equations and Markov Chains

Where did dynamic programming come from?

How do you know you have SLE?

Problem Set 3 Solutions

Chapter 14. Matrix Representations of Linear Transformations

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

Computing the Optimal Global Alignment Value. B = n. Score of = 1 Score of = a a c g a c g a. A = n. Classical Dynamic Programming: O(n )

5.2 Exponent Properties Involving Quotients

Chapter 5 : Continuous Random Variables

Engineering Analysis ENG 3420 Fall Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 11:00-12:00

Review of Gaussian Quadrature method

Jack Simons, Henry Eyring Scientist and Professor Chemistry Department University of Utah

20 MATHEMATICS POLYNOMIALS

A-Level Mathematics Transition Task (compulsory for all maths students and all further maths student)

Math& 152 Section Integration by Parts

Monte Carlo method in solving numerical integration and differential equation

A REVIEW OF CALCULUS CONCEPTS FOR JDEP 384H. Thomas Shores Department of Mathematics University of Nebraska Spring 2007

1 Online Learning and Regret Minimization

STEP FUNCTIONS, DELTA FUNCTIONS, AND THE VARIATION OF PARAMETERS FORMULA. 0 if t < 0, 1 if t > 0.

DATA Search I 魏忠钰. 复旦大学大数据学院 School of Data Science, Fudan University. March 7 th, 2018

Sample pages. 9:04 Equations with grouping symbols

Section 7.2 Velocity. Solution

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

Applying Q-Learning to Flappy Bird

Chapters 4 & 5 Integrals & Applications

Before we can begin Ch. 3 on Radicals, we need to be familiar with perfect squares, cubes, etc. Try and do as many as you can without a calculator!!!

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

(See Notes on Spontaneous Emission)

1 Probability Density Functions

Search: The Core of Planning

3.4 Numerical integration

Review of basic calculus

Matrices and Determinants

Numerical Integration

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams

Jonathan Mugan. July 15, 2013

Adding and Subtracting Rational Expressions

Linear Inequalities. Work Sheet 1

Read section 3.3, 3.4 Announcements:

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

UNIFORM CONVERGENCE. Contents 1. Uniform Convergence 1 2. Properties of uniform convergence 3

Math 270A: Numerical Linear Algebra

PHYS Summer Professor Caillault Homework Solutions. Chapter 2

4 7x =250; 5 3x =500; Read section 3.3, 3.4 Announcements: Bell Ringer: Use your calculator to solve

Math 113 Exam 2 Practice

Energy Bands Energy Bands and Band Gap. Phys463.nb Phenomenon

Abstract inner product spaces

CHM Physical Chemistry I Chapter 1 - Supplementary Material

Pre-Calculus TMTA Test 2018

Tutorial 4. b a. h(f) = a b a ln 1. b a dx = ln(b a) nats = log(b a) bits. = ln λ + 1 nats. = log e λ bits. = ln 1 2 ln λ + 1. nats. = ln 2e. bits.

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

Lecture 6 Regular Grammars

Lecture 19: Continuous Least Squares Approximation

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Unit #9 : Definite Integral Properties; Fundamental Theorem of Calculus

Section 4.8. D v(t j 1 ) t. (4.8.1) j=1

If deg(num) deg(denom), then we should use long-division of polynomials to rewrite: p(x) = s(x) + r(x) q(x), q(x)

Markov Decision Processes

CS 310 (sec 20) - Winter Final Exam (solutions) SOLUTIONS

This lecture covers Chapter 8 of HMU: Properties of CFLs

Non-Linear & Logistic Regression

different methods (left endpoint, right endpoint, midpoint, trapezoid, Simpson s).

Normal Distribution. Lecture 6: More Binomial Distribution. Properties of the Unit Normal Distribution. Unit Normal Distribution

Autonomous Learning of High-Level States and Actions in Continuous Environments. Jonathan Mugan and Benjamin Kuipers, Fellow, IEEE

Transcription:

Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V (s) mx Q (s,) A(s) mx A(s) mx A(s) Er t 1 V (s t 1 ) s t s, t s P ss The relevnt bckup digrm: R ss V ( s ) V is the unique solution of this system of nonliner equtions. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 1

Bellmn Optimlity Eqution for Q* Q (s,) E r t 1 mx P ss s R ss Q (s t 1, ) s t s, t mx Q ( s, ) The relevnt bckup digrm: Q * is the unique solution of this system of nonliner equtions. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 2

Why Optiml Stte-Vlue Functions re Useful Any policy tht is greedy with respect to V is n optiml policy. V Therefore, given, one-step-hed serch produces the long-term optiml ctions. E.g., bck to the gridworld: * R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 3

Wht About Optiml Action-Vlue Functions? Q * Given, the gent does not even hve to do one-step-hed serch: (s) rg mx Q (s,) A(s) R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 4

Solving the Bellmn Optimlity Eqution Finding n optiml policy by solving the Bellmn Optimlity Eqution requires the following: ccurte knowledge of environment dynmics; we hve enough spce nd time to do the computtion; the Mrkov Property. How much spce nd time do we need? polynomil in number of sttes (vi dynmic progrmming methods; Chpter 4), BUT, number of sttes is often huge (e.g., bckgmmon hs bout 10 20 sttes). We usully hve to settle for pproximtions. Mny RL methods cn be understood s pproximtely solving the Bellmn Optimlity Eqution. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 5

Summry Agent-environment interction Sttes Actions Rewrds Policy: stochstic rule for selecting ctions Return: the function of future rewrds gent tries to mximize Episodic nd continuing tsks Mrkov Property Mrkov Decision Process Trnsition probbilities Expected rewrds Vlue functions Stte-vlue function for policy Action-vlue function for policy Optiml stte-vlue function Optiml ction-vlue function Optiml vlue functions Optiml policies Bellmn Equtions The need for pproximtion R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 6

R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 7

Gridworld Actions: north, south, est, west; deterministic. If would tke gent off the grid: no move but rewrd = 1 Other ctions produce rewrd = 0, except ctions tht move gent out of specil sttes A nd B s shown. Wht if ll rewrds re shifted by constnt? R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 8

R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 9

Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence, optiml policies Discuss efficiency nd utility of DP R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 10

Policy Evlution Policy Evlution: for given policy, compute the stte-vlue function V Recll: Stte- vlue function for policy : V (s) E R t s t s E k r t k 1 s t s k 0 Bellmnequtionfor V V ( s) ( s, ) s P ss systemof S simultneous liner equtions R : ss V ( s) R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 11

Itertive Methods V 0 V 1 V k V k1 V sweep A sweep consists of pplying bckup opertion to ech stte. A full policy-evlution bckup: s V k1 (s) (s,) P s s R ss V k ( s ) R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 12

Itertive Policy Evlution R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 13

A Smll Gridworld An undiscounted episodic tsk Nonterminl sttes: 1, 2,..., 14; One terminl stte (shown twice s shded squres) Actions tht would tke gent off the grid leve stte unchnged Rewrd is 1 until the terminl stte is reched R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 14

Itertive Policy Evl for the Smll Gridworld equiprobble rndom ction choices R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 15

Policy Improvement Suppose we hve computed V for deterministic policy. For given stte s, would it be better to do n ction (s)? The vlue of doing in stte s is : Q (s,) E s r t 1 V (s t 1 ) s t s, t P ss R ss V ( s ) It is better to switch to ction for stte s if nd only if Q (s,) V (s) R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 16

Policy Improvement Cont. Do this for ll sttes to get new policy tht is greedy with respect to V : Then V V (s) rgmx Q (s,) rgmx s P s R s V ( s ) R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 17

Policy Improvement Cont. Wht if V V? i.e., for ll s S, V (s) mx s V ( s )? P R ss ss But this is the Bellmn Optimlity Eqution. So V V nd both nd re optiml policies. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 18

Policy Itertion 0 V 0 1 V 1 * V * * policy evlution policy improvement greedifiction R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 19

Policy Itertion R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 20

Jck s Cr Rentl $10 for ech cr rented (must be vilble when request rec d) Two loctions, mximum of 20 crs t ech Crs returned nd requested rndomly Poisson distribution, n returns/requests with prob 1st loction: verge requests = 3, verge returns = 3 2nd loction: verge requests = 4, verge returns = 2 Cn move up to 5 crs between loctions overnight n n! e Sttes, Actions, Rewrds? Trnsition probbilities? R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 21

Jck s Cr Rentl R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 22

Jck s CR Exercise Suppose the first cr moved is free From 1st to 2nd loction Becuse n employee trvels tht wy nywy (by bus) Suppose only 10 crs cn be prked for free t ech loction More thn 10 cost $4 for using n extr prking lot Such rbitrry nonlinerities re common in rel problems R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 23

Vlue Itertion Recll the full policy-evlution bckup: s V k1 (s) (s,) P ss R ss V k ( s ) Here is the full vlue-itertion bckup: V k1 (s) mx s P ss R ss V k ( s ) R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 24

Vlue Itertion Cont. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 25