{ } = E! & $ " k r t +k +1

Similar documents
Chapter 4: Dynamic Programming

Administrivia CSE 190: Reinforcement Learning: An Introduction

Bellman Optimality Equation for V*

Chapter 4: Dynamic Programming

Reinforcement learning II

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

2D1431 Machine Learning Lab 3: Reinforcement Learning

19 Optimal behavior: Game theory

Reinforcement Learning

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Efficient Planning. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Reinforcement learning

Lecture 3 Gaussian Probability Distribution

AQA Further Pure 1. Complex Numbers. Section 1: Introduction to Complex Numbers. The number system

We will see what is meant by standard form very shortly

CS 188: Artificial Intelligence Spring 2007

Decision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information

How do you know you have SLE?

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

CS 188: Artificial Intelligence Fall 2010

The Regulated and Riemann Integrals

Review of Calculus, cont d

Reinforcement Learning and Policy Reuse

Operations with Polynomials

1 Probability Density Functions

CS 188: Artificial Intelligence

20 MATHEMATICS POLYNOMIALS

Normal Distribution. Lecture 6: More Binomial Distribution. Properties of the Unit Normal Distribution. Unit Normal Distribution

STEP FUNCTIONS, DELTA FUNCTIONS, AND THE VARIATION OF PARAMETERS FORMULA. 0 if t < 0, 1 if t > 0.

Ordinary Differential Equations- Boundary Value Problem

Artificial Intelligence Markov Decision Problems

Before we can begin Ch. 3 on Radicals, we need to be familiar with perfect squares, cubes, etc. Try and do as many as you can without a calculator!!!

5.2 Exponent Properties Involving Quotients

AQA Further Pure 2. Hyperbolic Functions. Section 2: The inverse hyperbolic functions

4.4 Areas, Integrals and Antiderivatives

Student Activity 3: Single Factor ANOVA

A Fast and Reliable Policy Improvement Algorithm

Lecture 1: Introduction to integration theory and bounded variation

MATH 115 FINAL EXAM. April 25, 2005

SUMMER KNOWHOW STUDY AND LEARNING CENTRE

Planning in Markov Decision Processes

Chapter 14. Matrix Representations of Linear Transformations

Chapters 4 & 5 Integrals & Applications

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees

Jim Lambers MAT 169 Fall Semester Lecture 4 Notes

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Equations and Inequalities

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1

Matrix Solution to Linear Equations and Markov Chains

Monte Carlo method in solving numerical integration and differential equation

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

Read section 3.3, 3.4 Announcements:

1.2. Linear Variable Coefficient Equations. y + b "! = a y + b " Remark: The case b = 0 and a non-constant can be solved with the same idea as above.

Continuous Random Variables

4 7x =250; 5 3x =500; Read section 3.3, 3.4 Announcements: Bell Ringer: Use your calculator to solve

LECTURE NOTE #12 PROF. ALAN YUILLE

1 The Riemann Integral

Math 113 Exam 2 Practice

Physics 201 Lab 3: Measurement of Earth s local gravitational field I Data Acquisition and Preliminary Analysis Dr. Timothy C. Black Summer I, 2018

Unit #9 : Definite Integral Properties; Fundamental Theorem of Calculus

Applying Q-Learning to Flappy Bird

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

Chapter 4 Contravariance, Covariance, and Spacetime Diagrams

CS5371 Theory of Computation. Lecture 20: Complexity V (Polynomial-Time Reducibility)

Lecture 6 Regular Grammars

Best Approximation. Chapter The General Case

Problem Set 3 Solutions

1 Online Learning and Regret Minimization

Numerical integration

DIRECT CURRENT CIRCUITS

How to simulate Turing machines by invertible one-dimensional cellular automata

Jonathan Mugan. July 15, 2013

Math& 152 Section Integration by Parts

Math 1B, lecture 4: Error bounds for numerical methods

Exponentials - Grade 10 [CAPS] *

If deg(num) deg(denom), then we should use long-division of polynomials to rewrite: p(x) = s(x) + r(x) q(x), q(x)

This lecture covers Chapter 8 of HMU: Properties of CFLs

Markov Decision Processes

NUMERICAL INTEGRATION

CS667 Lecture 6: Monte Carlo Integration 02/10/05

Chapter 3 Solving Nonlinear Equations

A sequence is a list of numbers in a specific order. A series is a sum of the terms of a sequence.

Lesson 1: Quadratic Equations

Numerical Integration

Actor-Critic. Hung-yi Lee

12 TRANSFORMING BIVARIATE DENSITY FUNCTIONS

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

CS 188: Artificial Intelligence Fall Announcements

fractions Let s Learn to

Finite Horizon Risk Sensitive MDP and Linear Programming

CMSC 330: Organization of Programming Languages

Energy Bands Energy Bands and Band Gap. Phys463.nb Phenomenon

Nondeterminism and Nodeterministic Automata

( ) 1. Algebra 2: Final Exam Review. y e + e e ) 4 x 10 = 10,000 = 9) Name

Lecture 20: Numerical Integration III

Finite Automata. Informatics 2A: Lecture 3. John Longley. 22 September School of Informatics University of Edinburgh

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

CBSE-XII-2015 EXAMINATION. Section A. 1. Find the sum of the order and the degree of the following differential equation : = 0

Z b. f(x)dx. Yet in the above two cases we know what f(x) is. Sometimes, engineers want to calculate an area by computing I, but...

Transcription:

Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence, optiml policies Discuss efficiency nd utility of DP R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 1 Policy Evlution Policy Evlution: for given policy π, compute the stte-vlue function V! Recll: Stte - vlue function for policy! : V! (s) = E! R t s t = s % # { } = E! & $ " k r t +k +1 s t = s ' k =0 ( ) * Bellmn eqution for V! : $ V! (s) =!(s, ) P s s " $ s " [ R s s " + # V! ( s ")] system of S simultneous liner equtions R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 2 1

Itertive Methods V 0! V 1! L! V k! V k +1! L! V " sweep A sweep consists of pplying bckup opertion to ech stte. A full policy-evlution bckup: V k +1 (s) " &#(s,)& P s s $ [ R + %V ( $ s s $ s )] k s $ R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 3 Itertive Policy Evlution R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 4 2

A Smll Gridworld An undiscounted episodic tsk Nonterminl sttes: 1, 2,..., 14; One terminl stte (shown twice s shded squres) Actions tht would tke gent off the grid leve stte unchnged Rewrd is 1 until the terminl stte is reched R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 5 Itertive Policy Evl for the Smll Gridworld " = equiprobble rndom ction choices R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 6 3

Policy Improvement Suppose we hve computed V! for deterministic policy π. For given stte s, would it be better to do n ction! "(s)? The vlue of doing in stte s is : Q! (s, ) = E! r t +1 + " V! (s t +1 ) s t = s, t = { } s [ s +" V! ( s #)] = $ P s # R s # s # It is better to switch to ction for stte s if nd only if Q! (s, ) > V! (s) R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 7 Policy Improvement Cont. Do this for ll sttes to get new policy "! tht is greedy with respect to V " : "!(s) = rgmx Q " (s, ) Then V "! # V " = rgmx # R + $ V " ( s s! ) s! P s! s! [ ] R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 8 4

Policy Improvement Cont. Wht if V "! = V "? i.e., for ll s #S, V "! (s) = mx $ P R s s! s s! But this is the Bellmn Optimlity Eqution. s! [ +% V " ( s!)]? So V "! = V # nd both " nd "! re optiml policies. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 9 Policy Itertion! 0 " V! 0 "! 1 " V! 1 " L! * " V * "! * policy evlution policy improvement greedifiction R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 10 5

Policy Itertion R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 11 Jck s Cr Rentl $10 for ech cr rented (must be vilble when request rec d) Two loctions, mximum of 20 crs t ech Crs returned nd requested rndomly " Poisson distribution, n returns/requests with prob n n! e#" 1st loction: verge requests = 3, verge returns = 3 2nd loction: verge requests = 4, verge returns = 2 Cn move up to 5 crs between loctions overnight Sttes, Actions, Rewrds? Trnsition probbilities? R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 12 6

Jck s Cr Rentl R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 13 Jck s CR Exercise Suppose the first cr moved is free From 1st to 2nd loction Becuse n employee trvels tht wy nywy (by bus) Suppose only 10 crs cn be prked for free t ech loction More thn 10 cost $4 for using n extr prking lot Such rbitrry nonlinerities re common in rel problems R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 14 7

Vlue Itertion Recll the full policy-evlution bckup: % % V k +1 (s)! "(s, ) P s s # s # [ R s s # + $ V k ( s #)] Here is the full vlue-itertion bckup: V k +1 (s)! mx $ P s s " R s s " s " [ + # V k ( s ")] R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 15 Vlue Itertion Cont. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 16 8

Gmbler s Problem Gmbler cn repetedly bet $ on coin flip Heds he wins his stke, tils he loses it Initil cpitl {$1, $2, $99} Gmbler wins if his cpitl becomes $100 loses if it becomes $0 Coin is unfir Heds (gmbler wins) with probbility p =.4 " n n! e#" Sttes, Actions, Rewrds? R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 17 Gmbler s Problem Solution R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 18 9

Herd Mngement You re consultnt to frmer mnging herd of cows Herd consists of 5 kinds of cows: Young Milking Breeding Old Sick Number of ech kind is the Stte Number sold of ech kind is the Action Cows trnsition from one kind to nother Young cows cn be born R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 19 Asynchronous DP All the DP methods described so fr require exhustive sweeps of the entire stte set. Asynchronous DP does not use sweeps. Insted it works like this: Repet until convergence criterion is met: Pick stte t rndom nd pply the pproprite bckup Still need lots of computtion, but does not get locked into hopelessly long sweeps Cn you select sttes to bckup intelligently? YES: n gent s experience cn ct s guide. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 20 10

Generlized Policy Itertion Generlized Policy Itertion (GPI): ny interction of policy evlution nd policy improvement, independent of their grnulrity. A geometric metphor for convergence of GPI: R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 21 Efficiency of DP To find n optiml policy is polynomil in the number of sttes BUT, the number of sttes is often stronomicl, e.g., often growing exponentilly with the number of stte vribles (wht Bellmn clled the curse of dimensionlity ). In prctice, clssicl DP cn be pplied to problems with few millions of sttes. Asynchronous DP cn be pplied to lrger problems, nd pproprite for prllel computtion. It is surprisingly esy to come up with MDPs for which DP methods re not prcticl. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 22 11

Summry Policy evlution: bckups without mx Policy improvement: form greedy policy, if only loclly Policy itertion: lternte the bove two processes Vlue itertion: bckups with mx Full bckups (to be contrsted lter with smple bckups) Generlized Policy Itertion (GPI) Asynchronous DP: wy to void exhustive sweeps Bootstrpping: updting estimtes bsed on other estimtes R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 23 12