Bellman Optimality Equation for V*

Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V (s) mx Q (s,) A(s) mx A(s) mx A(s) Er t 1 V (s t 1 ) s t s, t s P ss The relevnt bckup digrm: R ss V ( s ) V is the unique solution of this system of nonliner equtions. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 1

Bellmn Optimlity Eqution for Q* Q (s,) E r t 1 mx P ss s R ss Q (s t 1, ) s t s, t mx Q ( s, ) The relevnt bckup digrm: Q * is the unique solution of this system of nonliner equtions. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 2

Why Optiml Stte-Vlue Functions re Useful Any policy tht is greedy with respect to V is n optiml policy. V Therefore, given, one-step-hed serch produces the long-term optiml ctions. E.g., bck to the gridworld: * R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 3

Wht About Optiml Action-Vlue Functions? Q * Given, the gent does not even hve to do one-step-hed serch: (s) rg mx Q (s,) A(s) R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 4

Solving the Bellmn Optimlity Eqution Finding n optiml policy by solving the Bellmn Optimlity Eqution requires the following: ccurte knowledge of environment dynmics; we hve enough spce nd time to do the computtion; the Mrkov Property. How much spce nd time do we need? polynomil in number of sttes (vi dynmic progrmming methods; Chpter 4), BUT, number of sttes is often huge (e.g., bckgmmon hs bout 10 20 sttes). We usully hve to settle for pproximtions. Mny RL methods cn be understood s pproximtely solving the Bellmn Optimlity Eqution. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 5

Summry Agent-environment interction Sttes Actions Rewrds Policy: stochstic rule for selecting ctions Return: the function of future rewrds gent tries to mximize Episodic nd continuing tsks Mrkov Property Mrkov Decision Process Trnsition probbilities Expected rewrds Vlue functions Stte-vlue function for policy Action-vlue function for policy Optiml stte-vlue function Optiml ction-vlue function Optiml vlue functions Optiml policies Bellmn Equtions The need for pproximtion R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 6

R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 7

Gridworld Actions: north, south, est, west; deterministic. If would tke gent off the grid: no move but rewrd = 1 Other ctions produce rewrd = 0, except ctions tht move gent out of specil sttes A nd B s shown. Wht if ll rewrds re shifted by constnt? R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 8

R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 9

Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence, optiml policies Discuss efficiency nd utility of DP R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 10

Policy Evlution Policy Evlution: for given policy, compute the stte-vlue function V Recll: Stte- vlue function for policy : V (s) E R t s t s E k r t k 1 s t s k 0 Bellmnequtionfor V V ( s) ( s, ) s P ss systemof S simultneous liner equtions R : ss V ( s) R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 11

Itertive Methods V 0 V 1 V k V k1 V sweep A sweep consists of pplying bckup opertion to ech stte. A full policy-evlution bckup: s V k1 (s) (s,) P s s R ss V k ( s ) R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 12

Itertive Policy Evlution R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 13

A Smll Gridworld An undiscounted episodic tsk Nonterminl sttes: 1, 2,..., 14; One terminl stte (shown twice s shded squres) Actions tht would tke gent off the grid leve stte unchnged Rewrd is 1 until the terminl stte is reched R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 14

Itertive Policy Evl for the Smll Gridworld equiprobble rndom ction choices R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 15

Policy Improvement Suppose we hve computed V for deterministic policy. For given stte s, would it be better to do n ction (s)? The vlue of doing in stte s is : Q (s,) E s r t 1 V (s t 1 ) s t s, t P ss R ss V ( s ) It is better to switch to ction for stte s if nd only if Q (s,) V (s) R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 16

Policy Improvement Cont. Do this for ll sttes to get new policy tht is greedy with respect to V : Then V V (s) rgmx Q (s,) rgmx s P s R s V ( s ) R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 17

Policy Improvement Cont. Wht if V V? i.e., for ll s S, V (s) mx s V ( s )? P R ss ss But this is the Bellmn Optimlity Eqution. So V V nd both nd re optiml policies. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 18

Policy Itertion 0 V 0 1 V 1 * V * * policy evlution policy improvement greedifiction R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 19

Policy Itertion R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 20

Jck s Cr Rentl $10 for ech cr rented (must be vilble when request rec d) Two loctions, mximum of 20 crs t ech Crs returned nd requested rndomly Poisson distribution, n returns/requests with prob 1st loction: verge requests = 3, verge returns = 3 2nd loction: verge requests = 4, verge returns = 2 Cn move up to 5 crs between loctions overnight n n! e Sttes, Actions, Rewrds? Trnsition probbilities? R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 21

Jck s Cr Rentl R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 22

Jck s CR Exercise Suppose the first cr moved is free From 1st to 2nd loction Becuse n employee trvels tht wy nywy (by bus) Suppose only 10 crs cn be prked for free t ech loction More thn 10 cost $4 for using n extr prking lot Such rbitrry nonlinerities re common in rel problems R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 23

Vlue Itertion Recll the full policy-evlution bckup: s V k1 (s) (s,) P ss R ss V k ( s ) Here is the full vlue-itertion bckup: V k1 (s) mx s P ss R ss V k ( s ) R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 24

Vlue Itertion Cont. R. S. Sutton nd A. G. Brto: Reinforcement Lerning: An Introduction 25