Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Size: px

Start display at page:

Download "Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo"

Abraham Gallagher
5 years ago
Views:

1 Module 6 Vlue Itertion CS 886 Sequentil Decision Mking nd Reinforcement Lerning University of Wterloo

2 Mrkov Decision Process Definition Set of sttes: S Set of ctions (i.e., decisions): A Trnsition model: Pr (s t s t 1, t 1 ) Rewrd model (i.e., utility): R(s t, t ) Discount fctor: 0 γ 1 Horizon (i.e., # of time steps): h Gol: find optiml policy π 2

3 Finite Horizon Policy evlution V h π s = h t=0 γ t Pr (S t = s S 0 = s, π)r(s, π t (s )) Recursive form (dynmic progrmming) V 0 π s = R(s, π 0 s ) V t π s = R s, π t s + γ Pr s s, π t s V t 1 π (s ) s 3

4 Optiml Policy π Finite Horizon V h π s V h π s π, s Optiml vlue function V (shorthnd for V π ) V 0 s = mx V t s = mx R(s, ) R s, + γ Pr s s, V t 1 (s ) s Bellmn s eqution 4

5 Vlue Itertion Algorithm vlueitertion(mdp) V 0 s mx R(s, ) s For t = 1 to h do V t s mx R s, + γ Pr s s s, V t 1 (s ) Return V s Optiml policy π t = 0: π 0 s rgmx R s, s t > 0: π t s rgmx R s, + γ Pr s s s, V t 1 (s ) NB: π is non sttionry (i.e., time dependent) s 5

6 Mtrix form: Vlue Itertion R : S 1 column vector of rewrds for V t : S 1 column vector of stte vlues T : S S mtrix of trnsition prob. for vlueitertion(mdp) V 0 mx R For t = 1 to h do V t mx R + γt V t 1 Return V 6

7 Infinite Horizon Let h Then V h π V π nd V h 1 π V π Policy evlution: V π s = R s, π s + γ s Pr s s, π s V π (s ) s Bellmn s eqution: V s = mx R s, + γ Pr s s s, V (s ) 7

8 Policy evlution Liner system of equtions V π s = R s, π s + γ s Pr s s, π s V π (s ) s Mtrix form: R: S 1 column vector of ste rewrds for π V: S 1 column vector of stte vlues for π T: S S mtrix of trnsition prob for π V = R + γtv 8

9 Solving liner equtions Liner system: V = R + γtv Gussin elimintion: I γt V = R Compute inverse: V = I γt 1 R Itertive methods Vlue itertion (.k.. Richrdson itertion) Repet V R + γtv 9

10 Contrction Let H(V) R + γtv be the policy evl opertor Lemm 1: H is contrction mpping. H V H V γ V V Proof H V H V = R + γtv R γtv (by definition) = γt V V (simplifiction) γ T V V (since AB A B ) = γ V V (since mx s s T(s, s ) = 1) 10

11 Convergence Theorem 2: Policy evlution converges to V π for ny initil estimte V lim n H(n) V = V π V Proof By definition V π = H 0, but policy evlution computes H V for ny initil V By lemm 1, H (n) V H n V γ n V V Hence, when n, then H (n) V H n 0 0 nd H V = V π V 11

12 Approximte Policy Evlution In prctice, we cn t perform n infinite number of itertions. Suppose tht we perform vlue itertion for k steps nd H k V H k 1 V = ε, how fr is H k V from V π? 12

13 Approximte Policy Evlution Theorem 3: If H k V H k 1 V ε then V π H k V ε 1 γ Proof V π H k V = H (V) H k V (by Theorem 2) = t=1 H t+k V H t+k 1 V t=1 H t+k (V) H t+k 1 V ( A + B A + B ) = t=1 γ t ε = ε 1 γ (by Lemm 1) 13

14 Optiml Vlue Function Non-liner system of equtions V s = mx R s, + γ Pr s s s, V (s ) s Mtrix form: R : S 1 column vector of rewrds for V : S 1 column vector of optiml vlues T : S S mtrix of trnsition prob for V = mx R + γt V 14

15 Contrction Let H (V) mx vlue itertion R + γt V be the opertor in Lemm 3: H is contrction mpping. H V H V γ V V Proof: without loss of generlity, let H V let s = rgmx s H (V)(s) nd R s, + γ s Pr s s, V(s ) 15

16 Proof continued: Contrction Then 0 H V s H V s (by ssumption) R s, s + γ Pr s s, s s V s (by definition) R s, s γ Pr s s, s s V s = γ Pr s s s, s V s V s γ Pr s s, s s V V (mxnorm upper bound) = γ V V (since Pr s s, s s = 1) Repet the sme rgument for H V s H (V)(s) nd for ech s 16

17 Convergence Theorem 4: Vlue itertion converges to V for ny initil estimte V lim n H (n) V = V V Proof By definition V = H 0, but vlue itertion computes H V for some initil V By lemm 3, H (n) V H n V γ n V V Hence, when n, then H (n) V H n 0 0 nd H V = V V 17

18 Vlue Itertion Even when horizon is infinite, perform finitely mny itertions Stop when V n V n 1 ε vlueitertion(mdp) V 0 mx R ; n 0 Repet n n + 1 V n mx Until V n V n 1 Return V n R + γt V n 1 ε 18

19 Induced Policy Since V n V n 1 ε, by Theorem 4: we know tht V n V ε 1 γ But, how good is the sttionry policy π n s extrcted bsed on V n? π n s = rgmx How fr is V π n from V? R s, + γ Pr s s, V n (s ) s 19

20 Induced Policy Theorem 5: V π n V Proof 2ε 1 γ V π n V = Vπ n V n + V n V V π n V n + V n V ( A + B A + B ) = H π n (V n ) V n + V n H V n ε + ε 1 γ 1 γ = 2ε 1 γ (by Theorems 2 nd 4) 20

21 Summry Vlue itertion Simple dynmic progrmming lgorithm Complexity: O(n A S 2 ) Here n is the number of itertions Cn we optimize the policy directly insted of optimizing the vlue function nd then inducing policy? Yes: by policy itertion 21

{ } = E! & $ " k r t +k +1

{ } = E! & $ k r t +k +1 Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,