Dynamic Policy Programming with Function Approximation: Supplementary Material

Size: px

Start display at page:

Download "Dynamic Policy Programming with Function Approximation: Supplementary Material"

Gary Jordan
5 years ago
Views:

1 Dyamic Policy Programmig with Fuctio pproximatio: Supplemetary Material Mohammad Gheshlaghi zar Radboud Uiversity Nijmege Geert Grooteplei Noord EZ Nijmege Netherlads Viceç Gómez Radboud Uiversity Nijmege Geert Grooteplei Noord EZ Nijmege Netherlads Hilbert J. Kappe Radboud Uiversity Nijmege Geert Grooteplei Noord EZ Nijmege Netherlads ppedices Defiitios.1 Markov Decisio Processes ad Bellma Equatio statioary MDP is a 5-tuple S,,R,T,γ, where S,,R are, respectively, the set of all system states, the set of actios that ca be take ad the set of rewards that may be issued, such that rss a deotes the reward of the ext state s give that the curret state is s ad the actio is a. T is a set of matrices of dimesio S S, oe for each a such that Tss a deotes the probability of the ext state s give that the curret state is s ad the actio is a. γ 0,1 deotes the discout factor. ssumptio 1. We assume that for every 3-tuple s,a,s S S, the magitude of the immediate reward, r a ss is bouded from above by R. statioary policy is a mappig π that assigs to each state s a probability distributio over the actio space, oe for each s,a S such that π s a deotes the probability of the actio a give the curret state is s. Give the policy π, its correspodig value fuctio V π deotes the expected value of the log-term discouted sum of rewards i each state s, whe the actio is chose by policy π. The goal is to fid a policy π that attais the optimal value fuctio V s, such that V s satisfies a Bellma equatio: V s = s π s a π a Tss a ra ss +γv s, s S. 1 s S

2 Mohammad Gheshlaghi zar, Viceç Gómez.2 Matrix otatio We fid it coveiet to use matrix ad vector otatio for aalysis of theorem 1, 2 ad 3. We begi by re-defiig the variables of the MDP. T ad r are, respectively, a S S trasitio matrix with etries Tsa,s = Tss a ad a S S reward matrix with etries rsa,s = rss a. r is a S 1 colum vector of expected rewards with etries rsa = s S Ta ss ra ss. The policy ca also be re-expressed as a S 1 vector π with etries πsa = π s a. I additio, it will be coveiet to itroduce a S S matrix Π, give by: π s1 a 1 π s1 a π s2 a 1 π s2 a Π =..., 2 πs S a1 πs S a where Π cosists of S row blocks, each of legth, which are arraged diagoally. The policy matrix Π is related to the policy vector π by π = Π T 1 with 1 deotes a S 1 vector of all 1s. Further, oe ca easily verify that the matrix-product ΠT gives the state to state trasitio matrix for the policy π. Oe ca also preset the value fuctio i vector space. v π is a S 1 vector with etries v π s = V π s. The Bellma equatio 1 ca ow be re-expressed i matrix otatio as: v = M r+γtv, 3 wherem istheoperatorothe S 1vector r+γtv,suchthatm r+γtv s = a rsa+ γtv sa. Ofte it is coveiet to associate value fuctios ot with states but with state-actio pairs. Therefore, we itroduce a S 1 vector of the actio-values q π whose etries q π sa deotes the expected value of the sum of future rewards for all state-actio s,a S, provided the future actios are chose by the policy π. q π satisfies the followig Bellma equatio: q π = T π q π = r+γtπq π, 4 where we itroduce the Bellma operator T π ad q π deotes the fixed poit of T π. The optimal q, q, also satisfies a Bellma equatio: q = T q = r+γtm q, 5 where we itroduce the Bellma operator T. q deotes the fixed poit of the operator T. Both T ad T π are cotractio mappigs with the factor γ?, chap. 1. I other words, for ay two vectors q ad q, we have: T q T q q q, T π q T π q q q, 6 where deotes a L -orm o R S. The operator O defied ca be writte i matrix otatio as: Op = p+ r+γtm η p ΞM η p. 7 Here, p deotes a S 1 vector of the actio prefereces. M η is the soft- operator o the vector p with M η ps = M η Ps. Ξ is a S S matrix give by: 1 1 Ξ = 1 1 T,

3 Mohammad Gheshlaghi zar, Viceç Gómez where Ξ cosists of S colum blocks of 1s, each of legth, which are arraged diagoally. Ξ trasforms the S 1 vector M η p to the S 1 vector ΞM η p with ΞM η psa = M η ps. Further, oe ca easily verify that for ay policy matrix defied by 2: πs 1 a a πs 2 a ΠΞ = a. = I, 9.. πs S a where I is a S S idetity matrix. We kow that lim η M η p = M p. Further, It is ot difficult to show that the followig iequality holds for M p M η p : Lemma 1. Let p be a S 1 vector of actio prefereces ad η be a positive costat, the a upper boud for M η p M p ca be obtaied as follows: a M η p M p 1 η log. Proof. sketch First, we ote that M η p M p = M η p M p 0. The we apply equatio 12 of the mai article to each etries of p M p. Now, oe ca easily show that L η p M ps 0 for all s S. This combied with the fact that for the probability distributio π: H π s log derive 0 M p M η p 1 η log. The result the follows by takig the sup-orm. Defiig p as the actio preferece resulted by the th iteratio of 7, we have: p = p 1 + r+γtm η p 1 ΞM η p 1 = p 1 + r+γtπ 1 p 1 ΞΠ 1 p 1, = 1,2,3,, 10 where p 0 = p ad Π is a policy distributio matrix associated with the policy distributio vector π give by: π sa = expηp sa, s,a S, 11 Zs ad Zs = a expηp sa is the ormalizatio factor. B Proof of Lemma 1 of The Mai rticle The optimal value value fuctio is defied as follows: V π s = s π s Π s a π a Tss a ra ss +γv π s 1 η KLπ s π s s S 12 The imizatio i 12 ca be performed i closed form usig Lagrage multipliers: Ls,λ s = π s a Tss a ra ss +γv πs 1 η KLπ s π s λ s s a 1. a s S a π The ecessary coditio for the extremum with respect to π s is: 0 = Ls,λ s π s a = s S Tss a ra ss +γv πs 1 η 1 η log πs a λ s. π s a

4 Mohammad Gheshlaghi zar, Viceç Gómez So, the solutio is: π s a = π s aexp ηλ s 1exp η ss s ST a ra ss +γv πs, s S. 13 The Lagrage multipliers ca be solved from the costraits: 1 = π s a = exp ηλ s 1 s aexp η a a π ss s ST a ra ss +γv πs, λ s = 1 η log a π s aexp η ss s ST a ra ss +γv π s 1 η. 14 By pluggig 14 i to 13 we have: π s aexp η Tss a ra ss +γv πs π s a = a s S Substitutig policy 15 i 12, we obtai: π s aexp η, s,a S. 15 Tss a ra ss +γv πs s S V πs = 1 η log a π s aexp η s ST a ss ra ss +γv πs. C Proof of Theorem 1 This sectio provides a formal aalysis of the covergece behavior of DPP. Our objective is to establish a rate of covergece for the value fuctio of the policy iduced by DPP. Our mai result is that at iteratio of DPP we have: q π q < δ, where δ = O Here, q π is a vector of the actio-values uder the policy π ad π is the policy iduced by DPP at iteratio. Equatio 16 implies that, asymptotically, the policy iduced by DPP, π = lim π, coverges to the optimal policy π. To derive 16 oe eeds to relate q π to the optimal q. Ufortuately, fidig a direct relatio betwee q π ad q is ot a easy task. Istead, we ca relate q π to q via a auxiliary q, which we defie later i this sectio. I the remaider of this sectio, we first express π i terms of q. The, we obtai a upper boud o the ormed error q q i lemma 2. Fially, we use these two results to derive a boud o the ormed error q π q. I ordertoexpressπ i termsofq, weexpadthe correspodigactiopreferecep byrecursivesubstitutio of 10: p = p 1 + r+γtπ 1 p 1 ΞΠ 1 p 1 = p 2 +2 r+γtπ 2 p 2 ΞΠ 2 p 2 +γtπ 1 p 2 +γtπ 1 r +γ 2 TΠ 1 TΠ 2 p 2 γtπ 1 ΞΠ 2 p 2 ΞΠ 1 p 2 ΞΠ 1 r 17 γξπ 1 TΠ 2 p 2 +ΞΠ 1 ΞΠ 2 p 2 = 2 r+γtπ 1 r+p 2 +γtπ 1 p 2 +γ 2 TΠ 1 TΠ 2 p 2 ΞΠ 1 r ΞΠ 1 p 2 γξπ 1 TΠ 2 p 2 by 9.

5 Mohammad Gheshlaghi zar, Viceç Gómez Equatio 17 expresses p i terms of p 2, where i the last step we have caceled some terms because of 9. To express p i terms of p 0, we proceed with the expasio of 10: p = r+ 1 kγk k TΠ j r+p 0 + γk k TΠ j p 0 j=1 j=1 ΞΠ 1 1 r+ 2 k 1γk k TΠ j 1 r j=1 ΞΠ 1 p γk k TΠ j 1 p 0 j=1 = r+ 1 γk k TΠ j r γ 1 kγk 1 k TΠ j r+p 0 j=1 j=1 + γk k TΠ j p 0 1ΞΠ 1 r+ 2 γk k TΠ j 1 r j=1 j=1 +ΞΠ 1 γ 2 kγk 1 k TΠ j 1 r p 0 1 γk k TΠ j 1 p 0. j=1 j=1 18 We defie the auxiliary actio-values q γk k j=1 TΠ j p0 ad write 18 as: = r + 1 γk k j=1 TΠ j r ad q p = p 0 + p = q γ q γ +qp ΞΠ 1 1q 1 γ q 1 +q P 1 = q γ +c Ξd. 19 Here, c = q P γ q / γ is a S 1 vector bouded by: c c = γ R p 0, = 1,2,3,, 20 ad d = Π 1 1q 1 γ q 1/ γ+q P 1 is a S 1 vector. From its defiitio, it is easy to see that q satisfies the followig Bellma equatio: q = r+γtπ 1q Note the differece betwee 21 ad 4: whereas q π evolves accordig to a fixed policy π, q evolves accordig to a policy that evolves accordig to DPP. Fially, we express π i terms of q. By pluggig 19 i 11 ad takig i to accout that Ξd sa = d s, we obtai: π sa = exp η q sa+c sa d s = exp η q sa+c sa Zs Z, s,a S, 22 s where Z s = Zsexpηd s is the ormalizatio factor. Equatio 22 establishes the relatio betwee π ad q. To relate q ad q we state the followig lemma, that establishes a boud o q q : Lemma 2 boud o q q. Let assumptio 1 hold, deotes the cardiality of ad be a positive iteger, also, for keepig the represetatio succict, assume that both r ad p 0 are bouded from above by some costat L > 0, the the followig iequality holds: q q δ 1, where δ 1 is give by: δ 1 = 2γ 2 log /η +2L 4 +2γ 1 L. 23

6 Mohammad Gheshlaghi zar, Viceç Gómez Proof. q q = 1 T k 1 q k+1 T k q k +T 1 q 1 q 1 T k 1 q k+1 T k q k + T 1 q 1 q 1 k 1 q k+1 T q k +γ 1 q 1 q by For 1 k 1 we obtai: q k+1 T q k TΠ k q k TM q k by 21 ad 5 Π k q k M q k by Hölder s iequality. 25 By mootoicity property for ay actio-value vector q ad distributio matrix Π we have: Πqs = a πsaqsa M qs, s S. 26 By comparig 26 with 25, we obtai: q k+1 T q k M q ks Π k q ks s S M q k + c k s+ c s S k k Π kq ks M q k + c k Π k q 1 k + γc k M q k + c k Π k k q k k + c M η k q k + c k k + 2γc k M k q k + c + 2γc k 27 log /η +2c k by lemma 1. By substitutio of 27 i 24, we obtai: It is ot difficult to show that 1 q q 1 γ k log /η +2c k + 2γ 1 L. 28 γ k k 2γ/ 2. Combiig this with 28 yields: q q 2γ log /η +2c 2 +2γ 1 L 2γ 2 log /η +2L 4 +2γ 1 L by 37. Equatio 22 expresses the policy π i terms of the auxiliary q. Lemma 2 provides a upper boud o the ormed-error q q. We use these two results to derive a boud o the ormed-error q π q :

7 Mohammad Gheshlaghi zar, Viceç Gómez We start by otig that q π is obtaied by ifiite applicatio of the operator T π to a arbitrary iitial q-vector, for which we take q, the: q π q m = lim T k m π q T π k 1 T q lim m lim m m T k π q T π k 1 T q m γ k 1 T π q T q by 6 1 T π q T q TΠ q TM q Π q M q by Hölder s iequality. 29 log similar lies with the proof of lemma 2, we obtai: q π q Π q M q M q s Π q s s S M q s Π q s +δ 1 s S M q + c s Π q s s S + γ δ 1 + c M q + c Π q + c + γ 2δ 1 + 2c M q + c M η q + c + γ 2δ 1 + 2c log +2δ 1 η + 2c 2 log /η +2L 2 +2δ 1 2 log /η +2L 2 +4γ 2 log /η +2L log /η +2L 4 +4γ 1 L +4γ 1 L by lemma 2 by 37 by lemma 1 by 37 This completes the proof. D Proof of Theorem 2 First, we show that there exists a limit for p i ifiity. The, we compute this limit. Before we proceed with the proof we review the followig assumptio ad corollary from the mai article

8 Mohammad Gheshlaghi zar, Viceç Gómez ssumptio 2. We assume that MDP has a uique determiistic optimal policy π give by: π sa = { 1 a = a s 0 otherwise, s S, where a s = arg a q sa. Corollary 1. The followig relatio holds i limit: lim = q, s,a S. + qπ The covergece of p to a limit ca be established by provig the covergece of all the terms i RHS of 19. We have already prove the covergece of q to q i lemma 2. The, the covergece of q / γ to q / γ is immediate. Corollary 1 together with assumptio 2 yields the covergece of π to π. The covergece of q p to some limit q p the follows. ccordigly, there exists a limit for p. Now, we compute the limit of p. Combiig corollary 1 with 19 ad takig i to accout that v = Π q yields: lim p sa = lim q sa γ q sa +q p sa 1v s+γ v s q p sa γ γ E Proof of Theorem 3 = lim q sa v s v s +γ q sa +q p sa q p sa +v s γ γ { v = s a = a s otherwise. This sectio provides a formal theoretical aalysis of the performace of dyamic policy programmig i the presece of approximatio error. Each iteratio of approximate dyamic programmig ca be characterized by 30. Our objective is to establish a L -orm performace loss boud of the policy iduced by approximate DPP. p = p 1 + r+γtm η p 1 ΞM η p 1 +ǫ 1 = p 1 + r+γtπ 1 p 1 ΞΠ 1 p 1 +ǫ 1, = 1,2,3,, 30 Our mai result that at iteratio of approximate dyamic policy programmig, we have: where: q π q δ, 31 δ = 4γ 2 log /η +4L 5 +4γ L 2 + 2γ γ k ε k. with ε k = 1 / k j=0:k 1 ǫ j. Here, q π is a vector of the actio-values uder the policy π ad π is the policy iduced after iteratio of approximate DPP. To relate q with q π, we ca relate q π to q via a auxiliary q, which we defie later i this sectio. I the remaider of this sectio, we first express π i terms of q i lemma 3. The, we obtai a upper boud o the ormed error q q i lemma 4. Fially, we use these two results to derive 31. We begi our aalysis with the followig lemma:

9 Mohammad Gheshlaghi zar, Viceç Gómez Lemma 3. Let be a positive iteger ad p 0 deotes the iitial actio prefereces. lso, lets defie the auxiliary actio-value fuctios q ad q p as: 1 q = r+ ǫ + γ k kj=1 TΠ j r+ ǫ k q p = p 0 + k=0 γ k kj=1 TΠ j p0 32 with ǫ = 1 l=0 ǫ l/, the we have: p = q γ q γ +qp ΞΠ 1 1q 1 γ q 1 +q P 1 γ 33 Proof. I order to express π i terms of q, we expad the correspodig actio preferece p by recursive substitutio of 30: p = p 1 + r+γtπ 1 p 1 ΞΠ 1 p 1 +ǫ 1 = p 2 +ǫ 2 +2 r+γtπ 2 p 2 +γtπ 2 ǫ 2 ΞΠ 2 p 2 +γtπ 1 p 2 +γ 2 TΠ 1 TΠ 2 p 2 +γtπ 1 r γtπ 1 ΞΠ 2 p 2 ΞΠ 1 p 2 ΞΠ 1 ǫ 2 ΞΠ 1 r γξπ 1 TΠ 2 p 2 +ΞΠ 1 ΞΠ 2 p 2 +ǫ 1 = 2 r+γtπ 1 r+p 2 ++γtπ 1 p 2 +γ 2 TΠ 1 TΠ 2 p 2 ΞΠ 1 r 34 ΞΠ 1 p 2 γξπ 1 TΠ 2 p 2 +ǫ 1 +ǫ 2 +γtπ 2 ǫ 2 ΞΠ 1 ǫ 2 Equatio 34 expresses p i terms of p 2, where i the last step we have caceled some terms because of 9. To express p i terms of p 0, we proceed with the expasio of 30: p = 1 k=0 kγk k j=1 TΠ j r+ k 1 k=0 γk k j=1 TΠ j + 1 γk k TΠ 2 j ǫ l ΞΠ 1 k 1γk k TΠ j 1 k=0 j=1 l=0 k=0 j=1 1 ΞΠ 1 γk k TΠ j 1 p γk k TΠ k 2 j 1 ǫ l k=0 j=1 k=0 j=1 l=0 = 1 γk k TΠ j r γ 1 kγk 1 k TΠ j r k=0 j=1 j=1 + 1 γk k TΠ j ǫ k γ 1 kγk 1 k TΠ j ǫ k k=0 j=1 j=1 + γk k TΠ 2 j p 0 1ΞΠ 1 γk k TΠ j 1 r k=0 j=1 k=0 j=1 2 1ΞΠ 1 ǫ k 1 p 0 γk k TΠ j 1 k=0 j=1 2 +γξπ 1 kγk 1 k TΠ j 1 ǫ k 1 j=1 +ΞΠ 1 γ 2 kγk 1 k TΠ j 1 r 1 γk k TΠ j 1 p 0 j=1 k=0 j=1 r 35 Equatio 33 the follows by comparig 35 with 32. It is easy to see that q, defied i lemma 3, satisfies the followig Bellma equatio: q = r+γtπ 1q 1 + ǫ 36 Note the differece betwee 36 ad 4: whereas q π evolves accordig to a fixed policy π, q to a policy that evolves accordig to approximate DPP. evolves accordig

10 Mohammad Gheshlaghi zar, Viceç Gómez For brevity we itroduce the ew variables c = q P γ q / γ 1 ad d = Π 1 1q 1 γ q 1/ γ + q P 1 ad re-express 33 as: p = q +c Ξd 38 Fially, we express π i terms of q. By pluggig 33 i 11 ad takig i to accout that Ξd sa = d s, we obtai: π sa = exp ηq sa+c sa d s Zs = exp η q sa+c sa s,a S, 39 Z, s where Z s = Zsexpd s is the ormalizatio factor. Equatio 39 establishes the relatio betwee π ad q. To relate q ad q, we state the followig lemma, that establishes a boud o q q : Lemma 4 L boud o q q. Let assumptio 1 hold ad q defied accordig to 32. Let deotes the cardiality of ad be a positive iteger, also, for keepig the represetatio succict, assume that both r ad p 0 are bouded from above by some costat L > 0, the the followig iequality holds: q q δ 2, where δ 2 is give by: with ε k = ǫ k. δ 2 = 2γ 2 log /η +4L 4 +2γ 1 L + γ k ε k. 40 Proof. q q = 1 T k 1 q k+1 T k q k +T 1 q 1 q 1 T k 1 q k+1 T k q k + T 1 q 1 q 1 k 1 q k+1 T q k +γ 1 q 1 q by For all l N : 2 l 1 we obtai: q l+1 T q l = γ TΠl q l TM q l + ǫl+1 by 36 ad 5 + ε l+1 TΠl q l TM q l Πl q l M q l + ε l+1 by Hölder s iequality Note that: c c = γ 2 R 1 +ǫ + p0, = 1,2,3,, 37 with ǫ =,2,3,... ǫ k

11 Mohammad Gheshlaghi zar, Viceç Gómez Takig i to accout that M q l Π l q l s we obtai: q l+1 T q l M q l s Π l q l s + ε l+1 s S M q l + c l s+ c Π l q l s + ε l+1 s S l l M q l + c l Π l q l + γc + ε l+1 l l M q l + c l Π l q l + c l + 2γc + ε l+1 l l l M ηl q l + c l M q l + c l + 2γc + ε l+1 l l l log / η +2c l + ε l+1 by lemma By substitutio of 43 i 41, we obtai: q q log / 1 η +2c γ k k + 2γ 1 L + γ k 1 ε k. 44 It is ot difficult to show that 1 γ k k 2γ/ 2. Combiig this with 44 yields: q q 2γ log /η +2c 2 +2γ 1 L + 2γ 2 log /η +4L 4 γ k ε k +2γ 1 L + γ k ε k. Equatio 39 expresses the policy π i terms of the auxiliary q. Lemma 4 provides a upper-boud o the ormed-error q q. We use these two results to derive a boud o the ormed-error q π q. We start by otig that q π is obtaied by ifiite applicatio of the operator T π to a arbitrary iitial q-vector, for which we take q, the: q π q = lim m lim m T k π q T π k 1 T q m lim m m T k π q T π k 1 T q m γ k 1 T π q T q by 6 1 T π q T q TΠ q TM q Π q M q by Hölder s iequality. 45

12 Mohammad Gheshlaghi zar, Viceç Gómez log similar lies with the proof of lemma 4, we obtai: q π q Π q M q M q s Π q s s S s S M s S M q s Π q s q + c + γ δ 2 + c M q + c Π + γ 2δ 2 + 2c q π q M q + c + γ Π +δ 2 s Π q s q + c q + c 2δ 2 + 2c M q + c M η q + c + γ 2δ 2 + 2c log +2δ 2 + 2c η 2 log /η +4L 2 +2δ log /η +4L 4 +4γ 1 L +2 γ k ε k by lemma 4 by 37 by lemma 1 by 37 This completes the proof of theorem 3. Refereces Bertsekas, D. P Dyamic Programmig ad Optimal Cotrol, volume II. thea Scietific, Belmout, Massachusetts, third editio.

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + 62. Power series Defiitio 16. (Power series) Give a sequece {c }, the series c x = c 0 + c 1 x + c 2 x 2 + c 3 x 3 + is called a power series i the variable x. The umbers c are called the coefficiets of