Robust Dynamic Programming for Discounted Infinite-Horizon Markov Decision Processes with Uncertain Stationary Transition Matrice

roceengs of the 2007 IEEE Symposum on Approxmate Dynamc rogrammng an Renforcement Learnng (ADRL 2007) Robust Dynamc rogrammng for Dscounte Infnte-Horzon Markov Decson rocesses wth Uncertan Statonary Transton Matrce Baohua L Department of Electrcal Engneerng Arzona State Unversty Tempe, AZ 85287-5706 Emal: BaohuaL@asueu Jenne S Department of Electrcal Engneerng Arzona State Unversty Tempe, AZ 85287-5706 Telephone: (480) 965-6133 Fax: (480) 965-2811 Emal: S@asueu Abstract In ths paper, fnte-state, fnte-acton, scounte nfnte-horzon-cost Markov ecson processes (MDs) wth uncertan statonary transton matrces are scusse n the etermnstc polcy space Uncertan statonary parametrc transton matrces are clearly classfe nto nepenent an correlate cases It s ponte out n ths paper that the optmalty crteron of unform mnmzaton of the mum expecte total scounte cost functons for all ntal states, or robust unform optmalty crteron, s not approprate for solvng MDs wth correlate transton matrces A new optmalty crteron of mnmzng the mum quaratc total value functon s propose whch nclues the prevous crteron as a specal case Base on the new optmalty crteron, robust polcy teraton s evelope to compute an optmal polcy n the etermnstc statonary polcy space Uner some assumptons, the soluton s guarantee to be optmal or near-optmal n the etermnstc polcy space I INTRODUCTION Dynamc programmng (D) s a computatonal approach to fnng an optmal polcy by employng the prncple of optmalty ntrouce by Rchar Bellman [1] D an approxmate D can solve the problems of Markov ecson processes (MDs) wth accurate knowlege of transton probabltes to obtan an optmal or near-optmal polcy by the algorthms such as value teraton, polcy teraton, evolutonary polcy teraton, approxmate polcy teraton base on Monte Carlo smulaton [2] [5] However, n practce, accurate transton probabltes are very ffcult to obtan Thus, exact solutons va classc D are not attanable, an those algorthms are not applcable to obtan solutons any more Moreover, the estmate transton probabltes may be far away from the true values ue to nose an other ssues assocate wth the estmaton process, or the estmaton error may not be trval that t can result n sgnfcant evaton of the solutons from the optmal values [6] Hence, the ea of set estmaton for transton matrces wth hgh confence can be use to allevate some of the efct from naccurate pont estmaton Wth those uncertan transton matrces, robust ynamc programmng s esre to aress the ssue of esgnng approxmaton metho wth an approprate robustness to exten the power of the Bellman Equaton Representatve efforts n evelopng robust ynamc programmng can be foun n [6] [10] One commonly use prncple of optmalty crteron for robust algorthms s to mnmze the mum value functons for all ntal states, whch s referre to as robust unform optmalty crteron n ths paper Base on ths optmalty crteron, robust value teraton an robust polcy teraton were propose n [6] an [7], respectvely, to obtan a etermnstc, unformly optmal polcy To eal wth uncertan transton matrces, the noton of correlaton n a transton matrx was ntrouce n [6] an [10] n the context of MDs The exstng optmalty crtera an assocate robust algorthms were evelope only for MDs wth nepenent transton matrces In ths paper, fnte-state, fnte-acton MDs wth scounte nfnte-horzon cost s scusse The optmal polcy s constrane n the etermnstc polcy space The transton matrces are assume to be statonary, whch s reasonable when systems are slowly varyng The contrbuton of ths paper s as follows Frst, mathematcally clear an more tractable efntons of nepenent an correlate uncertan statonary transton matrces are prove Usng these newly formulate efntons, robust unform optmalty crteron s not applcable for MDs wth correlate transton matrces Because of the correlaton of uncertan transton matrces, for any gven polcy, there may be no optmal transton matrces such that the value functons of the polcy reach mum unformly for all ntal states Ths makes the comparson among polces meanngless Even f such transton matrces exst for all polces, there may not be an optmal polcy such that ts mum value functons reach mnmum unformly for all ntal states Hence, a new optmalty crteron of mnmzng quaratc total value functon s propose Base on ths crteron, uner a weak conton, there exsts an optmal polcy whch can be optmal non-unformly n ntal states The prevous crteron becomes a specal case of the new crteron Note that the exstng robust value teraton an robust polcy teraton can not guarantee solutons for 1-4244-0706-0/07/$2000 2007 IEEE 96

roceengs of the 2007 IEEE Symposum on Approxmate Dynamc rogrammng an Renforcement Learnng (ADRL 2007) MDs wth correlate transton matrces A new robust polcy teraton s evelope to obtan an optmal polcy n the statonary space Uner some assumptons, the soluton s guarantee to be optmal or near-optmal n the etermnstc polcy space The rest of the paper s organze as follows In secton 2, the optmalty crteron of mnmzng the mum quaratc total value functon s propose an theorems are evelope to show that ths crteron makes robust unform optmalty crteron a specal case an uner some contons, statonary optmal polces are proven to be optmal or near-optmal n the etermnstc polcy space Base on ths optmalty crteron, robust polcy teraton s evelope n secton 3 The paper conclues n secton 4 II OTIMALITY CRITERION USING QUADRATIC TOTAL VALUE FUNCTIONS In ths secton, an MD wth uncertan parametrc transton matrx s formulate Base on ths formulaton, we present why an optmal soluton may be senstve to transton probabltes After that, uncertan parametrc transton matrces are classfe nto nepenent an correlate cases Analyss emonstrates why an optmal polcy may not exst uner robust unform optmalty crteron Consequently, the optmalty crteron usng the quaratc total value functon s propose It s proven that robust unform optmalty crteron s a specal case Beses, two theorems an one corollary are prove to show that uner some assumptons, statonary optmal polces are optmal or near-optmal n the etermnstc polcy space Fnte-state, fnte-acton, nfnte-horzon MDs wth statonary transton matrces are escrbe as follows Let T enote the screte, nfnte ecson horzon, where T = {0, 1, 2, } At each stage, the system occupes a state S, where S s the state space wth n states an enote as S = {1, 2,, n} Wth each state S, a ecson maker s allowe to choose an acton a etermnstcally from a fnte state-epenent set of allowable actons, enote by A = {1, 2,, m } Let M = n =1 m, an let a t be a functon mappng states nto actons wth a t () A at the tme t, e, a t : 1 2 n Denote a polcy by π a 1 a 2 a n A 1 A 2 A n (1) π = (a 1, a 2, ), (2) an let Π represent the etermnstc polcy space Denote a statonary polcy by π s π s = (a, a, ), (3) an let Π s represent the etermnstc statonary polcy space Obvously, Π s Π Defne the cost corresponng to state S an acton a A by c(, a) Assume that c(, a) s nonnegatve The costs are tme scounte by a factor γ (0 < γ < 1) The system starts from an ntal state The states make Markov transtons accorng to statonary transton probabltes p a j from one state to another state j uner an acton a All transton probabltes consttute an M n transton matrx ( = 1 1 ) ( a n mn ), (4) where the superscrpt enotes the transpose an a represents a transton probablty row uner the state S an the acton a A In a more general settng, let each transton probablty p a j be represente by a functon of a parameter vector enote as θ = {ϑ 1,, ϑ q }, e, p a j = p a j(θ), j S, (5) where 0 p a j 1 an n j=1 pa j = 1 In [6] [10], the parameters are specfe as the transton probabltes themselves, that s, θ = {p 1,, p j,, p (n 1),, pa 1,, p a j,, p a (n 1),, pmn n1,, pmn nj,, pmn n(n 1) } (6) The transton probablty row a an the transton matrx can also be represente as a functon of θ, e, a = a(θ) an = (θ) Assume the parameter vector θ vary n a known subset Θ R 1 q An the transton matrx s varyng n the set = { : θ Θ} It s not too ffcult to see that optmal value functons may be very senstve to the perturbaton of the parameters n transton matrx Conser a statonary polcy π s efne n (3) For any, v πs = (I γ πs ) 1 C πs, (7) where v πs s the expecte total scounte cost functon uner the polcy π s an the traston matrx, πs an C πs are the n n transton matrx an the n 1 cost functon vector respectvely to π s, e, πs = a(1) 1 a(n) n c(1, a(1)), C πs = c(, a()) c(n, a(n)) (8) Obvously, v πs s contnuous n [11] However, can be scontnuous n θ over Θ, whch may lea to scontnuous value functons v πs n θ Even f s contnuous n θ an therefore v πs s contnuous n Θ, the functon vπs stll may not be smooth enough, that s, a small perturbaton n θ results n relatvely large varaton n v πs Therefore there s a nee to evelop robust solutons uner uncertanty We are now n a poston to ntrouce the concepts of nepenence an correlaton n an πs Defnton (Correlate transton matrx, Inepenent 97

roceengs of the 2007 IEEE Symposum on Approxmate Dynamc rogrammng an Renforcement Learnng (ADRL 2007) transton matrx): The transton matrx s correlate f a mn n, (9) where a = { a (θ) : θ Θ} The transton matrx s nepenent f = a mn n (10) Defnton (Correlate transton matrx for π s, Inepenent transton matrx for π s ): The transton matrx πs s correlate f πs a(1) 1 n a(n), (11) where πs = { πs (θ) : θ Θ} The transton matrx πs s nepenent f πs = a(1) 1 n a(n) (12) () The set a mn n s an M-mensonal hyper-rectangle Matrx s correlate f s a proper subset of ths hyper-rectangle, an s nepenent f s equal to the hyper-rectangle An exact transton matrx s a specal case of an nepenent transton matrx () The set a(1) 1 n a(n) s an n-mensonal hyper-rectangle Matrx πs s correlate f πs s a proper subset of ths hyper-rectangle, an πs s nepenent f πs s equal to the hyper-rectangle () Accorng to the above efntons, MDs are classfe nto those wth nepenent transton matrces an those wth correlate transton matrces The optmalty crteron of mnmzng the mum value functon for any ntal state s gven as follows mn π Π vπ () S, (13) where v π () s the expecte total scounte cost functon uner the polcy π Π an the transton matrx at the ntal state, e, v π () = E π ( γ t c(j, a) ) (14) t=0 Accorng to the optmalty crteron gven n (13), an optmal polcy s efne as follows Defnton (Optmal polcy): A polcy π s optmal f there exsts such that v π () = vπ () = mn π Π vπ () S, (15) where there exsts θ Θ such that = (θ ) For MDs wth nepenent transton matrces, a etermnstc optmal polcy has been proven exstent [6], [7] However, for MDs wth correlate transton matrces, an optmal polcy oes not always exst The efnton of an optmal polcy gven by (15) can be nterprete as the contons for the exstence of an optmal polcy: () for any gven π Π, there s, at least one such that v π () = vπ () for any S, where from here on epens on π; () there s at least one π Π such that v π () = mn π Π vπ () for any S Such an optmal polcy s unformly optmal for ntal states For MDs wth correlate transton matrces, the above contons are too strong to be satsfe, whch can be shown n the followng example Example: conser a two-state, two-acton, nfnte-horzon MD Let the state space be S = {1, 2} The acton spaces at state 1 an state 2 are A 1 = {1, 2} an A 2 = {1, 2} Accorng to (1) an (3), all four statonary polces n Π s are efne by π 1, π 2, π 3, π 4 as follows, π 1 = (,, ), (16) π 2 = (,, ), (17) π 3 = (,, ), (18) π 4 = (,, ) (19) Let the transton matrx have the followng formulaton 1 1 θ θ 1 = 1 2 2 1 = θ 2 1 θ 2 1 θ 2 2 2 1 θ1 2, (20) 1 θ 3 θ 3 an the cost functons are as follows, c(1, 1) = 1, c(1, 2) = 2, c(2, 1) = 3, c(2, 2) = 4 (21) The scount factor γ s chosen at 09 Let Ω = {0, 02, 04, 06, 08, 1} Let θ = {θ 1, θ 2, θ 3 } an Θ = {θ : θ 1, θ 2, θ 3 Ω} The transton matrx s correlate, that s, where 2 1 1 2, (22) = 2 1 = = {(0, 1), (02, 08), (04, 06), (06, 04), (08, 02), (1, 0)}, (23) 1 2 = {(1, 0), (096, 004), (084, 016), (064, 036), Actually, by (8), (036, 064), (0, 1)} (24) π2 = π3 = π4 = θ θ 1, (25) 1 θ 3 θ 3 θ2 1 θ 2 1 θ1 2 θ1 2, (26) θ2 1 θ 2 (27) 1 θ 3 θ 3 By (12), π2, π3 an π4 are nepenent an the mum value functons are reachable However, for π 1, θ θ π1 = 1 1 θ1 2 θ1 2 (28) 98

roceengs of the 2007 IEEE Symposum on Approxmate Dynamc rogrammng an Renforcement Learnng (ADRL 2007) By (11), t s correlate an then there s no such that the mum value functons for all ntal states are reachable Thus, by the optmalty crteron efne n (13), an optmal polcy oes not exst Thus, a new optmalty crteron s neee an ts corresponng optmal polcy also nee to be efne For any fxe transton matrx, the quaratc total value functon for a polcy π s efne as follows, v π 2 = (v π ) v π, (29) where n terms of the value functon efne n (14) v π = ( v π (1) vπ () vπ (n) ) (30) The new optmalty crteron s to mnmze the mum quaratc total value functon, e, mn π Π vπ 2 (31) Defnton (Optmal polcy): A polcy π s optmal f v π 2 = vπ 2 = mn π Π vπ 2 = mn π Π vπ 2 (32) () Base on the optmalty crteron usng the quaratc total value functon, an optmal polcy can be non-unformly optmal n ntal states () The exstence of such an optmal polcy s guarantee uner a weak conton For any π, s easly satsfe to make the mum value reachable For example, the set s compact an close an the functon v π 2 s contnuous over Even f the mum an mnmum values are not reachable, an mn can be replace by nf an sup, respectvely, an an near-optmal polcy can be easly propose () If the cost functon s also parameterze n θ, the optmalty crteron can be mofe usng the quaratc value functon wth the cost functon represente as c(, a, θ) (v) If for any polcy π, the probablty strbuton of the ntal states s non-unform, or the value functons for fferent ntal states have fferent weghts, the optmalty crteron can be mofe usng weghte quaratc total value functon v π 2 W = (vπ W v π ) Actually, the optmalty crteron efne by (31) generalzes the unform optmalty crteron efne by (13), whch s shown n the followng Theorem 1 Theorem 1: If an optmal polcy exsts wth regar to the optmalty crteron efne n (13), then ths optmalty crteron s equvalent to the one efne n (31) That s, f an optmal polcy exsts, a polcy uner the optmalty crteron efne n (13) s optmal f an only f t s optmal uner the optmalty crteron efne n (31) roof: Snce an optmal polcy exsts uner the optmalty crteron efne n (13), for any π Π, there exsts such that v π () = vπ () S (33) an let π 1 be an optmal polcy such that v π1 () = vπ1 Snce v π () 0, by (33), () = mn π Π vπ () S (34) (v π ()) 2 = (vπ ()) 2 = v π 2, (35) an then by (34), mn (v π ()) 2 = mn π Π π Π (vπ ()) 2 = v π1 2 (36) : By (35) an (36), v π1 2 = (v π1 ())2 = mn π Π (v π ()) 2 (37) Hence, π 1 s an optmal polcy uner the optmalty crteron efne n (31) : Let π 2 be an optmal polcy uner the optmalty crteron efne n (31) Then, for any π Π, there exsts Q such that vq π 2 = ( v π Q () ) 2 = (v π ()) 2, (38) an for π 2, Q 2 = ( By (35), for any π Π, an then by (36), Snce an then Q () ) 2 = mn π Π (v π ()) 2 (39) v π Q 2 = (vπ ()) 2, (40) Q 2 = mn π Π (vπ ()) 2 (41) 2 Q () (vπ2 ())2, 2 Q () = (vπ2 ())2, (42) Q () = vπ2 () (43) Snce (vπ2 ())2 mn π Π (vπ ())2, 2 Q () = mn π Π (vπ ()) 2, (44) an then Hence, Q () = Q vπ2 () = mn π Π vπ () (45) () = mn π Π vπ2 () S (46) That s to say, π 2 s an optmal polcy uner the optmalty crteron efne n (13) 99

roceengs of the 2007 IEEE Symposum on Approxmate Dynamc rogrammng an Renforcement Learnng (ADRL 2007) Generally, an optmal polcy uner the optmalty crteron usng the quaratc total value functon may be non-statonary However, uner some assumptons, statonary optmal polces are optmal or ɛ-optmal n the polcy space Π Theorem 2: If mn vπ 2 = mn v π 2, (47) statonary optmal polces are optmal n the polcy space Π roof: For any fxe, snce mn v π π Π 2 = s mn π Π vπ 2, by (47), mn vπ 2 = mn v π 2 = mn π Π vπ 2 (48) Snce by weak ualty, mn π Π vπ 2 mn π Π vπ 2, mn vπ 2 mn π Π vπ 2 (49) Snce mn vπ 2 mn π Π vπ 2, mn vπ 2 = mn π Π vπ 2 (50) That s to say, statonary optmal polces are also optmal n Π Assume that Θ s a compact an close subspace of the vector space R 1 q an for any π Π, the functon v π (θ) 2 s contnuous n θ Then, gven an arbtrarly small constant ɛ > 0, for any θ 0 Θ, there exsts δθ π 0 > 0, such that v π (θ) 2 v π (θ 0) 2 < ɛ, θ B π θ0 Θ (51) where Bθ π 0 = {θ : θ θ 0 < δθ π 0 } Snce Bθ π 0 Θ, (52) θ 0 Θ by Hene-Borel theorem, there exst fnte balls to cover Θ Let those selecte balls be Bθ π 1,, Bθ π j,, Bθ π r an π = { (θ j ) : 1 j r}, where θ j an r epen on π An then = π (53) π Π Theorem 3: Wth the above assumptons an notatons, f for any π Π s, mn v π (θ) (θ) 2 = (θ) mn v π (θ) π Π 2, (54) s then statonary optmal polces on are ɛ-optmal an statonary optmal polces on are also ɛ-optmal n the polcy space Π roof: Wthout loss of generalty, for any π Π, let (θ ), (θ 1 ) an (θ 2 ) π such that v π (θ ) 2 = v π (θ 1) 2 = (θ) vπ (θ) 2, (55) v π (θ) (θ) 2, (56) 0 v π (θ ) 2 v π (θ 2) 2 < ɛ (57) Snce v π (θ 2) 2 v π (θ 1) 2 v π (θ ) 2, Then, 0 (θ) vπ (θ) 2 v π (θ) (θ) 2 < ɛ (58) 0 mn π Π (θ) vπ (θ) 2 mn v π (θ) π Π (θ) 2 ɛ (59) By (54) an Theorem 2, Hence, mn v π 2 = mn v π 2 (60) (θ) π Π (θ) 0 mn π Π (θ) vπ (θ) 2 mn v π (θ) (θ) 2 ɛ, (61) that s to say, statonary optmal polces on are ɛ-optmal n Π Snce 0 mn (θ) vπ (θ) 2 mn v π (θ) (θ) 2 ɛ, 0 mn (θ) vπ (θ) 2 mn π Π = mn (θ) vπ (θ) 2 mn (θ) vπ (θ) 2 v π (θ) (θ) 2 (62) + mn v π (θ) (θ) 2 mn π Π (θ) vπ (θ) 2 ɛ (63) Hence, statonary optmal polces on are ɛ-optmal n Π () If for any fxe π Π, the functon v π (θ) 2 s Lpschtz wth the constant L π that epens on π, that s, v π (θ 1) 2 vθ π 2 2 L π θ 1 θ 2, θ 1, θ 2 Θ (64) then the parameter δθ π 0 can be chosen as follows, δθ π 0 = δ π ɛ = L π + 1, θ 0 Θ (65) That s to say, δθ π 0 s nepenent of θ 0 () If the functon v π (θ) 2 s contnuously fferentable wth respect to θ, t s also Lpschtz an the constant L π can be chosen as follows, > L π v π (θ) 2 θ Θ θ (66) () The smaller the L π s, the larger the δ π shoul be, an then the smaller the carnalty of π s, whch can make the conton (54) relatvely easy to satsfy (v) If for some polces, the constants L π are equal, the same balls Bθ π j can be selecte to cover Θ an also ther sets π are equal, whch may reuce the carnalty of an make the conton (54) satsfe more easly When for all π Π, the functons v π (θ) 2 are Lpschtz wth the constant L that s nepenent of π, then the parameters δθ π 0 can be nepenent of not only θ 0 but also of π, that s, δθ π 0 = δ = ɛ L + 1, θ 0 Θ, π Π, (67) 100

roceengs of the 2007 IEEE Symposum on Approxmate Dynamc rogrammng an Renforcement Learnng (ADRL 2007) an consequently, the balls Bθ π 0 are nepenent of π, that s, B π θ 0 = B θ0 = {θ : θ θ0 < δ}, π Π (68) Hence, for all π, the same balls Bθ π j can be selecte to cover Θ an also all π are equal Thus, the set becomes fnte, that s, = f = { (θj ) : 1 j r}, where θ j an r are nepenent of π Corollary 1: Wth the above assumptons an notatons, f mn v π 2 = mn v π 2, (69) (θ) f (θ) f then statonary optmal polces on f are ɛ-optmal an statonary optmal polces on are also ɛ-optmal n the polcy space Π roof: By (69) an Theorem 3, statonary optmal polces on f are ɛ-optmal an statonary optmal polces on are also ɛ-optmal n Π () If the contnuous ervatves of the functons v π (θ) 2 are unformly boune for all π Π, the Lpschtz constant L can be chosen as follows, v π (θ) > L sup 2 π Π θ Θ θ (70) () The smaller the L s, the larger the δ shoul be, an consequently the smaller the carnalty of f s, whch can make the conton (69) relatvely easy to satsfy III ROBUST OLICY ITERATION UNDER QUADRATIC TOTAL VALUE FUNCTION Base on the optmalty crteron of mnmzng the mum quaratc total value functon gven n (31), a robust polcy teraton s evelope to fn a statonary optmal polcy for MDs Even though ths soluton s generally suboptmal n the etermnstc polcy space Π, accorng to Theorems 2, 3, an Corollary 1, ths statonary optmal polcy can be guarantee to be optmal or ɛ-optmal n Π In ths secton, we conser statonary polces For smplcty, the subscrpt s whch ncates assocaton wth statonary polces, s roppe out olcy teraton nclues polcy evaluaton an polcy mprovement steps Uner the polcy evaluaton step, for any polcy π = (a, a, ), the mum quaratc total value functon s expresse as follows n terms of (7) v π 2 = vπ 2 = (Cπ ) ( I γ ( π ) ) 1 (I γ π ) 1 C π (71) The mum value v π 2 an the optmal are to be calculate Such approaches are avalable to compute these values Actually, f the transton matrx for the polcy π s nepenent, an effcent teratve metho gven n Algorthm 1 can be use to compute v π 2 () Gven a polcy π, for the mum pont, the transton Algorthm 1 Iteratve Algorthm for mum quaratc total value functon of a statonary polcy wth ts nepenent transton matrx 1 select v π 0 Rn 1 an set k = 0; 2 compute v π k+1 by vk+1 π () = c(, a()) + γ vk π, S; (72) 3 termnate f vk+1 π = vπ k ; otherwse, ncrement k by 1 an go to 2; 4 output vk π 2 as v π 2 probablty rows for ( all states ) an the corresponng actons, a(), enote by are compute as follows, { } = arg, S, (73) an there are no specal constrants for other rows () A proof of convergence of the teratve algorthm uner the total value functon can be obtane smlar to that n [6], [7] () In prncple, the ntal value v0 π can be selecte arbtrarly n R n 1 However, a carefully selecte v0 π can accelerate the convergence of the teratve process To see that, let an G 1 = {v : v() c(, a()) + γ G 2 = {v : v() c(, a()) + γ v π k v} (74) v} (75) If v π 0 G 1, the sequence {v π k } s non-ecreasng If vπ 0 G 2, the sequence {v π k } s non-ncreasng Thus, the sequence {vπ k } converges to v π faster than those wth vπ 0 / G 1 G 2 For the polcy mprovement step, the polcy can be mprove easly for MDs wth nepenent transton matrces usng robust polcy teraton [7] However, such mprovement metho oes not guarantee solutons for MDs wth correlate transton matrces Hence, the polcy elmnaton metho s consere heren as a means of mprovng polcy The polcy elmnaton process entals elmnaton of some non-optmal polces urng each teraton by a necessary conton that a statonary polcy s optmal The nequalty v µ 2 v πk 2 gven n (77) s such a necessary conton for a statonary polcy µ beng optmal, where v µ 2 s the quaratc total value functon of µ at, v πk 2 s the mum quaratc total value functon of π k an s the corresponng optmal matrx Robust polcy teraton uner quaratc total value functon s escrbe n etals n Algorthm 2 () Robust polcy teraton uner quaratc total value functon termnates n fnte teratons when the carnalty of the set Π k, enote by Π k n Algorthm 2, s equal to one, snce the carnalty of Π 0 s fnte an Π k+1 Π k (k = 0, 1, 2, ) The sequence {π k } converges to a statonary optmal polcy () A goo ntal polcy π 0, whch has relatvely small total value functon, can accelerate the convergence process 101

roceengs of the 2007 IEEE Symposum on Approxmate Dynamc rogrammng an Renforcement Learnng (ADRL 2007) Algorthm 2 Robust olcy Iteraton Uner Quaratc Total Value Functon 1 Intalzaton: set k = 0, Π 0 = Π s, O 1 = +, {π 1 } =, an select π 0 Π 0 ; 2 olcy evaluaton: compute v π k 2 an such that 3 olcy mprovement: f Π k > 1 (a) elmnate polces to obtan Πk such that v π k 2 = vπ k 2 ; (76) Π k = {µ Π k : v µ 2 v π k 2 }; (77) f Πk > 1 f v π k 2 < O k 1 (b) set O k = v π k 2, Πk = Πk {π k 1 }; (c) f Πk = 1, set Π k+1 = Πk = {π k }, π k+1 = π k, O k+1 = O k = v π k 2, ncrement k by 1, an go to 4; otherwse, go to (); () set Π k+1 = Πk, select π k+1 Π k+1 {π k }, ncrement k by 1, an return to 2; else (e) f Πk {π k } {π k 1 } 1, select π k Πk {π k } {π k 1 }, set Π k = Πk {π k }, π k = π k, an return to 2; otherwse, set Π k = Πk {π k } = {π k 1 }, π k = π k 1, O k = O k 1 = v π k 1 2 ; else (f) set Π k+1 = Πk = {π k }, π k+1 = π k, O k+1 = v π k 2, ncrement k by 1, an go to 4; else (g) go to 4; 4 Termnaton: output π k as the statonary optmal polcy wth the corresponng optmal transton matrx an mum quaratc total value functon O k [3] Hyeong Soo Chang, Hong-G Lee, Mchael C Fu, an Steven I Marcus, Evolutonary olcy Iteraton for Solvng Markov Decson rocesses, IEEE Transactons on Automatc Cntrol, vol 50, n 11, pp 1804-1808, 2005 [4] D Berstsekas an J Tstskls, Neuro-Dynamc rogrammng, Athena Scentfc Massachusetts, 1996 [5] J S, A G Barto, W B owell an D Wunsch, Hanbook of Learnng an Approxmate Dynamc rogrammng, Wley-IEEE ress, 2004 [6] A Nlm an L E Ghaou, Robust Control of Markov Decson rocesses wth Uncertan Transton Matrces, Operatons Research, vol 53, n 5, pp 780-798, 2005 [7] J K Sata an R E Lave, Jr, Markov Decson rocesses wth Uncertan Transton robabltes, Operatons Research, vol 21, pp 728-740, 1973 [8] C C Whte an H K Eleb, Markov Decson rocesses wth Imprecse Transton robabltes, Operatons Research, vol 42, pp 739-749, 1994 [9] R Gvan, S Leach, an T Dean, Boune arameter Markov Decson rocesses, Artfcal Intellgence, vol 122, no 1-2, pp 71-109, 2000 [10] S Kalyanasunaram, E K Chong, an N B Shroff, Markov Decson rocesses wth Uncertan Transton Rates: Senstvty an Robust Control, roceengs of the 41st IEEE Conference on Decson an Control, vol 4, pp 3799-3804, 2002 [11] M Abba, an J A Flar, erturbaton an Stablty Theory for Markov Control roblems, IEEE transatons on Automatc Control, vol 37, pp 1415-1420, 1992 () Robust polcy teraton uner quaratc total value functon can be extene to solvng problems wth parameterze cost functons an weghte quaratc total value functons (v) Assume that the set Θ s compact an close n R 1 q, an f for any fxe π an (θ), the quaratc total value functon can be compute wthn arbtrarly small error boun, then Algorthm 2 can be extene to obtan a statonary ɛ-optmal polcy IV CONCLUSION A theoretcal framework s propose to stuy MDs wth uncertantes n the transton probablty matrces As a result, the paper conclues wth a robust polcy teraton proceure uner a newly propose total value functon, whch guarantees optmal or near-optmal solutons Such a robust construct may be especally useful n practcal applcatons when there s too lmte amount of expermental ata to obtan an accurate transton probablty matrx, or when an accurate estmaton of the probablty transton matrx s unrealstc The propose framework s straghtforwar to apply n practcal problems REFERENCES [1] Bellman an R kalaba, Dynamc rogrammng an Moern Control Theory, New York: Acaemc ress, 1965 [2] M utterman, Markov Decson rocesses: Dscrete Stochastc Dynamc rogrammng, Wley-Interscnce, New York, 1994 102