A Fast and Reliable Policy Improvement Algorithm

Size: px

Start display at page:

Download "A Fast and Reliable Policy Improvement Algorithm"

David Lawson
5 years ago
Views:

1 A Fst nd Relible Policy Improvement Algorithm Ysin Abbsi-Ydkori Peter L. Brtlett Stephen J. Wright Queenslnd University of Technology UC Berkeley nd QUT University of Wisconsin-Mdison Abstrct We introduce simple, efficient method tht improves stochstic policies for Mrkov decision processes. The computtionl complexity is the sme s tht of the vlue estimtion problem. We prove tht when the vlue estimtion error is smll, this method gives n improvement in performnce tht increses with certin vrince properties of the initil policy nd trnsition dynmics. Performnce in numericl experiments compres fvorbly with previous policy improvement lgorithms. 1 Introduction Mrkov decision problems (MDPs) re sequentil decision problems where loss hs memory (lso known s stte). The objective is to find policy mpping from sttes to ctions tht yields high discounted cumultive rewrd. In lrge-stte problems, finding n optiml policy is chllenging nd one hs to resort to pproximtions. Unfortuntely, mny pproximte MDP lgorithms do not lwys improve monotoniclly. We propose computtionlly efficient lgorithm nd show tht it genertes sequence of incresingly better policies. We consider MDPs with finite stte nd ction spces, nd rewrd function r defined on the stte spce. The distribution of the stte t time t + 1 is function of the stte x t nd ction t t the previous time t. We define trnsition mtrix P, with rows indexed by stte-ction pirs nd columns indexed by subsequent sttes, so tht P (xt, t) is the vector of probbilities of stte x t+1. A policy π is mpping from sttes to probbility distributions over ctions. We write π( x) Appering in Proceedings of the 19 th Interntionl Conference on Artificil Intelligence nd Sttistics (AISTATS) 016, Cdiz, Spin. JMLR: W&CP volume 51. Copyright 016 by the uthors. s the probbility of ction in stte x under policy π. (We lso use π(x t ) to denote the rndom ction t distributed ccording to π( x t ).) For strting stte x 0, the vlue function corresponding to π is defined by [ ] V π (x 0 ) = E γ t r(x t ), (1) t=0 where γ (0, 1) is discount fctor, x t is the stte t time t, nd t π( x t ). The expecttion is over the stochsticity in the policy nd in the evolution of sttes. The objective is to find policy π such tht the totl cumultive loss V π (x 0 ) is ner-optiml. (The optiml policy is the one for which V π (x 0 ) is mximized.) We ssume tht the rewrd function is bounded in [0, (1 γ)b] for some b (0, 1). There is vst literture on Mrkov decision problems nd reinforcement lerning (RL) (Sutton nd Brto, 1998, Bertseks nd Tsitsiklis, 1996). Dynmic progrmming (DP) lgorithms, such s vlue itertion nd policy itertion, re stndrd techniques for computing the optiml policy. In lrge stte spce problems, exct DP is not fesible, becuse the computtionl complexity scles t lest qudrticlly with the number of sttes. In such problems, the optiml vlue function cn be pproximted with liner combintion of smll number of fetures, with the understnding tht serching in this low dimensionl subspce is esier thn solving the originl problem. Unlike exct DP, pproximte DP does not necessrily improve the policy in ech itertion (Kkde nd Lngford, 00). Given stochstic policy π, our method finds n estimte V π for its vlue, nd returns n improved policy π such tht V π (x 0 ) V π (x 0 ) E( V π, V π ) + for some policy evlution error E( V π, V π ) nd some positive sclr. Our performnce bounds re composed of policy evlution (PE) error term nd positive policy improvement (PI) term. The min dvntge of both our method nd CPI, by comprison with API, is tht we cn obtin strict policy improvement s long s the PI term is bigger thn the PE term. If the PE error is very lrge, our lgorithm might fil to improve the policy. The sme is true of the CPI pproch of 1338

2 A Fst nd Relible Policy Improvement Algorithm Kkde nd Lngford (00). Vlue estimtes however re needed only t the sttes tht the gent visits under the policy. Estimtes cn be obtined by performing roll-outs from the current stte. By choosing the number of roll-outs ppropritely, we cn control the ccurcy of these estimtes, nd thus ensure policy improvement. For API, the performnce is only gurnteed to not degrde by more thn the PE error. The policy π is rndomized, ssigning lrger probbilities to ctions with lrger vlue estimtes. The closest to our work is Conservtive Policy Itertion (CPI) of Kkde nd Lngford (00) tht uses n pproximte greedy updte. Pirott et l. (013) study severl extensions of CPI. Thoms et l. (015) propose different pproch tht gurntees sfe policy improvement, but the computtionl complexity of their method is high. Our contributions re s follows: (1) We propose policy itertion scheme tht mkes step towrds the greedy policy, however unlike CPI, the mixture coefficients re stte-dependent nd unlike Pirott et l. (013), these stte-dependent coefficients cn be computed efficiently. () We nlyze the proposed lgorithm nd show tht its performnce improvement is lrger thn tht of CPI. While the improvement in CPI hs the form of the qudrtic of n expecttion, our improvement hs the form of the expecttion of qudrtics. Moreover, the mixture coefficients cn be significntly lrger in our updtes, mking our lgorithm prcticl while gurnteed to improve the initil policy. (3) We study the proposed lgorithm numericlly on chin-wlk nd inverted-pendulum benchmrks, showing tht it performs well in these domins. 1.1 Nottion The expecttion of rndom vrible z with respect to distribution v is denoted by E v z = p v(p)z(p), where summtion is over the countble domin of z. For policy π, we write E π( x) z = π( x)z(x, ) nd E π( x)p z = π( x)p (x,)z. Similrly, Vr π( x) z = E π( x) z (E π( x) z). Vribles z nd y cn be sclrs, vectors, or mtrices. We use P π to denote the probbility trnsition mtrix under policy π. We use L π to denote the Bellmn opertor: for ny V R X, (L π V )(x) = π( x)(r(x, ) + γp (x,) V ). 1 Algorithm We ssume tht rewrd is independent of the ction. From here on, we use r(x) to represent r(x, ), since the rewrd is independent of in A. Fix constnt 1 For ny V R X, P (x,) V = x P (x x, )V (x ). b < 1 nd scle rewrds such tht r(x) [0, (1 γ)b]. This implies tht V π (x) (0, b) for ny policy π nd stte x. We sy function V R X is consistent vlue estimte if for ny stte x, min(r(x) + γp (x,) V ) V (x) mx (r(x) + γp (x,)v ). Let π be n rbitrry policy. Let V π R X be the vlue of π. Let V π R X be n pproximtion of V π, nd define Q π (x, ) = r(x) + γp (x,) V π. First, check if V π is consistent vlue estimte: min Q π (x, ) V π (x) mx If () holds, find policy ν such tht Q π (x, ). () V π (x) = E ν( x) Q π (x, ) + Vr ν( x) Q π (x, ). (3) Otherwise find policy ν such tht E π( x) Q π (x, ) = E ν( x) Q π (x, ) + Vr ν( x) Q π (x, ). (4) Eqution (4) lwys hs solution ν. If we choose ν = π, then LHS is no more thn RHS. On the other hnd, if ν ssigns ll the probbility mss to rgmin Q π (x, ), then Vr ν( x) Q π (x, ) = 0 nd LHS is no less thn RHS. As RHS is continuous function in ν, the bove eqution hs solution nd t lest one { solution is convex } combintion of π( x) nd 1 rgmin Q π (x, ). Similrly, (3) hs solution under condition (). Becuse of monotonicity, the solution cn be found efficiently by binry serch. Let π (x, ) = Q π (x, ) E ν( x) Q π (x, ) nd π( x) = ν( x)(1 + π (x, )). Inclusion of the term E ν( x) Q π (x, ) ensures tht the probbilities sum to one: A π( x) = 1 for ll x X. In the bsence of estimtion error, tht is, V π = V π, it cn be shown tht L π V π = V π. (See Lemm.) Although π might be different from π, it hs the sme vlue function V π = V π. Let F ( π) = mx x, π (x, ). nd define the policy Choose s = 1/F ( π) π( x) = ν( x)(1 + s π (x, )). (5) If π (x, ) = 0 for ll x nd, we use the convention tht 0 1/0 = 0. If we do not hve ccess to good estimte of F ( π), choose s = 1/(γ mx x V π (x)). This ensures tht γ(p x, V π )s 1. If we do not hve ccess to good estimte of mx x V π (x), then we cn use the more conservtive choice of s = 1/(γb). In prctice, when estimting F ( π) nd mx x V π (x) is hrd, we strt from lrge vlue of s nd decrese it when we observe negtive π vlue. 1339

3 Ysin Abbsi-Ydkori, Peter L. Brtlett, Stephen J. Wright Input: Policy π, constnt s; for t = 1,,... do Observe stte x t ; Estimte V π (x t ) nd Q π (x t, ) for A; if Inequlity () holds then Obtin ν( x) such tht (3) is stisfied; else Obtin ν( x) such tht (4) is stisfied; end if Tke ction smpled ccording to π( x t ) := ν( x t )(1 + s π (x t, )), where s is defined in the text; end for Input: Initil policy π 1, constnt s, time horizon T ; for i = 1,,..., I do Run policy π i for T steps; Estimte V π i; Obtin ν from (3) or (4); Define the new policy for ll x, : π i+1 ( x) := ν( x)(1 + s π i(x, )), where s is defined in the text; end for Figure : Itertive LPI Algorithm. Figure 1: Linerized Policy Improvement Algorithm. The definition of s ensures tht π( x) 0 for ll x nd. In summry, we reshpe policy π nd obtin π tht hs the sme vlue function. Then π( x) is obtined by incresing the probbility of ctions with positive π (x, ). We clculte ν( x) only when we visit stte x. So we do not need to perform these clcultions for ll sttes beforehnd. We cll the resulting lgorithm the LPI lgorithm for Linerized Policy Improvement. Pseudo-code of the lgorithm is given in Figure 1. Let I(V ) be the set of sttes such tht V is consistent vlue estimte. In Theorem 4, we show tht for ny strting stte distribution c R X, we hve c V π c V π c B(s 1) ( V π V π ) + (6) v π,c (x) E π( x) Q π (x, ) V π (x), where v π,c = c t=0 γt (P π ) t nd B = x In prticulr, if V π = V π, then v π,c (x)vr ν( x) Q π (x, ). c V π c V π + B(s 1)/. All quntities on the RHS cn be estimted by rollouts, which provides n efficient wy to estimte policy improvement. We cn iterte the procedure of Figure 1 to improve the policy. The resulting lgorithm, clled Itertive LPI or ILPI, is shown in Figure. Our updte rule (5) hs similrities with the CPI rule of Kkde nd Lngford (00), lthough it is not convex combintion of the current policy nd the greedy policy. Also, unlike CPI, our updte is nonuniform cross the stte spce. Updte rule (5) mkes smll chnges to the current policy when there re smll differences in Q π vlues, nd lrger chnges when the differences in Q π vlues re more substntil. Interestingly, our theorem lso reflects this; our theoreticl improvement is more significnt compred to CPI when differences in Q π vlues vry cross the stte spce. (See Section.3, where we show tht our lgorithm enjoys stronger performnce gurntees.) Policy improvement in (6) depends on the error in estimting the vlue of the previous policy π. An effective wy to keep this error smll is to perform roll-outs in sttes tht we visit under policy π. Unfortuntely, the computtionl cost increses exponentilly with the number of itertions I in the ILPI lgorithm, mking this pproch effective only when I is smll. An lterntive pproch, which we use in our invertedpendulum experiments in Section 3, is to estimte V π i by liner combintion of columns of feture mtrix: Vπ i Φθ, where Φ R X d is feture mtrix nd θ R d is prmeter vector. For exmple, we cn use the vlue itertion lgorithm to estimte θ: θ 0 = 0, ( ) 1 θ k+1 = Φ(x) Φ(x) Φ(x) t k (x, π i (x)), x S x S where S is set of sttes visited while running policy π i nd t k (x, ) = r(x)+γp (x,) Φθ k. Notice tht in our performnce gurntee (6), there is no estimtion error in sttes where V π i is consistent vlue estimte. For this reson, we propose the following modified procedure where trget vlues re thresholded with pproprite min/mx vlues: ( ) 1 θ k+1 = Φ(x) Φ(x) Φ(x) y k (x, π i (x)), 1340 x S x S

4 A Fst nd Relible Policy Improvement Algorithm where y k (x, ) = r(x) + γp (x,) z k nd min t k (x, ) if Φ(x)θ k < min t k (x, ) z k (x) = mx t k (x, ) if Φ(x)θ k > mx t k (x, ) Φ(x)θ k otherwise..1 Anlysis In this section, we show performnce bound for the LPI lgorithm. We strt with useful lemm tht expresses the objective c V π in terms of c V nd Bellmn error. The lemm is from Kkde nd Lngford (00). Its proof cn lso be extrcted from the proof of Theorem 1 of de Fris nd Vn Roy (003). Lemm 1 (Kkde nd Lngford (00)). Fix policy π nd vectors V, c R X. Let P π denote the probbility trnsition kernel under policy π. Define the mesure We hve v π,c = c t=0 γ t (P π ) t = c (I γp π ) 1. (7) c V π = c V + v π,c(l π V V ). (8) Lemm. Consider the policy π( x) = ν( x)(1 + Q π (x, ) E ν( x) Q π (x, )). Under Condition (3), we hve (L π V π )(x) = V π (x) nd under Condition (4), we hve (L π V π )(x) = E π( x) Q π (x, ). Proof. First consider Condition (4). We wnt to show tht for stte x, E π( x) Q π (x, ) = ν( x)(1 + Q π (x, ) E ν( x) Q π (x, )) Q π (x, ). This implies tht E π( x) Q π (x, ) = E ν( x) Q π (x, ) + E ν( x) Q π (x, ) (E ν( x) Q π (x, )) = E ν( x) Q π (x, ) + Vr ν( x) Q π (x, ). This lst equlity holds by Condition (4). We hve similr rgument when Condition (3) holds. Lemm 3. Let π w ( x) = ν( x)(1 + π (x, )w). Consider the function h(w) = c ( V π w) + v π,c (L π w ( V π w) V π w). Then h(w) = 1 Bw + gw + f where f = v π,c r, g = c V π v π,c V π + γv π,c (P ν V π ), B = x v π,c (x)vr ν( x) Q π (x, ). The proof is in Appendix A. The min result of this section is s follows. Theorem 4. Let I(V ) be the set of sttes such tht V is consistent vlue estimte (s defined in the beginning of this section). For ny strting stte distribution c, c V π c V π + c B(s 1) ( V π V π ) + v π,c (x) E π( x) Q π (x, ) V π (x). Proof. Recll the definition of π w nd h(w) from Lemm 3. Notice tht π s = π. The function h(w) cn be written s h(w) = c ( V π w) + v π,c (x) x ( ν( x)(1 + w π (x, ))(r(x) + γp (x,) V π w) We hve tht V π w ). h(0) = v π,c r = c (I γp π ) 1 r = c V π, where the second equlity holds by definition of v π,c in Lemm 1. If we set V = V π s nd π = π = π s, then it is pprent by compring (8) with the definition of h( ) in Lemm 3 tht h(s) = c T V π. Thus, h(0) = h(s). On the other hnd, h(1) = c V π + x v π,c (x)((l π V π )(x) V π (x)) c V π + c ( V π V π ) v π,c (x) (L π V π )(x) V π (x) x I( V π ) v π,c (x) (L π V π )(x) V π (x) = c V π + c ( V π V π ) v π,c (x) E π( x) Q π (x, ) V π (x). where the lst step holds by Lemm. Becuse h is convex nd 0 < 1 < s, h(1) h(s). We cn clculte the improvement: Write h in the qudrtic form h(w) = 1 Bw + gw + f, where B, g, f re defined in Lemm 3. We know tht h(s) = h(0) = f. Thus the improvement is h(s) h(1) = g B/. On the other hnd, h(s) = Bs / + gs + f = f nd so g = Bs/. Thus, h(s) h(1) = B(s 1)/, from which the theorem sttement follows. 1341

5 Ysin Abbsi-Ydkori, Peter L. Brtlett, Stephen J. Wright. Choosing s As Theorem 4 suggests, bigger vlue of s gives bigger policy improvement. On the other hnd, the nlysis is vlid s long s the probbilities π( x) = ν( x)(1+s π (x, )) re positive, nd this prevents us from choosing very lrge vlues of s. The next corollry relxes the positivity condition nd shows tht if these probbilities re negtive only in smll subset of the stte spce, we cn still hve policy improvement. Corollry 5. Let G be the set of good sttes where ν( x)(1 + s π (x, )) is positive nd let B = X G. Define the policy { π w( x) ν( x)(1 + w π (x, )) if x G = ν( x)(1 + π (x, )) if x B nd π = π s. Let We hve tht B = x v π,c (x)vr ν( x) Q π (x, ). c V π c V π + c ( V π V π ) v π,c (x) E π( x) Q π (x, ) V π (x) x B v π,c (x)vr ν( x) Q π (x, ) + Proof. Consider the function B(s 1) h (w) = c ( V π w) + v π,c (L π w ( V π w) V π w). Similr to the rgument in the proof of Theorem 4, we hve tht h (0) = v π,c r = c V π nd h (s) = c V π. Thus, h (0) = h (s). As before, h (1) c V π + c ( V π V π ) v π,c (x) E π( x) Q π (x, ) V π (x). Let B = x G v π,c (x)vr ν( x) Q π (x, ). The new h is lso qudrtic nd cn be written s h (w) = 1 B w + g w + f, for some g nd f. We know tht h (s) = h (0) = f. Thus the improvement is h (s) h (1) = g B /. On the other hnd, h (s) = B s / + g s + f = f nd so g = B s/. Thus, h (s) h (1) = B (s 1) B(s 1) = v π,c (x)vr ν( x) Q π (x, ), x B from which the sttement follows.. Input: Initil policy π 1, negtivity threshold ɛ, time horizon T, initil s 0 ; for i = 1,,..., I do s = s 0 ; repet Run policy π i for T steps; Estimte G i (s) using (9); If G i (s) > ɛ, set s = s/; until G i (s) ɛ; Estimte V π i; Obtin ν i from (3) or (4); Define new π i+1 bsed on Corollry 5; end for Figure 3: The Adptive Itertive LPI Algorithm. In prticulr, if x B (1γ)v π,c(x) ɛ for some smll ɛ, then c V π c V π + c ( V π V π ) ɛb B(s 1) + 4(1 γ) v π,c (x) E π( x) Q π (x, ) V π (x). This rgument motivtes n dptive procedure for updting s: strt from big vlue of s nd decrese it only when the frequency of visits to bd sttes becomes lrger thn threshold. The dptive lgorithm, clled AILPI, is shown in Figure 3. In the figure, π i is the ith policy, ν i is the corresponding bse policy, { π i+1 ν i ( x)(1 + s 0 ( x) = π i(x, )) if x G ν i ( x)(1 + π i(x, )) if x B, B i (s) = {x :, ν i ( x)(1 + s π i(x, )) < 0}, nd G i (s) = x B i(s) (1 γ)v πi,c(x). (9) To simplify the presenttion, we estimte G i (s) fter running policy for fixed number of rounds. We cn lso design version tht updtes the estimte in n online fshion nd decreses s s soon s the number of visits to bd sttes becomes lrge..3 Comprison with Conservtive Policy Itertion Let us compre the performnce bound in Theorem 4 with the performnce bound of Conservtive Policy Itertion. To simplify the rgument, we ssume the exct vlue functions re vilble nd ν is the uniform 134

6 A Fst nd Relible Policy Improvement Algorithm policy. Let Q π (x, ) = r(x) + γp (x,) V π be the stte-ction vlue of policy π nd let A π (x, ) = Q π (x, ) V π (x) be the dvntge function. Let g π (x) = rgmx Q π (x, ) be the greedy policy with respect to policy π nd let A π π (x) = π ( x)a π (x, ) be the policy dvntge of π with respect to π. Let A g = (1 γ) x = (1 γ) x v π,c (x)a g π π (x) v π,c (x)(mx Q π (x, ) V π (x)). Let E CPI = A g/(8b). Kkde nd Lngford (00) propose Conservtive Policy Itertion tht uses n pproximte greedy updte π CPI ( x) = (1 α) π( x) + α1 { = g π (x)} (10) for some α (0, 1). Kkde nd Lngford (00) show tht using the choice of α = (1 γ)a g /(4b), c V πcpi c V π + E CPI. Let N x = mx Q π (x, ) min Q π (x, ) denote the rnge of Q π (x, ). The CPI improvement cn be upper bounded by E CPI 1 8b ( x (1 γ)v π,c (x)n x ). Theorem 4 shows n improvement of c V π = c V π + Define B(s 1) = c V π + (s 1) x E LPI def = (s 1) x v π,c (x)vr ν( x) Q π (x, ). v π,c (x)vr ν( x) Q π (x, ). Becuse ν is ssumed to be uniform, Vr ν( x) Q π (x, ) = N x/4. Thus, E LPI = ((s 1)/4) x v π,c(x)n x. Let s choose b = γ nd s = 1/(bγ) (the most conservtive choice of s). Thus Thus, E LPI (1 γ )/(4γ ) x E LPI E CPI 1 + γ 4γ 1 8γ v π,c (x)n x. (1 γ)v π,c (x)nx x ( ) (1 γ)v π,c (x)n x. x A direct comprison is not possible becuse v π,c is different from v π,c. If we ssume tht v π,c nd v π,c re similr, by Jensen s inequlity, we expect E LPI to be bigger thn E CPI. We ttribute this difference to the fct tht, unlike CPI, the mixture coefficient in our updte rule is not constnt nd depends on the stte nd ction. Even if N x is uniform over the stte spce nd equl to constnt N, we still hve n improvement: E LPI E CPI N ( 1 + γ 1 ) 3N 4γ γ 8γ. In prctice, the recommended choice of α = (1 γ)a g /(4b) leds to very conservtive updtes nd very slow progress (Scherrer, 014). Often one needs to choose much lrger α to mke CPI prcticl, but there re no theoreticl gurntees for such choices. Scherrer (014) proposes doing line serch to find the best α. But unlike our dptive method, such procedure lcks theoreticl justifiction. As we show in experiments, even our most conservtive choice of s = 1/(bγ) results in fster progress thn CPI. The bove rgument ssumes mximum vrince for ν. If π is deterministic, then ν is lso deterministic, Vr ν( x) Q π (x, ) = 0, B = 0, nd the performnce bound in Theorem 4 shows no improvements. CPI does not hve this restriction nd cn be pplied with initil deterministic policies. Also, we require rewrds to be ction-independent, while CPI pplies to more generl rewrd functions. Let m π = mx x 1 {rgmx Q π (x, )} π( x) 1, A π π = mx x,x A π π (x) A π π (x ), nd α = (1 γ)a g /(γm π A g π π ). Pirott et l. (013) improve the theoreticl nlysis of Kkde nd Lngford (00) nd show tht if α 1 nd we updte the policy ccording to (10) with the choice of mixture coefficient α, the policy improvement is t lest A g/(γm π A g π π ). Although this improves upon CPI, estimting m π nd α is computtionlly hrd in lrge stte problems. Pirott et l. (013) lso propose multi-prmeter version tht uses different vlue of α for ech stte, but the improvement over the single prmeter version is not shown nd the method is computtionlly expensive. 3 Experiments We implemented the ILPI lgorithm in Python nd tested its performnce on three problems: two chin wlk problems nd blncing n inverted pendulum. The performnce of the lgorithm is compred with the performnce of CPI (Kkde nd Lngford, 00). 1343

Ysin Abbsi-Ydkori, Peter L. Brtlett, Stephen J. Wright Figure 4: Performnce of ILPI on chin wlk benchmrk (50 sttes). Ech run is repeted 10 times nd men nd stndrd devitions re reported.

7 Ysin Abbsi-Ydkori, Peter L. Brtlett, Stephen J. Wright Figure 4: Performnce of ILPI on chin wlk benchmrk (50 sttes). Ech run is repeted 10 times nd men nd stndrd devitions re reported. ILPI finds n optiml policy in less thn 10 itertions. 3.1 Chin Wlk Domins We tested the performnce of the lgorithm on two simple chin wlk problems. (See Section 9.1 in (Lgoudkis nd Prr, 003).) The first chin hs 50 sttes nd there re two ctions (Left nd Right) vilble in ech stte. An ction moves the stte in the intended direction with probbility 0.9, nd moves the sttes in the opposite direction with probbility 0.1. Rewrd is +1 in sttes 10 nd 41, nd is zero in other sttes. The discount fctor is 0.9. Figure 4 shows the performnce of the exct version of ILPI lgorithm on this benchmrk. The initil policy π 1 is the uniform rndom policy tht tkes Left nd Right with equl probbility. We chose s = 1/F (π 1 ) nd b = 0.9 in the ILPI lgorithm. Figure 4 shows tht the ILPI lgorithm chieves the performnce of the optiml policy in less thn 10 itertions. In comprison, the USPI lgorithm of Pirott et l. (013) needs 74 itertions to chieve this performnce. 3 CPI exhibits much slower progress (Pirott et l., 013). The second chin hs 4 sttes. The ction set, discount fctor, nd trnsition dynmics is the sme s before. Lgoudkis nd Prr (003) show tht LSPI finds the optiml policy in this problem, lthough Koller nd Prr (000) show tht n lgorithm tht is combintion of LSTD nd policy improvement oscilltes between the suboptiml policies RRRR nd LLLL (lwys going to the right nd lwys going to the left). Figure 5 shows the performnce of five versions of ILPI lgorithm on this benchmrk. The initil policy π 3 The vlue of optiml policy tht we find is slightly different thn the vlue reported by Pirott et l. (013). Figure 5: Performnce of ILPI on chin wlk benchmrk with 4 sttes. 95% confidence intervls re shown for pproximte lgorithms. is lwys the uniform rndom policy tht tkes Left nd Right with equl probbility. The first three versions (shown by blue circles, strs, nd red circles), use s i = 1/F (π i ), s i = 1/(γ mx x V π i(x)), nd s = 1/(γb), respectively, nd vlue functions V π re computed exctly. Notice tht the first two versions chnge s i in ech itertion dptively. The fourth version (shown by tringles), uses s = 1/(γb). Vlue functions re estimted by verging over 4 roll-outs of length 0. Other quntities (ν nd Q π ) re lso estimted by verging over 4 smples. The lst version (shown by the pink line) uses only one roll-out to estimte quntity. This lst version fils to improve the initil policy (pprently due to lrge estimtion errors). We lso show the performnce of the CPI lgorithm, which improves the policies very slowly. Pirott et l. (013) show tht their lgorithms find ner optiml policy in 49 itertions, however s discussed in Section.3, these pproximte lgorithms use quntity m π = mx x 1 {rgmx Q π (x, )} π(. x) nd hving ccess to such quntity for n pproximte lgorithm is questionble. We mke few observtions. First, ll versions of the exct ILPI lgorithm re fster thn CPI. Second, using roll-outs to estimte vlue functions re sufficient to improve policies, however, the number of roll-outs should be sufficiently lrge so tht estimtion errors become smll. 3. Inverted Pendulum The problem is to blnce n inverted pendulum t the upright position by pplying horizontl forces to the crt tht the pendulum is ttched to. The length nd mss of the pendulum re unknown to the lerner. The ctions re left force (-50N), right force (50N), or no force (0N). A uniform perturbtion in [-10,10] is dded to the ction. The stte vector consists of the verticl 1344

8 A Fst nd Relible Policy Improvement Algorithm () Performnce of ILPI (s = 1 γb ). (b) Performnce of AILPI. (c) Performnce of ILPI (s = 100). Figure 6: Performnce of ILPI nd AILPI on inverted pendulum benchmrk. 95% confidence intervls re shown. ngle θ nd the ngulr velocity θ of the pendulum. Given ction, the stte evolves ccording to θ = 9.8 sin(θ) αml( θ) sin(θ)/ α cos(θ) 4l/3 αml cos (θ) Here, m = kg is the mss of the pendulum, M = 8kg is the mss of the crt, l = 0.5m is the length of the pendulum, nd α = 1/(m + M). The simultion step is 0.1 seconds. The objective is to keep the ngle in [π/, π/]. An episode ends when the ngle of the pendulum is outside this intervl or when the episode exceeds 3000 steps. We tested the performnce of the itertive policy improvement lgorithm on this problem. We used 10 bsis functions to estimte vlue of policies: Ψ(x) = (1, exp( x p 1 ),..., exp( x p 9 )), where {p 1,..., p 9 } = {π/4, 0, π/4} {1, 0, +1}. To estimte vlue of policy π i, we collected dt by running π i for 100 episodes. Then we used this dt nd estimted V π i by n pproximte vlue itertion (AVI) lgorithm (using the dditionl trick tht we introduced t the end of Section ). The number of itertions of AVI is 100. We performed 0 policy improvements (so I = 0 in Figure ). We chose γ = 0.95, b = 0.9, nd s = 1/(γb) in the ILPI lgorithm. Figure 6() shows the performnce of the ILPI lgorithm. The CPI lgorithm exhibits very slow progress; even fter 100 itertions, the number of steps is less thn 15. The performnce of the ILPI lgorithm cn be significntly improved by using lrger s. Becuse the stte spce is continuous, clculting mx x Vπ i or F (π i ) is not esy. Insted, we run the AILPI lgorithm tht dptively updtes s. Figure 6(b) shows performnce of AILPI with initil s = 0. We choose ɛ = 0. nd 100 episodes re used for vlue estimtion. Figure 6(c) shows tht ILPI with fixed s = 100 finds the optiml policy in itertions.. 4 Conclusions We proposed policy itertion lgorithm tht is gurnteed to improve the performnce of the initil stochstic policy. We showed tht the theoreticl improvement is bigger thn tht of Conservtive Policy Itertion lgorithm. Our theorem hs two dvntges compred with the gurntees tht re known for CPI: First, the mixture coefficients re stte-dependent nd becuse of this, our improvement hs the form of the expecttion of qudrtics while the improvement of CPI hs the form of the qudrtic of n expecttion. Second, our theorem llows for much bigger steps towrds the greedy policy, hence fster convergence. Our experiments re consistent with these theoreticl dvntges. Acknowledgements We grtefully cknowledge the support of the Austrlin Reserch Council through n Austrlin Lurete Fellowship (FL ) nd through the Austrlin Reserch Council Centre of Excellence for Mthemticl nd Sttisticl Frontiers (ACEMS). 1345

9 Ysin Abbsi-Ydkori, Peter L. Brtlett, Stephen J. Wright References D. P. Bertseks nd J. Tsitsiklis. Neuro-Dynmic Progrmming. Athen Scientific, D. P. de Fris nd B. Vn Roy. The liner progrmming pproch to pproximte dynmic progrmming. Opertions Reserch, 51, 003. S. Kkde nd J. Lngford. Approximtely optiml pproximte reinforcement lerning. In ICML, 00. D. Koller nd R. Prr. Policy itertion for fctored MDPs. In Sixteenth Conference on Uncertinty in Artificil Intelligence, 000. M. G. Lgoudkis nd R. Prr. Lest-squres policy itertion. JMLR, 4: , 003. M. Pirott, M. Restelli, A. Pecorino, nd D. Clndriello. Sfe policy itertion. In ICML, 013. B. Scherrer. Approximte policy itertion schemes: A comprison. In ICML, 014. R. S. Sutton nd A. G. Brto. Reinforcement Lerning: An Introduction. Brdford Book. MIT Press, P. S. Thoms, G. Theochrous, nd M. Ghvmzdeh. High confidence policy improvement. In ICML,

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Module 6 Vlue Itertion CS 886 Sequentil Decision Mking nd Reinforcement Lerning University of Wterloo Mrkov Decision Process Definition Set of sttes: S Set of ctions (i.e., decisions): A Trnsition model: