2D1431 Mchine Lerning Lb 3: Reinforcement Lerning Frnk Hoffmnn modified by Örjn Ekeberg December 7, 2004 1 Introduction In this lb you will lern bout dynmic progrmming nd reinforcement lerning. It is ssumed tht you re fmilir with the bsic concepts of reinforcement lerning nd tht you hve red chpter 13 in the course book Mchine Lerning (Mitchell, 1997). The first four chpters of the survey on reinforcement lerning by Kelbling et l. (1996) is good supplementry mteril. For further reding nd detiled discussion of policy itertion nd reinforcement lerning, the textbook Reinforcement Lerning is highly recommendble (Sutton nd Brto, 1999). In prticulr studying chpters 3,4 nd 6 is of immense help for this lb. The predefined Mtlb functions for this lb re locted in the course directory /info/mi04/lbs/lb3. Dynmic progrmming refers to clss of lgorithms tht cn be used to compute optiml policies given complete model of the environment. Dynmic progrmming solves problems tht cn be formulted s Mrkov decision processes. Unlike in the reinforcement lerning cse, dynmic progrmming ssumes tht the stte trnsition nd rewrd functions re known. The centrl ide of dynmic progrmming nd reinforcement lerning is to lern vlue functions, which in turn cn be used to identify the optiml policy. 2 Policy Evlution nd Policy Itertion First we consider policy evlution, nmely how to compute the stte-vlue function V π for n rbitrry policy π. For the deterministic cse the vlue 1
function hs to obey the Bellmn eqution. V π (s) = r(s, π(s)) + γv π (δ(s, π(s))) (1) where δ(s, ) : S A S nd r(s, ) : S A R re the deterministic stte trnsition nd rewrd function. This eqution cn be either solved directly, by solving liner eqution of the type V = R + BV (2) where V nd R re vectors nd B is mtrix. An lterntive is to solve eqution 1 by successive pproximtion, nd considering the Bellmnn eqution s n updte rule V π k+1 = r(s, π(s)) + γv π k (δ(s, π(s))) (3) The sequence of Vk π cn be shown to converge to V π s k. This method is clled itertive policy evlution. If the policy is stochstic, i.e., the ction in given sitution s is probbility distribution over possible ctions, then we will use π(s, ) to denote the probbility of tking ction. The itertive Bellmn eqution then hs the following form: V π k+1 = π(s, ) (r(s, ) + γv π k (δ(s, ))) (4) For the non-deterministic cse, the trnsition nd rewrd functions hve to be replced by probbilistic functions. In tht cse the Bellmn equtions become: V π (s) = s P (s s, π(s))(r(s, s, π(s)) + γv π (s )) (5) where P (s s, ) is the probbility tht the next stte is s when executing ction in stte s nd R(s, s, ) is the rewrd when executing ction in stte s nd trnsitioning to the next stte s. Policy evlution for the nondeterministic cse, cn be formulted s n updte rule similr to eqution 3 by Vk+1 π = P (s s, π(s))(r(s, s, π(s)) + γvk π (s )) (6) s Our min motivtion for computing the vlue function for policy is to improve on our current policy. For some stte s we cn improve our current policy by picking n lterntive ction π(s) tht devites from our current policy π(s) if it hs higher ction vlue function Q(s, ) > 2
Q(s, π(s)). This process is clled policy improvement. In other words, for ech stte s we greedily choose the ction tht mximizes Q π (s, ) π (s) = rgmx Q π (s, ) = rgmx r(s, ) + γv (δ(s, ) (7) Once policy π hs been improved using V π to yield better policy π, we cn then compute V π nd improve it gin to yield n even better π. Policy itertion intertwines policy evlution nd policy improvement ccording to V π k+1(s) = mx π k+1 (s) = rgmx Q(s, ) = mx (r(s, ) + γv π k (δ(s, ))) Q(s, ) = rgmx For the non-deterministic cse we obtin V π (r(s, ) + γv π k (δ(s, ))) (8) k+1(s) = mx Q(s, ) = mx P (s s, )(R(s, s, ) + γv π (s )) s π k+1 (s) = rgmx Q(s, ) = rgmx P (s s, )(R(s, s, ) + γvk π (s )) (9) s It cn be shown tht policy itertion converges to the optiml policy. Notice, tht ech policy evlution, itself n itertive computtion, is strted with the vlue function for the previous policy. Assume grid world of 4 4 cells tht correspond to 16 sttes enumerted s 1,..., s 16 s shown in Figure 1. In ech stte the gent cn choose one of the four possible ctions (North, West, South, Est) in order to move to neighboring cell. If the gent ttempts to move beyond the limits of the grid world, for exmple going est in stte s 8 locted t the right edge, it remins in the originl cell but incurs penlty of -1. There re two specil cells A (s 1 ) nd B (s 3 ) from which the gent is bemed to the cells A (s 13 ) respectively B (s 11 ) independent of the ction it chooses. When being bemed it receives rewrd of +10 for the trnsition from A to A nd rewrd of +5 for the trnsporttion from B to B. For ll other moves tht do not ttempt to led outside the grid world the rewrd is zero. There re no terminl sttes nd the gent tries to mximize its future discounted rewrds over n infinite horizon. Assume discount fctor of γ = 0.9. Due to the discount fctor the ccumulted rewrd remins finite even if the 3
A 1 B 2 3 4 +10 5 +5 6 7 8 B 9 10 11 12 A 13 14 15 16 Figure 1: Grid world. Independent of the ction tken by the gent in cell A, it is bemed to cell A nd receives rewrd of +10. The sme pplies to B nd B with rewrd of +5. problem hs n infinite horizon. Notice, tht returning from B to B, only tkes minimum of two steps, wheres going bck to A from A tkes t lest three steps. Therefore, it is not immeditely obvious which policy is optiml. Assignment 1: Use vlue itertion to compute the vlue function V π (s) for n equiprobble policy in which t ech stte ll four possible ctions (including the ones tht ttempt to cross the boundry of the grid world) hve the sme uniform probbility π(s, ) = 1/4. Assume discount fctor γ = 0.9. Use vlue itertion ccording to the Bellmn equtions in (4) to pproximte the vlue function. You cn either use two rrys, one for the old vlues Vk π (s) nd one for the new vlues Vk+1 π (s). This wy the new vlues cn be computed one by one from the old vlues without the old vlues being chnged. It turns out however, tht it is esier to use synchronous updtes, with ech new vlue immeditely overwriting the old one. Asynchronous updtes lso converges to V π, in fct it usully converges fster thn the synchronous updte two-rry version. As n exmple we compute the new vlue of stte s 8. For the four possible ctions 4
North, West, South, Est the successor sttes re δ(s 8, North) = s 4, δ(s 8, South) = s 12, δ(s 8, W est) = s 7 nd δ(s 8, Est) = s 8 (the gent ttempts to leve the grid world nd remins in the sme squre). The rewrds re ll zero except for the penlty r(s 8, Est) = 1 when tking the Est ction. All ctions re eqully likely, therefore π(s 8, North) = π(s 8, South) = π(s 8, W est) = π(s 8, Est) = 1/4. In Mtlb we use vector of length 16 to store the vlue function. The updte rule for stte s 8 would look like: >> gmm=0.9; >> V=zeros(16,1); >> V(8) = 1/4 * (-1 + gmm* (V(4) + V(7) + V(12) + V(8))) The Mtlb function plot_v(v,rnge,pi) plots the stte vlue function s color plot. The first rgument V is 16 1-vector with the stte vlues V (s i ). The second optionl rgument rnge is 2 1- vector to specify the lower nd upper bound of the vlue function for scling the color-plot. The defult rnge is [ 10 30]. The third optionl rgument pi is 16 1-vector for specifying the current policy π(s) : S A, where by definition, the ctions North, Est, South, West re clockwise enumerted from 1 to 4. Use policy itertion bsed on eqution 8 to compute the optiml vlue function V nd policy π (s, ). It might be esier to use the ction vlue function Q(s, ) rther thn the stte vlue functionv (s). In Mtlb you represent Q(s, ) by 16 4-mtrix, where the first dimension corresponds to the stte, nd the second dimension to the ction. Visulize the optiml vlue function nd policy using plot_v. After how mny itertions does the lgorithm find n optiml policy, ssuming the initil stte vlues re zero? Is the optiml policy unique? Wht hppens if you initilize the stte vlue function with rndom vlues rther thn zero >> V=10.0*rnd(16,1); Does the lgorithm converge to different policy? Assignment 2: Assume, tht the trnsition function is no longer deterministic, but given by the probbility P (s s, ). Compute the optiml vlue function V nd 5
policy π (s, ) using policy itertion ccording to equtions 9, for nondeterministic stte trnsition function. Assume tht with probbility p = 0.7, the gent moves to the correct squre s indicted by the desired ction, but with probbility 1 p = 0.3 rndom ction is tken tht pushes the gent to rndom neighboring squre. The rndom squre cn be coincidentlly the very sme cell tht ws originlly preferred by the ction. A rndom ction cn lso be n illegl move, tht incurs penlty of -1. Visulize the optiml vlue function nd policy using plot_v. After how mny itertions does the lgorithm find n optiml policy, ssuming the initil stte vlues re zero? Is the optiml policy unique? 3 Temporl Difference Lerning This ssignment dels with the generl reinforcement lerning problem, in tht we no longer ssume tht the stte trnsition nd rewrd functions re known. Temporl difference (TD) lerning directly lern from experience nd do not rely on model of the environment s dynmics. TD methods updte the estimte of the ction vlue function bsed on lerned estimtes, in other words unlike Monte Crlo methods which updte their estimtes only t the end of n episode, they bootstrp nd updte their beliefs immeditely fter ech stte trnsition. For more detils on temporl difference lerning red chpters six nd seven of the reinforcement lerning book Sutton nd Brto (1999). Temporl difference lerning is esier formulted using the ction vlue function Q(s, ) rther thn the stte vlue function V (s) which re relted through Q π (s, ) = P (s s, )R(s, s, ) + γv π (s ) (10) s In contrst to dynmic progrmming, the gent lerns through interction with the environment. There is need for ctive explortion of the stte spce nd the possible ctions. At ech stte s the gent chooses n ction ccording to its current policy, nd observes n immedite rewrd r nd new stte s. This sequence of stte, ction, rewrd, stte, ction motivtes the nme SARSA for this form of lerning. The ction vlue function cn be lerned by mens of off-policy TD lerning lso clled Q-lerning. In its simplest form, one step Q-lerning, it is defined by the updte rule Q(s, ) = Q(s, ) + α(r + γ mx Q(s, ) Q(s, )) (11) 6
In this cse, the lerned ction-vlue function Q(s, ) directly pproximtes the optiml vlue function Q (s, ), independent of the policy followed, hence off-policy lerning. However, the policy π(s, ) : S A R (π(s, ) is the probbility of tking ction in stte s) still hs n effect in tht it determines which stte-ction pirs re visited nd updted. All temporl difference methods hve need for ctive explortion, which requires tht the gent every now nd then tries lterntive ctions tht re not necessrily optiml ccording to its current estimtes of Q(s, ). The policy is generlly soft, mening tht π(s, ) > 0 for ll sttes nd ctions. An ɛ-greedy policy stisfies this requirement, in tht most of the time with probbility 1 ɛ it picks the optiml ction ccording to π(s) = rgmx Q(s, ) (12) but with smll probbility ɛ it tkes rndom ction. Therefore, ll nongreedy ctions re tken with the probbility π(s, ) = ɛ/a(s), where A(s) is the number of lterntive ctions in stte s. As the gent collects more nd more evidence the policy shifts towrds deterministic optiml policy. This cn be chieved by decresing ɛ with n incresing number of observtions, for exmple ccording to ɛ(t) = ɛ 0 (1 t/t ) (13) where T is the totl number of itertions. Resonble vlues for lerning nd explortion rte re α = 0.1 nd ɛ 0 = 0.2. The off-policy TD lgorithm cn be summrized s Initilize Q(s, ) rbitrrily Initilize s Repet for ech step Choose from s using ɛ-greedy policy bsed on Q(s, ) Tke ction, observe rewrd r, nd next stte s Updte Q(s, ) = Q(s, ) + α(r + γ mx Q(s, ) Q(s, )) Replce s with s until T steps 7
Assignment 3: For n unknown environment the gent is supposed to lern the optiml policy by mens of off-policy temporl difference lerning. The stte spce consists of 25 sttes s 1,..., s 25, corresponding to 5 5 grid-world. In ech stte the gent hs the choice between four possible ctions 1,..., 4, which cn be ssocited to the four directions North, Est, South, West. However, the trnsition function is not deterministic, which mens the gent sometimes ends up in non-neighboring squre. Assume, tht the exct model of the environment nd the rewrds re unknown. The dynmics of the environment re determined by the Mtlb functions s = strtstte nd [s_new rewrd] = env(s_old,ction). The function strtstte returns the initil stte. The sttes s 1,..., s 25 re represented by the integers 1,..., 25, nd the ctions 1,..., 4 re enumerted by 1,..., 4. The function [s_new rewrd] = env(s_old,ction) computes the next stte s_new nd the rewrd rewrd when executing ction ction in the current stte s_old. Represent the ction vlue function Q(s, ) by 25 4-mtrix Q. Given Q you cn compute the optiml policy pi(s) nd stte vlue function V nd visulize it with plot_v_td(v,rnge,pi) using the following code >> [V pi] = mx(q,[],2); >> plot_v_td(v,[-5 15],pi); The function plot_v_td(v,rnge,pi) is the counterprt to the Mtlb function plot_v(v,rnge,pi) for the 4 4-gridworld used in the erlier ssignments. The function plot_trce(sttes,ctions,tlength) cn be used to plot trce of the most recently visited sttes. The prmeter sttes is N 1-vector tht contins the history of recent sttes s(t),..., s(t + N), the prmeter ctions is N 1-vector tht stores the history of recent ctions (t 1),..., (t + N 1), nd tlength determines how mny sttes from the pst re plotted. Build history of sttes, ctions nd rewrds when iterting the TD-lerning lgorithm, by ppending the new stte s, ction nd rewrd r to the history of previous sttes, ctions nd rewrds. >> for k=1:itertions >>... >> sttes = [sttes s]; >> ctions = [ctions ]; >> rewrds = [rewrds r]; >>... 8
>> end >> plot_trce(sttes,ctions,12); Run the off-policy TD lerning lgorithm for 20000 steps. Initilize the Q(s, ) with smll positive vlues (e.g. 0.1) in order to bis the TD-lerning to explore lterntive ctions in the erly stges, when most of the time the rewrds re zero. Every 500 steps visulize the current stte vlue function V (s), optiml policy π(s) plot trce of the recently visited sttes nd ctions. nd compute the verge rewrd over the pst 500 steps nd plot the evolution of the verge nd ccumulted rewrd s function of the number of itertions. Experiment with different settings for the explortion prmeter ɛ 0 nd lerning rte α. Cn you think of n extension to the one-step TD-lerning lgorithm tht would help to lern the optiml policy in fewer number of itertions? If you hve time, try to implement this extension. References L. P. Kelbling, M. L. Littmn, nd A. W. Moore. Reinforcement lerning: A survey. Journl of Artificil Intelligence Reserch, 4:237 285, 1996. T. M. Mitchell. Mchine Lerning. McGrw Hill, 1997. R. Sutton nd A. Brto. Reinforcement Lerning. MIT Press, 1999. Also vilble online t http://www-nw.cs.umss.edu/~rich/book/the-book.html 9