Bellman goes Relational

Size: px

Start display at page:

Download "Bellman goes Relational"

Randall Adams
5 years ago
Views:

1 Bellmn goes Reltionl Kristin Kersting 1 kersting@informtik.uni-freiburg.de University of Freiburg, Mchine Lerning Lb, Georges-Koehler-Allee 079, Freiburg, Germny Mrtijn Vn Otterlo 1 otterlo@cs.utwente.nl University of Freiburg, Mchine Lerning Lb, Georges-Koehler-Allee 079, Freiburg, Germny Twente University, Deprtment of Computer Science, TKI, P.O. Box 217, 7500 AE Enschede, The Netherlnds Luc De Redt deredt@informtik.uni-freiburg.de University of Freiburg, Mchine Lerning Lb, Georges-Koehler-Allee 079, Freiburg, Germny Abstrct Motivted by the interest in reltionl reinforcement lerning, we introduce novel reltionl Bellmn updte opertor clled ReBel. It employs constrint logic progrmming lnguge to compctly represent Mrkov decision processes over reltionl domins. Using ReBel, novel vlue itertion lgorithm is developed in which bstrction (over sttes nd ctions) plys mjor role. This frmework provides new insights into reltionl reinforcement lerning. Convergence results s well s experiments re presented. 1. Introduction There hs been lot of ttention nd progress in reinforcement lerning (RL) nd Mrkov decision processes (MDPs) recently. Severl bsic lgorithms hve been proposed nd their behvior is reltively well understood tody (Sutton & Brto, 1998). This hs led to n incresed interest into the effects of generliztion nd to new chllenges. One of them concerns the use of RL in reltionl domins (Džeroski et l., 2001). Even though number of reltionl RL lgorithms hs been developed essentilly through vrying the underlying function pproximtors (Driessens & Rmon, 2003; Gärtner et l., 2003) the problem of reltionl RL is still not well understood nd theory of reltionl RL is lcking. Appering in Proceedings of the 21 st Interntionl Conference on Mchine Lerning, Bnff, Cnd, Copyright by the uthors. In trditionl RL, the Bellmn bckup opertor is one of the centrl concepts. A prticulrly interesting pproch is tht of Dietterich nd Flnn (1997), who showed tht vlue bckups in model-bsed RL cn be upgrded to region-bsed bckups, where multiple sttes re updted simultneously using bckup opertor tht reverses the ction opertors. Inspired by this work, the key contribution of this pper is the introduction of reltionl Bellmn bckup opertor, clled ReBel. ReBel is developed within simple probbilistic STRIPS-like reltionl formlism tht incorportes severl elements of reltionl nd logicl Mrkov Decision Progrmming (Kersting & De Redt, 2003; Vn Otterlo, 2004) such s bstrct sttes tht re represented using reltionl queries. Using ReBel, we then develop model-bsed reltionl RL lgorithm nd demonstrte it on number of experiments. The pproch is lso relted to tht by Boutilier et l. (2001) who employ sitution clculus bsed lnguge. Although their work is certinly elegnt nd principled, due to the complexity of the lnguge, they neither report on complete implementtion nor present utomted experiments. In contrst, our pproch is simpler nd therefore fully utomted. It dels fully utomticlly with the sme experimentl exmple tht Boutilier et l. report on. Outline: Section 2 briefly reviews reltionl logic nd MDPs. After discussing vlue itertion (VI) for MDPs in Section 3, we introduce lnguge to compctly specify MDPs over reltionl domins in Section 4. In Section 5, we develop reltionl VI lgorithm bsed on ReBel. It is empiriclly vlidted in Section 6. Before concluding we discuss relted work. 1 Both uthors contributed eqully to the pper.

2 2. Preliminries Reltionl Logic, cf. (Nienhuys-Cheng & de Wolf, 1997): An lphbet Σ is set of reltion symbols p with rity m 0, nd set of constnts c. An tom p(t 1,... t m ) is reltion symbol p followed by brcketed m-tuple of terms t i. A term is vrible X or constnt c. A conjunction A is set of toms. The set of vribles in conjunction A is denoted s vrs(a). A substitution θ is set of ssignments of terms to vribles {X 1 /t 1,... X n /t n } where X i re vribles nd ll t i re terms. A term, tom or conjunction is clled ground if it contins no vribles. Conjunctions re implicitly ssumed to be existentilly quntified. A conjunction A is sid to be θ-subsumed by conjunction B, denoted by A θ B, if there exists substitution θ such tht Bθ A. The most generl unifier (mgu) for toms nd b is denoted by mgu(, b). A (Horn) cluse H B consists of positive tom H nd conjunction B nd cn be red s H is true if B is true. The gretest lower bound (glb) of two conjunctions A nd B is the most generl conjunction tht is subsumed by both A nd B. Both subsumption nd glb re lso defined for cluses. The Herbrnd bse of Σ, HB Σ, is the set of ll ground toms which cn be constructed with the predicte symbols nd constnts of Σ. An interprettion is subset of HB Σ. Our running exmple will be blocks world. Here, block X cn be moved on top of nother block Y, denoted s move(x, Y). Vlid reltions re on(x, Y), i.e. block X is on Y, nd cl(z), i.e. block Z is cler. To model the floor, we follow common pproch. It is set of blocks which cnnot be on top of other blocks. Mrkov Decision Processes, cf. (Sutton & Brto, 1998): A Mrkov Decision Process (MDP) is tuple M = S, A, T, R, where S is set of sttes, A set of ctions, T : S A S [0, 1] trnsition model nd R : S A S [0, 1] rewrd model. The set of ctions pplicble in stte s S is denoted A(s). A trnsition from stte i S to j S cused by some ction A(i) occurs with probbility T (i,, j) nd rewrd R(i,, j) is received. T defines proper probbility distribution if for ll sttes i S nd ll ctions A(i): j S T (i,, j) = 1. A deterministic policy π : S A for M specifies which ction A(s) will be executed when the gent is in some stte s S, i.e. π(s) =. 3. Vlue Itertion Given some MDP M = S, A, T, R, policy π for M, nd discount fctor γ [0, 1], the stte vlue function V π : S R represents the vlue of being in stte following policy π, w.r.t. expected rewrds. A similr stte-ction vlue function Q π : S A R cn be defined. A policy π is optiml if V π (s) V π (s) s S nd π. Optiml vlue functions re denoted V nd Q. Bellmn s (1957) optimlity eqution sttes: V (s) = mx s T (s,, s )[R(s,, s ) + γv (s )] (1) From this eqution, bsiclly ll methods for solving MDPs cn be derived. For exmple, the well-known exct solution technique vlue itertion (VI) is obtined from (1) by turning it into n updte rule: V t+1 (s) = mx s T (s,, s )[R(s,, s ) + γv t (s )] (2) = mx Q t+1 (s, ). (3) Bsed on Eqution (2), the VI lgorithm cn be stted s follows: strting with vlue function V 0 over ll sttes, we itertively updte the vlue of ech stte ccording to (2) to get the next vlue functions V t (t = 1, 2, 3,...). VI is gurnteed to converge in the limit towrds V, i.e. the Bellmn optimlity eqution (1) holds for ech stte. Trditionl VI s expressed by Eqution (2) ssumes tht ll sttes nd vlues re represented explicitly in tble. This is imprcticl for ll but the smllest stte spces. Furthermore, for reltionl domins, where the number of sttes cn grow very lrge (even infinitely lrge) this is infesible. Therefore, methods tht mke bstrct from specific sttes re needed. Such method is developed in the next sections. 4. Mrkov Decision Progrms Trditionl MDPs re essentilly propositionl in tht ech stte cn be represented using seprte proposition. In Mrkov decision progrms these propositionl symbols re replced by bstrct sttes: Definition 1 An bstrct stte is conjunction Z of logicl toms, i.e., logicl query. Abstrct sttes represent sets of sttes. More formlly, stte is n interprettion, i.e. set of grounds fcts. Consider e.g. the stte z = cl(), cl(b), on(, c) in the blocks world. An bstrct stte Z is, e.g., cl(x). It represents ll sttes tht re subsumed by Z, i.e., ll interprettions in which there exists something tht is cler. We cn now introduce the bsic ingredients of Mrkov decision progrms, nmely, bstrct ctions, bstrct rewrds, nd integrity constrints.

3 An bstrct ction is defined s follows. Definition 2 An ction 2 is finite set of ction rules p i:a H i B where A is n tom representing the nme nd the rguments of the ction nd B is n bstrct stte denoting the preconditions of A. H i is the i-th possible outcome of A. It holds tht i p i = 1. We ssume tht vrs(a) = (vrs(h i ) vrs(b)). The semntics of the ction definition re: If the current stte b is subsumed by B, i.e., b θ B, then tking ction A will result in [b \ Bθ] H i θ with probbility p i. So, if the preconditions re fullfilled, ll outcomes re possible. As n illustrtion, consider on(x, Y), cl(x), cl(z), X Y, Y Z, X Z cl(x), cl(y), on(x, Z), X Y, Y Z, X Z. 0.9:move(X,Y,Z) 0.1:move(X,Y,Z) cl(x), cl(y), on(x, Z), X Y, Y Z, X Z cl(x), cl(y), on(x, Z), X Y, Y Z, X Z. which moves block X on Y with probbility 0.9. With probbility 0.1 the ction fils, i.e., we do not chnge the stte. Applied to the bove stte z the ction tells us tht move(, b, c) will result in z on(, b), cl(), cl(c) with probbility 0.9 nd with probbility 0.1 we sty in z. This type of ction definition implements kind of probbilistic STRIPS opertor. The model R of bstrct rewrds specifies the rewrds generted by entering bstrct sttes. In our frmework it coincides with our initil bstrct stte vlue function V 0. Definition 3 An bstrct stte vlue function V is finite list of vlue rules of the form c B were B is n bstrct stte nd c R. To ny bstrct stte Z, V ssigns the mximl vlue c of ll mtching vlue rules c B to Z s vlue. A rule mtches if Z θ B. Consider e.g. R = V 0 s 10.0 on(, b). nd 0.0 true. It ssigns 0 to z but 10 to z. Using true in the lst vlue rule ssures tht ll stte re ssigned vlue. To develop ReBel, we will lso employ bstrct ction-stte vlue functions, which re similr to bstrct stte vlue functions nd of which n exmple cn be found in Section 5.2. Definition 4 An bstrct stte ction vlue function Q is finite set of Q-rules of the form c : A B were B is n bstrct stte, A is n ction nd c R. 2 For the ske of simplicity, we consider cost-free ctions. The frmework cn be dpted to the cse of ction costs. Note lso tht the mening of bstrct ction here differs from tht sometimes used in the context of hierrchicl RL. To ny bstrct stte-ction pir B nd A, Q ssigns the mximl vlue c of ll bstrct stte ction rules subsumed by A B. Rewrds re specified over queries, i.e., existentilly quntified gols. Although these re simple, they re expressive enough to specify mny interesting problems studied by the (reltionl) RL community such s shortest-pth problems. Here, the gol is to rech certin (bstrct) sttes. When gol stte is entered, the process ends. In RL, episodic tsks re encoded using bsorbing sttes. We encode it by rtificil deterministic ctions such s on(, b) 1.0:bsorbing on(, b), which denotes tht ll sttes tht re subsumed by on(, b) trnsition only to themselves nd generte only zero rewrds. For exmple, z is not bsorbing but z is. Finlly, we need wy to cope with the integrity constrints imposed by our domin. For instnce, in the move definition bove we employed symmetry of. This cn be modeled by set C of integrity constrints. Ech integrity constrint is Horn cluse. For instnce in the blocks world, no block cn be free if there is block on top of it nd no block cn be on itself: flse on(x, Y), cl(y) nd X Y on(x, Y). The completion of n bstrct stte Z is the lest fixpoint of C {Z}, i.e., ll fcts deducible from C {Z}. For exmple, on(, b) does not encode tht is not b. Using the rules bove, this stte is completed to on(, b), b. Furthermore, if the completion includes flse, the stte does not stisfy the constrints, i.e., it is n illegl stte. To del with integrity constrints, we lso hve to dpt our nottions of ction definitions nd generlity. Action definitions re now constrined so tht they cnnot led to illegl sttes. For subsumption we employ the integrity constrints s bckground theory nd use Buntine s generlized subsumption frmework (Buntine, 1988). Along the lines of (Kersting & De Redt, 2003; Vn Otterlo, 2004), it cn be proven tht ny Mrkov decision progrm induces (possibly infinite) MDP. 5. Reltionl Vlue Itertion: ReBel We will now develop vlue itertion lgorithm for Mrkov decision progrms, i.e., given n bstrct rewrd model R, i.e., initil bstrct stte vlue function V 0, compute the next bstrct stte vlue functions V t, t = 1, 2,... The min ide is to upgrde Bellmn s trditionl bckup opertor in Eqution (2). Therefore, we iterte over: 1): Regress ll preceding bstrct sttes from V t.

4 2): Compute Q t+1 over the regressed sttes. 3): Compute V t+1 by mximizing over Q t+1. We will now discuss ech step in turn Regression Let V t be the current bstrct stte vlue function, sy V 0, nd consider the bstrct ction move. For single Bellmn bckup, ll bstrct sttes S which led to condition in V 0 when tking ction move hve to be computed. Thus, we hve to reson from postto preconditions. For exmple, the first outcome of move(, b, c) cn led from stte S ( cl(), cl(b), on(, c), on(b, d) ) (inequlity constrints omitted) to the bstrct stte S on(, b). Thus, we hve to compute the wekest preconditions for the outcomes of move nd S. Definition 5 All bstrct sttes which led to S when p i:a following some ction rule H i B constitute the so clled wekest precondition wp i (A, S ) of the i-th outcome of A. For exmple, S lies in the wekest precondition of S, i.e., S wp 1 (move(x, Y, Z), S ) but it does not lie in wp 2 (move(x, Y, Z), S ) To compute wp 1 (move(x, Y, Z), S ) we cn ssume tht we moved from S to S. Thus, 1) the preconditions of the ction (rule) re fullfilled in S, nd 2) S is prtilly cused by the first outcome of the ction. As n illustrtion of 2), consider on(, b) : move cused on(, b): We hve been in bstrct stte S 1 ( cl(), cl(b), on(, Z), b, Z, b Z ) nd moved X = on Y = b. move did not cuse on(, b): We moved X on Y but not on b. Therefore, we hve been in bstrct sttes T ( cl(x), cl(y), on(x, Z),on(, b), X Y, X Z, Y Z ) stisfying tht we did not move on b, i.e., on(x, Y) on(, b), nd tht we did not move from b wy, i.e., on(x, Z) on(, b). The constrints gurntee tht pplying move(x, Y, Z) in T preserves on(, b). The definition of S simplifies to S 2 ( T X ), S 3 ( T X Z b ) ( ), S 4 T Y b X, nd S5 ( T Y b Z b ). All S i re completed to the sme stte nmely S 6 cl(a), cl(b), on(, b), on(a, C) where ll vribles nd constnts re mutully different. The bstrct sttes S 1, S 6 together logiclly define wp 1 (move(x, Y, Z), S ) ( ) S 1 S 6. So fr, we considered single effect only, nmely on(, b). In generl, however, there cn be multiple 1: initilize wp i to be the empty list 2: for ech subset S of S nd subset P of H i such tht θ = mgu(s, P ) exists or S == P == H i, i.e., θ = do 3: S := (S θ \ P θ) Bθ 4: for ll pirs (l, l ) in {(l, l ) l (S θ \ P θ) l H iθ Bθ} do 5: if mgu(l, l ) exists then 6: dd l l to S 7: dd ll simplifictions of S to wp i 8: return wp i Procedure 1: WekestPre returns the wekest precondition of ction rule H p i :A i B nd bstrct stte S given set of integrity constrints C. We omitted tht only legl nd completed bstrct sttes re inserted in wp i. (combined) effects tht re or tht re not cused by tking ction move, cf. WekestPre in Procedure 1. Consider for exmple S ( on(, b), on(c, d) ). Moving block on some other block cn hve cused either on(, b) or on(c, d), or neither of them, cf. line 2. Assume tht no effect ws cused. Then, S is empty nd P = H 1, cf. line 2. Therefore, θ is the empty substitution nd S ( on(, b), on(c, d), cl(x), cl(y), on(x, Z) ) (inequlity constrints omitted) is possible preimge, cf. line 3. However, we know tht move did not cuse on(, b), on(c, d). Therefore, it holds on(x, Z) on(, b) on(x, Z) on(c, d) on(x, Y) on(, b) on(x, Y) on(c, d), cf. lines 4 6. S cn be simplified for instnce to S, X, X c which is legl bstrct stte. The cse tht the ction cused some effects is covered by the mgu(s, P ) exists conditition in line 2. It is treted nlogously Computing Abstrct Stte Action Vlues Given the regressed bstrct sttes nd the current bstrct stte vlue function V t, we now compute n bstrct stte-ction vlue function Q t+1 ccording to Procedure 2. To do so, (A) we tret ech outcome of n ction A s though it would be single ction nd compute its bstrct stte ction vlue, cf. line 4. Then, (B) we combine the vlues of ll outcomes to n bstrct stte ction vlue for A, cf. lines For the ske of brevity, we will not stte constrints in the exmples till the end of Section 5.3. For step (A), consider gin the first outcome of move. The wekest precondition ws wp 1 (move(x, Y, Z), S ) S 1 S 6. Becuse S 6, is bsorbing, we ssign n bstrct stte ction vlue of 10 for tking ction move, i.e., 10 : move(x, Y, Z) S 6. The vlue of S 1, however, is dependent on V t (S ), i.e. in our exmple V 0. Assuming discount fctor of 0.9 this yields R(S) + p V 0 (S ) = = 8.1, i.e., 8.1 : move(, b, Z)

5 1: initilize Qrules to the empty set. p 2: for ech ction rule H i :A i B for A do 3: for ech v V in V t do 4: := { q : Ã S S wp i(a, V )} prtilq ( q := R(S) : S is bsorbing R(S) + p i γ V t(v ) : otherwise 5: if Qrules then 6: Qrules := prtilq 7: else 8: newq := 9: for ll pirs q : Ã S Qrules nd q : Ã S prtilq do 10: if G := glb(ã S, Ã S ) exists then 11: dd q : G to newq with q = q + q 12: Qrules := newq 13: return Qrules Procedure 2: QRules returns the Q-rules of n ction A given the rewrd model R, the current vlue function V t nd discount fctor γ. Note tht Ã denotes the ction hed where we keep the substitution mde by wp i. We lso omitted tht only legl nd completed bstrct sttes g re inserted in Qrules. S 1. Doing the sme for ll other rules in V 0 results in: 10 : move(x, Y, Z) cl(x), cl(y), on(, b), on(x, Z) b 8.1 : move(, b, Z) cl(), cl(b), on(, Z) c 0.0 : move(x, Y, Z) cl(x), cl(y), on(x, Z) For the second outcome of move, step (A) leds to: d 1.0 : move(, X, b) cl(), cl(x), on(, b) e 1.0 : move(x, Y, Z) cl(x), cl(y), on(, b), on(x, Z) f 0.0 : move(x, Y, Z) cl(x), cl(y), on(x, Z) For step (B), we note tht ech of these rules describes situtions such s if we re in stte then we cn get some vlue for chieving the i-th outcome of ction A. This informtion hs to be combined to n bstrct stte ction vlues for A. To do so, we select rule from c, sy b, nd rule from d f, sy f, nd check whether we cn be in both bstrct sttes t the sme time nd whether we cn pply the sme ction. In other words, we compute the gretest lower bound (glb) of the logicl cluses underlying both vlue rules. If the glb (where the ctions hve to unify) exists nd it is legl stte, then it is inserted s new rule, cf. line 11. The vlue of the new rule is the sum of vlues of the combined rules. For b nd f this yields 8.1 : move(, b, X) cl(), cl(b), on(, X). In contrst, b nd d do not give new rule. In our blocks world exmple, QRules yields the following bstrct stte ction vlue function when pplied on V 0 nd move nd bsorbing: 1: initilize V t+1 to the empty set of V -rules. 2: sort Qrules in decresing order of Q-vlues 3: while Qrules not empty do 4: remove top element d : A B of Qrules 5: if no other rule d : A B in Qrules exists such tht B subsumes B then 6: dd d B to V t+1 7: remove ll rules d B from Qrules such tht B is subsumed by B 8: return V t+1 Procedure 3: VRules returns the vlue functions V t+1 given the Q-rules computed from V t for ll ctions : bsorbing on(, b) 2 10 : move(x, Y, Z) cl(x), cl(y), on(, b), on(x, Z) : move(, b, X) cl(), cl(b), on(, X) : move(x, Y, Z) cl(x), cl(y), on(x, Z) Note tht we hve sorted the Q-rules in descending order only for the ske of redbility Computing Abstrct Stte Vlues The set of Q-rules enbles one to compute the next bstrct stte vlue function V t+1. In contrst to the trditionl cse, Q-rules, i.e., vlues of bstrct stte ction pirs, cn overlp such s Q-rules 1 nd 2. To compute bstrct stte vlues we mke use of the fct tht V t+1 (S) = mx A Q t+1 (S, A) due to Eq. (3). In generl, ny vlue-preserving trnsformtion cn be pplied. In this pper, we use simple seprte-ndconquer rule lerning pproch where the rules to lern nd the exmples to lern from coincide, see VRules in Procedure 3. We serch for Q-rule m hving mximl Q-vlue mong Qrules, lines 3 4, seprte the covered Q-rules, line 5, nd recursively conquer the remining Q-rules by selecting more rules until no Q-rules remin, line 6. The min difference is tht we select m nd dd it to V t+1 only if there is no other Q-rule left in Qrules with the sme vlue whose body subsumes the body of m, cf. line 8. In our running exmple, we strt with rule 1. Becuse it is not subsumed by ny other rule hving the sme vlue, we dd 10 on(, b) to V 1 nd, becuse it subsumes 2, we remove 2 from Qrules. The remining highest vlued rule is 3, nd we iterte. After completing, this yields the new vlue function V 1 (constrints listed gin): 10 on(, b), b. 8.1 cl(), cl(b), on(, X), b, X, b X. 0 cl(x), cl(y), on(x, Z), X Y, X Z, Y Z Reltionl Bellmn Bckup Opertor To summrize, the generl scheme of ReBel is: 1) Compute the wekest precondition of ech ction

6 Vlues X F 1 F 2 F 1 F X 1 X 1 F 2 F 1 F X 1 X 2 F 1 F 2 X 1 X 2 X 1 F 1 X 2 X 1 F 2, X 2 F 1 X 1 X 2 F 2, F 1 F 2 X k F 1 F 2 X 1 X i(i = 1,..., k) F 1 F 2 F i(i = 1, 2) X i X j(i < j; i, j = 1,... k) X i F j(i = 1,... k; j = 1, 2) F 1 F 2 Figure 1. Blocks World Experiment I: Abstrct stte vlue function for the cl() gol fter 10 itertions. It pplies for ny number of blocks. Vlues re rounded to the second digit. F i cn be block or floor block. Sttes structurlly different from the depicted ones get vlue 0.0. outcome for ech bstrct stte in V t using Wekest- Pre. As done in QRules, 2) ssign to ech bstrct stte ction outcome pir computed in 1) Q-vlue nd 2b) combine them bsed using the glb. 3) Mximize the Q-rules to compute V t+1 using VRules. Note tht in 2b), if there re n > 1 mny outcomes of n ction, then the Q-vlues of the n-th outcome re combined with lredy combined Q-vlues of the n 1 previous outcomes. Thus, there re n 1 mny combintions per ction. This might produces mny rules. To overcome this, one cn dpt VRules mximizing Q-rules to compress Q-rules: if we re in stte with different currently combined vlues for comptible ctions, then we select only the higher one. This is sfe becuse the higher vlued Q-rule subsumes the lower vlued one. Therefore, it would hve been selected in ny cse lter on. Formlly, this Bellmn bckup requires n infinite number of itertions to converge to V, cf. Section 6. In prctice, we stop when the bstrct vlue function chnges by only smll mount. 6. Experiments In this section we empiriclly vlidte ReBel. We implemented ReBel with compressing Qrules in the Prolog system YAP version nd we used the supplemented constrint hndling rules librry (Frühwirth, 1998). In ll experiments we ssume discount fctor of 0.9 nd gol rewrd of 10, i.e., in ll other sttes we receive 0 rewrd. Only gol sttes re bsorbing. Experiments were run on 3.1 GHz Linux mchine. The running times were estimted using YAP s buildin sttistics(runtime, ). We focused on stndrd for k = 3, 4,..., 10 Vlues b F 1 b A b F 1 F 2 $ # $ A b F 1 F 2 % & b F 1 F 2 '' (( ' (! " ** )) ** ) * A B C F 1 b F 2 // 00 / 0,, ++,, +, Figure 2. Blocks World Experiment II: Prts of the bstrct vlue function for on(, b) fter 10 itertions (vlues rounded to the second digit). It pplies for ny number of blocks. We omitted the inequlity constrints: All blocks re mutully different. F i cn be block or floor block. Stte more thn 10 steps wy from the gol get vlue 0.0. exmples known from the reltionl RL literture. Blocks World Experiment I: We consider cl() s gol in our probbilistic blocks world setting. The experiment shows tht even on simple problems ReBel is not gurnteed to converge on the structurl level. Figure 1 shows the bstrct stte vlue function fter 10 itertions. It took ReBel roughly 1 minute to iterte ten times. Figure 1 highlights tht sttes tht re one step further wy from the gol get the sme vlue. The vlue, however, is lower becuse of the dditionl block on top of the stck of. Thus, becuse the number of blocks is not restricted, vlue itertion will never stop. Proposition: Abstrction does not gurntee convergence in infinite domins becuse n infinite number of bstrct sttes cn be required. This is interesting, becuse infinite stte spces esily rise when reltionl representtions re used nd reltionl bstrction ws hoped to be solution. Nevertheless, reltionl vlue itertion cn converge even for infinite domins s our third experiment will show. Blocks World Experiment II: We consider the gol on(, b) in deterministic blocks world becuse it is reported to be hrd problem for model-free reltionl RL (RRL) pproches (Džeroski et l., 2001; Driessens & Rmon, 2003). For instnce, Driessens nd Rmon (2003) report tht on verge the lerned policies did not rech optiml performnce even for 5 blocks. Using the sme experimentl set-up s in our first experiment but deterministic move ction, ReBel

7 V t bstrct sttes bin(b, p) tin(a, p), on(b, A), not rin tin(a, p), on(b, A), rin tin(a, B), on(b, A), not rin tin(a, B), on(b, A), rin tin(a, B), bin(b, B), not rin tin(a, B), bin(b, B), rin tin(a, B), bin(b, C), not rin tin(a, B), bin(b, C), rin tin(a, B) Tble 1. Lod-Unlod Experiment: The t-th column shows the bstrct stte vlue function fter the t-th itertion. When no vlue is given, the bstrct stte hs vlue 0.0. Bold numbers highlight chnged vlues. computed V 10 in less thn 12 minutes. The bstrct vlue function is prtilly shown in Figure 2. Becuse the move ction is deterministic, V 10 is optiml for 10 blocks (more thn 58 million ground sttes). The optiml policy cn directly be extrcted by computing the mximizing Q-rules for ech bstrct stte. In our exmple, this results in removing the top elements from the stcks on top of nd b. However, to compctly represent this strtegy, one needs to define the predicte ontop. In the experiments Driessens nd Rmon (2003) reported on, this ws lwys the cse. The policy bsed on Rebel is optiml no mtter how mny blocks there re. Lod-Unlod Experiment: Our finl experiment considers the logistics domin which Boutilier et l. (2001) solved semi-utomticlly. The domin consists of cities, trucks nd boxes. Boxes cn be loded onto nd unloded from trucks, nd trucks cn be driven between cities. The predicte on(b, T) denotes tht box B is on the truck T, bin(b, C) denotes tht box B is in some city C nd tin(t, C) denotes tht truck T is in city C. The ctions tht cn be performed re: lod(b, T) nd unlod(b, T) specifying how box B cn be loded onto or loded from truck T nd drive(t, C) specifying tht the truck T is driven to city C. The ctions in this domin hve probbilistic effects. The probbility of filing lod or unlod ction, i.e., stying in the current stte, depends on whether it rins or not, denoted by rin. The ction specifiction is s follows (we omit the filing specifictions for the ske of brevity): bin(b, C), tin(t, C), R on(b, T), tin(t, C), R pr:unlod(b,t) on(b, T), tin(t, C), R pr:lod(b,t) tin(t, C ), C C 1.0:drive(T,C ) tin(t, C) bin(b, C), tin(t, C), R where the probbility pr is 0.9 if R is rin nd 0.7 if R is not rin. To correctly hndle the explicit negtion we used for rin, we provided flse rin, not rin s constrint. The gol in this domin is to get some box b in p where p stnds for Pris, i.e., in bin(b, p) we get rewrd of 10. ReBel rn for less thn 6 seconds to compute the results summrized in Tble 1. In contrst to the blocks world exmples, the solution converges both t the vlue level nd t the structurl level. E.g., tke the sitution in which truck is in city different from Pris nd the box is there too. Then, it will tke three steps (lod drive unlod) to rech the gol stte nd the stte vlue in V 10 is in cse it rins. The bstrct stte vlue function pplies no mtter how mny trucks, boxes nd cities re present. 7. Relted Work In the pst few yers, there hs been n incresed nd significnt interest in using rich reltionl representtions for modeling nd lerning MDPs. In model-free reltionl RL, one hs studied different reltionl lerners for function pproximtion (Džeroski et l., 2001; Lecoeuche, 2001; Driessens & Rmon, 2003; Gärtner et l., 2003). Others hve pplied Q-lerning bsed on pre-specified bstrct stte spces: Kersting nd De Redt (2003) investigte pure Q-lerning, Vn Otterlo (2004) lerns the Q-function vi lerning the underlying trnsition model. Fern et l. (2003) extended previous work on upgrding lerned policies for smll reltionl MDPS (RMDPs) with pproximted policy itertion. Finlly, Guestrin et l. (2003) recently reported on clss-bsed, pproximte vlue functions for RMDPs. For model-bsed pproches, there hs been surprising lck of reserch on exct solution methods. From generl point of view, ReBel is closely relted to decision theoretic regression (DTR) (Boutilier et l., 1999) nd, becuse of tht, it is lso relted to regression plnning in the sme wy s DTR is. Within DTR, most lgorithms re designed to work with propositionl representtions. Actully, the only exception the uthors know of is tht of Boutilier et l. (2001). ReBeL reltes to this in tht it lso is model-bsed exct solution method for RMDPs. One key difference is tht Boutilier et l. employ sitution clculus

8 for representing RMDPs. Sitution clculus is very expressive nd s consequence it is hrder to simplify the logicl descriptions of the bstrct vlue functions sttes tht re obtined. This my lso explin why to the best of the uthors knowledge tht pproch hs not been fully implemented nd experimented with. In contrst, becuse of the use of simpler logicl lnguge, the simplifiction in ReBel is computtionlly fesible. As shown in the experiments, ReBel successfully nd fully utomticlly implements reltionl vlue itertion. Finlly, the work by Dietterich nd Flnn (1997) is lso concerned with generlizing Bellmn bckups but no reltionl representtion is used. 8. Conclusions The key contribution of this pper is the introduction of ReBel, reltionl upgrde of the Bellmn updte opertor. It hs been used to implement reltionl vlue itertion lgorithm. It hs been shown to be effective in number of simple though significnt exmples. This in turn hs led to number of novel insights into reltionl MDPs. First, it hs been shown tht vlue-bsed methods for reltionl MDPs my not converge becuse n infinite number of bstrct sttes hs to represented. Second, we highlighted tht in such cses bckground knowledge my enble the lerning of optiml policies. So, depending on the representtion of the problem, one cn or cnnot lern the optiml policy. Therefore, using bckground knowledge in RMDPs is not only n interesting feture, but in some cses lso necessity for successful lerning. In this wy, we hve given n explntion for nd confirmed some of the experimentl insights of the erly reltionl RL work (Džeroski et l., 2001). Further work could ddress combining ReBel with other types of vlue-bsed methods, extending the representtion lnguge, efficient dt structures, complexity nlysis, nd employing other lerning lgorithms to compress vlue functions. The uthors hope tht the theoreticl insights, s well s the lgorithm developed in this pper, will be helpful in dvncing the field of reltionl RL s well s contribute to n improved understnding of the problems involved. Acknowledgements The uthors would like to thnk the nonymous reviewers for their helpful comments. This reserch ws supported by the Europen Union IST progrmme, contrct no. FP , APrIL II. Mrtijn Vn Otterlo ws supported by Mrie Curie fellowship t DAISY, HPMT-CT References Bellmn, R. E. (1957). Dynmic progrmming. Princeton, New Jersey: Princeton University Press. Boutilier, C., Den, T., & Hnks, S. (1999). Decisiontheoretic plnning: Structurl ssumptions nd computtionl leverge. J. Art. Intel. Res., 11, Boutilier, C., Reiter, R., & Price, B. (2001). Symbolic dynmic progrmming for first-order MDP s. Proc. of IJCAI 01. Buntine, W. (1988). Generlized subumption nd its pplictions to induction nd redundncy. Artificil Intelligence, 36, Dietterich, T. G., & Flnn, N. S. (1997). Explntionbsed lerning nd reinforcement lerning: unified view. Mchine Lerning, 28, Driessens, K., & Rmon, J. (2003). Reltionl instnce bsed regression for reltionl reinforcement lerning. Proc. of ICML Džeroski, S., De Redt, L., & Driessens, K. (2001). Reltionl reinforcement lerning. Mchine Lerning, 43, Fern, A., Yoon, S., & Givn, R. (2003). Approximte policy itertion with policy lnguge bis. Proc. of NIPS 03. Frühwirth, T. (1998). Theory nd Prctice of Constrint Hndling Rules. Journl of Logic Progrmming, 37, Gärtner, T., Driessens, K., & Rmon, J. (2003). Grph kernels nd Gussin processes for reltionl reinforcement lerning. Proc. of ILP 03. Guestrin, C., Koller, D., Gerhrt, C., & Knodi, N. (2003). Generlizing plns to new environments in reltionl MDPs. Proc. of IJCAI 03. Kersting, K., & De Redt, L. (2003). Logicl Mrkov decision progrms. Proc. of the IJCAI 03 Workshop on Lerning Sttisticl Models of Reltionl Dt. Lecoeuche, R. (2001). Lerning optiml dilogue mngement rules by using reinforcement lerning nd inductive logic progrmming. Proc. of the North Americn Chpter of the Assocition for Computtionl Linguistics (NAACL). Pittsburgh. Nienhuys-Cheng, S.-H., & de Wolf, R. (1997). Foundtions of inductive logic progrmming, vol of Lecture Notes in Artificil Intelligence. Springer-Verlg. Sutton, R., & Brto, A. (1998). Reinforcement lerning: n introduction. Cmbridge: The MIT Press. Vn Otterlo, M. (2004). Reinforcement lerning for reltionl MDPs. Mchine Lerning Conference of Belgium nd the Netherlnds (BeNeLern 04).

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm