WE would like to build intelligent agents that can. Autonomous Learning of High-Level States and Actions in Continuous Environments

Size: px

Start display at page:

Download "WE would like to build intelligent agents that can. Autonomous Learning of High-Level States and Actions in Continuous Environments"

Samson York
5 years ago
Views:

1 Autonomous Lerning of High-Level Sttes nd s in Continuous Environments Jonthn Mugn nd Benjmin Kuipers, Fellow, IEEE Abstrct How cn n gent bootstrp up from pixel-level representtion to utonomously lern high-level sttes nd ctions using only domin-generl knowledge? In this pper we ssume tht the lerning gent hs set of continuous vribles describing the environment. There exist methods for lerning models of the environment, nd there lso exist methods for plnning. However, for utonomous lerning, these methods hve been used lmost exclusively in discrete environments. We propose ttcking the problem of lerning high-level sttes nd ctions in continuous environments by using qulittive representtion to bridge the gp between continuous nd discrete vrible representtions. In this pproch, the gent begins with brod discretiztion nd initilly cn only tell if the vlue of ech vrible is incresing, decresing, or remining stedy. The gent then simultneously lerns qulittive representtion (discretiztion) nd set of predictive models of the environment. These models re converted into plns to perform ctions. The gent then uses those lerned ctions to explore the environment. The method is evluted using simulted robot with relistic physics. The robot is sitting t tble tht contins block nd other distrctor objects tht re out of rech. The gent utonomously explores the environment without being given tsk. After lerning, the gent is given vrious tsks to determine if it lerned the necessry sttes nd ctions to complete them. The results show tht the gent ws ble to use this method to utonomously lern to perform the tsks. Index Terms unsupervised lerning, reinforcement lerning, qulittive resoning, intrinsic motivtion, ctive lerning. I. INTRODUCTION WE would like to build intelligent gents tht cn utonomously lern to predict nd control the environment using only domin-generl knowledge. Such gents could simply be plced in n environment, nd they would lern it. After they hd lerned the environment, the gents could be directed to chieve specified gols. The intelligence of the gents would free engineers from hving to design new gents for ech environment. These gents would be flexible nd robust becuse they would be ble to dpt to unnticipted spects of the environment. Designing such gents is difficult problem becuse the environment cn be lmost infinitely complex. This complexity mens tht n gent with limited resources cnnot represent nd reson bout the environment without describing it in simpler form. And possibly more importntly, the complexity of the environment mens tht it is chllenge to generlize J. Mugn is with the School of Computer Science, Crnegie Mellon University, Pittsburgh, PA 53 USA (e-mil: jmugn@cs.cmu.edu). B. Kuipers is with the Deprtment of Electricl Engineering nd Computer Science, University of Michign, Ann Arbor, MI 4809 USA (e-mil: kuipers@umich.edu). from experience since ech experience will be in some respect different. A solution to the difficulty of lerning in complex environment is for the gent to utonomously lern useful nd pproprite bstrctions. There re pproches tht do impressive lerning in specific scenrios [] [3]. But most utonomous lerning methods require discrete representtion nd given set of discrete ctions [4] [8]. Our gol is to enble utonomous lerning to be done in continuous environment nd to enble n gent to lern its first ctions. The Problem: The world is continuous nd infinitely complex (or effectively so). The dynmics of the world re lso continuous, nd dd further complexity. An gent cts in tht world by sending continuous low-level motor signls. Our gol is to show how such n gent cn lern hierrchy of ctions for cting effectively in such world. We simplify the problem of perception for our gent. Insted of deling with continuous high-bndwidth pixel-level perception, we ssume tht the gent hs trckers for smll number of moving objects (including its own body prts) within n otherwise sttic environment. These trckers provide the gent wih perceptul strem consisting reltively smll number of continuous vribles. Lerning such trckers nd the models tht provide dynmiclly updted model prmeters is done by methods outside the scope of this project, such s the Object Semntic Hierrchy [9]. Our solution to the problem of lerning hierrchy of ctions in continuous environments is the Qulittive Lerner of nd Perception, QLAP. A. The Qulittive Lerner of nd Perception, QLAP QLAP hs two mjor processes: () Modeling Events: An event is qulittive chnge in the vlue of some stte vrible (Section II). QLAP strts by identifying contingencies between events: situtions where the observtion of one event E mens tht nother event E is more likely to occur soon fter. QLAP serches for improved descriptions of the conditions for contingencies, trying to find sufficiently deterministic simple models tht predict the results of ctions (Section III). This is done by introducing new distinctions into the qulittive bstrctions of the domins of prticulr vribles, nd by identifying dditionl dependencies on context vribles, mking predictions more relible. These improved models of the forwrd dynmics of the environment re represented s dynmic Byesin networks (DBNs). This process is depicted in Figure. () Modeling s: Define n ction to be the occurrence of prticulr event: prticulr qulittive chnge to some

2 vrible. A relible DBN model tht leds to tht event cn be trnsformed (by fmilir RL methods) into pln for ccomplishing tht consequent event by mens of chieving ntecedent events, nd thus for crrying out tht ction. Since the pln thus consists of embedded ctions, we get nturl hierrchicl structure on ctions, vi the plns vilble to crry them out (Section IV). A hierrchy of ctions nd plns must ground out in motor ctions: qulittive chnges to the vlues of motor vribles tht the gent cn crry out simply by willing them. QLAP ensures tht its higher-level ctions nd plns hve this grounding in motor ctions becuse it lerns everything utonomously, strting from rndom explortion of the consequences of setting its motor vribles (i.e. motor bbbling ). The process of modeling ctions is depicted in Figure. To perform explortion nd lerning, the two processes of Modeling Events nd Modeling s run continuously s the gent cts in the world (Section V). Those ctions cn be driven initilly by rndom motor bbbling. However, they cn be driven by utonomous explortion through vrious intrinsic motivtion drives. They cn lso be driven by explicit coching, or by ply posing vrious gols nd mking plns to chieve them. (See [0] for video of QLAP. ) time Imges () (b) B. Contributions Continuous vribles (c) Discrete vribles (e) (d) Models of the environment Feedbck from model to discretiztion Fig. : Perception in QLAP. To the best of our knowledge, QLAP is the only lgorithm tht lerns sttes nd hierrchicl ctions through utonomous explortion in continuous, dynmic environments with continuous motor commnds. For the field of Autonomous Mentl Development, QLAP provides method for developing gent to lern its first temporlly-extended ctions nd to lern more complex ctions on top of previously-lerned ctions. The relted field of reinforcement lerning is bout enbling n gent to lern from experience to mximize rewrd signl []. QLAP ddresses three chllenges in reinforcement lerning: () continuous sttes nd ctions, () utomtic hierrchy construction, nd (3) utomtic genertion of reinforcement lerning problems. Continuous sttes nd ctions re chllenge becuse it is hrd to know how to generlize from experience since no two sttes re exctly mugn qlp () Models re converted into plns. (c) s nd plns re put together into hierrchy. Low-level motor commnds Qs (, ) () s rg mx Qs (, ) Qs (, ) () s rg mx Qs (, ) Qs (, ) () s rg mx Qs (, ) Qs (, ) () s rg mx Qs (, ) Qs Qs (, ) () s Qs (, ) () s rg mx (, ) rg mx Qs (, ) (b) Plns re different wys to do ctions. Qs (, ) () s rg mx Qs (, ) Qs (, ) () s Q (, s ) () s rg mx Qs (, ) Qs (, ) () s rg mx Q ( s, ) Qs (, ) () s rg mx Qs (, ) rg mx Qs (, ) Fig. : s in QLAP. Qs (, ) () s Q (, ) n s () s rg mx Q ( s, ) n rg mx Qs (, ) n like. There exist mny function pproximtion methods, nd other methods tht use rel vlues [3], [] [6], but QLAP provides method for discretizing the stte nd ction spce so tht the discretiztion corresponds to the nturl joints in the environment. Finding these nturl joints llows QLAP to use simple lerning lgorithms while still representing the complexity of the environment. Lerning of hierrchies cn enble n gent to explore the spce more effectively becuse it cn ggregte smller ctions into lrger ones. QLAP cretes hierrchicl set of ctions from continuous motor vribles. Currently, most reinforcement lerning problems must be designed by the humn experimenter. QLAP utonomously cretes reinforcement lerning problems s prt of its developmentl progression. Section II discusses the qulittive representtion used to discretize the continuous dt. Section III explins how QLAP lerns models of events, nd Section IV describes how models of events re converted into ctions. Section V discusses how the QLAP gent explores nd lerns bout the world. Experimentl results re presented in Section VI, nd the pper concludes with discussion (Section VII), n overview of relted work (Section VIII), nd summry (Section IX). II. QUALITATIVE REPRESENTATION A qulittive representtion llows n gent to bridge the gp between continuous nd discrete vlues. It does this by encoding the vlues of continuous vribles reltive to known lndmrks [7]. The vlue of continuous vrible cn be described qulittively. Its qulittive mgnitude is described s equl to lndmrk vlue or s in the open intervl between two djcent lndmrk vlues. Its qulittive direction of chnge cn be incresing, stedy, or decresing. Becuse lndmrk is intended to represent n importnt vlue of vrible where the system behvior my chnge qulittively, qulittive representtion llows the gent to generlize nd to focus on importnt events.

3 3 A. Lndmrks A lndmrk is symbolic nme for n point on number line. Using lndmrks, QLAP cn convert continuous vrible ṽ with n infinite number of vlues into qulittive vrible v with finite set of qulittive vlues Q(v) clled quntity spce [7]. A quntity spce Q(v) = L(v) I(v), where L(v) = {v,, v n} is totlly ordered set of lndmrk vlues, nd I(v) = {(, v ), (v, v ),, (v n, + )} is the set of mutully disjoint open intervls tht L(v) defines in the rel number line. A quntity spce with two lndmrks might be described by (v, v ), which implies five distinct qulittive vlues, Q(v) = {(, v ), v, (v, v ), v, (v, + )}. This is shown in Figure 3. v v () v, v, v, v, v, v, v, Fig. 3: Lndmrks divide the number line into discrete set of qulittive vlues. QLAP perceives the world through set of continuous input vribles nd ffects the world through set of continuous motor vribles. For ech continuous input vrible ṽ, two qulittive vribles re creted: discrete vrible v(t) tht represents the qulittive mgnitude of ṽ(t), nd discrete vrible v(t) tht represents the qulittive direction of chnge of ṽ(t). Also, qulittive vrible u(t) is creted for ech continuous motor vrible ũ. 3 The result of these trnsformtions is three types of qulittive vribles tht the gent cn use to ffect nd reson bout the world: motor vribles, mgnitude vribles, nd direction of chnge vribles. The properties of these vribles re shown in Tble I. TABLE I: Types of Qulittive Vribles Type of Vrible Initil Lndmrks Lern Lndmrks? motor {0} yes mgnitude {} yes direction of chnge {0} no Ech direction of chnge vrible v hs single intrinsic lndmrk t 0, so its quntity spce is Q( v) = {(, 0), 0, (0, + )}, which cn be bbrevited s Q( v) = {[ ], [0], [+]}. Motor vribles re lso given n initil lndmrk t 0. Mgnitude vribles initilly hve no lndmrks, treting zero s just nother point on the number line. Initilly, when the gent knows of no meningful qulittive distinctions mong vlues for ṽ(t), we describe the quntity spce with the empty list of lndmrks, {}, s Q( v) = {(, + )}. However, the gent cn lern new lndmrks for mgnitude nd motor vribles. Ech dditionl lndmrk llows the gent to perceive or ffect the world t finer grnulrity. QLAP cn hndle discrete (nominl) input vribles. See [8] for detils. 3 When the distinction between motor vribles nd non-motor vribles is unimportnt, we will refer to the vrible s v. B. Events If is qulittive vlue of qulittive vrible A, mening Q(A), then the event A t is defined by A(t ) nd A(t) =. Tht is, n event tkes plce when discrete vrible A chnges to vlue t time t, from some other vlue. We will often drop the t nd describe this simply s A. We will lso refer to n event s E when the vrible nd vlue involved re not importnt, nd we use the nottion E(t) to indicte tht event E occurs t time t. For mgnitude vribles, A t is relly two possible events, depending on the direction tht the vlue is coming from. If t time t, A(t) <, then we describe this event s A t. Likewise, if t time t, A(t) >, then we describe this event s A t. However, for ese of nottion, we generlly refer to the event s A t. We lso sy tht event A t is stisfied if A t =. III. MODELING EVENTS There re mny methods for lerning predictive models in continuous environments. Such models hve been lerned, for exmple, using regression [3], [] [4], neurl networks [5], nd Gussin processes [6]. But s described in the introduction, we wnt to brek up the environment nd represent it using qulittive representtion. In discretized environment, dynmic Byesin networks (DBNs) re convenient wy to encode predictive models. Most work on lerning DBNs lern network to predict ech vrible t the next timestep for ech ction, e.g. [9] []. However, QLAP lerns the set of ctions, nd QLAP works in environments where events my tke more thn one timestep. QLAP lerns two different types of DBN models. The first type of DBN models re those tht predict events on chnge vribles (chnge DBNs). The second type of DBN models re those for reching mgnitude vlues (mgnitude DBNs); Section III-E. To lern chnge DBNs, QLAP uses novel DBN lerning lgorithm. Given the current discretiztion, QLAP trcks sttistics on ll pirs of events to serch for contingencies (Section III-A) where n ntecedent event leds to consequent event. When such contingency is found, QLAP converts it to DBN with the ntecedent event s the prent vrible nd the consequent event s the child vrible. QLAP dds context vribles to the DBN one t time s they mke the DBN more relible. For ech DBN, QLAP lso serches for new discretiztion tht will mke the DBN more relible. This new discretiztion then cretes new possible events nd llows new DBNs to be lerned. This method is outlined in Figure 4. A. Serching for Contingencies The serch for chnge DBNs begins with serch for contingencies. A contingency represents the knowledge tht if the ntecedent event occurs, then the consequent event will soon occur. An exmple would be if you flip light switch, then the light will go off. QLAP serches for contingencies by trcking sttistics on pirs of events E, E nd extrcting those pirs into contingency where the occurrence of event E indictes tht event E is more likely to soon occur thn it would otherwise.

4 ntecedent events () Pirwise serch for contingencies consequent events (b) Found contingency converted to DBN ntecedent event event(, t X, x) consequent event soon(, t Y, y) (d) Lndmrks refine

4 4 ntecedent events () Pirwise serch for contingencies consequent events (b) Found contingency converted to DBN ntecedent event event(, t X, x) consequent event soon(, t Y, y) (d) Lndmrks refine pirwise serch (c) Add context vribles nd lndmrks event(, t X, x) V() t V () n t context vribles soon(, t Y, y) Fig. 4: () Do pirwise serch for contingencies tht use one event to predict nother. The ntecedent events re long the y-xis, nd the consequent events re long the x-xis, The color indictes the probbility tht the consequent event will soon follow the ntecedent event (lighter corresponds to higher probbility). When the probbility of the consequent event is sufficiently high, it is converted into contingency (yellow). (b) When contingency is found, it is used to crete DBN. (c) Once DBN is creted, context vribles re dded to mke it more relible. (d) The DBN cretes self-supervised lerning problem to predict when the consequent event will follow the ntecedent event. This llows new lndmrks to be found. Those lndmrks crete new events for the pirwise serch. (Best viewed in color.) ) Definition: To define contingencies in continuous environment, we hve to discretize both vrible vlues nd time. To discretize vrible vlues, we crete specil Boolen vrible event(t, X x) tht is true if event X t x occurs event(t, X x) X t x () To discretize time, we use time window. We define the Boolen vrible soon(t, Y y) tht is true if event Y t y occurs within time window of length k soon(t, Y y) t [t t < t + k event(t, Y y)] () (The length of the time window k is lerned by noting how long it tkes for motor commnds to be observed s chnges in the world, see [8].) With these vribles, we define contingency s event(t, X x) soon(t, Y y) (3) which represents the proposition tht if the ntecedent event X t x occurs, then the consequent event Y y will occur within k timesteps. ) The Pirwise Serch: QLAP looks for contingencies using pirwise serch by trcking sttistics on pirs of events X x nd Y y to determine if the pir is contingency. QLAP lerns contingency E E if when the event E occurs, then the event E is more likely to soon occur thn it would hve been otherwise P r(soon(t, E ) E (t)) > P r(soon(t, E )) (4) where P r(soon(t, E )) is the probbility of event E occurring within rndom window of k timesteps. Specificlly, the contingency is lerned when P r(soon(t, E ) E (t)) P r(soon(t, E )) > θ pen = 0.05 (5) QLAP performs this serch considering ll pirs of events, excluding those where ) The consequent event is mgnitude vrible (since these re hndled by the models on mgnitude vribles s discussed in Section III-E). ) The consequent event is on direction of chnge vrible to the lndmrk vlue [0] (since we wnt to predict chnges tht result in moving towrds or wy from lndmrks). 3) The mgnitude vrible corresponding to the direction of chnge vrible on the consequent event mtches the mgnitude vrible on the ntecedent event (since we wnt to lern how the vlues of vribles re ffected by other vribles). B. Converting Contingencies to DBNs In this section we describe how QLAP converts contingency of the form event(t, X x) soon(t, Y y) into dynmic Byesin network. A dynmic Byesin network (DBN) is compct wy to describe probbility distribution over time-series dt. Dynmic Byesin networks llow QLAP to identify situtions when the contingency will be relible. ) Adding Context: The consequent event my only follow the ntecedent event in certin contexts, so we lso wnt to lern set of qulittive context vribles C tht predict when event Y y will soon follow X x. This cn be represented s DBN r of the form r = C : event(t, X x) soon(t, Y y) (6) which we bbrevite to r = C : X x Y y (7) In this nottion, event E = X x is the ntecedent event, nd event E = Y y is the consequent event. We cn further bbrevite this QLAP DBN r s r = C : E E (8) Figure 5 shows the correspondence between this nottion nd stndrd DBN nottion. Becuse we consider only cses where event E occurs, we cn tret the conditionl probbility tble (CPT) of DBN r s defined over the probbility tht event Y y will soon follow event X x for ech qulittive vlue in context C. If the ntecedent event does not occur, then the CPT does not define the probbility for the consequent event occurring. If the ntecedent event occurs, nd the consequent event does follow soon fter, we sy tht the DBN succeeds. Likewise, if the ntecedent event occurs, nd the consequent event does not follow soon fter, we sy tht the DBN fils. These models re referred to s dynmic Byesin networks nd not simply Byesin networks becuse we re using them to model dynmic system. An exmple of DBN lerned by QLAP is shown in Figure 6. The set C = {v,..., v n } consists of the vribles in the conditionl probbility tble (CPT) of the DBN r = C : E E. The CPT is defined over the product spce Q(C) = Q(v ) Q(v ) Q(v n ) (9)

5 5 ) QLAP nottion r : E E with v, v,, vn b) DBN representtion ntecedent event context vribles event(, t E ) v () t v () t v () n t consequent event soon(, t E) Fig. 5: Correspondence between QLAP DBN nottion nd trditionl grphicl DBN nottion. () QLAP nottion of DBN. Context C consists of set of qulittive vribles. Event E is n ntecedent event nd event E is consequent event. (b) Trditionl grphicl nottion. Boolen prent vrible event(t, E ) is true if event E occurs t time t. Boolen child vrible soon(t, E ) is true if event E occurs within k timesteps of t. The other prent vribles re the context vribles in C. The conditionl probbility tble (CPT) gives the probbility of soon(t, E ) for ech vlue of its prents. For ll elements of the CPT where event(t, E ) is flse, the probbility is undefined. The remining probbilities re lerned through experience. u (300, ) x h x h x Pr h [ ] x CPT (,.5) [.5] (.5,.5) [.5] (.5, ) Fig. 6: An exmple DBN. This DBN sys tht if the motor vlue of u x becomes greter thn 300, nd the loction of the hnd, h x, is in the rnge.5 h x <.5, then the vrible ḣ x will most likely soon become [+] (the hnd will move to the right). (The limits of movement of h x re.5 nd +.5, nd so the prior of 0.5 domintes outside of tht rnge.) Since C is subset of the vribles vilble to the gent, Q(C) is n bstrction of the overll stte spce S Q(C) Q(v ) Q(v ) Q(v m ) = S (0) where m n is the number of vribles vilble to the gent. 4 ) Nottion of DBNs: We define the relibility for q Q(C) for DBN r s rel(r, q) = P r(soon(t, E ) E (t), q) () which is the probbility of success for the DBN for the vlue q Q(C). These probbilities come from the CPT nd re clculted using observed counts. The best relibility of DBN gives the highest probbility of success in ny context stte. We define the best relibility brel(r) of DBN r s brel(r) = mx rel(r, q) () q Q(C) (We require 5 ctul successes for q Q(C) before it cn be considered for best relibility.) By incresing the best 4 In our experiments, we limit n to be. relibility brel(r) we increse the relibility of DBN r. And we sy tht DBN r is sufficiently relible if t ny time brel(r) > θ SR = The entropy of DBN r = C : E E is mesure of how well the context C predicts tht event E will soon follow event E. Since we only consider the timesteps where event E occurs, we define the entropy H(r) of DBN r s H(r) = H(soon(t, E ) q, E (t))p r(q E (t)) (3) q Q(C) By decresing the entropy H(r) of DBN r, we increse the determinism of DBN r. C. Adding Context Vribles QLAP itertively dds context vribles to DBNs to mke them more relible nd deterministic. This hillclimbing process of dding one context vrible t time is inspired by the mrginl ttribution process in Drescher s [5] schem mechnism. By only considering the next vrible to improve the DBN, mrginl ttribution decreses the serch spce. Drescher spun off n entirely new model ech time context vrible ws dded, nd this resulted in prolifertion of models. To eliminte this prolifertion of models, we insted modify the model by chnging the context. QLAP initilly hillclimbs on best relibility brel(r) becuse the DBN models will eventully be used for plnning. In our experiments, we found this to be necessry to mke good plns becuse we wnt to find some context stte in which the model is relible. This llows the gent to set subgol getting to tht relible context stte. However, we lso wnt the model to be deterministic, so fter model is sufficiently relible, QLAP hillclimbs on reduction of entropy H(r). D. Lerning New Lndmrks Lerning new lndmrks llows the gent to perceive nd represent the world t higher resolution. This increse in resolution llows existing models to be mde more relible nd llows new models to be lerned. QLAP hs two mechnisms for lerning lndmrks. The first is to lern new lndmrk to mke n existing DBN more relible. The second is to lern new lndmrk tht predicts the occurrence of n event. ) New Lndmrks on Existing DBNs: QLAP lerns new lndmrks bsed on previously-lerned models (DBNs). For ny prticulr DBN, predicting when the consequent event will follow the ntecedent event is supervised lerning problem. This is becuse once the ntecedent event occurs, the environment will determine if the consequent event will occur. QLAP tkes dvntge of this supervisory signl to lern new lndmrks tht improve the predictive bility of DBNs. For ech DBN r = C : E E, QLAP serches for lndmrk on ech mgnitude nd motor vrible v in ech open intervl q Q(v). Finding this lndmrk is done using the informtion theoretic method of Fyyd nd Irni []. The best cndidte is then dopted if it hs sufficient informtion gin, nd the product of the informtion gin nd the probbility of being in tht intervl P r(v = q) is sufficient, nd the dopted lndmrk would improve DBN r.

6 6 ) New Lndmrks to Predict Events: QLAP lso lerns new lndmrks to predict events. QLAP needs this second lndmrk lerning process becuse some events my not be preceded by nother known event. An exmple of this is when n object moves becuse it is hit by nother object. In this cse, it needs to lern tht distnce of 0 between objects is significnt, becuse it cuses one of the objects to move. QLAP lerns lndmrks to predict events by storing histogrms of observed rel vlues of vribles. If lndmrk v is found tht co-occurs with some event E, then the gent cn predict the occurrence of event E by lerning DBN of the form C : v v E. Using these histogrms, QLAP serches for such lndmrk preceding event E by looking for vrible ṽ such tht the distribution of ṽ is significntly different just before the event E thn otherwise. E. Mgnitude DBN Models A mgnitude vlue cn be less thn, greter thn, or equl to qulittive vlue. We wnt to hve models for vrible ṽ reching qulittive vlue q. Intuitively, if we wnt v = q nd currently v(t) < q, then we need to set v = [+]. This section describes how this process is modeled. For ech mgnitude vrible v nd ech qulittive vlue q Q(v), QLAP cretes two models, one tht corresponds to pproching the vlue v = q from below on the number line, nd nother tht corresponds to pproching v = q from bove. For ech mgnitude vrible Y nd ech vlue y Q(Y ), these models cn be written s r + = C : Ẏ [+] Y y (4) r = C : Ẏ [ ] Y y (5) DBN r + mens tht if Y t < y nd Ẏ = [+], then eventully event Y y will occur (DBN r is nlogous in this discussion). As the nottion suggests, we cn tret Ẏ [+] Y y similrly to how we tret contingency, nd we cn lern context vribles for when this model will be relible. These models re bsed on the test-operte-test-exit (TOTE) models of Miller et l. [3]. Mgnitude DBNs do not use the soon predicte becuse how long it tkes to rech qulittive vlue is determined by how fr wy the vrible is from tht vlue. Insted, sttistics re gthered on mgnitude DBNs when the gent sets Ẏ = [+] to bring bout Y y. DBN r + is successful if Y y occurs while Ẏ = [+], nd it fils if the gent is unble to mintin Ẏ = [+] long enough to bring bout event Y y. Like chnge DBNs, mgnitude DBNs will be used in plnning s described in Section IV. 5 IV. MODELING ACTIONS QLAP uses the lerned DBN models to crete plns for performing ctions. There re two brod plnning frmeworks within AI: STRIPS-bsed gol regression [4], nd Mrkov Decision Process (MDP) plnning [5]. Gol regression hs 5 While context vribles re lerned on mgnitude DBNs, experiments showed tht lndmrks lerned from these models were not useful to the gent, so these models re not used for lerning lndmrks in QLAP. the dvntge of working well when only some of the vribles re relevnt, nd MDP plnning hs the dvntge of providing principled frmework for probbilistic ctions [6]. Plnning in QLAP ws designed to exploit the best of both frmeworks. A brod principle of QLAP is tht the gent should lern fctored model of the environment to mke lerning nd plnning more trctble. QLAP uses MDP plnning to pln within ech lerned fctor nd uses gol regression to stitch the fctors together. QLAP defines n ction to chieve ech qulittive vlue of ech vrible. QLAP cretes plns from the lerned models, where ech pln is different wy to perform n ction. An ction cn hve zero, one, or mny plns. If n ction hs no plns it cnnot be performed. If n ction hs multiple plns, then the ction cn be performed in more thn one wy. The ctions tht cn be clled by ech pln re QLAP ctions to bring bout qulittive events. This stitches the plns together nd leds to n ction hierrchy becuse the plns to perform one QLAP ction cll other QLAP ctions s if they were primitive ctions. This hierrchicl ction network encodes ll of the lerned skills of the gent. See Figure 7. Ech pln is policy of n MDP creted from DBN. The policy for ech pln is lerned using combintion of modelbsed nd model-free methods. QLAP uses MDP plnning insted of only gol regression becuse the trnsitions re probbilistic. Since the vribles in the stte spce of the pln only come from the DBN model, we minimize the problem of stte explosion common to MDP plnning. Additionlly, we use MDP plnning insted of more specilized plnning lgorithm such s RRT [7] becuse RRT nd similr lgorithms re designed specificlly for moving in spce, while the ctions tken by the QLAP pln my be rbitrry, such s hit the block off the tble. In this section, we define ctions nd plns in QLAP. We discuss how chnge nd mgnitude DBNs re converted into plns. We then discuss how QLAP cn lern when vribles need to be dded to the stte spce of pln, nd we conclude with description of how ctions re performed in QLAP. A. s nd Plns in QLAP s re how the QLAP gent brings bout chnges in the world. An ction (v, q) is creted for ech combintion of qulittive vrible v nd qulittive vlue q Q(v). An ction (v, q) is clled by the gent nd is sid to be successful if v = q when it termintes. (v, q) fils if it termintes with v q. Sttistics re trcked on the relibility of ctions. The relibility of n ction is denoted by rel(), which gives the probbility of succeeding if it is clled. When n ction is clled, the ction chooses pln to crry it out. Ech pln implements only one ction, nd n ction cn hve multiple different plns where ech pln is different wy to perform the ction. This gives QLAP the dvntge of being ble to use different plns in different situtions insted of hving one big pln tht must cover ll situtions. As with ctions, we sy tht pln ssocited with ction (v, q) is successful if it termintes with v = q nd fils if it termintes with v q.

7 7 X Z W x (c) Yy Q ( s, ) i ( s) i rg mx Q ( s, ) () i (b) Y y Qj ( s, ) ( s) j rg mx Q ( s, ) j Qk ( s, ) ( s) k rg mx Q ( s, ) (d) k X x Q ( s, ) l ( s) l rg mx Q ( s, ) Fig. 7: Plnning in QLAP. () QLAP defines n ction for ech qulittive vlue of ech vrible. This ction is to bring vrible Y to vlue y. (b) Ech ction cn hve multiple plns, which re different wys to perform the ction. Ech pln comes from n MDP. The vlue Q i (s, ) is the cumultive, expected, discounted rewrd of choosing ction in stte s in MDP M i. Given the function Q i, the policy π i cn be computed. (c) Plns re creted from models. The stte spce for n MDP is the cross product of the vlues of X, Y, Z, nd W from the model (lthough more cn be dded if needed). (d) The ctions for ech pln re QLAP ctions to move to different loctions in the stte spce of the MDP. This is reminiscent of gol-regression. Above, we see tht one of the ctions for M i is to cll the QLAP ction to bring bout X x. This link results from event X x being the ntecedent event of the DBN model to bring bout event Y y. Models lerned by QLAP cn result in plns. Ech pln is represented s policy π i over n MDP M i = S i, A i, T i, R i. A Mrkov Decision Process (MDP) is frmework for temporl decision mking [5]. Since QLAP lerns multiple models of the environment, QLAP lerns multiple MDPs. And s with models, ech MDP represents smll prt of the environment. The ctions used within ech MDP re QLAP ctions. And since these ctions, in turn, use plns tht cll other QLAP ctions, the ctions nd plns of QLAP re tied together, nd plnning tkes the flvor of gol regression. We cn think of this policy π i s being prt of n option o i = I i, π i, β i where I i is set of initition sttes, π i is the policy, nd β i is set of termintion sttes or termintion function [8]. An option is like subroutine tht cn be clled to perform tsk. Options in QLAP follow this pttern except tht π i is policy over QLAP ctions insted of being over primitive ctions or other options. We use the terminology of pln being n option becuse options re common in the literture, nd becuse QLAP tkes dvntge of the non-mrkov termintion function β i tht cn terminte fter fixed number of timesteps. However, plns in QLAP differ from options philosophiclly becuse options re usully used with the ssumption tht there is some underlying lrge MDP. QLAP ssumes no lrge, underlying MDP, but rther cretes mny little, independent MDPs tht re connected by ctions. Ech smll MDP M i creted by QLAP hs one policy π i. l B. Converting Chnge DBNs to Plns When DBN of the form r i = C : X x Y y becomes sufficiently relible it is converted into pln to bring bout Y y. 6 This pln cn then be clled by the ction (Y, y). This pln is bsed on n MDP M i. In this section, we will first describe how QLAP cretes MDP M i from DBN r i. We will then describe how QLAP lerns policy for this MDP. And finlly, we will describe how this policy is mpped to n option. ) Creting the MDP from the DBN: QLAP converts DBNs of the form r i = C : X x Y y to n MDP of the form M i = S i, A i, T i, R i. The stte spce S i consists of the Crtesin product of the vlues of the vribles in DBN r i. The ctions in A i re the QLAP ctions to bring the gent to the different sttes of S i. The trnsition function T i comes from the CPT of r i nd the relibility rel() of different ctions A i. The rewrd function simply penlizes ech ction with cost of nd gives rewrd of 0 for reching the gol of Y = y. ) Lerning Policy for the MDP: QLAP uses three different methods to lern the policy π i for ech MDP M i. () QLAP uses the trnsition function T i nd the rewrd function R i to lern the Q-tble Q i using dynmic progrmming with vlue itertion [9]. The Q-tble vlue Q i (s, ) represents the cumultive, expected discounted rewrd of tking ction A i in stte s S i. The policy π i then follows directly from Q i becuse for ech stte s, the gent cn choose ction tht mximizes Q i (s, ) (or it cn choose some other ction to explore the world). () As the gent further experiences the world, policy π i is updted using the temporl difference lerning method Srs(λ) [9]. (3) And s the gent gthers more sttistics, its trnsition model my be improved. QLAP lso updtes the model by occsionlly running one loop of the dynmic progrmming. 3) Mpping the Policy to n Option: An option hs the form o i = I i, π i, β i. We hve described how the policy π i is lerned. When n option o i is creted for DBN r i = C : X x Y y, the set of initition sttes I i is the set of ll sttes. The termintion function β i termintes option o i when it succeeds (the consequent event occurs) or when it exceeds resource constrints (300 timesteps, or 5 ction clls) or when the gent gets stuck. The gent is considered stuck if none of the self vribles (see Section V-C for discussion of how the gent lerns which vribles re prt of self ) or vribles in S i chnge in 0 timesteps. C. Converting Mgnitude DBNs into Plns As discussed in Section III, ech qulittive vlue y Q(Y ) on ech mgnitude vrible Y hs two DBNs r + = C : Ẏ [+] Y y (6) r = C : Ẏ [ ] Y y (7) 6 There re some restrictions on this to limit resource usge. For exmple, ctions tht cn be relibly performed successfully do not get more plns, the gent must be ble to bring bout the ntecedent event X x with sufficient relibility, nd ech ction my hve t most three plns. See [8].

8 8 tht correspond to chieving the event Y y from below nd bove the vlue Y = y, respectively. Both of these DBN models re converted into plns to chieve Y y. And like with chnge DBNs, ech mgnitude DBN r i is converted into pln bsed on n MDP M i. For MDP M i, the stte spce S i, the set of vilble ctions A i, the trnsition function T i, nd the rewrd function R i re computed similrly s they re for chnge plns. The result of this is tht ech ction (v, q) on mgnitude vrible hs two plns. There is one pln to perform the ction when v < q, nd nother pln to perform the ction when v > q. See [8] for further detils. D. Improving the Stte Spce of Plns The stte spce of pln consists of the Crtesin product of the quntity spces Q(v) of the vribles in the model from which it ws creted. But wht if there re vribles tht were not prt of the model, but tht re nonetheless necessry to successfully crry out the pln? To lern when new vribles should be dded to plns, QLAP keeps sttistics on the relibility of ech pln nd uses those sttistics to determine when vrible should be dded. ) Trcking Sttistics on Plns: QLAP trcks sttistics on plns the sme wy it does when lerning models. For chnge DBN models, QLAP trcks sttistics on the relibility of the contingency. For mgnitude models, QLAP trcks sttistics on the bility of vrible to rech qulittive vlue if moving in tht direction. For plns, QLAP trcks sttistics on the gent s bility to successfully complete the pln when clled. To trck these sttistics on the probbility of pln o being successful, QLAP cretes second-order model r o = C o : cll(t, o) succeeds(t, o) (8) The child vrible of second-order DBN r o is succeeds(t, o), which is true if option o succeeds fter being clled t time t nd is flse otherwise. The prent vribles of r o re cll(t, o) nd the context vribles in C o. The Boolen vrible cll(t, o) is true when the option is clled t time t nd is flse otherwise. When creted, model r o initilly hs n empty context, nd context vribles re dded in s they re for mgnitude nd chnge models. The nottion for these models is the sme s for mgnitude nd chnge models: QLAP computes rel(o), rel(o, s) nd brel(o). Therefore pln cn lso be sufficiently relible if t ny time brel(o) > θ SR = ) Adding New Vribles to the Stte Spce: Second-order models llow the gent to identify other vribles necessry for the success of n option o becuse those vribles will be dded to its context. Ech vrible tht is dded to r o is lso dded to the stte spce S i of its ssocited MDP M i. For exmple, for pln creted from model r i = C : X x Y y, the stte spce S i is updted so tht S i = Q(C o ) Q(C) Q(X) Q(Y ) (9) (vribles in more thn one of C o, C, {X}, or {Y } re only represented once in S i ). For both mgnitude nd chnge options, n ction (v, q) where v Q(C o ) is treted the sme wy s v Q(C). E. Performing s QLAP ctions re performed using plns, nd these plns cll other QLAP ctions. This leds to hierrchy of plns nd ctions. ) Clling nd Processing s: When n ction is clled, it chooses pln nd then strts executing the policy of tht chosen pln. Executing tht policy results in more QLAP ctions being clled, nd this process continues until motor ction is reched. When n ction (u, q) is clled on motor vrible u, then QLAP sends rndom motor vlue within the rnge covered by the qulittive vlue u = q to the body. This hierrchicl structure of ctions nd plns mens tht multiple ctions will be performed simultneously. Ech pln only keeps trck of wht ction it is currently performing. And when tht ction termintes, the next ction is clled bsed on ccording to the policy. So s the initil ction clled by the gent is being processed, the pth between tht initil ction nd motor ctions continully chnges. ) Terminting s: An ction (v, q) termintes if v = q, in which cse it succeeds. It lso termintes if it fils. An ction fils if () it hs no plns, or (b) for every pln for this ction, the ction to bring bout the ntecedent event of the pln is lredy in the cll list, or (c) its chosen pln fils. Similr to n ction, pln to bring bout v = q termintes if v = q, in which cse it succeeds. It lso termintes if it fils. A pln to bring bout v = q fils if () the termintion function β is triggered by resource constrints, or (b) there is no pplicble ction in the current stte, or (c) the ction chosen by the policy is lredy in the cll list, or (d) the ction chosen by the policy immeditely fils when it is clled. V. EXPLORATION AND LEARNING The QLAP gent explores nd lerns utonomously without being given tsk. This utonomous explortion nd lerning rises mny issues. For exmple, how cn the gent decide wht is worth exploring? As the gent explores, it lerns new representtions. How cn it keep from lerning unnecessry representtions nd getting bogged down? Should the gent use the sme criteri for lerning ll representtions? Or should it tret some representtions s especilly importnt? And finlly, cn the gent lern tht some prts of the environment cn be controlled with high relibility nd low ltency so tht they cn be considered prt of self? Previous sections hve explined how QLAP lerns representtions tht tke the form of lndmrks, DBNs, plns, nd ctions. This section explins how lerning in QLAP unfolds over time. We first discuss how the gent explores the environment. We then discuss developmentl restrictions tht determine wht representtions the gent lerns nd the order in which it lerns them. We then discuss how QLAP pys specil ttention to gols tht re hrd to chieve. And finlly, we discuss how the gent lerns wht is prt of self.

9 9 A. Explortion The QLAP gent explores the environment utonomously without being given tsk. Insted of trying to lern to do prticulr tsk, the gent tries to lern to predict nd control ll of the vribles in its environment. However, this rises difficulties becuse there might be mny vribles in the environment, nd some my be difficult or impossible to predict or control. This section explins how the gent determines wht should be explored nd the best wy to go bout tht explortion. Initilly, the gent motor bbbles for 0,000 timesteps by repetedly choosing rndom motor vlues nd mintining those vlues for rndom number of time steps. After tht point, QLAP begins to prctice its lerned ctions. An outline of the execution of QLAP is shown in Algorithm. The gent continully mkes three types of choices during its explortion. These choices vry in time scle from corse to fine. The gent chooses: lerned ction (v, q) to prctice, the best pln o i for performing the ction (v, q), nd the ction bsed on policy π i for pln o i. ) Choosing Lerned to Prctice: One method for choosing where to explore is to mesure prediction error nd thn to motivte the gent to explore prts of the spce for which it currently does not hve good model. This form of intrinsic motivtion is used in [30], [3]. However, focusing ttention on sttes where the model hs poor prediction bility cn cuse the gent to explore spces where lerning is too difficult. Schmidhuber [3] proposed method whereby n gent lerns to predict the decrese in the error of the model tht results from tking ech ction. The gent cn then choose the ction tht will cuse the biggest decrese in prediction error. Oudeyer, Kpln, nd Hfner [33] pply this pproch with developing gent nd hve the gent explore regions of the sensory-motor spce tht re expected to produce the lrgest decrese in predictive error. Their method is clled Intelligent Adptive Curiosity (IAC). QLAP uses IAC to determine which ction to prctice. After the motor bbbling period of 0,000 timesteps, QLAP chooses motor bbbling ction with probbility 0., otherwise it uses IAC to choose lerned ction to prctice. Choosing lerned ction to prctice consists of two steps: () determine the set of pplicble ctions tht could be prcticed in the current stte s, nd () choose n ction from tht set. The set of pplicble ctions to prctice consists of the set of ctions tht re not currently ccomplished, but could be performed. For chnge ction, this mens tht the ction must hve t lest one pln. For mgnitude ction (v, q), this mens tht if v t < q then ( v, [+]) must hve t lest one pln (nd similrly for v t > q). QLAP chooses n ction to prctice by ssigning weight w to ech ction in the set of pplicble ctions. The ction is then chosen rndomly bsed on this weight w. The weights re ssigned using version of Intelligent Adptive Curiosity (IAC) [33] tht mesures the chnge in the gent s bility to perform the ction over time nd then chooses ctions where tht bility is incresing. ) Choosing the Best Pln to Perform n : When n ction is clled, it chooses pln to perform the ction. QLAP seeks to choose the pln tht is most likely to be successful in the current stte. To compre plns, QLAP computes weight w s o for ech pln o in stte s. To compute the weight w s o for pln o in stte s, QLAP computes the product of the relibility of the DBN r tht led to the pln rel(r, s) nd the relibility of the second-order DBN rel(o, s) so tht w s o = rel(r, s) rel(o, s) (0) To choose the pln to perform the ction, QLAP uses ɛ- greedy ction pln selection (ɛ = 0.05). With probbility ɛ, QLAP chooses the pln with the highest weight. And with probbility ɛ it chooses pln rndomly. To prevent loops in the clling list, pln whose DBN hs its ntecedent event lredy in the cll list is not pplicble nd cnnot be chosen. 3) Choosing n within Pln: Recll from Section IV tht QLAP lerns Q-tble for ech pln tht gives vlue for tking ech ction in stte s. Here gin, QLAP uses ɛ-greedy selection. With probbility ɛ, in stte s, QLAP chooses ction tht mximizes Q i (s, ), nd with probbility ɛ, QLAP chooses rndom ction. This ction selection method blnces explortion with exploittion [9]. Algorithm The Qulittive Lerner of nd Perception (QLAP) : for t = 0 : do : sense environment 3: convert input to qulittive vlues using current lndmrks 4: updte sttistics for lerning new contingencies 5: updte sttistics for ech DBN 6: if mod(t, 000) == 0 then 7: lern new DBNs 8: updte contexts on existing DBNs 9: delete unneeded DBNs nd plns 0: if mod(t, 4000) == 0 then : lern new lndmrks on events : else 3: lern new lndmrks on DBNs 4: end if 5: convert DBNs to plns 6: end if 7: if current explortion ction is completed then 8: choose new explortion ction nd ction pln 9: end if 0: get low-level motor commnd bsed on current qulittive stte nd pln of current explortion ction : pss motor commnd to robot : end for B. Trgeted Lerning Since QLAP cretes n ction for ech vrible nd qulittive vlue combintion, QLAP gent is fced with mny potentil ctions tht could be lerned. QLAP cn choose different ctions to prctice bsed on the lerning grdient,

10 0 but wht bout the thresholds to lern predictive DBN models nd plns? Some ctions might be more difficult to lern thn others, so it seems resonble tht the requirements for lerning representtions tht led to lerning such ctions should be loosened. QLAP does trgeted lerning for difficult ctions. To lern pln for n ction chosen for trgeted lerning, QLAP ) Lowers the threshold needed to lern contingency. Recll from Section III-A, tht contingency is lerned when P r(soon(t, E ) E (t)) P r(soon(t, E )) > θ pen = 0.05 () If event E is chosen for trgeted lerning, QLAP mkes it more likely tht contingency will by lerned by setting θ pen = 0.0. ) Lowers the threshold needed to lern pln. Recll from Section IV-B tht one of the requirements to convert chnge DBN r into pln is tht brel(r) > θ SR = If event E is chosen for trgeted lerning, QLAP mkes it more likely tht DBN will be converted to pln by setting θ SR = 0.5. This leves the question of when to use trgeted lerning of ctions. An event is chosen s gol for trgeted lerning if the probbility of being in stte where the event is stisfied is less thn 0.05; we cll such n event sufficiently rre. This is reminiscent of Bonrini et l. [34]. They consider desirble sttes to be those tht re rrely reched or re esily left once reched. C. Lerning Wht Is Prt of Self One step towrds tool use is mking objects in the environment prt of self so tht they cn be used to perform useful tsks. The representtion of self is strightforwrd in QLAP. A chnge vrible is prt of self if it cn be quickly nd relibly mnipulted. QLAP lerns wht is prt of self by looking for vribles tht it cn relibly control with low ltency. Mrjnovic [35] enbled robot to identify wht ws prt of self by hving the robot wve its rm nd hving the robot ssume tht the only thing moving in the scene ws itself. The work of Mett nd Fitzptrick [36] is similr but more sophisticted becuse it looks for opticl flow tht correltes with motor commnds of the rm. Gold nd Scssellti [37] use the time between giving motor commnd nd seeing movement to determine wht is prt of self. Our method for lerning self is similr to tht of Gold nd Scssellti, but we lern wht is prt of self while lerning ctions. A direction of chnge vrible v is prt of self if: ) the verge time it tkes for the ction to set v = [+] nd v = [ ] is less thn k, nd ) the ctions for both v = [+] nd v = [ ] re sufficiently relible. where k is how long it tkes for motor commnds to be observed s chnges in the world, see [8]. VI. EVALUATION Evluting utonomous lerning is difficult becuse there is no pre-set tsk on which to evlute performnce. The () Not grsping (b) Grsping Fig. 8: The evlution environment (shown here with floting objects) pproch we tke is to first hve the gent lern utonomously in n environment; we then evlute if the gent is ble to perform set of tsks. It is importnt to note tht during lerning the gent does not know on which tsks it will be evluted. A. Evlution Environment The evlution environment is implemented in Breve [38] nd hs relistic physics. Breve simultes physics using the Open Dynmics Engine (ODE) [39]. The simultion consists of robot t tble with block nd floting objects. The robot hs n orthogonl rm tht cn move in the x, y, nd z directions. The environment is shown in Figure 8, nd the vribles perceived by the gent for the core environment re shown in Tble II. The block hs width tht vries between nd 3 units. The block is replced when it is out of rech nd not moving, or when it hits the floor. Ech timestep in the simultor corresponds to 0.05 seconds. I.e., 0 timesteps/second is equl to 00 timesteps/minute is equl to 7, 000 timesteps per hour. See [8] for further detils. The robot cn grsp the block in wy tht is reminiscent of both the plmer reflex [40] nd hving sticky mitten [4]. The plmer reflex is reflex tht is present from birth until the ge 4-6 months in humn bbies. The reflex cuses the bby to close its hnd when something touches the plm. In the sticky mittens experiments, three-month-old infnts wore mittens covered with Velcro tht llowed them to more esily grsp objects. Grsping is implemented on the robot to llow it to grsp only when over the block. Specificlly, the block is grsped if the hnd nd block re colliding, nd the Eucliden D distnce from the center of the block in the x nd y directions is less thn hlf the width of the plm, 3/ =.5 units. In ddition to the core environment, QLAP is lso evluted with distrctor objects. This is done using the floting extension environment, which dds two floting objects tht the gent cn observe but cnnot interct with. The purpose of this environment is to evlute QLAP s bility to focus on lernble reltionships in the presence of unlernble ones. The objects flot round in n invisible box. The vribles dded to the core environment to mke the floting extension environment re shown in Tble III.

11 B. Experimentl Conditions We compre the performnce of QLAP with the performnce of supervised lerner on set of tsks. The supervised lerner is trined only on the evlution tsks. This puts QLAP t disdvntge on the evlution tsks becuse QLAP is not informed of the evlution tsks nd QLAP lerns more thn the evlution tsks. We hope QLAP cn demonstrte developmentl lerning by getting better t the tsks over time, nd tht QLAP cn do s well s the supervised lerner. Ech gent is evluted on three tsks. These re referred to s the core tsks. ) move the block The evlutor picks gol to move the block left ( T L = [+]), right ( T R = [ ]), or forwrd ( T T = [+]). The gol is chosen rndomly bsed on the reltive position of the hnd nd the block. A tril is terminted erly if the gent hits the block in the wrong direction. ) hit the block to the floor The gol is to mke bng = true. 3) pick up the block The gol is to get the hnd in just the right plce so the robot cn grsp the block nd mke T = true. A tril is terminted erly if the gent hits the block out of rech. The supervised lerner is trined using liner, grdientdescent Srs(λ) with binry fetures [9] where the binry fetures come from tile coding. Tile coding is wy to discretize continuous input for reinforcement lerning. Both the QLAP nd the supervised lerning gents re evluted on the core tsks in both the core environment nd the floting extension environment under three experimentl conditions. ) QLAP The QLAP lgorithm. ) SupLrn- Supervised lerning, choosing n ction every timestep. 3) SupLrn-0 Supervised lerning, choosing n ction every 0 timesteps. SupLrn- nd SupLrn-0 re both used becuse SupLrn- hs difficulty lerning the core tsks due to high tsk dimeter. QLAP lerns utonomously for 50,000 timesteps (corresponding to bout 3.5 hours of physicl experience) s described in Section V. The supervised lerning gents repetedly perform trils of prticulr core tsk for 50,000 timesteps. At the beginning of ech tril, the core tsk tht the supervised lerning gent will prctice is chosen rndomly. The stte of the gent is sved every 0,000 timesteps (bout every 8 minutes of physicl experience). The gent is then evluted on how well it cn do the specified tsk using the representtions from ech stored stte. At the beginning of ech tril, block is plced in rndom loction within rech of the gent nd the hnd is moved to rndom loction. Then, the gol is given to the gent. The gent mkes nd executes plns to chieve the gol. If the QLAP gent cnnot mke pln to chieve the gol, it moves rndomly. The tril is terminted fter 300 timesteps or when the gol is chieved. The gent receives penlty of 0.0 for ech timestep it does not chieve the gol nd rewrd of 9.99 on the timestep it chieves the gol. (SupLrn-0 gets penlty of 0. every 0th timestep it does not rech the gol nd rewrd of 9.99 on the timestep it reches the gol.) Ech evlution consists of 00 trils. The rewrds over the 00 trils re verged, nd the verge rewrd is tken s mesure of bility. For ech experiment, 0 QLAP gents nd 0 supervised lerning gents re trined. Vrible u x, u y, u z u UG h x, h y, h z ḣ x, ḣy, ḣz y T B, ẏ T B y BT, ẏ BT x RL, ẋ RL x LR, ẋ LR z BT, ż BT z F, ż F T L, T L T R, T R T T, T T c x, ċ x c y, ċ y T bng TABLE II: Vribles of the core environment Mening force in x, y, nd z directions ungrsp the block globl loction of hnd in x, y, nd z directions derivtive of hx, hy, hz top of hnd in frme of reference of bottom of block (y direction) bottom of hnd in frme of reference of top of block (y direction) right side of hnd in frme of reference of left side of block (x direction) left side of hnd in frme of reference of right side of block (x direction) bottom side of hnd in frme of reference of top of block (z direction) distnce to the floor loction of nerest edge of block in x direction in coordinte frme defined by left edge of tble loction of nerest edge of block in x direction in coordinte frme defined by right edge of tble loction of nerest edge of block in y direction in coordinte frme defined by top edge of tble loction of hnd in x direction reltive to center of block loction of hnd in y direction reltive to center of block block is grsped, true or flse. Becomes true when the hnd is touching the block nd the D distnce between the center of the hnd nd the center of the block is less thn.5. true when block hits the floor TABLE III: Vribles dded to the core environment to mke up the floting extension environment Vrible fx, f y, f z f x, f y, f z fx, f y, f z f x, f y, f z C. Results Mening loction of first floting object in x, y, nd z directions derivtive of fx, f y, f z loction of second floting object in x, y, nd z directions derivtive of fx, f y, f z The results re shown in Figures 9 nd 0. Figure 9 compres QLAP nd supervised lerning on the tsk of moving the block in the specified direction. As cn be seen in Figure 9(), SupLrn- ws not ble to do the tsk well compred to QLAP due to the high tsk dimeter (the number of timesteps needed to complete the tsk). Hving the supervised lerning gents choose n ction every 0 timesteps improved their performnce, s cn be seen in Figure 9(b). But s cn be seen by visully inspecting Figure 9(c), the performnce of supervised lerning degrdes much more thn the performnce of QLAP degrdes when the distrctor objects re dded. This sme pttern of QLAP outperforming SupLrn-, SupLrn-0 doing s well or better thn QLAP in the environment without the distrctor objects, but then QLAP not

12 Averge rewrd per episode (move tsk) QLAP SupLrn- Averge rewrd per episode (move tsk) QLAP SupLrn-0 Averge rewrd per episode (move tsk) QLAP SupLrn Timesteps (x 0,000) () QLAP nd SupLrn Timesteps (x 0,000) (b) QLAP nd SupLrn Timesteps (x 0,000) (c) Flot: QLAP nd SupLrn-0 Fig. 9: Moving the block. () QLAP does better thn SupLrn- becuse of the high tsk dimeter. (b) SupLrn-0 does better thn QLAP. (c) When the floting objects re dded, the performnce of SupLrn-0 degrdes much more thn the performnce of QLAP degrdes. Averge rewrd per episode (hit to floor) Averge rewrd per episode (pickup tsk) QLAP SupLrn Timesteps (x 0,000) () Flot: QLAP nd SupLrn-0 QLAP SupLrn Timesteps (x 0,000) (b) Flot: QLAP nd SupLrn-0 Fig. 0: QLAP outperforms supervised reinforcement lerning using tile coding on the more difficult tsks in the floting extension environment. () Knock the block off the tble. (b) Pick up the block. degrding with the distrctor objects ws lso observed for the tsks of hitting the block off the tble nd picking up the block. Figure 0 shows the performnce of QLAP nd supervised lerning on the tsks of hitting the block off the tble nd picking up the block. For brevity, this figure only contins the finl comprison of QLAP nd SupLrn-0 with floting objects on those tsks. We see tht QLAP does better thn SupLrn-0 on these more difficult tsks in the environment with floting objects. For ll three tsks, QLAP displys developmentl lerning nd gets better over time. These results suggest the following conclusions: () QLAP utonomously selects n pproprite corse temporl coding for ctions, out-performing the fine-grined ctions used in SupLrn-, due to the resulting lrge tsk dimeter; () QLAP is more robust to distrctor events thn SupLrn-0. D. Additionl evlution: QLAP lerns lndmrks tht re generlly useful We wnt to show tht the lerned lndmrks relly do represent the nturl joints in the environment. Since we hve no ground truth for wht the nturl joints of the environment re, we will compre the results of tbulr Q- lerning using lndmrks lerned with QLAP with the results of tbulr Q-lerning using rndomly generted lndmrks. If the lndmrks do represent the joints in the environment, then the tbulr Q-lerner using lerned lndmrks should do better thn the one using rndom lndmrks. ) Experimentl Environment: Tbulr Q-lerning does not generlize well. During explortory experiments, the stte spce of the core environment ws so lrge tht tbulr Q- lerning rrely visited the sme stte more thn once. We therefore evlute this clim using smller environment nd simple tsk. We use the D core environment where the hnd only moves in two dimensions. It removes the vribles of the z direction from the core environment. It subtrcts u z, u UG, h z, ḣz, z BT, ż BT, c x, ċ x, c y, ċ y, T, nd bng. ) Experimentl Conditions: () QLAP lndmrks Tbulr Q-lerning using lndmrks lerned using QLAP fter previous run of 00,000 timesteps on the D core environment. (b) rndom lndmrks Tbulr Q-lerning using rndomly generted lndmrks. To generte the rndom lndmrks, for ech mgnitude or motor vrible v, rndom number of lndmrks between 0 nd 5 is chosen. Ech lndmrk is then plced in rndomly chosen loction within the minimum nd mximum rnge observed for v during typicl run of QLAP. Note tht motor

13 3 vribles lredy hve lndmrk t 0, so ech motor vrible hd between nd 6 lndmrks. 3) Results: The results re shown in Figure. Tbulr Q- lerning works much better using the lerned lndmrks thn using the rndom ones. Averge rewrd per episode (move tsk) QLAP lndmrks rndom lndmrks Timesteps (x 0,000) Fig. : QLAP lndmrks enble the gent to lern the tsk better thn do rndom lndmrks. VII. DISCUSSION A. Description of Lerned Representtions One of the first contingencies (models) tht QLAP lerns is tht if it gives positive force to the hnd, then the hnd will move to the right. But this contingency is not very relible becuse in the simultor it tkes force of t lest 300 units to move the hnd. The gent lerns lndmrk t 300 on tht motor vrible, nd modifies the model to stte tht if the force is t lest 300, then the hnd will move to the right. But this model still is not completely relible, becuse if the hnd is lredy ll the wy to the right, then it cn t move ny frther. But from this model, the gent cn note the loction of its hnd ech time it pplies force of 300, nd it cn lern lndmrk to indicte when the hnd is ll the wy to the right. The completed model sttes tht if the hnd is not ll the wy to the right nd force of t lest 300 is given, then the hnd will usully move to the right. Even this model is not completely relible, becuse there re unusul situtions where, for exmple, the hnd is stuck on the block. But the model is probbilistic, so it cn hndle nondeterminism. The gent lso lerns tht the loction of the right side of the hnd in the frme of reference of the left side of the block hs specil vlue t 0. It lerns this becuse it notices tht the block begins to move to the right when tht vlue is chieved. It then cretes lndmrk to indicte tht vlue nd n ction to rech tht vlue. Bsed on this lndmrk, QLAP cn lern contingency (model) tht sys if the vlue goes to 0, then the block will move to the right. It cn then lern other lndmrks tht indicte in which situtions this will be successful. In similr wy, it lerns to pick up the block nd knock the block off the tble. B. Theoreticl Bounds QLAP is resonbly efficient in time nd spce. Let V be the number of vribles. QLAP serches for pirs of events to form contingency, so this process is O(V ) in both time nd spce. QLAP serches for context vribles nd lndmrks for ech lerned contingency. It does this by considering one vrible t time for ech DBN, thus these processes re O(V 3 ) in time nd spce. This of course mens tht if you hd very lrge number of vribles, such s 000, tht the vribles would need to be ctegorized so tht only subset of possible contingencies were considered. MDP plnning is known to be computtionlly expensive becuse the stte spce grows exponentilly with the number of vribles. However this explosion is limited in QLAP becuse QLAP builds mny MDPs, ech consisting of only few vribles. Essentilly, QLAP serches for the simplest MDP models tht give resonble descriptions of the observed dynmics of the environment. The serch is greedy bredthfirst serch, incrementlly incresing the number of vribles in the MDP (vi the DBN). Note tht ny given MDP model describes the world in terms of certin set of explicit vribles. The rest of the dynmics of the world re encoded in the probbilities in the CPT tbles. Presumbly, there is generl trde-off between the number of vribles in the model nd the determinism of the CPT. However, QLAP is bsed on the ssumption tht tht generl trde-off is dominted by the choice of the right vribles nd the right lndmrks for those vribles. Tht is, the right smll models my well be quite good, nd the hill-climbing methods of QLAP cn often find them. C. Assumptions of QLAP QLAP ssumes tht ny gol tht n outside observer would wnt the gent to ccomplish is represented with n input vrible. QLAP lso ssumes tht meningful lndmrks cn be found on single vribles. In some cses when these ssumptions re violted, QLAP cn do serch on combintions of vribles (see [8]). QLAP lso ssumes set of continuous motor primitives tht correspond to orthogonl directions of movement. QLAP builds on the work of Pierce nd Kuipers [4]. In their work, n gent ws ble to use principl components nlysis (PCA) [43] to lern set of motor primitives corresponding to turn nd trvel for robot tht hd motors to turn ech of two wheels independently. VIII. RELATED WORK QLAP lerns sttes nd hierrchicl ctions in continuous, dynmic environments with continuous motors through utonomous explortion. The closest direct competitor to QLAP is the work of Brto, Jonsson, nd Vigorito. Given DBN model of the environment, the VISA lgorithm [44] cretes cusl grph which it uses to identify stte vribles for options. Like QLAP, the VISA lgorithm performs stte bstrction by finding the relevnt vribles for ech option. Jonsson nd Brto [0] lern DBNs through n gent s interction with discrete environment by mximizing the posterior of the DBN given the dt by building tree to represent the conditionl probbility. Vigorito nd Brto [45] extends [0],

14 4 [44] by proposing n lgorithm for lerning options when there is no specific tsk. This work differs from QLAP in tht lerning tkes plce in discrete environments with events tht re ssumed to occur over one-timestep intervls. The work lso ssumes tht the gent begins with set of discrete ctions. Becuse QLAP is designed for continuous environments with dynmics, QLAP uses qulittive representtion. This qulittive representtion leds to novel DBN lerning lgorithm for lerning predictive models, nd novel method for converting those models into set of hierrchicl ctions. Shen s LIVE lgorithm [46] lerns set of rules in firstorder logic nd then uses gol regression to perform ctions. The lgorithm ssumes tht the gent lredy hs bsic ctions, nd the experiments presented re in environments without dynmics such s the Tower of Hnoi. Another method for lerning plnning rules in first-order logic is [7], [8]. The rules they lern re probbilistic, given context nd n ction their lerned rules provide distribution over results. This lgorithm ssumes discrete stte spce nd tht the gent lredy hs bsic ctions such s pick up. QLAP s structure of ctions nd plns is reminiscent of the MAXQ vlue function decomposition [47]. QLAP defines its own ctions nd plns s it lerns, nd s the gent lerns more refined discretiztion the hierrchy chnges. There hs lso been much work on lerning hierrchy. Like QLAP, Digney [48] cretes tsk to chieve ech discrete vlue of ech vrible. However, QLAP lerns the discretiztion. Work hs been done on lerning hierrchicl decomposition of fctored Mrkov decision process by identifying exits. Exits re combintions of vrible vlues nd ctions tht cuse some stte vrible to chnge its vlue [44]. Exits roughly correspond to the DBNs found by QLAP except tht there is no explicit ction needed for QLAP DBNs. Hengst [49] determined n order on the input vribles bsed on how often they chnged vlue. Using this ordering, he identified exits to chnge the next vrible in the order nd creted n option for ech exit. There hs been other work on structure lerning in sequentil decision processes where the environment cn be modeled s fctored MDP. Degris et l. [9] proposed method clled SDYNA tht lerns structured represent- tion in the form of decision tree nd then uses tht structure to compute vlue function. Strehl et l. [] lern DBN to predict ech component of fctored stte MDP. Hester nd Stone [50] lern decision trees to predict both the rewrd nd the chnge in the next stte. All of these methods re evluted in discrete environments where trnsitions occur over onetimestep intervls. IX. SUMMARY AND CONCLUSION The Qulittive Lerner of nd Perception (QLAP) is n unsupervised lerning lgorithm tht llows n gent to utonomously lern sttes nd ctions in continuous environments. Lerning ctions from lerned representtion is significnt becuse it moves the stte of the rt of utonomous lerning from grid worlds to continuous environments. Another contribution of QLAP is providing method for fctoring the environment into smll pieces. Insted of lerning one lrge predictive model, QLAP lerns mny smll models. And insted of lerning one lrge pln to perform n ction, QLAP lerns mny smll plns tht re useful in different situtions. QLAP strts with bottom-up process tht detects contingencies nd builds DBN models to identify the conditions (i.e., vlues of context vribles) under which there re nerdeterministic reltions mong events. Menwhile, n ction is defined for ech event, where the ction hs the intended effect of mking tht event occur. The desirble sitution is for n ction to hve one or more sufficiently relible plns for implementing the ction. If there re no plns for n ction, or if the plns re not sufficiently relible, the ction is still defined, but it is not useful, so it will not be used s step in higher-level pln. Ech pln comes from sufficiently relible DBN, where the overll structure is tht DBNs led to MDPs, MDPs re converted into policies, nd policies re plns. QLAP explores utonomously nd tries to lern to chieve ech qulittive vlue of ech vrible. To explore, the gent continully chooses n ction to prctice. To choose which ction to prctice, QLAP uses Intelligent Adptive Curiosity (IAC). IAC motivtes the gent to prctice ctions tht it is getting better t, nd IAC motivtes the gent to stop prcticing ctions tht re too hrd or too esy. QLAP ws evluted in environments with simulted physics. The evlution ws performed by hving QLAP explore utonomously nd then mesuring how well it could perform set of tsks. The gent lerned to hit block in specified direction nd to pick up the block s well or better thn supervised lerner trined only on the tsk. The evlution lso showed tht the lndmrks lerned by QLAP were brodly useful. Future work will consist of incorporting continuous lerning methods within the discretized representtion lerned by QLAP. This should enble QLAP to leverge both best of discrete lerning nd the best of continuous lerning. ACKNOWLEDGMENT This work hs tken plce in the Intelligent Robotics Lb t the Artificil Intelligence Lbortory, The University of Texs t Austin. Reserch of the Intelligent Robotics lb is supported in prt by grnts from the Texs Advnced Reserch Progrm ( ), nd from the Ntionl Science Foundtion (IIS-04357, IIS-07350, nd IIS-07500). REFERENCES [] A. Sxen, J. Driemeyer, J. Kerns, nd A. Ng, Robotic grsping of novel objects, Advnces in neurl informtion processing systems, vol. 9, p. 09, 007. [] P. Fitzptrick, G. Mett, L. Ntle, S. Ro, nd G. Sndini, Lerning bout objects through ction-initil steps towrds rtificil cognition, in IEEE Interntionl Conference on Robotics nd Automtion, 003. Proceedings. ICRA 03, vol. 3, 003. [3] S. Vijykumr, A. D souz, nd S. Schl, Incrementl online lerning in high dimensions, Neurl Computtion, vol. 7, no., pp , 005. [4] C. M. Vigorito nd A. G. Brto, Intrinsiclly motivted hierrchicl skill lerning in structured environments, IEEE Trnsctions on Autonomous Mentl Development (TAMD), vol., no., 00. [5] G. L. Drescher, Mde-Up Minds: A Constructivist Approch to Artificil Intelligence. Cmbridge, MA: MIT Press, 99.

5 [6] P. R. Cohen, M. S. Atkin, T. Otes, nd C. R. Bel, Neo: Lerning conceptul knowledge by sensorimotor interction with n environment, in Agents 97. Mrin del Rey, CA: ACM, 997. [7] L. S. Zettlemoyer, H.

Kelbling, Lerning symbolic models of stochstic domins, Journl of Artificil Intelligence Reserch, vol. 9, pp. 309 35, 007. [9] C. Xu nd B. Kuipers, Towrds the Object Semntic Hierrchy, in Proc.

15 5 [6] P. R. Cohen, M. S. Atkin, T. Otes, nd C. R. Bel, Neo: Lerning conceptul knowledge by sensorimotor interction with n environment, in Agents 97. Mrin del Rey, CA: ACM, 997. [7] L. S. Zettlemoyer, H. Psul, nd L. P. Kelbling, Lerning plnning rules in noisy stochstic worlds. in Proc. 0nd Conf. on Artificil Intelligence (AAAI-005), 005, pp [8] H. Psul, L. Zettlemoyer, nd L. Kelbling, Lerning symbolic models of stochstic domins, Journl of Artificil Intelligence Reserch, vol. 9, pp , 007. [9] C. Xu nd B. Kuipers, Towrds the Object Semntic Hierrchy, in Proc. of the Int. Conf. on Development nd Lerning (ICDL 00), 00. [0] J. Mugn nd B. Kuipers, The qulittive lerner of ction nd perception, QLAP, in AAAI Video Competition (AIVC 00), 00, mugn qlp. [] R. S. Sutton nd A. G. Brto, Reinforcement Lerning. Cmbridge MA: MIT Press, 998. [] C. G. Atkeson, A. W. Moore, nd S. Schl, Loclly weighted lerning, Artificil Intelligence Review, vol., no. /5, pp. 73, 997. [3], Loclly weighted lerning for control, Artificil Intelligence Review, vol., no. /5, pp. 75 3, 997. [4] S. Vijykumr nd S. Schl, Loclly weighted projection regression: An O(n) lgorithm for incrementl rel time lerning in high dimensionl spce, in Proceedings of the Seventeenth Interntionl Conference on Mchine Lerning (ICML 000), vol., 000, pp [5] M. Jordn nd D. Rumelhrt, Forwrd models: Supervised lerning with distl techer, Cognitive Science, vol. 6, pp , 99. [6] C. Rsmussen, Gussin processes in mchine lerning, Advnced Lectures on Mchine Lerning, pp. 63 7, 006. [7] B. Kuipers, Qulittive Resoning. Cmbridge, Msschusetts: The MIT Press, 994. [8] J. Mugn, Autonomous Qulittive Lerning of Distinctions nd s in Developing Agent, Ph.D. disserttion, University of Texs t Austin, 00. [9] T. Degris, O. Sigud, nd P. Wuillemin, Lerning the structure of fctored Mrkov decision processes in reinforcement lerning problems, in ICML, 006, pp [0] A. Jonsson nd A. Brto, Active lerning of dynmic byesin networks in mrkov decision processes, Lecture Notes in Artificil Intelligence: Abstrction, Reformultion, nd Approximtion - SARA, pp , 007. [] A. Strehl, C. Diuk, nd M. Littmn, Efficient structure lerning in fctored-stte MDPs, in AAAI, vol., no., 007, p [] U. Fyyd nd K. Irni, On the hndling of continuous-vlued ttributes in decision tree genertion, Mchine Lerning, vol. 8, no., pp. 87 0, 99. [3] G. A. Miller, E. Glnter, nd K. H. Pribrm, Plns nd the Structure of Behvior. Holt, Rinehrt nd Winston, 960. [4] N. J. Nilsson, Principles of Artificil Intelligence. Tiog Publishing Compny, 980. [5] M. Putermn, Mrkov Decision Problems. New York: Wiley, 994. [6] C. Boutilier, T. Den, nd S. Hnks, Decision theoretic plnning: Structurl ssumptions nd computtionl leverge, Journl of Artificil Intelligence Reserch, vol., no., p. 94, 999. [7] J. Kuffner Jr nd S. Lvlle, RRT-connect: An efficient pproch to single-query pth plnning, in Proc. IEEE Int. Conf. Robot. Autom.(ICRA, 000, pp [8] R. S. Sutton, D. Precup, nd S. Singh, Between MDPs nd semi- MDPs: A frmework for temporl bstrction in reinforcement lerning, Artificil Intelligence, vol., no. -, pp. 8, 999. [9] R. S. Sutton nd A. G. Brto, Reinforcement Lerning. Cmbridge MA: MIT Press, 998. [30] X. Hung nd J. Weng, Novelty nd Reinforcement Lerning in the Vlue System of Developmentl Robots, Proc. nd Inter. Workshop on Epigenetic Robotics, 00. [3] J. Mrshll, D. Blnk, nd L. Meeden, An emergent frmework for self-motivtion in developmentl robotics, Proc. of the 3rd Int. Conf. on Development nd Lerning (ICDL 004), 004. [3] J. Schmidhuber, Curious model-building control systems, in Proc. Int. Joint Conf. on Neurl Networks, vol., 99, pp [33] P. Oudeyer, F. Kpln, nd V. Hfner, Intrinsic Motivtion Systems for Autonomous Mentl Development, Evolutionry Computtion, IEEE Trnsctions on, vol., no., pp , 007. [34] A. Bonrini, A. Lzric, nd M. Restelli, Incrementl Skill Acquisition for Self-Motivted Lerning Animts, in From Animls to Animts 9: 9th Interntionl Conference on Simultion of Adptive Behvior, SAB. Springer, 006, pp [35] M. Mrjnovic, B. Scssellti, nd M. Willimson, Self-tught visully guided pointing for humnoid robot, in From Animls to Animts 4: Proc. Fourth Int l Conf. Simultion of Adptive Behvior, 996, pp [36] G. Mett nd P. Fitzptrick, Erly integrtion of vision nd mnipultion, Adptive Behvior, vol., no., pp. 09 8, 003. [37] K. Gold nd B. Scssellti, Lerning cceptble windows of contingency, Connection Science, vol. 8, no., pp. 7 8, 006. [38] J. Klein, Breve: 3d environment for the simultion of decentrlized systems nd rtificil life, in Proc. of the Int. Conf. on Artificil Life, 003. [39] R. Smith, Open dynmics engine v 0.5 user guide, [40] V. G. Pyne nd L. D. Iscs, Humn Motor Development: A Lifespn Approch. McGrw-Hill Humnities/Socil Sciences/Lnguges, 007. [4] A. Needhm, T. Brrett, nd K. Petermn, A pick-me-up for infnts explortory skills: Erly simulted experiences reching for objects using sticky mittens enhnces young infnts object explortion skills, Infnt Behvior nd Development, vol. 5, no. 3, pp , 00. [4] D. M. Pierce nd B. J. Kuipers, Mp lerning with uninterpreted sensors nd effectors. Artificil Intelligence, vol. 9, pp. 69 7, 997. [43] R. O. Dud, P. E. Hrt, nd D. G. Stork, Pttern Clssifiction. Wiley- Interscience Publiction, 000. [44] A. Jonsson nd A. Brto, Cusl grph bsed decomposition of fctored MDPs, The Journl of Mchine Lerning Reserch, vol. 7, pp , 006. [45] C. M. Vigorito nd A. G. Brto, Autonomous hierrchicl skill cquisition in fctored mdps, in Yle Workshop on Adptive nd Lerning Systems, New Hven, Connecticut, 008. [46] W.-M. Shen, Autonomous Lerning from the Environment. W. H. Freemn nd Compny, 994. [47] T. Dietterich, The MAXQ method for hierrchicl reinforcement lerning, ICML, 998. [48] B. Digney, Emergent hierrchicl control structures: Lerning rective/hierrchicl reltionships in reinforcement environments, in From nimls to nimts 4: proceedings of the Fourth Interntionl Conference on Simultion of Adptive Behvior. The MIT Press, 996, p [49] B. Hengst, Discovering hierrchy in reinforcement lerning with HEXQ, in Proceedings of the Nineteenth Interntionl Conference on Mchine Lerning, 00, pp [50] T. Hester nd P. Stone, Generlized model lerning for reinforcement lerning in fctored domins, in Proceedings of The 8th Interntionl Conference on Autonomous Agents nd Multigent Systems-Volume. Interntionl Foundtion for Autonomous Agents nd Multigent Systems, 009, pp Jonthn Mugn is Post-Doctorl Fellow t Crnegie Mellon University. He received his Ph.D. in computer science from the University of Texs t Austin. He received his M.S. from the University of Texs t Dlls, nd he received his M.B.A. nd B.A. from Texs A&M University. Benjmin Kuipers joined the University of Michign in Jnury 009 s Professor of Computer Science nd Engineering. Prior to tht, he held n endowed Professorship in Computer Sciences t the University of Texs t Austin. He received his B.A. from Swrthmore College, nd his Ph.D. from MIT.

Autonomous Learning of High-Level States and Actions in Continuous Environments. Jonathan Mugan and Benjamin Kuipers, Fellow, IEEE

Autonomous Learning of High-Level States and Actions in Continuous Environments. Jonathan Mugan and Benjamin Kuipers, Fellow, IEEE Autonomous Lerning of High-Level Sttes nd s in Continuous Environments Jonthn Mugn nd Benjmin Kuipers, Fellow, IEEE Abstrct How cn n gent bootstrp up from low-level representtion to utonomously lern high-level