Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments

Size: px

Start display at page:

Download "Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments"

Jane McCoy
6 years ago
Views:

1 Plnning to Be Surprised: Optiml Byesin Explortion in Dynmic Environments Yi Sun, Fustino Gomez, nd Jürgen Schmidhuber IDSIA, Glleri 2, Mnno, CH-6928, Switzerlnd Abstrct. To mximize its success, n AGI typiclly needs to explore its initilly unknown world. Is there n optiml wy of doing so? Here we derive n ffirmtive nswer for brod clss of environments. 1 Introduction An intelligent gent is sent to explore n unknown environment. Over the course of its mission, the gent mkes observtions, crries out ctions, nd incrementlly builds up model of the environment from this interction. Since the wy in which the gent selects ctions my gretly ffect the efficiency of the explortion, the following question nturlly rises: How should the gent choose the ctions such tht the knowledge bout the environment ccumultes s quickly s possible? In this pper, this question is ddressed under clssicl frmework in which the gent improves its model of the environment through probbilistic inference, nd lerning progress is mesured in terms of Shnnon informtion gin. We show tht the gent cn, t lest in principle, optimlly choose ctions bsed on previous experiences, such tht the cumultive expected informtion gin is mximized. The rest of the pper is orgnized s follows: Section 2 reviews the bsic concepts nd estblishes the terminology; Section 3 elbortes the principle of optiml Byesin explortion; Section 4 presents simple experiment; Relted work is briefly reviewed in Section 5; Section 6 concludes the pper. 2 Preliminries Suppose tht the gent intercts with the environment in discrete time cycles t = 1, 2,.... In ech cycle, the gent performs n ction,, then receives sensory input, o. A history, h, is either the empty string,, or string of the form 1 o 1 t o t for some t, nd h nd ho refer to the strings resulting from ppending nd o to h, respectively. 2.1 Lerning from Sequentil Interctions To fcilitte the subsequent discussion under probbilistic frmework, we mke the following ssumptions:

2 Assumption I. The models of the environment under considertion re fully described by rndom element Θ which depends solely on the environment. Moreover, the gent s initil knowledge bout Θ is summrized by prior density p (θ). Assumption II. The gent is equipped with conditionl predictor p (o h; θ), i.e. the gent is cpble of refining its prediction in the light of informtion bout Θ. Using p (θ) nd p (o h; θ) s building blocks, it is strightforwrd to formulte lerning in terms of probbilistic inference. From Assumption I, given the history h, the gent s knowledge bout Θ is fully summrized by p (θ h). According to Byes rule, p (θ ho) = p(θ h)p(o h;θ) p(o h), with p (o h) = p (o h, θ) p (θ h) dθ. The term p (θ h) represents the gent s current knowledge bout Θ given history h nd n dditionl ction. Since Θ depends solely on the environment, nd, importntly, knowing the ction without subsequent observtions cnnot chnge the gent s stte of knowledge bout Θ, then p (θ h) = p (θ h), nd thus the knowledge bout Θ cn be updted using p (θ ho) = p (θ h) p (o h; θ) p (o h). (1) It is worth pointing out tht p (o h; θ) is chosen before entering the environment. It is not required tht it mtch the true dynmics of the environment, but the effectiveness of the lerning certinly depends on the choices of p (o h; θ). For exmple, if Θ R, nd p (o h; θ) depends on θ only through its sign, then no knowledge other thn the sign of Θ cn be lerned. 2.2 Informtion Gin s Lerning Progress Let h nd h be two histories such tht h is prefix of h. The respective posterior distributions of Θ re p (θ h) nd p (θ h ). Using h s reference point, the mount of informtion gined when the history grows to h cn be mesured using the KL divergence between p (θ h) nd p (θ h ). This informtion gin from h to h is defined s g(h h) = KL (p (θ h ) p (θ h)) = p (θ h ) log p (θ h ) p (θ h) dθ. As specil cse, if h =, then g (h ) = g (h ) is the cumultive informtion gin with respect to the prior p (θ). We lso write g (o h) for g (ho h), which denotes the informtion gined from n dditionl ction-observtion pir. From n informtion theoretic point of view, the KL divergence between two distributions p nd q represents the dditionl number of bits required to encode elements smpled from p, using optiml coding strtegy designed for q. This cn be interpreted s the degree of unexpectedness or surprise cused by observing smples from p when expecting smples from q. The key property informtion gin for the tretment below is the following decomposition: Let h be prefix of h nd h be prefix of h, then E h h g (h h) = g (h h) + E h h g (h h ). (2)

3 Tht is, the informtion gin is dditive in expecttion. Hving defined the informtion gin from trjectories ending with observtions, one my proceed to define the expected informtion gin of performing ction, before observing the outcome o. Formlly, the expected informtion gin of performing with respect to the current history h is given by ḡ ( h) = E o h g (o h). A simple derivtion gives ḡ ( h) = o p (o, θ h) log p (o, θ h) dθ = I (O; Θ h), p (θ h) p (o h) which mens tht ḡ ( h) is the mutul informtion between Θ nd the rndom vrible O representing the unknown observtion, conditioned on the history h nd ction. 3 Optiml Byesin Explortion In this section, the generl principle of optiml Byesin explortion in dynmic environments is presented. We first give results obtined by ssuming fixed limited life spn for our gent, then discuss condition required to extend this to infinite time horizons. 3.1 Results for Finite Time Horizon Suppose tht the gent hs experienced history h, nd is bout to choose τ more ctions in the future. Let π be policy mpping the set of histories to the set of ctions, such tht the gent performs with probbility π ( h) given h. Define the curiosity Q-vlue q τ π (h, ) s the expected informtion gined from the dditionl τ ctions, ssuming tht the gent performs in the next step nd follows policy π in the remining τ 1 steps. Formlly, for τ = 1, nd for τ > 1, q 1 π (h, ) = E o h g (o h) = ḡ ( h), q τ π (h, ) = E o h E 1 hoe o1 ho 1 E oτ 1 h τ 1 g (ho 1 o 1 τ 1 o τ 1 h) = E o h E 1o 1 τ 1o τ 1 hog (ho 1 o 1 τ 1 o τ 1 h). The curiosity Q-vlue cn be defined recursively. Applying Eq. 2 for τ = 2, And for τ > 2, q 2 π (h, ) = E o h E 1o 1 hog (ho 1 o 1 h) = E o h [ g (o h) + E1o 1 hog ( 1 o 1 ho) ] = ḡ ( h) + E o h E hoq 1 π (ho, ). q τ π (h, ) = E o h E 1o 1 τ 1o τ 1 hog (ho 1 o 1 τ 1 o τ 1 h) = E o h [ g (o h) + E1o 1 τ 1o τ 1 g (ho 1 o 1 τ 1 o τ 1 ho) ] = ḡ ( h) + E o h E hoq τ 1 π (ho, ). (3)

4 Noting tht Eq.3 bers gret resemblnce to the definition of stte-ction vlues (Q(s, )) in reinforcement lerning, one cn similrly define the curiosity vlue of prticulr history s v τ π (h) = E h q τ π (h, ), nlogous to stte vlues (V (s)), which cn lso be itertively defined s v 1 π (h) = E h ḡ ( h), nd v τ π (h) = E h ( h) + Eo h v τ 1 π (ho) ]. The curiosity vlue v τ π (h) is the expected informtion gin of performing the dditionl τ steps, ssuming tht the gent follows policy π. The two nottions cn be combined to write q τ π (h, ) = ḡ ( h) + E o h v τ 1 π (ho). (4) This eqution hs n interesting interprettion: since the gent is operting in dynmic environment, it hs to tke into ccount not only the immedite expected informtion gin of performing the current ction, i.e., ḡ ( h), but lso the expected curiosity vlue of the sitution in which the gent ends up due to the ction, i.e., vπ τ 1 (ho). As consequence, the gent needs to choose ctions tht blnce the two fctors in order to improve its totl expected informtion gin. Now we show tht there is optiml policy π, which leds to the mximum cumultive expected informtion gin given ny history h. To obtin the optiml policy, one my work bckwrds in τ, tking greedy ctions with respect to the curiosity Q-vlues t ech time step. Nmely, for τ = 1, let q 1 (h, ) = ḡ ( h), π 1 (h) = rg mx ḡ ( h), nd v 1 (h) = mx ḡ ( h), such tht v 1 (h) = q ( 1 h, π 1 (h) ), nd for τ > 1, let ] q τ (h, ) = ḡ ( h) + E o h [mx q τ 1 ( ho) = ḡ ( h) + E o h v τ 1 (ho), with π τ (h) = rg mx q τ (h, ) nd v τ (h) = mx q τ (h, ). We show tht π τ (h) is indeed the optiml policy for ny given τ nd h in the sense tht the curiosity vlue, when following π τ, is mximized. To see this, tke ny other strtegy π, first notice tht v 1 (h) = mx ḡ ( h) E h ḡ ( h) = vπ 1 (h). Moreover, ssuming v τ (h) v τ π (h), v τ+1 (h) = mx ( h) + Eo h v τ (ho) ] mx E h ( h) + Eo h v τ π (ho) ] = v τ+1 π (h). ( h) + Eo h v τ π (ho) ] Therefore v τ (h) vπ τ (h) holds for rbitrry τ, h, nd π. The sme cn be shown for curiosity Q-vlues, nmely, q τ (h, ) qπ τ (h, ), for ll τ, h,, nd π. Now consider tht the gent hs fixed life spn T. It cn be seen tht t time t, the gent hs to perform π T t (h t 1 ) to mximize the expected informtion gin in the remining T t steps. Here h t 1 = 1 o 1 t 1 o t 1 is the history t time t. However, from Eq.2, E ht h t 1 g (h T ) = g (h t 1 ) + E ht h t 1 g (h T h t 1 ).

5 Note tht t time t, g (h t 1 ) is constnt, thus mximizing the cumultive expected informtion gin in the remining time steps is equivlent to mximizing the expected informtion gin of the whole trjectory with respect to the prior. The result is summrized in the following proposition: Proposition 1. Let q 1 (h, ) = ḡ ( h), v 1 (h) = mx q 1 (h, ), nd q τ (h, ) = ḡ ( h) + E o h v τ 1 (ho), v τ (h) = mx q τ (h, ), then the policy π τ (h) = rg mx q τ (h, ) is optiml in the sense tht v τ (h) vπ τ (h), q τ (h, ) qπ τ (h, ) for ny π, τ, h nd. In prticulr, for n gent with fixed life spn T, following π T t (h t 1 ) t time t = 1,..., T is optiml in the sense tht the expected cumultive informtion gin with respect to the prior is mximized. The definition of the optiml explortion policy is constructive, which mens tht it cn be redily implemented, provided tht the number of ctions nd possible observtions is finite so tht the expecttion nd mximiztion cn be computed exctly. However, the cost of computing such policy is O ((n o n ) τ ), where n o nd n re the number of possible observtions nd ctions, respectively. Since the cost is exponentil on τ, plnning with lrge number of look hed steps is infesible, nd pproximtion heuristics must be used in prctice. 3.2 Non-trivility of the Result Intuitively, the recursive definition of the curiosity (Q) vlue is simple, nd bers cler resemblnce to its counterprt in reinforcement lerning. It might be tempting to think tht the result is nothing more thn solving the finite horizon reinforcement lerning problem using ḡ ( h) or g (o h) s the rewrd signls. However, this is not the cse. First, note tht the decomposition Eq.2 is direct consequence of the formultion of the KL divergence. The decomposition does not necessrily hold if g (h) is replced with other types of mesures of informtion gin. Second, it is worth pointing out tht g (o h) nd ḡ ( h) behve differently from norml rewrd signls in the sense tht they re dditive only in expecttion, while in the reinforcement lerning setup, the rewrd signls re usully ssumed to be dditive, i.e., dding rewrd signls together is lwys meningful. Consider simple problem with only two ctions. If g (o h) is plin rewrd function, then g (o h)+g ( o ho) should be meningful, no mtter if nd o is known or not. But this is not the cse, since the sum does not hve vlid informtion theoretic interprettion. On the other hnd, the sum is meningful in expecttion. Nmely, when o hs not been observed, from Eq.2, g (o h) + E o ho g ( o ho) = E o ho g (o o h), the sum cn be interpreted s the expecttion of the informtion gined from h to ho o. This result shows tht g (o h) nd ḡ ( h) cn be treted s dditive rewrd signls only when one is plnning hed. To emphsize the difference further, note tht ll immedite informtion gins g (o h) re non-negtive since they re essentilly KL divergence. A nturl ssumption would be tht the informtion gin g (h), which is the sum of

6 14 12 Sum of immedite informtion gin Cumultive informtion gin w.r.t. prior 10 KL divergence number of smples Fig. 1. Illustrtion of the difference between the sum of one-step informtion gin nd the cumultive informtion gin with respect to the prior. In this cse, 1000 independent smples re generted from distribution over finite smple spce {1, 2, 3}, with p (x = 1) = 0.1, p (x = 2) = 0.5, nd p (x = 3) = 0.4. The tsk of lerning is to recover the mss function from the smples, ssuming Dirichlet prior Dir ` 50, 50, The KL divergence between two Dirichlet distributions re computed ccording to [5]. It is cler from the grph tht the cumultive informtion gin fluctutes when the number of smples increses, while the sum of the one-step informtion gin increses monotoniclly. It lso shows tht the difference between the two quntities cn be lrge. ll g (o h) in expecttion, grows monotoniclly when the length of the history increses. However, this is not the cse, see Figure 1 for exmple. Although g (o h) is lwys non-negtive, some of the gin my pull θ closer to its prior density p (θ), resulting in decrese of KL divergence between p (θ h) nd p (θ). This is never the cse if one considers the norml rewrd signls in reinforcement lerning, where the ccumulted rewrd would never decrese if ll rewrds re non-negtive. 3.3 Extending to Infinite Horizon Hving to restrict the mximum life spn of the gent is rther inconvenient. It is tempting to define the curiosity Q-vlue in the infinite time horizon cse s the limit of curiosity Q-vlues with incresing life spns, T. However, this cnnot be chieved without dditionl technicl constrints. For exmple, consider simple coin tossing. Assuming Bet (1, 1) over the probbility of seeing heds, then the expected cumultive informtion gin for the next T flips is given by v T (h 1 ) = I (Θ; X 1,..., X T ) log T. With incresing T, v T (h 1 ). A frequently used pproch to simplifying the mth is to introduce discount fctor γ (0 γ < 1), s used in reinforcement lerning. Assume tht the gent hs mximum τ ctions left, but before finishing the τ ctions it my be forced to leve the environment with probbility 1 γ t ech time step. In this cse, the curiosity Q-vlue becomes

7 qπ γ,1 (h, ) = ḡ ( h), nd q γ,τ π (h, ) = (1 γ) ḡ ( h) + γ [ ḡ ( h) + E o h E hoq γ,τ 1 π (ho, ) ] = ḡ ( h) + γe o h E hoq γ,τ 1 π (ho, ). One my lso interpret qπ γ,τ (h, ) s liner combintion of curiosity Q-vlues without the discount, q γ,τ π (h, ) = (1 γ) τ γ t 1 qπ t (h, ) + γ τ qπ τ (h, ). t=1 Note tht curiosity Q-vlues with lrger look-hed steps re weighed exponentilly less. The optiml policy in the discounted cse is given by nd q γ,1 (h, ) = ḡ ( h), v γ,1 (h) = mx q γ,1 (h, ), q γ,τ (h, ) = ḡ ( h) + γe o h v γ,τ 1 (ho), v γ,τ (h) = mx q γ,τ (h, ). The optiml ctions re given by π γ,τ (h) = rg mx q γ,τ (h, ). The proof tht is optiml is similr to the one for the finite horizon cse (section 3.1) nd π γ,τ thus is omitted here. Adding the discount enbles one to define the curiosity Q-vlue in infinite time horizon in number of cses. However, it is still possible to construct scenrios where such discount fils. Consider infinite list of bndits. For bndit n, there re n possible outcomes with Dirichlet prior Dir ( 1 n,..., n) 1. The expected informtion gin of pulling bndit n for the first time is then given by ( log n ψ (2) + log ) log n, n with ψ( ) being the digmm function. Assume t time t, only the first e e2t bndits re vilble, thus the curiosity Q-vlue in finite time horizon is lwys finite. However, since the lrgest expected informtion gin grows t speed e t2, for ny given γ > 0, q γ,τ goes to infinity with incresing τ. This exmple gives the intuition tht to mke the curiosity Q-vlue meningful, the totl informtion content of the environment (or its growing speed) must be bounded. The following technicl Lemm gives sufficient condition for when such extension is meningful. Lemm 1. We hve 0 q γ,τ+1 (h, ) q γ,τ (h, ) γ τ E o h mx E o1 ho 1 mx ḡ ( τ h o τ 1 ). 1 τ

8 Proof. Expnd q γ,τ nd q γ,τ+1, nd note tht mx X mx Y mx X Y, then q γ,τ+1 π (h, ) qπ γ,τ (h, ) = E o h mx 1 E o1 ho 1 mx τ ( h) + γḡ ( 1 ho) + + γ τ ḡ ( τ h o τ 1 )] E o h mx E o1 ho 1 mx ( h) + γḡ (1 ho) + + γ τ 1 ḡ ( τ 1 h o τ 2 ) ] 1 τ 1 E o h mx{e o1 ho 1 mx ( h) + γḡ ( 1 ho) + + γ τ ḡ ( τ h o τ 1 )] 1 τ E o1 ho 1 mx τ 1 ( h) + γḡ (1 ho) + + γ τ 1 ḡ ( τ 1 h o τ 2 ) ] } γ τ E o h mx E o1 ho 1 mx ḡ ( τ h o τ 1 ). 1 τ It cn be seen tht if E o1 o τ 1 τ hḡ ( τ h o τ 1 ) grows sub-exponentilly, then qπ γ,τ is Cuchy sequence, nd it mkes sense to define the curiosity Q-vlue for infinite time horizon. 4 Experiment The ide presented in the previous section is illustrted through simple experiment. The environment is n MDP consisting of two groups of densely connected sttes (cliques) linked by long corridor. The gent hs two ctions llowing it to move long the corridor deterministiclly, wheres the trnsition probbilities inside ech clique re rndomly generted. The gent ssumes Dirichlet priors over ll trnsition probbilities, nd the gol is to lern the trnsition model of the MDP. In the experiment, ech clique consists of 5 sttes, (sttes 1 to 5 nd sttes 56 to 60), nd the corridor is of length 50 (sttes 6 to 55). The prior over ech trnsition probbility is Dir ( 1 60,..., 60) 1. We compre four different lgorithms: i) rndom explortion, where the gent selects ech of the two ctions with equl probbility t ech time step; ii) Q- lerning with the immedite informtion gin g (o h) s the rewrd; iii) greedy explortion, where the gent chooses t ech time step the ction mximizing ḡ ( h); nd iv) dynmic-progrmming (DP) pproximtion of the optiml Byesin explortion, where t ech time step the gent follows policy which is computed using policy itertion, ssuming tht the dynmics of the MDP is given by the current posterior, nd the rewrd is the expected informtion gin ḡ ( h). The detil of this lgorithm is described in [11]. Fig.2 shows the typicl behvior of the four lgorithms. The upper four plots show how the gent moves in the MDP strting from one clique. Both greedy explortion nd DP move bck nd forth between the two cliques. Rndom explortion hs difficulty moving between the two cliques due to the rndom wlk behvior in the corridor. Q-lerning explortion, however, gets stuck in the initil clique. The reson for is tht since the jump on the corridor is deterministic, the informtion gin decreses to virtully zero fter only severl ttempts, therefore the Q-vlue of jumping into the corridor becomes much lower thn the

9 Rndom stte stte Q lerning Greedy stte stte DP cum. info. gin Rndom Q lerning Greedy DP Time Fig. 2. The explortion process of typicl run of 4000 steps. The upper four plots shows the position of the gent between stte 1 (the lowest) nd 60 (the highest). The sttes t the top nd the bottom correspond to the two cliques, nd the sttes in the middle correspond to the corridor. The lowest plot is the cumultive informtion gin with respect to the prior. Q-vlue of jumping inside the clique. The bottom plot shows how the cumultive informtion gin grows over time, nd how the DP pproximtion clerly outperforms the other lgorithms, prticulrly in the erly phse of explortion. 5 Relted Work The ide of ctively selecting queries to ccelerte lerning process hs long history [1, 2, 7], nd hs received lot of ttention in recent decdes, primrily in the context of ctive lerning [8] nd rtificil curiosity [6]. In prticulr, mesuring lerning progress using KL divergence dtes bck to the 50 s [2, 4]. In

10 1995 this ws combined with reinforcement lerning, with the gol of optimizing future expected informtion gin [10]. Others renmed this Byesin surprise [3]. Our work differs from most previous work in two min points: First, like in [10], we consider the problem of exploring dynmic environment, where ctions chnge the environmentl stte, while most work on ctive lerning nd Byesin experiment design focuses on queries tht do not ffect the environment [8]. Second, our result is theoreticlly sound nd directly derived from first principles, in contrst to the more heuristic ppliction [10] of trditionl reinforcement lerning to mximize the expected informtion gin. In prticulr, we pointed out previously neglected subtlety of using KL divergence s lerning progress. Conceptully, however, this work is closely connected to rtificil curiosity nd intrinsiclly motivted reinforcement lerning [6,7,9] for gents tht ctively explore the environment without n externl rewrd signl. In fct, the very definition of the curiosity (Q) vlue permits firm connection between pure explortion nd reinforcement lerning. 6 Conclusion We hve presented the principle of optiml Byesin explortion in dynmic environments, centered round the concept of the curiosity (Q) vlue. Our work provides theoreticlly sound foundtion for designing more effective explortion strtegies. Future work will concentrte on studying the theoreticl properties of vrious pproximtion strtegies inspired by this principle. 7 Acknowledgement This reserch ws funded in prt by Swiss Ntionl Science Foundtion grnt /1, nd the EU IM-CLeVeR project (#231722). References 1. Chloner, K., Verdinelli, I.: Byesin experimentl design: A review. Sttisticl Science 10, (1995) 2. Fedorov, V.V.: Theory of optiml experiments. Acdemic Press (1972) 3. Itti, L., Bldi, P.F.: Byesin surprise ttrcts humn ttention. In: NIPS 05. pp (2006) 4. Lindley, D.V.: On mesure of the informtion provided by n experiment. Annls of Mthemticl Sttistics 27(4), (1956) 5. Penny, W.: Kullbck-liebler divergences of norml, gmm, dirichlet nd wishrt densities. Tech. rep., Wellcome Deprtment of Cognitive Neurology, University College London (2001) 6. Schmidhuber, J.: Curious model-building control systems. In: IJCNN 91. vol. 2, pp (1991) 7. Schmidhuber, J.: Forml theory of cretivity, fun, nd intrinsic motivtion ( ). Autonomous Mentl Development, IEEE Trns. on Autonomous Mentl Development 2(3), (9 2010) 8. Settles, B.: Active lerning literture survey. Tech. rep., University of Wisconsin Mdison (2010) 9. Singh, S., Brto, A., Chentnez, N.: Intrinsiclly motivted reinforcement lerning. In: NIPS 04 (2004) 10. Storck, J., Hochreiter, S., Schmidhuber, J.: Reinforcement driven informtion cquisition in non-deterministic environments. In: ICANN 95 (1995) 11. Sun, Y., Gomez, F.J., Schmidhuber, J.: Plnning to be surprised: Optiml byesin explortion in dynmic environments (2011),

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm