Continuous Learning Method for a Continuous Dynamical Control in a Partially Observable Universe

Size: px

Start display at page:

Download "Continuous Learning Method for a Continuous Dynamical Control in a Partially Observable Universe"

Jonah Briggs
5 years ago
Views:

1 Continuous Learning Method for a Continuous Dynamical Control in a Partially Observable Universe Frédéric Dambreville Délégation Générale pour l Armement, DGA/CTA/DT/GIP 16 Bis, Avenue Prieur de la Côte d Or Arcueil, F 94114, France Web: Abstract - In this paper, we are interested in the optimal dynamical control of sensors based on partial and noisy observations. These problems are related to the POMDP family. In this case however, we are manipulating continuous-valued controls and continuous-valued decisions. While the dynamical programming method will rely on a discretization of the problem, we are dealing here directly with the continuous data. Moreover, our purpose is to address the full past observation range. Our approach is to modelize the POMDP strategies by means of Dynamic Bayesian Networks. A method, based on the Cross-Entropy is implemented for optimizing the parameters of such DBN, relatively to the POMDP problem. In this particular work, the Dynamic Bayesian Networks are built from semi-continuous probabilistic laws, so as to ensure the manipulation of continuous data. Keywords: Optimization, Dynamical control, Crossentropie method, Resource allocation, Tracking 1 Introduction When planning the surveillance of an area, there are different degrees of difficulties in the optimization of the sensorsallocations. There are well known and efficient models for such planning when the observation is not involved in the optimization process[1, 2, 3, 4, 5]. The planning is much more difficult when it becomes dynamical and involves some partial observation. In order to better understand what are the difficulties here, let us consider a very simple example. Assume that we have to catch a target moving within a field. This target is not entirely predictible and moves according to a random model. Now, assume there is a hill in the center of the field. It is hypothesized that moving in the field is easy, while climbing the hill is difficult. On the other hand, observing the target from the field is difficult while one have a full observation of the situation from the top of the hill. Then, what should be our strategy? Are we managing our investigation in the field only: we are moving fast, but with poor observations. Or are we climbing first on the hill, in order to have a better knowledge of the target: we are loosing time first, by climbing the hill. But then we have a good perception of the target. And how will we use such a knowledge efficiently? How to evaluate the information earned from the hill? Such choices are the fundamental difficulties of a planning with partial observation. Mathematically, the problem is particularly complex. A quite classical model for such problems is the theory of Partially Observable Markov Decision Process. There are some hypotheses done here. First, the law of evolution of the universe (e.g. the target move) is markovian. Secondly, the criterion to be obtimized is sufficiently simple: it is additive with the time. It is well known[10, 11] that the problem is solved then by a dynamical programming approach. But the solution is in many cases tedious or even untractable practically. Such problems are then forced to be simple. Moreover, it is necessary in the approach to discretize the problem. Another approach is to approximate the strategy by reinforcement learning[12]. Although this method needed to bound the past observation range, there are interesting progress by the way of a hierarchical approach. In previous works[6] we proposed a new approach which shares some common points with the reinforcement learning method. The purpose here is to describe the possible policies of control by a wide family of probabilistic laws (typically a family of Dynamical Bayesian Networks). Then to learn an optimal law within this family by a simulation optimization algorithm. As described in the previous papers, this approach makes very few hypotheses about the problem: no Markovian hypothesis, no additivity hypothesis, and no restriction in the past observartion range. But of course there are limitations: the limitations of our policy models. Since the complexity of these models are necessarily bounded, the optimal answer to the control problem is restricted. It has been shown however, that complex policy models were not necessary to reach a good policy. Moreover, it is possible to implement hierarchical models[7], which allow more complex answer for the control. In this paper, we will apply such a technique to a problem of detection-investigation. While many POMDP

2 problems are manipulating discretized quantities, we are applying our method directly to continuous parameters. Here, the optimized policy will be a probabilistic law, which will take a continuous observation (typically, the noised positions of the target) and produce a continuous dcision (typically, the direction and speed of two patrolers). More precisely, the optimal strategic tree will be approximated by means of semi-continuous Hidden Markov Models. We will describe the setting of these laws and how to relearn them during the simulation process. The next section introduces some formalism and gives a quick description of control problems with partial observation. Our planning method is then introduced. It is based on the direct approximation of the optimal decision tree by means of an approximating structure; typically a Hidden Markov Model is used for such structure. A particular structure of HMM for manipulating continuous data is introduced; it has been implemented. The third section explains how to optimize the parameters of this HMM, in order to approximate the optimal decision tree for the planning with partial observation. In particular, the cross-entropy method is described and applied. The fourth section explains the simulated application on which our model is applied, and presents some results. 2 Decision in a partially observable universe This section is dedicated to the theoretical description of the control with partial observation. A practical exemple of experimentation, a simulation, is detailed in section 4 One should keep in mind that we intend here to solve a control problem manipulating continuous parameters (decisions and observations). Now let introduce the formal problem. It is assumed that a subject is acting in a given world with a given purpose or mission. The goal is to optimize the accomplishment of this mission. The subject will receive observation from the world, and will produce action on it. The world. The world is characterized by an hidden state x, which is evolving with the time. As an assumption, the time t is discretized from step 1 to the maximal step T. The temporal evolution of the hidden state is denoted by the vector x = x 1,...,x t,...,x T. During the evolution of the word, the subject will make some decisions d which will impact the evolution of the world. He is also able to make some partial and noisy observations, denoted y. The world is thus characterized by a law of evolution involving both the decision, the hidden state and the observation. It is hypothesized that this law, denoted P, is probabilistic: The hidden state x t and observation y t are obtained from the conditional law P(x t, y t x 1:t 1, y 1:t 1, d 1:t 1 ), which depends Figure 1: The world Hidden state x y 1 d 1 y 2 d 2 y 3 d 3 y t d t y t+1 d t+1 on the past states, observations and decisions. Moreover, it is assumed that d t is generated by the subject after the observation x t. There is no Markovian hypothesis about the law P. But it is assumed that the laws P(x t, y t x 1:t 1, d 1:t 1 ) are simulated very quickly. Notice that this lack of assumption make impossible the use of a method based on the dynamic programming. The law of x, y d is represented graphically by figure 1. In this description, output arrows are for the values produced by the world, i.e. in this case the observations. The input arrows are for the values consumed by the world, i.e. the decision of the subject. The variables are put chronologically: y t appears before d t because the decision d t occurs after the observation y t. In this paper, the observation y t and the decision d t are continuous values. From now on, we will use the notation P(x, y d) for the full law of the world: P(x, y d) = T P(x t, y t x 1:t 1, d 1:t 1 ). Evaluation and optimal planning. The previous paragraphs have built a modelling of the world, of the actions and of the observations. We are now giving a characterization of the mission to be accomplished. The mission is limited in time. Let T be this maximum time. In the most generality, the mission is evaluated by a function V (d, y, x) defined on the trajectories of d, y, x. Typically, the function V could be used for computing the time needed for the mission accomplishment. The purpose is to construct an optimal decision tree d(obs) depending on the observation obs in order to maximize the mean evaluation. This is a dynamic optimization problem, since the actions depend on the previous observations. The related program consists in the optimization of y ( d t (y 1:t ) T ) as follows: d arg max d y x P ( x, y ( d t (y 1:t ) T )) V ( ( d t (y 1:t ) T ) (1), y, x)dydx. This optimization is schematized in figure 2. It is related to the family of Partially Observable Markov Decision Problems, although there is no Markovian hypothesis about the world here. In the figure, the double arrows characterize the variables to be optimized. More precisely, these arrows describe the flow of information between the observations and the actions. The cells denoted are making decisions and transmitting

3 Figure 2: POMDP planning Hidden state x y 1 y 2 y 3 y t y t+1 d 1 d 2 d 3 d t d t+1 all the received and generated information (including the actions). This architecture illustrates that planning with observation is an indefinite-memory problem: the decision depends on the whole past observations. When the evaluation function V is additive, it is known that there is a finite-memory construction of the solution, by means of the dynamic programming paradigm. However, notice that this finite-memory is a probabilistic posterior of the world, resulting from the past observations. By the way, it is too huge to be manipulated properly. An alternative method to the Dynamic Programming is proposed subsequently. It relies on the optimisation of a probabilistic template for the control policy. Direct approximation of the decision tree. In an optimization problem like (1), the value to be optimized, d, is a deterministic object. In this precise case, d is a tree of decision, that is a function which maps to a decision d t from any sequence of observation y 1:t. It is possible however to have a probabilistic viewpoint. Then the problem is equivalent to finding π(d y), a probabilistic law of actions conditionally to the past observations, which maximizes the mean evaluation: V (π) = d y x T π(d t d 1:t 1, y 1:t ) P(x, y d) V (d, y, x)dxdy dd. Notice that this problem could be still schematized by figure 2, but the double arrows now describes the DBN (Dynamic Bayesian Network) structure of the law π. While are indefinite memories, the schematized law is quite general. Actually, there will not be a great difference with the deterministic case for an otimal solution: when the solution d is unique, the optimal law π is a dirac on d. But things change when approximating π. Now, why using a probability to approximate the optimal strategy? The main point is that probabilistic models seem more suitable for approximation. The second point is that we are sure to approximate continuously: indeed, π V (π) is continuous. There is a third point. When replacing in figure 2 the indefinite memories by finite memories, let they be denoted by m as in figure 3, it is then obtained a natural approximation of the law π. The approximated law is a Hidden Markov Model. As will be seen, HMM are very practical for an optimization. Figure 3: Finite-memory planning approximation Hidden state x y 1 y 2 y 3 y t y t+1 d 1 d 2 d 3 d t d t+1 m 1 m 2 m 3 m t m t+1 Figure 4: A typical Hidden Markov Model y 1 y 2 y 3 y t y t+1 d 1 d 2 d 3 d t d t+1 m 1 m 2 m 3 m t m t+1 Policy approximation by a HMM. Define for any time t a variable m t M, called the memory at time t. Notice that m t is intended to describe a finite memory. Nevertheless, M is not neccessarily a finite set; for example, M could contain continuous or a semicontinuous values. In the most general case, a HMM for the decision policy will take the form: h(d y) = m M T h(d, m y)dm where: h(d, m y) = T ( h(dt m t )h(m t y t, m t 1 ) ). This general model for an HMM is schematized in figure 4. This formalism is the most general, and it hides many possible HMM settings more or less intricated. We will not discuss here about the detailed structure of the HMM (see next paragraph), but instead about the general principle of the approximation of π by such HMM. The approach developped in [6][7] is quite general and can be split up into two points: Define a family of HMMs H, to be used as policy aproximation, Optimize the parameters of such HMMs in order to maximize the mean evaluation: Find h argmax h H V (h). In practice, a good choice of H implies that h is a good approximation of π. A detailed description of the HMM family. It is recalled that our purpose is to investigate a continuous control problem. Thus, any HMM h H should input a continuous data (the observation y) and output a continuous decision d. The choice here is to manipulate both a discrete memory and a continuous memory.

4 Let call m D the discrete memory of h, and assume m D t takes its values within the set {0,...,2 L 1}. Let m C denote the continuous memory of h, and assume that m C t is a IR-valued vector of dimension K. In addition, we will define a continuous temporary memory, denoted µ D, such that µ D t is a IR-valued vector of dimension L. The idea is to derive the continuous data m C t and µ D t from the previous memories m C t 1, m D t 1 and observations y t by means of a Gaussian law; these law will be optimized. The discrete memory m D t is obtained by discretizing the temporary memory µ D t ; this process is fixed and cannot be optimized. The decision d t is obtained from the memory m C t, md t by means of a Gaussian law; this law will be optimized. Let N(Σ, µ) denotes a multivariate gaussian vector with variance matrix Σ and means vector µ. All the process could be detailed as follows: m C t = N(Σ C [m D t 1], A C [m D t 1](1, m C t 1, y t )), where the matrices Σ C [m] and A C [m] have to be optimized for any m {0,...,2 L 1}. Notice that Σ C [m] is of dimension K K while A C [m] is of dimension K (1 + K + dimy t ), µ D t = N(Σ D [m D t 1 ], AD [m D t 1 ](1, mc t 1, y t)), where the matrices Σ D [m] and A D [m] have to be optimized for any m {0,...,2 L 1}. Notice that Σ D [m] is of dimension L L while A D [m] is of dimension L (1 + K + dimy t ), m D t is the boolean-vector of dimension L which indicates in which hypercorner of IR L is placed µ D t. More precisely: where: L 1 m D t = b kt 2 k, k=0 b kt = 1 when µ D kt 0 and b kt = 0 else, d t = N(Σ dec [m D t ], A dec [m D t ](1, m C t )), where the matrices Σ dec [m] and A dec [m] have to be optimized for any m {0,...,2 L 1}. Notice that Σ dec [m] is of dimension dim d t dimd t while A dec [m] is of dimension dimd t (1 + K) Figure 5 give an illustration of the Markovian transition. The doubled arrows means that the parameter have to be optimized. From now on, the set H will refer to these semi-continuous HMM. Why not using purely continuous HMM? Continuous hmm, particularly Gaussian, are too weak structures. A semi-continuous scheme is necessary to achieve a sufficient abstraction. It remains now to explain how to optimize the choice of h among the family H. The following section explains a cross-entropic method for optimizing such a choice. Figure 5: Semi-continuous HMM transition m C t 1 m D t 1 y t m C t µ D t µ D t 3 Cross-entropic optimization The reader interested in CE methods should refer to the tutorial on the CE method[8]. CE algorithms were first dedicated to estimating the probability of rare events. A slight change of the basic algorithm made it also good for optimization. In their new article[13], Homem-de-Mello and Rubinstein have given some results about the global convergence. In order to ensure such convergence, some refinements are introduced particularly about the selective rate. This presentation is restricted to the basic CE optimization method. The new improvements of the CE algorithm proposed in [13] have not been implemented, but the algorithm has been seen to work properly. For this reason, this paper does not deal with the choice of the selective rate. 3.1 General CE algorithm for the optimization The Cross Entropy algorithm repeats until convergence the three successive phases: 1. Generate samples of random data according to a parameterized random mechanism, 2. Select the best samples according to an evaluation criterion, 3. Update the parameters of the random mechanism, on the basis of the selected samples. In the particular case of CE, the update in phase 3 is obtained by minimizing the Kullback-Leibler distance, or cross entropy, between the updated random mechanism and the selected samples. The next paragraphs describe on a theoretical example how such method can be used in an optimization problem. Formalism Let be given a function x f(x); this function is easily computable. The value f(x) has to be maximized, by optimizing the choice of x X. The function f will be the evaluation criterion. Now let be given a family of probabilistic laws, P σ σ Σ, applying on the variable x. The family P is the parameterized random mechanism. The variable x is the random data. Let ρ ]0, 1[ be a selective rate. The CE algorithm for (x, f, P) follows the synopsis: d t

5 1. Initialize σ Σ, 2. Generate N samples x n according to P σ, 3. Select the ρn best samples according to the evaluation criterion f, 4. Update σ as a minimizer of the cross-entropy with the selected samples: σ argmax ln P σ (x n ), σ Σ n selected 5. Repeat from step 2 until convergence. This algorithm requires f to be easily computable. Interpretation The CE algorithm tightens the law P σ around the maximizer of f. Then, when the probabilistic family P is well suited to the maximization of f, it becomes equivalent to find a maximizer for f or to optimize the parameter σ by means of the CE algorithm. The problem is to find a good family... Another issue is the criterion for deciding the convergence. Some answers are given in [13]. Now, it is outside the scope of this paper to investigate these questions precisely. Our criterion was to stop after a given threshold of successive unsuccessful tries and this very simple method have worked fine on our problem. 3.2 Application The cross-entropy, together with the probabilistic modelling of the policy, is now applied in order to approximate the optimal strategy for a planning with partial observation. Our objective is to tune the semicontinuous HMM h H in order to have the best approximation of the optimal planning strategy π : π h arg max h H V (h). Define P[h] the complete probabilistic law of the system Universe/Planner by: P[h](d, y, x, m) = P(x, y d)h(d, m y). Notice here that the memory is composite, i.e. m = (m D, m C, µ D ) with one discrete and two continuous components. The approximated planning reduces to solve: h arg max P[h](d, y, x, m) h H d y x m V (d, y, x)dxdy dddm. Optimizing h means tuning the parameter h H in order to tighten the probability P[h] around optimal values for V. This is exactly solved by the Cross- Entropy optimization method. However, it is required that the evaluation function V is easily computable. Typically, the definition of V may be recursive, eg. 1 : V (d, y, x) = {v t (d t, y t, x t, } 2 t=t v 1(d 1, y 1, x 1 ) {)} T t=2. 1 Braces used with subscripts, eg. {} T, only have a grammatical meaning here. More precisely, it means that the symbols inside the braces are duplicated and concatenated according to the subscript. For example, {f k (} 3 k=1 x{)}1 k=3 means f 1 (f 2 (f 3 (x))) and {x k } T k=t means T k=t x k. Let the selective rate ρ be a positive number such that ρ < 1. The cross-entropy optimization method follows the synopsis: 1. Initialize h. For example a flat h, 2. Make N tossing θ n = (d n, y n, x n, m n ) according to the law P[h], 3. Choose the ρn best samples θ n according to the evaluation V (d, y, x). Denote S the set of the selected samples, 4. Update h as the minimizer of the cross-entropy with the selected examples: h argmax h H ln P[h](θ n ), (2) 5. Reiterate from step 2 until convergence. In this case, the maximization (2) is not difficult. In particular, the Markovian property is widely used: ln P[h] is derived into a sum and subsequently, the optimization is split into several elementary independent problems. At last, this maximization (2) is solved by: For the continuous memory (v means the transpose of vector v): A C [m] = and m C n,t (1, mc n,t 1, y n,t) t:m D n,t 1 =m (1, m C n,t 1, y n,t)(1, m C n,t 1, y n,t) t:m D n,t 1 =m Γ nt Γ nt Σ C t:m D n,t 1 [m] = { =m } card n S, t/m D t 1 = m where Γ nt = m C n,t AC [m](1, m C n,t 1, y n,t) For the temporary memory: A D [m] = µ D n,t (1, mc n,t 1, y n,t) t:m D n,t 1 =m and t:m D n,t 1 =m (1, m C n,t 1, y n,t )(1, m C n,t 1, y n,t ) Γ nt Γ nt Σ D t:m D n,t 1 [m] = { =m } card n S, t/m D t 1 = m where Γ nt = µ D n,t AD [m](1, m C n,t 1, y n,t) 1 1

6 For the decision: A dec [m] = d n,t (1, m C n,t ) t:m D n,t =m 1 and t:m D n,t =m (1, m C n,t)(1, m C n,t) Γ nt Γ nt Σ dec t:m D n,t [m] = { =m } card n S, t/m D t = m where Γ nt = d n,t A dec [m](1, m C n,t ) 4 Implementation The algorithm has been applied to a target detection and interception problem. 4.1 The experiment. A target R is moving in the continuous space [ 20, 20] [ 20, 20]. It is initially located in the area [0, 20] [ 20, 20], with a known distribution (actually an uniform distribution). R is tracked by two mobiles, B 1 and B 2, controled by the subject. B 1 and B 2 are initially located at the coordonates ( 20, 0). The mobiles receive the relative position of each other, with an additive noise (a gaussian noises with variance 1). Each mobile receives only one information about the target: it knows the direction of the target spot but have no information about its distance. Moreover, this direction information is noisy. The noise may vary with the (Euclidean) distance d between the patrol and the target: in this simulation, the angular noise is a uniform random variable on the set [ d π d+1 2, d d+1 π 2 ]. Each mobile will receive this information as a spot tossed accordingly to the noisy distribution (in particular, there is a very big variance about the distance). Thus, the dimension of the continuous information y t is 8, since y t contains two spot positions and two mobile relative positions. B 1 and B 2 are able to move, according to a directive of the subject. The directive is a direction and a move intensity (a speed) for each mobile. The mobile maximal speed is 2, starting form 0; but the mobile cannot escape from the space. Thus, the dimension of the continuous decision d t is 4 (moves will be truncated). Moreover, the patrols moves are noised additively (a gaussian noises with variance 1). The target moves accordingly to the following directives (unless other test directives are given): It cannot escape from the space, unless it reach the escape line { 20} [ 20, 20], The target speed is characterized by its relative move from step t to step t+1. This relative move is chosen as a uniform random variable on the set [ 4, 0] [ 2, 2]. The move is truncated, if a constraint is reached. As a consequence, the target is moving downward so as to reach the escape line. It moves twice faster than the patrols. The purpose of the mission is to get closer as possible to the target (at least one time and by mean of at least one patrol), before it escapes. More precisely, the evaluation V of a sample is given by: V = max max{ 1 t before escape d(b1 t, Rt ) 2, 1 d(b2 t, Rt ) 2 }. Thus, we are just optimizing the expected maximal inverted (squared) distance, which results in strategies with close target contacts. The total number of turn is T = 100. Results. Owing to the conference deadline Schedule, our tests have been limited. More results should be available later at this address: In the tests described subsequently, the processes have been run for one hour on a 2GHz PC (almost all the processor time were used). This was sufficient to reach a good convergence, since most of the gains are obtained at the beginning of the process (convergence is almost done after about ten minutes). In this version of the paper, we are interested in 3 different tests (test 1 is the simplest). Test 1. In this case, the target does not move, and is initially located at position (20, 0). After optimization of the strategy, the obtained mean reward is 1982, which means that a patrol will contact the target at distance Test 2. Again the target does not move, but is initially located randomly on the space [0, 20] [ 20, 20] with a uniform distribution. After optimization of the strategy, the obtained mean reward is 17, which means that a patrol will contact the target at distance Test 3. In this case, the full location and moving hypotheses are made about the target. Notice that the period of possible contact is quite reduced (because of escape) in comparison with previous tests. After optimization of the strategy, the obtained mean reward is 16, which means that a patrol will contact the target at distance It appears fortunately that our optimized policies are able of good contact with the target. Such results are promizing, but more tests should be done, and comparisons with other methods (for example a Q-learning approach on a discretized problem) are needed. More

7 intricated examples should be investigated too. These tests are considered for a next future. 5 Conclusion. In this paper, we proposed a method for approximating the optimal planning in a partially observable control problem. The planning involves an optimization of continuous decision in regards to a sequence of continuous past observations. The method relies on a modelling of the policies by means of a semi-continuous probabilitic law family. The method of cross-entropy is applied to find the optimal law. This method will be implemented for solving a problem of detectioninvestigation, where two mobiles have to catch a target, while receiving a radial observation of this target. The tests on this scenario are promizing. More tests are being done. [11] Anthony Rocco Cassandra, Exact and approximate algorithms for partially observable Markov decision processes, PhD thesis, Brown University, Rhode Island, Providence, May [12] B. Bakker, J. Schmidhuber, Hierarchical Reinforcement Learning Based on Subgoal Discovery and Subpolicy Specialization, in Proceedings of the 8-th Conference on Intelligent Autonomous Systems, Amsterdam, The Netherlands, p , [13] Homem-de-Mello, Rubinstein, Rare Event Estimation for Static Models via Cross-Entropy and Importance Sampling, tito/list.htm [14] Kevin Murphy and Mark Paskin, Linear Time Inference in Hierarchical HMMs, Proceedings of Neural Information Processing Systems, References [1] S.S. Brown, Optimal Search for a Moving Target in Discrete Time and Space. Operations Research 28, pp , [2] J. de Guenin, Optimum Distribution of Effort: an Extension of the Koopman Basic Theory. Operations Research 9, pp 1 7, [3] B.O. Koopman, Search and Screening: General Principle with Historical Applications. MORS Heritage Series, Alexandria, VA, [4] L.D. Stone, Theory of Optimal Search, 2-nd ed.. Operations Research Society of America, Arlington, VA, [5] A.R. Washburn, Search for a moving Target: The FAB algorithm. Operations Research 31, pp , [6] Frederic Dambreville, Learning a Machine for the Decision in a Partially Observable Markov Universe, ISDA 2004, Budapest, Hungary, August 26-28, [7] Frederic Dambreville, Learning a Machine for the Decision in a Partially Observable Markov Universe. Submitted to European Journal of Operation Research. [8] De Boer and Kroesse and Mannor and Rubinstein, A Tutorial on the Cross-Entropy Method, [9] Richard Bellman, Dynamic Programming, Princeton University Press, Princeton, New Jersey, [10] Edward J. Sondik, The Optimal Control of Partially Observable Markov Processes, PhD thesis, Stanford University, Stanford, California, 1971.

arxiv:math/ v1 [math.gm] 11 Aug 2004

arxiv:math/ v1 [math.gm] 11 Aug 2004 Learning a Machine for the Decision in a Partially arxiv:math/0408146v1 [math.gm] 11 Aug 2004 Observable Markov Universe Frédéric Dambreville Délégation Générale pour l Armement, DGA/CTA/DT/GIP 16 Bis,