Optimal path planning using Cross-Entropy method

Optimal path planning using Cross-Entropy method F Celeste, FDambreville CEP/Dept of Geomatics Imagery Perception 9 Arcueil France {francisceleste, fredericdambreville}@etcafr J-P Le Cadre IRISA/CNRS Campus de Beaulieu 350 Rennes France lecadre@irisafr Abstract - This paper addresses the problem of optimizing the navigation of an intelligent mobile in a real world environment, described by a map The map is composed of features representing natural landmars in the environment The vehicle is equipped with a sensor which allows it to obtain range and bearing measurements from observed landmars during the execution These measurements are correlated with the map to estimate its position The optimal trajectory must be designed in order to control a measure of the performance for the filtering algorithm used for the localization tas As the mobile state and the measurements are random, a well-suited measure can be a functional of the approximate Posterior Cramer-Rao Bound A natural way for optimal path planning is to use this measure of performance within a (constrained) Marovian Decision Process framewor However, due to the functional characteristics, Dynamic Programming method is generally irrelevant To face that, we investigate a learning approach based on the Cross- Entropy method Keywords: Marov Decision Process, planning, estimation, Posterior Cramer Rao Bound, Cross Entropy method Introduction In this paper we are concerned with the tas of finding a plan for a vehicle moving around in its environment That is to say, reaching a given goal position from an initial position In many application, it is crucial to be able to estimate accurately the state of the mobile during the execution of the plan So it seems necessary to couple the planning and the execution stages One way to achieve that aim is to propose trajectories where the performance of the localization algorithm can be guaranteed This problem have been well studied and in most approaches the environment is discretized and described as a graph whose nodes correspond to particular area and edges are actions to move from one place to another Some of these previous contributions address the problem within a Marov Decision Process (MDP) Such approaches also use a graph representation of the mobile s state space and provide theoretical framewor to solve it in some cases In the present paper, we will also use the constrained MDP framewor, as in[] Our optimality criterion is based on the Posterior Cramer-Rao bound However, the nature of the objective function for path planning maes it impossible to perform complete optimization of MDP with classical means Indeed, we will show that the reward in one stage of our MDP depends on all the history of the trajectory To solve the problem, the Cross-Entropy originally used for rare-events simulation seemed a valuable tool The paper is organized in five parts In the second section we introduce the problem in details Section three deals with the Posterior Cramér-Rao bound and its properties We also derive its formulation for our problem and the criterion for the optimization In section four, we mae a short introduction of the Crossentropy and show how to apply it to our needs Finally, in section five the results of a first example are discussed Problem statement Let X, A and Z respectively denote the state of the vehicle, the action and the vector of observations at time We consider that the state vector X is the position of the vehicle in the x-axis, y-axis, so that X (x, y ) The action vector A (a x, a y ) is restricted to the set { d, 0, +d} { d, 0, +d} with d R + given constant So, the state {X } motion is governed by a dynamic model and an initial uncertainty, given by : X 0 π 0 N(X 0, P0) X + = f(x, A ) + w w N(0, Q ) where {w } is a white noise sequence and the symbol means distributed according to The nown map M is composed of N f features (m i ) i Nf with respective position (x i, y i ) R At time, due to the sensor capabilities, only a few landmars can be observed So, the observation process depends on the number of visible features Moreover, we do not consider data association and non detection problem for the moment That is to say, each observation is made from one landmar represented in the map If we denote I v () the indexes The symbol N means normal

of visible landmars at time, the measurement vector received at time is : with j I v () : Z = {z (j)} j Iv() zr(j) = (x x j ) + (y y j ) + γ z r (j) (j) zβ (j) = arctan( yj y x j x ) θ + γβ (j) 0 9 7 5 3 s f s i where θ is the global mobile orientation, {γr (j)} and {γβ (j)} are considered as white Gaussian noise and mutually independent Nevertheless, a more complex modeling must be considered if we want to tae into account correlated errors in the map or observations 3 A Marov Decision Process model with constraints As in [], we first formulate the problem within the Marov decision Process framewor Generally, a Marov Decision Process is defined by its state and action sets, the state transition probabilities and reward functions In our case, the state and action spaces are discrete and finite Indeed the map is discretized in N s = N x N y locations and one action is defined as a move between two neighbor points (figure ), where N x and N y are the number of grid points in both axis At each point, the mobile can choose one action among N a So we can model the process with: S = {,, N s } the state space A = {,, N a } the action space T ss (a) Pr(s + = s s = s, a = a) the transition function R ss (a) the reward or cost function, associated with transition from state s to state s, for an action a Finding the optimal path between two states s i and s f is equivalent to determine the best sequence of actions or decisions Va = (a 0,, a K ) allowing to simultaneously connecting them and optimizing a (global) criterion In our application, we consider only possible actions for one state (figure ) Each of them can be selected except for the points located on the border of the map and in places where obstacles can be found Moreover, we tae into account operational constraints on the mobile dynamic between two following epochs as in [] It is possible to do that by defining one authorized transition matrix δ(a, a ) which indicates actions that can be chosen at time according to the choice at time For example, if only [ π, π ] headings controls are allowed, such a matrix can be expressed as below : eg Marov Random Fields 3 eg colored noise or biases 3 5 7 9 0 Figure : MDP grid with states (red crosses) Trajectories with (solid line) and without(dashed line) [ π, π ] constrained headings change δ = a + 3 5 7 a 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 5 0 0 0 0 0 0 7 0 0 0 0 0 0 The actions are clocwise numbered from ( go up and right ) to ( go up ) The above δ matrix indicates that if action is chosen at time, only actions,, 3, 7 and can be selected at time + Moreover, the mobile orientation θ is restricted to values directly lined with the orientation of the elementary displacements In the Marov Decision Process context when the reward is completely nown, optimal policy can be computed using Dynamic Programming technique or similar algorithms such as Value Iteration and Policy Iteration [] They are based on the principle of optimality due to Bellman However, applicability of this principle implies some basic assumptions of the cost functional It is not the case in our context where the criterion of optimality depends on the Posterior Cramér Rao Bound (PCRB) matrix which is introduced in next section Indeed, as shown in [], any cost functional must satisfy the Matrix Dynamic Programming Property to guarantee the principle of optimality For the determinant functional used in our wor, this property is not verified [] 3 Posterior Cramér-Rao Bound 3 Definition In this section, we briefly remind the properties of the Cramér-Rao bound for estimation problem Let ˆX(Z)

be an unbiased estimator of an unnown random vector X R d based on random observations Z The PCRB is given by the inverse of the Fisher Information Matrix (FIM) F [9] and measures the minimum mean square errors of ˆX(Z) : where H(X, j) = z r (j) x z β (j) x z r (j) y z β (j) y, (9) E{( ˆX(Z) X)( ˆX(Z) X) T } F (X, Z), () where A B means (A B) is positive semidefinite and : F E [ X X log (p x,z (X, Z)) ] is the second-order partial derivatives operator p x,z is the joint probability density of (X, Y ) For the filtering case, it can be shown [] that : ( ) ( ) T E{ ˆX X ˆX X V a } J (V with X the state at time, ˆX ˆX (Z (:), Va ) the estimate based on Z (:) all measurements observed until time and Va = {a,, a } the sequence of chosen actions until time 3 Tichavsy PCRB recursion A remarable contribution of Tichasy et al[] was to introduce a Ricatti lie recursion for computing J : where a ) J + = D D (J + D ) D () D = E{ X X log (p(x + X ))}, (3) D = E{ X + X log (p(x + X ))}, () D = [D ] T, (5) D = E{ X + X + log (p(x + X )} + () E{ X + X + log (p(z + X + ) } (7) The relation is valid under the assumptions that the mentioned second-order derivatives, expectations and matrix inverses exist The initial information matrix J 0 can be calculated from π 0 The measurements only contribute through the D term The dynamic being linear, the following equations are easily derived : J 0 = P0, () D = Q, D = Q, D = [Q ]T, D = Q + J + (Z) The observations contribution J + (Z) is given by the following expression : J (Z) = E X { H(X, j) T R H(X, j)}, j I v(x + ) Actually, this is a lower bound which may be reasonably accurate and R + is the covariance matrix of the combined visible observations We assume that this matrix only depends on the current position Obviously, there is no explicit expression for J + (Z) because the observation model is nonlinear Consequently, it is necessary to resort to Monte Carlo simulation to approximate it : draw N c (X i 0 ) i N c according to π 0, until l = draw (Xl+ i ) i N c from (Xl i) i N c according to the prediction equation (), for each X i + compute I v (X i + ) then H(X i +, j), j I v(x i + ) and R + For our observation model, the matrix H(X, j) is as follows : H(X, j) = (x x j) d j (y y j) d j (y y j) d j (x x j) d j, (0) where d j = (x x j ) +(y y j ) The approximation can then be obtained : J + N c i I v(x i + ) H(X i +, j) T R + H(Xi +, j) 33 A criterion for path planning We are interested in finding one or more paths connecting two points which minimize a functional φ of the PCRB along the trajectory [] We consider two functionals which depend on the determinant of the history of approximated PCRB {J0,, JK } The determinant is lined to the volume of the error ellipse of the position estimate : minimizing a weighted mean volume along the trajectory φ (J 0:K ) = K =0 w det(j ) minimizing the final error of estimation φ (J 0:K ) = det(j K ) In figure, 90% error ellipses are computed and drawn along two trajectories of the same length The size of the ellipses decreases rapidly for the trajectory which goes towards the area with landmars Since a classical optimization method is irrelevant, we propose to investigate a learning approach based on the innovative Cross Entropy (CE) method developed by Rubinstein et al [5]

0 9 7 5 3 3 5 7 9 0 Figure : example of 90% error ellipses for trajectories of the same length (features are in blac (o)) Cross Entropy algorithm In this section we briefly introduce the Cross Entropy algorithm The reader interested in the CE method and its convergence properties should refer to [5, ] A short introduction The Cross Entropy method was first used to estimate the probability of rare events A rare event is an event with very small probabilities It was then adapted for optimization assuming that sampling around the optimum of a function is a rare event Simulation of rare event Let X a random variable on a space X, p x its probability density function (pdf) and φ a function on X Suppose, we are concerned with estimating l(γ) the probability of the event F γ = {x X φ(x) γ} with γ R + F γ is a rare event if l(γ) is very small An unbiased estimate can be obtained via Crude Monte Carlo simulation Given a random sample (x,, x N ) drawn from p x, this estimate is : ˆl(γ) = I [φ(x i ) γ] N For rare event, the variance of ˆl(γ) is very high and it is necessary to increase N to improve the estimation The estimate properties can be improved with variance minimization technique such as importance sampling where the random sample is drawn from a more appropriate probability density function q It can be easily shown that the optimal q pdf is given by I [f(x) > γ]p x (x)/l(γ) Nevertheless, q depends on l(γ) which needs to be estimated To get round this difficulty, it can be convenient to choose q in a family of pdfs {π(, λ) λ Λ} The idea is to find the optimal parameter λ such as D(q, π(, λ)) is minimized where D is the Kullbac-Leibler pseudo-distance : D(p, q) = E p ln p q = p(x)ln p(x)dx q(x)ln p(x)dx X X Minimizing D(q, π(, λ)) is equivalent to maximize X q (x) ln π(, λ)dx which implies : λ arg max λ Λ E p (I [φ(x) γ] ln π(x, λ)) () The computation of the expectation in must also be done using importance sampling So we need a change of measure q, drawing one sample (x,, x N ) from q and estimate λ as follows : ˆλ arg max λ Λ ( I [φ(x i ) γ] p ) x(x i ) q(x i ) ln π(x i, λ) () The family {π(, λ) λ Λ} must be chosen to easily solve equation Natural Exponential Families 5 (NEF)[5] are especially adapted however, q is not nown in equation The CE algorithm tries to overcome this difficulty by constructing adaptively a sequence of parameters (γ t t ) and (λ t t ) such as : F γ is not a rare event F γt+ is not a rare event for π(, λ t ) lim t = γ More precisely, given ρ ]0, [ : choose λ 0 such as π(, λ 0 ) = p x draw (x,, x N ) from π(, λ t ) sort in increasing order (φ(x ),, φ(x N )) and evaluate the ( ρ)) quantile γ t compute λ t as λ t argmax λ Λ ( I [φ(x i ) γ t ] ) p x (x i ) π(x i, λ t ) ln π(x i, λ) if γ t < γ, set t = t + and go to step Else estimate the probability of F γ with : ˆl(γ) = p x (x i ) I [φ(x i ) γ] N π(x i, λ t ) This is the main CE algorithm but other versions can be found in [5] application to optimization The CE was adapted to solve optimization problem Consider the optimization problem : φ(x ) = γ = max φ(x) (3) x X The principle of CE for optimization is to translate the problem 3 into an associated stochastic problem and then solved it adaptively as the simulation of a rare event If γ is the optimum of φ, F γ is generally a rare event The main idea is to define a family π(, λ) λ Λ 5 pdfs f λ (x) = c(λ)e λ t(x) h(x)

and iterate enough the CE algorithm such as γ t γ to draw samples around the optimum Moreover, one has to notice that unlie the other local random search algorithm such as simulated annealing which used the assumption of local neighborhood hypothesis, the CE method tries to solve globally the optimization problem To apply the CE approach, we need to : define a selection rate ρ choose a well-suited family of pdf π(, λ) λ Λ define a criterion to stop the iteration Once these elements are given the algorithm for the optimization proceeds as follows : Initialize λ t = λ 0 Generate a sample of size N (x t i ) i N from π(, λ t ), compute (φ(x t i )) i N and order them from smallest to biggest Estimate γ t as the ( ρ) sample percentile 3 update λ t with : λ t+ = arg max λ N I [ φ(x t ] i) γ t ln π(x t i, λ) repeat from step until convergence () 5 assume convergence is reached at t = t, an optimal value for φ can be done by drawing from π(, λ t ) Application to the path planning tas In this part we deal with the application of the CE methods to our tas First of all, it is necessary to define the random mechanism (π(, λ) λ Λ) to generate sample of trajectories More precisely we want to generate sample which : start from s i and end at s f respect the authorized matrix δ(a, a + ), have whole length less than T max One way to achieve that is to use a probability matrix P sa = (p sa ) with s {,, N s } and a {,, N a } defining the probability of choosing action a in state s P sa is a N s N a matrix (in our case N a = ) p p p 7 p P sa = (5) p Ns p Ns p Ns7 p Ns with s, the s th row P s () is one dimensional discrete probability such as : P s (i) = p si, i =,, with p si = So, to solve our problem we are concerned with the optimization of the N s N a parameters (p sa ) using the CE algorithm Trajectory generation At each iteration of the CE algorithm, we want to generate from P sa N trajectories of length T T max starting from s i and ending within s f and drawn in accordance with the authorized transition matrix δ Let first introduce : A( s, ã) = {a s = s, δ(a = a, a = ã) = } P s () such as, a A( s, ã), p sa p sa = P a A( s,ã) p sa else p sa = 0; A( s, ã) is the admissible actions at time in state s = s nowing that ã was chosen at time and P s () is the normalized restriction of P s () to A( s, ã) We proceed as follows for one iteration for the CE: Table : Path generation principle j = 0 while (j < N) = 0, set s 0 = s i generate one action a j 0 according to P s0 and apply it set = + and T = until s = s f do - compute A( s, ã) if A( s, ã) - generate one action a A(s, a ) according to P s and apply it else stop and set j = j - set = + and T = T + if T > T max stop and set j = j else return x(j) = (s i, a j 0, sj, aj,, sj, aj, s f) j = j + Updating the P sa matrix At each iteration t of the Cross Entropy algorithm, given the N trajectories (x(j)) j N (starting from s i to s f and respecting to the matrix δ) generated via the algorithm described in table, we can evaluate each one by calculating the PCRB sequence and applying our functional φ ( {, }) Then the parameter γ t can be evaluated One can update P sa by solving Let x(j) = (s i, a j 0, sj, aj,, sj, aj, s f) be one trajectory, we have : ln π(x(j),p sa ) = I [{x(j) χ sa }] lnp sa () i=0

where I is the indicator function and {x(j) χ sa } means that the trajectory contains a visit to state s in which action a is taen Since for each s, the row P s () is a discrete probability, must be solved under the condition that the rows of P sa sum up to This yields to solve the following problem using Lagrange multipliers (µ s ) { s Ns} : 0 max P sa min µ,,µ Ns N I [φ (x(j)) γ t ] lnπ(x(j),p sa ) (7) j= ( N s + µ s a= s= a= ) (p sa ) 0 0 0 with lnπ(x(j),p sa ) given in We can derive after differentiation with respect each parameter p sa (s {,, N s } a {,, }) the following equation N I [φ (x(j)) γ t ] I [{x(j) χ sa }] = µ s p sa j= After summing over a =,, and applying the condition a= p sa =, we can obtain µ s and the final updating formula µ s = N I [φ (x(j)) γ t ] I [{x(j) χ s }] () j= N j= p sa = I [{φ (x(j)) γ t }] I [{x(j) χ sa }] N j= I [{φ (x(j)) γ t }] I [{x(j) χ s }] (9) where {x(j) χ s } means that the trajectory contains a visit to state s For the first iteration, s, P s () is a uniform probability density function 5 Simulation results The algorithm has not been widely tested, and only one simple scenario is introduced in this paper In this example, the map is defined on [, ] [, ] with m j point features (figure 3) The state space is discretized in N s = 5 5 states The initial and terminal states are respectively in positions (0, 0) and (0, 0) m 0 m m m 3 m m 5 x 3 37 5 7 y 0 5 7 7 Table : the features of the map Figure 3: The maximum lielihood optimal solution after the first iteration of the CE Starting and terminal states ( ) and map features ( ) The mobile can only apply [ π ; π ] headings controls at each time Thus the δ matrix is exactly the same defined as in section Only trajectories with length T 30 are admissible For the observation model, a landmar m j is visible at time provided that: { r min z r (j) r max z β (j) θ max with r min = 000, r max = and θ max = 0 deg The noise variance σa on the range and σ β on the bearing are the same for all features and are time independent : σ a = 50 3, σ β = 05 deg The computation of the PCRB matrices was performed with N c = 000 to estimate the observation term J (Z) For the optimization step, the Cross Entropy algorithm was implemented with 000 iterations, N = 5000 admissible trajectories and a selective rate ρ = 0 That is to say, the 50 best samples are used for updating the psa probabilities Figure show the optimal trajectory found by the algorithm at iteration 000 for the performance function φ 0 For the mobile dynamic model, the elementary displacement d is equal to and the noise process covariance is defined by: P 0 = σ I, Q = σ I, where I is the identity matrix of size and σ = 005 0 0 0 Figure : The most liely optimal trajectory found by the algorithm for φ

5 Analysis As expected (figure ) the mobile is guided toward the area with landmars in order to improve its performance of localization Moreover, it operates to eep the landmars visible while the maneuvers (δ) and the time constraints (T max ) allow it We can also notice that the algorithm converge rapidly to a solution To illustrate that we present in the next figure, the evolution of parameters γ and the minimum value of φ (or the maximum of φ ) at each iteration of the CE algorithm (figure 5) 0 0 0 9 3 0 0 0 005 0 05 Figure : state (squares) with pdf different from a dirac after convergence 0 05 03 035 0 05 max( φ ) gamma 05 0 00 00 300 00 500 00 700 00 900 000 Figure 5: Evolution of γ (solid line) and the minimum value of the functional (dashed line) When we loo at precisely after convergence the densities (P s ()) for all s in the optimal trajectory we can notice that some of them are not a dirac probability law In some state, The most liely trajectory optimal trajectory is composed of 30 states, only 3 have their associated P s equivalent to a dirac probability law Table 3 shows the probability density functions for the others (see figure ) In these states the al- 3 9 0 079 0 0 0 03 0 039 05903 079 0 3 0 057 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0097 0 0 7 0 0 0 0 0 05 0 0 00 0 Table 3: probability densities functions gorithm converge toward a multi-modal density Two actions can be chosen but with different probability We can notice that this behavior are concentrated on state where maneuvers can be done to increase the time to observe the landmars Conclusions and perspectives In this paper, we presented a framewor to solve a path planning tas for a mobile with the The problem was discretized and a Marov Decision Process with constraints on the mobile maneuver was used Our main goal was to find the optimal trajectory according to a measure of capability of estimating accurately the state of the mobile during the execution Functionals of the Posterior Cramr-Rao Bound was used as the criterion of performance The main contribution of the paper is the use of the Cross Entropy algorithm to solve the optimization step as Dynamic Programming could not be applied This approach was tested on a simple first example and seems to be relevant Future wor will first concentrate on the complete implementation of the algorithm and applications to more examples More analysis on the probability has to be made We will also investigate a continuous approach and try to approximate the computation of the observation contribution to the PCRB, which is time consuming The tuning of the Cross-Entropy to our specific tas was not studied, some experiments have to be carried out based on device given in [5] Finally, we want to consider more complex maps such those used in Geographical Information Systems and tae into account measurement models with data association and non detection problem References [] S Paris, J-P Le Cadre Planning for Terrain- Aided Navigation, Fusion 00, Annapolis (USA), pp 007 0, 7- Jul 00 [] J-P Le Cadre and O Tremois, The Matrix Dynamic Programming Property and its Implications SIAM Journal on Matrix Analysis, (): pp -, April 997

[3] R Enns and D Morrell, Terrain-Aided Navigation Using the Viterbi Algorithm, Journal of Guidance, Control, and Dynamics, vol, no, pp 9, November-December 995 [] P Tichavsy, CH Muravchi, A Nehorai Posterior Cramér-Rao Bounds for Discrete-Time Nonlinear Filtering,IEEE Transactions on Signal Processing, Vol, no 5, pp 3 39, May 99 [5] R Rubinstein,D P Kroese The Cross-Entropy method An unified approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning, Information Science & Statistics, Springer 00 [] de-boer,p Kroese, D Mannor and RRubinstein A tutorial on the cross-entropy method, 003 http://wwwcsutwentenl/ ptdeboer/ce/ [7] S Mannor, R Rubinstein, YGat The Cross Entropy method for Fast Policy Search, Procof the 0 th IC on Machine Learning, 003 [] R S Sutton and AGBarto, Reinforcement Learning An Introduction A Bradford boo, 000 [9] HL Van Trees, Detection, Estimation and Modulation Theory, New Yor Wiley, 9