Motion Planning under Uncertainty using Iterative Local Optimization in Belief Space

Size: px

Start display at page:

Download "Motion Planning under Uncertainty using Iterative Local Optimization in Belief Space"

Claud Hugh Barnett
5 years ago
Views:

1 Moion Planning under Uncerainy using Ieraive Local Opimizaion in Belief Space Jur van den Berg 1 Sachin Pail 2 Ron Aleroviz 2 1 School of Compuing, Universiy of Uah, berg@cs.uah.edu. 2 Dep. of Compuer Science, Universiy of Norh Carolina a Chapel Hill, {sachin, ron}@cs.unc.edu. Absrac We presen a new approach o moion planning under sensing and moion uncerainy by compuing a locally opimal soluion o a coninuous parially observable Markov decision process (POMDP). Our approach represen beliefs (he disribuions of he robo s sae esimae) by Gaussian disribuions and is applicable o robo sysems wih non-linear dynamics and observaion models. The mehod follows he general POMDP soluion framework in which we approximae he belief dynamics using an exended Kalman filer and represen he value funcion by a quadraic funcion ha is valid in he viciniy of a nominal rajecory hrough belief space. Using a belief space varian of ieraive LQG (ilqg), our approach ieraes wih secondorder convergence owards a linear conrol policy over he belief space ha is locally opimal wih respec o a user-defined cos funcion. Unlike previous work, our approach does no assume maximum-likelihood observaions, does no assume fixed esimaor or conrol gains, akes ino accoun obsacles in he environmen, and does no require discreizaion of he sae and acion spaces. The running ime of he algorihm is polynomial (O[n 6 ]) in he dimension n of he sae space. We demonsrae he poenial of our approach in simulaion for holonomic and nonholonomic robos maneuvering hrough environmens wih obsacles wih noisy and parial sensing and wih non-linear dynamics and observaion models. 1 Inroducion As a robo moves hrough an environmen o accomplish a ask, uncerainy may arise in (1) he robo s moion due o unmodeled or unpredicable exernal forces, and (2) he robo s sensing of is sae due o noisy or incomplee sensor measuremens. These forms of uncerainy are common in a variey of pracical roboics asks, including guiding aerial vehicles in urbulen condiions, maneuvering mobile robos in unfamiliar errain, and roboically seering flexible medical needles o clinical arges in sof issues. Explicily considering moion and sensing uncerainy when compuing moion plans can improve he qualiy of compued plans. The objecive of moion planning under uncerainy is o plan moions for a robo such ha he expeced cos (as defined by a user-specified cos-funcion) is minimized. Opimal plans ypically limi he informaion ha is los due o moion uncerainy and move he robo hrough regions of he sae space where informaion on he sae is gained. Opimal soluions will maximize, for insance, he probabiliy of reaching a specified goal locaion while avoiding collisions wih obsacles. To fully consider he impac of uncerainy in moion and sensing, a moion planner should no merely compue a saic pah hrough he robo s configuraion space bu raher a conrol policy ha defines he moion o perform given any curren sae informaion. A key challenge is ha he robo ofen canno direcly observe is curren sae bu insead esimaes a disribuion 1

2 over he se of possible saes (i.e., is belief sae) based on sensor measuremens ha are boh noisy and parial (i.e., only a subse of he sae vecor can be sensed). The problem of compuing a conrol policy over he space of belief saes is formally described as a parially observable Markov decision process (POMDP), on which a large body of work is available in he lieraure. Soluions o POMDPs are known o be exremely complex [19], since he belief space (over which a conrol policy is o be compued) is in he mos general formulaion an infinie-dimensional space of all possible probabiliy disribuions over he finie-dimensional sae space. Soluions based on discree or discreized sae and acion spaces are inherenly subjec o he curse of dimensionaliy, and have only been successfully applied o very small and low-dimensional sae spaces. In his paper, we presen a mehod ha akes as inpu a feasible rajecory and improves i by compuing a locally opimal rajecory and a corresponding conrol policy ha ogeher minimize he expeced value of a user-specified cos meric in he presence of moion and sensing uncerainy. To accomplish his, our mehod compues a locally opimal soluion o a POMDP problem wih coninuous sae and acion spaces and non-linear dynamics and observaion models, where we assume a belief can be represened by a Gaussian disribuion. This POMDP formulaion is applicable o a wide range of robo moion planning problems. Our approach uses a belief space varian of ieraive linear-quadraic Gaussian (ilqg) o perform value ieraion, where he value funcion is approximaed using a quadraizaion around a nominal rajecory, and he belief dynamics is approximaed using an exended Kalman filer (any non-linear Gaussian filer can in fac be used). The resul is a linear conrol policy over he belief space ha is valid in he viciniy of he nominal rajecory. By execuing he conrol policy, a new nominal rajecory is creaed, around which a new conrol policy is consruced. This process coninues wih second-order convergence owards a locally opimal soluion o he POMDP problem. Unlike general POMDP solvers ha have an exponenial running ime, our approach does no rely on discreizaions and has a running ime ha is polynomial (O[n 6 ]) in he dimension n of he sae space. Our approach combines, generalizes, and overcomes he limiaions of previous work ha has addressed he same problem of creaing applicable approximaions o he POMDP problem. Mos previous work on POMDPs assumes maximum-likelihood observaions o enable or simplify compuing a conrol policy. This assumpion has no formal jusificaion, ye seems o produce reasonable resuls. Our approach does no assume maximum-likelihood observaions, bu can relaively easily be adaped such ha i does. We use his o sudy he impac of he maximumlikelihood observaion assumpion on he resuling conrol policies and discuss he impac on plans compued using ieraive local opimizaion. Our resuls indicae ha no making his assumpion resuls, on average, in beer conrol policies (i.e., hey have lower expeced cos). Furhermore, our approach does no assume fixed esimaor or conrol gains, and akes ino accoun obsacles in he environmen. We do assume ha he dynamics and observaion models and cos funcions are sufficienly smooh, and ha he belief abou he sae of he robo is well described by only is mean and is variance. We show he poenial of our approach in several illusraive scenarios involving robos wih non-linear dynamics and observaion models moving hrough environmens conaining obsacles and relying on limied and parial sensing. 2 Previous Work Parially observable Markov decision processes (POMDPs) [24] provide a principled mahemaical framework for planning under uncerainy. They are known o be of exreme complexiy [19], and can only be direcly applied o problems wih small and low-dimensional sae spaces [16]. Recenly, several POMDP algorihms have been developed ha use approximae value ieraion wih poin-based updaes [1, 17, 20, 18]. These have been shown o scale up o medium-sized domains. However, hey rely on discreizing he sae space or he acion space, making hem 2

3 ineviably subjec o he curse of dimensionaliy. The mehods of [23, 4, 9, 6] handle coninuous sae and acion spaces, bu mainain a global (discree) represenaion of he value funcion over he belief space. In conras, our approach is coninuous and approximaes he value funcion in parameric form only in he regions of he belief space ha are relevan o solving he problem, allowing for a running ime polynomial in he dimension of he sae. Anoher class of works, o which our mehod is direcly relaed, assumes a linear-quadraic Gaussian (LQG) framework o find locally opimal feedback policies. In he basic LQG derivaion [2], moion and sensing uncerainy have no impac on he resuling policy. As shown in [25], he LQG framework can be exended such ha i accouns for sae and conrol dependen moion noise, bu sill implicily assumes full observaion (or an independen esimaor) of he sae. Several approaches have been proposed o include parial and noisy observaions such ha he conroller will acively choose acions o gain informaion abou he sae. Belief roadmaps [22] and iclqg [10] combine an ieraive LQG approach wih a roadmap, bu his approach does no resul in (locally) opimal soluions. The approaches of [21, 7, 8] are similar o ours and incorporae he variance ino an augmened sae and use he LQG framework o find a locally opimal conrol policy. The main difference is ha hese approaches assume maximumlikelihood observaions o make he belief propagaion deerminisic. LQG-MP [26] removes his assumpion, bu only evaluaes he probabiliy of success of a given rajecory, raher han consrucing an opimal one. Belief rees [5] overcome his limiaion by combining a varian of LQG-MP wih RRT* o find an opimal rajecory hrough belief space. A grea advanage of his approach is ha i finds a globally opimal soluion. Vius and Tomlin [31] propose an alernaive soluion ha involves solving a chance consrained opimal conrol problem. However, hese approaches do no really solve a POMDP as hey assume fixed conrol gains along each secion of he rajecory independen of he conex. The work of [15] akes ino accoun sae and conrol dependen moion and observaion noise by an inerleaved ieraion of he esimaor and he conroller, converging o a local opimum. While his approach is asympoically faser han ours, i does no allow for obsacles in he environmen and resuls in a conroller ha is opimal only under he assumpion of fixed esimaor gains. Our approach combines and generalizes hese approaches as i does no assume maximum-likelihood observaions, does no assume fixed conrol or esimaor gains, and akes ino accoun he exisence of obsacles in he environmen o compue locally opimal policies ha minimize he expeced value of a user-defined cos funcion. This paper is an exended version of a preliminary paper presened by he auhors in [28], which used sochasic differenial dynamic programming (sddp) raher han ilqg for he value ieraion, bu oherwise presens he same global approach. Also, o improve numerical sabiliy compared o [28], in his paper we use he principal square roo of he variance, raher han he variance iself, in he definiion of he belief. Qualiaively, ilqg is asympoically faser han sddp (O[n 6 ] raher han O[n 7 ]) and numerically more sable (regularizaion of marices o mainain posiive-semidefinieness of he value funcion is no necessary wih ilqg). Our experimenal resuls include a quaniaive comparison beween he wo approaches. 3 Preliminaries and Definiions We begin by defining POMDPs in heir mos general formulaion (following [24]). Then, we specifically sae he insance of he problem we discuss in his paper. 3.1 General POMDPs Le X R n be he space of all possible saes x of he robo, U R m be he space of all possible conrol inpus u of he robo, and Z R k be he space of all possible sensor measuremens z he robo may receive. General POMDPs ake as inpu a sochasic dynamics and observaion 3

4 model, here given in probabilisic noaion: x +1 p[x +1 x, u ], z p[z x ], (1) where x X, u U, and z Z are he robo s sae, conrol inpu, and received measuremen a ime sep, respecively. The belief b[x ] of he robo is defined as he disribuion of he sae x given all pas conrol inpus and sensor measuremens: b[x ] = p[x u 0,..., u 1, z 1,..., z ]. (2) Given a conrol inpu u and a measuremen z +1, he belief is propagaed using Bayesian filering: b[x +1 ] = η p[z +1 x +1 ] p[x +1 x, u ] b[x ] dx, (3) where η is a normalizer independen of x +1. Denoing belief b[x ] by b, and he space of all possible beliefs by B {X R}, he belief dynamics defined by Eq. (3) can be wrien as a funcion β : B U Z B: b +1 = β[b, u, z +1 ]. (4) Now, he challenge of he POMDP problem is o find a conrol policy π : B U for all 0 < l, where l is he ime horizon (i.e. he index of he final ime sep), such ha selecing he conrols u = π [b ] minimizes he objecive funcion: [ l 1 E cl [b l ] + c [b, u ] ], (5) z 1,...,z l for given immediae cos funcions c l and c. The expecaion is aken because he measuremens are sochasic. A general soluion approach uses value ieraion [24], a backward recursion procedure, o find he conrol policy π for each ime sep : =0 v l [b l ] = c l [b l ] (6) [ v [b ] = min(c [b, u ] + E v+1 [β[b, u, z +1 ]] ] ) (7) u z+1 [ π [b ] = argmin(c [b, u ] + E v+1 [β[b, u, z +1 ]] ] ), (8) u z+1 where v [b ] : B R is called he value funcion a ime sep. 3.2 Problem Definiion The complexiy of POMDPs sems from he fac ha B, he space of all beliefs, is infiniedimensional, and ha in general he value funcion canno be expressed in parameric form. We address hese challenges in our approach by represening beliefs by Gaussian disribuions, approximaing he belief dynamics using an exended Kalman filer, and approximaing he value funcion by a quadraizaion around a nominal rajecory hrough he belief space. Specifically, we assume we are given a (non-linear) sochasic dynamics and observaion model, here given in sae-ransiion noaion: x +1 = f[x, u, m ], m N [0, I], (9) z = h[x, n ], n N [0, I], (10) 4

5 where m is he moion noise and n is he measuremen noise, each drawn from an independen Gaussian disribuion wih (wihou loss of generaliy) zero mean and uni variance. Noe ha he moion and sensing uncerainy can be sae and conrol inpu dependen hrough manipulaions on m and n wihin he funcions f and h, respecively. The belief, denoed b = (ˆx, Σ ), is assumed o be defined by he mean ˆx and he principal square roo Σ of he variance Σ of a Gaussian disribuion N [ ˆx, Σ ] of he sae x. We use he square roo for numerical robusness of he algorihm we presen below. Similar o he general POMDP case, our objecive is o find a conrol policy u = π [b ] ha minimizes he cos funcion E [ c l [b l ] + l 1 =0 c [b, u ] ]. In our case, we assume in addiion posiive- (semi)definieness for he Hessian marices of he immediae cos funcions c : [ ] 2 c l 2 c 2 c [b] 0, [b, u] > 0, b b [b, u] 2 c b u [b, u] 0, (11) b b u u [b, u] [b, u] 2 c u b 2 c u u for all b, u and. Furher, we assume ha he iniial belief b 0 = (ˆx 0, Σ 0 ) is given. 4 Approach To compue a locally opimal soluion o he Gaussian POMDP problem as formulaed above, we follow he general soluion approach as skeched in Secion 3.1. Firs, we approximae he belief dynamics using an exended Kalman filer. Second, we approximae he value funcion using a quadraic funcion ha is locally valid in he viciniy of a nominal rajecory hough he belief space. We hen use a belief-space varian of ieraive LQG o perform he value ieraion, which resuls in a linear conrol policy over he belief space ha is locally opimal around he nominal rajecory. We hen ieraively generae new nominal rajecories by execuing he conrol policy, and repea he process unil convergence o a locally opimal soluion o he POMDP problem. We discuss each of hese seps in his secion, and analyze he running ime of our algorihm. 4.1 Bayesian Filer and Belief Dynamics Given a curren belief b = (ˆx, Σ ), a conrol inpu u, and a measuremen z +1, he belief evolves using a Bayesian filer. We approximae he Bayesian filer by an exended Kalman filer (EKF), which is applicable o Gaussian beliefs (we noe ha any oher non-linear Gaussian filer, such as he unscened Kalman filer [12], can be used as well). The EKF is widely used for sae esimaion of non-linear sysems [32], and uses he firs-order approximaion ha for any vecor-valued funcion f[x] of a sochasic variable x we have: E[f[x]] f[e[x]], Var[f[x]] f f [E[x]] Var[x] x x [E[x]]T. (12) Given ˆx and Σ ha define he curren belief, he EKF updae equaions are hen given by: ˆx +1 = f[ˆx, u, 0] + K (z +1 h[f[ˆx, u, 0], 0]), (13) Σ+1 = Γ K H Γ, (14) where Γ = A Σ (A Σ ) T + M M T, A = f x [ˆx, u, 0], K = Γ H T (H Γ H T + N N T ) 1, H = h x [f[ˆx, u, 0], 0], M = f m [ˆx, u, 0], (15) N = h n [f[ˆx, u, 0], 0]. (16) 5

6 Noe ha all of hese marices are funcions of b and u. Equaions (13) and (14) define he (non-linear) belief dynamics. The second erm of Eq. (13), called he innovaion erm, depends on he measuremen z +1. Since he measuremen is unknown in advance, he belief dynamics are sochasic. Using Eq. (10) and he assumpions of Eq. (12), he innovaion erm is disribued according o N [0, K H Γ ]. We define he belief b = [ ˆx vec[ Σ ] ] as a rue vecor, conaining he mean ˆx and he columns of Σ. Obviously, in our implemenaion we exploi he symmery of Σ o eliminae he redundancy. Then, he sochasic belief dynamics are given by: b +1 = g[b, u ] + W [b, u ]w, w N [0, I n ], (17) where n is he dimension dim[x] of he sae, and: [ ] f[ˆx g[b, u ] =, u, 0] vec[, W [b Γ K H Γ ], u ] = 4.2 Value Ieraion [ ] K H Γ. (18) 0 We perform value ieraion backward in ime o find a locally opimal conrol policy. When using value ieraion (dynamic programming) over discree saes one usually sores he value of each possible sae. In he case of a coninuous sae his is no possible. Insead, we assume ha we have an iniial (nominal) rajecory given. For each ime sep we calculae an approximaion of he value funcion around he sae he robo is in a ime sep when following he nominal rajecory. As he value funcion a ime sep depends on he value funcion a ime sep + 1, his is done in a backward ieraive process saring a he final ime sep l. Using he approximaed value funcion, we can also calculae an opimal policy for each ime sep. Using his opimal policy we generae a new nominal rajecory by saring a he iniial sae and applying his opimal policy forward in ime. The process in hen repeaed using he new nominal rajecory, and ulimaely converges o a locally opimal soluion. We use a belief-space varian of ieraive LQG [25] o perform he value ieraion. We approximae he value funcion v [b] as a quadraic funcion ha is approximaely valid around a given nominal rajecory in belief space. Le he nominal rajecory be given as a series of beliefs and conrol inpus ( b 0, ū 0,..., b l 1, ū l 1, b l ) such ha b +1 = g[ b, ū ] for 0... l 1 (we will discuss iniializaion and ieraive convergence of he nominal rajecory o a locally opimal rajecory in he nex subsecion). The value funcion is hen approximaed as: v [b] 1 2 (b b ) T S (b b ) + (b b ) T s + s, (19) wih S 0. For he final ime sep = l, he value funcion v l (see Eq. (6)) is approximaed by seing S l = 2 c l b b [ b l ], s l = c l b [ b l ], s l = c l [ b l ], (20) which amouns o a second-order Taylor expansion of c l around he poin b l. The value funcions and he conrol policies for he ime seps l > 0 are compued by backward recursion. 6

7 We proceed by combining Eqs. (7), (17), and (19): ( v [b] = min c [b, u] + E [ v +1 [g[b, u] + W [b, u]w ] ]) u ( = min c [b, u] + E [ 1 u 2 (g[b, u] + W [b, u]w b +1 ) T S +1 (g[b, u] + W [b, u]w b +1 ) + (g[b, u] + W [b, u]w b +1 ) T ] ) s +1 + s +1 ( = min c [b, u] + 1 u 2 (g[b, u] b +1 ) T S +1 (g[b, u] b +1 ) + (g[b, u] b +1 ) T s +1 + s r [ W [b, u] T S +1 W [b, u] ]) (21) ( = min c [b, u] + 1 u 2 (g[b, u] b +1 ) T S +1 (g[b, u] b +1 ) + (g[b, u] b +1 ) T s +1 + s n ) W (i) [b, u] T S +1 W (i) [b, u], (22) 2 i=1 where W (i) [b, u] refers o he i h column of marix W [b, u] (noe ha W [b, u] has n columns, where n is he dimension of he sae). The race-erm in Eq. (21) follows from he fac ha E[x T Qx] = E[x] T Q E[x] + r[q Var[x]] for any sochasic variable x, and ha r[qxx T ] = r[x T QX]. I is his erm ha ensures ha he sochasic naure of he belief dynamics is accouned for in he value ieraion. Eq. (22) follows from he fac ha r[x T QX] = i X (i) T QX (i). To approximae he opimal value of u as a funcion of b we linearize he belief dynamics and each of he columns of W [b, u] abou he belief b and conrol inpu ū of he nominal rajecory. Given ha b +1 = g[ b, ū ], we ge: where F = g b [ b, ū ], e i = W (i) [ b, ū ], g[b, u] b +1 F (b b ) + G (u ū ), (23) W (i) [b, u] e i + F i (b b ) + G i (u ū ), (24) G = g u [ b, ū ], (25) F i = W (i) b [ b, ū ], G i = W (i) u [ b, ū ]. (26) The immediae cos funcion c [b, u] is quadraized abou b and ū : where c [b, u] 1 [ ] T [ b b Q 2 u ū P Q = 2 c b b [ b, ū ], q T P T R ] [ ] [ ] T [ ] b b b b q + + p u ū u ū r, (27) R = 2 c u u [ b, ū ], P = 2 c u b [ b, ū ], = c b [ b, ū ], r T = c u [ b, ū ], p = c [ b, ū]. (28) 7

8 Filling in Eqs. (23), (24), and (27) ino Eq. (22), we ge: ( [ ] T [ 1 b b Q v [b] min u 2 u ū P where (1 = min u 2 P T R ] [ ] [ ] T [ ] b b b b q + + p u ū u ū r (F (b b ) + G (u ū )) T S +1 (F (b b ) + G (u ū )) + (F (b b ) + G (u ū )) T s +1 + s n (e i + F i (b 2 b ) + G i (u ū )) T S +1 (e i + F i (b b ) ) + G i (u ū )) i=1 [ b b u ū ] T [ C C = Q + F T S +1 F + D = R + G T S +1 G + E = P + G T S +1 F + E E T D ] [ ] [ ] T [ ] b b b b c ) + + e u ū u ū d, (29) n F i T S+1 F i, c = q + F T s +1 + i=1 n G i T S+1 G i, d = r + G T s +1 + i=1 n i=1 G i T S+1 F i, e = p + s n F i T S+1 e i, (30) i=1 n G i T S+1 e i, (31) i=1 n e i T S+1 e i. (32) Equaion (29) is hen solved by expanding he erms, aking he derivaive wih respec o u and equaing o 0 (for u o be acually a minimum, D mus be posiive-definie. Given he assumpions of Eq. (11), his is necessarily he case). We hen ge he soluion: u ū = D 1 E (b b ) D 1 d. (33) Hence, he conrol policy for ime sep is linear and given by: i=1 u = π [b ] = L (b b ) + l + ū, L = D 1 E, l = D 1 d. (34) Filling Eq. (33) back ino Eq. (29) gives he value funcion v [b] as a funcion of only b in he form of Eq. (19). Expanding and collecing erms gives: S = C E T D 1 E, s = c E T D 1 d, s = e 1 2 dt D 1 d. (35) This recursion hen coninues by compuing a conrol policy for ime sep Ieraion o a Locally Opimal Conrol Policy The above value ieraion gives a conrol policy ha is valid in he viciniy of he given nominal rajecory. To le he conrol policy converge o a local opimum, we ieraively updae he nominal rajecory using he mos recen conrol policy [11]. Given he iniial belief b 0 = (ˆx 0, Σ 0 ), and an (arbirary) iniial nominal rajecory ( b (0) b (0) b (0) +1 0, ū(0) (0) 0,..., b l 1, ū(0) (0) l 1, b l ) (such ] for 0... l 1), which can be obained using RRT (0) ha 0 = b 0 and = g[ b, ū (0) moion planning [13], for insance, we proceed as follows. Using he value ieraion procedure as described above given he nominal rajecory of he i h ieraion, we find he conrol policy, i.e. he marices L (i) and vecors l (i) for he i h ieraion. We hen compue he nominal rajecory ( b (i+1), ū (i+1) ) of he i + 1 h ieraion 8

9 (saring wih i = 0) by forward inegraing he conrol policy in he deerminisic (zero-noise) belief dynamics: b (i+1) 0 = b 0, ū (i+1) = L (i) ( b (i+1) (i) b ) + l (i) + ū (i), b(i+1) +1 = g[ b (i+1), ū (i+1) ], (36) We hen recompue he conrol policy, and reierae. This les he conrol policy converge o a locally opimal rajecory wih a second-order convergence rae [14]. 4.4 Ensuring Convergence To ensure ha he above algorihm in fac converges o a locally opimal conrol policy, he algorihm mus be augmened wih line search. As wih Newon s mehod for finding roos of a funcion, second order convergence of he above algorihm is only achieved if he curren nominal rajecory is already close o he locally opimal rajecory. If he curren nominal rajecory is far away from he local opimum, he approach may overshoo local-minima, which significanly slows down convergence, or even resuls in divergence. To address his issue, we subly change he algorihm following [33]. We limi he incremen o he conrol policy by adding a parameer ε o Eq. (33): (u ū ) = L (b b ) + εl. Iniially, ε = 1, bu each ime ha he conrol policy creaes a rajecory wih higher expeced cos han he previous nominal rajecory, he rajecory is rejeced, ε is divided in half, and a new rajecory is creaed. When a new rajecory is acceped, ε is rese o 1. This change is equivalen o using backracking line search o limi he sep size in Newon s mehod and guaranees convergence o a locally opimal conrol policy [33]. An issue ha remains is how o compue he expeced cos of a given nominal rajecory. In deerminisic ilqg, one simply evaluae is cos using he given immediae cos funcions c [b, u]. In our case however, he dynamics are sochasic, so one has o compue he expeced cos. We do his as follows. Le L (i) and εl (i) define he conrol policy in he i h ieraion. A candidae nominal rajecory for ieraion i + 1 is now generaed by applying his conrol policy wih respec o he nominal rajecory of ieraion i, according o Eq. (36). We have: ū (i+1) ū (i) = L (i) ( b (i+1) The conrol policy of ieraion i iself is defined as (u ū (i+1) (u ū (i+1) ) + (ū (i+1) ) + L (i) ( b (i+1) u ū (i) ū (i) (i) b ) + εl (i) u ū (i+1) = L (i) (b ) = L (i) ((b = L (i) ((b = L (i) (b (i) b ) + εl (i). (37) (i) b ) + εl (i), (i+1) b ) + ( b (i+1) (i+1) b ) + ( b (i+1) (i) b )) + εl (i), (i) b )) + εl (i), b (i+1) ). (38) Hence, Eq. (38) gives he conrol policy of ieraion i relaive o a candidae rajecory of ieraion i + 1. We now compue he expeced cos of he candidae nominal rajecory ( b (i+1), ū (i+1) ) as follows. Quadraizing he immediae cos funcions and linearizing he belief dynamics abou he candidae rajecory of ieraion i + 1 according o Eqs. (23) o (28), in combinaion wih he conrol policy of Eq. (38), allows us o o recursively updae he value funcion along he 9

10 candidae rajecory as: S = Q + L T R L + L T P + P T L + (F + G L ) T S +1 (F + G L ) + n (F j + G j L ) T S +1 (F j + G j L ), (39) j=1 s = q + L T r + (F + G L ) T s +1 + s = p + s n (F j + G j L ) T S +1 e j, (40) j=1 n e j T S+1 e j, (41) j=1 where L = L (i). The value s 0 now gives he expeced cos of he candidae nominal rajecory wih respec o he conrol policy of he curren nominal rajecory (noe ha he s s are inconsequenial for he expeced cos, and need no be compued). If his expeced cos is lower han he expeced cos of he curren nominal rajecory, he candidae nominal rajecory is acceped, ε is rese o 1, and he ieraion coninues. Oherwise, ε is divided in half, and he search for a new nominal rajecory coninues. Since he vecors l poin in he direcion of he gradien of he expeced cos, a posiive ε ha generaes a new rajecory wih lower expeced cos will always be found. When he magniude of he l s vanish (or drop below a prese small value), he ieraion sops and he curren nominal rajecory and is conrol policy is a locally opimal soluion. 4.5 Running Time Analysis Le us analyze he running ime of our algorihm. The dimension of he sae is n, and we assume for he sake of analysis ha he dimension of he conrol inpus and he measuremens are O[n]. As he belief conains he (square roo of he) covariance marix of he sae, he dimension of a belief is O[n 2 ]. The boleneck of he running ime lies in he compuaion of he marix C in Eq. (30). Evaluaing he produc F T S +1 F in Eq. (30) of marices of O[n 2 ] O[n 2 ] dimension akes O[n 6 ] ime. Also, compuing he marix Q of Eq. (28), which conains O[n 4 ] enries, using numerical differeniaion (cenral differences) can be done in O[n 6 ] ime assuming ha c [b, u] can be evaluaed in O[n 2 ] ime. Furher, he produc F i T S+1 F i is evaluaed n imes, bu each can be evaluaed in O[n 5 ] ime, since each F i only conains non-zero enries in he upper n O[n 2 ] block of he marix (see he definiion of W [b, u] in Eq. (18)). Noe ha linearizing he belief dynamics, i.e. compuing he marices F, G, F i and G i using numerical differeniaion (cenral differences) can be done in O[n 5 ] ime, as i involves evaluaing he belief dynamics (which akes O[n 3 ] ime for he EKF (and also for he UKF)) O[n 2 ] imes. Hence, his does no form a boleneck of he compuaion. A complee cycle of value ieraion akes l seps (l being he index of he final ime sep), bringing he complexiy o O[ln 6 ]. The number of such cycles needed o obain convergence canno be expressed in erms of n or l, bu as noed before, our algorihm converges wih a second-order rae o a local opimum. 5 Environmens wih Obsacles We presened our approach above for general immediae cos funcions c l [b] and c [b, u] (wih he assumpions of Eq. (11)). In ypical LQG-syle cos funcions, he exisence of obsacles in he environmen is no incorporaed, while we may wan o minimize he probabiliy of colliding wih hem. We incorporae obsacles ino he cos funcions as follows. 10

11 f Σ Σ Figure 1: Plos of he funcion f[σ] = log γ[n/2, σ 2 /2] for n = {1, 2, 3}. Le O X be he region of he sae space ha is occupied by obsacles. Given a belief b = (ˆx, Σ ), he probabiliy of colliding wih an obsacle is given by he inegral over O of he probabiliy-densiy funcion of N [ˆx, Σ ]. As described in [26], his probabiliy can be approximaed by using a collision-checker o compue he number σ[b ] of sandard-deviaions one may deviae from he mean before an obsacle is hi (i akes one geomeric disance compuaion o compue his number, and does no involve Mone Carlo sampling). A lower-bound on he probabiliy of no colliding is hen given by γ[n/2, σ[b ] 2 /2], where γ is he regularized gamma funcion, and n he dimension of he sae. A lower-bound on he oal probabiliy of no colliding along a rajecory is subsequenly compued as l 1 =0 γ[n/2, σ[b ] 2 /2], and his number should be maximized. To fi his objecive wihin he minimizing and addiive naure of he POMDP objecive funcion, we noe ha maximizing a produc is equivalen o minimizing he sum of he negaive logarihms of he facors. Hence, we add o c [b, u] he erm f[σ[b]] = log γ[n/2, σ[b] 2 /2] o accoun for he probabiliy of colliding wih obsacles (noe ha f[σ[b]] > 0 and 2 f σ σ > 0; see Fig. 1), poenially muliplied by a scaling facor o allow rading-off wih respec o oher coss (such as he magniude of he conrol inpu). While he above approach works well, i should be noed ha in order o compue he Hessian of c [b, u] a b (i.e. compuing he marix Q as is done in Eq. (28)), a oal of O[n 4 ] collision-checks wih respec o he obsacles need o be performed, since he obsacle erm f[σ[b]] is par of c [b, u]. As his can be prohibiively cosly, we can insead approximae he Hessian of f[σ[b]] using linearizaions, which involves only O[n 2 ] collision checks. To his end, le us approximae f[σ] by a second-order Taylor expansion abou σ[ b ]: f[σ[b]] 1 2 a(σ[b] σ[ b ]) 2 + b(σ[b] σ[ b ]) + f[σ[ b ]], (42) where a = 2 f σ σ [σ[ b ]] and b = f σ [σ[ b ]] (noe ha his requires only one collision-check). Now, we approximae (σ[b] σ[ b ]) using a firs-order Taylor expansion abou b : σ[b] σ[ b ] (b b ) T a (43) where a T = σ b [ b ] (noe ha his requires O[n 2 ] collision-checks). By subsiuing Eq. (43) in Eq. (42), we ge f[σ[b]] 1 2 (b b ) T (aaa T )(b b ) + (b b ) T (ba) + f[σ[ b ]]. (44) Hence, aaa T is an approximae Hessian of he obsacle erm f[σ[b]] of c [b, u] ha requires only O[n 2 ] collision-checks o compue. In addiion, since a > 0, his Hessian is guaraneed o be posiive-semidefinie, as mandaed by Eq. (11). 11

12 6 Resuls We evaluae our approach in simulaion applied o robo moion planning scenarios involving sochasic dynamics, measuremen models wih sae and conrol-dependen noise, and spaiallyvarying sensing capabiliies. We consider hree scenarios: (i) a 2-D poin robo wih linear dynamics, (ii) a non-holonomic, car-like robo wih second-order dynamics, and (iii) an aircraflike robo navigaing in a 3-D environmen. Our mehod akes as inpu a collision-free rajecory o he goal. A naïve rajecory compued using an uncerainy-unaware planner migh sray very close o he obsacles in he environmen and accumulae considerable uncerainy during execuion. We show ha our mehod improves he inpu rajecory o compue a locally opimal rajecory and a corresponding conrol policy ha safely guides he robo o he goal, even in he presence of large moion uncerainy and measuremen noise. In each of he following experimens, we use he following definiions of c l [b l ] and c [b, u ] in he cos funcion o be minimized (Eq. (5)): c l [b l ] = ˆx T l Q lˆx l + r[ Σ l Q l Σl ], (45) c [b, u ] = u T R u + r[ Σ l Q Σ ] + f[σ[b ]], (46) for given Q 0 and R > 0. The erm ˆx T l Q lˆx l + r[ Σ l Q l Σl ] = E[x T l Q lx l ] encodes he final cos of arriving a he goal, u T R u penalizes he conrol effor along he rajecory, r[ Σ Q Σ ] penalizes he uncerainy, and f[σ[b ]] encodes he obsacle cos erm (if applicable). Using he approximaion of Eq. (44) for f[σ[b ]], he above cos funcions are in accordance wih he assumpions of Eq. (11), and heir Hessians can be consruced in O[n 4 ] ime, so i does no presen a boleneck for he running ime. All he performance resuls presened in his secion are based on a C++ implemenaion running on a 3.33 Ghz Inel R i7 TM PC. For each scenario, we evaluae he performance of our approach and he qualiy of he compued conrol policy. We also separaely consider environmens wih and wihou obsacles o demonsrae ha our approach can handle boh ypes of environmens. We compare and analyze he performance and convergence characerisics of he approach presened in his paper o our preliminary approach based on sochasic differenial dynamic programming (sddp) [28]. We also analyze he effec of assuming maximum-likelihood observaions [21, 7, 8] on he compued locally opimal rajecory and corresponding conrol policy D Poin Robo We consider he case of a poin robo moving in a 2-D environmen wih he following linear dynamics model wih conrol-dependen moion noise: x +1 = f[x, u, m ] = x + τu + M[u ] m, (47) where he sae x = (x, y) R 2 is he robo s posiion, he conrol inpu u R 2 is he robo s velociy, τ is he duraion of a ime sep, and he marix M[u ] scales he moion noise m proporional o he conrol inpu u. The robo localizes iself using noisy measuremens from sensors in he environmen, he reliabiliy of which varies as a funcion of he robo s posiion x. The robo is able o obain reliable measuremens in he brigh region of he environmen, bu he measuremens become noisier as he robo moves in o he dark regions. This gives he following linear observaion model wih spaially-varying noise: z = h[x, n ] = x + N[x ] n, (48) 12

13 (a) Iniial rajecory. (b) Locally opimal soluion. Figure 2: Poin robo moving in a 2-D ligh-dark domain wihou obsacles (adaped from Pla e al. [21]). (a) The mehod is iniialized wih a naïve sraigh line rajecory o he goal. (b) The nominal rajecory and associaed beliefs of he soluion (shown in black), and he rajecory obained by applying he compued feedback policy o a robo wih an iniial belief ha is considerably differen han he iniial belief used for compuing he conrol policy (shown in red). where he measuremen vecor z R 2 consiss of noisy measuremens of he robo s posiion and he marix N[x ] scales he measuremen noise based on a funcion of he robo s posiion. We use sae and conrol cos marices of Q = I, R = I, and he final cos marix, Q l = 10lI in our experimens, where l is he number of secions in he iniial rajecory. The mehod converges when he difference beween he expeced coss beween successive ieraions falls below a user-specified epsilon hreshold Ligh-Dark Domain (No Obsacles) We consider he ligh-dark domain scenario suggesed by Pla e al. [21]. The measuremen noise (modeled by he marix N[x ]) varies as a quadraic funcion of he robo s horizonal coordinae x (as shown in Fig. 2). We iniialize our mehod wih a naïve sraigh line rajecory from he iniial posiion o he goal (Fig. 2(a)). Fig. 2(b) shows he nominal rajecory and he associaed beliefs of he soluion compued using our approach. The locally opimal nominal rajecory leads he robo o he horizonal coordinae where he measuremen noise is minimum, in order o beer localize iself, before proceeding o he goal. For his example, he iniial nominal rajecory has an expeced cos of 49.7, and he rajecory converges o a (local) opimum wih an expeced cos of 9.61 in 42 ieraions, requiring a oal compuaion ime of seconds. To evaluae he qualiy of he compued conrol policy, we also compued he acual expeced coss across simulaion runs ha use he compued feedback policy o compensae for arificial moion and measuremen noise. The acual expeced cos for he compued conrol policy was 9.46 unis. To demonsrae he effeciveness of he conrol policy compued by our mehod, we apply he compued feedback policy o a robo wih a belief ha is considerably differen han he belief wih which our mehod is iniialized. The resuling rajecory is indicaed in red in Fig. 2(b). The compued policy iniially leads he robo owards he ligh region, i quickly recifies he rajecory afer a beer esimae of he belief is obained in he ligh region. The basin of aracion of he conrol policy is wide enough o avoid he need for replanning Ligh-Dark Domain (Wih Obsacles) We consider he ligh-dark domain scenario wih obsacles as suggesed in Bry and Roy [5]. In his scenario, he measuremen noise (modeled by he marix N[x ]) varies as a sigmoid funcion 13

14 (a) Iniial rajecory. (b) Locally opimal soluion. (c) Execuion races. Figure 3: Poin robo moving in a 2-D ligh-dark domain wih obsacles. (a) An iniial collisionfree rajecory is compued using an RRT planner. (b) Nominal rajecory and he associaed beliefs of soluion compued using our approach. The robo moves away from he goal o beer localize iself before reaching he goal wih significanly reduced uncerainy. (c) Execuion races of he robo s rue sae drawn from differen iniial beliefs while following he compued conrol policy. of he robo s horizonal coordinae x (as shown in Fig. 3). We iniialize our mehod wih a collision-free iniial rajecory compued using an RRT planner [13] (Fig. 3(a)). Fig. 3(b) shows he nominal rajecory and he associaed beliefs of he soluion compued using our approach. The nominal rajecory leads he robo o he region of he environmen wih reliable sensing for beer localizaion, before moving he robo hrough he narrow passage o arrive a he goal. For his example, he iniial rajecory has an expeced cos of and he rajecory converges o a local opimum wih expeced cos of in 66 ieraions, which requires a oal compuaion ime of seconds. To evaluae he qualiy of he compued conrol policy, we also compued he acual expeced coss across simulaion runs ha use he compued feedback policy o compensae for arificial moion and measuremen noise. The acual expeced cos was 13.8 unis. To demonsrae he effeciveness of he conrol policy compued by our mehod, we apply he compued feedback policy o a robo wih a belief ha is considerably differen han he belief wih which our mehod is iniialized. Fig. 3(c) shows races of he rue sae of he robo x across 5 simulaions where he iniial sae of he robo x 0 is sampled from a differen iniial belief o evaluae he robusness of he conrol policy. Even if he iniial belief is considerably differen from he iniial belief used o compue he soluion, he conrol policy is able o safely guide he robo o he goal. We also evaluaed our mehod quaniaively by compuing he percenage of execuions in which he robo was able o avoid obsacles across 1000 simulaion execuions for 10 random iniial beliefs. In our experimens, in 93% (sandard deviaion: 3%) of he execuions, he robo was able o safely raverse he narrow passage wihou colliding wih obsacles. Our soluion also agrees wih he soluion found by Bry and Roy [5] for his experimen. Our mehod direcly opimizes he rajecory raher han relying on an opimal sampling-based planner in belief space, resuling in an order of magniude faser compuaion imes. Our mehod also does no assume fixed conrol gains along each along each secion of he nominal rajecory. However, he mehod of Bry and Roy is able o find a globally-opimal soluion (given he fixed conrol gains), whereas our mehod compues a locally opimal soluion given an iniial rajecory. 6.2 Non-Holonomic Car-Like Robo We consider he case of a non-holonomic car-like robo navigaing in a 2-D environmen wih noisy and parial sensing of he robo s sae. The sae x = (x, y, θ, v) R 4 of he robo consiss of is posiion (x, y), is orienaion θ and speed v. The conrol inpu vecor u = (a, φ) 14

15 (a) Iniial rajecory. (b) Locally opimal soluion. Figure 4: Car-like robo moving in a 2-D ligh-dark domain wihou obsacles (adaped from Pla e al. [21]). (a) The mehod is iniialized wih a naïve rajecory o he goal using a RRT planner. (b) The nominal rajecory and associaed beliefs of he soluion compued using our approach (shown in black), and he rajecory obained by applying he compued feedback policy o a robo wih a belief ha is considerably differen han he belief used for mehod iniializaion (red). consiss of an acceleraion a and he seering wheel angle φ. This gives he following non-linear dynamics model: x + τv cos θ y + τv sin θ x +1 = f[x, u, m ] = θ + τv an[φ]/d v + τa + M[u ] m, (49) where τ is he duraion of a ime sep, d is he lengh of he car-like robo, and M[u ] scales he moion noise m proporional o he conrol inpu u Ligh-Dark Domain (No Obsacles) We again consider he ligh-dark domain scenario suggesed by Pla e al. [21]. In his scenario, he robo s abiliy o sense is sae is boh parial (he robo is only capable of sensing is posiion bu no is velociy or orienaion) and noisy. The measuremen noise (modeled by he marix N[x ]) varies as a quadraic funcion of he robo s horizonal coordinae x (as shown in Fig. 4). This gives he following observaion model wih spaially-varying noise: z = h[x, n ] = [ x y ] + N[x ] n, (50) where he measuremen vecor z R 2 consiss of noisy measuremens of he robo s posiion, and he marix N[x ] scales he measuremen noise based on a funcion of he robo s horizonal coordinae x. We iniialize our mehod wih a naïve rajecory o he goal compued using a RRT planner [13] (Fig. 4(a)). We use sae and conrol cos marices of Q = I, R = I, and he final cos marix, Q l = 10lI in our experimens, where l is he number of secions in he iniial rajecory. Fig. 4(b) shows he nominal rajecory and he associaed beliefs of he soluion compued using our approach. The locally opimal nominal rajecory leads he robo o he horizonal coordinae where he measuremen noise is minimum, in order o beer localize iself, before proceeding o he goal. For his example, he iniial rajecory has an expeced cos of and he rajecory converges o a local-opimum wih an expeced cos of 7.6 in 81 ieraions, which requires a oal compuaion ime of 2.07 seconds. We also apply he compued feedback policy o a robo wih a belief ha is considerably differen ha he belief wih which our mehod is iniialized. The resuling rajecory is shown 15

(a) Iniial rajecory. (b) Locally opimal soluion. (d) Iniial rajecory. (c) Soluion for Q = 10I`. (e) Locally opimal soluion. Figure 5: A car-like robo moving in a 2-D ligh-dark domain wih obsacles.

(b) Nominal rajecory and he associaed beliefs of soluion compued using our approach. The robo localizes iself by moving closer o he beacon(s) before reaching he goal.

16 (a) Iniial rajecory. (b) Locally opimal soluion. (d) Iniial rajecory. (c) Soluion for Q = 10I`. (e) Locally opimal soluion. Figure 5: A car-like robo moving in a 2-D ligh-dark domain wih obsacles. The robo obains measuremens from wo beacons (marked by blue squares) and an on-board speedomeer. (a) An iniial collision-free rajecory is compued using an RRT planner. (b) Nominal rajecory and he associaed beliefs of soluion compued using our approach. The robo localizes iself by moving closer o he beacon(s) before reaching he goal. The final nominal rajecory also follow he medial axis beween he narrow passage o minimize he possibiliy of colliding wih obsacles. (c) Nominal rajecory compued by varying he cos marices (Q = 10I). The robo ries o reduce he uncerainy in is sae by visiing boh he beacons. (d) A differen iniial rajecory resuls in a differen locally opimal soluion. (e) Our mehod is able o improve rajecories wihin a single homoopy class. in red in Fig. 4(b). Since he belief is considerably differen from he assumed belief used for mehod iniializaion, he conrol policy leads he robo o mimic he compued nominal rajecory, bu once he robo has localized iself in he ligh region of he environmen, he conrol policy reliably leads he robo o he goal Domain Wih Spaially Varying Sensing (Wih Obsacles) We also consider a scenario in which he car-like robo esimaes is locaion using signal measuremens from wo beacons b1 and b2 placed in he environmen a locaions (x 1, y 1 ) and (x 2, y 2 ) respecively. The srengh of he signal decays quadraically wih he disance o he beacon. The robo also measures is curren speed using an on-board speedomeer. The measuremen uncerainy is scaled by a consan marix N. This gives us he following non-linear observaion model: 1/((x x 1 )2 + (y y 1 )2 + 1) z = h[x, n ] = 1/((x x 2 )2 + (y y 2 )2 + 1) + N n, (51) v where he vecor z R3 consiss of wo readings of signal srenghs from he beacons and a speed measuremen from he speedomeer. Fig. 5(a) visually illusraes he quadraic decay in he beacon signal srenghs in he environmen. The robo is able o obain reliable measuremens in he brigh regions of he environmen, bu he measuremens become relaively noisier as he robo moves in o he dark regions due o he decreased signal-o-noise raio. 16

17 (a) Iniial rajecory. (b) Locally opimal soluion. (c) Differen rajecory. (d) Locally opimal soluion. Figure 6: An aircraf-like robo wih omni-direcional acceleraion moving in a 3-D environmen wih obsacles wih parial and noisy sensing. The moion uncerainy is considerably lower a higher aliudes (indicaed by he yellow region). (a) An iniial collision-free rajecory is compued using an RRT planner. (b) Nominal rajecory and he associaed beliefs of soluion compued using our approach. The nominal rajecory is locally opimized such ha he robo spends a large proporion of he rajecory a higher aliudes o reduce uncerainy, before reaching he goal. (c) A differen rajecory iniializaion resuls in local improvemen wihin is iniial homoopy class, resuling in a locally opimal nominal rajecory (d). We iniialize our mehod wih a collision-free rajecory o he goal compued using a RRT planner [13] (Fig. 5(a)). We use sae and conrol cos marices of Q = I, R = I, and he final cos marix, Q l = 10lI in our experimens. Fig. 5(b) shows he nominal rajecory and he associaed beliefs of he soluion compued using our approach. The nominal rajecory leads he robo o he region of he environmen wih reliable sensing for beer localizaion, before moving he robo hrough he narrow passage o arrive a he goal. In conras o he iniial rajecory (Fig. 5(a)), he locally opimal rajecory also moves away from he obsacles and akes a safer pah o he goal. For his example, he iniial rajecory has an expeced cos of and he rajecory converges o a local-opimum wih an expeced cos of in 19 ieraions, which requires a oal compuaion ime of 9.57 seconds. The cos marices Q and R deermine he relaive weighing beween minimizing uncerainy in he robo sae and minimizing conrol effor in he objecive funcion. Fig. 5(c) shows he nominal rajecory of he soluion compued by changing he cos marix Q = 10I. Noice ha he rajecory visis boh he beacons for beer localizaion and minimizing uncerainy, a he expense of addiional conrol effor. Figs. 5(d) and 5(e) shows he nominal rajecory when a differen iniial rajecory is provided as inpu. The presence of obsacles in he environmen forces our mehod o locally opimize rajecories wihin a single homoopy class D Aircraf We consider he case of an aircraf-like robo wih parial and noisy sensing maneuvering in a 3-D environmen wih obsacles. We consider a simplified model of an aircraf ha has omni-direcional acceleraion. This model can be used o approximae he kinemaic consrains on he aircraf as long as he aircraf is moving wih non-zero speed [27]. The sae x = (x, y, z, v x, v y, v z ) R 6 of he robo consiss of is posiion p = (x, y, z) and is velociy v = (v x, v y, v z ). The conrol inpu vecor u = (a x, a y, a z ) comprises of he omni-direcional acceleraion applied o he robo. This gives he following dynamics model: [ p + τv x +1 = f[x, u, m ] = τ ] 2 u + M[p v + τu ] m, (52) 17

18 where τ is he duraion of a ime sep, and M[p ] scales he moion noise m proporional o he robo s posiion p. We se moion uncerainy o be much lower a higher aliudes, approximaely modeling he effec of amospheric and weaher condiions on he robo moion. The uncerainy seadily increases as he aliude of he robo decreases (Fig. 6). We also assume he following sochasic observaion model based on parial and noisy sensing: z = h[x, n ] = p + N n, (53) where he measuremen vecor z R 2 consiss of noisy measuremens of he robo s posiion, and he measuremen noise is scaled by a consan marix N. We iniialize our mehod wih a collision-free rajecory o he goal compued using a RRT planner [13] (Fig. 6(a)). Fig. 6(b) shows he nominal rajecory and he associaed beliefs of he soluions compued using our approach. The robo spends a considerable proporion of he nominal rajecory a higher aliudes in order o reduce he uncerainy, before arriving a he goal. In conras o he iniial rajecory (Fig. 6(a)), he locally opimal rajecory is also smooher in erms of he applied conrol inpus and says away from he obsacles o ake a safer pah o he goal. For his example, he iniial rajecory has an expeced cos of and he rajecory converges o a local opimum wih a considerably lower expeced cos of in 47 ieraions, which requires a oal compuaion ime of 41.8 seconds. Figs. 6(c) and 6(d) show he nominal rajecory when a differen iniial rajecory is provided as inpu. The presence of obsacles in he environmen forces our mehod o locally opimize rajecories wihin a single homoopy class. Our mehod is sill able o locally force he robo o ascend o a higher aliude o reduce he uncerainy, before descending below and going around he obsacle o arrive a he goal. 6.4 Comparison beween ilqg and sddp We quaniaively compared our approach wih value ieraion based on ilqg wih our preliminary approach wih value ieraion based on sochasic differenial dynamic programming (sddp) [28]. In Table 1, we compare he number of ieraions required for convergence and he opimal expeced cos for each of he considered scenarios for boh mehods. Qualiaively, he ilqg-based mehod is asympoically faser han he sddp-based mehod (O[n 6 ] raher han O[n 7 ]) and numerically more sable even when he sddp mehod is implemened wih he square roo of he variance in he belief (sddp requires regularizaion of marices o mainain posiive-semidefinieness of he value funcion). As expeced, each ieraion of he ilqg mehod (O[n 6 ]) akes less ime han an equivalen sddp ieraion (O[n 7 ]). The differences are more pronounced as he dimensionaliy of he belief space increases, as is eviden in he aircraf scenario. On he oher hand, sddp converges in fewer ieraions hen ilqg. This is because sddp uses direc compuaion of he Hessians of he value funcion, while ilqg compues he Hessians based on a linearizaion of he belief dynamics (which runcaes some second-order erms compared o sddp). In all experimens, ilqg and sddp yield almos idenical soluions, whose difference is visually hardly appreciable, and he opimal expeced cos ha boh ilqg and sddp converge o are almos idenical. To evaluae he difference in he wo mehods, we also compue he acual expeced coss across simulaion runs ha use he compued feedback policy o compensae for arificially simulaed moion uncerainy and measuremen noise. The differences in he acual expeced coss are minimal, which alludes o he fac ha he conrol policies compued by he wo mehods are similar. This is wha one would expec; he sligh differences ha do appear are a resul of numerical variaions beween he mehods, and in a few cases his causes he approaches o converge o differen local opima. Overall, our experimens indicae ha ilqg is preferable over sddp because i scales beer o higher dimensional problems and is numerically more sable since he ilqg mehod 18

19 Scenario Iniial Mehod Num. Time Time per Opimal Acual exp cos ier (s) ier (s) exp cos exp cos Poin ilqg (no obs) sddp Poin ilqg (obs) sddp Car ilqg (no obs) sddp Car ilqg (obs) sddp ilqg aircraf sddp Table 1: Comparison of ilqg and sddp. does no require regularizaion o ensure ha he Hessians are posiive semi-definie [28]. The inheren complexiy of he mehod is sill oo high for robos wih complex dynamics and high-dimensional sae spaces, and algorihmic improvemens in he mehod and efficien implemenaions hereof presen ineresing research direcions. 6.5 Effec Of Assuming Maximum-Likelihood Observaions We analyze he effec of assuming maximum-likelihood observaions made in prior work [21, 7, 8] on he compued locally opimal rajecory and corresponding conrol policy. We reproduce his assumpion in our mehod by ignoring all he erms in he value ieraion ha perain o he marix W [b, u], which deermines he sochasic naure of he belief dynamics given by Eq. (17). More specifically, we can reproduce he assumpion by removing he erms conaining he sum-quanifiers in Eqs. (30), (31), and (32). This has he ne resul of considering deerminisic belief dynamics as is he case when maximum-likelihood observaions are assumed. We consider an illusraive example ha considers a poin robo moving in a 2-D domain wih obsacles, as shown in Fig. 7(a). We consider he same sochasic dynamics model for he robo as in Sec We also consider he ligh-dark domain scenario suggesed by Pla e al. [21] where he measuremen noise varies as a quadraic funcion of he robo s horizonal coordinae x (as shown in Fig. 7(a)). We use sae and conrol cos marices of Q = I, R = 3I, and he final cos marix, Q l = 10I in our experimens. We compued 100 random rajecories using an RRT planner [13] and used he rajecories o iniialize our mehod wih and wihou assuming maximum-likelihood observaions. In he case of maximum-likelihood observaions, he mean iniial cos is unis wih a sandard deviaion of 35 unis. The mean final cos a convergence is 17.7 unis wih a sandard deviaion of 1.5 unis. I is imporan o noe ha he final cos is based on deerminisic belief dynamics and is exacly known. We also compued he final expeced cos of he compued conrol policy using value ieraion assuming sochasic belief dynamics, as oulined in Sec The mean expeced cos of he policy a convergence is 23.1 unis wih a sandard deviaion of 2 unis. This indicaes ha here is a mismach in he final cos assuming deerminisic belief dynamics and he acual expeced cos of he compued policies when execued under moion and sensing uncerainy. We also ran our mehod on he same 100 rajecories wihou assuming maximum-likelihood observaions. The mean iniial cos is 35, 371 unis wih a sandard deviaion of 41, 522 unis, while he mean expeced cos a convergence is 21.3 unis wih a sandard deviaion of 1.9 unis. For his scenario, our mehod which does no assume maximum-likelihood observaions yielded an average expeced cos 8.5% beer han he mehod making he maximum-likelihood 19

(a) Iniial rajecory. (b) Wih maximum-likelihood assumpion. (c) Wihou maximumlikelihood assumpion. Figure 7: An illusraive example ha considers a poin robo moving in a 2-D domain wih obsacles.

(b) Nominal rajecory and he associaed beliefs of soluion compued using our mehod under he assumpion of maximum-likelihood observaions.

20 (a) Iniial rajecory. (b) Wih maximum-likelihood assumpion. (c) Wihou maximumlikelihood assumpion. Figure 7: An illusraive example ha considers a poin robo moving in a 2-D domain wih obsacles. (a) An iniial collision-free rajecory is compued using an RRT planner. (b) Nominal rajecory and he associaed beliefs of soluion compued using our mehod under he assumpion of maximum-likelihood observaions. The opimizaion resuls in a nominal rajecory ha does no lead he robo all he way o he horizonal coordinae where he measuremen noise is minimum. (c) Soluion compued wihou making he maximum-likelihood observaion assumpion. The opimizaion is able o find a differen locally opimal rajecory and policy ha allows he robo o localize iself wih cerainy before arriving a he goal region wih reduced uncerainy. assumpion. To demonsrae he effeciveness of he conrol policy compued wih and wihou assuming maximum-likelihood observaions, we evaluaed each conrol policy quaniaively by compuing he percenage of execuions in which he robo was able o avoid obsacles across simulaion execuions assuming arificial moion and measuremen noise. In our experimens, he conrol policies compued assuming maximum-likelihood observaions resul in an average of 324 collisions (sandard deviaion: 87) while he conrol policies compued by our mehod resul in an average of 252 collisions (sandard deviaion: 72). This demonsraes ha no assuming maximum-likelihood observaions reduces he number of collisions by approximaely 25% for he considered scenario. We visualize he difference in he wo cases in Figs. 7(b) and 7(c) using an illusraive example from he 100 random scenarios considered in our experimens. As shown in Fig. 7(b), he nominal rajecory for he case in which we assume maximum-likelihood observaions does no lead he robo all he way o he horizonal coordinae where he measuremen noise is minimum. This resuls in a higher expeced cos of 24.4 unis a convergence and higher uncerainy in he sae of he robo as he robo raverses he narrow passage. In conras, he soluion compued wihou making he maximum-likelihood observaion assumpion is able o find a differen locally opimal rajecory and policy ha allows he robo o localize iself wih greaer cerainy before arriving a he goal region wih reduced uncerainy (see Fig. 7(c)). The expeced cos a convergence in his case is 16.9 unis. We noe ha a lower expeced cos is no guaraneed: among he 100 random iniial rajecories here are also cases in which he soluion compued wih he maximum likelihood assumpion has a beer expeced cos han he soluion compued wihou he assumpion. As in he scenario of he figure, his is very likely he resul of boh mehods converging o a differen local opimum. Overall, our resuls indicae ha no making he maximum-likelihood assumpion gives, on average, beer conrol policies. However, depending on he applicaion, he impac of he assumpion may be relaively limied. This raises he quesion of wheher he assumpion can be formally jusified and is negaive impac bounded. In he case of our mehod, making he 20

Notes on Kalman Filtering

Notes on Kalman Filtering Noes on Kalman Filering Brian Borchers and Rick Aser November 7, Inroducion Daa Assimilaion is he problem of merging model predicions wih acual measuremens of a sysem o produce an opimal esimae of he curren