arxiv: v3 [cs.sy] 28 Oct 2015

Size: px

Start display at page:

Download "arxiv: v3 [cs.sy] 28 Oct 2015"

Stewart Page
5 years ago
Views:

1 Model Predctve Path Integral Control usng Covarance Varable Importance Samplng Grady Wllams, Andrew Aldrch, and Evangelos A. Theodorou arxv:59.49v3 [cs.sy] 8 Oct 5 Abstract In ths paper we develop a Model Predctve Path Integral MPPI control algorthm based on a generalzed mportance samplng scheme and perform parallel optmzaton va samplng usng a Graphcs Processng Unt GPU. The proposed generalzed mportance samplng scheme allows for changes n the drft and dffuson terms of stochastc dffuson processes and plays a sgnfcant role n the performance of the model predctve control algorthm. We compare the proposed algorthm n smulaton wth a model predctve control verson of dfferental dynamc programmng. I. INTRODUCTION The path ntegral optmal control framework [7], [5], [6] provdes a mathematcally sound methodology for developng optmal control algorthms based on stochastc samplng of trajectores. The key dea n ths framework s that the value functon for the optmal control problem s transformed usng the Feynman-Kac lemma [], [8] nto an expectaton over all possble trajectores, whch s known as a path ntegral. Ths transformaton allows stochastc optmal control problems to be solved wth a Monte-Carlo approxmaton usng forward samplng of stochastc dffuson processes. There have been a varety of algorthms developed n the path ntegral control settng. The most straght-forward applcaton of path ntegral control s when the teratve feedback control law suggested n [5] s mplemented n ts open loop formulaton. Ths requres that samplng takes place only from the ntal state of the optmal control problem. A more effectve approach s to use the path ntegral control framework to fnd the parameters of a feedback control polcy. Ths can be done by samplng n polcy parameter space, these methods are known as Polcy Improvement wth Path Integrals [4]. Another approach to fndng the parameters of a polcy s to attempt to drectly sample from the optmal dstrbuton defned by the value functon [3]. Other methods along smlar threads of research nclude [], [7]. Another way that the path ntegral control framework can be appled s n a model predctve control settng. In ths settng an open-loop control sequence s constantly optmzed n the background whle the machne s smultaneously executng the best guess that the controller has. An ssue wth ths approach s that many trajectores must be sampled n real-tme, whch s dffcult when the system has complex dynamcs. One way around ths problem s to Ths research has been supported by NSF Grant No. NRI The authors are wth the Autonomous Control and Decson Systems Laboratory at the Georga Insttute of Technology, Atlanta, GA, USA. Emal: gradyrw@gatech.edu drastcally smplfy the system under consderaton by usng a herarchcal scheme [4], and use path ntegral control to generate trajectores for a pont mass whch s then followed by a low level controller. Even though ths approach may be successfull for certan applcatons, t s lmted n the knds of behavors that t can generate snce t does not consder the full non-lnearty of dynamcs. A more effcent approach s to take advantage of the parallel nature of samplng and use a graphcs processng unt GPU [9] to sample thousands of trajectores from the nonlnear dynamcs. A major ssue n the path ntegral control framework s that the expectaton s taken wth respect to the uncontrolled dynamcs of the system. Ths s problematc snce the probablty of samplng a low cost trajectory usng the uncontrolled dynamcs s typcally very low. Ths problem becomes more drastc when the underlyng dynamcs are nonlnear and sampled trajectores can become trapped n undesrable parts of the state space. It has prevously been demonstrated how to change the mean of the samplng dstrbuton usng Grsanov s theorem [5], [6], ths can then be used to develop an teratve algorthm. However, the varance of the samplng dstrbuton has always remaned unchanged. Although n some smple smulated scenaros changng the varance s not necessary, n many cases the natural varance of a system wll be too low to produce useful devatons from the current trajectory. Prevous methods have ether dealt wth ths problem by artfcally addng nose nto the system and then optmzng the nosy system [], [4]. Or they have smply gnored the problem entrely and sampled from whatever dstrbuton worked best [], [9]. Although these approaches can be successful, both are problematc n that the optmzaton ether takes place wth respect to the wrong system or the resultng algorthm gnores the theoretcal bass of path ntegral control. The approach we take here generalzes these approaches n that t enables for both the mean and varance of the samplng dstrbuton to be changed by the control desgner, wthout volatng the underlyng assumptons made n the path ntegral dervaton. Ths enables the algorthm to converge fast enough that t can be appled n a model predctve control settng. After dervng the model predctve path ntegral control MPPI algorthm, we compare t wth an exstng model predctve control formulaton based on dfferental dynamc programmng DDP [6], [3], [8]. DDP s one of the most powerful technques for trajectory optmzaton, t reles on a frst or second order approxmaton of the dynamcs and a quadratc approxmaton of the cost along a nomnal trajectory, t then computes a second order approxmaton of

2 the value functon whch t uses to generate the control. II. PATH INTEGRAL CONTROL In ths secton we revew the path ntegral optmal control framework [7]. Let x t R N denote the state of a dynamcal system at tme t, ux t, t R m denotes a control nput for the system, τ : [t, T ] R n represents a trajectory of the system, and dw R p s a brownan dsturbance. In the path ntegral control framework we suppose that the dynamcs take the form: dx = fx t, tdt + Gx t, tux t, tdt + Bx t, tdw In other words, the dynamcs are affne n control and subject to an affne brownan dsturbance. We also assume that G and B are parttoned as: Gx t, t = G c x t, t ; Bx t, t = B c x t, t Expectatons taken wth respect to are denoted as E Q [ ], we wll also be nterested n takng expectatons wth respect to the uncontrolled dynamcs of the system.e wth u. These wll be denoted E P [ ]. We suppose that the cost functon for the optmal control problem has a quadratc control cost and an arbtrary state-dependent cost. Let φx T denote a fnal the termnal cost, qx t, t a state dependent runnng cost, and defne Rx t, t as a postve defnte matrx. The value functon V x t, t for ths optmal control problem s then defned as: mn E Q u [ φx T + T t qx t, t + ut Rx t, tu dt 3 The Stochastc Hamlton-Jacob-Bellman equaton [], [] for the type of system n and for the cost functon n 3 s gven as: t V = qx t, t + fx t, t T V x V T x Gx t, trx t, t Gx t, t T V x + trbx t, tbx t, t T V xx where the optmal control s expressed as: ] 4 u = Rx t, t Gx t, t T V x 5 The soluton to ths backwards PDE yelds the value functon for the stochastc optmal control problem, whch s then used to generate the optmal control. Unfortunately, classcal methods for solvng partal dfferental equatons of ths nature suffer from the curse of dmensonalty and are ntractable for systems wth more than a few state varables. The approach we take n the path ntegral control framework s to transform the backwards PDE nto a path ntegral, whch s an expectaton over all possble trajectores of the system. Ths expectaton can then be approxmated by forward samplng of the stochastc dynamcs. In order to effect ths transformaton we apply an exponental transformaton of the value functon V x, t = logψx, t 6 Here s a postve constant. We also have to assume a relatonshp between the cost and nose n the system as well as through the equaton: B c x t, tb c x, t T = G c x t, trx t, t G c x t, t T 7 The man restrcton mpled by ths assumpton s that Bx t, t has the same rank as Rx t, t. Ths lmts the nose n the system to only effect state varables that are drectly actuated.e. the nose s control dependent. There are a wde varety of systems whch naturally fall nto ths descrpton, so the assumpton s not too restrctve. However, there are nterestng systems for whch ths descrpton does not hold.e. f there are known strong dsturbances on ndrectly actuated state varables or f the dynamcs are only partally known. By makng ths assumpton and performng the exponental transformaton of the value functon the stochastc HJB equaton s transformed nto the lnear partal dfferental equaton: t Ψ = Ψx t, t qx t, t fx t, t T Ψ x trσx t, tψ xx 8 Here we ve denoted the covarance matrx B c x t, tb c x t, t T as Σx t, t. Ths equaton s known as the backward Chapman-Kolmogorov PDE. We can then apply the Feynman-Kac lemma, whch relates backward PDEs of ths type to path ntegrals through the equaton: ] T Ψx t, t = E P [exp qx, t dt Ψx T, T t 9 Note that the expectaton whch s the path ntegral s taken wth respect to P whch s the uncontrolled dynamcs of the system. By recognzng that the term Ψx T s the transformed termnal cost: e φx T we can re-wrte ths expresson as: Ψx t, t E P [exp ] Sτ where Sτ = φx T + T t qx t, tdt s the cost-to-go of the state dependent cost of a trajectory. Lastly we have to compute the gradent of Ψ wth respect to the ntal state x t. Ths can be done analytcally and s a straghtforward, albet lengthy, computaton so we omt t and refer the nterested reader to [4]. After takng the gradent we obtan: u dt = Gx t, t E [ P exp Sτ Bx t, t dw ] [ E P exp Sτ]

3 Where the matrx Gx t, t s defned as: Rx t, t G c x t, t T G c x t, trx t, t G c x t, t T Note that f G c x t, t s square whch s the case f the system s not over actuated ths reduces to G c x t, t. Equaton s the path ntegral form of the optmal control. The fundamental dfference between ths form of the optmal control and classcal optmal control theory s that nstead of relyng on a backwards n tme process, ths formula requres the evaluaton of an expectaton whch can be approxmated usng forward samplng of stochastc dfferental equatons. A. Dscrete Approxmaton Equaton provdes an expresson for the optmal control n terms of a path ntegral. However, these equatons are for contnuous tme and n order to sample trajectores on a computer we need dscrete tme approxmatons. We frst dscretze the dynamcs of the system. We have that x t+ = x t + dx t where dx t s defned as: dx t = fx t, t + Gx t, tux t, t + Bx t, tɛ 3 The term ɛ s a vector of standard normal Gaussan random varables. For the uncontrolled dynamcs of the system we have: dx t = fx t, t + Bx t, tɛ 4 Another way we can express Bx t, tdw whch wll be useful s as: Bx t, tdw dx t fx t, t 5 Lastly we say: Sτ φx T + N = qx t, t where N = T t/ Then by defnng p as the probablty nduced by the dscrete tme uncontrolled dynamcs we can approxmate as: [ E u = Gx t, t p exp ] Sτ dx t fx t, t [ E p exp Sτ] 6 Note that we have moved the term multplyng u over to the rght-hand sde of the equaton and nserted t nto the expectaton. III. GENERALIZED IMPORTANCE SAMPLING Equaton 6 provdes an mplementable method for approxmatng the optmal control va random samplng of trajectores. By drawng many samples from p the expectaton can be evaluated usng a Monte-Carlo approxmaton. In practce, ths approach s unlkely to succeed. The problem s that p s typcally an neffcent dstrbuton to sample from.e the cost-to-go wll be hgh for most trajectores sampled from p. Intutvely samplng from the uncontrolled dynamcs corresponds to turnng a machne on and watng for the natural nose n the system dynamcs to produce nterestng behavor. In order to effcently approxmate the controls, we requre the ablty to sample from a dstrbuton whch s lkely to produce low cost trajectores. In prevous applcatons of path ntegral control [5], [6] the mean of the samplng dstrbuton has been changed whch allows for an teratve update law. However, the varance of the samplng dstrbuton has always remaned unchanged. In well engneered systems, where the natural varance of the system s very low, changng the mean s nsuffcent snce the state space s never aggressvely explored. In the followng dervaton we provde a method for changng both the ntal control nput and the varance of the samplng dstrbuton. A. Lkelhood Rato We suppose that we have a samplng dstrbuton wth nonzero control nput and a changed varance, whch we denote as q, and we would lke to approxmate 6 usng samples from q as opposed to p. Now f we wrte the expectaton term 6 n ntegral form we get: exp Sτ dx t fx t, t pτdτ exp Sτ 7 pτdτ Where we are abusng notaton and usng τ to represent the dscrete trajectory x t, x t,... x tn. Next we multply both ntegrals by = qτ qτ to get: exp Sτ dx t fx t, t qτ qτ pτdτ exp Sτ qτ qτ pτdτ 8 And we can then wrte ths as an expectaton wth respect to q: [exp ] Sτ dx t fx t, t pτ qτ [ exp ] 9 Sτ pτ qτ We now have the expectaton n terms of a samplng dstrbuton q for whch we can choose: The ntal control sequence from whch to sample around. The varance of the exploraton nose whch determnes how aggressvely the state space s explored. However, we now have an extra term to compute pτ qτ. Ths s known as the lkelhood rato or Radon-Nkodym dervatve between the dstrbutons p and q. In order to derve an expresson for ths term we frst have to derve equatons for the probablty densty functons of pτ and qτ ndvdually. We can do ths by dervng the probablty densty functon for the general dscrete tme dffuson processes P τ, correspondng to the dynamcs: dx t = fx t, t + Gx t, tux t, t + Bx t, tɛ The goal s to fnd P τ = P x t, x t,... x tn. By condtonng and usng the Markov property of the state space ths probablty becomes: N P x t, x t,... x tn = P x t x t =

4 Now recall that a porton of the state space has determnstc dynamcs and that we ve parttoned the dffuson matrx as: Bx t, t = B c x t, t We can partton the state varables x nto the determnstc and non-determnstc varables x a t and x c t respectvely. The next step s to condton on x a t+ = F a x t, t = x a t + f a x t, t + G a x t, tu t dt snce f ths does not hold P τ s zero. We thus need to compute: N P = x t x t, x a t = F a x t, t 3 And from the dynamcs equatons we know that each of these one-step transtons s Gaussan wth mean: f c x t, t + G c x t, t ux t, t and varance: Σ = B c x t, t B c x t, t T. 4 We then defne z = dxc t f c x t, t, and µ = G c x t, t ux t, t. Applyng the defnton of the Gaussan dstrbuton wth these terms yelds: N exp z µ T Σ z µ P τ = 5 π n/ Σ / = And then usng basc rules of exponents ths probablty becomes: Zτ exp N z µ T Σ z µ 6 = Where Zτ = N = πn/ Σ /. Wth ths equaton n hand we re now ready to compute the lkelhood rato between two dffuson processes. Theorem : Let pτ be the probablty densty functon for trajectores under the uncontrolled dscrete tme dynamcs: dx t = fx t, t + Bx t, tɛ 7 And let qτ be the probablty densty functon for trajectores under the controlled dynamcs wth an adjusted varance: dx t = fx t, t + Gx t, tux t, t + Where the adjusted varance has the form: B E x t, t = A t B c x t, t B E x t, tɛ 8 And defne z, µ, and Σ as before. Let Q be defned as: Where Γ s: Q = z µ T Γ z µ + µ T Σ Γ z µ + µ T Σ µ 9 = Σ A T t Σ A t 3 Then under the condton that each A t s nvertble and each Γ s nvertble, the lkelhood rato for the two dstrbutons s: N A t exp N Q 3 = = Proof: In dscrete tme the probablty of a trajectory s formulated accordng to the 6. We thus have pτ equal to: exp N = z Σ z pτ = 3 Z p τ and qτ equal to: exp N = z µ T A T t Σ A t z µ Z q τ Then dvdng these two equatons we have pτ qτ as: N π n/ A T t Σ A t / exp N ζ π n/ Σ / = = Where ζ s: ζ = z T Σ z z µ T A T t Σ A t z µ 35 Usng basc rules of determnants t s easy to see that the term outsde the exponent reduces to N π n/ A T j Σ ja j / N = A π n/ Σ j / j 36 j= j= So we need only show that ζ reduces to Q. Observe that at every tmestep we have the dfference between two quadratc functons of z, so we can complete the square to combne ths nto a sngle quadratc functon. If we recall the defnton of Γ from above, and defne Λ = A T t Σ A t then completng the square yelds: ζ = z + Γ Λ µ T Λ T µ Γ z + Γ Λ µ µ Γ Λ T µ Γ Γt Λ µ Now we expand out the frst quadratc term to get: ζ = z T Γ µ T Λ z + µ T Λ z + µ T Λ Γ Λ µ µ Γ Λ µ T Γ Γ Λ µ Notce that the two underlned terms are the same, except for the sgn, so they cancel out and we re left wth: ζ = z T Γ z + µ T Λ z µ T Λ µ 39 Now defne z = z µ, and then re-wrte ths equaton n terms of z : ζ = z +µ T Γ z +µ +µ T Λ z +µ µ T Λ µ 4

5 whch expands out to: Then by re-defnng the runnng cost qx t, t as: ζ = z T Γ + µ T Λ Whch then smplfes to: z + µ T Γ z + µ T Λ z + µ T Γ µ µ µ T Λ µ 4 qx, u, dx = qx t, t + z µt Γ z µ + µ T H z µ + µt H µ 48 ζ = z T Γ + µ T Λ Now recall that Γ = Σ quadratc terms n Γ Dong ths yelds: ζ = z T Γ µ T Λ z + µ T Γ z + µ T Σ µ + µ T Λ z + µ T Λ Λ nto the Σ z µ T Λ z + µ T Γ µ µ 4, so we can splt the and Λ components. z + µ T Λ µ z + µ T Σ µ 43 and by notng that the underlned terms cancel out we see that we re left wth: ζ = z T Γ whch s the same as: z µ T Γ z + µ T Σ z + µ T Σ µ 44 z µ + µ T Σ And so ζ = Q whch completes the proof. z µ + µ T Σ µ 45 The key dfference between ths proof and earler path ntegral works whch use an applcaton of Grsanov s theorem to sample from a non-zero control nput s that ths theorem allows for a change n the varance as well. In the expresson for the lkelhood rato derved here the last two terms µ T Σ z µ + µ T Σ µ are exactly the terms from Grsanov s theorem. The frst term z µ T Γ z µ, whch can be nterpreted as penalzng over-aggressve exploraton, s the only addtonal term. B. Lkelhood Rato as Addtonal Runnng Cost The form of the lkelhood rato just derved s easly ncorporated nto the path ntegral control framework by foldng t nto the cost-to-go as an extra runnng cost. Note that the lkelhood rato appears n both the numerator and denomnator of 6. Therefore, any terms whch do not depend on the state can be factored out of the expectaton and canceled. Ths removes the numercally troublesome normalzng term N j= A t j. So only the summaton of Q remans. Recall that Σ = Gx t, trx t, t Gx t, t. Ths mples that: Gxt Γ =, trx t, t Gx t, t 46 A T Gx t, trx t, t Gx t, t T A Now defne H = Gx t, trx t, t Gx t, t T and Γ = Γ. We then have: Q = z µ T Γ z µ+µ T H z µ+µ T H µ 47 and Sτ = φx T + N j= qx, u, dx, we have: [ Sτ E u t = Gx t, t q exp dxt fx t, t ] [exp Sτ ] 49 Also note that dx t s now equal to: fx t, t + Gx t, tux t, t + Bx t, tɛ 5 So we can re-wrte dxt fx t, t as: ɛ Gx t, tux t, t + Bx t, t 5 And then snce Gx t, t does not depend on the expectaton we can pull t out and get the teratve update law: u t = Gx t, t Gx t, tux t, t [ E + Gx t, t q exp Sτ Bx t, t [exp Sτ ] C. Specal Case ] ɛ 5 The update law 5 s applcable for a very general class of systems. In ths secton we examne a specal case whch we use for all of our experments. We consder dynamcs of the form: dx t = fx t, t + Gx t, t ux t, t + ɛ ρ 53 And for the samplng dstrbuton we set A equal to νi. We also assume that G c x t, t s a square nvertble matrx. Ths reduces Hx t, t to G c x t, t. Next the dynamcs can be re-wrtten as: dx t = fx t, t + Gx t, t ux t, t + ɛ ρ 54 Then we can nterpret ρ ɛ as a random change n the control nput, to emphasze ths we wll denote ths term as δu = ɛ ρ. We then have Bx t, t ɛ = Gx t, tδu. Ths yelds the teratve update law as: [exp Sτ ] ux t, t δu = ux t, t + 55 [exp Sτ ] whch can be approxmated as: K ux t, t k= exp Sτ,k δu,k ux t, t + K k= exp Sτ,k 56

6 Where K s the number of random samples termed rollouts and Sτ,k s the cost-to-go of the k th rollout from tme t onward. Ths expresson s smply a reward-weghted average of random varatons n the control nput. Next we nvestgate what the lkelhood rato addton to the runnng cost s. For these dynamcs we have the followng smplfcatons: z µ = Gx t, tδu Γ = ν Gx t, t Rx t, tgx t, t H = Gx t, t Rx t, tgx t, t Gven these smplfcatons q reduces to: qx, u, dx = qx t, t + ν δu T Rδu + u T Rδu + ut Ru 57 Ths means that the ntroducton of the lkelhood rato smply ntroduces the orgnal control cost from the optmal control formulaton nto the samplng cost, whch orgnally only ncluded state-dependent terms. IV. MODEL PREDICTIVE CONTROL ALGORITHM We apply the teratve path ntegral control update law, wth the generalzed mportance samplng term, n a model predctve control settng. In ths settng optmzaton and executon occur smultaneously: the trajectory s optmzed and then a sngle control s executed, then the trajectory s re-optmzed usng the un-executed porton of the prevous trajectory to warm-start the optmzaton. Ths scheme has two key requrements: Rapd convergence to a good control nput. The ablty to sample a large number of trajectores n real-tme. The frst requrement s essental because the algorthm does not have the luxury of watng untl the trajectory has converged before executng. The new mportance samplng term enables tunng of the exploraton varance whch allows for rapd convergence, ths s demonstrated n Fg.. The second requrement, samplng a large number of trajectores n real-tme, s satsfed by mplementng the random samplng of trajectores on a GPU. The algorthm s gven n Algorthm, n the parallel GPU mplementaton the samplng for loop for k to K- s run completely n parallel. V. EXPERIMENTS We tested the model predctve path ntegral control algorthm MPPI on three smulated platforms A cart-pole, A mnature race car, and 3 A quadrotor attemptng to navgate an obstacle flled envronment. For the race car and quadrotor we used a model predctve control verson of the dfferental dynamc programmng DDP algorthm as a baselne comparson. In all of these experments the controller operates at 5 Hz, ths means that the open loop control sequence s re-optmzed every mllseconds. Algorthm : Model Predctve Path Integral Control Gven: K: Number of samples; N: Number of tmesteps; u, u,...u N : Intal control sequence;, x t, f, G, B, ν: System/samplng dynamcs; φ, q, R, : Cost parameters; u nt : Value to ntalze new controls to; whle task not completed do for k to K do x = x t ; for to N do x + = x + f + G u + δu,k ; Sτ +,k = Sτ,k + q; for to N[ do ] K exp u u + S τ,k δu,k k= K k= exp S ; τ,k send to actuatorsu ; for to N do u = u + ; u N = u nt Update the current state after recevng feedback; check for task completon; A. Cart-Pole For the cart-pole swng-up task we used the state cost: qx = p cosθ + θ + ṗ, where p s the poston of cart, ṗ s the velocty and θ, θ are the angle and angular velocty of the pole. The control nput s desred velocty, whch maps to velocty through the equaton: p = u ṗ. The dsturbance parameter ρ was set equal. and the control cost was R =. We ran the MPPI controller for seconds wth a second optmzaton horzon. The controller has to swng-up the pole and keep t balanced for the rest of the second horzon. The exploraton varance Average Runnng Cost ν = 75 ν = 5 ν = ν = Number of Rollouts Log Scale Fg.. Average runnng cost for the cart-pole swng-up task as a functon of the exploraton varance ν and the number of rollouts. Usng only the natural system varance the MPC algorthm does not converge n ths scenaro.

7 parameter, ν, was vared between and 5. The MPPI controller s able to swng-up the pole faster wth ncreasng exploraton varance. Fg. llustrates the performance of the MPPI controller as the exploraton varance and the number of rollouts are changed. Usng only the natural varance of the system for exploraton s nsuffcent n ths task, n that case not shown n the fgure the controller s never able to swng-up the pole whch results n a cost around. B. Race Car In the race car task the goal was to mnmze the objectve functon: qx = d + v x 7.. Where d s defned as: d = x 3 + y 6, and vx s the forward n body frame velocty of the car. Ths cost ensures that the car to stays on an ellptcal track whle mantanng a forward speed of 7 meters/sec. We use a non-lnear dynamcs model [5] whch takes nto account the hghly non-lnear nteractons between tres and the ground. The exploraton varance was set to a constant ν tmes the natural varance of the system. The MPPI controller s able to enter turns at 5 5 MPC-DDP MPPI Fg. 3. Comparson of DDP left and MPPI rght performng a cornerng maneuver along an ellpsod track. MPPI s able to make a much tgther turn whle carryng more speed n and out of the corner than DDP. The drecton of travel s counterclockwse. Velocty m/s DDP v x DDP v y MPPI v x MPPI v y Average Runnng Cost DDP Soluton ν = 5 ν = ν = 5 ν = Tme s Fg. 4. Comparson of DDP left and MPPI rght performng a cornerng maneuver along an ellpsod track. MPPI s able to make a much tgther turn whle carryng more speed n and out of the corner than DDP. MPPI and DDP whch gude the quadrotor through the forest as quckly as possble. The cost functon for MPPI was Number of Rollouts Log Scale 8 MPC-DDP MPPI Fg.. Performance comparson n terms of average cost between MPPI and MPC-DDP as the exploraton varance ν changes from 5 to 3 and the number of rollouts changes from to. Only wth a very large ncrease n the exploraton varance s MPPI able to outperform MPC-DDP. Note that the cost s capped at close to the desred speed of 7 m/s and then slde through the turn. The DDP soluton does not attempt to slde and sgnfcantly reduces ts forward velocty before enterng the turn, ths results n a hgher average cost compared to the MPPI controller. Fg. shows the cost comparson between MPPI and MPC-DDP, and Fgures 3 and 4 show samples of the trajectores taken by the two algorthms as well as the velocty profles. C. Quadrotor The quadrotor task was to fly through a feld flled wth cylndrcal obstacles as fast as possble. We used the quadrotor dynamcs model from [9]. Ths s a non-lnear model whch ncludes poston, velocty, euler angles, angular acceleraton, and the rotor dynamcs. We randomly generated three forests, one where obstacles are on average 3 meters apart, the second one 4 meters apart, and the thrd 5 meters apart. We then separately created cost functons for both Fg. 5. Left: sample DDP trajectory through 4m obstacle feld, Rght: Sample MPPI trajectory through the same feld. Snce the MPPI controller can drectly reason about the shape of the obstacles t s able to safely pass through the feld takng a much more drect route. of the form: qx =.5p x p des x +.5p y p des y + 5p z p des z + 5ψ + v +35 exp d + C where p x, p y, p z denotes the poston of the vehcle. ψ denotes the yaw angle n radans, v s velocty, and d s the dstance to the closest obstacle. C s a varable whch ndcates whether the vehcle has crashed nto the ground or an obstacle. Addtonally f C = whch ndcates a crash, the rollout stops smulatng the dynamcs and the vehcle remans where t s for the rest of the tme horzon. We found that the crash ndcator term s not useful for the MPC-DDP based controller, ths s not surprsng snce the dscontnuty

8 t creates s dffcult to approxmate wth a quadratc functon. The term n the cost for avodng obstacles n the MPC- DDP controller conssts purely of a large exponental term: N = exp d, note that ths sum s over all the obstacles n the proxmty of the vehcle whereas the MPPI controller only has to consder the closest obstacle. Tme to Completon s Fg MPPI MPC-DDP 3m 4m 5m Densty Settng of Forest Tme to navgate forest. Comparson between MMPI and DDP. Snce the MPPI controller can explctly reason about crashng as opposed to just stayng away from obstacles, t s able to travel both faster and closer to obstacles than the MPC-DDP controller. Fg. 7 shows the dfference n tme between the two algorthms and Fg. 6 the trajectores taken by MPC-DDP and one of the MPPI runs on the forest wth obstacles placed on average 4 meters away. Fg. 7. Smulated forest envronment used n the quadrotor navgaton task. VI. CONCLUSION In ths paper we have developed a model predctve path ntegral control algorthm whch s able to outperform a state-of-the-art DDP method on two dffcult control tasks. The algorthm s based on stochastc samplng of system trajectores and requres no dervatves of ether the dynamcs or costs of the system. Ths enables the algorthm to naturally take nto account non-lnear dynamcs, such as a non-lnear tre model [5]. It s also able to handle cost functons whch are ntutvely appealng, such as an mpulse cost for httng an obstacle, but are dffcult for tradtonal approaches that rely on a smooth gradent sgnal to perform optmzaton. The two keys to achevng ths level of performance wth a samplng based method are: The dervaton of the generalzed lkelhood rato between dscrete tme dffuson processes. The use of a GPU to sample thousands of trajectores n real-tme. The dervaton of the lkelhood rato enables the desgner of the algorthm to tune the exploraton varance n the path ntegral control framework, whereas prevous methods have only allowed for the mean of the dstrbuton to be changed. Tunng the exploraton varance s crtcal n achevng a hgh level of performance snce the natural varance of the system s typcally too low to acheve good performance. The experments consdered n ths work only consder changng the varance by a constant multple tmes the natural varance of the system. In ths specal case the ntroducton of the lkelhood rato corresponds to addng n a control cost when evaluatng the cost-to-go of a trajectory. A drecton for future research s to nvestgate how to automatcally adjust the varance onlne. Dong so could enable the algorthm to swtch from aggressvely explorng the state space when performng aggressve maneuvers to explorng more conservatvely for performng very precse maneuvers. REFERENCES [] W. H. Flemng and H. M. Soner. Controlled Markov processes and vscosty solutons. Applcatons of mathematcs. Sprnger, New York, nd edton, 6. [] A. Fredman. Stochastc Dfferental Equatons And Applcatons. Academc Press, 975. [3] Vcenç Gómez, Hlbert J Kappen, Jan Peters, and Gerhard Neumann. Polcy search for path ntegral control. In Machne Learnng and Knowledge Dscovery n Databases, pages Sprnger, 4. [4] Vcenç Gómez, Sep Thjssen, Hlbert J Kappen, Stephen Hales, and Andrew Symngton. Real-tme stochastc optmal control for multagent quadrotor swarms. arxv preprnt arxv:5.4548, 5. [5] R.Y Hndyeh. Dynamcs and Control of Drftng n Automobles. PhD thess, Stanford Unversty, March 3. [6] D. H. Jacobson and D. Q. Mayne. Dfferental dynamc programmng. Amercan Elsever Pub. Co., New York, 97. [7] H. J. Kappen. Lnear theory for control of nonlnear stochastc systems. Phys Rev Lett, 95:, 5. Journal Artcle Unted States. [8] I. Karatzas and S. E. Shreve. Brownan Moton and Stochastc Calculus Graduate Texts n Mathematcs. Sprnger, nd edton, August 99. [9] Nathan Mchael, Danel Mellnger, Quentn Lndsey, and Vjay Kumar. The grasp multple mcro-uav testbed. Robotcs & Automaton Magazne, IEEE, 73:56 65,. [] E. Rombokas, M. Malhotra, E.A. Theodorou, E. Todorov, and Y. Matsuoka. Renforcement learnng and synergstc control of the act hand. IEEE/ASME Transactons on Mechatroncs, 8: , 3. [] R. F. Stengel. Optmal control and estmaton. Dover books on advanced mathematcs. Dover Publcatons, New York, 994. [] F. Stulp, J. Buchl, E. Theodorou, and S. Schaal. Renforcement learnng of full-body humanod motor sklls. In Proceedngs of th IEEE- RAS Internatonal Conference on Humanod Robots Humanods, pages 45 4, Dec. [3] E. Theodorou, Y. Tassa, and E. Todorov. Stochastc dfferental dynamc programmng. In Amercan Control Conference,, pages 5 3,. [4] E. A. Theodorou, J. Buchl, and S. Schaal. A generalzed path ntegral approach to renforcement learnng. Journal of Machne Learnng Research, :337 38,. [5] E.A. Theodorou and E. Todorov. Relatve entropy and free energy dualtes: Connectons to path ntegral and kl control. In the Proceedngs of IEEE Conference on Decson and Control, pages , Dec. [6] Evangelos A. Theodorou. Nonlnear stochastc control and nformaton theoretc dualtes: Connectons, nterdependences and thermodynamc nterpretatons. Entropy, 75: , 5.

9 [7] Sep Thjssen and HJ Kappen. Path ntegral control and state-dependent feedback. Physcal Revew E, 93:34, 5. [8] E. Todorov and W. L. A generalzed teratve lqg method for locallyoptmal feedback control of constraned nonlnear stochastc systems. pages 3 36, 5. [9] G. Wllams, E. Rombokas, and T. Danel. Gpu based path ntegral control wth learned dynamcs. In Neural Informaton Processng Systems - ALR Workshop, 4.

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning Journal of Machne Learnng Research 00-9 Submtted /0; Publshed 7/ Erratum: A Generalzed Path Integral Control Approach to Renforcement Learnng Evangelos ATheodorou Jonas Buchl Stefan Schaal Department of