Stochastic Optimal Control with Linearized Dynamics

Size: px

Start display at page:

Download "Stochastic Optimal Control with Linearized Dynamics"

Melissa Booker
6 years ago
Views:

1 Sochic Opiml Conrol wih Linerized Dynmic Sochich opimle Regelung mi lineriieren Modellen Mer-Thei von Hny Abdulmd Tg der Einreichung: 1. Guchen: Prof. Gerhrd Neumnn 2. Guchen: Prof. Jn Peer 3. Guchen: Prof. Ulrich Konigorki

2 Sochic Opiml Conrol wih Linerized Dynmic Sochich opimle Regelung mi lineriieren Modellen Vorgelege Mer-Thei von Hny Abdulmd 1. Guchen: Prof. Gerhrd Neumnn 2. Guchen: Prof. Jn Peer 3. Guchen: Prof. Ulrich Konigorki Tg der Einreichung:

3 Erklärung zur Mer-Thei Hiermi verichere ich, die vorliegende Mer-Thei ohne Hilfe Drier nur mi den ngegebenen Quellen und Hilfmieln ngeferig zu hben. Alle Sellen, die u Quellen ennommen wurden, ind l olche kennlich gemch. Diee Arbei h in gleicher oder ähnlicher Form noch keiner Prüfungbehörde vorgelegen. Drmd, den 1. März 2016 (Hny Abdulmd)

4 Abrc Policy Serch i powerful cl for lerning opiml conrol policie of complex yem. By llowing very brod decripion of k, hey re uible for olving chllenging roboic pplicion. Alhough model-free Policy Serch pproche require he le moun of knowledge bou he environmen, hey ofen uffer from he didvnge of hving o drw lrge number of mple from he yem. Therefore, in ce where i i feible o reconruc he yem dynmic, i i dvngeou o include much prior knowledge bou he lerning eing poible. In hi work we conider hi inigh moivion for exploring model-bed Policy Serch lgorihm. A recen pproch, Guided Policy Serch, h combined he rengh of powerful model-bed rjecory opimizion echnique, Sochic Opiml Conrol, wih Relive Enropy Policy Serch o lern policie of compliced k like bipedl wlking. By lerning beween linerizing he yem dynmic nd opimizing locl policie, i follow he min cheme of ierive mehod like Differenil Dynmic Progrming nd Ierive Liner Qudric Guin. The novely i, however, he inroducion of relive enropy bound on he rjecory diribuion in order o preerve he locliy of he linerizion nd improve he robune of convergence. In hi work we will exmine nd reformule Guided Policy Serch in order o highligh i min conribuion. We will how h he bound on he rjecory diribuion i equivlen o bound on he chnge of he policy. Moreover, we will moive nd propoe new conrin h would ricly bound he e diribuion beween ierion, nd furher enure he vlidiy of he linerizion, while llowing u o preform lrger ep on he policy upde. In ddiion, we will inroduce bound on he enropy of he policy, which llow o conrol he biliy he conroller o explore he cion pce nd preven premure convergence. We will preen reul nd compre ll vrin of he propoed lgorihm on highly non-liner yem, uch wing-up k on orque- nd ngle-conrined double nd qud pendulum. A upplemenry meril, we will provide full nd deiled mhemicl derivion of our mehod. i

5 Acknowledgmen I would epecilly like o hnk Prof. Gerhrd Neumnn, Hed of he Compuionl Lerning nd Auonomou Syem (CLAS) group, for inroducing me he o ide covered in hi hei nd for hi pien uperviion, open-door policy nd he counle hour of hi ime, h hve reuled in mny informive dicuion for me. I lo hnk M.Sc. Oleg Arenz, who co-upervied me nd lwy ook he ime o hre hi inigh nd experience. I owe deb of griude o Prof. Jn Peer, Hed of he Inelligen Auonomou Syem (IAS) group, nd ll IAS nd CLAS member, who connly engge nd moive heir uden. During my ime IAS nd CLAS, I ve hd he pleure of working cloely wih Alexndro Prcho nd Simone Prii. I m deeply greful for heir help nd uppor. Finlly, I hnk Prof. Ulrich Konigorki nd M.Sc. Zhongyi Gong from he Iniu für Regelungechnik und Mechronik (RTM) he Elecricl Engineering Deprmen, who greed o co-upervie me nd howed inere in my work. ii

6 Conen 1. Inroducion Locliy nd Vlidiy of Linerizion Reinforcemen Lerning v. Moion Plnning Preliminrie Mrkov Deciion Proce Sochic Opiml Conrol Informion Theoreic Bound Differenil Enropy Relive Enropy Reled Work Ierive Locl Mehod for Non-Liner Syem Differenil Dynmic Progrmming Ierive Liner Qudric Guin Relive Enropy Policy Serch Guided Policy Serch Opimizion Problem Dul Problem Policy Dependen Rewrd Implemenion Se-Acion Bound Policy Serch Opimizion Problem Dul Problem Se-Acion Dependen Rewrd Implemenion Circulr Dependency of V () nd µ () Block Decen over V () nd µ () Grdien Decen over α Block Coordine Decen Enropy Se-Acion Bound Policy Serch Opimizion Problem Dul Problem Augmened Rewrd Implemenion Evluion Double Pendulum Tk Qud Pendulum Tk Dicuion iii

7 7. Fuure Work Sepre Bound on Se nd Acion Comprion o Full Grdien Decen Principled Conrol of Policy Enropy Reformulion for Deerminiic Policie Furher Evluion on Lrger nd Rel Syem Concluion 26 Reference 27 A. Derivion of Guided Policy Serch 29 B. Derivion of Se-Acion Bound Policy Serch 38 C. Derivion of Enropy Se-Acion Bound Policy Serch 43 iv

8 Figure nd Tble Li of Figure 6.1. Double Pendulum Tk: The ol expeced rewrd of GPS, SAPS nd ESAPS in comprion during wing-up k. Ech lerner i given 25 ierion per ril o find he be policy. To ccoun for he ochiciy of he eup, 10 ril were preformed nd verged. The hyperprmeer of ech lerner were opimized eprely o reflec i be performnce Double Pendulum Tk: The mximum chnge in he policy for ech ierion of GPS, SAPS nd ESAPS. GPS h conn ep h i equl i KL-bound. SAPS ke ignificnly bigger ep while minining he upper bound on he e-cion diribuion. ESAPS i ble o ke he lrge ep due o i biliy o minin lrger vrince Qud Pendulum Tk: The expeced rewrd of GPS nd ESAPS. Ech lerner i given 50 ierion. For iicl men of he expeced rewrd, 10 ril were preformed nd verged. The hyperprmeer of ech lerner were opimized eprely o reflec i be performnce. The finl reul how ESAPS ouperforming GPS ignificnly Qud Pendulum Tk: The mximum ep in he policy pce for ech ierion of GPS nd ESAPS. The ep of GPS, per definiion, i conn nd equl i KL-bound. ESAPS, however, module he mximum ep ize bed on he e-cion bound Li of Algorihm 1. Guided Policy Serch in Peudo-Code Se-Acion Policy Serch: Dul Block Decen over V () nd µ () in Peudo-Code Se-Acion Policy Serch: Dul Grdien Decen over α in Peudo-Code Se-Acion Policy Serch: Dul Coordine Decen in Peudo-Code Enropy Se-Acion Policy Serch: Dul Coordine Decen in Peudo-Code v

9 Abbreviion Li of Abbreviion Noion DDP DP ESAPS GPS ilqg KLD LQG MDP REPS RL SAPS SOC Decripion Differenil Dynmic Progrmming Dynmic Progrmming Enropy Se-Acion Bound Policy Serch Guided Policy Serch Ierive Liner Qudric Guin Kullbck-Leibler Divergence Liner Qudric Guin Mrkov Deciion Proce Relive Enropy Policy Serch Reinforcemen Lerning Se-Acion Bound Policy Serch Sochic Opiml Conrol 1

10 1 Inroducion Recen dvncemen in he field of roboic hve reuled in coniderble growh in roboic pplicion nd k. The inroducion of new plform uch high dimenionl humnoid nd high velociy/orque mnipulor give he promie of mking hedwy in olving mjor k like bipedl locomoion nd grping. However, wih hi progre come hrp rie in he complexiy nd nonlineriy of he dynmicl yem in queion, which poe everl chllenge from conrol nd plnning poin of view. Trjecory Opimizion mehod e ou o olve Opiml Conrol problem under generl ime, energy nd pcil conrin. Sochic Opiml Conrol (SOC) wih linerized dynmic, in priculr, i powerful pproch o obin opiml conrol lw for non-liner yem. Fundmenl work on Sochic Opiml Conrol include Differenil Dynmic Progrmming (DDP) (Myne, 1966) (Jcobon nd Myne, 1970), Ierive Liner Qudric Guin (Todorov nd Li, 2005) (T e l., 2012), Approxime Inference Conrol (AICO) (Touin, 2009) (Rwlik e l., 2010) nd Robu Policy Upde for Sochic Opiml Conrol (RSOC) (Ruecker e l., 2014). 1.1 Locliy nd Vlidiy of Linerizion Sochic Opiml Conrol lgorihm implemen n ierive cheme uing linerized dynmic o loclly opimize he curren rjecory. A key elemen in he biliy of uch procedure i mechnim o conrol he ep ize of he upde of he conroller in principled mnner. The linerized dynmic re only ccure in he viciniy of he linerizion poin. Soluion h ry oo fr from hi linerizion poin hve o be voided, hey my cue ocillion or even inbiliie. A recen pproch, Guided Policy Serch (GPS) (Levine nd Kolun, 2014) (Levine nd Abbeel, 2014), ddree hi iue by inroducing relive enropy bound on he upde of he rjecory diribuion beween ierion. In he coure of hi hei, we will evlue he foremenioned pproch nd exend i by propoing new bound. Therefore, we will rgue for n explici bound on he e diribuion, ined of he rjecory diribuion, i i crucil o he linerizion. Thi bound hould provide ronger gurnee for he vlidiy of he linerizion beween ierion, hence, llowing more ggreive upde of he policy. Moreover, we will ugge n enropy conrin h would llow u o conrol he explorion re of he policy, hu prevening premure convergence iue h hve been oberved in GPS. 1.2 Reinforcemen Lerning v. Moion Plnning A hi poin i i necery o drw n imporn diincion beween wo cegorie of Opiml Conrol mehod. Nmely, Moion Plnning lgorihm nd Reinforcemen Lerning mehod (RL) (Suon nd Bro, 1998). In Moion Plnning, complee model of he environmen, mpping from yem dynmic o rewrd, i vilble nd cn exploied o opimize he expeced reurn of he rjecory. Se-of-he-r lgorihm in hi re re CHOMP (Rliff e l., 2009), STOMP (Klkrihnn e l., 2011) nd TRJOPT (Schulmn e l., 2013). Where, in Reinforcemen Lerning eing uch model re eiher lerned online, in PILCO (Deienroh nd Rmmuen, 2011) nd GPS (Levine nd Abbeel, 2014), or compleely circumvened, in REPS (Peer e l., 2010), in he proce of finding he opiml policy. The focu of hi work will be devoed o model-bed Reinforcemen Lerning lgorihm wih e-of-he-r GPS he cener piece. 2

11 1.3 Preliminrie Mrkov Deciion Proce A Mrkov Deciion Proce (MDP) i mhemicl model for equenil deciion mking in Moion Plnning or Reinforcemen Lerning eing. An MDP equence i ime dicree nd cn rech over finie or infinie ime horizon. MPD derive heir nme from he Mrkov propery, which ipule h in Mrkovin yem, he diribuion over he fuure e of he environmen depend only on he curren e nd nex cion. The rniion o fuure e i governed by he ochic yem dynmic P (, ). Becue deciion mking h o be rionlized by ome qunifible meure, MDP lo pecify rewrd funcion R (, ), h re he quliy of e-cion pir (, ). Since we re inereed in ime-conrined rjecorie, we will, for he reminder of hi hei, lwy conider finie-horizon MDP. Furhermore, we ume he dynmic o be of liner-guin nure wih imevrin qudric rewrd funcion, P (, ) = N ( A + b + c, Σ ), R (, ) = T M + T H. We dub hee condiion on he dynmic nd rewrd funcion he LQG umpion Sochic Opiml Conrol Sochic Opiml Conrol objecive i o find conrol policy π ( ) h mximize rewrd meure R (, ) long rjecory. Generl oluion re rre nd rericed o mll cl of yem. Aide from dicree yem, he mo imporn excepion re yem wih liner-guin dynmic. I h be hown, h he opiml conroller for yem dhering o he LQG umpion cn be compued in cloed-form by pplying Dynmic Progrmming (DP) (Bellmn, 1957). DP inroduce he concep of e vlue funcion V () nd e-cion vlue funcion Q (, ). V () i defined he expeced rewrd-o-go under cerin policy π ( ) ring from e, where Q (, ) i he expeced rewrd-o-go fer execuing n cion nd ubequenly following policy π ( ). DP implemen bckwrd inducion lgorihm, h recurively compue boh vlue funcion ring from he end ime poin T, where V () = R T (). The Bellmn equion, here given in coninuou form, define he relion beween V () nd Q (, ) follow Q (, ) = R (, ) + V () = P (, )V +1 ( )d, π ( )Q (, )d. Following hee definiion, deermining he opiml policy π ( ) i reduced o finding funcion h mximize he e-cion vlue funcion Q (, ) ech ime ep of he rjecory π ( ) = rgmx Q (, ) Informion Theoreic Bound In hi ecion we inroduce he heoreicl bckground o he enropy nd relive enropy bound h we will encouner in he coure of hi hei. 3

12 Differenil Enropy The enropy of diribuion p over rndom vrible i meure of he vrince of h diribuion nd, hu, lo meure of he verge moun of informion embedded in i. Relevn o hi work i he enropy of probbiliy diribuion over coninuou rndom vrible, lo clled differenil enropy. In h ce he enropy H of diribuion p() i defined H = p() log p()d. We will ue he enropy meure of he ochiciy of he conrol policy, which in urn would llow u o judge nd conrol i cpbiliy in exploring he e-cion pce. Relive Enropy The relive enropy, lo known he Kullbck-Leibler divergence, D KL (p q) beween wo probbiliy diribuion p() nd q(), i non negive meure of informion lo when p() i ued o pproxime q() nd i defined D KL (p() q()) = p() log p() q() d. The relive enropy of wo condiionl p( ) nd q( ), or heir expeced KL divergence, i defined nlogouly D KL (p( ) q( )) = p() p( ) log p( ) q( ) dd. The meure of relive enropy i cenrl o he ide of hi work. We will ue i o define bound over differen diribuion wih he im of limiing heir chnge beween ierion, hu enuring he biliy of he lgorihm. 4

13 2 Reled Work 2.1 Ierive Locl Mehod for Non-Liner Syem In our inroducion of Sochic Opiml Conrol, we hve dicued he limiion of i frmework, in which rcble oluion re excluive o dicree nd liner yem. Thee rericion my eem o compleely elimine he poibiliy of pplying Sochic Opiml Conrol o non-liner yem. However, i i poible o pply SOC in n ierive cheme wih he following rucure, Sring from n iniil e, pply n iniil conrol equence o he non-liner dynmic o obin finie e equence. Linerize he dynmic round ech poin of he rerieved rjecory nd qudrize he rewrd funcion. Formule nd olve locl LQG problem wih repec o e-cion deviion o ge new loclly opiml policy. Execue he new policy on he non-liner yem o obin new rjecory. In he coming ecion we will inroduce Differenil Dynmic Progrmming (DDP) nd Ierive Liner Qudric Guin (ilqg), wo lgorihm h follow hi ierive cycle Differenil Dynmic Progrmming Differenil Dynmic Progrming w inroduced in (Myne, 1966) (Jcobon nd Myne, 1970). I follow he min cheme decribed bove. Sring from he objecive of mximizing he expeced rewrd long he rjecory τ = { 1, 1,..., T, T } J( 1, A) = R (, ) + R T (). =0 Mximizing J i equivlen o finding he opiml e vlue funcion V () h mximize he rewrdo-go for ech e nd ime ep V () mx A J( 1, A). By eing V T () = R T () nd pplying Dynmic Progrming, we cn reduce he mximizion over he whole conrol equence o equence of mximizion over ingle conrol V () = mx [R ( ) + P (, )V +1 ( )] = mx [R (, ) + V +1 (P (, ))], where P (, ) re he linerized dynmic ech ime ep of he curren rjecory. By moving o noion h decribe he perurbion round ech e-cion pir (, ), we re ble o reformule he rgumen of he mximizion problem Q (δ, δ) = R ( + δ, + δ) R (, ) + V +1 (P ( + δ, + δ)) V +1 (P (, )). 5

14 Afer expnding o econd order, he Jcobin nd Hein of he dynmic cn be deermined. The ubcrip denoe he derivive wih repec o e nd cion Q, = R, + P T, V,+1, Q, = R, + P T, V,+1, Q, = R, + P T, V,+1P, + V,+1 P,, Q, = R, + P T, V,+1P, + V,+1 P,, Q, = R, + P T, V,+1P, + V,+1 P,. For he opiml locl conrol equence δ, we mximize he funcion Q (δ, δ) nd ge policy π (δ δ) h reemble liner conroller δ = rgmx Q (δ, δ) δ = Q 1, (Q, + Q, δ) = k + K δ. Subiuing he policy π (δ δ) ino Q (δ, δ) led o qudric vlue funcion V = 1 2 Q,Q 1, Q,, V, = Q, Q, Q 1, Q,, V, = Q, Q, Q 1, Q,. Applying he new policy o he non-liner yem o ge new rjecory complee one cycle of DDP. The min problem wih hi formulion, i h i greedily exploi he locl dynmic nd produce policie h be cn be rbirrily differen beween ierion, undermining he locliy nd vlidiy of he linerizion. In mo ce hi led o divergence or ocillion. The uhor ddreed hi iue by inroducing regulrizion o he cion-rewrd Hein Q, = Q, + µi. which i equivlen o dding rewrd for ying cloe o he l policy nd no rying. Thi regulrizion i helpful under he umpion h mll chnge in he policy imply mll chnge in he e pce nd, hu, preerve he vlidiy of he linerizion Ierive Liner Qudric Guin Ierive Liner Qudric Guin e ou o correc he horcoming of DDP by offering everl improvemen on he regulrizion nd line erch lgorihm. In (T e l., 2012) he uhor preen new regulrizion on he e rewrd, h would force he new rjecory o be y cloe o he l one nd reul in modified e nd cion Hein mrice Q, = R, + P T, (V,+1 + µi)p, + V,+1 P,, Q, = R, + P T, (V,+1 + µi)p, + V,+1 P,. which lo reul in new qudric vlue funcion h ke ino ccoun he new regulrizion V = 1 2 kt Q 1, k + k T Q,, V, = Q, K T Q 1, k + K T Q, + Q T, k, V, = Q, K T Q 1, K + K T Q, + Q T, K. 6

15 Furhermore, wih im of bounding he rjecory chnge even more, nd prevening highly non-liner yem from diverging, he uhor lo inroduce cler α o he policy prmeer â = + αk + K δ. Thi cler i opimized by line erch mehod bed on he expeced improvemen of he rewrd. 2.2 Relive Enropy Policy Serch Relive Enropy Policy Serch (REPS) i model-free Reinforcemen Lerning pproch (Peer e l., 2010). The novely of REPS i he inroducion of new ype of bound, h cn be impoed beween upde. The bound reemble relive enropy meure, or Kullbck-Leibler divergence, on he ecion diribuion. In Reinforcemen Lerning environmen hi conrin i crucil o convergence, i preerve he experience conined in he l policy nd l e diribuion, h h developed over muliple ierion nd conrin he lgorihm from jumping rbirrily o new unexplored region of he e pce. The opimizion problem under REPS i given rgmx π( )µ().. R(, )µ()π( ),,, µ()π( ) log µ()π( ) ε, q(, ), µ( )Φ( ) = µ()π( )P(, )Φ( ), µ()π( ) = 1., (2.5) (2.5b) (2.5c) (2.5d) where he objecive 2.5 mximize he rewrd wih repec o he join diribuion over he e µ() nd condiionl cion π( ) nd Equion 2.5c enure h he e-cion diribuion µ()π( ) y cloe he old one q(, ). Under hi formulion he opiml policy i normlized exponenil 1 π( ) exp η log q(, ) + R(, ) + η P(, )θ T Φ( ) θ T Φ(). The prmeer θ nd η re he Lgrngin muliplier correponding o Equion 2.5b nd 2.5c nd cn be opimized by grdien decen mehod. 7

16 3 Guided Policy Serch Guided Policy Serch (GPS) w developed over muliple publicion (Levine nd Kolun, 2013, 2014; Levine nd Abbeel, 2014). The ide preened in (Levine nd Kolun, 2013) i o inroduce e of guiding rjecorie genered under loclly opiml Differenil Dynmic Progrmming (DDP) nd weighed by Impornce Smpling (IS) o exploi region in he e pce wih high rewrd nd o "guide" nd peed-up convergence. In (Levine nd Kolun, 2014) he lgorihm w furher modified o enure he uefulne of he guiding rjecorie. Thi improvemen i done by lerning beween opimizing e of rjecorie for high rewrd (Trjecory Opimizion), while conrining he policy o mch he cion in ech rjecory hu conrining he policy upde from rying ino unexplored region of he e pce (Policy Serch). While he conribuion in (Levine nd Kolun, 2013, 2014) re inereing in heir own nding, hi hei will concenre on he core of Guided Policy Serch in i le nd mo refined verion preened in (Levine nd Abbeel, 2014), which impoe KLdivergence bound on he rjecory diribuion beween ierion. 3.1 Opimizion Problem In heir work he uhor dop rjecory-bed noion (Levine nd Abbeel, 2014) rgmx p(τ).. τ τ R(τ)p(τ)dτ, p(τ) log p(τ) dτ ε, q(τ) (3.1b) T 1 p(τ) = p( 1 ) P (, )π ( ). (3.1) (3.1c) where he objecive 3.1 mximize he rewrd R(τ) long rjecory τ = { 1, 1,..., T, T }, while Equion 3.1b provide he KL-bound on he curren nd l rjecory diribuion p(τ) nd q(τ). Equion 3.1c propge he e long he rjecory under he locl liner dynmic P(, ) nd he Guin policy π( ) ring from he e diribuion p( 1 ). We find hi noion o be omewh uncler, herefore we rnform he problem o i ep-bed equivlen. Thu, we re ble o how h he KL-divergence bound impoed on he rjecory diribuion p(τ) cn be, in fc, implified o bound e on he policy π( ). For he purpoe of clriy we preform hi rnformion explicily. By ubiuing he dynmic conrin 3.1c ino he KL-bound 3.1b nd replcing rjecorie τ wih e-cion pir (, ) we cn rewrie he inegrl in 3.1b D KL (p(τ) q(τ)) = p(τ) log p( 1) T 1 P (, )π ( ) τ p( 1 ) T 1 P dτ (3.2) (, )q ( ) = p (, ) log π ( ) q ( ) dd (3.2b) = p () π ( ) log π ( ) q ( ) dd. (3.2c) 8

17 From Equion 3.2c, i i cler h KL-bound on he rjecory diribuion i equivlen o n expeced bound on he policy ech ime ep. A hi poin we re ble o rewrie he whole problem wih our new e-cion-pir noion rgmx π ( ) T-1 R (, )µ ()π ( )dd + µ T ()R T ()d, (3.3).., > 1 < T < T, µ () µ 1 ()π 1 ( )P 1 (, )dd = µ ( ), (3.3b) π ( ) log π ( ) dd ε, q ( ) (3.3c) π ( )d = 1, (3.3d) µ 1 () = p 1 (). (3.3e) where he rewrd R(, ) i o be mximized wih repec o he e-cion diribuion, given by he policy π ( ) nd i induced e diribuion µ (), while under he yem dynmic conrin 3.3b, h propge he iniil e diribuion hrough ime nd i referred o forwrd p. Equion 3.3c i conrin on he expeced KL-bound on he policy for ech ime ep, where Equion 3.3d enure he policy i diribuion, nd Equion 3.3e pecifie he iniil e diribuion µ 1 (). 3.2 Dul Problem For he purpoe of hi hei we produce complee derivion of he cloed-form oluion of Guided Policy Serch under he umpion of liner dynmic, Guin noie nd qudric rewrd, ee Appendix A. We r by pplying he mehod of Lgrngin muliplier o formule he o clled priml problem, which inroduce new Lgrngin muliplier per conrin nd ime ep. The edependen Lgrngin muliplier V () re ocied wih he dynmic conrin 3.3b nd will ler reemble he e vlue funcion, while α re ocied wih he KL-bound given in Equion 3.3c. By olving for he opiml policy π ( ) we obin normlized exponenil of he e-cion vlue funcion Q (, ) 1 π ( ) exp α q ( ) + R (, ) + V +1 ( )P (, )d. (3.4) α By plugging Equion 3.4 ino he priml problem we rrive he Lgrngin dul L(µ, V, α ) T L(µ, V, α ) = µ T ()R T ()d + V 1 ()p 1 ()d V ( )µ ( )d + α ε 1 + α µ () log q ( ) exp R (, ) + V +1 ( )P (, )d dd. α (3.5) The dul L i funcion of he e diribuion µ () nd he Lgrngin muliplier V () nd α. By exploiing he duliy of hi opimizion, we re ble o mximize he priml problem by minimizing he dul funcion (Boyd nd Vndenberghe, 2009). Therefore, we ke he pril derivive of L nd pply dul decen in heir repecive direcion, 9

18 = µ = V R T () V T () V () α log exp α log q ( ) + R (, ) + V +1 ( )P (, ) α µ () ŝ p 1 () µ 1 (), = 1 π 1( ŝ)µ 1 (ŝ)p 1 ( ŝ, )ddŝ, > 1 = ε µ () π ( ) log π ( ) α q ( ) dd., = T, < T, (3.6), (3.6b) (3.6c) Seing he derivive in Equion 3.6 nd 3.6b o zero deliver wo opimliy condiion for he e vlue funcion V () nd he e diribuion µ (), h correpond o bckwrd p (bckwrd propgion of fuure rewrd) nd forwrd p (forwrd propgion of he e diribuion) repecively V () = µ () = R T () α log exp α log q ( ) + R (, ) + V +1 ( )P (, ) α ŝ p 1 (), = 1 π 1( ŝ)µ 1 (ŝ)p 1 ( ŝ, )ddŝ, > 1, = T, < T, (3.7). (3.7b) Under he LQG umpion, hee pe cn be compued in cloed form, where α hve o be opimized by grdien decen. Conidering he pril derivive of L wih repec o α, i i worh noing, h he opiml poin, he KL-conrin given in Equion 3.3c i me excly he bound ε, becue he grdien in Equion 3.6c become zero. Finlly, by plugging Equion 3.7 nd 3.7b ino Equion 3.8 he dul implifie o L(µ, V, α ) = V 1 ()µ 1 ()d + α ε. (3.8) 3.3 Policy Dependen Rewrd An inereing inigh ino he e vlue funcion V (), which nd for he expeced rewrd-o-go nd i defined in Equion 3.7, i he emergence of new erm h ugmen he immedie rewrd o include policy-reled erm q ( ) in ddiion o he ndrd e-cion rewrd provided by ime-vrin funcion R (, ) in eing nlog o DDP nd ilqg r (, ) = R (, ) + α log q ( ). (3.9) Under liner-guin dynmic P (, ) = N ( A + b + c, Σ ) nd qudric rewrd R (, ) = (z ) T M (z ) + T H, we how h he overll rewrd r (, ) i lo qudric r (, ) = T R. + T R. + T R T. + T R. + T r. + T r. + r 0,. (3.10) 10

19 R, = M α 2 (Kq )T (Σ q, ) 1 K q, (3.11) R, = H α 2 (Σq, ) 1, (3.11b) R, = α 2 (Kq )T (Σ q, ) 1, (3.11c) r, = α (K q )T (Σ q, ) 1 k q 2M z, (3.11d) r, = α (Σ q, ) 1 k q, r 0, = z T M z α log 2πΣ q, (3.11e) α 2 (kq )T (Σ q, ) 1 k q. (3.11f) A qudric rewrd funcion r (, ), by definiion, force qudric e vlue funcion V () V () = T V + T v + v. (3.12) In urn nd by conidering Equion 3.4, qudric e vlue funcion give rie o ime-vrin liner-guin opiml policy π ( ) = N ( k π + Kπ, Σπ, ). (3.13) 3.4 Implemenion In hi ecion we decribe he rucure of our verion of Guided Policy Serch we hve implemened i. For he purpoe of breviy, we do no conider he proce of linerizion. Generlly, linerizion i done by mpling full rjecorie from he non-liner yem under he curren policy nd fiing liner- Guin dynmic ech ime ep. The implemenion dicued here, focue on he opimizion ep, nd preuppoe he exience of he linerized dynmic. Bed on he derivion of he dul funcion from he previou ecion, we hve rnformed he problem ino convex minimizion problem over hree prmeer per ime ep V (), µ () nd α. However, ince Equion 3.7 nd 3.7b deliver cloed-form oluion o he opiml e vlue funcion V () nd e diribuion µ () funcion of α, he problem i reduced o minimizion of he dul wih repec o α nd cn be ierively olved by grdien decen cheme. In hi ce, he whole procedure cn be een bch-coordine-decen opimizion wih repec o V (), µ () nd α. Algorihm 1 how he ep by ep equence of he minimizion. Alhough grdien decen implemenion i righ forwrd procedure, i i recommended o ue more ophiiced opimizer provided by Mhwork MATLAB or Non-Liner Opimizion Librry (NLop) (Johnon, 2016), becue hey provide dvnced heuriic of moduling he ep ize long he grdien nd numericl eime of he econd degree derivive (Hein), generlly leding o fer convergence nd le compuion co. For reon reled o compuionl biliy nd efficiency, ll our lgorihm will be implemened in he frmework of he Armdillo Liner Algebr Librry (Snderon, 2010). 11

20 inpu : T ; /* ime horizon */ P (, ) ; /* linerized dynmic */ µ 1 () ; /* iniil e diribuion */ q ( ) ; /* l policy */ M, H, z ; /* rewrd mrice nd gol e */ oupu: π ( ) ; /* opiml policy */ V () ; /* opiml e vlue funcion */ µ () ; /* e diribuion under opiml policy */ α ; /* opiml Lgrngin prmeer α */ iniilize α ; /* iniil gue of α */ /* minimizing he dul by grdien decen */ while L(µ, V, α ) no minimum do /* compue ugmened rewrd funcion uing Equion 3.10 */ r (, ) overll_rewrd(m, H, z, q ( ), α ); /* compue vlue funcion nd policy uing Equion 3.7 nd 3.4 */ [V (), π ( )] bckwrd_p(r (, ), P (, ), α ); /* compue he e diribuion uing Equion 3.7b */ µ () forwrd_p(µ 1 (), π ( ), P (, )); /* upde Lgrnge dul vlue wih Equion 3.8 */ L(µ, V, α ) upde_dul(v 1 (), µ 1 (), α, ε); /* compue Lgrnge dul grdien wih repec o α uing Equion 3.6c */ dul_lph_grdien(µ (), π ( ), q ( ), ε); α /* upde α long he grdien wih ep λ */ α = α λ α ; Algorihm 1: Guided Policy Serch in Peudo-Code 12

21 4 Se-Acion Bound Policy Serch A he beginning of hi hei we inroduced he generl cheme of pplying Sochic Opiml Conrol o non-liner yem. The min chllenge i he bence of heoreicl gurnee on he improvemen of he induced rjecory fer ech ierion. Thi horcoming i due o he rericed vlidiy of he locl dynmic o mll region round he linerizion poin. A greedy exploiion of he linerized dynmic my led o policie h force he non-liner yem ino region of he e pce h re "fr wy" from wh i expeced under he linerized model mking he opimizion ep under he model meningle. Therefore, i i crucil o minin bound on he e diribuion beween ierion in order o enure he vlidiy of he loclly opimized conroller. Ierive Liner Qudric Guin (ilqg) rie o olve hi problem by inroducing clr o he policy prmeer which i opimized by bckrcking line-erch cheme h incree or reduce he ep ize bed on he improvemen in he expeced rewrd. Guided Policy Serch follow imilr logic; by inroducing relive enropy bound on he chnge of he ochic policy, he induced e diribuion become implicily bounded. However, for highly dynmicl yem hi condiion would require impoing very mll ep on he policy, which migh drmiclly low down convergence nd co coniderble exr moun of mple on he rel yem. In hi chper we im o ddre hi iue. We propoe he inroducion of n explici relive enropy bound on he e-cion diribuion nd e ou o how h uch bound would llow king lrger ep in he policy pce while prevening he e diribuion from diverging, hu reducing he number of needed ierion nd overll mple. 4.1 Opimizion Problem We ke imilr formulion o Guided Policy Serch, bu replce he KL-bound on he policy diribuion by bound on he e-cion diribuion rgmx π ( ) T-1 R (, )µ ()π ( )dd + µ T ()R T ()d, (4.1).., > 1 < T, < T µ 1 ()π 1 ( )P 1 (, )dd = µ ( ), (4.1b) µ ()π ( ) log µ ()π ( ) dd ε, q (, ) π ( )d = 1, (4.1c) (4.1d), = 1 µ 1 () = p 1 (). (4.1e) The objecive in 4.1 eek o mximize he rewrd under he finl e-cion diribuion p (, ) = µ ()π ( ), while 4.1b keep he e diribuion µ () under he conrin of he linerized yem dynmic. Our novely e-cion bound i inroduced in 4.1c, wih q (, ) repreening he ecion diribuion of he l linerizion. The remining conrin 4.1d nd 4.1e enure h he policy i diribuion nd pecify he iniil e diribuion repecively. 13

22 4.2 Dul Problem A in our derivion of Guided Policy Serch in Chper 3, we pply he mehod of Lgrngin muliplier o formule he priml problem wih one Lgrngin muliplier per conrin nd ime ep. The full derivion under he LQG umpion i lied in Appendix B. In hi ce, he opiml policy i lo normlized exponenil of he e-cion vlue funcion Q (, ) 1 π ( ) exp α log q (, ) + R (, ) + V +1 ( )P (, )d. (4.2) α We obin he Lgrngin dul funcion L(µ, V, α ) by ubiuing he opiml policy Equion 4.2 ino he priml problem L = µ T ()R T ()d + V 1 ()p 1 ()d V T ( )µ T ( )d V ( )µ ( )d + α ε α µ () log µ ()d (4.3) 1 + α µ () log exp α log q (, ) + R (, ) + V +1 ( )P (, )d dd. α According o he principle of duliy, minimizing he dul funcion i equivlen o mximizing he priml problem (Boyd nd Vndenberghe, 2009). Therefore, we minimize L by king i pril derivive R T () V T (), = T = µ V () α log exp α log q (, ) α log µ () α + R (, ) + V +1 ( )P (, ), (4.4) α, < T = V µ () ŝ p 1 () µ 1 (), = 1 π 1( ŝ)µ 1 (ŝ)p 1 ( ŝ, )ddŝ, > 1, (4.4b) = ε µ ()π ( ) log µ ()π ( ) dd. (4.4c) α q (, ) A he opiml poin of L he pril derivive re equl o zero, which cn be een opimliy condiion for he e vlue funcion V () nd he e diribuion µ () R T (), = T V () = α log exp α log q (, ) α log µ () α + R (, ) + V +1 ( )P (, ) α p 1 (), = 1 µ () = π ŝ 1( ŝ)µ 1 (ŝ)p 1 ( ŝ, )ddŝ, > 1, < T, (4.5). (4.5b) Anlog o Guided Policy Serch in Chper 3, he opimliy condiion reemble bckwrd p nd forwrd p h cn be compued in cloed-form in n LQG environmen. Furhermore, he KLconrin 4.1c i being me excly he bound ε due o Equion 4.4c becoming equl o zero he opiml poin. Alo, uing Equion 4.5 nd 4.5b, we cn furher implify he Lgrnge dul L(µ, V, α ) L(µ, V, α ) = V 1 ()µ 1 ()d + α (ε + 1). (4.6) 14

23 4.3 Se-Acion Dependen Rewrd The inroducion of he e-cion conrin 4.1c reul in n ugmened rewrd funcion. The new erm no only ccoun for dince o he l policy q ( ), bu lo weigh he dince beween µ (), he curren e diribuion, nd q (), he e diribuion under he l policy round which he yem w linerized r (, ) = R (, ) + α log q (, ) α log µ () α = R (, ) + α log q ( ) + α log q () α log µ () α. (4.7) By ubiuing liner-guin dynmic P(, ) = N ( τ q,, Σq, ), Guin e-cion diribuion q (, ) = N (, τ q,,, Σq,, ) nd qudric rewrd funcion R (, ) = (z ) T M (z ) + T H, he overll rewrd r (, ) become lo qudric r (, ) = T R, + T R, + T R T, + T R, + T r, + T r, + r 0,, (4.8) R, = M α 2 (Kq )T (Σ q, ) 1 K q α 2 (Σq, ) 1 + α 2 (Σp, ) 1, (4.8b) R, = H α 2 (Σq, ) 1, (4.8c) R, = α 2 (Kq )T (Σ q, ) 1, (4.8d) r, = α (K q )T (Σ q, ) 1 k q + α (Σ q, ) 1 τ q, α (Σ p, ) 1 τ p, 2M z, (4.8e) r, = α (Σ q, ) 1 k q, r 0, = z T M z α 2 log 2πΣ q, (4.8f) α 2 (kq )T (Σ q, ) 1 k q (4.8g) α 2 log 2πΣq, α 2 (τq, )T (Σ q, ) 1 τ q, α (4.8h) + α log 2πΣ p, 2 + (τp, )T (Σ p, ) 1 τ p,. (4.8i) 4.4 Implemenion In hi ecion we preen he implemenion of Se-Acion Bound Policy Serch (SAPS). We ignore he linerizion ep nd focu on he convex minimizion problem of he dul L(µ, V, α ) preened in he previou ecion Circulr Dependency of V () nd µ () The equion of he bckwrd p 4.5 nd forwrd p 4.5b inroduce new lgorihmic chllenge h did no occur under Guided Policy Serch. The emergence of new e-diribuion-dependen erm in he ugmened rewrd funcion r (, ) of he e vlue funcion V (), genere circulr dependency beween V () nd he e diribuion µ (). Thi relion become cler when we recognize h he e diribuion µ () i funcion of he policy π ( ), Equion 4.5b, nd h π ( ) i in i elf funcion of he e vlue funcion V (), Equion Block Decen over V () nd µ () A hi poin we propoe new pproch o clcule he e vlue funcion V () nd e diribuion µ (). The Equion 4.5 nd 4.5b ill offer opimliy condiion nd cn be ued ierively in 15

24 block-decen cheme on he dul L(µ, V, α ). Sring wih n iniil nd brod gue of he e diribuion p (), we ierively pply he bckwrd p, o compue V () nd π( ), nd forwrd p, o compue µ (), nd upde p () by inerpoling in he direcion of µ () unil boh diribuion mch. Algorihm 2 provide deiled view of hi procedure. 16 inpu : T ; /* ime horizon */ P (, ) ; /* linerized dynmic */ µ 1 () ; /* iniil e diribuion */ q ( ) ; /* l policy */ q () ; /* l e diribuion */ α ; /* curren Lgrngin prmeer α */ M, H, z ; /* rewrd mrice nd gol e */ oupu: π ( ) ; /* policy under curren α */ V () ; /* e vlue funcion under curren α */ µ () ; /* e diribuion under curren α */ iniilize p () ; /* iniil gue of e diribuion */ L(µ, V, α ) ; /* iniil dul vlue */ γ ; /* inerpolion ep ize */ /* minimizing he dul wih repec o V () nd µ () */ while p () µ () do /* compue ugmened rewrd funcion uing Equion 4.7 */ r (, ) overll_rewrd(m, H, z, q ( ), q (), p (), α ); /* compue vlue funcion nd policy uing Equion 4.5 nd 4.2 */ [V (), π ( )] bckwrd_p(r (, ), P (, ), α ); /* compue he e diribuion uing Equion 4.5b */ µ () forwrd_p(µ 1 (), π ( ), P (, )); /* check KL-divergence beween p () nd µ () */ if D K L (p (), µ ()) < hrehold hen brek; /* inerpole p () in he direcion of µ () wih ep ize γ */ p () inerpole_diribuion(p (), µ ()); /* upde Lgrnge dul vlue wih Equion 4.6 */ L(µ, V, α ) upde_dul(v 1 (), p 1 (), α, ε); /* check if he dul reched lower vlue */ if L < L hen L = L; p () = p () ; ele γ = 0.5 γ ; Algorihm 2: Se-Acion Policy Serch: Dul Block Decen over V () nd µ () in Peudo-Code

25 4.4.3 Grdien Decen over α inpu : T ; /* ime horizon */ P (, ) ; /* linerized dynmic */ µ 1 () ; /* iniil e diribuion */ q ( ) ; /* l policy */ q () ; /* l e diribuion */ M, H, z ; /* rewrd mrice nd gol e */ oupu: π ( ) ; /* opiml policy */ V () ; /* opiml e vlue funcion */ µ () ; /* opiml e diribuion */ iniilize α ; /* iniil gue of α */ /* minimizing he dul by grdien decen */ while L(µ, V, α ) no minimum do /* do block-decen o compue V () nd µ () */ [V (), π ( ), µ ()] block_decen(p (, ), q ( ), q (), p (), M, H, z, α ); /* upde Lgrnge dul vlue wih Equion 4.6 */ L(µ, V, α ) upde_dul(v 1 (), µ 1 (), α, ε); /* compue Lgrnge dul grdien wih repec o α uing Equion 4.4c */ dul_lph_grdien(µ (), π ( ), q ( ), q (), ε); α /* upde α long he grdien wih ep λ */ α = α λ α ; Algorihm 3: Se-Acion Policy Serch: Dul Grdien Decen over α in Peudo-Code Block Coordine Decen A ignificn drw bck of Algorihm 3 i he compuion co of preforming he block decen over V () nd µ () for every grdien-decen ep of α. Therefore, we ugge modified lgorihm, h implemen differen block-coordine-decen wih repec o V (), µ () nd α. By holding he e vlue funcion V () conn while opimizing α, nd vice ver, we re ble o opimize boh eprely nd reduce compuion ime drmiclly. However, h would require u o reconider he opimliy condiion of µ () when opimizing α. Thu, we need o reke he pril derivive of Equion 4.6 wih repec o µ (), we rrive differen cloed-form condiion for µ () µ () = N ( V (), ˆV (), α ), (4.9) where ˆV () i erm h reemble n α -dependen e vlue funcion 1 ˆV () = log exp α log q (, ) + R (, ) + V +1 ( )P (, )d d. (4.10) α A full derivion of he coordine-decen cheme cn be found in Appendix B. 17

26 inpu : T ; /* ime horizon */ P (, ) ; /* linerized dynmic */ µ 1 () ; /* iniil e diribuion */ q ( ) ; /* l policy */ q () ; /* l e diribuion */ M, H, z ; /* rewrd mrice nd gol e */ oupu: π ( ) ; /* opiml policy */ V () ; /* opiml e vlue funcion */ µ () ; /* opiml e diribuion */ iniilize α ; /* iniil gue of α */ /* minimizing he dul by coordine decen */ while L(µ, V, α ) no minimum do /* do block-decen o compue V () nd µ () */ [V (), π ( ), ] block_decen(p (, ), q ( ), q (), p (), M, H, z, α ); /* minimize Lgrnge dul wih repec o α */ while L(µ, α ) no minimum do /* compue ˆV () wih Equion 4.10 */ [ˆV (), ˆπ ()] co_decen_bckwrd_p(p (, ), V (), q ( ), q (), M, H, z, α ); /* compue e diribuion ˆµ () wih Equion 4.9 */ ˆµ () co_decen_e_diribuion(ˆv (), V (), α ); /* upde Lgrnge dul vlue wih Equion 4.3 */ L(µ, α ) upde_dul(v (), ˆV (), ˆµ (), α, ε); /* compue Lgrnge dul grdien wih repec o α uing Equion 4.4c */ dul_lph_grdien(ˆµ (), ˆπ ( ), q ( ), q (), ε); α /* upde α long he grdien wih ep λ */ α = α λ α ; Algorihm 4: Se-Acion Policy Serch: Dul Coordine Decen in Peudo-Code 18

27 5 Enropy Se-Acion Bound Policy Serch The inroducion of ochic policy o he clicl Mrkov Deciion Proce formulion of Opiml Conrol, poe chllenge imilr o problem h occur in generl Sochic Serch eing (Abdolmleki e l., 2015). Thee iue boil down o he problem of explorion v. exploiion. The ochiciy of policy dd o he biliy of n lgorihm o explore he e-cion pce. The chllenge lie in yemiclly conrolling he vrince of he policy in wy h llow for explorion bu lo converge o men conroller h mximize he expeced rewrd. Algorihm like Guided Policy Serch nd Se-Acion Bound Policy Serch cn uffer from premure convergence, becue of he nure of heir relive enropy bound. The KL-divergence c on he men nd vrince of diribuion nd my reul in he lgorihm oping o greedily mximizing i rewrd by rpidly hrinking he vrince nd brely exploring in he direcion of men cion. To counerc hi dynmic, we inroduce new conrin on he enropy of he policy h im o minin lower bound of ochiciy nd, hu, force explorion in he cion pce. 5.1 Opimizion Problem The new opimizion problem i nlog o h of Se-Acion Bound Policy Serch wih he ddiion of n enropy conrin in Equion 5.1d rgmx π ( ) T-1 R (, )µ ()π ( )dd + µ T ()R T ()d, (5.1).., > 1 < T < T, < T µ 1 ()π 1 ( )P 1 (, )dd = µ ( ), (5.1b) µ ()π ( ) log µ ()π ( ) dd ε, q (, ) µ () π ( ) log π ( )dd δ, π ( )d = 1, (5.1c) (5.1d) (5.1e), = 1 µ 1 () = p 1 (). (5.1f) The hyperprmeer δ cn be choen in uch wy, for exmple, o minin or incree he vrince or enropy of he l policy q ( ) by ome fcor. 5.2 Dul Problem Ju in GPS nd SAPS, we rnform he priml problem o i dul equivlen by olving for π ( ). The inroducion of he enropy conrin 5.1d reul in new Lgrngin vrible for ech ime ep β. A complee derivion of Enropy Se-Acion Bound Policy Serch i in Appendix C 1 π ( ) exp R (, ) + α log q (, ) + V +1 ( )P (, )d. (5.2) α + β 19

28 We ubiue π ( ) ino he priml problem o ge he dul funcion L(µ, V, α, β ) L = µ T ()R T ()d + V 1 ()p 1 ()d V T ( )µ T ( )d V ( )µ ( )d + α ε + β δ α µ () log µ ()d R (, ) + α log q (, ) + V + (α + β ) µ () log exp +1 ( )P (, )d dd. α + β For dul minimizion, we ke he pril derivive of L(µ, V, α, β ) nd e hem o zero o ge he opimliy condiion of he e vlue funcion V () nd e diribuion µ () R T () V T (), = T =, µ V () (α + β ) log exp α log q (, ) α log µ () α + R (, ) + V +1 ( )P (, ) α + β, < T (5.3) = V µ () ŝ = ε α β = δ µ () p 1 () µ 1 (), = 1 π 1( ŝ)µ 1 (ŝ)p 1 ( ŝ, )ddŝ, > 1 µ ()π ( ) log µ ()π ( ) dd q (, ) π ( ) log π ( )dd. (5.4), (5.4b) By plugging hee opimliy condiion ino Equion 5.3 we ge implified dul L(µ, V, α, β ) L(µ, V, α, β ) = (5.4c) (5.4d) V 1 ()µ 1 ()d + α (ε + 1) + β δ. (5.5) 5.3 Augmened Rewrd From Equion 5.4b nd 5.4, i i cler h he rewrd funcion r (, ) i imilr o h of he Se- ACion Bound Policy Serch in Equion 4.7. However, he emperure prmeer of he e-cion vlue funcion Q (, ) nd he weighing of he e vlue funcion V () hve he dded vlue of β Q (, ) = 1 r (, ) + P [V +1 ( )], (5.6) α + β V () =(α + β ) log exp Q (, ) d. (5.6b) In Appendix C, we do full derivion of ESAPS under liner Guin dynmic P(, ) = N ( A + b + c, Σ ) nd ime vrin qudric rewrd R (, ) = (z ) T M (z ) + T H nd how he reuling vlue funcion Q (, ) nd V () re lo qudric nd he policy π ( ) i liner-guin diribuion. 20

29 5.4 Implemenion The implemenion of ESAPS i imilr in i rucure o SAPS wih n ddiionl opimizion over β. Algorihm 5 how he deil of he coordine-decen cheme. inpu : T ; /* ime horizon */ P (, ) ; /* linerized dynmic */ µ 1 () ; /* iniil e diribuion */ q ( ) ; /* l policy */ q () ; /* l e diribuion */ M, H, z ; /* rewrd mrice nd gol e */ oupu: π ( ) ; /* opim policy */ V () ; /* opiml e vlue funcion */ µ () ; /* opiml e diribuion */ iniilize α, β ; /* iniil gue of α, β */ /* minimizing he dul by coordine decen */ while L(µ, V, α, β ) no minimum do /* do block-decen o compue V () nd µ () */ [V (), π ( ), ] block_decen(p (, ), q ( ), q (), p (), M, H, z, α, β ); /* minimize Lgrnge dul wih repec o α */ while L(µ, α ) no minimum do /* compue e vlue funcion ˆV () */ [ˆV (), ˆπ ()] co_decen_bckwrd_p(p (, ), V (), q ( ), q (), M, H, z, α, β ); /* compue e diribuion ˆµ () */ ˆµ () co_decen_e_diribuion(ˆv (), V (), α, β ); /* upde Lgrnge dul vlue wih Equion 5.3 */ L(µ, α, β ) upde_dul(v (), ˆV (), ˆµ (), α, β, ε, δ); /* compue Lgrnge dul grdien wih repec o α uing Equion 5.4c */ dul_lph_grdien(ˆµ (), ˆπ ( ), q ( ), q (), ε); α /* compue Lgrnge dul grdien wih repec o β uing Equion 5.4d */ dul_be_grdien(ˆµ (), ˆπ ( ), δ); β /* upde α β long he grdien wih ep λ nd ζ */ α = α λ ; β = β ζ α β Algorihm 5: Enropy Se-Acion Policy Serch: Dul Coordine Decen in Peudo-Code 21

30 6 Evluion 6.1 Double Pendulum Tk The double pendulum k i eup wih fully cued wo link rm under he influence of grviy. The objecive of he lerner i o do full wing up of he pendulum ring from he down-righ poiion nd ry o bilize he il of he rjecory round he up-righ poure. To mke he k hrder, we inroduce fricion o he join nd hif he cener of m owrd he end of he econd link. Furhermore, we limi he llowed orque by pplying hrp non-liner conrin. The number of mple ued for linerizion i 25 per ierion Rewrd GPS SAPS ESAPS Ierion Figure 6.1.: Double Pendulum Tk: The ol expeced rewrd of GPS, SAPS nd ESAPS in comprion during wing-up k. Ech lerner i given 25 ierion per ril o find he be policy. To ccoun for he ochiciy of he eup, 10 ril were preformed nd verged. The hyperprmeer of ech lerner were opimized eprely o reflec i be performnce. Figure 6.1 how direc comprion beween GPS, SAPS nd ESAPS fer independen opimizion of he repecive hyperprmeer. Afer 25 ierion, GPS reche he lowe rewrd nd demonre he highe re of ocillion during he l 5 ierion, which i due o he yem premurely running 22

31 ino he orque limi. SAPS nd ESAPS boh ou preform GPS by reching he me rewrd level fer only hve he number of ierion or le GPS SAPS ESAPS Mx Policy Sep Ierion Figure 6.2.: Double Pendulum Tk: The mximum chnge in he policy for ech ierion of GPS, SAPS nd ESAPS. GPS h conn ep h i equl i KL-bound. SAPS ke ignificnly bigger ep while minining he upper bound on he e-cion diribuion. ESAPS i ble o ke he lrge ep due o i biliy o minin lrger vrince Figure 6.2 illure he mximum KL-divergence of he policy fer ech ierion. The reul vlide our umpion, h by bounding he e-cion diribuion in SAPS nd ESAPS, we re ble ke lrger ep in he policy pce wihou he rik of leving he viciniy of he linerized dynmic. Alo, by minining ignificn porion of i enropy, ESAPS i cpble of king lrger ep in he direcion of he men cion. 6.2 Qud Pendulum Tk The qud pendulum k i imilr o h of he double pendulum, lbei wih much higher complexiy in he dynmic. The pendulum i fully cued nd h o be wung-up nd bilized in he up-righ poiion. We only pecify he end-poin of he rjecory for bilizion nd forgo he pecificion of ny oher vi-poin. The number of mple ued for linerizion i 100 per ierion. Figure 6.3 offer comprion of he ol expeced rewrd of ESAPS gin GPS. The hyperprmeer of boh lgorihm were opimized independenly. I i cler h ESAPS ouperform GPS by very lrge mrgin, reching imilr rewrd level fer only 25 ierion compred o 50 ierion for GPS. A juificion for hi difference in performnce i found in Figure 6.4, h compre he mximum policy ep h boh lgorihm cn ke wihou riking divergence. ESAPS cn, le for ome ime ep, ke ep 6-7 ime he ep of GPS wihou compromiing he inegriy of he linerizion. 6.3 Dicuion Bed on he reul we hve preened, i i cler h our umpion hve been vlided o ome exen. In direc comprion o GPS, we were ble o how he impc of bounding he e diribuion o preerve he vlidiy of he linerizion, i llowed u o execue lrger ep in he policy pce nd ignificnly reduce he number of ierion nd mple. Alo, he exience of he n enropy lower bound h conribued o minining explorion nd, hu, reching beer end policie. 23

32 Rewrd GPS ESAPS Ierion Figure 6.3.: Qud Pendulum Tk: The expeced rewrd of GPS nd ESAPS. Ech lerner i given 50 ierion. For iicl men of he expeced rewrd, 10 ril were preformed nd verged. The hyperprmeer of ech lerner were opimized eprely o reflec i be performnce. The finl reul how ESAPS ouperforming GPS ignificnly GPS ESAPS Mx Policy Sep Ierion Figure 6.4.: Qud Pendulum Tk: The mximum ep in he policy pce for ech ierion of GPS nd ESAPS. The ep of GPS, per definiion, i conn nd equl i KL-bound. ESAPS, however, module he mximum ep ize bed on he e-cion bound 24

33 7 Fuure Work In hi chper we ugge poible li of improvemen nd re of furher reerch, bed on he encourging reul we hve preened. 7.1 Sepre Bound on Se nd Acion Our min conribuion in hi hei h been he inroducion of n upper bound on he chnge of e diribuion in ierive Sochic Opiml Conrol mehod. We hve choen o chieve h by bounding he e-cion diribuion. However, i i conceivble h wo epre bound, one on he policy nd one he e diribuion, my crry ome dvnge, uch being ble o e independen upper or lower bound on he policy chnge. 7.2 Comprion o Full Grdien Decen In our derivion we hve hown h we re ble o compue he opiml vlue funcion nd e diribuion in cloed-form bed on wo opimliy condiion from he dul pril derivive. Thi formulion reduce he minimizion of he dul funcion o grdien decen problem over he Lgrngin muliplier ocied wih relive enropy conrin. In he fuure we pln o nlyze he poibiliy of pplying full grdien decen on he vlue funcion nd e diribuion nd compring he erch direcion o h of he opimliy condiion. 7.3 Principled Conrol of Policy Enropy By dding he enropy conrin on he policy in Enropy Bound Se Acion Policy Serch, we were ble o preven he decy of he policy vrince, llowing u o explore he cion pce for lrger number of ierion. A poible exenion i he inroducion of ome heuriic h would no only minin he vrince bu lo incree i. Such biliy o mnipule he enropy would help in ecping hllow locl minim h migh reul of ub-opiml iniilizion of he policy. 7.4 Reformulion for Deerminiic Policie By concenring on he formulion of Guided Policy Serch, we re limied o cl of lgorihm h ry o opimize ochic policy. However, he originl formulion of he Mrkov Deciion Proce doe no necerily require uch policy. In fc, i e h he opiml policy i deerminiic conroller. Bed on hi inigh, i my be inereing o reformule he problem long he line of Differenil Dynmic Progrmming nd Ierive Qudric Guin nd exploring equivlen regulrizion h correpond o wh we hve inroduce in hi hei. 7.5 Furher Evluion on Lrger nd Rel Syem Alhough our reul re promiing, furher comprion o oher e-of-he-r lgorihm re ill needed for ronger vlidion. Alo he pplicion on high dimenionl nd rel yem would help u undernd he clbiliy of compuion ime nd feibiliy in regrd o he number of mple. 25

34 8 Concluion Sochic Opiml Conrol wih linerized dynmic i powerful echnique for lerning opiml conrol policie of highly non-liner yem. In hi hei we hve inveiged nd inroduced everl vriion of e-of-he-r lgorihm in hi field. In our inroducion we hve dicued mjor iue in hi cl of lgorihm, which i i dependency on he vlidiy of he model round he linerizion poin. Hence, i i crucil o provide gurnee h would preven greedy exploiion of he locl dynmic. In Chper 3, we wen on o nlyze recen pproch, Guided Policy Serch, h ddree hi iue, by forcing relive enropy bound on he rjecory diribuion beween ierion. We ucceeded in reformuling GPS nd were ble o how h i propoed conrin i equivlen o bounding he policy upde ech ime ep. We hve lo rgued h uch n pproch only implicily bound he e diribuion round which he yem i linerized. Thu, o void divergence in highly dynmicl yem, he lgorihm i limied o very mll upde on he policy, which would, in urn, incree he number of ierion nd mple needed. In Chper 4, relying on hee inigh, we propoed new conrin h explicily impoe relive enropy bound on he e diribuion by bounding he e-cion diribuion ined of he policy. Thi ddiion h reuled in number of new lgorihmic chllenge, which we were ble o del wih. The min iue w he emergence of new rewrd erm h encode he dince beween he curren nd l e diribuion, which h led o circulr dependency beween he vlue funcion nd e diribuion, which we were ble o olve by pplying block-coordine-decen cheme. By concenring on cl of lgorihm h require ochic policy nd due o he nure of he relive enropy bound we hve inroduced, we were inevibly confroned wih problem of rde-off beween explorion nd exploiion. We ddreed hi iue, in Chper 5, hrough n ddiionl conrin on he differenil enropy of he policy, hu, llowing u o conrol he ochiciy of policy he lgorihm dvnce fer ech ierion. A proof of concep of our conribuion, we hve compred our lgorihm wih GPS by preforming wing-up k on he highly non-liner double nd qud pendulum. The reul vlide our view, h bound on he e-cion diribuion llow for more ggreive upde of he policy, while eing n upper bound on he divergence of he e diribuion. Finlly, we hve dicued wy o improve nd exend our conribuion, uch inroducing epre bound on he e nd cion, developing principled pproch for mnipuling he enropy of he policy nd preforming evluion on higher dimenionl nd rel yem. 26

Chapter 2: Evaluative Feedback

Chapter 2: Evaluative Feedback Chper 2: Evluive Feedbck Evluing cions vs. insrucing by giving correc cions Pure evluive feedbck depends olly on he cion ken. Pure insrucive feedbck depends no ll on he cion ken. Supervised lerning is