Feature Extraction for Inverse Reinforcement Learning

Size: px

Start display at page:

Download "Feature Extraction for Inverse Reinforcement Learning"

Janel Washington
6 years ago
Views:

1 Feure Exrcion for Invere Reinforcemen Lerning Feure-Exrkion für Invere Reinforcemen Lerning Mer-Thei von Oleg Arenz u Wiebden Tg der Einreichung: 1. Guchen: 2. Guchen: 3. Guchen:

2 Feure Exrcion for Invere Reinforcemen Lerning Feure-Exrkion für Invere Reinforcemen Lerning Vorgelege Mer-Thei von Oleg Arenz u Wiebden 1. Guchen: 2. Guchen: 3. Guchen: Tg der Einreichung:

3 Erklärung zur Mer-Thei Hiermi verichere ich, die vorliegende Mer-Thei ohne Hilfe Drier nur mi den ngegebenen Quellen und Hilfmieln ngeferig zu hben. Alle Sellen, die u Quellen ennommen wurden, ind l olche kennlich gemch. Diee Arbei h in gleicher oder ähnlicher Form noch keiner Prüfungbehörde vorgelegen. Drmd, den 17. Dezember 2014 (Oleg Arenz)

4 Abrc Equipping n gen wih he biliy o infer inenion behind oberved behvior i prerequiie for creing ruly uonomou robo. By deducing he purpoe of he cion of oher gen he robo would be ble o rec in enible wy nd, furhermore, o imie he regy on high level. Invere Reinforcemen Lerning (IRL) cn be pplied o lern rewrd funcion h i conien wih he oberved behvior nd i, hu, ep owrd hi overll gol. Some regie cn only be modeled properly by underlying ime-dependen rewrd funcion. Mximum Cul Enropy Invere Reinforcemen Lerning (MxCulEn-IRL) i mehod h cn be pplied o lern uch non-ionry funcion. However, i depend on grdien-bed opimizion nd i performnce cn herefore uffer if oo mny prmeer hve o be lerned. Thi cn be problemic, ince he number of prmeer incree ignificnly if epre rewrd funcion re lerned for ech ime ep. Furhermore, ince only few ime ep migh be relevn for he oberved k, econd chllenge of pplying IRL for lerning non-ionry rewrd funcion coni in properly exrcing uch prene. Thi hei inveige how o mee hee prcicl requiremen. A novel pproch, IRL-MSD, i developed for h purpoe. Unlike ome previou IRL mehod, Invere Reinforcemen Lerning by Mching Se Diribuion (IRL-MSD) doe no im o mch feure coun bu ined lern rewrd funcion by mching e diribuion. Thi pproch h wo inereing properie. Firly, he feure do no hve o be defined explicily bu rie nurlly from he rucure of he oberved e diribuion. Secondly, i doe no require grdien-bed opimizion. The experimen how h i converge fer hn exiing IRL mehod nd properly recover he gol of pre rewrd funcion. Therefore, IRL-MSD ugge ielf for fuure reerch. i

5 Acknowledgmen I would like o hnk Prof. Dr. Jn Peer for founding he Inelligen Auonomou Syem lb TU Drmd. The IAS concenre lo of experie in he field of robo lerning from which I could benefi. I lo wn o hnk my upervior Prof. Dr. Gerhrd Neumnn nd M.Sc. Chriin Dniel for he counle hour hey inveed during our regulr meeing. I lerned lo from you during my work on hi hei. I epecilly wn o hnk Prof. Dr. Gerhrd Neumnn, who cme up wih he opimizion problem for IRL-MSD. I m greful for hving goen he opporuniy o work on hi mehod. ii

6 Conen 1. Inroducion Thei Semen Moivion for Imiion Lerning by Invere Reinforcemen Lerning Moivion for Lerning Time-Dependen Rewrd Funcion Preliminrie Informion Theory Mrkov Deciion Proce Opiml Conrol Ouline Reled Work Lerning From Demonrion Perceiving Demonrion Imiion Lerning Invere Reinforcemen Lerning Relive Enropy Policy Serch Mximum Cul Enropy Invere Reinforcemen Lerning Inigh Implemenion Lerning Non-Sionry Rewrd Funcion Towrd Compc Repreenion Uing Rdil Bi Funcion Enuring Poiive Semi-Definie Rewrd Funcion Adping he Sepize Bed on he Dul Funcion Invere Reinforcemen Lerning by Mching Se Diribuion The Opimizion Problem The Propoed Algorihm Uing he Policy KL for Lerning Acion Co Uing he Policy KL for Regulrizion Enuring Poiive Definie Rewrd Funcion Diregrding Feure by Mching Mrginl A Peudo-code Implemenion of IRL-MSD Evluion Auming Known Siic of he Exper IRL by Mching Se Diribuion Comprion wih Mximum Cul Enropy IRL Lerning Bed on Smple IRL by Mching Se Diribuion Comprion wih Mximum Cul Enropy IRL Dicuion iii

7 6. Fuure Work Lerning Conex Dependen Rewrd Funcion Dropping he LQG Aumpion Bounding he Policy-KL Lerning Se Independen Acion Co Concluion 35 Reference 37 A. Derivion for Mximum Cul Enropy Invere Reinforcemen Lerning 41 B. Derivion for Invere Reinforcemen Lerning by Mching Se Diribuion 44 C. The Difference beween Ṽ π () nd V π () 46 C.1. Derivion of V π () C.2. Derivion of Ṽ π () C.3. Comprion of V π () nd Ṽ π () iv

8 Figure, Tble nd Algorihm Li of Figure 1.1. Moiving Exmple Fir Dimenion of Exper Se Diribuion in Tk Spce Expeced Rewrd of IRL-MSD for Known Exper Siic Fir Dimenion of MSD Se Diribuion in Tk Spce Comprion of MSD nd MxCulEn for Known Siic Fir Dimenion of MxCulEn Se Diribuion in Tk Spce Plo of Lerned Rewrd Prmeer Expeced Rewrd of MSD-ACL Bed on 7 Smple Comprion of Mximum Cul Enropy IRL nd IRL-MSD Bed on 7 Smple Plo of Lerned Rewrd Prmeer Bed on 7 Smple Li of Tble 5.1. Lerned Gol Poiion for Known Exper Siic Expeced Rewrd for MSD-REG Bed on 7 Smple Lerned Gol Poiion Bed on 7 Smple Li of Algorihm 1. Peudo-code Implemenion of IRL-MSD v

9 Abbreviion nd Symbol Li of Abbreviion Noion DMP IRL IRL-MSD KL LfD LQG MAP MxCulEn- IRL MxEn-IRL MCMC MDP MSD-ACL MSD-REG PbD PoWER ProMP RBF REPS RL Decripion Dynmic Movemen Primiive Invere Reinforcemen Lerning Invere Reinforcemen Lerning by Mching Se Diribuion Kullbck-Leibler divergence Lerning from Demonrion Liner Qudric Guin Mximum-A-Poeriori Mximum Cul Enropy Invere Reinforcemen Lerning Mximum Enropy Invere Reinforcemen Lerning Mrkov-Chin-Mone-Crlo Mrkov Deciion Proce MSD wih Acion Co Lerning MSD wih KL-bed Regulrizion Progrmming by Demonrion Policy Lerning by Weighing Explorion wih he Reurn Probbiliic Movemen Primiive Rdil Bi Funcion Relive Enropy Policy Serch Reinforcemen Lerning 1

10 1 Inroducion Auonomou robo need o dp heir plnning bed on obervion of heir environmen. Hence, when oberving oher gen, he uonomou robo hould decide how o rec on he oberved behvior. For exmple, if during collborive embly k one gen pick up crew, enible recion of econd gen migh coni in hnding over he crewdriver. Epecilly if he environmen i no well-defined, meningful recion o he oberved behvior migh only be poible by inferring he inenion of he oher gen. Lerning he inenion behind n oberved behvior i lo helpful for Imiion Lerning. Imiion Lerning (Schl, 1999; Argll e l., 2009) llow o ech n gen how o perform k by providing demonrion. By inferring he gol of he demonrion, he gen lern k repreenion h i generlizble nd no ffeced by differen embodimen nd yem dynmic. IRL (Ng e l., 2000) i ep owrd lerning he inenion of n oberved behvior by lerning rewrd funcion from exper demonrion. Rewrd funcion define k gol by ring cion nd e wih repec o hee gol nd hence cn repreen inenion. Thi repreenion i uible for Imiion Lerning, becue he gen cn lern o perform he deired k by pplying Reinforcemen Lerning (RL) (Suon nd Bro, 1998). 1.1 Thei Semen The im of hi hei i o pply Invere Reinforcemen Lerning in order o lern ime-dependen rewrd funcion. Furhermore, i i inveiged how well he prene cn be recovered, if he rue rewrd funcion of he exper i pre in ime. MxCulEn-IRL (Ziebr e l., 2010) i n IRL pproch h cn properly re iuion where he gen dped i pln bed on informion h h been reveled o him during execuion. Thi mehod herefore ugge ielf for lerning ime-dependen rewrd funcion. MxCulEn-IRL pplie grdien decen for lerning he prmeer of he rewrd funcion. Therefore, he peed of convergence cn uffer if mny prmeer hve o be opimized. A novel pproch, IRL-MSD, i developed in order o void grdien bed opimizion. Thi mehod doe no require n explici definiion of he feure, bu chooe hem uomiclly bed on he rucure of he rge diribuion. 1.2 Moivion for Imiion Lerning by Invere Reinforcemen Lerning Teching k by combining IRL nd RL eem lboriou becue i involve wo ep of lerning. Hence, i could be rgued h more direc pproch hould be preferred by (1) lerning he policy direcly from he demonrion or (2) by providing he rewrd funcion direcly. However, boh of hee pproche hve limiion h hll be dicued in he following. Imiing movemen from demonrion wihou lerning rewrd funcion i uully chieved by lerning prmerized rge policy or rge rjecory uing regreion. Lerning he policy direcly w ued by Smmu e l. (2002) o cree n uopilo for fligh imulor. However, uch pproche cn uffer from he correpondence problem, fil if he dynmic chnge nd cn no be generlized. Lerning rge rjecorie i more promiing nd w uccefully pplied on humnoid 2

11 robo o lern forehnd enni wing (Ijpeer e l., 2003). Neverhele, lerning rjecory i ill le flexible hn lerning rewrd funcion, which ge eviden when obcle occur long he deired ph. Reinforcemen Lerning on provided rewrd funcion h been ued wih gre ucce, e.g. for lerning f qudrupedl wlking (Kohl nd Sone, 2004). However, defining n pproprie rewrd funcion cn be cumberome nd difficul even for exper. For exmple, when eing how well dihwher h been loded, everl differen feure hve o be ken ino ccoun (e.g. for decribing he ype of he dihe well heir poiion nd orienion) wih repec o everl differen ubgol, e.g. efficien uge of pce nd high expeced clenne. A i i uully no obviou which ubgol re relevn nd how hey hve o be weighed gin ech oher, defining n pproprie rewrd funcion require experience nd ofen involve hnd-uning by ril nd error. In conr o h, demonrion cn ofen be provided ubnilly eier nd even by non-exper. 1.3 Moivion for Lerning Time-Dependen Rewrd Funcion φ Time Figure 1.1.: Moiving Exmple. The demonrion cn be explined wih ime-dependen policy bed on he only oberved feure. Lerning ingle, ionry (ime-independen) rewrd funcion cn be inpproprie for lerning k, h decompoe ino everl ubk. For exmple, ume h he demonrion of n exper led o he rjecorie depiced in figure 1.1, where φ hll be ome rbirry feure ploed over ime. The demonred behvior cn be explined uing non-ionry rewrd funcion, h rewrd he gen for chieving φ = 1 ime ep 25 nd φ = 0 ime ep 50. Even if boh ubk erved n overll k nd could be explined uing ionry rewrd funcion, h rewrd funcion migh poe higher complexiy hn he non-ionry one nd would depend on ddiionl, unoberved feure. 3

12 1.4 Preliminrie Thi ecion provide bckground informion on Informion Theory, Mrkov Deciion Procee nd Opiml Conrol Theory Informion Theory Enropy The enropy (Shnnon, 2001) H(P) of dicree probbiliy diribuion P(X ) meure he unceriny of h diribuion nd, hu, lo he moun of encoded informion. I i defined H(X ) = P(x) log(p(x)). x Thi hei will mke ue of everl reled enropy meure for coninuou probbiliy diribuion h re defined follow. The coninuou enropy of probbiliy diribuion p(), denoed H(S), i defined H(S) = p() log p() d. If n gen chooe i cion ccording o ime-depend policy, he enropy of h diribuion hould only ke ino ccoun he informion h h lredy been reveled o he gen. The enropy i hen denoed cul enropy. Cul Enropy For coninuou, condiionl diribuion p( ), he condiionl enropy H(A S) i defined he expeced coninuou enropy of he condiionl diribuion, i.e., H(A S) = p() p( ) log p( ) d d. =1 Le he condiionl diribuion p( 1,..., T 1,..., T ) denoe he probbiliy of oberving given ech dicree ime ep [1, T]. If hi diribuion only depend on p obervion nd, hu, T mrginlize o p( 1,..., T 1,..., T ) = p( 1,..., ) i will be referred o cully condiioned on. The enropy of uch cully condiioned diribuion i hen denoed cul enropy nd defined T H(A S) = p( 1,..., ) p( 1,..., ) log(p( 1,..., )) d d 1,...,. =1 1,..., For he pecil ce where p( 1,..., T 1,..., T ) = T p ( ), he cul enropy implifie o =1, T p( ) = =1 =1 T H(A S) = p ()p ( ) log(p ( )) d d. Thi equion cn be ued o compue he enropy of fir order Mrkovin policie, i.e. policie h do no depend on p e if he curren e i given. 4

13 Relive Enropy The relive enropy KL (P Q) beween wo diribuion p() nd q(), lo known Kullbck-Leibler divergence, meure he informion lo h reul when p() i employed pproximion of q(). I i defined log p() KL (P Q) = p() log q() d. The condiionl relive enropy KL (P Q) nd he cul relive enropy KL (P Q) re defined, nlogou o he condiionl nd cul enropy, KL (P Q) = KL (P Q) =, log p( ) p()p( ) d d, log q( ) T =1, p ()p ( ) log p ( ) d d. log q ( ) (1.1) Equion 1.1 cn be ued meure of imilriy beween wo ime-dependen policie while properly king ino ccoun culiy Mrkov Deciion Proce A Mrkov Deciion Proce (MDP) i mhemicl frmework h i ofen employed for modeling he environmen of n gen in he conex of Reinforcemen Lerning. MDP model ime by uing dicree ime ep. Wihin hi hei, only finie horizon MDP re dicued, i.e. i i umed h he proce lwy end fer fixed number T of ime ep. The e of he environmen nd he e of he gen re modeled ogeher uing he e S of e. A ech ime ep, he gen chooe n cion from he e A of cion. Afer n cion h been choen, he proce rniion ino he nex ime ep by rndomly chooing new e ccording o he yem dynmic p (, ). A MDP ume he Mrkovin propery, i.e. i ume h he probbiliy diribuion over he nex e doe no depend on p e or cion, if he curren e nd he curren cion re known. Addiionlly, for ny given ime ep, he e nd cion re red uing rewrd funcion r (, ). Therefore, finie horizon MDP cn be defined uing five-uple M = S, A, p (, ), T, r (, ). For he reminder of hi hei, he e nd cion re repreened uing vecor nd of ize N nd N repecively. The e nd cion for ime ep re denoed by nd nd heir coninuou elemen index i by i, nd i, repecively. The yem dynmic re umed o be liner wih Guin noie, e.g. p (, ) = (A + B + b, Σ d yn, ). Furhermore, he rewrd funcion i umed o be convex qudric in nd. Thereby, he MDP complie wih he Liner Qudric Guin (LQG) umpion Opiml Conrol An gen, cing in MDP, chooe i cion ccording o policy π ( ) h i umed o be conien over ime nd hereby (nd due o he Mrkovin propery) depend only on he curren e nd ime ep. Reinforcemen Lerning ddree he problem of lerning policy h mximize he expeced rewrd of he gen. For finie horizon MDP h comply wih he LQG umpion, he opiml 5

14 conrol policy cn be compued recurively, even for coninuou (nd hereby infinie) e nd cion pce, by uing bckwrd inducion. Bckwrd inducion mke ue of he concep of Vlue funcion nd e-cion Vlue funcion. The Vlue funcion for given policy π ime ep i denoed by V π () nd correpond o he expeced rewrd of n gen h r ime ep in e, uming h he gen will c ccording o policy π for he curren nd ll remining ime ep. The e-cion Vlue funcion Q π (, ) i defined ccordingly, however, i ume h he gen chooe cion ime ep nd only ferwrd c ccording o he policy π. The Vlue funcion for he l ime ep doe no hve o ke fuure rewrd ino ccoun nd i hereby given by he rewrd funcion of he l ime ep, V π T () = r T (). The rewrd he l ime ep doe no depend on n cion, becue i i umed h he MDP end immediely fer reching he finl e. The e cion Vlue funcion i conequenly only defined for < T. I cn be expreed in erm of he expeced Vlue funcion of he nex ime ep uing Q π (, ) = r (, ) + p (,) V π +1 ( ), (1.2) where he diribuion over he nex e i given by he yem dynmic p (, ). Similrly, he Vlue funcion for ime ep < T cn be expreed in erm of he expeced e cion Vlue funcion of h me ime ep uing V π () = π ( ) Q π (, ), (1.3) where he diribuion of he curren cion i given by he policy π ( ). Thee recurive equion (1.2 nd 1.3) re known he Bellmn equion for finie horizon MDP. Sring wih V π T () hey cn be employed o compue he Vlue funcion nd he e-cion Vlue funcion of ny given policy π for ll ime ep by deducing bckwrd in ime. The opiml policy π ( ) chooe ech ime ep n cion h mximize he e-cion Vlue funcion Q π (, ). Hence, he correponding e-cion Vlue funcion Q π (, ) nd he Vlue funcion () cn be compued uing equion 1.2 nd V π V π () = mx Q π (, ). The recurive equion re hen known Bellmn opimliy equion for finie horizon MDP. For LQG yem, V π () nd Q π (, ) re convex qudric for ll ime ep. The involved expeced vlue nd mximum vlue cn hu be derived nlyiclly. The opiml policy for uch yem ke he form of deerminiic, liner conroller π () = K + k, where he conroller gin K nd k cn be compued uing bckwrd inducion. 1.5 Ouline The reminder of hi hei i rucured follow: Chper 2 dicue reled work, by providing n overview of Imiion Lerning nd Invere Reinforcemen Lerning. Sudying MxCulEn-IRL led o inigh h hopefully i in beer undernding he mechnic of he pproch. Thee inigh re preened in chper 3 long wih he dicuion of implemenion pecific deil. Chper 4 will cover he novel pproch of IRL-MSD. Chper 5 evlue hi pproch nd compre i o MxCulEn- IRL. Finlly, Chper 6 preen n oulook on fuure reerch nd Chper 7 recpiule he finding of hi hei. 6

15 2 Reled Work 2.1 Lerning From Demonrion Thi ecion provide brief overview of Lerning from Demonrion (LfD) in he field of roboic. Roboic LfD i n cive field of reerch ince i w fir covered in deph in 1984 (Hlber, 1984). Since hen i h evolved ino everl differen direcion nd produced overlpping erminology (Argll e l., 2009). Thi hei diinguihe beween he following erm: Lerning from Demonrion i ued for ll robo lerning mehod h mke ue of demonrion. Thi definiion i ricly more generl hn he one given by Argll e l. (2009), h only pplie o pproche h lern policy bed on demonrion. Progrmming by Demonrion (PbD) i ued in i originl mening, when demonrion re ued o produce code in progrmming lnguge (Cypher nd Hlber, 1993). Imiion Lerning i ued when he robo hould lern o imie he demonror. True Imiion Lerning i ued when, following he definiion of Tomello e l. (1993), n undernding of he inenionl e underlying he behvior i ddiionlly required. Behviorl Cloning i ued when he gen lern e-cion mpping h cloely mch he oberved one wihou regrding he implicion of he cion. I i hereby ubfield of Imiion Lerning bu cn no provide rue imiion. A nurl queion regrding LfD i how o demonre. The differen wy of demonring k cn be lo decribed from he perpecive of he lerning gen, leding o he queion: How cn robo perceive demonrion? Perceiving Demonrion Wihin hi hei, hree differen poibiliie of perceiving demonrion re dicued. For he fir one, he gen perceive he demonrion merely vi exerocepive enor, i.e. enor h meure quniie concerning he gen environmen, e.g. vi viion. The econd one lo incorpore propriocepion, i.e. direcly ening quniie h re reled o he robo ielf, e.g. join poiion. The hird poibiliy of perceiving demonrion include propriocepion nd exerocepion nd furhermore recording of he iued conrol commnd. Percepion Bed on Exerocepive Sening When perceiving demonrion bed on exerocepion, he gen pively oberve he exper performing k. Even hough he robo migh be ble o meure uing i propriocepive enor, hi informion i no conidered o be pr of he demonrion. Kuniyohi e l. (1994) ued lerning by wching o lern high-level embly pln by wching humn demonrion. Mrker-bed moion cpuring i ofen ued by humnoid robo o mimic he whole-body movemen of humn demonrion (Ude e l., 2004; Kulić e l., 2011; Kim e l., 2009). Such demonrion hve he dvnge, h hey ofen cn be performed nurlly nd wihou he 7

16 need of inercing wih he robo. However hey hve he drwbck, h hey give he le informion o he gen. Epecilly, hey do no conin ny informion bou how he robo i uppoed o perform he k bu only bou how he exper perform i. If he exper nd he robo hve imilr embodimen, he robo migh be ble o perform he deired moion by imiing he exper. However, if hey hve differen embodimen, i i no cler how o imie he exper which i known he correpondence problem (Nehniv nd Duenhhn, 2002; Alindrki e l., 2002). Addiionl Percepion Bed on Propriocepive Sening The robo cn lo perceive demonrion uing i propriocepive enor if he exper demonre he k by phyiclly moving i limb. Providing uch demonrion o he robo i known kineheic eching. If he robo i lighweigh nd complin, he deired robo moion cn be induced olely by force creed by he exper. Such demonrion hve been ued uccefully for lerning lrge vriey of k, including Bll-Pddling, Bll-in--Cup, Pendulum Swing-Up nd Pncke-Flipping (Kober nd Peer, 2009; Kormuhev e l., 2010). Addiionl Percepion of he Conrol Demonrion cn lo be performed by uing he cuor of he robo direcly, for exmple vi eleoperion. Thereby, he robo i conrolled from dince vi remoe conrol, e.g. joyick or exokeleon. Oher exmple for hi kind of demonrion include progrmming he deired moor commnd nd ome form of cive kineheic eching, i.e. kineheic eching where he robo ene ouche of he exper nd rec by iuing correponding conrol commnd. Demonrion on rdio-conrolled helicoper where ued by Abbeel e l. (2010) o lern conroller h w cpble of producing mneuver he edge of he phyicl feible, including in-plce flip nd uo-roionl lnding (emergency lnding wih n unpowered min roor) Imiion Lerning Imiion Lerning i n imporn pplicion of LfD, i mke progrmming he robo eier for he exper nd cceible for non-exper uer. Imiion Lerning cn be divided ino hree differen pproche (Schl, 1999), (1) pproche h re only concerned wih mching he demonred policy, (2) pproche h ry o reproduce he demonred rjecory nd (3) pproche h ue he demonred rjecorie in order o reduce he erch pce for Reinforcemen Lerning. Policy-Bed Imiion Lerning Alredy in he erly yer of roboic, demonrion were ued o ee he k of robo progrmming. The mnipulor w guided by humn exper in order o perform deired movemen nd w ble o repe h moion by recording he vi poin. Thi imple pproch of robo progrmming by demonrion doe no incorpore enory inpu nd i hereby only pplicble o few, induril k like pining componen pr (Lozno-Perez, 1983). Behviorl Cloning (Bin nd Smmu, 1995; Pomerleu, 1989) ddiionlly record he enory informion o mp he oberved e o cion. Thi pproch w ued o cree n uopilo for fligh imulor h exceeded he performnce of he exper by moohing ou he humn conrol noie (Smmu e l., 2002). Thi clen-up effec i ofen encounered wih behviorl cloning (Michie e l., 1990). However, behviorl cloning merely rie o produce he me cion he exper, mindle of he reuling rjecory, i i ofen frgile wih repec o chnge in he environmen. 8

17 Trjecory-Bed Imiion Lerning The policy-bed imiion lerning pproch cn fil, if he robo nd he exper differ in heir kinemic or dynmic, becue he me cion hen led o differen rjecorie. Trjecory-bed imiion lerning mehod circumven hi horcoming by rying o mch he exper rjecory ined of i policy. In order o mke lerning of he rjecory feible, repreenion i required h cn genere he deired rjecorie while he me ime depending on rcble number of prmeer. The choice of repreenion i crucil for he performnce of he imiion lerning k. An exmple for uch repreenion re vi poin. Miymoo e l. (1996) exrc vi poin from demonrion by fir dding vi poin he gol poiion nd hen ierively dding vi poin he poiion of mximl qured error beween he lerned rjecory nd he demonred one unil he demonrion i mched ufficienly well. However, hi repreenion cn no be generlized righforwrdly nd migh no recover from perurbion in he environmen. Schl e l. (2000) propoed o repreen movemen by uing nonliner dynmic yem n rcor. Thi led o he developmen of Dynmic Movemen Primiive (DMP), flexible repreenion of moion h cn be lerned from demonrion nd i herefore uible for Imiion Lerning (Ijpeer e l., 2002, 2003; Schl e l., 2003, 2004). A DMP i cully no rjecory repreenion, bu repreenion of conrol policy. Thi policy correpond o PD-conroller h i perurbed wih nonliner forcing funcion. The PD-conroller h he purpoe of reching given gol poiion where he forcing funcion conrol which ph hould be choen o rech h gol. The forcing funcion i defined rdil bi funcion nework nd converge o zero o h i doe no preven he PD-conroller from reching he gol e (when uming pproprie conroller gin). Furhermore, phe vrible i inroduced h cn be ued o chnge he peed of execuion. Addiionl dvnge include robune gin perurbion, cpbiliy of producing rhyhmic movemen, nd ee of lerning, which i chieved by lerning he weigh of he rdil bi funcion. Ijpeer e l. (2002) demonred he effecivene of heir pproch by lerning forehnd enni wing from demonrion. The rjecorie cn lo be repreened Guin diribuion by uing Probbiliic Movemen Primiive (ProMP) (Prcho e l., 2013). Thi pproch ue weighed rdil bi funcion in order o encode he men for ech ime ep nd encode he covrince mrice by uing prior on hee weigh. Hence, by mrginlizing ou he weigh vecor, he rjecorie re given by he prmeerizion of he prior. Similr o DMP, emporl modulion i chieved by inroducing phe vrible nd rhyhmic moion cn be produced by uing von-mie bi funcion ined of Guin one. ProMP cn be ued for Imiion Lerning, by lerning he prmeer of he prior h mximize he likelihood of he demonred rjecorie. Thi probbiliic pproch h everl inereing cpbiliie: The movemen primiive cn be dped o rech differen vi poin, by condiioning he diribuion ccordingly. Alo, everl primiive cn be co-cived nd blended by compuing heir produc while weighing hem wih ime-dependen civion funcion. ProMP were recenly pplied for lerning inercion beween muliple gen in collborive embly k (Med e l., 2014). Reinforcemen Lerning From Demonrion For mny prcicl pplicion, he e pce i lrge nd he dynmic model h o be pproximed. If he dynmic re no lerned, Imiion Lerning cn be inufficien o fulfill he k. Reinforcemen Lerning cn be pplied o lern he k by ril-nd-error while he me ime lerning he dynmic model. However, exploring he lrge e pce cn be oo coly, epecilly if he rewrd funcion i pre. Lerning cn be booed by fir lerning policy from demonrion nd ferwrd improving i uing Reinforcemen Lerning (Schl, 1997). Following up Imiion Lerning wih Reinforcemen Lerning, my no only enble he robo o ccomplih he k bu lo o improve upon he exper demonrion (Akeon nd Schl, 1997). Furhermore, by pplying Reinforcemen Lerning he robo 9

18 i ble o hndle new iuion for which he demonred behvior would fil, for exmple, when new obcle occur (Guener e l., 2007). Kober nd Peer (2009) ued Imiion Lerning bed on DMP in order o iniilize heir policy erch mehod Policy Lerning by Weighing Explorion wih he Reurn (PoWER). Thi pproch enbled he robo o relibly ucceed in he gme of Bll-in--Cup. A imilr pproch w ued by Kormuhev e l. (2010) for lerning he k of pncke-flipping Invere Reinforcemen Lerning Invere Reinforcemen Lerning coniue differen pplicion of Lerning from Demonrion by iming lerning he rewrd funcion of he exper bed on i demonrion. By uing he lerned rewrd funcion for Reinforcemen Lerning, IRL cn be ued mehod of Imiion Lerning. However, he IRL-problem hould be clerly diinguihed from he problem of Imiion Lerning, becue lerning rewrd funcion cn lo erve oher purpoe hn imiion. For exmple, in humn-robo collborion k, he robo migh ry o infer he inenion of he humn, no in order o overke hi k bu in order o i him in chieving i. In le cooperive wy, menioned by Rmchndrn nd Amir (2007), he rewrd funcion cn be ued o model he opponen in dverril gme like poker, in order o exploi i regy. More generlly, IRL im inferring he gol underlying he oberved behvior. Neverhele, urveying curren reerch in IRL, one my conclude h i i indeed mo ofen ued for Imiion Lerning k, boh, for experimenl evluion well for prcicl pplicion. Thi i no urpriing, becue decribing (nerly) opiml behvior in erm of i inenion i uccinc nd llow for generlizion. Furhermore, unlike policy, he rewrd funcion remin vlid, if he dynmic chnge. In he following, n overview will be given of reerch in he field of Invere Reinforcemen Lerning, ring wih he fir MDP formulion of he problem, given by Ng e l. (2000). The MDP Formulion of Invere Reinforcemen Lerning Similr o Reinforcemen Lerning, Invere Reinforcemen Lerning ume h he gen i cing in Mrkov Deciion Proce. However, he rewrd funcion of h MDP i no known o he gen. In exchnge, i i given demonrion of n exper, h rie o mximize hi unknown rewrd. The gol of IRL i now o lern h rewrd funcion from he demonrion. Hence, ined of ddreing he problem of finding ner-opiml policy for given rewrd funcion, i ddree he problem of recovering rewrd funcion from demonrion of ner-opiml policy. Unforunely, hi problem formulion i ill-poed, becue he demonrion do no uffice o deduce he underlying inenion of he exper wih ceriny. For exmple, here i lwy he, uully exremely unlikely bu lwy non-zero, poibiliy h he exper chooe i cion compleely rndomly. Thi would correpond o conn rewrd funcion, for which ny behvior would be opiml. Clerly, uch degenered oluion hould be dicrded nd more reonble umpion of he underlying rewrd funcion hould be found. However, wh would mke n umpion reonble? Demnding h he exper perform (ner) opiml on he lerned rewrd funcion i necery bu no ufficien condiion. Ng e l. (2000) propoed o furher conrin he oluion pce by demnding h ny ingle-ep deviion from he oberved policy hould be mximlly punihed. Addiionlly, n L1-regulrizion w dded o prefer imple oluion. They demonred h hi pproch cn be ued o pproximely recover he rue exper policy on imple grid world k nd he mounin-cr problem. 10

19 Byein Invere Reinforcemen Lerning Rmchndrn nd Amir (2007) propoe Byein pproch in order o void hving o decide for priculr rewrd funcion. For h purpoe he poerior i compued bed on prior on he rewrd prmeer nd he likelihood of he demonrion. Poible prior on he rewrd prmeer include he Guin-, Lplce- nd Be-diribuion well he uniform diribuion over finie inervl. The likelihood of demonrion for given rewrd prmeer i compued under he umpion h he exper chooe i cion proporionl o he exponenil of he correponding expeced rewrd. The rewrd funcion cn be inferred from he poerior diribuion by compuing i men or Mximum- A-Poeriori (MAP). More inereingly for Imiion Lerning, he policy cn lo be inferred from he poerior direcly. However, due o he complexiy of he poerior diribuion, Mrkov-Chin-Mone- Crlo (MCMC) (Gilk, 2005) h o be pplied for boh ce whereby he pproch cn uffer from he cure of dimenionliy (Bellmn, 1957). Mching Feure Coun Abbeel nd Ng (2004) propoed new pproch o dicrd degenered oluion, by demnding h he opiml policy wih repec o he lerned rewrd funcion hould mch he oberved policy in behvior. For h purpoe, he rewrd funcion i umed o be liner combinion of given ime-dependen feure φ (, ), i.e., r (, ) = φ (, ) θ, nd rewrd funcion i lerned, uch h he expeced feure coun φ mch he empiricl feure coun of he exper ˆφ, 1 N D T N D i=1 =1 φ ( (i), (i) ) = ˆφ =! φ = T p (,) φ (, ), where N D denoe he number of demonrion nd (i) nd (i) denoe he e nd cion of he i-h demonrion ime ep. The expeced feure coun φ re compued bed on he join diribuion p (, ) h would reul from he opiml policy wih repec o he lerned prmeer. Since he ol rewrd r(, ) cn be wrien r(, ) = = T r (, ) =1 =1 T φ (, ) θ =1 T = φ (, ) θ, =1 policy h produce he me feure coun he exper would chieve he me rewrd on he rue rewrd funcion independen of i prmeer θ. However, if he join diribuion p (, ) for compuing φ i bed on n opiml, deerminiic policy, i i ofen no poible o mch he empiricl feure coun, becue hey re uully bed on mple nd, furhermore, on ubopiml demonrion. Therefore, IRL pproche h re bed on mching feure coun ue ochic, ubopiml policy for compuing he feure expecion. However, when he expeced feure coun re bed on policy, h i ubopiml wih repec o he lerned 11

20 prmeer θ, differen rewrd funcion could be lerned, depending on which policy h been choen. And even hough ll hee rewrd funcion could be ued o mch he empiricl feure coun by employing heir repecive policy, hey migh no infer he correc gol nd he opiml policie for hee rewrd funcion hu migh perform bdly. Mximum Enropy Invere Reinforcemen Lerning Mximum Enropy Invere Reinforcemen Lerning (Ziebr e l., 2008) pplie he principle of mximum enropy (Jyne, 1957) by chooing he policy h led o he mximum enropy join diribuion for mching he feure coun. Thi i he mo principled pproch, i doe no preume ny ungrounded conrin on he join diribuion nd hereby minimize he wor-ce predicion log-lo (Grünwld nd Dwid, 2004). Under he mximum enropy model, he likelihood of given ph ζ i i proporionl o he exponenil of i rewrd, i.e., p(ζ i θ ) exp θ φζ,i. The rewrd funcion θ h led o he mximum enropy e diribuion mching he feure coun, cn be found by mximizing he log-likelihood of he demonrion, θ = rgmx θ L(θ ) = rgmx θ N d i=1 log p(ζ i θ ), where N d denoe he number of demonrion. The mximum likelihood cn be found uing grdien decen, where he grdien i given by he difference beween he empiricl feure coun nd he expeced feure coun for he curren eime of he rewrd funcion, i.e., L(θ ) = ˆφ φ. Severl recen dvnce in Invere Reinforcemen Lerning re bed on Mximum Enropy Invere Reinforcemen Lerning (MxEn-IRL), e.g. Levine e l. (2011) pply Guin Proce Regreion in order o lern nonliner rewrd funcion. Levine nd Kolun (2012) pply Lplce pproximion in order o loclly pproxime he policy by Guin long he demonred rjecorie in deerminiic MDP. Boulri e l. (2011) cully ue he relive enropy beween he empiricl diribuion nd expeced diribuion. By eiming he ubgrdien vi Impornce Smpling hey derive modelfree mehod. In i originl formulion, MxEn-IRL ume h ll ide informion w vilble o he exper he beginning of i demonrion. If hi i no he ce, e.g. whenever he yem dynmic re noiy, Mximum Cul Enropy Invere Reinforcemen Lerning (Ziebr e l., 2010) cn be ued ined, which chooe he ochic policy h h he mximum cul enropy while mching he feure coun. Thi modificion i of min inere for hi hei nd i dicued in deil in Chper 3. 12

21 2.2 Relive Enropy Policy Serch Policy Serch i n pproch o Reinforcemen Lerning h doe no im for lerning he Vlue funcion bu ined lern policy direcly. Policy grdien mehod chieve hi by uing grdien-bed opimizion o ierively upde he prmeer of he policy. However, becue he previou experience i encoded wihin he curren policy, policy upde deroy pr of h experience leding o n informion lo. Relive Enropy Policy Serch (Peer e l., 2010) i bed on he opimizion problem of mximizing he expeced rewrd (Equion 2.2) while bounding he informion lo (relive enropy) beween he curren nd previou ierion (Equion 2.2b), i.e., mximize π( ),µ() ubjec o µ()π( )r(, ), µ()π( ) log µ()π( ) ε, q(, ), µ( )φ( ) = µ()π( )p(, )φ( ),, µ()π( ) = 1,, (2.2) (2.2b) (2.2c) (2.2d) where µ() i he ionry (Equion 2.2c) e diribuion h i evenully reched when execuing he policy π( ) in n infinie horizon MDP. Equion 2.2d i necery o enure h π( ) nd µ() re probbiliy diribuion. q(, ) repreen he (eimed) join diribuion of he previou ierion, hu, he policy h mximize he problem, π mx ( ) q(, ) exp 1 r(, ) + η p(, )θ φ( ) θ φ(), correpond o he be policy upde wih bounded informion lo ε. The Lgrngin Muliplier θ nd η cn be found uing grdien bed opimizion. Encoding he ierive nure of lerning pproch by bounding he informion lo beween he curren nd l ierion i mjor inpirion for IRL-MSD. 13

22 3 Mximum Cul Enropy Invere Reinforcemen Lerning The cion choice of he exper for differen ime ep migh no be conien wih repec o he complee rjecory, becue i migh hve dped i plnning bed on he oucome of he ochic yem dynmic p (, ). Mximizing he enropy of he whole rjecory diribuion doe no ke hi culiy ino ccoun. Mximum Cul Enropy IRL differ from Mximum Enropy IRL by iming mximizing he cul enropy of he policy π (, ) ined of he enropy of he join diribuion p(, ). The reuling Guin policy chooe cion wih probbiliy h incree exponenilly wih he expeced fuure rewrd, π MCE ( ) exp Q π (, ). (3.1) For LQG yem, hi led o noiy liner conroller, h differ from he opiml conroller only by i noie. The e-cion Vlue funcion Q π (, ) nd he Vlue funcion V π () cn be compued uing he equion 1.2 nd 1.3 decribed in ecion However, Ziebr e l. (2010) provide differen wy o compue Vlue funcion Ṽ π (), h i imilr o he compuion of he opiml Vlue funcion (Equion 1.4.3) excep h he mximum-operor h been replced by he ofmx-operor, Ṽ π () = log = ofmx exp Q π (, ) d Q π (, ). (3.2) The difference beween Ṽ π () nd V π () will be inveiged in ecion 3.1. MxCulEn-IRL lern he rewrd funcion for which π MCE ( ) mche he empiricl feure coun ˆφ in expecion. Thi i chieved by minimizing he dul of he opimizion problem of mximizing he cul enropy of he policy conrined on mching he feure coun. The pril derivive of he dul funcion wih repec o he prmeerizion of he rewrd funcion i given in (Ziebr e l., 2010) he difference beween he empiricl feure coun ˆφ of he exper nd he expeced feure coun φ of π MCE ( ) for he curren eimion of θ, i.e., θ = ˆφ φ. (3.3) Hence, MxCulEn-IRL propoe o compue he rewrd prmeer ierively uing grdien decen. Ech ierion hereby involve bckwrd p nd forwrd p. The bckwrd p compue Ṽ π () for ll ime ep ring he l ime ep T (bed on he curren eime of θ ) decribed in chper For h purpoe, i pplie Equion 1.2 o compue Q π (, ) nd Equion 3.2 o compue Ṽ π (). For LQG yem, Q (, ) i convex qudric funcion for ll ime ep. Hence, he policie π MCE ( ) cn be compued uing Equion 3.1. The forwrd p r he (known) iniil e diribuion p 1 () nd ue he policie π MCE ( ) o compue he join diribuion p (, ) for ll ime ep bed on he yem dynmic p (, ). Thee join diribuion re hen ued o compue he expeced feure coun φ for he curren eime of θ. The feure expecion cn hen be ued 14

23 o compue he grdien uing Equion 3.3. Since he performnce of grdien decen depend hevily on he choen epize, i i enible o look for wy o dp i during opimizion. In he conex of MxCulEn-IRL, he evluion of he dul funcion cn erve bi for epize dpion. Unforunely, he dul funcion h no been publihed long wih he lgorihm nd, herefore, hd o be derived for hi hei (ee ppendix A). Forunely, however, hi derivion provide inigh ino he inrinic of he lgorihm h hll now be dicued. 3.1 Inigh Bed on he decripion in (Ziebr e l., 2010), he opimizion problem cn be formuled mximize π ( ) T 1 p ()π ( ) log π ( ) dd =1, T 1 ubjec o p ()π ( )φ (, )dd + p T ()φ T (, 0) d = ˆφ, =1, >1 p ( )π ( ) d = p 1 ()π 1 ( )p(, ) dd, (3.4) (3.4b) (3.4c) p 1 () = µ 1 (), <T π ( )d = 1,, (3.4d) (3.4e) where he cul enropy of he policy π ( ) hould be mximized (3.4) ubjec o he conrin of mching he feure coun (3.4b) nd keeping he e diribuion conien (3.4c), where he iniil e diribuion hll be provided by µ 1 () (3.4d). Furhermore, equion (3.4e) enure h he policy i probbiliy diribuion. The Lgrngin muliplier of he conrin (3.4b) for mching he feure coun will be denoed θ i correpond o he weigh vecor of he rewrd funcion. The Lgrngin muliplier of he conrin (3.4c) nd (3.4d) will be denoed by Ṽ π π 1 () nd Ṽ2 ()... Ṽ π T (), repecively, hey rele o he Vlue funcion of he policy π ( ). The opimizion problem i olved uing Lgrnge opimizion (Boyd nd Vndenberghe, 2009) demonred in Appendix A. The dul funcion p (), θ, Ṽ π () i hereby minimized uing he pril derivive p (),θ,ṽ π () = p () p (),θ,ṽ π () Ṽ π () p (),θ,ṽ π () θ = Ṽ π ()+log exp θ φ(,)+ Ṽ π +1 ( )p(,)d Ṽ π T () θ φ T (,0) p 1 ()+µ 1 (), if = 1 p ()+, π 1( )p 1 ()p 1 (,), if > 1 d, if < T, if = T, (3.5), (3.5b) = ˆφ φ. (3.5c) 15

24 Seing Equion 3.5 equl o zero led o n upde equion for Ṽ π () (bckwrd p), exp θ φ(, ) + Ṽ π +1 ( )p(, )d d, if < T Ṽ π () = log θ φ T (, 0), if = T Seing Equion 3.5b equl o zero led o n upde equion for p () (forwrd p), µ 1 (), if = 1 p () =. π, 1( )p 1 ()p 1 (, ), if > 1. (3.6) A menioned he beginning of hi chper, he bckwrd p nd he forwrd p re performed conecuively during ech ierion in order o compue he expeced feure coun φ of he curren pproximion of θ, which cn hen be ued for ingle ep of grdien decen by uing equion (3.5c). The common wy of compuing he Vlue funcion V π () of π MCE would involve compuing he expecion of he e-cion Vlue funcion Q π (, ) uing Equion 1.3. I i herefore inereing o compre he Vlue funcion V π () h i compued uing Equion 1.3 wih he Vlue funcion Ṽ π () h i compued uing Equion 3.6. I urn ou, h boh Vlue funcion hve he me e-dependen pr, bu differ in n offe h depend on θ, Ṽ π () = V π () T 1 i= 1 2 (N + log 2πΣ 1,i ), (3.7) where Σ, denoe he covrince mrix of π MCE ( ). The proof i given in Appendix C. A he offe doe no depend on he e, boh bckwrd pe led o he me policy. When evluing he dul funcion for he curren eime of p (), Ṽ π () nd θ, equion (3.5) nd (3.5b) eque o zero for ll ime ep. In h ce, he dul funcion implifie grely nd i given by (θ ) = ˆφ Ṽ π θ 1 ( ). (3.8) Ṽ π The expeced ol rewrd 1 ( ) cn lo be compued bed on he expeced feure coun φ, Ṽ π 1 ( ) = θ φ. (3.9) Uing Equion 3.7, 3.8 nd 3.9, he dul funcion cn be expreed in erm of φ, (θ ) = ˆφ Ṽ π θ 1 ( ). = ˆφ T 1 θ V π ( 1 ) 2 (N + log 2πΣ 1,i ). i= T 1 = ˆφ θ V π ( ) + i= 1 2 (N + log 2πΣ 1,i ). = ˆφ θ φ T 1 θ + log 2πΣ 1 + con. (3.10),i =1 The empiricl feure coun ˆφ nd he expeced feure coun φ hve o be compued nywy in order o evlue Equion (3.5c), nd he covrince Σ, of he ochic policie π ( ) re ide produc of he bckwrd p. Therefore, he dul funcion cn be evlued ech ierion for he curren pproximion of θ wihou ny noiceble compuionl overhed. 16

25 3.2 Implemenion Thi ecion dicue he prcicl chllenge h roe when employing MxCulEn-IRL for he purpoe of lerning non-ionry rewrd funcion nd how hey hve been coped wih. Applying he lgorihm for lerning ime-dependen rewrd funcion i righforwrd nd i chieved by inroducing independen e of feure for ech ime ep. Thereby, however, he number of feure i increed ignificnly, nd lerning become infeible if he durion of he demonrion i dicreized ino oo mny priion. Rdil Bi Funcion (RBF) re employed in order o lern mooh, ime-coninuou rewrd funcion bed on core dicreizion. Addiionlly, he inigh dicued in ecion 3.1 re uilized by dping he epize bed on he dul funcion (3.10) Lerning Non-Sionry Rewrd Funcion MxCulEn-IRL i pplied o lern T independen, qudric rewrd funcion r (, ) = ( g ) R ( g ) H = R + r H + con, where R mu be poiive emi-definie nd H mu be poiive definie. The gol poiion g rele o he liner coefficien r vi r = 2R g. The cion penly mrix H i umed o be ime independen digonl mrix in order o reduce he moun of feure. The feure φ re divided ino T +1 ube φ = [φ 0, φ 1,..., φ T ], where φ 0 ggrege he qudric T 1 cion, i.e. φ 0 = [ 2 1,,..., 2 N, ]. The remining ube encode he liner, qudric nd mixed =1 e feure for heir correponding ime ep, i.e., [1, T] : φ = [ 1,,..., N,, 1, 1,,..., 1, N,, 2, 2,,..., 2, N,,..., N, N,]. The ol rewrd cn hen be expreed liner combinion of hee feure, r(, ) = T r (, ), =1 = θ φ + con. The enrie of θ hereby correpond o he enrie of he prmeer r, R nd H depending on he feure hey weigh: if he feure i qudric cion erm, i correpond o he repecive digonl elemen of H. if he feure i qudric e erm, i correpond o he repecive digonl elemen of R. if he feure i mixed e erm, i correpond o he repecive off-digonl enry ime wo. if he feure i liner e erm, i correpond o he repecive elemen of r. Where correpond o he index of he ube h conin he feure. Thu, he prmeer of he ime dependen rewrd funcion r (, ) cn be lerned, by lerning θ vi MxCulEn-IRL decribed in chper 3. 17

26 3.2.2 Towrd Compc Repreenion Uing Rdil Bi Funcion Lerning n independen e of feure for ech ime ep incree he number of prmeer ignificnly. In order o ofen hi incree, he T ime dependen rewrd funcion re expreed uing N ϕ rdil bi funcion ϕ i (), i.e. N ϕ r ϕ (, ) = ϕ i () R i r i H, where ϕ i () re normlized guin rdil bi funcion i=1 ϕ i () = w i e 1 2 ci σ i 2, wih eqully pced cener c i nd vrince σ i. The weigh w i re choen o normlize he rdil bi funcion on he inervl [1, T], i.e., i [1, N ϕ ] : T ϕ i () = 1. By employing rdil bi funcion, only N ϕ + 1 ube of feure re required ined of T + 1. Agin, he fir ube, φ 0, hll encode he qudric cion hown in ecion The remining ube, however, now ggrege he liner, qudric nd mixed e feure, wih repec o he reponibiliie of he repecive rdil bi funcion, i.e., i [1, N ϕ ] : φ ϕ i = =1 T ϕ i ()[ 1,,..., N,, 1, 1,,..., 1, N,, 2, 2,,..., 2, N,,..., N, N,]. =1 By compoing he feure vecor φ ϕ of hee ube, he ol rewrd i gin liner combinion of he feure, r(, ) = T =1 r ϕ (, ) = θ φ ϕ, nd he prmeer of he ime dependen rewrd funcion cn hu be lerned vi θ. Beide decreing he number of prmeer, he RBF-bed repreenion led o ime coninuou rewrd funcion by uming rewrd funcion h chnge moohly over ime Enuring Poiive Semi-Definie Rewrd Funcion A he enrie of θ direcly correpond o he elemen of he prmeer r x, R x nd H, he grdien bed opimizion my led o non-poiive emidefinie e co R x, which would viole he LQG umpion nd hereby led o filure. Thi cn be voided by pplying grdien decen for lerning he enrie of lower ringulr mrix L x ined of lerning he enrie of R x direcly. R x cn hen be conruced by 18 R x = L x L x.

27 A Choleky h hown, ny poiive emi-definie mrix cn be decompoed ino uch mrix produc nd ny uch produc yield poiive emi-definie mrix. In order o lern he enrie of L x n ddiionl weigh vecor θ of he me ize θ i inroduced, uch h ech enry of h weigh vecor correpond o differen enry of he repecive ringulr mrix L x, gol poiion g x or cion co mrix H. Grdien decen i hen pplied o upde θ ined of θ uing p (), θ, Ṽ π () θ = p (), θ, Ṽ π () dθ θ d θ. The grdien dθ cn be olved in cloed form nd produce block digon mrix d θ dθ d θ = dθ 0 d θ dθ d. θ , where θ i nd θ i indice he ube of θ nd θ h correpond o he repecive ube φ i of φ. dθ N d θ N Adping he Sepize Bed on he Dul Funcion A hown in ecion 3.1, he dul funcion cn be evlued very efficienly uing Equion Therefore, i mke ene o ke hee evluion bi for epize dpion during grdien decen. Thi i done by checking fer ech grdien ep, wheher he dul funcion did indeed decree. Only hen, he grdien ep i cceped nd he epize i increed by fixed fcor α. Whenever he dul funcion did no decree, he l grdien ep i wihdrwn nd he epize i decreed by β. The prmeer α nd β re choen uch h he relive incree fer ucceful ep i mller hn he relive decree fer n unucceful ep in order o void hving o wihdrw oo ofen. 19

28 4 Invere Reinforcemen Lerning by Mching Se Diribuion Thi chper preen novel pproch for lerning non-ionry rewrd funcion from demonrion. In conr o Mximum Cul Enropy Invere Reinforcemen Lerning i i no bed on mching he feure expecion of he lerned policy wih he oberved feure coun. Ined, i im mching he e diribuion h reul from he lerned policie, p π () wih he oberved e diribuion, q (). By defining he objecive in erm of diribuion ined of defining hem in erm of feure coun, he feure do no hve o be explicily defined. Ined, he rucure of he rewrd i uomiclly deermined bed on he rucure of he diribuion. Mching he e diribuion i chieved by minimizing he relive enropy beween hoe probbiliy diribuion. Addiionlly, inpired by Relive Enropy Policy Serch (REPS), he relive enropy of he curren policy π ( ) i minimized wih repec o he l policy q 0, ( ) in order o reduce he lo of informion beween ierion. 4.1 The Opimizion Problem Invere Reinforcemen Lerning by Mching Se Diribuion i bed on he following opimizion problem: minimize π ( ) T p () log p T 1 () q () d + p () =1 =1 ubjec o >1 p ( )π ( )d = p 1 () = µ 1 (), <T π ( )d = 1., π ( ) log π ( ) q 0, ( ) dd p 1 ()π 1 ( )p(, )dd, (4.1) (4.1b) (4.1c) (4.1d) The conrin (4.1b, 4.1c, 4.1d) re excly he me heir counerpr in he Mximum Cul Enropy opimizion problem (3.4c, 3.4d, 3.4e). The gol of he opimizion (4.1), however, i differen, i im minimizing he wo Kullbck-Leibler divergence (KL). I i o noe, h he rge policy q ( ) doe no hve o be e o he policy of he l ierion, bu could be e o fixed diribuion ined. In hi ce, he correponding KL doe no led in lerning cion co, bu erve regulrizion ined. Agin, he Lgrngin muliplier for he conrin (4.1b) nd (4.1c) rele o he Vlue funcion of he policy π ( ) nd will be herefore denoed by Ṽ π (). A he conrin of mching he feure coun h been dropped, he opimizion problem doe no poe Lgrngin muliplier h correpond o he prmeer of he rewrd funcion. Indeed, i migh look like he opimizion problem doe no refer o ny form of rewrd ll, giving 20

29 rie o he queion of how i cn be employed o ckle he problem of IRL. Thi queion cn be be nwered by looking he pril derivive of he dul funcion, (p (), Ṽ ()) Ṽ T () + log p () q = () + 1, if = T p () Ṽ () log exp log q 0, ( ) + log q () log p () 1 + Ṽ+1( )p(, )d, if < T (p (), Ṽ ()) p 1 () + µ 1 (), if = 1 = Ṽ () p () + π, 1( )p 1 ()p 1 (, ), if > 1 (4.2). (4.2b) The derivion re given in Appendix B. Seing he derivive wih repec o he e diribuion (4.2) o zero led o he bckwrd p o compue Ṽ π (), while eing he derivive wih repec o he Vlue funcion (4.2b) equl o zero led o he forwrd p o compue p (). Acully, he forwrd p i excly he me for MxCulEn-IRL (3.5b) nd IRL-MSD. More inereingly, however, he bckwrd p of IRL-MSD i he me he bckwrd p of MxCulEn-IRL (3.5) excep h he rewrd funcion θ φ(, ) h been replced by he erm r (, ) = log(q 0, ( )) + log (q ()) log (p ()) 1, (4.3) h erve rewrd ignl for he opiml policy π ( ), which i gin, in MxCulEn-IRL, proporionl o he exponenil of he e cion Vlue funcion Q (, ), π ( ) exp (Q (, )) = exp r (, ) + )p(, )d. (4.4) Ṽ +1( Hence, he rewrd funcion depend direcly on he rge policy q 0, ( ), he rge e diribuion q (), nd p (), he e diribuion h reul from he policy h minimize he objecive (4.1). A he e diribuion p () depend on he policy, nd he policy depend on he Vlue funcion Ṽ (), he rewrd funcion i cully defined recurively. The feure of he rewrd funcion hu evolve nurlly, depending on he ype of he involved probbiliy diribuion. Mo nobly, if ll diribuion re Guin, nd he yem dynmic re liner wih Guin noie, he rewrd funcion i qudric in e nd cion. For LQG yem, π ( ) lwy ke he form of liner PD conroller wih Guin noie. Therefore, in he following, he rge policy i umed o be of he me form, i.e. q 0, ( ) = ( K 0, + k 0,, Σ q0, ). Then, he prmeer of he rewrd funcion r (, ) = R F F H + r + con (4.5) h cn be compued by R = 1 Σ 1 q, 2 0, Σ 1 q0, K 0, Σ 1 p,, (4.6) r = Σ 1 q, µ q, K 0, Σ 1 q0, k 0, Σ 1 p, µ p,, (4.6b) F = 1 2 K 0, Σ 1 q0,, H = 1 2 Σ 1 q0,, h = Σ 1 q0, k 0,. The derivion cn be found in Appendix B. (4.6c) (4.6d) (4.6e) 21

30 4.2 The Propoed Algorihm The pril derivive of he dul funcion (equion 4.2 nd equion 4.2b) led o n ierive lgorihm imilr o MxCulEn-IRL. Sring wih n iniil eime of he rewrd funcion, he ofmx Vlue funcion Ṽ π () well he correponding policie π ( ) re compued vi he bckwrd p (4.2), nd he reuling e diribuion p () re compued vi he forwrd p (4.2b). However, IRL-MSD uilize hee e diribuion in order o compue he prmeer of he rewrd funcion, h i opiml wih repec o he curren eime, in one ho, where MxCulEn-IRL uilize hem merely o upde he prmeer by ingle ep of grdien decen. I hould be noed, h he compuionl co of compuing he opiml prmeer (Equion 4.6 o 4.6e) doe no exceed he co of compuing he grdien of he prmeer vi feure expecion. The IRL-MSD opimizion problem llow differen inerpreion regrding q 0,, h led o wo differen vrin of he lgorihm. The fir vrin re q 0, n eime of he opiml policy, h i upded fer ech ierion, where he econd vrin re q 0, regulrizion on he policy which i no chnged during opimizion. The effec of hee differen poin of view hll be dicued in he following ubecion Uing he Policy KL for Lerning Acion Co The fir vrin of IRL-MSD, which will be referred o by MSD-ACL in he following, regrd q 0, n eime of he opiml policy. I i upded fer ech ierion by eing i equl o he policy h w compued during h ierion. Thereby, imilr o REPS, he curren policy hould y cloe o he l one. However, in conr o REPS he relive enropy i no bounded. Thi llow for fer convergence bu doe no give ny gurnee, h he policy i indeed cloe o he l one. Therefore, if uch gurnee i needed, e.g. when lerning locl linerizion of he e dynmic, bound on h KL hould be inroduced by reformuling i conrin. In he following, however, i will be umed h no uch bound i necery. The relive enropy hen merely erve he purpoe of lerning he policy h mche he rge e diribuion cloe poible. MSD wih Acion Co Lerning (MSD-ACL) would per deful no regulrize he cion ll, however, i cn be eily ugmened wih cion regulrizion by dding fixed offe H reg, o he cion co mrix H (equion 4.6d) whenever compuing he rewrd funcion. Thi correpond o dding correponding punihmen erm o he objecive of he opimizion problem. A more eriou didvnge of MSD-ACL come from he fc, h eing q 0, ( ) equl o noiy PD-conroller, led o lerning e-dependen cion co F nd poenilly non-zero cion gol h. While hi migh be fine for mny ce, i h he drwbck h i led o rewrd funcion h i difficul o inerpre. When e dependen cion co hould be voided, he econd vrin of IRL-MSD cn be employed Uing he Policy KL for Regulrizion The econd vrin, which i in he following nmed MSD wih KL-bed Regulrizion (MSD-REG), ue q 0, for regulrizion. The conroller gin K 0, nd k 0, re e o zero, nd i noie Σ q0, erve weigh of he regulrizion. If he covrince i high, deviion from he zero cion re punihed le nd he regulrizion i hu low. Repecively, low covrince would reul ino rong regulrizion. However, miuing KL for regulrizion come co, becue minimizing he relive enropy migh cully reul in increing he cion co rificilly ju for he purpoe of mching Σ q0, which migh led o ubopiml lerning. Neverhele, i llow o lern e-dependen rewrd funcion for given cion co mrix H. Such rewrd funcion h he dvnge, h gol e well heir relevnce cn be direcly inferred. 22

31 The effec on he performnce, h reul from uing he KL for regulrizion will be inveiged in chper Enuring Poiive Definie Rewrd Funcion Compuing R or H ccording o equion 4.6 or 4.6d doe no gurnee poiive emi-definie co mrice. Such rewrd funcion hould be voided, becue hey brek he LQG umpion nd my led o non-guin policie. For IRL-MSD, non-poiive emi-definie rewrd funcion indice h he vrince wih repec o he gol poiion i mller hn he rge vrince. The gol poiion hen i no longer n rcor, bu erve dercor ined, i.e. he rewrd incree wih he dince o h poiion. A i i no dmiible o llow uch rewrd funcion, he co mrice cn be checked immediely fer compuion for poiive definiene. If hey re no poiive definie, pecrl decompoiion cn be uilized in order o replce ll negive eigenvlue by mll poiive number nd hen o rnform he mrix bck ino i originl bi Diregrding Feure by Mching Mrginl In ome ce, i i no deirble o mch he demonrion wih repec o ll e rjecorie. For exmple, kineheic eching migh be ued in order o demonre he vi poin of movemen regrding only he k pce poiion, bu no he correponding velociie. I migh eem o be drwbck of IRL-MSD, h he feure cnno be choen direcly, bu re rher implicily defined by he rge diribuion. However, defining he ype of he rge diribuion p () doe no need o be le inuiive hn defining he ype of he rewrd funcion. I could be rgued, h defining he ype of he rge diribuion i more inuiive, ince i cn be direcly eimed bed on he rjecorie, where inferring he rucure of he rewrd funcion from he rjecorie i lighly le direc. However, for mny prcicl pplicion i probbly doe no mke ny difference in he end nd he problem of chooing he proper feure nd he problem of chooing he proper rge diribuion re ju wo differen view on he me problem. Ignoring cerin e dimenion for MxCulEn-IRL i chieved by ignoring he correponding feure when compuing he feure coun, where for IRL-MSD i i chieved by ignoring he correponding dimenion of he rndom vrible when compuing he KL. Thi led o lighly more generl formulion of he objecive of he opimizion funcion, i.e. Equion 4.1 become minimize π ( ) T =1 p () log p () q () d + T 1 p () =1 π ( ) log π ( ) dd, (4.7) q 0, ( ) where p () nd q () re he repecive mrginl diribuion. The only difference h reul from hi mll modificion i h he enrie of R nd r h correpond o he diregrded e hve o be filled up wih zero, i.e. hey hve o be compued by D (Σ 1 q, Σ 1 p, )D + K 0, Σ 1 q0, K 0, R = 1 2 r = D (Σ 1 q, µ q, Σ 1 p, µ p,) K 0, Σ 1 q0, k 0,,, (4.8) (4.8b) where D i conruced by removing ll row of he N -by-n ideniy mrix, h correpond h diregrded e A Peudo-code Implemenion of IRL-MSD A peudo-code implemenion of IRL-MSD for LQG yem i given in Algorihm 1 for he vrin dicued in ecion nd The modificion given in ecion nd cn be incorpored by compuing he rewrd ccordingly. 23

32 inpu : q() ; /* rge e diribuion for ll ime ep */ q (0) 0 ( ) ; /* rge policie for ll ime ep */ p(, ) ; /* yem dynmic for ll ime ep */ µ 0 () ; /* e diribuion of he fir ime ep */ T ; /* ime horizon */ lernacionco ; /* boolen indicing wheher cion co hould be lerned */ oupu: H (i), h (i), F (i), R (i), r (i) ; /* rewrd prmeer for ll ime ep */ /* Iniilize rewrd prmeer */ [H (0), h (0), F (0), R (0), r (0) ] iniilize() i 0 while no converged do /* compue he Vlue funcion nd policie uing equion 4.2 nd 4.4 */ [V (i) (), π( ) (i) ] bckwrd_p(h (i), h (i), F (i), R (i), r (i) ) ; /* See Appendix B */ /* compue he e diribuion uing equion 4.2b */ p (i) () forwrd_p(π (i) ( ), p(, ), µ 0 ()) /* e nex rge policie */ if lernacionco hen q (i+1) 0 ( ) π (i) ( ) ele q (i+1) 0 ( ) q (i) 0 ( ) /* compue rewrd prmeer for he nex ierion uing equion 4.6 o 4.6e */ [H (i+1), h (i+1), F (i+1), R (i+1), r (i+1) ] compue_rewrd(q(), q (i) 0 ( ), p(i) () /* iere */ i i + 1 Algorihm 1: Peudo-code Implemenion of IRL-MSD 24

33 5 Evluion The differen pproche re evlued on double inegror oy exmple. A double inegror i dynmic model of n undercued yem h poee wice mny e dimenion cion dimenion, where hee wo e correpond o he fir nd econd inegrion of he correponding cion. Thereby he cion cn be inerpreed ccelerion nd he e velociie nd poiion. The velociy ime ep, v, cn be pproximed uing he velociy nd ccelerion of he l ime ep, i.e. v = v 1 + v v 1 + 1, where i he difference in ime beween wo ime ep. Similrly, he poiion ime ep, x cn be pproximed x x 1 + v 1. For ingle cion nd ime dicreizion = 0.1, hi correpond o liner yem dynmic +1 = A + B, wih A = nd B = 0. (5.1) The dimenion cn be increed by dding ddiionl cion nd heir correponding pir of e. The mrice of uch higher dimenionl yem re block-digonl nd compoed of he block given in Equion 5.1. The pproximion error i modeled Guin noie, leding o non-deerminiic yem dynmic. 5.1 Auming Known Siic of he Exper The evluion wihin hi ecion ume h he rue policy of he exper i known. Therefore, he e diribuion reuling from h policy re ued in order o compue he empiricl feure coun ˆφ for MxCulEn-IRL nd, repecively, re ued rge diribuion q () for IRL-MSD. The effec of he mpling error re hereby removed when compring he differen pproche. The yem dynmic re given by double inegror wih N cion dimenion nd, hu, N = 2N e dimenion. The MDP l for T = 50 ime ep. The rue rewrd funcion i qudric in e nd cion. The cion co re no correled wih he e nd re ime-independen. The cion co mrix i given by H = σ 2 N, where N denoe he N -by-n ideniy mrix nd σ 2 = Se co re only given ime ep 25 nd ime ep 50, nd re bed on projecion of he even poiion o hree-dimenionl pce. Thi rele o he problem of k pce conrol of N -link mnipulor, where he forwrd kinemic re given by he projecion mrix. For he following experimen, he number of link w e o hree. The rge diribuion w compued uing opiml conrol on he rue rewrd funcion. Figure 5.1 how he reuling e diribuion of he fir dimenion in k pce. The hded re correpond o he 2σ confidence inervl. The righ ide how n eime bed on even mple IRL by Mching Se Diribuion MSD-REG nd MSD-ACL re evlued for differen iniil covrince mrice Σ q0,. The men of he rge policy q 0 () i e equl o zero, independen of he e. Only ingle prmeer, σ 0, h o be choen by eing Σ q0, = σ 2 0 N. MSD-REG ue σ 2 0 o conrol regulrizion, where high vlue of σ2 0 correpond o low regulrizion nd vice ver. MSD-ACL ue σ 2 0 iniil gue of he rge policy h i upded during opimizion. 25

34 x x ime ep ime ep Figure 5.1.: The exper rjecory diribuion of he fir dimenion in k pce ploed over ime. Lef: rue diribuion, Righ: eime bed on even mple 3 x x 104 Expeced Rewrd rue rewrd σ 0 2 =1000 σ 0 2 =100 Expeced Rewrd rue rewrd σ 0 2 =10 σ 0 2 = ierion ierion Figure 5.2.: The expeced rewrd for MSD-REG nd MSD-ACL re hown for differen iniilizion. Lef: MSD-REG chieve good performnce lredy fer he fir ierion, however converge o oluion h perform wore. Righ: The expeced rewrd of MSD-ACL converge o he one of he exper. The higher iniilizion h fer peed of convergence. Figure 5.2 how he expeced rewrd of he opiml conroller wih repec o he lerned rewrd funcion for good vlue of σ 2 0. A for ll oher conduced experimen in hi chper, he expeced rewrd i compued bed on he rue rewrd funcion. The vlue for σ 2 0 hve been found in previou experimen. Higher vlue hve led o inbiliie for boh pproche. Inereingly, MSD-REG in he be performing rewrd funcion fer only few ierion nd ubequenly converge o oluion h perform lighly wore. MSD-ACL converge moohly o he exper performnce. The peed of convergence i fer for he iniilizion wih higher vrince. The e diribuion of he opiml conroller on he lerned rewrd funcion re depiced in Figure 5.3. The iniilizion w e o σ 2 0 = 1000 for MSD-REG nd o σ2 0 = 10 for MSD-ACL. The plo on he lef ide how he e diribuion fer one (red) nd fer five hound ierion (blue) of MSD-REG. The diribuion fer ingle ierion h clerly lower vrince hn he beline diribuion (blck). The e diribuion fer five hound ierion mche he beline diribuion beer, 26

Figure 5.3.: The rjecorie of he fir dimenion in k pce re hown for MSD-REG (lef) nd MSD- ACL (righ). The iniil eime of MSD-REG (red) h clerly lower vrince hn he beline diribuion.

The plo on he righ ide how he correponding diribuion for MSD-ACL.

35 Figure 5.3.: The rjecorie of he fir dimenion in k pce re hown for MSD-REG (lef) nd MSD- ACL (righ). The iniil eime of MSD-REG (red) h clerly lower vrince hn he beline diribuion. The e diribuion of MSD-ACL fer five hound ierion (blue) i indiinguihble from he beline diribuion. even hough i perform wore in erm of expeced rewrd. The plo on he righ ide how he correponding diribuion for MSD-ACL. The e diribuion fer ingle ierion (red) pproxime he rge diribuion wore hn he repecive diribuion of MSD-REG becue i red wih higher regulrizion. The e diribuion fer five hound ierion (blue) mche he beline diribuion excly Comprion wih Mximum Cul Enropy IRL MSD-ACL wih σ 2 0 = 10 i compred o MxCulEn-IRL. The modificion bed on rdil bi funcion (Chper 3.2.2) i no hown, becue i doe no led o ignificn difference for he given k (excep, of coure, if he number of cener i choen very low, which impir he performnce ignificnly). For MxCulEn-IRL he cion co re umed o be known in order o reduce he moun of feure. The ep ize i dped fer ech ierion of grdien decen by muliplying i wih α = 1.01 or β = 0.5 repecively, decribed in chper The lgorihm re compred wih repec o he number of performed ierion. An ddiionl comprion bed on he compuionl ime pen would be uninpiring, becue boh lgorihm need pproximely he me ime per ierion. Figure 5.4 how he expeced rewrd fer ech ierion. The blue curve rele o MSD-ACL nd i he me he correponding curve in Figure 5.2. The red curve how he performnce of MxCulEn- IRL. I converge ignificnly lower hn MSD-ACL. Figure 5.5 how he reuling e diribuion for MxCulEn-IRL fer one hound (red) nd fer en hound ierion (blue). Thee diribuion re bed on he greedy, opiml conroller. The correponding diribuion for he rue rewrd funcion i hown in blck. MxCulEn-IRL ucceed in mching he men very ccurely lredy fer hound ierion, bu h oo high vrince he gol poiion. Afer en hound ierion he vrince re mched more cloely, however, hey re ill lighly o lrge he gol poiion. I i lo inereing o inpec he lerned prmeer direcly. Boh pproche lern rewrd funcion h only ign high rewrd ime ep h re cloe o he criicl ime ep 25 nd 50, hu, he remining ime ep cn be negleced. MSD-ACL lern e-dependen cion co nd i herefore difficul o inerpre. Hence, MSD-REG nd MxCulEn-IRL re nlyzed, bed on he rewrd funcion h hve been lerned fer en hound ierion. The lerned rewrd mrice R nd gol 27

36 0.2 x Expeced Rewrd beline MSD ACL MxCulEn Ierion Figure 5.4.: The expeced rewrd of MSD-ACL nd MxCulEn-IRL i compred fer ech ierion bed on he rue rewrd funcion. MSD-ACL converge ignificnly fer o he expeced rewrd of he exper. poiion g re mpped o he k pce uing he projecion mrix. Thi rnformion yield he low dimenionl rewrd prmeer R TS nd g TS h cn be compred o he rue rewrd funcion. For ee of illurion, only he fir k pce dimenion i conidered nd rewrd correlion re ignored. Figure 5.6 how he fir dimenion of he k pce gol poiion (olid line) nd he correponding enrie of R TS (rdiu of hded re) for he inereing ime ep. Inereingly, MxCulEn-IRL lern gol poiion h re din from he deired rjecory, epecilly ime ep horly before he criicl one. The gol poiion re given in ble 5.1. MSD-REG ucceed in exrcing he rue gol poiion he criicl ime ep 25 nd 50. The gol poiion lerned by MxCulEn-IRL re very differen from he rue gol poiion even for high-rewrd ime ep. ime ep cul MSD-REG MxCulEn-IRL 1.2E E3-7.9E5 891 Tble 5.1.: The ble compre he gol poiion round he criicl ime ep for he fir dimenion in k pce. MSD-REG exrc he cul gol poiion he criicl ime ep (25 nd 50) very ccurely. Thee vlue re given wih higher preciion. MxCulEn-IRL lern wrong gol poiion he criicl ime ep nd compene hi by chooing gol poiion h re very differen from zero for he preceding ime ep. 28

Figure 5.5.: The e diribuion of he opiml conroller bed on he rewrd lerned by Mximum Cul Enropy IRL i hown fer 1000 ierion (red) nd fer 10000 ierion (blue).

37 Figure 5.5.: The e diribuion of he opiml conroller bed on he rewrd lerned by Mximum Cul Enropy IRL i hown fer 1000 ierion (red) nd fer ierion (blue). The e diribuion for he rue rewrd funcion i hown in blck. The lerned rewrd funcion led o vrince h i lighly oo high he gol poiion. 5.2 Lerning Bed on Smple Thi ecion cover he more reliic cenrio where he policy nd he reuling join diribuion of he exper re no known. The experimen re bed on he me yem he one decribed in ecion 5.1. The empiricl feure coun nd he rge diribuion repecively re eimed bed on demonrion. All experimen ue he me weny e of even demonrion IRL by Mching Se Diribuion The employed rewrd funcion doe no depend on he velociie direcly. A he velociie re uully very noiy, i i enible o exclude hem from he rge diribuion decribed in chper The effec of removing he velociie from he rge diribuion w eed boh, for MSD-ACL well for MSD-REG. The verion of MSD-ACL h rie o mch he complee e diribuion w iniilized wih σ 2 0 = 10; he verion h ignore he velociie ue σ 2 0 = 1. Thee vlue were choen bed on preliminry experimen. The verged expeced rewrd nd heir 2σ confidence inervl re hown in Figure 5.7. The vrin h doe no ry o mch he velociie converge lower, bu eem o converge o beer oluion. Boh vrin of MSD-REG hve been eed wih σ 2 0 = Similr o he experimen bed on he known exper diribuion, he verged expeced rewrd did no chnge much fer he fir ierion. The reul of he opimizion re herefore hown in ble (5.2). By ignoring he noiy velociie, he performnce could be improved. 29

38 3 x x x x Figure 5.6.: The plo how he lerned gol poiion nd heir ocied rewrd for he inereing ime ep he middle nd end of he demonrion. The op row illure he rewrd prmeer lerned by MSD-REG. The boom row how hoe lerned by MxCulEn-IRL. Boh pproche ocie high rewrd o he criicl ime ep 25 nd 50. The lerned gol poiion by MxCulEn-IRL re very differen from zero he ime ep h precede he criicl one. The gol poiion re hown in Tble 5.1. vrin verged rewrd 2σ-confidence mching velociie ± ignoring velociie ± exper demonrion ± Tble 5.2.: The expeced verged rewrd i hown fer hound ierion. The performnce of MSD- REG i imilr o he one demonred by he exper when he velociie re ignored. 30

Figure 5.7.: The red curve how he verged expeced rewrd of MSD-ACL when mching boh, poiion nd velociie. The blue curve how verged expeced rewrd of MSD-ACL when he velociie re ignored.

The lower convergence cn be explined wih he higher iniil regulrizion. ime ep 23 24 25 26 27 48 49 50 cul 0 0 10 0 0 0 0 20 MSD-REG 112.1 5.1 10.1015 4.6 111.9-11.3-119.1 20.7220 MxCulEn-IRL 1.2E4 31.

39 Figure 5.7.: The red curve how he verged expeced rewrd of MSD-ACL when mching boh, poiion nd velociie. The blue curve how verged expeced rewrd of MSD-ACL when he velociie re ignored. The verge expeced rewrd of he demonrion i illured by he blck line. The hded re correpond o 2σ confidence. The vrin h ignore he velociie converge lower. However i eem o converge o beer oluion. The lower convergence cn be explined wih he higher iniil regulrizion. ime ep cul MSD-REG MxCulEn-IRL 1.2E Tble 5.3.: The ble compre he gol poiion round he criicl ime ep for he fir dimenion in k pce. MSD-REG pproximely lern he rue gol poiion he criicl ime ep (25 nd 50) Comprion wih Mximum Cul Enropy IRL Figure 5.8 compre he expeced rewrd of MSD-ACL nd MxCulEn-IRL. Boh pproche ry o mch ll e. MSD-ACL i iniilized wih σ 2 0 = 10. A for he eing where he exper e diribuion i known, MSD-ACL converge ignificnly fer. Agin, he rewrd prmeer lerned by MSD-REG re compred o hoe lerned by MxCulEn- IRL. Figure 5.9 how he gol poiion for he fir dimenion in k pce well he ocied rewrd for ime ep h re cloe o he criicl one. 31

Figure 5.8.: MxCulEn-IRL (blue curve) i compred o MSD-ACL (red curve) on he me e of 7 mple. MSD-ACL converge ignificnly fer. 5.3 Dicuion A imple oy k w ued o evlue boh vrin of IRL-MSD nd o compre hem wih MxCulEn-IRL.

40 Figure 5.8.: MxCulEn-IRL (blue curve) i compred o MSD-ACL (red curve) on he me e of 7 mple. MSD-ACL converge ignificnly fer. 5.3 Dicuion A imple oy k w ued o evlue boh vrin of IRL-MSD nd o compre hem wih MxCulEn-IRL. Boh, MSD-ACL nd MSD-REG converged ignificnly fer hn MxCulEn- IRL. Thi i no urpriing, given h IRL-MSD doe no rely on grdien decen bu direcly compue he rewrd prmeer h re opiml wih repec o he curren eimion. When properly iniilized, boh vrin of IRL-MSD eem o converge o imilr oluion. MSD-REG converge lmo innly. The expeced rewrd bed on he iniil eime of he prmeer w lwy cloe o he be one found over ll ierion. However, he iniilizion h huge impc on he performnce (Figure 5.2 lef) becue i correpond o regulrizion. MSD-ACL upde he rge policy fer ech ierion nd, hu, he iniil prmeer i le influenil (Figure 5.2 righ). The prmeer of he rewrd funcion hve been inpeced in order o e how well he rue gol poiion of he exper hve been mched. MSD-REG w ble o pproximely recover he rue gol poiion he imporn ime ep even for he mple-bed eime of he rge e diribuion. 32

Chapter 2: Evaluative Feedback

Chapter 2: Evaluative Feedback Chper 2: Evluive Feedbck Evluing cions vs. insrucing by giving correc cions Pure evluive feedbck depends olly on he cion ken. Pure insrucive feedbck depends no ll on he cion ken. Supervised lerning is