Feature Extraction for Inverse Reinforcement Learning

Size: px
Start display at page:

Download "Feature Extraction for Inverse Reinforcement Learning"

Transcription

1 Feure Exrcion for Invere Reinforcemen Lerning Feure-Exrkion für Invere Reinforcemen Lerning Mer-Thei von Oleg Arenz u Wiebden Tg der Einreichung: 1. Guchen: 2. Guchen: 3. Guchen:

2 Feure Exrcion for Invere Reinforcemen Lerning Feure-Exrkion für Invere Reinforcemen Lerning Vorgelege Mer-Thei von Oleg Arenz u Wiebden 1. Guchen: 2. Guchen: 3. Guchen: Tg der Einreichung:

3 Erklärung zur Mer-Thei Hiermi verichere ich, die vorliegende Mer-Thei ohne Hilfe Drier nur mi den ngegebenen Quellen und Hilfmieln ngeferig zu hben. Alle Sellen, die u Quellen ennommen wurden, ind l olche kennlich gemch. Diee Arbei h in gleicher oder ähnlicher Form noch keiner Prüfungbehörde vorgelegen. Drmd, den 17. Dezember 2014 (Oleg Arenz)

4 Abrc Equipping n gen wih he biliy o infer inenion behind oberved behvior i prerequiie for creing ruly uonomou robo. By deducing he purpoe of he cion of oher gen he robo would be ble o rec in enible wy nd, furhermore, o imie he regy on high level. Invere Reinforcemen Lerning (IRL) cn be pplied o lern rewrd funcion h i conien wih he oberved behvior nd i, hu, ep owrd hi overll gol. Some regie cn only be modeled properly by underlying ime-dependen rewrd funcion. Mximum Cul Enropy Invere Reinforcemen Lerning (MxCulEn-IRL) i mehod h cn be pplied o lern uch non-ionry funcion. However, i depend on grdien-bed opimizion nd i performnce cn herefore uffer if oo mny prmeer hve o be lerned. Thi cn be problemic, ince he number of prmeer incree ignificnly if epre rewrd funcion re lerned for ech ime ep. Furhermore, ince only few ime ep migh be relevn for he oberved k, econd chllenge of pplying IRL for lerning non-ionry rewrd funcion coni in properly exrcing uch prene. Thi hei inveige how o mee hee prcicl requiremen. A novel pproch, IRL-MSD, i developed for h purpoe. Unlike ome previou IRL mehod, Invere Reinforcemen Lerning by Mching Se Diribuion (IRL-MSD) doe no im o mch feure coun bu ined lern rewrd funcion by mching e diribuion. Thi pproch h wo inereing properie. Firly, he feure do no hve o be defined explicily bu rie nurlly from he rucure of he oberved e diribuion. Secondly, i doe no require grdien-bed opimizion. The experimen how h i converge fer hn exiing IRL mehod nd properly recover he gol of pre rewrd funcion. Therefore, IRL-MSD ugge ielf for fuure reerch. i

5 Acknowledgmen I would like o hnk Prof. Dr. Jn Peer for founding he Inelligen Auonomou Syem lb TU Drmd. The IAS concenre lo of experie in he field of robo lerning from which I could benefi. I lo wn o hnk my upervior Prof. Dr. Gerhrd Neumnn nd M.Sc. Chriin Dniel for he counle hour hey inveed during our regulr meeing. I lerned lo from you during my work on hi hei. I epecilly wn o hnk Prof. Dr. Gerhrd Neumnn, who cme up wih he opimizion problem for IRL-MSD. I m greful for hving goen he opporuniy o work on hi mehod. ii

6 Conen 1. Inroducion Thei Semen Moivion for Imiion Lerning by Invere Reinforcemen Lerning Moivion for Lerning Time-Dependen Rewrd Funcion Preliminrie Informion Theory Mrkov Deciion Proce Opiml Conrol Ouline Reled Work Lerning From Demonrion Perceiving Demonrion Imiion Lerning Invere Reinforcemen Lerning Relive Enropy Policy Serch Mximum Cul Enropy Invere Reinforcemen Lerning Inigh Implemenion Lerning Non-Sionry Rewrd Funcion Towrd Compc Repreenion Uing Rdil Bi Funcion Enuring Poiive Semi-Definie Rewrd Funcion Adping he Sepize Bed on he Dul Funcion Invere Reinforcemen Lerning by Mching Se Diribuion The Opimizion Problem The Propoed Algorihm Uing he Policy KL for Lerning Acion Co Uing he Policy KL for Regulrizion Enuring Poiive Definie Rewrd Funcion Diregrding Feure by Mching Mrginl A Peudo-code Implemenion of IRL-MSD Evluion Auming Known Siic of he Exper IRL by Mching Se Diribuion Comprion wih Mximum Cul Enropy IRL Lerning Bed on Smple IRL by Mching Se Diribuion Comprion wih Mximum Cul Enropy IRL Dicuion iii

7 6. Fuure Work Lerning Conex Dependen Rewrd Funcion Dropping he LQG Aumpion Bounding he Policy-KL Lerning Se Independen Acion Co Concluion 35 Reference 37 A. Derivion for Mximum Cul Enropy Invere Reinforcemen Lerning 41 B. Derivion for Invere Reinforcemen Lerning by Mching Se Diribuion 44 C. The Difference beween Ṽ π () nd V π () 46 C.1. Derivion of V π () C.2. Derivion of Ṽ π () C.3. Comprion of V π () nd Ṽ π () iv

8 Figure, Tble nd Algorihm Li of Figure 1.1. Moiving Exmple Fir Dimenion of Exper Se Diribuion in Tk Spce Expeced Rewrd of IRL-MSD for Known Exper Siic Fir Dimenion of MSD Se Diribuion in Tk Spce Comprion of MSD nd MxCulEn for Known Siic Fir Dimenion of MxCulEn Se Diribuion in Tk Spce Plo of Lerned Rewrd Prmeer Expeced Rewrd of MSD-ACL Bed on 7 Smple Comprion of Mximum Cul Enropy IRL nd IRL-MSD Bed on 7 Smple Plo of Lerned Rewrd Prmeer Bed on 7 Smple Li of Tble 5.1. Lerned Gol Poiion for Known Exper Siic Expeced Rewrd for MSD-REG Bed on 7 Smple Lerned Gol Poiion Bed on 7 Smple Li of Algorihm 1. Peudo-code Implemenion of IRL-MSD v

9 Abbreviion nd Symbol Li of Abbreviion Noion DMP IRL IRL-MSD KL LfD LQG MAP MxCulEn- IRL MxEn-IRL MCMC MDP MSD-ACL MSD-REG PbD PoWER ProMP RBF REPS RL Decripion Dynmic Movemen Primiive Invere Reinforcemen Lerning Invere Reinforcemen Lerning by Mching Se Diribuion Kullbck-Leibler divergence Lerning from Demonrion Liner Qudric Guin Mximum-A-Poeriori Mximum Cul Enropy Invere Reinforcemen Lerning Mximum Enropy Invere Reinforcemen Lerning Mrkov-Chin-Mone-Crlo Mrkov Deciion Proce MSD wih Acion Co Lerning MSD wih KL-bed Regulrizion Progrmming by Demonrion Policy Lerning by Weighing Explorion wih he Reurn Probbiliic Movemen Primiive Rdil Bi Funcion Relive Enropy Policy Serch Reinforcemen Lerning 1

10 1 Inroducion Auonomou robo need o dp heir plnning bed on obervion of heir environmen. Hence, when oberving oher gen, he uonomou robo hould decide how o rec on he oberved behvior. For exmple, if during collborive embly k one gen pick up crew, enible recion of econd gen migh coni in hnding over he crewdriver. Epecilly if he environmen i no well-defined, meningful recion o he oberved behvior migh only be poible by inferring he inenion of he oher gen. Lerning he inenion behind n oberved behvior i lo helpful for Imiion Lerning. Imiion Lerning (Schl, 1999; Argll e l., 2009) llow o ech n gen how o perform k by providing demonrion. By inferring he gol of he demonrion, he gen lern k repreenion h i generlizble nd no ffeced by differen embodimen nd yem dynmic. IRL (Ng e l., 2000) i ep owrd lerning he inenion of n oberved behvior by lerning rewrd funcion from exper demonrion. Rewrd funcion define k gol by ring cion nd e wih repec o hee gol nd hence cn repreen inenion. Thi repreenion i uible for Imiion Lerning, becue he gen cn lern o perform he deired k by pplying Reinforcemen Lerning (RL) (Suon nd Bro, 1998). 1.1 Thei Semen The im of hi hei i o pply Invere Reinforcemen Lerning in order o lern ime-dependen rewrd funcion. Furhermore, i i inveiged how well he prene cn be recovered, if he rue rewrd funcion of he exper i pre in ime. MxCulEn-IRL (Ziebr e l., 2010) i n IRL pproch h cn properly re iuion where he gen dped i pln bed on informion h h been reveled o him during execuion. Thi mehod herefore ugge ielf for lerning ime-dependen rewrd funcion. MxCulEn-IRL pplie grdien decen for lerning he prmeer of he rewrd funcion. Therefore, he peed of convergence cn uffer if mny prmeer hve o be opimized. A novel pproch, IRL-MSD, i developed in order o void grdien bed opimizion. Thi mehod doe no require n explici definiion of he feure, bu chooe hem uomiclly bed on he rucure of he rge diribuion. 1.2 Moivion for Imiion Lerning by Invere Reinforcemen Lerning Teching k by combining IRL nd RL eem lboriou becue i involve wo ep of lerning. Hence, i could be rgued h more direc pproch hould be preferred by (1) lerning he policy direcly from he demonrion or (2) by providing he rewrd funcion direcly. However, boh of hee pproche hve limiion h hll be dicued in he following. Imiing movemen from demonrion wihou lerning rewrd funcion i uully chieved by lerning prmerized rge policy or rge rjecory uing regreion. Lerning he policy direcly w ued by Smmu e l. (2002) o cree n uopilo for fligh imulor. However, uch pproche cn uffer from he correpondence problem, fil if he dynmic chnge nd cn no be generlized. Lerning rge rjecorie i more promiing nd w uccefully pplied on humnoid 2

11 robo o lern forehnd enni wing (Ijpeer e l., 2003). Neverhele, lerning rjecory i ill le flexible hn lerning rewrd funcion, which ge eviden when obcle occur long he deired ph. Reinforcemen Lerning on provided rewrd funcion h been ued wih gre ucce, e.g. for lerning f qudrupedl wlking (Kohl nd Sone, 2004). However, defining n pproprie rewrd funcion cn be cumberome nd difficul even for exper. For exmple, when eing how well dihwher h been loded, everl differen feure hve o be ken ino ccoun (e.g. for decribing he ype of he dihe well heir poiion nd orienion) wih repec o everl differen ubgol, e.g. efficien uge of pce nd high expeced clenne. A i i uully no obviou which ubgol re relevn nd how hey hve o be weighed gin ech oher, defining n pproprie rewrd funcion require experience nd ofen involve hnd-uning by ril nd error. In conr o h, demonrion cn ofen be provided ubnilly eier nd even by non-exper. 1.3 Moivion for Lerning Time-Dependen Rewrd Funcion φ Time Figure 1.1.: Moiving Exmple. The demonrion cn be explined wih ime-dependen policy bed on he only oberved feure. Lerning ingle, ionry (ime-independen) rewrd funcion cn be inpproprie for lerning k, h decompoe ino everl ubk. For exmple, ume h he demonrion of n exper led o he rjecorie depiced in figure 1.1, where φ hll be ome rbirry feure ploed over ime. The demonred behvior cn be explined uing non-ionry rewrd funcion, h rewrd he gen for chieving φ = 1 ime ep 25 nd φ = 0 ime ep 50. Even if boh ubk erved n overll k nd could be explined uing ionry rewrd funcion, h rewrd funcion migh poe higher complexiy hn he non-ionry one nd would depend on ddiionl, unoberved feure. 3

12 1.4 Preliminrie Thi ecion provide bckground informion on Informion Theory, Mrkov Deciion Procee nd Opiml Conrol Theory Informion Theory Enropy The enropy (Shnnon, 2001) H(P) of dicree probbiliy diribuion P(X ) meure he unceriny of h diribuion nd, hu, lo he moun of encoded informion. I i defined H(X ) = P(x) log(p(x)). x Thi hei will mke ue of everl reled enropy meure for coninuou probbiliy diribuion h re defined follow. The coninuou enropy of probbiliy diribuion p(), denoed H(S), i defined H(S) = p() log p() d. If n gen chooe i cion ccording o ime-depend policy, he enropy of h diribuion hould only ke ino ccoun he informion h h lredy been reveled o he gen. The enropy i hen denoed cul enropy. Cul Enropy For coninuou, condiionl diribuion p( ), he condiionl enropy H(A S) i defined he expeced coninuou enropy of he condiionl diribuion, i.e., H(A S) = p() p( ) log p( ) d d. =1 Le he condiionl diribuion p( 1,..., T 1,..., T ) denoe he probbiliy of oberving given ech dicree ime ep [1, T]. If hi diribuion only depend on p obervion nd, hu, T mrginlize o p( 1,..., T 1,..., T ) = p( 1,..., ) i will be referred o cully condiioned on. The enropy of uch cully condiioned diribuion i hen denoed cul enropy nd defined T H(A S) = p( 1,..., ) p( 1,..., ) log(p( 1,..., )) d d 1,...,. =1 1,..., For he pecil ce where p( 1,..., T 1,..., T ) = T p ( ), he cul enropy implifie o =1, T p( ) = =1 =1 T H(A S) = p ()p ( ) log(p ( )) d d. Thi equion cn be ued o compue he enropy of fir order Mrkovin policie, i.e. policie h do no depend on p e if he curren e i given. 4

13 Relive Enropy The relive enropy KL (P Q) beween wo diribuion p() nd q(), lo known Kullbck-Leibler divergence, meure he informion lo h reul when p() i employed pproximion of q(). I i defined log p() KL (P Q) = p() log q() d. The condiionl relive enropy KL (P Q) nd he cul relive enropy KL (P Q) re defined, nlogou o he condiionl nd cul enropy, KL (P Q) = KL (P Q) =, log p( ) p()p( ) d d, log q( ) T =1, p ()p ( ) log p ( ) d d. log q ( ) (1.1) Equion 1.1 cn be ued meure of imilriy beween wo ime-dependen policie while properly king ino ccoun culiy Mrkov Deciion Proce A Mrkov Deciion Proce (MDP) i mhemicl frmework h i ofen employed for modeling he environmen of n gen in he conex of Reinforcemen Lerning. MDP model ime by uing dicree ime ep. Wihin hi hei, only finie horizon MDP re dicued, i.e. i i umed h he proce lwy end fer fixed number T of ime ep. The e of he environmen nd he e of he gen re modeled ogeher uing he e S of e. A ech ime ep, he gen chooe n cion from he e A of cion. Afer n cion h been choen, he proce rniion ino he nex ime ep by rndomly chooing new e ccording o he yem dynmic p (, ). A MDP ume he Mrkovin propery, i.e. i ume h he probbiliy diribuion over he nex e doe no depend on p e or cion, if he curren e nd he curren cion re known. Addiionlly, for ny given ime ep, he e nd cion re red uing rewrd funcion r (, ). Therefore, finie horizon MDP cn be defined uing five-uple M = S, A, p (, ), T, r (, ). For he reminder of hi hei, he e nd cion re repreened uing vecor nd of ize N nd N repecively. The e nd cion for ime ep re denoed by nd nd heir coninuou elemen index i by i, nd i, repecively. The yem dynmic re umed o be liner wih Guin noie, e.g. p (, ) = (A + B + b, Σ d yn, ). Furhermore, he rewrd funcion i umed o be convex qudric in nd. Thereby, he MDP complie wih he Liner Qudric Guin (LQG) umpion Opiml Conrol An gen, cing in MDP, chooe i cion ccording o policy π ( ) h i umed o be conien over ime nd hereby (nd due o he Mrkovin propery) depend only on he curren e nd ime ep. Reinforcemen Lerning ddree he problem of lerning policy h mximize he expeced rewrd of he gen. For finie horizon MDP h comply wih he LQG umpion, he opiml 5

14 conrol policy cn be compued recurively, even for coninuou (nd hereby infinie) e nd cion pce, by uing bckwrd inducion. Bckwrd inducion mke ue of he concep of Vlue funcion nd e-cion Vlue funcion. The Vlue funcion for given policy π ime ep i denoed by V π () nd correpond o he expeced rewrd of n gen h r ime ep in e, uming h he gen will c ccording o policy π for he curren nd ll remining ime ep. The e-cion Vlue funcion Q π (, ) i defined ccordingly, however, i ume h he gen chooe cion ime ep nd only ferwrd c ccording o he policy π. The Vlue funcion for he l ime ep doe no hve o ke fuure rewrd ino ccoun nd i hereby given by he rewrd funcion of he l ime ep, V π T () = r T (). The rewrd he l ime ep doe no depend on n cion, becue i i umed h he MDP end immediely fer reching he finl e. The e cion Vlue funcion i conequenly only defined for < T. I cn be expreed in erm of he expeced Vlue funcion of he nex ime ep uing Q π (, ) = r (, ) + p (,) V π +1 ( ), (1.2) where he diribuion over he nex e i given by he yem dynmic p (, ). Similrly, he Vlue funcion for ime ep < T cn be expreed in erm of he expeced e cion Vlue funcion of h me ime ep uing V π () = π ( ) Q π (, ), (1.3) where he diribuion of he curren cion i given by he policy π ( ). Thee recurive equion (1.2 nd 1.3) re known he Bellmn equion for finie horizon MDP. Sring wih V π T () hey cn be employed o compue he Vlue funcion nd he e-cion Vlue funcion of ny given policy π for ll ime ep by deducing bckwrd in ime. The opiml policy π ( ) chooe ech ime ep n cion h mximize he e-cion Vlue funcion Q π (, ). Hence, he correponding e-cion Vlue funcion Q π (, ) nd he Vlue funcion () cn be compued uing equion 1.2 nd V π V π () = mx Q π (, ). The recurive equion re hen known Bellmn opimliy equion for finie horizon MDP. For LQG yem, V π () nd Q π (, ) re convex qudric for ll ime ep. The involved expeced vlue nd mximum vlue cn hu be derived nlyiclly. The opiml policy for uch yem ke he form of deerminiic, liner conroller π () = K + k, where he conroller gin K nd k cn be compued uing bckwrd inducion. 1.5 Ouline The reminder of hi hei i rucured follow: Chper 2 dicue reled work, by providing n overview of Imiion Lerning nd Invere Reinforcemen Lerning. Sudying MxCulEn-IRL led o inigh h hopefully i in beer undernding he mechnic of he pproch. Thee inigh re preened in chper 3 long wih he dicuion of implemenion pecific deil. Chper 4 will cover he novel pproch of IRL-MSD. Chper 5 evlue hi pproch nd compre i o MxCulEn- IRL. Finlly, Chper 6 preen n oulook on fuure reerch nd Chper 7 recpiule he finding of hi hei. 6

15 2 Reled Work 2.1 Lerning From Demonrion Thi ecion provide brief overview of Lerning from Demonrion (LfD) in he field of roboic. Roboic LfD i n cive field of reerch ince i w fir covered in deph in 1984 (Hlber, 1984). Since hen i h evolved ino everl differen direcion nd produced overlpping erminology (Argll e l., 2009). Thi hei diinguihe beween he following erm: Lerning from Demonrion i ued for ll robo lerning mehod h mke ue of demonrion. Thi definiion i ricly more generl hn he one given by Argll e l. (2009), h only pplie o pproche h lern policy bed on demonrion. Progrmming by Demonrion (PbD) i ued in i originl mening, when demonrion re ued o produce code in progrmming lnguge (Cypher nd Hlber, 1993). Imiion Lerning i ued when he robo hould lern o imie he demonror. True Imiion Lerning i ued when, following he definiion of Tomello e l. (1993), n undernding of he inenionl e underlying he behvior i ddiionlly required. Behviorl Cloning i ued when he gen lern e-cion mpping h cloely mch he oberved one wihou regrding he implicion of he cion. I i hereby ubfield of Imiion Lerning bu cn no provide rue imiion. A nurl queion regrding LfD i how o demonre. The differen wy of demonring k cn be lo decribed from he perpecive of he lerning gen, leding o he queion: How cn robo perceive demonrion? Perceiving Demonrion Wihin hi hei, hree differen poibiliie of perceiving demonrion re dicued. For he fir one, he gen perceive he demonrion merely vi exerocepive enor, i.e. enor h meure quniie concerning he gen environmen, e.g. vi viion. The econd one lo incorpore propriocepion, i.e. direcly ening quniie h re reled o he robo ielf, e.g. join poiion. The hird poibiliy of perceiving demonrion include propriocepion nd exerocepion nd furhermore recording of he iued conrol commnd. Percepion Bed on Exerocepive Sening When perceiving demonrion bed on exerocepion, he gen pively oberve he exper performing k. Even hough he robo migh be ble o meure uing i propriocepive enor, hi informion i no conidered o be pr of he demonrion. Kuniyohi e l. (1994) ued lerning by wching o lern high-level embly pln by wching humn demonrion. Mrker-bed moion cpuring i ofen ued by humnoid robo o mimic he whole-body movemen of humn demonrion (Ude e l., 2004; Kulić e l., 2011; Kim e l., 2009). Such demonrion hve he dvnge, h hey ofen cn be performed nurlly nd wihou he 7

16 need of inercing wih he robo. However hey hve he drwbck, h hey give he le informion o he gen. Epecilly, hey do no conin ny informion bou how he robo i uppoed o perform he k bu only bou how he exper perform i. If he exper nd he robo hve imilr embodimen, he robo migh be ble o perform he deired moion by imiing he exper. However, if hey hve differen embodimen, i i no cler how o imie he exper which i known he correpondence problem (Nehniv nd Duenhhn, 2002; Alindrki e l., 2002). Addiionl Percepion Bed on Propriocepive Sening The robo cn lo perceive demonrion uing i propriocepive enor if he exper demonre he k by phyiclly moving i limb. Providing uch demonrion o he robo i known kineheic eching. If he robo i lighweigh nd complin, he deired robo moion cn be induced olely by force creed by he exper. Such demonrion hve been ued uccefully for lerning lrge vriey of k, including Bll-Pddling, Bll-in--Cup, Pendulum Swing-Up nd Pncke-Flipping (Kober nd Peer, 2009; Kormuhev e l., 2010). Addiionl Percepion of he Conrol Demonrion cn lo be performed by uing he cuor of he robo direcly, for exmple vi eleoperion. Thereby, he robo i conrolled from dince vi remoe conrol, e.g. joyick or exokeleon. Oher exmple for hi kind of demonrion include progrmming he deired moor commnd nd ome form of cive kineheic eching, i.e. kineheic eching where he robo ene ouche of he exper nd rec by iuing correponding conrol commnd. Demonrion on rdio-conrolled helicoper where ued by Abbeel e l. (2010) o lern conroller h w cpble of producing mneuver he edge of he phyicl feible, including in-plce flip nd uo-roionl lnding (emergency lnding wih n unpowered min roor) Imiion Lerning Imiion Lerning i n imporn pplicion of LfD, i mke progrmming he robo eier for he exper nd cceible for non-exper uer. Imiion Lerning cn be divided ino hree differen pproche (Schl, 1999), (1) pproche h re only concerned wih mching he demonred policy, (2) pproche h ry o reproduce he demonred rjecory nd (3) pproche h ue he demonred rjecorie in order o reduce he erch pce for Reinforcemen Lerning. Policy-Bed Imiion Lerning Alredy in he erly yer of roboic, demonrion were ued o ee he k of robo progrmming. The mnipulor w guided by humn exper in order o perform deired movemen nd w ble o repe h moion by recording he vi poin. Thi imple pproch of robo progrmming by demonrion doe no incorpore enory inpu nd i hereby only pplicble o few, induril k like pining componen pr (Lozno-Perez, 1983). Behviorl Cloning (Bin nd Smmu, 1995; Pomerleu, 1989) ddiionlly record he enory informion o mp he oberved e o cion. Thi pproch w ued o cree n uopilo for fligh imulor h exceeded he performnce of he exper by moohing ou he humn conrol noie (Smmu e l., 2002). Thi clen-up effec i ofen encounered wih behviorl cloning (Michie e l., 1990). However, behviorl cloning merely rie o produce he me cion he exper, mindle of he reuling rjecory, i i ofen frgile wih repec o chnge in he environmen. 8

17 Trjecory-Bed Imiion Lerning The policy-bed imiion lerning pproch cn fil, if he robo nd he exper differ in heir kinemic or dynmic, becue he me cion hen led o differen rjecorie. Trjecory-bed imiion lerning mehod circumven hi horcoming by rying o mch he exper rjecory ined of i policy. In order o mke lerning of he rjecory feible, repreenion i required h cn genere he deired rjecorie while he me ime depending on rcble number of prmeer. The choice of repreenion i crucil for he performnce of he imiion lerning k. An exmple for uch repreenion re vi poin. Miymoo e l. (1996) exrc vi poin from demonrion by fir dding vi poin he gol poiion nd hen ierively dding vi poin he poiion of mximl qured error beween he lerned rjecory nd he demonred one unil he demonrion i mched ufficienly well. However, hi repreenion cn no be generlized righforwrdly nd migh no recover from perurbion in he environmen. Schl e l. (2000) propoed o repreen movemen by uing nonliner dynmic yem n rcor. Thi led o he developmen of Dynmic Movemen Primiive (DMP), flexible repreenion of moion h cn be lerned from demonrion nd i herefore uible for Imiion Lerning (Ijpeer e l., 2002, 2003; Schl e l., 2003, 2004). A DMP i cully no rjecory repreenion, bu repreenion of conrol policy. Thi policy correpond o PD-conroller h i perurbed wih nonliner forcing funcion. The PD-conroller h he purpoe of reching given gol poiion where he forcing funcion conrol which ph hould be choen o rech h gol. The forcing funcion i defined rdil bi funcion nework nd converge o zero o h i doe no preven he PD-conroller from reching he gol e (when uming pproprie conroller gin). Furhermore, phe vrible i inroduced h cn be ued o chnge he peed of execuion. Addiionl dvnge include robune gin perurbion, cpbiliy of producing rhyhmic movemen, nd ee of lerning, which i chieved by lerning he weigh of he rdil bi funcion. Ijpeer e l. (2002) demonred he effecivene of heir pproch by lerning forehnd enni wing from demonrion. The rjecorie cn lo be repreened Guin diribuion by uing Probbiliic Movemen Primiive (ProMP) (Prcho e l., 2013). Thi pproch ue weighed rdil bi funcion in order o encode he men for ech ime ep nd encode he covrince mrice by uing prior on hee weigh. Hence, by mrginlizing ou he weigh vecor, he rjecorie re given by he prmeerizion of he prior. Similr o DMP, emporl modulion i chieved by inroducing phe vrible nd rhyhmic moion cn be produced by uing von-mie bi funcion ined of Guin one. ProMP cn be ued for Imiion Lerning, by lerning he prmeer of he prior h mximize he likelihood of he demonred rjecorie. Thi probbiliic pproch h everl inereing cpbiliie: The movemen primiive cn be dped o rech differen vi poin, by condiioning he diribuion ccordingly. Alo, everl primiive cn be co-cived nd blended by compuing heir produc while weighing hem wih ime-dependen civion funcion. ProMP were recenly pplied for lerning inercion beween muliple gen in collborive embly k (Med e l., 2014). Reinforcemen Lerning From Demonrion For mny prcicl pplicion, he e pce i lrge nd he dynmic model h o be pproximed. If he dynmic re no lerned, Imiion Lerning cn be inufficien o fulfill he k. Reinforcemen Lerning cn be pplied o lern he k by ril-nd-error while he me ime lerning he dynmic model. However, exploring he lrge e pce cn be oo coly, epecilly if he rewrd funcion i pre. Lerning cn be booed by fir lerning policy from demonrion nd ferwrd improving i uing Reinforcemen Lerning (Schl, 1997). Following up Imiion Lerning wih Reinforcemen Lerning, my no only enble he robo o ccomplih he k bu lo o improve upon he exper demonrion (Akeon nd Schl, 1997). Furhermore, by pplying Reinforcemen Lerning he robo 9

18 i ble o hndle new iuion for which he demonred behvior would fil, for exmple, when new obcle occur (Guener e l., 2007). Kober nd Peer (2009) ued Imiion Lerning bed on DMP in order o iniilize heir policy erch mehod Policy Lerning by Weighing Explorion wih he Reurn (PoWER). Thi pproch enbled he robo o relibly ucceed in he gme of Bll-in--Cup. A imilr pproch w ued by Kormuhev e l. (2010) for lerning he k of pncke-flipping Invere Reinforcemen Lerning Invere Reinforcemen Lerning coniue differen pplicion of Lerning from Demonrion by iming lerning he rewrd funcion of he exper bed on i demonrion. By uing he lerned rewrd funcion for Reinforcemen Lerning, IRL cn be ued mehod of Imiion Lerning. However, he IRL-problem hould be clerly diinguihed from he problem of Imiion Lerning, becue lerning rewrd funcion cn lo erve oher purpoe hn imiion. For exmple, in humn-robo collborion k, he robo migh ry o infer he inenion of he humn, no in order o overke hi k bu in order o i him in chieving i. In le cooperive wy, menioned by Rmchndrn nd Amir (2007), he rewrd funcion cn be ued o model he opponen in dverril gme like poker, in order o exploi i regy. More generlly, IRL im inferring he gol underlying he oberved behvior. Neverhele, urveying curren reerch in IRL, one my conclude h i i indeed mo ofen ued for Imiion Lerning k, boh, for experimenl evluion well for prcicl pplicion. Thi i no urpriing, becue decribing (nerly) opiml behvior in erm of i inenion i uccinc nd llow for generlizion. Furhermore, unlike policy, he rewrd funcion remin vlid, if he dynmic chnge. In he following, n overview will be given of reerch in he field of Invere Reinforcemen Lerning, ring wih he fir MDP formulion of he problem, given by Ng e l. (2000). The MDP Formulion of Invere Reinforcemen Lerning Similr o Reinforcemen Lerning, Invere Reinforcemen Lerning ume h he gen i cing in Mrkov Deciion Proce. However, he rewrd funcion of h MDP i no known o he gen. In exchnge, i i given demonrion of n exper, h rie o mximize hi unknown rewrd. The gol of IRL i now o lern h rewrd funcion from he demonrion. Hence, ined of ddreing he problem of finding ner-opiml policy for given rewrd funcion, i ddree he problem of recovering rewrd funcion from demonrion of ner-opiml policy. Unforunely, hi problem formulion i ill-poed, becue he demonrion do no uffice o deduce he underlying inenion of he exper wih ceriny. For exmple, here i lwy he, uully exremely unlikely bu lwy non-zero, poibiliy h he exper chooe i cion compleely rndomly. Thi would correpond o conn rewrd funcion, for which ny behvior would be opiml. Clerly, uch degenered oluion hould be dicrded nd more reonble umpion of he underlying rewrd funcion hould be found. However, wh would mke n umpion reonble? Demnding h he exper perform (ner) opiml on he lerned rewrd funcion i necery bu no ufficien condiion. Ng e l. (2000) propoed o furher conrin he oluion pce by demnding h ny ingle-ep deviion from he oberved policy hould be mximlly punihed. Addiionlly, n L1-regulrizion w dded o prefer imple oluion. They demonred h hi pproch cn be ued o pproximely recover he rue exper policy on imple grid world k nd he mounin-cr problem. 10

19 Byein Invere Reinforcemen Lerning Rmchndrn nd Amir (2007) propoe Byein pproch in order o void hving o decide for priculr rewrd funcion. For h purpoe he poerior i compued bed on prior on he rewrd prmeer nd he likelihood of he demonrion. Poible prior on he rewrd prmeer include he Guin-, Lplce- nd Be-diribuion well he uniform diribuion over finie inervl. The likelihood of demonrion for given rewrd prmeer i compued under he umpion h he exper chooe i cion proporionl o he exponenil of he correponding expeced rewrd. The rewrd funcion cn be inferred from he poerior diribuion by compuing i men or Mximum- A-Poeriori (MAP). More inereingly for Imiion Lerning, he policy cn lo be inferred from he poerior direcly. However, due o he complexiy of he poerior diribuion, Mrkov-Chin-Mone- Crlo (MCMC) (Gilk, 2005) h o be pplied for boh ce whereby he pproch cn uffer from he cure of dimenionliy (Bellmn, 1957). Mching Feure Coun Abbeel nd Ng (2004) propoed new pproch o dicrd degenered oluion, by demnding h he opiml policy wih repec o he lerned rewrd funcion hould mch he oberved policy in behvior. For h purpoe, he rewrd funcion i umed o be liner combinion of given ime-dependen feure φ (, ), i.e., r (, ) = φ (, ) θ, nd rewrd funcion i lerned, uch h he expeced feure coun φ mch he empiricl feure coun of he exper ˆφ, 1 N D T N D i=1 =1 φ ( (i), (i) ) = ˆφ =! φ = T p (,) φ (, ), where N D denoe he number of demonrion nd (i) nd (i) denoe he e nd cion of he i-h demonrion ime ep. The expeced feure coun φ re compued bed on he join diribuion p (, ) h would reul from he opiml policy wih repec o he lerned prmeer. Since he ol rewrd r(, ) cn be wrien r(, ) = = T r (, ) =1 =1 T φ (, ) θ =1 T = φ (, ) θ, =1 policy h produce he me feure coun he exper would chieve he me rewrd on he rue rewrd funcion independen of i prmeer θ. However, if he join diribuion p (, ) for compuing φ i bed on n opiml, deerminiic policy, i i ofen no poible o mch he empiricl feure coun, becue hey re uully bed on mple nd, furhermore, on ubopiml demonrion. Therefore, IRL pproche h re bed on mching feure coun ue ochic, ubopiml policy for compuing he feure expecion. However, when he expeced feure coun re bed on policy, h i ubopiml wih repec o he lerned 11

20 prmeer θ, differen rewrd funcion could be lerned, depending on which policy h been choen. And even hough ll hee rewrd funcion could be ued o mch he empiricl feure coun by employing heir repecive policy, hey migh no infer he correc gol nd he opiml policie for hee rewrd funcion hu migh perform bdly. Mximum Enropy Invere Reinforcemen Lerning Mximum Enropy Invere Reinforcemen Lerning (Ziebr e l., 2008) pplie he principle of mximum enropy (Jyne, 1957) by chooing he policy h led o he mximum enropy join diribuion for mching he feure coun. Thi i he mo principled pproch, i doe no preume ny ungrounded conrin on he join diribuion nd hereby minimize he wor-ce predicion log-lo (Grünwld nd Dwid, 2004). Under he mximum enropy model, he likelihood of given ph ζ i i proporionl o he exponenil of i rewrd, i.e., p(ζ i θ ) exp θ φζ,i. The rewrd funcion θ h led o he mximum enropy e diribuion mching he feure coun, cn be found by mximizing he log-likelihood of he demonrion, θ = rgmx θ L(θ ) = rgmx θ N d i=1 log p(ζ i θ ), where N d denoe he number of demonrion. The mximum likelihood cn be found uing grdien decen, where he grdien i given by he difference beween he empiricl feure coun nd he expeced feure coun for he curren eime of he rewrd funcion, i.e., L(θ ) = ˆφ φ. Severl recen dvnce in Invere Reinforcemen Lerning re bed on Mximum Enropy Invere Reinforcemen Lerning (MxEn-IRL), e.g. Levine e l. (2011) pply Guin Proce Regreion in order o lern nonliner rewrd funcion. Levine nd Kolun (2012) pply Lplce pproximion in order o loclly pproxime he policy by Guin long he demonred rjecorie in deerminiic MDP. Boulri e l. (2011) cully ue he relive enropy beween he empiricl diribuion nd expeced diribuion. By eiming he ubgrdien vi Impornce Smpling hey derive modelfree mehod. In i originl formulion, MxEn-IRL ume h ll ide informion w vilble o he exper he beginning of i demonrion. If hi i no he ce, e.g. whenever he yem dynmic re noiy, Mximum Cul Enropy Invere Reinforcemen Lerning (Ziebr e l., 2010) cn be ued ined, which chooe he ochic policy h h he mximum cul enropy while mching he feure coun. Thi modificion i of min inere for hi hei nd i dicued in deil in Chper 3. 12

21 2.2 Relive Enropy Policy Serch Policy Serch i n pproch o Reinforcemen Lerning h doe no im for lerning he Vlue funcion bu ined lern policy direcly. Policy grdien mehod chieve hi by uing grdien-bed opimizion o ierively upde he prmeer of he policy. However, becue he previou experience i encoded wihin he curren policy, policy upde deroy pr of h experience leding o n informion lo. Relive Enropy Policy Serch (Peer e l., 2010) i bed on he opimizion problem of mximizing he expeced rewrd (Equion 2.2) while bounding he informion lo (relive enropy) beween he curren nd previou ierion (Equion 2.2b), i.e., mximize π( ),µ() ubjec o µ()π( )r(, ), µ()π( ) log µ()π( ) ε, q(, ), µ( )φ( ) = µ()π( )p(, )φ( ),, µ()π( ) = 1,, (2.2) (2.2b) (2.2c) (2.2d) where µ() i he ionry (Equion 2.2c) e diribuion h i evenully reched when execuing he policy π( ) in n infinie horizon MDP. Equion 2.2d i necery o enure h π( ) nd µ() re probbiliy diribuion. q(, ) repreen he (eimed) join diribuion of he previou ierion, hu, he policy h mximize he problem, π mx ( ) q(, ) exp 1 r(, ) + η p(, )θ φ( ) θ φ(), correpond o he be policy upde wih bounded informion lo ε. The Lgrngin Muliplier θ nd η cn be found uing grdien bed opimizion. Encoding he ierive nure of lerning pproch by bounding he informion lo beween he curren nd l ierion i mjor inpirion for IRL-MSD. 13

22 3 Mximum Cul Enropy Invere Reinforcemen Lerning The cion choice of he exper for differen ime ep migh no be conien wih repec o he complee rjecory, becue i migh hve dped i plnning bed on he oucome of he ochic yem dynmic p (, ). Mximizing he enropy of he whole rjecory diribuion doe no ke hi culiy ino ccoun. Mximum Cul Enropy IRL differ from Mximum Enropy IRL by iming mximizing he cul enropy of he policy π (, ) ined of he enropy of he join diribuion p(, ). The reuling Guin policy chooe cion wih probbiliy h incree exponenilly wih he expeced fuure rewrd, π MCE ( ) exp Q π (, ). (3.1) For LQG yem, hi led o noiy liner conroller, h differ from he opiml conroller only by i noie. The e-cion Vlue funcion Q π (, ) nd he Vlue funcion V π () cn be compued uing he equion 1.2 nd 1.3 decribed in ecion However, Ziebr e l. (2010) provide differen wy o compue Vlue funcion Ṽ π (), h i imilr o he compuion of he opiml Vlue funcion (Equion 1.4.3) excep h he mximum-operor h been replced by he ofmx-operor, Ṽ π () = log = ofmx exp Q π (, ) d Q π (, ). (3.2) The difference beween Ṽ π () nd V π () will be inveiged in ecion 3.1. MxCulEn-IRL lern he rewrd funcion for which π MCE ( ) mche he empiricl feure coun ˆφ in expecion. Thi i chieved by minimizing he dul of he opimizion problem of mximizing he cul enropy of he policy conrined on mching he feure coun. The pril derivive of he dul funcion wih repec o he prmeerizion of he rewrd funcion i given in (Ziebr e l., 2010) he difference beween he empiricl feure coun ˆφ of he exper nd he expeced feure coun φ of π MCE ( ) for he curren eimion of θ, i.e., θ = ˆφ φ. (3.3) Hence, MxCulEn-IRL propoe o compue he rewrd prmeer ierively uing grdien decen. Ech ierion hereby involve bckwrd p nd forwrd p. The bckwrd p compue Ṽ π () for ll ime ep ring he l ime ep T (bed on he curren eime of θ ) decribed in chper For h purpoe, i pplie Equion 1.2 o compue Q π (, ) nd Equion 3.2 o compue Ṽ π (). For LQG yem, Q (, ) i convex qudric funcion for ll ime ep. Hence, he policie π MCE ( ) cn be compued uing Equion 3.1. The forwrd p r he (known) iniil e diribuion p 1 () nd ue he policie π MCE ( ) o compue he join diribuion p (, ) for ll ime ep bed on he yem dynmic p (, ). Thee join diribuion re hen ued o compue he expeced feure coun φ for he curren eime of θ. The feure expecion cn hen be ued 14

23 o compue he grdien uing Equion 3.3. Since he performnce of grdien decen depend hevily on he choen epize, i i enible o look for wy o dp i during opimizion. In he conex of MxCulEn-IRL, he evluion of he dul funcion cn erve bi for epize dpion. Unforunely, he dul funcion h no been publihed long wih he lgorihm nd, herefore, hd o be derived for hi hei (ee ppendix A). Forunely, however, hi derivion provide inigh ino he inrinic of he lgorihm h hll now be dicued. 3.1 Inigh Bed on he decripion in (Ziebr e l., 2010), he opimizion problem cn be formuled mximize π ( ) T 1 p ()π ( ) log π ( ) dd =1, T 1 ubjec o p ()π ( )φ (, )dd + p T ()φ T (, 0) d = ˆφ, =1, >1 p ( )π ( ) d = p 1 ()π 1 ( )p(, ) dd, (3.4) (3.4b) (3.4c) p 1 () = µ 1 (), <T π ( )d = 1,, (3.4d) (3.4e) where he cul enropy of he policy π ( ) hould be mximized (3.4) ubjec o he conrin of mching he feure coun (3.4b) nd keeping he e diribuion conien (3.4c), where he iniil e diribuion hll be provided by µ 1 () (3.4d). Furhermore, equion (3.4e) enure h he policy i probbiliy diribuion. The Lgrngin muliplier of he conrin (3.4b) for mching he feure coun will be denoed θ i correpond o he weigh vecor of he rewrd funcion. The Lgrngin muliplier of he conrin (3.4c) nd (3.4d) will be denoed by Ṽ π π 1 () nd Ṽ2 ()... Ṽ π T (), repecively, hey rele o he Vlue funcion of he policy π ( ). The opimizion problem i olved uing Lgrnge opimizion (Boyd nd Vndenberghe, 2009) demonred in Appendix A. The dul funcion p (), θ, Ṽ π () i hereby minimized uing he pril derivive p (),θ,ṽ π () = p () p (),θ,ṽ π () Ṽ π () p (),θ,ṽ π () θ = Ṽ π ()+log exp θ φ(,)+ Ṽ π +1 ( )p(,)d Ṽ π T () θ φ T (,0) p 1 ()+µ 1 (), if = 1 p ()+, π 1( )p 1 ()p 1 (,), if > 1 d, if < T, if = T, (3.5), (3.5b) = ˆφ φ. (3.5c) 15

24 Seing Equion 3.5 equl o zero led o n upde equion for Ṽ π () (bckwrd p), exp θ φ(, ) + Ṽ π +1 ( )p(, )d d, if < T Ṽ π () = log θ φ T (, 0), if = T Seing Equion 3.5b equl o zero led o n upde equion for p () (forwrd p), µ 1 (), if = 1 p () =. π, 1( )p 1 ()p 1 (, ), if > 1. (3.6) A menioned he beginning of hi chper, he bckwrd p nd he forwrd p re performed conecuively during ech ierion in order o compue he expeced feure coun φ of he curren pproximion of θ, which cn hen be ued for ingle ep of grdien decen by uing equion (3.5c). The common wy of compuing he Vlue funcion V π () of π MCE would involve compuing he expecion of he e-cion Vlue funcion Q π (, ) uing Equion 1.3. I i herefore inereing o compre he Vlue funcion V π () h i compued uing Equion 1.3 wih he Vlue funcion Ṽ π () h i compued uing Equion 3.6. I urn ou, h boh Vlue funcion hve he me e-dependen pr, bu differ in n offe h depend on θ, Ṽ π () = V π () T 1 i= 1 2 (N + log 2πΣ 1,i ), (3.7) where Σ, denoe he covrince mrix of π MCE ( ). The proof i given in Appendix C. A he offe doe no depend on he e, boh bckwrd pe led o he me policy. When evluing he dul funcion for he curren eime of p (), Ṽ π () nd θ, equion (3.5) nd (3.5b) eque o zero for ll ime ep. In h ce, he dul funcion implifie grely nd i given by (θ ) = ˆφ Ṽ π θ 1 ( ). (3.8) Ṽ π The expeced ol rewrd 1 ( ) cn lo be compued bed on he expeced feure coun φ, Ṽ π 1 ( ) = θ φ. (3.9) Uing Equion 3.7, 3.8 nd 3.9, he dul funcion cn be expreed in erm of φ, (θ ) = ˆφ Ṽ π θ 1 ( ). = ˆφ T 1 θ V π ( 1 ) 2 (N + log 2πΣ 1,i ). i= T 1 = ˆφ θ V π ( ) + i= 1 2 (N + log 2πΣ 1,i ). = ˆφ θ φ T 1 θ + log 2πΣ 1 + con. (3.10),i =1 The empiricl feure coun ˆφ nd he expeced feure coun φ hve o be compued nywy in order o evlue Equion (3.5c), nd he covrince Σ, of he ochic policie π ( ) re ide produc of he bckwrd p. Therefore, he dul funcion cn be evlued ech ierion for he curren pproximion of θ wihou ny noiceble compuionl overhed. 16

25 3.2 Implemenion Thi ecion dicue he prcicl chllenge h roe when employing MxCulEn-IRL for he purpoe of lerning non-ionry rewrd funcion nd how hey hve been coped wih. Applying he lgorihm for lerning ime-dependen rewrd funcion i righforwrd nd i chieved by inroducing independen e of feure for ech ime ep. Thereby, however, he number of feure i increed ignificnly, nd lerning become infeible if he durion of he demonrion i dicreized ino oo mny priion. Rdil Bi Funcion (RBF) re employed in order o lern mooh, ime-coninuou rewrd funcion bed on core dicreizion. Addiionlly, he inigh dicued in ecion 3.1 re uilized by dping he epize bed on he dul funcion (3.10) Lerning Non-Sionry Rewrd Funcion MxCulEn-IRL i pplied o lern T independen, qudric rewrd funcion r (, ) = ( g ) R ( g ) H = R + r H + con, where R mu be poiive emi-definie nd H mu be poiive definie. The gol poiion g rele o he liner coefficien r vi r = 2R g. The cion penly mrix H i umed o be ime independen digonl mrix in order o reduce he moun of feure. The feure φ re divided ino T +1 ube φ = [φ 0, φ 1,..., φ T ], where φ 0 ggrege he qudric T 1 cion, i.e. φ 0 = [ 2 1,,..., 2 N, ]. The remining ube encode he liner, qudric nd mixed =1 e feure for heir correponding ime ep, i.e., [1, T] : φ = [ 1,,..., N,, 1, 1,,..., 1, N,, 2, 2,,..., 2, N,,..., N, N,]. The ol rewrd cn hen be expreed liner combinion of hee feure, r(, ) = T r (, ), =1 = θ φ + con. The enrie of θ hereby correpond o he enrie of he prmeer r, R nd H depending on he feure hey weigh: if he feure i qudric cion erm, i correpond o he repecive digonl elemen of H. if he feure i qudric e erm, i correpond o he repecive digonl elemen of R. if he feure i mixed e erm, i correpond o he repecive off-digonl enry ime wo. if he feure i liner e erm, i correpond o he repecive elemen of r. Where correpond o he index of he ube h conin he feure. Thu, he prmeer of he ime dependen rewrd funcion r (, ) cn be lerned, by lerning θ vi MxCulEn-IRL decribed in chper 3. 17

26 3.2.2 Towrd Compc Repreenion Uing Rdil Bi Funcion Lerning n independen e of feure for ech ime ep incree he number of prmeer ignificnly. In order o ofen hi incree, he T ime dependen rewrd funcion re expreed uing N ϕ rdil bi funcion ϕ i (), i.e. N ϕ r ϕ (, ) = ϕ i () R i r i H, where ϕ i () re normlized guin rdil bi funcion i=1 ϕ i () = w i e 1 2 ci σ i 2, wih eqully pced cener c i nd vrince σ i. The weigh w i re choen o normlize he rdil bi funcion on he inervl [1, T], i.e., i [1, N ϕ ] : T ϕ i () = 1. By employing rdil bi funcion, only N ϕ + 1 ube of feure re required ined of T + 1. Agin, he fir ube, φ 0, hll encode he qudric cion hown in ecion The remining ube, however, now ggrege he liner, qudric nd mixed e feure, wih repec o he reponibiliie of he repecive rdil bi funcion, i.e., i [1, N ϕ ] : φ ϕ i = =1 T ϕ i ()[ 1,,..., N,, 1, 1,,..., 1, N,, 2, 2,,..., 2, N,,..., N, N,]. =1 By compoing he feure vecor φ ϕ of hee ube, he ol rewrd i gin liner combinion of he feure, r(, ) = T =1 r ϕ (, ) = θ φ ϕ, nd he prmeer of he ime dependen rewrd funcion cn hu be lerned vi θ. Beide decreing he number of prmeer, he RBF-bed repreenion led o ime coninuou rewrd funcion by uming rewrd funcion h chnge moohly over ime Enuring Poiive Semi-Definie Rewrd Funcion A he enrie of θ direcly correpond o he elemen of he prmeer r x, R x nd H, he grdien bed opimizion my led o non-poiive emidefinie e co R x, which would viole he LQG umpion nd hereby led o filure. Thi cn be voided by pplying grdien decen for lerning he enrie of lower ringulr mrix L x ined of lerning he enrie of R x direcly. R x cn hen be conruced by 18 R x = L x L x.

27 A Choleky h hown, ny poiive emi-definie mrix cn be decompoed ino uch mrix produc nd ny uch produc yield poiive emi-definie mrix. In order o lern he enrie of L x n ddiionl weigh vecor θ of he me ize θ i inroduced, uch h ech enry of h weigh vecor correpond o differen enry of he repecive ringulr mrix L x, gol poiion g x or cion co mrix H. Grdien decen i hen pplied o upde θ ined of θ uing p (), θ, Ṽ π () θ = p (), θ, Ṽ π () dθ θ d θ. The grdien dθ cn be olved in cloed form nd produce block digon mrix d θ dθ d θ = dθ 0 d θ dθ d. θ , where θ i nd θ i indice he ube of θ nd θ h correpond o he repecive ube φ i of φ. dθ N d θ N Adping he Sepize Bed on he Dul Funcion A hown in ecion 3.1, he dul funcion cn be evlued very efficienly uing Equion Therefore, i mke ene o ke hee evluion bi for epize dpion during grdien decen. Thi i done by checking fer ech grdien ep, wheher he dul funcion did indeed decree. Only hen, he grdien ep i cceped nd he epize i increed by fixed fcor α. Whenever he dul funcion did no decree, he l grdien ep i wihdrwn nd he epize i decreed by β. The prmeer α nd β re choen uch h he relive incree fer ucceful ep i mller hn he relive decree fer n unucceful ep in order o void hving o wihdrw oo ofen. 19

28 4 Invere Reinforcemen Lerning by Mching Se Diribuion Thi chper preen novel pproch for lerning non-ionry rewrd funcion from demonrion. In conr o Mximum Cul Enropy Invere Reinforcemen Lerning i i no bed on mching he feure expecion of he lerned policy wih he oberved feure coun. Ined, i im mching he e diribuion h reul from he lerned policie, p π () wih he oberved e diribuion, q (). By defining he objecive in erm of diribuion ined of defining hem in erm of feure coun, he feure do no hve o be explicily defined. Ined, he rucure of he rewrd i uomiclly deermined bed on he rucure of he diribuion. Mching he e diribuion i chieved by minimizing he relive enropy beween hoe probbiliy diribuion. Addiionlly, inpired by Relive Enropy Policy Serch (REPS), he relive enropy of he curren policy π ( ) i minimized wih repec o he l policy q 0, ( ) in order o reduce he lo of informion beween ierion. 4.1 The Opimizion Problem Invere Reinforcemen Lerning by Mching Se Diribuion i bed on he following opimizion problem: minimize π ( ) T p () log p T 1 () q () d + p () =1 =1 ubjec o >1 p ( )π ( )d = p 1 () = µ 1 (), <T π ( )d = 1., π ( ) log π ( ) q 0, ( ) dd p 1 ()π 1 ( )p(, )dd, (4.1) (4.1b) (4.1c) (4.1d) The conrin (4.1b, 4.1c, 4.1d) re excly he me heir counerpr in he Mximum Cul Enropy opimizion problem (3.4c, 3.4d, 3.4e). The gol of he opimizion (4.1), however, i differen, i im minimizing he wo Kullbck-Leibler divergence (KL). I i o noe, h he rge policy q ( ) doe no hve o be e o he policy of he l ierion, bu could be e o fixed diribuion ined. In hi ce, he correponding KL doe no led in lerning cion co, bu erve regulrizion ined. Agin, he Lgrngin muliplier for he conrin (4.1b) nd (4.1c) rele o he Vlue funcion of he policy π ( ) nd will be herefore denoed by Ṽ π (). A he conrin of mching he feure coun h been dropped, he opimizion problem doe no poe Lgrngin muliplier h correpond o he prmeer of he rewrd funcion. Indeed, i migh look like he opimizion problem doe no refer o ny form of rewrd ll, giving 20

29 rie o he queion of how i cn be employed o ckle he problem of IRL. Thi queion cn be be nwered by looking he pril derivive of he dul funcion, (p (), Ṽ ()) Ṽ T () + log p () q = () + 1, if = T p () Ṽ () log exp log q 0, ( ) + log q () log p () 1 + Ṽ+1( )p(, )d, if < T (p (), Ṽ ()) p 1 () + µ 1 (), if = 1 = Ṽ () p () + π, 1( )p 1 ()p 1 (, ), if > 1 (4.2). (4.2b) The derivion re given in Appendix B. Seing he derivive wih repec o he e diribuion (4.2) o zero led o he bckwrd p o compue Ṽ π (), while eing he derivive wih repec o he Vlue funcion (4.2b) equl o zero led o he forwrd p o compue p (). Acully, he forwrd p i excly he me for MxCulEn-IRL (3.5b) nd IRL-MSD. More inereingly, however, he bckwrd p of IRL-MSD i he me he bckwrd p of MxCulEn-IRL (3.5) excep h he rewrd funcion θ φ(, ) h been replced by he erm r (, ) = log(q 0, ( )) + log (q ()) log (p ()) 1, (4.3) h erve rewrd ignl for he opiml policy π ( ), which i gin, in MxCulEn-IRL, proporionl o he exponenil of he e cion Vlue funcion Q (, ), π ( ) exp (Q (, )) = exp r (, ) + )p(, )d. (4.4) Ṽ +1( Hence, he rewrd funcion depend direcly on he rge policy q 0, ( ), he rge e diribuion q (), nd p (), he e diribuion h reul from he policy h minimize he objecive (4.1). A he e diribuion p () depend on he policy, nd he policy depend on he Vlue funcion Ṽ (), he rewrd funcion i cully defined recurively. The feure of he rewrd funcion hu evolve nurlly, depending on he ype of he involved probbiliy diribuion. Mo nobly, if ll diribuion re Guin, nd he yem dynmic re liner wih Guin noie, he rewrd funcion i qudric in e nd cion. For LQG yem, π ( ) lwy ke he form of liner PD conroller wih Guin noie. Therefore, in he following, he rge policy i umed o be of he me form, i.e. q 0, ( ) = ( K 0, + k 0,, Σ q0, ). Then, he prmeer of he rewrd funcion r (, ) = R F F H + r + con (4.5) h cn be compued by R = 1 Σ 1 q, 2 0, Σ 1 q0, K 0, Σ 1 p,, (4.6) r = Σ 1 q, µ q, K 0, Σ 1 q0, k 0, Σ 1 p, µ p,, (4.6b) F = 1 2 K 0, Σ 1 q0,, H = 1 2 Σ 1 q0,, h = Σ 1 q0, k 0,. The derivion cn be found in Appendix B. (4.6c) (4.6d) (4.6e) 21

30 4.2 The Propoed Algorihm The pril derivive of he dul funcion (equion 4.2 nd equion 4.2b) led o n ierive lgorihm imilr o MxCulEn-IRL. Sring wih n iniil eime of he rewrd funcion, he ofmx Vlue funcion Ṽ π () well he correponding policie π ( ) re compued vi he bckwrd p (4.2), nd he reuling e diribuion p () re compued vi he forwrd p (4.2b). However, IRL-MSD uilize hee e diribuion in order o compue he prmeer of he rewrd funcion, h i opiml wih repec o he curren eime, in one ho, where MxCulEn-IRL uilize hem merely o upde he prmeer by ingle ep of grdien decen. I hould be noed, h he compuionl co of compuing he opiml prmeer (Equion 4.6 o 4.6e) doe no exceed he co of compuing he grdien of he prmeer vi feure expecion. The IRL-MSD opimizion problem llow differen inerpreion regrding q 0,, h led o wo differen vrin of he lgorihm. The fir vrin re q 0, n eime of he opiml policy, h i upded fer ech ierion, where he econd vrin re q 0, regulrizion on he policy which i no chnged during opimizion. The effec of hee differen poin of view hll be dicued in he following ubecion Uing he Policy KL for Lerning Acion Co The fir vrin of IRL-MSD, which will be referred o by MSD-ACL in he following, regrd q 0, n eime of he opiml policy. I i upded fer ech ierion by eing i equl o he policy h w compued during h ierion. Thereby, imilr o REPS, he curren policy hould y cloe o he l one. However, in conr o REPS he relive enropy i no bounded. Thi llow for fer convergence bu doe no give ny gurnee, h he policy i indeed cloe o he l one. Therefore, if uch gurnee i needed, e.g. when lerning locl linerizion of he e dynmic, bound on h KL hould be inroduced by reformuling i conrin. In he following, however, i will be umed h no uch bound i necery. The relive enropy hen merely erve he purpoe of lerning he policy h mche he rge e diribuion cloe poible. MSD wih Acion Co Lerning (MSD-ACL) would per deful no regulrize he cion ll, however, i cn be eily ugmened wih cion regulrizion by dding fixed offe H reg, o he cion co mrix H (equion 4.6d) whenever compuing he rewrd funcion. Thi correpond o dding correponding punihmen erm o he objecive of he opimizion problem. A more eriou didvnge of MSD-ACL come from he fc, h eing q 0, ( ) equl o noiy PD-conroller, led o lerning e-dependen cion co F nd poenilly non-zero cion gol h. While hi migh be fine for mny ce, i h he drwbck h i led o rewrd funcion h i difficul o inerpre. When e dependen cion co hould be voided, he econd vrin of IRL-MSD cn be employed Uing he Policy KL for Regulrizion The econd vrin, which i in he following nmed MSD wih KL-bed Regulrizion (MSD-REG), ue q 0, for regulrizion. The conroller gin K 0, nd k 0, re e o zero, nd i noie Σ q0, erve weigh of he regulrizion. If he covrince i high, deviion from he zero cion re punihed le nd he regulrizion i hu low. Repecively, low covrince would reul ino rong regulrizion. However, miuing KL for regulrizion come co, becue minimizing he relive enropy migh cully reul in increing he cion co rificilly ju for he purpoe of mching Σ q0, which migh led o ubopiml lerning. Neverhele, i llow o lern e-dependen rewrd funcion for given cion co mrix H. Such rewrd funcion h he dvnge, h gol e well heir relevnce cn be direcly inferred. 22

31 The effec on he performnce, h reul from uing he KL for regulrizion will be inveiged in chper Enuring Poiive Definie Rewrd Funcion Compuing R or H ccording o equion 4.6 or 4.6d doe no gurnee poiive emi-definie co mrice. Such rewrd funcion hould be voided, becue hey brek he LQG umpion nd my led o non-guin policie. For IRL-MSD, non-poiive emi-definie rewrd funcion indice h he vrince wih repec o he gol poiion i mller hn he rge vrince. The gol poiion hen i no longer n rcor, bu erve dercor ined, i.e. he rewrd incree wih he dince o h poiion. A i i no dmiible o llow uch rewrd funcion, he co mrice cn be checked immediely fer compuion for poiive definiene. If hey re no poiive definie, pecrl decompoiion cn be uilized in order o replce ll negive eigenvlue by mll poiive number nd hen o rnform he mrix bck ino i originl bi Diregrding Feure by Mching Mrginl In ome ce, i i no deirble o mch he demonrion wih repec o ll e rjecorie. For exmple, kineheic eching migh be ued in order o demonre he vi poin of movemen regrding only he k pce poiion, bu no he correponding velociie. I migh eem o be drwbck of IRL-MSD, h he feure cnno be choen direcly, bu re rher implicily defined by he rge diribuion. However, defining he ype of he rge diribuion p () doe no need o be le inuiive hn defining he ype of he rewrd funcion. I could be rgued, h defining he ype of he rge diribuion i more inuiive, ince i cn be direcly eimed bed on he rjecorie, where inferring he rucure of he rewrd funcion from he rjecorie i lighly le direc. However, for mny prcicl pplicion i probbly doe no mke ny difference in he end nd he problem of chooing he proper feure nd he problem of chooing he proper rge diribuion re ju wo differen view on he me problem. Ignoring cerin e dimenion for MxCulEn-IRL i chieved by ignoring he correponding feure when compuing he feure coun, where for IRL-MSD i i chieved by ignoring he correponding dimenion of he rndom vrible when compuing he KL. Thi led o lighly more generl formulion of he objecive of he opimizion funcion, i.e. Equion 4.1 become minimize π ( ) T =1 p () log p () q () d + T 1 p () =1 π ( ) log π ( ) dd, (4.7) q 0, ( ) where p () nd q () re he repecive mrginl diribuion. The only difference h reul from hi mll modificion i h he enrie of R nd r h correpond o he diregrded e hve o be filled up wih zero, i.e. hey hve o be compued by D (Σ 1 q, Σ 1 p, )D + K 0, Σ 1 q0, K 0, R = 1 2 r = D (Σ 1 q, µ q, Σ 1 p, µ p,) K 0, Σ 1 q0, k 0,,, (4.8) (4.8b) where D i conruced by removing ll row of he N -by-n ideniy mrix, h correpond h diregrded e A Peudo-code Implemenion of IRL-MSD A peudo-code implemenion of IRL-MSD for LQG yem i given in Algorihm 1 for he vrin dicued in ecion nd The modificion given in ecion nd cn be incorpored by compuing he rewrd ccordingly. 23

32 inpu : q() ; /* rge e diribuion for ll ime ep */ q (0) 0 ( ) ; /* rge policie for ll ime ep */ p(, ) ; /* yem dynmic for ll ime ep */ µ 0 () ; /* e diribuion of he fir ime ep */ T ; /* ime horizon */ lernacionco ; /* boolen indicing wheher cion co hould be lerned */ oupu: H (i), h (i), F (i), R (i), r (i) ; /* rewrd prmeer for ll ime ep */ /* Iniilize rewrd prmeer */ [H (0), h (0), F (0), R (0), r (0) ] iniilize() i 0 while no converged do /* compue he Vlue funcion nd policie uing equion 4.2 nd 4.4 */ [V (i) (), π( ) (i) ] bckwrd_p(h (i), h (i), F (i), R (i), r (i) ) ; /* See Appendix B */ /* compue he e diribuion uing equion 4.2b */ p (i) () forwrd_p(π (i) ( ), p(, ), µ 0 ()) /* e nex rge policie */ if lernacionco hen q (i+1) 0 ( ) π (i) ( ) ele q (i+1) 0 ( ) q (i) 0 ( ) /* compue rewrd prmeer for he nex ierion uing equion 4.6 o 4.6e */ [H (i+1), h (i+1), F (i+1), R (i+1), r (i+1) ] compue_rewrd(q(), q (i) 0 ( ), p(i) () /* iere */ i i + 1 Algorihm 1: Peudo-code Implemenion of IRL-MSD 24

33 5 Evluion The differen pproche re evlued on double inegror oy exmple. A double inegror i dynmic model of n undercued yem h poee wice mny e dimenion cion dimenion, where hee wo e correpond o he fir nd econd inegrion of he correponding cion. Thereby he cion cn be inerpreed ccelerion nd he e velociie nd poiion. The velociy ime ep, v, cn be pproximed uing he velociy nd ccelerion of he l ime ep, i.e. v = v 1 + v v 1 + 1, where i he difference in ime beween wo ime ep. Similrly, he poiion ime ep, x cn be pproximed x x 1 + v 1. For ingle cion nd ime dicreizion = 0.1, hi correpond o liner yem dynmic +1 = A + B, wih A = nd B = 0. (5.1) The dimenion cn be increed by dding ddiionl cion nd heir correponding pir of e. The mrice of uch higher dimenionl yem re block-digonl nd compoed of he block given in Equion 5.1. The pproximion error i modeled Guin noie, leding o non-deerminiic yem dynmic. 5.1 Auming Known Siic of he Exper The evluion wihin hi ecion ume h he rue policy of he exper i known. Therefore, he e diribuion reuling from h policy re ued in order o compue he empiricl feure coun ˆφ for MxCulEn-IRL nd, repecively, re ued rge diribuion q () for IRL-MSD. The effec of he mpling error re hereby removed when compring he differen pproche. The yem dynmic re given by double inegror wih N cion dimenion nd, hu, N = 2N e dimenion. The MDP l for T = 50 ime ep. The rue rewrd funcion i qudric in e nd cion. The cion co re no correled wih he e nd re ime-independen. The cion co mrix i given by H = σ 2 N, where N denoe he N -by-n ideniy mrix nd σ 2 = Se co re only given ime ep 25 nd ime ep 50, nd re bed on projecion of he even poiion o hree-dimenionl pce. Thi rele o he problem of k pce conrol of N -link mnipulor, where he forwrd kinemic re given by he projecion mrix. For he following experimen, he number of link w e o hree. The rge diribuion w compued uing opiml conrol on he rue rewrd funcion. Figure 5.1 how he reuling e diribuion of he fir dimenion in k pce. The hded re correpond o he 2σ confidence inervl. The righ ide how n eime bed on even mple IRL by Mching Se Diribuion MSD-REG nd MSD-ACL re evlued for differen iniil covrince mrice Σ q0,. The men of he rge policy q 0 () i e equl o zero, independen of he e. Only ingle prmeer, σ 0, h o be choen by eing Σ q0, = σ 2 0 N. MSD-REG ue σ 2 0 o conrol regulrizion, where high vlue of σ2 0 correpond o low regulrizion nd vice ver. MSD-ACL ue σ 2 0 iniil gue of he rge policy h i upded during opimizion. 25

34 x x ime ep ime ep Figure 5.1.: The exper rjecory diribuion of he fir dimenion in k pce ploed over ime. Lef: rue diribuion, Righ: eime bed on even mple 3 x x 104 Expeced Rewrd rue rewrd σ 0 2 =1000 σ 0 2 =100 Expeced Rewrd rue rewrd σ 0 2 =10 σ 0 2 = ierion ierion Figure 5.2.: The expeced rewrd for MSD-REG nd MSD-ACL re hown for differen iniilizion. Lef: MSD-REG chieve good performnce lredy fer he fir ierion, however converge o oluion h perform wore. Righ: The expeced rewrd of MSD-ACL converge o he one of he exper. The higher iniilizion h fer peed of convergence. Figure 5.2 how he expeced rewrd of he opiml conroller wih repec o he lerned rewrd funcion for good vlue of σ 2 0. A for ll oher conduced experimen in hi chper, he expeced rewrd i compued bed on he rue rewrd funcion. The vlue for σ 2 0 hve been found in previou experimen. Higher vlue hve led o inbiliie for boh pproche. Inereingly, MSD-REG in he be performing rewrd funcion fer only few ierion nd ubequenly converge o oluion h perform lighly wore. MSD-ACL converge moohly o he exper performnce. The peed of convergence i fer for he iniilizion wih higher vrince. The e diribuion of he opiml conroller on he lerned rewrd funcion re depiced in Figure 5.3. The iniilizion w e o σ 2 0 = 1000 for MSD-REG nd o σ2 0 = 10 for MSD-ACL. The plo on he lef ide how he e diribuion fer one (red) nd fer five hound ierion (blue) of MSD-REG. The diribuion fer ingle ierion h clerly lower vrince hn he beline diribuion (blck). The e diribuion fer five hound ierion mche he beline diribuion beer, 26

35 Figure 5.3.: The rjecorie of he fir dimenion in k pce re hown for MSD-REG (lef) nd MSD- ACL (righ). The iniil eime of MSD-REG (red) h clerly lower vrince hn he beline diribuion. The e diribuion of MSD-ACL fer five hound ierion (blue) i indiinguihble from he beline diribuion. even hough i perform wore in erm of expeced rewrd. The plo on he righ ide how he correponding diribuion for MSD-ACL. The e diribuion fer ingle ierion (red) pproxime he rge diribuion wore hn he repecive diribuion of MSD-REG becue i red wih higher regulrizion. The e diribuion fer five hound ierion (blue) mche he beline diribuion excly Comprion wih Mximum Cul Enropy IRL MSD-ACL wih σ 2 0 = 10 i compred o MxCulEn-IRL. The modificion bed on rdil bi funcion (Chper 3.2.2) i no hown, becue i doe no led o ignificn difference for he given k (excep, of coure, if he number of cener i choen very low, which impir he performnce ignificnly). For MxCulEn-IRL he cion co re umed o be known in order o reduce he moun of feure. The ep ize i dped fer ech ierion of grdien decen by muliplying i wih α = 1.01 or β = 0.5 repecively, decribed in chper The lgorihm re compred wih repec o he number of performed ierion. An ddiionl comprion bed on he compuionl ime pen would be uninpiring, becue boh lgorihm need pproximely he me ime per ierion. Figure 5.4 how he expeced rewrd fer ech ierion. The blue curve rele o MSD-ACL nd i he me he correponding curve in Figure 5.2. The red curve how he performnce of MxCulEn- IRL. I converge ignificnly lower hn MSD-ACL. Figure 5.5 how he reuling e diribuion for MxCulEn-IRL fer one hound (red) nd fer en hound ierion (blue). Thee diribuion re bed on he greedy, opiml conroller. The correponding diribuion for he rue rewrd funcion i hown in blck. MxCulEn-IRL ucceed in mching he men very ccurely lredy fer hound ierion, bu h oo high vrince he gol poiion. Afer en hound ierion he vrince re mched more cloely, however, hey re ill lighly o lrge he gol poiion. I i lo inereing o inpec he lerned prmeer direcly. Boh pproche lern rewrd funcion h only ign high rewrd ime ep h re cloe o he criicl ime ep 25 nd 50, hu, he remining ime ep cn be negleced. MSD-ACL lern e-dependen cion co nd i herefore difficul o inerpre. Hence, MSD-REG nd MxCulEn-IRL re nlyzed, bed on he rewrd funcion h hve been lerned fer en hound ierion. The lerned rewrd mrice R nd gol 27

36 0.2 x Expeced Rewrd beline MSD ACL MxCulEn Ierion Figure 5.4.: The expeced rewrd of MSD-ACL nd MxCulEn-IRL i compred fer ech ierion bed on he rue rewrd funcion. MSD-ACL converge ignificnly fer o he expeced rewrd of he exper. poiion g re mpped o he k pce uing he projecion mrix. Thi rnformion yield he low dimenionl rewrd prmeer R TS nd g TS h cn be compred o he rue rewrd funcion. For ee of illurion, only he fir k pce dimenion i conidered nd rewrd correlion re ignored. Figure 5.6 how he fir dimenion of he k pce gol poiion (olid line) nd he correponding enrie of R TS (rdiu of hded re) for he inereing ime ep. Inereingly, MxCulEn-IRL lern gol poiion h re din from he deired rjecory, epecilly ime ep horly before he criicl one. The gol poiion re given in ble 5.1. MSD-REG ucceed in exrcing he rue gol poiion he criicl ime ep 25 nd 50. The gol poiion lerned by MxCulEn-IRL re very differen from he rue gol poiion even for high-rewrd ime ep. ime ep cul MSD-REG MxCulEn-IRL 1.2E E3-7.9E5 891 Tble 5.1.: The ble compre he gol poiion round he criicl ime ep for he fir dimenion in k pce. MSD-REG exrc he cul gol poiion he criicl ime ep (25 nd 50) very ccurely. Thee vlue re given wih higher preciion. MxCulEn-IRL lern wrong gol poiion he criicl ime ep nd compene hi by chooing gol poiion h re very differen from zero for he preceding ime ep. 28

37 Figure 5.5.: The e diribuion of he opiml conroller bed on he rewrd lerned by Mximum Cul Enropy IRL i hown fer 1000 ierion (red) nd fer ierion (blue). The e diribuion for he rue rewrd funcion i hown in blck. The lerned rewrd funcion led o vrince h i lighly oo high he gol poiion. 5.2 Lerning Bed on Smple Thi ecion cover he more reliic cenrio where he policy nd he reuling join diribuion of he exper re no known. The experimen re bed on he me yem he one decribed in ecion 5.1. The empiricl feure coun nd he rge diribuion repecively re eimed bed on demonrion. All experimen ue he me weny e of even demonrion IRL by Mching Se Diribuion The employed rewrd funcion doe no depend on he velociie direcly. A he velociie re uully very noiy, i i enible o exclude hem from he rge diribuion decribed in chper The effec of removing he velociie from he rge diribuion w eed boh, for MSD-ACL well for MSD-REG. The verion of MSD-ACL h rie o mch he complee e diribuion w iniilized wih σ 2 0 = 10; he verion h ignore he velociie ue σ 2 0 = 1. Thee vlue were choen bed on preliminry experimen. The verged expeced rewrd nd heir 2σ confidence inervl re hown in Figure 5.7. The vrin h doe no ry o mch he velociie converge lower, bu eem o converge o beer oluion. Boh vrin of MSD-REG hve been eed wih σ 2 0 = Similr o he experimen bed on he known exper diribuion, he verged expeced rewrd did no chnge much fer he fir ierion. The reul of he opimizion re herefore hown in ble (5.2). By ignoring he noiy velociie, he performnce could be improved. 29

38 3 x x x x Figure 5.6.: The plo how he lerned gol poiion nd heir ocied rewrd for he inereing ime ep he middle nd end of he demonrion. The op row illure he rewrd prmeer lerned by MSD-REG. The boom row how hoe lerned by MxCulEn-IRL. Boh pproche ocie high rewrd o he criicl ime ep 25 nd 50. The lerned gol poiion by MxCulEn-IRL re very differen from zero he ime ep h precede he criicl one. The gol poiion re hown in Tble 5.1. vrin verged rewrd 2σ-confidence mching velociie ± ignoring velociie ± exper demonrion ± Tble 5.2.: The expeced verged rewrd i hown fer hound ierion. The performnce of MSD- REG i imilr o he one demonred by he exper when he velociie re ignored. 30

39 Figure 5.7.: The red curve how he verged expeced rewrd of MSD-ACL when mching boh, poiion nd velociie. The blue curve how verged expeced rewrd of MSD-ACL when he velociie re ignored. The verge expeced rewrd of he demonrion i illured by he blck line. The hded re correpond o 2σ confidence. The vrin h ignore he velociie converge lower. However i eem o converge o beer oluion. The lower convergence cn be explined wih he higher iniil regulrizion. ime ep cul MSD-REG MxCulEn-IRL 1.2E Tble 5.3.: The ble compre he gol poiion round he criicl ime ep for he fir dimenion in k pce. MSD-REG pproximely lern he rue gol poiion he criicl ime ep (25 nd 50) Comprion wih Mximum Cul Enropy IRL Figure 5.8 compre he expeced rewrd of MSD-ACL nd MxCulEn-IRL. Boh pproche ry o mch ll e. MSD-ACL i iniilized wih σ 2 0 = 10. A for he eing where he exper e diribuion i known, MSD-ACL converge ignificnly fer. Agin, he rewrd prmeer lerned by MSD-REG re compred o hoe lerned by MxCulEn- IRL. Figure 5.9 how he gol poiion for he fir dimenion in k pce well he ocied rewrd for ime ep h re cloe o he criicl one. 31

40 Figure 5.8.: MxCulEn-IRL (blue curve) i compred o MSD-ACL (red curve) on he me e of 7 mple. MSD-ACL converge ignificnly fer. 5.3 Dicuion A imple oy k w ued o evlue boh vrin of IRL-MSD nd o compre hem wih MxCulEn-IRL. Boh, MSD-ACL nd MSD-REG converged ignificnly fer hn MxCulEn- IRL. Thi i no urpriing, given h IRL-MSD doe no rely on grdien decen bu direcly compue he rewrd prmeer h re opiml wih repec o he curren eimion. When properly iniilized, boh vrin of IRL-MSD eem o converge o imilr oluion. MSD-REG converge lmo innly. The expeced rewrd bed on he iniil eime of he prmeer w lwy cloe o he be one found over ll ierion. However, he iniilizion h huge impc on he performnce (Figure 5.2 lef) becue i correpond o regulrizion. MSD-ACL upde he rge policy fer ech ierion nd, hu, he iniil prmeer i le influenil (Figure 5.2 righ). The prmeer of he rewrd funcion hve been inpeced in order o e how well he rue gol poiion of he exper hve been mched. MSD-REG w ble o pproximely recover he rue gol poiion he imporn ime ep even for he mple-bed eime of he rge e diribuion. 32

Chapter 2: Evaluative Feedback

Chapter 2: Evaluative Feedback Chper 2: Evluive Feedbck Evluing cions vs. insrucing by giving correc cions Pure evluive feedbck depends olly on he cion ken. Pure insrucive feedbck depends no ll on he cion ken. Supervised lerning is

More information

graph of unit step function t

graph of unit step function t .5 Piecewie coninuou forcing funcion...e.g. urning he forcing on nd off. The following Lplce rnform meril i ueful in yem where we urn forcing funcion on nd off, nd when we hve righ hnd ide "forcing funcion"

More information

Optimality of Myopic Policy for a Class of Monotone Affine Restless Multi-Armed Bandit

Optimality of Myopic Policy for a Class of Monotone Affine Restless Multi-Armed Bandit Univeriy of Souhern Cliforni Opimliy of Myopic Policy for Cl of Monoone Affine Rele Muli-Armed Bndi Pri Mnourifrd USC Tr Jvidi UCSD Bhkr Krihnmchri USC Dec 0, 202 Univeriy of Souhern Cliforni Inroducion

More information

Stochastic Optimal Control with Linearized Dynamics

Stochastic Optimal Control with Linearized Dynamics Sochic Opiml Conrol wih Linerized Dynmic Sochich opimle Regelung mi lineriieren Modellen Mer-Thei von Hny Abdulmd Tg der Einreichung: 1. Guchen: Prof. Gerhrd Neumnn 2. Guchen: Prof. Jn Peer 3. Guchen:

More information

LAPLACE TRANSFORMS. 1. Basic transforms

LAPLACE TRANSFORMS. 1. Basic transforms LAPLACE TRANSFORMS. Bic rnform In hi coure, Lplce Trnform will be inroduced nd heir properie exmined; ble of common rnform will be buil up; nd rnform will be ued o olve ome dierenil equion by rnforming

More information

e t dt e t dt = lim e t dt T (1 e T ) = 1

e t dt e t dt = lim e t dt T (1 e T ) = 1 Improper Inegrls There re wo ypes of improper inegrls - hose wih infinie limis of inegrion, nd hose wih inegrnds h pproch some poin wihin he limis of inegrion. Firs we will consider inegrls wih infinie

More information

Minimum Squared Error

Minimum Squared Error Minimum Squred Error LDF: Minimum Squred-Error Procedures Ide: conver o esier nd eer undersood prolem Percepron y i > 0 for ll smples y i solve sysem of liner inequliies MSE procedure y i i for ll smples

More information

GEOMETRIC EFFECTS CONTRIBUTING TO ANTICIPATION OF THE BEVEL EDGE IN SPREADING RESISTANCE PROFILING

GEOMETRIC EFFECTS CONTRIBUTING TO ANTICIPATION OF THE BEVEL EDGE IN SPREADING RESISTANCE PROFILING GEOMETRIC EFFECTS CONTRIBUTING TO ANTICIPATION OF THE BEVEL EDGE IN SPREADING RESISTANCE PROFILING D H Dickey nd R M Brennn Solecon Lbororie, Inc Reno, Nevd 89521 When preding reince probing re mde prior

More information

4.8 Improper Integrals

4.8 Improper Integrals 4.8 Improper Inegrls Well you ve mde i hrough ll he inegrion echniques. Congrs! Unforunely for us, we sill need o cover one more inegrl. They re clled Improper Inegrls. A his poin, we ve only del wih inegrls

More information

Bipartite Matching. Matching. Bipartite Matching. Maxflow Formulation

Bipartite Matching. Matching. Bipartite Matching. Maxflow Formulation Mching Inpu: undireced grph G = (V, E). Biprie Mching Inpu: undireced, biprie grph G = (, E).. Mching Ern Myr, Hrld äcke Biprie Mching Inpu: undireced, biprie grph G = (, E). Mflow Formulion Inpu: undireced,

More information

IX.1.1 The Laplace Transform Definition 700. IX.1.2 Properties 701. IX.1.3 Examples 702. IX.1.4 Solution of IVP for ODEs 704

IX.1.1 The Laplace Transform Definition 700. IX.1.2 Properties 701. IX.1.3 Examples 702. IX.1.4 Solution of IVP for ODEs 704 Chper IX The Inegrl Trnform Mehod IX. The plce Trnform November 4, 7 699 IX. THE APACE TRANSFORM IX.. The plce Trnform Definiion 7 IX.. Properie 7 IX..3 Emple 7 IX..4 Soluion of IVP for ODE 74 IX..5 Soluion

More information

Machine Learning Reinforcement Learning

Machine Learning Reinforcement Learning Mchine Lerning Reinforcemen Lerning Leon 2 Mchine Lerning Mchine Lerning Supervied Lerning Techer ell lerner wh o remember Reinforcemen Lerning Environmen provide hin o lerner Unupervied Lerning Lerner

More information

Minimum Squared Error

Minimum Squared Error Minimum Squred Error LDF: Minimum Squred-Error Procedures Ide: conver o esier nd eer undersood prolem Percepron y i > for ll smples y i solve sysem of liner inequliies MSE procedure y i = i for ll smples

More information

IX.1.1 The Laplace Transform Definition 700. IX.1.2 Properties 701. IX.1.3 Examples 702. IX.1.4 Solution of IVP for ODEs 704

IX.1.1 The Laplace Transform Definition 700. IX.1.2 Properties 701. IX.1.3 Examples 702. IX.1.4 Solution of IVP for ODEs 704 Chper IX The Inegrl Trnform Mehod IX. The plce Trnform November 6, 8 699 IX. THE APACE TRANSFORM IX.. The plce Trnform Definiion 7 IX.. Properie 7 IX..3 Emple 7 IX..4 Soluion of IVP for ODE 74 IX..5 Soluion

More information

Positive and negative solutions of a boundary value problem for a

Positive and negative solutions of a boundary value problem for a Invenion Journl of Reerch Technology in Engineering & Mngemen (IJRTEM) ISSN: 2455-3689 www.ijrem.com Volume 2 Iue 9 ǁ Sepemer 28 ǁ PP 73-83 Poiive nd negive oluion of oundry vlue prolem for frcionl, -difference

More information

Contraction Mapping Principle Approach to Differential Equations

Contraction Mapping Principle Approach to Differential Equations epl Journl of Science echnology 0 (009) 49-53 Conrcion pping Principle pproch o Differenil Equions Bishnu P. Dhungn Deprmen of hemics, hendr Rn Cmpus ribhuvn Universiy, Khmu epl bsrc Using n eension of

More information

The solution is often represented as a vector: 2xI + 4X2 + 2X3 + 4X4 + 2X5 = 4 2xI + 4X2 + 3X3 + 3X4 + 3X5 = 4. 3xI + 6X2 + 6X3 + 3X4 + 6X5 = 6.

The solution is often represented as a vector: 2xI + 4X2 + 2X3 + 4X4 + 2X5 = 4 2xI + 4X2 + 3X3 + 3X4 + 3X5 = 4. 3xI + 6X2 + 6X3 + 3X4 + 6X5 = 6. [~ o o :- o o ill] i 1. Mrices, Vecors, nd Guss-Jordn Eliminion 1 x y = = - z= The soluion is ofen represened s vecor: n his exmple, he process of eliminion works very smoohly. We cn elimine ll enries

More information

Motion. Part 2: Constant Acceleration. Acceleration. October Lab Physics. Ms. Levine 1. Acceleration. Acceleration. Units for Acceleration.

Motion. Part 2: Constant Acceleration. Acceleration. October Lab Physics. Ms. Levine 1. Acceleration. Acceleration. Units for Acceleration. Moion Accelerion Pr : Consn Accelerion Accelerion Accelerion Accelerion is he re of chnge of velociy. = v - vo = Δv Δ ccelerion = = v - vo chnge of velociy elpsed ime Accelerion is vecor, lhough in one-dimensionl

More information

CSC 373: Algorithm Design and Analysis Lecture 9

CSC 373: Algorithm Design and Analysis Lecture 9 CSC 373: Algorihm Deign n Anlyi Leure 9 Alln Boroin Jnury 28, 2013 1 / 16 Leure 9: Announemen n Ouline Announemen Prolem e 1 ue hi Friy. Term Te 1 will e hel nex Mony, Fe in he uoril. Two nnounemen o follow

More information

5. Network flow. Network flow. Maximum flow problem. Ford-Fulkerson algorithm. Min-cost flow. Network flow 5-1

5. Network flow. Network flow. Maximum flow problem. Ford-Fulkerson algorithm. Min-cost flow. Network flow 5-1 Nework flow -. Nework flow Nework flow Mximum flow prolem Ford-Fulkeron lgorihm Min-co flow Nework flow Nework N i e of direced grph G = (V ; E) ource 2 V which h only ougoing edge ink (or deinion) 2 V

More information

Factorized Decision Forecasting via Combining Value-based and Reward-based Estimation

Factorized Decision Forecasting via Combining Value-based and Reward-based Estimation Fcorized Decision Forecsing vi Combining Vlue-bsed nd Rewrd-bsed Esimion Brin D. Ziebr Crnegie Mellon Universiy Pisburgh, PA 15213 bziebr@cs.cmu.edu Absrc A powerful recen perspecive for predicing sequenil

More information

A Kalman filtering simulation

A Kalman filtering simulation A Klmn filering simulion The performnce of Klmn filering hs been esed on he bsis of wo differen dynmicl models, ssuming eiher moion wih consn elociy or wih consn ccelerion. The former is epeced o beer

More information

A new model for limit order book dynamics

A new model for limit order book dynamics Anewmodelforlimiorderbookdynmics JeffreyR.Russell UniversiyofChicgo,GrdueSchoolofBusiness TejinKim UniversiyofChicgo,DeprmenofSisics Absrc:Thispperproposesnewmodelforlimiorderbookdynmics.Thelimiorderbookconsiss

More information

Physics 2A HW #3 Solutions

Physics 2A HW #3 Solutions Chper 3 Focus on Conceps: 3, 4, 6, 9 Problems: 9, 9, 3, 41, 66, 7, 75, 77 Phsics A HW #3 Soluions Focus On Conceps 3-3 (c) The ccelerion due o grvi is he sme for boh blls, despie he fc h he hve differen

More information

Flow Networks Alon Efrat Slides courtesy of Charles Leiserson with small changes by Carola Wenk. Flow networks. Flow networks CS 445

Flow Networks Alon Efrat Slides courtesy of Charles Leiserson with small changes by Carola Wenk. Flow networks. Flow networks CS 445 CS 445 Flow Nework lon Efr Slide corey of Chrle Leieron wih mll chnge by Crol Wenk Flow nework Definiion. flow nework i direced grph G = (V, E) wih wo diingihed erice: orce nd ink. Ech edge (, ) E h nonnegie

More information

f t f a f x dx By Lin McMullin f x dx= f b f a. 2

f t f a f x dx By Lin McMullin f x dx= f b f a. 2 Accumulion: Thoughs On () By Lin McMullin f f f d = + The gols of he AP* Clculus progrm include he semen, Sudens should undersnd he definie inegrl s he ne ccumulion of chnge. 1 The Topicl Ouline includes

More information

Recent Enhancements to the MULTIFAN-CL Software

Recent Enhancements to the MULTIFAN-CL Software SCTB15 Working Pper MWG-2 Recen Enhncemen o he MULTIFAN-CL Sofwre John Hmpon 1 nd Dvid Fournier 2 1 Ocenic Fiherie Progrmme Secreri of he Pcific Communiy Noume, New Cledoni 2 Oer Reerch Ld. PO Box 2040

More information

September 20 Homework Solutions

September 20 Homework Solutions College of Engineering nd Compuer Science Mechnicl Engineering Deprmen Mechnicl Engineering A Seminr in Engineering Anlysis Fll 7 Number 66 Insrucor: Lrry Creo Sepember Homework Soluions Find he specrum

More information

0 for t < 0 1 for t > 0

0 for t < 0 1 for t > 0 8.0 Sep nd del funcions Auhor: Jeremy Orloff The uni Sep Funcion We define he uni sep funcion by u() = 0 for < 0 for > 0 I is clled he uni sep funcion becuse i kes uni sep = 0. I is someimes clled he Heviside

More information

5.1-The Initial-Value Problems For Ordinary Differential Equations

5.1-The Initial-Value Problems For Ordinary Differential Equations 5.-The Iniil-Vlue Problems For Ordinry Differenil Equions Consider solving iniil-vlue problems for ordinry differenil equions: (*) y f, y, b, y. If we know he generl soluion y of he ordinry differenil

More information

1 jordan.mcd Eigenvalue-eigenvector approach to solving first order ODEs. -- Jordan normal (canonical) form. Instructor: Nam Sun Wang

1 jordan.mcd Eigenvalue-eigenvector approach to solving first order ODEs. -- Jordan normal (canonical) form. Instructor: Nam Sun Wang jordnmcd Eigenvlue-eigenvecor pproch o solving firs order ODEs -- ordn norml (cnonicl) form Insrucor: Nm Sun Wng Consider he following se of coupled firs order ODEs d d x x 5 x x d d x d d x x x 5 x x

More information

Math Week 12 continue ; also cover parts of , EP 7.6 Mon Nov 14

Math Week 12 continue ; also cover parts of , EP 7.6 Mon Nov 14 Mh 225-4 Week 2 coninue.-.3; lo cover pr of.4-.5, EP 7.6 Mon Nov 4.-.3 Lplce rnform, nd pplicion o DE IVP, epecilly hoe in Chper 5. Tody we'll coninue (from l Wednedy) o fill in he Lplce rnform ble (on

More information

Algorithmic Discrete Mathematics 6. Exercise Sheet

Algorithmic Discrete Mathematics 6. Exercise Sheet Algorihmic Dicree Mahemaic. Exercie Shee Deparmen of Mahemaic SS 0 PD Dr. Ulf Lorenz 7. and 8. Juni 0 Dipl.-Mah. David Meffer Verion of June, 0 Groupwork Exercie G (Heap-Sor) Ue Heap-Sor wih a min-heap

More information

Randomized Perfect Bipartite Matching

Randomized Perfect Bipartite Matching Inenive Algorihm Lecure 24 Randomized Perfec Biparie Maching Lecurer: Daniel A. Spielman April 9, 208 24. Inroducion We explain a randomized algorihm by Ahih Goel, Michael Kapralov and Sanjeev Khanna for

More information

, the. L and the L. x x. max. i n. It is easy to show that these two norms satisfy the following relation: x x n x = (17.3) max

, the. L and the L. x x. max. i n. It is easy to show that these two norms satisfy the following relation: x x n x = (17.3) max ecure 8 7. Sabiliy Analyi For an n dimenional vecor R n, he and he vecor norm are defined a: = T = i n i (7.) I i eay o how ha hee wo norm aify he following relaion: n (7.) If a vecor i ime-dependen, hen

More information

Transformations. Ordered set of numbers: (1,2,3,4) Example: (x,y,z) coordinates of pt in space. Vectors

Transformations. Ordered set of numbers: (1,2,3,4) Example: (x,y,z) coordinates of pt in space. Vectors Trnformion Ordered e of number:,,,4 Emple:,,z coordine of p in pce. Vecor If, n i i, K, n, i uni ecor Vecor ddiion +w, +, +, + V+w w Sclr roduc,, Inner do roduc α w. w +,.,. The inner produc i SCLR!. w,.,

More information

2. VECTORS. R Vectors are denoted by bold-face characters such as R, V, etc. The magnitude of a vector, such as R, is denoted as R, R, V

2. VECTORS. R Vectors are denoted by bold-face characters such as R, V, etc. The magnitude of a vector, such as R, is denoted as R, R, V ME 352 VETS 2. VETS Vecor algebra form he mahemaical foundaion for kinemaic and dnamic. Geomer of moion i a he hear of boh he kinemaic and dnamic of mechanical em. Vecor anali i he imehonored ool for decribing

More information

Problem Set If all directed edges in a network have distinct capacities, then there is a unique maximum flow.

Problem Set If all directed edges in a network have distinct capacities, then there is a unique maximum flow. CSE 202: Deign and Analyi of Algorihm Winer 2013 Problem Se 3 Inrucor: Kamalika Chaudhuri Due on: Tue. Feb 26, 2013 Inrucion For your proof, you may ue any lower bound, algorihm or daa rucure from he ex

More information

Sph3u Practice Unit Test: Kinematics (Solutions) LoRusso

Sph3u Practice Unit Test: Kinematics (Solutions) LoRusso Sph3u Prcice Uni Te: Kinemic (Soluion) LoRuo Nme: Tuey, Ocober 3, 07 Ku: /45 pp: /0 T&I: / Com: Thi i copy of uni e from 008. Thi will be imilr o he uni e you will be wriing nex Mony. you cn ee here re

More information

t s (half of the total time in the air) d?

t s (half of the total time in the air) d? .. In Cl or Homework Eercie. An Olmpic long jumper i cpble of jumping 8.0 m. Auming hi horizonl peed i 9.0 m/ he lee he ground, how long w he in he ir nd how high did he go? horizonl? 8.0m 9.0 m / 8.0

More information

Magnetostatics Bar Magnet. Magnetostatics Oersted s Experiment

Magnetostatics Bar Magnet. Magnetostatics Oersted s Experiment Mgneosics Br Mgne As fr bck s 4500 yers go, he Chinese discovered h cerin ypes of iron ore could rc ech oher nd cerin mels. Iron filings "mp" of br mgne s field Crefully suspended slivers of his mel were

More information

The Residual Graph. 12 Augmenting Path Algorithms. Augmenting Path Algorithm. Augmenting Path Algorithm

The Residual Graph. 12 Augmenting Path Algorithms. Augmenting Path Algorithm. Augmenting Path Algorithm Augmening Pah Algorihm Greedy-algorihm: ar wih f (e) = everywhere find an - pah wih f (e) < c(e) on every edge augmen flow along he pah repea a long a poible The Reidual Graph From he graph G = (V, E,

More information

Image-based localization for mobile robots in dynamic environments

Image-based localization for mobile robots in dynamic environments Univeriy of Pdu Fculy of Engineering Imge-bed loclizion for mobile robo in dynmic environmen Supervior: Prof. Enrico Pgello Co-upervior: Prof. Sefn Wermer Suden: Nicol Belloo Lure in ELECTRONIC ENGINEERING

More information

The Residual Graph. 11 Augmenting Path Algorithms. Augmenting Path Algorithm. Augmenting Path Algorithm

The Residual Graph. 11 Augmenting Path Algorithms. Augmenting Path Algorithm. Augmenting Path Algorithm Augmening Pah Algorihm Greedy-algorihm: ar wih f (e) = everywhere find an - pah wih f (e) < c(e) on every edge augmen flow along he pah repea a long a poible The Reidual Graph From he graph G = (V, E,

More information

PHYSICS 1210 Exam 1 University of Wyoming 14 February points

PHYSICS 1210 Exam 1 University of Wyoming 14 February points PHYSICS 1210 Em 1 Uniersiy of Wyoming 14 Februry 2013 150 poins This es is open-noe nd closed-book. Clculors re permied bu compuers re no. No collborion, consulion, or communicion wih oher people (oher

More information

1.0 Electrical Systems

1.0 Electrical Systems . Elecricl Sysems The ypes of dynmicl sysems we will e sudying cn e modeled in erms of lgeric equions, differenil equions, or inegrl equions. We will egin y looking fmilir mhemicl models of idel resisors,

More information

ARTIFICIAL INTELLIGENCE. Markov decision processes

ARTIFICIAL INTELLIGENCE. Markov decision processes INFOB2KI 2017-2018 Urech Univeriy The Neherland ARTIFICIAL INTELLIGENCE Markov deciion procee Lecurer: Silja Renooij Thee lide are par of he INFOB2KI Coure Noe available from www.c.uu.nl/doc/vakken/b2ki/chema.hml

More information

CHAPTER 7: SECOND-ORDER CIRCUITS

CHAPTER 7: SECOND-ORDER CIRCUITS EEE5: CI RCUI T THEORY CHAPTER 7: SECOND-ORDER CIRCUITS 7. Inroducion Thi chaper conider circui wih wo orage elemen. Known a econd-order circui becaue heir repone are decribed by differenial equaion ha

More information

3. Renewal Limit Theorems

3. Renewal Limit Theorems Virul Lborories > 14. Renewl Processes > 1 2 3 3. Renewl Limi Theorems In he inroducion o renewl processes, we noed h he rrivl ime process nd he couning process re inverses, in sens The rrivl ime process

More information

Admin MAX FLOW APPLICATIONS. Flow graph/networks. Flow constraints 4/30/13. CS lunch today Grading. in-flow = out-flow for every vertex (except s, t)

Admin MAX FLOW APPLICATIONS. Flow graph/networks. Flow constraints 4/30/13. CS lunch today Grading. in-flow = out-flow for every vertex (except s, t) /0/ dmin lunch oday rading MX LOW PPLIION 0, pring avid Kauchak low graph/nework low nework direced, weighed graph (V, ) poiive edge weigh indicaing he capaciy (generally, aume ineger) conain a ingle ource

More information

Some basic notation and terminology. Deterministic Finite Automata. COMP218: Decision, Computation and Language Note 1

Some basic notation and terminology. Deterministic Finite Automata. COMP218: Decision, Computation and Language Note 1 COMP28: Decision, Compuion nd Lnguge Noe These noes re inended minly s supplemen o he lecures nd exooks; hey will e useful for reminders ou noion nd erminology. Some sic noion nd erminology An lphe is

More information

u(t) Figure 1. Open loop control system

u(t) Figure 1. Open loop control system Open loop conrol v cloed loop feedbac conrol The nex wo figure preen he rucure of open loop and feedbac conrol yem Figure how an open loop conrol yem whoe funcion i o caue he oupu y o follow he reference

More information

Average & instantaneous velocity and acceleration Motion with constant acceleration

Average & instantaneous velocity and acceleration Motion with constant acceleration Physics 7: Lecure Reminders Discussion nd Lb secions sr meeing ne week Fill ou Pink dd/drop form if you need o swich o differen secion h is FULL. Do i TODAY. Homework Ch. : 5, 7,, 3,, nd 6 Ch.: 6,, 3 Submission

More information

Reinforcement Learning. Markov Decision Processes

Reinforcement Learning. Markov Decision Processes einforcemen Lerning Mrkov Decision rocesses Mnfred Huber 2014 1 equenil Decision Mking N-rmed bi problems re no good wy o model sequenil decision problem Only dels wih sic decision sequences Could be miiged

More information

Discussion Session 2 Constant Acceleration/Relative Motion Week 03

Discussion Session 2 Constant Acceleration/Relative Motion Week 03 PHYS 100 Dicuion Seion Conan Acceleraion/Relaive Moion Week 03 The Plan Today you will work wih your group explore he idea of reference frame (i.e. relaive moion) and moion wih conan acceleraion. You ll

More information

Network Flows: Introduction & Maximum Flow

Network Flows: Introduction & Maximum Flow CSC 373 - lgorihm Deign, nalyi, and Complexiy Summer 2016 Lalla Mouaadid Nework Flow: Inroducion & Maximum Flow We now urn our aenion o anoher powerful algorihmic echnique: Local Search. In a local earch

More information

5.2 GRAPHICAL VELOCITY ANALYSIS Polygon Method

5.2 GRAPHICAL VELOCITY ANALYSIS Polygon Method ME 352 GRHICL VELCITY NLYSIS 52 GRHICL VELCITY NLYSIS olygon Mehod Velociy analyi form he hear of kinemaic and dynamic of mechanical yem Velociy analyi i uually performed following a poiion analyi; ie,

More information

can be viewed as a generalized product, and one for which the product of f and g. That is, does

can be viewed as a generalized product, and one for which the product of f and g. That is, does Boyce/DiPrim 9 h e, Ch 6.6: The Convoluion Inegrl Elemenry Differenil Equion n Bounry Vlue Problem, 9 h eiion, by Willim E. Boyce n Richr C. DiPrim, 9 by John Wiley & Son, Inc. Someime i i poible o wrie

More information

Mathematics 805 Final Examination Answers

Mathematics 805 Final Examination Answers . 5 poins Se he Weiersrss M-es. Mhemics 85 Finl Eminion Answers Answer: Suppose h A R, nd f n : A R. Suppose furher h f n M n for ll A, nd h Mn converges. Then f n converges uniformly on A.. 5 poins Se

More information

Introduction to Congestion Games

Introduction to Congestion Games Algorihmic Game Theory, Summer 2017 Inroducion o Congeion Game Lecure 1 (5 page) Inrucor: Thoma Keelheim In hi lecure, we ge o know congeion game, which will be our running example for many concep in game

More information

Chapter Introduction. 2. Linear Combinations [4.1]

Chapter Introduction. 2. Linear Combinations [4.1] Chper 4 Inrouion Thi hper i ou generlizing he onep you lerne in hper o pe oher n hn R Mny opi in hi hper re heoreil n MATLAB will no e le o help you ou You will ee where MATLAB i ueful in hper 4 n how

More information

Section P.1 Notes Page 1 Section P.1 Precalculus and Trigonometry Review

Section P.1 Notes Page 1 Section P.1 Precalculus and Trigonometry Review Secion P Noe Pge Secion P Preclculu nd Trigonomer Review ALGEBRA AND PRECALCULUS Eponen Lw: Emple: 8 Emple: Emple: Emple: b b Emple: 9 EXAMPLE: Simplif: nd wrie wi poiive eponen Fir I will flip e frcion

More information

The Finite Element Method for the Analysis of Non-Linear and Dynamic Systems

The Finite Element Method for the Analysis of Non-Linear and Dynamic Systems Swiss Federl Insiue of Pge 1 The Finie Elemen Mehod for he Anlysis of Non-Liner nd Dynmic Sysems Prof. Dr. Michel Hvbro Fber Dr. Nebojs Mojsilovic Swiss Federl Insiue of ETH Zurich, Swizerlnd Mehod of

More information

S Radio transmission and network access Exercise 1-2

S Radio transmission and network access Exercise 1-2 S-7.330 Rdio rnsmission nd nework ccess Exercise 1 - P1 In four-symbol digil sysem wih eqully probble symbols he pulses in he figure re used in rnsmission over AWGN-chnnel. s () s () s () s () 1 3 4 )

More information

A 1.3 m 2.5 m 2.8 m. x = m m = 8400 m. y = 4900 m 3200 m = 1700 m

A 1.3 m 2.5 m 2.8 m. x = m m = 8400 m. y = 4900 m 3200 m = 1700 m PHYS : Soluions o Chper 3 Home Work. SSM REASONING The displcemen is ecor drwn from he iniil posiion o he finl posiion. The mgniude of he displcemen is he shores disnce beween he posiions. Noe h i is onl

More information

6.8 Laplace Transform: General Formulas

6.8 Laplace Transform: General Formulas 48 HAP. 6 Laplace Tranform 6.8 Laplace Tranform: General Formula Formula Name, ommen Sec. F() l{ f ()} e f () d f () l {F()} Definiion of Tranform Invere Tranform 6. l{af () bg()} al{f ()} bl{g()} Lineariy

More information

Making Complex Decisions Markov Decision Processes. Making Complex Decisions: Markov Decision Problem

Making Complex Decisions Markov Decision Processes. Making Complex Decisions: Markov Decision Problem Mking Comple Decisions Mrkov Decision Processes Vsn Honvr Bioinformics nd Compuionl Biology Progrm Cener for Compuionl Inelligence, Lerning, & Discovery honvr@cs.ise.edu www.cs.ise.edu/~honvr/ www.cild.ise.edu/

More information

( ) ( ) ( ) ( ) ( ) ( y )

( ) ( ) ( ) ( ) ( ) ( y ) 8. Lengh of Plne Curve The mos fmous heorem in ll of mhemics is he Pyhgoren Theorem. I s formulion s he disnce formul is used o find he lenghs of line segmens in he coordine plne. In his secion you ll

More information

ENGR 1990 Engineering Mathematics The Integral of a Function as a Function

ENGR 1990 Engineering Mathematics The Integral of a Function as a Function ENGR 1990 Engineering Mhemics The Inegrl of Funcion s Funcion Previously, we lerned how o esime he inegrl of funcion f( ) over some inervl y dding he res of finie se of rpezoids h represen he re under

More information

CS4445/9544 Analysis of Algorithms II Solution for Assignment 1

CS4445/9544 Analysis of Algorithms II Solution for Assignment 1 Conider he following flow nework CS444/944 Analyi of Algorihm II Soluion for Aignmen (0 mark) In he following nework a minimum cu ha capaciy 0 Eiher prove ha hi aemen i rue, or how ha i i fale Uing he

More information

Probability, Estimators, and Stationarity

Probability, Estimators, and Stationarity Chper Probbiliy, Esimors, nd Sionriy Consider signl genered by dynmicl process, R, R. Considering s funcion of ime, we re opering in he ime domin. A fundmenl wy o chrcerize he dynmics using he ime domin

More information

2/5/2012 9:01 AM. Chapter 11. Kinematics of Particles. Dr. Mohammad Abuhaiba, P.E.

2/5/2012 9:01 AM. Chapter 11. Kinematics of Particles. Dr. Mohammad Abuhaiba, P.E. /5/1 9:1 AM Chper 11 Kinemic of Pricle 1 /5/1 9:1 AM Inroducion Mechnic Mechnic i Th cience which decribe nd predic he condiion of re or moion of bodie under he cion of force I i diided ino hree pr 1.

More information

An integral having either an infinite limit of integration or an unbounded integrand is called improper. Here are two examples.

An integral having either an infinite limit of integration or an unbounded integrand is called improper. Here are two examples. Improper Inegrls To his poin we hve only considered inegrls f(x) wih he is of inegrion nd b finie nd he inegrnd f(x) bounded (nd in fc coninuous excep possibly for finiely mny jump disconinuiies) An inegrl

More information

20.2. The Transform and its Inverse. Introduction. Prerequisites. Learning Outcomes

20.2. The Transform and its Inverse. Introduction. Prerequisites. Learning Outcomes The Trnform nd it Invere 2.2 Introduction In thi Section we formlly introduce the Lplce trnform. The trnform i only pplied to cul function which were introduced in Section 2.1. We find the Lplce trnform

More information

To become more mathematically correct, Circuit equations are Algebraic Differential equations. from KVL, KCL from the constitutive relationship

To become more mathematically correct, Circuit equations are Algebraic Differential equations. from KVL, KCL from the constitutive relationship Laplace Tranform (Lin & DeCarlo: Ch 3) ENSC30 Elecric Circui II The Laplace ranform i an inegral ranformaion. I ranform: f ( ) F( ) ime variable complex variable From Euler > Lagrange > Laplace. Hence,

More information

Introduction to SLE Lecture Notes

Introduction to SLE Lecture Notes Inroducion o SLE Lecure Noe May 13, 16 - The goal of hi ecion i o find a ufficien condiion of λ for he hull K o be generaed by a imple cure. I urn ou if λ 1 < 4 hen K i generaed by a imple curve. We will

More information

EECE 301 Signals & Systems Prof. Mark Fowler

EECE 301 Signals & Systems Prof. Mark Fowler EECE 31 Signal & Syem Prof. Mark Fowler Noe Se #27 C-T Syem: Laplace Tranform Power Tool for yem analyi Reading Aignmen: Secion 6.1 6.3 of Kamen and Heck 1/18 Coure Flow Diagram The arrow here how concepual

More information

Dipartimento di Elettronica Informazione e Bioingegneria Robotics

Dipartimento di Elettronica Informazione e Bioingegneria Robotics Diprimeno di Eleronic Inormzione e Bioingegneri Roboics From moion plnning o rjecories @ 015 robo clssiicions Robos cn be described by Applicion(seelesson1) Geomery (see lesson mechnics) Precision (see

More information

(b) 10 yr. (b) 13 m. 1.6 m s, m s m s (c) 13.1 s. 32. (a) 20.0 s (b) No, the minimum distance to stop = 1.00 km. 1.

(b) 10 yr. (b) 13 m. 1.6 m s, m s m s (c) 13.1 s. 32. (a) 20.0 s (b) No, the minimum distance to stop = 1.00 km. 1. Answers o Een Numbered Problems Chper. () 7 m s, 6 m s (b) 8 5 yr 4.. m ih 6. () 5. m s (b).5 m s (c).5 m s (d) 3.33 m s (e) 8. ().3 min (b) 64 mi..3 h. ().3 s (b) 3 m 4..8 mi wes of he flgpole 6. (b)

More information

22.615, MHD Theory of Fusion Systems Prof. Freidberg Lecture 9: The High Beta Tokamak

22.615, MHD Theory of Fusion Systems Prof. Freidberg Lecture 9: The High Beta Tokamak .65, MHD Theory of Fusion Sysems Prof. Freidberg Lecure 9: The High e Tokmk Summry of he Properies of n Ohmic Tokmk. Advnges:. good euilibrium (smll shif) b. good sbiliy ( ) c. good confinemen ( τ nr )

More information

Notes on cointegration of real interest rates and real exchange rates. ρ (2)

Notes on cointegration of real interest rates and real exchange rates. ρ (2) Noe on coinegraion of real inere rae and real exchange rae Charle ngel, Univeriy of Wiconin Le me ar wih he obervaion ha while he lieraure (mo prominenly Meee and Rogoff (988) and dion and Paul (993))

More information

Price Discrimination

Price Discrimination My 0 Price Dicriminion. Direc rice dicriminion. Direc Price Dicriminion uing wo r ricing 3. Indirec Price Dicriminion wih wo r ricing 4. Oiml indirec rice dicriminion 5. Key Inigh ge . Direc Price Dicriminion

More information

Sample Final Exam (finals03) Covering Chapters 1-9 of Fundamentals of Signals & Systems

Sample Final Exam (finals03) Covering Chapters 1-9 of Fundamentals of Signals & Systems Sample Final Exam Covering Chaper 9 (final04) Sample Final Exam (final03) Covering Chaper 9 of Fundamenal of Signal & Syem Problem (0 mar) Conider he caual opamp circui iniially a re depiced below. I LI

More information

A continuous-time approach to constraint satisfaction: Optimization hardness as transient chaos

A continuous-time approach to constraint satisfaction: Optimization hardness as transient chaos A coninuou-ime pproch o conrin ifcion: Opimizion hrdne rnien cho PN-II-RU-TE--- Finl Sineic Repor Generl im nd objecive of he projec Conrin ifcion problem (uch Boolen ifibiliy) coniue one of he hrde cle

More information

MATH 124 AND 125 FINAL EXAM REVIEW PACKET (Revised spring 2008)

MATH 124 AND 125 FINAL EXAM REVIEW PACKET (Revised spring 2008) MATH 14 AND 15 FINAL EXAM REVIEW PACKET (Revised spring 8) The following quesions cn be used s review for Mh 14/ 15 These quesions re no cul smples of quesions h will pper on he finl em, bu hey will provide

More information

1 Motivation and Basic Definitions

1 Motivation and Basic Definitions CSCE : Deign and Analyi of Algorihm Noe on Max Flow Fall 20 (Baed on he preenaion in Chaper 26 of Inroducion o Algorihm, 3rd Ed. by Cormen, Leieron, Rive and Sein.) Moivaion and Baic Definiion Conider

More information

Reinforcement Learning

Reinforcement Learning Reiforceme Corol lerig Corol polices h choose opiml cios Q lerig Covergece Chper 13 Reiforceme 1 Corol Cosider lerig o choose cios, e.g., Robo lerig o dock o bery chrger o choose cios o opimize fcory oupu

More information

Honours Introductory Maths Course 2011 Integration, Differential and Difference Equations

Honours Introductory Maths Course 2011 Integration, Differential and Difference Equations Honours Inroducory Mhs Course 0 Inegrion, Differenil nd Difference Equions Reding: Ching Chper 4 Noe: These noes do no fully cover he meril in Ching, u re men o supplemen your reding in Ching. Thus fr

More information

Solutions to assignment 3

Solutions to assignment 3 D Sruure n Algorihm FR 6. Informik Sner, Telikeplli WS 03/04 hp://www.mpi-.mpg.e/~ner/oure/lg03/inex.hml Soluion o ignmen 3 Exerie Arirge i he ue of irepnie in urreny exhnge re o rnform one uni of urreny

More information

DEVELOPMENT OF A DISCRETE-TIME AERODYNAMIC MODEL FOR CFD- BASED AEROELASTIC ANALYSIS

DEVELOPMENT OF A DISCRETE-TIME AERODYNAMIC MODEL FOR CFD- BASED AEROELASTIC ANALYSIS AIAA-99-765 DEVELOPENT OF A DISCRETE-TIE AERODYNAIC ODEL FOR CFD- BASED AEROELASTIC ANALYSIS Timohy J. Cown * nd Andrew S. Aren, Jr. echnicl nd Aeropce Engineering Deprmen Oklhom Se Univeriy Sillwer, OK

More information

Chapter 2. Motion along a straight line. 9/9/2015 Physics 218

Chapter 2. Motion along a straight line. 9/9/2015 Physics 218 Chper Moion long srigh line 9/9/05 Physics 8 Gols for Chper How o describe srigh line moion in erms of displcemen nd erge elociy. The mening of insnneous elociy nd speed. Aerge elociy/insnneous elociy

More information

Reminder: Flow Networks

Reminder: Flow Networks 0/0/204 Ma/CS 6a Cla 4: Variou (Flow) Execie Reminder: Flow Nework A flow nework i a digraph G = V, E, ogeher wih a ource verex V, a ink verex V, and a capaciy funcion c: E N. Capaciy Source 7 a b c d

More information

Max-flow and min-cut

Max-flow and min-cut Mx-flow nd min-cu Mx-Flow nd Min-Cu Two imporn lgorihmic prolem, which yield euiful duliy Myrid of non-rivil pplicion, i ply n imporn role in he opimizion of mny prolem: Nework conneciviy, irline chedule

More information

Performance Comparison of LCMV-based Space-time 2D Array and Ambiguity Problem

Performance Comparison of LCMV-based Space-time 2D Array and Ambiguity Problem Inernaional journal of cience Commerce and umaniie Volume No 2 No 3 April 204 Performance Comparion of LCMV-baed pace-ime 2D Arra and Ambigui Problem 2 o uan Chang and Jin hinghia Deparmen of Communicaion

More information

ANSWERS TO ODD NUMBERED EXERCISES IN CHAPTER

ANSWERS TO ODD NUMBERED EXERCISES IN CHAPTER John Riley 6 December 200 NWER TO ODD NUMBERED EXERCIE IN CHPTER 7 ecion 7 Exercie 7-: m m uppoe ˆ, m=,, M (a For M = 2, i i eay o how ha I implie I From I, for any probabiliy vecor ( p, p 2, 2 2 ˆ ( p,

More information

Understanding small, permanent magnet brushed DC motors.

Understanding small, permanent magnet brushed DC motors. Undernding ll, pernen gne ruhed DC oor. Gideon Gouw Mrch 2008 L Upded 21 April 2015 Vrile B = Fixed gneic field (Tel) i () = rure curren, funcion of ie (A) i ( ) = edy e rure curren (A) J L = Ineri of

More information

The Concepts and Applications of Fractional Order Differential Calculus in Modelling of Viscoelastic Systems: A primer

The Concepts and Applications of Fractional Order Differential Calculus in Modelling of Viscoelastic Systems: A primer The Concep nd Applicion of Frcionl Order Differenil Clculu in Modelling of Vicoelic Syem: A primer Mohmmd Amirin Mlob, Youef Jmli,2* Biomhemic Lborory, Deprmen of Applied Mhemic, Trbi Modre Univeriy, Irn

More information

3D Transformations. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 1/26/07 1

3D Transformations. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 1/26/07 1 D Trnsformions Compuer Grphics COMP 770 (6) Spring 007 Insrucor: Brndon Lloyd /6/07 Geomery Geomeric eniies, such s poins in spce, exis wihou numers. Coordines re nming scheme. The sme poin cn e descried

More information

Flow networks. Flow Networks. A flow on a network. Flow networks. The maximum-flow problem. Introduction to Algorithms, Lecture 22 December 5, 2001

Flow networks. Flow Networks. A flow on a network. Flow networks. The maximum-flow problem. Introduction to Algorithms, Lecture 22 December 5, 2001 CS 545 Flow Nework lon Efra Slide courey of Charle Leieron wih mall change by Carola Wenk Flow nework Definiion. flow nework i a direced graph G = (V, E) wih wo diinguihed verice: a ource and a ink. Each

More information

18 Extensions of Maximum Flow

18 Extensions of Maximum Flow Who are you?" aid Lunkwill, riing angrily from hi ea. Wha do you wan?" I am Majikhie!" announced he older one. And I demand ha I am Vroomfondel!" houed he younger one. Majikhie urned on Vroomfondel. I

More information