Factorized Decision Forecasting via Combining Value-based and Reward-based Estimation

Size: px

Start display at page:

Download "Factorized Decision Forecasting via Combining Value-based and Reward-based Estimation"

Horace Perkins
6 years ago
Views:

1 Fcorized Decision Forecsing vi Combining Vlue-bsed nd Rewrd-bsed Esimion Brin D. Ziebr Crnegie Mellon Universiy Pisburgh, PA Absrc A powerful recen perspecive for predicing sequenil decisions lerns he prmeers of decision problems h produce observed behvior s (ner) opiml soluions. Under his perspecive, behvior is explined in erms of uiliies, which cn ofen be defined s funcions of se nd cion feures o enble generlizion cross decision sks. Two pproches hve been proposed from his perspecive: esime feure-bsed rewrd funcion nd recursively compue vlues from i, or direcly esime feure-bsed vlue funcion. In his work, we invesige he combinion of hese wo pproches ino single lerning sk using direced informion heory nd he principle of mximum enropy. This enbles uncovering which ype of esime is mos pproprie in erms of predicive ccurcy nd/or compuionl benefi for differen porions of he decision spce. I. INTRODUCTION For mny sks, humn-genered decisions re he bes exmples of inelligen behvior vilble nd re vluble source for rining mchines o behve inelligenly. Erly pproches for lerning o imie inelligen behvior direcly esime observed policies s behvior cloning sk. Perhps he mos successful exmple is ALVINN 20, he uonomous vehicle rined o conrol he driving wheel s seering ngle using neurl nework wih forwrd-mouned video feeds s inpu. This reduces decision predicion o clssificion sk. Though successful in keeping he vehicle sfely on he rodwy, his model of decision mking s solely he response o immedie sensor d is limied in is cpbiliies. For exmple, i would no be successful in choosing sequence of rodwys leding o priculr new desinion. In oher words, i is incpble of higher-level decision mking, like sequenil plnning. More recen echniques for lerning nd predicing sequenil decisions hve conneced he lerning sk o decision heory nd conrol heory frmeworks. In hese frmeworks, n insnneous rewrd is received by king priculr cion in priculr se nd i is reled o he expeced cion vlue of cumulive fuure rewrds for ech secion combinion ccording o he Bellmn equion 3. Inverse opiml conrol 13, 4, 19, 1 emps o find rewrds nd vlues h explin observed behvior, following Klmn s erly quesion: When is liner conrol sysem opiml 13? Unforunely, he quesion is ill-posed; mny choices for he rewrd funcion will mke observed sequences of decisions opiml, including degenercies h mke ll decision sequences eqully good 19, bu re uninformive. A number of echniques hve been developed h resolve he mbiguiies of his originl quesion. Mos hve focused on esiming he rewrd funcion bsed on feures of he se nd cion 5, 1, 22, 2, 21, 18, 25, 27. However, recen pproch hs employed his sme perspecive o lern he vlue funcion bsed on se nd cion feures 15. Inverse opiml conrol pproches hve been successfully employed for vehicle 29 nd roboic nvigion pplicions 30, 23, s well s cogniive science models 2. In his pper, we invesige new pproch for combining rewrd-bsed nd vlue-bsed decision lerning. We ssume h he desirbiliy of ech se s cions moives observed behvior eiher s vlue ( funcion of se-cion or se feures h is he ulime moivor of cions) or s cos ( combinion of direc funcion of se-cion feures nd he influence of fuure expeced feures in he decision process i.e., fuure vlue). Porions of he se spce h re vlue-bsed effecively fcorize he decision forecsing sk, since decision leding o hose ses do no consider he fuure beyond he vlue-bsed cions of hose ses. This fcorizion cn drmiclly improve he compuionl efficiency of inverse opiml conrol while reining mny of he dvnges of non-myopic resoning over smller cos-bsed porions of he decision spce. We formule he combinion of rewrd-bsed nd vluebsed inverse opiml conrol s mximum cusl enropy opimizion 27. We hen invesige prmeer nd srucure lerning sks, s well s predicion sks, under he resuling model. We show h given he se-bsed fcorizion of he decision problem, rewrd nd vlue prmeers cn be efficienly lerned s convex opimizion. We hen pose he problem of deciding wheher ech se influences decisions s vlue-bsed or s rewrd-bsed fcor from Byesin perspecive. We presen srucure lerning lgorihms for obining he poserior belief of vlue-bsed versus rewrd-bsed influence for ech se from demonsred behvior nd known vlue nd rewrd weighs. To ccomplish his, we inroduce echnique bsed on Mrkov chin Mone Crlo simulion 8. Combining hese wo procedures, we inroduce n expecion-mximizion 6 pproch for lerning he combined influence ype of ses nd he corresponding rewrd nd vlue prmeers.

2 II. BACKGROUND AND RELATED WORK We begin by reviewing decision mking frmeworks, opiml decision crieri, nd inverse opiml conrol lerning echniques. Our combined vlue-bsed nd rewrd-bsed mximum cusl enropy inverse opiml conrol pproch builds upon hese conceps. A. Decision Processes nd Opiml Conrol Mrkov decision processes (Definiion 1) provide flexible represenion of sequenil decision mking. Definiion 1: A Mrkov decision process (MDP) is uple, M MDP = (S, A, P (s s, ), R(s, )), of: A se of ses (s S); A se of cions ( A) ssocied wih ses; Acion-dependen se rnsiion dynmics probbiliy disribuions (P (s s, )) specifying nex se (s ); nd A rewrd funcion (R(s, ) R). A ech imesep, he se (S ) is genered from he rnsiion probbiliy disribuion (bsed on he previous se S 1 nd previous cion A 1 ). The se is known o he decision mker before he nex cion (A ) is seleced. The disribuion from which decisions re drwn is he (sochsic) policy, P (A S) or π(a S). The sndrd problem of ineres, given n MDP, is o find he opiml policy which mximizes he cumulive expeced rewrd: π(a S) = rgmx E P (A,S) R(, s ) π. π(a S) The Bellmn equion, π(s) = rgmx Q(, s) (1) Q(, s) = R(, s) + E P (S s,) V (s ) (2) V (s) = mx Q(, s), (3) defines fixed poin for his opiml policy. The recurrences of Equion 2 nd Equion 3 cn be ierively pplied o compue he se-cion vlues (lso known s he rewrdo-go), Q(, s), nd he se vlues, V (s), vi he vlue ierion lgorihm 3. These vlues qunify he cumulive fuure expeced rewrd received by he opiml policy from invoking priculr cion in priculr se or from priculr se, respecively. Under his opimliy crieri, deerminisic policy cn lwys be obined h provides he opiml vlues compued vi he Bellmn equion. Ofen, he h rewrd obined in he MDP is discouned by fcor of γ (0 γ < 1) o ensure fs convergence when pplying he Bellmn equion. However, his cn be equivlenly represened by scling ll rnsiion probbiliies P (s, s) by fcor of γ nd ermining fer ech cion wih probbiliy (1 γ). Thus, we do no explicily consider he discouned rewrd seing in ou formulion. In prcice, compuing he Bellmn equion in decision spces wih lrge numbers of ses nd cions cn be compuionlly burdensome. One pproch o me his complexiy is o inelligenly order he se updes of he Bellmn equion h re pplied nd o employ heurisics h bound he vlue funcions, limiing he size of he decision spce needed o be considered. This pproch is fmous for is use in plnning problems nd is known s he A* serch lgorihm 10. The vlue-bsed inverse opiml conrol porion of our pproch cn be viewed s providing reled compuionl benefis. B. Inverse Opiml Conrol Inverse opiml conrol (IOC) invesiges he problem of deermining wh rewrd funcion bes explins demonsred behvior sequences of ses s = (s 1,..., s T ) S nd cions = ( 1,..., T ) A. Typiclly, he prmeers of liner rewrd funcions, R θ (, s) = θ f,s, (4) which re defined in erms of rel-vlued feure vecors, f,s R K, re lerned. A broder view of inverse opiml conrol llows ny of he underlying Bellmn vribles of Equions 1 3 o be esimed. This leds o hree ypes of esimes: Acion vlue esimion, Q(, s) = φ Q f,s, from which he policy cn be obined vi Equion 1. Se vlue esimion, V (s) = φ V f s, from which he policy cn be obined vi Equion 1 nd Equion 2. Rewrd esimion, R(, s) = θ f s,, from which he policy cn be recursively obined vi Equions 1-3. Unforunely, he nïve problem formulion find rewrd prmeers θ or vlue prmeers φ Q or φ V h mke ll demonsred behvior opiml is ill-posed; mny choices of hese prmeers, including degenercies will chieve his objecive. Consider he zero weigh vecor, θ = φ Q = φ V = 0. I mkes ll decision choices, including demonsred decision sequences, hve n equl vlue of zero. While his sisfies he nïve formulion, i is compleely uninformive soluion. Erly pproches employed heurisics o selec he more meningful rewrd funcion weigh soluions o he inverse opiml conrol problem 19. However, he choice of heurisic is no priculrly well-jusified nd ofen he degenere soluion is he only vlid soluion in his problem formulion when demonsred sequences re no perfecly predicble. A key insigh of Abbeel & Ng 1 is h o ensure h n esimed policy mches he performnce of demonsred (smple from ) policy on decision mker s unknown rewrd funcion, he expeced feures mus mch. Following lineriy, E ˆP f,s = E P f,s θ R K, θ E ˆP f,s = θ E P f,s θ R K, E ˆP R θ (, s ) =E P R θ (, s ),

3 we cn see h his is indeed he cse. Unforunely, mching feures in expecion does no resolve ll of he mbiguiies in he inverse opiml conrol problem; when demonsred decision sequences re no consisenly explined s he opiml resul of single choice of rewrd prmeers, θ, hen here re mny sochsic policies h mch feure couns. This corresponds o he siuion in which demonsred decision sequences re no perfecly predicble nd degenere soluions re he only soluions o he nïve inverse opiml conrol formulion. When lerning from humn-genere decision sequences, his is ofen he cse no necessrily becuse people re nonopiml, bu lso becuse MDP nd se-cion feures re ofen simplificion of he decision sk nd is influences. Abbeel & Ng 1 mix se of deerminisic policies h re he opiml policies for sequence of rewrd prmeers, {θ i }: T T E P (S,A) f s, π θi (A S) E P (S,A) f s,. i =1 =1 However, here re mny oher wys o find such feuremching policy (mixure), nd mny do no provide good predicive performnce. Indeed, very simple exmples hve been shown o hve infinie log-loss under he Abbeel & Ng pproch 28. Oher pproches o inverse opiml conrol employ gmeheoreic prmeer selecion crieri for obining rewrd funcion prmeers 25, mximum mrgin echniques 22, or Byesin poserior disribuions of rewrd prmeers 5. A Bolzmnn cion disribuion pproch 2, 21, 18 is similr o he mximum enropy echniques we use in his pper. I employs disribuion over cions bsed on he se-cion vlue, P ( s) e Q θ(,s), (5) where he Q(, s) cion vlues re compued from he Bellmn equion (Equion 1) wih liner rewrd funcion (Equion 4). Unforunely, he d likelihood under his model is no convex, so opimizion echniques for finding rewrd weighs θ re subjec o locl opim. Addiionlly, he policy wih he lrges expeced rewrd does no necessrily hve he lrges probbiliy in his model 28. C. Rewrd-Bsed Mximum Cusl Enropy IOC The principle of mximum enropy 12 prescribes probbiliy disribuions h re he mos uncerin subjec o consrins mching mesured properies of d. This provides robus predicive log-loss minimizion gurnees 9. I hs been recenly exended o seings wih inercion nd feedbck, mking i pplicble o inverse opiml conrol problems 27. Definiion 2: Rewrd-bsed mximum cusl enropy inverse opiml conrol 27 is defined by he following opimizion: mx H(A S) (6) such h: E P (S,A) f s, = E P (S,A) f s,, s S, A, P (s, ) = P ( s) P (s T T 1 ) s S, A, P ( s) 0 nd s S, P ( s) = 1. where he cuslly condiioned enropy 14 from he Mrko- Mssey heory of direced informion 16, 17, H(A S) = E P (A,S) log P ( s), (7) is bsed on he cuslly condiioned probbiliy disribuion, P ( s) = T P ( s 1:, 1: 1 ), (8) =1 nd is emporl complemen, P (s T T 1 ) = T P (s s 1: 1, 1: 1 ). (9) =1 These disribuions crucilly differ from he sndrd condiionl probbiliy, P ( s) = T P ( s 1:T, 1: 1 ), (10) =1 in h ech cion is condiioned only on previously vilble se informion, s 1: nd no fuure se informion, s +1:T. This difference is refleced in Mrkov decision processes where ech se is only reveled o he decision mker fer previous cions re invoked nd policies h condiion on fuure ses cnno be execued. Togeher, he wo cuslly condiioned disribuions form he join disribuion: P (, s) = P ( s)p (s T T 1 ), which is explicily enforced s consrin in Equion 6. By mximizing his cusl enropy mesure (Equion 6), his pproch provides robus log-loss esime for he policy P (A S) 27. The soluion o his opimizion (Definiion 2) fcors ino sochsic policy wih cions disribued ccording o P ( s) e Qsof θ (,s), where Q sof (, s) is recursively defined s follows: θ Q sof θ (, s) = R θ (, s) + E P (S s,) V sof θ (s) (11) Vθ sof (s) = sofmx Q sof θ (, s), wih rewrd funcion R θ (, s) = θ f s, nd sofmx x f(x) = log x ef(x). This recursive relionship cn be viewed s smooh relxion of he Bellmn equion (Equion 1). The vlue updes of Equion 11, like he Bellmn equion, cn be ierively pplied vi dynmic progrmming o obin he sochsic policy of he model. This disribuion is equivlen o mximum likelihood esimion under mulivrie exreme-vlue noise disribuion for rewrd vlues in n MDP 24, bu cn be pplied more brodly o seings in which pproprie

4 noise erms re difficul o specify (e.g., coninuous conrol wih liner dynmics nd qudric coss 27). The rewrd funcion prmeers cn be lerned using sndrd grdien-bsed opimizion echniques. The grdien is simply he difference beween empiricl nd expeced feures: θ log L(θ P (S, A)) = E P (S,A) f s, E Pθ (S,A) f s,. Due o he convexiy of he log-likelihood funcion, log L(θ P (S, A)) = A,S P (A, S) log P θ (A S), rewrd prmeers converging o globl opim re obined using e.g., he grdien scen lgorihm. D. Vlue-bsed mximum enropy inverse opiml conrol An lerne pproch lerns vlue funcion rher hn rewrd funcion 15. This pproch cn lso be posed s mximum enropy esimion sk nd is closely reled o logisic regression. Definiion 3: Acion vlue-bsed mximum enropy inverse opiml conrol is defined by he following opimizion: mx H(A S) (12) such h: E P (S,A) f s, = E P (S,A) f s, A, s S P ( s) 0 nd s S, A P ( s) = 1, where he condiionl enropy is: H(A S) = E P log P (A S). The condiionl probbiliy disribuion of cions is of he form: P ( s) e Q(,s) where Q(, s) = φ Q f s,. Noe h he disribuion is over single se nd cion pir. Acion disribuions from se-bsed vlues re lso possible s resul of mximum enropy formulion. Definiion 4: Se vlue-bsed mximum enropy inverse opiml conrol 15 is defined by he following opimizion: mx H(A S) (13) such h: E P (S,S,A) f s = E P (S,S,A) f s A, s S P ( s) 0 nd s S, A P ( s) = 1, where S is he nex se experienced fer employing cion A in se S, nd is disribued ccording o he known decision process dynmics P (S S, A). Boh esimes cn similrly be obined using sndrd grdien-bsed opimizion echniques h void locl opim s consequence of convexiy. A key dvnge of he vlue-bsed esimes is h he policy cn be direcly obined wihou requiring poenilly compuionlly expensive dynmic progrmming sk (Equion 11). However, lerned vlue funcions cn ypiclly be less generlly pplied o e.g., differences in he gol erminl se of he MDP, for which he rewrd-bsed pproch is more pproprie. E. Inverse Opiml Heurisic Conrol In our previous work, we combined rewrd-bsed, IOCsyle lerning wih insnneous influence lerning, similr o he vlue-bsed perspecive, by ugmening lerned rewrd funcions hving long-erm influence wih lerned vlue funcions hving only insnneous influence. Acions under his model re disribued ccording o: P ( s) e Qsof θ (,s)+qins φ (,s), (14) where he se-cion vlues, Q sof θ (, s), re recursively compued from he dynmic progrm of Equion 11 nd he insnneous vlue is liner funcion of se-cion feures, Q ins φ (, s) = φ f s,, nd is no recursively reled o he se-cion vlues Q sof θ (s, ) ssocied wih he lerned rewrd funcion. Richer feures cn be employed wihin he insnneous vlue funcion (e.g., chrcerisics of he previous se-cion sequence) hn cn be efficienly incorpored in he rewrd funcion. While his pproch improves upon he predicive cpbiliies of rewrd-bsed mximum cusl enropy inverse opiml conrol 27, mny of he niceies of he mximum enropy formulion re los in his formulion. Firs nd foremos, lerning rewrd nd vlue funcion prmeers θ nd φ is no longer convex opimizion sk. Therefore, prcicl lgorihms for lerning model prmeers re suscepible o locl opim. Addiionlly, i does no improve upon he compuionl complexiy of he underlying mximum cusl enropy pproch i employs by combining he long-erm nd insnneous rewrds in his mnner. In conrs, he pproch we inroduce in his pper permis ech se s cions o influence behvior s rewrd or s vlue, bu no simulneously combinion of boh. This reins he convexiy properies of he mximum enropy formulion. I lso provides compuionl benefis, s he Bellmn-like inference procedure is effecively resriced o smller porions of he decision spce. III. A UNIFYING STATISTICAL ESTIMATION FRAMEWORK FOR REWARD-BASED AND VALUE-BASED INVERSE OPTIMAL CONTROL We leverge he recenly developed mximum cusl enropy frmework 27 o pose se-dependen combinion of rewrd-bsed nd vlue-bsed inverse opiml conrol under unified esimion sk. We hen inroduce lgorihms for inference nd lerning in he resuling model. A. A Unified Esimion Formulion We combine vlue-bsed nd rewrd-bsed inverse opiml conrol by priioning he ses ino hree subses: one of rewrd-bsed esimes, S R, one of cion vlue-bsed esimes, S Q, nd one of se vlue-bsed esimes, S V. Ech se s S hs ype, denoed ype(s), indicing o which of hese ses i belongs. We denoe he se of ypes for ll ses s ype(s). Using hese subses of ses, we

5 view decision sequences s shorer subsequences ccording o Definiion 5 1. Definiion 5: A vlue-se-segmened behvior sequence is rnsformion of sequence of ses nd cions ino se of smller subsequences in which only he finl se of ech sub-sequence cn be vlue-bsed se. For exmple, priculr se-cion sequence, (s, x, s b, y, s c, x, s d, z, s, y, s b ), wih cion vluebsed se S Q = {s b } nd se vlue-bsed se S V = {s d } is segmened ino sequences: (s, x, s b, y ), (s c, x, s d, z ), nd (s, y, s b ). Our view of demonsred decision sequences nd disribuions over decision sequences considers only hese segmened subsequences. We mke use of he empiricl disribuion of he iniil sub-sequence se, Psubseq (S 1 ), which, in our exmple, would be: P subseq (S 1 = s ) = 2 3 nd P subseq (S 1 = s c ) = 1 3. Definiion 6: Hybrid rewrd nd vlue mximum cusl enropy inverse opiml conrol given rewrd-bsed ses S R nd vlue-bsed ses S V nd ll se-cion sequences segmened ccording o Definiion 5 hs cions wih disribuions obined from he following opimizion: {P (A S)} = rgmx H(A S) (15) {P (A S)} T 1 T 1 such h: E P (S,A) f s, = E P (S,A) f s, =1 =1 E P (S,A) ISQ (s T )f st, T = E P (S,A) ISQ (s T )f st, T E P (S,A) I SV (s T ) P (s s T, T ) f s s S = E P (S,A) I SV (s T ) s S P (s s T, T ) f s s S, A, P (s, ) = P ( s) P (s T T 1 ) s S, A P ( s) 0, s S A P ( s) = 1 nd s 1 S, P (s 1 ) = P subseq (s 1 ), wih indicor funcion I A () = 1 if A nd 0 oherwise. We noe h he cusl enropy is concve nd h he join sequence disribuion P (S, A) is liner funcion of unknown cuslly condiioned vribles P (A S). This enbles he opimizion of Definiion 6 o be expressed s convex progrm. B. Dul Form nd Inference Procedure Though he convex progrm of Definiion 6 cn be direcly opimized, he dul progrm enbles beer undersnding of he model nd suppors more compc opimizion procedures. Theorem 1: The hybrid IOC opimizion of Definiion 6 cn be formuled s sofened version of he Bellmn 1 For simpliciy, we ssume h he finl se of ech originl sequence is vlue se. However, his ssumpion cn be relxed wih only ddiionl noionl complexiy. equion 3 vi convex duliy: θ f s, + E P (S Q hyb s,) V hyb (s ) if s S R (, s) = φ Q f s, if s S Q s S P (s s, ) φ V f s if s S V (16) V hyb (s) = sofmx Q hyb (, s), wih sofmx x f(x) = log x ef(x), he rewrd funcion prmeerized by rewrd weighs θ, nd he vlue funcions prmeerized by vlue weighs φ = {φ Q, φ V }. A srigh-forwrd procedure, Algorihm 1, follows from hese sofmx Bellmn-like updes. For simpliciy, we ssume in our noion h he decision ime-horizon is infinie nd h he corresponding policies re ime-invrin. Exensions o ime-vrying, fixed horizon decision seings re srigh-forwrd, bu noionlly more cumbersome. Algorihm 1 Rewrd/vlue-bsed IOC inference procedure Require: MDP M MDP, rewrd prmeers θ, cion vlue prmeers φ Q, se vlue prmeers φ V, cion vluebsed se S Q, se vlue-bsed se S V, nd rewrd-bsed se S R. Ensure: Se-cion vlues Q (, s) nd se vlues V (s) ccording o Equion 16 1: for ll s S Q do 2: for ll A do 3: Q (, s) φ Q f s, 4: end for 5: end for 6: for ll s S V do 7: for ll A do 8: Q (, s) φ V f s, 9: end for 10: end for 11: while no converged do 12: for ll s S R do 13: V (s) sofmx Q(, s) 14: end for 15: for ll s S R do 16: for ll A do 17: Q (, s) θ f s, + E P (S s,)v (s ) s, 18: end for 19: end for 20: end while The compuionl benefi of replcing some rewrd-bsed ses wih vlue-bsed ses cn be undersood from considering he longes phs of he decision spce. In he wors cse, vlue ierion requires O( S 2 A ) ime qudric in he number of ses o obin he opiml infinie imehorizon policy. Inuiively, his is becuse he bes policy could correspond o ph of lengh O( S ) nd propging vlues over ech se of h ph requires O( S A ) ime vi vlue ierion. Replcing rewrd-bsed ses wih vluebsed ses so h here re no phs consising enirely

6 of rewrd-bsed ses of lengh bounded below O( S 1 2 ) or O(log S ) reduces he vlue ierion o O( S 3 2 A ) or O( S A log S ) ime. The inference procedure (Algorihm 1) cn be generlized o seings in which here is belief bou he vluebsed se se, P (S V ). Nïvely, he resuling policy cn be obined for ech belief combinion for vlue se ses: P ( s) = P (S V )P ( s, S V ). (17) S V S However, more efficien lgorihm is obined when beliefs in individul se ypes, P (ype(s)) re independen beween ses (Theorem 2) nd we ssume h decisions re mde under his sme unceriny. Theorem 2: When se ype beliefs re independen, P (ype(s)) = s S P (ype(s)), he se-cion vlue of Equion 16 cn be re-wrien s: Q (, s) =P (s S R ) (θ f s, + EV (s ) s, ), + P (s S Q ) φ Qf s, + P (s S V ) s S P (s s, ) φ V f s (18) wih he inerpreion h ech visi o se, smple from P (ype(s)) is drwn indicing wheher he se is reed s rewrd-bsed influence, n cion vlue-bsed influence, or se vlue-bsed influence. C. Rewrd nd Vlue Weigh Lerning from Known Se Ses We now invesige he sk of lerning rewrd weighs θ nd vlue weighs φ = {φ Q, φ V } given he rewrd se se. Our objecive is o mximize he log likelihood: {θ, φ Q, φ V } = rgmx θ, φ Q, φ V log L(θ, φ Q, φ V P (, s)) = rgmx P (, s) log P ( s). θ, φ Q, φ Q A, s S The grdien of his opimizion wih respec o rewrd prmeers nd vlue prmeers is: θ log L(θ, φ Q, φ V P (S, A)) = (19) T 1 T 1 E P (S,A) f s, E P (S,A) f s, =1 =1 φq log L(θ, φ Q, φ V P (S, A)) = (20) E P (S,A) IsT S Q f st, T EP (S,A) IsT S Q f st, T φq log L(θ, φ Q, φ V P (S, A)) = (21) E P (S,A) P (s s T, T ) f s E P (S,A) I st S V s S I st S V P (s s T, T ) f s. s S Sndrd grdien-bsed opimizion echniques cn be employed o find prmeers h converge o globl opimum. Algorihm 2 Rewrd/vlue-bsed IOC rewrd/vlue prmeer lerning procedure Require: MDP M MDP, demonsred sequences P (S, A), iniil rewrd prmeers θ 0, iniil cion vlue prmeers φ Q,0, iniil se vlue prmeers φ V,0, se ypes ype(s). Ensure: Rewrd weigh esimes ˆθ nd vlue weigh esimes ˆφ h re (pproximely) opiml mximum likelihood esimes 1: ˆθ θ0, ˆφ Q φ Q,0, ˆφ V φ V,0 2: while ˆθ, ˆφ Q, nd ˆφ V no (pproximely) converged do 3: Compue Qˆθ, ˆφ(, s) nd Vˆθ, ˆφ(s) given ype(s) vi Algorihm 1 4: Compue policy P ( s) = e Qˆθ, ˆφ(,s) Vˆθ, ˆφ(s) 5: Compue grdien θ log L(θ, φ P (S, A)) θ=ˆθ vi Equion 19 6: Compue grdien vi Equion 20 φq log L(θ, φ P (S, A)) φq = ˆφ Q 7: Compue grdien vi Equion 21 φv log L(θ, φ P (S, A)) φv = ˆφ V 8: ˆθ ˆθ + η θ log L(θ, φ P (S, A)) 9: ˆφQ ˆφ Q + η φq log L(θ, φ P (S, A)) 10: ˆφV ˆφ V + η φv log L(θ, φ P (S, A)) 11: end while One such procedure is shown s Algorihm 2. The lerning re prmeers ( η re chosen o slowly decy o 0 re of e.g., Θ 1 ). D. Inferring he Influence Types of Ses Inferring which ses re vlue-bsed influences of decision mking nd which re rewrd-bsed influences is more difficul sk. Nïvely, considering he powerse of ses is compuionlly burdensome due o he Θ(3 S ) subse choices. We drw upon pproches h hve been successfully employed for Byesin srucure lerning 11, which is similrly difficul sk of deermining how join disribuion should fcor ino produc of condiionl probbiliies. Specificlly, Mrkov chin Mone Crlo (MCMC) simulion 8, hs been employed for Byesin nework srucure lerning 7 nd provides generl pproch for ddressing he influence ype inference seing of his pper s well. We re ineresed in obining poserior disribuion of se ypes given vilble evidence. Here, we represen h evidence s π, he demonsred policy. Using Byes rule, he poserior disribuion is: P (ype(s) π, θ, φ) P ( π ype(s), θ, φ)p (ype(s)). (22) We rely on key propery of he rewrd/vlue-bsed IOC disribuion (Theorem 3) reling he log likelihood o rewrd nd sofened se vlues. Theorem 3: The log probbiliy of policy under he hybrid rewrd-bsed/vlue-bsed inverse opiml conrol p-

7 proch is reled o he policy s expeced rewrds s follows: log P ( π ype(s), θ, φ) = (23) T 1 E P (S,A) θ f s, + φ f st, T V hyb (s 1), π =1 where V hyb is defined ccording o Theorem 1. Using sndrd Mrkov chin Mone Crlo simulion echniques, n pproximion o he poserior disribuion of rewrd/vlue se ype given demonsred policies is obined using Algorihm 3. Algorihm 3 Poserior rewrd/vlue se MCMC procedure Require: MDP M MDP, rewrd prmeers θ, cion vlue prmeers φ Q, se vlue prmeers φ V, demonsred policy π(a S), burn-in size N b, smple size N s Ensure: ˆP (ype(s)) bsed on smples from Byesin poserior. 1: s S, se ype(s) = Acion vlue 2: 1 3: while (N b + N s ) do 4: Smple i {1,..., S } uniformly rndom 5: Se ype(s i ) = Rewrd 6: Compue P R = P ( π ype(s), θ, φ) P (ype(s)) vi Equion 23 7: Se ype(s i ) = Acion vlue 8: Compue P Q = P ( π ype(s), θ, φ) P (ype(s)) vi Equion 23 9: Se ype(s i ) = Se vlue 10: Compue P V = P ( π ype(s), θ, φ) P (ype(s)) vi Equion 23 11: Smple ype(s i ) from he disribuion: ( Mulinomil P R P R +P Q +P V, P Q P R +P Q +P V, ) P V P R +P Q +P V 12: if > N b hen 13: Add smple ype(s) o disribuion ˆP (ype(s)) 14: end if 15: : end while Algorihm 3 ssumes h he rewrd nd vlue weighs, θ, φ Q, nd φ V re known. Ofen his is no he cse. Insed, we would like o esime boh hose rewrd nd vlue weighs s well s he ype of ech se. Algorihm 4 employs n expecion-mximizion pproch 6 o obin boh se ypes nd prmeer weighs. I obins he mximum likelihood weighs for priculr se of se beliefs in line 3 (he mximizion sep) nd clcules he expecion of se ypes given hose weigh esimes in line 4 (he expecion sep). When boh he expecion nd mximizion seps re exc, his procedure is gurneed o converge o locl opim of he likelihood funcion of rewrd/vlue weighs nd se ypes. In his seing, such gurnee cn be provided when he Mrkov chin Mone Crlo procedure of Algorihm 3 mixes well nd he rue disribuion of se ypes fcors independenly (s ssumed in line 5 of Algorihm 4 for compuionl efficiency benefis). Algorihm 4 Rewrd/vlue se nd prmeer EM procedure Require: MDP M MDP, demonsred sequences P (S, A) / policy π Ensure: Se se esime ˆP (ype(s)) nd rewrd/vlue prmeer esimes ˆθ, ˆφ h re (pproximely) locl opim. 1: s S, le P (ype(s)) be disribued uniformly rndom 2: while ype(s), ˆθ, nd ˆφ no converged do 3: Obin rewrd/vlue weighs ˆθ, ˆφ given ˆP (ype(s)) nd P (S, A) vi Algorihm 2 using Equion 18 4: Obin ˆP (ype(s)) given π nd prmeers ˆθ, ˆφ vi Algorihm 3 5: Approxime ˆP (ype(s)) s n independen disribuion wih ˆP (ype(s i )) s he mrginl disribuion of ˆP (ype(s)) 6: end while IV. DISCUSSION In his pper, we hve combined rewrd-bsed inverse opiml conrol wih vlue-bsed inverse opiml conrol s sisicl esimion sk using he principle of mximum cusl enropy. We llow ech se o influence behvior s eiher rewrd, n cion vlue, or se vlue. This llows compuionlly dvngeous vlue-bsed esimes o be employed in some porions of he decision spce while reining he sequenil resoning of cos-bsed esimes in oher porions of he se spce. When he ype of ech se is known, lerning rewrd/vlue weighs is ccomplished by convex opimizion. When se ypes re unknown, Mrkov chin Mone Crlo nd Expecion-Mximizion pproches cn be pplied wih weker gurnees. One of he grees benefis of inverse opiml conrol is he biliy o rnsfer lerned rewrd/vlue weighs o differen decision processes h re chrcerized by rewrd/vlue feures from he sme feure spce. This benefi is seemingly los by he hybrid formulion of inverse opiml conrol in his pper becuse he belief in he ype of ech se is specific o he decision process of he demonsred decision sequences nd does no rnsfer. Esiming se ypes bsed on vilble (nd rnsferble) informion, such s se-cion feures, is n imporn fuure direcion for enbling his pproch o he rnsfer seing. REFERENCES 1 P. Abbeel nd A. Y. Ng. Appreniceship lerning vi inverse reinforcemen lerning. In Proc. Inernionl Conference on Mchine Lerning, pges 1 8, C. Bker, J. Tenenbum, nd R. Sxe. Gol inference s inverse plnning. In Proceedings of he 29h nnul meeing of he cogniive science sociey, R. Bellmn. A Mrkovin decision process. Journl of Mhemics nd Mechnics, 6: , S. Boyd, L. El Ghoui, E. Feron, nd V. Blkrishnn. Liner mrix inequliies in sysem nd conrol heory. SIAM, 15, U. Chjewsk, D. Koller, nd D. Ormonei. Lerning n gen s uiliy funcion by observing behvior. In Inernionl Conference on Mchine Lerning, pges 35 42, 2001.

8 6 A. P. Dempser, N. M. Lird, nd D. B. Rubin. Mximum likelihood from incomplee d vi he EM lgorihm. Journl of he Royl Sisicl Sociey. Series B (Mehodologicl), 39(1):1 38, N. Friedmn nd D. Koller. Being Byesin bou nework srucure. Byesin pproch o srucure discovery in Byesin neworks. Mchine lerning, 50(1):95 125, W. Gilks, S. Richrdson, nd D. Spiegelhler. Mrkov chin Mone Crlo in prcice. Chpmn & Hll/CRC, P. D. Grünwld nd A. P. Dwid. Gme heory, mximum enropy, minimum discrepncy, nd robus Byesin decision heory. Annls of Sisics, 32: , P. Hr, N. Nilsson, nd B. Rphel. A forml bsis for he heurisic deerminion of minimum cos phs. IEEE Trnscions on Sysems Science nd Cyberneics, 4: , D. Heckermn, D. Geiger, nd D. Chickering. Lerning Byesin neworks: The combinion of knowledge nd sisicl d. Mchine lerning, 20(3): , E. T. Jynes. Informion heory nd sisicl mechnics. Physicl Review, 106: , R. Klmn. When is liner conrol sysem opiml? Trns. ASME, J. Bsic Engrg., 86:51 60, G. Krmer. Direced Informion for Chnnels wih Feedbck. PhD hesis, Swiss Federl Insiue of Technology (ETH) Zurich, D. Krishnmurhy nd E. Todorov. Inverse opiml conrol wih linerly-solvble MDPs. In Proc. Inernionl Conference on Mchine Lerning, pges , H. Mrko. The bidirecionl communicion heory generlizion of informion heory. In IEEE Trnscions on Communicions, pges , J. L. Mssey. Cusliy, feedbck nd direced informion. In Proc. IEEE Inernionl Symposium on Informion Theory nd Is Applicions, pges 27 30, G. Neu nd C. Szepesvári. Appreniceship lerning using inverse reinforcemen lerning nd grdien mehods. In Proc. UAI, pges , A. Y. Ng nd S. Russell. Algorihms for inverse reinforcemen lerning. In Proc. Inernionl Conference on Mchine Lerning, pges , D. Pomerleu. ALVINN: An uonomous lnd vehicle in neurl nework. Advnces in Neurl Informion Processing Sysems I, 1:305, D. Rmchndrn nd E. Amir. Byesin inverse reinforcemen lerning. In Proc. Inernionl Join Conference on Arificil Inelligence, pges , N. Rliff, J. A. Bgnell, nd M. Zinkevich. Mximum mrgin plnning. In Proc. Inernionl Conference on Mchine Lerning, pges , N. Rliff, D. Brdley, J. Bgnell, nd J. Chesnu. Boosing srucured predicion for imiion lerning. In In Advnces in Neurl Informion Processing Sysems (NIPS) 19, J. Rus. Mximum likelihood esimion of discree conrol processes. SIAM Journl on Conrol nd Opimizion, 26:1006, U. Syed nd R. Schpire. A gme-heoreic pproch o ppreniceship lerning. Advnces in Neurl Informion Processing Sysems, 20: , B. D. Ziebr. Modeling Purposeful Adpive Behvior wih he Principle of Mximum Cusl Enropy. PhD hesis, Mchine Lerning Deprmen, Crnegie Mellon Universiy, Dec B. D. Ziebr, J. A. Bgnell, nd A. K. Dey. Modeling inercion vi he principle of mximum cusl enropy. In Proc. Inernionl Conference on Mchine Lerning, pges , B. D. Ziebr, A. Ms, J. A. Bgnell, nd A. K. Dey. Mximum enropy inverse reinforcemen lerning. In Proc. Conference on Arificil Inelligence (AAAI), pges , B. D. Ziebr, A. Ms, A. K. Dey, nd J. A. Bgnell. Nvige like cbbie: Probbilisic resoning from observed conex-wre behvior. In Proc. Inernionl Conference on Ubiquious Compuing, B. D. Ziebr, N. Rliff, G. Gllgher, C. Merz, K. Peerson, J. A. Bgnell, M. Heber, A. K. Dey, nd S. Srinivs. Plnning-bsed predicion for pedesrins. In Proc. of he Inernionl Conference on Inelligen Robos nd Sysems, APPENDIX Theorem 1: The hybrid IOC opimizion of Definiion 6 cn be formuled s sofened version of he Bellmn equion 3 vi convex duliy: Q hyb (, s) = V hyb θ f s, + E P (S s,) φ Q f s, s S P (s s, ) φ V f s (s) = sofmx Q hyb (, s), V hyb (s ) if s S R if s S Q if s S V (24) wih sofmx x f(x) = log x ef(x), he rewrd funcion prmeerized by rewrd weighs θ, nd he vlue funcions prmeerized by vlue weighs φ = {φ Q, φ V }. Proof: Afer seing feures over he enire sequence s F(s, ) = T 1 =1 f s, I SQ (s T ) f st I SV (s T ) s S P (s s T, T ) f s, (25) his is corollry of Theorem Theorem 2: When se ype beliefs re independen, P (ype(s)) = s S P (ype(s)), he se-cion vlue of Equion 16 cn be re-wrien s: Q (, s) =P (s S R ) (θ f s, + EV (s ) s, ), + P (s S Q ) φ Qf s, + P (s S V ) s S P (s s, ) φ V f s (26) wih he inerpreion h ech visi o se, smple from P (ype(s)) is drwn indicing wheher he se is reed s rewrd-bsed influence, n cion vlue-bsed influence, or se vlue-bsed influence. Proof: (skech) The siuion where se ype is disribued ccording o n independen belief disribuion cn be represened wih n ugmened se of dynmics over he join of ses nd se ypes: P expnd (s, ype(s ) s, ) = P (s s, )P (ype(s )), (27) where P (s s, ) is he originl dynmics nd P (ype(s )) is he belief disribuion over ypes. Following Theorem 1 in his expnded se spce complees he proof. Theorem 3: The log probbiliy of policy under he hybrid rewrd-bsed/vlue-bsed inverse opiml conrol pproch is reled o he policy s expeced rewrds s follows: log P ( π ype(s), θ, φ) = (28) T 1 E P (S,A) θ f s, + φ f st, T V hyb (s 1), π =1 where V hyb is defined ccording o Theorem 1. Proof: This is corollry of Theorem wih feures defined ccording o Equion 25.

e t dt e t dt = lim e t dt T (1 e T ) = 1

e t dt e t dt = lim e t dt T (1 e T ) = 1 Improper Inegrls There re wo ypes of improper inegrls - hose wih infinie limis of inegrion, nd hose wih inegrnds h pproch some poin wihin he limis of inegrion. Firs we will consider inegrls wih infinie