Simultaneous estimation of rewards and dynamics from noisy expert demonstrations

Size: px

Start display at page:

Download "Simultaneous estimation of rewards and dynamics from noisy expert demonstrations"

Owen Payne
5 years ago
Views:

1 Smultneous estmton of rewrds nd dynmcs from nosy expert demonstrtons Mchel Hermn,2, Tobs Gndele, Jo rg Wgner, Felx Schmtt, nd Wolfrm Burgrd2 - Robert Bosch GmbH Stuttgrt - Germny 2- Unversty of Freburg - Deprtment of Computer Scence 790 Freburg - Germny Abstrct. Inverse Renforcement Lernng (IRL) descrbes the problem of lernng n unknown rewrd functon of Mrkov Decson Process (MDP) from demonstrtons of n expert. Current pproches typclly requre the system dynmcs to be known or ddtonl demonstrtons of stte trnstons to be vlble to solve the nverse problem ccurtely. If these ssumptons re not stsfed, heurstcs cn be used to compenste the lck of model of the system dynmcs. However, heurstcs cn dd bs to the soluton. To overcome ths, we present grdent-bsed pproch, whch smultneously estmtes rewrds, dynmcs, nd the prmeterzble stochstc polcy of n expert from demonstrtons, whle the stochstc polcy s functon of optml Q-vlues. Introducton The growng number of utonomous systems requres effcent methods to djust the system to new envronments nd tsks. Lernng from demonstrton offers methods to prmeterze desred behvor nd cn be splt nto two subfelds: Behvorl Clonng nd Inverse Renforcement Lernng (IRL). Behvorl Clonng estmtes polcy from demonstrtons nd therefore mmcs the expert drectly. Especlly, f the envronment or ts dynmcs chnge, pretrned polces cn be npproprte. Therefore, IRL [] hs been ntroduced, whch descrbes the problem of recoverng rewrd functon from demonstrtons, s the rewrd functon encodes the expert s gol. Approches hve been proposed whch solve the IRL problem under vrous ssumptons, e.g. [2, 3, 4, 5]. The cted pproches requre the true system dynmcs to be known. Inccurte trnston models cn bs the rewrd estmte. Snce the system dynmcs re often unknown, model-free IRL lgorthms hve been proposed, such s n [6, 7, 8]. Typclly, those pproches requre ccess to ddtonl observtons of trnstons. If these cnnot be obtned, the pproches tend to suffer from wrong generlztons due to heurstcs. Often, experts re unble to produce optml demonstrtons. As consequence, IRL pproches re necessry tht del wth stochstc behvor. In [9, 0, ], stochstc polces of mxmum (cusl) entropy re trned under the constrnt of mtchng feture expecttons. Ths cuses the stochstc polcy to be Boltzmnn dstrbuton over soft Q-vlues. However, f the expert s stochstc polcy follows dfferent type of dstrbuton, these pproches cn be npproprte. 677

2 Our contrbuton s to generlze IRL to the cse of unknown dynmcs nd unknown stochstc polces. We propose n pproch tht smultneously optmzes rewrds, dynmcs, nd the expert s stochstc polcy by mxmzng the posteror probblty of the demonstrtons. Even though mny trnstons hve never been observed, they nfluenced the expert s polcy nd cn therefore to some degree be nferred from demonstrtons. The expert s stochstc polcy s modeled s prmetrc functon of optml Q-vlues, whch ssumes tht the expert s ble to correctly estmte the vlue of dfferent ctons, but s unble to choose them pproprtely. We provde grdent-bsed soluton nd evlute our pproch on synthetc grdworld stellte nvgton tsk. 2 Fundmentls An MDP s tuple M = {S, A, P (s s, ), γ, P (s0 ), R}, where S s the stte spce wth sttes s S, A s the cton spce wth ctons A, P (s s, ) s the probblty of trnston to s when cton s ppled n stte s, γ [0, ) s dscount fctor, P (s0 ) s strt stte probblty dstrbuton, nd R : S A R s rewrd functon whch ssgns rel-vlued rewrd for pckng cton n stte s. Often, ths rewrd s expressed s lner functon R(s, ) = T f (s, ) of stte- nd cton-dependent fetures f : S A Rd wth feture weghts. The gol of n MDP s to fnd n optml polcy π (s) A, whch specfes sttedependent ctons, such P tht ts executon mxmzes the expected, dscounted, cumulted rewrd E [ t=0 γ t R (st, t ) s0 = s, π]. The optml vlue functon cn be computed by vlue terton, whch repetedly pples Eq. () to n rbtrry ntl Q-functon. After convergence, the optml polcy chooses the ctons wth the lrgest Q-vlue: π (s) = rgmx Q(s, ). Xh P (s s, ) mx Q(s, ) () Q(s, ) = R (s, ) + γ s S 3 Smultneous Estmton of Rewrds, Dynmcs, nd Stochstc Polcy (SERD-SP) We propose n pproch, clled Smultneous Estmton of Rewrds, Dynmcs, nd Stochstc Polcy (SERD-SP), to ccount for problems, where nether the rewrds, the dynmcs, nor the expert s stochstc polcy π (s, ) = P ( s) s known. Snce the expert s estmte of the trnston model my dffer from the rel one, we ntroduce ndependent models. Addtonlly, we ssume tht there exsts prmeterzble stochstc mppng π = g(q) from optml Q-vlues to the stochstc polcy of the expert. Then, the problem cn be formlzed s: Determne: Expert s rewrd functon R(s, ) Expert s estmte of the dynmcs PA (s s, ) Rel dynmcs P (s s, ) Stochstc polcy mppng g(q) 678

3 Gven: MDP M \ {R, P (s s, ), PA (s s, )} wthout rewrds nd dynmcs Demonstrtons D = {τ, τ2,..., τn } wth trjectores τ = {(sτ0, τ0 ), (sτ, τ ),..., sτtτ, τtτ of n expert ctng n M bsed on polcy tht depends on R(s, ), PA (s s, ), nd g(q) A set of prmeters of the rewrds, dynmcs, nd the stochstc polcy s ntroduced, whch should be estmted from the gven demonstrtons D: R TA T P Feture weghts of the rewrd functon R(s, ) Prmeters of the expert s trnston model PTA Prmeters of the rel trnston model PT Prmeters of the expert s stochstc polcy mppng g(q) We propose to mxmze the posteror probblty of the demonstrtons wth respect to the prmeters = R TA T P. Assumng ndependent trjectores, the lkelhood of the demonstrtons n D cn be expressed s P (D M, ) = Y P (sτ0 ) TY τ t=0 τ D π (sτt, τt ) PT sτt+ sτt, τt. (2) It should be noted tht the polcy π (s, ) depends on the prmeters R, TA, nd P. In contrst, the trnston model PT (s s, ) only depends on T. Then, the mxmum posteror estmtor of the prmeters cn be formulted: = rgmx log P (D M, ) + log P (). (3) We propose grdent-bsed method to optmze the prmeters ccordng to Eq. (3) wth L (D) = log P (D M, ) + log P (): τ X TX L (D) = log π (sτt, τt ) + log PT sτt+ sτt, τt t=0 τ D + log P (). (4) Snce the system dynmcs nd the pror re problem-dependent, the followng dervtons wll focus on the prtl dervtve log π (sτt, τt ). Ths requres the stochstc polcy mppng π = g(q) of the expert to be specfed. We wll exemplrly derve the grdent for Boltzmnn polcy wth temperture P : π (s, ) = g(q)(s, ) = P exp( P Q (s, )). ) exp Q (s, A P Then, the prtl dervtve of the log polcy log π (sτt, τt ) results n: h ( h ) Q (s, ) E Q (s, ) f 6= P π (s, log π (s, ) = P f = P Eπ (s, ) [Q (s, )] Q (s, ) 2 P 679 (5)

4 The grdent of the polcy depends on the grdent of the stte-cton vlue functon Q (s, ). Snce we ssume tht the expert chooses ctons bsed on n optml, greedy vlue functon, the dervtve of the Q-functon from Eq. () hs to be computed. Ths cn result n sub-dervtve, s the mx-functon s not dfferentble. Nevertheless, for the ske of smplcty, we cll t Q-grdent. X Q (s, ) = R f (s, ) + γ PTA (s s, ) V (s ) (6) s S X +γ PTA (s s, ) Q (s, π (s )) s S Eq. (6) shres smlrtes wth the pproch from Neu nd Szepesv r [3]. It s lner equton system nd cn be computed drectly. However, snce t s fxed pont equton, repetedly pplyng Eq. (6) to n rbtrry Q-grdent wll converge to the true one. Especlly n lrge stte nd cton spces, ths Qgrdent terton cn requre less computtons thn drectly solvng the lner equton system. Algorthm summrzes the proposed lgorthm. Algorthm SERD lgorthm Requre: MDP M \ {R, PT, PTA, g(q)}, Demonstrtons D, ntl 0, step sze α : N+ R+, t 0 whle not suffcently converged do Q QIterton(M, t ) Eq. () π DervePolcy(M, Q ) Eq. (5) dq ComputeQGrdent(M, Q, π, t ) Eq. (6) dl (D) ComputeGrdent(M, D, dq ) Eq. (4) t+ t + α(t)dl (D) t t+ end whle 4 Evluton We evlute the proposed pproch n stellte grdworld nvgton tsk, whch s llustrted n Fg.. The moton dynmcs re stochstc nd dffer n the forest nd on open terrn. The cton spce llows the gent to choose from fve dfferent ctons: movng n one of four drectons (north, est, south, or west) or remnng n the stte, respectvely. Possble successor sttes re the four neghbourng ones or the current one. On the open terrn (depcted n lght gry n Fg. (c)), the gent hs probblty of 0.8 to successfully execute the desred moton nd 0. to fll ether to the rght or to the left. In the forest (depcted n drk gry n Fg. (c)), successful motons only occur wth probblty of 0.3. The remnng successor sttes hve probblty of Styng n stte s lwys successful n both forest nd open terrn. Due to ths defnton of the moton dynmcs, the gent hs to trde off between short cuts through the forest, whch re less lkely to be successful, or longer pths on 680

() (b) (c) (d) (e) (f) Fg. : () Envronment, Mp dt: Google. (b) Dscretzed stte spce (Gol: green. Intl sttes: red.). (c) Forest sttes re ndcted n drk-gry nd open terrn n lght gry.

The frst feture encodes the normlzed gry scle vlue [0, ] of the mge, whle the second one s gol ndctor {0, }. The dscount s 0.99 nd the temperture of the Boltzmnn polcy ws set to P = 2.

Therefore, the prmeters of the trnston model TA nd T re dentcl. The system dynmcs re modeled s energes of Boltzmnn dstrbutons.

An m-estmtor wth unform pror s used to estmte the dynmcs from demonstrtons before pplyng SERDSP or lterntve IRL pproches. The feture weghts re ntlzed rndomly ( : [ 0, 0]).

5 () (b) (c) (d) (e) (f) Fg. : () Envronment, Mp dt: Google. (b) Dscretzed stte spce (Gol: green. Intl sttes: red.). (c) Forest sttes re ndcted n drk-gry nd open terrn n lght gry. (d) Rewrd (e) Vlue functon (f) Expected stte frequency. open terrn. The rewrd s functon of two fetures, whch re weghted by R = (6, 6). The frst feture encodes the normlzed gry scle vlue [0, ] of the mge, whle the second one s gol ndctor {0, }. The dscount s 0.99 nd the temperture of the Boltzmnn polcy ws set to P = 2. We compute the optml Q-functon nd smple trjectores from the resultng stochstc polcy to obtn expert demonstrtons. We ssume tht the expert hs knowledge bout the true trnston model. Therefore, the prmeters of the trnston model TA nd T re dentcl. The system dynmcs re modeled s energes of Boltzmnn dstrbutons. Snce there exst 4 moton ctons n ech, forest nd open terrn, s well s one styng cton, 9 models re trned wth 5 possble outcomes, resultng n 45 prmeters. An m-estmtor wth unform pror s used to estmte the dynmcs from demonstrtons before pplyng SERDSP or lterntve IRL pproches. The feture weghts re ntlzed rndomly ( : [ 0, 0]). We use Gussn prors for the feture weghts nd the polcy prmeter. The pror of the dynmcs s fvorng hgh entropes. We optmze ll prmeters for vrous szes of demonstrton sets wth SERD-SP nd compre t to the result of Mxmum Dscounted Cusl Entropy IRL [] (MDCE IRL), nd Reltve Entropy IRL [6] (REIRL). The ddtonl smples, whch re needed by REIRL, re smpled from the m-estmted trnston model. Fg. 2 summrzes the results. The medn log lkelhood of demonstrtons from the true model on the lerned ones n Fg. 2 () shows tht SERD-SP outperforms 0 3 SERD-SP )] 35 MDCE IRL A SERD-SP 45 MDCE IRL REIRL E[DKL (PT PT logp (D M ) D () Log lkelhood of the demonstrtons REIRL D (b) KL dvergence of the trnston model Fg. 2: () Medn wth qurtles of the log lkelhood of demonstrtons drwn from the true model under the estmted model. (b) Averge Kullbck-Lebler dvergence between the estmted dynmcs nd the true ones. 68

6 the other lgorthms, whle beng smple effcent. Ths result s understndble, s the comprtve pproches model dfferent types of stochstc polces. In ddton, Fg. 2 (b) llustrtes tht SERD-SP s further optmzng the ntlly m-estmted dynmcs, whch results n more ccurte models. 5 Concluson In ths pper, we presented grdent-bsed soluton for smultneous estmton of rewrds, dynmcs, s well s the expert s stochstc polcy. We ssume tht the expert s ble to compute n optml Q-functon, but executes suboptml ctons. Ths stochstcty s modeled by prmeterzble functon of optml Q-vlues. The evluton shows mproved performnce gnst trdtonl IRL methods wth more ccurte polces nd dynmcs. Future work could elborte on dfferent types of stochstc polces nd on the cse tht the gent s estmte of the dynmcs dffers from the true one. References [] Andrew Y. Ng nd Sturt J. Russell. Algorthms for nverse renforcement lernng. In Proceedngs of the Seventeenth Interntonl Conference on Mchne Lernng, ICML 00, pges , Sn Frncsco, CA, USA, Morgn Kufmnn Publshers Inc. [2] Peter Abbeel nd Andrew Y. Ng. Apprentceshp lernng v nverse renforcement lernng. In Proceedngs of the Twenty-frst Interntonl Conference on Mchne Lernng, ICML 04, New York, NY, USA, ACM. [3] Gergely Neu nd Csb Szepesv r. Apprentceshp lernng usng nverse renforcement lernng nd grdent methods. In UAI 2007, Proceedngs of the Twenty-Thrd Conference on Uncertnty n Artfcl Intellgence, Vncouver, BC, Cnd, July 9-22, 2007, pges , [4] Deepk Rmchndrn nd Eyl Amr. Byesn Inverse Renforcement Lernng. Proceedngs of the 20th Interntonl Jont Conference on Artfcl Intellgence, 5: , [5] Constntn A. Rothkopf nd Chrstos Dmtrkks. Preference elctton nd nverse renforcement lernng. In ECML/PKDD (3), volume 693 of Lecture Notes n Computer Scence, pges Sprnger, 20. [6] Abdeslm Boulrs, Jens Kober, nd Jn Peters. Reltve entropy nverse renforcement lernng. In Proceedngs of Fourteenth Interntonl Conference on Artfcl Intellgence nd Sttstcs (AISTATS 20), 20. [7] Edourd Klen, Mttheu Gest, Bll Pot, nd Olver Petqun. Inverse Renforcement Lernng through Structured Clssfcton. In Advnces n Neurl Informton Processng Systems (NIPS 202), Lke Thoe (NV, USA), December 202. [8] Edourd Klen, Bll Pot, Mttheu Gest, nd Olver Petqun. A cscded supervsed lernng pproch to nverse renforcement lernng. In Proceedngs of the Europen Conference on Mchne Lernng nd Prncples nd Prctce of Knowledge Dscovery n Dtbses (ECML/PKDD 203), Prgue (Czech Republc), September 203. [9] Brn D. Zebrt, Andrew Ms, J. Andrew (Drew) Bgnell, nd Annd Dey. Mxmum entropy nverse renforcement lernng. In Proceedng of AAAI 2008, July [0] Brn D. Zebrt, J. Andrew Bgnell, nd Annd K. Dey. Modelng ntercton v the prncple of mxmum cusl entropy. In Proc. of the Interntonl Conference on Mchne Lernng, pges , 200. [] Mchel Bloem nd Nchols Bmbos. Infnte tme horzon mxmum cusl entropy nverse renforcement lernng. In 53rd IEEE Conference on Decson nd Control, CDC 204, Los Angeles, CA, USA, December 5-7, 204, pges ,

Dennis Bricker, 2001 Dept of Industrial Engineering The University of Iowa. MDP: Taxi page 1

Dennis Bricker, 2001 Dept of Industrial Engineering The University of Iowa. MDP: Taxi page 1 Denns Brcker, 2001 Dept of Industrl Engneerng The Unversty of Iow MDP: Tx pge 1 A tx serves three djcent towns: A, B, nd C. Ech tme the tx dschrges pssenger, the drver must choose from three possble ctons: