A Variance Analysis for POMDP Policy Evaluation

Size: px

Start display at page:

Download "A Variance Analysis for POMDP Policy Evaluation"

Stephany White
5 years ago
Views:

1 Proceedings of the Twenty-Third AAAI Conference on Artificil Intelligence (2008) A Vrince Anlysis for POMDP Policy Evlution Mhdi Milni Frd nd Joelle Pineu School of Computer Science McGill University, Montrel, Cnd Peng Sun Fuqu School of Business Duke University, Durhm, USA Abstrct Prtilly Observble Mrkov Decision Processes hve been studied widely s model for decision mking under uncertinty, nd number of methods hve been developed to find the solutions for such processes. Such studies often involve clcultion of the vlue function of specific policy, given model of the trnsition nd observtion probbilities, nd the rewrd. These models cn be lerned using lbeled smples of on-policy trjectories. However, when using empiricl models, some bis nd vrince terms re introduced into the vlue function s result of imperfect models. In this pper, we propose method for estimting the bis nd vrince of the vlue function in terms of the sttistics of the empiricl trnsition nd observtion model. Such error terms cn be used to meningfully compre the vlue of different policies. This is n importnt result for sequentil decision-mking, since it will llow us to provide more forml gurntees bout the qulity of the policies we implement. To evlute the precision of the proposed method, we provide supporting experiments on problems from the field of robotics nd medicl decision mking. Introduction It is common in the context of Mrkov Decision Processes (MDPs) to clculte the vlue function of specific policy, bsed on some trnsition nd rewrd model. When the model is not known priori, one cn compute n empiricl model from some smple on-policy trjectories using bsic frequentist pproch nd then use this model (long with Bellmn s eqution) to clculte the vlue function of the trget policy. Using imperfect models however will introduce some error in the estimted vlue function. As generl prctice with lerning methods, we might wnt to know how good this estimte of the vlue function is, given the error in the estimted models. This cn be expressed in terms of bis nd vrince of the clculted vlue function. The vribility of the vlue function my hve two different sources. One is the stochstic nture of MDPs (internl vrince), nd the other is the error due to the use of the imperfect empiricl model insted of the true model (prmetric vrince). Internl vrince nd its reduction hve Copyright c 2008, Assocition for the Advncement of Artificil Intelligence ( All rights reserved. been studied in severl works (Greensmith, Brtlett, & Bxter 2004). Here we re mostly interested in finding the ltter. Mnnor et l. (2004) showed tht when the empiricl model is resonbly close to the true model, we cn use second order pproximtion to clculte these terms in the vlue function of n MDP. In this pper we extend these ides to the context of Prtilly Observble Mrkov Decision Processes (POMDPs) nd derive similr expressions for the bis nd vrince terms. This is n importnt result for the deployment of utonomous decision-mking systems in rel-world domins since it is well-known tht POMDPs re more relistic model of decision-mking thn MDPs (becuse they llow prtil stte observbility). It is worth noting tht pproximtion methods for POMDPs hve mde lrge leps in recent yers; nd while these pproches consistently ssume perfect model of the domin, in rel-world pplictions, these models must often be estimted from dt. The method outlined in this pper cn be used to ssess when we hve gthered sufficient dt to hve good estimte of the vlue function. The method cn lso be used to ssess whether we cn confidently select one policy over nother. Finlly, the method cn be used to define clsses of equivlent policies. These re useful considertions when developing expert systems, especilly for criticl pplictions such s humn-robot interction nd medicl decision-mking. Bckground In this section we define the model nd nottion tht will be used in the following sections. Prtilly Observble Mrkov Decision Process We consider finite-stte, finite-ction, discounted rewrd POMDP (Sondik 1971; Cssndr, Kelbling, & Littmn 1994): S: finite set of sttes A: finite set of ctions Z: finite set of observtions R : S dimensionl mtrix of rewrds when selecting ction T : S S dimensionl mtrix of trnsition probbilities when selecting ction 1056

2 O : S Z dimensionl mtrix of observtion probbilities when selecting ction γ: discount fctor It is well known tht the vlue function of the optiml policy of POMDP in the finite horizon is convex piecewise liner function of the belief stte (Sondik 1971). It is often convenient to use finite-horizon pproximtion in the infinite horizon cse. Thus, we work only with piecewise liner vlue functions. Finite Stte Controller nd Vlue Function Sondik (1971) points out tht n optiml policy for finitehorizon POMDP cn be represented s n cyclic finite-stte controller in which ech of the mchine sttes represents liner piece (or the corresponding lph vector) in the piecewise liner vlue function. The stte of the controller is bsed on the observtion history nd the ction of the gent will only be bsed on the stte of the controller. For deterministic policies, ech mchine stte i issues n ction (i) nd then the controller trnsitions to new mchine stte ccording to the received observtion. This finite-stte controller is usully represented s policy grph. An exmple of policy grph for POMP dilog mnger is shown in Fig 2. Cssndr, Kelbling, & Littmn (1994) stte tht dynmic progrmming lgorithms for infinite-horizon POMDPs, such s vlue itertion, sometimes converge to n optiml piecewise vlue function tht is equivlent to cyclic finite-stte controller. In the cse tht the optiml vlue function is not piecewise liner, it is still possible to find n pproximte or suboptiml finite-stte controller (Pouprt & Boutilier 2003). Given finite-stte controller for policy, we cn extrct the vlue function of the POMDP using liner system of equtions. To extrct the ith liner piece of the POMDP vlue function, we clculte the vlue of ech POMDP stte over tht liner piece. For ech mchine stte i (corresponding to liner piece), nd ech POMDP stte s, the vlue of s over the ith liner piece is: v i (s) = r(s, (i)) + γ s,z T (i) (s, s )O (i) (s, z)v l(i,z) (s ), where r(s, ) is the immedite rewrd nd l(i, z) is the next mchine stte from stte i nd given observtion z (Hnsen 1998). We cn rewrite the bove system of equtions in mtrix form using the following definitions: K: finite set of mchine sttes in the policy grph v k for k K: S dimensionl vector of coefficients representing liner piece in the vlue function V : S K dimensionl vector, verticl conctention of v k s representing the POMDP vlue function (k) for k K: the ction ssocited with mchine stte k ccording to the fixed policy r k = R (k) for k K: S dimensionl vector of coefficients representing liner piece in the piecewise liner immedite rewrd function R: S K dimensionl vector, conctention of r k s T : S K S K dimensionl block digonl mtrix of K K blocks, with T (k) s the kth digonl submtrix O: S K Z S K dimensionl block digonl mtrix of K K blocks. Ech digonl block is S Z S block digonl sub-mtrix of S S sub-blocks. Ech sub-block is therefore Z dimensionl row vector. The kth block, sth sub-block contins the sth row in the O (k). Π: Z S K S K dimensionl block mtrix of K K blocks. Ech block Π k1k 2 is itself Z S S block digonl sub-mtrix of S S sub-blocks. Ech subblock is therefore Z dimensionl vector. For ll s, the zth component of the sth digonl block of the (k 1, k 2 ) sub-mtrix, [(Π k1k 2 ) s ] z, is equl to 1 if k 2 is the succeeding index of the mchine stte when the mchine stte is k 1 nd the observtion is z, nd 0 otherwise. This mtrix represents the trnsition function l(i, z) of the finite-stte controller which re the rcs in the policy grph. We cn write the system of equtions representing the vlue of policy π in the following mtrix form: leding to: V π = R + γt OΠ π V. (1) V π = (I γt OΠ π ) 1 R. (2) The bove eqution cn be used to clculte the vlue function of given policy, if the models for T, O nd R re known. This eqution is t the core of most policy itertion lgorithms for POMDPs (Hnsen 1997; 1998), including one of the most recent highly successful pproximtion method (Ji et l. 2007). Thus hving confidence intervls over the clculted vlue function might be of gret use in such lgorithms. Model Error Given POMDP (s defined in the previous section), fixed policy nd set of lbeled on-policy trjectories, one cn use frequentist pproch to clculte the models for T, O nd R. The ssumption of hving trining dt with known lbeled sttes is strong ssumption nd in mny POMDP domins my not be plusible. However, it is still more prcticl thn the ssumption of hving exct true models of T, O nd R. In the cse where EM-type lgorithms re used to lbel the dt (Koenig & Simmons 1996), the derivtion of the estimtes with the bove ssumption is not exctly correct, but might still provide useful guide to compre competing policies. Here we focus on the cse in which the model for immedite rewrd is known, while T nd O re estimted from dt. The method cn be further extended to the cse where rewrds re lso estimted from dt. If ction is used Ni times in stte s i, from which there were Nij trnsitions to s j, we cn write down the empiricl trnsition probbility from s i to s j given ction s: ˆT (i, j) = N ij Ni. (3) 1057

3 A similr method cn be used with the observtion model. If there were Mi trnsitions to s i fter ction, nd z j ws observed in Mij of them, the empiricl model of observtion probbilities would be: Ô (i, j) = M ij Mi. (4) From these empiricl models we cn crete the T nd O models s defined in the previous section. As our trining dt hs finite number of smples, nd therefore these empiricl models re likely to be imperfect, contining error terms T nd Õ. We therefore hve: ˆT = T + T, Ô = O + Õ. (5) As we used simple frequentist pproch to clculte the empiricl models, we cn ssume independence of errors in the following mnner: Different rows in ˆT nd Ô re independent from ech other, nd ech row is drwn from multinomil distribution. Considering sttisticl properties of the multinomil distribution, we know tht the expected errors re zero nd independent: E[ T ] = E[Õ] = E[ T Õ] = 0. (6) We cn write the covrince of the i th row of ˆT (denoted ˆT (i) ) s: cov(t (i) ) = 1 ( ) (i) (i) Ni dig( ˆT ) ( ˆT ) T (i) ˆT, (7) ˆT (i) where dig( ) is digonl mtrix with digonl. Similrly for Ô(i) we hve: cov(o (i) ) = 1 ( Mi dig(ô(i) ) (Ô(i) ˆT (i) long the ) ) T Ô (i). (8) Using the bove derivtions nd the definition of T nd O mtrices from the previous section, it is stright-forwrd to clculte the four dimensionl covrince mtrices of T nd Õ in terms of cov(t (i) ) nd cov(o (i) ). With T nd Õ being zero men vribles, the covrince mtrices will be: cov(t (i, j), T (k, l)) = E[ T (i, j) T (k, l)], (9) cov(o(i, j), O(k, l)) = E[Õ(i, j)õ(k, l)]. (10) These terms cpture the vrince in the empiricl models. The interesting question tht rises is how these errors in the empiricl models impct our estimte of the vlue function. Clcultion of Bis nd Vrince If we use the empiricl models insted of the true models to clculte the vlue of given policy π, we will hve: ˆV π = (I γ ˆT ÔΠπ ) 1 R, (11) To simplify the nottion, we will drop the π superscript in the lter derivtions. The bove expression cn be rewritten s: ˆV = (I γ(t + T )(O + Õ)Π) 1 R. (12) Now using Tylor expnsion nd mtrix mnipultion (Mnnor et l. 2007), we cn re-write the bove expression s: ˆV = γ k f k ( T, Õ)R, (13) where k=0 X = (I γt OΠ) 1, (14) f k ( T, Õ) = (X( T OΠ + T ÕΠ + T ÕΠ))k X. (15) We will use the bove derivtion to pproximte the expecttion of the clculted vlue function: E[ ˆV ] = E[ γ k f k ( T, Õ)R]. (16) k=0 Becuse the exct expression of the bove eqution cnnot be further simplified, we consider second order pproximtion insted. The expecttion of the vlue function then becomes: E[ ˆV ] = XR + γe[f 1 ]R + γ 2 E[f 2 ]R. (17) As Õ nd T re zero men nd independent, E[f 1 ( T, Õ)] will be 0. By substituting X, the bove expression becomes: E[ ˆV ] = V + γ 2 E[f 2 ( T, Õ)]R, (18) which shows tht the clculted vlue function is expected to hve some non-zero bis term. Using similr pproximtion, we cn write down the second moment of vlue function s: E[ ˆV ˆV T ] = V V T + γ 2 (E[f 1 RR T f T 1 ]) (19) +γ 2 (E[f 0 RR T f T 2 ]) + γ 2 (E[f 2 RR T f T 0 ]). The covrince mtrix will therefore be: E[ ˆV ˆV T ] E[ ˆV ]E[ ˆV ] T = γ 2 (E[f 1 RR T f T 1 ]). (20) Substituting f 1 with the definition we get: cov( ˆV ) = γ 2 XE[ T OΠV V T Π T O T T T ]X T. +γ 2 XTE[ÕΠV V T Π T Õ T ]T T X T (21) We will pproximtely clculte the bove expression by substituting the true models with our empiricl models (which is stndrd clssicl pproch). We lso require the following lemm: Lemm 1. Let Q be n n n dimensionl mtrix: Q = AXA T, (22) where A is n n m mtrix of zero men rndom vribles nd X is constnt mtrix of m m dimensions. The ijth entry of E[Q] is equl to: E[ k,l A ik X kl A T lj] = k,l = k,l X kl E[A ik A jl ] X kl cov(a ik, A jl ), (23) which is only dependent on four dimensionl covrince of the mtrix A. By pplying Lemm 1 to Eqn 21, we cn clculte the covrince of the clculted vlue function using the covri- 1058

4 nce of T nd Õ defined in the previous section. In summry, we propose second order pproximtion to estimte the expected error in the vlue function, in terms of the expected error in the empiricl models. Using similr clcultions, we cn lso clculte the bis s defined by Eqn 18 (the derivtion will pper in longer version of this pper; in most cses this term is much smller thn the vrince). goto x x sk x/y y x sk y x x/y sk goto y y Experiment nd Results The purpose of this section is two-fold. First, we wish to evlute the pproximtions used when deriving our estimte of the vrince in the vlue function. Second we wish to illustrte how the method cn be used to compre different policies for given tsk. POMDP dilog mnger We begin by evluting the method on synthetic dt from humn-robot dilog tsk. The use of POMDP-bsed dilog mngers is well-estblished (Doshi & Roy 2007; Willims & Young 2006). However, it is often not esy to get trining dt in humn-robot interction domins. With smll trining sets, error terms tend to be importnt. Estimtes of the error vrince will therefore be helpful to evlute nd compre policies. Here we focus on smll simulted problem which requires evluting dilog policies for the purpose of cquiring motion gols from the user. We presume humn opertor is instructing n ssistive robot to move to one of two loctions (e.g. bedroom or bthroom). While the humn intent (i.e. the stte) is one of these gols, the observtion received by the robot (presumbly through speech recognizer) might be incorrect. The robot hs the option to sk gin to ensure the gol ws understood correctly. Note however tht the humn my chnge his/her intent (the stte) with smll probbility. Fig 1 shows model of the described sitution. In the genertive model (used to provide the trining dt), we ssume the probbility of wrong observtion is 0.15 nd the humn might chnge gols with probbility If the robot cts s requested, it gets rewrd of 10; otherwise it gets 40 penlty. There is smll penlty of 1 when sking for clrifiction. We ssume γ = Fig 2 shows policy grph for the described POMDP dilog mnger. This policy grph corresponds to the policy where the robot keeps sking the humn until it receives n observtion twice more thn the other one. We rn the following experiment: given the fixed policy of Fig 2 nd fixed number n, we drw on-policy lbeled strt goto bedroom goto bthroom Figure 1: Exmple of dilog POMDP - Dshed lines refer to tking ction sk end Figure 2: Policy grph for the POMDP dilog mnger trjectories tht on the whole contin n trnsitions. We use these to clculte the empiricl models (Eqns 3 nd 4), nd use Eqn 2 to clculte the vlue function. Then we use Eqn 21 to clculte the covrince nd stndrd devition of the vlue function t the initil belief point (b 0 = [0.5; 0.5]). Let V (b 0 ) be the expected vlue t the initil belief stte b 0, nd let α = [α 1 ; α 2 ] be the vector of coefficients describing the corresponding liner piece in the piecewise liner vlue function. We hve V (b 0 ) = E[α b] = (α 1 + α 2 )/2 nd thus the vrince of V (b 0 ) cn be clculted s: vr(v (b 0 )) = vr(α 1) + vr(α 1 ) + 2cov(α 1, α 2 ). (24) 4 Fixing the size of the trining set, we run the bove experiment 1000 times. In ech time, we clculte the empiricl vlue of the initil belief stte ( ˆV (b 0 )), nd estimte its vrince using Eqn 24. We then clculte the percentge of cses in which the estimted vlue ( ˆV (b 0 )) lies within 1 nd 2 estimted stndrd devitions of the true vlue (V (b 0 )). Assuming tht the error between the clculted nd true vlue hs Gussin distribution (this ws confirmed by plotting the histogrm of error terms), these vlues should be 68% nd 95% respectively. Fig 3 confirms tht the vrince estimtion we propose stisfies this criteri. The result holds for vriety of smple set sizes (from n=1000 to n=5000). To investigte how these vrince estimtes cn be useful to compre competing policies, we clculte the vrince of the vlue function for two other policies on this dilog problem (we presume these dilog policies re provided by n Percentge below 1 (+) nd 2 (x) STDs Number of smples Figure 3: Percentge of the cses in which ˆV (b 0 ) lies within 1 (+) nd 2 ( ) pproximtely clculted stndrd devitions from V (b 0 ) - the dilog problem 1059

5 4! l & STD intervl of the vlue of the initil belief stte 3! 2! &!!!&! sk once sk twice sk three times MedA m l l h m MedB MedC h h m Figure 5: The policy grph for the STAR*D problem!2!! 2!!! 4!!! $!!! %!!! &!!!! (umber of smples Figure 4: 1 stndrd devition intervl for the clculted vlue of the initil belief stte for different policies on the dilog problem expert, though they could be cquired from policy itertion lgorithm such s Ji et l. (2007)). One policy is to sk for the gol only once, nd then ct ccording to tht single observtion. The other policy is to keep sking until the robot observes one of the gols three times more thn the other one, nd then ct ccordingly. Fig 4 shows the 1 stndrd devition intervl for the clculted vlue of the initil belief stte s function of the number of smples, for ech of our three policies (including the one shown in Fig 2). Given lrger smple sizes, the policy in Fig 2 becomes cler fvorite, wheres the other two re not significntly different from ech other. This illustrtes how our estimtes cn be used prcticlly to ssess the difference between policies using more informtion thn simply their expected vlue (s is usully stndrd in the POMDP literture). Medicl Domin We now evlute the ccurcy of our pproximtion in medicl decision-mking tsk involving rel dt. The dt ws collected s prt of lrge (4000+ ptients) multi-step rndomized clinicl tril, designed to investigte the comprtive effectiveness of different tretments provided sequentilly for ptients suffering from depression (Fv et l. 2003). The POMDP frmework offers powerful model for optimizing tretment strtegies from such dt. However given the sensitive nture of the ppliction, s well s the cost involved in collecting such dt, estimtes of the potentil error re highly useful. The dtset provided includes lrge number of mesured outcomes, which will be the focus of future investigtions. For the current experiment, we focus on numericl score clled the Quick Inventory of Depressive Symptomtology (QIDS), which roughly indictes the level of depression. This score ws collected throughout the study in two different wys: self-report version (QIDS-SR) ws collected using n utomted phone system; clinicl version (QIDS-C) ws lso collected by qulified clinicin. For the purposes of our experiment, we presume the QIDS-C score completely describes the ptient s stte, nd the QIDS- SR score is noisy observtion of the stte. To mke the problem trctble with smll trining dt, we discretize the score (which usully rnges from 0 to 27) uniformly ccording to quntiles into 2 sttes nd 3 observtions. The dtset includes informtion bout 4 steps of tretments. We focus on policies which only differ in terms of tretment options in the second step of the sequence (other tretment steps re held constnt). There re seven tretment options t tht step. A rewrd of 1 is given if the ptient chieves remission (t ny step); rewrd of 0 is given otherwise. Although this reltively smll POMDP domin, it is nonetheless n interesting vlidtion for our estimte, since it uses rel dt, nd highlights the type of problem where these estimtes re prticulrly crucil. We focus on estimting the vrince in the vlue estimte for the policy shown in Fig 5. This policy includes only three tretments: mediction A is given to ptients with low QIDS-SR scores, mediction B is given to ptients with medium QIDS-SR scores, nd mediction C is given to ptients with high QIDS-SR scores. Since we do not know the exct vlue of this policy (over n infinitely lrge dt set), we use bootstrpping estimte. This mens we tke ll the smples in our dtset which re consistent with this policy, nd presume tht they define the true model nd true vlue function. Now to investigte the ccurcy of our vrince estimte, we subsmple this dt set, estimte the corresponding prmeters, nd clculte the vlue function using Eqn 2. To summrize the vlue function into single vlue (de- Percentge below 1 (+) nd 2 (x) STDs Number of smples Figure 6: Percentge of cses in which ˆV (B) lies within 1 (+) nd 2 ( ) pproximtely clculted stndrd devitions from V (B) - the STAR*D problem 1060

6 2 STD intervl of the summrized vlue function not using CT using CT serch for policies tht hve high expected vlue nd low expected vrince. Furthermore, in some domins (including humn-robot interction nd medicl tretment design), where there is n extensive trdition of using hnd-crfted policies to select ctions, the method we present would be useful to compre hnd-crfted policies with the best policy selected by n utomted plnning method. The method we presented cn be further extended to work in cses where the rewrd model is lso unknown nd is pproximted by smpling. However, the derived equtions re more cumbersome s we need to tke into ccount the potentil correltions between rewrd nd trnsition models Number of smples Figure 7: 2 stndrd devition intervl for the clculted vlue of the summrized belief stte for different policies on the STAR*D problem noted by V (B)), we simply tke the verge over the 3 liner pieces in the vlue function. The vrince of ˆV (B) will therefore be the verge of the elements of the covrince mtrix we clculted for the vlue function. To check the qulity of the estimtes, we clculte the percentge of cses in which the clculted vlue lies within 1 nd 2 stndrd devitions from the true vlue. If the error term in the vlue function hs norml distribution these percentges should gin be 68 nd 95. Fig 6 shows the mentioned percentges s function of the number of smples. Here gin, the vrince estimtes re close to wht is observed empiriclly. Finlly, we conducted n experiment to compre policies with different choice of medictions in the policy grph of Fig 5. During the STAR*D experiment, ptients mostly preferred not to use certin tretment (CT:Cognitive Therpy). To study the effect of this preference, we compred two policies only one of which uses CT. As shown in Fig 7, the CT-bsed policy hs slightly better expected vlue nd much higher vrition. Using the result of this nlysis, one might prefer the non CT-bsed policy for two resons: Even with high empiricl vlues, we hve smll evidence to support the CT-bsed policy. Moreover, CT is not preferred by most ptients. Such method cn be pplied in similr cses for comprison between n empiriclly optiml policy nd mediclly preferred ones. Discussion Most of the literture on sequentil decision-mking focuses strictly on the problem of mking the best possible decision. This pper rgues tht it is sometimes importnt to tke into ccount the error in our vlue function, when compring lterntive policies. In prticulr, we show tht when we use imperfect empiricl models generted from smple dt (insted of the true model), some bis nd vrince terms re introduced in the vlue function of POMDP. We lso present method to pproximtely clculte these errors in terms of the sttistics of the empiricl models. Such informtion cn be highly vluble when compring different ction selection strtegies. During policy serch, for instnce, one could mke use of these error terms to Acknowledgment Funding for this work ws provided by the Ntionl Institutes of Helth (grnt R21 DA019800) nd the NSERC Discovery Grnt progrm. References Cssndr, A. R.; Kelbling, L. P.; nd Littmn, M. L Acting optimlly in prtilly observble stochstic domins. In Proceedings of AAAI. Doshi, F., nd Roy, N Efficient model lerning for dilog mngement. In Proceeding of HRI. Fv, M.; Rush, A.; Trivedi, M.; Nierenberg, A.; Thse, M.; Sckeim, H.; Quitkin, F.; Wisniewski, S.; Lvori, P.; Rosenbum, J.; nd Kupfer, D Bckground nd rtionle for the sequenced tretment lterntives to relieve depression (STAR*D) study. Psychitr Clin North Am 26(2): Greensmith, E.; Brtlett, P. L.; nd Bxter, J Vrince reduction techniques for grdient estimtes in reinforcement lerning. J. Mch. Lern. Res. 5: Hnsen, E. A An improved policy itertion lgorithm for prtilly observble MDPs. In Proceedings of NIPS. Hnsen, E. A Solving POMDPs by serching in policy spce. In Proceedings of UAI. Ji, S.; Prr, R.; Li, H.; Lio, X.; nd Crin, L Pointbsed policy itertion. In Proceedings of AAAI. Koenig, S., nd Simmons, R Unsupervised lerning of probbilistic models for robot nvigtion. In Proceedings of ICRA. Mnnor, S.; Simester, D.; Sun, P.; nd Tsitsiklis, J. N Bis nd vrince in vlue function estimtion. In Proceedings of ICML. Mnnor, S.; Simester, D.; Sun, P.; nd Tsitsiklis, J. N Bis nd vrince pproximtion in vlue function estimtes. Mnge. Sci. 53(2): Pouprt, P., nd Boutilier, C Bounded finite stte controllers. In Proceedings of NIPS, volume 16. Sondik, E. J The optiml control of prtilly observble Mrkov processes. Ph.D. Disserttion, Stnford. Willims, J. D., nd Young, S Prtilly observble mrkov decision processes for spoken dilog systems. Computer Speech nd Lnguge 21(2). 1061

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic