Metrics for Finite Markov Decision Processes

Metrics for Finite Mrkov Decision Processes Norm Ferns chool of Computer cience McGill University Montrél, Cnd, H3 27 nferns@cs.mcgill.c Prksh Pnngden chool of Computer cience McGill University Montrél, Cnd, H3 27 prksh@cs.mcgill.c Doin Precup chool of Computer cience McGill University Montrél, Cnd, H3 27 dprecup@cs.mcgill.c bstrct We present metrics for mesuring the similrity of sttes in finite Mrkov decision process (MDP). The formultion of our metrics is bsed on the notion of bisimultion for MDPs, with n im towrds solving discounted infinite horizon reinforcement lerning tsks. uch metrics cn be used to ggregte sttes, s well s to better structure other vlue function pproximtors (e.g., memory-bsed or nerest-neighbor pproximtors). We provide bounds tht relte our metric distnces to the optiml vlues of sttes in the given MDP. Introduction Mrkov decision processes (MDPs) offer populr mthemticl tool for plnning nd lerning in the presence of uncertinty (Boutilier et l., 999). MDPs re stndrd formlism for describing multi-stge decision mking in probbilistic environments. The objective of the decision mking is to mximize cumultive mesure of long-term performnce, clled the return. Dynmic progrmming lgorithms, e.g., vlue itertion or policy itertion (Putermn, 994), llow us to compute the optiml expected return for ny stte, s well s the wy of behving (policy) tht genertes this return. However, in mny prcticl pplictions, the stte spce of n MDP is simply too lrge, possibly even continuous, for such stndrd lgorithms to be pplied. typicl mens of overcoming such circumstnces is to prtition the stte spce in the hope of obtining n essentilly equivlent reduced system. One defines new MDP over the prtition blocks, nd if it is smll enough, it cn be solved by clssicl methods. The hope is tht optiml vlues nd policies for the reduced MDP cn be extended to optiml vlues nd policies for the originl MDP. Recent MDP reserch on defining equivlence reltions on MDPs (Givn et l., 23) hs built on the notion of strong probbilistic bisimultion from concurrency theory. Bisimultion ws introduced by Lrsen nd kou (99) bsed on ides of Prk (98) nd Milner (98). Roughly speking, two sttes of process re deemed equivlent if ll the trnsitions of one stte cn be mtched by trnsitions of the other stte, nd the results re themselves bisimilr. The extension of bisimultion to trnsition systems with rewrds ws crried out in the context of MDPs by Givn, Den nd Greig (23) nd in the context of performnce evlution by Bernrdo nd Brvetti (23). In both cses, the motivtion is to use the equivlence reltion to ggregte the sttes nd get smller stte spces. The bsic notion of bisimultion is modified only slightly be the introduction of rewrds. The notion of equivlence for stochstic processes is problemtic to use in prctice becuse it requires tht the trnsition probbilities gree exctly. This is not robust concept, especilly considering tht usully, the numbers used in probbilistic models come from experimenttion or re pproximte estimtes. smll chnge in probbility estimtes cn cuse bisimilr sttes to pper non-bisimilr. Den, Givn nd Lech (997) ddressed this issue by llowing the stte spce to be prtitioned into blocks of sttes such tht the sttes within block re close in terms of their trnsition probbilities. However, their technique involves moving to slightly generlized model, nmely, the bounded-prmeter MDP. In this pper we ddress the sme problem in different wy, by developing metrics, or distnce functions, on the sttes of n MDP. Unlike n equivlence reltion, metric my vry smoothly s function of the trnsition probbilities. Yet, metric cn be used to ggregte sttes in mnner similr to n equivlence reltion. For exmple, we cn choose tolernce prmeter, ε, nd cluster together sttes tht re in ε-neighborhoods. metric cn hve broder pplicbility to other clsses of function pproximtors s well. For instnce, metric cn be used in nerest-neighbor pproximtor in order to decide on the dt points to be used s prototypes. The metrics we develop re bsed on the notion of bisimultion. More precisely, we will require tht if one

of our metrics ssigns distnce of to pir of sttes, then those sttes hve to be bisimilr. Thus, our metrics provide quntittive nlogue of bisimultion. dditionlly, our metrics will possess the following plesing property: if the system prmeters of two bisimilr sttes re perturbed slightly, then the two sttes will remin close in metric distnce. We build on previous work by Deshrnis, Pnngden, Jgdeesn nd Gupt (Deshrnis et l., 999; Deshrnis et l., 22) nd by vn Breugel nd Worrell (2), in which the theory of bisimultion, metrics nd pproximtion ws developed for lbeled Mrkov processes with continuous stte spces. Their work ws developed in the context of forml verifiction; here we tke the first steps to pply nd extend their results in the context of optimiztion problems. lthough we present our work currently in the context of discrete MDPs, our recent reserch indictes tht the results cn be extended for continuous MDPs. The pper is orgnized s follows. ections 2 nd 3 provide the definitions nd theoreticl results required to construct our metrics. In section 4 we introduce two kinds of bisimultion metrics nd in section 5 provide bounds on the optiml vlue function of MDPs tht cn be obtined by using these metrics for stte ggregtion. In section 6 we provide some experimentl results to compre nd contrst our metrics. ection 7 contins conclusions nd directions for future work. 2 Bckground finite Mrkov decision process consists of finite set of sttes,, finite set of ctions,, nd for every pir of sttes s nd s nd ction, Mrkovin stte trnsition probbility, Pss, nd numericl rewrd, rs. In the rest of this work we will focus on fixed, known MDP. Moreover, since rewrds re necessrily bounded, we will ssume without loss of generlity tht s. rs. We will now review briefly some bsic definitions nd results from MDP theory (e.g., (Putermn, 994), sec.6.- 6.3). wy of behving or policy is defined s mpping from sttes to ctions, π :, nd s. The vlue of stte s under policy π, V π s, is defined s: V π s E t γt r t s s π, where s is the stte t time, γ is discount fctor for future rewrds, r t is the rewrd obtined t time t, nd the expecttion is chieved by following the stte dynmics induced by π. The mpping V π : is clled the vlue function ccording to π. The gol of decision mking in n MDP is to find policy π tht mximizes V π s for ech s. uch mximizing policy nd its ssocited vlue function re sid to be If rewrds re bounded between R min nd R mx, we cn chieve this by subtrcting R min from ll rewrds nd dividing by R mx R min optiml. Note tht while there my be mny optiml policies, the optiml vlue function, V, is unique nd stisfies fmily of fixed point equtions, V s mx rs γ P ss V s s s These re known s the Bellmn optimlity equtions. They led to the following theorem, which expresses V s the limit of sequence of itertes. Theorem 2.. Let V s s nd mx rs s γ P ss s s Then n converges to V uniformly. These results cn be relized vi dynmic progrmming (DP) lgorithm tht computes vlue function up to prescribed degree of ccurcy. For exmple, if one is given positive tolernce ε then iterting until the mximum difference between consecutive itertes is ε γ 2γ gurntees tht the current iterte differs from the true vlue function by t most ε. Unfortuntely, it is sometimes the cse tht the stte spce is too lrge for DP to be fesible. stndrd strtegy is to pproximte the given MDP by ggregting its stte spce. The hope is tht one cn obtin smller equivlent MDP, with n esily computble vlue function, tht could provide informtion bout the vlue function of the originl MDP. Givn, Den, nd Greig (23) investigted severl notions of stte equivlence nd determined tht the most pproprite is stochstic bisimultion: Definition 2.2. stochstic bisimultion reltion is n equivlence reltion R on tht stisfies the following property: srs! r s r s nd C " R P s C# P s C where " R is the stte prtition induced by R nd P s C c C P sc. tochstic bisimultion, $, is the lrgest stochstic bisimultion reltion. In (Givn et l., 23) it ws shown tht the stochstic bisimultion (henceforth simply bisimultion ) prtition could be found by itertively refining prtitions bsed on rewrds nd equivlence clss trnsition probbilities, beginning with n initil prtition in which ll sttes re lumped together. This could be done in O % 3 opertions. Unfortuntely, bisimultion is too stringent. Consider the smple MDP in figure with 4 sttes lbeled s, t, u, nd v, nd one ction lbeled. uppose rv. Then ll sttes re bisimilr, becuse they shre the sme immedite rewrd nd trnsition mong themselves w.p.. On the

" ' other hnd, if rv then v is the only stte in its bisimultion clss since it is the only one with positive rewrd. Moreover, s nd t re bisimilr iff they shre the sme probbility of trnsitioning to v s bisimultion clss. Ech is bisimilr to u iff tht probbility is zero. Thus, u, s, t $ v, s $ t p q; s $ u p, nd t $ u q. " ' #!$ &% #!$ &% Figure : mple MDP! This exmple demonstrtes tht bisimultion is simply too strong notion; if r v is just slightly positive, nd p differs only slightly from q then we should expect s nd t to be prcticlly bisimilr. From the point of view of the vlue function, these sttes will lso be very close, nd one cn rgue tht ggregting them would be sfe. However, such fine distinction cnnot be mde using bisimultion lone. Therefore, we seek quntittive notion of bisimultion so tht we cn obtin mesure of how bisimilr two sttes re. To formulte such notion we use semimetrics, distnce functions on the stte spce. Definition 2.3. semimetric on is mp d : ( such tht for ll s, s, s :. s s d s s 2. d s s d s s 3. d s s *) d s s d s s If the converse of the first xiom holds s well, we sy d is metric. Let M be the set of ll semimetrics on tht ssign distnces of t most. Note tht every semimetric d induces n equivlence reltion, R d, on, obtined by equting points ssigned distnce zero by d. Definition 2.4. We sy tht d M is bisimultion reltion metric if R d is bisimultion reltion. We sy tht d is bisimultion metric if R d is $. 3 Probbility metrics Our gol is to construct clss of bisimultion metrics for use in MDP stte ggregtion. pecificlly, such metrics would be required to be esily computble nd provide informtion concerning the optiml vlues of sttes. However, if we denote by +-, X /. Y the bisimultion metric tht ssigns distnce to sttes tht re not bisimilr then it is not hrd to show tht +, X. / Y stisfies both requirements, while possessing no more distinguishing power thn tht of bisimultion itself. o we dditionlly require tht metric distnces vry smoothly nd proportionlly with differences in rewrds nd differences in probbilities. Formlly, we will construct bisimultion metrics vi metric on rewrds nd metric on probbility functions. The choice of metric on rewrds is n obvious one: we simply use the bsolute vlue of the difference. However, there re mny wys of defining useful probbility metrics (Gibbs & u, 22). Two of the most importnt re the Kntorovich metric nd the totl vrition metric. 2 Given d M, the Kntorovich metric, d, pplied to stte probbility functions P nd Q is defined by the following liner progrm: 3 3 mx u i i 2 2 2 3 3 P s i 54 Q s i u i i subject to: i j u i 4 u j 6 d s i s j i ) u i ) which is equivlent to the following dul progrm: 3 3 min l k 2 2 2 3 3 j 2 2 273 3 l d s k s j subject to: k l P s k j j l Q s j k k j l 8 The origins of the Kntorovich metric lie in mss trnsporttion theory. Consider two copies of the stte spce, one in which sttes re lbeled s supply nodes, nd the other in which sttes re lbeled s demnd nodes. Ech supply node hs supply whose vlue is equl to the probbility mss of the corresponding stte under P. Ech demnd hs vlue equl to the probbility mss of the corresponding stte under Q. Furthermore, imgine there is trnsporttion rc from ech supply node to ech demnd node, lbeled with cost equl to the distnce of the corresponding sttes under d. This constitutes trnsporttion network. flow with respect to this network is n ssignment of quntities to be shipped long ech rc subject to the conditions tht the totl flow leving supply node is equl to its supply, nd the totl flow entering demnd node is equl to its demnd. The cost of flow long n rc is the vlue of the flow long tht rc multiplied by the cost ssigned to tht rc. The gol of the Kntorovich optiml mss trnsporttion problem is to find the best totl flow for the given network, i.e. the flow of miniml cost. This formultion is cptured exctly in the dul progrm 2 Note tht the Kullbch-Leibler divergence, lso known s KL-distnce, which is commonly used to estimte the similrity of probbility distributions, is not metric.

bove. The distnce ssigned to P nd Q, d P Q, is the cost of the optiml flow, which is known to be computble in strongly polynomil time. This formultion cn be computed in O 2 log time (Orlin, 988). ince the underlying cost function d is semimetric, the Kntorovich metric my be further simplified. Lemm 3.. Let d M. Then d P Q mx P C 4 Q C v C v C C R d subject to: C D v C 4 v D ) min i C j d s i s j D C ) v C ) nd d P Q P C Q C, C " R d. Proof. Let v i be ny fesible solution to the priml LP for d P Q. Note tht if s i R d s j then we must hve v i v j. Define for ech C " R d, v C v i for some s i C. Then collecting terms yields the desired expression. From this expression it is cler tht if P C Q C for every equivlence clss C, then d P Q. For the converse, suppose tht C such tht P C Q C. Without loss of generlity, suppose P C Q C. Clerly C, so we my tke v C min k C j C d s k s j nd v D for ll other clsses nd obtin positive lower bound on d P Q. By contrst, the totl vrition probbility metric, T TV, is defined independently of d by T TV P Q 2 s P s 4 Q s which is hlf the L -norm of P nd Q. It clerly hs the dvntge of being simply defined nd esily computble. Yet, it my still be plced within the previous context since T TV cn be expressed s +-, X. Y. 4 Bisimultion Metrics Our construction of bisimultion metrics is hevily bsed on the following two lemms, which re importnt consequences of lemm 3.. Here the usefulness of the Kntorovich metric becomes evident. Lemm 4.. If d is bisimultion metric then s s, d s s # rs rs nd d Ps Ps () ince condition is necessry for d M to be bisimultion metric, the question nturlly rises s to whether or not it is sufficient s well. In generl, the nswer is negtive. However, it is sufficient for d to be bisimultion reltion metric. Lemm 4.2. uppose d M stisfies (). Then d s s s $ s We hve stted tht our gol is to construct bisimultion metrics tht provide useful informtion concerning the optiml vlues of sttes, but we hve not mentioned how this cn be done. For inspirtion we look to the Bellmn optimlity equtions for the optiml vlue function, which yield the following bound: V s4 V s ) mx rs 4 r s γ Psu 4 P s u V u u The first component of the RH is simply the distnce in immedite rewrds, while the second component is strikingly similr to the priml LP for the Kntorovich distnce in distributions. Bsed on these observtions we fix prticulr form for our bisimultion metrics, nmely d s s # mx c R rs 4 r s d P Ps P s where d P is some probbility metric nd nd re two positive -bounded constnts. Intuitively, these constnts weight the importnce given to the distnce between rewrds reltive to the distnce between trnsition probbilities respectively. For instnce, in MDPs nturl choice would be γ nd 4 γ. The prticulr choice of probbility metric leds to two kinds of bisimultion metrics, which we now describe in detil. 4. Fixed-Point Metrics In this section, we will use the Kntorovich distnce s bsis for formulting bisimultion metric. Before we do so, we need some definitions nd results from fixed-point theory. These my be found, for exmple, in (Winskel, 993). We present them in generl nottion first, then we explin it in the context of our problem. Let X be prtil order. n ω-chin of this prtil order is n incresing sequence x n. The prtil order is sid to be n ω-complete prtil order (ω-cpo) if it contins lest upper bounds of ll ω-chins. It is clled n ω-cpo with bottom if it dditionlly contins lest element,, clled bottom. function f : X Y between ω-cpos is sid to be monotonic if x x f x f x. It is continuous if for every ω-chin x n, f n x n n f x n. point x X is sid to be prefixed-point of f if f x x. It is fixed-point if x f x. With these definitions, the following importnt theorem cn be estblished. Theorem 4.3 (Fixed-Point Theorem). Let f : X X be continuous function on n ω-cpo with bottom X. Define fix f n f n. Then fix f is the lest prefixedpoint of f nd the lest fixed-point of f.

In order to use this result, we equip M with the usul pointwise ordering: d ) d iff d s s ) d s s for ll s s. s result, we obtin n ω-cpo with bottom, where is the constnt zero function nd n d n is given by n d n s s sup n d n s s. Moreover, the sme cn be sid of the set M P of semimetrics on the set of probbility functions on. With this in mind it is now esy to see tht this ordering is preserved by the Kntorovich metric, i.e. Lemm 4.4. : M M P is continuous. Proof. ee ppendix. We re now redy to estblish the bisimultion metric bsed on the Kntorovich probbility metric: Theorem 4.5. Let, 8 with ). Define F : M M by F d s s mx c R rs 4 rs d Ps Ps Then F hs lest fixed-point, d f ix, nd d f ix is bisimultion metric. Proof. Clerly, existence of the lest fixed-point will follow from theorem 4.3, so we only need to show tht F is continuous. For future reference we will denote the itertes, F n, by d n nd remrk tht they form n ω-chin in M. Continuity of F follows from lemm 4.4, since it estblishes the monotonicity of F, nd from the fct tht given n ω-chin x n in M nd pir of sttes s nd s, F n x n s s mx c R rs 4 r s n x n Ps Ps mx c R rs 4 r s sup x n Ps Ps n supmx c n R rs 4 r s x n Ps P s sup F x n s s n F x n s s n o d f ix exists, nd d f ix n F n. Note tht by construction, d f ix stisfies (), nd so, from lemm 4.2 d f ix s s s $ s. On the other hnd, since +&, X. / Y is bisimultion metric, by pplying lemm 4. nd the definition of F, F +-, X. / Y is lso bisimultion metric. Therefore, F +, X. / Y ) +, X /. Y, i.e. +, X /. Y is prefixed-point of F. o d f ix ) +, X /. Y, since d f ix is the lest prefixed-point of F. Thus, s $ s d f ix s s. Note tht by induction d f ix 4 d n ) c n T for every n. Thus, we cn compute d f ix up to prescribed degree of ccurcy δ by itertively pplying F for lnδ ln steps. ince this essentilly reduces to computing Kntorovich metric t ech itertion for every ction nd pir of sttes, d f ix cn be computed in O % 4 lnδ log ln opertions. 4.2 Metrics bsed on Totl Vrition We remrked in the proof of theorem 4.5 tht F(+, X /. Y ), which we now denote by d/ is lso bisimultion metric. The dvntge to using d/ in plce of d f ix is tht its component probbility semimetric, +, X. / Y, dmits n explicit, esily computble formultion, similr to tht of the totl vrition metric. Lemm 4.6. +, X /. Y P Q 2 C / Proof: By lemm 3., we hve: P C 4 Q C +, X /. Y P Q mx u C P C / C 4 Q C u C subject to: C D u C 4 u D ) min i C j +, X /. Y s i s j D C ) u C ) However, for distinct bisimultion equivlence clsses C nd D, +-, X. / Y s i s j is, nd so the first constrint is extrneous. Thus, if we define u C to be if P C 8 Q C nd otherwise, then it is cler tht u C is fesible solution t which the mximum is chieved. For this solution we hve, +, X. / Y P Q# P C / C 4 Q C u C P C / C 4 Q C u C 4 2 P 4 Q 2 2 C P / C54 Q C Thus, d/ cn be computed vi the bisimultion prtition in O 3 opertions. 5 Vlue Function Bounds We re now redy to provide vlue function bounds. We will stte the bounds in terms of d f ix only. The bounds hold immeditely for d/ s well, becuse d f ix ) d/. Theorem 5.. uppose γ ). Then s s : s 4 s ) V s 4 V s ) d n s s d f ix s s Proof: Clerly the proof of the second item follows from the first by tking limits. For the proof of the first item we proceed by induction. Note tht since γ ) ) γ V i u ) 4 γ 4 γ )

nd by the induction hypothesis γ u 4 γ v ) u 4 v ) d n u v c o R γ u : u constitutes fesible solution to the priml LP for d n Ps Ps. It follows tht s 4 s mx rs γ Psu u V n u 4 mx r s γ P s u u V n u ) mx rs 4 r s γ Psu 4 P s u u V n u ) mx c R rs 4 r s Psu 4 P γ s u u u ) mx c R rs 4 r s d n Ps P s F d n s s d n s s These bounds cn be extended to relte the optiml vlues of sttes in the given MDP nd n ggregte MDP. First, let us fix some nottion nd ssumptions concerning the form of n ggregte MDP. We ssume the ggregte is given by P CD : C D r C : C where is prtition of the stte spce, is the sme finite set of ctions, nd trnsition probbilities nd rewrds re ech verged over equivlence clsses, i.e. P CD C P s D nd rc s C C rs s C dditionlly in the following we will denote the mp from to tking stte to its equivlence clss by ρ, nd the verge distnce from stte s to ll sttes in its equivlence clss under semimetric d, by g s d 3 ρ s3 s ρ s d s s. Theorem 5.2. uppose γ ) inequlities hold: ρ s 4 s ) V ρ s 4 V s ) Proof: ee ppendix.. Then s, the following g n s d n γ n k mx k u g u d k g γ s d f ix 4 γ mx u g u d f ix The proposed distnce metrics cn be used for ggregting sttes in strightforwrd wy. For some positive ε we choose severl seed sttes nd for ech, we cluster ll the sttes within n ε-neighborhood (while ensuring tht ech stte is plced in only one cluster). Then for cluster C nd ny stte s belonging to it, the bove theorem tells us tht V C 4 V 2ε s ) γ, provided γ ). Thus, s ε decreses, the optiml vlues of clss nd its sttes converge. 6 Illustrtion We illustrte our distnce metrics nd error bounds on very simple toy MDP, consisting of 5 ( 5 grid. There re 5 ctions, north, south, est, west nd sty. Trnsitions for ech cell re uniformly distributed mong djcent cells. Rewrds re distributed s follows. Moving south from rows -4 to rows 2-5 yields rewrds of.,.2,.3,.4 respectively. Moving est from columns -4 to columns 2-5 yields rewrds of.5,.53,.56,.59 respectively. Finlly, stying in the southest corner yields rewrd of. ll other ctions give rewrd. We used these prmeters in order to be ble to inspect the prtitions obtined. More extensive (but similr) illustrtions, using rndom MDPs, re discussed in (Ferns, 23). In ll experiments, 4 γ nd γ. We first compute the pirwise distnces between ll pirs of sttes. Note tht this is not prcticl pproch; here, we re just trying to understnd the behvior of the metrics. Then, from n initil seed stte we grow prtition of ε-clusters of sttes, dding new cluster ech time we encounter stte t distnce greter thn ε from the seed sttes of ech cluster presently in the prtition. Of course, the qulity of the prtition will depend on the choice of seeds, nd more sophisticted methods cn be employed here (e.g., picking the seeds for subsequent prtitions s fr s possible from the previous ones). Once prtition is estblished, we perform vlue itertion to find the vlue of the optiml policy. We vried the prmeter ε which bounds the llowed distnce between sttes, nd well s the discount fctor γ. Note tht low ε (close to ) mens tht we only llow sttes to be ggregted if they re very close in terms of the distnce. Hence, t this end of the spectrum, very little ggregtion will occur nd the vlue function in the ggregted MDP should be very close (or identicl) to the one in the originl MDP. When ε, ll sttes cn be ggregted, resulting in single-stte MDP, nd poor pproximtion of the optiml vlue function. Figure 2 shows the size of the ggregted MDPs, obtined using the Kntorovich metric nd the totl vrition metric, for vlues of γ, 5 nd 9. The two metrics re close for low γ but behve quite differently for high vlues of γ (which re typicl in the MDP community). In prticulr, the totl vrition metric hs very brupt trnsition from no ggregtion to ggregting ll sttes in one lump. We note, though, tht this metric is much fster to compute (by n order of 4 in our experiments, in Jv implementtion). Figure 3 compres the metrics in terms of ctul nd estimted error. The lower curves represent the mximum error between the optiml vlue functions of the ggregted MDP nd the originl MDP. The higher curves re the upper bound on the error, bsed on Theorem 5.2. 2ε The stright line is the nive estimte, γ. Note tht the bounds in the theorem re much tighter thn the nive bound (which is omitted in the lst grph to mke the fig-

25 Totl vrition Kntorovich 25 Totl vrition Kntorovich 25 Totl vrition Kntorovich 2 2 2 ize of ggregte MDP 5 ize of ggregte MDP 5 ize of ggregte MDP 5 5 5 5..2.3.4.5.6.7.8.9..2.3.4.5.6.7.8.9..2.3.4.5.6.7.8.9 Figure 2: ize of ggregted MDP s function of ε, for γ (left), γ 5 (middle) nd γ 9 (right). 2.5 2 True error Totl Vrition True error Kntorovich Bound Totl vrition Bound Kntorovich Nive bound 8 7 True error Totl Vrition True error Kntorovich Bound Totl vrition Bound Kntorovich Nive bound 9 8 True error Totl Vrition True error Kntorovich Bound Totl vrition Bound Kntorovich 6 7 Mximum error.5 Mximum error 5 4 3 Mximum error 6 5 4 3 2.5 2..2.3.4.5.6.7.8.9..2.3.4.5.6.7.8.9..2.3.4.5.6.7.8.9 Figure 3: True error nd estimted error bounds between the optiml vlue function of the originl nd ggregted MDP, s function of ε, for γ (left), γ 5 (middle) nd γ 9 (right). ure cler). The bounds get looser s γ increses, due to the γ fctor. We note, though, tht the shpe of the bound mimics very well the shpe of the ctul error. 7 Conclusion In this pper, we introduced metrics for mesuring the distnce between the sttes of n MDP, bsed on the notion of bisimultion. Unlike equivlence reltions, the metrics re robust to perturbtions in the prmeters of the MDP: if two bisimilr sttes re slightly perturbed, the metric will still show them s close. Moreover, the sme cn be sid of the sttes optiml vlues, s reflected by the bounds relting these to our metrics. uch metrics re obviously useful for stte ggregtion, but lso for other vlue function pproximtors (e.g. memory-bsed). We re currently pursuing n interesting connection to diffusion kernels on grphs (Kondor & Lfferty, 22). The existence of bisimultion metrics for finite MDPs llows us to tckle compression of such systems in new mnner. metric defined on the stte spce of n MDP cn be extended to metric on the spce of finite MDPs. With this in mind, we re now concerned with nswering the following question: given finite MDP nd positive integer k, wht is its best k-stte pproximtion? Here by best we men k-stte MDP of miniml distnce to the originl. We lso im to extend these results to other probbilistic models. We hve mostly estblished n extension for continuous-stte MDPs. In the future, we hope to tckle fctored MDPs nd prtilly observble MDPs s well. ppendix: Proof of Lemm 4.4 Fix probbility functions P nd Q. Monotonicity of follows from the priml LP: for, if d ) d then every fesible solution to d P Q is fesible solution to d P Q. Thus, d ) d. Next, given ω-chin d n note tht by monotonicity sup d n P Q ) d n P Q. For the other direction, we use the dul LP. For ech n, let l n denote

fesible solution of d n yielding the minimum. Then ech is lso fesible solution for d n. Define ε n d n s k s j 4 d n s k s j nd δ min P s k Q s j. Then for every k, j, nd n, ε n 8, lim n ε n, nd l n ) δ. Thus, d n P Q*) l n ) d n P Q l n ε n d n s k s j ) sup d n P Q δ ε n By tking n on both sides of the inequlity, we obtin the desired result. ppendix: Proof of Theorem 5.2 Once more we proceed by induction. s mx rρ s γ Pρ s D D mx D rs γ Psu u mx rs rs Ps u D γ D u D mx rs rs mx rs rs c R γ Psu u γ Ps u ρ u u mx rs rs γ Ps u ρ u PsuV n u γ Ps u Psu u Ps u Psu mx P s u ρ u u γ u c Note by theorem 5. tht R γ u : u constitutes fesible solution to the priml LP for d n Ps Ps. Hence we cn continue s follows: c R γ mx rs rs d n Ps Ps mx d n s s g s d n γmx g u d n g s d n g s d n P s u mx ρ u u γmx g u d n n γ n k γmx ρ u u n γ n k n γ n k k mx v g v d k k mx g u d k k mx g u d k References Bernrdo, M., & Brvetti, M. (23). Performnce mesure sensitive congruences for Mrkovin process lgebrs. Theoreticl Computer cience, 29, 7 6. Boutilier, C., Den, T., & Hnks,. (999). Decisiontheoretic plnning: tructurl ssumptions nd computtionl leverge. Journl of rtificil Intelligence Reserch,, 94. Den, T., Givn, R., & Lech,. (997). Model reduction techniques for computing pproximtely optiml solutions for Mrkov decision processes. Proceedings of UI (pp. 24 3). Deshrnis, J., Gupt, V., Jgdeesn, R., & Pnngden, P. (999). Metrics for lbeled mrkov systems. Interntionl Conference on Concurrency Theory (pp. 258 273). Deshrnis, J., Gupt, V., Jgdeesn, R., & Pnngden, P. (22). The metric nlogue of wek bisimultion for probbilistic processes. Logic in Computer cience (pp. 43 422). IEEE Computer ociety. Ferns, N. (23). Metrics for mrkov decision processes. Mster s thesis, McGill University. URL: http:// www.cs.mcgill.c/ nferns/mythesis.ps. Gibbs,. L., & u, F. E. (22). On choosing nd bounding probbility metrics. Interntionl ttisticl Review, 7, 49 435. Givn, R., Den, T., & Greig, M. (23). Equivlence notions nd model minimiztion in mrkov decision processes. rtificil Intelligence, 47, 63 223. Kondor, R. I., & Lfferty, J. (22). Diffusion kernels on grphs nd other discrete structures. Proceedings of the ICML. Lrsen, K., & kou,. (99). Bisimultion through probbilistic testing. Informtion nd Computtion, 94, 28. Milner, R. (98). clculus of communicting systems. Lecture Notes in Computer cience Vol. 92. pringer- Verlg. Orlin, J. (988). fster strongly polynomil minimum cost flow lgorithm. Proceedings of the Twentieth nnul CM symposium on Theory of Computing (pp. 377 387). CM Press. Prk, D. (98). Concurrency nd utomt on infinite sequences. Proceedings of the 5th GI-Conference on Theoreticl Computer cience (pp. 67 83). pringer- Verlg. Putermn, M. L. (994). Mrkov decision processes: Discrete stochstic dynmic progrmming. John Wiley & ons, Inc. vn Breugel, F., & Worrell, J. (2). n lgorithm for quntittive verifiction of probbilistic trnsition systems. Proceedings of the 2th Interntionl Conference on Concurrency Theory (pp. 336 35). pringer-verlg. Winskel, G. (993). The forml semntics of progrmming lnguges. Foundtions of Computing. The MIT Press.