Metrics for Markov Decision Processes with Infinite State Spaces

Size: px

Start display at page:

Download "Metrics for Markov Decision Processes with Infinite State Spaces"

Allan Flynn
5 years ago
Views:

1 Metrics for Mrkov Decision Processes with Infinite Stte Spces Norm Ferns School of Computer Science McGill University Montrél, Cnd, H3A 2A7 Prksh Pnngden School of Computer Science McGill University Montrél, Cnd, H3A 2A7 Doin Precup School of Computer Science McGill University Montrél, Cnd, H3A 2A7 Abstrct We present metrics for mesuring stte similrity in Mrkov decision processes (MDPs) with infinitely mny sttes, including MDPs with continuous stte spces. Such metrics provide stble quntittive nlogue of the notion of bisimultion for MDPs, nd re suitble for use in MDP pproximtion. We show tht the optiml vlue function ssocited with discounted infinite horizon plnning tsk vries continuously with respect to our metric distnces. 1 Introduction Mrkov decision processes (MDPs) offer populr mthemticl tool for plnning nd lerning in the presence of uncertinty (Boutilier et l., 1999). MDPs re stndrd formlism for describing multi-stge decision mking in probbilistic environments. The objective of the decision mking is to mximize cumultive mesure of long-term performnce, clled the return. Dynmic progrmming lgorithms, e.g., vlue itertion or policy itertion (Putermn, 1994), llow us to compute the optiml expected return for ny stte, s well s the wy of behving (policy) tht genertes this return. However, in mny prcticl pplictions, the stte spce of n MDP is simply too lrge, possibly infinite or even continuous, for such stndrd lgorithms to be pplied. A typicl mens of overcoming such circumstnces is to prtition the stte spce in the hope of obtining n essentilly equivlent reduced system. One defines new MDP over the prtition blocks, nd if it is smll enough, it cn be solved by clssicl methods. The hope is tht optiml vlues nd policies for the reduced MDP cn be extended to optiml vlues nd policies for the originl MDP. Recent MDP reserch on defining equivlence reltions on MDPs (Givn et l., 2003) hs built on the notion of strong probbilistic bisimultion from concurrency theory. Probbilistic bisimultion ws introduced by Lrsen nd Skou (1991) bsed on bisimultion for nondeterministic (nonprobbilistic) systems due to Prk (1981) nd Milner (1980). Henceforth when we sy bisimultion we will men strong probbilistic bisimultion. In probbilistic setting, bisimultion cn be described s n equivlence reltion tht reltes two sttes precisely when they hve the sme probbility of trnsitioning to clsses of equivlent sttes. The extension of bisimultion to trnsition systems with rewrds ws crried out in the context of MDPs by Givn, Den nd Greig (2003) nd in the context of performnce evlution by Bernrdo nd Brvetti (2003). In both cses, the motivtion is to use the equivlence reltion to ggregte the sttes nd get smller stte spces. The bsic notion of bisimultion is modified only slightly by the introduction of rewrds. However, it hs been well estblished for while now tht use of exct equivlences in quntittive systems is problemtic. A notion of equivlence is two-vlued: two sttes re either equivlent or not equivlent. A smll perturbtion of the trnsition probbilities cn mke two equivlent sttes no longer equivlent. In short, ny kind of equivlence is too unstble to perturbtions of the numericl vlues of the trnsition probbilities. A nturl remedy is to use metrics. Metrics re nturl quntittive nlogues of the notion of equivlence reltion: for exmple the tringle inequlity is nturl quntittive nlogue of trnsitivity. The metrics on which we focus here specify the degree to which objects of interest behve similrly. Much of this work hs been done in very generl setting, using the lbelled Mrkov process (LMP) model (Blute et l., 1997; Deshrnis et l., 2002). Previous metrics (Deshrnis et l., 1999; vn Breugel & Worrell, 2001; Deshrnis et l., 2002b) (more precisely pseudo-metrics or semi-metrics) hve quntittively generlized bisimultion by ssigning distnce

2 zero to sttes tht re bisimilr, distnce one to sttes tht re esily distinguishble, nd n intermedite distnce to those in between. In (vn Breugel & Worrell, 2001) it ws shown how, in simplified setting of finite stte spce LMPs, metric distnces could be clculted in polynomil time. This work, long with tht of (Deshrnis et l., 2002b), ws dpted to finite MDPs in (Ferns et l., 2004). There, we used fixed point theory to construct metrics, ech of which hd bisimultion s its kernel, ws sensitive to perturbtions in MDP prmeters, nd provided bounds on the optiml vlues of sttes. We showed how to compute the metrics up to ny prescribed degree of ccurcy nd then used them to directly ggregte smple finite MDPs. In this pper we present significnt generliztion of these previous results to MDPs with continuous stte spces. The liner progrmming rguments we used in our previous work no longer pply, nd we hve to use mesure theory nd dulity theory on continuous stte spces. The mthemticl theory is interesting in its own right. Although continuous MDPs re of gret interest for prcticl pplictions, e.g. in the res of utomted control nd robotics, the existing methods for mesuring distnces between sttes, for the purpose of stte ggregtion s well s other pproximtion methods re still lrgely heuristic. As result, it is hrd to provide gurnteed error bounds between the correct nd the pproximte vlue function. It is lso difficult to determine the impct tht structurl chnges in the pproximtor would hve on the qulity on the pproximtion. The metrics we define in this pper llow the definition of error bounds for vlue functions. These bounds cn be used s tool in the nlysis of existing pproximtion schemes. The pper is orgnized s follows. In sections 2 nd 3 we provide the theoreticl tools necessry for the construction of our metrics. The ctul construction is crried out in section 4, where we lso rgue tht our metrics re the best for the job. Section 5 provides proof of vlue function continuity with respect to our metrics. In section 6 we provide simple illustrtion of metric use in pproximtion. Finlly, section 7 contins our conclusions nd directions for future work. 2 Bckground 2.1 Mrkov Decision Processes Let (S, A, P, r) be Mrkov decision process (MDP), where S is complete seprble metric spce equipped with its Borel sigm lgebr Σ, A is finite set of ctions, r : S A R is mesurble rewrd function, nd P : S A Σ [0, 1] is lbeled stochstic trnsition kernel, i.e. A, s S, P (s,, ) : Σ [0, 1] is probbility mesure, nd A, X Σ, P (,, X) : S [0, 1] is mesurble function. We will use the following nottion: for A nd s S, P s denotes P (s,, ) nd r s denotes r(s, ). Given mesure P nd integrble function f, we denote the integrl of f with respect to P by P (f). We lso mke the following ssumptions: 1. B := sup s,s, rs rs <. 2. For ech A, r(, ) is continuous on S. 3. For ech A, P s is (wekly) continuous s function of s, i.e. if s n tends to s in S then for every bounded continuous function f : S R, P s n (f) tends to P s (f). The first ssumption is direct consequence of the stndrd ssumption tht rewrds re bounded. The second ssumption is non-stndrd, but very mild. In generl, rewrds in n MDP re not ssumed to vry continuously (e.g., in gol-directed tsks). However, it is generlly ssumed tht there would be finite or countble number of discontinuities. In this cse, it is esy to trnsform the rewrd structure into one tht is continuous nd rbitrrily close to the originl one, e.g. by pplying smoothing sigmoid functions t the points of discontinuity. The third ssumption is continuity ssumption on the trnsition probbilities, nd stisfied by most resonble systems (including physicl systems of interest in control nd robotics). The discounted, infinite horizon plnning tsk in n MDP is to determine policy π : S A tht mximizes the vlue of every stte, V π (s) = E[ t=0 r t s 0 = s, π], where s 0 is the stte t time 0, r t is the rewrd chieved t time t, γ is discount fctor in (0, 1), nd the expecttion is tken by following the stte dynmics induced by π. The function V π is clled the vlue function of policy π. The optiml vlue function V, ssocited with n optiml policy, is the unique solution of the fixed point eqution V (s) = mx A (r s + γp s (V )) nd cn be used to directly determine n optiml policy, provided it is computble. Note tht in generl the optiml vlue function need not be mesurble, in which cse the fixed point eqution would be invlid. However, under ssumptions 1-3, this cnnot

3 be the cse (see theorem of (Putermn, 1994)). In fct, in this cse, the optiml vlue function cn be computed s the limit of sequence of itertes. Define V 0 = 0 nd V n+1 (s) = mx A (r s + γp s (V n )). Then the V n s converge to V in the uniform (mx-norm) metric. Of course, for this computtion to work in prctice it would be desirble to work with smll discretized version of the given MDP. This brings bout the problem of pproximtion, nd finding forml definition which chrcterizes when sttes re equivlent (nd hence cn be lumped together). The correct equivlence reltion is bisimultion. 2.2 Bisimultion Bisimultion is notion of behviourl equivlence, the strongest of whole zoo of equivlence reltions considered in concurrency theory. Bisimultion cn be defined solely in terms of reltions or using fixed point theory (so clled co-induction). The ltter will be useful for our purposes, but first requires some bsic definitions nd tools from fixed point theory on lttices tht cn be found, for exmple, in (Winskel, 1993). Let (L, ) be prtil order. If it hs lest upper bounds nd gretest lower bounds of rbitrry subsets of elements, then it is sid to be complete lttice. A function f : L L is sid to be monotone if x x implies f(x) f(x ). A point x in L is sid to be prefixed point if f(x) x, postfixed point if x f(x) nd fixed point if x = f(x). The importnce of these definitions rises in the following theorem. Theorem Let L be complete lttice, nd suppose f : L L is monotone. Then f hs lest fixed point, which is lso its lest prefixed point, nd f hs gretest fixed point, which is lso its gretest postfixed point. Let REL be the complete lttice of binry reltions on S with the usul subset ordering. We sy set in X is R-closed if the collection of ll those elements of S tht re rechble by R from X is itself contined in X. When R is n equivlence reltion this is equivlent to sying tht X is union of R-equivlence clsses. We write R rst for the reflexive, symmetric, trnsitive closure of R, nd Σ(R) for those Σ-mesurble sets tht re R-closed. Definition 2.2. Define F : REL REL by sf(r)s A, rs = rs nd X 1 This is n elementry theorem sometimes clled the Knester-Trski theorem in the literture. In fct the Knester-Trski theorem is much stronger sttement to the effect tht the collection of fixed points is itself complete lttice. Σ(R rst ), Ps (X) = Ps (X). The gretest fixed point of F is bisimultion. The existence of bisimultion is gurnteed by the fixed-point theorem. Unfortuntely, s n exct equivlence, bisimultion suffers from issues of instbility; tht is, slight numericl differences in the MDP prmeters, r nd P, cn led to vstly different bisimultion prtitions. To get round this, one generlizes the notion of equivlence through metrics. 2.3 Metrics Definition 2.3. A semimetric 2 on S is mp d : S S [0, ) such tht for ll s, s, s : 1. s = s d(s, s ) = 0 2. d(s, s ) = d(s, s) 3. d(s, s ) d(s, s ) + d(s, s ) If the converse of the first xiom holds s well, we sy d is metric. 3 Recll tht function h : S S R is lower semicontinuous (lsc) if whenever (s n, s n) tends to (s, s ), lim inf h(s n, s n) h(s, s ). Here we re considering S S to be endowed with the product topology. Note tht lsc functions re product mesurble. Let M be the set of semimetrics on S tht re lsc on S S nd uniformly bounded, e.g. those ssigning distnce t most 1, nd give it the usul pointwise ordering. Then M is complete lttice. This follows becuse tking the pointwise supremum of n rbitrry collection of lsc functions yields lsc function, nd tking the pointwise supremum of n rbitrry collection of semimetrics yields semimetric. Additionlly, if we tke M with the metric induced by the uniform norm, h = sup s,s h(s, s ), then it is complete metric spce. The rich structure of M llows us to pply both the lttice theoretic fixed-point theorem nd the more fmilir Bnch fixed-point theorem, provided we construct n pproprite mp on M. Since bisimultion involves n exct mtching of rewrds nd probbilistic trnsitions, the pproprite metric generliztion should involve metric on rewrds nd metric on probbility mesures. The choice of rewrd metric is obvious: the usul Eucliden distnce. The choice of probbility metric, however, is not so obvious. 2 They re often clled pseudo-metrics in the literture. 3 For convenience we will use the terms metric nd semimetric interchngebly; however, we relly men the ltter.

4 3 Probbility Metrics There re numerous wys of defining notion of distnce between probbility mesures on given spce (Gibbs & Su, 2002). The prticulr probbility semimetric of which we mke use is known s the Kntorovich metric. Given semimetric h M nd probbility mesures P nd Q on S, the induced Kntorovich distnce, T K (h), is defined by T K (h)(p, Q) = sup f (P (f) Q(f)), where the supremum is tken over ll bounded mesurble f : S R stisfying the Lipschitz condition: f(x) f(y) h(x, y) for ll x, y S. We write Lip(h) for the set of ll such functions. In light of the definition of bisimultion, the importnce of using the Kntorovich distnce is mde evident in the following lemm. Lemm 3.1. Let h M. Then T K (h)(p, Q) = 0 P (X) = Q(X), X Σ(Rel(h)). Proof. Fix ɛ > 0 nd let f Lip(h) such tht T K (h)(p, Q) < P (f) Q(f) + ɛ. WLOG f 0. Choose ψ simple pproximtion (the usul one) to f so tht T K (h)(p, Q) < P (ψ) Q(ψ) + 2ɛ. Let ψ(s) = {c 1,..., c k } where the c i re distinct, E i = ψ 1 ({c i }), nd R = Rel(h). Then ech E i is R- closed, for if y R(E i ) then there is some x E i such tht h(x, y) = 0. So f(x) = f(y) nd therefore, ψ(x) = ψ(y). So y E i. So by ssumption P (ψ) Q(ψ) = c i P (E i ) c i Q(E i ) = 0. Thus, T K (h)(p, Q) = 0. Let X Σ(R). Let K X be compct. Define f(x) = inf k K h(x, k). Since lsc function hs minimum on compct set, we my write f(x) = min k K h(x, k). In fct, f is itself lsc (see Theorem B.5 of (Putermn, 1994)). Since f is mesurble, R(K) = f 1 ({0}) Σ(R). Now, since P is tight (s S is complete seprble metric spce), P (X) = sup P (K) where the supremum is tken over ll compct K X. However, K X implies K R(K) R(X) = X. Since R(K) is mesurble, we hve P (X) = sup P (R(K)). Similrly, Q(X) = sup Q(R(K)). Define g n = mx(0, 1 nf). Then g n decreses to the indictor function on R(K). Also, g n /n Lip(h), so by ssumption P (g n /n) = Q(g n /n). Multiplying by n nd tking limits gives P (R(K)) = Q(R(K)) nd we re done. The Kntorovich metric rose in the study of optiml mss trnsporttion (see (Villni, 2002)): Assume we re given pile of snd nd hole, occupying mesurble spces (X, Σ X ) nd (Y, Σ Y ), ech representing copy of (S, Σ). The pile of snd nd the hole obviously hve the sme volume, nd the mss of the pile is ssumed to be normlized to 1. Let P nd Q be mesures on X nd Y respectively, such tht whenever A Σ X nd B Σ Y, P [A] mesures how much snd occupies A nd Q[B] mesures how much snd cn be piled into B. Suppose further tht we hve some mesurble cost function h : X Y R, where h(x, y) tells us how much it costs to trnsfer one unit of mss from point x X to point y Y. Here we consider h M. The gol is to determine pln for trnsferring ll the mss from X to Y while keeping the cost t minimum. Such trnsfer pln is modelled by probbility mesure λ on (X Y, Σ X Σ Y ), where dλ(x, y) mesures how much mss is trnsferred from loction x to y. Of course, for the pln to be vlid we require tht λ[a Y ] = P [A] nd λ[x B] = Q[B] for ll mesurble A nd B. A pln stisfying this condition is sid to hve mrginls P nd Q, nd we denote the collection of ll such plns by Λ(P, Q). We cn now restte the gol formlly s: minimize h(λ) over λ Λ(P, Q) This is ctully n instnce of n infinite liner progrm. Fortuntely, under very generl circumstnces, it hs solution nd dmits dul formultion. Let us first note tht mesures in Λ(P, Q) cn equivlently be chrcterized s those λ stisfying: P (φ) + Q(ψ) = λ(φ + ψ) for ll (φ, ψ) L 1 (P ) L 1 (Q). As consequence of this chrcteriztion we hve the following inequlity: sup (P (f) Q(f)) T K (h)(p, Q) inf h(λ) f λ Λ(P,Q) (1) where f is restricted to the continuous functions in Lip(h). The leftmost nd rightmost terms in inequlity (1) re exmples of infinite liner progrms in dulity. It is highly nontrivil result tht there is no dulity gp in this cse, s result of the Kntorovich- Rubinstein Dulity Theorem with metric cost function (see (Rchev & Rüschendorf, 1998), specificlly theorems 4.15 & 4.28 nd exmple 4.24, or see (Villni, 2002) for more redble ccount of this topic). In the cse of finite stte spce, this dulity leds to strongly polynomil time lgorithm (in terms of the size of the stte spce) for clculting the Kntorovich metric (Orlin, 1988). Thus, one pproch for clculting the Kntorovich metric is to discretize the liner progrm in some mnner nd solve finite liner progrm (see section 5.3 of (Rchev & Rüschendorf, 1998) for compct S). In further restricted settings, e.g. if S is Eucliden nd h is continuous, more direct

5 pproximtion schemes exist (see section 5.4 of (Anderson & Nsh, 1987)). Issues of efficiency side, the Kntorovich distnce is computble. We conclude this section by noting tht if the stte spce metric is chosen to be the discrete metric, which ssigns distnce 1 to ll pirs of unequl points, then the Kntorovich metric grees with the totl vrition metric, defined s d T V (P, Q) = sup X Σ P (X) Q(X). While simple to define, the totl vrition metric gives n overly strong mesure of the numericl differences cross probbilistic trnsitions to ll mesurble sets. Note, for exmple, tht the distnce between two point msses, δ x nd δ y, is lwys 1, unless x = y exctly. Nevertheless, the totl vrition distnce is commonly used in prctice nd cn led to interesting bounds. 4 Bisimultion Metrics Our development of fixed point metrics mirrors closely the definition of bisimultion. In the following c (0, 1) is discount fctor, in the sme vein s the discount fctor γ used in the definition nd estimtion of vlue functions. It determines the extent to which future trnsitions re tken into ccount when trying to distinguish sttes quntittively. In section 2 we mentioned tht M is uniformly bounded set of lsc semimetrics. Here we fix tht upper bound to be the constnt α defined s B 1 c. Theorem 4.1. Let c (0, 1). Define F c : M M by F c (h)(s, s ) = mx A ( r s r s + ct K(h)(P s, P s )) Then F c hs lest fixed point, d c fix, nd Rel(dc fix ) is bisimultion. Proof. It is esy to see tht F c is monotone on M nd so existence of d c fix follows from the Knester- Trski Theorem. It is importnt to note here tht we re implicitly invoking the leftmost equlity in (1) in order to correctly clim tht the mp tking (s, s ) to T K (h)(ps, Ps ) is lsc. By mens of lemm 3.1 we find tht for ny h in M, Rel(F c (h)) = F(Rel(h)). Thus, Rel(d c fix ) = F(Rel(d c fix )) is fixed point nd so is contined in bisimultion. For the other direction, we consider the discrete bisimultion semimetric; note tht it is not immeditely cler tht it is lsc. Cll it I. Let l be the gretest lower bound in M of {αi }. Then Rel(l). Thus, = F( ) F(Rel(l)) = Rel(F c (l)), which implies F c (l) αi. Since F c (l) M, we must hve F c (l) l. Since d c fix is the lest prefixed point of F c, d c fix l αi, so tht Rel(d c fix ). Thus, we hve estblished existence of metric tht ssigns distnce zero to points exctly in the cse when those points re bisimilr. Of course, d c fix is not the only such metric; the discrete bisimultion metric, for exmple, is nother. However, d c fix is the most suitble cndidte bisimultion metric for MDP nlysis. Before we rgue tht this is the cse, let us first note tht d c fix is in fct unique. Proposition 4.2. For ny h 0 M, d c fix (F c ) n (h 0 ) cn 1 c F c (h 0 ) h 0. In prticulr, lim n (F c ) n (h 0 ) = d c fix, nd dc fix is the unique fixed point of F c. Proof. This is simply n ppliction of the Bnch Fixed Point Theorem. Here we use the dul minimiztion form of T K ( ), s given in (1). Note tht for ll h, h M, nd for ll s, s S, F c (h)(s, s ) F c (h )(s, s ) c mx A (T K(h)(P s, P s ) T K(h )(P s, P s )) c mx A (T K(h h + h )(P s, P s ) T K(h )(P s, P s )) c mx A (T K( h h + h )(P s, P s ) T K(h )(P s, P s )) c mx A ( h h + T K (h )(P s, P s ) T K(h )(P s, P s )) c h h Thus, F c (h) F c (h ) c h h, so tht F c is contrction mpping nd hs n unique fixed point d c fix. As n immedite corollry of theorem 4.1 we find tht bisimultion is closed subset of S S, under the given restrictions on r nd P. So the discrete bisimultion metric, αi, is lsc, nd in prticulr, {(F c ) n (αi )} is fmily of lsc semimetrics decresing to d c fix, ech of which hs bisimultion s its kernel. The first iterte cn be expressed in more fmilir form by noting tht T K (I )(P, Q) = sup X Σ( ) P (X) Q(X), which is the totl vrition distnce of P nd Q s defined over the fully compressed stte spce (see ppendix for proof). The dvntge of using d c fix over ny of these itertes is tht d c fix is sensitive to perturbtions in the MDP prmeters. Formlly, d c fix is continuous in r nd P. Proposition 4.3. Suppose (r i, P i ), i = 1, 2, re MDP prmeters, ech stisfying the ssumptions of section 2, nd set B = mx(b 1, B 2 ). Let d 1 nd d 2 be the corresponding bisimultion metrics given by theo-

6 rem 4.1 with discount fctor c. Then d 1 d c mx r1 r2 + 2Bc (1 c) 2 sup d T V (P1,s, P2,s) This result follows from the unwinding of the fixed point definitions of d 1 nd d 2. Proof. Since Lip( d2 d 2 ) Lip(I ), we first obtin the following inequlity:,s T K (d 2 )(P 1,x, P 1,y) T K (d 2 )(P 2,x, P 2,y) sup (P1,x(f) P1,y(f)) (P2,x(f) P2,y(f)) Lip(d 2) d 2 sup (P1,x( f d 2 ) P 2,x( f d 2 )) Lip(I ) (P1,y( f d 2 ) P 2,y( f d 2 )) d 2 ( sup P1,x(g) P2,x(g) Lip(I ) + sup P1,y(g) P2,y(g) ) Lip(I ) d 2 (d T V (P 1,x, P 2,x) + d T V (P 1,y, P 2,y)) Here we re once more using the minimiztion form of T K. d 1 (x, y) d 2 (x, y) mx A ( r 1,x r 1,y + ct K (d 1 )(P 1,x, P 1,y)) mx A ( r 2,x r 2,y + ct K (d 2 )(P 2,x, P 2,y)) mx A ( r 1,x r 1,y r 2,x r 2,y + c(t K (d 1 )(P 1,x, P 1,y) T K (d 2 )(P 2,x, P 2,y))) mx A ( (r 1,x r 1,y) (r 2,x r 2,y) + c(t K (d 1 )(P 1,x, P 1,y) T K (d 2 )(P 1,x, P 1,y)) + c(t K (d 2 )(P 1,x, P 1,y) T K (d 2 )(P 2,x, P 2,y)))) mx A ( r 1,x r 2,x + r 1,y r 2,y + c d 1 d 2 + 2c d 2 sup d T V (P1,s, P2,s))) s mx A (2 r 1 r2 + c d 1 d 2 + 2c d 2 sup d T V (P1,s, P2,s))) s 2 mx A r 1 r2 + c d 1 d 2 B + 2c( 1 c ) sup d T V (P1,s, P2,s))),s Finlly, note tht proposition 4.2 llows us to clculte distnces up to ny prescribe degree of ccurcy using itertion, provided the Kntorovich metrics cn be efficiently nd suitbly clculted themselves. It remins to be seen if such method will be fesible in prctice. 5 Vlue Function Bounds Theorem 5.1. Suppose γ c. Then V is 1-Lipschitz continuous with respect to d c fix, i.e. V (s) V (s ) d c fix(s, s ). Proof. Ech iterte V n is continuous, nd so ech V n (s) V n (s ) belongs to M. The result now follows by induction nd tking limits. 6 Illustrtion In this section we present toy exmple of metric computtion nd metric pproximtion gurntees. Let S = [0, 1] with the usul Borel sigm-lgebr, A = {, b}, r s = 1 s, r b s = s, P s be uniform on S, nd P b s the point mss t s. Clerly, these MDP prmeters stisfy the required ssumptions. Given ny c (0, 1), we clim d c fix(x, y) = x y 1 c. Denote the RHS by h. Note tht T K (h)(px, Py ) = 0 nd T K (h)(px, b Py b ) = sup f Lip(h) f(x) f(y). Tking f 1 (x) = x 1 c nd f 2(x) = 1 f 1 (x) in Lip(h) we find T K (h)(px, b Py b ) = h(x, y). Thus, F c (h)(x, y) = mx( x y + c 0, x y + c h(x, y)) = x y + c h(x, y) = h(x, y). By uniqueness, d c fix = h. Now consider the following pproximtion. Given ɛ > 0, choose n lrge enough so tht 1 n < (1 c)ɛ. Prtition S s B k = [ k n, k+1 n ), B n 1 = [ n 1 n, 1], for k = 0, 1, 2,..., n 2. Note tht the dimeter of ech B k with respect to d c fix is 1 n(1 c) < ɛ. The n prtitions will be the sttes of finite MDP pproximnt. We obtin the rest of the prmeters by verging over the sttes in prtition. Thus, rb k = 1 2k+1 2n, rb b k = 2k+1 2n, P B k,b l = 1 n, nd P B b k,b l = δ Bk,B l. Assume γ is given nd choose c = γ. Note tht for ll x, y B k, V (x) V (y) dim d c fix B k ɛ. Thus, we would expect tht by verging, nd solving the finite MDP, V (B k ) should differ by t most ɛ from V (x), for ny x B k. In fct, in this cse the vlue functions of the originl MDP nd of the finite pproximnt cn be computed directly nd we cn verify this.

7 For x S, { B k, 1 x + γ V 2(1 γ) if 0 x < 1 2 (x) = x 1 γ if 1 2 { x 1 1 2k+1 V 2n (B k ) = + γ 2(1 γ) if 0 k < n 1 2 2k+1 2n 1 γ if n 1 2 k n 1 Thus, for x B k, V (x) V (B k ) 1 2k+1 1 γ x 2n dim d c fix B k ɛ. 7 Conclusion In this pper we hve constructed metrics for MDPs with continuous stte spces. Ech metric hs bisimultion s its kernel nd is continuous in the MDP prmeters. Most importntly, ech metric bounds the optiml vlue of sttes continuously. Hence, if one ws to ggregte sttes, this metric llows gurntee on the error introduced by this pproximtion. In contrst to previous situtions, in this theoreticl development the most importnt fctor tht we hve to tke into considertion ws the wy in which the rewrds vry cross the stte spce. Wht cn be sid in the cse of generl bounded mesurble, yet not necessrily continuous, rewrd function? In order to generlize our results, we need to estblish the mesurbility of the mp tking pir of sttes to the Kntorovich distnce, nd to generlize lemm 3.1. We re currently working on this development. In the mentime, if the rewrd structure does not stisfy our ssumption, we cn still consider the best lsc pproximtions to rs rs in M. Tht is, we cn replce rs rs by R 1(s, s ) = inf M { rs rs }, nd R2(s, s ) = sup M { rs rs } nd obtin two fixed point semimetrics d c 1 nd d c 2, respectively. Then it is not hrd to modify theorem 4.1 to show tht Rel(d c 2) Rel(d c 1). The ide is tht we re sndwiching bisimultion by nerby closed equivlence reltions. The theoreticl foundtion we estblished cn be used, potentilly in two different wys. The first ide is to use the distnce metric in the process of stte ggregtion, in order to provide finite pproximnt for continuous stte MDP. However, even though our distnce metrics re computble, the computtion methods tht we hve investigted so fr re not stisfctory. Discretizing the Kntorovich liner progrm my result in dded complexity when one considers tht the direct solution might be simple. On the other hnd, more direct methods of clculting the distnce re not currently known in generl. The second ide, which holds lot of promise, is to use our metric s tool for the theoreticl nlysis of existing pproximtion schemes. There re mny heuristic methods for providing vrible resolution or multi-resolution pproximtions to MDPs with continuous stte spces. Using our metrics, the error bounds of these heuristics cn be nlyzed. A second importnt ppliction is in the nlysis of pproximtion schemes which strt with corse pproximnt nd grdully increse the resolution. The distnce metrics cn provide tools for proving tht such schemes converge to correct vlue estimtes in the limit. We re currently pursuing reserch in this direction. Acknowledgments This work hs been supported in prt by funding from NSERC nd CFI. Appendix Lemm 7.1. Suppose C is closed equivlence reltion on S. Then T K (I C )(P, Q) = sup P (X) Q(X). X Σ(C) Proof. For every X Σ(C), the indictor function on X belongs to Lip(I C ). Thus, the RHS is t most the LHS. For the other inequlity, fix positive ɛ nd tke f : S [0, 1] nd ψ = n i=1 c i I Ei s in the proof of lemm 3.1. Let J = {i P (E i ) Q(E i )}. Then T K (I C )(P, Q) 2ɛ P (ψ) (Qψ) = c i (P (E i ) Q(E i )) J c i (P (E i ) Q(E i )) (mx c i ) (P (E i ) Q(E i )) J J 1 (P ( J E i ) Q( J E i )) since J E i belongs to Σ(C). References sup P (X) Q(X) X Σ(C) Anderson, E.J., & Nsh, P. (1987). Liner Progrmming in Infinite-Dimensionl Spces John Wiley nd Sons, Ltd. Bernrdo, M., & Brvetti, M. (2003). Performnce mesure sensitive congruences for Mrkovin process lgebrs. Theoreticl Computer Science, 290, R. Blute, J. Deshrnis, A. Edlt, nd P. Pnngden. Bisimultion for lbelled Mrkov processes. In Proceedings of the Twelfth IEEE Symposium On Logic In Computer Science, Wrsw, Polnd., , 1997.

8 Boutilier, C., Den, T., & Hnks, S. (1999). Decisiontheoretic plnning: Structurl ssumptions nd computtionl leverge. Journl of Artificil Intelligence Reserch, 11, J. Deshrnis, A. Edlt, nd P. Pnngden. Bisimultion for lbeled Mrkov processes. Informtion nd Computtion, vol 179, pges , Deshrnis, J., Gupt, V., Jgdeesn, R., & Pnngden, P. (1999). Metrics for lbeled Mrkov systems. Interntionl Conference on Concurrency Theory (pp ). Deshrnis, J., Gupt, V., Jgdeesn, R., & Pnngden, P. (2002). The metric nlogue of wek bisimultion for probbilistic processes. Logic in Computer Science (pp ). IEEE Computer Society. Ferns, N., Pnngden, P., & Precup, D. (2004) Metrics for finite Mrkov decision processes Proceedings of the 20th conference on Uncertinty in rtificil intelligence (pp ). Gibbs, A. L., & Su, F. E. (2002). On choosing nd bounding probbility metrics. Interntionl Sttisticl Review, 70, (pp ). Givn, R., Den, T., & Greig, M. (2003). Equivlence notions nd model minimiztion in mrkov decision processes. Artificil Intelligence, 147, Lrsen, K., & Skou, A. (1991). Bisimultion through probbilistic testing. Informtion nd Computtion, 94, Milner, R. (1980). A clculus of communicting systems. Lecture Notes in Computer Science Vol. 92. Springer-Verlg. Orlin, J. (1988). A fster strongly polynomil minimum cost flow lgorithm. Proceedings of the Twentieth nnul ACM symposium on Theory of Computing (pp ). ACM Press. Prk, D. (1981). Concurrency nd utomt on infinite sequences. Proceedings of the 5th GI-Conference on Theoreticl Computer Science (pp ). Springer-Verlg. Putermn, M. L. (1994). Mrkov decision processes: Discrete stochstic dynmic progrmming. John Wiley & Sons, Inc. Rchev, S. T., & Rüschendorf L. (1998). Mss Trnsporttion Problems, Vol. I: Theory. Springer, Berlin Heidelberg New York. vn Breugel, F., & Worrell, J. (2001). Towrds Quntittive Verifiction of Probbilistic Trnsition Systems. Proceedings of the 28th Interntionl Colloquium on Automt, Lnguges, nd Progrmming (ICALP), (pp ) Springer-Verlg. vn Breugel, F., & Worrell, J. (2001). An lgorithm for quntittive verifiction of probbilistic trnsition systems. Proceedings of the 12th Interntionl Conference on Concurrency Theory (pp ). Springer-Verlg. Villni, C. (2002). Topics in Mss Trnsporttion. [ seminr/rticles/vilnotes.ps](28/07/03) Winskel, G. (1993). The forml semntics of progrmming lnguges. Foundtions of Computing. The MIT Press.

Metrics for Finite Markov Decision Processes

Metrics for Finite Markov Decision Processes Metrics for Finite Mrkov Decision Processes Norm Ferns chool of Computer cience McGill University Montrél, Cnd, H3 27 nferns@cs.mcgill.c Prksh Pnngden chool of Computer cience McGill University Montrél,