Continuous-Time Markov Decision Processes. Discounted and Average Optimality Conditions. Xianping Guo Zhongshan University.

Size: px

Start display at page:

Download "Continuous-Time Markov Decision Processes. Discounted and Average Optimality Conditions. Xianping Guo Zhongshan University."

Poppy McCarthy
6 years ago
Views:

1 Continuous-Time Markov Decision Processes Discounted and Average Optimality Conditions Xianping Guo Zhongshan University.

2 Outline The control model The existing works Our conditions and results Examples

3 1. The control model The model of continuous-time Markov decision processes: {S, (A(x) A, x S),q( x,a),r(x,a)}, (1) where S : the state space, a Polish space A(x) : the admissible action set; actions A: a Polish space of q( x,a) : the transition rates, a A(x). 1

4 r(x, a) : the reward/cost rates, a A(x) A Markov policy π: a family (π t,t 0) of stochastic kernels on A(x) given X. A stationary policy f: a measurable function on S with f(x) A(x) for all x S. For each Markov policy π = (π t, t 0), define the transition rates Q(D x,π t ) := A(x) q(d x,a)π t (da x). (2) 2

5 To guarantee the existence of a Q-process with the transition rates Q(D x,π t ), we introduce admissible policies. Definition 1.1(Admissible Policies). A Markov policy (π t ) is called admissible if Q(D x,π t ) is continuous in t 0. Let Π be the class of admissible policies. Under suitable condition, we can define the expected discounted and average criteria: J(s, x,π) = s [Ẽs,x π e α(t s) r(ξ(t),η(t))]dt. 3

6 V (x,π) = lim inf T T 0 [Ẽ0,x π r(ξ(t),η(t))]dt. T Definition 1.2. A policy π is said to be discounted optimal if J(s, x,π ) J(s, x, π) for all π,x S and s 0; Similarly, we can define an average optimal policy. Main aims: The conditions for the existence of optimal policies. Algorithms for optimal policies. The characterization of optimal policies and applications. 4

7 2. The Existing Works Let r := sup (a,x) K r(x,a), q := sup (x,a) K ( q({x} x,a)). When S is denumerable: Case 1 r <, q < : Kakumanu (1972, 1975); Dong (1979), Cao (2002). Case 2 r <, q : Doshi (1975), Song (1987), Guo & Liu (2001). Case 3 r, q < : Yushkevich (1979), Puter- 5

8 man (1994), Haviv (1998), Sennott (1999), Lewis & Puterman (2000). Case 4 r, q : Guo & Zhu (2001, 2002), Guo & Hernandez-Lerma, (2003a, 2003b). When S is not denumerable: Case 1 r <, q : Doshi (1975) (Finite A(x)) Case 2 r(x, a) 0, q : Hernandez-Lerma & Govindan (2001) ( Assumption about the existence of a solution to the optimality equation!!!). Case 3 r, q :??? 6

9 3. On the Discounted Criterion To ensure the regularity of a possibly nonhomogeneous Q- process we use the following drift conditions. Assumption A. There exists a measurable function w 1 1 on S, and constants c 1 0,b 1 0 and M q > 0 such that (1) S w 1(y)q(dy x, a) c 1 w 1 (x) + b 1 (x, a) K; (2) q(x) M q w 1 (x);q(x) := sup a A(x) [ q({x} x,a)]. Theorem 3.1. Let w be any nonnegative measurable function on S, c ( = 0) and b ( 0) two constants. Then for each 7

10 π Π, the following statements are equivalent: (a) S w(y)q(dy x,π t) c w(x) + b x S,t 0; (b) S w(y)pmin π (s, x, t,dy) e c(t s) w(x) + b [e c(t s) 1] c for all x S,t s 0, where p min π (s, x,t,dy) is the minimum Q-process with the transition rates Q(D x,π t ). Theorem 3.2. Suppose that Assumption A holds. Then for each π Π, x S,t s 0, (a) the Q-process is regular (i.e., p min π (s, x,t,s) 1); (b) E π s,xw 1 (x(t,π)) e c 1(t s) w 1 (x) + b 1 c 1 (e c 1(t s) 1). 8

11 To ensure the finiteness of discounted criterion J(s, x,π), by Theorem 3.2(b) it is natural to propose the following conditions. Assumption B. (1) r(x,a) M 1 w 1 (x) for all (x,a) K. (2) α c 1 > 0. To ensure the existence of optimal stationary policies, in addition to Assumptions A and B we propose the following. Assumption C. (1) A(x) is a compact set for each x S; (2) r(x,a) is continuous in a A(x), for each fixed x S; 9

12 (3) For each x S, the function S u(y)q(dy x,a) is continuous in a A(x), for all of bounded measurable functions u on S, and also for u = w 1 ; (4) There exists a nonnegative measurable function w 2 on S, and constants c 2 > 0, b 2 0 and M 2 > 0 such that q(x)w 1 (x) M 2 w 2 (x), w 2 (y)q(dy x,a) c 2 w 2 (x) + b 2. S Let B w1 (S) := {u : sup x S u(x) w 1 (x) < }. Theorem 3.3. Suppose that Assumptions A, B and C hold. (a) J(s, x,π) b 1M 1 α(α c 1 ) + M 1w(x) α c 1. 10

13 (b) There exists a function u B w1 (S) satisfying the optimality equation αu (x) = sup {r(x,a) + a A(x) S u (y)q(dy x, a)} x S. (c) u (x) = sup π Π J(s, x,π) =: J α(x) for all s 0. (d)there exists an optimal stationary policy. Theorem 3.3 ensures the existence of an optimal stationary policy. Under the hypotheses of Theorem 3.3, for each f F, we define a stochastic process: M(τ, f) = τ 0 e αt r(x(t,f),f(x(t, f)))dt + e ατ u (x(τ,f)). 11

14 Theorem 3.4 Suppose that Assumptions A, B and C hold. Then, the following statements are equivalent: (a) f in F is discounted optimal. (b) For each x S, {M(τ, f )} is a P f x -martingale with respect to F t = σ{x(s, f ) : s t}. Theorem 3.4 gives a martingale characterization of discount optimal stationary policy. 12

15 4. On the Average Criterion To prove the existence of average optimal policies, we give the following conditions. Assumption D. (1) c 1 < 0, with c 1 as in Assumption A. (2) There exist functions v 1, v 2 B w1 (S) and state x 0 S satisfying v 1 (x) h α (x) v 2 (x) x S and α > 0, where h α (x) := Jα(x) Jα(x 0 ). 13

16 To verify Assumption D, we provide sufficient conditions. Proposition 4.1. Under Assumptions A and B, each one of following conditions (a) and (b) implies Assumption C. (a) There exit constants R and ρ > 0, such that sup Ex[u(x(t))] f u(y)µ f (dy) Re ρt w 1 (x) u w 1 S for t 0 and f F, where µ f is the invariant distribution. (b) If S = [0, ) d with some integer number d 1, and the following conditions are satisfied: for each f F (b1) the function w 1 in Assumption A is nondecreasing 14

17 in each component, and w 1 (y)q(dy x, f(x)) c 1 w 1 (x) + b 1 I {0 d } (x), S (b2) The process x(t) is stochastically monotonic. We now state our main results about average optimality. Theorem 4.2. Suppose that Assumptions A, B, C and D hold. (a) There exists a constant g, functions u 1, u 2 B w1 (S) and a stationary f F satisfying the optimality inequalities g max {r(x,a) + u 1(y)(dy x, a)} a A(x) S 15

18 g max {r(x,a) + u 2v(y)(dy x,a)} x S; a A(x) S = r(x,f (x)) + u 2(y)q(dy x, f (x)) x S; S (b) g = sup π Π V (x,π) for all x S. (c) The policy f in (a) is average optimal. Theorem 4.2 ensures the existence of an average optimal stationary policy. For each f F,x S,u B w1 (S) and any constant number 16

19 g, let (x; f,u, g) := r(x,f(x)) + S u(y)q(dy x, f(x)) g and then define a continuous-time Markov process M t (f, u, g) := t 0 r(x(s), f(x(s)))ds + u(x(t)) tg for each t 0. Theorem 4.3. Suppose that Assumption A, B, C and D hold. (a) Let f be the average optimal policy obtained in Theorem 5.2, with u 1, u 2, g as in Theorem 5.2. Then (a 1 ) M t (f, u 2,g ) is a P f x -submartingale all x S; ; 17

20 and x S. (a 2 ) M t (f, u 1,g ) is P f x -supermartingale for all f F (b) Conversely, if there exist some policy f F, and u 1, u 2 B w (S) and a constant g such that for each f F and x S, (b 1 ) M t (f, u 2, g) is a P f x -submartingale for all x S; (b 2 ) M t (f, u 1, g) is P f x -supermartingale, then f F is average optimal. Theorem 4.3 gives a semi-martingale characterization of average optimal stationary policy. 18

21 5. Examples 5.1. A controlled generalized Potlach process. S := [0, ) d with d 1. Let (p ij ) be a transition probability matrix on {1, 2,, d}. Then, the generalized Potlach process generated by d Lu(x) := i=1 0 [u(x e i x i + y d p ij x i e j ) u(x)]df(y, λ), Let d d r(x,a) := q i p ij x j λ(x x d ), (4) i=1 j=1 j=1 fo 19

22 where (q 1,,q d ) =: a will be interpreted as control actions. Thus, the transition rates q(d x,a) are given by d q(d x,a) := i=1 q({x} x, a) := q(s x, a). 0 d I D\{x} (x e i x i + y p ij x i e j )λe λy dy, Conclusion: All of Assumptions A, B, C and D hold if A(x) is compact for each x S, and λ > 1. j=1 20

23 5.2. A controlled generalized birth-death system. For i = 0 and each a := (a 1,a 2 ) A(0) q(1 0, a) := q(0 0,a) := h 2 (0, a 2 ) > 0, and for i 1 and all a := (a 1, a 2 ) A(i) µi + h 1 (i,a 1 ) if j = i 1, (µ + λ)i h q(j i,a) := 1 (i,a 1 ) h 2 (i,a 2 ) if j = i, λi + h 2 (i,a 2 ) if j = i + 1, 0 otherwise; r(i,a) := pi + r(i,a 2 ) c(i,a 1 ) Consider the following conditions: E 1 : (a) µ λ > 0; 21

24 (b) Either κ := µ λ+h 2 h 1 0, or µ λ > h 2 h 1 when κ > 0, where h 2 := sup a2 A 2 (i),i 1 h 2 (i,a 2 ),h 1 := inf a1 A 1 (i),i 1 h 1 (i,a 1 ). E 2 : h 1 (i, ), h 2 (i, ), c(i, ) and r(i, ) are all continuous. E 3 : (a) c(i,a 1 ) L 1 (i + 1) and r(i,a 2 ) L 2 (i + 1); (b) h k := sup i S,ak A k (i) h k (i,a k ) <. Proposition 5.1. Under E 1,E 2 and E 3, the birth-death system satisfies the Assumptions A, B, C and D. 22

25 Bibliography [1] Altman, E., Constrained Markov Decision Processes, Chapman Hall/CRC,1999. [2] Borkar, V.S., Topics in Controlled Markov Chains, Pitman Research Notes in Math. No. 240, Longman Scientific and technical, Harlow,

26 [3] Derman, C., Finite State Markovian Decision Processes, Academic Press, New York, [4] Dynkin, E.B. and Yushkevich, A.A., Controlled Markov Processes, Springer Verlag, New York. [5] Feinberg, E.A., and Shwartz, A., Handbook of Markov Decision Processes, Kluwer Academic Publishers, Boston/Dordrecht/London, [6] Filar, J.A. and Vrieze, K., Competitive Markov Decision Processes, Springer-Verlag, New York,

27 [7] Hernandez-Lerma, O. and Lasserre. J.B., Further Topics on discrete-time Markov Control Processes, Springer-Verlag, New York, [8] Hernandez-Lerma, O. and Lasserre. J.B., Discrete- Time Markov Control Processes: Basic Optimality Criteria. Springer-Verlag, New York, [9] Hernandez-Lerma, O., Adaptive Markov Control Processes, Springer-Verlag, New York, [10] Hinderer, K., Foundations of Non-stationary Dy- 25

28 namic Programming with Discrete Time Parameter. Lecture Notes in Oper. Res., Springer-Verlag, New York, [11] Hou, Z.T. and Guo, X.P., Markov Desicion Processes, Science and Technology Press of Hunan, (In Chinese.) [12] Hordijk, A., Dynamic Programming and Markov Potential Theory, Math. Centre Tract, No.51, Mathematisch Centrum, Amsterdam,

29 [13] Howard, R.A., Dynamic Programming and Markov Processes, MIT Press, Cambridge, [14] Kallenberg, L.C.M., Linear Programming and Finite Markovian Control Problems, Mathematical Centre Tract 148, Mathematical Centre, Amsterdam, [15] Piunovskiy, A.B., Optimal Control of Random Sequences in Problems with Constraints, Kluwer Academic Publishers, 1997, 27

30 [16] Puterman, M.L., Markov Decision Processes, Wiley, New York, [17] Ross, S.M., Introduction to Stochastic Dynamic Programming, Academic Press, New York, [18] Sennott, L.I., Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley, New York, [19] Tijms, H.C. and Wessels, J., Markov Decision Theory, Mathematical Centre Tract 93, Mathematical 28

31 Centre, Amsterdam, [20] White,D.J., Markov Decision Processes, John Wiley Sons, Ltd., Chichester, THANK YOU!!!

Total Expected Discounted Reward MDPs: Existence of Optimal Policies

Total Expected Discounted Reward MDPs: Existence of Optimal Policies Eugene A. Feinberg Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony Brook, NY 11794-3600