Infinite-Horizon Discounted Markov Decision Processes

Size: px

Start display at page:

Download "Infinite-Horizon Discounted Markov Decision Processes"

Ralph Marshall
6 years ago
Views:

1 Infinite-Horizon Discounted Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1

2 Outline The expected total discounted reward Policy evaluation Optimality equations Value iteration Policy iteration Linear Programming Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 2

3 Expected Total Reward Criterion Let π = (d 1, d 2,... ) Π HR Starting at a state s, using policy π leads to a sequence of state-action pairs {X t, Y t }. The sequence of rewards is given by {R t r t (X t, Y t ) : t = 1, 2,... }. Let λ [0, 1) be the discount factor The expected total rewards from policy π starting in state s is given by [ N ] vλ π (s) lim N Eπ s t=1 λ t 1 r(x t, Y t ). The limit above exists when r( ) is bounded; i.e., sup s S,a As r(s, a) = M <. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 3

4 Expected Total Reward Criterion Under suitable conditions (such as the boundedness of r( )), we have [ N ] [ ] vλ π (s) lim N Eπ s λ t 1 r(x t, Y t ) = E π s λ t 1 r(x t, Y t ). Let t=1 [ ] v π (s) E π s r(x t, Y t ). t=1 t=1 We have v π (s) = lim λ 1 v π λ (s) whenever v π (s) exists. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 4

5 Optimality Criteria A policy π is discount optimal for λ [0, 1) if v π λ (s) v π λ (s), s S, π ΠHR. The value of a discounted MDP is defined by v λ (s) sup vλ π (s), s S. π Π HR Let π be a discount optimal policy. Then vλ π (s) = v λ (s) for all s S. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 5

6 Vector Notation for Markov Decision Processes Let V denote the set of bounded real valued functions on S with componentwise partial order and norm v sup s S v(s). The corresponding matrix norm is given by H sup H(j s), s S j S where H(j s) denotes the (s, j)-th component of H. Let e V denote the function with all components equal to 1; that is, e(s) = 1 for all s S. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 6

7 Vector Notation for Markov Decision Processes For d D MD, let r d (s) r(s, d(s)) and p d (j s) p(j s, d(s)). Similarly, for d D MR, let r d (s) a A s q d(s) (a)r(s, a), p d (j s) a A s q d(s) (a)p(j s, a). Let r d denote the S -vector, with the s-th component r d (s) and P d the S S matrix with (s, j)-th entry p d (j s). We refer to r d as the reward vector and P d as the transition probability matrix. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 7

8 Vector Notation for Markov Decision Processes π = (d 1, d 2,... ) Π MR. The (s, j) component of the t-step transition probability matrix P t π(j s) satisfies P t π(j s) = [P d1... P dt 1 P dt ](j s) = P π (X t+1 = j X 1 = s). For v V, E π s [v(x t )] = j S Pπ t 1 (j s)v(j). We also have v π λ = t=1 λ t 1 P t 1 π r dt. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 8

9 Assumptions Stationary rewards and transition probabilities: r(s, a) and p(j s, a) do not vary with time Bounded rewards: r(s, a) M < Discounting: λ [0, 1). Discrete state space: S is finite or countable Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 9

10 Policy Evaluation Theorem Let π = (d 1, d 2,... ) Π HR. Then for each s S, there exists a policy π = (d 1, d 2,... ) ΠMR, satisfying P π (X t = j, Y t = a X 1 = s) = P π (X t = j, Y t = a X 1 = s), t. = Suppose π Π HR, then for each s S, there exists a policy π Π MR such that v π λ (s) = v π λ (s). = It suffices to consider Π MR. v λ (s) = sup vλ π (s) = sup vλ π (s). π Π HR π Π MR Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 10

11 Policy Evaluation Let π = (d 1, d 2,... ) Π MR. Then [ ] vλ π (s) = E s π λ t 1 r(x t, Y t ). In vector notation, we have v π λ = t=1 t=1 λ t 1 Pπ t 1 r dt = r d1 + λp 1 πr d2 + λ 2 P 2 πr d = r d1 + λp d1 r d2 + λ 2 P d1 P d2 r d = r d1 + λp d1 (r d2 + λp d2 r d ) = r d1 + λp d1 v π λ, where π = (d 2, d 3,... ). Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 11

12 Policy Evaluation When π is stationary, π = (d, d,... ) d and π = π. It follows that v d λ satisfies v d λ = r d1 + λp d v d λ L d v d λ, where L d : V V is a linear transformation. Theorem Suppose λ [0, 1). Then for any stationary policy d with d D MR, vλ d is a solution in V of v = r d + λp d v. Furthermore, v d λ may be written as v d λ = (I λp d ) 1 r d. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 12

13 Optimality Equations For any fixed n, the finite horizon optimality equation is given by v n (s) = sup a A s r(s, a) + j S Taking limits on both sides leads to v(s) = sup a A s r(s, a) + j S λp(j s, a)v n+1 (j). λp(j s, a)v(j). The equations above for all s S are the optimality equations. For v V, let Lv sup [r d + λp d v], d D MD Lv max d D MD[r d + λp d v]. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 13

14 Optimality Equations Proposition For all v V and λ [0, 1), sup [r d + λp d v] = sup [r d + λp d v]. d D MD d D MR Replacing D MD with D, the optimality equation can be written as v = Lv. In case supremum can be attained above for all v V, v = Lv. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 14

15 Solutions of the Optimality Equations Theorem Suppose v V. (i) If v Lv, then v v λ ; (ii) If v Lv, then v v λ ; (iii) If v = Lv, then v is the only element of V with this property and v = v λ. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 15

16 Solutions of the Optimality Equations Let U be a Banach space (complete normed linear space). Special case: space of bounded measurable real-valued functions An operator T : U U is a contraction mapping if there exists a λ [0, 1) such that Tv Tu λ v u for all u and v in U. Theorem [Banach Fixed-Point Theorem] Suppose U is a Banach space and T : U U is a contraction mapping. Then (i) There exists a unique v in U such that Tv = v ; (ii) For arbitrary v 0 U, the sequence {v n } defined by v n+1 = Tv n = T n+1 v 0 converges to v. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 16

17 Solutions of the Optimality Equations Proposition Suppose λ [0, 1). Then L and L are contraction mappings on V. Theorem Suppose λ [0, 1), S is finite or countable, and r(s, a) is bounded. The following results hold. (i) There exits a v V satisfying Lv = v (Lv = v ). Furthermore, v is the only element of V with this property and equals v λ ; (ii) For each d D MR, there exists a unique v V satisfying L d v = v. Furthermore, v = v d λ. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 17

18 Existence of Stationary Optimal Policies A decision rule is d is conserving if d argmax{r d + λp d vλ }. d D Theorem Suppose there exists a conserving decision rule or an optimal policy, then there exists a deterministic stationary policy which is optimal. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 18

19 Value Iteration 1 Select v 0 V, specify ɛ > 0, and set n = 0. 2 For each s S, compute v n+1 (s) by v n+1 (s) = max a A s r(s, a) + j S λp(j s, a)v n (j). 3 If v n+1 v n ɛ(1 λ) <, 2λ go to step 4. Otherwise, increment n by 1 and return to step 2. 4 For each s S, choose and stop. d ɛ (s) argmax a A s r(s, a) + j S λp(j s, a)v n+1 (j) Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 19

20 Policy Iteration 1 Set n = 0 and select an arbitrary decision rule d 0 D. 2 (Policy evaluation) Obtain v n by solving (I λp dn )v = r dn. 3 (Policy improvement) Choose d n+1 satisfy Setting d n+1 = d n if possible. d n+1 argmax[r d + λp d v n ]. d D 4 If d n+1 = d n, stop and set d = d n. Otherwise increment n by 1 and return to step 2. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 20

21 Linear Programming Let α(s) be positive scalars such that s S α(s) = 1. Primal linear program is given by min α(j)v(j) v j S v(s) j S λp(j s, a)v(j) r(s, a), s S, a A s. Dual linear program is given by max r(s, a)x(s, a) x s S a A s x(j, a) λp(j s, a)x(s, a) = α(j), j S, a A j s S a A s x(s, a) 0, s S, a A s. Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 21

Infinite-Horizon Average Reward Markov Decision Processes

Infinite-Horizon Average Reward Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 1 Outline The average