Infinite-Horizon Average Reward Markov Decision Processes

Size: px

Start display at page:

Download "Infinite-Horizon Average Reward Markov Decision Processes"

Cleopatra Lindsey
6 years ago
Views:

1 Infinite-Horizon Average Reward Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 1

2 Outline The average reward Classification of MDPs Optimality equations Value iteration in unichain models Policy iteration in unichain models Linear Programming in unichain models Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 2

3 Average Reward Criterion Let π = (d 1, d 2,... ) Π HR Starting at a state s, using policy π leads to a sequence of state-action pairs {X t, Y t }. The sequence of rewards is given by {R t r t (X t, Y t ) : t = 1, 2,... }. The average reward (or gain) from policy π Π HR starting in state s is given by g π (s) lim N [ N ] 1 N Eπ s r(x t, Y t ). t=1 The limit above may not exist, in which case we define [ N ] g (s) π 1 lim inf N N Eπ s r(x t, Y t ), g π +(s) lim sup N 1 N Eπ s t=1 [ N ] r(x t, Y t ). t=1 Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 3

4 Optimality Criteria When g π (s) exists for all s S and π Π HR, a policy π is average optimal if g π (s) g π (s), s S, π Π HR. The value (or optimal gain) is defined by g (s) sup g π (s), s S. π Π HR Let π be an average optimal policy, then g π (s) = g (s) for all s S. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 4

5 Markov Policies Theorem Suppose π Π HR. For each s S, there exists a π Π MR (which possibly varies with s) for which g π + = g π +, g π = g π, g π = g π whenever g π + = g π, g π + = g π. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 5

6 Assumptions Stationary rewards and transition probabilities: r(s, a) and p(j s, a) do not vary with time Bounded rewards: r(s, a) M < Finite state spaces Unichain: the transition matrix corresponding to every deterministic stationary policy is unichain (i.e., it consists of a single recurrent class plus a possibly empty set of transient states). Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 6

7 The Average Reward Optimality Equation Unichain Models For unichain models, it can be shown that all stationary policies have constant gain g. Optimality equations: 0 = max a A s In matrix notation: r(s, a) g + j S p(j s, a)h(j) h(s). 0 = max d D {r d ge + (P d I )h} B(g, h). Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 7

8 The Average Reward Optimality Equation Unichain Models Theorem Suppose S is countable. (i) If there exists a scalar g and an h V which satisfy B(g, h) 0, then ge g +; (ii) If there exists a scalar g and an h V which satisfy B(g, h) 0, then ge sup d D MD g d g ; (iii) If there exists a scalar g and an h V which satisfy B(g, h) = 0, then ge = g = g + = g. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 8

9 Existence of Solutions to the Optimality Equation Unichain Models Theorem Suppose S and A s are finite, r(s, a) M < for all s, a, and the model is unichain. (i) There exists a g R 1 and h V for which 0 = max d D {r d ge + (P d I )h}; (ii) If (g, h ) is any other solution of the average reward optimality equation, then g = g. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 9

10 Existence of Optimal Policies Unichain Models A decision d h is h-improving if d h argmax d D {r d + P d h}. Theorem Suppose there exists a scalar g and an h V for which B(g, h ) = 0. Then if d is h -improving, (d ) is average optimal. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 10

11 Existence of Optimal Policies Unichain Models Theorem Suppose S and A s are finite, r(s, a) is bounded, and the model is unichain. Then (i) there exists a stationary average optimal policy; (ii) there exists a scalar g and an h V for which B(g, h ) = 0; (iii) any stationary policy derived from an h -improving decision rule is average optimal; (iv) g e = g + = g. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 11

12 Value Iteration 1 Select v 0 V, specify ɛ > 0, and set n = 0. 2 For each s S, compute v n+1 (s) by v n+1 (s) = max a A s r(s, a) + j S p(j s, a)v n (j). 3 If sp(v n+1 v n ) < ɛ, go to step 4. Otherwise, increment n by 1 and return to step 2. 4 For each s S, choose and stop. d ɛ (s) argmax a A s r(s, a) + j S p(j s, a)v n+1 (j) Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 12

13 Relative Value Iteration 1 Select u 0 V, choose s S, specify ɛ > 0, set w 0 = u 0 u 0 (s )e, and set n = 0. 2 For each s S, compute u n+1 (s) by u n+1 (s) = max a A s r(s, a) + j S Let w n+1 = u n+1 u n+1 (s )e. p(j s, a)w n (j). 3 If sp(u n+1 u n ) < ɛ, go to step 4. Otherwise, increment n by 1 and return to step 2. 4 For each s S, choose and stop. d ɛ (s) argmax a A s r(s, a) + j S p(j s, a)u n (j) Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 13

14 Policy Iteration 1 Set n = 0 and select an arbitrary decision rule d 0 D. 2 (Policy evaluation) Obtain a scalar g n and an h n V by solving 0 = r dn ge + (P dn I )h. 3 (Policy improvement) Choose d n+1 satisfy Setting d n+1 = d n if possible. d n+1 argmax[r d + P d h n ]. d D 4 If d n+1 = d n, stop and set d = d n. Otherwise increment n by 1 and return to step 2. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 14

15 Policy Iteration 1 Set n = 0 and select an arbitrary decision rule d 0 D. 2 (Policy evaluation) Obtain a scalar g n and an h n V by solving 0 = r dn ge + (P dn I )h. 3 (Policy improvement) Choose d n+1 satisfy Setting d n+1 = d n if possible. d n+1 argmax[r d + P d h n ]. d D 4 If d n+1 = d n, stop and set d = d n. Otherwise increment n by 1 and return to step 2. Practical consideration: set h n (s 0 ) = 0 for some fixed s 0 S. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 14

16 Linear Programming Primal linear program is given by min g,h g g + h(s) j S p(j s, a)h(j) r(s, a), s S, a A s. Dual linear program is given by max r(s, a)x(s, a) x s S a A s x(j, a) λp(j s, a)x(s, a) = 0, j S, a A j s S a A s x(s, a) = 1, s S a A s x(s, a) 0, s S, a A s. Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 15

Infinite-Horizon Discounted Markov Decision Processes

Infinite-Horizon Discounted Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1 Outline The expected