AM 11: Intro to Optimization Models and Methods: Fall 018 Lecture 18: Markov Decision Processes Yiling Chen Lesson Plan Markov decision processes Policies and value functions Solving: average reward, discounted reward LPs; value and policy iteration Example applications Airline meals, assisted living, automated vehicles / helicopters, Super Mario. Reading: Hillier and Lieberman, Introduction to Operations Research, McGraw Hill, p903-99 1
Markov decision process (MDP) An MDP model contains: A set of possible world states S (m states) A set of possible actions A (n actions) A reward function R(s,a) A transition function P(s,a,s ) [0,1] prob of transitioning to state s given (s,a) Num time periods: finite or infinite horizon. Markov decision process (MDP) An MDP model contains: A set of possible world states S (m states) A set of possible actions A (n actions) A reward function R(s,a) A transition function P(s,a,s ) [0,1] prob of transitioning to state s given (s,a) Num time periods: finite or infinite horizon. Problem: which action to take in each state?
Markov decision process (MDP) An MDP model contains: A set of possible world states S (m states) A set of possible actions A (n actions) A reward function R(s,a) A transition function P(s,a,s ) [0,1] prob of transitioning to state s given (s,a) Num time periods: finite or infinite horizon. Problem: which action to take in each state? Markov property: effect of an action only depends on current state, not prior history Example: Maintenance Four machine states {0,1,,3} Actions: do nothing, overhaul, replace; transitions depend on action Each action in each state is associated with a cost (negated reward). As follows: State Action Defective Maintenance Lost production 0 Nothing 0 0 0 1 Nothing 1000 0 0 1 Replace 0 4000 000 Nothing 3000 0 0 Overhaul 0 000 000 Replace 0 4000 000 3 Replace 0 4000 000 (Hillier and Lieberman) Possible next states To 1,, 3 To 1,, 3 To 0 To, 3 To 1 To 0 To 0 3
Maintenance Problem replace -$6000 State transition diagram 3/4 reward R(s,a) transition P(s,a,s ) 7/8 1 0 do nothing $0 overhaul -$4000 1/8 do nothing -$1000 1/8 1/16 1/ 1/ do nothing -$3000 replace -$6000 3 A Decision Policy A (deterministic) policy µ: S è A is a mapping from state to action. A stochastic policy µ: S è Δ(A) maps to a distribution on actions. A stationary policy is invariant to time. Focus on this (and assume infinite time horizon). 4
Following a Determinstic Policy Determine current state s Policy µ: S è A. Execute action µ(s) A. Fact: An MDP and a policy µ define a Markov chain. Defn. An MDP is ergodic if the assoc. Markov chain is ergodic for every determ policy. For an ergodic MDP, can define π µ (s ) as the steady-state prob of being in state s given µ. Evaluating a Policy How good is a policy µ from state s? Expected total reward? This may be infinite! How can we compare policies? 5
Value functions Definition. A value function V µ (s) provides a value for policy µ in each state s. Different ways to do this. With a finite decision horizon, we can sum the rewards. With an infinite decision horizon, we can adopt two different criteria : the limit-average reward, or the expected discounted reward. Limit-Average Reward Let V µ (t) (s) denote the total expected reward of policy µ for next t periods from state s Limit-average reward criterion: V µ (s) = lim [V µ (t) (s)/t] tè For an ergodic MDP, then V µ (s) = s R(s, µ(s )) π µ (s ), where π µ (s ) is steady-state prob of state s 6
Expected Discounted Reward A reward t steps away is discounted by γ t, for discount factor 0<γ<1 Models uncertainty about the time horizon, or current rewards more important than future. Expected discounted reward criterion: V µ (s) = E s,s, [ R(s, µ(s)) + ϒ R(s, µ(s )) + ϒ R(s, µ(s )) + ] Solving MDPs We can write an MDP problem as a linear program Solve the LP! The particular formulation depends on the objective criterion. 7
Limit-average Reward Given model M=(S,A,P,R), find policy µ that maximizes limit-average reward. Assume the MDP is ergodic Possible Decision Variables state action 0 1 n-1 0 x(0,0) x(0,1) x(0,n-1) 1 x(1,0) x(1,1) x(1,n-1) m-1 x(m-1,0) x(m-1,1) x(m-1,n-1) 1 if action a in state s x(s,a)= 0 otherwise Rows sum to 1 Defines prob action a in state s if stochastic 8
Better Decision Variables Define π(s,a) as the steady-state probability of being in state s and taking action a This defines a policy. To see this Better Decision Variables Define π(s,a) as the steady-state probability of being in state s and taking action a We have: π(s)= a π(s,a ) π(s,a)= π(s) x(s,a) (prob s * prob a given s) Defines a Policy: π(s,a) x(s,a)= a π(s,a ) 9
Balance Equation π(s,a): steady-state probability of being in state s and taking action a Balance equation: π(s ) = a π(s, a) = s π(in s, go to s ) = s a π(s, a) Pr(s, a, s ) Implies consistency in π values across (s,a) pairs LP formulation (Limit-average) V = max s a R(s,a)π(s,a) (1) s.t. s a π(s,a) = 1 () a π(s,a) = s a π(s,a)p(s,a,s ) s (3) π(s,a) 0 s, a (4) 8 8 8 (1) Maximize expected average reward () Total probability sums to one (3) Balance equation Note: important here that the MDP be ergodic. 10
Optimal Policies are Deterministic V = max s a R(s,a)π(s,a) (1) s.t. s a π(s,a) = 1 () a π(s,a) = s a π(s,a)p(s,a,s ) s (3) π(s,a) 0 s, a (4) 8 8 nm decision variables; m+1 equalities (one of balance equalities is redundant) simplex terminates with m basic variables π(s)>0 for each s since MDP ergodic; and so π(s,a)>0 for at least one a for each s. è π(s,a)>0 for exactly one a in each s è a deterministic policy. 8 Maintenance Problem State transition diagram reward R(s,a) transition P(s,a,s ) 0 7/8 do nothing $0 replace -$6000 overhaul -$4000 1 1/8 3/4 do nothing -$1000 1/8 1/16 1/ 1/ do nothing -$3000 replace -$6000 3 11
Example: Maintenance min 1000π(1,1) + 6000π(1,3) + 3000π(,1) + 4000π(,) + 6000π(,3) + 6000π(3,3) s.t. (minimization; because costs not rewards) π(0,1) + π (1,1) + π(1,3) + + π(3,3) = 1 π(0,1) (π(1,3)+π(,3) +π(3,3)) = 0 π(1,1) + π(1,3) (7/8π(0,1) +3/4π(1,1) + π(,)) = 0 π(,1)+π(,)+π(,3) (1/16π(0,1) + 1/8π(1,1) + ½π(,1)) = 0 π(3,3) - (1/16π(0,1) + 1/8π(1,1) + ½π(,1)) = 0 π(0,1),, π(3,3) 0 π * (0,1)=/1; π * (1,1)=5/7; π * (,)=/1; π * (3,3)=/1; rest zero. Optimal policy: µ(0)=1, µ(1)=1, µ()=, µ(3)=3; do nothing in 0 and 1, overhaul in, and replace in 3. Markov Chain of Optimal Poilicy 3/4 7/8 1 0 do nothing $0 overhaul -$4000 1/8 do nothing -$1000 1/8 1/16 replace -$6000 3 1
Expected discounted reward criterion Given model M=(S,A,P,R), find the policy µ that maximizes the expected discounted reward, for discount factor 0<γ<1. No need for MDP to be ergodic. LP formulation: Discounted Criterion V = max s a R(s,a)π(s,a) (1) s.t. s a π(s,a) = 1 () a π(s,a) - s a π(s,a)p(s,a,s )=0 8 s (3) π(s,a) 0 s, a (4) 8 8 13
LP formulation: Discounted Criterion V = max s a R(s,a)π(s,a) (1) s.t. s a π(s,a) = 1 () a π(s,a) - γ s a π(s,a)p(s,a,s )=β(s ) 8 s (3 ) π(s,a) 0 s, a (4) 8 8 For any β values s.t. s β(s)=1 and β(s)>0 for all s Policy: x(s,a) = π(s,a) / a π(s,a ) Interpretation: π(s,a) is the discounted exp number of time periods of being in state s, taking action a. β(s) is initial state distribution. Understanding this formulation It is the dual of another, easier to understand formulation. See this in a few slides! 14
Example: Maintenance Discount γ=0.9; arbitrarily set β=(¼, ¼, ¼, ¼ ) min 1000π(1,1) + 6000π(1,3) + 3000π(,1) + 4000π(,) + 6000π(,3) + 6000π(3,3) s.t. π(0,1) 0.9(π(1,3)+π(,3) +π(3,3)) = ¼ π(1,1) + π(1,3) 0.9(7/8π(0,1) +3/4π(1,1) + π(,)) = ¼ π(,1)+π(,)+π(,3) 0.9(1/16π(0,1)+1/8π(1,1)+½π(,1)) = ¼ π(3,3) - 0.9(1/16π(0,1) + 1/8π(1,1) + ½π(,1)) = ¼ π(0,1),, π(3,3) 0 Optimal policy: π * (0,1)=1.1 π * (1,1)=6.66 π * (,)=1.07 π * (3,3)=1.07, rest all zero. (=> the same policy as for limitaverage criterion.) Fundamental Theorem of MDPs Theorem. Under discounted reward criterion, there exists a policy µ * that is uniformly optimal; i.e., it is optimal for all distributions on start states. Note 1: there need not be a unique, optimal policy. Note : The uniform optimality property is only true under the limit-average reward criterion for an ergodic MDP. 15
Policy and Value Iteration Another way to solve MDPs under the expected discounted reward criterion The Bellman equations Define consistency requirements for the case of a discounted reward criterion For any policy µ, the value function satisfies: V µ (s) = R(s, µ(s))+γ s S P(s,µ(s),s )V µ (s ) 8 s For the optimal policy: V * (s) = max a A [R(s,a) +γ s S P(s,a,s ) V * (s )] 8 s 16
Policy Iteration Improve a policy: µ (s):= arg max a R(s, a)+γ s P(s,a,s )V µ (s ) Evaluate a policy: V µ (s) = R(s, µ(s))+γ s S P(s,µ(s),s )V µ (s ) Policy iteration Ε Ι Ε Ι Ε Ι V µ0 V µ1 V µ µ 0 µ 1 µ 1 Each step provides a strict improvement in policy. Finite # policies. Converges! Value Iteration Just iterate on Bellman equations (for kth step): V k+1 (s):=max a [R(s,a) + γ s S P(s,a,s )V k (s )] Also converges. Less computationally intensive than policy iteration. 17
Understanding the Discounted reward LP formulation This LP: = Understanding the Discounted reward LP formulation Previous LP equivalent to (for positive beta values) and this is the dual! 18
Applications Airline meal provisioning Assisted living Automated vehicles Airline meal provisioning (Goto 00) Determine quantity of meals to load. Passenger load varies because of stand-by, missed flights Average costs down 17%, short-catered flights down 33% 19
Additional applications Assisted Living (Pollack 05) High speed obstacle avoidance (Michels et al. 05) Super Mario World Summary: MDPs Policy maps states to actions, fixing a policy we have a Markov chain Can adopt limit-average reward and expected discounted reward criteria Can solve for optimal policies via LPs. Solutions are: deterministic uniformly optimal (need ergodic if limit-average reward criterion) Can also use policy and value iteration 0