Markov Decision Processes (and a small amount of reinforcement learning) Slides adapted from: Brian Williams, MIT Manuela Veloso, Andrew Moore, Reid Simmons, & Tom Mitchell, CMU Nicholas Roy 16.4/13 Session 23 How Should a Rover Search for its Landing Craft? Landing Craft State Space Search? As a Constraint Satisfaction Problem? oal-directed Planning? Linear Programming? Is the real world well-behaved? 1
How Should a Rover Search for its Landing Craft? Landing Craft What if each action can have one of a set of different outcomes? What if the outcomes occur probabilistically? 2
Ideas in this lecture Problem is to accumul ate rewards, rather than to achieve goal states. Approach is to generate reactive policies for how to act in all situations, rather than plans for a single starting situation. Policies fall out of value functions, which describe the greatest lifetime reward achievable at every state. Value functions are iteratively approximated. MDP Problem: Model Agent State Reward Action Environment a s a 1 s1 r r 1 s 2 a 2 r 2 s 3 iven an environment model as a MDP create a policy for acting that maximizes lifetime reward 3
Markov Decision Processes ( MDPs) Model: Finite set of states, S Finite set of actions, A (Probabilistic) state transitions, Τ(s i,a j, s k ) Reward for each state and action, R(s i,a i ) Example: s1 a1 Process: Observe state s t in S Choose action at in A Receive immediate reward r t State changes to some s t +1 accordi ng to T (s t, a t, s t +1) a s a 1 s1 r r 1 s 2 a 2 r 2 s 3 Legal transitions shown Rewards on unlabeled transitions are. MDP Environment Assumptions Markov Assumption: Next state and reward is a function only of the current state and action: p(s t +1 a t, s t, a t-1, s t-1, a t-2,...) = p(s t +1 a t, s t ) r(s t, a t, s t-1, a t-1, s t-2,...) = r (s t, a t ) Uncertain and Unknown Environment: p(s t +1 a t, s t ) and r may be nondeterministic and unknown 4
So what is the solution to an MDP? An MDP solution is a policy π : S A Selects an action for each state. Optimal policy π : S A Selects action for each state that maximizes lifetime reward. π π Assume deterministic world There are many policies, not all are necessarily optimal. There may be several optimal policies. 5
What is this lifeti me reward? Optimal policy maximizes expected reward of agent over the lifetime of the agent. " * $ (s) = argmax E s,s 1,...[ % r(s t,a t )dt a #A t= ] How long will the agent live? Finite horizon: Rewards accumulate for a fixed period. $K + $K + $K = $3K Infinite horizon: Assume reward accumulates for ever $K + $K +... = infinity Discounting: Future rewards not worth as much ( a bird i n hand ) Introduce discount factor γ $K + γ $K + γ 2 $K... converges Value Function V π for a iven Policy π V π (s t ) is the accumulated lifetime reward resulting from starting in state s t and repeatedly executing policy π: V π (s t ) = r t + γ r t+1 + γ 2 rt+2... V π (s t ) = i γ i r t+i where r t, r t+1, r ll i π, π t+2... are generated by fo ow ng starting at s t. V π γ 9 9 Assume =.9 6
An Optimal Policy π* iven Value Function V* Notice: Suppose, given state s, we knew the lifetime rewards for all other states? 1. Examine all possible actions a i in state s. 2. Select action a i with greatest lifetime reward. Lifeti me reward Q( s, a i ) is: the immediate reward for taking action r( s,a) probability of posterior state s : p(s s,a) life time reward starting in target state V( s' ) discounted by γ. π*(s) = argmaxa [r( s,a) + γv ( p( s, a) )] Must Know: Value function Environment model. p : S x A x S R r : S x A R π 9 9 Value Function V for an optimal policy π Example R A A B S A S B Optimal value function for a one step horizon: V* 1(s) = maxa i [ r(s, a i )] Optimal value function for a two step horizon: V* 2(s) = maxa i [ r(s, a i ) + γ s V* 1 (s )p(s s, a i )] Optimal value function for an n step horizon: V* n(s) = maxa i [ r(s, a i ) + γ s V* n-1(s )p(s s, a i )] Optimal value function for an infinite horizon: V* (s) = maxa i [r( s, a i ) + γ s V*(s )p(s s, a i )] R A R B A R B B 7
Solving MDPs by Value Iteration Insight: Calculate optimal values iteratively using DP Algorithm: 1. Label all states: for each state s V (s) max a r(s, a) 2. Iteratively calculate value using Bellman s Equation: for each state s V t+1 (s) max a [r(s,a) + γ s V t (s )p(s s, a)] 3. Terminate when values are close enough V t+1 (s) - V t (s) < ε 4. Return V* = V t+1 Policy Execution: agent selects optimal action by one step lookahead on V: π(s) = argmax a [r(s,a) + γ s V t(s (s )p(s s, a)] Example of Value Iteration V t +1(s) maxa [r( s,a ) + γ s V t s,a γ =.9, p(s s, a) is deterministic (s )p(s )] V t 9 V t+1 9 9 81 9 8
Example of Value Iteration V t +1(s) maxa [r( s,a ) + γ s V t s,a (s )p(s )] γ =.9, p(s s, a) is non-deterministic ( red arcs occur with 5% probabili ty) V 1 V 2 V 3 5 45 5 81.45 9.5 9 81 9 Convergence of Value Iteration If terminate when val ues are cl ose enough V (s) - V t (s) < ε t+1 Then: Maxs i V (s) - V n S t+1 (s) < 2εγ /(1 - γ) Converges in polynomial time. Convergence guaranteed even if updates are performed infinitely often, but asynchronously and in any order. 9
Ideas in this lecture Objective is to accumulate rewards, rather than goal states. Objectives are achieved along the way, rather than at the end. Policies can be described by value functions, which describe the greatest lifetime reward achievable at every state. Value iteration is a fast algorithm for computing the value function under certain assumptions. Appendix: Policy Iteration Idea: Iteratively improve the policy 1. Policy Evaluation: iven a policy π i calculate V i = V πi, the utility of each state if π i were to be executed. 2. Policy Improvement: Calculate a new maximum expected utility policy π i using one-step look ahead based on V i. +1 π i improves at every step, converging if π i = π i. Computing V i is simpler than for Value iteration (no max): V* t+1 (s) r( s, π i (s)) + γ s V* t (s )p(s s, π i (s))] Solve linear equations i n O(N 3 ) Solve iteratively, similar to value iteration. +1
Reinforcement Learning Problem iven: Repeatedly Executed action Observed state Observed reward Agent Learn action policy π: S A Maximizes life reward r + γ r 1 + γ 2 r 2... from any start state. Di scount: < γ < 1 Note: Unsupervised learning Delayed reward Model not known State Reward Action Environment a s a 1 s1 r r 1 s 2 a 2 r 2 s 3 oal: Learn to choose actions that maximize life reward r + γ r 1 + γ 2 r 2... How About Learning the Model? Certainty Equivalence 1. Explore the world 2. Count how often reward r occurs when in state s: SumR[s i] r, Count[s i] 1. 3. Count how often state s occurs when in state s Trans[Si,Sj] 1. 4. At any time: r est (S j ) = SumR[S i] / Count[S i] T est ij = Trans[Si,Sj] / Count[Si] 5. So at any time we can solve for V est 11
Certainty Equivalence Costs Memory: O(N 2 ) Ti me to update counters: O(1) Time to re-evaluate V O(N 3 ) if use matrix inversion O(N 2 k CRIT) if use value iteration and we need kcrititerations to converge O( Nk CRIT ) if use value iteration, and k CRIT to converge, and T is Sparse(i.e. mean # successors is constant) Too expensive for some people. Prioritized sweeping will hel p, ( see l ater ), but first l et s review a very inexpensive approach Eliminating the Model with Q Functions π*(s) = argmaxa [r( s,a ) + γ s V* t (s )p(s s, a)] Key idea: Define function like V that encapsulates δ and r: Q( s,a) = r( s,a) + γ s V* t (s )p(s s, a) Then, if agent learns Q, it can choose an optimal action without knowing δ or r. π*(s) = argmaxa Q( s,a ) 12
How Do We Learn Q? Q(s t,a t ) = r(s t,a t ) + γ s V* t (s )p(s s, a t ) Need to eliminate V* In update rule. Note Q and V* are closely related: V*(s) = maxa Q(s,a ) Substituting Q for V*: Q(s t,a t ) = r(s t,a t ) + γ maxa Q(s,a ) γ =.9 Example Q R 72 63 81 Learning Update 9 63 R 81 Q(s 1,a ri ) r(s 1 ri ) + γ a Q(s ght,a ght max 2,a ) +.9 max { 63, 81, } 9 Note: if rewards are non-negative: For all s, a, n, Q n (s, a) Q n+1( s, a ) For all s, a, n, Q n (s, a) Q( s, a) 13
Q-Learning Iterations Starts at top left corner move clockwise around perimeter; Initially all values in Q table are zero; γ =.8 Q(s, a) r+ γ max a Q(s,a ) s1 s2 s3 s6 s5 s4 Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W) r+ γ max a {Q(s5,loop)}= +.8 x = r+ γ max a {Q(s4,W), Q(s4,N)} = +.8 x max{,) = 8 r+ γ max a {Q(s3,W), Q(s3,S)} = +.8 x max{,8) = 6.4 8 Crib Sheet: Q-Learning for Deterministic Worlds Let Q denote the current approximation to Q. Initially: For each s, a initialize table entry Q(s, a) Observe current state s Do forever: Select an action a and execute it Receive immediate reward r Observe the new state s Update the table entry for Q (s, a) as follows: Q(s, a) r+ γ max a Q(s,a ) s s 14
Discussion How should the learning agent use the intermediate Q values? Exploration vs Exploitation Scaling up in the size of the state space Function approxi mators ( neural net instead of table) Reuse, use of macros NonDeterministic Case We redefine V, Q by taking expected values V π (s t ) = E[r t + γ r t+1 + γ 2 r t+2... ] V π (s t ) = E[ γ i r t+i ] Q(s t,a t ) = E[r(s t,a t ) + γv (δ(s t, a t ))] 15
Nondeterministic Case Alter training rule to Q n ( s, a) ( 1- α n ) Q n-1 ( s,a ) + α n [ r+ γ max a Q n- (s,a )] 1 where α n = 1/( 1+visitsn ( s,a)) and s = δ( s, a). Can still prove convergence of Q [ Watkins and Dayan, 92] Ongoing Research Handling case where state is only partially observable Design optimal exploration strategies Extend to continuous action, state Learn and use δ : S x A S Relationship to dynamic programming Multiple l earners Multi-agent reinforcement learning 16