Planning in Markov Decision Processes

Size: px

Start display at page:

Download "Planning in Markov Decision Processes"

Dulcie King
6 years ago
Views:

1 Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU Katerina Fragkiadaki

2 Markov Decision Process (MDP) A Markov Decision Process is a tuple (S, A,T,r, ) S is a finite set of states A is a finite set of actions T is a state transition probability function T (s 0 s, a) =P[S t+1 = s 0 S t = s, A t = a] r is a reward function r(s, a) =E[R t+1 S t = s, A t = a] is a discount factor 2 [0, 1]

3 Solving MDPs Prediction: Given an MDP (S, A,T,r, ) and a policy (a s) =P[A t = a S t = s] find the state and action value functions. Optimal control: given an MDP (S, A,T,r, ), find the optimal policy (aka the planning problem). Compare with the learning problem with missing information about rewards/dynamics. We still consider finite MDPs (finite S and A ) with known dynamics!

4 Outline Policy iteration Value iteration Linear programming Asynchronous DP

5 Policy Evaluation Policy evaluation: for a given policy, compute the state value function v (s) =E Rt+1 + R t R t S t = s where v (s) is implicitly given by the Bellman equation V (s) =E [Rt+1 + Rt Rt St = s] v (s) = X a2a (a s) r(s, a)+ X s 0 2S T (s 0 s, a)v (s 0 )! a system of S simultaneous equations.

6 MDPs to MRPs MDP under a fixed policy becomes Markov Reward Process (MRP) (a s) r(s, a)+ X! T (s 0 s, a)v (s 0 ) s 0 2S v (s) = X a2a = X a2a (a s)r(s, a)+ X a2a (a s) X s 0 2S T (s 0 s, a)v (s 0 ) = r s + X s 0 2S T s 0 sv (s 0 ) where rs = P a2a (a s)r(s, a) and T s 0 s = P a2a (a s)t (s0 s, a)

7 Back Up Diagram MDP v (s) v k+1 (s) a r s r s v (s 0 ) s 0 X

8 Back Up Diagram MDP v (s) v k+1 (s) s r s r a v (s 0 ) s 0 v X(s) s MRP v (s) =r s + P s 0 2S T s 0 s v (s 0 ) v (s 0 ) s 0

9 Matrix Form The Bellman expectation equation can be written concisely using the induced MRP as with direct solution of complexity O(N 3 ) v = r + v =(I T v T ) 1 r

10 Iterative Methods: Recall the Bellman Equation v (s) = X a2a (a s) r(s, a)+ X s 0 2S T (s 0 s, a)v (s 0 )! v (s) v k+1 (s) s r s a v (s 0 ) s 0 r

11 Iterative Methods: Backup Operation Given an expected value function at iteration k, we back up the expected value function at iteration k+1: v [k+1] (s) = X (a s) r(s, a)+ X! T (s 0 s, a)v [k] (s 0 ) a2a s 0 2S v [k+1] (s) v k+1 (s) a s v s [k+1] r s v [k] (s 0 ) s 0 r

12 Iterative Methods: Sweep A sweep consists of applying the backup operation states in S v! v 0 for all the Applying the back up operator iteratively v [0]! v [1]! v [2]!...v

13 A Small-Grid World R γ = 1 An undiscounted episodic task Nonterminal states: 1, 2,, 14 Terminal state: one, shown in shaded square Actions that would take the agent off the grid leave the state unchanged Reward is until the terminal state is reached

14 Iterative Policy Evaluation v [k] for the random policy Policy, an equiprobable random action An undiscounted episodic task Nonterminal states: 1, 2,, 14 Terminal state: one, shown in shaded square Actions that would take the agent off the grid leave the state unchanged Reward is until the terminal state is reached 1

15 Iterative Policy Evaluation v [k] for the random policy Policy, an equiprobable random action An undiscounted episodic task Nonterminal states: 1, 2,, 14 Terminal state: one, shown in shaded square Actions that would take the agent off the grid leave the state unchanged Reward is until the terminal state is reached 1

Iterative Policy Evaluation v [k] for the random policy Policy, an equiprobable random action An undiscounted episodic task Nonterminal states: 1, 2,, 14

16 Iterative Policy Evaluation v [k] for the random policy Policy, an equiprobable random action An undiscounted episodic task Nonterminal states: 1, 2,, 14 Terminal state: one, shown in shaded square Actions that would take the agent off the grid leave the state unchanged Reward is until the terminal state is reached 1

17 Contraction Mapping Theorem An operator F on a normed vector space X is a -contraction, for, provided for all x, y 2 X 0 < < 1 T (x) T (y) apple x y Theorem (Contraction mapping) For a -contraction F in a complete normed vector space X converges to a unique fixed point in F X at a linear convergence rate Remark. In general we only need metric (vs normed) space

18 Value Function Sapce Consider the vector space There are S dimensions over value functions Each point in this space fully specifies a value function V v(s) Bellman backup brings value functions closer in this space? And therefore the backup must converge to a unique solution

19 Value Function 1 -Norm We will measure distance between state-value functions by the 1-norm u and v i.e. the largest difference between state values, u v 1 = max s2s u(s) v(s)

20 Bellman Expectation Backup is a Contraction Define the Bellman expectation backup operator F (v) = r + T v This operator is a -contraction, i.e. it makes value functions closer by at least, F (u) F (v) 1 = (r + T u) 1 (r + T v) 1 = T (u v) 1 apple T u v 1 1 apple u v 1

21 Convergence of Iter. Policy Evaluation and Policy Iteration The Bellman expectation operator has a unique fixed point v is a fixed point of (by Bellman expectation equation) F By contraction mapping theorem F Iterative policy evaluation converges on v

22 Policy Improvement Suppose we have computed v for a deterministic policy For a given state s, would it be better to do an action a 6= (s)? It is better to switch to action a for state s if and only if q (s, a) > v (s) And we can compute q (s, a) from by: v q (s, a) =E[R t+1 + v (S t+1 ) S t = s, A t = a] = r(s, a)+ X T (s 0 s, a)v (s 0 ) s 0 2S

23 Policy Improvement Cont. Do this for all states to get a new policy with respect to : v 0 that is greedy 0 (s) = arg max a = arg max a q (s, a) E[R t+1 + v (s 0 ) S t = s, A t = a] = arg max r(s, a)+ X s 0 2S T (s 0 s, a)v (s 0 ) What if the policy is unchanged by this? Then the policy must be optimal!

24 Policy Iteration 0 E! v 0 I! 1 E! v 1 I! 2 E!... I! E! v policy evaluation policy improvement greedification

25 Policy Iteration 1. Initialization V (s) 2 R and (s) 2 A(s) arbitrarilyforalls 2 S 2. Policy Evaluation Repeat 0 For each s 2 S: v V (s) V (s) Ps 0,r p(s0,r s, (s)) r + V (s 0 ) a2a (a s)(r(s, a)+ s 0 2ST (s 0 s, a)v (s 0 )) max(, vv V (s) ) until < (a small positive number) 3. Policy Improvement policy-stable true For each s 2 S: a (s) P (s) arg max a r(s, sa)+ 0,r p(s0,r s, a) r + V (s 0 ) s 0 2ST (s 0 s, a)v (s 0 ) If a 6= (s), then policy-stable f alse If policy-stable, then stop and return V and ; else go to 2

26 6 Iterative Policy Eval for the Small Gridworld Policy, an equiprobable random action R γ = 1 An undiscounted episodic task Nonterminal states: 1, 2,, 14 Terminal state: one, shown in shaded square Actions that take the agent off the grid leave the state unchanged Reward is until the terminal state is reached

27 Iterative Policy Eval for the Small Gridworld Policy, an equiprobable random action R γ = 1 An undiscounted episodic task Nonterminal states: 1, 2,, 14 Terminal state: one, shown in shaded square Actions that take the agent off the grid leave the state unchanged Reward is until the terminal state is reached

28 Generalized Policy Iteration Does policy evaluation need to converge to v? Or should we introduce a stopping condition e.g. -convergence of value function Or simply stop after k iterations of iterative policy evaluation? For example, in the small grid world k = 3 was sufficient to achieve optimal policy Why not update policy every iteration? i.e. stop after k = 1 This is equivalent to value iteration (next section)

29 Generalized Policy Iteration Generalized Policy Iteration (GPI): any interleaving of policy evaluation and policy improvement, independent of their granularity. evaluation V v V A geometric metaphor for convergence of GPI: greedy(v ) improvement V = v v V 0, 0 v v, v = greedy(v )

30 Principle of Optimality Any optimal policy can be subdivided into two components: An optimal first action A S 0 Followed by an optimal policy from successor state Theorem (Principle of Optimality) (a s) A policy achieves the optimal value from state, dfsfdsfdf v (s) =v dsfdf (s), if and only if s 0 s For any state reachable from, achieves the optimal value from state s 0, v (s 0 )=v (s 0 ) s

31 Value Iteration If we know the solution to subproblems Then solution v (s) v (s 0 ) max a2a can be found by one-step lookahead r(s, a)+ X s 0 2S The idea of value iteration is to apply these updates iteratively Intuition: start with final rewards and work backwards Still works with loopy, stochastic MDPs v (s 0 ) T (s 0 s, a)v (s 0 )

32 Example: Shortest Path g Body Level One Body Level Two Problem V 1 V 2 V 3 Body Level Three 0-3 Body Level Four Body Level Five V 4 V 5 V 6 V 7

33 Value Iteration Problem: find optimal policy Solution: iterative application of Bellman optimality backup v 1! v 2!...! v Using synchronous backups At each iteration k + 1 For all states s 2 S v k+1 (s) v k (s 0 ) Update from

34 Value Iteration (2) v k+1 (s) k+1(s) a r v k (s 0 ) s 0 r s s v k+1 (s) = max a2a r(s, a)+ X s 0 2S v k+1 = max a2a r(a)+ T (s 0 s, a)v k (s 0 ) T (a)v k!

35 Bellman Optimality Backup is a Contraction Define the Bellman optimality backup operator, F F (v) = max a2a r(a)+ T (a)v This operator is a closer by at least -contraction, i.e. it makes value functions (similar to previous proof) F (u) F (v) 1 apple u v 1

36 Convergence of Value Iteration The Bellman optimality operator has a unique fixed point v is a fixed point of (by Bellman optimality equation) F By contraction mapping theorem F Value iteration converges on v

37 Synchronous Dynamic Programming Algorithms Problem Bellman Equation Algorithm Prediction Control Bellman Expectation Equation Bellman Expectation Equation + Greedy Policy Improvement Iterative Policy Evaluation Policy Iteration Control Bellman Optimality Equation Value Iteration Algorithms are based on state-value function v (s) or v (s) Complexity O(mn 2 ) per iteration, for m actions and n states Could also apply to action-value function q (s, a) or q (s, a) Complexity O(m 2 n 2 ) per iteration

38 Efficiency of DP To find an optimal policy is polynomial in the number of states BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called the curse of dimensionality ). In practice, classical DP can be applied to problems with a few millions of states.

39 Asynchronous DP All the DP methods described so far require exhaustive sweeps of the entire state set. Asynchronous DP does not use sweeps. Instead it works like this: Repeat until convergence criterion is met: Pick a state at random and apply the appropriate backup Still need lots of computation, but does not get locked into hopelessly long sweeps Guaranteed to converge if all states continue to be selected Can you select states to backup intelligently? YES: an agent s experience can act as a guide.

40 Asynchronous Dynamic Programming Three simple ideas for asynchronous dynamic programming: In-place dynamic programming Prioritized sweeping Real-time dynamic programming

41 In-Place Dynamic Programming Synchronous value iteration stores two copies of value function s for all in v new (s) S max a2a r(s, a)+ X s 0 2S T (s 0 s, a)v old (s 0 )! v old v new In-place value iteration only stores one copy of value function s for all in v(s) S max a2a r(s, a)+ X s 0 2S T (s 0 s, a)v(s 0 )!

42 Prioritized Sweeping Use magnitude of Bellman error to guide state selection, e.g.! max a2a r(s, a)+ X s 0 2S T (s 0 s, a)v(s 0 ) v(s) Backup the state with the largest remaining Bellman error Update Bellman poor of affected states after each backup Requires knowledge of reverse dynamics (predecessor states) Can be implemented efficiently by maintaining a priority queue

43 Real-time Dynamic Programming Idea: only states that are relevant to agent Use agent s experience to guide the selection of states After each time-step S t, A t,r t+1 Backup the state S t v(s t ) max a2a r(s t,a)+ X s 0 2S T (s 0 S t,a)v(s 0 )!

44 Sample Backups In subsequent lectures we will consider sample backups Using sample rewards and sample transitions (S, A,r,S 0 ) Instead of reward function and transition dynamics Advantages: Model-free: no advance knowledge of MDP required Breaks the curse of dimensionality through sampling Cost of backup is constant, independent of n = S

45 Approximate Dynamic Programming Approximate the value function Using a function approximate Apply dynamic programming to e.g. Fitted Value Iteration repeats at each iteration k, Sample states S S s 2 S ˆv ( s, w) ˆv (, w) For each state, estimate target value using Bellman optimality equation,! ṽ k (s) = max a2a r(s, a)+ X s 0 2S T (s 0 s, a)ˆv(s 0, w k ) ˆv (, w k+1 ) Train next value function using targets {hs, ṽ k (s)i}

46 Linear Programming (LP) Recall, at value iteration convergence we have 8s 2 S : V (s) = max r(s, a)+ X a2a s LP formulation to find 0 2S X min µ 0 (s)v (s) V S s.t. 8s 2 S, 8a 2 A : V (s) r(s, a)+ X s 0 2S T (s 0 s, a)v (s 0 ) T (s 0 s, a)v (s 0 )! µ 0 S µ 0 (s) > 0 s is a probability distribution over, with for all in. S Theorem. V is the solution to the above LP.

47 LP con t Let F be the Bellman operator, i.e.,. Then the LP can be written as: V i+1 = F (V i ) min V µ > 0 V s.t. V F (V ) U V F (U) F (V ) Monotonicity Property: If then. Hence, if V F (V ) then F (V ) F (F (V )), and by repeated application, V F (V ) F 2 V F 3 V... F 1 V = V Any feasible solution to the LP must satisfy V F (V ), and hence must satisfy V V. Hence, assuming all entries in µ are positive, V 0 is the optimal solution to the LP.

48 LP: The Dual -> q* min X X X (s, a)t (s, a, s 0 )r(s, a) s.t. s2s a2a s 0 2S 8s 2 S : X (s 0,a 0 )=µ 0 (s)+ X s2s X (s, a)t (s, a, s 0 ) a 0 2A a2a Interpretation 1X (s, a) = t P (s t = s, a t = a) t=0 Equation 2: ensures that has the above meaning Equation 1: maximize expected discounted sum of rewards Optimal policy: (s) = arg max a (s, a)

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming Objectives of this chapter: Overview of a collection of classical solution methods for MDPs known as dynamic programming (DP) Show how DP can be used to compute value functions,