Lecture 3: Markov Decision Processes

Size: px

Start display at page:

Download "Lecture 3: Markov Decision Processes"

Elvin Foster
6 years ago
Views:

1 Lecture 3: Markov Decision Processes Joseph Modayil

2 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs

3 Markov Processes Introduction Introduction to MDPs Markov decision processes formally describe an environment for reinforcement learning Conventionally where the environment is fully observable i.e. The current state completely characterizes the process Almost all RL problems can be formalized as MDPs, e.g. Optimal control primarily deals with continuous MDPs Partially observable problems can be converted into MDPs Bandits are MDPs with one state References: Sutton & Barto Chapter 3. Slides:

4 Markov Processes Notation Statistical notation in Reinforcement Learning A variety of statistical notation has been used in earlier RL textbooks and earlier versions of this course. We will move towards more standard notation. Lowercase letters are used for states s and functions f. Uppercase letters are used for random variables, S t, and these are often indexed by time. This is not the case yet for the figures, so for now assume V π = v π (and similar), particularly in figures. There will be some typos in the slides. Also be aware of this when combining information from different sources. An expectation E [G t ] is often made over future trajectories. A probability distribution is denoted by P [].

5 Markov Processes Markov Property Markov Property The future is independent of the past given the present Consider a sequence of random states, {S t } t N, indexed by time. Definition A random state S t has the Markov property if and only if s, s S P [ S t+1 = s S t = s ] = P [ S t+1 = s S 1,..., S t = s ] for all histories (all instantiations of S k for k < t). The state captures all relevant information from the history Once the state is known, the history may be thrown away i.e. The state is a sufficient statistic of the future

6 Markov Processes Markov Property State Transition Matrix For random states with the Markov property, the state transition probability is defined by P ss = P [ S t+1 = s S t = s ] State transition matrix P defines transition probabilities from all states s to all successor states s, P = from to P P 1n. P P nn where each row of the matrix sums to 1.

7 Markov Processes Markov Chains Markov Process A Markov process is a memoryless random process, i.e. a sequence of random states S 1, S 2,... with the Markov property. Definition A Markov Process (or Markov Chain) is a tuple S, P S is a (finite) set of states P is a state transition probability matrix, P ss = P [S t+1 = s S t = s]

8 Markov Processes Markov Chains Example: Student Markov Chain Facebook Sleep Class Class 2 Class Pass 0.2 Pub

9 Markov Processes Markov Chains Example: Student Markov Chain Episodes Sample episodes for Student Markov Chain starting from S 1 = C1 0.9 Facebook Sleep S 1, S 2,..., S T Class Class 2 Class Pass C1 C2 C3 Pass Sleep C1 FB FB C1 C2 Sleep C1 C2 C3 Pub C2 C3 Pass Sleep 0.2 Pub C1 FB FB C1 C2 C3 Pub C1 FB FB FB C1 C2 C3 Pub C2 Sleep

10 Markov Processes Markov Chains Example: Student Markov Chain Transition Matrix 0.9 Facebook Class 1 Sleep Class 2 Class Pass C1 C2 C3 Pass Pub FB Sleep C C C3 0.6 P = Pass 1.0 Pub 0.2 FB Sleep Pub

11 Markov Reward Processes MRP Markov Reward Process A Markov reward process is a Markov chain with values. Definition A Markov Reward Process is a tuple S, P, R, γ S is a finite set of states P is a state transition probability matrix, P ss = P [S t+1 = s S t = s] R is a (expected) reward function, R s = E [R t+1 S t = s] γ is a discount factor, γ [0, 1]

12 Markov Reward Processes MRP Example: Student MRP 0.9 Facebook Sleep 0.1 r = -1 r = Class Class Class 3 r = -2 r = -2 r = Pass r = Pub r = +1

13 Markov Reward Processes Return Return Definition The return G t is the total discounted reward from time-step t. G t = R t+1 + γr t = γ k R t+k+1 k=0 The discount γ [0, 1] is the present value of future rewards The value of receiving reward R after k + 1 time-steps is γ k R. This values immediate reward above delayed reward. γ close to 0 leads to myopic evaluation γ close to 1 leads to far-sighted evaluation

14 Markov Reward Processes Return Why discount? Most Markov reward and decision processes are discounted. Why? Mathematically convenient to discount rewards Avoids infinite returns in cyclic Markov processes Uncertainty about the future may not be fully represented If the reward is financial, immediate rewards may earn more interest than delayed rewards Animal/human behaviour shows preference for immediate reward It sometimes possible to use undiscounted Markov reward processes (i.e. γ = 1), e.g. if all sequences terminate.

15 Markov Reward Processes Value Function Value Function The value function v(s) gives the long-term value of state s Definition The state value function v(s) of an MRP is the expected return starting from state s v(s) = E [G t S t = s]

16 Markov Reward Processes Value Function Example: Student Markov Chain Returns Sample returns for Student Markov Chain: Starting from S 1 = C1 with γ = 1 2 G 1 = R 1 + γr γ T 1 R T C1 C2 C3 Pass Sleep G 1 = = 2.25 C1 FB FB C1 C2 Sleep G 1 = = C1 C2 C3 Pub C2 C3 Pass Sleep G 1 = = 3.41 C1 FB FB C1 C2 C3 Pub C1... G 1 = = 3.20 FB FB FB C1 C2 C3 Pub C2 Sleep

17 Markov Reward Processes Value Function Example: State-Value Function for Student MRP (1) V(s) for γ = r = -1 r = r = -2 r = -2 r = r = r = +1

18 Markov Reward Processes Value Function Example: State-Value Function for Student MRP (2) V(s) for γ = r = -1 r = r = -2 r = -2 r = r = r = +1

19 Markov Reward Processes Value Function Example: State-Value Function for Student MRP (3) V(s) for γ = r = -1 r = r = -2 r = -2 r = r = r = +1

20 Markov Reward Processes Bellman Equation Bellman Equation for MRPs The value function can be decomposed into two parts: immediate reward r discounted value of successor state γv(s ) v(s) = E [G t S t = s] = E [ R t+1 + γr t+2 + γ 2 R t S t = s ] = E [R t+1 + γ (R t+2 + γr t ) S t = s] = E [R t+1 + γg t+1 S t = s] = E [R t+1 + γv(s t+1 ) S t = s]

21 Markov Reward Processes Bellman Equation Bellman Equation for MRPs (2) v(s) = E [R t+1 + γv(s t+1 ) S t = s] s V(s) r s' V(s') v(s) = R s + γ s S P ss v(s )

22 Markov Reward Processes Bellman Equation Example: Bellman Equation for Student MRP 4.3 = *10 + * r = -1 r = r = -2 r = -2 r = r = r = +1

23 Markov Reward Processes Bellman Equation Bellman Equation in Matrix Form The Bellman equation can be expressed concisely using matrices, v = R + γpv where v is a column vector with one entry per state v(1) R 1 P P 1n v(1). =. + γ.. v(n) R n P n1... P nn v(n)

24 Markov Reward Processes Bellman Equation Solving the Bellman Equation The Bellman equation is a linear equation It can be solved directly: (I γp) v = R v = R + γpv v = (I γp) 1 R Computational complexity is O(n 3 ) for n states Direct solution only possible for small MRPs There are many iterative methods for large MRPs, e.g. Dynamic programming Monte-Carlo evaluation Temporal-Difference learning

25 Markov Decision Processes MDP Markov Decision Process A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all random states are Markov. Definition A Markov Decision Process is a tuple S, A, P, R, γ S is a finite set of states A is a finite set of actions P is a state transition probability matrix, P a ss = P [S t+1 = s S t = s, A t = a] R is a reward function, R a s = E [R t+1 S t = s, A t = a] γ is a discount factor γ [0, 1].

26 Markov Decision Processes MDP Markov Decision Process: New notation A more precise way to describe the state and reward dynamics of a MDP is to give the probability for each possible next state and reward. p(s, r s, a) P [ S t+1 = s, R t+1 = r S t = s, A t = a ] We can then write the state transistion probabilities as P a ss P [ S t+1 = s S t = s, A t = a ] = r R p(s, r s, a). We can then write the state transistion probabilities as R a s E [R t+1 = r S t = s, A t = a] = r R r s S p(s, r s, a).

27 Markov Decision Processes MDP Example: Student MDP Facebook r = -1 Quit r = 0 Facebook Sleep r = -1 r = 0 Study Study r = -2 r = -2 Study r = Pub r = +1

28 Markov Decision Processes Policies Policies (1) Definition A policy π is a distribution over actions given states, π(s, a) = P [A t = a S t = s] A policy fully defines the behaviour of an agent MDP policies depend on the current state (not the history) i.e. Policies are stationary (time-independent), A t π(s t, ), t > 0 The policy π(s, a) is sometimes written as π(a s)

29 Markov Decision Processes Policies Policies (2) Given an MDP M = S, A, P, R, γ and a policy π The state sequence S 1, S 2,... is a Markov process S, P π The state and reward sequence S 1, R 2, S 2,... is a Markov reward process S, P π, R π, γ where P π s,s = a A π(s, a)p a s,s R π s = a A π(s, a)r a s

30 Markov Decision Processes Value Functions Policy Conditional Expectations Definition The expectation of a random variable F t conditional on π being used to select future actions is written as E π [F t ].

31 Markov Decision Processes Value Functions Value Function Definition The state-value function v π (s) of an MDP is the expected return starting from state s, and then following policy π v π (s) = E π [G t S t = s] Definition The action-value function q π (s, a) is the expected return starting from state s, taking action a, and then following policy π q π (s, a) = E π [G t S t = s, A t = a]

32 Markov Decision Processes Value Functions Example: State-Value Function for Student MDP Facebook r = -1 V π (s) for π(s,a)=0.5, γ = Quit r = 0 Facebook Sleep r = -1 r = Study 2.7 Study 7.4 r = -2 r = -2 Study r = Pub r = +1

33 Markov Decision Processes Bellman Expectation Equation Bellman Expectation Equation The state-value function can again be decomposed into immediate reward plus discounted value of successor state, v π (s) = E π [R t+1 + γv π (S t+1 ) S t = s] The action-value function can similarly be decomposed, q π (s, a) = E π [R t+1 + γq π (S t+1, A t+1 ) S t = s, A t = a]

34 Markov Decision Processes Bellman Expectation Equation Bellman Expectation Equation for v π s V! (s) a Q! (s,a) v π (s) = a A π(s, a)q π (s, a)

35 Markov Decision Processes Bellman Expectation Equation Bellman Expectation Equation for q π s,a Q! (s,a) r s' V! (s') q π (s, a) = R a s + γ s S P a ss v π (s )

36 Markov Decision Processes Bellman Expectation Equation Bellman Expectation Equation for v π (2) s V! (s) a r s' V! (s') v π (s) = ( π(s, a) R a s + γ ) Pss a v π (s ) a A s S

37 Markov Decision Processes Bellman Expectation Equation Bellman Expectation Equation for q π (2) s,a Q! (s,a) r s' a' Q! (s',a') q π (s, a) = R a s + γ s S P a ss π(s, a )q π (s, a ) a A

38 Markov Decision Processes Bellman Expectation Equation Example: Bellman Expectation Equation in Student MDP Facebook r = = 0.5 * ( * * * 7.4) * Quit r = 0 Facebook Sleep r = -1 r = Study 2.7 Study 7.4 r = -2 r = -2 Study r = Pub r = +1

39 Markov Decision Processes Bellman Expectation Equation Bellman Expectation Equation (Matrix Form) The Bellman expectation equation can be expressed concisely using the induced MRP, with direct solution v π = R π + γp π v π v π = (I γp π ) 1 R π

40 Markov Decision Processes Optimal Value Functions Optimal Value Function Definition The optimal state-value function v (s) is the maximum value function over all policies v (s) = max π v π (s) The optimal action-value function q (s, a) is the maximum action-value function over all policies q (s, a) = max π q π (s, a) The optimal value function specifies the best possible performance in the MDP. An MDP is solved when we know the optimal value fn.

41 Markov Decision Processes Optimal Value Functions Example: Optimal Value Function for Student MDP Facebook r = -1 V*(s) for γ =1 6 0 Quit r = 0 Facebook Sleep r = -1 r = 0 6 Study Study 8 r = -2 r = Study r = +10 Pub r =

42 Markov Decision Processes Optimal Value Functions Example: Optimal Action-Value Function for Student MDP Facebook r = -1 Q* =5 Q*(s,a) for γ =1 6 0 Quit r = 0 Q* =6 Facebook Sleep r = -1 r = 0 Q* =5 Q* =0 6 Study Study 8 r = -2 r = Q* =6 Q* =8 Study r = +10 Q* =10 Pub r = Q* =8.4

43 Markov Decision Processes Optimal Value Functions Optimal Policy Define a partial ordering over policies π π if v π (s) v π (s), s Theorem For any Markov Decision Process There exists an optimal policy π that is better than or equal to all other policies, π π, π All optimal policies achieve the optimal value function, v π (s) = v (s) All optimal policies achieve the optimal action-value function, q π (s, a) = q (s, a)

44 Markov Decision Processes Optimal Value Functions Finding an Optimal Policy An optimal policy can be found by maximising over q (s, a), { 1 if a = argmax q (s, a) π (s, a) = a A 0 otherwise There is always a deterministic optimal policy for any MDP If we know q (s, a), we immediately have the optimal policy There can be multiple optimal policies

45 Markov Decision Processes Optimal Value Functions Example: Optimal Policy for Student MDP Facebook r = -1 Q* =5 π*(s,a) for γ =1 6 0 Quit r = 0 Q* =6 Facebook Sleep r = -1 r = 0 Q* =5 Q* =0 6 Study Study 8 r = -2 r = Q* =6 Q* =8 Study r = +10 Q* =10 Pub r = Q* =8.4

46 Markov Decision Processes Bellman Optimality Equation Bellman Optimality Equation for v The optimal value functions are recursively related by the Bellman optimality equations: s V * (s) a Q * (s,a) v (s) = max a q (s, a)

47 Markov Decision Processes Bellman Optimality Equation Bellman Optimality Equation for q s,a Q * (s,a) r s' V * (s') q (s, a) = R a s + γ s S P a ss v (s )

48 Markov Decision Processes Bellman Optimality Equation Bellman Optimality Equation for v (2) s V * (s) a r s' V * (s') v (s) = max a R a s + γ s S P a ss v (s )

49 Markov Decision Processes Bellman Optimality Equation Bellman Optimality Equation for q (2) s,a Q * (s,a) r s' a' Q * (s',a') q (s, a) = R a s + γ s S P a ss max a v (s, a )

50 Markov Decision Processes Bellman Optimality Equation Example: Bellman Optimality Equation in Student MDP Facebook r = -1 6 = max {-2 + 8, } 6 0 Quit r = 0 Facebook Sleep r = -1 r = 0 6 Study Study 8 r = -2 r = Study r = +10 Pub r =

51 Markov Decision Processes Bellman Optimality Equation Solving the Bellman Optimality Equation Bellman Optimality Equation is non-linear No closed form solution (in general) Many iterative solution methods Value Iteration Policy Iteration Q-learning Sarsa

52 Extensions to MDPs Extensions to MDPs Infinite and continuous MDPs Partially observable MDPs Undiscounted, average reward MDPs

53 Extensions to MDPs Infinite MDPs Infinite MDPs The following extensions are all possible: Countably infinite state and/or action spaces Straightforward Continuous state and/or action spaces Closed form for linear quadratic model (LQR) Continuous time Requires partial differential equations Hamilton-Jacobi-Bellman (HJB) equation Limiting case of Bellman equation as time-step 0

54 Extensions to MDPs Partially Observable MDPs POMDPs A POMDP is an MDP with hidden states. It is a hidden Markov model with actions. Definition A Partially Observable Markov Decision Process is a tuple S, A, O, P, R, Z, γ S is a finite set of states A is a finite set of actions O is a finite set of observations P is a state transition probability matrix, P a ss = P [S t+1 = s S t = s, A t = a] R is a reward function, R a s = E [R t+1 = r S t = s, A t = a] Z is an observation function, Z a s = P [O t+1 = o S t = s, A t = a] γ is a discount factor γ [0, 1].

55 Extensions to MDPs Partially Observable MDPs Belief States Definition A history H t is a sequence of actions, rewards, and observations H t = A 1, R 2, O 2,..., A t 1, R t, O t Definition A belief state b(h t ) is a probability distribution over states, conditioned on the history h t b(h t ) = (P [ S t = s 1 H t ],..., P [St = s n H t ])

56 Extensions to MDPs Partially Observable MDPs Reductions of POMDPs The history H t satisfies the Markov property The belief state b(h t ) satisfies the Markov property History tree Belief tree P(s) a 1 a 2 a 1 a 2 a 1 a 2 o 1 o 2 o 1 o 2 P(s a 1 ) P(s a 2 ) o 1 o 2 o 1 o 2 a 1 o 1 a 1 o 2 a 2 o 1 a 2 o 2 P(s a 1 o 1 ) P(s a 1 o 2 ) P(s a 2 o 1 ) P(s a 2 o 2 ) a 1 a 2 a 1 o 1 a 1 a 1 o 1 a a 1 a 2 P(s a 1 o 1 a 1 ) P(s a 1 o 1 a 2 ) A POMDP can be reduced to an (infinite) history tree A POMDP can be reduced to an (infinite) belief state tree

57 Extensions to MDPs Average Reward MDPs Ergodic Markov Process An ergodic Markov process is Theorem Recurrent: each state is visited an infinite number of times Aperiodic: each state is visited without any systematic period An ergodic Markov process has a limiting stationary distribution d π (s) with the property d π (s) = s S P k ss dπ (s )

58 Extensions to MDPs Average Reward MDPs Ergodic MDP Definition An MDP is ergodic if the Markov chain induced by any policy is ergodic. For any policy π, an ergodic MDP has an average reward per time-step ρ π that is independent of start state. ρ π = lim T [ T ] 1 T E π R t t=1

59 Extensions to MDPs Average Reward MDPs Average Reward Value Function The value function of an undiscounted, ergodic MDP can be expressed in terms of average reward. ṽ π (s) is the extra reward due to starting from state s, [ ] ṽ π (s) = E π (R t+k ρ π ) S t = s k=1 There is a corresponding average reward Bellman equation, ] ṽ π (s) = E π [(R t+1 ρ π ) + (R t+k+1 ρ π ) S t = s k=1 = E π [(R t+1 ρ π ) + ṽ π (S t+1 ) S t = s]

60 Extensions to MDPs Average Reward MDPs Questions? The only stupid question is the one you were afraid to ask but never did. -Rich Sutton

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q