Introduction to Markov Decision Processes

Size: px

Start display at page:

Download "Introduction to Markov Decision Processes"

Melvin Carson
5 years ago
Views:

1 Introduction to Markov Decision Processes Fall Alborz Geramifard Research Scientist at Amazon.com *This work was done during my postdoc at MIT. 1

2 Motivation Understand the customer s need in a sequence of interactions. Minimize a notion of accumulated frustration level. 2

3 Applications 3

4 Grid World Example Goal: Grab the cookie fast and avoid pits Noisy movement Actions:,,, 4

5 Outline Motivation Problem Formulation Solving MDPs Extensions 5

6 Markov Decision Process (S, A, P a ss, Ra ss, ) (s) :S A a t s t,r t s t+1,r t

7 Markov Decision Process n as Markov decision (S, A, P a, R a, ) ss 0 ss 0 actions, P a is the p

8 Policy (π): S A 8

9 Assumptions Fully Observable Markovian Property 9

10 State Values a 0,r 1 a 1,r 2 s 0 s 1 s 2 Q (s, a) =E t 1 r t s 0 = s, a 0 = a, t=1 V (s) =Q (s, (s)) 10

11 Problem = max, s S,V (s) 11

12 Outline Motivation Problem Formulation Solving MDPs Extensions 12

13 (S, A, P a ss, Ra ss, ) Assume all elements of the MDP are known.

14 Dynamic Programming Given a fixed policy (π), estimate the value of each state Policy Evaluation Q or V Policy Improvement Given a fixed value function, improve the policy (π) Loop till convergence

15 2.6. Approximate Dynamic Programming in Matrix Forma Bellman update (line 5) and the policy updat Both the Policy Evaluation sider all possible next states, which in the worst case c recursively as: 2.3. Dynamic Programming 9 V stores a unique for every state of the MD X parameter a a 0 0 Q (s, a) = s0 2S PQss0 orrvss0 + Q s, (s ). Solving Problem 1 by dropped formulating as arest set of equations otationsolve will be for the of linear the derivation, as the po 3 Costly calculation: O( S ) d problem to be fixed.can Let be us now write the above equation in aimplicitl matrix fo solved by the policy Q or V storing provement loop: The convergent policy is guaranteed to be optimal, if the Q LSTD. Notice that we overload our notation, such that,, P. he Q function. If for each state s, Q(s, a) is available f Policy Improvement ra information required to calculate Q rather than V : greedy policy can be retrieved simply by: Q = R + P Q, (s) = argmax Q(s, a), provement loop: The convergent policy is guaranteed to be optimal, if the Q techniques. Finally Section 2.7 shows how reinforcet. a2a ques that do not have access to the MDP model can 2 3 Policy Iteration approximate dynamic programming techniques. These also known as the greedy policy with respect to the Q(s1, a1 ) road outline of the algorithm derivations of this tutorial v

16 Policy Evaluation Q(s, a) X s 0 2S P a ss 0 [R a ss 0 + max a 0 Q(s 0,a 0 )] Improve the value of a single state-action pair at a time Lower computation: * O( S ) Policy Improvement (s) = argmax a2a Q(s, a), Value Iteration

17 Value Iteration Example Transition Model Transition Model.9.1 V (s) Iteration 0: Policy =1 Reward = 100

18 Value Iteration Example Transition Model Transition Model.9.1 V (s) Iteration 1: Policy =1 Reward = 100

19 Value Iteration Example Transition Model Transition Model.9.1 V (s) Iteration 2: Policy =1 Reward = 100

20 Value Iteration Example Transition Model Transition Model.9.1 V (s) Iteration 3: Policy =1 Reward = 100

21 (S, A, P a ss, Ra ss, ) Not known!

22 Reinforcement Learning (s) :S A a t s t,r t s t+1,r t s. A starts from We only see this: s 0,a 0,r 0,s 1,a 1,r 1,s 2... ng to a policy22 S A

23 Reinforcement Learning 23 [B.F. Skinner Foundation]

24 Reinforcement Learning Unknown P a ss, Ra ss s. A starts from What can we do with only samples? s 0,a 0,r 0,s 1,a 1,r 1,s 2... ng to a policy S A

25 Policy Evaluation Q(s, a) X s 0 2S P a ss 0 [R a ss 0 + max a 0 Q(s 0,a 0 )] Q + (s, a) Can we build a noisy estimate of Q + (s, a)? Policy Improvement (s) = argmax a2a Q(s, a), Value Iteration

26 Policy Evaluation a,r a s s Q + (s, a) =r t + max a 0 Q(s 0,a 0 ) = Q + (s, a) Q(s, a) Q(s, a) =Q(s, a)+ Policy Improvement Q-Learning argmaxa Q (s, a), with probability 1 (s), UniformRandom(A), 26 with probability

27 Policy Evaluation a,r a s s Q + (s, a) =r t + Q(s 0,a 0 ) = Q + (s, a) Q(s, a) Q(s, a) =Q(s, a)+ Policy Improvement SARSA (s), argmaxa Q (s, a), with probability 1 UniformRandom(A), with probability 27

28 SARSA Example S G Rewards: +1 at goal, per step γ=.98 Transitions:, 30% noise 28 States =95

29 What is the main challenge in solving MDPs with a tabular representation of values for every problem? S States =95 G In practice, state spaces are huge... 29

30 Huge State Spaces Dialog Turns 7 Frustration Level 10 Possible Sentences Caller Gender 2 Caller Location Billion Parameters 30

31 Outline Motivation Problem Formulation Solving MDPs Extensions 31

32 Linear Function Approximation 1 1 s V (s) (s) > n n 8 32

33 Example State Feature Weight Value t(s) t Male Seattle V(s) = What is the right set of features? 33

34 Adaptive Tile Coding 34 [Whiteson et al. 2007]

35 Matrix Form Ṽ = S W U (s 1 ) (s 2 )... T X V S W U T X, S m m 1. V (s S ) m Q X (s, a) = s 0 2S Pss a 0 R a ss + 0 Q s 0, (s 0 ) Solve by formulating as a set of linear equations Costly calculation: T(V ), R + PV, 35

36 Geometric View = ( T D ) 1 T D V T(Ṽ ) T V Ṽ T(Ṽ ) 36 Ṽ =

37 Policy Evaluation a,r a s s Q + (s, a) =r t + max a 0 Q(s 0,a 0 ) = Q + (s, a) Q(s, a) Q(s, a) =Q(s, a)+ (s, a) Policy Improvement Q-Learning argmaxa Q (s, a), with probability 1 (s), UniformRandom(A), 37 with probability

38 Policy Evaluation a,r a s s Q + (s, a) =r t + Q(s 0,a 0 ) = Q + (s, a) Q(s, a) Q(s, a) =Q(s, a)+ (s, a) Policy Improvement SARSA (s), argmaxa Q (s, a), with probability 1 UniformRandom(A), with probability 38

Planning in Markov Decision Processes

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov