Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015
Note I ve created these slides whilst following Algorithms for Reinforcement Learning lectures by Csaba Szepesvári, specifically sections 2.2-2.4. The lectures themselves are available on Professor Szepesvári s homepage: http://www.ualberta.ca/~szepesva/papers/rlalgsinmdps.pdf Any errors please email me: rtm26 at cam dot ac dot uk
MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability, }{{} R, γ } }{{} reward function discount factor
MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability X e.g. Euclidean plane X = R 2, }{{} R, γ } }{{} reward function discount factor
MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability X e.g. Euclidean plane X = R 2 A, }{{} R, γ } }{{} reward function discount factor e.g. movement A = {north, south, east, west} 1 meter
MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability X e.g. Euclidean plane X = R 2 A, }{{} R, γ } }{{} reward function discount factor e.g. movement A = {north, south, east, west} 1 meter P Pr(X t+1, X t, A t ) e.g. Pr({0, 1}, {0, 0}, north) = 0.99
MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability X e.g. Euclidean plane X = R 2 A, }{{} R, γ } }{{} reward function discount factor e.g. movement A = {north, south, east, west} 1 meter P Pr(X t+1, X t, A t ) e.g. Pr({0, 1}, {0, 0}, north) = 0.99 R R : X A R
MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability X e.g. Euclidean plane X = R 2 A, }{{} R, γ } }{{} reward function discount factor e.g. movement A = {north, south, east, west} 1 meter P Pr(X t+1, X t, A t ) e.g. Pr({0, 1}, {0, 0}, north) = 0.99 R R : X A R γ prefer rewards now vs later, γ [0, 1]
MDP Framework Markov: Cond. indep.: P(X t+1 X 1:t, A 1:t ) = P(X t+1 X t, A t ) Decision: Decide action to optimise objective Process: Sequential movements and decisions, time t = 0,1,2,...
MDP Framework Markov: Cond. indep.: P(X t+1 X 1:t, A 1:t ) = P(X t+1 X t, A t ) Decision: Decide action to optimise objective Process: Sequential movements and decisions, time t = 0,1,2,... R t 1 R t R t+1 X t 1 X t X t+1 P P A t 1 A t A t+1
Goal Maximise return, where: return = γ t R t t=0
Goal Maximise return, where: return = γ t R t t=0 How? can influence the return by deciding actions A t each time step. Define policy function: π : X A
Goal Maximise return, where: return = γ t R t t=0 How? can influence the return by deciding actions A t each time step. Define policy function: Define value of policy: V π : X R π : X A V π (x) = E P [ return X0 = x; π ], x X
Goal Maximise return, where: return = γ t R t t=0 How? can influence the return by deciding actions A t each time step. Define policy function: Define value of policy: Optimise policy: V π : X R π : X A V π (x) = E P [ return X0 = x; π ], x X π arg max π [ V π (x) ], x X
Also useful: Define action-value of policy: Q π : X A R Q π (x, a) = E P [ return X0 = x, A 0 = a; π ], x X, a A
Bellman Equations (evaluating a fixed policy) Q π (x, a) = R(x, a) + γ P(x x, a)v π (x ), x X, a A x X V π (x) = Q π (x, π(x)) = R(x, π(x)) + γ x X P(x x, π(x))v π (x ), x X } {{ } T π operator on V π
Bellman Equations (evaluating a fixed policy) i.e.: Q π (x, a) = R(x, a) + γ P(x x, a)v π (x ), x X, a A x X V π (x) = Q π (x, π(x)) = R(x, π(x)) + γ x X P(x x, π(x))v π (x ), x X } {{ } T π operator on V π V π = T π V π
Bellman Equations (evaluating a fixed policy) i.e.: Q π (x, a) = R(x, a) + γ P(x x, a)v π (x ), x X, a A x X V π (x) = Q π (x, π(x)) = R(x, π(x)) + γ x X P(x x, π(x))v π (x ), x X } {{ } T π operator on V π V π = T π V π i.e. a linear system of equations: v π = r + γp π v π v π = (I γp π ) 1 r
Bellman Optimality Equations (evaluating the optimal policy) Q (x, a) =. Q π (x, a) = R(x, a) + γ P(x x, a)v (x ), x X, a A x X V (x). = V π (x) = max a A Q (x, a), x X π (x) = arg max a A Q (x, a), x X
Bellman Optimality Equations (evaluating the optimal policy) Q (x, a) =. Q π (x, a) = R(x, a) + γ P(x x, a)v (x ), x X, a A x X V (x). = V π (x) = max a A Q (x, a), x X π (x) = arg max a A Q (x, a), x X i.e.: V (x) = [ max R(x, a) + γ x a A X P(x x, a)v (x ) ], x X }{{} T operator on V V = T V
Value Iteration V k+1 = T V k
Value Iteration 16 0 0?? 0 0 4?? 0 0 0 Value k=0??? Policy k=0 γ = 0.5
Value Iteration 16 8 2 8 2 4 0 0 2 Value k=1?? Policy k=1 γ = 0.5
Value Iteration 16 8 4 8 4 4 4 1 2 Value k=2 Policy k=2 γ = 0.5
Value Iteration 16 8 4 8 4 4 4 2 2 Value k=3 Policy k=3 γ = 0.5
Policy Iteration Initialise random policy π 0 k 0 WHILE π k not converged 1. Compute associated action values Q π k (policy evaluation) 2. Update policy greedily w.r.t. Q π k : (policy improvement) π k+1 (x) arg max a A Q π k (x, a), x X 3. k k + 1
Policy Iteration 16 1 2 0 0 4 Policy k=0 0 0 0 Value k=0 γ = 0.5
Policy Iteration 16 1 2 0 0 4 Policy k=1 0 0 0 Value k=0 γ = 0.5
Policy Iteration 16 8 2 8 2 4 Policy k=1 0.5 1 2 Value k=1 γ = 0.5
Policy Iteration 16 8 2 8 2 4 Policy k=2 0.5 1 2 Value k=1 γ = 0.5
Policy Iteration 16 8 4 8 4 4 Policy k=2 4 1 2 Value k=2 γ = 0.5
Policy Iteration 16 8 4 8 4 4 Policy k=3 4 1 2 Value k=2 γ = 0.5
Policy Iteration 16 8 4 8 4 4 Policy k=3 4 2 2 Value k=3 γ = 0.5
Policy Iteration (Policy evaluation) v π = (I γp π ) 1 r 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 P π = 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 Policy k=3 (Rows FROM state, columns TO state. Grey indicates a terminal state)
Policy Iteration (Policy evaluation) v π = (I γp π ) 1 r 16 4 Rewards r = 16 0 0 0 0 0 0 4 0 0
Policy Iteration (Policy evaluation) v π = (I γp π ) 1 r 16 8 4 8 4 4 Value 4 2 2