Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Size: px

Start display at page:

Download "Introduction to Reinforcement Learning Part 1: Markov Decision Processes"

Sheena Collins
6 years ago
Views:

1 Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015

2 Note I ve created these slides whilst following Algorithms for Reinforcement Learning lectures by Csaba Szepesvári, specifically sections The lectures themselves are available on Professor Szepesvári s homepage: Any errors please me: rtm26 at cam dot ac dot uk

3 MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability, }{{} R, γ } }{{} reward function discount factor

4 MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability X e.g. Euclidean plane X = R 2, }{{} R, γ } }{{} reward function discount factor

5 MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability X e.g. Euclidean plane X = R 2 A, }{{} R, γ } }{{} reward function discount factor e.g. movement A = {north, south, east, west} 1 meter

6 MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability X e.g. Euclidean plane X = R 2 A, }{{} R, γ } }{{} reward function discount factor e.g. movement A = {north, south, east, west} 1 meter P Pr(X t+1, X t, A t ) e.g. Pr({0, 1}, {0, 0}, north) = 0.99

7 MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability X e.g. Euclidean plane X = R 2 A, }{{} R, γ } }{{} reward function discount factor e.g. movement A = {north, south, east, west} 1 meter P Pr(X t+1, X t, A t ) e.g. Pr({0, 1}, {0, 0}, north) = 0.99 R R : X A R

8 MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability X e.g. Euclidean plane X = R 2 A, }{{} R, γ } }{{} reward function discount factor e.g. movement A = {north, south, east, west} 1 meter P Pr(X t+1, X t, A t ) e.g. Pr({0, 1}, {0, 0}, north) = 0.99 R R : X A R γ prefer rewards now vs later, γ [0, 1]

9 MDP Framework Markov: Cond. indep.: P(X t+1 X 1:t, A 1:t ) = P(X t+1 X t, A t ) Decision: Decide action to optimise objective Process: Sequential movements and decisions, time t = 0,1,2,...

10 MDP Framework Markov: Cond. indep.: P(X t+1 X 1:t, A 1:t ) = P(X t+1 X t, A t ) Decision: Decide action to optimise objective Process: Sequential movements and decisions, time t = 0,1,2,... R t 1 R t R t+1 X t 1 X t X t+1 P P A t 1 A t A t+1

11 Goal Maximise return, where: return = γ t R t t=0

12 Goal Maximise return, where: return = γ t R t t=0 How? can influence the return by deciding actions A t each time step. Define policy function: π : X A

13 Goal Maximise return, where: return = γ t R t t=0 How? can influence the return by deciding actions A t each time step. Define policy function: Define value of policy: V π : X R π : X A V π (x) = E P [ return X0 = x; π ], x X

14 Goal Maximise return, where: return = γ t R t t=0 How? can influence the return by deciding actions A t each time step. Define policy function: Define value of policy: Optimise policy: V π : X R π : X A V π (x) = E P [ return X0 = x; π ], x X π arg max π [ V π (x) ], x X

15 Also useful: Define action-value of policy: Q π : X A R Q π (x, a) = E P [ return X0 = x, A 0 = a; π ], x X, a A

16 Bellman Equations (evaluating a fixed policy) Q π (x, a) = R(x, a) + γ P(x x, a)v π (x ), x X, a A x X V π (x) = Q π (x, π(x)) = R(x, π(x)) + γ x X P(x x, π(x))v π (x ), x X } {{ } T π operator on V π

17 Bellman Equations (evaluating a fixed policy) i.e.: Q π (x, a) = R(x, a) + γ P(x x, a)v π (x ), x X, a A x X V π (x) = Q π (x, π(x)) = R(x, π(x)) + γ x X P(x x, π(x))v π (x ), x X } {{ } T π operator on V π V π = T π V π

18 Bellman Equations (evaluating a fixed policy) i.e.: Q π (x, a) = R(x, a) + γ P(x x, a)v π (x ), x X, a A x X V π (x) = Q π (x, π(x)) = R(x, π(x)) + γ x X P(x x, π(x))v π (x ), x X } {{ } T π operator on V π V π = T π V π i.e. a linear system of equations: v π = r + γp π v π v π = (I γp π ) 1 r

19 Bellman Optimality Equations (evaluating the optimal policy) Q (x, a) =. Q π (x, a) = R(x, a) + γ P(x x, a)v (x ), x X, a A x X V (x). = V π (x) = max a A Q (x, a), x X π (x) = arg max a A Q (x, a), x X

20 Bellman Optimality Equations (evaluating the optimal policy) Q (x, a) =. Q π (x, a) = R(x, a) + γ P(x x, a)v (x ), x X, a A x X V (x). = V π (x) = max a A Q (x, a), x X π (x) = arg max a A Q (x, a), x X i.e.: V (x) = [ max R(x, a) + γ x a A X P(x x, a)v (x ) ], x X }{{} T operator on V V = T V

21 Value Iteration V k+1 = T V k

22 Value Iteration ?? 0 0 4?? Value k=0??? Policy k=0 γ = 0.5

23 Value Iteration Value k=1?? Policy k=1 γ = 0.5

24 Value Iteration Value k=2 Policy k=2 γ = 0.5

25 Value Iteration Value k=3 Policy k=3 γ = 0.5

26 Policy Iteration Initialise random policy π 0 k 0 WHILE π k not converged 1. Compute associated action values Q π k (policy evaluation) 2. Update policy greedily w.r.t. Q π k : (policy improvement) π k+1 (x) arg max a A Q π k (x, a), x X 3. k k + 1

27 Policy Iteration Policy k= Value k=0 γ = 0.5

28 Policy Iteration Policy k= Value k=0 γ = 0.5

29 Policy Iteration Policy k= Value k=1 γ = 0.5

30 Policy Iteration Policy k= Value k=1 γ = 0.5

31 Policy Iteration Policy k= Value k=2 γ = 0.5

32 Policy Iteration Policy k= Value k=2 γ = 0.5

33 Policy Iteration Policy k= Value k=3 γ = 0.5

34 Policy Iteration (Policy evaluation) v π = (I γp π ) 1 r P π = Policy k=3 (Rows FROM state, columns TO state. Grey indicates a terminal state)

35 Policy Iteration (Policy evaluation) v π = (I γp π ) 1 r 16 4 Rewards r =

36 Policy Iteration (Policy evaluation) v π = (I γp π ) 1 r Value 4 2 2

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control