Reinforcement Learning

Size: px

Start display at page:

Download "Reinforcement Learning"

Beverly Warner
6 years ago
Views:

1 Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value iteration, policy iteration. Vien Ngo Marc Toussaint University of Stuttgart

2 Outline Evaluative feedback. Reinforcement learning problem. Element of reinforcement learning Markov decision process. Dynamic programming Value iteration Policy iteration 2/??

3 Evaluative Feedback Multi-armed bandit evaluative actions vs. intructing actions 3/??

4 n-armed bandit problem Each arm i is associated with an expected value µ i and a variance δ 2 i Repeatedly pull one of arm a t, t {1,, n}, receive its respective reward µ (at); Each pull is considered a play. 4/??

5 n-armed bandit problem Each arm i is associated with an expected value µ i and a variance δ 2 i Repeatedly pull one of arm a t, t {1,, n}, receive its respective reward µ (at); Each pull is considered a play. The total expected rewards for a fixed number of plays T is: R = T n µ (at) = µ i E(n i ) t=1 where n i is a random number representing the number of plays of arm i in T plays. Objective is to maximize R, i=1 Or minimize the regret = T µ R where µ = max µ i is the expected reward of the best arm. 4/??

6 Action-Value Methods Assuming that at t-th play, arm i has been pulled n i times with rewards r 1, r 2,, r ni, then Q t (i) = r 1 + r 2 + r ni n i 5/??

7 Action-Value Methods Assuming that at t-th play, arm i has been pulled n i times with rewards r 1, r 2,, r ni, then Q t (i) = r 1 + r 2 + r ni n i By the law of large numbers, Q t (i) Q t (i) (from Introduction to RL book, Sutton & Barto) 5/??

8 Exploration/Exploitation Dilemma The greedy acion selection a t = argmax i Q t (i). Exploitation: a t = a t Exploration: a t = a t 6/??

9 Exploration/Exploitation Dilemma The greedy acion selection a t = argmax i Q t (i). Exploitation: a t = a t Exploration: a t = a t ɛ-greedy action selection. Exploitation: with prob. of 1 ɛ Exploration: with prob. of ɛ 6/??

10 10-armed Testbed n = 10 arms Each µ is chosen randomly from N(0, 1). play 1000 plays repeat the whole thing 2000 times, and report the averaged results. (from Introduction to RL book, Sutton & Barto) 7/??

11 Reinforcement Learning Problem Elements of Reinforcement Learning Problem Agent vs. Environment. State, Action, Reward, Goal, Return. The Markov property. Markov decision process. Bellman equations. Optimality and Approximation. 8/??

12 Agent vs. Environment 9/??

13 Agent vs. Environment The learner and decision-maker is called the agent. The thing it interacts with, comprising everything outside the agent, is called the environment. (from Introduction to RL book, Sutton & Barto) 9/??

14 The Markov property A state that summarizes past sensations compactly yet in such a way that all relevant information is retained. This normally requires more than the immediate sensations, but never more than the complete history of all past sensations. A state that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. (Introduction to RL book, Sutton & Barto) 10/??

15 The Markov property A state that summarizes past sensations compactly yet in such a way that all relevant information is retained. This normally requires more than the immediate sensations, but never more than the complete history of all past sensations. A state that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. (Introduction to RL book, Sutton & Barto) Formally, P r(s t+1, r t+1 s t, a t, r t,, s 0, a 0, r 0 ) = P r(s t+1, r t+1 s t, a t, r t ) 10/??

16 The Markov property A state that summarizes past sensations compactly yet in such a way that all relevant information is retained. This normally requires more than the immediate sensations, but never more than the complete history of all past sensations. A state that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. (Introduction to RL book, Sutton & Barto) Formally, P r(s t+1, r t+1 s t, a t, r t,, s 0, a 0, r 0 ) = P r(s t+1, r t+1 s t, a t, r t ) Example: the current configuration of the chess board for predicting the next steps, the position, velocity of the cart, the angle and its changing rate of the pole in cart-pole domain. 10/??

17 Markov decision process A reinforcement learning problem that satisfies the Markov property is called a Markov decision process, or MDP. 11/??

18 Markov decision process A reinforcement learning problem that satisfies the Markov property is called a Markov decision process, or MDP. MDP = {S, A, T, R, P 0, γ}. 11/??

19 Markov decision process A reinforcement learning problem that satisfies the Markov property is called a Markov decision process, or MDP. MDP = {S, A, T, R, P 0, γ}. S: consists of all possible states. A: consists of all possible actions. T: is a transition function which defines the probability T(s, s, a) = P r(s s, a). R: is a reward function which defines the reward R(s, a). P 0 : is the probability distribution over initial states. γ: is a discount factor. 11/??

20 Example: Recycling Robot MDP 12/??

21 a 0 a 1 a 2 s 0 s 1 s 2 r 0 r 1 r 2 13/??

22 a 0 a 1 a 2 s 0 s 1 s 2 r 0 r 1 r 2 A policy is a mapping from state space to action space µ : S A 13/??

23 a 0 a 1 a 2 s 0 s 1 s 2 r 0 r 1 r 2 A policy is a mapping from state space to action space Objective function: Expected average reward. Expected discounted reward. where γ [0, 1] us a discount factor. µ : S A 1 [ T 1 η = lim T T E ] r(s t, a t, s t+1 ) t=0 [ ] η γ = E γ t r(s t, a t, s t+1 ) t=0 13/??

24 a 0 a 1 a 2 s 0 s 1 s 2 r 0 r 1 r 2 A policy is a mapping from state space to action space Objective function: Expected average reward. Expected discounted reward. where γ [0, 1] us a discount factor. Singh et. al. 1994: µ : S A 1 [ T 1 η = lim T T E ] r(s t, a t, s t+1 ) t=0 [ ] η γ = E γ t r(s t, a t, s t+1 ) t=0 η γ = 1 1 γ η 13/??

25 Dynamic Programming State value Functions Bellman s Equation Value Iteration Policy Iteration 14/??

26 State value function The value (expected discounted return) of policy π when started in state s: discounting factor γ [0, 1] V π (s) = E π {r 0 + γr 1 + γ 2 r 2 + s 0 =s} (1) 15/??

27 State value function The value (expected discounted return) of policy π when started in state s: discounting factor γ [0, 1] V π (s) = E π {r 0 + γr 1 + γ 2 r 2 + s 0 =s} (1) definition of optimality: behavior π is optimal iff s : V π (s) = V (s) where V (s) = max V π (s) π (simultaneously maximising the value in all states) (In MDPs there always exists (at least one) optimal deterministic policy.) 15/??

28 Bellman optimality equation V π (s) = E{r 0 + γr 1 + γ 2 r 2 + s 0 =s; π} = E{r 0 s 0 =s; π} + γe{r 1 + γr 2 + s 0 =s; π} = R(π(s), s) + γ s P (s π(s), s) E{r 1 + γr 2 + s 1 =s ; π} = R(π(s), s) + γ s P (s π(s), s) V π (s ) 16/??

29 Bellman optimality equation V π (s) = E{r 0 + γr 1 + γ 2 r 2 + s 0 =s; π} = E{r 0 s 0 =s; π} + γe{r 1 + γr 2 + s 0 =s; π} = R(π(s), s) + γ s P (s π(s), s) E{r 1 + γr 2 + s 1 =s ; π} = R(π(s), s) + γ s P (s π(s), s) V π (s ) We can write this in vector notation V π = R π + γp π V π with vectors V π s = V π (s), R π s = R(π(s), s) and matrix P π s s = P (s π(s), s) For stochastic π(a s): V π (s) = a π(a s)r(a, s) + γ s,a π(a s)p (s a, s) V π (s ) 16/??

30 Bellman optimality equation V π (s) = E{r 0 + γr 1 + γ 2 r 2 + s 0 =s; π} = E{r 0 s 0 =s; π} + γe{r 1 + γr 2 + s 0 =s; π} = R(π(s), s) + γ s P (s π(s), s) E{r 1 + γr 2 + s 1 =s ; π} = R(π(s), s) + γ s P (s π(s), s) V π (s ) We can write this in vector notation V π = R π + γp π V π with vectors V π s = V π (s), R π s = R(π(s), s) and matrix P π s s = P (s π(s), s) For stochastic π(a s): V π (s) = a π(a s)r(a, s) + γ s,a π(a s)p (s a, s) V π (s ) Bellman optimality equation [ V (s) = max R(a, s) + γ ] a s P (s a, s) V (s ) π (s) = argmax a [ R(a, s) + γ s P (s a, s) V (s ) (Sketch of proof: If π would select another action than argmax a [ ], then π which = π everywhere except π (s) = argmax a [ ] would be better.) This is the principle of optimality in the stochastic case (related to Viterbi, max-product algorithm) 16/?? ]

31 Richard E. Bellman ( ) Bellman s principle of optimality B A A opt B opt [ V (s) = max a π (s) = argmax a R(a, s) + γ ] s P (s a, s) V (s ) [ R(a, s) + γ ] s P (s a, s) V (s ) 17/??

32 Value Iteration Given the Bellman equation iterate V (s) = max a s : V k+1 (s) = max a stopping criterion: [ R(a, s) + γ s P (s a, s) V (s ) [ R(a, s) + γ ] P (s π(s), s) V k (s ) s max V k+1 (s) V k (s) ɛ s ] Value Iteration converges to the optimal value function V (proof below) 18/??

33 2x2 Maze % % 10% manually solving. 19/??

34 State-action value function (Q-function) The state-action value function (or Q-function) is the expected discounted return when starting in state s and taking first action a: Q π (a, s) = E π {r 0 + γr 1 + γ 2 r 2 + s 0 =s, a 0 =a} = R(a, s) + γ s P (s a, s) Q π (π(s), s) (Note: V π (s) = Q π (π(s), s).) Bellman optimality equation for the Q-function Q (a, s) = R(a, s) + γ s P (s a, s) max a Q (a, s ) π (s) = argmax Q (a, s) a 20/??

35 Q-Iteration Given the Bellman equation Q (a, s) = R(a, s) + γ s P (s a, s) max a Q (a, s ) iterate a,s : Q k+1 (a, s) = R(a, s) + γ s P (s a, s) max a Q k (a, s ) stopping criterion: max Q k+1(a, s) Q k (a, s) ɛ a,s Q-Iteration converges to the optimal state-action value function Q 21/??

36 Proof of convergence Let k = Q Q k = max a,s Q (a, s) Q k (a, s) Q k+1 (a, s) = R(a, s) + γ s R(a, s) + γ [ s = R(a, s) + γ s = Q (a, s) + γ k P (s a, s) max a Q k (a, s ) [ ] P (s a, s) max Q (a, s ) + k a ] P (s a, s) max a Q (a, s ) + γ k similarly: Q k Q k Q k+1 Q γ k 22/??

37 Convergence Contraction property: U k+1 V k+1 γ U k V k which guarantees convergence with different initial values U 0, V 0 of two approximations. 23/??

38 Convergence Contraction property: U k+1 V k+1 γ U k V k which guarantees convergence with different initial values U 0, V 0 of two approximations. Stopping condition: V k+1 V k ɛ V k+1 V ɛγ/(1 γ) 23/??

39 Policy Evaluation Value Iteration and Q-Iteration compute directly V and Q If we want to evaluate a given policy π, we want to compute V π or Q π : 24/??

40 Policy Evaluation Value Iteration and Q-Iteration compute directly V and Q If we want to evaluate a given policy π, we want to compute V π or Q π : Iterate using π instead of max a : s : V k+1 (s) = R(π(s), s) + γ s P (s π(s), s) V k (s ) a,s : Q k+1 (a, s) = R(a, s) + γ s P (s a, s) Q k (π(s ), s ) 24/??

41 Policy Evaluation Value Iteration and Q-Iteration compute directly V and Q If we want to evaluate a given policy π, we want to compute V π or Q π : Iterate using π instead of max a : s : V k+1 (s) = R(π(s), s) + γ s P (s π(s), s) V k (s ) a,s : Q k+1 (a, s) = R(a, s) + γ s P (s a, s) Q k (π(s ), s ) Or, invert the matrix equation V π = R π + γp π V π V π + γp π V π = R π (I γp π )V π = R π V π = (I γp π ) 1 R π requires inversion of n n matrix for S = n, O(n 3 ) 24/??

42 Policy Iteration What does it help to just compute V π or Q π to find the optimal policy? 25/??

43 Policy Iteration What does it help to just compute V π or Q π to find the optimal policy? Policy Iteration 1. Initialise π 0 somehow (e.g. randomly) 2. Iterate Policy Evaluation: compute V π k or Q π k Policy Update: π k+1 (s) argmax a Q π k (a, s) demo: 2x2 maze 25/??

44 Policy Iteration What does it help to just compute V π or Q π to find the optimal policy? 26/??

45 Policy Iteration What does it help to just compute V π or Q π to find the optimal policy? Policy Iteration 1. Initialise π 0 somehow (e.g. randomly) 2. Iterate Policy Evaluation: compute V π k or Q π k Policy Update: π k+1 (s) argmax a Q π k (a, s) demo: 2x2 maze 26/??

46 So far, we introduce basic notions of an MDP and value functions and methods to compute optimal policies assuming that we know the world (know P (s a, s) and R(a, s)): Value Iteration/Q-Iteration V, Q, π Policy Evaluation V π, Q π Policy Update π(s) argmax a Q π k (a, s) Policy Iteration (iterate Policy Evaluation and Policy Update) Reinforcement Learning? 27/??

Artificial Intelligence

Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important