Reinforcement Learning

Size: px

Start display at page:

Download "Reinforcement Learning"

Warren Clark
6 years ago
Views:

1 Reinforcement Learning Dipendra Misra Cornell University

2 Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al. 2014

3 Supervised Learning Grasp the green cup. Expert Demonstrations Setup from Lenz et. al. 2014

4 Supervised Learning Grasp the green cup. Problem? Expert Demonstrations Setup from Lenz et. al. 2014

5 Supervised Learning Grasp the cup. Problem? Training data Test data No exploration

6 Exploring the environment

7 What is reinforcement learning? Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment - R. Sutton and A. Barto

8 Interaction with the environment action reward + new environment Scalar reward Setup from Lenz et. al. 2014

9 Interaction with the environment a 1 r 1 a 2 r 2... a n Episodic vs Non-Episodic r n Setup from Lenz et. al. 2014

10 Rollout hs 1,a 1,r 1,s 2,a 2,r 2,s 3, a n,r n,s n i a 1 r 1 a 2 r 2... a n r n Setup from Lenz et. al. 2014

11 Setup a t s t! s t+1 r t e.g.,1$

12 Policy (s, a) = 0.9

13 Interaction with the environment action reward + new environment Objective? Setup from Lenz et. al. 2014

14 Objective hs 1,a 1,r 1,s 2,a 2,r 2,s 3, a n,r n,s n i a 1 r 1 maximize expected reward " X n E t=1 r t # a 2 r 2... a n Problem? r n Setup from Lenz et. al. 2014

15 Discounted Reward maximize expected reward " X 1 E unbounded discount future reward " X 1 E t=0 t=0 r t+1 # t r t+1 # Problem? 2 [0, 1) a 1 r 1 a 2 r 2... a n r n

16 Discounted Reward maximize discounted E expected reward " # nx 1 t r t+1 t=0 if r apple M and 2 [0, 1) " # X 1 1X E t r t+1 apple t M = t=0 t=0 1 M a 1 r 1 a 2 r 2... a n r n

17 Need for discounting To keep the problem well formed Evidence that humans discount future reward

18 Markov Decision Process MDP is a tuple (S, A,P,R, ) where S is a set of finite or infinite states A is a set of finite or infinite actions For the transition s a! s 0 P a is the transition probability s,s 0 2 P Rs,s a 0 2 R is the reward for the transition 2 [0, 1] is the discounted factor } Markov Asmp.

19 MDP Example Example from Sutton and Barto 1998

20 Summary MDP is a tuple (S, A,P,R, ) a 1 Maximize discounted expected reward E " n 1 X t=0 t r t+1 # Agent controls the policy (s, a) r 1 a 2 r 2... a n r n

21 What we learned Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP

22 Value functions Expected reward from following a policy State value function " 1 X # V (s) =E t=0 t r t+1 s 1 = s, State action value function " X 1 # Q (s, a) =E t=0 t r t+1 s 1 = s, a 1 = a,

23 State Value function " 1 X # V (s) =E t r t+1 s 1 = s, t=0 s1 a 1 (s 1,a 1 )P a 1 s 1,s 2 R a 1 s 1,s 2... s2 a 2 (s 2,a 2 )P a 2 s 2,s 3 R a 2 s 2,s 3... s3

24 State Value function V (s 1 )=E " 1 X t=0 t r t+1 # = X t (r 1 + r 2 )p(t) where t = hs 1,a 1,s 2,a 2 i = hs 1,a 1,s 2 i : t 0 s1 a 1 (s 1,a 1 )P a 1 s 1,s 2 R a 1 s 1,s 2... s2 a 2 (s 2,a 2 )P a 2 s 2,s 3 R a 2 s 2,s 3... s3

25 State Value function V (s 1 )=E " 1 X t=0 t r t+1 # = X t (r 1 + r 2 )p(t) where t = hs 1,a 1,s 2,a 2 i = hs 1,a 1,s 2 i : t 0 = X X P (s 1,a 1,s 2 )P (t 0 s 1,a 1,s 2 ) Rs a1 1,s 2 + (r 2 ) a 1,s 2 t 0 = X a 1,s 2 P (s 1,a 1,s 2 ){R a 1 s 1,s 2 + X t 0 P (t 0 s 1,a 1,s 2 )(r 2 )} V (s 2 ) = X a 1 (s 1,a 2 ) X s 2 P a 1 s 1,s 2 R a 1 s 1,s 2 + V (s 2 )}

26 Bellman Self-Consistency Eqn V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) similarly Q X (s, a) = Ps,s a 0 s 0 Rs,s a 0 + V (s 0 ) Q (s, a) = X Ps,s (R a 0 s,s a + X ) (s 0,a 0 )Q (s 0,a 0 ) 0 s 0 a 0

27 Bellman Self-Consistency Eqn V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) Given N states, we have N equations in N variables Solve the above equation Does it have a unique solution? Yes, it does. Exercise: Prove it.

28 Optimal Policy V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) Given a state s policy 1 is as good as 2 (den. 1 2 ) if: V 1 (s) V 2 (s) How to define a globally optimal policy?

29 Optimal Policy policy 1 is as good as 2 (den. 1 2 ) if: V 1 (s) V 2 (s) How to define a globally optimal policy? is an optimal policy if: Does it always exists? Yes it always does. V (s) V (s) 8s 2 S,

30 Existence of Optimal Policy Leader policy for every state s is: s = arg max V (s) Define: (s, a) = s (s, a) 8s, a To show is optimal or equivalently: (s) =V (s) V s (s) 0 V (s) = X a s (s, a) X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} V s (s) = X a s (s, a) X s 0 P a s,s 0 {R a s,s 0 + V s (s 0 )}

31 Existence of Optimal Policy Leader policy for every state s is: s = arg max V (s) Define: (s, a) = s (s, a) 8s, a V (s) = X a s (s, a) X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} V s (s) = X a s (s, a) X s 0 P a s,s 0 {R a s,s 0 + V s (s 0 )} (s) =V (s) V s (s) = X a X a s (s, a) X s 0 P a s,s 0 {V (s 0 ) V s (s 0 )} s (s, a) X s 0 P a s,s 0 { (s 0 )} = conv( (s 0 )) (s 0 )

32 Existence of Optimal Policy Leader policy for every state s is: s = arg max V (s) Define: (s, a) = s (s, a) 8s, a (s) =V (s) V s (s) = X a s (s, a) X s 0 P a s,s 0 {V (s 0 ) V s (s 0 )} X s (s, a) X Ps,s a { (s 0 )} = conv( (s 0 )) 0 s 0 a (s) min (s 0 ) min (s) min (s 0 ) 2 [0, 1) ) min (s) 0 Hence proved

33 Bellman s Optimality Condition Define V (s) =V (s) and Q (s, a) =Q (s, a) V (s) = X a (s, a)q (s, a) apple max a Q (s, a) Claim: V (s) = max a Q (s, a) Let V (s) < max a Q (s, a) Define 0 (s) = arg max a Q (s, a)

34 Bellman s Optimality Condition 0 (s) = arg max a Q (s, a) V (s) = X a (s, a)q (s, a) V 0 (s) =Q 0 (s, 0 (s)) (s) =V 0 (s) V (s) =Q 0 (s, 0 (s)) X a (s, a)q (s, a) Q 0 (s, 0 (s)) Q (s, 0 (s)) = X s 0 P 0 (s) s,s 0 (s 0 ) (s) 0 9s 0 such that (s 0 ) > 0 is not optimal

35 Bellman s Optimality Condition V (s) = max a V (s) = max a Q (s, a) X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} similarly Q (s, a) = X s 0 P a s,s 0 {R a s,s 0 + max a 0 Q (s 0,a 0 )}

36 Optimal policy from Q value Given Q (s, a) an optimal policy is given by: (s) = arg max a Q (s, a) Corollary: Every MDP has a deterministic optimal policy

37 Summary An optimal policy exists such that: V (s) V (s) 8s 2 S, Bellman s self-consistency equation V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) Bellman s optimality condition V X (s) = max Ps,s a a 0 {Rs,s a 0 + V (s 0 )} s 0

38 What we learned Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP Consistency Equation Optimal Policy Optimality Condition

39 Solving MDP To solve an MDP is to find an optimal policy

40 Bellman s Optimality Condition V (s) = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} Iteratively solve the above equation

41 Bellman Backup Operator V (s) = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} T : V! V (TV)(s) = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )}

42 Dynamic Programming Solution Initialize V 0 randomly do V t+1 = TV t until kv t+1 V t k 1 > return V t+1 V t+1 (s) = max a X s 0 P a s,s 0 {R a s,s 0 + V t (s 0 )}

43 Convergence (TV)(s) = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} Theorem: ktv 1 TV 2 k 1 apple kv 1 V 2 k 1 Proof: (TV 1 )(s) where (TV 2 )(s) = max a max a kxk 1 = max{ x 1, x 2 x k }; x 2 R k X s X 0 Ps,s a {R 0 s,s a + 0 V 2 (s 0 )} Ps,s a {R a 0 s,s + 0 V 1 (s 0 )} s 0 using max x f(x) max x g(x) apple max f(x) g(x) x

44 Convergence Theorem: ktv 1 TV 2 k 1 = kv 1 V 2 k 1 where kxk 1 = max{ x 1, x 2 x k }; x 2 R k Proof: (TV 1 )(s) (TV 2 )(s) apple max a apple max a apple max a X s 0 P a s,s 0 (V 1 (s 0 ) V 2 (s 0 )) X s 0 P a s,s 0 (V 1 (s 0 ) V 2 (s 0 )) max s 0 V 1 (s 0 ) V 2 (s 0 ) apple kv 1 V 2 k 1 )ktv 1 TV 2 k 1 apple kv 1 V 2 k 1

45 Optimal is a fixed point V = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} = TV V is a fixed point of T

46 Optimal is the fixed point V = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} = TV V is a fixed point of T Theorem: V is the only fixed point of T Proof: TV 1 = V 1 TV 2 = V 2 kv 1 V 2 k 1 = ktv 1 TV 2 k 1 apple kv 1 V 2 k 1 As 2 [0, 1) therefore kv 1 V 2 k 1 =0) V 1 = V 2

47 Dynamic Programming Solution Initialize V 0 randomly do V t+1 = TV t return V t+1 until kv t+1 V t k 1 > Problem? Theorem: algorithm converges for all V 0 Proof: kv t+1 V k 1 = ktv t TV k 1 apple kv t V k 1 kv t V k 1 apple t kv 0 V k 1 lim kv t V k 1 apple lim t kv 0 V k 1 =0 t!1 t!1

48 Summary Iteratively solving optimality condition V t+1 X (s) = max Ps,s a a 0 {Rs,s a 0 + V t (s 0 )} s 0 Bellman Backup Operator X (TV)(s) = max Ps,s a a 0 {Rs,s a 0 + V (s 0 )} s 0 Convergence of the iterative solution

49 What we learned Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP Consistency Equation Optimal Policy Optimality Condition Bellman Backup Operator Iterative Solution

50 In next tutorial Value and Policy Iteration Monte Carlo Solution SARSA and Q-Learning Policy Gradient Methods Learning to search OR Atari game paper?

Reinforcement Learning Part 2

Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment