Reinforcement Learning Part 2

Size: px

Start display at page:

Download "Reinforcement Learning Part 2"

Ralf Jeremy Scott
5 years ago
Views:

1 Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu

2 From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP Consistency Equation Optimal Policy Optimality Condition Bellman Backup Operator Iterative Solution

3 Interaction with the environment action reward + new environment Scalar reward Setup from Lenz et. al. 2014

4 Rollout hs1, a1, r1, s2, a2, r2, s3, an, rn, sn i a1 r1 a2 r2... an rn Setup from Lenz et. al. 2014

5 Setup at st! st+1 rt e.g.,1$

6 Policy (s, a) = 0.9

7 From previous tutorial An optimal policy exists such that: V (s) V (s) 8s 2 S, Bellman s self-consistency equation V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) Bellman s optimality condition V X (s) = max Ps,s a a 0 {Rs,s a 0 + V (s 0 )} s 0

8 Solving MDP To solve an MDP (or RL problem) is to find an optimal policy

9 Dynamic Programming Solution Initialize V 0 randomly do V t+1 = TV t until kv t+1 V t k 1 > return V t+1 (TV)(s) = max a T : V! V X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )}

10 From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP Consistency Equation Optimal Policy Optimality Condition Bellman Backup Operator Iterative Solution

11 Dynamic Programming Solution Initialize V 0 randomly do V t+1 = TV t return V t+1 until kv t+1 V t k 1 > Problem? (TV)(s) = max a T : V! V X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )}

12 Learning from rollouts Step 1: gather experience using a behaviour policy Step2: update value functions of an estimation policy

13 On-Policy and Off-Policy On policy methods behaviour and estimation policy are same Off policy methods behaviour and estimation policy can be different Advantage?

14 Behaviour Policy Encourage exploration of search space Epsilon-greedy policy (s, a) = { 1 + A(s) A(s) a = arg max a 0 Q(s, a 0 ) otherwise

15 Temporal Difference Method Q (s, a) =E 2 4 X t 0 3 t r t+1 s 1 = s, a 1 = a5 = E 2 4r 1 + X t t r t+2 A s 1 = s, a 1 = a5 = E [r 1 + Q (s 2,a 2 ) s 1 = s, a 1 = a] Q (s, a) =(1 )Q (s, a)+ (r 1 + Q (s 2,a 2 )) combination of monte carlo and dynamic programming

16 SARSA Converges w.p.1 to an optimal policy as long as all state-action pairs are visited infinitely many times and epsilon eventually decays to 0 i.e. policy becomes greedy. On or off?

17 Q-Learning Resemblance to Bellman optimality condition Q (s, a) = X s 0 P a s,s 0 {R a s,s 0 + max a 0 Q (s 0,a 0 )} For proof of convergence see: On or off?

18 Summary SARSA and Q-Learning On vs Off policy. Epsilon greedy policy.

19 What we learned Solving Reinforcement Learning Dynamic Programming Soln. Temporal Difference Learning Bellman Backup Operator SARSA Q-Learning Iterative Solution

20 Another Approach So far policy is implicitly defined using value functions Can t we directly work with policies

21 Policy Gradient Methods Parameterized policy (s, a) Optimization max J( ) where J( ) =E (s,a) " X t t r t+1 # Gradient descent. Smoothly evolving policy. Obtaining gradient estimator? On or off?

22 Finite Difference ) J( + e i) J( e i i 2 t+1 i t i t t i Easy to implement and works for all policies. Problem?

23 Likelihood Ratio Trick J( ) =E t p (t 0 )[R(t)] = X t R(t)p (t) max J( ) r J( ) = X t R(t)r p (t) = X t R(t)p (t) r log p (t) = E t p (t 0 )[R(t)r log p (t)] = E t p (t 0 )[(R(t) b)r log p (t)] 8 b

24 Reinforce (Multi Step) Policy gradient theorem: r J( ) =E (s,a)[r log (s, a)q (s,a) (s, a)] initialize for each episode hs 1,a 1,r 1,s 2,a 2,r 2,s 3, a n,r n,s n i (s 1,a 1 ) for t 2 {1,n} v t Q (s t,a t ) return + r log (s t,a t )v t content from David Silver

25 Summary SARSA and Q-Learning On vs Off policy. Epsilon greedy policy. Policy Gradient Methods

26 What we learned Solving Reinforcement Learning Dynamic Programming Soln. Temporal Difference Learning Bellman Backup Operator SARSA Q-Learning Iterative Solution Policy Gradient Methods Finite difference method Reinforce

27 What we did not cover Generalized policy iteration Simple monte carlo solution TD( ) algorithm Convergence of Q-learning, SARSA Actor-critic method

28 Application

29 Playing Atari game with Deep RL State is given by raw images. Learn a good policy for a given game.

30 Playing Atari game with Deep RL Q(s, a, ) Q (s, a) Q (s, a) = X s 0 P a s,s 0 {R a s,s 0 + max a 0 Q (s 0,a 0 )} = R a s,s 0 + max a 0 Q (s 0,a 0 ) FC Q(s, a, ) FC + relu conv + relu conv + relu

31 Playing Atari game with Deep RL Q (s, a) =R a s,s 0 + max a 0 Q (s 0,a 0 ) Q(s, a, )! R a s,s 0 + max a 0 Q(s 0,a 0, ) min(q(s, a, t ) R a s,s 0 max a 0 Q(s 0,a 0, t 1 )) 2 FC Q(s, a, ) FC + relu conv + relu conv + relu nothing deep about their RL

32 Playing Atari game with Deep RL

33 Playing Atari game with Deep RL why replay memory? break correlation between consecutive datapoints

34 Playing Atari game with Deep RL

35 Why Deep RL is hard Q (s, a) = X s 0 P a s,s 0 {R a s,s 0 + max a 0 Q (s 0,a 0 )} Recursive equation blows as difference between s, s 0 is small Too many iterations required for convergence. 10 million frames for Atari game. It may take too long to see a high reward action.

36 Learning to Search It may take too long to see a high reward. Ease the learning using a reference policy Exploiting a reference policy to search space better (s, a) ref (s, a) s1 si sn

37 Summary SARSA and Q-Learning On vs Off policy. Epsilon greedy policy. Policy Gradient Methods Playing Atari game using deep reinforcement learning Why deep RL is hard. Learning to search.

Reinforcement Learning

Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.