Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/
From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP Consistency Equation Optimal Policy Optimality Condition Bellman Backup Operator Iterative Solution
Interaction with the environment action reward + new environment Scalar reward Setup from Lenz et. al. 2014
Rollout hs1, a1, r1, s2, a2, r2, s3, an, rn, sn i a1 r1 a2 r2... an rn Setup from Lenz et. al. 2014
Setup at st! st+1 rt e.g.,1$
Policy (s, a) = 0.9
From previous tutorial An optimal policy exists such that: V (s) V (s) 8s 2 S, Bellman s self-consistency equation V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) Bellman s optimality condition V X (s) = max Ps,s a a 0 {Rs,s a 0 + V (s 0 )} s 0
Solving MDP To solve an MDP (or RL problem) is to find an optimal policy
Dynamic Programming Solution Initialize V 0 randomly do V t+1 = TV t until kv t+1 V t k 1 > return V t+1 (TV)(s) = max a T : V! V X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )}
From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP Consistency Equation Optimal Policy Optimality Condition Bellman Backup Operator Iterative Solution
Dynamic Programming Solution Initialize V 0 randomly do V t+1 = TV t return V t+1 until kv t+1 V t k 1 > Problem? (TV)(s) = max a T : V! V X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )}
Learning from rollouts Step 1: gather experience using a behaviour policy Step2: update value functions of an estimation policy
On-Policy and Off-Policy On policy methods behaviour and estimation policy are same Off policy methods behaviour and estimation policy can be different Advantage?
Behaviour Policy Encourage exploration of search space Epsilon-greedy policy (s, a) = { 1 + A(s) A(s) a = arg max a 0 Q(s, a 0 ) otherwise
Temporal Difference Method Q (s, a) =E 2 4 X t 0 3 t r t+1 s 1 = s, a 1 = a5 = E 2 4r 1 + 0 @ X t 0 1 3 t r t+2 A s 1 = s, a 1 = a5 = E [r 1 + Q (s 2,a 2 ) s 1 = s, a 1 = a] Q (s, a) =(1 )Q (s, a)+ (r 1 + Q (s 2,a 2 )) combination of monte carlo and dynamic programming
SARSA Converges w.p.1 to an optimal policy as long as all state-action pairs are visited infinitely many times and epsilon eventually decays to 0 i.e. policy becomes greedy. On or off?
Q-Learning Resemblance to Bellman optimality condition Q (s, a) = X s 0 P a s,s 0 {R a s,s 0 + max a 0 Q (s 0,a 0 )} For proof of convergence see: http://users.isr.ist.utl.pt/~mtjspaan/readinggroup/proofqlearning.pdf On or off?
Summary SARSA and Q-Learning On vs Off policy. Epsilon greedy policy.
What we learned Solving Reinforcement Learning Dynamic Programming Soln. Temporal Difference Learning Bellman Backup Operator SARSA Q-Learning Iterative Solution
Another Approach So far policy is implicitly defined using value functions Can t we directly work with policies
Policy Gradient Methods Parameterized policy (s, a) Optimization max J( ) where J( ) =E (s,a) " X t t r t+1 # Gradient descent. Smoothly evolving policy. Obtaining gradient estimator? On or off?
Finite Difference Method @J( ) J( + e i) J( e i ) @ i 2 t+1 i t i + @J( t ) @ t i Easy to implement and works for all policies. Problem?
Likelihood Ratio Trick J( ) =E t p (t 0 )[R(t)] = X t R(t)p (t) max J( ) r J( ) = X t R(t)r p (t) = X t R(t)p (t) r log p (t) = E t p (t 0 )[R(t)r log p (t)] = E t p (t 0 )[(R(t) b)r log p (t)] 8 b
Reinforce (Multi Step) Policy gradient theorem: r J( ) =E (s,a)[r log (s, a)q (s,a) (s, a)] initialize for each episode hs 1,a 1,r 1,s 2,a 2,r 2,s 3, a n,r n,s n i (s 1,a 1 ) for t 2 {1,n} v t Q (s t,a t ) return + r log (s t,a t )v t content from David Silver
Summary SARSA and Q-Learning On vs Off policy. Epsilon greedy policy. Policy Gradient Methods
What we learned Solving Reinforcement Learning Dynamic Programming Soln. Temporal Difference Learning Bellman Backup Operator SARSA Q-Learning Iterative Solution Policy Gradient Methods Finite difference method Reinforce
What we did not cover Generalized policy iteration Simple monte carlo solution TD( ) algorithm Convergence of Q-learning, SARSA Actor-critic method
Application
Playing Atari game with Deep RL State is given by raw images. Learn a good policy for a given game.
Playing Atari game with Deep RL Q(s, a, ) Q (s, a) Q (s, a) = X s 0 P a s,s 0 {R a s,s 0 + max a 0 Q (s 0,a 0 )} = R a s,s 0 + max a 0 Q (s 0,a 0 ) FC Q(s, a, ) FC + relu conv + relu conv + relu
Playing Atari game with Deep RL Q (s, a) =R a s,s 0 + max a 0 Q (s 0,a 0 ) Q(s, a, )! R a s,s 0 + max a 0 Q(s 0,a 0, ) min(q(s, a, t ) R a s,s 0 max a 0 Q(s 0,a 0, t 1 )) 2 FC Q(s, a, ) FC + relu conv + relu conv + relu nothing deep about their RL
Playing Atari game with Deep RL
Playing Atari game with Deep RL why replay memory? break correlation between consecutive datapoints
Playing Atari game with Deep RL
Why Deep RL is hard Q (s, a) = X s 0 P a s,s 0 {R a s,s 0 + max a 0 Q (s 0,a 0 )} Recursive equation blows as difference between s, s 0 is small Too many iterations required for convergence. 10 million frames for Atari game. It may take too long to see a high reward action.
Learning to Search It may take too long to see a high reward. Ease the learning using a reference policy Exploiting a reference policy to search space better (s, a) ref (s, a) s1 si sn
Summary SARSA and Q-Learning On vs Off policy. Epsilon greedy policy. Policy Gradient Methods Playing Atari game using deep reinforcement learning Why deep RL is hard. Learning to search.