Reinforcement Learning Part 2 - PDF Free Download

Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/

From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP Consistency Equation Optimal Policy Optimality Condition Bellman Backup Operator Iterative Solution

Interaction with the environment action reward + new environment Scalar reward Setup from Lenz et. al. 2014

Rollout hs1, a1, r1, s2, a2, r2, s3, an, rn, sn i a1 r1 a2 r2... an rn Setup from Lenz et. al. 2014

Setup at st! st+1 rt e.g.,1$

Policy (s, a) = 0.9

From previous tutorial An optimal policy exists such that: V (s) V (s) 8s 2 S, Bellman s self-consistency equation V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) Bellman s optimality condition V X (s) = max Ps,s a a 0 {Rs,s a 0 + V (s 0 )} s 0

Solving MDP To solve an MDP (or RL problem) is to find an optimal policy

Dynamic Programming Solution Initialize V 0 randomly do V t+1 = TV t until kv t+1 V t k 1 > return V t+1 (TV)(s) = max a T : V! V X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )}

Dynamic Programming Solution Initialize V 0 randomly do V t+1 = TV t return V t+1 until kv t+1 V t k 1 > Problem? (TV)(s) = max a T : V! V X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )}

Learning from rollouts Step 1: gather experience using a behaviour policy Step2: update value functions of an estimation policy

On-Policy and Off-Policy On policy methods behaviour and estimation policy are same Off policy methods behaviour and estimation policy can be different Advantage?

Behaviour Policy Encourage exploration of search space Epsilon-greedy policy (s, a) = { 1 + A(s) A(s) a = arg max a 0 Q(s, a 0 ) otherwise

Temporal Difference Method Q (s, a) =E 2 4 X t 0 3 t r t+1 s 1 = s, a 1 = a5 = E 2 4r 1 + 0 @ X t 0 1 3 t r t+2 A s 1 = s, a 1 = a5 = E [r 1 + Q (s 2,a 2 ) s 1 = s, a 1 = a] Q (s, a) =(1 )Q (s, a)+ (r 1 + Q (s 2,a 2 )) combination of monte carlo and dynamic programming

SARSA Converges w.p.1 to an optimal policy as long as all state-action pairs are visited infinitely many times and epsilon eventually decays to 0 i.e. policy becomes greedy. On or off?

Q-Learning Resemblance to Bellman optimality condition Q (s, a) = X s 0 P a s,s 0 {R a s,s 0 + max a 0 Q (s 0,a 0 )} For proof of convergence see: http://users.isr.ist.utl.pt/~mtjspaan/readinggroup/proofqlearning.pdf On or off?

Summary SARSA and Q-Learning On vs Off policy. Epsilon greedy policy.

What we learned Solving Reinforcement Learning Dynamic Programming Soln. Temporal Difference Learning Bellman Backup Operator SARSA Q-Learning Iterative Solution

Another Approach So far policy is implicitly defined using value functions Can t we directly work with policies

Policy Gradient Methods Parameterized policy (s, a) Optimization max J( ) where J( ) =E (s,a) " X t t r t+1 # Gradient descent. Smoothly evolving policy. Obtaining gradient estimator? On or off?

Finite Difference Method @J( ) J( + e i) J( e i ) @ i 2 t+1 i t i + @J( t ) @ t i Easy to implement and works for all policies. Problem?

Likelihood Ratio Trick J( ) =E t p (t 0 )[R(t)] = X t R(t)p (t) max J( ) r J( ) = X t R(t)r p (t) = X t R(t)p (t) r log p (t) = E t p (t 0 )[R(t)r log p (t)] = E t p (t 0 )[(R(t) b)r log p (t)] 8 b

Reinforce (Multi Step) Policy gradient theorem: r J( ) =E (s,a)[r log (s, a)q (s,a) (s, a)] initialize for each episode hs 1,a 1,r 1,s 2,a 2,r 2,s 3, a n,r n,s n i (s 1,a 1 ) for t 2 {1,n} v t Q (s t,a t ) return + r log (s t,a t )v t content from David Silver

Summary SARSA and Q-Learning On vs Off policy. Epsilon greedy policy. Policy Gradient Methods

What we learned Solving Reinforcement Learning Dynamic Programming Soln. Temporal Difference Learning Bellman Backup Operator SARSA Q-Learning Iterative Solution Policy Gradient Methods Finite difference method Reinforce

What we did not cover Generalized policy iteration Simple monte carlo solution TD( ) algorithm Convergence of Q-learning, SARSA Actor-critic method

Application

Playing Atari game with Deep RL State is given by raw images. Learn a good policy for a given game.

Playing Atari game with Deep RL Q(s, a, ) Q (s, a) Q (s, a) = X s 0 P a s,s 0 {R a s,s 0 + max a 0 Q (s 0,a 0 )} = R a s,s 0 + max a 0 Q (s 0,a 0 ) FC Q(s, a, ) FC + relu conv + relu conv + relu

Playing Atari game with Deep RL Q (s, a) =R a s,s 0 + max a 0 Q (s 0,a 0 ) Q(s, a, )! R a s,s 0 + max a 0 Q(s 0,a 0, ) min(q(s, a, t ) R a s,s 0 max a 0 Q(s 0,a 0, t 1 )) 2 FC Q(s, a, ) FC + relu conv + relu conv + relu nothing deep about their RL

Playing Atari game with Deep RL

Playing Atari game with Deep RL why replay memory? break correlation between consecutive datapoints

Playing Atari game with Deep RL

Why Deep RL is hard Q (s, a) = X s 0 P a s,s 0 {R a s,s 0 + max a 0 Q (s 0,a 0 )} Recursive equation blows as difference between s, s 0 is small Too many iterations required for convergence. 10 million frames for Atari game. It may take too long to see a high reward action.

Learning to Search It may take too long to see a high reward. Ease the learning using a reference policy Exploiting a reference policy to search space better (s, a) ref (s, a) s1 si sn

Summary SARSA and Q-Learning On vs Off policy. Epsilon greedy policy. Policy Gradient Methods Playing Atari game using deep reinforcement learning Why deep RL is hard. Learning to search.