Reinforcement Learning - PDF Free Download

Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/

Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al. 2014

Supervised Learning Grasp the green cup. Expert Demonstrations Setup from Lenz et. al. 2014

Supervised Learning Grasp the green cup. Problem? Expert Demonstrations Setup from Lenz et. al. 2014

Supervised Learning Grasp the cup. Problem? Training data Test data No exploration

Exploring the environment

What is reinforcement learning? Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment - R. Sutton and A. Barto

Interaction with the environment action reward + new environment Scalar reward Setup from Lenz et. al. 2014

Interaction with the environment a 1 r 1 a 2 r 2... a n Episodic vs Non-Episodic r n Setup from Lenz et. al. 2014

Rollout hs 1,a 1,r 1,s 2,a 2,r 2,s 3, a n,r n,s n i a 1 r 1 a 2 r 2... a n r n Setup from Lenz et. al. 2014

Setup a t s t! s t+1 r t e.g.,1$

Policy (s, a) = 0.9

Interaction with the environment action reward + new environment Objective? Setup from Lenz et. al. 2014

Objective hs 1,a 1,r 1,s 2,a 2,r 2,s 3, a n,r n,s n i a 1 r 1 maximize expected reward " X n E t=1 r t # a 2 r 2... a n Problem? r n Setup from Lenz et. al. 2014

Discounted Reward maximize expected reward " X 1 E unbounded discount future reward " X 1 E t=0 t=0 r t+1 # t r t+1 # Problem? 2 [0, 1) a 1 r 1 a 2 r 2... a n r n

Discounted Reward maximize discounted E expected reward " # nx 1 t r t+1 t=0 if r apple M and 2 [0, 1) " # X 1 1X E t r t+1 apple t M = t=0 t=0 1 M a 1 r 1 a 2 r 2... a n r n

Need for discounting To keep the problem well formed Evidence that humans discount future reward

Markov Decision Process MDP is a tuple (S, A,P,R, ) where S is a set of finite or infinite states A is a set of finite or infinite actions For the transition s a! s 0 P a is the transition probability s,s 0 2 P Rs,s a 0 2 R is the reward for the transition 2 [0, 1] is the discounted factor } Markov Asmp.

MDP Example Example from Sutton and Barto 1998

Summary MDP is a tuple (S, A,P,R, ) a 1 Maximize discounted expected reward E " n 1 X t=0 t r t+1 # Agent controls the policy (s, a) r 1 a 2 r 2... a n r n

What we learned Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP

Value functions Expected reward from following a policy State value function " 1 X # V (s) =E t=0 t r t+1 s 1 = s, State action value function " X 1 # Q (s, a) =E t=0 t r t+1 s 1 = s, a 1 = a,

State Value function " 1 X # V (s) =E t r t+1 s 1 = s, t=0 s1 a 1 (s 1,a 1 )P a 1 s 1,s 2 R a 1 s 1,s 2... s2 a 2 (s 2,a 2 )P a 2 s 2,s 3 R a 2 s 2,s 3... s3

State Value function V (s 1 )=E " 1 X t=0 t r t+1 # = X t (r 1 + r 2 )p(t) where t = hs 1,a 1,s 2,a 2 i = hs 1,a 1,s 2 i : t 0 s1 a 1 (s 1,a 1 )P a 1 s 1,s 2 R a 1 s 1,s 2... s2 a 2 (s 2,a 2 )P a 2 s 2,s 3 R a 2 s 2,s 3... s3

State Value function V (s 1 )=E " 1 X t=0 t r t+1 # = X t (r 1 + r 2 )p(t) where t = hs 1,a 1,s 2,a 2 i = hs 1,a 1,s 2 i : t 0 = X X P (s 1,a 1,s 2 )P (t 0 s 1,a 1,s 2 ) Rs a1 1,s 2 + (r 2 ) a 1,s 2 t 0 = X a 1,s 2 P (s 1,a 1,s 2 ){R a 1 s 1,s 2 + X t 0 P (t 0 s 1,a 1,s 2 )(r 2 )} V (s 2 ) = X a 1 (s 1,a 2 ) X s 2 P a 1 s 1,s 2 R a 1 s 1,s 2 + V (s 2 )}

Bellman Self-Consistency Eqn V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) similarly Q X (s, a) = Ps,s a 0 s 0 Rs,s a 0 + V (s 0 ) Q (s, a) = X Ps,s (R a 0 s,s a + X ) (s 0,a 0 )Q (s 0,a 0 ) 0 s 0 a 0

Bellman Self-Consistency Eqn V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) Given N states, we have N equations in N variables Solve the above equation Does it have a unique solution? Yes, it does. Exercise: Prove it.

Optimal Policy V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) Given a state s policy 1 is as good as 2 (den. 1 2 ) if: V 1 (s) V 2 (s) How to define a globally optimal policy?

Optimal Policy policy 1 is as good as 2 (den. 1 2 ) if: V 1 (s) V 2 (s) How to define a globally optimal policy? is an optimal policy if: Does it always exists? Yes it always does. V (s) V (s) 8s 2 S,

Existence of Optimal Policy Leader policy for every state s is: s = arg max V (s) Define: (s, a) = s (s, a) 8s, a To show is optimal or equivalently: (s) =V (s) V s (s) 0 V (s) = X a s (s, a) X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} V s (s) = X a s (s, a) X s 0 P a s,s 0 {R a s,s 0 + V s (s 0 )}

Existence of Optimal Policy Leader policy for every state s is: s = arg max V (s) Define: (s, a) = s (s, a) 8s, a V (s) = X a s (s, a) X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} V s (s) = X a s (s, a) X s 0 P a s,s 0 {R a s,s 0 + V s (s 0 )} (s) =V (s) V s (s) = X a X a s (s, a) X s 0 P a s,s 0 {V (s 0 ) V s (s 0 )} s (s, a) X s 0 P a s,s 0 { (s 0 )} = conv( (s 0 )) (s 0 )

Existence of Optimal Policy Leader policy for every state s is: s = arg max V (s) Define: (s, a) = s (s, a) 8s, a (s) =V (s) V s (s) = X a s (s, a) X s 0 P a s,s 0 {V (s 0 ) V s (s 0 )} X s (s, a) X Ps,s a { (s 0 )} = conv( (s 0 )) 0 s 0 a (s) min (s 0 ) min (s) min (s 0 ) 2 [0, 1) ) min (s) 0 Hence proved

Bellman s Optimality Condition Define V (s) =V (s) and Q (s, a) =Q (s, a) V (s) = X a (s, a)q (s, a) apple max a Q (s, a) Claim: V (s) = max a Q (s, a) Let V (s) < max a Q (s, a) Define 0 (s) = arg max a Q (s, a)

Bellman s Optimality Condition 0 (s) = arg max a Q (s, a) V (s) = X a (s, a)q (s, a) V 0 (s) =Q 0 (s, 0 (s)) (s) =V 0 (s) V (s) =Q 0 (s, 0 (s)) X a (s, a)q (s, a) Q 0 (s, 0 (s)) Q (s, 0 (s)) = X s 0 P 0 (s) s,s 0 (s 0 ) (s) 0 9s 0 such that (s 0 ) > 0 is not optimal

Bellman s Optimality Condition V (s) = max a V (s) = max a Q (s, a) X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} similarly Q (s, a) = X s 0 P a s,s 0 {R a s,s 0 + max a 0 Q (s 0,a 0 )}

Optimal policy from Q value Given Q (s, a) an optimal policy is given by: (s) = arg max a Q (s, a) Corollary: Every MDP has a deterministic optimal policy

Summary An optimal policy exists such that: V (s) V (s) 8s 2 S, Bellman s self-consistency equation V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) Bellman s optimality condition V X (s) = max Ps,s a a 0 {Rs,s a 0 + V (s 0 )} s 0

What we learned Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP Consistency Equation Optimal Policy Optimality Condition

Solving MDP To solve an MDP is to find an optimal policy

Bellman s Optimality Condition V (s) = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} Iteratively solve the above equation

Bellman Backup Operator V (s) = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} T : V! V (TV)(s) = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )}

Dynamic Programming Solution Initialize V 0 randomly do V t+1 = TV t until kv t+1 V t k 1 > return V t+1 V t+1 (s) = max a X s 0 P a s,s 0 {R a s,s 0 + V t (s 0 )}

Convergence (TV)(s) = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} Theorem: ktv 1 TV 2 k 1 apple kv 1 V 2 k 1 Proof: (TV 1 )(s) where (TV 2 )(s) = max a max a kxk 1 = max{ x 1, x 2 x k }; x 2 R k X s X 0 Ps,s a {R 0 s,s a + 0 V 2 (s 0 )} Ps,s a {R a 0 s,s + 0 V 1 (s 0 )} s 0 using max x f(x) max x g(x) apple max f(x) g(x) x

Convergence Theorem: ktv 1 TV 2 k 1 = kv 1 V 2 k 1 where kxk 1 = max{ x 1, x 2 x k }; x 2 R k Proof: (TV 1 )(s) (TV 2 )(s) apple max a apple max a apple max a X s 0 P a s,s 0 (V 1 (s 0 ) V 2 (s 0 )) X s 0 P a s,s 0 (V 1 (s 0 ) V 2 (s 0 )) max s 0 V 1 (s 0 ) V 2 (s 0 ) apple kv 1 V 2 k 1 )ktv 1 TV 2 k 1 apple kv 1 V 2 k 1

Optimal is a fixed point V = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} = TV V is a fixed point of T

Optimal is the fixed point V = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} = TV V is a fixed point of T Theorem: V is the only fixed point of T Proof: TV 1 = V 1 TV 2 = V 2 kv 1 V 2 k 1 = ktv 1 TV 2 k 1 apple kv 1 V 2 k 1 As 2 [0, 1) therefore kv 1 V 2 k 1 =0) V 1 = V 2

Dynamic Programming Solution Initialize V 0 randomly do V t+1 = TV t return V t+1 until kv t+1 V t k 1 > Problem? Theorem: algorithm converges for all V 0 Proof: kv t+1 V k 1 = ktv t TV k 1 apple kv t V k 1 kv t V k 1 apple t kv 0 V k 1 lim kv t V k 1 apple lim t kv 0 V k 1 =0 t!1 t!1

Summary Iteratively solving optimality condition V t+1 X (s) = max Ps,s a a 0 {Rs,s a 0 + V t (s 0 )} s 0 Bellman Backup Operator X (TV)(s) = max Ps,s a a 0 {Rs,s a 0 + V (s 0 )} s 0 Convergence of the iterative solution

What we learned Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP Consistency Equation Optimal Policy Optimality Condition Bellman Backup Operator Iterative Solution

In next tutorial Value and Policy Iteration Monte Carlo Solution SARSA and Q-Learning Policy Gradient Methods Learning to search OR Atari game paper? https://dipendramisra.wordpress.com/