Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/
Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al. 2014
Supervised Learning Grasp the green cup. Expert Demonstrations Setup from Lenz et. al. 2014
Supervised Learning Grasp the green cup. Problem? Expert Demonstrations Setup from Lenz et. al. 2014
Supervised Learning Grasp the cup. Problem? Training data Test data No exploration
Exploring the environment
What is reinforcement learning? Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment - R. Sutton and A. Barto
Interaction with the environment action reward + new environment Scalar reward Setup from Lenz et. al. 2014
Interaction with the environment a 1 r 1 a 2 r 2... a n Episodic vs Non-Episodic r n Setup from Lenz et. al. 2014
Rollout hs 1,a 1,r 1,s 2,a 2,r 2,s 3, a n,r n,s n i a 1 r 1 a 2 r 2... a n r n Setup from Lenz et. al. 2014
Setup a t s t! s t+1 r t e.g.,1$
Policy (s, a) = 0.9
Interaction with the environment action reward + new environment Objective? Setup from Lenz et. al. 2014
Objective hs 1,a 1,r 1,s 2,a 2,r 2,s 3, a n,r n,s n i a 1 r 1 maximize expected reward " X n E t=1 r t # a 2 r 2... a n Problem? r n Setup from Lenz et. al. 2014
Discounted Reward maximize expected reward " X 1 E unbounded discount future reward " X 1 E t=0 t=0 r t+1 # t r t+1 # Problem? 2 [0, 1) a 1 r 1 a 2 r 2... a n r n
Discounted Reward maximize discounted E expected reward " # nx 1 t r t+1 t=0 if r apple M and 2 [0, 1) " # X 1 1X E t r t+1 apple t M = t=0 t=0 1 M a 1 r 1 a 2 r 2... a n r n
Need for discounting To keep the problem well formed Evidence that humans discount future reward
Markov Decision Process MDP is a tuple (S, A,P,R, ) where S is a set of finite or infinite states A is a set of finite or infinite actions For the transition s a! s 0 P a is the transition probability s,s 0 2 P Rs,s a 0 2 R is the reward for the transition 2 [0, 1] is the discounted factor } Markov Asmp.
MDP Example Example from Sutton and Barto 1998
Summary MDP is a tuple (S, A,P,R, ) a 1 Maximize discounted expected reward E " n 1 X t=0 t r t+1 # Agent controls the policy (s, a) r 1 a 2 r 2... a n r n
What we learned Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP
Value functions Expected reward from following a policy State value function " 1 X # V (s) =E t=0 t r t+1 s 1 = s, State action value function " X 1 # Q (s, a) =E t=0 t r t+1 s 1 = s, a 1 = a,
State Value function " 1 X # V (s) =E t r t+1 s 1 = s, t=0 s1 a 1 (s 1,a 1 )P a 1 s 1,s 2 R a 1 s 1,s 2... s2 a 2 (s 2,a 2 )P a 2 s 2,s 3 R a 2 s 2,s 3... s3
State Value function V (s 1 )=E " 1 X t=0 t r t+1 # = X t (r 1 + r 2 )p(t) where t = hs 1,a 1,s 2,a 2 i = hs 1,a 1,s 2 i : t 0 s1 a 1 (s 1,a 1 )P a 1 s 1,s 2 R a 1 s 1,s 2... s2 a 2 (s 2,a 2 )P a 2 s 2,s 3 R a 2 s 2,s 3... s3
State Value function V (s 1 )=E " 1 X t=0 t r t+1 # = X t (r 1 + r 2 )p(t) where t = hs 1,a 1,s 2,a 2 i = hs 1,a 1,s 2 i : t 0 = X X P (s 1,a 1,s 2 )P (t 0 s 1,a 1,s 2 ) Rs a1 1,s 2 + (r 2 ) a 1,s 2 t 0 = X a 1,s 2 P (s 1,a 1,s 2 ){R a 1 s 1,s 2 + X t 0 P (t 0 s 1,a 1,s 2 )(r 2 )} V (s 2 ) = X a 1 (s 1,a 2 ) X s 2 P a 1 s 1,s 2 R a 1 s 1,s 2 + V (s 2 )}
Bellman Self-Consistency Eqn V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) similarly Q X (s, a) = Ps,s a 0 s 0 Rs,s a 0 + V (s 0 ) Q (s, a) = X Ps,s (R a 0 s,s a + X ) (s 0,a 0 )Q (s 0,a 0 ) 0 s 0 a 0
Bellman Self-Consistency Eqn V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) Given N states, we have N equations in N variables Solve the above equation Does it have a unique solution? Yes, it does. Exercise: Prove it.
Optimal Policy V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) Given a state s policy 1 is as good as 2 (den. 1 2 ) if: V 1 (s) V 2 (s) How to define a globally optimal policy?
Optimal Policy policy 1 is as good as 2 (den. 1 2 ) if: V 1 (s) V 2 (s) How to define a globally optimal policy? is an optimal policy if: Does it always exists? Yes it always does. V (s) V (s) 8s 2 S,
Existence of Optimal Policy Leader policy for every state s is: s = arg max V (s) Define: (s, a) = s (s, a) 8s, a To show is optimal or equivalently: (s) =V (s) V s (s) 0 V (s) = X a s (s, a) X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} V s (s) = X a s (s, a) X s 0 P a s,s 0 {R a s,s 0 + V s (s 0 )}
Existence of Optimal Policy Leader policy for every state s is: s = arg max V (s) Define: (s, a) = s (s, a) 8s, a V (s) = X a s (s, a) X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} V s (s) = X a s (s, a) X s 0 P a s,s 0 {R a s,s 0 + V s (s 0 )} (s) =V (s) V s (s) = X a X a s (s, a) X s 0 P a s,s 0 {V (s 0 ) V s (s 0 )} s (s, a) X s 0 P a s,s 0 { (s 0 )} = conv( (s 0 )) (s 0 )
Existence of Optimal Policy Leader policy for every state s is: s = arg max V (s) Define: (s, a) = s (s, a) 8s, a (s) =V (s) V s (s) = X a s (s, a) X s 0 P a s,s 0 {V (s 0 ) V s (s 0 )} X s (s, a) X Ps,s a { (s 0 )} = conv( (s 0 )) 0 s 0 a (s) min (s 0 ) min (s) min (s 0 ) 2 [0, 1) ) min (s) 0 Hence proved
Bellman s Optimality Condition Define V (s) =V (s) and Q (s, a) =Q (s, a) V (s) = X a (s, a)q (s, a) apple max a Q (s, a) Claim: V (s) = max a Q (s, a) Let V (s) < max a Q (s, a) Define 0 (s) = arg max a Q (s, a)
Bellman s Optimality Condition 0 (s) = arg max a Q (s, a) V (s) = X a (s, a)q (s, a) V 0 (s) =Q 0 (s, 0 (s)) (s) =V 0 (s) V (s) =Q 0 (s, 0 (s)) X a (s, a)q (s, a) Q 0 (s, 0 (s)) Q (s, 0 (s)) = X s 0 P 0 (s) s,s 0 (s 0 ) (s) 0 9s 0 such that (s 0 ) > 0 is not optimal
Bellman s Optimality Condition V (s) = max a V (s) = max a Q (s, a) X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} similarly Q (s, a) = X s 0 P a s,s 0 {R a s,s 0 + max a 0 Q (s 0,a 0 )}
Optimal policy from Q value Given Q (s, a) an optimal policy is given by: (s) = arg max a Q (s, a) Corollary: Every MDP has a deterministic optimal policy
Summary An optimal policy exists such that: V (s) V (s) 8s 2 S, Bellman s self-consistency equation V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) Bellman s optimality condition V X (s) = max Ps,s a a 0 {Rs,s a 0 + V (s 0 )} s 0
What we learned Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP Consistency Equation Optimal Policy Optimality Condition
Solving MDP To solve an MDP is to find an optimal policy
Bellman s Optimality Condition V (s) = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} Iteratively solve the above equation
Bellman Backup Operator V (s) = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} T : V! V (TV)(s) = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )}
Dynamic Programming Solution Initialize V 0 randomly do V t+1 = TV t until kv t+1 V t k 1 > return V t+1 V t+1 (s) = max a X s 0 P a s,s 0 {R a s,s 0 + V t (s 0 )}
Convergence (TV)(s) = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} Theorem: ktv 1 TV 2 k 1 apple kv 1 V 2 k 1 Proof: (TV 1 )(s) where (TV 2 )(s) = max a max a kxk 1 = max{ x 1, x 2 x k }; x 2 R k X s X 0 Ps,s a {R 0 s,s a + 0 V 2 (s 0 )} Ps,s a {R a 0 s,s + 0 V 1 (s 0 )} s 0 using max x f(x) max x g(x) apple max f(x) g(x) x
Convergence Theorem: ktv 1 TV 2 k 1 = kv 1 V 2 k 1 where kxk 1 = max{ x 1, x 2 x k }; x 2 R k Proof: (TV 1 )(s) (TV 2 )(s) apple max a apple max a apple max a X s 0 P a s,s 0 (V 1 (s 0 ) V 2 (s 0 )) X s 0 P a s,s 0 (V 1 (s 0 ) V 2 (s 0 )) max s 0 V 1 (s 0 ) V 2 (s 0 ) apple kv 1 V 2 k 1 )ktv 1 TV 2 k 1 apple kv 1 V 2 k 1
Optimal is a fixed point V = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} = TV V is a fixed point of T
Optimal is the fixed point V = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} = TV V is a fixed point of T Theorem: V is the only fixed point of T Proof: TV 1 = V 1 TV 2 = V 2 kv 1 V 2 k 1 = ktv 1 TV 2 k 1 apple kv 1 V 2 k 1 As 2 [0, 1) therefore kv 1 V 2 k 1 =0) V 1 = V 2
Dynamic Programming Solution Initialize V 0 randomly do V t+1 = TV t return V t+1 until kv t+1 V t k 1 > Problem? Theorem: algorithm converges for all V 0 Proof: kv t+1 V k 1 = ktv t TV k 1 apple kv t V k 1 kv t V k 1 apple t kv 0 V k 1 lim kv t V k 1 apple lim t kv 0 V k 1 =0 t!1 t!1
Summary Iteratively solving optimality condition V t+1 X (s) = max Ps,s a a 0 {Rs,s a 0 + V t (s 0 )} s 0 Bellman Backup Operator X (TV)(s) = max Ps,s a a 0 {Rs,s a 0 + V (s 0 )} s 0 Convergence of the iterative solution
What we learned Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP Consistency Equation Optimal Policy Optimality Condition Bellman Backup Operator Iterative Solution
In next tutorial Value and Policy Iteration Monte Carlo Solution SARSA and Q-Learning Policy Gradient Methods Learning to search OR Atari game paper? https://dipendramisra.wordpress.com/