Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of Stuttgart
Outline Temporal Difference Learning Q-learning Eligibility Traces 2/??
Learning in MDPs While interacting with the world, the agent collects data of the form D = {(s t, a t, r t, s t+1 )} H t=1 (state, action, immediate reward, next state) 3/??
Learning in MDPs While interacting with the world, the agent collects data of the form D = {(s t, a t, r t, s t+1 )} H t=1 (state, action, immediate reward, next state) What could we learn from that? learn to predict next state: estimate P (s s, a) learn to predict immediate reward: estimate P (r s, a) learn to predict value: estimate s, a Q(s, a) 3/??
Model-based versus Model-free Edward Tolman (1886-1959) Clark Hull (1884-1952) Principles of Behavior (1943) Wolfgang Köhler (1887 1967) learn stimulus-response mappings based on reinforcement learn facts about the world that they could subsequently use in a flexible manner, rather than simply learning automatic responses 4/??
5/??
6/??
Let s introduce basic model-free methods first. 7/??
Monte-Carlo method (Introduction to RL, Sutton & Barto 1998) 8/??
Temporal difference (TD) learning recall V π (s) = R(π(s), s) + γ s P (s π(s), s) V π (s ) 9/??
Temporal difference (TD) learning recall V π (s) = R(π(s), s) + γ s P (s π(s), s) V π (s ) TD learning: Given a new experience (s, a, r, s ) V new (s) = (1 α) V old (s) + α [r + γv old (s )] = V old (s) + α [r V old (s) + γv old (s )]. 9/??
Temporal difference (TD) learning recall V π (s) = R(π(s), s) + γ s P (s π(s), s) V π (s ) TD learning: Given a new experience (s, a, r, s ) V new (s) = (1 α) V old (s) + α [r + γv old (s )] = V old (s) + α [r V old (s) + γv old (s )]. Reinforcement: more reward than expected (r > V old (s) γv old (s )) increase V (s) less reward than expected (r < V old (s) γv old (s )) decrease V (s) 9/??
Temporal difference Monte-Carlo method TD(0) (Introduction to RL, Sutton & Barto 1998) 10/??
Q-learning recall Q (s, a) = R(s, a) + γ s P (s a, s) max a Q (s, a ) 11/??
Q-learning recall Q (s, a) = R(s, a) + γ s P (s a, s) max a Q (s, a ) Q-learning (Watkins, 1988) Given a new experience (s, a, r, s ) Q new (s, a) = (1 α) Q old (s, a) + α [r + γmax Q old (s, a )] a = Q old (s, a) + α [r t Q old (s, a) + γ max Q old (s, a)] a Reinforcement: more reward than expected (r > Q old (s, a) γ max a Q old (s, a)) increase Q(s, a) less reward than expected (r < Q old (s, a) γ max a Q old (s, a)) decrease Q(s, a) 11/??
Q-learning (Introduction to RL, Sutton & Barto 1998) 12/??
Q-learning convergence with prob 1 Q-learning is a stochastic approximation of Q-Iteration: Q-learning: Q new(s, a) = (1 α)q old (s, a) + α[r + γ max a Q old (s, a )] Q-Iteration: s,a : Q k+1 (s, a) = R(s, a) + γ s P (s a, s) max a Q k (s, a ) We ve shown convergence of Q-VI to Q Convergence of Q-learning: Q-Iteration is a deterministic update: Q k+1 = T (Q k ) Q-learning is a stochastic version: Q k+1 = (1 α)q k + α[t (Q k ) + η k ] η k is zero mean! 13/??
Q-learning convergence with prob 1 The Q-learning algorithm converges w.p.1 as long as the learning rates satisfy, α t (s, a) = t αt 2 (s, a) < t (Watkins and Dayan, Q-learning. Machine Learning 1992) 14/??
Q-learning impact Q-Learning was the first provably convergent direct adaptive optimal control algorithm Great impact on the field of Reinforcement Learning smaller representation than models automatically focuses attention to where it is needed, i.e., no sweeps through state space though does not solve the exploration versus exploitation issue epsilon-greedy, optimistic initialization, etc,... 15/??
Unified View 16/??
Eligibility traces Temporal Difference: based on single experience (s 0, r 0, s 1 ) V new (s 0 ) = V old (s 0 ) + α[r 0 + γv old (s 1 ) V old (s 0 )] longer experience sequence, e.g.: (s 0, r 0, r 1, r 2, s 3 ) 17/??
Eligibility traces Temporal Difference: based on single experience (s 0, r 0, s 1 ) V new (s 0 ) = V old (s 0 ) + α[r 0 + γv old (s 1 ) V old (s 0 )] longer experience sequence, e.g.: (s 0, r 0, r 1, r 2, s 3 ) temporal credit assignment, think further backwards: receiving r 3 also tells us something about V (s 0 ) V new (s 0 ) = V old (s 0 ) + α[r 0 + γr 1 + γ 2 r 2 + γ 3 V old (s 3 ) V old (s 0 )] 17/??
Eligibility trace The error of n-step TD update V t (s) = V t (s) + α[rt n V t (s)] The offline value update up to time T Error reduction T 1 V (s) = V (s) + α V t (s) t=0 V n t V π γ n V t V π 18/??
TD(λ): Forward View TD(λ) is the averaging of n-backups with different n Rt λ = (1 λ) λ n 1 Rt n n=1 19/??
TD(λ): Backward View TD(λ): remember where you ve been recently ( eligibility trace ) and update those values as well: e(s t) e(s t) + 1 s : V new(s) = V old (s) + α e(s) [r t + γv old (s t+1 ) V old (s t)] s : e(s) γλe(s) 20/??
TD(λ): Backward View TD(λ): remember where you ve been recently ( eligibility trace ) and update those values as well: e(s t) e(s t) + 1 s : V new(s) = V old (s) + α e(s) [r t + γv old (s t+1 ) V old (s t)] s : e(s) γλe(s) 20/??
TD(λ): Backward vs. Forward Two view provides equivalent offline update (see the proof in Section 7.4, Introduction to RL book, Sutton & Barto). 21/??
TD-Gammon, by Gerald Tesauro (See section 11.1 in Sutton & Barto s book.) MLP to represent the value function V (s) Only reward given at end of game for win. Self-play: use the current policy to sample moves on both sides! random policies games take up to thousands of steps. Skilled players 50 60 steps. TD(λ) learning (gradient-based update of NN weights) 22/??
TD-Gammon input features first only raw position inputs (number of pieces at each place) as good as previous computer programs using previous computer program s expert features world-class player 23/??