Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite). MDP s dynamics defined by transition probabilities: Pss a = P (s t+1 = s s t = s, a t = a) and expected immediate rewards, R a ss = E{r t+1 s t = s, a t = a, s t+1 = s } Goal: to search for good policies π DP Strategy: use value functions to structure search: V (s) = max Pss a a [Ra ss + γv (s )] or s Q (s, a) = s P a ss [Ra ss + γ max a Q (s, a )] MC strategy: expand each episode and keep averages of returns per state. Temporal Difference (TD) Learning TD is a combination of ideas from Monte Carlo methods and DP methods: TD can learn directly from raw experience without a model of the environment s dynamics, like MC. TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (i.e. they bootstrap ), like DP Some widely used TD methods: TD(0), TD(λ), Sarsa, Q-learning, Actorcritic, R-Learning 1
Prediction & Control Recall value function estimation (prediction) for DP and MC: Updates for DP: Update rule for MC: V π (s) = E π {R t s t = s} = E π {r t+1 + γv π (s t+1 ) s t = s} V (s t ) V (s t ) + α[r t γv (s t )] (Q: why are the value functions above merely estimates?) Prediction in TD(0) Prediction (update of estimates of V ) for the TD(0) method is done as follows: V (s t ) V (s t ) + α[ r t+1 + γv (s t+1 ) V (s t )] (1) }{{} 1-step estimate of R t No need to wait until the end of the episode (as in MC) to update V No need to sweep across the possible successor states (as in DP). Tabular TD(0) value function estimation Use sample backup (as in MC) rather than full backup (as in DP): 1 TabularTD0 (π ) 2 I n i t i a l i s a t i o n, s S : 3 V (s) a r b i t r a r y ; 4 α l e a r n i n g c o n s t a n t 5 γ d i s c o u n t f a c t o r 6 Algorithm 1: TD(0) 7 For each e p i s o d e E u n t i l stop c r i t e r i o n met 8 I n i t i a l i s e s 9 For each step o f E 10 a π(s) 11 Take a c t i o n a ; 12 Observe reward r and next s t a t e s 13 V (s) V (s) + α[r + γv (s ) V (s)] 14 s s 15 u n t i l s i s a t e r m i n a l s t a t e 2
Example: random walk estimate P a ss =.5 for all non-terminal states R a ss = 0 except for terminal state on the right which is 1. Comparison between MC and TD(0) (over 100 episodes) Error (RMS) Episodes Focusing on action-value function estimates The control (algorithm) part. Exploration Vs. exploitation trade-off On-policy and off-policy methods Several variants, e.g.: Q-learning: off-policy method (Watkins and Dayan, 1992) Modified Q-learning (aka Sarsa): on-policy method Sarsa Transitions from non-terminal states update Q as follows: Q(s t, a t) Q(s t, a t) + α[r t+1 + γq(s t+1, a t+1 ) Q(s t, a t)] (2) r r st t+1 s t+2 t+1 s st,a t s t+1,a t+2 t+1 s t+2,a t+2 (for terminal states s t+1, assign Q(s t+1, a t+1 ) 0) 1 I n i t i a l i s e Q(s, a) a r b i t r a r i l y 2 f o r each e p i s o d e 3 i n i t i a l i s e s Algorithm 2: Sarsa 4 choose a from s u s i n g π d e r i v e d from Q / e. g. ɛ g r e e d y / 5 r e p r e a t ( f o r each s t e p o f e p i s o d e ) 6 perform a, o b s e r v e r, s 7 choose a from s u s i n g π d e r i v e d from Q / e. g. ɛ g r e e d y / 8 Q(s, a) Q(s, a) + α[r + γq(s, a ) Q(s, a)] 9 s s 10 a a 11 u n t i l s i s t e r m i n a l s t a t e 3
Q-learning An off-policy method: approximate Q independently of the policy being followed: Q(s t, a t) Q(s t, a t) + α[r t+1 + γ max Q(s t+1, a) Q(s a t, a t)] (3) Algorithm 3: Q-Learning 1 I n i t i a l i s e Q(s, a) a r b i t r a r i l y 2 f o r each e p i s o d e 3 i n i t i a l i s e s 4 r e p r e a t ( f o r each s t e p o f e p i s o d e ) 5 choose a from s u s i n g π d e r i v e d from Q / e. g. ɛ g r e e d y / 6 perform a, o b s e r v e r, s 7 Q(s, a) Q(s, a) + α[r + γ max a Q(s, a ) Q(s, a)] 8 s s 9 u n t i l s i s t e r m i n a l s t a t e A simple comparison A comparison between Q-Learning and SARSA on the cliff walking problem (Sutton and Barto, 1998) Safe Path Optimal Path Average reward per episode Why does Q-learning find the optimal path? Why is its average reward per episode worse than SARSA s? Convergence theorems Convergence, in the following sense: Q converges in the limit to the optimal Q function, provided the system can be modelled as a deterministc Markov process, r is bounded and π visit every action-state pair infinitely often. Has been proved for the above described TD methods: Q-Learning (Watkins and Dayan, 1992), Sarsa and TD(0) (Sutton and Barto, 1998). General proofs based on stochastic approximation theory can be found in (Bertsekas and Tsitsiklis, 1996). 4
Summary of methods Dynamic programming Exhaustive search full backups Monte Carlo sample backups Temporaldifference shallow backups bootstrapping, λ deep backups TD(λ) Basic Idea: if with TD(0) we backed up our estimates based on one step ahead, why not generalise this to include n-steps ahead? 1-step: TD(0) 2-step n-step Monte Carlo............ Averaging backups 5
Consider, for instance, averaging 2- and 4-step backups: R µ t = 0.5Rt 2 + 0.5Rt 4 TD(λ)is a method for averaging all n-step backups. Weight by λ n 1 (0 λ 1): Rt λ = (1 λ) λ n 1 Rt n Backup using λ-return: V t (s t ) = α[rt λ V t (s t )] (Sutton and Barto, 1998) call this the Forward View of TD(λ) TD(λ) backup structure n=1............ lim n n i=1 (1 λ)λn 1 = 1 Backup value (λ-return): R λ t = (1 λ) n=1 λn 1 R n t or R λ t = (1 λ) T t 1 n=1 λ n 1 R n t + λt t 1 R t Implementing TD(λ) We need a way to accummulate the effect of the trace-decay parameter λ Eligibility traces: { γλet 1 (s) + 1 ifs = s e t (s) = t γλe t 1 (s) otherwise. (4) TD error for state-value prediction is: δ t = r t+1 + γv t (s t+1 ) V t (s t ) (5) Sutton and Barto call this the Backward View of TD(λ) 6
Equivalence It can be shown (Sutton and Barto, 1998) that: The Forward View of TD(λ) and the Backward View of TD(λ) are equivalent. Tabular TD(λ) Algorithm 4: On-line TD(λ) 1 I n i t i a l i s e V (s) a r b i t r a r i l y 2 e(s) 0 f o r a l l s S 3 f o r each e p i s o d e 4 i n i t i a l i s e s 5 r e p e a t ( f o r each s t e p o f e p i s o d e ) 6 choose a a c c o r d i n g to π 7 perform a, o b s e r v e r, s 8 δ r + γv (s ) V (s) 9 e(s) e(s) + 1 10 f o r a l l s 11 V (s) V (s) + αδe(s) 12 e(s) γλe(s) 13 s s 14 u n t i l s i s t e r m i n a l s t a t e Similar algorithms can be implemented for control (Sarsa, Q-learning), using eligibility traces. Control algorithms: SARSA(λ) Algorithm 5: SARSA(λ) 1 I n i t i a l i s e Q(s, a) a r b i t r a r i l y 2 e(s, a) 0 f o r a l l s S and a A 3 f o r each e p i s o d e 4 i n i t i a l i s e s, a 5 r e p e a t ( f o r each s t e p o f e p i s o d e ) 6 perform a, o b s e r v e r, s 7 choose a, s a c c o r d i n g to Q ( e. g. ɛ greedy 8 δ r + γq(s, a ) Q(s, a) 9 e(s, a) e(s, a) + 1 10 f o r a l l s 11 Q(s, a) Q(s, a) + αδe(s, a) 12 e(s) γλe(s) 13 s s a a 14 u n t i l s i s t e r m i n a l s t a t e Notes based on (Sutton and Barto, 1998, ch 6). References Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Athena Scientific, Belmont. Neuro-Dynamic Programming. 7
Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA. Watkins, C. J. C. H. and Dayan, P. (1992). Technical note Q-learning. Machine Learning, 8:279 292. 8