The convergence limit of the temporal difference learning

Size: px

Start display at page:

Download "The convergence limit of the temporal difference learning"

Edith McDowell
5 years ago
Views:

1 The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3,

2 Outline Reinforcement Learning Convergence limit Construction of the feature vector Numerical experiments Conclusion References 2

3 Reinforcement Learning Reinforcement learning is one of machine learning, which deals with a problem that an agent decides an optimal policy in an environment. We consider a finite state space, and an agent gets a reward when he moves from a state to next state. Our purpose is to evaluate an expectation of a cumulative reward and to find an optimal policy which maximizes or minimizes the reward. 3

4 Setting S: finite state space state sequence (s t ) t=0,1,... : Markov chain on S s 0 : follows some probability distribution on S P M S (R): transition probability matrix has a stationary distribution d(s) reward sequence (r t ) t N : sequence of uniformly bounded random variables p(r t+1 s 0, s 1,..., s t+1 ) = p(r t+1 s t, s t+1 ) E[r t+1 s t = s] is independent of t cumulative reward: R t = γ k 1 r t+k k=1 where, γ: discount rate (0 γ 1) 4

5 Setting value function: [ ] V (s) = E [R t s t = s] = E γ k 1 r t+k s t = s Here, let k=1 R(s) = E[r t+1 s t = s], we have V = R + γp V. We solve this problem when observations are given, and R and P are unknown. 5

6 Temporal difference learning ϕ k R S, k = 1,..., K: feature vectors Φ = (ϕ 1,..., ϕ K ) R S K : feature matrix w 0 R K : initial values of parameters V t : estimator of V defined as follows K V t = w t (k)ϕ k = Φw t k=1 w t = (w t (k)) k=1,...,k is updated as the following rules: δ t = r t + γv t 1 (s t ) V t 1 (s t 1 ) w t = w t 1 + a t δ t ϕ(s t 1 ) where, ϕ(s t ) = (ϕ k (s t )) k=1,...,k. cf. 0 = R + γp V V 6

7 Notation and Assumption Notation: D M S (R): diagonal matrix whose elements are d(s) s ϕ k (s) = ϕ k (s) γ s P (s, s )ϕ k (s ) Φ = ( ϕ 1,..., ϕ K ) = (I S γp )Φ R S K Assumption: Φ T D Φ: invertible 7

8 Convergence limit Consider next rules: ) w t = w t 1 + α (Φ T DR Φ T D Φw t 1 = w t 1 + αφ T D Φ(ŵ w t 1 ) where, ŵ = (Φ T D Φ) 1 Φ T DR. Then, we have w t ŵ = ( I K αφ T D Φ) t (w0 ŵ) where, I K is a K K identity matrix. 8

9 Convergence limit Theorem. Under Assumption, w t converges to ŵ for small enough α > 0 as t. outline of proof. Lemma 1. Under Assumption, every eigenvalue of Φ T D Φ has positive real part. Lemma 2. There exists some positive number α such that the absolute value of every eigenvalue of I K αφ T D Φ is less than 1. 9

10 Motivation The limit V of an estimator V t is the form of V = Φŵ = Φ(Φ T D Φ) 1 Φ T DR, and if the true value V is expressed as linear combination of feature vectors, then the limit consists with the true value, but it is not true in general. Then, I propose the construction of the feature vector related to this limit to converge the true value. 10

11 Property of the limit Fact. The limit V satisfies the following equation: proof. V T D(I γp )(V V ) = 0 V T D(I γp )V = ŵ T Φ T D(I γp )Φ(Φ T D Φ) 1 Φ T DR = ŵ T Φ T DR = V T D(I γp )V 11

12 Construction of the feature vector Here, consider the following algorithm: 1. Let a limit V 1 0 be an initial vector. 2. Obtain D(I γp )(V V 1 ). 3. Obtain the limit V 2 for two feature vectors, V 1 and D(I γp )(V V 1 ). 4. Repeat 2 and 3. Then, the limit V t converges to the true value V as t. 12

13 Numerical experiments Consider the model as follows: R = , P = 1/2 1/ /2 0 1/ /2 0 1/ /2 1/2, γ = 0.8 and we use an initial feature vector ϕ T = (1, 0, 0, 0). 13

14 Numerical experiments Black line is a theoretical line and the value of y-axis is V t V 2. Red points are obtained by 100 simulations, and 500 observations are given in each simulation. The value of y-axis is V t ˆV 2, where ˆV = (I γ ˆP ) 1 R and ˆP is an estimator of P. 14

15 Numerical experiments 15

16 Numerical experiments Iteration Error S.D Iteration Error S.D

17 Conclusion When the true value is not expressed as linear combination of feature vectors, I constructed the feature vector based on the limit vector, and proposed the algorithm where the estimator converges to the true value. 17

18 References [1] R. S. Sutton (1988). Learning to predict by the methods of temporal differences. Machine learning, 3:9 44. [2] R. S. Sutton, A. G. Barto (1998). Reinforcement Learning: An Introduction. MIT Press. [3] T. Ueno, S. Maeda, M. Kawanabe, S. Ishii (2011). Generalized TD learning. Journal of Machine Learning Research, pages [4] H. Yu and D. P. Bertsekas (2006). Convergence results for some temporal difference methods based on least squares. Technical report, LIDS REPORT

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)