Chapter 13 Wow! Least Squares Methods in Batch RL

Size: px

Start display at page:

Download "Chapter 13 Wow! Least Squares Methods in Batch RL"

Marshall Strickland
6 years ago
Views:

1 Chapter 13 Wow! Least Squares Methods in Batch RL Objectives of this chapter: Introduce batch RL Tradeoffs: Least squares vs. gradient methods Evaluating policies: Fitted value iteration Bellman residual minimization Least squares temporal difference learning Learning control Fitted Q-iteration Policy iteration

2 Batch RL Goal: Given the trajectory of the behavior policy π b X 1,A 1,R 1,, X t, A t, R t,, X N compute a good policy! Batch learning Properties: Data collection is not influenced Emphasis is on the quality of the solution Computational complexity plays a secondary role Performance measures: V * (x) - V π (x) = sup x V * (x) - V π (x) = sup x V * (x) - V π (x) V * (x) - V π (x) 2 = (V * (x)-v π (x)) 2 dµ(x) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 2

3 Solution methods Build a model Do not build a model, but find an approximation to Q * using value iteration => fitted Q-iteration using policy iteration => Policy evaluated by approximate value iteration Policy evaluated by Bellman-residual minimization (BRM) Policy evaluated by least-squares temporal difference learning (LSTD) => LSPI Policy search R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 3

4 Evaluating a policy: Fitted value iteration Choose a function space F. Solve for i=1,2,,m the LS (regression) problems: T Q i+1 =argmin Q F (R t + γq i (X t+1, π(x t+1 )) Q(X t, A t )) 2 t=1 Wait, what about the counterexample of Tsitsiklis and van Roy? Or the counterexample of Baird? When does this work?? Requirement: If M is big enough and the number of samples is big enough Q M should be close to Q π We have to make some assumptions on F.. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 4

5 Least-squares vs. gradient Linear least squares (ordinary regression): y t = w * T x t + ǫ t, (x t,y t ) jointly distributed r.v.s., iid, E[ǫ t x t ]=0. Seeing (x t,y t ), t=1,,t, find out w *. Loss function: L(w) = E[ (y 1 w T x 1 ) 2 ]. Least-squares approach: w T = argmin w t=1t (y t w T x t ) 2 Stochastic gradient method: w t+1 = w t + α t (y t -w t T x t ) x t Tradeoffs Sample complexity: How good is the estimate Computational complexity: How expensive is the computation? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 5

6 Fitted value iteration: Analysis Goal: Bound Q M - Q π µ2 in terms of max m ǫ m 2 ν, ǫ m 2 ν = ǫ m2 (x,a) ν(dx,da), where Q m+1 = T π Q m + ǫ m U m = Q m Q π U m+1 = Q m+1 Q π = T π Q m Q π + ǫ m = T π Q m T π Q π + ǫ m = γp π U m + ǫ m. U M = M m=0 (γp π ) M m ǫ m 1. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 6

7 U M = µ U M 2 Jensen M m=0 ( 1 1 γ Analysis/2 (γp π ) M m ǫ m 1. ) 2 1 γ 1 γ M+1 M m=0 ( ) γ C 1 1 γ 1 γ M+1 ( ) γ C 1 1 γ = C 1 ( 1 1 γ 1 γ M+1 γ m µ((p π ) m ǫ M m 1 ) 2 M m=0 ( γ m ν ǫ M m 1 2 γ M ν ǫ ) 2 ǫ 2 + C 1 γ M ν ǫ γ M+1. Jensen applied to operators, µ C 1 ν and: ρ: ρp π C 1 ν Legend: ρf = f(x)ρ(dx) (Pf)(x)= f(y)p(dy x) M m=0 γ m ǫ 2 ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 7

8 Summary If the regression errors are all small and the system is noisy ( π,ρ, ρ P π C 1 ν) then the final error will be small. How to make the regression errors small? Regression error decomposition: Estimation error Q m+1 T π Q m 2 Q m+1 Π F T π Q m 2 + Π F T π Q m T π Q m 2 Approximation error R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 8

9 Controlling the approximation error F TF Tf F f R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 9

10 Controlling the approximation error TF F F d p,µ (TF,F) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 10

11 Controlling the approximation error TF F TF F F R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 11

12 Controlling the approximation error Assume smoothness! Lip α (L) B(X, R max 1 γ ) T ( ) B(X, R max 1 γ ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 12

13 Learning with (lots of) historical data Data: A long trajectory of some exploration policy Goal: Efficient algorithm to learn a policy Idea: Use fitted action-values Algorithms: Bellman residual minimization, FQI [Antos et al. 06] LSPI [Lagoudakis, Parr 03] Bounds: Oracle inequalities (BRM, FQI and LSPI) consistency R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 13

14 BRM insight TD error: t =R t +γ Q(X t+1,π(x t+1 ))-Q(X t,a t ) Bellman error: E[E[ t X t,a t ] 2 ] What we can compute/estimate: E[E[ t2 X t,a t ]] They are different! However: E[ t X t, A t ] 2 =E[ 2 t X t, A t ] Var[ t X t, A t ] R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 14

15 Loss function L N,π (Q, h)= 1 N { w t (R t + γq(x t+1, π(x t+1 )) Q(X t, A t )) 2 N t=1 } (R t + γq(x t+1, π(x t+1 )) h(x t, A t )) 2 w t =1/µ(A t X t ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 15

16 Algorithm (BRM++) 1. Choose π 0, i:=0 2. While(i K)do: 3. Let Q i+1 =argmin Q F Asup h F A L N,πi (Q, h) 4. Let π i+1 (x)=argmax a A Q i+1 (x, a) 5. i:= i+1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 16

17 Do we need to reweight or throw away data? NO! WHY? Intuition from regression: m(x) = E[Y X=x] can be learnt no matter what p(x) is! π * (a x): the same should be possible! BUT.. Performance might be poor! => YES! Like in supervised learning when training and test distributions are different R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 17

18 Bound Q Q π K 2,ρ 2γ (1 γ) 2 C1/2 ρ,ν ) 1/2 (Ẽ(F)+E(F)+S N,x +(2γ K ) 1/2 R max, S N,x = c 2 ( )1+κ ( V 2 +1)ln(N)+ln(c 1)+ 1 1+κ ln(bc2 2 4 )+x 2κ (b 1/κ N) 1/2 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 18

19 The concentration coefficients Lyapunov exponents Our case: y t+1 = P t y t ˆγ top = limsup t y t is infinite dimensional P t depends on the policy chosen If top-lyap exp.<=0, we are good 1 t log+ ( y t ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 19

20 Open question Abstraction: f(i 1,..., i m )=log( P i1 P i2... P im ), i k {0,1}. Let f :{0,1} R +, f(x+y) f(x)+f(y), limsup m 1 m f([x] m) β. True? {y m } m, y m {0,1} m, limsup m 1 m log f(y m) β R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 20

21 Relation to LSTD LSTD: Linear function space Bootstrap the normal equation R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 21

22 Conclusions Fitted RL algorithms work (for smooth MDPs), even with a single trajectory What to do about the curse of dimensionality? Need adaptive algorithms that can take advantage of regularity when present Penalized least-squares/aggregation? Feature relevance Factorization Manifold estimation Abstraction Special purpose algorithms? What priors to assume??? Is on-line learning easier? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 22

23 Reading/References M. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4: ,2003. A. Antos, Cs. Szepesvári, R. Munos. Learning nearoptimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning Journal (MLJ), to appear, 2007, available from: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ++ 23

Finite-Sample Analysis in Reinforcement Learning

Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical