Reinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial. Part II: Learning to predict values

Size: px

Start display at page:

Download "Reinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial. Part II: Learning to predict values"

Elfrieda Banks
5 years ago
Views:

Ideas and Motivation Background Off-policy learning Option formalism Learning about one policy while behaving according to another Needed for RL w/exploration (as in Q-learning) Needed for learning

1 Ideas and Motivation Background Off-policy learning Option formalism Learning about one policy while behaving according to another Needed for RL w/exploration (as in Q-learning) Needed for learning abstract models of dynamical systems (representing world knowledge) Enables efficient exploration Enables learning about many ways of behaving at the same time (learning models of options) Non An option is defined as a triple o =!I, π, β" I S is the set of states in which the option can be initiated π is the internal policy of the option β : S [0, 1] is a stochastic termination condition Reinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial We want to compute the reward model of option o: Ou Gi W ac Target (see d Eo {R(s)} θt φs = y Off-policy learning is tricky Options for soccer players could be Keepaway Proble We assume that linear function approximation is used to represent the model: A way of behaving for a period of time! a policy! a stopping condition The Bermuda triangle! Temporal-difference learning! Function approximation (e.g., linear)! Off-policy Leads to divergence of iterative algorithms! Q-learning diverges with linear FA! Dynamic programming diverges with linear FA Pass Proble The ta c : [0, (see b Precup, Sutton & Dasgupta (PSD) algorithm Part II: Learning to predict values Models of options Co Eo {R(s)} = E{r1 + r rt s0 = s, π, β} Options Dribble On m π = Uses importance sampling to convert off-policy case to on-policy case Convergence assured by theorem of Tsitsiklis & Van Roy (1997) Survives the Bermuda triangle! A predictive model of the outcome of following the option What state will you be in? Will you still control the ball? What will be the value of some feature? Will your teammate receive the pass? What will be the expected total reward along the way? How long can you keep control of the ball? BUT! Variance can be high, even infinite (slow learning) Difficult to use with continuous or large action spaces Requires explicit representation of behavior policy (probability distribution) %, Options in a 2D world Csaba Szepesva ri References Richard S. Sutton Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of ICML.!! Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference Wall learning with function approximation. In Proceedings of ICML. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-mdps: A The red and blue options are mostly executed. Surely we should be able to learn about them from this experience! framework for temporal abstraction in reinforcement learning. Artificial Intelligence, vol. 112, pp Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings University of Alberta s: {szepesva,rsutton}@.ualberta.ca Experienced trajectory Distinguished region of NIPS-17. Sutton R.S., Rafols E. and Koop, A. (2006). Temporal abstraction in temporal-difference networks. In Proceedings of NIPS-18. function approximation. In Machine learning vol. 42. Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42. Vk (s) = Vk (s) =!(7)+2!(1)!(7)+2!(2) Atlanta, July 11, 2010 Options "!" 100% terminal state 10!k(1)!k(5)!k(7) 5 (log scale, broken at!1) 10!k(6) 0 10 / -100 Proof: Us Iterations (k) Szepesva ri & Sutton (UofA) Off-policy algorithm for options π(a) > 0 recognize importanc 10 Parameter values,!k (i) - 10 Recognizers :+;< Theorem possible a the class Vk (s) = 2!(7)+!(6) 1% Background '& Vk (s) = Vk (s) = Vk (s) =!(7)+2!(3)!(7)+2!(4)!(7)+2!(5) 99% Doina Precup, Richard S. Sutton, Cosmin Paduraru, Anna J. Koop, Satinder Singh McGill University, University of Alberta, University of Michigan! Tadic, V. (2001). On the convergence of temporal-difference learning with linear Baird's Counterexample cy learning with options and recognizers!'& Learning w/o the Behavior Policy RL Algorithms July 11, 2010 Theorem µ1 > µ2. lower vari 1 / 45

2 Outline 1 Introduction 2 Simple learning techniques Learning the mean Learning values with Monte-Carlo Learning values with temporal differences Monte-Carlo or TD? Resolution 3 Function approximation 4 Methods Stochastic gradient descent TD(λ) with linear function approximation Gradient temporal difference learning LSTD and friends Comparing least-squares and TD-like methods 5 How to choose the function approximation method? 6 Bibliography Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

3 The problem How to learn the value function of a policy over a large state space? Why learn? Avoid the curses of modeling Complex models are hard to deal with Avoids modelling errors Adaptation to changes Why learn value functions? Applications: Failure probabilities in a large power grid Taxi-out times of flights on airports Generally: Estimating a long term expected value associated with a Markov process Building block Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

4 Learning the mean of a distribution Assume R 1, R 2,... are i.i.d., V = E [R t ]. Estimating the expected value by the sample mean: Recursive update: More general update: V t = 1 t 1 R s+1. t s=0 V t = V t t ( R t V }{{} t 1 ). target V t = V t 1 + α t (R t V t 1 ), Robbins-Monro conditions: α t =, αt 2 < t=0 t=0 Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

5 Application to learning a value: the Monte-Carlo method Setup: %!!!!"# $ %!& " " ()*!%!+#! Finite, episodic MDP " # Policy π, which is proper from x 0 Goal: Estimate V π (x 0 )! Trajectories: $!!!!'# $ %!"!! X (0) 0, R(0) 1, X(0) 1, R(0) 2,..., X(0) T 0, X (1) 0, R(1) 1, X(1) 1, R(1) 2,..., X(0) T 1,. where X (i) 0 = x 0 Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

6 First-visit Monte-Carlo function FIRSTVISITMC(T, V, n) T = (X 0, R 1,..., R T, X T ) is a trajectory with X 0 = x and X T being an absorbing state, n is the number of times V was updated 1: sum 0 2: for t = 0 to T 1 do 3: sum sum + γ t R t+1 4: end for 5: V V + 1 n (sum V) 6: return V Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

7 Every-visit Monte-Carlo learning a value function function EVERYVISITMC(X 0, R 1, X 1, R 2,..., X T 1, R T, V) Input: X t is the state at time t, R t+1 is the reward associated with the t th transition, T is the length of the episode, V is the array storing the current value function estimate 1: sum 0 2: for t T 1 downto 0 do 3: sum R t+1 + γ sum 4: target[x t ] sum 5: V[X t ] V[X t ] + α (target[x t ] V[X t ]) 6: end for 7: return V Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

8 Learning from snippets of data Goals Learn from elementary transitions of the form (X t, R t+1, X t+1 ) Learn a full value function Increase convergence rate (if possible) = Temporal Difference (TD) Learning Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

9 Learning from guesses: TD learning Idealistic Monte-Carlo update: V(X t ) V(X t ) + 1 t (R t V(X t )), where R t is the return from state X t. However, R t is not available! Idea: Replace it with something computable: R t = R t+1 + γr t+2 + γ 2 R t = R t+1 + γ {R t+2 + γr t } R t+1 + γv(x t+1 ). Update: V(X t ) V(X t ) + 1 t δ t+1 (V) {}}{ {R t+1 + γv(x t+1 ) V(X t )}. Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

10 The TD(0) algorithm function TD0(X, R, Y, V) Input: X is the last state, Y is the next state, R is the immediate reward associated with this transition, V is the array storing the current value estimates 1: δ R + γ V[Y] V[X] 2: V[X] V[X] + α δ 3: return V Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

11 Which one to love? Part I %!!!!"# $ %!& " " ()*!%!+#! " # $!!!!'# $ %!" TD(0) at state 2:!! By the k th visit to state 2, state 3 has already been visited 10 k times! ] Var [ˆV t (3) 1/(10 k)! MC at state 2: Var [R t X t = 2] = 0.25, does not decrease with k! Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

12 Which one to love? Part II %!!!!"# $ %!& " " ()*!%!+#! " # $!!!!'# $ %!"!! Replace the stochastic reward by a deterministic one of value 1 TD has to wait until the value of 3 converges MC updates towards the correct value in every step (no variance!) Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

13 The happy compromise: TD(λ) Choose 0 λ 1 Consider the k-step return estimate: t+k R t:k = γ s t R s+1 + γ k+1 ˆV t (X t+k+1 ), s=t Consider updating the values toward the so-called λ-return estimate: R (λ) t = (1 λ)λ k R t:k. k=0 Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

14 Toward TD(λ) R t:k = t+k γ s t R s+1 + γ k+1 ˆV t (X t+k+1 ), R (λ) t = s=t (1 λ)λ k R t:k. { } R (λ) t ˆV t (X t ) = (1 λ) R t+1 + γ ˆV t (X t+1 ) ˆV t (X t ) + { } (1 λ) λ R t+1 + γr t+2 + γ 2 ˆV t (X t+2 ) ˆV t (X t ) + { } (1 λ) λ 2 R t+1 + γr t+2 + γ 2 R t+3 + γ 3 ˆV t (X t+3 ) ˆV t (X t ) + k=0. = [ ] R t+1 + γ ˆV t (X t+1 ) ˆV t (X t ) [ ] γλ R t+2 + γ ˆV t (X t+2 ) ˆV t (X t+1 ) + [ ] γ 2 λ 2 R t+3 + γ ˆV t (X t+3 ) ˆV t (X t+2 ) +. Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

15 The TD(λ) algorithm function TDLAMBDA(X, R, Y, V, z) Input: X is the last state, Y is the next state, R is the immediate reward associated with this transition, V is the array storing the current value function estimate, z is the array storing the eligibility traces 1: δ R + γ V[Y] V[X] 2: for all x X do 3: z[x] γ λ z[x] 4: if X = x then 5: z[x] 1 6: end if 7: V[x] V[x] + α δ z[x] 8: end for 9: return (V, z) Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

16 Experimental results.55.5 =1 =.99 =.975 =.95 Online TD( ) on Random Walk Average RMSE over First 10 Trials =.95 =.9 =.8 =.6 =.4 =.2 = Problem: 19-state random walk on a chain. Reward of 1 at the left end. Both ends are absorbing. The goal is to predict the values of states. Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

17 Too many states! What to do? The state space is too large Cannot store all the values Cannot visit all the states! What to do??? Idea: Use compressed representations! Examples Discretization Linear function approximation Nearest neighbor methods Kernel methods Decision trees. How to use them? Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

18 Regression with stochastic gradient descent Assume (X 1, R 1 ), (X 2, R 2 ),... are i.i.d., V(x) = E [R t X t = x]. Goal: Estimate V! With a function of the form Vθ (x) = θ ϕ(x) This is called regression in statistics/machine learning More precise goal: Minimize the expected squared prediction error: J(θ) = 1 2 E [ (R t V θ (X t )) 2]. Stochastic gradient descent: θ t+1 = 1 θ t α t 2 θ(r t V θt (X t )) 2 = θ t + α t (R t V θt (X t )) t V θt (X t ) = θ t + α t (R t V θt (X t )) ϕ(x t ). Robbins-Monro conditions: t=0 α t =, t=0 α2 t <. Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

19 Limit theory Stochastic gradient descent: θ t+1 θ t = α t (R t V θt (X t )) ϕ(x t ). Robbins-Monro conditions: t=0 α t =, t=0 α2 t <. If converges, it must converge to θ satisfying E [(R t V θ (X t )) ϕ(x t )] = 0. Explicit form: [ θ = E ϕ t ϕ t ] 1 E [ϕt R t ], where ϕ t = ϕ(x t ). Indeed a minimizer of J. LMS rule, Widrow-Hoff rule, delta-rule, ADALINE Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

20 Learning from guesses: TD learning with function approximation Replace reward with return! Idealistic Monte-Carlo based update: θ t+1 = θ t + α t (R t V θt (X t )) θ V θt (X t ) where R t is the return from state X t. However, R t is not available! Idea: Replace it with an estimate: R t R t+1 + γv θt (X t+1 ). Update: θ t+1 = θ t + α t δ t+1 (V θ t ) {}}{ {R t+1 + γv θt (X t+1 ) V θt (X t )} θ V θt (X t ) Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

21 TD(λ) with linear function approximation function TDLAMBDALINFAPP(X, R, Y, θ, z) Input: X is the last state, Y is the next state, R is the immediate reward associated with this transition, θ R d is the parameter vector of the linear function approximation, z R d is the vector of eligibility traces 1: δ R + γ θ ϕ[y] θ ϕ[x] 2: z ϕ[x] + γ λ z 3: θ θ + α δ z 4: return (θ, z) Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

$Issues with off-policy learning L \ L 2 ' J {,. 2 '. '. J / " ' { ( ) I 8.$ 12 Baird ' s counterexamplėthe approximate value function for this Markov process is of the 5-star form shown

12 Baird ' s counterexamplėthe approximate value function for this Markov process is of the 5-star form shown

Behavior of TD(0) with expected backups on the 5-star example the equations inset in each state.

Moreover, the set of feature vectorș {c/>s: se a}, corresponding to this function is a linearly independent seț

22 Issues with off-policy learning L \ L 2 ' J {,. 2 '. '. J / " ' { ( ) I 8.5 Off - Policy Bootstra ping terminal state Figure 8.12 Baird ' s counterexamplėthe approximate value function for this Markov process is of the 5-star form shown example by the linear expressions inside each state. The reward is always zero. Behavior of TD(0) with expected backups on the 5-star example the equations inset in each state. Note that the overall function is linear and that.. there are fewer states than components of (),. Moreover, the set of feature vectorș {c/>s: se a}, corresponding to this function is a linearly independent seț and the.... true value function is easily formed by setting (), = O. In all ways, this task seems a favorable case for linear function approximatioṅ Szepesvári The prediction & Sutton method (UofA) we apply to this task is a linear, RL Algorithms gradient - descent form of July 11, / 45

23 Defining the objective function Let δ t+1 (θ) = R t+1 + γv θ (Y t+1 ) V θ (X t ) be the TD-error at time t, ϕ t = ϕ(x t ). TD(0) update: θ t+1 θ t = α t δ t+1 (θ t )ϕ t. When TD(0) converges, it converges to a unique vector θ that satisfies E [δ t+1 (θ )ϕ t ] = 0. (TDEQ) Goal: Come up with an objective function such that its optima satisfy (TDEQ). Solution: [ ] 1 J(θ) = E [δ t+1 (θ)ϕ t ] E ϕ t ϕ t E [δt+1 (θ)ϕ t ]. Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

24 Deriving the algorithm [ J(θ) = E [δ t+1 (θ)ϕ t ] E ϕ t ϕ t ] 1 E [δt+1 (θ)ϕ t ]. Take the gradient! θ J(θ) = 2E [ (ϕ t γϕ t+1)ϕ t ] w(θ), where [ w(θ) = E ϕ t ϕ t ] 1 E [δt+1 (θ)ϕ t ]. Idea: introduce two sets of weights! θ t+1 = θ t + α t (ϕ t γ ϕ t+1) ϕ t w t w t+1 = w t + β t (δ t+1 (θ t ) ϕ t w t ) ϕ t. Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

25 GTD2 with linear function approximation function GTD2(X, R, Y, θ, w) Input: X is the last state, Y is the next state, R is the immediate reward associated with this transition, θ R d is the parameter vector of the linear function approximation, w R d is the auxiliary weight 1: f ϕ[x] 2: f ϕ[y] 3: δ R + γ θ f θ f 4: a f w 5: θ θ + α (f γ f ) a 6: w w + β (δ a) f 7: return (θ, w) Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

26 Experimental results 20 TD RMSPBE 10 GTD2 TDC GTD Sweeps Behavior on 7-star example Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

27 Bibliographic notes and subsequent developments GTD the original idea (Sutton et al., 2009b) GTD2, a two-timescale version (TDC) (Sutton et al., 2009a). Just replace the update in line 5 by θ θ + α (δ f γ a f ). Extension to nonlinear function approximation (Maei et al., 2010) Addresses the issue that TD is unstable when used with nonlinear function approximation Extension to eligibility traces, action-values (Maei and Sutton, 2010) Extension to control (next part!) Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

28 The problem The methods are gradient -like, or first-order methods Make small steps in the weight space They are sensitive to: choice of the step-size initial values of weights eigenvalue spread of the underlying matrix determining the dynamics Solution proposals: Use of adaptive step-sizes (Sutton, 1992; George and Powell, 2006) Normalizing the updates (Bradtke, 1994) Reusing previous samples (Lin, 1992) Each of them have their own weaknesses Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

29 The LSTD algorithm In the limit, if TD(0) converges it finds the solution to ( ) E [ ϕ t δ t+1 (θ) ] = 0. Assume the sample so far is D n = ((X 0, R 1, Y 1 ), (X 1, R 2, Y 2 ),..., (X n 1, R n, Y n )), Idea: Approximate (*) by ( ) 1 n Stochastic programming: sample average approximation (Shapiro, 2003) Statistics: Z-estimation (e.g., Kosorok, 2008, Section 2.2.5) Note: (**) is equivalent to n 1 Â n θ + ˆb n = 0, n 1 t=0 ϕ t δ t+1 (θ) = 0. n 1 where ˆb n = 1 n t=0 R t+1ϕ t and Â n = 1 n t=0 ϕ t(ϕ t γϕ t+1 ). Solution: θ n = Â 1ˆb n n, provided the inverse exists. Least-squares temporal difference learning or LSTD (Bradtke and Barto, 1996). Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

30 RLSTD(0) with linear function approximation function RLSTD(X, R, Y, C, θ) Input: X is the last state, Y is the next state, R is the immediate reward associated with this transition, C R d d, and θ R d is the parameter vector of the linear function approximation 1: f ϕ[x] 2: f ϕ[y] 3: g (f γf ) C g is a 1 d row vector 4: a 1 + gf 5: v Cf 6: δ R + γ θ f θ f 7: θ θ + δ / a v 8: C C v g / a 9: return (C, θ) Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

31 Which one to love? Assumptions Time for computation T is fixed Some facts How many samples (n) can be processed? Least-squares: n T/d 2 First-order methods: n T/d = nd Samples are cheap to obtain Precision after t samples? Least-squares: C 1 t 1 2 First-order: C 2 t 1 2 C 2 > C 1 Conclusion Ratio of precisions: θ n θ θ n θ C 2 C 1 d 1 2, Hence: If C 2 /C 1 < d 1/2 then the first-order method wins, in the other case the least-squares method wins. Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

32 The choice of the function approximation method Factors to consider Quality of the solution in the limit of infinitely many samples Overfitting/underfitting Eigenvalue spread (decorrelated features) when using first-order methods Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

33 Error bounds Consider TD(λ) estimating the value function V. Let V θ (λ) be the limiting solution. Then V θ (λ) V µ 1 1 γλ Π F,µ V V µ. Here γ λ = γ(1 λ)/(1 λγ) is the contraction modulus of Π F,µ T (λ) (Tsitsiklis and Van Roy, 1999; Bertsekas, 2007). Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

34 Error analysis II Define the Bellman error (λ) (ˆV) = T (λ) ˆV ˆV, ˆV : X R under T (λ) = (1 λ) m=0 λm T [m], where T [m] is the m-step lookahead Bellman operator. Contraction argument: V ˆV 1 (λ) (ˆV). What makes (λ) (ˆV) small? Error decomposition: (λ) (V θ (λ)) = (1 λ) m 0 where [r] m m [ϕ] λ m [r] m + γ = r m Π F,µ r m = P m+1 ϕ Π F,µ P m+1 ϕ 1 γ rm (x) = E [R m+1 X 0 = x], P m+1 ϕ (x) = (P m+1 ϕ 1 (x),..., P m+1 ϕ d (x)), P m ϕ i (x) = E [ϕ i (X m ) X 0 = x]. (1 λ) λ m m [ϕ] θ(λ), m 0 Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

35 For Further Reading Bertsekas, D. P. (2007). Dynamic Programming and Optimal Control, volume 2. Athena Scientific, Belmont, MA, 3 edition. Bradtke, S. J. (1994). Incremental Dynamic Programming for On-line Adaptive Optimal Control. PhD thesis, Department of Computer and Information Science, University of Massachusetts, Amherst, Massachusetts. Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22: George, A. P. and Powell, W. B. (2006). Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Machine Learning, 65: Kosorok, M. R. (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer. Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 9: Maei, H., Szepesvári, C., Bhatnagar, S., Silver, D., Precup, D., and Sutton, R. (2010). Convergent temporal-difference learning with arbitrary smooth function approximation. In NIPS-22, pages Maei, H. R. and Sutton, R. S. (2010). GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In Baum, E., Hutter, M., and Kitzelmann, E., editors, AGI 2010, pages Atlantis Press. Shapiro, A. (2003). Monte Carlo sampling methods. In Stochastic Programming, Handbooks in OR & MS, volume 10. North-Holland Publishing Company, Amsterdam. Sutton, R. S. (1992). Gain adaptation beats least squares. In Proceedings of the 7th Yale Workshop on Adaptive and Learning Systems, pages Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora, E. (2009a). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Bottou, L. and Littman, M., editors, ICML 2009, pages ACM. Sutton, R. S., Szepesvári, C., and Maei, H. R. (2009b). A convergent O(n) temporal-difference algorithm for off-policy learning with linear function approximation. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, NIPS-21, pages MIT Press. Tsitsiklis, J. N. and Van Roy, B. (1999). Average cost temporal-difference learning. Automatica, 35(11): Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Rich Sutton, University of Alberta Hamid Maei, University of Alberta Doina Precup, McGill University Shalabh