Reinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial. Part II: Learning to predict values

Size: px
Start display at page:

Download "Reinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial. Part II: Learning to predict values"

Transcription

1 Ideas and Motivation Background Off-policy learning Option formalism Learning about one policy while behaving according to another Needed for RL w/exploration (as in Q-learning) Needed for learning abstract models of dynamical systems (representing world knowledge) Enables efficient exploration Enables learning about many ways of behaving at the same time (learning models of options) Non An option is defined as a triple o =!I, π, β" I S is the set of states in which the option can be initiated π is the internal policy of the option β : S [0, 1] is a stochastic termination condition Reinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial We want to compute the reward model of option o: Ou Gi W ac Target (see d Eo {R(s)} θt φs = y Off-policy learning is tricky Options for soccer players could be Keepaway Proble We assume that linear function approximation is used to represent the model: A way of behaving for a period of time! a policy! a stopping condition The Bermuda triangle! Temporal-difference learning! Function approximation (e.g., linear)! Off-policy Leads to divergence of iterative algorithms! Q-learning diverges with linear FA! Dynamic programming diverges with linear FA Pass Proble The ta c : [0, (see b Precup, Sutton & Dasgupta (PSD) algorithm Part II: Learning to predict values Models of options Co Eo {R(s)} = E{r1 + r rt s0 = s, π, β} Options Dribble On m π = Uses importance sampling to convert off-policy case to on-policy case Convergence assured by theorem of Tsitsiklis & Van Roy (1997) Survives the Bermuda triangle! A predictive model of the outcome of following the option What state will you be in? Will you still control the ball? What will be the value of some feature? Will your teammate receive the pass? What will be the expected total reward along the way? How long can you keep control of the ball? BUT! Variance can be high, even infinite (slow learning) Difficult to use with continuous or large action spaces Requires explicit representation of behavior policy (probability distribution) %, Options in a 2D world Csaba Szepesva ri References Richard S. Sutton Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings of ICML.!! Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference Wall learning with function approximation. In Proceedings of ICML. Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-mdps: A The red and blue options are mostly executed. Surely we should be able to learn about them from this experience! framework for temporal abstraction in reinforcement learning. Artificial Intelligence, vol. 112, pp Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings University of Alberta s: {szepesva,rsutton}@.ualberta.ca Experienced trajectory Distinguished region of NIPS-17. Sutton R.S., Rafols E. and Koop, A. (2006). Temporal abstraction in temporal-difference networks. In Proceedings of NIPS-18. function approximation. In Machine learning vol. 42. Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42. Vk (s) = Vk (s) =!(7)+2!(1)!(7)+2!(2) Atlanta, July 11, 2010 Options "!" 100% terminal state 10!k(1)!k(5)!k(7) 5 (log scale, broken at!1) 10!k(6) 0 10 / -100 Proof: Us Iterations (k) Szepesva ri & Sutton (UofA) Off-policy algorithm for options π(a) > 0 recognize importanc 10 Parameter values,!k (i) - 10 Recognizers :+;< Theorem possible a the class Vk (s) = 2!(7)+!(6) 1% Background '& Vk (s) = Vk (s) = Vk (s) =!(7)+2!(3)!(7)+2!(4)!(7)+2!(5) 99% Doina Precup, Richard S. Sutton, Cosmin Paduraru, Anna J. Koop, Satinder Singh McGill University, University of Alberta, University of Michigan! Tadic, V. (2001). On the convergence of temporal-difference learning with linear Baird's Counterexample cy learning with options and recognizers!'& Learning w/o the Behavior Policy RL Algorithms July 11, 2010 Theorem µ1 > µ2. lower vari 1 / 45

2 Outline 1 Introduction 2 Simple learning techniques Learning the mean Learning values with Monte-Carlo Learning values with temporal differences Monte-Carlo or TD? Resolution 3 Function approximation 4 Methods Stochastic gradient descent TD(λ) with linear function approximation Gradient temporal difference learning LSTD and friends Comparing least-squares and TD-like methods 5 How to choose the function approximation method? 6 Bibliography Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

3 The problem How to learn the value function of a policy over a large state space? Why learn? Avoid the curses of modeling Complex models are hard to deal with Avoids modelling errors Adaptation to changes Why learn value functions? Applications: Failure probabilities in a large power grid Taxi-out times of flights on airports Generally: Estimating a long term expected value associated with a Markov process Building block Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

4 Learning the mean of a distribution Assume R 1, R 2,... are i.i.d., V = E [R t ]. Estimating the expected value by the sample mean: Recursive update: More general update: V t = 1 t 1 R s+1. t s=0 V t = V t t ( R t V }{{} t 1 ). target V t = V t 1 + α t (R t V t 1 ), Robbins-Monro conditions: α t =, αt 2 < t=0 t=0 Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

5 Application to learning a value: the Monte-Carlo method Setup: %!!!!"# $ %!& " " ()*!%!+#! Finite, episodic MDP " # Policy π, which is proper from x 0 Goal: Estimate V π (x 0 )! Trajectories: $!!!!'# $ %!"!! X (0) 0, R(0) 1, X(0) 1, R(0) 2,..., X(0) T 0, X (1) 0, R(1) 1, X(1) 1, R(1) 2,..., X(0) T 1,. where X (i) 0 = x 0 Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

6 First-visit Monte-Carlo function FIRSTVISITMC(T, V, n) T = (X 0, R 1,..., R T, X T ) is a trajectory with X 0 = x and X T being an absorbing state, n is the number of times V was updated 1: sum 0 2: for t = 0 to T 1 do 3: sum sum + γ t R t+1 4: end for 5: V V + 1 n (sum V) 6: return V Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

7 Every-visit Monte-Carlo learning a value function function EVERYVISITMC(X 0, R 1, X 1, R 2,..., X T 1, R T, V) Input: X t is the state at time t, R t+1 is the reward associated with the t th transition, T is the length of the episode, V is the array storing the current value function estimate 1: sum 0 2: for t T 1 downto 0 do 3: sum R t+1 + γ sum 4: target[x t ] sum 5: V[X t ] V[X t ] + α (target[x t ] V[X t ]) 6: end for 7: return V Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

8 Learning from snippets of data Goals Learn from elementary transitions of the form (X t, R t+1, X t+1 ) Learn a full value function Increase convergence rate (if possible) = Temporal Difference (TD) Learning Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

9 Learning from guesses: TD learning Idealistic Monte-Carlo update: V(X t ) V(X t ) + 1 t (R t V(X t )), where R t is the return from state X t. However, R t is not available! Idea: Replace it with something computable: R t = R t+1 + γr t+2 + γ 2 R t = R t+1 + γ {R t+2 + γr t } R t+1 + γv(x t+1 ). Update: V(X t ) V(X t ) + 1 t δ t+1 (V) {}}{ {R t+1 + γv(x t+1 ) V(X t )}. Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

10 The TD(0) algorithm function TD0(X, R, Y, V) Input: X is the last state, Y is the next state, R is the immediate reward associated with this transition, V is the array storing the current value estimates 1: δ R + γ V[Y] V[X] 2: V[X] V[X] + α δ 3: return V Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

11 Which one to love? Part I %!!!!"# $ %!& " " ()*!%!+#! " # $!!!!'# $ %!" TD(0) at state 2:!! By the k th visit to state 2, state 3 has already been visited 10 k times! ] Var [ˆV t (3) 1/(10 k)! MC at state 2: Var [R t X t = 2] = 0.25, does not decrease with k! Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

12 Which one to love? Part II %!!!!"# $ %!& " " ()*!%!+#! " # $!!!!'# $ %!"!! Replace the stochastic reward by a deterministic one of value 1 TD has to wait until the value of 3 converges MC updates towards the correct value in every step (no variance!) Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

13 The happy compromise: TD(λ) Choose 0 λ 1 Consider the k-step return estimate: t+k R t:k = γ s t R s+1 + γ k+1 ˆV t (X t+k+1 ), s=t Consider updating the values toward the so-called λ-return estimate: R (λ) t = (1 λ)λ k R t:k. k=0 Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

14 Toward TD(λ) R t:k = t+k γ s t R s+1 + γ k+1 ˆV t (X t+k+1 ), R (λ) t = s=t (1 λ)λ k R t:k. { } R (λ) t ˆV t (X t ) = (1 λ) R t+1 + γ ˆV t (X t+1 ) ˆV t (X t ) + { } (1 λ) λ R t+1 + γr t+2 + γ 2 ˆV t (X t+2 ) ˆV t (X t ) + { } (1 λ) λ 2 R t+1 + γr t+2 + γ 2 R t+3 + γ 3 ˆV t (X t+3 ) ˆV t (X t ) + k=0. = [ ] R t+1 + γ ˆV t (X t+1 ) ˆV t (X t ) [ ] γλ R t+2 + γ ˆV t (X t+2 ) ˆV t (X t+1 ) + [ ] γ 2 λ 2 R t+3 + γ ˆV t (X t+3 ) ˆV t (X t+2 ) +. Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

15 The TD(λ) algorithm function TDLAMBDA(X, R, Y, V, z) Input: X is the last state, Y is the next state, R is the immediate reward associated with this transition, V is the array storing the current value function estimate, z is the array storing the eligibility traces 1: δ R + γ V[Y] V[X] 2: for all x X do 3: z[x] γ λ z[x] 4: if X = x then 5: z[x] 1 6: end if 7: V[x] V[x] + α δ z[x] 8: end for 9: return (V, z) Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

16 Experimental results.55.5 =1 =.99 =.975 =.95 Online TD( ) on Random Walk Average RMSE over First 10 Trials =.95 =.9 =.8 =.6 =.4 =.2 = Problem: 19-state random walk on a chain. Reward of 1 at the left end. Both ends are absorbing. The goal is to predict the values of states. Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

17 Too many states! What to do? The state space is too large Cannot store all the values Cannot visit all the states! What to do??? Idea: Use compressed representations! Examples Discretization Linear function approximation Nearest neighbor methods Kernel methods Decision trees. How to use them? Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

18 Regression with stochastic gradient descent Assume (X 1, R 1 ), (X 2, R 2 ),... are i.i.d., V(x) = E [R t X t = x]. Goal: Estimate V! With a function of the form Vθ (x) = θ ϕ(x) This is called regression in statistics/machine learning More precise goal: Minimize the expected squared prediction error: J(θ) = 1 2 E [ (R t V θ (X t )) 2]. Stochastic gradient descent: θ t+1 = 1 θ t α t 2 θ(r t V θt (X t )) 2 = θ t + α t (R t V θt (X t )) t V θt (X t ) = θ t + α t (R t V θt (X t )) ϕ(x t ). Robbins-Monro conditions: t=0 α t =, t=0 α2 t <. Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

19 Limit theory Stochastic gradient descent: θ t+1 θ t = α t (R t V θt (X t )) ϕ(x t ). Robbins-Monro conditions: t=0 α t =, t=0 α2 t <. If converges, it must converge to θ satisfying E [(R t V θ (X t )) ϕ(x t )] = 0. Explicit form: [ θ = E ϕ t ϕ t ] 1 E [ϕt R t ], where ϕ t = ϕ(x t ). Indeed a minimizer of J. LMS rule, Widrow-Hoff rule, delta-rule, ADALINE Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

20 Learning from guesses: TD learning with function approximation Replace reward with return! Idealistic Monte-Carlo based update: θ t+1 = θ t + α t (R t V θt (X t )) θ V θt (X t ) where R t is the return from state X t. However, R t is not available! Idea: Replace it with an estimate: R t R t+1 + γv θt (X t+1 ). Update: θ t+1 = θ t + α t δ t+1 (V θ t ) {}}{ {R t+1 + γv θt (X t+1 ) V θt (X t )} θ V θt (X t ) Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

21 TD(λ) with linear function approximation function TDLAMBDALINFAPP(X, R, Y, θ, z) Input: X is the last state, Y is the next state, R is the immediate reward associated with this transition, θ R d is the parameter vector of the linear function approximation, z R d is the vector of eligibility traces 1: δ R + γ θ ϕ[y] θ ϕ[x] 2: z ϕ[x] + γ λ z 3: θ θ + α δ z 4: return (θ, z) Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

22 Issues with off-policy learning L \ L 2 ' J {,. 2 '. '. J / " ' { ( ) I 8.5 Off - Policy Bootstra ping terminal state Figure 8.12 Baird ' s counterexamplėthe approximate value function for this Markov process is of the 5-star form shown example by the linear expressions inside each state. The reward is always zero. Behavior of TD(0) with expected backups on the 5-star example the equations inset in each state. Note that the overall function is linear and that.. there are fewer states than components of (),. Moreover, the set of feature vectorș {c/>s: se a}, corresponding to this function is a linearly independent seț and the.... true value function is easily formed by setting (), = O. In all ways, this task seems a favorable case for linear function approximatioṅ Szepesvári The prediction & Sutton method (UofA) we apply to this task is a linear, RL Algorithms gradient - descent form of July 11, / 45

23 Defining the objective function Let δ t+1 (θ) = R t+1 + γv θ (Y t+1 ) V θ (X t ) be the TD-error at time t, ϕ t = ϕ(x t ). TD(0) update: θ t+1 θ t = α t δ t+1 (θ t )ϕ t. When TD(0) converges, it converges to a unique vector θ that satisfies E [δ t+1 (θ )ϕ t ] = 0. (TDEQ) Goal: Come up with an objective function such that its optima satisfy (TDEQ). Solution: [ ] 1 J(θ) = E [δ t+1 (θ)ϕ t ] E ϕ t ϕ t E [δt+1 (θ)ϕ t ]. Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

24 Deriving the algorithm [ J(θ) = E [δ t+1 (θ)ϕ t ] E ϕ t ϕ t ] 1 E [δt+1 (θ)ϕ t ]. Take the gradient! θ J(θ) = 2E [ (ϕ t γϕ t+1)ϕ t ] w(θ), where [ w(θ) = E ϕ t ϕ t ] 1 E [δt+1 (θ)ϕ t ]. Idea: introduce two sets of weights! θ t+1 = θ t + α t (ϕ t γ ϕ t+1) ϕ t w t w t+1 = w t + β t (δ t+1 (θ t ) ϕ t w t ) ϕ t. Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

25 GTD2 with linear function approximation function GTD2(X, R, Y, θ, w) Input: X is the last state, Y is the next state, R is the immediate reward associated with this transition, θ R d is the parameter vector of the linear function approximation, w R d is the auxiliary weight 1: f ϕ[x] 2: f ϕ[y] 3: δ R + γ θ f θ f 4: a f w 5: θ θ + α (f γ f ) a 6: w w + β (δ a) f 7: return (θ, w) Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

26 Experimental results 20 TD RMSPBE 10 GTD2 TDC GTD Sweeps Behavior on 7-star example Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

27 Bibliographic notes and subsequent developments GTD the original idea (Sutton et al., 2009b) GTD2, a two-timescale version (TDC) (Sutton et al., 2009a). Just replace the update in line 5 by θ θ + α (δ f γ a f ). Extension to nonlinear function approximation (Maei et al., 2010) Addresses the issue that TD is unstable when used with nonlinear function approximation Extension to eligibility traces, action-values (Maei and Sutton, 2010) Extension to control (next part!) Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

28 The problem The methods are gradient -like, or first-order methods Make small steps in the weight space They are sensitive to: choice of the step-size initial values of weights eigenvalue spread of the underlying matrix determining the dynamics Solution proposals: Use of adaptive step-sizes (Sutton, 1992; George and Powell, 2006) Normalizing the updates (Bradtke, 1994) Reusing previous samples (Lin, 1992) Each of them have their own weaknesses Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

29 The LSTD algorithm In the limit, if TD(0) converges it finds the solution to ( ) E [ ϕ t δ t+1 (θ) ] = 0. Assume the sample so far is D n = ((X 0, R 1, Y 1 ), (X 1, R 2, Y 2 ),..., (X n 1, R n, Y n )), Idea: Approximate (*) by ( ) 1 n Stochastic programming: sample average approximation (Shapiro, 2003) Statistics: Z-estimation (e.g., Kosorok, 2008, Section 2.2.5) Note: (**) is equivalent to n 1  n θ + ˆb n = 0, n 1 t=0 ϕ t δ t+1 (θ) = 0. n 1 where ˆb n = 1 n t=0 R t+1ϕ t and  n = 1 n t=0 ϕ t(ϕ t γϕ t+1 ). Solution: θ n =  1ˆb n n, provided the inverse exists. Least-squares temporal difference learning or LSTD (Bradtke and Barto, 1996). Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

30 RLSTD(0) with linear function approximation function RLSTD(X, R, Y, C, θ) Input: X is the last state, Y is the next state, R is the immediate reward associated with this transition, C R d d, and θ R d is the parameter vector of the linear function approximation 1: f ϕ[x] 2: f ϕ[y] 3: g (f γf ) C g is a 1 d row vector 4: a 1 + gf 5: v Cf 6: δ R + γ θ f θ f 7: θ θ + δ / a v 8: C C v g / a 9: return (C, θ) Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

31 Which one to love? Assumptions Time for computation T is fixed Some facts How many samples (n) can be processed? Least-squares: n T/d 2 First-order methods: n T/d = nd Samples are cheap to obtain Precision after t samples? Least-squares: C 1 t 1 2 First-order: C 2 t 1 2 C 2 > C 1 Conclusion Ratio of precisions: θ n θ θ n θ C 2 C 1 d 1 2, Hence: If C 2 /C 1 < d 1/2 then the first-order method wins, in the other case the least-squares method wins. Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

32 The choice of the function approximation method Factors to consider Quality of the solution in the limit of infinitely many samples Overfitting/underfitting Eigenvalue spread (decorrelated features) when using first-order methods Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

33 Error bounds Consider TD(λ) estimating the value function V. Let V θ (λ) be the limiting solution. Then V θ (λ) V µ 1 1 γλ Π F,µ V V µ. Here γ λ = γ(1 λ)/(1 λγ) is the contraction modulus of Π F,µ T (λ) (Tsitsiklis and Van Roy, 1999; Bertsekas, 2007). Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

34 Error analysis II Define the Bellman error (λ) (ˆV) = T (λ) ˆV ˆV, ˆV : X R under T (λ) = (1 λ) m=0 λm T [m], where T [m] is the m-step lookahead Bellman operator. Contraction argument: V ˆV 1 (λ) (ˆV). What makes (λ) (ˆV) small? Error decomposition: (λ) (V θ (λ)) = (1 λ) m 0 where [r] m m [ϕ] λ m [r] m + γ = r m Π F,µ r m = P m+1 ϕ Π F,µ P m+1 ϕ 1 γ rm (x) = E [R m+1 X 0 = x], P m+1 ϕ (x) = (P m+1 ϕ 1 (x),..., P m+1 ϕ d (x)), P m ϕ i (x) = E [ϕ i (X m ) X 0 = x]. (1 λ) λ m m [ϕ] θ(λ), m 0 Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

35 For Further Reading Bertsekas, D. P. (2007). Dynamic Programming and Optimal Control, volume 2. Athena Scientific, Belmont, MA, 3 edition. Bradtke, S. J. (1994). Incremental Dynamic Programming for On-line Adaptive Optimal Control. PhD thesis, Department of Computer and Information Science, University of Massachusetts, Amherst, Massachusetts. Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning, 22: George, A. P. and Powell, W. B. (2006). Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Machine Learning, 65: Kosorok, M. R. (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer. Lin, L.-J. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 9: Maei, H., Szepesvári, C., Bhatnagar, S., Silver, D., Precup, D., and Sutton, R. (2010). Convergent temporal-difference learning with arbitrary smooth function approximation. In NIPS-22, pages Maei, H. R. and Sutton, R. S. (2010). GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In Baum, E., Hutter, M., and Kitzelmann, E., editors, AGI 2010, pages Atlantis Press. Shapiro, A. (2003). Monte Carlo sampling methods. In Stochastic Programming, Handbooks in OR & MS, volume 10. North-Holland Publishing Company, Amsterdam. Sutton, R. S. (1992). Gain adaptation beats least squares. In Proceedings of the 7th Yale Workshop on Adaptive and Learning Systems, pages Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora, E. (2009a). Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Bottou, L. and Littman, M., editors, ICML 2009, pages ACM. Sutton, R. S., Szepesvári, C., and Maei, H. R. (2009b). A convergent O(n) temporal-difference algorithm for off-policy learning with linear function approximation. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, NIPS-21, pages MIT Press. Tsitsiklis, J. N. and Van Roy, B. (1999). Average cost temporal-difference learning. Automatica, 35(11): Szepesvári & Sutton (UofA) RL Algorithms July 11, / 45

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Rich Sutton, University of Alberta Hamid Maei, University of Alberta Doina Precup, McGill University Shalabh

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function Approximation Continuous state/action space, mean-square error, gradient temporal difference learning, least-square temporal difference, least squares policy iteration Vien

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation

A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation Richard S. Sutton, Csaba Szepesvári, Hamid Reza Maei Reinforcement Learning and Artificial Intelligence

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

Emphatic Temporal-Difference Learning

Emphatic Temporal-Difference Learning Emphatic Temporal-Difference Learning A Rupam Mahmood Huizhen Yu Martha White Richard S Sutton Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science, University

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Algorithms for Fast Gradient Temporal Difference Learning

Algorithms for Fast Gradient Temporal Difference Learning Algorithms for Fast Gradient Temporal Difference Learning Christoph Dann Autonomous Learning Systems Seminar Department of Computer Science TU Darmstadt Darmstadt, Germany cdann@cdann.de Abstract Temporal

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

CS599 Lecture 2 Function Approximation in RL

CS599 Lecture 2 Function Approximation in RL CS599 Lecture 2 Function Approximation in RL Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview of function approximation (FA)

More information

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation Hamid R. Maei University of Alberta Edmonton, AB, Canada Csaba Szepesvári University of Alberta Edmonton, AB, Canada

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Off-Policy Actor-Critic

Off-Policy Actor-Critic Off-Policy Actor-Critic Ludovic Trottier Laval University July 25 2012 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 1 / 34 Table of Contents 1 Reinforcement Learning Theory

More information

Lecture 4: Approximate dynamic programming

Lecture 4: Approximate dynamic programming IEOR 800: Reinforcement learning By Shipra Agrawal Lecture 4: Approximate dynamic programming Deep Q Networks discussed in the last lecture are an instance of approximate dynamic programming. These are

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems

Least-Squares λ Policy Iteration: Bias-Variance Trade-off in Control Problems : Bias-Variance Trade-off in Control Problems Christophe Thiery Bruno Scherrer LORIA - INRIA Lorraine - Campus Scientifique - BP 239 5456 Vandœuvre-lès-Nancy CEDEX - FRANCE thierych@loria.fr scherrer@loria.fr

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

Dual Temporal Difference Learning

Dual Temporal Difference Learning Dual Temporal Difference Learning Min Yang Dept. of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 Yuxi Li Dept. of Computing Science University of Alberta Edmonton, Alberta Canada

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

Reinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial. Introduction

Reinforcement Learning Algorithms in Markov Decision Processes AAAI-10 Tutorial. Introduction Ideas and Motivation Background Off-policy learning Option formalism Learning about one policy while behaving according to another Needed for RL w/exploration (as in Q-learning) Needed for learning abstract

More information

Off-Policy Temporal-Difference Learning with Function Approximation

Off-Policy Temporal-Difference Learning with Function Approximation This is a digitally remastered version of a paper published in the Proceedings of the 18th International Conference on Machine Learning (2001, pp. 417 424. This version should be crisper but otherwise

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Toward Off-Policy Learning Control with Function Approximation

Toward Off-Policy Learning Control with Function Approximation Hamid Reza Maei, Csaba Szepesvári, Shalabh Bhatnagar, Richard S. Sutton Department of Computing Science, University of Alberta, Edmonton, Canada T6G 2E8 Department of Computer Science and Automation, Indian

More information

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results

Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some state-of-the-art results Chandrashekar Lakshmi Narayanan Csaba Szepesvári Abstract In all branches of

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

arxiv: v1 [cs.ai] 1 Jul 2015

arxiv: v1 [cs.ai] 1 Jul 2015 arxiv:507.00353v [cs.ai] Jul 205 Harm van Seijen harm.vanseijen@ualberta.ca A. Rupam Mahmood ashique@ualberta.ca Patrick M. Pilarski patrick.pilarski@ualberta.ca Richard S. Sutton sutton@cs.ualberta.ca

More information

Generalization and Function Approximation

Generalization and Function Approximation Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Recurrent Gradient Temporal Difference Networks

Recurrent Gradient Temporal Difference Networks Recurrent Gradient Temporal Difference Networks David Silver Department of Computer Science, CSML, University College London London, WC1E 6BT d.silver@cs.ucl.ac.uk Abstract Temporal-difference (TD) networks

More information

Chapter 8: Generalization and Function Approximation

Chapter 8: Generalization and Function Approximation Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview

More information

Preconditioned Temporal Difference Learning

Preconditioned Temporal Difference Learning Hengshuai Yao Zhi-Qiang Liu School of Creative Media, City University of Hong Kong, Hong Kong, China hengshuai@gmail.com zq.liu@cityu.edu.hk Abstract This paper extends many of the recent popular policy

More information

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, 3 David Silver, Csaba Szepesvári,

More information

Gradient Temporal Difference Networks

Gradient Temporal Difference Networks JMLR: Workshop and Conference Proceedings 24:117 129, 2012 10th European Workshop on Reinforcement Learning Gradient Temporal Difference Networks David Silver d.silver@cs.ucl.ac.uk Department of Computer

More information

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating? CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes

Lecture 17: Reinforcement Learning, Finite Markov Decision Processes CSE599i: Online and Adaptive Machine Learning Winter 2018 Lecture 17: Reinforcement Learning, Finite Markov Decision Processes Lecturer: Kevin Jamieson Scribes: Aida Amini, Kousuke Ariga, James Ferguson,

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating? CSE 190: Reinforcement Learning: An Introduction Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to

More information

Variance Reduction for Policy Gradient Methods. March 13, 2017

Variance Reduction for Policy Gradient Methods. March 13, 2017 Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping

Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping Richard S Sutton, Csaba Szepesvári, Alborz Geramifard, Michael Bowling Reinforcement Learning and Artificial Intelligence

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Proximal Gradient Temporal Difference Learning Algorithms

Proximal Gradient Temporal Difference Learning Algorithms Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Proximal Gradient Temporal Difference Learning Algorithms Bo Liu, Ji Liu, Mohammad Ghavamzadeh, Sridhar

More information

An online kernel-based clustering approach for value function approximation

An online kernel-based clustering approach for value function approximation An online kernel-based clustering approach for value function approximation N. Tziortziotis and K. Blekas Department of Computer Science, University of Ioannina P.O.Box 1186, Ioannina 45110 - Greece {ntziorzi,kblekas}@cs.uoi.gr

More information

Algorithms for Reinforcement Learning

Algorithms for Reinforcement Learning Algorithms for Reinforcement Learning Draft of the lecture published in the Synthesis Lectures on Artificial Intelligence and Machine Learning series by Morgan & Claypool Publishers Csaba Szepesvári June

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Minimal Residual Approaches for Policy Evaluation in Large Sparse Markov Chains

Minimal Residual Approaches for Policy Evaluation in Large Sparse Markov Chains Minimal Residual Approaches for Policy Evaluation in Large Sparse Markov Chains Yao Hengshuai Liu Zhiqiang School of Creative Media, City University of Hong Kong at Chee Avenue, Kowloon, Hong Kong, S.A.R.,

More information

The Fixed Points of Off-Policy TD

The Fixed Points of Off-Policy TD The Fixed Points of Off-Policy TD J. Zico Kolter Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 2139 kolter@csail.mit.edu Abstract Off-policy

More information

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning Vivek Veeriah Dept. of Computing Science University of Alberta Edmonton, Canada vivekveeriah@ualberta.ca Harm van Seijen

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning Active Policy Iteration: fficient xploration through Active Learning for Value Function Approximation in Reinforcement Learning Takayuki Akiyama, Hirotaka Hachiya, and Masashi Sugiyama Department of Computer

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu

More information

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Reinforcement Learning. Summer 2017 Defining MDPs, Planning Reinforcement Learning Summer 2017 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

University of Alberta

University of Alberta University of Alberta NEW REPRESENTATIONS AND APPROXIMATIONS FOR SEQUENTIAL DECISION MAKING UNDER UNCERTAINTY by Tao Wang A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment

More information

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you

More information

A Gentle Introduction to Reinforcement Learning

A Gentle Introduction to Reinforcement Learning A Gentle Introduction to Reinforcement Learning Alexander Jung 2018 1 Introduction and Motivation Consider the cleaning robot Rumba which has to clean the office room B329. In order to keep things simple,

More information

Convergence of Synchronous Reinforcement Learning. with linear function approximation

Convergence of Synchronous Reinforcement Learning. with linear function approximation Convergence of Synchronous Reinforcement Learning with Linear Function Approximation Artur Merke artur.merke@udo.edu Lehrstuhl Informatik, University of Dortmund, 44227 Dortmund, Germany Ralf Schoknecht

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car) Function Approximation in Reinforcement Learning Gordon Geo ggordon@cs.cmu.edu November 5, 999 Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation)

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

Off-policy Learning With Eligibility Traces: A Survey

Off-policy Learning With Eligibility Traces: A Survey Journal of Machine Learning Research 15 2014) 289-333 Submitted 9/11; Revised 4/13; Published 1/14 Off-policy Learning With Eligibility Traces: A Survey Matthieu Geist IMS-MaLIS Research Group & UMI 2958

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning Journal of Machine Learning Research 7 206-29 Submitted /4; Revised /5; Published 5/6 An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning Richard S. Sutton sutton@cs.ualberta.ca

More information

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming Introduction to Reinforcement Learning Part 6: Core Theory II: Bellman Equations and Dynamic Programming Bellman Equations Recursive relationships among values that can be used to compute values The tree

More information

Least-Squares Temporal Difference Learning based on Extreme Learning Machine

Least-Squares Temporal Difference Learning based on Extreme Learning Machine Least-Squares Temporal Difference Learning based on Extreme Learning Machine Pablo Escandell-Montero, José M. Martínez-Martínez, José D. Martín-Guerrero, Emilio Soria-Olivas, Juan Gómez-Sanchis IDAL, Intelligent

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Policy Evaluation with Temporal Differences: A Survey and Comparison

Policy Evaluation with Temporal Differences: A Survey and Comparison Journal of Machine Learning Research 15 (2014) 809-883 Submitted 5/13; Revised 11/13; Published 3/14 Policy Evaluation with Temporal Differences: A Survey and Comparison Christoph Dann Gerhard Neumann

More information

Generalized Advantage Estimation for Policy Gradients

Generalized Advantage Estimation for Policy Gradients Generalized Advantage Estimation for Policy Gradients Authors Department of Electrical Engineering and Computer Science Abstract Value functions provide an elegant solution to the delayed reward problem

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

On Average Versus Discounted Reward Temporal-Difference Learning

On Average Versus Discounted Reward Temporal-Difference Learning Machine Learning, 49, 179 191, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. On Average Versus Discounted Reward Temporal-Difference Learning JOHN N. TSITSIKLIS Laboratory for

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information